Browsing by Author "Holt, James"
Now showing 1 - 10 of 10
Results Per Page
Sort Options
Item Automatic Yara Rule Generation Using Biclustering(ACM, 2020-09-06) Raff, Edward; Zak, Richard; Munoz, Gary Lopez; Fleming, William; Anderson, Hyrum S.; Filar, Bobby; Nicholas, Charles; Holt, JamesYara rules are a ubiquitous tool among cybersecurity practitioners and analysts. Developing high-quality Yara rules to detect a malware family of interest can be labor- and time-intensive, even for expert users. Few tools exist and relatively little work has been done on how to automate the generation of Yara rules for specific families. In this paper, we leverage large n-grams (n≥8) combined with a new biclustering algorithm to construct simple Yara rules more effectively than currently available software. Our method, AutoYara, is fast, allowing for deployment on low-resource equipment for teams that deploy to remote networks. Our results demonstrate that AutoYara can help reduce analyst workload by producing rules with useful true-positive rates while maintaining low false-positive rates, sometimes matching or even outperforming human analysts. In addition, real-world testing by malware analysts indicates AutoYara could reduce analyst time spent constructing Yara rules by 44-86%, allowing them to spend their time on the more advanced malware that current tools can't handle.Item A Coreset Learning Reality Check(2023-01-15) Lu, Fred; Raff, Edward; Holt, JamesSubsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets. In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification. While these works are supported by theory and limited experiments, to date there has not been a comprehensive evaluation of these methods. In our work, we directly compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness. In many cases, methods do not outperform simple uniform subsampling.Item Creating Cybersecurity Knowledge Graphs from Malware After Action Reports(2020-10-6) Piplai, Aritran; Mittal, Sudip; Joshi, Anupam; Finin, Tim; Holt, James; Zak, RichardAfter Action Reports provide incisive analysis of cyber-incidents. Extracting cyber-knowledge from these sources would provide security analysts with credible information, which they can use to detect, or find patterns indicative of, a future cyber-attack. It is not possible for a security analyst to read and garner relevant information from a large number of after action reports and similar textual documents that detail an attack. An automated pipeline that extracts from text sources, represents this in a knowledge graph and reasons over it, could help them to analyze cyber-attacks of the future. In this paper, we describe a system to extract information from After Action Reports, which are published by established security corporations, and represent that in a Cybersecurity Knowledge Graph (CKG). We also show how these can also incorporate information from semi structured sources such as STIX. They can also help security analysts execute queries that involve inferences, and retrieve information required to detect a future attack. We extract entities by building a customized named entity recognizer called `Malware Entity Extractor' (MEE). We then build a neural network to predict how pairs of `malware entities' are related to each other. Once, we have predicted entity pairs and the relationship between them, we assert the `entity-relationship set' into a cybersecurity knowledge graph. In this process, each individual source of information (i.e. after action report) would lead to its own graph. Our next step in the process is to fuse the graph on common entities where possible, to create a single graph which represented knowledge in multiple documents. The cybersecurity knowledge graph can be populated from one After Action Report, and can also be fused with another knowledge graph about a similar cyber-attack, or an After Action Reports describing attributes of a similar malware. We show how this knowledge can be used to answer analyst queries that are not possible to be answered from a single source.Item Deploying Convolutional Networks on Untrusted Platforms Using 2D Holographic Reduced Representations(2022-06-13) Alam, Mohammad Mahmudul; Raff, Edward; Oates, Tim; Holt, JamesDue to the computational cost of running inference for a neural network, the need to deploy the inferential steps on a third party's compute environment or hardware is common. If the third party is not fully trusted, it is desirable to obfuscate the nature of the inputs and outputs, so that the third party can not easily determine what specific task is being performed. Provably secure protocols for leveraging an untrusted party exist but are too computational demanding to run in practice. We instead explore a different strategy of fast, heuristic security that we call Connectionist Symbolic Pseudo Secrets. By leveraging Holographic Reduced Representations (HRR), we create a neural network with a pseudo-encryption style defense that empirically shows robustness to attack, even under threat models that unrealistically favor the adversary.Item Leveraging Uncertainty for Improved Static Malware Detection Under Extreme False Positive Constraints(2021-08-09) Nguyen, Andre T.; Raff, Edward; Nicholas, Charles; Holt, JamesThe detection of malware is a critical task for the protection of computing environments. This task often requires extremely low false positive rates (FPR) of 0.01% or even lower, for which modern machine learning has no readily available tools. We introduce the first broad investigation of the use of uncertainty for malware detection across multiple datasets, models, and feature types. We show how ensembling and Bayesian treatments of machine learning methods for static malware detection allow for improved identification of model errors, uncovering of new malware families, and predictive performance under extreme false positive constraints. In particular, we improve the true positive rate (TPR) at an actual realized FPR of 1e-5 from an expected 0.69 for previous methods to 0.80 on the best performing model class on the Sophos industry scale dataset. We additionally demonstrate how previous works have used an evaluation protocol that can lead to misleading results.Item MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers(2023-10-18) Joyce, Robert J.; Raff, Edward; Nicholas, Charles; Holt, JamesExisting research on malware classification focuses almost exclusively on two tasks: distinguishing between malicious and benign files and classifying malware by family. However, malware can be categorized according to many other types of attributes, and the ability to identify these attributes in newly-emerging malware using machine learning could provide significant value to analysts. In particular, we have identified four tasks which are under-represented in prior work: classification by behaviors that malware exhibit, platforms that malware run on, vulnerabilities that malware exploit, and packers that malware are packed with. To obtain labels for training and evaluating ML classifiers on these tasks, we created an antivirus (AV) tagging tool called ClarAVy. ClarAVy's sophisticated AV label parser distinguishes itself from prior AV-based taggers, with the ability to accurately parse 882 different AV label formats used by 90 different AV products. We are releasing benchmark datasets for each of these four classification tasks, tagged using ClarAVy and comprising nearly 5.5 million malicious files in total. Our malware behavior dataset includes 75 distinct tags - nearly 7x more than the only prior benchmark dataset with behavioral tags. To our knowledge, we are the first to release datasets with malware platform and packer tags.Item Out of Distribution Data Detection Using Dropout Bayesian Neural Networks(2022-02-18) Nguyen, Andre T.; Lu, Fred; Munoz, Gary Lopez; Raff, Edward; Nicholas, Charles; Holt, JamesWe explore the utility of information contained within a dropout based Bayesian neural network (BNN) for the task of detecting out of distribution (OOD) data. We first show how previous attempts to leverage the randomized embeddings induced by the intermediate layers of a dropout BNN can fail due to the distance metric used. We introduce an alternative approach to measuring embedding uncertainty, justify its use theoretically, and demonstrate how incorporating embedding uncertainty improves OOD data identification across three tasks: image classification, language classification, and malware detection.Item Recasting Self-Attention with Holographic Reduced Representations(2022-08-15) Alam, Mohammad Mahmudul; Raff, Edward; Oates, Tim; Holt, JamesSelf-Attention has become fundamentally a new approach to set and sequence modeling, particularly within transformerstyle architectures. Given a sequence of 𝑇 items the standard self-attention has O (𝑇 2 ) memory and compute needs, leading to many recent works building approximations to self-attention with reduced computational or memory complexity. In this work, we instead re-cast self-attention using the neuro-symbolic approach of Holographic Reduced Representations (HRR). In doing so we perform the same logical strategy of the standard self-attention. Implemented as a “Hrrformer” we obtain several benefits including faster compute (O (𝑇 log𝑇 ) time complexity), less memory-use per layer (O (𝑇 ) space complexity), convergence in 10× fewer epochs, near state-of-the-art accuracy, and we are able to learn with just a single layer. Combined, these benefits make our Hrrformer up to 370× faster to train on the Long Range Arena benchmark.Item RelExt: Relation Extraction using Deep Learning approaches for Cybersecurity Knowledge Graph Improvement(2019-05-16) Pingle, Aditya; Piplai, Aritran; Mittal, Sudip; Joshi, Anupam; Holt, James; Zak, RichardSecurity Analysts that work in a `Security Operations Center' (SoC) play a major role in ensuring the security of the organization. The amount of background knowledge they have about the evolving and new attacks makes a significant difference in their ability to detect attacks. Open source threat intelligence sources, like text descriptions about cyber-attacks, can be stored in a structured fashion in a cybersecurity knowledge graph. A cybersecurity knowledge graph can be paramount in aiding a security analyst to detect cyber threats because it stores a vast range of cyber threat information in the form of semantic triples which can be queried. A semantic triple contains two cybersecurity entities with a relationship between them. In this work, we propose a system to create semantic triples over cybersecurity text, using deep learning approaches to extract possible relationships. We use the set of semantic triples generated through our system to assert in a cybersecurity knowledge graph. Security Analysts can retrieve this data from the knowledge graph, and use this information to form a decision about a cyber-attack.Item Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits!(2023-12-25) Patel, Tirth; Lu, Fred; Raff, Edward; Nicholas, Charles; Matuszek, Cynthia; Holt, JamesIndustry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines, meaning a 0.1\% change can cause an overwhelming number of false positives. However, academic research is often restrained to public datasets on the order of ten thousand samples and is too small to detect improvements that may be relevant to industry. Working within these constraints, we devise an approach to generate a benchmark of configurable difficulty from a pool of available samples. This is done by leveraging malware family information from tools like AVClass to construct training/test splits that have different generalization rates, as measured by a secondary model. Our experiments will demonstrate that using a less accurate secondary model with disparate features is effective at producing benchmarks for a more sophisticated target model that is under evaluation. We also ablate against alternative designs to show the need for our approach.