Browsing by Author "Raff, Edward"

Now showing 1 - 20 of 49

Adversarial Transfer Attacks with Unknown Data and Class Overlap
(ACM, 2021-11-15) Richards, Luke E.; Nguyen, André; Capps, Ryan; Forsythe, Steven; Matuszek, Cynthia; Raff, Edward
The ability to transfer adversarial attacks from one model (the surrogate) to another model (the victim) has been an issue of concern within the machine learning (ML) community. The ability to successfully evade unseen models represents an uncomfortable level of ease toward implementing attacks. In this work we note that as studied, current transfer attack research has an unrealistic advantage for the attacker: the attacker has the exact same training data as the victim. We present the first study of transferring adversarial attacks focusing on the data available to attacker and victim under imperfect settings without querying the victim, where there is some variable level of overlap in the exact data used or in the classes learned by each model. This threat model is relevant to applications in medicine, malware, and others. Under this new threat model attack success rate is not correlated with data or class overlap in the way one would expect, and varies with dataset. This makes it difficult for attacker and defender to reason about each other and contributes to the broader study of model robustness and security. We remedy this by developing a masked version of Projected Gradient Descent that simulates class disparity, which enables the attacker to reliably estimate a lower-bound on their attack's success.
Applied Machine Learning for Information Security
(ACM, 2024-03-11) Samtani, Sagar; Raff, Edward; Anderson, Hyrum
Information security has undoubtedly become a critical aspect of modern cybersecurity practices. Over the last half-decade, numerous academic and industry groups have sought to develop machine learning, deep learning, and other areas of artificial intelligence-enabled analytics into information security practices. The Conference on Applied Machine Learning (CAMLIS) is an emerging venue that seeks to gather researchers and practitioners to discuss applied and fundamental research on machine learning for information security applications. In 2021, CAMLIS partnered with ACM Digital Threats: Research and Practice (DTRAP) to provide opportunities for authors of accepted CAMLIS papers to submit their research for consideration into ACM DTRAP via a Special Issue on Applied Machine Learning for Information Security. This editorial summarizes the results of this Special Issue.
Automatic Yara Rule Generation Using Biclustering
(ACM, 2020-09-06) Raff, Edward; Zak, Richard; Munoz, Gary Lopez; Fleming, William; Anderson, Hyrum S.; Filar, Bobby; Nicholas, Charles; Holt, James
Yara rules are a ubiquitous tool among cybersecurity practitioners and analysts. Developing high-quality Yara rules to detect a malware family of interest can be labor- and time-intensive, even for expert users. Few tools exist and relatively little work has been done on how to automate the generation of Yara rules for specific families. In this paper, we leverage large n-grams (n≥8) combined with a new biclustering algorithm to construct simple Yara rules more effectively than currently available software. Our method, AutoYara, is fast, allowing for deployment on low-resource equipment for teams that deploy to remote networks. Our results demonstrate that AutoYara can help reduce analyst workload by producing rules with useful true-positive rates while maintaining low false-positive rates, sometimes matching or even outperforming human analysts. In addition, real-world testing by malware analysts indicates AutoYara could reduce analyst time spent constructing Yara rules by 44-86%, allowing them to spend their time on the more advanced malware that current tools can't handle.
AVScan2Vec: Feature Learning on Antivirus Scan Data for Production-Scale Malware Corpora
(2023-06-09) Joyce, Robert J.; Patel, Tirth; Nicholas, Charles; Raff, Edward
When investigating a malicious file, searching for related files is a common task that malware analysts must perform. Given that production malware corpora may contain over a billion files and consume petabytes of storage, many feature extraction and similarity search approaches are computationally infeasible. Our work explores the potential of antivirus (AV) scan data as a scalable source of features for malware. This is possible because AV scan reports are widely available through services such as VirusTotal and are ~100x smaller than the average malware sample. The information within an AV scan report is abundant with information and can indicate a malicious file's family, behavior, target operating system, and many other characteristics. We introduce AVScan2Vec, a language model trained to comprehend the semantics of AV scan data. AVScan2Vec ingests AV scan data for a malicious file and outputs a meaningful vector representation. AVScan2Vec vectors are ~3 to 85x smaller than popular alternatives in use today, enabling faster vector comparisons and lower memory usage. By incorporating Dynamic Continuous Indexing, we show that nearest-neighbor queries on AVScan2Vec vectors can scale to even the largest malware production datasets. We also demonstrate that AVScan2Vec vectors are superior to other leading malware feature vector representations across nearly all classification, clustering, and nearest-neighbor lookup algorithms that we evaluated.
Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded Language from Percepts and Raw Speech
(PKP, 2022-06-28) Kebe, Gaoussou Youssouf; Richards, Luke E.; Raff, Edward; Ferraro, Francis; Matuszek, Cynthia
Learning to understand grounded language, which connects natural language to percepts, is a critical research area. Prior work in grounded language acquisition has focused primarily on textual inputs. In this work we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs. This will allow interactions in which language about novel tasks and environments is learned from end users, reducing dependence on textual inputs and potentially mitigating the effects of demographic bias found in widely available speech recognition systems. We leverage recent work in self-supervised speech representation models and show that learned representations of speech can make language grounding systems more inclusive towards specific groups while maintaining or even increasing general performance.
Bringing UMAP Closer to the Speed of Light with GPU Acceleration
(2020-08-01) Nolet, Corey J.; Lafargue, Victor; Raff, Edward; Nanditale, Thejaswi; Oates, Tim; Zedlewski, John; Patterson, Joshua
The Uniform Manifold Approximation and Projection (UMAP) algorithm has become widely popular for its ease of use, quality of results, and support for exploratory, unsupervised, supervised, and semi-supervised learning. While many algorithms can be ported to a GPU in a simple and direct fashion, such efforts have resulted in inefficent and inaccurate versions of UMAP. We show a number of techniques that can be used to make a faster and more faithful GPU version of UMAP, and obtain speedups of up to 100x in practice. Many of these design choices/lessons are general purpose and may inform the conversion of other graph and manifold learning algorithms to use GPUs. Our implementation has been made publicly available as part of the open source RAPIDS cuML library
Comprehensive OOD Detection Improvements
(2024-01-18) Lakkapragada, Anish; Khanna, Amol; Raff, Edward; Inkawhich, Nathan
As machine learning becomes increasingly prevalent in impactful decisions, recognizing when inference data is outside the model's expected input distribution is paramount for giving context to predictions. Out-of-distribution (OOD) detection methods have been created for this task. Such methods can be split into representation-based or logit-based methods from whether they respectively utilize the model's embeddings or predictions for OOD detection. In contrast to most papers which solely focus on one such group, we address both. We employ dimensionality reduction on feature embeddings in representation-based methods for both time speedups and improved performance. Additionally, we propose DICE-COL, a modification of the popular logit-based method Directed Sparsification (DICE) that resolves an unnoticed flaw. We demonstrate the effectiveness of our methods on the OpenOODv1.5 benchmark framework, where they significantly improve performance and set state-of-the-art results.
Continuously Generalized Ordinal Regression for Linear and Deep Models
(SIAM, 2022) Lu, Fred; Ferraro, Francis; Raff, Edward
Ordinal regression is a classification task where classes have an order and prediction error increases the further the predicted class is from the true class. The standard approach for modeling ordinal data involves fitting parallel separating hyperplanes that optimize a certain loss function. This assumption offers sample efficient learning via inductive bias, but is often too restrictive in real-world datasets where features may have varying effects across different categories. Allowing class-specific hyperplane slopes creates generalized logistic ordinal regression, increasing the flexibility of the model at a cost to sample efficiency. We explore an extension of the generalized model to the all-thresholds logistic loss and propose a regularization approach that interpolates between these two extremes. Our method, which we term continuously generalized ordinal logistic, significantly outperforms the standard ordinal logistic model over a thorough set of ordinal regression benchmark datasets. We further extend this method to deep learning and show that it achieves competitive or lower prediction error compared to previous models over a range of datasets and modalities. Furthermore, two primary alternative models for deep learning ordinal regression are shown to be special cases of our framework.
A Coreset Learning Reality Check
(2023-01-15) Lu, Fred; Raff, Edward; Holt, James
Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets. In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification. While these works are supported by theory and limited experiments, to date there has not been a comprehensive evaluation of these methods. In our work, we directly compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness. In many cases, methods do not outperform simple uniform subsampling.
COVID-19 Literature Clustering
(2020-04-03) Eren, Maksim Ekin; Solovyev, Nick; Nicholas, Charles; Raff, Edward
Cross-Sectional Survey of High-Risk Pregnant Women's Opinions on COVID-19 Vaccination
(Mary Ann Liebert, 2022-06-29) DesJardin, Marcia; Raff, Edward; Baranco, Nicholas; Mastrogiannis, Dimitrios
Background: Pregnant women are at increased risk of severe disease with coronavirus disease 2019 (COVID-19). Despite strong recommendations from American College of Obstetricians and Gynecologists and Society for Maternal Fetal Medicine for vaccination, COVID-19 vaccination hesitancy persists. With this study, we aim to evaluate opinions about the COVID-19 vaccine in a cohort of high-risk pregnant patients. Materials and Methods: Institutional review board approval was obtained. Patients attending a regional Maternal–Fetal Medicine clinic in central New York were surveyed about the COVID-19 vaccine using a standardized questionnaire. Demographic, obstetrical, and medical information was abstracted using medical records. The vaccinated and unvaccinated groups were evaluated using chi-square tests and a Bayesian model. Results: Among the 157 participants, 38.2% are vaccinated. There were no significant differences in race/ethnicity, living situation, marital status, employment status, insurance type, pregravid body mass index, history of recreational drug use, number of living children, or gestational age at the time of survey. Patients with less formal education are less likely to be vaccinated. There was no difference between influenza and tetanus diphtheria pertussis vaccination rates with COVID-19 vaccination rates. Unvaccinated patients cite lack of data in pregnancy (66%) as their primary concern. Most patients prefer to learn about vaccines via conversation with their doctor (46.7% for vaccinated and 59.8% for unvaccinated). Conclusions: The vaccination rate is low in our population. A provider-initiated conversation about COVID-19 vaccination included with routine prenatal care could increase the vaccination rate.
Deploying Convolutional Networks on Untrusted Platforms Using 2D Holographic Reduced Representations
(2022-06-13) Alam, Mohammad Mahmudul; Raff, Edward; Oates, Tim; Holt, James
Due to the computational cost of running inference for a neural network, the need to deploy the inferential steps on a third party's compute environment or hardware is common. If the third party is not fully trusted, it is desirable to obfuscate the nature of the inputs and outputs, so that the third party can not easily determine what specific task is being performed. Provably secure protocols for leveraging an untrusted party exist but are too computational demanding to run in practice. We instead explore a different strategy of fast, heuristic security that we call Connectionist Symbolic Pseudo Secrets. By leveraging Holographic Reduced Representations (HRR), we create a neural network with a pseudo-encryption style defense that empirically shows robustness to attack, even under threat models that unrealistically favor the adversary.
Engineering a Simplified 0-Bit Consistent Weighted Sampling
(2018-10-23) Raff, Edward; Sylvester, Jared; Nicholas, Charles
The Min-Hashing approach to sketching has become an important tool in data analysis, information retrial, and classification. To apply it to real-valued datasets, the ICWS algorithm has become a seminal approach that is widely used, and provides state-of-the-art performance for this problem space. However, ICWS suffers a computational burden as the sketch size K increases. We develop a new Simplified approach to the ICWS algorithm, that enables us to obtain over 20x speedups compared to the standard algorithm. The veracity of our approach is demonstrated empirically on multiple datasets and scenarios, showing that our new Simplified CWS obtains the same quality of results while being an order of magnitude faster.
Evading Malware Classifiers via Monte Carlo Mutant Feature Discovery
(2021-06-15) Boutsikas, John; Eren, Maksim E.; Varga, Charles; Raff, Edward; Matuszek, Cynthia; Nicholas, Charles
The use of Machine Learning has become a significant part of malware detection efforts due to the influx of new malware, an ever changing threat landscape, and the ability of Machine Learning methods to discover meaningful distinctions between malicious and benign software. Antivirus vendors have also begun to widely utilize malware classifiers based on dynamic and static malware analysis features. Therefore, a malware author might make evasive binary modifications against Machine Learning models as part of the malware development life cycle to execute an attack successfully. This makes the studying of possible classifier evasion strategies an essential part of cyber defense against malice. To this extent, we stage a grey box setup to analyze a scenario where the malware author does not know the target classifier algorithm, and does not have access to decisions made by the classifier, but knows the features used in training. In this experiment, a malicious actor trains a surrogate model using the EMBER-2018 dataset to discover binary mutations that cause an instance to be misclassified via a Monte Carlo tree search. Then, mutated malware is sent to the victim model that takes the place of an antivirus API to test whether it can evade detection.
Exploratory Analysis of Covid-19 Tweets using Topic Modeling, UMAP, and DiGraphs
(2020-05-06) Ordun, Catherine; Purushotham, Sanjay; Raff, Edward
This paper illustrates five different techniques to assess the distinctiveness of topics, key terms and features, speed of information dissemination, and network behaviors for Covid19 tweets. First, we use pattern matching and second, topic modeling through Latent Dirichlet Allocation (LDA) to generate twenty different topics that discuss case spread, healthcare workers, and personal protective equipment (PPE). One topic specific to U.S. cases would start to uptick immediately after live White House Coronavirus Task Force briefings, implying that many Twitter users are paying attention to government announcements. We contribute machine learning methods not previously reported in the Covid19 Twitter literature. This includes our third method, Uniform Manifold Approximation and Projection (UMAP), that identifies unique clustering-behavior of distinct topics to improve our understanding of important themes in the corpus and help assess the quality of generated topics. Fourth, we calculated retweeting times to understand how fast information about Covid19 propagates on Twitter. Our analysis indicates that the median retweeting time of Covid19 for a sample corpus in March 2020 was 2.87 hours, approximately 50 minutes faster than repostings from Chinese social media about H7N9 in March 2013. Lastly, we sought to understand retweet cascades, by visualizing the connections of users over time from fast to slow retweeting. As the time to retweet increases, the density of connections also increase where in our sample, we found distinct users dominating the attention of Covid19 retweeters. One of the simplest highlights of this analysis is that early-stage descriptive methods like regular expressions can successfully identify high-level themes which were consistently verified as important through every subsequent analysis.
Flexible and Adaptive Fairness-aware Learning in Non-stationary Data Streams
(IEEE) Zhang, Wenbin; Zhang, Mingli; Zhang, Ji; Liu, Zhen; Chen, Zhiyuan; Wang, Jianwu; Raff, Edward; Messina, Enza
Artificial intelligence (AI)-based decision-making systems are employed nowadays in an ever growing number of online as well as offline services–some of great importance. Depending on sophisticated learning algorithms and available data, these systems are increasingly becoming automated and data-driven. However, these systems can impact individuals and communities with ethical or legal consequences. Numerous approaches have therefore been proposed to develop decision making systems that are discrimination-conscious by-design. However, these methods assume the underlying data distribution is stationary without drift, which is counterfactual in many real world applications. In addition, their focus has been largely on minimizing discrimination while maximizing prediction performance without necessary flexibility in customizing the tradeoff according to different applications. To this end, we propose a learning algorithm for fair classification that also adapts to evolving data streams and further allows for a flexible control on the degree of accuracy and fairness. The positive results on a set of discriminated and non-stationary data streams demonstrate the effectiveness and flexibility of this approach.
A General Framework for Auditing Differentially Private Machine Learning
(2022-10-16) Lu, Fred; Munoz, Joseph; Fuchs, Maya; LeBlond, Tyler; Zaresky-Williams, Elliott; Raff, Edward; Ferraro, Francis; Testa, Brian
We present a framework to statistically audit the privacy guarantee conferred by a differentially private machine learner in practice. While previous works have taken steps toward evaluating privacy loss through poisoning attacks or membership inference, they have been tailored to specific models or have demonstrated low statistical power. Our work develops a general methodology to empirically evaluate the privacy of differentially private machine learning implementations, combining improved privacy search and verification methods with a toolkit of influence-based poisoning attacks. We demonstrate significantly improved auditing power over previous approaches on a variety of models including logistic regression, Naive Bayes, and random forest. Our method can be used to detect privacy violations due to implementation errors or misuse. When violations are not present, it can aid in understanding the amount of information that can be leaked from a given dataset, algorithm, and privacy specification.
Generating Thermal Human Faces for Physiological Assessment Using Thermal Sensor Auxiliary Labels
(2021-06-15) Ordun, Catherine; Raff, Edward; Purushotham, Sanjay
Thermal images reveal medically important physiological information about human stress, signs of inflammation, and emotional mood that cannot be seen on visible images. Providing a method to generate thermal faces from visible images would be highly valuable for the telemedicine community in order to show this medical information. To the best of our knowledge, there are limited works on visible-to-thermal (VT) face translation, and many current works go the opposite direction to generate visible faces from thermal surveillance images (TV) for law enforcement applications. As a result, we introduce favtGAN, a VT GAN which uses the pix2pix image translation model with an auxiliary sensor label prediction network for generating thermal faces from visible images. Since most TV methods are trained on only one data source drawn from one thermal sensor, we combine datasets from faces and cityscapes. These combined data are captured from similar sensors in order to bootstrap the training and transfer learning task, especially valuable because visible-thermal face datasets are limited. Experiments on these combined datasets show that favtGAN demonstrates an increase in SSIM and PSNR scores of generated thermal faces, compared to training on a single face dataset alone.
A Generative Approach for Image Registration of Visible-Thermal (VT) Cancer Faces
(2023-08-23) Ordun, Catherine; Cha, Alexandra; Raff, Edward; Purushotham, Sanjay; Kwok, Karen; Rule, Mason; Gulley, James
Since thermal imagery offers a unique modality to investigate pain, the U.S. National Institutes of Health (NIH) has collected a large and diverse set of cancer patient facial thermograms for AI-based pain research. However, differing angles from camera capture between thermal and visible sensors has led to misalignment between Visible-Thermal (VT) images. We modernize the classic computer vision task of image registration by applying and modifying a generative alignment algorithm to register VT cancer faces, without the need for a reference or alignment parameters. By registering VT faces, we demonstrate that the quality of thermal images produced in the generative AI downstream task of Visible-to-Thermal (V2T) image translation significantly improves up to 52.5\%, than without registration. Images in this paper have been approved by the NIH NCI for public dissemination.
Hash-Grams: Faster N-Gram Features for Classification and Malware Detection
(Association for Computing Machinery (ACM), 2018) Raff, Edward; Nicholas, Charles; UMBC Faculty Collection
N-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements.

Browsing by Author "Raff, Edward"

Results Per Page

Sort Options