Investigating Antivirus Scan Results as a Source of Features and Labels for Machine Learning
Loading...
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
2024-01-01
Type of Work
Department
Computer Science and Electrical Engineering
Program
Computer Science
Citation of Original Publication
Rights
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.
Distribution Rights granted to UMBC by the author.
Abstract
Advances in machine learning have recently found success in automating common malware analysis tasks. Historically, two of the primary challenges in implementing a machine learning model for use in a malware analysis environment have been selecting representative malware features and identifying a high-confidence source of malware labels. Both of these are made more difficult due to the massive quantity and diversity of malicious files, as well as the adversarial nature of malware analysis. Many existing malware featurization approaches are only feasible at small scales, can only be applied to a single file format, or are defeated by common obfuscation techniques. Malware datasets have long suffered from label quality issues, and datasets with ground-truth labels are severely restricted in both size and diversity. In this dissertation, we explore the utility of antivirus scan reports, which are the results obtained by scanning a malicious file with a collection of antivirus products. Antivirus products may identify malware using a variety of approaches (byte signatures, heuristics, dynamic analysis, etc.) and their outputs contain diverse information (including file format, behavior, malware family, packer, and vulnerability information). We show that due to this diversity, antivirus scan reports have promising utility as a source of features for multiple types of supervised and unsupervised learning, and as a source of labels for multiple common malware classification tasks.