Investigating Antivirus Scan Results as a Source of Features and Labels for Machine Learning

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Subjects

Antivirus
Data Science
Machine Learning
Malware

Abstract

Advances in machine learning have recently found success in automating common malware analysis tasks. Historically, two of the primary challenges in implementing a machine learning model for use in a malware analysis environment have been selecting representative malware features and identifying a high-confidence source of malware labels. Both of these are made more difficult due to the massive quantity and diversity of malicious files, as well as the adversarial nature of malware analysis. Many existing malware featurization approaches are only feasible at small scales, can only be applied to a single file format, or are defeated by common obfuscation techniques. Malware datasets have long suffered from label quality issues, and datasets with ground-truth labels are severely restricted in both size and diversity. In this dissertation, we explore the utility of antivirus scan reports, which are the results obtained by scanning a malicious file with a collection of antivirus products. Antivirus products may identify malware using a variety of approaches (byte signatures, heuristics, dynamic analysis, etc.) and their outputs contain diverse information (including file format, behavior, malware family, packer, and vulnerability information). We show that due to this diversity, antivirus scan reports have promising utility as a source of features for multiple types of supervised and unsupervised learning, and as a source of labels for multiple common malware classification tasks.

Investigating Antivirus Scan Results as a Source of Features and Labels for Machine Learning

Links to Files

Permanent Link

Collections

Author/Creator

Author/Creator ORCID

Date

Type of Work

Department

Program

Citation of Original Publication

Rights

Subjects

Abstract