Investigating Antivirus Scan Results as a Source of Features and Labels for Machine Learning

dc.contributor.advisorNicholas, Charles
dc.contributor.advisorRaff, Edward
dc.contributor.authorJoyce, Robert James
dc.contributor.departmentComputer Science and Electrical Engineering
dc.contributor.programComputer Science
dc.date.accessioned2024-08-09T17:12:01Z
dc.date.available2024-08-09T17:12:01Z
dc.date.issued2024-01-01
dc.description.abstractAdvances in machine learning have recently found success in automating common malware analysis tasks. Historically, two of the primary challenges in implementing a machine learning model for use in a malware analysis environment have been selecting representative malware features and identifying a high-confidence source of malware labels. Both of these are made more difficult due to the massive quantity and diversity of malicious files, as well as the adversarial nature of malware analysis. Many existing malware featurization approaches are only feasible at small scales, can only be applied to a single file format, or are defeated by common obfuscation techniques. Malware datasets have long suffered from label quality issues, and datasets with ground-truth labels are severely restricted in both size and diversity. In this dissertation, we explore the utility of antivirus scan reports, which are the results obtained by scanning a malicious file with a collection of antivirus products. Antivirus products may identify malware using a variety of approaches (byte signatures, heuristics, dynamic analysis, etc.) and their outputs contain diverse information (including file format, behavior, malware family, packer, and vulnerability information). We show that due to this diversity, antivirus scan reports have promising utility as a source of features for multiple types of supervised and unsupervised learning, and as a source of labels for multiple common malware classification tasks.
dc.formatapplication:pdf
dc.genredissertation
dc.identifierdoi:10.13016/m2futd-jiih
dc.identifier.other12883
dc.identifier.urihttp://hdl.handle.net/11603/35299
dc.languageen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Theses and Dissertations Collection
dc.relation.ispartofUMBC Graduate School Collection
dc.relation.ispartofUMBC Student Collection
dc.rightsThis item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
dc.sourceOriginal File Name: Joyce_umbc_0434D_12883.pdf
dc.subjectAntivirus
dc.subjectData Science
dc.subjectMachine Learning
dc.subjectMalware
dc.titleInvestigating Antivirus Scan Results as a Source of Features and Labels for Machine Learning
dc.typeText
dcterms.accessRightsDistribution Rights granted to UMBC by the author.

Files

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Joyce-Robert_1Open.pdf
Size:
203.01 KB
Format:
Adobe Portable Document Format
Description: