Detecting, Quantifying, and Mitigating Bias in Malware Datasets

dc.contributor.advisorNicholas, Charles K
dc.contributor.authorSeymour III, John Jefferson
dc.contributor.departmentComputer Science and Electrical Engineering
dc.contributor.programComputer Science
dc.date.accessioned2021-09-01T13:55:26Z
dc.date.available2021-09-01T13:55:26Z
dc.date.issued2020-01-20
dc.description.abstractThe effectiveness of a malware classifier on new data is tightly coupled with the data upon which is was trained and validated. Malware data are collected from various sources, which must be trusted to be correct about the given labels as well as independent and identically distributed (i.i.d.). However, little research exists toward assessing how well this assumption holds in practice. Given data from various sources of unknown quality, what can we know about a malware classifier's ability to generalize to future, unseen data? Can we even create a malware classifier that generalizes, given issues of data quality and concept drift? How can we assure others that our malware classifier doesn't have underlying data quality issues? This dissertations describes the labeling of a massive dataset of over 33 million raw malware samples so that it can be used both for classification of malware families as well as a baseline to measure drift. It then demonstrates that the models from multiple prior studies are highly sensitive to drift. It finally tests new methods for regularization, explicitly using the source of the data in order to penalize features which don't generalize from one dataset to another.
dc.formatapplication:pdf
dc.genredissertations
dc.identifierdoi:10.13016/m2nolb-tskz
dc.identifier.other12251
dc.identifier.urihttp://hdl.handle.net/11603/22840
dc.languageen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Theses and Dissertations Collection
dc.relation.ispartofUMBC Graduate School Collection
dc.relation.ispartofUMBC Student Collection
dc.sourceOriginal File Name: SeymourIII_umbc_0434D_12251.pdf
dc.subjectData Science
dc.subjectDataset Bias
dc.subjectMachine Learning
dc.subjectMalware Analysis
dc.subjectReverse Engineering
dc.titleDetecting, Quantifying, and Mitigating Bias in Malware Datasets
dc.typeText
dcterms.accessRightsDistribution Rights granted to UMBC by the author.
dcterms.accessRightsThis item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
SeymourIII_umbc_0434D_12251.pdf
Size:
1.05 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Seymour III-John_Open.pdf
Size:
254.57 KB
Format:
Adobe Portable Document Format
Description: