Detecting, Quantifying, and Mitigating Bias in Malware Datasets
Loading...
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
2020-01-20
Type of Work
Department
Computer Science and Electrical Engineering
Program
Computer Science
Citation of Original Publication
Rights
Distribution Rights granted to UMBC by the author.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Abstract
The effectiveness of a malware classifier on new data is tightly coupled with the data upon which is was trained and validated. Malware data are collected from various sources, which must be trusted to be correct about the given labels as well as independent and identically distributed (i.i.d.). However, little research exists toward assessing how well this assumption holds in practice. Given data from various sources of unknown quality, what can we know about a malware classifier's ability to generalize to future, unseen data? Can we even create a malware classifier that generalizes, given issues of data quality and concept drift? How can we assure others that our malware classifier doesn't have underlying data quality issues? This dissertations describes the labeling of a massive dataset of over 33 million raw malware samples so that it can be used both for classification of malware families as well as a baseline to measure drift. It then demonstrates that the models from multiple prior studies are highly sensitive to drift. It finally tests new methods for regularization, explicitly using the source of the data in order to penalize features which don't generalize from one dataset to another.