Detecting, Quantifying, and Mitigating Bias in Malware Datasets

Author/Creator ORCID




Computer Science and Electrical Engineering


Computer Science

Citation of Original Publication


Distribution Rights granted to UMBC by the author.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see or contact Special Collections at speccoll(at)


The effectiveness of a malware classifier on new data is tightly coupled with the data upon which is was trained and validated. Malware data are collected from various sources, which must be trusted to be correct about the given labels as well as independent and identically distributed (i.i.d.). However, little research exists toward assessing how well this assumption holds in practice. Given data from various sources of unknown quality, what can we know about a malware classifier's ability to generalize to future, unseen data? Can we even create a malware classifier that generalizes, given issues of data quality and concept drift? How can we assure others that our malware classifier doesn't have underlying data quality issues? This dissertations describes the labeling of a massive dataset of over 33 million raw malware samples so that it can be used both for classification of malware families as well as a baseline to measure drift. It then demonstrates that the models from multiple prior studies are highly sensitive to drift. It finally tests new methods for regularization, explicitly using the source of the data in order to penalize features which don't generalize from one dataset to another.