Detecting, Quantifying, and Mitigating Bias in Malware Datasets

Seymour, John Jefferson, III

Detecting, Quantifying, and Mitigating Bias in Malware Datasets

dc.contributor.advisor	Nicholas, Charles K
dc.contributor.author	Seymour, John Jefferson, III
dc.contributor.department	Computer Science and Electrical Engineering
dc.contributor.program	Computer Science
dc.date.accessioned	2021-09-01T13:55:26Z
dc.date.available	2021-09-01T13:55:26Z
dc.date.issued	2020-01-20
dc.description.abstract	The effectiveness of a malware classifier on new data is tightly coupled with the data upon which is was trained and validated. Malware data are collected from various sources, which must be trusted to be correct about the given labels as well as independent and identically distributed (i.i.d.). However, little research exists toward assessing how well this assumption holds in practice. Given data from various sources of unknown quality, what can we know about a malware classifier's ability to generalize to future, unseen data? Can we even create a malware classifier that generalizes, given issues of data quality and concept drift? How can we assure others that our malware classifier doesn't have underlying data quality issues? This dissertations describes the labeling of a massive dataset of over 33 million raw malware samples so that it can be used both for classification of malware families as well as a baseline to measure drift. It then demonstrates that the models from multiple prior studies are highly sensitive to drift. It finally tests new methods for regularization, explicitly using the source of the data in order to penalize features which don't generalize from one dataset to another.
dc.format	application:pdf
dc.genre	dissertations
dc.identifier	doi:10.13016/m2nolb-tskz
dc.identifier.other	12251
dc.identifier.uri	http://hdl.handle.net/11603/22840
dc.language	en
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartof	UMBC Theses and Dissertations Collection
dc.relation.ispartof	UMBC Graduate School Collection
dc.relation.ispartof	UMBC Student Collection
dc.source	Original File Name: SeymourIII_umbc_0434D_12251.pdf
dc.subject	Data Science
dc.subject	Dataset Bias
dc.subject	Machine Learning
dc.subject	Malware Analysis
dc.subject	Reverse Engineering
dc.title	Detecting, Quantifying, and Mitigating Bias in Malware Datasets
dc.type	Text
dcterms.accessRights	Distribution Rights granted to UMBC by the author.
dcterms.accessRights	This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Files

Original bundle

Now showing 1 - 1 of 1

Name:: SeymourIII_umbc_0434D_12251.pdf
Size:: 1.05 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: Seymour III-John_Open.pdf
Size:: 254.57 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

UMBC Theses and Dissertations
UMBC Computer Science and Electrical Engineering Department
UMBC Graduate School
UMBC Student Collection