Detecting, Quantifying, and Mitigating Bias in Malware Datasets
dc.contributor.advisor | Nicholas, Charles K | |
dc.contributor.author | Seymour III, John Jefferson | |
dc.contributor.department | Computer Science and Electrical Engineering | |
dc.contributor.program | Computer Science | |
dc.date.accessioned | 2021-09-01T13:55:26Z | |
dc.date.available | 2021-09-01T13:55:26Z | |
dc.date.issued | 2020-01-20 | |
dc.description.abstract | The effectiveness of a malware classifier on new data is tightly coupled with the data upon which is was trained and validated. Malware data are collected from various sources, which must be trusted to be correct about the given labels as well as independent and identically distributed (i.i.d.). However, little research exists toward assessing how well this assumption holds in practice. Given data from various sources of unknown quality, what can we know about a malware classifier's ability to generalize to future, unseen data? Can we even create a malware classifier that generalizes, given issues of data quality and concept drift? How can we assure others that our malware classifier doesn't have underlying data quality issues? This dissertations describes the labeling of a massive dataset of over 33 million raw malware samples so that it can be used both for classification of malware families as well as a baseline to measure drift. It then demonstrates that the models from multiple prior studies are highly sensitive to drift. It finally tests new methods for regularization, explicitly using the source of the data in order to penalize features which don't generalize from one dataset to another. | |
dc.format | application:pdf | |
dc.genre | dissertations | |
dc.identifier | doi:10.13016/m2nolb-tskz | |
dc.identifier.other | 12251 | |
dc.identifier.uri | http://hdl.handle.net/11603/22840 | |
dc.language | en | |
dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
dc.relation.ispartof | UMBC Computer Science and Electrical Engineering Department Collection | |
dc.relation.ispartof | UMBC Theses and Dissertations Collection | |
dc.relation.ispartof | UMBC Graduate School Collection | |
dc.relation.ispartof | UMBC Student Collection | |
dc.source | Original File Name: SeymourIII_umbc_0434D_12251.pdf | |
dc.subject | Data Science | |
dc.subject | Dataset Bias | |
dc.subject | Machine Learning | |
dc.subject | Malware Analysis | |
dc.subject | Reverse Engineering | |
dc.title | Detecting, Quantifying, and Mitigating Bias in Malware Datasets | |
dc.type | Text | |
dcterms.accessRights | Distribution Rights granted to UMBC by the author. | |
dcterms.accessRights | This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu |