Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

dc.contributor.authorEren, Maksim E.
dc.contributor.authorBhattarai, Manish
dc.contributor.authorJoyce, Robert J.
dc.contributor.authorRaff, Edward
dc.contributor.authorNicholas, Charles
dc.contributor.authorAlexandrov, Boian S.
dc.date.accessioned2023-10-06T14:08:46Z
dc.date.available2023-10-06T14:08:46Z
dc.date.issued2023-09-18
dc.description.abstractIdentification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this paper, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier, that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic model selection, that is, with an estimation of the number of clusters. With HNMFk Classifier, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families and helps with maintaining the performance of the model when a low quantity of labeled data is used. We perform bulk classification of nearly 2,900 both rare and prominent malware families, through static analysis, using nearly 388,000 samples from the EMBER-2018 corpus. In our experiments, we surpass both supervised and semi-supervised baseline models with an F1 score of 0.80.en_US
dc.description.sponsorshipThis manuscript has been approved for unlimited release and has been assigned LA-UR-23-30350. We thank Nick Solovyev and Drew Barlow for helpful suggestions and edits. This research was partially funded by the Los Alamos National Laboratory (LANL) Laboratory Directed Research and Development (LDRD) grant 20190020DR and LANL Institutional Computing Program, supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001.en_US
dc.description.urihttps://dl.acm.org/doi/abs/10.1145/3624567en_US
dc.format.extent26 pagesen_US
dc.genrejournal articlesen_US
dc.genrepostprintsen_US
dc.identifierdoi:10.13016/m2dgkd-vjbi
dc.identifier.citationEren, Maksim E., Manish Bhattarai, Robert J. Joyce, Edward Raff, Charles Nicholas, and Boian S. Alexandrov. “Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection.” ACM Transactions on Privacy and Security, September 18, 2023. https://doi.org/10.1145/3624567.en_US
dc.identifier.urihttps://doi.org/10.1145/3624567
dc.identifier.urihttp://hdl.handle.net/11603/30010
dc.language.isoen_USen_US
dc.publisherACMen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law.en_US
dc.rightsPublic Domain Mark 1.0*
dc.rights.urihttp://creativecommons.org/publicdomain/mark/1.0/*
dc.titleSemi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selectionen_US
dc.typeTexten_US
dcterms.creatorhttps://orcid.org/0000-0001-9494-7139en_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
3624567.pdf
Size:
1.08 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: