KiloGrams: Very Large N-Grams for Malware Classification

dc.contributor.authorRaff, Edward
dc.contributor.authorFleming, William
dc.contributor.authorZak, Richard
dc.contributor.authorAnderson, Hyrum
dc.contributor.authorFinlayson, Bill
dc.contributor.authorNicholas, Charles
dc.contributor.authorMcLean, Mark
dc.descriptionLEMINCS @ KDD’19, August 5th, 2019, Anchorage, Alaska, United Statesen_US
dc.description.abstractN-grams have been a common tool for information retrieval and machine learning applications for decades. In nearly all previous works, only a few values of n are tested, with n>6 being exceedingly rare. Larger values of n are not tested due to computational burden or the fear of overfitting. In this work, we present a method to find the top-k most frequent n-grams that is 60× faster for small n, and can tackle large n≥1024. Despite the unprecedented size of n considered, we show how these features still have predictive ability for malware classification tasks. More important, large n-grams provide benefits in producing features that are interpretable by malware analysis, and can be used to create general purpose signatures compatible with industry standard tools like Yara. Furthermore, the counts of common n-grams in a file may be added as features to publicly available human-engineered features that rival efficacy of professionally-developed features when used to train gradient-boosted decision tree models on the EMBER dataset.en_US
dc.format.extent11 pagesen_US
dc.genreconference proceedings and papers preprintsen_US
dc.identifier.citationraff2019kilograms, KiloGrams: Very Large N-Grams for Malware Classification Edward Raff; William Fleming; Richard Zak; Hyrum Anderson; Bill Finlayson; Charles Nicholas; Mark McLean; 2019 Cite as:arXiv:1908.00200v1 [cs.CR]en_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subjectinformation retrievalen_US
dc.subjectmachine learningen_US
dc.subjectpredictive abilityen_US
dc.subjectmalware analysisen_US
dc.titleKiloGrams: Very Large N-Grams for Malware Classificationen_US


License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
2.56 KB
Item-specific license agreed upon to submission