Hash-Grams: Faster N-Gram Features for Classification and Malware Detection

dc.contributor.advisorUMBC Faculty Collection
dc.contributor.authorRaff, Edward
dc.contributor.authorNicholas, Charles
dc.date.accessioned2018-09-21T18:55:35Z
dc.date.available2018-09-21T18:55:35Z
dc.date.issued2018
dc.descriptionThe 18th ACM Symposium on Document Engineering; This work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Lawen_US
dc.description.abstractN-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements.en_US
dc.description.urihttp://www.edwardraff.com/publications/hash-grams-faster.pdfen_US
dc.format.extent4 pagesen_US
dc.genreconference paperen_US
dc.identifierdoi:10.13016/M2Q814W51
dc.identifier.citationEdward Raff and Charles Nicholas. 2018. Hash-Grams: Faster N-Gram Features for Classification and Malware Detection. In DocEng ’18: ACM Symposium on Document Engineering 2018, August 28–31, 2018, Halifax, NS, Canada. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3209280.3229085en_US
dc.identifier.urihttps://doi.org/10.1145/3209280.3229085
dc.identifier.urihttp://hdl.handle.net/11603/11343
dc.language.isoen_USen_US
dc.publisherAssociation for Computing Machinery (ACM)en_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please contact the author.
dc.rightsPublic Domain Mark 1.0*
dc.rights.urihttp://creativecommons.org/publicdomain/mark/1.0/*
dc.subjecttop-k selectionen_US
dc.subjectbottlenecken_US
dc.subjectHash-Gramsen_US
dc.titleHash-Grams: Faster N-Gram Features for Classification and Malware Detectionen_US
dc.typeTexten_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
hash-grams-faster.pdf
Size:
587.08 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.68 KB
Format:
Item-specific license agreed upon to submission
Description: