Hash-Grams: Faster N-Gram Features for Classification and Malware Detection

dc.contributor.advisorUMBC Faculty Collection
dc.contributor.authorRaff, Edward
dc.contributor.authorNicholas, Charles
dc.date.accessioned2018-09-21T18:55:35Z
dc.date.available2018-09-21T18:55:35Z
dc.date.issued2018
dc.descriptionThe 18th ACM Symposium on Document Engineering; This work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Lawen
dc.description.abstractN-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements.en
dc.description.urihttp://www.edwardraff.com/publications/hash-grams-faster.pdfen
dc.format.extent4 pagesen
dc.genreconference paperen
dc.identifierdoi:10.13016/M2Q814W51
dc.identifier.citationEdward Raff and Charles Nicholas. 2018. Hash-Grams: Faster N-Gram Features for Classification and Malware Detection. In DocEng ’18: ACM Symposium on Document Engineering 2018, August 28–31, 2018, Halifax, NS, Canada. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3209280.3229085en
dc.identifier.urihttps://doi.org/10.1145/3209280.3229085
dc.identifier.urihttp://hdl.handle.net/11603/11343
dc.language.isoenen
dc.publisherAssociation for Computing Machinery (ACM)en
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsPublic Domain Mark 1.0*
dc.rightsThis item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please contact the author.
dc.rights.urihttp://creativecommons.org/publicdomain/mark/1.0/*
dc.subjecttop-k selectionen
dc.subjectbottlenecken
dc.subjectHash-Gramsen
dc.titleHash-Grams: Faster N-Gram Features for Classification and Malware Detectionen
dc.typeTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
hash-grams-faster.pdf
Size:
587.08 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.68 KB
Format:
Item-specific license agreed upon to submission
Description: