Hash-Grams: Faster N-Gram Features for Classification and Malware Detection
dc.contributor.advisor | UMBC Faculty Collection | |
dc.contributor.author | Raff, Edward | |
dc.contributor.author | Nicholas, Charles | |
dc.date.accessioned | 2018-09-21T18:55:35Z | |
dc.date.available | 2018-09-21T18:55:35Z | |
dc.date.issued | 2018 | |
dc.description | The 18th ACM Symposium on Document Engineering; This work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law | en_US |
dc.description.abstract | N-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements. | en_US |
dc.description.uri | http://www.edwardraff.com/publications/hash-grams-faster.pdf | en_US |
dc.format.extent | 4 pages | en_US |
dc.genre | conference paper | en_US |
dc.identifier | doi:10.13016/M2Q814W51 | |
dc.identifier.citation | Edward Raff and Charles Nicholas. 2018. Hash-Grams: Faster N-Gram Features for Classification and Malware Detection. In DocEng ’18: ACM Symposium on Document Engineering 2018, August 28–31, 2018, Halifax, NS, Canada. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3209280.3229085 | en_US |
dc.identifier.uri | https://doi.org/10.1145/3209280.3229085 | |
dc.identifier.uri | http://hdl.handle.net/11603/11343 | |
dc.language.iso | en_US | en_US |
dc.publisher | Association for Computing Machinery (ACM) | en_US |
dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
dc.relation.ispartof | UMBC Computer Science and Electrical Engineering Department Collection | |
dc.relation.ispartof | UMBC Faculty Collection | |
dc.rights | This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please contact the author. | |
dc.rights | Public Domain Mark 1.0 | * |
dc.rights.uri | http://creativecommons.org/publicdomain/mark/1.0/ | * |
dc.subject | top-k selection | en_US |
dc.subject | bottleneck | en_US |
dc.subject | Hash-Grams | en_US |
dc.title | Hash-Grams: Faster N-Gram Features for Classification and Malware Detection | en_US |
dc.type | Text | en_US |