Clustering Analysis of Malware Binaries Using the Lempel-Ziv Jaccard Distance


Author/Creator ORCID




Computer Science and Electrical Engineering


Computer Science

Citation of Original Publication


Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.


In this theses, we investigate the applicability of the Lempel-Ziv Jaccard Distance (LZJD), a recently introduced similarity metric on arbitrary binaries, for hierarchical clustering. We perform experiments with three separate datasets and analyze cluster quality from a hierarchical density-based clustering algorithm, HDBSCAN, using internal and external metrics where applicable. Finally, we propose the Partitioned Lempel-Ziv Jaccard Distance (LZJDp), a novel modication to LZJD that forms the Lempel-Ziv dictionary by merging the dictionaries built from natural partitions of the original file. We evaluate three different partitioning methods for malicious binaries and compare these results to traditional LZJD. We find that LZJD does not perform well with hierarchical clustering and does not result in well-separated clusters. Additionally, we find that LZJDp underperforms LZJD, with decreasing accuracy and higher uncertainty as the number of partitions increase. Recommendations for future research are provided, including further exploration into flat clustering with LZJD to better understand why hierarchical clustering was ineffective. The need for additional research is also indicated to provide improvements to LZJDp that reduce LZJD dictionary noise.