Nicholas, Charles KSpizler, Alexander2021-01-292021-01-292018-01-0111814http://hdl.handle.net/11603/20761In this theses, we investigate the applicability of the Lempel-Ziv Jaccard Distance (LZJD), a recently introduced similarity metric on arbitrary binaries, for hierarchical clustering. We perform experiments with three separate datasets and analyze cluster quality from a hierarchical density-based clustering algorithm, HDBSCAN, using internal and external metrics where applicable. Finally, we propose the Partitioned Lempel-Ziv Jaccard Distance (LZJDp), a novel modication to LZJD that forms the Lempel-Ziv dictionary by merging the dictionaries built from natural partitions of the original file. We evaluate three different partitioning methods for malicious binaries and compare these results to traditional LZJD. We find that LZJD does not perform well with hierarchical clustering and does not result in well-separated clusters. Additionally, we find that LZJDp underperforms LZJD, with decreasing accuracy and higher uncertainty as the number of partitions increase. Recommendations for future research are provided, including further exploration into flat clustering with LZJD to better understand why hierarchical clustering was ineffective. The need for additional research is also indicated to provide improvements to LZJDp that reduce LZJD dictionary noise.application:pdfclusteringhierarchicalLZJDmetricClustering Analysis of Malware Binaries Using the Lempel-Ziv Jaccard DistanceText