CLUSTERING API CALL GRAPHS THROUGH GAUSSIAN MIXTURE MODELS FOR BINARY BEHAVIORAL STATIC ANALYSIS

Author/Creator

Author/Creator ORCID

Date

2019-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Abstract

Analyzing binaries and malicious programs e?ciently has been a topic of great interest over the past decade. Over the years, tools have been developed to auto mate the process of analyzing binary behavior both statically and dynamically. This project will present a method of analyzing PE32 binaries and malware behavior stat ically through the extraction of Windows API call graphs. These call graphs are then converted into variable-length sequences which are then processed through an enhanced version of the Lempel-Ziv Jaccard Distance (LZJD) algorithm. Princi pal Component Analysis (PCA) is then used to reduce the dimensionality of the resulting vector, and the reduced vector is then soft clustered using Gaussian Mix ture models and quali?ed using silhouette score calculations to create a probabilistic model of the sample(s) in question. The overall concept of this work is by clustering unknown samples with known samples that have behavioral reports, various behaviors can be inferred based on the clustering assignments and silhouette scores. The motivation behind this ap proach is to reduce the amount of manual inspections of behavioral indicators that may infer a speci?c behavior, and to improve the partial automation of the reverse engineering process by developing a behavioral analysis method that requires no program execution, and allows the attributes of the call graph of the program to self-produce behavioral inferences through clustering. Rather than analyzing the disassembled code of a binary, instead Gaussian clusters are analyzed and behav ioral inferences are derived from the resulting clusters. In addition, this method does not require the often arduous task of labeling samples by one or more speci?c behaviors to then train a model that is only limited to those features. One impor tant note for this work is that the distinguishing factor from malware from goodware is the context and intent of the sample. While this work does not aim to identify directly whether an unknown sample is malware, the work does aim to present the behaviors of samples in a cluster format by which an analyst can then infer from the deduced behavior that a sample is benign or malicious as a use case.