CLUSTERING API CALL GRAPHS THROUGH GAUSSIAN MIXTURE MODELS FOR BINARY BEHAVIORAL STATIC ANALYSIS

dc.contributor.advisorNicholas, Charles
dc.contributor.authorAdangwa, Mariama
dc.contributor.departmentComputer Science and Electrical Engineering
dc.contributor.programComputer Science
dc.date.accessioned2021-09-01T13:55:28Z
dc.date.available2021-09-01T13:55:28Z
dc.date.issued2019-01-01
dc.description.abstractAnalyzing binaries and malicious programs e?ciently has been a topic of great interest over the past decade. Over the years, tools have been developed to auto mate the process of analyzing binary behavior both statically and dynamically. This project will present a method of analyzing PE32 binaries and malware behavior stat ically through the extraction of Windows API call graphs. These call graphs are then converted into variable-length sequences which are then processed through an enhanced version of the Lempel-Ziv Jaccard Distance (LZJD) algorithm. Princi pal Component Analysis (PCA) is then used to reduce the dimensionality of the resulting vector, and the reduced vector is then soft clustered using Gaussian Mix ture models and quali?ed using silhouette score calculations to create a probabilistic model of the sample(s) in question. The overall concept of this work is by clustering unknown samples with known samples that have behavioral reports, various behaviors can be inferred based on the clustering assignments and silhouette scores. The motivation behind this ap proach is to reduce the amount of manual inspections of behavioral indicators that may infer a speci?c behavior, and to improve the partial automation of the reverse engineering process by developing a behavioral analysis method that requires no program execution, and allows the attributes of the call graph of the program to self-produce behavioral inferences through clustering. Rather than analyzing the disassembled code of a binary, instead Gaussian clusters are analyzed and behav ioral inferences are derived from the resulting clusters. In addition, this method does not require the often arduous task of labeling samples by one or more speci?c behaviors to then train a model that is only limited to those features. One impor tant note for this work is that the distinguishing factor from malware from goodware is the context and intent of the sample. While this work does not aim to identify directly whether an unknown sample is malware, the work does aim to present the behaviors of samples in a cluster format by which an analyst can then infer from the deduced behavior that a sample is benign or malicious as a use case.
dc.formatapplication:pdf
dc.genretheses
dc.identifierdoi:10.13016/m28jba-jwwn
dc.identifier.other12113
dc.identifier.urihttp://hdl.handle.net/11603/22845
dc.languageen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Theses and Dissertations Collection
dc.relation.ispartofUMBC Graduate School Collection
dc.relation.ispartofUMBC Student Collection
dc.sourceOriginal File Name: Adangwa_umbc_0434M_12113.pdf
dc.subjectAPI Call Graph
dc.subjectBehavior
dc.subjectBinary analysis
dc.subjectClustering
dc.subjectGaussian Mixture Models
dc.subjectMalware
dc.titleCLUSTERING API CALL GRAPHS THROUGH GAUSSIAN MIXTURE MODELS FOR BINARY BEHAVIORAL STATIC ANALYSIS
dc.typeText
dcterms.accessRightsAccess limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
dcterms.accessRightsThis item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Adangwa_umbc_0434M_12113.pdf
Size:
464.41 KB
Format:
Adobe Portable Document Format