Advanced Semi-supervised Tensor Decomposition Methods for Malware Characterization

Author/Creator

Author/Creator ORCID

Date

2024/01/01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.

Abstract

Malware continues to be one of the most dangerous and costly cyber threats to national security. As of last year, over 1.3 billion malware specimens have been documented, prompting the use of data-driven machine learning (ML) techniques for their analysis. However, existing ML approaches face significant barriers that limit their widespread implementation. These challenges include the detection of novel malware, maintaining performance with low quantities of labeled data during training, and classifying malware under class imbalance: a scenario where malware families are unevenly represented in the dataset. This dissertation addresses these shortcomings by introducing three novel semi-supervised ML methods based on tensor decomposition. Our methods are based on dimensionality reduction, hierarchical tensor decomposition, automatic model determination, and feature extraction methods with selective classification or reject-option capability. This "reject-option" capability is a form of self-awareness that allows our models to abstain from making a decision under uncertainty, which in return allows for detection of novel threats. In this dissertation, we describe the foundational concepts underlying our methods and describe the approaches we developed: the Random Forest of Tensors (RFoT), HNMFk Classifier, and MalwareDNA. Additionally, we detail the capabilities of our methods to utilize High Performance Computing (HPC), multi-processing, and Graphical Processing Units (GPUs) for accelerated computation. We showcase our experiments with all three methods where we demonstrate stable task performance under extreme class imbalance, low-quantity of labeled data, and extreme quantities of malware families. We also showcase results when simultaneously classifying benign-ware and malware, classifying malware families, and detecting novel malware families. Our results are compared against state-of-the-art semi-supervised and supervised ML baselines on two datasets. We showcase how our method surpasses the performance of our baselines with a trade-off in increased abstention or reject-option rate.