Building an Efficient PDF Malware Detection System

Author/Creator

Author/Creator ORCID

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

With the widespread use of the Portable Document Format (PDF), it has increasingly becoming a target for malware, highlighting the need for effective detection solutions. In recent years, machine learning-based methods for PDF malware detection have grown in popularity. However, the effectiveness of ML models is closely related to the quality of the training dataset and the employed feature set. Besides, many well-known ML-based detectors require a large number of specimen features to be collected before making a decision, which can be time-consuming. This thesis addresses these challenges by proposing a two-stage approach for PDF malware detection. The initial phase focuses on rapid detection, providing an early warning against potential PDF malware attacks. Following this fast detection stage, we introduce a robust and highly reliable feature set for PDF malware identification. Our contributions are as follows: 1. Rapid Detection Methodology: Compared to traditional machine learning or neural network models, our novel, distance-based method for rapid PDF malware detection requires much less training samples. Evaluated on the Contagio dataset, our method shows that it can detect 90.50\% of malware samples using only 20 benign PDFs for model training.2. Dataset Analysis and Introduction of PdfRep: Through an examination of two widely used PDF malware datasets, namely Contagio and CIC, we find biases and representativeness issues that compromise malware detection model reliability. To mitigate these limitations, we present PdfRep, a new, more representative PDF malware dataset that outperforms existing PDF malware dataset in evaluation metrics. 3. Compact Feature Set for Enhanced Robustness: We introduce a novel set of just five features designed to optimize training efficiency and enhance the robustness of PDF malware detection systems. Experiments show that this compact feature set strengthens PDF malware detection systems against particular adversarial attacks and allows the building of highly accurate models. In summary, this thesis presents a comprehensive approach for PDF malware detection, from rapid initial alerts to the deployment of a robust, efficient detection system enhanced by a novel dataset and an optimized feature set. When taken as a whole, these developments offer significant progress to defend against PDF-based malware attacks.