MOTIF: A Malware Reference Dataset with Ground Truth Family Labels

Date

2022-09-16

Department

Program

Citation of Original Publication

Robert J. Joyce, Dev Amlani, Charles Nicholas, Edward Raff, MOTIF: A Malware Reference Dataset with Ground Truth Family Labels, Computers & Security (2022), doi: https://doi.org/10.1016/j.cose.2022.102921

Rights

This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Access to this item will begin on 09-16-2024

Subjects

Abstract

Malware family classification is a significant issue with pub- lic safety and research implications that has been hindered by the high cost of expert labels. The vast majority of corpora use noisy labeling approaches that obstruct definitive quantifi- cation of results and study of deeper interactions. In order to provide the data needed to advance further, we have created the Malware Open-source Threat Intelligence Family (MO- TIF) dataset. MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset with ground truth family labels to date, nearly 3× larger than any prior expert-labeled corpus and 36× larger than the prior Windows malware corpus. MOTIF also comes with a mapping from malware samples to threat reports pub- lished by reputable industry sources, which both validates the labels and opens new research opportunities in connect- ing opaque malware samples to human-readable descriptions. This enables important evaluations that are normally infeasi- ble due to non-standardized reporting in industry. For exam- ple, we provide aliases of the different names used to describe the same malware family, allowing us to benchmark for the first time accuracy of existing tools when names are obtained from differing sources. Evaluation results obtained using the MOTIF dataset indicate that existing tasks have significant room for improvement, with accuracy of antivirus majority voting measured at only 62.10% and the well-known AVClass tool having just 46.78% accuracy. Our findings indicate that malware family classification suffers a type of labeling noise unlike that studied in most ML literature, due to the large open set of classes that may not be known from the sample under consideration