MOTIF: A Malware Reference Dataset with Ground Truth Family Labels
Loading...
Author/Creator
Author/Creator ORCID
Date
2022-09-16
Type of Work
Department
Program
Citation of Original Publication
Robert J. Joyce, Dev Amlani, Charles Nicholas, Edward Raff, MOTIF: A Malware Reference Dataset with Ground Truth Family Labels, Computers & Security (2022), doi: https://doi.org/10.1016/j.cose.2022.102921
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Access to this item will begin on 09-16-2024
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Access to this item will begin on 09-16-2024
Subjects
Abstract
Malware family classification is a significant issue with pub-
lic safety and research implications that has been hindered by
the high cost of expert labels. The vast majority of corpora
use noisy labeling approaches that obstruct definitive quantifi-
cation of results and study of deeper interactions. In order to
provide the data needed to advance further, we have created
the Malware Open-source Threat Intelligence Family (MO-
TIF) dataset. MOTIF contains 3,095 malware samples from
454 families, making it the largest and most diverse public
malware dataset with ground truth family labels to date, nearly
3× larger than any prior expert-labeled corpus and 36× larger
than the prior Windows malware corpus. MOTIF also comes
with a mapping from malware samples to threat reports pub-
lished by reputable industry sources, which both validates
the labels and opens new research opportunities in connect-
ing opaque malware samples to human-readable descriptions.
This enables important evaluations that are normally infeasi-
ble due to non-standardized reporting in industry. For exam-
ple, we provide aliases of the different names used to describe
the same malware family, allowing us to benchmark for the
first time accuracy of existing tools when names are obtained
from differing sources. Evaluation results obtained using the
MOTIF dataset indicate that existing tasks have significant
room for improvement, with accuracy of antivirus majority
voting measured at only 62.10% and the well-known AVClass
tool having just 46.78% accuracy. Our findings indicate that
malware family classification suffers a type of labeling noise
unlike that studied in most ML literature, due to the large
open set of classes that may not be known from the sample
under consideration