MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers

Joyce, Robert J.; Raff, Edward; Nicholas, Charles; Holt, James

MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers

dc.contributor.author	Joyce, Robert J.
dc.contributor.author	Raff, Edward
dc.contributor.author	Nicholas, Charles
dc.contributor.author	Holt, James
dc.date.accessioned	2023-11-08T14:15:32Z
dc.date.available	2023-11-08T14:15:32Z
dc.date.issued	2023-10-18
dc.description	CAMLIS’23: Conference on Applied Machine Learning in Information Security (CAMLIS); Arlington, VA; October 19–20, 2023
dc.description.abstract	Existing research on malware classification focuses almost exclusively on two tasks: distinguishing between malicious and benign files and classifying malware by family. However, malware can be categorized according to many other types of attributes, and the ability to identify these attributes in newly-emerging malware using machine learning could provide significant value to analysts. In particular, we have identified four tasks which are under-represented in prior work: classification by behaviors that malware exhibit, platforms that malware run on, vulnerabilities that malware exploit, and packers that malware are packed with. To obtain labels for training and evaluating ML classifiers on these tasks, we created an antivirus (AV) tagging tool called ClarAVy. ClarAVy's sophisticated AV label parser distinguishes itself from prior AV-based taggers, with the ability to accurately parse 882 different AV label formats used by 90 different AV products. We are releasing benchmark datasets for each of these four classification tasks, tagged using ClarAVy and comprising nearly 5.5 million malicious files in total. Our malware behavior dataset includes 75 distinct tags - nearly 7x more than the only prior benchmark dataset with behavioral tags. To our knowledge, we are the first to release datasets with malware platform and packer tags.	en_US
dc.description.uri	https://arxiv.org/abs/2310.11706	en_US
dc.format.extent	17 pages	en_US
dc.genre	conference papers and proceedings	en_US
dc.genre	preprints	en_US
dc.identifier	doi:10.13016/m2xlxt-bxxa
dc.identifier.uri	https://doi.org/10.48550/arXiv.2310.11706
dc.identifier.uri	http://hdl.handle.net/11603/30589
dc.language.iso	en_US	en_US
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartof	UMBC Faculty Collection
dc.rights	This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.	en_US
dc.rights	Attribution 4.0 International (CC BY 4.0 DEED)	*
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	*
dc.title	MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers	en_US
dc.type	Text	en_US
dcterms.creator	https://orcid.org/0000-0001-9494-7139	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2310.11706.pdf
Size:: 763.94 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.56 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

UMBC Computer Science and Electrical Engineering Department
UMBC Faculty Collection