CyberEnt: A Cybersecurity Domain Specific Dataset for Named Entity Recognition

dc.contributor.authorHanks, Casey
dc.contributor.authorMaiden, Michael
dc.contributor.authorRanade, Priyanka
dc.contributor.authorFinin, Tim
dc.contributor.authorJoshi, Anupam
dc.date.accessioned2022-06-21T21:06:35Z
dc.date.available2022-06-21T21:06:35Z
dc.date.issued2022-04-18
dc.description.abstractNamed Entity Recognition (NER) is a critical component of automated knowledge extraction. It allows Natural Language Processing (NLP) models to label instances of real-world entities that are important in the context of the text. To be able to accomplish this, the NLP model needs to be trained on large corpora of human-annotated text. There are examples of general, domain-agonistic text corpora available, but they are not suited for fields such as cybersecurity, that require domain-specific text for downstream tasks such as malware analysis. NLP for cybersecurity is an emerging field, and there is a large need to develop community-accessible datasets to train existing AI-based cybersecurity pipelines to extract meaningful insights from Cyber Threat Intelligence (CTI). There are terabytes of CTI data that are disclosed on a daily basis, making it nearly impossible for human-analysts to manually sift through. The cybersecurity domain has limited training datasets available, as opposed to other domains such as Medicine or Law. We have created a large CTI corpus and are actively using it to train and test supervised and semi-supervised cybersecurity NER models using the SpaCy NLP Framework. In addition, we also aim to develop methods that allow continuous integration of incoming, up-to-date CTI information.en_US
dc.description.sponsorshipThis work was funded, in part, by a grant from the NSA through the On-Ramp program and by the National Science Foundation under Grant Number 2114892.en_US
dc.description.urihttps://ebiquity.umbc.edu/paper/html/id/1022/CyNER-A-Cybersecurity-Domain-Specific-Dataset-for-Named-Entity-Recognitionen_US
dc.description.urihttps://umbc.voicethread.com/share/19813519/
dc.format.extent5 minutes 36 secondsen_US
dc.genrepostersen_US
dc.genreaudio recordingsen_US
dc.identifierdoi:10.13016/m27qz6-xiyt
dc.identifier.urihttp://hdl.handle.net/11603/25006
dc.language.isoen_USen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Student Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.en_US
dc.subjectUMBC Undergraduate Research and Creative Achievement Dayen_US
dc.subjectUMBC Ebiquity Research Group
dc.titleCyberEnt: A Cybersecurity Domain Specific Dataset for Named Entity Recognitionen_US
dc.typeSounden_US
dc.typeTexten_US
dcterms.creatorhttps://orcid.org/0000-0002-6593-1792en_US
dcterms.creatorhttps://orcid.org/0000-0002-8641-3193en_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
CyberEnt Poster.pdf
Size:
502.52 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: