CyberEnt: A Cybersecurity Domain Specific Dataset for Named Entity Recognition
dc.contributor.author | Hanks, Casey | |
dc.contributor.author | Maiden, Michael | |
dc.contributor.author | Ranade, Priyanka | |
dc.contributor.author | Finin, Tim | |
dc.contributor.author | Joshi, Anupam | |
dc.date.accessioned | 2022-06-21T21:06:35Z | |
dc.date.available | 2022-06-21T21:06:35Z | |
dc.date.issued | 2022-04-18 | |
dc.description.abstract | Named Entity Recognition (NER) is a critical component of automated knowledge extraction. It allows Natural Language Processing (NLP) models to label instances of real-world entities that are important in the context of the text. To be able to accomplish this, the NLP model needs to be trained on large corpora of human-annotated text. There are examples of general, domain-agonistic text corpora available, but they are not suited for fields such as cybersecurity, that require domain-specific text for downstream tasks such as malware analysis. NLP for cybersecurity is an emerging field, and there is a large need to develop community-accessible datasets to train existing AI-based cybersecurity pipelines to extract meaningful insights from Cyber Threat Intelligence (CTI). There are terabytes of CTI data that are disclosed on a daily basis, making it nearly impossible for human-analysts to manually sift through. The cybersecurity domain has limited training datasets available, as opposed to other domains such as Medicine or Law. We have created a large CTI corpus and are actively using it to train and test supervised and semi-supervised cybersecurity NER models using the SpaCy NLP Framework. In addition, we also aim to develop methods that allow continuous integration of incoming, up-to-date CTI information. | en_US |
dc.description.sponsorship | This work was funded, in part, by a grant from the NSA through the On-Ramp program and by the National Science Foundation under Grant Number 2114892. | en_US |
dc.description.uri | https://ebiquity.umbc.edu/paper/html/id/1022/CyNER-A-Cybersecurity-Domain-Specific-Dataset-for-Named-Entity-Recognition | en_US |
dc.description.uri | https://umbc.voicethread.com/share/19813519/ | |
dc.format.extent | 5 minutes 36 seconds | en_US |
dc.genre | posters | en_US |
dc.genre | audio recordings | en_US |
dc.identifier | doi:10.13016/m27qz6-xiyt | |
dc.identifier.uri | http://hdl.handle.net/11603/25006 | |
dc.language.iso | en_US | en_US |
dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
dc.relation.ispartof | UMBC Computer Science and Electrical Engineering Department Collection | |
dc.relation.ispartof | UMBC Student Collection | |
dc.relation.ispartof | UMBC Faculty Collection | |
dc.rights | This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author. | en_US |
dc.subject | UMBC Undergraduate Research and Creative Achievement Day | en_US |
dc.subject | UMBC Ebiquity Research Group | |
dc.title | CyberEnt: A Cybersecurity Domain Specific Dataset for Named Entity Recognition | en_US |
dc.type | Sound | en_US |
dc.type | Text | en_US |
dcterms.creator | https://orcid.org/0000-0002-6593-1792 | en_US |
dcterms.creator | https://orcid.org/0000-0002-8641-3193 | en_US |