CyberEnt: A Cybersecurity Domain Specific Dataset for Named Entity Recognition

This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Subjects

UMBC Undergraduate Research and Creative Achievement Day
UMBC Ebiquity Research Group

Abstract

Named Entity Recognition (NER) is a critical component of automated knowledge extraction. It allows Natural Language Processing (NLP) models to label instances of real-world entities that are important in the context of the text. To be able to accomplish this, the NLP model needs to be trained on large corpora of human-annotated text. There are examples of general, domain-agonistic text corpora available, but they are not suited for fields such as cybersecurity, that require domain-specific text for downstream tasks such as malware analysis. NLP for cybersecurity is an emerging field, and there is a large need to develop community-accessible datasets to train existing AI-based cybersecurity pipelines to extract meaningful insights from Cyber Threat Intelligence (CTI). There are terabytes of CTI data that are disclosed on a daily basis, making it nearly impossible for human-analysts to manually sift through. The cybersecurity domain has limited training datasets available, as opposed to other domains such as Medicine or Law. We have created a large CTI corpus and are actively using it to train and test supervised and semi-supervised cybersecurity NER models using the SpaCy NLP Framework. In addition, we also aim to develop methods that allow continuous integration of incoming, up-to-date CTI information.

CyberEnt: A Cybersecurity Domain Specific Dataset for Named Entity Recognition

Files

Links to Files

Permanent Link

Collections

Author/Creator

Author/Creator ORCID

Date

Type of Work

Department

Program

Citation of Original Publication

Rights

Subjects

Abstract