CyBERT: Contextualized Embeddings for the Cybersecurity Domain

dc.contributor.authorRanade, Priyanka
dc.contributor.authorPiplai, Aritran
dc.contributor.authorJoshi, Anupam
dc.contributor.authorFinin, Tim
dc.date.accessioned2022-08-18T22:33:17Z
dc.date.available2022-08-18T22:33:17Z
dc.date.issued2022-01-13
dc.description2021 IEEE International Conference on Big Data (Big Data), 15-18 December 2021, Orlando, FL, USAen_US
dc.description.abstractWe present CyBERT, a domain-specific Bidirectional Encoder Representations from Transformers (BERT) model, fine-tuned with a large corpus of textual cybersecurity data. State-of-the-art natural language models that can process dense, fine-grained textual threat, attack, and vulnerability information can provide numerous benefits to the cybersecurity community. The primary contribution of this paper is providing the security community with an initial fine-tuned BERT model that can perform a variety of cybersecurity-specific downstream tasks with high accuracy and efficient use of resources. We create a cybersecurity corpus from open-source unstructured and semi-unstructured Cyber Threat Intelligence (CTI) data and use it to fine-tune a base BERT model with Masked Language Modeling (MLM) to recognize specialized cybersecurity entities. We evaluate the model using various downstream tasks that can benefit modern Security Operations Centers (SOCs). The fine-tuned CyBERT model outperforms the base BERT model in the domain-specific MLM evaluation. We also provide use-cases of CyBERT applications in cybersecurity-based downstream tasks.en_US
dc.description.sponsorshipThis material is based upon work supported by a grant from NSA and from the National Science Foundation Grant No. 2114892.en_US
dc.description.urihttps://ieeexplore.ieee.org/document/9671824en_US
dc.format.extent9 pagesen_US
dc.genreconference papers and proceedingsen_US
dc.genrepreprintsen_US
dc.identifierdoi:10.13016/m2mnkf-02uq
dc.identifier.citationP. Ranade, A. Piplai, A. Joshi and T. Finin, "CyBERT: Contextualized Embeddings for the Cybersecurity Domain," 2021 IEEE International Conference on Big Data (Big Data), 2021, pp. 3334-3342, doi: 10.1109/BigData52589.2021.9671824en_US
dc.identifier.urihttps://doi.org/10.1109/BigData52589.2021.9671824
dc.identifier.urihttp://hdl.handle.net/11603/25498
dc.language.isoen_USen_US
dc.publisherIEEEen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Student Collection
dc.rights© 2022 IEEE.  Personal use of this material is permitted.  Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.en_US
dc.subjectUMBC Ebiquity Research Groupen_US
dc.titleCyBERT: Contextualized Embeddings for the Cybersecurity Domainen_US
dc.typeTexten_US
dcterms.creatorhttps://orcid.org/0000-0002-6437-1324en_US
dcterms.creatorhttps://orcid.org/0000-0002-8641-3193en_US
dcterms.creatorhttps://orcid.org/0000-0002-6593-1792en_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1117.pdf
Size:
335.35 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: