Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts

Roy, Arpita; Park, Youngja; Pan, Shimei

Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts

dc.contributor.author	Roy, Arpita
dc.contributor.author	Park, Youngja
dc.contributor.author	Pan, Shimei
dc.date.accessioned	2025-01-08T15:08:53Z
dc.date.available	2025-01-08T15:08:53Z
dc.date.issued	2017-09-21
dc.description.abstract	Word embedding is a Natural Language Processing (NLP) technique that automatically maps words from a vocabulary to vectors of real numbers in an embedding space. It has been widely used in recent years to boost the performance of a vari-ety of NLP tasks such as Named Entity Recognition, Syntac-tic Parsing and Sentiment Analysis. Classic word embedding methods such as Word2Vec and GloVe work well when they are given a large text corpus. When the input texts are sparse as in many specialized domains (e.g., cybersecurity), these methods often fail to produce high-quality vectors. In this pa-per, we describe a novel method to train domain-specificword embeddings from sparse texts. In addition to domain texts, our method also leverages diverse types of domain knowledge such as domain vocabulary and semantic relations. Specifi-cally, we first propose a general framework to encode diverse types of domain knowledge as text annotations. Then we de-velop a novel Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text annotations in word em-bedding. We have evaluated our method on two cybersecurity text corpora: a malware description corpus and a Common Vulnerability and Exposure (CVE) corpus. Our evaluation re-sults have demonstrated the effectiveness of our method in learning domain-specific word embeddings.
dc.description.uri	http://arxiv.org/abs/1709.07470
dc.format.extent	8 pages
dc.genre	journal articles
dc.genre	preprints
dc.identifier	doi:10.13016/m2ak0r-n4la
dc.identifier.uri	https://doi.org/10.48550/arXiv.1709.07470
dc.identifier.uri	http://hdl.handle.net/11603/37201
dc.language.iso	en_US
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Information Systems Department
dc.relation.ispartof	UMBC Faculty Collection
dc.relation.ispartof	UMBC Student Collection
dc.rights	This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subject	Computer Science - Computation and Language
dc.title	Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts
dc.type	Text
dcterms.creator	https://orcid.org/0000-0002-5989-8543

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 1709.07470v1.pdf
Size:: 918.24 KB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Information Systems Department
UMBC Faculty Collection
UMBC Student Collection