Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts

dc.contributor.authorRoy, Arpita
dc.contributor.authorPark, Youngja
dc.contributor.authorPan, SHimei
dc.date.accessioned2025-01-08T15:08:53Z
dc.date.available2025-01-08T15:08:53Z
dc.date.issued2017-09-21
dc.description.abstractWord embedding is a Natural Language Processing (NLP) technique that automatically maps words from a vocabulary to vectors of real numbers in an embedding space. It has been widely used in recent years to boost the performance of a vari-ety of NLP tasks such as Named Entity Recognition, Syntac-tic Parsing and Sentiment Analysis. Classic word embedding methods such as Word2Vec and GloVe work well when they are given a large text corpus. When the input texts are sparse as in many specialized domains (e.g., cybersecurity), these methods often fail to produce high-quality vectors. In this pa-per, we describe a novel method to train domain-specificword embeddings from sparse texts. In addition to domain texts, our method also leverages diverse types of domain knowledge such as domain vocabulary and semantic relations. Specifi-cally, we first propose a general framework to encode diverse types of domain knowledge as text annotations. Then we de-velop a novel Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text annotations in word em-bedding. We have evaluated our method on two cybersecurity text corpora: a malware description corpus and a Common Vulnerability and Exposure (CVE) corpus. Our evaluation re-sults have demonstrated the effectiveness of our method in learning domain-specific word embeddings.
dc.description.urihttp://arxiv.org/abs/1709.07470
dc.format.extent8 pages
dc.genrejournal articles
dc.genrepreprints
dc.identifierdoi:10.13016/m2ak0r-n4la
dc.identifier.urihttps://doi.org/10.48550/arXiv.1709.07470
dc.identifier.urihttp://hdl.handle.net/11603/37201
dc.language.isoen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Information Systems Department
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Student Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subjectComputer Science - Computation and Language
dc.titleLearning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0002-5989-8543

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1709.07470v1.pdf
Size:
918.24 KB
Format:
Adobe Portable Document Format