Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts
dc.contributor.author | Roy, Arpita | |
dc.contributor.author | Park, Youngja | |
dc.contributor.author | Pan, SHimei | |
dc.date.accessioned | 2025-01-08T15:08:53Z | |
dc.date.available | 2025-01-08T15:08:53Z | |
dc.date.issued | 2017-09-21 | |
dc.description.abstract | Word embedding is a Natural Language Processing (NLP) technique that automatically maps words from a vocabulary to vectors of real numbers in an embedding space. It has been widely used in recent years to boost the performance of a vari-ety of NLP tasks such as Named Entity Recognition, Syntac-tic Parsing and Sentiment Analysis. Classic word embedding methods such as Word2Vec and GloVe work well when they are given a large text corpus. When the input texts are sparse as in many specialized domains (e.g., cybersecurity), these methods often fail to produce high-quality vectors. In this pa-per, we describe a novel method to train domain-specificword embeddings from sparse texts. In addition to domain texts, our method also leverages diverse types of domain knowledge such as domain vocabulary and semantic relations. Specifi-cally, we first propose a general framework to encode diverse types of domain knowledge as text annotations. Then we de-velop a novel Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text annotations in word em-bedding. We have evaluated our method on two cybersecurity text corpora: a malware description corpus and a Common Vulnerability and Exposure (CVE) corpus. Our evaluation re-sults have demonstrated the effectiveness of our method in learning domain-specific word embeddings. | |
dc.description.uri | http://arxiv.org/abs/1709.07470 | |
dc.format.extent | 8 pages | |
dc.genre | journal articles | |
dc.genre | preprints | |
dc.identifier | doi:10.13016/m2ak0r-n4la | |
dc.identifier.uri | https://doi.org/10.48550/arXiv.1709.07470 | |
dc.identifier.uri | http://hdl.handle.net/11603/37201 | |
dc.language.iso | en_US | |
dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
dc.relation.ispartof | UMBC Information Systems Department | |
dc.relation.ispartof | UMBC Faculty Collection | |
dc.relation.ispartof | UMBC Student Collection | |
dc.rights | This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author. | |
dc.subject | Computer Science - Computation and Language | |
dc.title | Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts | |
dc.type | Text | |
dcterms.creator | https://orcid.org/0000-0002-5989-8543 |
Files
Original bundle
1 - 1 of 1