SeNMFk-SPLIT: Large Corpora Topic Modeling by Semantic Non-negative Matrix Factorization with Automatic Model Selection
Loading...
Links to Files
Author/Creator ORCID
Date
2022-08-21
Type of Work
Department
Program
Citation of Original Publication
Rights
This work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law.
Public Domain Mark 1.0
Public Domain Mark 1.0
Subjects
Abstract
As the amount of text data continues to grow, topic
modeling is serving an important role in understanding the
content hidden by the overwhelming quantity of documents.
One popular topic modeling approach is non-negative matrix
factorization (NMF), an unsupervised machine learning (ML)
method. Recently, Semantic NMF with automatic model selection
(SeNMFk) has been proposed as a modification to NMF. In
addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is
performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/wordcontext matrix, the values of which represent the number of times
two words co-occur in a predetermined window of the text. In
this paper, we introduce a novel distributed method, SeNMFkSPLIT, for semantic topic extraction suitable for large corpora.
Contrary to SeNMFk, our method enables the joint factorization
of large documents by decomposing the word-context and termdocument matrices separately. We demonstrate the capability of
SeNMFk-SPLIT by applying it to the entire artificial intelligence
(AI) and ML scientific literature uploaded on arXiv.