SeNMFk-SPLIT: Large Corpora Topic Modeling by Semantic Non-negative Matrix Factorization with Automatic Model Selection

dc.contributor.authorEren, Maksim E.
dc.contributor.authorSolovyev, Nick
dc.contributor.authorBhattarai,  Manish
dc.contributor.authorRasmussen, Kim
dc.contributor.authorNicholas, Charles
dc.contributor.authorAlexandrov, Boian S.
dc.date.accessioned2022-10-07T14:49:50Z
dc.date.available2022-10-07T14:49:50Z
dc.date.issued2022-08-21
dc.description.abstractAs the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an unsupervised machine learning (ML) method. Recently, Semantic NMF with automatic model selection (SeNMFk) has been proposed as a modification to NMF. In addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/wordcontext matrix, the values of which represent the number of times two words co-occur in a predetermined window of the text. In this paper, we introduce a novel distributed method, SeNMFkSPLIT, for semantic topic extraction suitable for large corpora. Contrary to SeNMFk, our method enables the joint factorization of large documents by decomposing the word-context and termdocument matrices separately. We demonstrate the capability of SeNMFk-SPLIT by applying it to the entire artificial intelligence (AI) and ML scientific literature uploaded on arXiv.en
dc.description.sponsorshipThis manuscript has been approved for unlimited release and has been assigned LA-UR-22-26571. This research was funded by the Los Alamos National Laboratory (LANL) Laboratory Directed Research and Development (LDRD) grant 20190020DR and the LANL Institutional Computing Program, supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001.en
dc.description.urihttps://arxiv.org/abs/2208.09942en
dc.format.extent5 pagesen
dc.genrejournal articlesen
dc.genrepreprintsen
dc.identifierdoi:10.13016/m24yvf-yxrh
dc.identifier.urihttps://doi.org/10.48550/arXiv.2208.09942
dc.identifier.urihttp://hdl.handle.net/11603/26114
dc.language.isoenen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsPublic Domain Mark 1.0*
dc.rightsThis work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law.en
dc.rights.urihttp://creativecommons.org/publicdomain/mark/1.0/*
dc.titleSeNMFk-SPLIT: Large Corpora Topic Modeling by Semantic Non-negative Matrix Factorization with Automatic Model Selectionen
dc.typeTexten
dcterms.creatorhttps://orcid.org/0000-0001-9494-7139en

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2208.09942.pdf
Size:
2.56 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: