Unsupervised Thematic Phrase Extraction From Single Text Artifacts

Author/Creator

Author/Creator ORCID

Date

2021-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.

Abstract

Automated knowledge discovery is central to augmenting knowledge acquisition and elicitation by humans from vast amounts of content. Precise and concise representations, both structured and semi-structured, of knowledge contained in textual content have the potential to boost human productivity. Further, they can reduce, if not eliminate, human error and bias in knowledge retrieval and curation by humans from vast collections of content to use for their subsequent knowledge-based tasks. Conventionally, knowledge discovery in text (KDT) approaches and paradigms have been designed to build domain knowledge by processing large collections of text documents and applying them to process individual text documents using this acquired domain knowledge for guidance. Consequently, these approaches are blind to the finer topical features of the individual document because these features are abstracted by topic models that infer topicality in the context of the whole corpus. We need an unsupervised method to extract topical or thematic phrases from a single text document without the need to access entire collections of texts or background domain or language dictionaries and thesauri. Further, the method should not abstract fine-grained thematic phrases contained in the document, thus, enabling its application for hierarchical knowledge representation and downstream document level text analytics tasks. This work describes ThemaPhrase (ThP), a novel framework for unsupervised extraction of thematic phrases from single text artifacts. The framework operates without the need for corpus wide statistics and external domain knowledge which makes it domain agnostic. ThP configurations are more robust than competing methods to topic-to-partitions ratio and varying average token occurrence frequencies in a document. Different configurations of ThemaPhrase are identified that outperform competing methods in extracting thematic phrases that represent the topicality of a document at varied granularities. Further, this work shows that sentence pre-filtering based on thematic phrases and thematic words helps improve extractive summarization for texts, such as patents, that have relatively higher occurrence frequencies of tokens where the baseline TextRank summarizer underperforms. ThemaPhrase configurations that outperform competing thematic phrase extraction methods in extractive summarization using sentence pre-filtering are discussed.