Unsupervised Thematic Phrase Extraction From Single Text Artifacts

dc.contributor.advisorOates, Tim
dc.contributor.authorPeshave, Akshay
dc.contributor.departmentComputer Science and Electrical Engineering
dc.contributor.programComputer Science
dc.date.accessioned2022-09-29T15:37:50Z
dc.date.available2022-09-29T15:37:50Z
dc.date.issued2021-01-01
dc.description.abstractAutomated knowledge discovery is central to augmenting knowledge acquisition and elicitation by humans from vast amounts of content. Precise and concise representations, both structured and semi-structured, of knowledge contained in textual content have the potential to boost human productivity. Further, they can reduce, if not eliminate, human error and bias in knowledge retrieval and curation by humans from vast collections of content to use for their subsequent knowledge-based tasks. Conventionally, knowledge discovery in text (KDT) approaches and paradigms have been designed to build domain knowledge by processing large collections of text documents and applying them to process individual text documents using this acquired domain knowledge for guidance. Consequently, these approaches are blind to the finer topical features of the individual document because these features are abstracted by topic models that infer topicality in the context of the whole corpus. We need an unsupervised method to extract topical or thematic phrases from a single text document without the need to access entire collections of texts or background domain or language dictionaries and thesauri. Further, the method should not abstract fine-grained thematic phrases contained in the document, thus, enabling its application for hierarchical knowledge representation and downstream document level text analytics tasks. This work describes ThemaPhrase (ThP), a novel framework for unsupervised extraction of thematic phrases from single text artifacts. The framework operates without the need for corpus wide statistics and external domain knowledge which makes it domain agnostic. ThP configurations are more robust than competing methods to topic-to-partitions ratio and varying average token occurrence frequencies in a document. Different configurations of ThemaPhrase are identified that outperform competing methods in extracting thematic phrases that represent the topicality of a document at varied granularities. Further, this work shows that sentence pre-filtering based on thematic phrases and thematic words helps improve extractive summarization for texts, such as patents, that have relatively higher occurrence frequencies of tokens where the baseline TextRank summarizer underperforms. ThemaPhrase configurations that outperform competing thematic phrase extraction methods in extractive summarization using sentence pre-filtering are discussed.
dc.formatapplication:pdf
dc.genredissertations
dc.identifierdoi:10.13016/m2iwkr-eutb
dc.identifier.other12476
dc.identifier.urihttp://hdl.handle.net/11603/25972
dc.languageen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Theses and Dissertations Collection
dc.relation.ispartofUMBC Graduate School Collection
dc.relation.ispartofUMBC Student Collection
dc.rightsThis item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
dc.sourceOriginal File Name: Peshave_umbc_0434D_12476.pdf
dc.subjectKnowledge Discovery in Text
dc.subjectText Summarization
dc.subjectThemaPhrase
dc.subjectTheme Representation
dc.subjectTopical Phrase Extraction
dc.titleUnsupervised Thematic Phrase Extraction From Single Text Artifacts
dc.typeText
dcterms.accessRightsDistribution Rights granted to UMBC by the author.
dcterms.accessRightsAccess limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Peshave_umbc_0434D_12476.pdf
Size:
5.25 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Peshave-Akshay_Open.pdf
Size:
698.65 KB
Format:
Adobe Portable Document Format
Description: