Understanding the Logical and Semantic Structure of Large Documents

dc.contributor.advisorFinin, Tim
dc.contributor.authorRahman, Muhammad Mahbubur
dc.contributor.departmentComputer Science and Electrical Engineering
dc.contributor.programComputer Science
dc.date.accessioned2021-01-29T18:12:38Z
dc.date.available2021-01-29T18:12:38Z
dc.date.issued2018-01-01
dc.description.abstractCurrent language understanding approaches are mostly focused on small documents, such as newswire articles, blog posts, and product reviews. Understanding and extracting information from large documents like legal documents, reports, proposals, technical manuals, and research articles is still a challenging task. Because the documents may be multi-themed, complex, and cover diverse topics. The content can be split into multiple files or aggregated into one large file. As a result, the content of the whole document may have different structures and formats. Furthermore, the information is expressed in different forms, such as paragraphs, headers, tables, images, mathematical equations, or a nested combination of these structures. Identifying a document's logical sections and organizing them into a standard structure to understand the semantic structure of a document will not only help many information extraction applications, but also enable users to quickly navigate to sections of interest. Such an understanding of a document's structure will significantly benefit and facilitate a variety of applications, such as information extraction, document summarization, and question answering. We intend to section large and complex PDF documents automatically and annotate each section with a semantic, human-understandable label. Our semantic labels are intended to capture the general purpose and domain specific semantic in the large document. In a nutshell, we aim to automatically identify and classify semantic sections of documents and assign human-understandable, consistent labels to them. We developed powerful, yet simple, approaches to build our framework using layout information and text contents extracted from documents, such as scholarly articles and RFP documents. The framework has four units: Pre-processing Unit, Annotation Unit, Classification Unit and Semantic Annotation Unit. We developed state-of-the-art machine learning and deep learning architectures. We also explored and experimented with the Latent Dirichlet Allocation (LDA), TextRank and Tensorflow Textsum models for semantic concept identification and document summarization respectively. We mapped each of the sections with a semantic name using a document ontology. We aimed to develop a generic and domain independent framework. We used scholarly articles from the arXiv repository and RFP documents from RedShred. We evaluated the performance of our framework using different evaluation matrices, such as precision, recall, and f1-score. We also analyzed and visualized the results in the embedding space. We made available a dataset of information about a collection of scholarly articles from the arXiv eprints that includes a wide range of metadata for each article, including a TOC, section labels, section summarizations, and more.
dc.formatapplication:pdf
dc.genredissertations
dc.identifierdoi:10.13016/m2hgey-ijlr
dc.identifier.other11895
dc.identifier.urihttp://hdl.handle.net/11603/20728
dc.languageen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Theses and Dissertations Collection
dc.relation.ispartofUMBC Graduate School Collection
dc.relation.ispartofUMBC Student Collection
dc.sourceOriginal File Name: Rahman_umbc_0434D_11895.pdf
dc.subjectDeep Learning
dc.subjectDocument Structure
dc.subjectInformation Retrieval
dc.subjectMachine Learning
dc.subjectNatural Language Processing
dc.subjectNatural Language Understanding
dc.titleUnderstanding the Logical and Semantic Structure of Large Documents
dc.typeText
dcterms.accessRightsDistribution Rights granted to UMBC by the author.
dcterms.accessRightsAccess limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
dcterms.accessRightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Rahman_umbc_0434D_11895.pdf
Size:
5.04 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
RahmanMUnderstanding_Open.pdf
Size:
41.14 KB
Format:
Adobe Portable Document Format
Description: