Understanding the Logical and Semantic Structure of Large Documents

Rahman, Muhammad Mahbubur

Understanding the Logical and Semantic Structure of Large Documents

dc.contributor.advisor	Finin, Tim
dc.contributor.author	Rahman, Muhammad Mahbubur
dc.contributor.department	Computer Science and Electrical Engineering
dc.contributor.program	Computer Science
dc.date.accessioned	2021-01-29T18:12:38Z
dc.date.available	2021-01-29T18:12:38Z
dc.date.issued	2018-01-01
dc.description.abstract	Current language understanding approaches are mostly focused on small documents, such as newswire articles, blog posts, and product reviews. Understanding and extracting information from large documents like legal documents, reports, proposals, technical manuals, and research articles is still a challenging task. Because the documents may be multi-themed, complex, and cover diverse topics. The content can be split into multiple files or aggregated into one large file. As a result, the content of the whole document may have different structures and formats. Furthermore, the information is expressed in different forms, such as paragraphs, headers, tables, images, mathematical equations, or a nested combination of these structures. Identifying a document's logical sections and organizing them into a standard structure to understand the semantic structure of a document will not only help many information extraction applications, but also enable users to quickly navigate to sections of interest. Such an understanding of a document's structure will significantly benefit and facilitate a variety of applications, such as information extraction, document summarization, and question answering. We intend to section large and complex PDF documents automatically and annotate each section with a semantic, human-understandable label. Our semantic labels are intended to capture the general purpose and domain specific semantic in the large document. In a nutshell, we aim to automatically identify and classify semantic sections of documents and assign human-understandable, consistent labels to them. We developed powerful, yet simple, approaches to build our framework using layout information and text contents extracted from documents, such as scholarly articles and RFP documents. The framework has four units: Pre-processing Unit, Annotation Unit, Classification Unit and Semantic Annotation Unit. We developed state-of-the-art machine learning and deep learning architectures. We also explored and experimented with the Latent Dirichlet Allocation (LDA), TextRank and Tensorflow Textsum models for semantic concept identification and document summarization respectively. We mapped each of the sections with a semantic name using a document ontology. We aimed to develop a generic and domain independent framework. We used scholarly articles from the arXiv repository and RFP documents from RedShred. We evaluated the performance of our framework using different evaluation matrices, such as precision, recall, and f1-score. We also analyzed and visualized the results in the embedding space. We made available a dataset of information about a collection of scholarly articles from the arXiv eprints that includes a wide range of metadata for each article, including a TOC, section labels, section summarizations, and more.
dc.format	application:pdf
dc.genre	dissertations
dc.identifier	doi:10.13016/m2hgey-ijlr
dc.identifier.other	11895
dc.identifier.uri	http://hdl.handle.net/11603/20728
dc.language	en
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartof	UMBC Theses and Dissertations Collection
dc.relation.ispartof	UMBC Graduate School Collection
dc.relation.ispartof	UMBC Student Collection
dc.source	Original File Name: Rahman_umbc_0434D_11895.pdf
dc.subject	Deep Learning
dc.subject	Document Structure
dc.subject	Information Retrieval
dc.subject	Machine Learning
dc.subject	Natural Language Processing
dc.subject	Natural Language Understanding
dc.title	Understanding the Logical and Semantic Structure of Large Documents
dc.type	Text
dcterms.accessRights	Distribution Rights granted to UMBC by the author.
dcterms.accessRights	Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
dcterms.accessRights	This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Rahman_umbc_0434D_11895.pdf
Size:: 5.04 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: RahmanMUnderstanding_Open.pdf
Size:: 41.14 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

UMBC Theses and Dissertations
UMBC Computer Science and Electrical Engineering Department
UMBC Graduate School
UMBC Student Collection