Dynamic Data Assimilation for Topic Modeling (DDATM)

Author/Creator ORCID

Date

2017-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

Understanding how a particular discipline such as climate science evolves over time has received renewed interest. By understanding this evolution, predicting the future direction of the discipline becomes more achievable. Dynamic Topic Modeling (DTM) has been applied to a number of disciplines to model topic evolution as a means to learn how a particular scientific discipline and its underlying concepts are changing. Understanding how a discipline evolves, and its internal and external influences, can be complicated by how the information retrieved over time is integrated. There are different techniques used to integrate sources of information, however, less research has been dedicated to understanding how to integrate these sources over time. Data assimilation is commonly used in a number of scientific disciplines to both understand and make predictions of various phenomena, using numerical models and assimilated observational data over time. This dissertations introduces a novel algorithm for scientific data assimilation, called Dynamic Data Assimilation for Topic Modeling (DDATM), which uses a new cross-domain divergence method and DTM. By using DDATM, observational data, in the form of full-text research papers, can be assimilated over time starting from an initial model. DDATM can be used as a way to integrate data from multiple sources and, due to its robustness, can exploit the assimilating observational information to better tolerate missing model information. When compared with a model built using DTM, the DDATM model produces topics with better characteristics, particularly for scientific data. The DDATM method is suitable for prediction and results in higher likelihood for subsequent documents. DDATM is able to overcome missing information during the assimilation process when compared with a DTM model. The cross-domain divergence method generalizes as a method that can also bring together multiple disciplines into one cohesive model enabling the identification of related concepts and documents across disciplines and time periods. Finally, grounding the topic modeling process with an ontology improves the quality of the topics for scientific data and enables a more refined understanding of concept relatedness and cross-domain influence. The results of this dissertations are demonstrated and evaluated by applying DDATM to 30 years of reports from the Intergovernmental Panel on Climate Change (IPCC) along with more than 150,000 documents that they cite to show the evolution of the physical basis of climate change.