From Latent Knowledge Gathering to Side Information Injection in Discrete Sequential Models
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
Type of Work
Department
Computer Science and Electrical Engineering
Program
Engineering, Electrical
Citation of Original Publication
Rights
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.
Distribution Rights granted to UMBC by the author.
Abstract
Representation learning aims at extracting relevant information from data to represent input in away that is sufficient for performing a task. Specifically, this problem is difficult when the data
under consideration is both sequential and discrete such as in natural language processing (NLP).
From classical methods like topic modeling to modern transformer-based architectures one seeks
to utilize the available information from data or transferable knowledge to learn richer
representations. To that end, recent advances in current state-of-the-art models rely on two major
strategies. a) Latent Knowledge Gathering , where we encourage a model to recognize semantic
and thematically-relevant knowledge contained within the training data. Methods include
clustering techniques like topic modeling and document classification. b) Injecting Background
Information, where the goal is to exploit structural or representational priors, such as pretrained
models or word embeddings to facilitate the training phase. Irrespective of the architecture or
task, the training process invariably begins with the encoding of high-dimensional documents
into more manageable, low-dimensional latent representations. We advocate for these
representations to be optimized to capture and utilize more pertinent information, enhancing
their efficacy in various language-based tasks. Considering document classification as an
example of semantic analysis, both the encoder and decoder are vital in extracting essential
information from inputs, especially when dealing with limited training data. Our extensive
experiments assess the capabilities of models across various data regimes, highlighting the
importance of efficient representation in handling the situation entity classification task.
In thematic analysis, despite notable advancements, many previous studies have overlooked the
extraction of valuable word-level information, such as latent thematic topics pertinent to each
word. Additionally, the use of auxiliary knowledge has often been confined to basic applications
like weight initialization. Some methods have simplified the process by merely appending
external knowledge to the input. Nonetheless, the effective utilization of informationwhether
derived directly from the data or leveraged from background knowledgeremains a critical factor
in document representation. It is essential to ensure that the process of information gathering
does not compromise the richness of the original data.
First, we offer a novel lightweight unsupervised design that shows how to use topic models in
conjuction with recurrent neural networks (RNNs) with minimal word-level information loss.
Our approach maintains and uses lower-level representations that previous approaches had
discarded, and then it gathers and provides that information to a natural language generation
model. We conduct extensive experiments to compare the efficiency of the proposed model with
previous proposed architectures. The results demonstrate that retaining and exploiting word topic
assignments, previously overlooked, leads to new stateof-the-art performance in thematic
analysis.
Second, we consider how background (or side) knowledge can be used to guide model and
representation learning of text. This side knowledge can itself be structured, and may often be
given categorically. However, the sources of side knowledge can be incomplete, meaning that the side knowledge may be structured, but partially observed. This poses challenges for learning.To handle this, we first focus on incomplete partially observed side knowledge. We propose
using a structured, discrete, semi-supervised variational autoencoder framework, which uses
provided side knowledge to represent the original input text. This method is intricately designed
to use the partially observed knowledge as a guiding tool, without imposing limitations on the
training phase. We show that our approach can robustly handle varying levels of side knowledge
observation, and leads to consistent performance gains across multiple language modeling and
classification metrics.
Ultimately, we delve into scenarios where side knowledge is not just incomplete but also
contains noise. In this context, we introduce a universal framework for integrating discrete
information, based on the information bottleneck principle. This framework involves a thorough
theoretical exploration of how side information can be integrated into model parameters. Our
extensive theoretical analysis and empirical studies, including a case study on event modeling,
show that our approach not only extends and refines previous methods but also significantly
enhances performance. The proposed framework lays a robust theoretical groundwork for future
research in this domain.
