Supervised Training Strategies for Low Resource Language Processing

Author/Creator ORCID




Computer Science and Electrical Engineering


Computer Science

Citation of Original Publication


Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan through a local library, pending author/copyright holder's permission.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see or contact Special Collections at speccoll(at)


Over the last decade, we have witnessed an explosion in Artificial Intelligence (AI) research with a focus on deep neural networks (DNN). Since Krizhevsky et al. (2017) proposed a convolutional neural network architecture (CNN) for the ImageNet (Deng et al., 2009) task, deep neural networks have become a default model of choice for many computer vision and natural language processing tasks. The architecture is able to showcase an important property, i.e., a modular composite function model (with different layers / operations) can be easily scaled up for a large dataset. This has led to the creation of a generation of deep neural models built on easy access to image and textual data. But this approach to the construction of a neural networks has two important limitations. First, as the number of parameters in the model increases exponentially, the GPU computing necessary to train and perform inference prohibitively increases too. Also, as the universe of natural language processing tasks expands, the cognitive complexity of the task increases too. The cost of collecting good quality annotations (for textual data) becomes a barrier to building better models in the future. A common solution to solve this problem is to very large train unsupervised models with a huge textual corpora available on the web and then transfer them to other tasks. In our work, we study each of these challenges and propose different approaches to alleviate them by focusing on design models that are data and hardware efficient. Our work has three main contributions. Firstly, we study methods to efficiently utilize existing datasets by exploiting the inherent relationship between samples in the dataset. We propose a locality preserving alignment algorithm that learns the local manifold structure surrounding a datapoint in embedding space and then aligns two manifolds preserving this structure. Thus points that do not have a target label but are present in the neighborhood of a given datapoint in the supervised set, can be mapped in the target domain too. This augments a given dataset with pseudo text-label pairs that can be used for additional model training. Secondly, most current generation models deployed for NLP tasks (apart from autoencoders) are designed to have unidirectional flow from source to target. We propose a bidirectional manifold alignment (BDMA) method that trains a single model to be perform forward and reverse mapping. The model is optimized with a cycle consistency loss inspired by Zhu et al. (2017)'s research on CycleGANs. We show the effectiveness of this approach on the crosslingual word alignment (CLWA) task and how it can improve hardware efficiency and reduce the number of models deployed. Lastly, in order reduce the size of the model, we propose a model architecture that infers labels with holographic reduce representations (HRR). HRRs provide the ability to compose and decompose embeddings. In an eXtreme Multi-Label (XML) setting where there are a very large set of labels, we show how a model's output layer is compressed when the layer is replaced with a multi-label embedding that can be decomposed into its primary constituents. We show that the new HRR-based model has precision equivalent to the standard model.