Towards Learning A Better Text Encoder

Author/Creator

Author/Creator ORCID

Date

2020-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan through a local library, pending author/copyright holder's permission.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Abstract

Encoding raw text into a machine comprehensible representation while preserving useful information has long been a popular research area. Despite their great success, traditional supervised text encoders demand human annotation which is often expensive, inefficient and sometimes infeasible to obtain. As a result, they are usually limited in both performance and generalization, especially for deep learning based approaches, which are widely known to usually perform better with a large quantity of data. Recent work has shown that language models pre-trained on large scale corpora can be used as the basis for many downstream tasks and significantly improve the performance of their correspondingly fine-tuned models. However, deep neural networks are still considered as black-boxes and thus often lack interpretability. Besides, deep learning approaches are often vulnerable to adversarial perturbations of the input. These perturbations are usually imperceptible to human eyes or do not change the semantic meaning of the input. On the other hand, despite the utility of transfer learning from pre-trained language models, many downstream tasks still require a relatively large amount of labeled data to achieve expected performance, which is often expensive or impractical in real world. Thus, self-supervised methods are proposed to address the problem, many of which rely on various data augmentation techniques. In this thesis, we propose to learn better text encoders from the following three directions: (1) we design a neural network with a better architecture capable of approximating a larger set of functions; (2) we propose several algorithms to generate adversarial examples for text encoders and fine-tune the models on these samples to improve their robustness; (3) we introduce a new data augmentation algorithm to enlarge the corpus when labeled data is limited. The proposed methods were evaluated on various downstream tasks against many baseline models and algorithms, including language modeling, sentiment classification, and semantic relatedness prediction. Some of them were also evaluated on their time efficiency. The experimental results show that: (1) our proposed neural network architecture can improve the model's performance on various downstream NLP tasks without significantly increasing the number of parameters required; (2) models adversarially trained with the proposed algorithm can preserve relatively the same performance on adversarial examples as on natural examples; and (3) the artificial data generated with the proposed data augmentation technique can significantly improve the models' performance with very limited labeled training data.