Towards Learning A Better Text Encoder

Gao, Hang

Towards Learning A Better Text Encoder

dc.contributor.advisor	Oates, Tim
dc.contributor.author	Gao, Hang
dc.contributor.department	Computer Science and Electrical Engineering
dc.contributor.program	Computer Science
dc.date.accessioned	2022-02-09T15:52:32Z
dc.date.available	2022-02-09T15:52:32Z
dc.date.issued	2020-01-01
dc.description.abstract	Encoding raw text into a machine comprehensible representation while preserving useful information has long been a popular research area. Despite their great success, traditional supervised text encoders demand human annotation which is often expensive, inefficient and sometimes infeasible to obtain. As a result, they are usually limited in both performance and generalization, especially for deep learning based approaches, which are widely known to usually perform better with a large quantity of data. Recent work has shown that language models pre-trained on large scale corpora can be used as the basis for many downstream tasks and significantly improve the performance of their correspondingly fine-tuned models. However, deep neural networks are still considered as black-boxes and thus often lack interpretability. Besides, deep learning approaches are often vulnerable to adversarial perturbations of the input. These perturbations are usually imperceptible to human eyes or do not change the semantic meaning of the input. On the other hand, despite the utility of transfer learning from pre-trained language models, many downstream tasks still require a relatively large amount of labeled data to achieve expected performance, which is often expensive or impractical in real world. Thus, self-supervised methods are proposed to address the problem, many of which rely on various data augmentation techniques. In this thesis, we propose to learn better text encoders from the following three directions: (1) we design a neural network with a better architecture capable of approximating a larger set of functions; (2) we propose several algorithms to generate adversarial examples for text encoders and fine-tune the models on these samples to improve their robustness; (3) we introduce a new data augmentation algorithm to enlarge the corpus when labeled data is limited. The proposed methods were evaluated on various downstream tasks against many baseline models and algorithms, including language modeling, sentiment classification, and semantic relatedness prediction. Some of them were also evaluated on their time efficiency. The experimental results show that: (1) our proposed neural network architecture can improve the model's performance on various downstream NLP tasks without significantly increasing the number of parameters required; (2) models adversarially trained with the proposed algorithm can preserve relatively the same performance on adversarial examples as on natural examples; and (3) the artificial data generated with the proposed data augmentation technique can significantly improve the models' performance with very limited labeled training data.
dc.format	application:pdf
dc.genre	dissertations
dc.identifier	doi:10.13016/m2j9k4-in2b
dc.identifier.other	12356
dc.identifier.uri	http://hdl.handle.net/11603/24175
dc.language	en
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartof	UMBC Theses and Dissertations Collection
dc.relation.ispartof	UMBC Graduate School Collection
dc.relation.ispartof	UMBC Student Collection
dc.source	Original File Name: Gao_umbc_0434D_12356.pdf
dc.subject	adversarial training
dc.subject	data augmentation
dc.subject	long short term memory
dc.subject	natural language processing
dc.title	Towards Learning A Better Text Encoder
dc.type	Text
dcterms.accessRights	Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan through a local library, pending author/copyright holder's permission.
dcterms.accessRights	This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Gao_umbc_0434D_12356.pdf
Size:: 2.28 MB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Theses and Dissertations
UMBC Computer Science and Electrical Engineering Department
UMBC Graduate School
UMBC Student Collection