Multilingual Text Alignment

Author/Creator

Author/Creator ORCID

Date

2019-01-01

Department

Information Systems

Program

Information Systems

Citation of Original Publication

Rights

Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Abstract

Cybersecurity threats, exploits, and intelligence sources have evolved to be largely cross-regional over the course of time. Although the security community perpetually addresses this topic, its scope is continually stretching and introducing new areas of study. Particularly, an area of research that is relevant but heavily under-explored, is the use of multilingual open source intelligence in cyber operations. Open Source Intelligence (OSINT) in the form of text is scattered across major criminal networks, and is highly multilingual in nature. By aligning multilingual sources, the security community can tap into new pools of intelligence. Language alignment, can be achieved through the use of neural machine translation (NMT) systems. This theses explores supervised and unsupervised methods in aligning multilingual open source intelligence sources without the use of of third party engines. Although third party engines are growing stronger, they are unsuited for private security environments. First, sensitive intelligence is not a permitted input to third party engines due to privacy and confidentiality policies. In addition, third party engines produce generalized translations that tend to lack exclusive cyber security terminology, which could be integral in attack discovery. We addresses these issues and describe our system that enables threat intelligence understanding across unfamiliar languages. We create monolingual and multilingual word embeddings from open source intelligence data in two distinct languages, and derive a bilingual dictionary through both supervised and unsupervised methods. We then create a neural network based system that takes in cybersecurity data in a different language and outputs the respective English translation. We evaluate with traditional approaches, and through experimental applications.