Open Information Extraction for Code-Mix Hindi-English Social Media Data

Ferraro, FrancisPATE, MAYUR SATISH2021-01-292021-01-292018-01-0111894http://hdl.handle.net/11603/20714Open domain relation extraction (Angeli, Premkumar, & Manning 2015) is a process of finding relation triples. While there are a number of available systems for open information extraction (Open IE) for a single language, traditional Open IE systems are not well suited to content that contains multiple languages in a single utterance. In this theses, we have extended an existing code mix corpus (Das, Jamatia, & Gamb�ack 2015) by finding and annotating relation triples in an Open IE fashion. We will be open sourcing this newly annotated dataset. Using this newly annotated corpus, we have experimented with sequence-to-sequence neural networks (Zhang, Duh, & Van Durme 2017) for finding the relationship triples. As a prerequisite for relationship extraction pipeline, we have developed a part-of-speech tagger, named entity recognizer and predicate recognizer for code-mix content. We have experimented with various approaches such as Conditional Random Fields (CRF), Average Perceptron and deep neural networks. According to our knowledge, this relationship extraction system is the first ever contribution for any code mix natural language. We have achieved promising results for all of the components and it could be improved in the future with more code mix data.application:pdfCode MixingMachine LearningNamed Entity RecongnitionNatural Language ProcessingOpen IESeq2Seq Neural NetworkOpen Information Extraction for Code-Mix Hindi-English Social Media DataText