Author/Creator ORCID



Type of Work


Computer Science and Electrical Engineering


Computer Science

Citation of Original Publication


This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see or contact Special Collections at speccoll(at)
Distribution Rights granted to UMBC by the author.


Social media websites such as Twitter, Facebook, and LinkedIn aggregate large amounts of textual data. There is a wealth of user information that can be inferred from this, that is potentially useful in advertising, analytics, sentiment analysis, etc. It is estimated that over 60% of people in the US have a Twitter account, and a significant portion of US population is comprised of immigrants. As social media have become common place, people are willingly posting their personal information such as their name, age, location, alma mater, etc. This makes it possible to use text classification methods to accurately determine demographic profiles. This theses focuses on extracting latent demographic information from social media data. Previous works have attempted to determine user's race and ethnicity, while our work focuses on using posts on Twitter (tweets), to determine whether a user is an immigrant or a native US citizen. The method uses ethnic name distribution among immigrant and native populations to find and collect users in the United States, and their tweets across three race groups: Asian, Latino, and Caucasian/White. We use supervised machine learning approach to predict the immigration status of a user by examining the textual content of tweets, using Multinomial Naive Bayes, Support Vector Machines, Logistic Regression, k-Nearest Neighbors, and Decision Trees. We investigate methods for improving the performance of algorithms and determine how number of features affects the accuracy of the built models. Additionally we evaluate which features have more weight in classifying users, and attempt to discover latent topical patterns in the data corpus using Latent Dirichlet Allocation.