PREDICTING LATENT DEMOGRAPHIC ATTRIBUTES OF TWITTER USERS

dc.contributor.advisorOates, Tim
dc.contributor.authorFrolov, Georgiy
dc.contributor.departmentComputer Science and Electrical Engineering
dc.contributor.programComputer Science
dc.date.accessioned2019-10-11T13:39:17Z
dc.date.available2019-10-11T13:39:17Z
dc.date.issued2016-01-01
dc.description.abstractSocial media websites such as Twitter, Facebook, and LinkedIn aggregate large amounts of textual data. There is a wealth of user information that can be inferred from this, that is potentially useful in advertising, analytics, sentiment analysis, etc. It is estimated that over 60% of people in the US have a Twitter account, and a significant portion of US population is comprised of immigrants. As social media have become common place, people are willingly posting their personal information such as their name, age, location, alma mater, etc. This makes it possible to use text classification methods to accurately determine demographic profiles. This theses focuses on extracting latent demographic information from social media data. Previous works have attempted to determine user's race and ethnicity, while our work focuses on using posts on Twitter (tweets), to determine whether a user is an immigrant or a native US citizen. The method uses ethnic name distribution among immigrant and native populations to find and collect users in the United States, and their tweets across three race groups: Asian, Latino, and Caucasian/White. We use supervised machine learning approach to predict the immigration status of a user by examining the textual content of tweets, using Multinomial Naive Bayes, Support Vector Machines, Logistic Regression, k-Nearest Neighbors, and Decision Trees. We investigate methods for improving the performance of algorithms and determine how number of features affects the accuracy of the built models. Additionally we evaluate which features have more weight in classifying users, and attempt to discover latent topical patterns in the data corpus using Latent Dirichlet Allocation.
dc.genretheses
dc.identifierdoi:10.13016/m28fdc-jtnj
dc.identifier.other11484
dc.identifier.urihttp://hdl.handle.net/11603/15473
dc.languageen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Theses and Dissertations Collection
dc.relation.ispartofUMBC Graduate School Collection
dc.relation.ispartofUMBC Student Collection
dc.rightsThis item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
dc.sourceOriginal File Name: Frolov_umbc_0434M_11484.pdf
dc.subjectlatent demographic attribute
dc.subjectmachine learning
dc.subjectsocial media
dc.subjectsupervised learning
dc.subjecttext classification
dc.titlePREDICTING LATENT DEMOGRAPHIC ATTRIBUTES OF TWITTER USERS
dc.typeText
dcterms.accessRightsDistribution Rights granted to UMBC by the author.

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Frolov_umbc_0434M_11484.pdf
Size:
1.75 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
FrolovG_Predicting_Open.pdf
Size:
50.46 KB
Format:
Adobe Portable Document Format
Description: