PREDICTING LATENT DEMOGRAPHIC ATTRIBUTES OF TWITTER USERS

Frolov, Georgiy

PREDICTING LATENT DEMOGRAPHIC ATTRIBUTES OF TWITTER USERS

dc.contributor.advisor	Oates, Tim
dc.contributor.author	Frolov, Georgiy
dc.contributor.department	Computer Science and Electrical Engineering
dc.contributor.program	Computer Science
dc.date.accessioned	2019-10-11T13:39:17Z
dc.date.available	2019-10-11T13:39:17Z
dc.date.issued	2016-01-01
dc.description.abstract	Social media websites such as Twitter, Facebook, and LinkedIn aggregate large amounts of textual data. There is a wealth of user information that can be inferred from this, that is potentially useful in advertising, analytics, sentiment analysis, etc. It is estimated that over 60% of people in the US have a Twitter account, and a significant portion of US population is comprised of immigrants. As social media have become common place, people are willingly posting their personal information such as their name, age, location, alma mater, etc. This makes it possible to use text classification methods to accurately determine demographic profiles. This theses focuses on extracting latent demographic information from social media data. Previous works have attempted to determine user's race and ethnicity, while our work focuses on using posts on Twitter (tweets), to determine whether a user is an immigrant or a native US citizen. The method uses ethnic name distribution among immigrant and native populations to find and collect users in the United States, and their tweets across three race groups: Asian, Latino, and Caucasian/White. We use supervised machine learning approach to predict the immigration status of a user by examining the textual content of tweets, using Multinomial Naive Bayes, Support Vector Machines, Logistic Regression, k-Nearest Neighbors, and Decision Trees. We investigate methods for improving the performance of algorithms and determine how number of features affects the accuracy of the built models. Additionally we evaluate which features have more weight in classifying users, and attempt to discover latent topical patterns in the data corpus using Latent Dirichlet Allocation.
dc.genre	theses
dc.identifier	doi:10.13016/m28fdc-jtnj
dc.identifier.other	11484
dc.identifier.uri	http://hdl.handle.net/11603/15473
dc.language	en
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartof	UMBC Theses and Dissertations Collection
dc.relation.ispartof	UMBC Graduate School Collection
dc.relation.ispartof	UMBC Student Collection
dc.rights	This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
dc.source	Original File Name: Frolov_umbc_0434M_11484.pdf
dc.subject	latent demographic attribute
dc.subject	machine learning
dc.subject	social media
dc.subject	supervised learning
dc.subject	text classification
dc.title	PREDICTING LATENT DEMOGRAPHIC ATTRIBUTES OF TWITTER USERS
dc.type	Text
dcterms.accessRights	Distribution Rights granted to UMBC by the author.

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Frolov_umbc_0434M_11484.pdf
Size:: 1.75 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: FrolovG_Predicting_Open.pdf
Size:: 50.46 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

UMBC Theses and Dissertations