CLASSIFICATION AND PREDICTION OF NEWSPAPER ARTICLES ON THE BASIS OF AUTHOR GENDER

Author/Creator

Author/Creator ORCID

Date

2018-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Subjects

Abstract

Categorizing text on the basis of author gender has been a long standing problem in the field of Machine Learning, taking gender as a basis for classification in different types of text. For the purpose of this theses we focus on categorizing newspaper articles on the basis of gender, traditional machine learning techniques for classifying the text having been applied. Male and female writing styles have been identified. The New York Times Annotated Corpus licensed by Linguistic Data Consortium, containing approximately 1.8 million articles has been used. The article text is sorted, ---articles containing definite male female author bylines and labels have been considered for classification and prediction initially, The text contains name of the author which has been matched against a male female labelled list to determine the gender of the author name. We try to predict the author of the authorless articles (containing articles written by collective boards such as editorials) on the basis of the model we built. We also conduct a comparative study of different machine learning techniques like logistic Regression, Decision Tree Classifier, Support Vector machines and a few more to determine which learning method performs the best with the corpus.