Data Science Employment Classification using Machine Learning

Author/Creator ORCID

Date

2019-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Subjects

Abstract

Following the gold rush in artificial intelligence, a new career track called "data scientists” has taken the world by storm. With a combination of skills in business intuition and technical soundness, data science is considered the most sought after job in the 21st century. But one must be able to classify if a job posting is a data science-related job or not. This theses aims to classify a job posting whether it belongs to Data Science field or not using a Machine Learning model. Based on the results obtained an extensive analysis is done to find out various patterns and to find out if data science is actually in-demand as one might think. The Machine Learning models used for the classifying the job advertisements are Support Vector Machine and Neural-Networks with TensorFlow. These two models were considered because, first with respect to SVM, it has a regularization parameter, which makes the user think about avoiding over-fitting. Next, it uses the kernel trick, so one can build in expert knowledge about the problem via engineering the kernel. Also, an SVM is defined by a convex optimization problem (no local minima) for which there are efficient methods (e.g. Sequential minimal optimization). Lastly, it approximates a bound on the test error rate, and there is a substantial body of theory behind it which suggests it should be a good idea. Coming to Neural Networks, it has a relatively simple learning algorithm (Stochastic Gradient Descent and backpropagation) when compared to some of the Bayesian models. It also scales well to larger datasets with a new general-purpose GPU hardware and CUDA software that is readily available. And finally, it can significantly out-perform other models when the right conditions and parameters are plugged in appropriately along with high quality labeled data. The dataset is obtained through online web scraping of the Glassdoor website and it is then subjected to pre-processing and feature extraction process. This data is then used to train the above-mentioned models against a training size of around 8000 job advertisements and a test sample of 2000 job advertisements. The results are tabulated in the form of a confusion matrix and the accuracies between the two models are compared.