Data Science Employment Classification using Machine Learning

dc.contributor.advisorNicholas, Charles
dc.contributor.authorChandrashekar, Tejus
dc.contributor.departmentComputer Science and Electrical Engineering
dc.contributor.programComputer Science
dc.date.accessioned2021-01-29T18:13:34Z
dc.date.available2021-01-29T18:13:34Z
dc.date.issued2019-01-01
dc.description.abstractFollowing the gold rush in artificial intelligence, a new career track called "data scientists” has taken the world by storm. With a combination of skills in business intuition and technical soundness, data science is considered the most sought after job in the 21st century. But one must be able to classify if a job posting is a data science-related job or not. This theses aims to classify a job posting whether it belongs to Data Science field or not using a Machine Learning model. Based on the results obtained an extensive analysis is done to find out various patterns and to find out if data science is actually in-demand as one might think. The Machine Learning models used for the classifying the job advertisements are Support Vector Machine and Neural-Networks with TensorFlow. These two models were considered because, first with respect to SVM, it has a regularization parameter, which makes the user think about avoiding over-fitting. Next, it uses the kernel trick, so one can build in expert knowledge about the problem via engineering the kernel. Also, an SVM is defined by a convex optimization problem (no local minima) for which there are efficient methods (e.g. Sequential minimal optimization). Lastly, it approximates a bound on the test error rate, and there is a substantial body of theory behind it which suggests it should be a good idea. Coming to Neural Networks, it has a relatively simple learning algorithm (Stochastic Gradient Descent and backpropagation) when compared to some of the Bayesian models. It also scales well to larger datasets with a new general-purpose GPU hardware and CUDA software that is readily available. And finally, it can significantly out-perform other models when the right conditions and parameters are plugged in appropriately along with high quality labeled data. The dataset is obtained through online web scraping of the Glassdoor website and it is then subjected to pre-processing and feature extraction process. This data is then used to train the above-mentioned models against a training size of around 8000 job advertisements and a test sample of 2000 job advertisements. The results are tabulated in the form of a confusion matrix and the accuracies between the two models are compared.
dc.formatapplication:pdf
dc.genretheses
dc.identifierdoi:10.13016/m2xxmy-qamh
dc.identifier.other12019
dc.identifier.urihttp://hdl.handle.net/11603/20872
dc.languageen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Theses and Dissertations Collection
dc.relation.ispartofUMBC Graduate School Collection
dc.relation.ispartofUMBC Student Collection
dc.sourceOriginal File Name: Chandrashekar_umbc_0434M_12019.pdf
dc.titleData Science Employment Classification using Machine Learning
dc.typeText
dcterms.accessRightsDistribution Rights granted to UMBC by the author.
dcterms.accessRightsAccess limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
dcterms.accessRightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Chandrashekar_umbc_0434M_12019.pdf
Size:
1.01 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
ChandrashekarTData_Open.pdf
Size:
44.84 KB
Format:
Adobe Portable Document Format
Description: