PARALLEL FEATURE SELECTION OF MULTIPLE CLASS DATASETS USING APACHE SPARK

Sankineni, Rishi

PARALLEL FEATURE SELECTION OF MULTIPLE CLASS DATASETS USING APACHE SPARK

dc.contributor.advisor	Wang, Jianwu
dc.contributor.author	Sankineni, Rishi
dc.contributor.department	Information Systems
dc.contributor.program	Information Systems
dc.date.accessioned	2019-10-11T13:59:20Z
dc.date.available	2019-10-11T13:59:20Z
dc.date.issued	2017-01-01
dc.description.abstract	Feature selection is the task of selecting a small subset from original features that can achieve maximum classification accuracy. This subset of features has some very important benefits like, it reduces the computational complexity of learning algorithms, saves time, improve accuracy and the selected features can be insightful for the people involved in a problem domain. This makes feature selection as an indispensable task in the classification task. In this theses, we present a two-phase approach for feature selection. In the first phase, a batch based Minimum Redundancy and Maximum Relevance (mRMR) algorithm is used with "correlation coefficient" and "mutual information" as a statistical measure of similarity. This phase helps in improving the classification performance by removing redundant and unimportant features. In the second phase, we present a stream based tree-based feature selection method that allows dynamic generation and selection of features, while taking advantage of the different feature classes and the fact that they are of different sizes and have a different fraction of good features. Experimental results show that this phase is computationally less expensive than comparable "batch" methods that do not take advantage of the feature classes and expect all features to be known in advance.
dc.genre	theses
dc.identifier	doi:10.13016/m29iqr-b0cu
dc.identifier.other	11724
dc.identifier.uri	http://hdl.handle.net/11603/15636
dc.language	en
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Information Systems Department Collection
dc.relation.ispartof	UMBC Theses and Dissertations Collection
dc.relation.ispartof	UMBC Graduate School Collection
dc.relation.ispartof	UMBC Student Collection
dc.rights	This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
dc.source	Original File Name: Sankineni_umbc_0434M_11724.pdf
dc.subject	Apache Spark
dc.subject	Big Data
dc.subject	Dimensionality Reduction
dc.subject	Feature Selection
dc.subject	Machine Learning
dc.subject	Streaming data
dc.title	PARALLEL FEATURE SELECTION OF MULTIPLE CLASS DATASETS USING APACHE SPARK
dc.type	Text
dcterms.accessRights	Distribution Rights granted to UMBC by the author.

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Sankineni_umbc_0434M_11724.pdf
Size:: 4.03 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: SankineniR_Parellel_Open.pdf
Size:: 45.13 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

UMBC Theses and Dissertations