• Login
    View Item 
    •   Maryland Shared Open Access Repository Home
    • ScholarWorks@UMBC
    • UMBC Interdepartmental Collections
    • UMBC Theses and Dissertations
    • View Item
    •   Maryland Shared Open Access Repository Home
    • ScholarWorks@UMBC
    • UMBC Interdepartmental Collections
    • UMBC Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    PARALLEL FEATURE SELECTION OF MULTIPLE CLASS DATASETS USING APACHE SPARK

    Thumbnail
    Files
    Sankineni_umbc_0434M_11724.pdf (4.025Mb)
    Permanent Link
    http://hdl.handle.net/11603/15636
    Collections
    • UMBC Theses and Dissertations
    Metadata
    Show full item record
    Author/Creator
    Unknown author
    Date
    2017-01-01
    Type of Work
    Text
    thesis
    Department
    Information Systems
    Program
    Information Systems
    Rights
    This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
    Distribution Rights granted to UMBC by the author.
    Subjects
    Apache Spark
    Big Data
    Dimensionality Reduction
    Feature Selection
    Machine Learning
    Streaming data
    Abstract
    Feature selection is the task of selecting a small subset from original features that can achieve maximum classification accuracy. This subset of features has some very important benefits like, it reduces the computational complexity of learning algorithms, saves time, improve accuracy and the selected features can be insightful for the people involved in a problem domain. This makes feature selection as an indispensable task in the classification task. In this thesis, we present a two-phase approach for feature selection. In the first phase, a batch based Minimum Redundancy and Maximum Relevance (mRMR) algorithm is used with "correlation coefficient" and "mutual information" as a statistical measure of similarity. This phase helps in improving the classification performance by removing redundant and unimportant features. In the second phase, we present a stream based tree-based feature selection method that allows dynamic generation and selection of features, while taking advantage of the different feature classes and the fact that they are of different sizes and have a different fraction of good features. Experimental results show that this phase is computationally less expensive than comparable "batch" methods that do not take advantage of the feature classes and expect all features to be known in advance.


    Albin O. Kuhn Library & Gallery
    University of Maryland, Baltimore County
    1000 Hilltop Circle
    Baltimore, MD 21250
    www.umbc.edu/scholarworks

    Contact information:
    Email: scholarworks-group@umbc.edu
    Phone: 410-455-3544


    If you wish to submit a copyright complaint or withdrawal request, please email mdsoar-help@umd.edu.

     

     

    My Account

    LoginRegister

    Browse

    This CollectionBy Issue DateTitlesAuthorsSubjectsType

    Statistics

    View Usage Statistics


    Albin O. Kuhn Library & Gallery
    University of Maryland, Baltimore County
    1000 Hilltop Circle
    Baltimore, MD 21250
    www.umbc.edu/scholarworks

    Contact information:
    Email: scholarworks-group@umbc.edu
    Phone: 410-455-3544


    If you wish to submit a copyright complaint or withdrawal request, please email mdsoar-help@umd.edu.