Data-Driven Approaches to Classifier and Variable Selection in High-Dimensional Classification

Andalib, Vahid

Data-Driven Approaches to Classifier and Variable Selection in High-Dimensional Classification

dc.contributor.advisor	Baek, Seungchul
dc.contributor.author	Andalib, Vahid
dc.contributor.department	Mathematics and Statistics
dc.contributor.program	Statistics
dc.date.accessioned	2025-02-13T15:35:04Z
dc.date.available	2025-02-13T15:35:04Z
dc.date.issued	2024-01-01
dc.description.abstract	Classification in high dimensions has gained significant attention over the past two decades since Fisher's linear discriminant analysis (LDA) is not optimal in a smaller sample size n comparing the number of variables p, i.e., p>n, which is mostly due to the singularity of the sample covariance matrix. This dissertation proposes two novel data-driven approaches to address the challenges in high-dimensional classification, both building upon Fisher's LDA. The first approach involves the development of binary classifiers using random partitioning. Rather than modifying how to estimate the sample covariance and sample mean vector in constructing a classifier, we build two types of high-dimensional classifiers using data splitting, i.e., single data splitting (SDS) and multiple data splitting (MDS). We also present a weighted version of the MDS classifier that further improves classification performance. Each of the split data sets has a smaller size of variables compared to the sample size so that LDA is applicable, and classification results can be combined with respect to minimizing the misclassification rate. We provide theoretical justification backing up our methods by comparing misclassification rates with LDA in high dimensions. The second approach proposes a high-dimensional classifier, which is a two-stage procedure serving variable selection and classification tasks. The variable selection scheme is to select covariates that belong to the discriminative set, and this approach is aimed at obtaining a better classifier, rather than choosing significant variables themselves. In the first stage, we identify discriminative variables by adopting a notion of mirror statistic, proposed recently in the literature, and LDA direction vector obtained from a regularized form of the sample covariance matrix and a James-Stein type estimator for the mean vectors. In the second stage, a new classifier is developed using the selected variables, refined with a modified ?-greedy algorithm to enhance the LDA direction vector. Both approaches are extensively validated through simulation studies and real data analysis, including DNA microarray data sets. Our methods demonstrate superior or comparable performance to existing high-dimensional classifiers, offering improved classification accuracy, effective variable selection, and robustness in various scenarios. This dissertation contributes to the field of high-dimensional statistics by providing novel, theoretically grounded, and effective methods for classification in high-dimensional spaces, with potential applications in genomics, machine learning, and other domains facing the challenges of high-dimensional data analysis.
dc.format	application:pdf
dc.genre	dissertation
dc.identifier	doi:10.13016/m2jxx5-bplg
dc.identifier.other	12951
dc.identifier.uri	http://hdl.handle.net/11603/37643
dc.language	en
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Mathematics and Statistics Department Collection
dc.relation.ispartof	UMBC Theses and Dissertations Collection
dc.relation.ispartof	UMBC Graduate School Collection
dc.relation.ispartof	UMBC Student Collection
dc.rights	This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu or contact Special Collections at speccoll(at)umbc.edu
dc.source	Original File Name: Andalib_umbc_0434D_12951.pdf
dc.subject	Classification
dc.subject	High dimension
dc.subject	Linear discriminant analysis
dc.subject	Mirror statistics
dc.subject	Random partitioning
dc.subject	Variable selection
dc.title	Data-Driven Approaches to Classifier and Variable Selection in High-Dimensional Classification
dc.type	Text
dcterms.accessRights	Distribution Rights granted to UMBC by the author.

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Andalib_umbc_0434D_12951.pdf
Size:: 1.23 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: Andalib-Vahid_Open.pdf
Size:: 134.94 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

UMBC Theses and Dissertations
UMBC Graduate School
UMBC Mathematics and Statistics Department
UMBC Student Collection