Data-Driven Approaches to Classifier and Variable Selection in High-Dimensional Classification

dc.contributor.advisorBaek, Seungchul
dc.contributor.authorAndalib, Vahid
dc.contributor.departmentMathematics and Statistics
dc.contributor.programStatistics
dc.date.accessioned2025-02-13T15:35:04Z
dc.date.available2025-02-13T15:35:04Z
dc.date.issued2024-01-01
dc.description.abstractClassification in high dimensions has gained significant attention over the past two decades since Fisher's linear discriminant analysis (LDA) is not optimal in a smaller sample size n comparing the number of variables p, i.e., p>n, which is mostly due to the singularity of the sample covariance matrix. This dissertation proposes two novel data-driven approaches to address the challenges in high-dimensional classification, both building upon Fisher's LDA. The first approach involves the development of binary classifiers using random partitioning. Rather than modifying how to estimate the sample covariance and sample mean vector in constructing a classifier, we build two types of high-dimensional classifiers using data splitting, i.e., single data splitting (SDS) and multiple data splitting (MDS). We also present a weighted version of the MDS classifier that further improves classification performance. Each of the split data sets has a smaller size of variables compared to the sample size so that LDA is applicable, and classification results can be combined with respect to minimizing the misclassification rate. We provide theoretical justification backing up our methods by comparing misclassification rates with LDA in high dimensions. The second approach proposes a high-dimensional classifier, which is a two-stage procedure serving variable selection and classification tasks. The variable selection scheme is to select covariates that belong to the discriminative set, and this approach is aimed at obtaining a better classifier, rather than choosing significant variables themselves. In the first stage, we identify discriminative variables by adopting a notion of mirror statistic, proposed recently in the literature, and LDA direction vector obtained from a regularized form of the sample covariance matrix and a James-Stein type estimator for the mean vectors. In the second stage, a new classifier is developed using the selected variables, refined with a modified ?-greedy algorithm to enhance the LDA direction vector. Both approaches are extensively validated through simulation studies and real data analysis, including DNA microarray data sets. Our methods demonstrate superior or comparable performance to existing high-dimensional classifiers, offering improved classification accuracy, effective variable selection, and robustness in various scenarios. This dissertation contributes to the field of high-dimensional statistics by providing novel, theoretically grounded, and effective methods for classification in high-dimensional spaces, with potential applications in genomics, machine learning, and other domains facing the challenges of high-dimensional data analysis.
dc.formatapplication:pdf
dc.genredissertation
dc.identifierdoi:10.13016/m2jxx5-bplg
dc.identifier.other12951
dc.identifier.urihttp://hdl.handle.net/11603/37643
dc.languageen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Mathematics and Statistics Department Collection
dc.relation.ispartofUMBC Theses and Dissertations Collection
dc.relation.ispartofUMBC Graduate School Collection
dc.relation.ispartofUMBC Student Collection
dc.rightsThis item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu or contact Special Collections at speccoll(at)umbc.edu
dc.sourceOriginal File Name: Andalib_umbc_0434D_12951.pdf
dc.subjectClassification
dc.subjectHigh dimension
dc.subjectLinear discriminant analysis
dc.subjectMirror statistics
dc.subjectRandom partitioning
dc.subjectVariable selection
dc.titleData-Driven Approaches to Classifier and Variable Selection in High-Dimensional Classification
dc.typeText
dcterms.accessRightsDistribution Rights granted to UMBC by the author.

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Andalib_umbc_0434D_12951.pdf
Size:
1.23 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Andalib-Vahid_Open.pdf
Size:
134.94 KB
Format:
Adobe Portable Document Format
Description: