Improving the generalization of unsupervised feature learning by using data from diferent sources on gene expression data for cancer diagnosis
Loading...
Files
Links to Files
Author/Creator
Author/Creator ORCID
Date
2022-02-24
Type of Work
Department
Program
Citation of Original Publication
Liu, Z., Wang, R. & Zhang, W. Improving the generalization of unsupervised feature learning by using data from different sources on gene expression data for cancer diagnosis. Med Biol Eng Comput 60, 1055–1073 (2022). https://doi.org/10.1007/s11517-022-02522-2
Rights
This version of the article has been accepted for publication, after peer review (when applicable) and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org/10.1007/s11517-022-02522-2
Access to this item will begin on 02/24/2023
Access to this item will begin on 02/24/2023
Subjects
Abstract
Machine learning techniques have been utilized on gene expression profling for cancer diagnosis. However, the gene expression data sufer from the curse of high dimensionality. Diferent kinds of feature reduction methods have been proposed
to decrease the features for specifc cancer diagnosis. However, with the difculty of obtaining the samples of a particular
tumor, the lack of training samples may lead to the overftting problem. In addition, the feature reduction model on a specifc
tumor may lead to the problem that the model is not scalable and cannot be generalized to new cancer types. To handle these
problems, this paper proposes an unsupervised feature learning method to reduce the data dimensionality of gene expression data. This method amplifes the training samples of feature learning by utilizing the unlabeled samples from diferent
sources. Two heuristic rules are devised to check if the unlabeled samples could be used for amplifying the training set. The
amplifed training set is used to train the feature learning model based on sparse autoencoder. Since the method leverages the
knowledge among the expression data from diferent sources, it improves the generalization of unsupervised feature learning
and further boosts the cancer diagnosis performance. A series of experiments are carried out on the gene expression datasets
from TCGA and other sources. Experimental results prove that our method improves the generalization of cancer diagnosis
when unlabeled data are used for latent feature learning.