Improving the generalization of unsupervised feature learning by using data from diferent sources on gene expression data for cancer diagnosis

Author/Creator ORCID

Date

2022-02-24

Department

Program

Citation of Original Publication

Liu, Z., Wang, R. & Zhang, W. Improving the generalization of unsupervised feature learning by using data from different sources on gene expression data for cancer diagnosis. Med Biol Eng Comput 60, 1055–1073 (2022). https://doi.org/10.1007/s11517-022-02522-2

Rights

This version of the article has been accepted for publication, after peer review (when applicable) and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org/10.1007/s11517-022-02522-2
Access to this item will begin on 02/24/2023

Subjects

Abstract

Machine learning techniques have been utilized on gene expression profling for cancer diagnosis. However, the gene expression data sufer from the curse of high dimensionality. Diferent kinds of feature reduction methods have been proposed to decrease the features for specifc cancer diagnosis. However, with the difculty of obtaining the samples of a particular tumor, the lack of training samples may lead to the overftting problem. In addition, the feature reduction model on a specifc tumor may lead to the problem that the model is not scalable and cannot be generalized to new cancer types. To handle these problems, this paper proposes an unsupervised feature learning method to reduce the data dimensionality of gene expression data. This method amplifes the training samples of feature learning by utilizing the unlabeled samples from diferent sources. Two heuristic rules are devised to check if the unlabeled samples could be used for amplifying the training set. The amplifed training set is used to train the feature learning model based on sparse autoencoder. Since the method leverages the knowledge among the expression data from diferent sources, it improves the generalization of unsupervised feature learning and further boosts the cancer diagnosis performance. A series of experiments are carried out on the gene expression datasets from TCGA and other sources. Experimental results prove that our method improves the generalization of cancer diagnosis when unlabeled data are used for latent feature learning.