Chronic Obstructive Pulmonary Diseases (COPD) is the most
common group of respiratory disorders, which are characterized
by persistent and irreversible airflow obstruction.
Molecular phenotyping of COPD status is challenging due to limited access to lung tissues from patients with debilitating lung functions. On the other hand, bronchial brushing is a less invasive method which also allows clinicians and researchers to sample airway epithelial cells to better understand the changes in cellular and molecular landscape in COPD lungs.
Therefore, we utilized a GEO dataset (GSE37147) which specifically profiled the bronchial epithelial cells obtained by
bronchoscopy in a group of smokers with and without COPD
(smoke controls-SC). To analyze this dataset, we utilized two different machine learning (ML) techniques to classify the COPD from the SC group using gene expression as features.
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Determining bronchial gene expression signature of Chronic Obstructive Pulmonary Disease by machine learning techniques
1. Determining bronchial gene expression signature of Chronic Obstructive Pulmonary Disease
by Machine Learning Techniques
Thi K. Tran-Nguyen1, Tongbin Zhang2, Son Do Hai Dang3 and Steven R. Duncan1
Methods and Results
Methods and Results (cont.)
Acknowledgement
Introduction
Conclusion
(1) Division of Pulmonary and Critical Care Medicine, Department of Medicine, The University of Alabama at Birmingham, Birmingham, AL, U.S.A.
(2) The 1st School of Medicine and School of Information and Engineering, Wenzhou Medical University, Zhejiang, China
(3) Master in Data Sciences Program, The University of Alabama at Birmingham, Birmingham, AL, U.S.A.
Chronic Obstructive Pulmonary Diseases (COPD) is the most
common group of respiratory disorders, which are characterized
by persistent and irreversible airflow obstruction.
Molecular phenotyping of COPD status is challenging due to
limited access to lung tissues from patients with debilitating lung
functions. On the other hand, bronchial brushing is a less invasive
method which also allows clinicians and researchers to sample
airway epithelial cells to better understand the changes in cellular
and molecular landscape in COPD lungs.
Therefore, we utilized a GEO dataset (GSE37147) which
specifically profiled the bronchial epithelial cells obtained by
bronchoscopy in a group of smokers with and without COPD
(smoke controls-SC). To analyze this dataset, we utilized two
different machine learning (ML) techniques to classify the COPD
from the SC group using gene expression as features.
2. Random Forest to classify COPD from SC
Figure 2. Feature selection improved Random Forest predictive
performance. After running the predictor importance estimation,
we selected 17 genes with the highest importance ranking for
subsequent classification task. Prior to this feature selection, by using
5-fold cross-validation, we achieved the classification accuracy of
0.676 and AUC of 0.724. After feature selection, we improved the
classification accuracy to 0.744 and AUC to 0.812.
‘Biomarker: ML perspective’
To classify COPD from SC phenotypes:
‘Classification problem’
‘Hub gene’ – influencing many other genes:
‘Regression problem’
Random Forest:
Classification + Feature selection
Support Vector:
Regression
‘top-weighted gene as biomarker’ Control association profile COPD association profile
- =
Which genes have ‘most different association patterns’?
1. Conceptual and computing framework
3. Support Vector Machine (SVM) to identify COPD
hub genes
Figure 3. SVM results show the bronchial genes that are “hub genes”
in COPD compared to SC. First, SVM was used to compute the gene
expression correlation matrices in either COPD or SC cohort. For each
gene, we calculated the differences in the correlation patterns between
the COPD matrix and SC matrix. These values were then used to identify
genes with the most difference in the gene network topology between
COPD and SC.
Figure 1. We conceptualize the identification of COPD biomarkers as
two different problems that can be solved by ML computing frameworks.
Methods and Results (cont.)
4. Two different types of biomarkers highlight
two different molecular processes in COPD vs SC
Positive correlation
Negative correlation
Figure 4. Distinct association patterns
found in COPD vs SC bronchial gene
expression correlation matrix.
• Using 2 different ML techniques, we identified the bronchial gene expression
signature for COPD using data obtained from bronchoscopy.
• Among the highest ranked genes, many have been reported to be COPD
biomarkers in studies assaying COPD lung tissues, suggesting bronchial
brushing can be used as a reliable and robust surrogate tissues to assess
COPD status without the need to sample lung tissues.
• novel gene expression patterns in COPD airways suggest novel mechanisms
of airflow obstruction.
I would like to give special thanks to Thanh Nguyen, Ph.D. from the UAB
Informatics Institute for his ideas and support during the execution of this project.