Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Heuristic PCA Based Feature Extraction
and
Its Application to Bioinformatics
Y-h. Taguchi, Dept. Phys., Chuo Uinv.,
Y. Mur...
0. Why PCA?
PCA = principal component analysis
Motivation:
Unsupervised Feature Selection
How PCA?
10 Ordered
Features
90 random
Features

100 Features

20 samples
Class 1
Class 2
11111111110000000000
11111111110000000000...
Embedding 100 features into 2D using PCA
90 random
Features

10 Ordered
Features
PC1 represents discrimination
between class 1 and class 2

Class 1

Class 2

20 samples
Applying “weak” unitary transformation to
the space spanned by 20 samples...
20 samples

20 samples
100 Features

Class 1 ...
The same 2D embedding.
Thus we can select 10 features.

10 Ordered
Features

90 random
Features
PC1 “weakly” represents discrimination
between class 1 and class 2

Class 1

Class 2

20 samples
Linear discriminant analysis
+ leave one out cross validation
using 10 ordered features ….

True
class 1 2
Predict 1 8 2
2...
1. Real example 1: Disease associated
aberrant promoter methylation
methylation
gene
promoter
three autoimmune diseases
SL...
Embedding of 〜1000 promoters within 20
RA samples into 2D with PCA (PC2 vs PC3)

PC3
Outlier promoters,
Selected

PC2
PC2:RA
Male Female
◯:Sick Twin
△:Healthy Twin
+:Healthy Control 1
☓:Healthy Control 2
Twins: Healthy > Sick
Controls: No
T...
Scatter plots between healthy/RA twins.
Red dots = selected promoters
Healthy twins
RA twins
P<2.2 ✕10

-16

-12
P=2.2✕10
...
Among three autoimmune diseases,
selected promoters are mostly common.

No other methods can achieve such an excellent
coi...
Lessons to learn:
Predefined class definition (e.g., 'sick
twin' vs 'healthy twin + two healthy
controls') is not a good s...
Additional Remarks
Similar procedures were applied to
squamous cell carcinoma(*) and genes with
genotype-specific DNA meth...
2. Real example 2: Circulating biomarker
findings for liver diseases
Why “circulating biomaker”?
→ non-invasive, thus less...
Data set: 14 diseases + healthy control
For example,
2D embeddings of 〜900 blood miRNAs using PCA
in 32 lung cancer + 70 h...
Prediction

Control vs Lung Cancer
LDA with PCA, leave one out cross validation
(using 10 miRNAs, up to the 5th PC)
True
c...
What is the advantage of PCA based
feature extraction? → stability
Cross validation test (10 folds) of stability of
featur...
Comparison of stabilities with other feature
extraction methods
UFF(*) : 111 out of 140 miRNAs
t-test based : 40 out of 14...
Lessons to learn:
Predefined class definition (e.g., 'sick
twin' vs 'healthy twin+two healthy
controls') is not a good str...
Additional remarks:
10 miRNAs selected as biomarkers that
discriminate 14 diseases from normal control
were largely overla...
3. Real example 3: Analysis of proteome
during bacterial incubation
Purpose :
Antibiotics are nothing but disaster of bact...
Data set:
Two incubation conditions:
stable (normal) and shaking (oxidative stress)
Two fractions:
cellular and supernatan...
2D embedding of 48 samples using PCA
Cellular

PC2
early
supernatant

PC1

late
supernatant
PCA embeddings of proteins
23 proteins selcted
(underlined are ribosomal ptoteins)

PC2
PC1

SPy1489:hlpA
SPy2039:speB
Spy...
using 23 proteins extracted via PCA

PC2
PC1
Lessons to learn:
Even if there are no criterion about what
kind of classifications are assumed,
unsupervised feature extr...
4. Discussion
Real example 1:
Commonly methylated promoters between three
autoimmune
diseases
were
found
by
unsupervised f...
Unsupervised feature extraction seems
to be the best method, however...
When does PCA based feature extraction work?
Is PC...
I believe that people here
should be experts on this topics.
Help me....
Heuristic PCA Based Feature Extraction  and  Its Application to Bioinformatics
Heuristic PCA Based Feature Extraction  and  Its Application to Bioinformatics
Upcoming SlideShare
Loading in …5
×

Heuristic PCA Based Feature Extraction and Its Application to Bioinformatics

1,669 views

Published on

Presentation at "New Developments of Multivariate Statistical Methodologies -Robust, High Speed,
and High-Accuracy" 25th-27th Nov 2014, Tsukuba Univ,, Japan, http://www.math.tsukuba.ac.jp/~aoshima-lab/symposium.html

Book chapter is here
https://www.researchgate.net/publication/271198208_Heuristic_Principal_Component_Analysis-Based_Unsupervised_Feature_Extraction_and_Its_Application_to_Bioinformatics

  • Be the first to comment

Heuristic PCA Based Feature Extraction and Its Application to Bioinformatics

  1. 1. Heuristic PCA Based Feature Extraction and Its Application to Bioinformatics Y-h. Taguchi, Dept. Phys., Chuo Uinv., Y. Murakami, Grad. Sch. Med., Osaka City Univ. M. Iwadate, Dept. Biol. Sci., Chuo Univ. H. Umeyama, Dept. Biol. Sci., Chuo Univ. A. Okamoto, Dept. Sch. Health Sci., Aichi Univ. Edu.
  2. 2. 0. Why PCA? PCA = principal component analysis Motivation: Unsupervised Feature Selection How PCA?
  3. 3. 10 Ordered Features 90 random Features 100 Features 20 samples Class 1 Class 2 11111111110000000000 11111111110000000000 . . 11111111110000000000 01000000110110011111 00011110000101011101 . . . 01000011000110101111 How to select 10 ordered features, without classification information?
  4. 4. Embedding 100 features into 2D using PCA 90 random Features 10 Ordered Features
  5. 5. PC1 represents discrimination between class 1 and class 2 Class 1 Class 2 20 samples
  6. 6. Applying “weak” unitary transformation to the space spanned by 20 samples... 20 samples 20 samples 100 Features Class 1 Class 2 10 Ordered Features 90 random Features Class 1 Class 2
  7. 7. The same 2D embedding. Thus we can select 10 features. 10 Ordered Features 90 random Features
  8. 8. PC1 “weakly” represents discrimination between class 1 and class 2 Class 1 Class 2 20 samples
  9. 9. Linear discriminant analysis + leave one out cross validation using 10 ordered features …. True class 1 2 Predict 1 8 2 228 Accuracy=Sensitivity=Specificity=80% How about real examples?
  10. 10. 1. Real example 1: Disease associated aberrant promoter methylation methylation gene promoter three autoimmune diseases SLE RA DM [ MZ twins (healthy+sick) + 2 healthy controls] ✕ 5 = 20 samples → ✕3 diseases = 60 samples vs ≈ 1000 potential methylation sites
  11. 11. Embedding of 〜1000 promoters within 20 RA samples into 2D with PCA (PC2 vs PC3) PC3 Outlier promoters, Selected PC2
  12. 12. PC2:RA Male Female ◯:Sick Twin △:Healthy Twin +:Healthy Control 1 ☓:Healthy Control 2 Twins: Healthy > Sick Controls: No The 4th set: No → The reason why unsupervised feature selection is needed. 20 samples
  13. 13. Scatter plots between healthy/RA twins. Red dots = selected promoters Healthy twins RA twins P<2.2 ✕10 -16 -12 P=2.2✕10 -12 P=3.7✕10 P=3.9✕10 -1 P<2.2✕10 -16 Individual promoters are significantly aberrantly methylated. Thus, feature selections are successful. After repeating the same procedures to additional two diseases (SLE and DM)....
  14. 14. Among three autoimmune diseases, selected promoters are mostly common. No other methods can achieve such an excellent coincidence between three autoimmune diseases.
  15. 15. Lessons to learn: Predefined class definition (e.g., 'sick twin' vs 'healthy twin + two healthy controls') is not a good strategy to extract “important” features that can exhibit much more complicated behavior (e.g., upregulated for male while downregulated for female)
  16. 16. Additional Remarks Similar procedures were applied to squamous cell carcinoma(*) and genes with genotype-specific DNA methylation were extracted. These genes were identified as cancer-related genes using literature searches and in silico drug screening was performed for these genes (BMC Sys, Biol. in press, to be presented at APBC2014). (*) 食道がん
  17. 17. 2. Real example 2: Circulating biomarker findings for liver diseases Why “circulating biomaker”? → non-invasive, thus less stresses. Circulating = blood, etc Target in this talk: microRNAs in blood → microRNA is non-protein coding RNA that regulates other transcript.
  18. 18. Data set: 14 diseases + healthy control For example, 2D embeddings of 〜900 blood miRNAs using PCA in 32 lung cancer + 70 healthy controls PC2 10 outlier miRNAs PC1 However PC1 does not exhibit clear distinction between lung cancer and normal control any more.... (not shown here)
  19. 19. Prediction Control vs Lung Cancer LDA with PCA, leave one out cross validation (using 10 miRNAs, up to the 5th PC) True control lung cancer control 56 8 lung cancer 14 24 Accuracy 0.784 Specificity 0.800 Sensitivity 0.750 Precision 0.632
  20. 20. What is the advantage of PCA based feature extraction? → stability Cross validation test (10 folds) of stability of feature extraction (100 trials): 14 diseases vs normal control ✕ 10 miRNAs = 140 miRNAs selected. Ideally 140 miRNAs are always selected over 100 trials. As a result, 129 out of 140 miRNAs are selected by 100% probabilities.
  21. 21. Comparison of stabilities with other feature extraction methods UFF(*) : 111 out of 140 miRNAs t-test based : 40 out of 140 miRNAs SAM : 30 out of 140 miRNAs gsMMD : 5 and 1 out of 140 miRNAs RFE : 1 out of 140 miRNAs ensemble RFE : 0 out of 140 miRNAs (*) only another unsupervised FE
  22. 22. Lessons to learn: Predefined class definition (e.g., 'sick twin' vs 'healthy twin+two healthy controls') is not a good strategy to extract “stable” features. Too serious consideration of classification information may injure stability of selected features.
  23. 23. Additional remarks: 10 miRNAs selected as biomarkers that discriminate 14 diseases from normal control were largely overlapped (every 10 miRNAs were chosen from common 12 miRNAs). In addition to this, these 12 miRNAs discriminate seven additional diseases from healthy controls, even using different measuring methodology, samples and studies (submitted).
  24. 24. 3. Real example 3: Analysis of proteome during bacterial incubation Purpose : Antibiotics are nothing but disaster of bacteria. They try to kill not toxic bacteria and thus cause resistance to drugs. If any other drugs that target to proteins that are more specific to each bacteria are targeted, it will be much better and effective. In order to do this, at first, we need to know how proteome can change in response to environmental changes.
  25. 25. Data set: Two incubation conditions: stable (normal) and shaking (oxidative stress) Two fractions: cellular and supernatant Four time points: From early to final through meddle growth phase Three biological replicates. In total: 2 ✕2 ✕4 ✕ 3 = 48 samples are available
  26. 26. 2D embedding of 48 samples using PCA Cellular PC2 early supernatant PC1 late supernatant
  27. 27. PCA embeddings of proteins 23 proteins selcted (underlined are ribosomal ptoteins) PC2 PC1 SPy1489:hlpA SPy2039:speB Spy1073:rplL SPy2005 SPy2018:emm1 Spy0059:rpmC Spy0611:tufA Spy0274:plr Spy0062:rplX SPy2043:mf Spy0613:tpi Spy2079:AhpC SPy1831:rpsF} Spy2160:rpmG SPy1373:ptsH SPy0731:eno Spy1371:gapN Spy1881:pgk SPy0711:speC Spy0071:rpmD SPy2070:groEL Spy0019 SPy0712:mf2
  28. 28. using 23 proteins extracted via PCA PC2 PC1
  29. 29. Lessons to learn: Even if there are no criterion about what kind of classifications are assumed, unsupervised feature extraction can select prominent features.
  30. 30. 4. Discussion Real example 1: Commonly methylated promoters between three autoimmune diseases were found by unsupervised feature extraction. Real example 2: Stable circulating biomarkers were selected for 14 diseases using unsupervised feature extraction. Real example 3: Successful extraction of prominent features with unsupervised feature extraction
  31. 31. Unsupervised feature extraction seems to be the best method, however... When does PCA based feature extraction work? Is PCA based feature extraction the best? Are there any other better unsupervised feature extraction? How can we evaluate unsupervised feature extraction? Are there any variables to be maximize?
  32. 32. I believe that people here should be experts on this topics. Help me....

×