Your SlideShare is downloading. ×
Heuristic PCA Based Feature Extraction  and  Its Application to Bioinformatics
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Heuristic PCA Based Feature Extraction and Its Application to Bioinformatics


Published on

Presentation at "New Developments of Multivariate Statistical Methodologies -Robust, High Speed, …

Presentation at "New Developments of Multivariate Statistical Methodologies -Robust, High Speed,
and High-Accuracy" 25th-27th Nov 2014, Tsukuba Univ,, Japan,

Book chapter is here

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Heuristic PCA Based Feature Extraction and Its Application to Bioinformatics Y-h. Taguchi, Dept. Phys., Chuo Uinv., Y. Murakami, Grad. Sch. Med., Osaka City Univ. M. Iwadate, Dept. Biol. Sci., Chuo Univ. H. Umeyama, Dept. Biol. Sci., Chuo Univ. A. Okamoto, Dept. Sch. Health Sci., Aichi Univ. Edu.
  • 2. 0. Why PCA? PCA = principal component analysis Motivation: Unsupervised Feature Selection How PCA?
  • 3. 10 Ordered Features 90 random Features 100 Features 20 samples Class 1 Class 2 11111111110000000000 11111111110000000000 . . 11111111110000000000 01000000110110011111 00011110000101011101 . . . 01000011000110101111 How to select 10 ordered features, without classification information?
  • 4. Embedding 100 features into 2D using PCA 90 random Features 10 Ordered Features
  • 5. PC1 represents discrimination between class 1 and class 2 Class 1 Class 2 20 samples
  • 6. Applying “weak” unitary transformation to the space spanned by 20 samples... 20 samples 20 samples 100 Features Class 1 Class 2 10 Ordered Features 90 random Features Class 1 Class 2
  • 7. The same 2D embedding. Thus we can select 10 features. 10 Ordered Features 90 random Features
  • 8. PC1 “weakly” represents discrimination between class 1 and class 2 Class 1 Class 2 20 samples
  • 9. Linear discriminant analysis + leave one out cross validation using 10 ordered features …. True class 1 2 Predict 1 8 2 228 Accuracy=Sensitivity=Specificity=80% How about real examples?
  • 10. 1. Real example 1: Disease associated aberrant promoter methylation methylation gene promoter three autoimmune diseases SLE RA DM [ MZ twins (healthy+sick) + 2 healthy controls] ✕ 5 = 20 samples → ✕3 diseases = 60 samples vs ≈ 1000 potential methylation sites
  • 11. Embedding of 〜1000 promoters within 20 RA samples into 2D with PCA (PC2 vs PC3) PC3 Outlier promoters, Selected PC2
  • 12. PC2:RA Male Female ◯:Sick Twin △:Healthy Twin +:Healthy Control 1 ☓:Healthy Control 2 Twins: Healthy > Sick Controls: No The 4th set: No → The reason why unsupervised feature selection is needed. 20 samples
  • 13. Scatter plots between healthy/RA twins. Red dots = selected promoters Healthy twins RA twins P<2.2 ✕10 -16 -12 P=2.2✕10 -12 P=3.7✕10 P=3.9✕10 -1 P<2.2✕10 -16 Individual promoters are significantly aberrantly methylated. Thus, feature selections are successful. After repeating the same procedures to additional two diseases (SLE and DM)....
  • 14. Among three autoimmune diseases, selected promoters are mostly common. No other methods can achieve such an excellent coincidence between three autoimmune diseases.
  • 15. Lessons to learn: Predefined class definition (e.g., 'sick twin' vs 'healthy twin + two healthy controls') is not a good strategy to extract “important” features that can exhibit much more complicated behavior (e.g., upregulated for male while downregulated for female)
  • 16. Additional Remarks Similar procedures were applied to squamous cell carcinoma(*) and genes with genotype-specific DNA methylation were extracted. These genes were identified as cancer-related genes using literature searches and in silico drug screening was performed for these genes (BMC Sys, Biol. in press, to be presented at APBC2014). (*) 食道がん
  • 17. 2. Real example 2: Circulating biomarker findings for liver diseases Why “circulating biomaker”? → non-invasive, thus less stresses. Circulating = blood, etc Target in this talk: microRNAs in blood → microRNA is non-protein coding RNA that regulates other transcript.
  • 18. Data set: 14 diseases + healthy control For example, 2D embeddings of 〜900 blood miRNAs using PCA in 32 lung cancer + 70 healthy controls PC2 10 outlier miRNAs PC1 However PC1 does not exhibit clear distinction between lung cancer and normal control any more.... (not shown here)
  • 19. Prediction Control vs Lung Cancer LDA with PCA, leave one out cross validation (using 10 miRNAs, up to the 5th PC) True control lung cancer control 56 8 lung cancer 14 24 Accuracy 0.784 Specificity 0.800 Sensitivity 0.750 Precision 0.632
  • 20. What is the advantage of PCA based feature extraction? → stability Cross validation test (10 folds) of stability of feature extraction (100 trials): 14 diseases vs normal control ✕ 10 miRNAs = 140 miRNAs selected. Ideally 140 miRNAs are always selected over 100 trials. As a result, 129 out of 140 miRNAs are selected by 100% probabilities.
  • 21. Comparison of stabilities with other feature extraction methods UFF(*) : 111 out of 140 miRNAs t-test based : 40 out of 140 miRNAs SAM : 30 out of 140 miRNAs gsMMD : 5 and 1 out of 140 miRNAs RFE : 1 out of 140 miRNAs ensemble RFE : 0 out of 140 miRNAs (*) only another unsupervised FE
  • 22. Lessons to learn: Predefined class definition (e.g., 'sick twin' vs 'healthy twin+two healthy controls') is not a good strategy to extract “stable” features. Too serious consideration of classification information may injure stability of selected features.
  • 23. Additional remarks: 10 miRNAs selected as biomarkers that discriminate 14 diseases from normal control were largely overlapped (every 10 miRNAs were chosen from common 12 miRNAs). In addition to this, these 12 miRNAs discriminate seven additional diseases from healthy controls, even using different measuring methodology, samples and studies (submitted).
  • 24. 3. Real example 3: Analysis of proteome during bacterial incubation Purpose : Antibiotics are nothing but disaster of bacteria. They try to kill not toxic bacteria and thus cause resistance to drugs. If any other drugs that target to proteins that are more specific to each bacteria are targeted, it will be much better and effective. In order to do this, at first, we need to know how proteome can change in response to environmental changes.
  • 25. Data set: Two incubation conditions: stable (normal) and shaking (oxidative stress) Two fractions: cellular and supernatant Four time points: From early to final through meddle growth phase Three biological replicates. In total: 2 ✕2 ✕4 ✕ 3 = 48 samples are available
  • 26. 2D embedding of 48 samples using PCA Cellular PC2 early supernatant PC1 late supernatant
  • 27. PCA embeddings of proteins 23 proteins selcted (underlined are ribosomal ptoteins) PC2 PC1 SPy1489:hlpA SPy2039:speB Spy1073:rplL SPy2005 SPy2018:emm1 Spy0059:rpmC Spy0611:tufA Spy0274:plr Spy0062:rplX SPy2043:mf Spy0613:tpi Spy2079:AhpC SPy1831:rpsF} Spy2160:rpmG SPy1373:ptsH SPy0731:eno Spy1371:gapN Spy1881:pgk SPy0711:speC Spy0071:rpmD SPy2070:groEL Spy0019 SPy0712:mf2
  • 28. using 23 proteins extracted via PCA PC2 PC1
  • 29. Lessons to learn: Even if there are no criterion about what kind of classifications are assumed, unsupervised feature extraction can select prominent features.
  • 30. 4. Discussion Real example 1: Commonly methylated promoters between three autoimmune diseases were found by unsupervised feature extraction. Real example 2: Stable circulating biomarkers were selected for 14 diseases using unsupervised feature extraction. Real example 3: Successful extraction of prominent features with unsupervised feature extraction
  • 31. Unsupervised feature extraction seems to be the best method, however... When does PCA based feature extraction work? Is PCA based feature extraction the best? Are there any other better unsupervised feature extraction? How can we evaluate unsupervised feature extraction? Are there any variables to be maximize?
  • 32. I believe that people here should be experts on this topics. Help me....