SlideShare a Scribd company logo
1 of 13
Biology

Chemistry

Principal Components Analysis

Informatics

Principal Component Analysis
(PCA) of metabolomic sample
processing methods

Goal:
Use PCA to identify the major modes of variance
(Used DATA: Pumpkin data 1.csv)
Topics:
1. Principal component number selection
2. Data pretreatment
3. PCA results visualization
Principal Components Analysis

Biology

Chemistry

Principal Components Analysis

Informatics

Steps
1. Calculate a PCA model
2. Select optimal model principal component (PC)
3. Overview PCA scores and loadings plots
4. Repeat steps 1-2 using data centering and scaling
Visualize:
1. Sample scores annotated by extraction and treatment
2. Leverage and DmodX (distance from model plane)
3. Variable loadings and biplots
Exercise:
1. How many PCs are needed to capture 80% variance for raw data and
scaled data?
2. Are their any moderate or extreme outliers?
3. What variables contribute most to the variance for raw and scaled data?
Biology

Chemistry
Informatics

PCA Variance Explained
(raw data)
•

Principal Components Analysis

•

PCs can be selected to
explain a minimum
%variance in the data
(~80%)
PCs explaining below
1% variance can be
excluded

•

q2 is the crossvalidated PCA
prediction of left out
data
PCA Scores (raw data)
Biology

Chemistry

Principal Components Analysis

Informatics

• Hotelling's T2 ellipse
shows 95% CI for
bivariate normal
distribution
• Samples lying outside
of the ellipse could be
outliers
PCA Loadings (raw data, centered)
Biology

Chemistry

Principal Components Analysis

Informatics

• Unscaled data PCA
loadings are highly
correlated with
magnitude
PCA Biplot (raw data)
Biology

Chemistry

Sample contains high maleic acid

Principal Components Analysis

Informatics

• Biplots can be used to
rapidly overview the
correlation between
sample scores and
variable loadings
Sample contains high sucrose (low maleic acid)
Biology

Chemistry

PCA Leverage and DmodX
(raw data)

Principal Components Analysis

Informatics

Leverage is the
distance to samples
center in the PCA plane
(extreme outliers)
Distance to model X
(DmodX) is the
orthogonal distance to
the PCA plane
(moderate outliers)
Biology

Chemistry

Principal Components Analysis

Informatics

PCA Variance Explained
(autoscaled)
PCA Scores (autoscaled)
Biology

Chemistry

Principal Components Analysis

Informatics

• Loadings on PC1
describe differences
due to extraction
• Loadings on PC2
describe differences
due to treatment
Biology

Chemistry

Principal Components Analysis

Informatics

PCA Leverage and DmodX
(autoscaled)
• Samples with both high
leverage and DomdX are likely
outliers
• Evaluate PCA results after their
removal
PCA Loadings (autoscaled)
Biology

Chemistry

Principal Components Analysis

Informatics

• Scaled loadings are
independent of
variable magnitude
and show a rich
variance structure of
the data
Biology

Chemistry

Relationship between scores and
loadings (autoscaled)

Informatics

Principal Components Analysis

Higher in
100%
MeOH

Lower in
100%
MeOH

Extraction
Loadings and Scores
Biology

Chemistry
Informatics

Principal Components Analysis

Highest negative
loading on PC1

Highest positive
loading on PC1

More Related Content

What's hot

Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Dmitry Grapov
 
Data analysis workflows part 2 2015
Data analysis workflows part 2 2015Data analysis workflows part 2 2015
Data analysis workflows part 2 2015Dmitry Grapov
 
Advanced strategies for Metabolomics Data Analysis
Advanced strategies for Metabolomics Data AnalysisAdvanced strategies for Metabolomics Data Analysis
Advanced strategies for Metabolomics Data AnalysisDmitry Grapov
 
Normalization of Large-Scale Metabolomic Studies 2014
Normalization of Large-Scale Metabolomic Studies 2014Normalization of Large-Scale Metabolomic Studies 2014
Normalization of Large-Scale Metabolomic Studies 2014Dmitry Grapov
 
4 partial least squares modeling
4  partial least squares modeling4  partial least squares modeling
4 partial least squares modelingDmitry Grapov
 
Data Normalization Approaches for Large-scale Biological Studies
Data Normalization Approaches for Large-scale Biological StudiesData Normalization Approaches for Large-scale Biological Studies
Data Normalization Approaches for Large-scale Biological StudiesDmitry Grapov
 
Strategies for Metabolomics Data Analysis
Strategies for Metabolomics Data AnalysisStrategies for Metabolomics Data Analysis
Strategies for Metabolomics Data AnalysisDmitry Grapov
 
Mapping to the Metabolomic Manifold
Mapping to the Metabolomic ManifoldMapping to the Metabolomic Manifold
Mapping to the Metabolomic ManifoldDmitry Grapov
 
Automation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationAutomation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationDmitry Grapov
 
Multivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataMultivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataDmitry Grapov
 
Metabolomic data analysis and visualization tools
Metabolomic data analysis and visualization toolsMetabolomic data analysis and visualization tools
Metabolomic data analysis and visualization toolsDmitry Grapov
 
Case Study: Overview of Metabolomic Data Normalization Strategies
Case Study: Overview of Metabolomic Data Normalization StrategiesCase Study: Overview of Metabolomic Data Normalization Strategies
Case Study: Overview of Metabolomic Data Normalization StrategiesDmitry Grapov
 
3 data normalization (2014 lab tutorial)
3  data normalization (2014 lab tutorial)3  data normalization (2014 lab tutorial)
3 data normalization (2014 lab tutorial)Dmitry Grapov
 
6 metabolite enrichment analysis
6  metabolite enrichment analysis6  metabolite enrichment analysis
6 metabolite enrichment analysisDmitry Grapov
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration StrategiesDmitry Grapov
 
5 data analysis case study
5  data analysis case study5  data analysis case study
5 data analysis case studyDmitry Grapov
 
Complex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine LearningComplex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine LearningDmitry Grapov
 
Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Dmitry Grapov
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 

What's hot (20)

Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
 
Data analysis workflows part 2 2015
Data analysis workflows part 2 2015Data analysis workflows part 2 2015
Data analysis workflows part 2 2015
 
Advanced strategies for Metabolomics Data Analysis
Advanced strategies for Metabolomics Data AnalysisAdvanced strategies for Metabolomics Data Analysis
Advanced strategies for Metabolomics Data Analysis
 
Normalization of Large-Scale Metabolomic Studies 2014
Normalization of Large-Scale Metabolomic Studies 2014Normalization of Large-Scale Metabolomic Studies 2014
Normalization of Large-Scale Metabolomic Studies 2014
 
4 partial least squares modeling
4  partial least squares modeling4  partial least squares modeling
4 partial least squares modeling
 
Data Normalization Approaches for Large-scale Biological Studies
Data Normalization Approaches for Large-scale Biological StudiesData Normalization Approaches for Large-scale Biological Studies
Data Normalization Approaches for Large-scale Biological Studies
 
Strategies for Metabolomics Data Analysis
Strategies for Metabolomics Data AnalysisStrategies for Metabolomics Data Analysis
Strategies for Metabolomics Data Analysis
 
Mapping to the Metabolomic Manifold
Mapping to the Metabolomic ManifoldMapping to the Metabolomic Manifold
Mapping to the Metabolomic Manifold
 
Automation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationAutomation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report Generation
 
Multivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataMultivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological data
 
Metabolomic data analysis and visualization tools
Metabolomic data analysis and visualization toolsMetabolomic data analysis and visualization tools
Metabolomic data analysis and visualization tools
 
2 cluster analysis
2  cluster analysis2  cluster analysis
2 cluster analysis
 
Case Study: Overview of Metabolomic Data Normalization Strategies
Case Study: Overview of Metabolomic Data Normalization StrategiesCase Study: Overview of Metabolomic Data Normalization Strategies
Case Study: Overview of Metabolomic Data Normalization Strategies
 
3 data normalization (2014 lab tutorial)
3  data normalization (2014 lab tutorial)3  data normalization (2014 lab tutorial)
3 data normalization (2014 lab tutorial)
 
6 metabolite enrichment analysis
6  metabolite enrichment analysis6  metabolite enrichment analysis
6 metabolite enrichment analysis
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration Strategies
 
5 data analysis case study
5  data analysis case study5  data analysis case study
5 data analysis case study
 
Complex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine LearningComplex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine Learning
 
Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Data analysis workflows part 1 2015
Data analysis workflows part 1 2015
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 

Similar to 3 principal components analysis

Lecture 9 molecular descriptors
Lecture 9  molecular descriptorsLecture 9  molecular descriptors
Lecture 9 molecular descriptorsRAJAN ROLTA
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataUC Davis
 
Webinar: How to Develop a Regulatory-compliant Continued Process Verificatio...
Webinar: 	How to Develop a Regulatory-compliant Continued Process Verificatio...Webinar: 	How to Develop a Regulatory-compliant Continued Process Verificatio...
Webinar: How to Develop a Regulatory-compliant Continued Process Verificatio...MilliporeSigma
 
Webinar: How to Develop a Regulatory-compliant Continued Process Verification...
Webinar: How to Develop a Regulatory-compliant Continued Process Verification...Webinar: How to Develop a Regulatory-compliant Continued Process Verification...
Webinar: How to Develop a Regulatory-compliant Continued Process Verification...Merck Life Sciences
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
Final presentation dwi riyono
Final presentation dwi riyonoFinal presentation dwi riyono
Final presentation dwi riyonoDwi Riyono
 
ARCHIVED: new version available - METASPACE Step by Step guide
ARCHIVED: new version available - METASPACE Step by Step guideARCHIVED: new version available - METASPACE Step by Step guide
ARCHIVED: new version available - METASPACE Step by Step guideMETASPACE
 
Points to Consider in QC Method Validation and Transfer for Biological Products
Points to Consider in QC Method Validation and Transfer for Biological ProductsPoints to Consider in QC Method Validation and Transfer for Biological Products
Points to Consider in QC Method Validation and Transfer for Biological ProductsWeijun Li
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...Kamel Mansouri
 
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3QIAGEN
 
Too good to be true? How validate your data
Too good to be true? How validate your dataToo good to be true? How validate your data
Too good to be true? How validate your dataAlex Henderson
 
Development and Validation of a RP-HPLC method
Development and Validation of a RP-HPLC methodDevelopment and Validation of a RP-HPLC method
Development and Validation of a RP-HPLC methodUshaKhanal3
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшаваValeriya Simeonova
 
Pa nalyticals high_score_suite_brochure
Pa nalyticals high_score_suite_brochurePa nalyticals high_score_suite_brochure
Pa nalyticals high_score_suite_brochureNhut Duong
 
Distributed approach for Peptide Identification
Distributed approach for Peptide IdentificationDistributed approach for Peptide Identification
Distributed approach for Peptide Identificationabhinav vedanbhatla
 

Similar to 3 principal components analysis (20)

Lecture 9 molecular descriptors
Lecture 9  molecular descriptorsLecture 9  molecular descriptors
Lecture 9 molecular descriptors
 
ADMET.pptx
ADMET.pptxADMET.pptx
ADMET.pptx
 
Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
Webinar: How to Develop a Regulatory-compliant Continued Process Verificatio...
Webinar: 	How to Develop a Regulatory-compliant Continued Process Verificatio...Webinar: 	How to Develop a Regulatory-compliant Continued Process Verificatio...
Webinar: How to Develop a Regulatory-compliant Continued Process Verificatio...
 
Webinar: How to Develop a Regulatory-compliant Continued Process Verification...
Webinar: How to Develop a Regulatory-compliant Continued Process Verification...Webinar: How to Develop a Regulatory-compliant Continued Process Verification...
Webinar: How to Develop a Regulatory-compliant Continued Process Verification...
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
Data handling metabolomics
Data handling metabolomicsData handling metabolomics
Data handling metabolomics
 
Final presentation dwi riyono
Final presentation dwi riyonoFinal presentation dwi riyono
Final presentation dwi riyono
 
ARCHIVED: new version available - METASPACE Step by Step guide
ARCHIVED: new version available - METASPACE Step by Step guideARCHIVED: new version available - METASPACE Step by Step guide
ARCHIVED: new version available - METASPACE Step by Step guide
 
Points to Consider in QC Method Validation and Transfer for Biological Products
Points to Consider in QC Method Validation and Transfer for Biological ProductsPoints to Consider in QC Method Validation and Transfer for Biological Products
Points to Consider in QC Method Validation and Transfer for Biological Products
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
 
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3
 
Too good to be true? How validate your data
Too good to be true? How validate your dataToo good to be true? How validate your data
Too good to be true? How validate your data
 
Development and Validation of a RP-HPLC method
Development and Validation of a RP-HPLC methodDevelopment and Validation of a RP-HPLC method
Development and Validation of a RP-HPLC method
 
Chromatography: Part 4 of 4 Pesticide Residue Analysis Webinar Series - Late...
Chromatography: Part 4 of 4 Pesticide Residue Analysis Webinar Series -  Late...Chromatography: Part 4 of 4 Pesticide Residue Analysis Webinar Series -  Late...
Chromatography: Part 4 of 4 Pesticide Residue Analysis Webinar Series - Late...
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
 
Aqbd seminar DOE
Aqbd seminar DOEAqbd seminar DOE
Aqbd seminar DOE
 
Pa nalyticals high_score_suite_brochure
Pa nalyticals high_score_suite_brochurePa nalyticals high_score_suite_brochure
Pa nalyticals high_score_suite_brochure
 
Distributed approach for Peptide Identification
Distributed approach for Peptide IdentificationDistributed approach for Peptide Identification
Distributed approach for Peptide Identification
 

More from Dmitry Grapov

R programming for Data Science - A Beginner’s Guide
R programming for Data Science - A Beginner’s GuideR programming for Data Science - A Beginner’s Guide
R programming for Data Science - A Beginner’s GuideDmitry Grapov
 
Network mapping 101 course
Network mapping 101 courseNetwork mapping 101 course
Network mapping 101 courseDmitry Grapov
 
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...Dmitry Grapov
 
Dmitry Grapov Resume and CV
Dmitry Grapov Resume and CVDmitry Grapov Resume and CV
Dmitry Grapov Resume and CVDmitry Grapov
 
Machine Learning Powered Metabolomic Network Analysis
Machine Learning Powered Metabolomic Network AnalysisMachine Learning Powered Metabolomic Network Analysis
Machine Learning Powered Metabolomic Network AnalysisDmitry Grapov
 
Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialDmitry Grapov
 
American Society of Mass Spectrommetry Conference 2014
American Society of Mass Spectrommetry Conference 2014American Society of Mass Spectrommetry Conference 2014
American Society of Mass Spectrommetry Conference 2014Dmitry Grapov
 

More from Dmitry Grapov (8)

R programming for Data Science - A Beginner’s Guide
R programming for Data Science - A Beginner’s GuideR programming for Data Science - A Beginner’s Guide
R programming for Data Science - A Beginner’s Guide
 
Network mapping 101 course
Network mapping 101 courseNetwork mapping 101 course
Network mapping 101 course
 
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
 
Dmitry Grapov Resume and CV
Dmitry Grapov Resume and CVDmitry Grapov Resume and CV
Dmitry Grapov Resume and CV
 
Machine Learning Powered Metabolomic Network Analysis
Machine Learning Powered Metabolomic Network AnalysisMachine Learning Powered Metabolomic Network Analysis
Machine Learning Powered Metabolomic Network Analysis
 
Modeling poster
Modeling posterModeling poster
Modeling poster
 
Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -Tutorial
 
American Society of Mass Spectrommetry Conference 2014
American Society of Mass Spectrommetry Conference 2014American Society of Mass Spectrommetry Conference 2014
American Society of Mass Spectrommetry Conference 2014
 

3 principal components analysis

  • 1. Biology Chemistry Principal Components Analysis Informatics Principal Component Analysis (PCA) of metabolomic sample processing methods Goal: Use PCA to identify the major modes of variance (Used DATA: Pumpkin data 1.csv) Topics: 1. Principal component number selection 2. Data pretreatment 3. PCA results visualization
  • 2. Principal Components Analysis Biology Chemistry Principal Components Analysis Informatics Steps 1. Calculate a PCA model 2. Select optimal model principal component (PC) 3. Overview PCA scores and loadings plots 4. Repeat steps 1-2 using data centering and scaling Visualize: 1. Sample scores annotated by extraction and treatment 2. Leverage and DmodX (distance from model plane) 3. Variable loadings and biplots Exercise: 1. How many PCs are needed to capture 80% variance for raw data and scaled data? 2. Are their any moderate or extreme outliers? 3. What variables contribute most to the variance for raw and scaled data?
  • 3. Biology Chemistry Informatics PCA Variance Explained (raw data) • Principal Components Analysis • PCs can be selected to explain a minimum %variance in the data (~80%) PCs explaining below 1% variance can be excluded • q2 is the crossvalidated PCA prediction of left out data
  • 4. PCA Scores (raw data) Biology Chemistry Principal Components Analysis Informatics • Hotelling's T2 ellipse shows 95% CI for bivariate normal distribution • Samples lying outside of the ellipse could be outliers
  • 5. PCA Loadings (raw data, centered) Biology Chemistry Principal Components Analysis Informatics • Unscaled data PCA loadings are highly correlated with magnitude
  • 6. PCA Biplot (raw data) Biology Chemistry Sample contains high maleic acid Principal Components Analysis Informatics • Biplots can be used to rapidly overview the correlation between sample scores and variable loadings Sample contains high sucrose (low maleic acid)
  • 7. Biology Chemistry PCA Leverage and DmodX (raw data) Principal Components Analysis Informatics Leverage is the distance to samples center in the PCA plane (extreme outliers) Distance to model X (DmodX) is the orthogonal distance to the PCA plane (moderate outliers)
  • 9. PCA Scores (autoscaled) Biology Chemistry Principal Components Analysis Informatics • Loadings on PC1 describe differences due to extraction • Loadings on PC2 describe differences due to treatment
  • 10. Biology Chemistry Principal Components Analysis Informatics PCA Leverage and DmodX (autoscaled) • Samples with both high leverage and DomdX are likely outliers • Evaluate PCA results after their removal
  • 11. PCA Loadings (autoscaled) Biology Chemistry Principal Components Analysis Informatics • Scaled loadings are independent of variable magnitude and show a rich variance structure of the data
  • 12. Biology Chemistry Relationship between scores and loadings (autoscaled) Informatics Principal Components Analysis Higher in 100% MeOH Lower in 100% MeOH Extraction
  • 13. Loadings and Scores Biology Chemistry Informatics Principal Components Analysis Highest negative loading on PC1 Highest positive loading on PC1