This document provides an overview of computational techniques for analyzing metabolomics data. It describes several web-based tools and R packages that can be used for biomarker discovery, data analysis, and pathway analysis using metabolomics data. These include MetaboAnalyst for statistical analysis and visualization, xmsPANDA for preprocessing, biomarker discovery, clustering and network analysis, and Mummichog for pathway analysis. The document then discusses specific workflows and parameters for preprocessing raw LC-MS data, performing quality control checks, and conducting statistical analysis and visualization in MetaboAnalyst and xmsPANDA.
This document presents a new target-decoy validation method called Target-Decoy with Mass Binning (TDMB) for identifying peptides from high-resolution mass spectrometry data. TDMB utilizes the high mass accuracy of modern instruments by searching over a wide precursor mass tolerance range and binning the mass errors to determine an optimal error window. When tested on various high-resolution MS datasets, TDMB performed comparably or better than other validation methods while remaining simple to implement.
Untargeted metabolomics, namely discovery metabolomics, involves the comparison of the metabolome between the control and test groups, to identify differences between their metabolite profiles which may be relevant to specific biological conditions.
The document discusses analytical chemist's objectives in studying objects and variables. It explains that objects are samples being analyzed, like solutions or molecules, while variables are measurements or features being observed, like wavelength, pH, or molecular descriptors. The goal is to represent objects in multivariate chemical spaces defined by variables, and use tools to understand relationships between objects. It also provides examples of instruments and the order of information they can obtain, from simple to more complex multivariate data.
Strategies for Metabolomics Data AnalysisDmitry Grapov
Part of a lectures series for the international summer course in metabolomics 2013 (http://metabolomics.ucdavis.edu/courses-and-seminars/courses). Get more material and information here (http://imdevsoftware.wordpress.com/2013/09/08/sessions-in-metabolomics-2013/).
Introduction to Structural Equation Modeling Partial Least Sqaures (SEM-PLS)Ali Asgari
Partial least squares structural equation modelling (PLS-SEM) has recently received considerable attention in a variety of disciplines.The goal of PLS-SEM is the explanation of variances (prediction-oriented approach of the methodology) rather than explaining covariances (theory testing via covariance-based SEM).
The document discusses metabolomics data analysis and issues for biostatistics. It describes the metabolomics pipeline from experimental design and data acquisition to statistical analysis and biological interpretation. Key aspects covered include data preprocessing methods, exploratory and supervised multivariate analysis, and biological interpretation tools like metabolic network inference and pathway analysis. Specific statistical challenges in metabolomics like handling non-detects and exploring variable importance are also addressed.
This document discusses genomic meta-analysis and summarization techniques. It introduces MetaQC for quality control, MetaDE for detecting differentially expressed genes through meta-analysis, and MetaPCA for integrative visualization of multiple genomic studies. MetaQC uses quality measures to determine inclusion/exclusion of studies in meta-analysis. MetaDE detects biomarkers statistically significant across studies using Fisher's and adaptive weighting methods. MetaPCA integrates multiple genomic datasets by finding a common principal component space.
This document presents a new target-decoy validation method called Target-Decoy with Mass Binning (TDMB) for identifying peptides from high-resolution mass spectrometry data. TDMB utilizes the high mass accuracy of modern instruments by searching over a wide precursor mass tolerance range and binning the mass errors to determine an optimal error window. When tested on various high-resolution MS datasets, TDMB performed comparably or better than other validation methods while remaining simple to implement.
Untargeted metabolomics, namely discovery metabolomics, involves the comparison of the metabolome between the control and test groups, to identify differences between their metabolite profiles which may be relevant to specific biological conditions.
The document discusses analytical chemist's objectives in studying objects and variables. It explains that objects are samples being analyzed, like solutions or molecules, while variables are measurements or features being observed, like wavelength, pH, or molecular descriptors. The goal is to represent objects in multivariate chemical spaces defined by variables, and use tools to understand relationships between objects. It also provides examples of instruments and the order of information they can obtain, from simple to more complex multivariate data.
Strategies for Metabolomics Data AnalysisDmitry Grapov
Part of a lectures series for the international summer course in metabolomics 2013 (http://metabolomics.ucdavis.edu/courses-and-seminars/courses). Get more material and information here (http://imdevsoftware.wordpress.com/2013/09/08/sessions-in-metabolomics-2013/).
Introduction to Structural Equation Modeling Partial Least Sqaures (SEM-PLS)Ali Asgari
Partial least squares structural equation modelling (PLS-SEM) has recently received considerable attention in a variety of disciplines.The goal of PLS-SEM is the explanation of variances (prediction-oriented approach of the methodology) rather than explaining covariances (theory testing via covariance-based SEM).
The document discusses metabolomics data analysis and issues for biostatistics. It describes the metabolomics pipeline from experimental design and data acquisition to statistical analysis and biological interpretation. Key aspects covered include data preprocessing methods, exploratory and supervised multivariate analysis, and biological interpretation tools like metabolic network inference and pathway analysis. Specific statistical challenges in metabolomics like handling non-detects and exploring variable importance are also addressed.
This document discusses genomic meta-analysis and summarization techniques. It introduces MetaQC for quality control, MetaDE for detecting differentially expressed genes through meta-analysis, and MetaPCA for integrative visualization of multiple genomic studies. MetaQC uses quality measures to determine inclusion/exclusion of studies in meta-analysis. MetaDE detects biomarkers statistically significant across studies using Fisher's and adaptive weighting methods. MetaPCA integrates multiple genomic datasets by finding a common principal component space.
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
1. Proteomics is the study of proteomes, which are the entire set of proteins expressed by a genome.
2. Mass spectrometry combined with separation techniques like liquid chromatography are the main tools used in proteomics to identify and characterize proteins.
3. Modern proteomics utilizes multidimensional separation methods like multiple liquid chromatography columns or liquid chromatography coupled with capillary electrophoresis prior to mass spectrometry to better resolve complex protein mixtures.
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...Antoaneta Vladimirova
1) Machine learning models were used to predict platinum treatment response in ovarian cancer using gene expression data from The Cancer Genome Atlas.
2) Several machine learning algorithms including logistic regression, random forests, support vector machines, and artificial neural networks were able to predict platinum resistance versus sensitivity with around 80% accuracy.
3) Artificial neural networks performed the best, likely due to their more complex model structure, while prediction using only clinical data achieved around 75% accuracy.
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly.
For more information:
http://societyofdatascientists.com/controlling-informative-features-for-improved-accuracy-and-faster-predictions-in-omentum-cancer-models/?src=slideshare
Automating Machine Learning - Is it feasible?Manuel Martín
Facing a machine learning problem for the first time can be overwhelming. Hundreds of methods exist for tackling problems such as classification, regression or clustering. Selecting the appropriate method is challenging, specially if no much prior knowledge is known. In addition, most models require to optimise a number of hyperparameters to perform well. Preparing the data for the learning algorithm is also a labour-intensive process that includes cleaning outliers and imperfections, feature selection, data transformation like PCA and more. A workflow connecting preprocessing methods and predictive models is called a multicomponent predictive system (MCPS). This talk introduces the problem of automating the composition and optimisation of MCPSs and also how they can be adapted in changing environments.
The document provides an overview of metabolomics and describes key aspects of the metabolomics workflow and analytical techniques. It defines metabolomics as the study of metabolites in biological systems and discusses areas of application including biomarker discovery. Metabolite profiling using techniques like NMR, GC-MS and LC-MS is described. The document uses Barth syndrome, a mitochondrial disorder, as a case study and discusses how a cardiolipin deficiency can be detected using metabolomics. It outlines the data processing steps in metabolomics including peak detection, matching, retention time correction, and compound identification.
Peak picking and data alignment are important preprocessing steps in metabolomics data analysis. Peak picking identifies metabolite features from raw mass spectrometry data. Data alignment integrates these features into a single data matrix with metabolite intensities for all samples. This matrix undergoes statistical analysis and dimension reduction to identify differences between sample groups and biologically significant metabolites. Dimension reduction techniques like principal component analysis reduce the data complexity to enable visualization and interpretation.
Our vision for the selective re-computation of genomics pipelines in reaction to changes to tools and reference datasets.
How do you prioritise patients for re-analysis on a given budget?
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
This document provides an introduction and overview of tutorials for metabolomic data analysis. It discusses downloading required files and software. The goals of the analysis include using statistical and multivariate analyses to identify differences between sample groups and impacted biochemical domains. It also discusses various data analysis techniques including data quality assessment, univariate and multivariate statistical analyses, clustering, principal component analysis, partial least squares modeling, functional enrichment analysis, and network mapping.
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 22, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, open bioactivity data available at PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using various supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The performance of the models was evaluated with an external data set containing bioactivity data submitted by ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for chemical toxicity.
The document summarizes a research study on developing a multi-sensor based method for detecting uncut crop edges in a head-feeding combine harvester. Key findings include:
1) Sensors including a laser range finder, RTK-GPS, and GPS compass were used to generate 3D terrain maps and detect uncut crop edges in real-time with an average processing speed of 35 ms.
2) The method was able to extract uncut crop edges with an average lateral offset of 0.154 m from the actual edge and estimate average crop height as 0.537 m.
3) While the method performed well generally, its accuracy decreased when the target path was obscured by lodged rice plants.
A Threshold Fuzzy Entropy Based Feature Selection: Comparative StudyIJMER
Feature selection is one of the most common and critical tasks in database classification. It
reduces the computational cost by removing insignificant and unwanted features. Consequently, this
makes the diagnosis process accurate and comprehensible. This paper presents the measurement of
feature relevance based on fuzzy entropy, tested with Radial Basis Classifier (RBF) network,
Bagging(Bootstrap Aggregating), Boosting and stacking for various fields of datasets. Twenty
benchmarked datasets which are available in UCI Machine Learning Repository and KDD have been
used for this work. The accuracy obtained from these classification process shows that the proposed
method is capable of producing good and accurate results with fewer features than the original
datasets.
Positive Impression of Low-Ranking Microrn as in Human Cancer Classificationcsandit
This document discusses using microRNA expression data for cancer classification. It evaluates using various feature selection methods and classifiers to select important microRNAs. The key points are:
1) It evaluates several feature selection methods (correlation-based with different search algorithms, ranker with attribute evaluators) and classifiers (SVM, naive Bayes, decision tree, k-NN) on microRNA expression data from multiple cancer types.
2) Feature selection improved classification accuracy for most methods compared to no feature selection, reducing dimensionality. However, selected feature sets were small, not showing the relationship between number of features and accuracy.
3) The results demonstrate the importance of feature selection for cancer classification using microRNA expression data, but more
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
Large amount of data have been stored and manipulated using various database
technologies. Processing all the attributes for the particular means is the difficult task. To avoid such
difficulties, feature selection process is processed.In this paper,we are collect a eight various benchmark
datasets from UCI repository.Feature selection process is carried out using fuzzy entropy based
relevance measure algorithm and follows three selection strategies like Mean selection strategy,Half
selection strategy and Neural network for threshold selection strategy. After the features are selected,
they are evaluated using Radial Basis Function (RBF) network,Stacking,Bagging,AdaBoostM1 and Antminer
classification methodologies.The test results depicts that Neural network for threshold selection
strategy works well in selecting features and Ant-miner methodology works best in bringing out better
accuracy with selected feature than processing with original dataset.The obtained result of this
experiment shows that clearly the Ant-miner is superiority than other classifiers.Thus, this proposed Antminer
algorithm could be a more suitable method for producing good results with fewer features than
the original datasets.
- The document presents two approaches for model updating in multivariate calibration: 1) a single updated model to predict analyte amounts in both primary and secondary conditions, and 2) two updated models, with one for primary conditions and one for secondary conditions.
- The approaches are evaluated using near-infrared spectral datasets of soils, corn, olive oil, and pharmaceutical tablets to test their ability to update models across different measurement conditions.
- Results show that the two-model approach generally performs better than the single-model approach when there are differences between the primary and secondary measurement conditions, but they perform similarly when the spectral differences are minor.
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesAli Al Hamadani
Conducted domain independent predictive analysis pipeline using R for cell type predictions. Applied many predictive analytics models, and machine learning techniques.
More Related Content
Similar to Cardiology_Metabolomics_workshop_2016_v2
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
1. Proteomics is the study of proteomes, which are the entire set of proteins expressed by a genome.
2. Mass spectrometry combined with separation techniques like liquid chromatography are the main tools used in proteomics to identify and characterize proteins.
3. Modern proteomics utilizes multidimensional separation methods like multiple liquid chromatography columns or liquid chromatography coupled with capillary electrophoresis prior to mass spectrometry to better resolve complex protein mixtures.
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...Antoaneta Vladimirova
1) Machine learning models were used to predict platinum treatment response in ovarian cancer using gene expression data from The Cancer Genome Atlas.
2) Several machine learning algorithms including logistic regression, random forests, support vector machines, and artificial neural networks were able to predict platinum resistance versus sensitivity with around 80% accuracy.
3) Artificial neural networks performed the best, likely due to their more complex model structure, while prediction using only clinical data achieved around 75% accuracy.
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly.
For more information:
http://societyofdatascientists.com/controlling-informative-features-for-improved-accuracy-and-faster-predictions-in-omentum-cancer-models/?src=slideshare
Automating Machine Learning - Is it feasible?Manuel Martín
Facing a machine learning problem for the first time can be overwhelming. Hundreds of methods exist for tackling problems such as classification, regression or clustering. Selecting the appropriate method is challenging, specially if no much prior knowledge is known. In addition, most models require to optimise a number of hyperparameters to perform well. Preparing the data for the learning algorithm is also a labour-intensive process that includes cleaning outliers and imperfections, feature selection, data transformation like PCA and more. A workflow connecting preprocessing methods and predictive models is called a multicomponent predictive system (MCPS). This talk introduces the problem of automating the composition and optimisation of MCPSs and also how they can be adapted in changing environments.
The document provides an overview of metabolomics and describes key aspects of the metabolomics workflow and analytical techniques. It defines metabolomics as the study of metabolites in biological systems and discusses areas of application including biomarker discovery. Metabolite profiling using techniques like NMR, GC-MS and LC-MS is described. The document uses Barth syndrome, a mitochondrial disorder, as a case study and discusses how a cardiolipin deficiency can be detected using metabolomics. It outlines the data processing steps in metabolomics including peak detection, matching, retention time correction, and compound identification.
Peak picking and data alignment are important preprocessing steps in metabolomics data analysis. Peak picking identifies metabolite features from raw mass spectrometry data. Data alignment integrates these features into a single data matrix with metabolite intensities for all samples. This matrix undergoes statistical analysis and dimension reduction to identify differences between sample groups and biologically significant metabolites. Dimension reduction techniques like principal component analysis reduce the data complexity to enable visualization and interpretation.
Our vision for the selective re-computation of genomics pipelines in reaction to changes to tools and reference datasets.
How do you prioritise patients for re-analysis on a given budget?
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
This document provides an introduction and overview of tutorials for metabolomic data analysis. It discusses downloading required files and software. The goals of the analysis include using statistical and multivariate analyses to identify differences between sample groups and impacted biochemical domains. It also discusses various data analysis techniques including data quality assessment, univariate and multivariate statistical analyses, clustering, principal component analysis, partial least squares modeling, functional enrichment analysis, and network mapping.
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 22, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, open bioactivity data available at PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using various supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The performance of the models was evaluated with an external data set containing bioactivity data submitted by ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for chemical toxicity.
The document summarizes a research study on developing a multi-sensor based method for detecting uncut crop edges in a head-feeding combine harvester. Key findings include:
1) Sensors including a laser range finder, RTK-GPS, and GPS compass were used to generate 3D terrain maps and detect uncut crop edges in real-time with an average processing speed of 35 ms.
2) The method was able to extract uncut crop edges with an average lateral offset of 0.154 m from the actual edge and estimate average crop height as 0.537 m.
3) While the method performed well generally, its accuracy decreased when the target path was obscured by lodged rice plants.
A Threshold Fuzzy Entropy Based Feature Selection: Comparative StudyIJMER
Feature selection is one of the most common and critical tasks in database classification. It
reduces the computational cost by removing insignificant and unwanted features. Consequently, this
makes the diagnosis process accurate and comprehensible. This paper presents the measurement of
feature relevance based on fuzzy entropy, tested with Radial Basis Classifier (RBF) network,
Bagging(Bootstrap Aggregating), Boosting and stacking for various fields of datasets. Twenty
benchmarked datasets which are available in UCI Machine Learning Repository and KDD have been
used for this work. The accuracy obtained from these classification process shows that the proposed
method is capable of producing good and accurate results with fewer features than the original
datasets.
Positive Impression of Low-Ranking Microrn as in Human Cancer Classificationcsandit
This document discusses using microRNA expression data for cancer classification. It evaluates using various feature selection methods and classifiers to select important microRNAs. The key points are:
1) It evaluates several feature selection methods (correlation-based with different search algorithms, ranker with attribute evaluators) and classifiers (SVM, naive Bayes, decision tree, k-NN) on microRNA expression data from multiple cancer types.
2) Feature selection improved classification accuracy for most methods compared to no feature selection, reducing dimensionality. However, selected feature sets were small, not showing the relationship between number of features and accuracy.
3) The results demonstrate the importance of feature selection for cancer classification using microRNA expression data, but more
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
Large amount of data have been stored and manipulated using various database
technologies. Processing all the attributes for the particular means is the difficult task. To avoid such
difficulties, feature selection process is processed.In this paper,we are collect a eight various benchmark
datasets from UCI repository.Feature selection process is carried out using fuzzy entropy based
relevance measure algorithm and follows three selection strategies like Mean selection strategy,Half
selection strategy and Neural network for threshold selection strategy. After the features are selected,
they are evaluated using Radial Basis Function (RBF) network,Stacking,Bagging,AdaBoostM1 and Antminer
classification methodologies.The test results depicts that Neural network for threshold selection
strategy works well in selecting features and Ant-miner methodology works best in bringing out better
accuracy with selected feature than processing with original dataset.The obtained result of this
experiment shows that clearly the Ant-miner is superiority than other classifiers.Thus, this proposed Antminer
algorithm could be a more suitable method for producing good results with fewer features than
the original datasets.
- The document presents two approaches for model updating in multivariate calibration: 1) a single updated model to predict analyte amounts in both primary and secondary conditions, and 2) two updated models, with one for primary conditions and one for secondary conditions.
- The approaches are evaluated using near-infrared spectral datasets of soils, corn, olive oil, and pharmaceutical tablets to test their ability to update models across different measurement conditions.
- Results show that the two-model approach generally performs better than the single-model approach when there are differences between the primary and secondary measurement conditions, but they perform similarly when the spectral differences are minor.
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesAli Al Hamadani
Conducted domain independent predictive analysis pipeline using R for cell type predictions. Applied many predictive analytics models, and machine learning techniques.
Similar to Cardiology_Metabolomics_workshop_2016_v2 (20)
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Cardiology_Metabolomics_workshop_2016_v2
1. Computational techniques for Metabolomics Data
Analysis
Sophia A. Banton and Karan Uppal
Clinical Biomarkers Laboratory
Emory University School of Medicine
sbanton@emory.edu, kuppal2@emory.edu
Integrated Health Science and Facilities Core
NIEHS P30 ES019776
August 11, 2016
2. Topics covered in this workshop
• Overview of metabolomics data
• Web-based tools for biomarker discovery and data analysis
– MetaboAnalyst3.0 (hands-on)
• Using R for biomarker discovery and data analysis
– xmsPANDA (hands-on)
– Runs on R >= 3.2.0
• Mummichog for pathway analysis
– Runs on Python2.7
2
5. 5
HRM: Amino Acid Metabolism is Altered in Adolescents with Nonalcoholic Fatty
Liver Disease-An Untargeted, High Resolution Metabolomics Study
Jin and Banton, et al. Amino Acid Metabolism is Altered in Adolescents with Nonalcoholic Fatty Liver Disease—An
Untargeted, High Resolution Metabolomics Study, The Journal of Pediatrics, Volume 172, May 2016, Pages 14-19.e5.
7. Connecting HRM: Plasma Metabolomics of Common Marmosets (Callithrix
jacchus) to Evaluate Diet and Feeding Husbandry
7
Banton et al. Plasma Metabolomics of Common
Marmosets (Callithrix jacchus) to Evaluate Diet and
Feeding Husbandry. JAALAS. March 2016
8. LC-Orbitrap MS
Raw data
Data Analysis Workflow
Final deliverables
8
Raw data processing with
built-in feature and sample
quality assessment
(apLCMS with xMSanalyzer)
Data Exploratory Analysis
(Box plots, histograms, etc.)
Batch-effect evaluation and correction
(Using ComBat); void volume filtering
Annotation of metabolites
(xMSannotator)
1. Untargeted feature table
2. Targeted feature table
3. Annotated feature table
Metabolite prediction based
on MS/MS
• Metlin (known)
• MassBank (known/unknown)
MS/MS validation
and deconvolution
• DeconMSn
Pathway analysis
(Mummichog,MetaboAnalyst,
MetaCore, MSEA)
Biomarker and network analysis
(xmsPANDA, MetabNet, MetaboAnalyst)
• Univariate: Limma t-test, paired t-test,
ANOVA, time-series
• Multivariate and predictive analysis:
Support vector machine, Random forest,
PLSDA
• Clustering: Two-way Hierarchical
clustering analysis
• Targeted and untargeted MWAS
10. Feature and sample quality
assessment
Merge results from different
parameter settings
Mass calibration, batch-effect
evaluation and correction
Annotation of metabolites
1. Untargeted feature table
2. Targeted feature table
3. Annotated feature table
4. EIC and QC plots
Noise removal and peak
detection in each run
Peak grouping after retention
time alignment
Recovery of weaker signals or
filling missing peaks
Summary feature table
Peak detection and alignment
using apLCMS or XCMS at
different parameter settings
apLCMS or XCMS
LC/MS data processing using apLCMS or
XCMS with xMSanalyzer R package 10
11. Quality evaluation and assurance
A. xMSanalyzer has built-in data quality evaluation routines that
evaluate the quality of both features and samples
– Each sample is run in triplicates so that allows us to evaluate the quality
of features and samples based on coefficient of variation (CV) and
Pearson correlation within the technical replicates, respectively
– Only features with median CV <50% and samples for which the technical
replicates have an average pairwise Pearson correlation >0.7 are retained
for further analysis
– A quality score is assigned to each measured m/z that takes into account
both reliability and reproducibility of detection
B. Batch-effect evaluation using Principal Component Analysis
C. Batch-effect correction using ComBat (Johnson 2007,
Biostatistics)
11
12. Feature table – column headings
mz Median measured mass-to-charge across all samples
time Median Retention time at which the ion elutes
mz.min Minimum measured mass-to-charge across all samples
mz.max Maximum measured mass-to-charge across all samples
NumPres.All.Samples
Number of samples with non-missing/non-zero values
NumPres.Biol.Samples
Number of biological samples for which 2 out of the 3
replicates have non-missing/non-zero values
median_CV
median coefficient of variation (%) within technical
replicates
Qscore
Quality score, defined as the ratio of the percentage of
biological samples for which > 50% of technical replicates
have a signal to the %median CV; A higher Qscore means
the feature is more quantitatively reproducible within
technical replicates is detected across large percentage of
biological samples
Max.intensity Maximum intensity of the feature across all samples
VT_SampleRunDate_Run
Number.cdf
Integrated peak area (ion intensity) in each sample; each
sample has 3 technical replicates (eg: VT_130712_002,
VT_130712_004, VT_130712_006)
12
Feature
Quality
Assessment
14. Biomarker and statistical analysis using
MetaboAnalyst3.0
(http://www.metaboanalyst.ca/)
Integrated Health Science and Facilities Core
NIEHS P30 ES019776
15. Various options for feature selection and predictive
evaluation
• Univariate:
– T-test, Paired t-test, LIMMA based t-test
• P-values from moderate t-tests were adjusted for multiple hypothesis testing
using the Benjamini-Hochberg false discovery rate (FDR) correction method
– Manhattan plot to visualize metabolome wide statistically significant
changes
• Multivariate and data mining:
– Supervised:
• Support Vector Machine
• Partial Least Square Discriminant Analysis
• Random Forest
– Unsupervised:
• Principal Component Analysis
• Two-way hierarchical clustering analysis
• K-means clustering
15
36. (EXTREMELY) Useful resources
• Xia J. and Wishart D., Web-based inference of biological patterns,
functions and pathways from metabolomic data using
MetaboAnalyst, Nature Protocols 2011
• Sugimoto et al., Bioinformatics Tools for Mass Spectroscopy-
Based Metabolomic Data Processing and Analysis, Current
Bioinformatics 2012
36
37. xmsPANDA: R package for pre-processing, biomarker discovery,
clustering, and network analysis
37
38. xmsPANDA workflow
Module a) Data pre-processing (Stage 1)
• Replicate summarization
• Data filtering: missing values, relative standard deviation
• Data Transformation (log, z-score)
• Normalization (Quantile)
Module b) Data mining (Stage 2)
• Univariate: Limma t-test, paired t-test, wilcox, mixed effects
model, ANOVA
• Multivariate and predictive analysis for regression and
classification: Support vector machine, MARS, Random
forest, PLS, sPLS
• Unsupervised: PCA, two-way Hierarchical clustering
analysis
Module c) Metabolome-wide association
(correlation) analysis (Stage 3)
• Global: Pairwise correlation and network of all metabolites
• Targeted: Pairwise correlation and network of targeted
metabolites 38
• Developed by Karan Uppal Ph.D., MSc., Assistant Professor, Emory University School
of Medicine
39. xmsPANDA: Various options for feature selection and
predictive evaluation
• Univariate:
– T-test, Paired t-test, LIMMA, linear regression, ANOVA
• P-values from moderate t-tests were adjusted for multiple hypothesis testing using the Benjamini-
Hochberg false discovery rate (FDR) correction method
– Manhattan plot to visualize metabolome wide statistically significant changes
• Multivariate and data mining:
– Supervised:
• Support Vector Machine
• Partial Least Square Discriminant Analysis (PLS, PLSDA, sPLS, sPLSDA)
• Random Forest
• Splines based (MARS)
– Unsupervised:
• Principal Component Analysis
• Two-way hierarchical clustering analysis
• Correlation/network analysis using *MetabNet (Uppal 2015):
– Untargeted: Correlations with all metabolites
– Targeted: Correlations with metabolites from a specific pathway, clinical parameters
39
40. xmsPANDA: Sample input files
a. Feature table
b. Class labels file
40
The
order
must be
identical
Sample IDs
42. xmsPANDA Manhattan plots: Y-axis corresponds to the –log10 (p-value); FDR
cut-off is represented by the horizontal line
a) -logP vs m/z b) -logP vs time
42
m/z Retention time
Amino
acids
Lipids,
steroids
43. xmsPANDA PCA and cluster analysis
Principal Component Analysis
(PCA)
Hierarchical clustering Analysis
(HCA)
Samples
m/z features
43
PC1
PC2
45. xmsPANDA Network analysis using MetabNet (Stage 3)
: correlated m/z
|cor|>0.4 at FDR 0.2
: putative biomarkers from PLS
• Targeted metabolome-wide
association study (MWAS) of
specific metabolites (biomarkers,
environmental exposures, etc.)
• Facilitates detection of related
metabolic pathways and network
structures
• Correlation-based network analysis
• Each node corresponds to
metabolites and the edges
correspond to the correlation
coefficient, Cij
• Two metabolites are linked if |Cij|>
threshold at a user defined
significance level
• Pearson, Spearman, and partial
correlation
45
46. Summary
• xmsPANDA provides an automated workflow for analyzing metabolomics
data (package can be tricked to work other –omics data)
• The package facilitates network level investigation of significant or different
expressed metabolites
• Includes independent functions for hierarchical clustering analysis, PCA,
boxplots
• Availability
– Emory IT Box, (Accessible under MetabolomicsWorkshopSummer2016 folder
on Box)
– Email: kuppal2@emory.edu
46
48. A) In the work flow of untargeted metabolomics, the conventional approach requires the metabolites to be identified before
pathway/network analysis, while mummichog (blue arrow) predicts functional activity bypassing metabolite identification. B) Each
row of dots represent possible matches of metabolites from one m/z feature, red the true metabolite, gray the false matches. The
conventional approach first requires the identification of metabolites before mapping them to the metabolic network.
C)mummichog maps all possible metabolite matches to the network and looks for local enrichment, which reflects the true activity
because the false matches will distribute randomly.
Mummichog for pathway enrichment analysis
48
• Developed by Shuzhao Li Ph.D., Assistant Professor, Emory University School of Medicine
• Li et al. 2013. PLoS Computational Biology
50. Metabolite annotation
• >10,000 reproducible signals can be detected using liquid
chromatography high resolution mass spectrometry
• Simple database searches can result in a large number of false
positives
50
51. Metabolite Annotation: mapping m/z from
LC-MS data to known metabolites in databases
Many-to-
many
relationship
between m/z
and
metabolites
m/z 1
m/z 2
51
52. Main goals of xMSannotator
• Incorporating multiple layers of information (m/z, retention time,
intensity profiles, isotope patterns, pathway membership) to
enhance confidence in annotations and prioritize candidates for
validation using MS/MS and chemical standards
• Perform suspect screening (exposure to environmental chemicals,
drugs)
• Allow use of cluster/module membership to facilitate generating
hypothesis about biochemical roles of features with no database
matches
52
• Developed by Karan Uppal Ph.D., MSc., Assistant Professor, Emory University School
of Medicine
53. • Human Metabolome Database (HMDB)
– About 41,000 metabolites
• 2,824 (Detected and Quantified)
• 251 (Detected but not Quantified)
• 38,439 (Expected but not detected)
• LipidMaps
– 36,269 lipids
• The toxin and toxin target database (T3DB)
– 2,097 toxic chemicals
• KEGG
– 15,298 chemicals
Databases supported by xMSannotator
53
54. xMSannotator functions
• Multilevelannotation() for multi-criteria based annotation that assigns
annotations into confidence levels (high, medium, low, none)
• get_mz_by_KEGGspecies:
– generate list of expected m/z based on adducts for all metabolites associated with a species in
KEGG
• get_mz_by_KEGGpathwayIDs:
– generate list of expected m/z based on adducts for all metabolites in specific pathways
• get_mz_by_KEGGcompoundIDs:
– generate list of expected m/z based on adducts for given KEGG compound ID
• get_kegg_map:
– Download KEGG map as a PNG file with color coded KEGG IDs
• ChemSpider.annotation:
– m/z based annotation for select databases in ChemSpider
54
55. library(xMSannotator)
#Package data files
data(example_data) #example peak intensity matrix
data(adduct_table)
data(adduct_weights)
#data(customIDs) #example for custom IDs
#data(customDB) #example for custom DB
#data(hmdbAllinf)
#data(keggotherinf)
#data(t3dbotherinf)
###########Parameters to change##############
dataA<-read.table("/Users/karanuppal/Documents/Emory/JonesLab/Projects/xMSannotator/50marmosets_rawdata_averaged.txt",sep="t",header=TRUE)
#OR
#dataA<-example_data
outloc<-"/Users/karanuppal/Documents/Emory/JonesLab/Projects/xMSannotator/testBloodSpotv1.1.2T3DB/"
max.mz.diff<-10 #mass search tolerance for DB matching in ppm
max.rt.diff<-10 #retention time tolerance between adducts/isotopes
corthresh<-0.7 #correlation threshold between adducts/isotopes
max_isp=5
mass_defect_window=0.01
num_nodes<-4 #number of cores to be used; 2 is recommended for desktop computers due to high memory consumption
db_name=“HMDB" #other options: KEGG, LipidMaps, T3DB
status=NA #other options: "Detected", NA, "Expected and Not Quantified"
num_sets<-300 #number of sets into which the total number of database entries should be split into;
mode<-"pos" #ionization mode
queryadductlist=c("M+2H","M+H+NH4","M+ACN+2H","M+2ACN+2H","M+H","M+NH4","M+Na","M+ACN+H","M+ACN+Na","M+2ACN+H","2M+H","2M+Na",
"2M+ACN+H","M+2Na-H","M+H-H2O","M+H-2H2O") #other options: c("M-H","M-H2O-H","M+Na-2H","M+Cl","M+FA-H"); c("positive"); c("negative");
c("all");see data(adduct_table) for complete list
#########################
dataA<-unique(dataA)
print(dim(dataA))
system.time(annotres<-multilevelannotation(dataA=dataA,max.mz.diff=max.mz.diff,max.rt.diff=max.rt.diff,cormethod="pearson",num_nodes=num_nodes,queryadductlist=queryadductlist,
mode=mode,outloc=outloc,db_name=db_name, adduct_weights=adduct_weights,num_sets=num_sets,allsteps=TRUE,
corthresh=corthresh,NOPS_check=TRUE,customIDs=NA,missing.value=NA,hclustmethod="complete",deepsplit=2,networktype="unsigned",
minclustsize=10,module.merge.dissimilarity=0.2,filter.by=c("M+H"),biofluid.location=NA,origin=NA,status=status,boostIDs=NA,max_isp=max_isp,
HMDBselect="union",mass_defect_window=mass_defect_window,pathwaycheckmode="pm",mass_defect_mode="pos")
)
xMSannotator example script (R)
55