SlideShare a Scribd company logo
Computational techniques for Metabolomics Data
Analysis
Sophia A. Banton and Karan Uppal
Clinical Biomarkers Laboratory
Emory University School of Medicine
sbanton@emory.edu, kuppal2@emory.edu
Integrated Health Science and Facilities Core
NIEHS P30 ES019776
August 11, 2016
Topics covered in this workshop
• Overview of metabolomics data
• Web-based tools for biomarker discovery and data analysis
– MetaboAnalyst3.0 (hands-on)
• Using R for biomarker discovery and data analysis
– xmsPANDA (hands-on)
– Runs on R >= 3.2.0
• Mummichog for pathway analysis
– Runs on Python2.7
2
3
Possible Study Approaches (Workflows)A
4
HRM: Pilot study of pulmonary tuberculosis
5
HRM: Amino Acid Metabolism is Altered in Adolescents with Nonalcoholic Fatty
Liver Disease-An Untargeted, High Resolution Metabolomics Study
Jin and Banton, et al. Amino Acid Metabolism is Altered in Adolescents with Nonalcoholic Fatty Liver Disease—An
Untargeted, High Resolution Metabolomics Study, The Journal of Pediatrics, Volume 172, May 2016, Pages 14-19.e5.
Connecting HRM with metabolic pathways
6
KEGG Pathways
Connecting HRM: Plasma Metabolomics of Common Marmosets (Callithrix
jacchus) to Evaluate Diet and Feeding Husbandry
7
Banton et al. Plasma Metabolomics of Common
Marmosets (Callithrix jacchus) to Evaluate Diet and
Feeding Husbandry. JAALAS. March 2016
LC-Orbitrap MS
Raw data
Data Analysis Workflow
Final deliverables
8
Raw data processing with
built-in feature and sample
quality assessment
(apLCMS with xMSanalyzer)
Data Exploratory Analysis
(Box plots, histograms, etc.)
Batch-effect evaluation and correction
(Using ComBat); void volume filtering
Annotation of metabolites
(xMSannotator)
1. Untargeted feature table
2. Targeted feature table
3. Annotated feature table
Metabolite prediction based
on MS/MS
• Metlin (known)
• MassBank (known/unknown)
MS/MS validation
and deconvolution
• DeconMSn
Pathway analysis
(Mummichog,MetaboAnalyst,
MetaCore, MSEA)
Biomarker and network analysis
(xmsPANDA, MetabNet, MetaboAnalyst)
• Univariate: Limma t-test, paired t-test,
ANOVA, time-series
• Multivariate and predictive analysis:
Support vector machine, Random forest,
PLSDA
• Clustering: Two-way Hierarchical
clustering analysis
• Targeted and untargeted MWAS
Step 1: Data extraction from RAW spectral files
Feature and sample quality
assessment
Merge results from different
parameter settings
Mass calibration, batch-effect
evaluation and correction
Annotation of metabolites
1. Untargeted feature table
2. Targeted feature table
3. Annotated feature table
4. EIC and QC plots
Noise removal and peak
detection in each run
Peak grouping after retention
time alignment
Recovery of weaker signals or
filling missing peaks
Summary feature table
Peak detection and alignment
using apLCMS or XCMS at
different parameter settings
apLCMS or XCMS
LC/MS data processing using apLCMS or
XCMS with xMSanalyzer R package 10
Quality evaluation and assurance
A. xMSanalyzer has built-in data quality evaluation routines that
evaluate the quality of both features and samples
– Each sample is run in triplicates so that allows us to evaluate the quality
of features and samples based on coefficient of variation (CV) and
Pearson correlation within the technical replicates, respectively
– Only features with median CV <50% and samples for which the technical
replicates have an average pairwise Pearson correlation >0.7 are retained
for further analysis
– A quality score is assigned to each measured m/z that takes into account
both reliability and reproducibility of detection
B. Batch-effect evaluation using Principal Component Analysis
C. Batch-effect correction using ComBat (Johnson 2007,
Biostatistics)
11
Feature table – column headings
mz Median measured mass-to-charge across all samples
time Median Retention time at which the ion elutes
mz.min Minimum measured mass-to-charge across all samples
mz.max Maximum measured mass-to-charge across all samples
NumPres.All.Samples
Number of samples with non-missing/non-zero values
NumPres.Biol.Samples
Number of biological samples for which 2 out of the 3
replicates have non-missing/non-zero values
median_CV
median coefficient of variation (%) within technical
replicates
Qscore
Quality score, defined as the ratio of the percentage of
biological samples for which > 50% of technical replicates
have a signal to the %median CV; A higher Qscore means
the feature is more quantitatively reproducible within
technical replicates is detected across large percentage of
biological samples
Max.intensity Maximum intensity of the feature across all samples
VT_SampleRunDate_Run
Number.cdf
Integrated peak area (ion intensity) in each sample; each
sample has 3 technical replicates (eg: VT_130712_002,
VT_130712_004, VT_130712_006)
12
Feature
Quality
Assessment
Sample Output
13
m/z
Retention
time Sample 1 Sample 2 Sample 3
Biomarker and statistical analysis using
MetaboAnalyst3.0
(http://www.metaboanalyst.ca/)
Integrated Health Science and Facilities Core
NIEHS P30 ES019776
Various options for feature selection and predictive
evaluation
• Univariate:
– T-test, Paired t-test, LIMMA based t-test
• P-values from moderate t-tests were adjusted for multiple hypothesis testing
using the Benjamini-Hochberg false discovery rate (FDR) correction method
– Manhattan plot to visualize metabolome wide statistically significant
changes
• Multivariate and data mining:
– Supervised:
• Support Vector Machine
• Partial Least Square Discriminant Analysis
• Random Forest
– Unsupervised:
• Principal Component Analysis
• Two-way hierarchical clustering analysis
• K-means clustering
15
MetaboAnalyst3.0: Multiple data analysis modules
16
MetaboAnalyst3.0: 3 main modules for statistical/biomarker
analysis
17
MetaboAnalyst3.0: Formatting input data files
• http://www.metaboanalyst.ca/faces/home.xhtml
18
MetaboAnalyst3.0: Selecting data file
19
Sample Input file
• “Smokers_nonsmokers_MetaboAnalyst.csv”
Sample NonSmoker_1 NonSmoker_7 NonSmoker_13 NonSmoker_19 NonSmoker_25
Phenotype NonSmoker NonSmoker NonSmoker NonSmoker NonSmoker
90.05544_114.26 6.25E+08 6.39E+08 1.03E+09 8.67E+08 9.07E+08
104.07101_114.84 1.13E+08 9.70E+07 59600000 1.88E+08 7.80E+07
104.10736_62.99 2.88E+09 4.34E+09 2.80E+09 2.67E+09 2.54E+09
114.06648_66.35 6.15E+08 6.85E+08 3.14E+08 6.09E+08 5.52E+08
118.08645_118.25 4.70E+09 4.21E+09 4.21E+09 6.28E+09 5.82E+09
119.03401_115.38 23737.65549 0 0 0 0
120.06562_119.55 2.85E+08 2.11E+08 2.79E+08 3.37E+08 2.58E+08
122.02708_124.25 40014.00396 39634.23778 34433.93197 68656.48709 8146.55363
123.0445_124.95 27500000 26300000 59400000 81600000 6.00E+07
123.05525_167.77 0 0 0 0 480688.6081
123.05531_52.82 3912282.412 12500000 3509484.928 2903190.851 0
124.04031_124.95 4381223.786 6501175.935 8539781.005 12200000 2333448.716
130.04993_122.35 7.31E+08 7.38E+08 1.02E+09 8.33E+08 8.17E+08
132.07675_102.12 1.50E+08 1.28E+08 1.60E+08 3.87E+08 2.78E+08
133.06077_123.44 8.30E+07 85400000 1.00E+08 85800000 93700000
134.04476_132.94 16100000 17100000 33900000 23400000 16200000
137.04564_76.31 94627.89366 99064.3524 31862.04198 74075.53368 96459.30732
137.05966_112.24 23026.51599 729321.1317 21884.85816 27338.56273 26548.87029
20
MetaboAnalyst3.0: Select input file
21
MetaboAnalyst3.0: Check for missing values and other potential issues such
as mislabeling
22
MetaboAnalyst3.0: Data filtering options
23
MetaboAnalyst3.0: Data transformation and scaling options
24
MetaboAnalyst3.0: Results after normalization
25
Before After
MetaboAnalyst3.0: Lots of options for statistical analysis!
Let’s try T-test first
26
MetaboAnalyst3.0: T-test output
Manhattan
Plot
27
MetaboAnalyst3.0: Click on individual red dots to visualize boxplots
28
Box-and-whisker
Plot
MetaboAnalyst3.0: Heatmap option with two-way HCA
29
Samples
Metabolites
MetaboAnalyst3.0: Heatmap option – using top 25 m/z features
based on T-test
Samples
Metabolites
30
MetaboAnalyst3.0: *PLSDA option
31
*Method for classifying – or separating – the groups
MetaboAnalyst3.0: PLSDA option – “2D Scores Plots” tab
32
*Method for classifying – or separating – the groups
MetaboAnalyst3.0: PLSDA option – “3D Scores Plots” tab
33
*Method for classifying – or separating – the groups
MetaboAnalyst3.0: PLSDA option – “Imp. Features” tab
• Top 15 features
based on variable
Importance (VIP)
determined Using
PLS-DA
34
MetaboAnalyst3.0: Download the results
35
(EXTREMELY) Useful resources
• Xia J. and Wishart D., Web-based inference of biological patterns,
functions and pathways from metabolomic data using
MetaboAnalyst, Nature Protocols 2011
• Sugimoto et al., Bioinformatics Tools for Mass Spectroscopy-
Based Metabolomic Data Processing and Analysis, Current
Bioinformatics 2012
36
xmsPANDA: R package for pre-processing, biomarker discovery,
clustering, and network analysis
37
xmsPANDA workflow
Module a) Data pre-processing (Stage 1)
• Replicate summarization
• Data filtering: missing values, relative standard deviation
• Data Transformation (log, z-score)
• Normalization (Quantile)
Module b) Data mining (Stage 2)
• Univariate: Limma t-test, paired t-test, wilcox, mixed effects
model, ANOVA
• Multivariate and predictive analysis for regression and
classification: Support vector machine, MARS, Random
forest, PLS, sPLS
• Unsupervised: PCA, two-way Hierarchical clustering
analysis
Module c) Metabolome-wide association
(correlation) analysis (Stage 3)
• Global: Pairwise correlation and network of all metabolites
• Targeted: Pairwise correlation and network of targeted
metabolites 38
• Developed by Karan Uppal Ph.D., MSc., Assistant Professor, Emory University School
of Medicine
xmsPANDA: Various options for feature selection and
predictive evaluation
• Univariate:
– T-test, Paired t-test, LIMMA, linear regression, ANOVA
• P-values from moderate t-tests were adjusted for multiple hypothesis testing using the Benjamini-
Hochberg false discovery rate (FDR) correction method
– Manhattan plot to visualize metabolome wide statistically significant changes
• Multivariate and data mining:
– Supervised:
• Support Vector Machine
• Partial Least Square Discriminant Analysis (PLS, PLSDA, sPLS, sPLSDA)
• Random Forest
• Splines based (MARS)
– Unsupervised:
• Principal Component Analysis
• Two-way hierarchical clustering analysis
• Correlation/network analysis using *MetabNet (Uppal 2015):
– Untargeted: Correlations with all metabolites
– Targeted: Correlations with metabolites from a specific pathway, clinical parameters
39
xmsPANDA: Sample input files
a. Feature table
b. Class labels file
40
The
order
must be
identical
Sample IDs
xmsPANDA: Example script
library(xmsPANDA)
demetabs_res<-
diffexp(feature_table_file="/Users/karanuppal/Documents/Emory/JonesLab/Projects/C18_feature_table_PANDA.txt",
parentoutput_dir="/Users/karanuppal/Documents/Emory/JonesLab/Projects/PANDA_lmreggeno_gender_allmiss0.3g
roup0.7_median_v1.0.3.1_p0.01B/",
class_labels_file="/Users/karanuppal/Documents/Emory/JonesLab/Projects/clinical_info_PANDA_class_gender.txt",
num_replicates = 2,
feat.filt.thresh =NA, summarize.replicates =TRUE,
summary.method="median",summary.na.replacement="zeros",rep.max.missing.thresh=0.3,
all.missing.thresh=0.5, group.missing.thresh=0.7,
log2transform = TRUE, medcenter=FALSE, znormtransform = FALSE,
quantile_norm = FALSE, lowess_norm = FALSE, madscaling = FALSE,
rsd.filt.list = seq(0, 0, 5), pairedanalysis = FALSE, featselmethod="lmreg",
fdrthresh = 0.01, fdrmethod="none",cor.method="spearman", abs.cor.thresh = 0.3, cor.fdrthresh=0.2,
kfold=10,feat_weight=1,globalcor=TRUE,target.metab.file=NA,
target.mzmatch.diff=10,target.rtmatch.diff=NA,max.cor.num=300,missing.val=0,networktype="complete",
samplermindex=NA,numtrees=1000,analysismode="classification",net_node_colors=c("green","red"),
net_legend=FALSE,heatmap.col.opt="RdBu",sample.col.opt="rainbow",alphacol=0.3, pls_vip_thresh = 3,
num_nodes = 2,
max_rf_varsel = 100, pls_ncomp =
5,pcacenter=TRUE,pcascale=TRUE,pca.stage2.eval=FALSE,scoreplot_legend=TRUE,pca.global.eval=FALSE)
Other options:
limma LIMMA
rf  random forest
spls  sparse PLS
pls  PLS
And more…
See example scripts for
more options
41
xmsPANDA Manhattan plots: Y-axis corresponds to the –log10 (p-value); FDR
cut-off is represented by the horizontal line
a) -logP vs m/z b) -logP vs time
42
m/z Retention time
Amino
acids
Lipids,
steroids
xmsPANDA PCA and cluster analysis
Principal Component Analysis
(PCA)
Hierarchical clustering Analysis
(HCA)
Samples
m/z features
43
PC1
PC2
xmsPANDA Boxplots
44
xmsPANDA Network analysis using MetabNet (Stage 3)
: correlated m/z
|cor|>0.4 at FDR 0.2
: putative biomarkers from PLS
• Targeted metabolome-wide
association study (MWAS) of
specific metabolites (biomarkers,
environmental exposures, etc.)
• Facilitates detection of related
metabolic pathways and network
structures
• Correlation-based network analysis
• Each node corresponds to
metabolites and the edges
correspond to the correlation
coefficient, Cij
• Two metabolites are linked if |Cij|>
threshold at a user defined
significance level
• Pearson, Spearman, and partial
correlation
45
Summary
• xmsPANDA provides an automated workflow for analyzing metabolomics
data (package can be tricked to work other –omics data)
• The package facilitates network level investigation of significant or different
expressed metabolites
• Includes independent functions for hierarchical clustering analysis, PCA,
boxplots
• Availability
– Emory IT Box, (Accessible under MetabolomicsWorkshopSummer2016 folder
on Box)
– Email: kuppal2@emory.edu
46
Mummichog: Pathway enrichment analysis
A) In the work flow of untargeted metabolomics, the conventional approach requires the metabolites to be identified before
pathway/network analysis, while mummichog (blue arrow) predicts functional activity bypassing metabolite identification. B) Each
row of dots represent possible matches of metabolites from one m/z feature, red the true metabolite, gray the false matches. The
conventional approach first requires the identification of metabolites before mapping them to the metabolic network.
C)mummichog maps all possible metabolite matches to the network and looks for local enrichment, which reflects the true activity
because the false matches will distribute randomly.
Mummichog for pathway enrichment analysis
48
• Developed by Shuzhao Li Ph.D., Assistant Professor, Emory University School of Medicine
• Li et al. 2013. PLoS Computational Biology
xMSannotator: Metabolite annotation
Manuscript under review; URL: https://sourceforge.net/projects/xmsannotator/
Metabolite annotation
• >10,000 reproducible signals can be detected using liquid
chromatography high resolution mass spectrometry
• Simple database searches can result in a large number of false
positives
50
Metabolite Annotation: mapping m/z from
LC-MS data to known metabolites in databases
Many-to-
many
relationship
between m/z
and
metabolites
m/z 1
m/z 2
51
Main goals of xMSannotator
• Incorporating multiple layers of information (m/z, retention time,
intensity profiles, isotope patterns, pathway membership) to
enhance confidence in annotations and prioritize candidates for
validation using MS/MS and chemical standards
• Perform suspect screening (exposure to environmental chemicals,
drugs)
• Allow use of cluster/module membership to facilitate generating
hypothesis about biochemical roles of features with no database
matches
52
• Developed by Karan Uppal Ph.D., MSc., Assistant Professor, Emory University School
of Medicine
• Human Metabolome Database (HMDB)
– About 41,000 metabolites
• 2,824 (Detected and Quantified)
• 251 (Detected but not Quantified)
• 38,439 (Expected but not detected)
• LipidMaps
– 36,269 lipids
• The toxin and toxin target database (T3DB)
– 2,097 toxic chemicals
• KEGG
– 15,298 chemicals
Databases supported by xMSannotator
53
xMSannotator functions
• Multilevelannotation() for multi-criteria based annotation that assigns
annotations into confidence levels (high, medium, low, none)
• get_mz_by_KEGGspecies:
– generate list of expected m/z based on adducts for all metabolites associated with a species in
KEGG
• get_mz_by_KEGGpathwayIDs:
– generate list of expected m/z based on adducts for all metabolites in specific pathways
• get_mz_by_KEGGcompoundIDs:
– generate list of expected m/z based on adducts for given KEGG compound ID
• get_kegg_map:
– Download KEGG map as a PNG file with color coded KEGG IDs
• ChemSpider.annotation:
– m/z based annotation for select databases in ChemSpider
54
library(xMSannotator)
#Package data files
data(example_data) #example peak intensity matrix
data(adduct_table)
data(adduct_weights)
#data(customIDs) #example for custom IDs
#data(customDB) #example for custom DB
#data(hmdbAllinf)
#data(keggotherinf)
#data(t3dbotherinf)
###########Parameters to change##############
dataA<-read.table("/Users/karanuppal/Documents/Emory/JonesLab/Projects/xMSannotator/50marmosets_rawdata_averaged.txt",sep="t",header=TRUE)
#OR
#dataA<-example_data
outloc<-"/Users/karanuppal/Documents/Emory/JonesLab/Projects/xMSannotator/testBloodSpotv1.1.2T3DB/"
max.mz.diff<-10 #mass search tolerance for DB matching in ppm
max.rt.diff<-10 #retention time tolerance between adducts/isotopes
corthresh<-0.7 #correlation threshold between adducts/isotopes
max_isp=5
mass_defect_window=0.01
num_nodes<-4 #number of cores to be used; 2 is recommended for desktop computers due to high memory consumption
db_name=“HMDB" #other options: KEGG, LipidMaps, T3DB
status=NA #other options: "Detected", NA, "Expected and Not Quantified"
num_sets<-300 #number of sets into which the total number of database entries should be split into;
mode<-"pos" #ionization mode
queryadductlist=c("M+2H","M+H+NH4","M+ACN+2H","M+2ACN+2H","M+H","M+NH4","M+Na","M+ACN+H","M+ACN+Na","M+2ACN+H","2M+H","2M+Na",
"2M+ACN+H","M+2Na-H","M+H-H2O","M+H-2H2O") #other options: c("M-H","M-H2O-H","M+Na-2H","M+Cl","M+FA-H"); c("positive"); c("negative");
c("all");see data(adduct_table) for complete list
#########################
dataA<-unique(dataA)
print(dim(dataA))
system.time(annotres<-multilevelannotation(dataA=dataA,max.mz.diff=max.mz.diff,max.rt.diff=max.rt.diff,cormethod="pearson",num_nodes=num_nodes,queryadductlist=queryadductlist,
mode=mode,outloc=outloc,db_name=db_name, adduct_weights=adduct_weights,num_sets=num_sets,allsteps=TRUE,
corthresh=corthresh,NOPS_check=TRUE,customIDs=NA,missing.value=NA,hclustmethod="complete",deepsplit=2,networktype="unsigned",
minclustsize=10,module.merge.dissimilarity=0.2,filter.by=c("M+H"),biofluid.location=NA,origin=NA,status=status,boostIDs=NA,max_isp=max_isp,
HMDBselect="union",mass_defect_window=mass_defect_window,pathwaycheckmode="pm",mass_defect_mode="pos")
)
xMSannotator example script (R)
55
Sample output
Confidence chemical_ID mz time MatchCategoryName Formula MonoisotopicMassAdduct ISgroup Module mean_int_vec
3 HMDB00472 221.090047 51.9551753 Unique 5-Hydroxy-L-tryptophan C11H12N2O3 220.08479 M+H ISgroup_17_1_10 17 687747.839
3 HMDB00472 202.095804 51.7621006 Unique 5-Hydroxy-L-tryptophan C11H12N2O3_[-18] - M_[-18] ISgroup_17_1_10 17 76047.9214
3 HMDB00472 222.093448 52.9762499 Unique 5-Hydroxy-L-tryptophan C11H12N2O3_[+2] - M_[+2] ISgroup_17_1_10 17 53822.941
3 HMDB00472 227.097096 51.5564666 Unique 5-Hydroxy-L-tryptophan C11H12N2O3_[+7] - M_[+7] ISgroup_17_1_10 17 62478.9814
3 HMDB00472 203.103566 50.3062004 Unique 5-Hydroxy-L-tryptophan C11H12N2O3_[-17] - M_[-17] ISgroup_17_1_11 17 108606.947
3 HMDB00269 302.302693 348.568133 Unique Sphinganine C18H39NO2 301.29808 M+H ISgroup_244_44_31 244 394617.009
3 HMDB00269 303.305073 347.574618 Unique Sphinganine C18H39NO2_[+2] - M_[+2] ISgroup_244_44_31 244 40886.1457
3 HMDB00222 400.340702 374.505405 Unique L-Palmitoylcarnitine C23H45NO4 399.33486 M+H ISgroup_244_45_35 244 3787674.18
3 HMDB00222 401.341243 371.277478 Multiple L-Palmitoylcarnitine C23H45NO4_[+2] - M_[+2] ISgroup_244_45_35 244 1213505.56
3 HMDB00211 181.070374 58.0993743 Multiple Myoinositol C6H12O6 180.06339 M+H ISgroup_194_3_8 194 2071579.66
3 HMDB00211 182.073758 54.9711864 Multiple Myoinositol C6H12O6_[+2] - M_[+2] ISgroup_194_3_8 194 128395.932
3 HMDB00201 204.121143 48.3112602 Unique L-Acetylcarnitine C9H17NO4 203.11576 M+H ISgroup_135_3_13 135 4173369.02
3 HMDB00201 206.125572 49.79958 Unique L-Acetylcarnitine C9H17NO4_[+3] - M_[+3] ISgroup_135_3_13 135 8500.80302
3 HMDB00172 132.100611 50.8616042 Multiple L-Isoleucine C6H13NO2 131.09463 M+H ISgroup_194_3_11 194 19782503.3
3 HMDB00172 86.0954504 48.6019537 Multiple L-Isoleucine C6H13NO2_[-45] - M_[-45] ISgroup_194_3_10 194 1330586.31
3 HMDB00172 133.10397 49.949951 Multiple L-Isoleucine C6H13NO2_[+2] - M_[+2] ISgroup_194_3_11 194 1219553.75
3 HMDB00162 116.06946 49.2340805 Unique L-Proline C5H9NO2 115.06333 M+H ISgroup_194_3_7 194 7364523.19
3 HMDB00162 117.072835 47.1430823 Unique L-Proline C5H9NO2_[+2] - M_[+2] ISgroup_194_3_8 194 328238.286
3 HMDB00159 166.084711 54.5950322 Unique L-Phenylalanine C9H11NO2 165.07898 M+H ISgroup_194_3_9 194 20408852.3
3 HMDB00159 120.079568 50.8568168 Unique L-Phenylalanine C9H11NO2_[-45] - M_[-45] ISgroup_194_3_8 194 3268442.52
3 HMDB00159 167.088077 50.3774503 Unique L-Phenylalanine C9H11NO2_[+2] - M_[+2] ISgroup_194_3_9 194 1862264.77
3 HMDB00148 148.059082 48.2266649 Multiple L-Glutamicacid C5H9NO4 147.05316 M+H ISgroup_244_1_6 244 491446.134
3 HMDB00148 192.022544 47.8072529 Multiple L-Glutamicacid C5H9NO4 147.05316 M+2Na-H ISgroup_244_1_3 244 75734.8384
56
Confidence scores for possible chemical identity:
• 0 is no confidence
• 1 is low confidence
• 2 is medium confidence
• 3 is high confidence
• 4 is experimentally confirmed
Summary
0
50000000
100000000
150000000
200000000
250000000
m/z time
Smoker_11.raw Peak
area
Smoker_15.raw Peak
area
Smoker_13.raw Peak
area
193.0970902 1.697928509 21590.09577 1465875.407 2921520.329
104.071007 1.914036922 1.68E+08 1.20E+08 1.18E+08
104.071007 1.914036922 1.68E+08 1.20E+08 1.18E+08
137.0456421 1.271814331 217380.9151 66352.25511 96353.93902
241.0307929 2.180590728 6.27E+07 8.42E+07 8.09E+07
134.044759 2.215654287 1.39E+07 2.77E+07 2.66E+07
Raw data
Data extraction
(apLCMS, XCMS, MzMine2.0, xMSanalyzer)
Probability or score-based annotation
(xMSannotator)
Biomarker discovery and
Network analysis
(MetaboAnalyst, xmsPANDA)
Pathway analysis and
applications (Dr. Shuzhao Li)
57
Clinical Biomarkers Laboratory
clinicalmetabolomics.org
Email: kuppal2@emory.edu
Dean Jones, Young-Mi Go, Shuzaho Li, Karan Uppal, Douglas Walker,
Josh Chandler, Sophia Banton, Ken Liu, Vilinh Tran, Michael Orr, Bill
Liang (not shown)
58
Lab website: http://clinicalmetabolomics.org/
Live Demonstrations of xmsPANDA and
Mummichog
59
xmsPANDA
• Installation instructions, data files, example R scripts, and manual on Emory
IT Box
60
Input
files
xmsPANDA
• R Script
61
#load xmsPANDA
library(xmsPANDA)
feature_table_file<-"/Users/karanuppal/Documents/Emory/Workshop/Workshop2016/Mzmine_smokers_nonsmokers_PANDA.txt"
class_labels_file<-"/Users/karanuppal/Documents/Emory/Workshop/Workshop2016/classlabels.txt"
outloc<-"/Users/karanuppal/Documents/Emory/Workshop/Workshop2016/testpanda4/"
demetabs_res<-diffexp(feature_table_file=feature_table_file,
parentoutput_dir=outloc,
class_labels_file=class_labels_file,
num_replicates = 3,
feat.filt.thresh =NA, summarize.replicates =TRUE, summary.method="median",summary.na.replacement="zeros",
rep.max.missing.thresh=0.5,
all.missing.thresh=NA, group.missing.thresh=NA, input.intensity.scale="raw",
log2transform = FALSE, medcenter=FALSE, znormtransform = FALSE,
quantile_norm = FALSE, lowess_norm = FALSE, madscaling = FALSE,
rsd.filt.list = c(0), pairedanalysis = FALSE, featselmethod="lm1wayanova",
fdrthresh = 0.05, fdrmethod="none",cor.method="pearson", abs.cor.thresh = 0.4, cor.fdrthresh=0.2,
kfold=10,feat_weight=1,globalcor=TRUE,target.metab.file=NA,
target.mzmatch.diff=10,target.rtmatch.diff=NA,max.cor.num=NA,missing.val=0,networktype="complete",
samplermindex=NA,numtrees=1000,analysismode="classification",net_node_colors=c("green","red"),
net_legend=FALSE,heatmap.col.opt="RdBu",sample.col.opt="rainbow",alphacol=0.3, pls_vip_thresh = 2, num_nodes = 2,
max_varsel = 100, pls_ncomp = 5,pcacenter=TRUE,pcascale=TRUE,pred.eval.method="BER",rocfeatlist=seq(2,10,1),
rocfeatincrement=TRUE,
rocclassifier="svm",foldchangethresh=0,wgcnarsdthresh=30,WGCNAmodules=FALSE,
optselect=FALSE,max_comp_sel=1,saveRda=FALSE,pca.cex.val=4,pls.permut.count=NA,
pca.ellipse=TRUE,ellipse.conf.level=0.95,legendlocation="bottomleft",svm.acc.tolerance=5)
xmsPANDA
• Results
• ReadME.txt
– Stage 1 results: Preprocessing (Normalization, transformation)
– Stage 2 results: Feature selection & evaluation results (Manhattan
plots, PCA, HCA, boxplots, table of significant features, clustering
results)"
– Stage 3 results: Correlation based network analysis 62
xmsPANDA Stage 2 Results
• Results
• Page 9 and 10 – Type I and II Manhattan plots
• Page 11 – 2-way HCA heatmap Final page(s) – box plots
63
Cotinine
xmsPANDA Stage 3 Results
64
• Correlation network plot
Mummichog
65
• Example data set and manual on Emory IT Box
Input
file
Mummichog
66
• Change directory in command prompt to location of
Mummichog folder:
• Example:
– cd mummichog-1.0.7test
Mummichog
67
• Change directory in command prompt to location of
Mummichog folder:
• Example:
– C:UserssbantonDownloadsmummichog-1.0.7test>python
../mummichog/main.py -f testdata.txt -o testdata.txt -c 0.05
Mummichog
68
• Program is running
Mummichog
69
• Results
Quick
results in
html
format
Mummichog Results
70
Pathways
Modules

More Related Content

Similar to Cardiology_Metabolomics_workshop_2016_v2

Mi bioinformática para el IBIMA
Mi bioinformática para el IBIMAMi bioinformática para el IBIMA
Mi bioinformática para el IBIMA
M. Gonzalo Claros
 
May workshop
May workshopMay workshop
May workshop
Fahadahammed2
 
May 15 workshop
May 15  workshopMay 15  workshop
May 15 workshop
Fahadahammed2
 
Proteomics
ProteomicsProteomics
Proteomics
Shereen Shehata
 
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...
Antoaneta Vladimirova
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
Damian R. Mingle, MBA
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?
Manuel Martín
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
UC Davis
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric Bioinformatics
Natalio Krasnogor
 
Metabolomics.ppt
Metabolomics.pptMetabolomics.ppt
Metabolomics.ppt
Robinakhan13
 
Data handling metabolomics
Data handling metabolomicsData handling metabolomics
Data handling metabolomics
Shruthi Shree Gandhi
 
ReComp for genomics
ReComp for genomicsReComp for genomics
ReComp for genomics
Paolo Missier
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
Dmitry Grapov
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
Sunghwan Kim
 
A multi-sensor based uncut crop edge detection method for head-feeding combin...
A multi-sensor based uncut crop edge detection method for head-feeding combin...A multi-sensor based uncut crop edge detection method for head-feeding combin...
A multi-sensor based uncut crop edge detection method for head-feeding combin...
Institute of Agricultural Machinery, NARO
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
IJMER
 
Positive Impression of Low-Ranking Microrn as in Human Cancer Classification
Positive Impression of Low-Ranking Microrn as in Human Cancer ClassificationPositive Impression of Low-Ranking Microrn as in Human Cancer Classification
Positive Impression of Low-Ranking Microrn as in Human Cancer Classification
csandit
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
IJMER
 
ICUR poster
ICUR posterICUR poster
ICUR poster
Anit Gurung
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesPredictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Ali Al Hamadani
 

Similar to Cardiology_Metabolomics_workshop_2016_v2 (20)

Mi bioinformática para el IBIMA
Mi bioinformática para el IBIMAMi bioinformática para el IBIMA
Mi bioinformática para el IBIMA
 
May workshop
May workshopMay workshop
May workshop
 
May 15 workshop
May 15  workshopMay 15  workshop
May 15 workshop
 
Proteomics
ProteomicsProteomics
Proteomics
 
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric Bioinformatics
 
Metabolomics.ppt
Metabolomics.pptMetabolomics.ppt
Metabolomics.ppt
 
Data handling metabolomics
Data handling metabolomicsData handling metabolomics
Data handling metabolomics
 
ReComp for genomics
ReComp for genomicsReComp for genomics
ReComp for genomics
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
A multi-sensor based uncut crop edge detection method for head-feeding combin...
A multi-sensor based uncut crop edge detection method for head-feeding combin...A multi-sensor based uncut crop edge detection method for head-feeding combin...
A multi-sensor based uncut crop edge detection method for head-feeding combin...
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
 
Positive Impression of Low-Ranking Microrn as in Human Cancer Classification
Positive Impression of Low-Ranking Microrn as in Human Cancer ClassificationPositive Impression of Low-Ranking Microrn as in Human Cancer Classification
Positive Impression of Low-Ranking Microrn as in Human Cancer Classification
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 
ICUR poster
ICUR posterICUR poster
ICUR poster
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesPredictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
 

Cardiology_Metabolomics_workshop_2016_v2

  • 1. Computational techniques for Metabolomics Data Analysis Sophia A. Banton and Karan Uppal Clinical Biomarkers Laboratory Emory University School of Medicine sbanton@emory.edu, kuppal2@emory.edu Integrated Health Science and Facilities Core NIEHS P30 ES019776 August 11, 2016
  • 2. Topics covered in this workshop • Overview of metabolomics data • Web-based tools for biomarker discovery and data analysis – MetaboAnalyst3.0 (hands-on) • Using R for biomarker discovery and data analysis – xmsPANDA (hands-on) – Runs on R >= 3.2.0 • Mummichog for pathway analysis – Runs on Python2.7 2
  • 4. 4 HRM: Pilot study of pulmonary tuberculosis
  • 5. 5 HRM: Amino Acid Metabolism is Altered in Adolescents with Nonalcoholic Fatty Liver Disease-An Untargeted, High Resolution Metabolomics Study Jin and Banton, et al. Amino Acid Metabolism is Altered in Adolescents with Nonalcoholic Fatty Liver Disease—An Untargeted, High Resolution Metabolomics Study, The Journal of Pediatrics, Volume 172, May 2016, Pages 14-19.e5.
  • 6. Connecting HRM with metabolic pathways 6 KEGG Pathways
  • 7. Connecting HRM: Plasma Metabolomics of Common Marmosets (Callithrix jacchus) to Evaluate Diet and Feeding Husbandry 7 Banton et al. Plasma Metabolomics of Common Marmosets (Callithrix jacchus) to Evaluate Diet and Feeding Husbandry. JAALAS. March 2016
  • 8. LC-Orbitrap MS Raw data Data Analysis Workflow Final deliverables 8 Raw data processing with built-in feature and sample quality assessment (apLCMS with xMSanalyzer) Data Exploratory Analysis (Box plots, histograms, etc.) Batch-effect evaluation and correction (Using ComBat); void volume filtering Annotation of metabolites (xMSannotator) 1. Untargeted feature table 2. Targeted feature table 3. Annotated feature table Metabolite prediction based on MS/MS • Metlin (known) • MassBank (known/unknown) MS/MS validation and deconvolution • DeconMSn Pathway analysis (Mummichog,MetaboAnalyst, MetaCore, MSEA) Biomarker and network analysis (xmsPANDA, MetabNet, MetaboAnalyst) • Univariate: Limma t-test, paired t-test, ANOVA, time-series • Multivariate and predictive analysis: Support vector machine, Random forest, PLSDA • Clustering: Two-way Hierarchical clustering analysis • Targeted and untargeted MWAS
  • 9. Step 1: Data extraction from RAW spectral files
  • 10. Feature and sample quality assessment Merge results from different parameter settings Mass calibration, batch-effect evaluation and correction Annotation of metabolites 1. Untargeted feature table 2. Targeted feature table 3. Annotated feature table 4. EIC and QC plots Noise removal and peak detection in each run Peak grouping after retention time alignment Recovery of weaker signals or filling missing peaks Summary feature table Peak detection and alignment using apLCMS or XCMS at different parameter settings apLCMS or XCMS LC/MS data processing using apLCMS or XCMS with xMSanalyzer R package 10
  • 11. Quality evaluation and assurance A. xMSanalyzer has built-in data quality evaluation routines that evaluate the quality of both features and samples – Each sample is run in triplicates so that allows us to evaluate the quality of features and samples based on coefficient of variation (CV) and Pearson correlation within the technical replicates, respectively – Only features with median CV <50% and samples for which the technical replicates have an average pairwise Pearson correlation >0.7 are retained for further analysis – A quality score is assigned to each measured m/z that takes into account both reliability and reproducibility of detection B. Batch-effect evaluation using Principal Component Analysis C. Batch-effect correction using ComBat (Johnson 2007, Biostatistics) 11
  • 12. Feature table – column headings mz Median measured mass-to-charge across all samples time Median Retention time at which the ion elutes mz.min Minimum measured mass-to-charge across all samples mz.max Maximum measured mass-to-charge across all samples NumPres.All.Samples Number of samples with non-missing/non-zero values NumPres.Biol.Samples Number of biological samples for which 2 out of the 3 replicates have non-missing/non-zero values median_CV median coefficient of variation (%) within technical replicates Qscore Quality score, defined as the ratio of the percentage of biological samples for which > 50% of technical replicates have a signal to the %median CV; A higher Qscore means the feature is more quantitatively reproducible within technical replicates is detected across large percentage of biological samples Max.intensity Maximum intensity of the feature across all samples VT_SampleRunDate_Run Number.cdf Integrated peak area (ion intensity) in each sample; each sample has 3 technical replicates (eg: VT_130712_002, VT_130712_004, VT_130712_006) 12 Feature Quality Assessment
  • 14. Biomarker and statistical analysis using MetaboAnalyst3.0 (http://www.metaboanalyst.ca/) Integrated Health Science and Facilities Core NIEHS P30 ES019776
  • 15. Various options for feature selection and predictive evaluation • Univariate: – T-test, Paired t-test, LIMMA based t-test • P-values from moderate t-tests were adjusted for multiple hypothesis testing using the Benjamini-Hochberg false discovery rate (FDR) correction method – Manhattan plot to visualize metabolome wide statistically significant changes • Multivariate and data mining: – Supervised: • Support Vector Machine • Partial Least Square Discriminant Analysis • Random Forest – Unsupervised: • Principal Component Analysis • Two-way hierarchical clustering analysis • K-means clustering 15
  • 16. MetaboAnalyst3.0: Multiple data analysis modules 16
  • 17. MetaboAnalyst3.0: 3 main modules for statistical/biomarker analysis 17
  • 18. MetaboAnalyst3.0: Formatting input data files • http://www.metaboanalyst.ca/faces/home.xhtml 18
  • 20. Sample Input file • “Smokers_nonsmokers_MetaboAnalyst.csv” Sample NonSmoker_1 NonSmoker_7 NonSmoker_13 NonSmoker_19 NonSmoker_25 Phenotype NonSmoker NonSmoker NonSmoker NonSmoker NonSmoker 90.05544_114.26 6.25E+08 6.39E+08 1.03E+09 8.67E+08 9.07E+08 104.07101_114.84 1.13E+08 9.70E+07 59600000 1.88E+08 7.80E+07 104.10736_62.99 2.88E+09 4.34E+09 2.80E+09 2.67E+09 2.54E+09 114.06648_66.35 6.15E+08 6.85E+08 3.14E+08 6.09E+08 5.52E+08 118.08645_118.25 4.70E+09 4.21E+09 4.21E+09 6.28E+09 5.82E+09 119.03401_115.38 23737.65549 0 0 0 0 120.06562_119.55 2.85E+08 2.11E+08 2.79E+08 3.37E+08 2.58E+08 122.02708_124.25 40014.00396 39634.23778 34433.93197 68656.48709 8146.55363 123.0445_124.95 27500000 26300000 59400000 81600000 6.00E+07 123.05525_167.77 0 0 0 0 480688.6081 123.05531_52.82 3912282.412 12500000 3509484.928 2903190.851 0 124.04031_124.95 4381223.786 6501175.935 8539781.005 12200000 2333448.716 130.04993_122.35 7.31E+08 7.38E+08 1.02E+09 8.33E+08 8.17E+08 132.07675_102.12 1.50E+08 1.28E+08 1.60E+08 3.87E+08 2.78E+08 133.06077_123.44 8.30E+07 85400000 1.00E+08 85800000 93700000 134.04476_132.94 16100000 17100000 33900000 23400000 16200000 137.04564_76.31 94627.89366 99064.3524 31862.04198 74075.53368 96459.30732 137.05966_112.24 23026.51599 729321.1317 21884.85816 27338.56273 26548.87029 20
  • 22. MetaboAnalyst3.0: Check for missing values and other potential issues such as mislabeling 22
  • 24. MetaboAnalyst3.0: Data transformation and scaling options 24
  • 25. MetaboAnalyst3.0: Results after normalization 25 Before After
  • 26. MetaboAnalyst3.0: Lots of options for statistical analysis! Let’s try T-test first 26
  • 28. MetaboAnalyst3.0: Click on individual red dots to visualize boxplots 28 Box-and-whisker Plot
  • 29. MetaboAnalyst3.0: Heatmap option with two-way HCA 29 Samples Metabolites
  • 30. MetaboAnalyst3.0: Heatmap option – using top 25 m/z features based on T-test Samples Metabolites 30
  • 31. MetaboAnalyst3.0: *PLSDA option 31 *Method for classifying – or separating – the groups
  • 32. MetaboAnalyst3.0: PLSDA option – “2D Scores Plots” tab 32 *Method for classifying – or separating – the groups
  • 33. MetaboAnalyst3.0: PLSDA option – “3D Scores Plots” tab 33 *Method for classifying – or separating – the groups
  • 34. MetaboAnalyst3.0: PLSDA option – “Imp. Features” tab • Top 15 features based on variable Importance (VIP) determined Using PLS-DA 34
  • 36. (EXTREMELY) Useful resources • Xia J. and Wishart D., Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst, Nature Protocols 2011 • Sugimoto et al., Bioinformatics Tools for Mass Spectroscopy- Based Metabolomic Data Processing and Analysis, Current Bioinformatics 2012 36
  • 37. xmsPANDA: R package for pre-processing, biomarker discovery, clustering, and network analysis 37
  • 38. xmsPANDA workflow Module a) Data pre-processing (Stage 1) • Replicate summarization • Data filtering: missing values, relative standard deviation • Data Transformation (log, z-score) • Normalization (Quantile) Module b) Data mining (Stage 2) • Univariate: Limma t-test, paired t-test, wilcox, mixed effects model, ANOVA • Multivariate and predictive analysis for regression and classification: Support vector machine, MARS, Random forest, PLS, sPLS • Unsupervised: PCA, two-way Hierarchical clustering analysis Module c) Metabolome-wide association (correlation) analysis (Stage 3) • Global: Pairwise correlation and network of all metabolites • Targeted: Pairwise correlation and network of targeted metabolites 38 • Developed by Karan Uppal Ph.D., MSc., Assistant Professor, Emory University School of Medicine
  • 39. xmsPANDA: Various options for feature selection and predictive evaluation • Univariate: – T-test, Paired t-test, LIMMA, linear regression, ANOVA • P-values from moderate t-tests were adjusted for multiple hypothesis testing using the Benjamini- Hochberg false discovery rate (FDR) correction method – Manhattan plot to visualize metabolome wide statistically significant changes • Multivariate and data mining: – Supervised: • Support Vector Machine • Partial Least Square Discriminant Analysis (PLS, PLSDA, sPLS, sPLSDA) • Random Forest • Splines based (MARS) – Unsupervised: • Principal Component Analysis • Two-way hierarchical clustering analysis • Correlation/network analysis using *MetabNet (Uppal 2015): – Untargeted: Correlations with all metabolites – Targeted: Correlations with metabolites from a specific pathway, clinical parameters 39
  • 40. xmsPANDA: Sample input files a. Feature table b. Class labels file 40 The order must be identical Sample IDs
  • 41. xmsPANDA: Example script library(xmsPANDA) demetabs_res<- diffexp(feature_table_file="/Users/karanuppal/Documents/Emory/JonesLab/Projects/C18_feature_table_PANDA.txt", parentoutput_dir="/Users/karanuppal/Documents/Emory/JonesLab/Projects/PANDA_lmreggeno_gender_allmiss0.3g roup0.7_median_v1.0.3.1_p0.01B/", class_labels_file="/Users/karanuppal/Documents/Emory/JonesLab/Projects/clinical_info_PANDA_class_gender.txt", num_replicates = 2, feat.filt.thresh =NA, summarize.replicates =TRUE, summary.method="median",summary.na.replacement="zeros",rep.max.missing.thresh=0.3, all.missing.thresh=0.5, group.missing.thresh=0.7, log2transform = TRUE, medcenter=FALSE, znormtransform = FALSE, quantile_norm = FALSE, lowess_norm = FALSE, madscaling = FALSE, rsd.filt.list = seq(0, 0, 5), pairedanalysis = FALSE, featselmethod="lmreg", fdrthresh = 0.01, fdrmethod="none",cor.method="spearman", abs.cor.thresh = 0.3, cor.fdrthresh=0.2, kfold=10,feat_weight=1,globalcor=TRUE,target.metab.file=NA, target.mzmatch.diff=10,target.rtmatch.diff=NA,max.cor.num=300,missing.val=0,networktype="complete", samplermindex=NA,numtrees=1000,analysismode="classification",net_node_colors=c("green","red"), net_legend=FALSE,heatmap.col.opt="RdBu",sample.col.opt="rainbow",alphacol=0.3, pls_vip_thresh = 3, num_nodes = 2, max_rf_varsel = 100, pls_ncomp = 5,pcacenter=TRUE,pcascale=TRUE,pca.stage2.eval=FALSE,scoreplot_legend=TRUE,pca.global.eval=FALSE) Other options: limma LIMMA rf  random forest spls  sparse PLS pls  PLS And more… See example scripts for more options 41
  • 42. xmsPANDA Manhattan plots: Y-axis corresponds to the –log10 (p-value); FDR cut-off is represented by the horizontal line a) -logP vs m/z b) -logP vs time 42 m/z Retention time Amino acids Lipids, steroids
  • 43. xmsPANDA PCA and cluster analysis Principal Component Analysis (PCA) Hierarchical clustering Analysis (HCA) Samples m/z features 43 PC1 PC2
  • 45. xmsPANDA Network analysis using MetabNet (Stage 3) : correlated m/z |cor|>0.4 at FDR 0.2 : putative biomarkers from PLS • Targeted metabolome-wide association study (MWAS) of specific metabolites (biomarkers, environmental exposures, etc.) • Facilitates detection of related metabolic pathways and network structures • Correlation-based network analysis • Each node corresponds to metabolites and the edges correspond to the correlation coefficient, Cij • Two metabolites are linked if |Cij|> threshold at a user defined significance level • Pearson, Spearman, and partial correlation 45
  • 46. Summary • xmsPANDA provides an automated workflow for analyzing metabolomics data (package can be tricked to work other –omics data) • The package facilitates network level investigation of significant or different expressed metabolites • Includes independent functions for hierarchical clustering analysis, PCA, boxplots • Availability – Emory IT Box, (Accessible under MetabolomicsWorkshopSummer2016 folder on Box) – Email: kuppal2@emory.edu 46
  • 48. A) In the work flow of untargeted metabolomics, the conventional approach requires the metabolites to be identified before pathway/network analysis, while mummichog (blue arrow) predicts functional activity bypassing metabolite identification. B) Each row of dots represent possible matches of metabolites from one m/z feature, red the true metabolite, gray the false matches. The conventional approach first requires the identification of metabolites before mapping them to the metabolic network. C)mummichog maps all possible metabolite matches to the network and looks for local enrichment, which reflects the true activity because the false matches will distribute randomly. Mummichog for pathway enrichment analysis 48 • Developed by Shuzhao Li Ph.D., Assistant Professor, Emory University School of Medicine • Li et al. 2013. PLoS Computational Biology
  • 49. xMSannotator: Metabolite annotation Manuscript under review; URL: https://sourceforge.net/projects/xmsannotator/
  • 50. Metabolite annotation • >10,000 reproducible signals can be detected using liquid chromatography high resolution mass spectrometry • Simple database searches can result in a large number of false positives 50
  • 51. Metabolite Annotation: mapping m/z from LC-MS data to known metabolites in databases Many-to- many relationship between m/z and metabolites m/z 1 m/z 2 51
  • 52. Main goals of xMSannotator • Incorporating multiple layers of information (m/z, retention time, intensity profiles, isotope patterns, pathway membership) to enhance confidence in annotations and prioritize candidates for validation using MS/MS and chemical standards • Perform suspect screening (exposure to environmental chemicals, drugs) • Allow use of cluster/module membership to facilitate generating hypothesis about biochemical roles of features with no database matches 52 • Developed by Karan Uppal Ph.D., MSc., Assistant Professor, Emory University School of Medicine
  • 53. • Human Metabolome Database (HMDB) – About 41,000 metabolites • 2,824 (Detected and Quantified) • 251 (Detected but not Quantified) • 38,439 (Expected but not detected) • LipidMaps – 36,269 lipids • The toxin and toxin target database (T3DB) – 2,097 toxic chemicals • KEGG – 15,298 chemicals Databases supported by xMSannotator 53
  • 54. xMSannotator functions • Multilevelannotation() for multi-criteria based annotation that assigns annotations into confidence levels (high, medium, low, none) • get_mz_by_KEGGspecies: – generate list of expected m/z based on adducts for all metabolites associated with a species in KEGG • get_mz_by_KEGGpathwayIDs: – generate list of expected m/z based on adducts for all metabolites in specific pathways • get_mz_by_KEGGcompoundIDs: – generate list of expected m/z based on adducts for given KEGG compound ID • get_kegg_map: – Download KEGG map as a PNG file with color coded KEGG IDs • ChemSpider.annotation: – m/z based annotation for select databases in ChemSpider 54
  • 55. library(xMSannotator) #Package data files data(example_data) #example peak intensity matrix data(adduct_table) data(adduct_weights) #data(customIDs) #example for custom IDs #data(customDB) #example for custom DB #data(hmdbAllinf) #data(keggotherinf) #data(t3dbotherinf) ###########Parameters to change############## dataA<-read.table("/Users/karanuppal/Documents/Emory/JonesLab/Projects/xMSannotator/50marmosets_rawdata_averaged.txt",sep="t",header=TRUE) #OR #dataA<-example_data outloc<-"/Users/karanuppal/Documents/Emory/JonesLab/Projects/xMSannotator/testBloodSpotv1.1.2T3DB/" max.mz.diff<-10 #mass search tolerance for DB matching in ppm max.rt.diff<-10 #retention time tolerance between adducts/isotopes corthresh<-0.7 #correlation threshold between adducts/isotopes max_isp=5 mass_defect_window=0.01 num_nodes<-4 #number of cores to be used; 2 is recommended for desktop computers due to high memory consumption db_name=“HMDB" #other options: KEGG, LipidMaps, T3DB status=NA #other options: "Detected", NA, "Expected and Not Quantified" num_sets<-300 #number of sets into which the total number of database entries should be split into; mode<-"pos" #ionization mode queryadductlist=c("M+2H","M+H+NH4","M+ACN+2H","M+2ACN+2H","M+H","M+NH4","M+Na","M+ACN+H","M+ACN+Na","M+2ACN+H","2M+H","2M+Na", "2M+ACN+H","M+2Na-H","M+H-H2O","M+H-2H2O") #other options: c("M-H","M-H2O-H","M+Na-2H","M+Cl","M+FA-H"); c("positive"); c("negative"); c("all");see data(adduct_table) for complete list ######################### dataA<-unique(dataA) print(dim(dataA)) system.time(annotres<-multilevelannotation(dataA=dataA,max.mz.diff=max.mz.diff,max.rt.diff=max.rt.diff,cormethod="pearson",num_nodes=num_nodes,queryadductlist=queryadductlist, mode=mode,outloc=outloc,db_name=db_name, adduct_weights=adduct_weights,num_sets=num_sets,allsteps=TRUE, corthresh=corthresh,NOPS_check=TRUE,customIDs=NA,missing.value=NA,hclustmethod="complete",deepsplit=2,networktype="unsigned", minclustsize=10,module.merge.dissimilarity=0.2,filter.by=c("M+H"),biofluid.location=NA,origin=NA,status=status,boostIDs=NA,max_isp=max_isp, HMDBselect="union",mass_defect_window=mass_defect_window,pathwaycheckmode="pm",mass_defect_mode="pos") ) xMSannotator example script (R) 55
  • 56. Sample output Confidence chemical_ID mz time MatchCategoryName Formula MonoisotopicMassAdduct ISgroup Module mean_int_vec 3 HMDB00472 221.090047 51.9551753 Unique 5-Hydroxy-L-tryptophan C11H12N2O3 220.08479 M+H ISgroup_17_1_10 17 687747.839 3 HMDB00472 202.095804 51.7621006 Unique 5-Hydroxy-L-tryptophan C11H12N2O3_[-18] - M_[-18] ISgroup_17_1_10 17 76047.9214 3 HMDB00472 222.093448 52.9762499 Unique 5-Hydroxy-L-tryptophan C11H12N2O3_[+2] - M_[+2] ISgroup_17_1_10 17 53822.941 3 HMDB00472 227.097096 51.5564666 Unique 5-Hydroxy-L-tryptophan C11H12N2O3_[+7] - M_[+7] ISgroup_17_1_10 17 62478.9814 3 HMDB00472 203.103566 50.3062004 Unique 5-Hydroxy-L-tryptophan C11H12N2O3_[-17] - M_[-17] ISgroup_17_1_11 17 108606.947 3 HMDB00269 302.302693 348.568133 Unique Sphinganine C18H39NO2 301.29808 M+H ISgroup_244_44_31 244 394617.009 3 HMDB00269 303.305073 347.574618 Unique Sphinganine C18H39NO2_[+2] - M_[+2] ISgroup_244_44_31 244 40886.1457 3 HMDB00222 400.340702 374.505405 Unique L-Palmitoylcarnitine C23H45NO4 399.33486 M+H ISgroup_244_45_35 244 3787674.18 3 HMDB00222 401.341243 371.277478 Multiple L-Palmitoylcarnitine C23H45NO4_[+2] - M_[+2] ISgroup_244_45_35 244 1213505.56 3 HMDB00211 181.070374 58.0993743 Multiple Myoinositol C6H12O6 180.06339 M+H ISgroup_194_3_8 194 2071579.66 3 HMDB00211 182.073758 54.9711864 Multiple Myoinositol C6H12O6_[+2] - M_[+2] ISgroup_194_3_8 194 128395.932 3 HMDB00201 204.121143 48.3112602 Unique L-Acetylcarnitine C9H17NO4 203.11576 M+H ISgroup_135_3_13 135 4173369.02 3 HMDB00201 206.125572 49.79958 Unique L-Acetylcarnitine C9H17NO4_[+3] - M_[+3] ISgroup_135_3_13 135 8500.80302 3 HMDB00172 132.100611 50.8616042 Multiple L-Isoleucine C6H13NO2 131.09463 M+H ISgroup_194_3_11 194 19782503.3 3 HMDB00172 86.0954504 48.6019537 Multiple L-Isoleucine C6H13NO2_[-45] - M_[-45] ISgroup_194_3_10 194 1330586.31 3 HMDB00172 133.10397 49.949951 Multiple L-Isoleucine C6H13NO2_[+2] - M_[+2] ISgroup_194_3_11 194 1219553.75 3 HMDB00162 116.06946 49.2340805 Unique L-Proline C5H9NO2 115.06333 M+H ISgroup_194_3_7 194 7364523.19 3 HMDB00162 117.072835 47.1430823 Unique L-Proline C5H9NO2_[+2] - M_[+2] ISgroup_194_3_8 194 328238.286 3 HMDB00159 166.084711 54.5950322 Unique L-Phenylalanine C9H11NO2 165.07898 M+H ISgroup_194_3_9 194 20408852.3 3 HMDB00159 120.079568 50.8568168 Unique L-Phenylalanine C9H11NO2_[-45] - M_[-45] ISgroup_194_3_8 194 3268442.52 3 HMDB00159 167.088077 50.3774503 Unique L-Phenylalanine C9H11NO2_[+2] - M_[+2] ISgroup_194_3_9 194 1862264.77 3 HMDB00148 148.059082 48.2266649 Multiple L-Glutamicacid C5H9NO4 147.05316 M+H ISgroup_244_1_6 244 491446.134 3 HMDB00148 192.022544 47.8072529 Multiple L-Glutamicacid C5H9NO4 147.05316 M+2Na-H ISgroup_244_1_3 244 75734.8384 56 Confidence scores for possible chemical identity: • 0 is no confidence • 1 is low confidence • 2 is medium confidence • 3 is high confidence • 4 is experimentally confirmed
  • 57. Summary 0 50000000 100000000 150000000 200000000 250000000 m/z time Smoker_11.raw Peak area Smoker_15.raw Peak area Smoker_13.raw Peak area 193.0970902 1.697928509 21590.09577 1465875.407 2921520.329 104.071007 1.914036922 1.68E+08 1.20E+08 1.18E+08 104.071007 1.914036922 1.68E+08 1.20E+08 1.18E+08 137.0456421 1.271814331 217380.9151 66352.25511 96353.93902 241.0307929 2.180590728 6.27E+07 8.42E+07 8.09E+07 134.044759 2.215654287 1.39E+07 2.77E+07 2.66E+07 Raw data Data extraction (apLCMS, XCMS, MzMine2.0, xMSanalyzer) Probability or score-based annotation (xMSannotator) Biomarker discovery and Network analysis (MetaboAnalyst, xmsPANDA) Pathway analysis and applications (Dr. Shuzhao Li) 57
  • 58. Clinical Biomarkers Laboratory clinicalmetabolomics.org Email: kuppal2@emory.edu Dean Jones, Young-Mi Go, Shuzaho Li, Karan Uppal, Douglas Walker, Josh Chandler, Sophia Banton, Ken Liu, Vilinh Tran, Michael Orr, Bill Liang (not shown) 58 Lab website: http://clinicalmetabolomics.org/
  • 59. Live Demonstrations of xmsPANDA and Mummichog 59
  • 60. xmsPANDA • Installation instructions, data files, example R scripts, and manual on Emory IT Box 60 Input files
  • 61. xmsPANDA • R Script 61 #load xmsPANDA library(xmsPANDA) feature_table_file<-"/Users/karanuppal/Documents/Emory/Workshop/Workshop2016/Mzmine_smokers_nonsmokers_PANDA.txt" class_labels_file<-"/Users/karanuppal/Documents/Emory/Workshop/Workshop2016/classlabels.txt" outloc<-"/Users/karanuppal/Documents/Emory/Workshop/Workshop2016/testpanda4/" demetabs_res<-diffexp(feature_table_file=feature_table_file, parentoutput_dir=outloc, class_labels_file=class_labels_file, num_replicates = 3, feat.filt.thresh =NA, summarize.replicates =TRUE, summary.method="median",summary.na.replacement="zeros", rep.max.missing.thresh=0.5, all.missing.thresh=NA, group.missing.thresh=NA, input.intensity.scale="raw", log2transform = FALSE, medcenter=FALSE, znormtransform = FALSE, quantile_norm = FALSE, lowess_norm = FALSE, madscaling = FALSE, rsd.filt.list = c(0), pairedanalysis = FALSE, featselmethod="lm1wayanova", fdrthresh = 0.05, fdrmethod="none",cor.method="pearson", abs.cor.thresh = 0.4, cor.fdrthresh=0.2, kfold=10,feat_weight=1,globalcor=TRUE,target.metab.file=NA, target.mzmatch.diff=10,target.rtmatch.diff=NA,max.cor.num=NA,missing.val=0,networktype="complete", samplermindex=NA,numtrees=1000,analysismode="classification",net_node_colors=c("green","red"), net_legend=FALSE,heatmap.col.opt="RdBu",sample.col.opt="rainbow",alphacol=0.3, pls_vip_thresh = 2, num_nodes = 2, max_varsel = 100, pls_ncomp = 5,pcacenter=TRUE,pcascale=TRUE,pred.eval.method="BER",rocfeatlist=seq(2,10,1), rocfeatincrement=TRUE, rocclassifier="svm",foldchangethresh=0,wgcnarsdthresh=30,WGCNAmodules=FALSE, optselect=FALSE,max_comp_sel=1,saveRda=FALSE,pca.cex.val=4,pls.permut.count=NA, pca.ellipse=TRUE,ellipse.conf.level=0.95,legendlocation="bottomleft",svm.acc.tolerance=5)
  • 62. xmsPANDA • Results • ReadME.txt – Stage 1 results: Preprocessing (Normalization, transformation) – Stage 2 results: Feature selection & evaluation results (Manhattan plots, PCA, HCA, boxplots, table of significant features, clustering results)" – Stage 3 results: Correlation based network analysis 62
  • 63. xmsPANDA Stage 2 Results • Results • Page 9 and 10 – Type I and II Manhattan plots • Page 11 – 2-way HCA heatmap Final page(s) – box plots 63 Cotinine
  • 64. xmsPANDA Stage 3 Results 64 • Correlation network plot
  • 65. Mummichog 65 • Example data set and manual on Emory IT Box Input file
  • 66. Mummichog 66 • Change directory in command prompt to location of Mummichog folder: • Example: – cd mummichog-1.0.7test
  • 67. Mummichog 67 • Change directory in command prompt to location of Mummichog folder: • Example: – C:UserssbantonDownloadsmummichog-1.0.7test>python ../mummichog/main.py -f testdata.txt -o testdata.txt -c 0.05