Vision and reflection on Mining Software Repositories research in 2024
Multivarite and network tools for biological data analysis
1. Dmitry Grapov and Oliver Fiehn
University of California, Davis
Multivariate Analysis and
Visualization Tools for
Metabolomic Data
2. State of the art facility producing massive
amounts of biological data…
>20-30K samples/yr
>200 studies
3. Sample
Variable
Data Analysis and Visualization
Quality Assessment
• use replicated mesurements
and/or internal standards to
estimate analytical variance
Statistical and Multivariate
• use the experimental design
to test hypotheses and/or
identify trends in analytes
Functional
• use statistical and multivariate
results to identify impacted
biochemical domains
Network
• integrate statistical and
multivariate results with the
experimental design and
analyte metadata
experimental design
- organism, sex, age etc.
analyte description and
metadata
- biochemical class, mass
spectra, etc.
VariableSample
4. Sample
Variable
Data Analysis and Visualization
Quality Assessment
• use replicated mesurements
and/or internal standards to
estimate analytical variance
Statistical and Multivariate
• use the experimental design
to test hypotheses and/or
identify trends in analytes
Functional
• use statistical and multivariate
results to identify impacted
biochemical domains
Network
• integrate statistical and
multivariate results with the
experimental design and
analyte metadata
Network Mapping
experimental design
- organism, sex, age etc.
analyte description and
metadata
- biochemical class, mass
spectra, etc.
VariableSample
5. Principal Component
Analysis (PCA) of all
analytes, showing QC
sample scores
Data Quality Assessment
Drift in >400 replicated measurements across >100 analytical batches for a single analyte
Acquisition batch
Abundance
QCs embedded
among >5,5000
samples (1:10)
collected over
1.5 yrs
If the biological effect
size is less than the
analytical variance
then the experiment
will incorrectly yield
insignificant results
6. Data Quality Assessment
Analyte specific data quality
overview
Sample specific normalization can be used
to estimate and remove analytical variance
Raw Data Normalized Data
Normalizations need to be
numerically and visually validated
log mean
low precision
%RSD
high precision
Samples
QCs
7. Network Mapping
Ranked statistically
significant differences
within a a biochemical
context
Statistics
Multivariate
Context
+
+
=
Statistical and Multivariate Analyses
Group 1
Group 2
What analytes are
different between the
two groups of samples?
Statistical
significant differences
lacking rank and
context
t-Test
Multivariate
ranked differences
lacking significance
and context
O-PLS-DA
8. Network Mapping
Statistics
Multivariate
Context
+
+
=
Statistical and Multivariate Analyses
Group 1
Group 2
What analytes are
different between the
two groups of samples?
Statistical
t-Test
Multivariate
O-PLS-DA
To see the big picture it is necessary too view the data from multiple
different angles
12. Functional Analysis: opportunity for ‘Omic integration
Use domain knowledge
databases to integrate
genomic, proteomic
and metabolomic data
Current approaches can
be limited to pathway
level analyses
16. Empirical Networks
Use experiment specific or data driven relationships to gain novel insight
into biochemical relationships
urea cycle
nucleotide
synthesis
protein
glycosylation
17. Mass Spectral Networks
Use mass spectra as a proxy for structure to help make sense of
unknown compounds’ biochemical identities
Watrous J et al. PNAS 2012;109:E1743-E1752
unknown compounds are likely phytosterol
esters
18. Mass Spectral Networks
Use mass spectra and empirical relationships to narrow down the
biochemical roles for unknown compounds
Rigorous chemical experiments identified the unknown compounds as partial
derivatization products of glucose
21. Analysis at the Metabolomic Scale and Beyond
pyruvate lactate
enzyme
gene Bgene A
Pathway independent metabolomic (known and unknown),
proteomic and genomic data integration
22. Software and Resources
•DeviumWeb- Dynamic multivariate data analysis and
visualization platform
url: https://github.com/dgrapov/DeviumWeb
•imDEV- Microsoft Excel add-in for multivariate analysis
url: http://sourceforge.net/projects/imdev/
•MetaMapR: Network analysis tools for metabolomics
url: https://github.com/dgrapov/MetaMapR
•TeachingDemos- Tutorials and demonstrations
•url: http://sourceforge.net/projects/teachingdemos/?source=directory
•url: https://github.com/dgrapov/TeachingDemos
•Data analysis case studies and Examples
url: http://imdevsoftware.wordpress.com/