1. Workshop: “Statistical Methods for Omics Data Integration and Analysis“
Valencia, Spain, 14-16, 2015
1
INTEGRATION OF METABOLOMICS, LIPIDOMICS AND CLINICAL DATA BY
RANDOM FOREST
Animesh Acharjee1
, Zsuzsanna Ament1
, James A West1
, Elizabeth Stanley1
, Benjamin J Jenkins1
,
Albert Koulman1
& Julian L Griffin1,2
1
Medical Research Council, Elsie Widdowson Laboratory, 120 Fulbourn Road, Cambridge, CB1 9NL, UK,
2
The Department of Biochemistry; 80 Tennis Court Road, University of Cambridge, Cambridge, CB2 1GA, UK.
Introduction: Peroxisome proliferator-activated
receptors, PPAR-α, PPAR-γ, and PPAR-δ are known to
regulate systemic metabolism (Ament et al., 2012).
Beneficial effects of their activation in the treatment of a
wide array of metabolic diseases are well established.
However, they can also cause side effects and adverse
pathological changes through unknown mechanisms. In
the current study, a PPAR-pan agonist (a triple agonist of
PPAR-α, -γ, and -δ) was investigated after dietary
treatment of male Sprague–Dawley (SD) rats. In addition
to the classical toxicological tests (urinalysis and clinical
chemistry) various mass spectrometry (MS) approaches
for the detection of liver metabolomic and lipidomic
changes were employed, in order to define the systemic
changes and better understand the underlying toxicity.
Here we present an approach, which is able to
integrate multiple data types and successfully combine
classical clinical chemistry and toxicology test results
with MS data. First, Random Forest (RF) (Breiman, 2001)
classification was used to select subsets of metabolites
showing that RF is successful in building associations and
predicting different dose responses. Next, we used RF
regression approach to link liver metabolites with clinical
phenotypes from plasma and urinalysis. Finally, an
integrated network analysis was performed providing a
relatively small sets of interrelated metabolites which can
predict the different dose levels with high accuracy. We
validate this approach by comparing the selected
metabolites to pathways known to be involved in PPAR
metabolism.
Methods: Five groups of 12 animals were
administered a PPAR-pan activator by daily oral gavage
at 30, 100, 300, 1000 mg/kg/day for 13 weeks. A separate
satellite group of animals (6 per group) were kept for a 4
week treatment free period in the control, 300 and 1000
mg/kg/day dose groups. Blood and urine samples of all
animals were collected at week 13 and 18. At necropsy,
tissue samples were collected following an overdose of
anaesthetic (halothane Ph. Eur. Vapour). Samples were
snap-frozen in liquid nitrogen and were maintained at
-80 °C until further analysis. Gas chromatography mass
spectrometry, (GC-MS), direct infusion mass
spectrometry (DI-MS) and liquid chromatography tandem
mass spectrometry (LC-MS/MS) methods were set up,
optimized, and used to measure hepatic (i) total fatty acids
(GC-MS), (ii-iii) intact lipids by DI-MS (pos. and neg.
ionization mode) (iv-v) intact lipids by LC-MS/MS (pos.
and neg. ionization modes) (vi) acyl-carnitines (targeted
LC-MS/MS), (vii) eicosanoids (SPE followed by LC-
MS/MS), and (viii-x) aqueous metabolites (open profiling
(pos. and neg.) and targeted LC-MS/MS) generating a
total of 9 datasets comprising over 1500 variables in
addition to those of clinical-chemical parameters (CCPs)
of plasma (33 variables), urinalysis (12 variables) and
relative liver weight (body and liver weight ratio)
Random Forest (RF) was used for both classification
and regression mode for different data types including liver
metabolites and CCPs from urine and plasma. Using the
select metabolites from the classification approach, RFs
were iteratively fitted, so that they yielded the smallest out
of bag (OOB) error rates (Díaz-Uriarte and De Andres,
2006). Further, we included permutation tests calculating
the significance of the associations of the metabolites with
CCPs. For integrated network analysis, we used partial
correlation, because it has the ability to distinguish between
direct and indirect associations.
Result: RF classification differentiated dose response
effects across all metabolomic and lipidomic datasets and
the regression approach was successfully applied to link
CCPs with metabolomic and lipidomic data.
Classification approach: The different doses
administered were treated as multiclass parameters whilst
metabolomic and lipidomic data were treated as predictor
sets. RF was applied in classification mode and OOB
misclassification error rate was calculated for the
individual data sets. Positive ion mode DI-MS intact lipids
and eicosanoid method variables were found to have the
lowest OOB errors of 36% each. Metabolites were selected
using backward elimination approach (Díaz-Uriarte and De
Andres, 2006). From each of the 9 dataset, important
variables were selected focusing down to 57 out of 1538.
Using the selected variables only, RF OOB error for dose
prediction was 22%. Again, applying the backward
elimination process, we were able to select the most
discriminatory variables, further reducing the total number
of variables to 15 (Figure 1). These were selected across 5
data sets, further reducing the OOB error to 21%.
Regression approach: We linked different clinical
phenotypes such as relative liver weight, urine and plasma
CCPs.
Relative liver weight: Intact lipid pos. DI-MS were
found to explain the highest variation (84%), the lowest
variation was explained by intact lipid in negative mode
2. Workshop: “Statistical Methods for Omics Data Integration and Analysis“
Valencia, Spain, 14-16, 2015
2
(32%). In total, 42 variables were selected (out of 1538)
explaining a striking 82%.
Urinalysis: Urine colour and turbidity was best
explained by selected intact lipid DI-MS (pos. and neg.),
eicosanoid, and intact lipid LC-MS/MS (pos.) data
variables. Variation (R2
) in urine colour was explained
54% by 24 variables and 60% of the variation in turbidity
was explained by 23 variables.
Plasma Clinical Chemistry: Aspartate
aminotransferase (AST, IU/L), albumin (g/L) and glucose
(mmol/L) variations were explained by 52, 37 and 44%
using 24, 31, 27 variables respectively using intact lipid DI-
MS (pos. and neg.), total fatty acid GC-MS and eicosanoid
data.
Network analysis: An integrated network was built
using partial correlation approach shown in figure 1.
Figure 1: Partial correlation network of the most discriminatory
variables (15) differentiating between dose levels. Metabolites
form different data matrices are in different colours: total fatty
acids GC-MS (yellow); eicosanoid open profiling (red); intact
lipids from DI-MS neg. (purple); and pos. (blue) mode; and acyl-
carnitines (green). The dotted lines represent negative, the solid
lines positive partial correlation coefficients. Eico_X is
representative of unknown small molecules as measured in the
eicosanoid assay by LC-MS/MS.
Discussions: We analysed, processed and explored
multiple liver metabolomics and lipidomics datasets along
with CCPs measured from plasma and urine. Firstly, RF
classification was successfully employed in the metabolite
selection process and allowed us to not only combine 9
different types of metabolite data from multiple platforms
but also to focus our attention to the most discriminatory
15 metabolites for data interpretation and biological
understanding, while increasing the predictive ability at the
same time. Furthermore, RF regression proved to be useful
as an interdisciplinary approach in joining classical
toxicology with modern metabolomics and lipidomics
data. Four broad themes emerged from the analysis. Firstly,
the selected 15 metabolites include only lipids, and no
aqueous compounds, reflecting the intimate role of PPARs
in lipidomic remodelling. Secondly, changes in acyl-
carnitines (C4-DC and C5:1) are suggestive of aciduria
more specifically 2-methyl-3-hydroxybutyric aciduria.
PPARs are known to regulate mitochondrial lipid
metabolism, and aciduria is commonly reported in
mitochondrial disorders, which could be suggestive of
common pathophysiological mechanism of damage.
Thirdly, it is interesting to note, that although the
discriminatory free fatty acids, C20:3, C22:5 and C20:5 all
have the potential to feed into the eicosanoid cascade, there
were no associations found with compounds detected by
the open profiling eicosanoid method. This could explain
the inability to identify these eicosanoid method
metabolites, and highlights the importance of a targeted
approach when these molecules are measured. And finally,
the odd chain saturated fatty acid: C17:0, commonly
considered to be a simple marker of ruminant fat intake,
was also found important and highly discriminatory,
leading us to further speculate on suggestions linking this
fatty acid to fatty acid α-oxidation (Jenkins et al., 2015).
In addition, CCPs were successfully combined with
metabolomic and lipidomic datasets highlighting
unexpected connections, such as liver lipid status and urine
turbidity. Liver related biochemical parameters AST
(hepatic leakage enzyme) and albumin (indicative of
altered liver synthetic function) have also been linked with
several decreasing phospholipids, which class of
compounds have well established hepato-protective effects
(Küllenberg et al., 2012).
Conclusion: In this study, we demonstrate a
powerful strategy in integrating multiple ~omics data using
RF and selecting discriminatory metabolites for partial
correlation network analysis. Previously, Acharjee and co-
workers (Acharjee et al., 2011) integrated plant gene
expression and metabolomics data using RF regression,
however, to the best of our knowledge, no such integrative
approach have been utilised to link classical hepatic
parameters with metabolomic and/or lipidomic datasets.
RF has proved to be a reliable and useful method in
integrative data interpretation which can assist hypothesis
generation. We also hope, that by linking classical
toxicology parameters with metabolite markers more
accurate and early detection of toxicity can be facilitated.
References:
Acharjee, A., Kloosterman, B., de Vos, R.C., Werij, J.S.,
Bachem, C.W., Visser, R.G., Maliepaard, C., 2011.
Data integration and network reconstruction with∼
omics data using Random Forest regression in potato.
Analytica chimica acta 705, 56-63.
Ament, Z., Masoodi, M., Griffin, J.L., 2012. Applications of
metabolomics for understanding the action of
peroxisome proliferator-activated receptors (PPARs)
in diabetes, obesity and cancer. Genome Med 4, 32.
Breiman, L., 2001. Random forests. Machine learning 45, 5-32.
Díaz-Uriarte, R., De Andres, S.A., 2006. Gene selection and
classification of microarray data using random forest.
BMC bioinformatics 7, 3.
Jenkins, B., West, J.A., Koulman, A., 2015. A Review of Odd-
Chain Fatty Acid Metabolism and the Role of
Pentadecanoic Acid (C15: 0) and Heptadecanoic Acid
(C17: 0) in Health and Disease. Molecules 20, 2425-
2444.
Küllenberg, D., Taylor, L.A., Schneider, M., Massing, U., 2012.
Health effects of dietary phospholipids. Lipids Health
Dis 11, 1-16.