Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Metabolomic data: combining wavelet representation with learning approaches
1. Metabolomic data: combining wavelet
representation with learning approaches
Nathalie Villa-Vialaneix
http://www.nathalievilla.org
In collaboration with Noslen Hernández (CENATAV, La
Havane, Cuba) & Philippe Besse
IUT de Carcassonne (UPVD)
& Institut de Mathématiques de Toulouse
Groupe de travail BioPuces, INRA de Castanet
May 19th, 2010
1 / 23
Nathalie Villa-Vialaneix
2. Présentation générale
1 Presentation of the data
2 Wavelet preprocessing and normalization
3 Learning methods
4 Identification of relevant metabolites
2 / 23
Nathalie Villa-Vialaneix
3. Presentation of the data
Presentation of the data
Data have been provided by Alain Paris (INRA): they are
metabolomic spectra (H NMR) from mice urine and consist of
950 variables (from 0.50 ppm to 9.99 ppm).
3 / 23
Nathalie Villa-Vialaneix
4. Presentation of the data
Presentation of the data
Data have been provided by Alain Paris (INRA): they are
metabolomic spectra (H NMR) from mice urine and consist of
950 variables (from 0.50 ppm to 9.99 ppm).
3 / 23
Nathalie Villa-Vialaneix
5. Presentation of the data
Presentation of the data
Data have been provided by Alain Paris (INRA): they are
metabolomic spectra (H NMR) from mice urine and consist of
950 variables (from 0.50 ppm to 9.99 ppm).
Peaks have been aligned and baseline has been removed.
3 / 23
Nathalie Villa-Vialaneix
6. Presentation of the data
Biologic question
Study the effets of Hypochoeris radicata (HR) ingestion on the
metabolism: HR flowers are responsible for a mortal disease for
horses, the “Australian stringhalt” (nervous system attack,
trembling...)
4 / 23
Nathalie Villa-Vialaneix
7. Presentation of the data
Biologic question
Study the effets of Hypochoeris radicata (HR) ingestion on the
metabolism: HR flowers are responsible for a mortal disease for
horses, the “Australian stringhalt” (nervous system attack,
trembling...)
Experiences have been made with 72 mice.
4 / 23
Nathalie Villa-Vialaneix
8. Presentation of the data
Description of the experiments
Mice are divided into several groups according to:
genders : 36 males ; 36 females
5 / 23
Nathalie Villa-Vialaneix
9. Presentation of the data
Description of the experiments
Mice are divided into several groups according to:
genders : 36 males ; 36 females
daily HR doses ingested : 0 (control) : 24 mice ; 3% : 24 mice ;
9% : 24 mice
5 / 23
Nathalie Villa-Vialaneix
10. Presentation of the data
Description of the experiments
Mice are divided into several groups according to:
genders : 36 males ; 36 females
daily HR doses ingested : 0 (control) : 24 mice ; 3% : 24 mice ;
9% : 24 mice
3 sacrifice dates : 8th day : 24 mice ; 15th : 24 mice ; 21st : 24
mice
5 / 23
Nathalie Villa-Vialaneix
11. Presentation of the data
Description of the experiments
Mice are divided into several groups according to:
genders : 36 males ; 36 females
daily HR doses ingested : 0 (control) : 24 mice ; 3% : 24 mice ;
9% : 24 mice
3 sacrifice dates : 8th day : 24 mice ; 15th : 24 mice ; 21st : 24
mice
⇒ 18 groups (but groups coming from sacrifice dates are irrelevant
for the biological question).
5 / 23
Nathalie Villa-Vialaneix
12. Presentation of the data
Day of measures
Urine was collected the following days:
Days 0 1 4 8 11 15 18 21
Nb of obs. 68 68 68 66 46 44 19 18
6 / 23
Nathalie Villa-Vialaneix
13. Presentation of the data
Day of measures
Urine was collected the following days:
Days 0 1 4 8 11 15 18 21
Nb of obs. 68 68 68 66 46 44 19 18
For each mice, from 1 to 8 measures were done.
6 / 23
Nathalie Villa-Vialaneix
14. Presentation of the data
Day of measures
Urine was collected the following days:
Days 0 1 4 8 11 15 18 21
Nb of obs. 68 68 68 66 46 44 19 18
For each mice, from 1 to 8 measures were done.
Finally, 397 observations with 950 variables.
6 / 23
Nathalie Villa-Vialaneix
15. Wavelet preprocessing and normalization
Basics about wavelets
For a given integer J, a spectrum f can be expressed at level J by:
f(x) =
k
αk 2−J/2
Ψ(2−J
x − k) +
J
j=1 k
βjk 2−j/2
Φ 2−j
x − k
7 / 23
Nathalie Villa-Vialaneix
16. Wavelet preprocessing and normalization
Basics about wavelets
For a given integer J, a spectrum f can be expressed at level J by:
f(x) =
k
αk 2−J/2
Ψ(2−J
x − k)
Trend based on father wavelet Ψ
+
J
j=1 k
βjk 2−j/2
Φ 2−j
x − k
7 / 23
Nathalie Villa-Vialaneix
17. Wavelet preprocessing and normalization
Basics about wavelets
For a given integer J, a spectrum f can be expressed at level J by:
f(x) =
k
αk 2−J/2
Ψ(2−J
x − k)
Trend based on father wavelet Ψ
+
J
j=1 k
βjk 2−j/2
Φ 2−j
x − k
Details of levels 1, . . . , J
based on mother wavelet Φ
7 / 23
Nathalie Villa-Vialaneix
18. Wavelet preprocessing and normalization
Example of a hierarchical decomposi-
tion for a metabolomic spectrum
↓
8 / 23
Nathalie Villa-Vialaneix
19. Wavelet preprocessing and normalization
Example of a hierarchical decomposi-
tion for a metabolomic spectrum
↓
8 / 23
Nathalie Villa-Vialaneix
20. Wavelet preprocessing and normalization
Example of a hierarchical decomposi-
tion for a metabolomic spectrum
↓
8 / 23
Nathalie Villa-Vialaneix
21. Wavelet preprocessing and normalization
Example of a hierarchical decomposi-
tion for a metabolomic spectrum
... Details 1 to 8
↓
8 / 23
Nathalie Villa-Vialaneix
22. Wavelet preprocessing and normalization
Several strategies
Several wavelet basis
Haar wavelets (easily interpretable because they are close to
discrete derivatives);
D4 Daubechies wavelets (smoother representation but not
directly interpretable).
9 / 23
Nathalie Villa-Vialaneix
23. Wavelet preprocessing and normalization
Several strategies
Several wavelet basis
Haar wavelets (easily interpretable because they are close to
discrete derivatives);
D4 Daubechies wavelets (smoother representation but not
directly interpretable).
Several preprocessings
Use all wavelet coefficients as input data;
Use thresholded wavelet coefficients as input data (i.e., delete
the smallest coefficient with an automatic method called “soft
thresholding”);
Use only the detailed coefficients (and the detailed coefficients
of the shifted spectra) as input data.
9 / 23
Nathalie Villa-Vialaneix
24. Wavelet preprocessing and normalization
Scaling of wavelet coefficients (ex: Haar
detailed coefficients)
D.1 D.57 D.125 D.297 D.370 D.443 D2.41 D2.120 D2.304 D2.389 D2.474
−40−2002040
Before scaling
D.1 D.57 D.125 D.297 D.370 D.443 D2.41 D2.120 D2.304 D2.389 D2.474
−15−10−5051015
After scaling
10 / 23
Nathalie Villa-Vialaneix
26. Wavelet preprocessing and normalization
Normalization
Find median and variance of the coefficients for each day of
measure based on the control group.
Use these values for the normalization of all the observations
(according to the day of measure).
12 / 23
Nathalie Villa-Vialaneix
27. Wavelet preprocessing and normalization
Normalization
Find median and variance of the coefficients for each day of
measure based on the control group.
Use these values for the normalization of all the observations
(according to the day of measure).
q
q
q
q
0 1 4 8 11 15 18 21
−0.20.00.20.40.6
D2.444
Day
Waveletcoefficients
q
q
q
q
q
0 1 4 8 11 15 18 21
−0.20−0.100.000.10
D.78
Day
Waveletcoefficients
q
q
q
0 1 4 8 11 15 18 21
0.00.51.01.52.02.5
D.332
Day
Waveletcoefficients
q
q
q
q
q
q
q
0 1 4 8 11 15 18 21
−1.5−1.0−0.5
D2.289
Day
Waveletcoefficients
q
q
q
q
0 1 4 8 11 18
−2−1012
D2.444
Day
Waveletcoefficients
q
q
q
q
q
0 1 4 8 11 18
−3−1012
D.78
Day
Waveletcoefficients
q
q q
0 1 4 8 11 18
−3−10123
D.332
Day
Waveletbcoefficients
q
qq
q
q
q
q
0 1 4 8 11 18
−3−10123
D2.289
Day
Waveletcoefficients
Before After 12 / 23
Nathalie Villa-Vialaneix
29. Learning methods
Motivations
Purpose: Validation of the impact of HR ingestion on metabolism
by predicting from the spectra the total HR dose ingested. If
the prediction is accurate, the impact is not an artefact of the data
and the biological dependency is validated.
14 / 23
Nathalie Villa-Vialaneix
30. Learning methods
Motivations
Purpose: Validation of the impact of HR ingestion on metabolism
by predicting from the spectra the total HR dose ingested. If
the prediction is accurate, the impact is not an artefact of the data
and the biological dependency is validated.
Compared methods :
random forest (R package randomForest)
ridge regression (R package glmnet)
LASSO (R package glmnet)
Elasticnet (R package glmnet)
Partial Least Squares (PLS) (R package mixOmics)
sparse PLS (R package mixOmics)
14 / 23
Nathalie Villa-Vialaneix
31. Learning methods
Methodology
Split the data into train and test sets that are balanced according to
the groups;
Preprocess (or not), scale and normalize the data with wavelets;
Learn each of the 6 methods (for each of the 7 kinds of
preprocessing) on the train set with a cross-validation strategy to
tune the parameters;
Calculate the mean squared error on the test set.
15 / 23
Nathalie Villa-Vialaneix
32. Learning methods
Methodology
Split the data into train and test sets that are balanced according to
the groups;
Preprocess (or not), scale and normalize the data with wavelets;
Learn each of the 6 methods (for each of the 7 kinds of
preprocessing) on the train set with a cross-validation strategy to
tune the parameters;
Calculate the mean squared error on the test set.
Repeat the previous scheme 250 times.
15 / 23
Nathalie Villa-Vialaneix
36. Identification of relevant metabolites
Identification issue
The full learning process is the following:
Spectra → Wavelet preprocess → Learning → HR dose prediction
19 / 23
Nathalie Villa-Vialaneix
37. Identification of relevant metabolites
Identification issue
The full learning process is the following:
Spectra → Wavelet preprocess → Learning → HR dose prediction
Hence, due to the preprocessing step, the coefficients selected
by ELN are not directly related to metabolites (or to localization
on the spectra).
19 / 23
Nathalie Villa-Vialaneix
38. Identification of relevant metabolites
Adaptation of the importance measure
for Each of the 950 variables, v, of the original data set do
Randomize the observations of the variable v
Compute the full Daubechies wavelet representation
with the randomized observations for v
Scale and normalize according to the true values mean,
median or variance
for Each test set, i do
Calculate new predictions with false values of v
and corresponding mse: msev,i
Calculate decrease in accuracy for test set: DAi =
1 − msei
msev,i
end for
Average over i, DAi, to obtain Importance of v
end for
20 / 23
Nathalie Villa-Vialaneix
40. Identification of relevant metabolites
Identification of important metabolites
2 4 6 8 10
05101520
ppm
Some have
already been identified: the most important is scyllo-inositol; one
of the orange is probably valine; one of the light yellow is probably
trimethylamine. The others are new. 22 / 23
Nathalie Villa-Vialaneix
41. Identification of relevant metabolites
What next?
Identification of the metabolites, study of the correlation between
the ones found and the ones previously emphasized.
Questions? Propositions?
23 / 23
Nathalie Villa-Vialaneix