Representation of metabolomic data with wavelets

Representation of metabolomic data with wavelets
Nathalie Villa-Vialaneix
http://www.nathalievilla.org
Toulouse School of Economics
Workgroup BioPuces, INRA de Castanet
June 5th, 2009
BioPuces (05/06/09) Nathalie Villa Metabolomic data 1 / 16

Sommaire
1 Database presentation
2 Wavelet representation
3 Perspective of work

Database presentation
Sommaire

Basics about the data base
The database was given by Alain Paris (INRA) and consists of
metabolomic registration (H NMR) from urine of mice.
950 variables from 0.505 ppm to 9.995 ppm.

Basics about the data base
The database was given by Alain Paris (INRA) and consists of
metabolomic registration (H NMR) from urine of mice.
950 variables from 0.505 ppm to 9.995 ppm.
Baseline has been removed and peaks have been aligned.

Purpose of the work
Study the effects of the ingestion of Hypochoeris radicata (HR) on the
metabolism: the inﬂorescences of this plant are known to be responsible
for a horse desease, the Australian stringhalt.

Purpose of the work
Study the effects of the ingestion of Hypochoeris radicata (HR) on the
metabolism: the inﬂorescences of this plant are known to be responsible
for a horse desease, the Australian stringhalt.
As it is hard to obtain several dizains of horses to kill them, the
experiments have been conducted on 72 mice.

Description of the experiment
72 mice from:
2 sexes 36 males 36 females

72 mice from:
3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice

72 mice from:
3 sacriﬁce dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice

72 mice from:
3 sacriﬁce dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice
⇒ 18 groups.

Measurements days
The urine was collected:
Days 0 1 4 8 11 15 18 21
Nb of observations 68 68 68 66 46 44 19 18

Measurements days
Days 0 1 4 8 11 15 18 21
For each mice, from 2 to 22 measurements are made.

Measurements days
Days 0 1 4 8 11 15 18 21
For each mice, from 2 to 22 measurements are made.
In conclusion, 397 observations for 950 variables.

Wavelet representation
Sommaire

Basic principle of wavelets
For a given J integer, the spectra can be expressed at level J as:
f(x) =
k
αk 2−J/2
Ψ(2−J
x − k) +
J
j=1 k
βjk 2−j/2
Φ 2−j
x − k

f(x) =
k
αk 2−J/2
Ψ(2−J
x − k)
Trend: based on the father wavelet Ψ
+
J
j=1 k
βjk 2−j/2
Φ 2−j
x − k

f(x) =
k
αk 2−J/2
Ψ(2−J
x − k)
Trend: based on the father wavelet Ψ
+
J
j=1 k
βjk 2−j/2
Φ 2−j
x − k
Details at levels 1,...,J: based on the mother wavelet Φ

Hierarchical decomposition
We add 74 zero values at the end of the spectra to have a diadic discrete
sampling.
Original Data: f observed at t1 ... t1024 equally spaced

sampling.
↓
Level 1 Trend Details

sampling.
↓
↓

sampling.
↓
↓
. . .
↓

sampling.
↓
↓
. . .
↓
⇒ At level 9 (maximum level with 1024 length discrete sampling), we
obtain 1025 coefﬁcients.

Examples
Trend Details

Denoising
For coefﬁcients corresponding to details greater than J (with J large
enough), a ﬁltering is made:
c∗
=
0 if |c| < 2 log 10ˆσ
c if |c| ≥ 2 log 10ˆσ
(Donoho and Johnstone)

Denoising
c∗
=
0 if |c| < 2 log 10ˆσ
Two parameters are to be tuned:
• Which wavelet has to be used?
• Which J has to be used?
to make a trade-off between quality of the reconstruction of the function
(what are the values on the functions built on the the basis of the filtered
coefficients?) and the number of non negative coefficients.

Denoising
c∗
=
0 if |c| < 2 log 10ˆσ
Two parameters are to be tuned:
• Which wavelet has to be used?
• Which J has to be used?
to make a trade-off between quality of the reconstruction of the function
(what are the values on the functions built on the the basis of the filtered
coefficients?) and the number of non negative coefficients.
Minimization of an empirical (self-created) quality criterium:
1
n
i
1
D
j
fi(tj) − ˆfi(tj)
2
+
Nb of non negative coefficients
Nb of coefficients

Final reconstruction of the data
274 positive coefﬁcients

Boxplots
Original coefﬁcients

Boxplots
Scaled coefﬁcients (reduction by mean and standard deviation)

Perspective of work
Sommaire

Perspective of work
Using random forests
The idea is to use random forest to make prediction and also extract the
main coefﬁcients responsible for the explanation of the target variables.

Perspective of work
Proposed regression: the scale coefﬁcients will be the explanatory
variables. The variable of interest could be:
• the dose (either as a number or as a class leading to a classiﬁcation
problem);
• the total dose injected (i.e., the dose multiplied by the number of
days of ingestion);
• any other interesting idea?

Perspective of work
Proposed regression: the scale coefficients will be the explanatory
variables. The variable of interest could be:
• the dose (either as a number or as a class leading to a classification
problem);
• the total dose injected (i.e., the dose multiplied by the number of
days of ingestion);
• any other interesting idea?
The idea is to rebuilt the individuals from the main coefficients (putting the
others to zero) to see which peaks are different from one group to the
others.

Representation of metabolomic data with wavelets

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to Representation of metabolomic data with wavelets

Similar to Representation of metabolomic data with wavelets (20)

More from tuxette

More from tuxette (20)

Recently uploaded

Recently uploaded (20)

Representation of metabolomic data with wavelets