Microarray data noise simulation

Microarray data noise
simulation
Despoina I. Kalfakakou
Interinstitutional postgraduate program
“Information Technologies in Medicine and Biology”
Course: Simulation methods in medicine and biology
Instructor: Dr G. Spyrou

Microarray data
• DNA microarray: a collection of tiny
DNA spots on a surface.
• Used in order to estimate the expression
of a large number of genes at the same
time.
• The expression measurements are saved
in a tsv file, with rows representing the
genes and columns representing the
samples.

Microarray data noise
• Biological noise:
– Gene expression is a random and noisy
process.
• “Inner" noise: The result of the inherent
stochasticity of biochemical processes such as
transcription and translation.
• “Outer" noise: Variations in quantities or conditions
from other cellular components (e.g., proteins)
indirectly result in a change in the expression of a
particular gene.
• Technical noise: Artefacts.

Gene Correlation
Possible correlation partners:
•X activates Y.
•X suppresses Y.
•W activates both X and Y.
•W activates X, while suppressing Y.

Information extraction from real
data {1/2}
• Real data constitutes of 30 breast tissue
samples (2 states: 20 healthy tissue; 10
tumour tissue)
• 20368 gene expression measurements per
sample.
• Data are already normalized.

Information extraction from real
data{2/2}
1. Per gene study of mean values and standard
deviations of gene expressions for each
state.
2. Significance Analysis for the discovery of
differentially expressed genes (non-
parametric t-test).
3. Significant gene covariance matrix
construction.
4. SVM training using significant gene
expressions.

Simulation Model
• Idea: Simulation of an “ideal” – noiseless
distribution. Application of different noise
models.
• Final distribution for gene i :
• xi = ai + ni , where ai is the noiseless distribution, ni
is the noise.

Ideal Distribution
• Gene i not significant: Normal distribution, where mean
value equals the real data mean value and
corresponding standard deviation.
• Gene i significant: Multivariate normal distribution, with
parameters: A vector with the real data mean values in
the given situation and the two-dimensional covariance
table Σ of the real correlated significant genes, where:
– Σ[i,j] equals the covariance of genes i and j , if correlated,
– Σ[i,j]=0, if not correlated and
– Σ[i,j]=var(i), if i = j.

Noise {1/3}
• The behavior of the data is studied by adding known
noise models:
– Uniform noise:
– Gaussian noise:

Noise {2/3}
• Poisson noise:
• Cauchy noise:

Noise {3/3}
• χ² noise:
• Exponential noise:

Evaluation
– Significant Analysis for the discovery of
differentially expressed genes.
– Use of the differentially expressed genes as
test data in the real data trained SVM
classifier.

Real Data Significant Analysis
SAM tool (Significant Analysis Of Microarrays).
Upregulated: 70
Downregulated: 236

SVM training using real data
• Linear kernel.
• 10-fold cross validation.
• Truth Table:
• Accuracy: 90%.
Predicted
True Normal Diseased
Normal 19 1
Diseased 2 8

Uniform noise
Distribution parameters: a=upper bound, b=lower bound.
Significant genes: 267 upregulated, 245 downregulated.
SVM accuracy: 20%

Gaussian noise
Distribution parameters: mv = min(mv), σ=0,8.
SVM accuracy: 80%

Poisson noise
Distribution parameters: λ=1.
SVM accuracy: 76,667%

Cauchy noise
Distribution parameters: location=min(mv), scale=0.3.
SVM accuracy: 63.334%

χ² noise
Distribution parameters: df=0.75, center=0.
SVM accuracy: 83.334%

Exponential noise
Distribution parameters: λ=0.97.
SVM accuracy: 86,667%

In depth-study of exponential noise
λ 0,97 2 0,3
Upregulated 64 389 0
Downregulated 224 532 15
SVM accuracy 86.667% 76.667% -
# of Times 1 1,5 2
Upregulated 64 14 0
Downregulated 224 98 48
SVM accuracy 86.667% 83.334% -
• Different λ values:
• Application of noise more than once:

Future applications{1/2}
• ConsensusClusterPlus tool:

Future applications{2/2}
• From consensusClusterPlus: Real data can
be divided in 4 categories.
• Noise simulation considering these 4
categories.

References {1/2}
• “Novel markers for differentiation of lobular and ductal
invasive breast carcinomas by laser microdissection and
microarray analysis.”, Turashvili et al, BMC Cancer, 2007.
• “Using Gene Expression Noise to Understand Gene
Regulation”, Munsky et al., SCIENE Vol. 336.
• “Simulating Correlated Multivariate Normal Data”, Alison
Kosel, 2009.
• “Interplay between gene expression noise and regulatory
network architecture”, Chalancon et al., Trends in Genetics,
Vol. 28.
• “Models of stochastic gene expression”, Paulsson et al.,
Physics of Life Reviews 2 (2005).
• “Intrinsic and extrinsic contributions to stochasticity in gene
expression”, Swain et al, PNAS Vol. 99.

References {2/2}
• “Intrinsic noise in gene regulatory networks”, Mukund Thattai
and Alexander van Oudenaarden, PNAS Vol. 98.
• “Making sense of microarray data distributions”, Hoyle et al,
Bioinformatics Vol. 18.
• “A Flexible Microarray Data Simulation Model”, Doulaye
Dembele, Microarrays, Vol. 2.
• “Simulation of microarray data with realistic characteristics”,
Nykter et al., BMC Bioinformatics 2006.
• http://statweb.stanford.edu/~tibs/SAM/
• “ConsensusClusterPlus: a class discovery tool with confidence
assessments and item tracking”, Wilkerson et al,
Bioinformatics, 2010.

Microarray data noise simulation

Recommended

Recommended

More Related Content

Similar to Microarray data noise simulation

Similar to Microarray data noise simulation (20)

Recently uploaded

Recently uploaded (20)

Microarray data noise simulation