Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Microarray data noise simulation
1. Microarray data noise
simulation
Despoina I. Kalfakakou
Interinstitutional postgraduate program
“Information Technologies in Medicine and Biology”
Course: Simulation methods in medicine and biology
Instructor: Dr G. Spyrou
2. Microarray data
• DNA microarray: a collection of tiny
DNA spots on a surface.
• Used in order to estimate the expression
of a large number of genes at the same
time.
• The expression measurements are saved
in a tsv file, with rows representing the
genes and columns representing the
samples.
3. Microarray data noise
• Biological noise:
– Gene expression is a random and noisy
process.
• “Inner" noise: The result of the inherent
stochasticity of biochemical processes such as
transcription and translation.
• “Outer" noise: Variations in quantities or conditions
from other cellular components (e.g., proteins)
indirectly result in a change in the expression of a
particular gene.
• Technical noise: Artefacts.
5. Information extraction from real
data {1/2}
• Real data constitutes of 30 breast tissue
samples (2 states: 20 healthy tissue; 10
tumour tissue)
• 20368 gene expression measurements per
sample.
• Data are already normalized.
6. Information extraction from real
data{2/2}
1. Per gene study of mean values and standard
deviations of gene expressions for each
state.
2. Significance Analysis for the discovery of
differentially expressed genes (non-
parametric t-test).
3. Significant gene covariance matrix
construction.
4. SVM training using significant gene
expressions.
7. Simulation Model
• Idea: Simulation of an “ideal” – noiseless
distribution. Application of different noise
models.
• Final distribution for gene i :
• xi = ai + ni , where ai is the noiseless distribution, ni
is the noise.
8. Ideal Distribution
• Gene i not significant: Normal distribution, where mean
value equals the real data mean value and
corresponding standard deviation.
• Gene i significant: Multivariate normal distribution, with
parameters: A vector with the real data mean values in
the given situation and the two-dimensional covariance
table Σ of the real correlated significant genes, where:
– Σ[i,j] equals the covariance of genes i and j , if correlated,
– Σ[i,j]=0, if not correlated and
– Σ[i,j]=var(i), if i = j.
9. Noise {1/3}
• The behavior of the data is studied by adding known
noise models:
– Uniform noise:
– Gaussian noise:
12. Evaluation
– Significant Analysis for the discovery of
differentially expressed genes.
– Use of the differentially expressed genes as
test data in the real data trained SVM
classifier.
13. Real Data Significant Analysis
SAM tool (Significant Analysis Of Microarrays).
Upregulated: 70
Downregulated: 236
14. SVM training using real data
• Linear kernel.
• 10-fold cross validation.
• Truth Table:
• Accuracy: 90%.
Predicted
True Normal Diseased
Normal 19 1
Diseased 2 8
23. Future applications{2/2}
• From consensusClusterPlus: Real data can
be divided in 4 categories.
• Noise simulation considering these 4
categories.
24. References {1/2}
• “Novel markers for differentiation of lobular and ductal
invasive breast carcinomas by laser microdissection and
microarray analysis.”, Turashvili et al, BMC Cancer, 2007.
• “Using Gene Expression Noise to Understand Gene
Regulation”, Munsky et al., SCIENE Vol. 336.
• “Simulating Correlated Multivariate Normal Data”, Alison
Kosel, 2009.
• “Interplay between gene expression noise and regulatory
network architecture”, Chalancon et al., Trends in Genetics,
Vol. 28.
• “Models of stochastic gene expression”, Paulsson et al.,
Physics of Life Reviews 2 (2005).
• “Intrinsic and extrinsic contributions to stochasticity in gene
expression”, Swain et al, PNAS Vol. 99.
25. References {2/2}
• “Intrinsic noise in gene regulatory networks”, Mukund Thattai
and Alexander van Oudenaarden, PNAS Vol. 98.
• “Making sense of microarray data distributions”, Hoyle et al,
Bioinformatics Vol. 18.
• “A Flexible Microarray Data Simulation Model”, Doulaye
Dembele, Microarrays, Vol. 2.
• “Simulation of microarray data with realistic characteristics”,
Nykter et al., BMC Bioinformatics 2006.
• http://statweb.stanford.edu/~tibs/SAM/
• “ConsensusClusterPlus: a class discovery tool with confidence
assessments and item tracking”, Wilkerson et al,
Bioinformatics, 2010.