2. Background
μ, σ2
• Few observations made by a black box
• What is the distribution behind the black box?
• E.g., with what probability will it output a number
bigger than 5?
3. Approach
• Easy to determine with many observations
• With few observations..
• Assume a canonical distribution based on prior
knowledge
• Determine parameters of this distribution using
the observations, e.g., mean, variance
6. Microarray Data
• Many genes, 25000
• 2 conditions (or more), many replicates within
each condition
• Which genes are differentially expressed
between the two conditions?
7. More Specifically
• For a particular gene
– Each condition is a black box
– Say 3 observations from each black box
• Do both black boxes have the same
distribution?
– Assume same canonical distribution
– Do both have the same parameters?
8. Which Canonical Distribution
• Use data with many replicates
• 418.0294, 295.8019, 272.1220, 315.2978, 294.2242,
379.8320, 392.1817, 450.4758, 335.8242, 265.2478,
196.6982, 289.6532, 274.4035, 246.6807, 254.8710,
165.9416, 281.9463, 246.6434, 259.0019, 242.1968
• Distribution??
11. The QQ plot of log scale intensities
(i.e., actual vs simulated from normal)
12. QQ Plot against a Normal Distribution
• 10 + 10 replicates in
two groups
• Single group QQ plot
• Combined 2 groups QQ
plot
• Combined log-scale QQ
plot
Shapiro-
Wilk Test
17. SD is flat
now,
except for
very low
values
Another
reason to
work on
the log
scale
SD vs Mean across 3 replicates computed for all
genes after log-transformation
24. The curve
fit here
may be a
better
estimate
Lots of false
positives can Not much
be avoided difference
here here
SD vs Mean across 3 replicates computed for all
genes after log-transformattion