Introduction to statistics iii

Statistics for Next Generation
Sequencing (RNA-Seq)

Distribution?
• 25000 genes, each with counts over several
samples
• 2 conditions, each with several replicates

• Recall, log-Normal for Microarrays
• Based on fitting on actual data with many replicates

• No equivalent data for RNA-Seq
• So go back to first principles

Simplifying the Hypergeometric
Distribution

The Poisson Distribution

λ is both mean
and variance

The Poisson Distribution
(Wikipedia)
• The number of soldiers killed by horse-kicks each year in each corps in
the Prussian cavalry. This example was made famous by a book of Ladislaus
Josephovich Bortkiewicz (1868–1931).
• The number of yeast cells used when brewing Guinness beer. This example was
made famous by William Sealy Gosset (1876–1937).[19]
• The number of phone calls arriving at a call centre per minute.
• The number of goals in sports involving two competing teams.
• The number of deaths per year in a given age group.
• The number of jumps in a stock price in a given time interval.
• Under an assumption of homogeneity, the number of times a web server is
accessed per minute.
• The number of mutations in a given stretch of DNA after a certain amount of
radiation.
• The proportion of cells that will be infected at a given multiplicity of infection.

Is Mean = Variance for NGS ?

– Variance ∝ Mean2

Log Scale: White
line is the Poisson
line

Why this Over-Dispersion

• The Poisson model only
models technical variation,
not biological variation

• Biological variation induces
more variance than
captured by the Poisson
model
– No reason for difference from
microarrays where SD ∝ Mean
(or Variance ∝ Mean2)
SD vs Mean for
Microarrays

What Distribution is X?

• Log-Normal for Arrays?

• The combination of log-Normal and Poisson
doesn’t have a neat closed form (i.e., formula)

• So assume Gamma distribution
– Poisson + Gamma -> Negative Binomial
– Used traditionally to fix the problem of over-
dispersion

The Gamma Distribution

Control on
Right Tail

The Negative Binomial Distribution

Estimating Parameters

For each gene, estimate
the mean across replicates,
and then estimate the
variance from the curve fit
above

Introduction to statistics iii

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to Introduction to statistics iii

Similar to Introduction to statistics iii (20)

More from Strand Life Sciences Pvt Ltd

More from Strand Life Sciences Pvt Ltd (12)

Recently uploaded

Recently uploaded (20)

Introduction to statistics iii