Introduction to statistics iii

763 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
763
On SlideShare
0
From Embeds
0
Number of Embeds
110
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction to statistics iii

  1. 1. Statistics for Next Generation Sequencing (RNA-Seq)
  2. 2. Distribution?• 25000 genes, each with counts over several samples • 2 conditions, each with several replicates• Recall, log-Normal for Microarrays • Based on fitting on actual data with many replicates• No equivalent data for RNA-Seq • So go back to first principles
  3. 3. RNA-Seq Setting
  4. 4. RNA-Seq Counts Distribution
  5. 5. Hypergeometric Distribution
  6. 6. Simplifying the Hypergeometric Distribution
  7. 7. The Poisson Distribution λ is both mean and variance
  8. 8. The Poisson Distribution (Wikipedia)• The number of soldiers killed by horse-kicks each year in each corps in the Prussian cavalry. This example was made famous by a book of Ladislaus Josephovich Bortkiewicz (1868–1931).• The number of yeast cells used when brewing Guinness beer. This example was made famous by William Sealy Gosset (1876–1937).[19]• The number of phone calls arriving at a call centre per minute.• The number of goals in sports involving two competing teams.• The number of deaths per year in a given age group.• The number of jumps in a stock price in a given time interval.• Under an assumption of homogeneity, the number of times a web server is accessed per minute.• The number of mutations in a given stretch of DNA after a certain amount of radiation.• The proportion of cells that will be infected at a given multiplicity of infection.
  9. 9. Is Mean = Variance for NGS ?– Variance ∝ Mean2 Log Scale: Whiteline is the Poisson line
  10. 10. Why this Over-Dispersion• The Poisson model only models technical variation, not biological variation• Biological variation induces more variance than captured by the Poisson model– No reason for difference from microarrays where SD ∝ Mean (or Variance ∝ Mean2) SD vs Mean for Microarrays
  11. 11. Handling Over-Dispersion
  12. 12. What Distribution is X?• Log-Normal for Arrays?• The combination of log-Normal and Poisson doesn’t have a neat closed form (i.e., formula)• So assume Gamma distribution – Poisson + Gamma -> Negative Binomial – Used traditionally to fix the problem of over- dispersion
  13. 13. The Gamma Distribution Control on Right Tail
  14. 14. The Negative Binomial Distribution
  15. 15. Estimating Parameters For each gene, estimate the mean across replicates, and then estimate the variance from the curve fit above
  16. 16. Bias Correction
  17. 17. Thank You

×