Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)

  • 232 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
232
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Using Simulated Data toOptimise Experimental Design and Analysis for RNA- Sequencing. Conrad Burden Mathematical Sciences Institute Australian National University Canberra
  • 2. RNA-Seq:Using high-throughput sequencing technology tosequence cDNA that has been reverse-transcribedfrom RNA to get information about a sample’s RNAcontent.If the sample is mRNA from a cell, it detects whichgenes are expressed.Useful for:1.  Expression profiling2.  Detecting differential expression
  • 3. Extract  RNA   Library  prep   Sequencing   RNA   cDNA  •  Extract  mRNA  from  total  RNA  •  Randomly  fragment  •  Reverse  transcribe  to  cDNA   Sequence  and  map  to  •  Ligate  sequencing  adaptor    reference  genome  to  get  a  •  Size  select  to  ~  200  bases    digital  count  of  fragments  •  Amplify  with  PCR    sampled  from  each  gene  
  • 4. Extract  RNA   Library  prep   Sequencing   RNA   cDNA  Biological  variaGon   Technical  variaGon   Poisson  noise   Overdispersion   Final  count  for  each  gene  is  overdispersed  Poisson  
  • 5. Extract  RNA   Library  prep   Sequencing   transcript   RNA   cDNA  (conc  =  R)   (count  =  K)   abundance  ~  q  1.  For  a  given  gene,  let    R  =  molar  concentraGon  of  cDNA  in  ‘library’  for  a  given  gene  of  interest,  with    E(R)  =  q;    Var(R)  =  v.      2.  Consider    q    as  a  proxy  for  the  ‘transcript  abundance’  of  this  gene.        3.  Sequencer  counts    K    for  this  gene  given  R  is  Poisson:    K|R  ~  Pois(λR).    1,  2  and  3  imply       E(K)  =  μ,    Var(K)  =    μ(1  +  φμ),        where    μ  =  λq,  φ  =  v/q2.    φ  is  called  the  overdispersion.      
  • 6. Extract  RNA   Library  prep   Sequencing  transcript   RNA   cDNA  (conc  =  R)   (count  =  K)  abundance  ~  q   Moreover,  if          λR  ~  Gamma(mean  =  μ,  variance  =  φμ)     Then          K  ~  NegBin(mean  =  μ,  variance  =  μ(1  +  φμ)     If    λ,  μ    and    φ    can  be  esGmated  from  the  data,  q  =    μ/  λ  gives   an  esGmate  of  the  abundance  of  this  transcript.      
  • 7. SyntheGc  Poisson  vs.  Poisson   Same  cDNA  library,  different  sequencers   Same  biol.  source,  different  cDNA  libraries   Different  biol.  reps.  (Data:  human  lymphoblastoid  cell  lines  from    J.K.  Pickrell  et  al.,  Nature  464  768–772.)    
  • 8. Data  from  an  RNA-­‐Seq  experiment  to  detect  differenGal  expression   typically  looks  like  this:   different  condiGons  or  biol.  samples     Condi)on  A   Condi)on  B     ...  etc.   Gene   Rep   Rep     Rep   Rep     1   2   ...etc   1   2   ...etc   ENSG00000209432   4   6   ...   35   45   ...   ENSG00000209432   0   0   ...   2   1   ...   typically   ENSG00000209432   110   96   ...   177   203   ...  >  10,000  genes   or  transcript   ENSG00000209432   1268   1089   ...   9246   9873   ...   isoforms   ENSG00000212678   148   201   ...   112   93   ...   ...  etc.   n  reps  per  condiGon  
  • 9. Data  from  an  RNA-­‐Seq  experiment  to  detect  differenGal  expression   typically  looks  like  this:   different  condiGons  or  biol.  samples     Condi)on  A   Condi)on  B     ...  etc.   Gene   Rep   Rep     Rep   Rep     1   2   ...etc   1   2   ...etc   ENSG00000209432   4   6   ...   35   45   ...   typically   ENSG00000209432   ENSG00000209432   Which  genes  are   ...   0   0   ...   2   1   110   96   ...   177   203   ...  >  10,000  genes   or  transcript   differenGally  expressed?   ENSG00000209432   1268   1089   ...   9246   9873   ...   isoforms   ENSG00000212678   148   201   ...   112   93   ...   ...  etc.   n  reps  per  condiGon  
  • 10. R  packages  for  assessing  differenGal  expression  based  on  the  negaGve  binomial  distribuGon:    •  DESeq:      S.  Anders  and  W.  Huber,  Gen.  Biol.  11:R106  (2010)  •  edgeR:      M.  Robinson,  D.  McCarthy  and  G.  Smyth,  Bioinf  26:139  (2010)    •  (also  NBPseq:  Y.  Di,  et  al.,  SAGMB  10:24  (2011)  and        TSPM:  P.  Auer  and  R.  Doerge:  SAGMB  10:26  (2011))      
  • 11. They  differ  in  how  they  esGmate  the  overdispersion  (φ)  for  each  gene  from  a  limited  number  of  replicates:    •  DESeq:      dispersion  φ  esGmated  for  each  gene  as  the  greater  of  a  per-­‐ gene  maximum  likelihood  esGmate  and  a  parametric  fit  to       φ  =  a  +  b/μ  •  edgeR:     dispersion  φ  esGmated  per  gene  from  a  likelihood  funcGon   condiGoned  on  sum  across  condiGons,  then  squeezed  towards  a   common-­‐to-­‐all  genes  dispersion  using  empirical  Bayes        
  • 12. p-­‐values  under  the  null  hypothesis     (μ/λ)condiGon  A  =  (μ/λ)condiGon  B    calculated  under  the  approximaGon  that  the  total  counts  in  each  condiGon  is  NB,  and  condiGoned  on  the  sum  of  counts    KB  =  counts  (cond.  B)   (a,  b)   KA  =  counts  (cond.  A)  
  • 13. p-­‐values  under  the  null  hypothesis     (μ/λ)condiGon  A  =  (μ/λ)condiGon  B    calculated  under  the  approximaGon  that  the  total  counts  in  each  condiGon  is  NB,  and  condiGoned  on  the  sum  of  counts     (1-­‐sided)  p-­‐value   Prob(KA  =  a|KA  +  KB  =  a  +  b)   is  the  sum  of  KB  =  counts  (cond.  B)   these   probabiliGes   (a,  b)   KA  =  counts  (cond.  A)   a   kA  
  • 14. p-­‐values  under  the  null  hypothesis     (μ/λ)condiGon  A  =  (μ/λ)condiGon  B    calculated  under  the  approximaGon  that  the  total  counts  in  each  condiGon  is  NB,  and  condiGoned  on  the  sum  of  counts     (2-­‐sided)  p-­‐value   Prob(KA  =  a|KA  +  KB  =  a  +  b)   is  the  sum  of  KB  =  counts  (cond.  B)   these   probabiliGes   (a,  b)   KA  =  counts  (cond.  A)   a   kA  
  • 15. Robles  et  al.,  BMC  Genomics  (2012)  13:484  
  • 16. Test  DESeq  and  edgeR  using  simulated  data    TesGng  null  hypothesis:      1.  Start  with  Pickrell  et  al.  dataset  of  69  sequenced  cDNA  libraries   from  HapMap  project  (i.e.  a  table  of  RNA-­‐Seq  counts  for  69   biological  replicates  of  ~60,000  transcript  isoforms).    2.  Use  max.  likelihood  to  produce  from  this  a  set  of  NB  parameters   (μi,  φi)  for  i  =  1,  ...,  ~60,000  represenGng  a  ‘typical’  range  of   means  and  overdispersions  for  our  syntheGc  transcriptome.  3.  Construct  a  syntheGc  dataset  of  counts:     •  n  reps  of  ‘control’  counts    Kijcontrol    ~  NB(μi,  φi)  ,        j  =  1,  ...  n   •  n  reps  of  ‘treatment’  counts  Kijtreatment    ~  NB(μi,  φi)        
  • 17. Null  hypothesis:  (no  up-­‐  or  down-­‐regulaGon)      n  =  3  reps  vs.  3  reps    expect  flat  p-­‐value  distribuGon.       0.0 0.2 0.4 0.6 0.8 1.0 edgeR all edgeR high edgeR low 10 all  t’cripts   >100  counts   <100  counts   8 6 4 Percentage  of  total   2 0 DESeq all DESeq high DESeq low 10 Percent of Total 8 6 4 2 0 NBP all NBP high NBP low 10
  • 18. DESeq null p-values: synthetic data 3 vs. 3 7 6 5Density 4 3 2 1 0 0.0 0.2 0.4 0.6 0.8 1.0 p-value
  • 19. Right-­‐hand  spike  is  an  arGfact  of  calculaGng  p-­‐values  from  a  discrete  distribuGon    -­‐  could  be  ‘fixed’  by  replacing  the  discrete  distribuGon  by  a  conGnuous  distribuGon  Prob(KA  =  a|KA  +  KB  =  kA  +  kB)   Prob(KA  =  a|KA  +  KB  =  kA  +  kB)   2-­‐sided  p-­‐value  is   2-­‐sided  p-­‐value  is   the  sum  of  these   the  shaded  area   probs   kA   kA   a   a   chose  a  point   randomly  in  the   interval  (kA  −  ½,  kA+  ½)    
  • 20. DESeq null p-values: synthetic data 3 vs. 3 7 original spectrum spike removed 6 5Density 4 3 2 1 0 0.0 0.2 0.4 0.6 0.8 1.0 p-value
  • 21. DESeq null p-values: synthetic data 3 vs. 3 7 original spectrum spike removed 6 parameters not estimated 5 Remaining  deviaGon  from  a  uniform  Density 4 distribuGon  is  from  having  to  esGmate  the   parameters  μ  and  φ  for  each  transcript   3 2 1 0 0.0 0.2 0.4 0.6 0.8 1.0 p-value
  • 22. Null  hypothesis:  α  =  0  (no  up-­‐  or  down-­‐regulaGon)      n  =  3  reps  vs.  3  reps    expect  flat  p-­‐value  distribuGon.       0.0 0.2 0.4 0.6 0.8 1.0 edgeR all edgeR high edgeR low 10 UnderesGmate   all  t’cripts   >100  counts   <100  counts   8 of  dispersion     6 ArGfact  of  p-­‐value   calculaGon  for   4 discrete  data   Percentage  of  total   2 0 OveresGmate   DESeq all DESeq high DESeq low of  dispersion     10 Percent of Total 8 6 4 2 0 NBP all NBP high NBP low 10
  • 23. FPR  =  percentage  of  transcripts  reported  as  differenGally  expressed  under  the  null  hypothesis  for    n  reps  vs.  n  reps    at    α  =  1%    significance   "!!# "#%&$ &"$!# &"!!#!"#$ %"$!# %"!!# !"$!# !"!!# +(+# +(+# +(+# %&(%&# %&(%&# %&(%&# &(&# (# )()# *(*# &(&# (# )()# *(*# &(&# (# )()# *(*# ,-.,/# 012,3# 45677582,3# (Li  et  al.,  BiostaDsDcs  (2012)  13:523)  
  • 24. FPR  =  percentage  of  transcripts  reported  as  differenGally  expressed  under  the  null  hypothesis  for    n  reps  vs.  n  reps    at    α  =  1%    significance   "!!# Overdispersion   "#%&$ underesGmated     underconservaGve   Overdispersion   &"$!#   overesGmated     overconservaGve   &"!!#!"#$ %"$!# %"!!# !"$!# !"!!# +(+# +(+# +(+# %&(%&# %&(%&# %&(%&# &(&# (# )()# *(*# &(&# (# )()# *(*# &(&# (# )()# *(*# ,-.,/# 012,3# 45677582,3# (Li  et  al.,  BiostaDsDcs  (2012)  13:523)  
  • 25. TesGng  the  power  to  detect  differenGal  expression    •  How  many  replicates  is  appropriate?    (biological  reps  or  library  prep  reps  if  reps  are   from  the  same  biological  source)      •  What  sequencing  depth?  •  Is  mulGplexing  (via  barcodes)  worthwhile?  
  • 26. •  SyntheGc  dataset  to  test  the  power  of  DESeq  and   edgeR  to  detect  differenGal  expression    1.  Use  max.  likelihood  esGmates  of  (μi,  φi)  from  Pickrell  data   again  2.  Construct  a  syntheGc  dataset  of  counts:     •  n  reps  of  ‘control’  counts    Kijcontrol    ~  NB(μi,  φi)  ,  j  =  1,  ...  n   •  n  reps  of  ‘treatment’  counts  Kijtreatment    ~  NB(μi  θi,  φi)       where     θi  =  (1  +  Xi)  for  7.5%  of  the  transcripts  (up-­‐regulated)   θi  =  (1  +  Xi)-­‐1  for  a  further  7.5%  (down-­‐regulated)   θi  =  1  for  the  remainder,       with  Xi    ~  i.i.d.  exponenGal  random  variables,  parameter  1.    
  • 27. Define  a  gene  to  be  ‘effecGvely  differenGally  expressed’  if                θi  <  1/1.2    or      θi  >  1.2           85%  unchanged   EffecGvely   non-­‐DE   EffecGvely  DE  
  • 28. Control  for  false  discovery  rate     FDR  =  FP/(TP  +  FP)  using  the  Benjamini-­‐Hochberg  adjusted  p-­‐value    padj < α  Finally,  measure  a  false  posiGve  rate       FPR = # of effectively non-DE transcripts with padj < α   total # of effectively non-DE transcripts  and  a  true  posiGve  rate       TPR = # of effectively DE transcripts with padj < α   total # of effectively DE transcripts  Do  this  for  a  range  of  coverage  depths  and  #  replicates  
  • 29. With  7.5%  up-­‐regulated  and  7.5%  down-­‐regulated:  DESeq  TPR  =  TP/(TP  +  FN)  (x  100%)        =  ‘sensiGvity’    using  Benjamini-­‐Hochberg  adjusted  p-­‐value    padj  ≤  0.01      as  a  significance  criterion        100%  coverage  ≈  107  reads  
  • 30. With  7.5%  up-­‐regulated  and  7.5%  down-­‐regulated:  edgeR  TPR  =  TP/(TP  +  FN)  (x  100%)        =  ‘sensiGvity’    using  Benjamini-­‐Hochberg  adjusted  p-­‐value    padj  ≤  0.01      as  a  significance  criterion        100%  coverage  ≈  107  reads  
  • 31. With  7.5%  up-­‐regulated  and  7.5%  down-­‐regulated:  edgeR  TPR  =  TP/(TP  +  FN)  (x  100%)        =  ‘sensiGvity’     1.  TPR  increases  with   number  of  reps  n     2.  TPR  decreases  with   coverage  depth     3.  MulGplexing  (more  reps,   less  coverage,  keeping    n   Gmes  depth  constant)   improves  TPR      (grey  curve)   4.  edgeR  has  slightly  beyer   sensiGvity  than  DESeq  
  • 32. With  7.5%  up-­‐regulated  and  7.5%  down-­‐regulated:  DESeq  FPR  =  FP/(TN  +  FP)  (x  100%)      =  1  –  ‘specificity’     n  =12   n  =4  using  Benjamini-­‐Hochberg  adjusted  p-­‐value    padj  ≤  0.01       n  =2  as  a  significance  criterion  
  • 33. With  7.5%  up-­‐regulated  and  7.5%  down-­‐regulated:  edgeR  FPR  =  FP/(TN  +  FP)  (x  100%)      =  1  –  ‘specificity’     n  =12   n  =4  using  Benjamini-­‐Hochberg  adjusted  p-­‐value    padj  ≤  0.01       n  =2  as  a  significance  criterion   1.  MulGplexing  (more  reps,   less  coverage,  keeping    n   Gmes  depth  constant)   improves  specificity      (grey  curve)   2.  DESeq  has  slightly  beyer   specificity  than  edgeR  
  • 34. With  7.5%  up-­‐regulated  and  7.5%  down-­‐regulated:  edgeR  FPR  =  FP/(TN  +  FP)  (x  100%)      =  1  –  ‘specificity’    using     n  =12   n  =4  Fold  change  >  2   n  =2  as  a  criterion  for  detecGng  differenGal  expression    (not  recommended)     FPR  increases  with   decreasing  coverage  depth   because  more  transcripts   have  very  low  counts  and   Poisson  shot  noise  can  easily   induce  a  spurious  doubling   of  counts  
  • 35. To  summarise  •  Have  tested  the  performance  of  NegaGve  Binomial  based  R  packages  for   detecGng  differenGal  expression  using  syntheGc  data.    •  Under  null  hypothesis,  DESeq’s  performance  is  consistently  more   conservaGve  than  edgeR  across  #  of  replicates,  and  closer  to  expected   significance  level  for  small  numbers  of  reps.    edgeR  is  closer  for  high   numbers  of  reps.      •  With  15%  of  transcripts    differenGally  expressed,  for  both  edgeR  and   DESeq:   –  sensiGvity  (=  TPR)  improves  with  number  of  replicates,  as  expected   –  sensiGvity  declines  with  decreased  sequencing  depth,  as  expected   –  sensiGvity  beyer  for  edgeR  than  DESeq   –  but  mulGplexing  (decreasing  sequencing  depth  while  increasing  #  of   replicates  with  same  total  amount  of  ‘read  estate’)  increases   sensiGvity  markedly    
  • 36. To  summarise  Recommend      •  The  more  (independent!)  replicates  the  beyer  •  It’s  OK  to  sacrifice  sequencing  read  depth  by   mulGplexing  
  • 37. Acknowledgements    Sue  Wilson,  Australian  NaGonal  University  and  University  of  New  South  Wales    Jen  Taylor,  Division  of  Plant  Industry,  CSIRO    Sumaira  Qureshi,  MathemaGcal  Sciences  InsGtute,  Australian  NaGonal  University      Jose  Robles,  Division  of  Plant  Industry,  CSIRO    Stuart  Stephen,  Division  of  Plant  Industry,  CSIRO