Your SlideShare is downloading. ×
0
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
RNA-seq Analysis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

RNA-seq Analysis

17,849

Published on

Published in: Health & Medicine
0 Comments
18 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
17,849
On Slideshare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
0
Comments
0
Likes
18
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. RNA-­‐seq  analysis   Mikael  Huss   Bioinforma7cs  scien7st  at  WABI  (Wallenberg  Advanced  Infrastructure  for  Bioinforma7cs),  Science   for  Life  Laboratory  /  DBB,  Stockholm  university     February  13,  2013  
  • 2. Omics,  biology  and  diseases   + + + + Protein “parts ProteinGenomics RNA profiles Interactomics list” profiles Systems biology Pathways,  molecular  targets,  diagnos5cs  
  • 3. Approximate contents of talk- Gene expression analysis in general; differences between RNA-seq and microarrays- Typical workflow(s) for RNA-seq analysis- Normalization issues- Visualization- Differential expression analysisI have tried to include many references so you can go back to these slides forreference afterwards
  • 4. How  DNA  get  transcribed  to  RNA  (and  then  translated  to  proteins)  varies  between  e.  g.  -­‐Tissues  -­‐ Cell  types  -­‐ Cell  states  -­‐Individuals  
  • 5. What  can  gene  expression  tell  us?  Basic  research  -­‐ How  do  gene  expression  paUerns  determine  cellular  iden7ty?  (7ssues,  cell  types  …)  -­‐ How  does  gene  expression  control  early  development  in  an  embryo?  -­‐ What  kinds  of  genes  are  expressed  in  response  to  specific  s7muli  (infec7ons,  smoking,  environmental  pollu7on,  gym  exercise  …)?  -­‐ What  kinds  of  genes  do  bacteria  or  other  microorganisms  express  in  the  human  gut  /  in  soil  /  in  oceans  under  different  condi7ons?  …  and  much,  much  more  …  
  • 6. What  can  gene  expression  tell  us?  Diseases  -­‐ Which  genes  are  over-­‐  (or  under-­‐)expressed  in  pa7ents  vs.  healthy  controls?  -­‐ Which  genes  are  correlated  to  disease  progression?  -­‐ Can  markers  of  hidden  disease  be  found  by  sequencing  blood  plasma?  
  • 7. Gene  expression  signatures  for  disease?  Hypothesis:  Cell  types  are  stable  states  in  a  “space”  of  gene  expression  paUerns.  Diseases  (e  g  cancers)  distort  the  gene  expression  so  that  the  cell  ends  up  in  the  wrong  stable  state.   Furusawa  and  Kaneko,  Biology  Direct  2009  4:17    
  • 8. Can  the  research  community  find  such  paUerns?  On-­‐line  predic7on  compe77ons,  objec7vely  scored  by  the  organizers  Diagnosing  MS  (mul/ple  sclerosis),  lung  cancer,  psoriasis,  COPD  (KOL)  Prognos/ca/ng  breast  cancer  outcome  
  • 9. Human  7ssue  RNA-­‐seq  data  sets  Genotype-Tissue Expression projecthttp://commonfund.nih.gov/GTEx/Illumina Human Body Mapaccessed via ReCount database, bowtie-bio.sourceforge.net/recount/Wang 2008 data set of ~15 human tissuesaccessed via ReCountRNA-seq Atlashttp://medicalgenomics.org/rna_seq_atlasHuman Protein Atlashttp://www.proteinatlas.org (tissue RNA-seq data not yet publicly released)
  • 10. Tools  for  genome-­‐scale  gene  expression  measurements   Microarrays  (c:a  1995)   Some7mes  called  “gene  chips”   Based  on  hybridiza7on   RNA  sequencing  (c:a  2008  in  current  form)   Based  on  sampling  
  • 11. Typical  (m)RNA-­‐seq  experiment   “library”  -­‐>   <-­‐  reads  hUp://cmb.molgen.mpg.de  
  • 12. Alterna7ve:  rRNA  deple7on  There are various kits for depleting rRNA insteadPluses:- Can use for microorganisms that don’t have poly-A tails- Thus, can use for simultaneous host/pathogen expression profiling- Can find non-coding RNAMinuses:-Usually leaves in quite a lot of rRNA-In practice, often variable efficiency between samples -> hard to compare results
  • 13. Sequencing  plagorms     ABI  3730xl   454  Life  Sciences   SOLiD  +   Pacific  Biosciences,   Sanger  Sequencing   pyrosequencing   Illumina   Oxford  Nanopore  etc   Single-­‐molecule     sequencing  Length/read  800  bp        400  bp      100  bp    20  000+  bp  Reads/run      96          1  million      2  billion    5  million  Bases/run                      60  kbp        400  Mbp      500  Gbp    100  Gbp  Speed    10  years/HG      1  month/HG    1  day/HG                      10  min/HG   “old  school”   “2nd  gen”   “3rd  gen”  
  • 14. Microarray:  Hybridiza7on   Source:  Wikipedia  The  design  of  the  microarray  determines  what  you  can  detect  in  a  sample  
  • 15. RNA  sequencing:  Sampling    It  is  possible  to  detect  transcripts  that  are  not  known  a  priori  (in  advance)  
  • 16. RNA-­‐seq  advantages    The  non-­‐dependence  on  reference  makes   possible:  -­‐  meta-­‐transcriptomics  -­‐  detec7ng  novel  splice  variants  -­‐  detec7ng  novel  transcripts   -­‐  Fusion  transcripts   -­‐  Non-­‐coding  transcripts  
  • 17. Some  examples  RNA-seq Atlas Wang 2008
  • 18. Some  examples  RNA-seq Atlas <- Skeletal Wang 2008 muscle -> <-Adipose tissue-> HPA
  • 19. What  does  one  do  with  RNA-­‐seq  reads?  •  Mapping  (also  called  alignment)  •  (de  novo)  Assembly  
  • 20. Mapping  (alignment)  vs.  assembly  Imagine  a  book  being  ripped  to  pieces  with  word  or  sentence  fragments  ending  up  on  each  piece  of  paper.    If  you  have  a  copy  of  the  book  that  you  can  compare  the  pieces  to,  you  have  a  mapping  (alignment)  problem.  If  you  have  no  copy  of  the  book,  you  have  a  de  novo  assembly  problem.  
  • 21. Mapping  to  a  reference  genome  Reads  from  the  sequencer   Sequencing  error   Gene7c  varia7on   CAATCAGA G TCCCACTGTGG   AGACG TCCCACTGTGGGGTG   GTGAAGTGTCCGTAGATGTGTG   GCAAATGCAATCAGACG TCCC  Gene(or  transcript)  sequence  
  • 22. Mapping  to  a  reference  genome  AGACG TCCCACTGTGGGGTG  GTGAAGTGTCCGTAGATGTGTG  GCAAATGCAATCAGACG TCCC  
  • 23. Mapping  to  a  reference  genome  GTGAAGTGTCCGTAGATGTGTG  GCAAATGCAATCAGACG TCCC  
  • 24. Mapping  to  a  reference  genome  GCAAATGCAATCAGACG TCCC  
  • 25. Mapping  to  a  reference  genome  
  • 26. Mapping  to  the  genome  vs.  the   transcriptome  Vs. the genome:-Can (in principle) detect new transcripts, splice variants- Less sensitive, need a lot of coverage to discover new things- Need a “splice-aware” aligned such as TopHat, MapSplice, RUM etc.Vs. the transcriptome:-Not unbiased anymore, tied to existing annotation-Faster, more sensitive, need less coverageThe best of both worlds?- Tools like TopHat (v1.4 and up) now do both
  • 27. If  it  had  been  de  novo  assembly   CAATCAGA G TCCCACTGTGG   AGACG TCCCACTGTGGGGTG   GTGAAGTGTCCGTAGATGTGTG   GCAAATGCAATCAGACG TCCC   Assembly   CAATCAGA G TCCCACTGTGG   AGACG TCCCACTGTGGGGTG  GCAAATGCAATCAGACG TCCC   “singleton”   GTGAAGTGTCCGTAGATGTGTG   Consensus  sequence(s)      
  • 28. Assembly  of  RNA-­‐seq  reads  Will not be discussed much further here.Most popular de novo assemblers build de Bruijn graphs where overlapping k-mersare connected to each other. The programs then try to find paths through the graphTypically needs a LOT of RAM. Can try to pre-process using “digital normalization”Tools: - Trinity - Velvet/Oases - CLC Bio (commercial)
  • 29. Assembly  of  RNA-­‐seq  reads  Typical workflow could be:- Clean the reads properly (remove adapters, low-quality reads) - Useful tools: FastQC, PRINSEQ, FASTX toolkit etc.- Run assembly tool of choice, resulting in a set of contigs- BLAST the contigs against nt database, check for % overlap by transcript inrelated organisms- Map your original reads back to the contigs and count the reads overlappingeach <- comparison of assembly & mapping
  • 30. Quan7fying  expression  with  RNA-­‐seq  Microarrays give a continuous (floating-point) expression value for each geneRNA-­‐seq  gives  an  integer  value  for  each  gene  (“digital  expression”):  read  counts  
  • 31. Example  (SciLifeLab)  mapping  workflow   FASTQ file(s) TopHat 2.0 BAM file Picard tools (SortSam, MarkDuplicates) Sorted BAM file with duplicate reads removed HTSeq 0.5 Cufflinks 2.0Gene-level count files Gene- and isoform-level expression(for DE analysis) estimates (FPKM, for reporting)
  • 32. RNA-­‐seq  mapping:  different  isoforms   Isoform  1  Exon  1   Exon  2   Exon  3   Isoform  2  Exon  1   Exon  2  
  • 33. (what  it  would  look  like  mapped  to  the  genome)   Exon  1   Exon  2   Exon  3  Need  a  special  mapping  algorithm  which  allows  large  gaps,  a  “split-­‐read  aligner”  
  • 34. (what  we  would  actually  observe  –  of  course  we  don’t  know  which  reads  come  from  which  isoform)  Sta7s7cal  algorithms  needed  to  es7mate  what  propor7on  of  reads  comes  from  which  isoform.  (For  example,  maximum  likelihood  /  expecta7on  maximiza7on)  
  • 35. Name   Free/Commercial/ Type  of  approach   Descrip5on  only  Xing  et  al.  2006   D   Maximum  likelihood  Partek   C   “  Li  et  al.  2010   D   “  Avadis   C   “  IsoEM   F   “  MISO   F   “  (MCMC)  Cufflinks   F   “  rQuant   F   Least  squares  (quadra7c   programming)  Rpkmforgenes.py   F   Least  squares  Howard  and  Heber  2010   D   Least  squares  FluxCapacitor   F   Linear  programming  CLC  Bio   C   ?  NSMAP   F   Nonnega7ve  Sparse   Maximum  A  Posteriori  ALEXA-­‐SEQ   F   Use  only  reads  that  are  compa7ble   with  a  single  isoform  NEUMA   D   Normaliza7on  by  Expected   Uniquely  Mappable  Area  
  • 36. Some remarks on isoform quantification- It is necessary for correct gene-level quantification as well because straight readcounting methods can never be fully correct (from 2012 CuffDiff2 paper)- Xing et al. (2006) gave the basic idea for EM-based isoform quantification which otherprograms (Cufflinks, MISO, IsoEM, …) haveadded various “bells and whistles” to- It is actually pretty hard to do isoformquantification well because there can be a lotof possible isoforms  not enough sequencecoverage to estimate
  • 37. Basic idea of the EM approachWe have a set of reads mapping to some locus - Some fit one specific isoform - Some fit several isoformsIf we knew the isoforms’ expression levels, we could distribute the reads proportionallyto those. But we don’t!On the other hand, if we knew the probability of each read to match each isoform, wecould estimate the isoforms’ expression pretty well. But we don’t know that either.So … start with a guess and iterate!- Assign reads to isoforms according to some initial guess- Re-estimate isoform expression levels- Repeat until convergence!
  • 38. Gene  fusion  detec7on  with  RNA-­‐seq  Beyond  isoforms:  Detect  pieces  of  different  genes  that  have  been  fused   Look  for  reads   that  map  in     “wrong”  ways   Wang  et  al.  Briefings  in   Bioinforma7cs  doi:10.1093/ bib/bbs044  
  • 39. Some  further  comments  on  microarrays   and  RNA-­‐seq  -­‐  Microarrays  are  s7ll  cheaper  and  faster.   -­‐  You  may  be  able  to  run  more  replicates,  which  is  important  for  sta7s7cal  power.    -­‐  RNA-­‐seq  has  a  wider  measurement  range.   -­‐  Low  expressed  transcripts:   -­‐  Microarrays  have  high  background  signal  -­‐>  poor  measurement   -­‐  RNA-­‐seq  can  measure  well  if  you  sequence  very  deeply   -­‐  Medium  expressed  transcripts:   -­‐  Microarrays  measure  well   -­‐  RNA-­‐seq  measures  well  if  sequenced  rela7vely  deeply   -­‐  High  expressed  transcripts:   -­‐  Microarrays  measure  poorly  because  of  satura7on   -­‐  RNA-­‐seq  measures  well  -­‐  Less  is  understood  about  how  to  pre-­‐process  and  normalize  RNA-­‐seq  data.  -­‐  One  interes7ng  aspect  of  RNA-­‐seq:  You  can  con7nue  to  sequence  a  sample  more   to  obtain  beUer  gene  expression  es7mates.  
  • 40. Analysis  -­‐  Pre-­‐processing  and  normaliza7on  -­‐  Visualiza7on  -­‐  Differen7al  gene  expression  analysis  -­‐  ( Gene  set  analysis,  pathway  analysis,  gene   expression  signatures  …  -­‐>  try  to  find  the   biological  significance)  
  • 41. Pre-­‐processing  Why  do  we  do  pre-­‐processing  and  normaliza7on  of   RNA-­‐seq  (or  microarray)  data?  
  • 42. Pre-­‐processing  Why  do  we  do  pre-­‐processing  and  normaliza7on  of   RNA-­‐seq  (or  microarray)  data?  -­‐  To  correct  for  batch  effects   -­‐  Different  labs   -­‐  Different  prepara7on  7mes   -­‐  Etc.  
  • 43. Pre-­‐processing  Why  do  we  do  pre-­‐processing  and  normaliza7on  of   RNA-­‐seq  (or  microarray)  data?  -­‐  To  correct  for  batch  effects   -­‐  Different  labs   -­‐  Different  prepara7on  7mes   -­‐  Etc.  -­‐  To  correct  for  intrinsic  technical  biases  in  the   technologies  
  • 44. Pre-­‐processing  Why  do  we  do  pre-­‐processing  and  normaliza7on  of  RNA-­‐ seq  (or  microarray)  data?  -­‐  To  correct  for  batch  effects   -­‐  Different  labs   -­‐  Different  prepara7on  7mes   -­‐  Etc.  -­‐  To  correct  for  intrinsic  technical  biases  in  the   technologies  -­‐  To  make  the  expression  value  distribu7ons  conform  to   some  assump7ons  in  order  to  perform  sta7s7cal  tests    
  • 45. RNA-­‐seq  pre-­‐processing  For  RNA-­‐seq  data,  it  is  s7ll  less  understood  than  for   microarrays  how  one  should  pre-­‐process  and   normalize  the  data.  Let’s  look  at  some  aspects   (that  some7mes  apply  to  both  RNA-­‐seq  and   microarray  data)  
  • 46. R  and  Bioconductor   Very helpful for (e.g.) microarray and RNA-seq differential expression analysisMicroarray: RNA-seq:affy, lumi (read raw microarray signal files DESeq, edgeR, baySeq,& preprocess) (differential expression analysislimma (differential expression analysis based on count data)with complex designs) SAMSeq (nonparametric differential expression analysis)
  • 47. Variance  stabiliza5on  Raw data(could be microarray signal or RNA-seq counts)Higher value -> higher variability (noise)Log transformLower value -> higher variability. Too aggressiveVariance stabilizing transforme.g. voom() in limma package http://bridgecrest.blogspot.se/2011_09_01_archive.html
  • 48. Quan5fying  expression  with  RNA-­‐seq  If  you  want  to  compare  RNA-­‐seq  counts  between  different  genes  and/or  samples,  consider:  -­‐ Longer  genes/transcripts  are  expected  to  generate  more  reads  -­‐ The  more  you  sequence,  the  more  reads  you  get  from  each  gene  Therefore,  the  standard  measure  has  been  RPKM  ( ),  which  corrects  for  transcript  length  and  sequencing  depth:     ⎛ X t ⎞ ⎜ l ⎟ 10 9 ⋅ X t (Xt:  no  of  reads  mapped  to  transcript/gene/…  t   ⎜ eff ,t ⎟ Nlib:  no  of  mapped  reads  in  library   RPKM  =     ⎜ 10 3 ⎟ ⎜ ⎟ =   N lib ⋅ leff ,t Leff,  t:  effec/ve  length  of  transcript/gene/…  t)   ⎝ ⎠ ⎛ N lib ⎞ ⎜ 6 ⎟ ⎝ 10 ⎠ € € FPKM is a paired-end version of this
  • 49. Alterna5ves  TPM – “transcripts per million”A slightly modified RPKM measure thataccounts for differences in gene lengthdistribution in the transcript population
  • 50. Alterna5ves   TMM – “trimmed mean of M values” Attempts to correct for differences in RNA composition between samples E g if certain genes are very highly expressed in one tissue but not another, there will be less “sequencing real estate” left for the less expressed genes in that tissue and RPKM normalization (or similar) will give biased expression values for them compared to the other sample RNA population 1 RNA population 2Equal sequencing depth -> orange and red will get lower RPKM in RNA population 1 although theexpression levels are actually the same in populations 1 and 2Robinson and Oshlack Genome Biology 2010, 11:R25, http://genomebiology.com/2010/11/3/R25
  • 51. Across-­‐sample  comparability  Dillies et al., Briefings in Bioinformatics, doi:10.1093/bib/bbs046
  • 52. Across-­‐sample  comparability  
  • 53. Across-­‐sample  comparability  
  • 54. Prac5cal  issues  with  normaliza5on  methods  Limma / voom can give negative valuesTMM cannot be done on a single sample
  • 55. RNA-­‐seq  pre-­‐processing  In  RNA-­‐seq,  normaliza7on  of  counts  is  oven   interwoven  with  differen7al  expression  analysis   and  done  implicitly  in  DE  packages  such  as  DESeq,   edgeR  etc.  Normalized  values  like  RPKM  are  usually  only  used   for  repor7ng  expression  values,  not  tes7ng  for   differen7al  expression.    Why?  
  • 56. Count  nature  of  RNA-­‐seq  data   These  methods  want  to  use  the  added  sta7s7cal  power  provided  by   the  count  nature  of  RNA-­‐seq  data.   Simplified  toy  example:  Scenario 1: A 30000-bp transcript has 1000 counts in sample A and 700 countsin sample B.Scenario 2: A 300-bp transcript has 10 counts in sample A and 7 counts insample B.Assume that the sequencing depths are the same in both samples and bothscenarios. Then the RPKM is the same in sample A in both scenarios, and insample B and both scenarios.In scenario A, we can be more confident that there is a true difference in theexpression level than in scenario B (although we would want more replicates ofcourse!) by analogy to a coin flip – 700 heads out of 1000 trials gives much moreconfidence that a coin is biased than 7 heads out of 10 trials
  • 57. Visualiza5on  Can  be  useful  for  “sanity  checking”,  outlier  detec7on  and  exploratory  analysis  in  general  Examples  of  useful  visualiza7ons  -­‐ Heat  maps  -­‐ PCA/MDS/NMF  -­‐ Box  plots,  violin  plots  etc.  
  • 58. Box  plots  Useful for comparing groupsAdding the actual data points is optional but can be interesting
  • 59. Sample  correla5on  heat  maps  Heat maps are ubiquitous in transcriptomicsCorrelations between samples, hierarchical clusteringUsed for “sanity checks”, outlier detection Two tissues Batch effects
  • 60. Gene  /  sample  heat  maps  With a smallercollection of genes,one sometimes looksat gene/sample heatmaps
  • 61. PCA  plots  Another way to see how samples cluster
  • 62. PCA  plots  Nice thing with PCA: you can also see how much each gene contributes to eachprincipal component -> a kind of feature selection
  • 63. Alterna5ves  to  PCA   NMF: non-negative matrix factorization. Also a matrix decomposition technique (like PCA)“A bioinformatic assay for pluripotency in human cells”, Nature Methods: doi.10.1038/nmeth.1580
  • 64. PCA  plot  of  human  5ssue  RNA-­‐seq  Red – GTexGreen – Body MapBlack – Human Protein Atlas
  • 65. #  of  genes  taking  up  X%  of  sequences   GTex RPKM HBA1 HBB HBA2
  • 66. #  of  genes  taking  up  X%  of  sequences  GTex
  • 67. #  of  genes  taking  up  X%  of  sequences  Wang/Sandberg
  • 68. Differen5al  expression  analysis  Many tools available!Easily the most common type of analysis, even though it is understood thatgene expression levels are not independent of each other, and should inprinciple be considered together.However, since the number of samples is typically << the number ofmeasured genes, a full model is usually not feasible to construct in practice.Some sort of feature selection is needed.
  • 69. Differen5al  expression  analysis  One would simply like to do a t-test or something like that for each gene, but…
  • 70. Differen5al  expression  analysis  One would simply like to do a t-test or something like that for each gene, but…- Assumes normal distribution & no mean-variance dependence
  • 71. Differen5al  expression  analysis  One would simply like to do a t-test or something like that for each gene, but…- Assumes normal distribution & no mean-variance dependence- Hard to estimate variance from few samples
  • 72. Differen5al  expression  analysis  One would simply like to do a t-test or something like that for each gene, but…- Assumes normal distribution & no mean-variance dependence- Hard to estimate variance from few samples- Multiple testing issue
  • 73. Parametric  vs.  non-­‐parametric  methods  It would be nice to not have to assume anything about the expression valuedistributions but only use rank-order statistics. -> methods like SAM(Significance Analysis of Microarrays) or SAM-seq (equivalent for RNA-seq data)However, it is (typically) harder to show statistical significance with non-parametric methods with few replicates.My rule of thumb:- Many replicates (~ >10) in each group -> use SAM(Seq)- Otherwise use DESeq or other parametric methodNote that according to Simon Anders (creator of DESeq) says that non-parametric methods are definitely better with 12 replicates and maybe already atfivehttp://seqanswers.com/forums/showpost.php?p=74264&postcount=3
  • 74. Standard  DE  methods  Limma (microarrays, RNA-seq)edgeR, DESeq (RNA-seq)
  • 75. Standard  DE  methods  Limma (microarrays, RNA-seq)edgeR, DESeq (RNA-seq)Distributional issue: Solved by variance stabilizing transform in limmaedgeR and DESeq model the count data using a negative binomial distribution anduse their own modified statistical tests based on that.
  • 76. Standard  DE  methods  Limma (microarrays, RNA-seq)edgeR, DESeq (RNA-seq)Distributional issue: Solved by variance stabilizing transform in limmaedgeR and DESeq model the count data using a negative binomial distribution anduse their own modified statistical tests based on that.Multiple testing issue: All of these packages report false discovery rate (correctedp values).
  • 77. Standard  DE  methods  Limma (microarrays, RNA-seq)edgeR, DESeq (RNA-seq)Distributional issue: Solved by variance stabilizing transform in limmaedgeR and DESeq model the count data using a negative binomial distribution anduse their own modified statistical tests based on that.Multiple testing issue: All of these packages report false discovery rate (correctedp values).Variance estimation issue: These packages (in slightly different ways) “borrow”information across genes to get a better variance estimate. One says that theestimates “shrink” from gene-specific estimates towards a common mean value.
  • 78. Standard  DE  methods  Limma (microarrays, RNA-seq)edgeR, DESeq (RNA-seq)Distributional issue: Solved by variance stabilizing transform in limmaedgeR and DESeq model the count data using a negative binomial distribution anduse their own modified statistical tests based on that.Multiple testing issue: All of these packages report false discovery rate (correctedp values).Variance estimation issue: These packages (in slightly different ways) “borrow”information across genes to get a better variance estimate. One says that theestimates “shrink” from gene-specific estimates towards a common mean value.
  • 79. CuffDiff2  Integrates isoform quantification +differential expression analysis
  • 80. Complex  designs  The simplest case is when you just want to compare two groups against each other.But what if you have several factors that you want to control for?E.g. you have taken tumor samples at two different time points from six patients,cultured the samples and treated them with two different anticancer drugs and a mockcontrol treatment. -> 2x6x3 = 36 samples.Now you want to assess the differential expression in response to one of theanticancer drugs, drug X. You could just compare all “drug X” samples to all controlsamples but the inter-subject variability might be larger than the specific drug effect. Enter limma / DESeq / edgeR which can work with factorial designs(SAMSeq cannot, which is another reason one might not want to use it)
  • 81. Limma  and  factorial  designs   limma stands for “linear models for microarray analysis” Essentially, the expression of each gene is modeled with a linear relationhttp://www.math.ku.dk/~richard/courses/bioconductor2009/handout/19_08_Wednesday/KU-August2009-LIMMA/PPT-PDF/Robinson-limma-linear-models-ku-2009.6up.pdf The design matrix describes all the conditions, e g treatment, patient, time etc y = a + b*treatment + c*time + d*patient + e Baseline/average Error term/noise
  • 82. Recent  DE  so[ware  comparison  
  • 83. Take-­‐away  messages  from  DE  tool   comparison  - CuffDiff2, which should theoretically be better, seems to work worse, probablydue to the increased “statistical burden” from isoform expression estimation- The HTSeq quantification which is theoretically “wrong” seems to give goodresults with downstream software- It is practically always better to sequence more biological replicates than tosequence the same samples deeperOmitted from this comparison - gains from ability to do complex designs - non-parametric methods
  • 84. The  end    Contact me at mikael.huss@scilifelab.se if you have any questions

×