0
base-resolution rna-seq
Jeff Leek
Johns Hopkins Bloomberg School of Public Health

@simplystats
normally
You are free to:

Copy, share adapt and remix
Photograph, film and broadcast
Live tweet, blog, post video of

Pro...
today

1.  types of statistical methods
2.  derfinder
3.  Unexpected expression

@simplystats
data generation

Genome
@simplystats
data generation

Transcripts

Genome
@simplystats
data generation
Reads

Transcripts

Genome
@simplystats
“simplest” thing – annotate-identify

Genome

@simplystats
exon model
Count by Exon

Genome

Bullard et al. BMC Bioinformatics 2010
@simplystats
union model
Union of all exons

Genome

Bullard et al. BMC Bioinformatics 2010
@simplystats
union-intersection model
Union/Intersection

Genome

Bullard et al. BMC Bioinformatics 2010
@simplystats
sources of variation in annotate-identify
1.  annotation
2.  gene models
3.  fragment-level biases
4.  technical variation...
annotation variation

Frazee et al. Biostatistics under review

@simplystats
gc-variation

Hansen et al. 2011 Biostatistics
@simplystats
biological variation

Stranger	
  et	
  al.	
  
	
  (2007)	
  	
  
vs.	
  	
  
Montgomery	
  et	
  al.	
  	
  
(2010)	
  	...
some data

hCp://bowGe-­‐bio.sourceforge.net/recount/	
  
@simplystats
assemble-identify
Reads

Genome
@simplystats
assemble-identify (align)

Genome

@simplystats
assemble-identify (assemble)

Genome

Fragments

Transcripts
Trapnell et al. 2010 @simplystats
Nat. Biotech
assemble-identify (abundance)

Genome

Transcripts

Trapnell et al. 2010 @simplystats
Nat. Biotech
inherent ambiguity (boundaries)

Genome

Fragments

Transcripts
@simplystats
inherent ambiguity (assembly)

Genome

Alternative
Assemblies
@simplystats
assembly variation

Frazee et al. in prep
@simplystats
result of assembly variation

Frazee et al. in prep
@simplystats
result of assembly variation (bio reps)

Frazee et al. in prep
@simplystats
result of assembly variation

Frazee et al. in prep
@simplystats
result of assembly variation
Cufflinks v2

Cufflinks p-values

Cufflinks p values

3

Density

30

2

20

1

10

0

0

den...
methods
annotate-identify
1.  align
2.  gene-model
3.  abundances
4.  analyze

assemble-identify
1.  align
2.  assemble
3....
differentially expressed region finder
1. 
2. 
3. 
4. 

Calculate base pair-resolution coverage
Perform test at each base
...
derfinder notes
•  Ignores annotation
•  Coverage data at base resolution, designed
for “differential” expression analysis...
Solution
ir

2 2 3 6 11 5 14 15 15 16 15 17 16 14 9 8 6 520 4 3 1 1
12
5
10
15

@simplystats
result

n	
  samples	
  à	
  

3	
  billion	
  nt	
  	
  

@simplystats
Frazee et al. Biostatistics in review
base-pair model (case/control)

g() = Transform (Box-Cox, log(+32) etc.)
Yi,j = coverage on sample i at base j
lj = genomi...
batch-variation

Blue:	
  3	
  sds	
  below	
  the	
  mean	
  
Orange:	
  3	
  sds	
  above	
  the	
  mean
Horizontal	
  l...
finding the statistics for d.e. bases

t ~ π0f0 + π1f1 + π2f2 + π3f3

@simplystats
empirical bayes

@simplystats
estimating parameters

t ~ π0f0 + π1f1 + π2f2 + π3f3
Assumed known – the distribution of zeros
Alternatively – Gottardo an...
estimating parameters

t ~ π0f0 + π1f1 + π2f2 + π3f3
Estimated null distribution from e.g. Efron 2002

@simplystats
estimating parameters

t ~ π0f0 + π1f1 + π2f2 + π3f3
Estimated from 2-groups model, assumed symmetric

@simplystats
hmm
hidden states
DE	
  

DE	
  

DE	
  

not	
  DE	
  

not	
  DE	
  

t1	
  

t2	
  

t3	
  

t4	
  

t5	
  

emissions ...
statistic

Observed	
  

@simplystats
Frazee et al. Biostatistics in review
monte-carlo p-value

Observed	
  

Null	
  	
  

Jaffe et al. Biostatistics 2011
Frazee et al. Biostatistics in review
Lag...
ma-plots

@simplystats
Frazee et al. Biostatistics in review
1000

statistical significance
DER Finder - males

0

0

200

200

400

400

600

Frequency

Frequency

600

800

800

DER...
percent “correct hits” by ranking

@simplystats
caveat

Genome

Bullard et al. BMC Bioinformatics 2010
@simplystats
caveat

Genome

Bullard et al. BMC Bioinformatics 2010
@simplystats
annotation incorrect

female
male

exons

states

1

2

3

4

5

t statistic

6

7

4.5

5.0

5.5

6.0

6.5

log2(count+32...
annotation missing

female
male

exons

states

2.0

2.5

3.0

3.5

t statistic

4.0

4.5

4.5

5.0

5.5

6.0

log2(count+...
missed by cufflinks

@simplystats
Frazee et al. Biostatistics in review
computational goals
•  Aligned reads (say from TopHat) to DERs in <
24 hours, all within R statistical software
–  Table o...
derfinder - fast
1.  Test for differential expression at each
base, record statistic (linear modeling)
2.  Identify contig...
time and memory needed: derSnyder
20 samples

•  Load & filter data: 10 cores with mclapply
1hr 15min, 177 GB
•  Make mode...
Counts: derSnyder

20 samples

•  Load & filter data: 10 cores with
mclapply 1hr 15min, 177 GB
•  Create count table: 26 m...
lieber brain samples
•  DLPFC Paired-end RNAseq Data
•  36 samples across 6 age ranges, n=6/
group: Fetal (age < 0) ; Infa...
lieber brain samples

@simplystats
test for base-level de

@simplystats
thresholding on statistic
F-­‐staGsGc	
  corresponding	
  to	
  
p-­‐value	
  <	
  10-­‐8	
  	
  (F5,30)	
  

@simplystats
derfinder results
• 
• 
• 
• 

alt model: age group + median coverage
null model: median coverage
threshold: p-value < 1e-...
@simplystats
@simplystats
@simplystats
@simplystats
@simplystats
@simplystats
@simplystats
annotating
•  Devised “light-weight” R annotation files for
UCSC hg19 knownGene and Ensembl GRCh37.p11
•  “Genomic State” ...
derfinder results
•  2,655 regions (47.7%) show expression of
1+ annotated intron (UCSC: 2,505; 45%)
•  577 regions (10.4%...
derfinder results
•  261 regions (4.7%) crossed a known
lincRNA
–  51 overlapping 535 “intragenic” regions
(9.6%; e.g. no ...
derfinder results
•  Verifying the 5,565 DERs:
–  95% of regions had mappability of 100bp
reads greater than 99%
–  Only 1...
derfinder results
•  Fetal samples had the highest expression
in the majority of the regions (84%; 18
[1.7-Inf] fold incre...
derfinder results

@simplystats
derfinder subgroup
•  Identified DERs within each 6-sample age
group based on mean expression
–  Represents set of express...
Percent	
  of	
  Genome	
  Expressed	
  

% of genome expressed

@simplystats
scaled % of genome expressed

Fetal	
  is	
  highest	
  
at	
  EVERY	
  cutoff	
  

Infant	
  is	
  lowest	
  	
  
a<er	
  ...
higher cutoffs create longer DERs

@simplystats
Percent	
  of	
  Genome	
  Expressed	
  

% of genome expressed (L ≥ 12)

@simplystats
Scaled % of genome expressed (L ≥ 12)

Fetal	
  is	
  s=ll	
  highest	
  
at	
  EVERY	
  cutoff	
  

@simplystats
Higher cutoffs still create longer DERs

@simplystats
try that stuff, yo!

https://github.com/lcolladotor/derfinder
https://github.com/lcolladotor/derfinderReport
https://githu...
acknowledgements
Leek Group
Alyssa Frazee
Prasad Patil
Leo Collado Torres
Abhi Nellore
University of Maryland
Héctor Corra...
Upcoming SlideShare
Loading in...5
×

Base-Resolution rna-seq - Jeff Leek

1,083

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,083
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Base-Resolution rna-seq - Jeff Leek"

  1. 1. base-resolution rna-seq Jeff Leek Johns Hopkins Bloomberg School of Public Health @simplystats
  2. 2. normally You are free to: Copy, share adapt and remix Photograph, film and broadcast Live tweet, blog, post video of Provided: Provided you attribute this work to its author and respect the rights and licenses associated with its components Adapted  from:     @simplystats
  3. 3. today 1.  types of statistical methods 2.  derfinder 3.  Unexpected expression @simplystats
  4. 4. data generation Genome @simplystats
  5. 5. data generation Transcripts Genome @simplystats
  6. 6. data generation Reads Transcripts Genome @simplystats
  7. 7. “simplest” thing – annotate-identify Genome @simplystats
  8. 8. exon model Count by Exon Genome Bullard et al. BMC Bioinformatics 2010 @simplystats
  9. 9. union model Union of all exons Genome Bullard et al. BMC Bioinformatics 2010 @simplystats
  10. 10. union-intersection model Union/Intersection Genome Bullard et al. BMC Bioinformatics 2010 @simplystats
  11. 11. sources of variation in annotate-identify 1.  annotation 2.  gene models 3.  fragment-level biases 4.  technical variation 5.  biological variability @simplystats
  12. 12. annotation variation Frazee et al. Biostatistics under review @simplystats
  13. 13. gc-variation Hansen et al. 2011 Biostatistics @simplystats
  14. 14. biological variation Stranger  et  al.    (2007)     vs.     Montgomery  et  al.     (2010)     Choy  et  al.     (2008)     vs.     Pickrell  et  al.     (2010)     Hansen et al. 2010 Nat.@simplystats Biotech
  15. 15. some data hCp://bowGe-­‐bio.sourceforge.net/recount/   @simplystats
  16. 16. assemble-identify Reads Genome @simplystats
  17. 17. assemble-identify (align) Genome @simplystats
  18. 18. assemble-identify (assemble) Genome Fragments Transcripts Trapnell et al. 2010 @simplystats Nat. Biotech
  19. 19. assemble-identify (abundance) Genome Transcripts Trapnell et al. 2010 @simplystats Nat. Biotech
  20. 20. inherent ambiguity (boundaries) Genome Fragments Transcripts @simplystats
  21. 21. inherent ambiguity (assembly) Genome Alternative Assemblies @simplystats
  22. 22. assembly variation Frazee et al. in prep @simplystats
  23. 23. result of assembly variation Frazee et al. in prep @simplystats
  24. 24. result of assembly variation (bio reps) Frazee et al. in prep @simplystats
  25. 25. result of assembly variation Frazee et al. in prep @simplystats
  26. 26. result of assembly variation Cufflinks v2 Cufflinks p-values Cufflinks p values 3 Density 30 2 20 1 10 0 0 density 4 40 5 50 6 Cufflinks v1 0.0 0.2 0.4 0.6 p-value 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p values Frazee et al. in prep @simplystats Frazee et al. 2012 in prep
  27. 27. methods annotate-identify 1.  align 2.  gene-model 3.  abundances 4.  analyze assemble-identify 1.  align 2.  assemble 3.  abundances 4.  analyze Pros: •  analogous to microarray, •  processed data easy to handle Cons: •  incorrect/variable annotation •  gene model choices have a big impact Pros: •  alternative transcription •  (potentially) less annotation dependent Cons: •  ambiguity/variation in assembly @simplystats
  28. 28. differentially expressed region finder 1.  2.  3.  4.  Calculate base pair-resolution coverage Perform test at each base Identify regions of differential expression (segment) Annotate regions (optional) Pros: •  processed data still easier to handle •  less dependent on annotation •  no assembly variability Cons: •  still no transcript-level abundances (but…) Frazee et al. 2012b @simplystats in prep
  29. 29. derfinder notes •  Ignores annotation •  Coverage data at base resolution, designed for “differential” expression analysis –  Lose paired end information –  Lose junction information –  Lose potential mapping quality information –  … •  Annotate the resulting differentially expressed regions (DERs) @simplystats
  30. 30. Solution ir 2 2 3 6 11 5 14 15 15 16 15 17 16 14 9 8 6 520 4 3 1 1 12 5 10 15 @simplystats
  31. 31. result n  samples  à   3  billion  nt     @simplystats Frazee et al. Biostatistics in review
  32. 32. base-pair model (case/control) g() = Transform (Box-Cox, log(+32) etc.) Yi,j = coverage on sample i at base j lj = genomic location j α() = baseline coverage β() = change in coverage between groups γk() = adjustment’s for confounders Wik = value of kth confounder on ith sample @simplystats Frazee et al. Biostatistics in review
  33. 33. batch-variation Blue:  3  sds  below  the  mean   Orange:  3  sds  above  the  mean Horizontal  lines  delimit   process  dates   Human  chromosome  16   Leek et al. 2010 Nat.@simplystats Rev. Genet.
  34. 34. finding the statistics for d.e. bases t ~ π0f0 + π1f1 + π2f2 + π3f3 @simplystats
  35. 35. empirical bayes @simplystats
  36. 36. estimating parameters t ~ π0f0 + π1f1 + π2f2 + π3f3 Assumed known – the distribution of zeros Alternatively – Gottardo and Raftery 2008 JCGS @simplystats
  37. 37. estimating parameters t ~ π0f0 + π1f1 + π2f2 + π3f3 Estimated null distribution from e.g. Efron 2002 @simplystats
  38. 38. estimating parameters t ~ π0f0 + π1f1 + π2f2 + π3f3 Estimated from 2-groups model, assumed symmetric @simplystats
  39. 39. hmm hidden states DE   DE   DE   not  DE   not  DE   t1   t2   t3   t4   t5   emissions are statistics @simplystats Frazee et al. Biostatistics in review
  40. 40. statistic Observed   @simplystats Frazee et al. Biostatistics in review
  41. 41. monte-carlo p-value Observed   Null     Jaffe et al. Biostatistics 2011 Frazee et al. Biostatistics in review Lagnmead et al. in prep@simplystats
  42. 42. ma-plots @simplystats Frazee et al. Biostatistics in review
  43. 43. 1000 statistical significance DER Finder - males 0 0 200 200 400 400 600 Frequency Frequency 600 800 800 DER Finder - sex 0.0 0.2 0.4 0.6 p values 0.8 1.0 0.2 0.6 p values 0.8 1.0 0.8 1.0 0.8 1.0 Frequency 300 0 0 50 50 100 200 Frequency 0.4 Cufflinks - males 100 150 200 250 300 Cufflinks - sex 0.0 0.2 0.4 0.6 p values 0.8 1.0 EdgeR - sex 0.4 0.6 p values EdgeR - males Frequency 0 0 5 50 10 15 100 Frequency 20 150 0.2 25 0.0 0.6 y 100 DESeq - sex 0.8 1.0 0.0 40 p value 30 0.4 y 0.2 140 0.0 0.2 0.4 p value 0.6 DESeq - males @simplystats Frazee et al. Biostatistics in review
  44. 44. percent “correct hits” by ranking @simplystats
  45. 45. caveat Genome Bullard et al. BMC Bioinformatics 2010 @simplystats
  46. 46. caveat Genome Bullard et al. BMC Bioinformatics 2010 @simplystats
  47. 47. annotation incorrect female male exons states 1 2 3 4 5 t statistic 6 7 4.5 5.0 5.5 6.0 6.5 log2(count+32) 7.0 chrY: 15016699 - 15017219 15016742 15016842 15016942 xaxinds genomic position 15017119 15017219 @simplystats Frazee et al. Biostatistics in review
  48. 48. annotation missing female male exons states 2.0 2.5 3.0 3.5 t statistic 4.0 4.5 4.5 5.0 5.5 6.0 log2(count+32) 6.5 7.0 chrY: 2715932-2716691 2715882 2716082 2716282 xaxinds genomic position 2716482 2716682 @simplystats Frazee et al. Biostatistics in review
  49. 49. missed by cufflinks @simplystats Frazee et al. Biostatistics in review
  50. 50. computational goals •  Aligned reads (say from TopHat) to DERs in < 24 hours, all within R statistical software –  Table of DERs and matrix of mean coverage per sample per region for post-hoc analysis –  Annotated using data from UCSC and Ensembl: counts of features and annotation lists –  Visualized DERs, including annotation to identify novel transcriptional activity –  Easy methods for counting exons from coverage objects (~2-4 hours from aligned reads for all samples) @simplystats
  51. 51. derfinder - fast 1.  Test for differential expression at each base, record statistic (linear modeling) 2.  Identify contiguous/adjacent bases that are differentially expressed above some cutoff (thresholding/ “bumphunter”) 3.  Summarize each DER (area) 4.  Perform significance testing on regionlevel (permutations, empirical p-values) @simplystats
  52. 52. time and memory needed: derSnyder 20 samples •  Load & filter data: 10 cores with mclapply 1hr 15min, 177 GB •  Make models: 20 min, 52 GB •  Analysis: 10 permutations, 4 cores each chr, total 59 mins –  chr1 41 min, 46 GB •  Merging: 30 min, 22 GB •  Report: 27 min, 17 GB •  Total wallclock time: 3 hr 46 min @simplystats
  53. 53. Counts: derSnyder 20 samples •  Load & filter data: 10 cores with mclapply 1hr 15min, 177 GB •  Create count table: 26 min, 24 GB •  Total wallclock time: 1 hr 41 min @simplystats
  54. 54. lieber brain samples •  DLPFC Paired-end RNAseq Data •  36 samples across 6 age ranges, n=6/ group: Fetal (age < 0) ; Infant (0 -1) ; Child (1 - 10) ; Teen (10 - 20) ; Adult (20 -50) ; 50+ •  4 M and 2 F per group; mostly AA, but some Caucasians •  RINs are evenly distributed across age @simplystats
  55. 55. lieber brain samples @simplystats
  56. 56. test for base-level de @simplystats
  57. 57. thresholding on statistic F-­‐staGsGc  corresponding  to   p-­‐value  <  10-­‐8    (F5,30)   @simplystats
  58. 58. derfinder results •  •  •  •  alt model: age group + median coverage null model: median coverage threshold: p-value < 1e-8 5,565 DERs with FWER ~ 0 (conservative) –  Median length: 148bp [IQR: 112-235] @simplystats
  59. 59. @simplystats
  60. 60. @simplystats
  61. 61. @simplystats
  62. 62. @simplystats
  63. 63. @simplystats
  64. 64. @simplystats
  65. 65. @simplystats
  66. 66. annotating •  Devised “light-weight” R annotation files for UCSC hg19 knownGene and Ensembl GRCh37.p11 •  “Genomic State” objects: each base pair in the genome gets assigned to exactly one “state”, annotations merged across overlapping features •  Two different configurations: –  “Full” (introns, exons, un-annotated/intragenic) –  “Coding” (introns, coding exons, UTRs, promoters, un-annotated/intragenic) •  Very fast, 1000s of regions in seconds @simplystats
  67. 67. derfinder results •  2,655 regions (47.7%) show expression of 1+ annotated intron (UCSC: 2,505; 45%) •  577 regions (10.4%) show expression of an “intragenic” region (UCSC: 800, 14%) Ensembl   UCSC   @simplystats
  68. 68. derfinder results •  261 regions (4.7%) crossed a known lincRNA –  51 overlapping 535 “intragenic” regions (9.6%; e.g. no exons) •  Only one region crossed known miRNA, but same region had annotated exon on other strand @simplystats
  69. 69. derfinder results •  Verifying the 5,565 DERs: –  95% of regions had mappability of 100bp reads greater than 99% –  Only 16 regions were in tracks excluded by Duke site of Encode (all “BSR/Beta” for satellite repeats) and 0 by Data Analysis Center of Encode –  Only 90 regions (1.5%) mapped to known pseudogenes @simplystats
  70. 70. derfinder results •  Fetal samples had the highest expression in the majority of the regions (84%; 18 [1.7-Inf] fold increase); second highest was 50+ group (7%; 1.4 [1-4.3] fold increase) @simplystats
  71. 71. derfinder results @simplystats
  72. 72. derfinder subgroup •  Identified DERs within each 6-sample age group based on mean expression –  Represents set of expressed sequences for each group at a given coverage threshold –  Varied mean coverage cutoff @simplystats
  73. 73. Percent  of  Genome  Expressed   % of genome expressed @simplystats
  74. 74. scaled % of genome expressed Fetal  is  highest   at  EVERY  cutoff   Infant  is  lowest     a<er  114  reads   Teen  is  lowest   thru  114  reads   @simplystats
  75. 75. higher cutoffs create longer DERs @simplystats
  76. 76. Percent  of  Genome  Expressed   % of genome expressed (L ≥ 12) @simplystats
  77. 77. Scaled % of genome expressed (L ≥ 12) Fetal  is  s=ll  highest   at  EVERY  cutoff   @simplystats
  78. 78. Higher cutoffs still create longer DERs @simplystats
  79. 79. try that stuff, yo! https://github.com/lcolladotor/derfinder https://github.com/lcolladotor/derfinderReport https://github.com/lcolladotor/derfinderExample @simplystats
  80. 80. acknowledgements Leek Group Alyssa Frazee Prasad Patil Leo Collado Torres Abhi Nellore University of Maryland Héctor Corrada Bravo Harvard Rafael Irizarry Lieber Institute Andrew Jaffe Danny Weinberger Thomas Hyde Hopkins Kasper Hansen Roger Peng Ben Langmead Sarven Sabunicyan Luigi Marchionni Donald Geman Funding Amazon Web Services Digital Science NIH CCNE Hopkins inHealth @simplystats
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×