Validating and improving the
D. melanogaster reference genome sequence
using PacBio de novo assemblies
Casey M. Bergman
@b...
Credits
• Danny Miller (Stowers Institute)
• Jane Landolin, Kristi Kim, Jason Chin & Edwin Hauw
(Pacific Biosciences)
• Su...
Bridges (1916) PMID: 17245850
“The” Drosophila genome circa 1910
“The” Drosophila genome circa 1925
Morgan et al. (1925) The Genetics of Drosophila
Painter (1933) PMID: 17801695
“The” Drosophila genome circa 1940
The strategy we have used is called chromosomal walking and jumping; it is
shown diagrammatically in Figure 1. The chromos...
“The” Drosophila genome circa 2000
Adams et al. (2000) PMID: 10731132
Accuracy of whole genome shotgun (WGS)
assembly vs. BAC-based physical map
Myers et al. (2000) PMID: 10731133
peaks - discrepancies
green - gaps
purple - TEs
Myers et al. (2000) PMID: 10731133
Accuracy of WGS vs. BAC-based sequencing
“The” Drosophila genome since 2000
~ 120 Mb of euchromatin
~ 60-100 Mb heterochromatin
Release Date
Total size of
scaffolds
Total size of
contigs
Contigs Contig N50
1 Mar 2000 116,117,226 114,201,085 1427 220,...
~ 120 Mb of euchromatin
~ 60-100 Mb heterochromatin
“The” Drosophila genome since 2000
Hoskins et al. (2007) PMID: 17569867
Heterochromatic genome assemblies
~350 Kb
in Rel5
Release Scaffolds
Total Size of
Scaffolds
Contigs
Total Size of
Contigs
1 0 0 0 0
2 1 (U) 7,513,406 1000 5,530,718
3 2604 ...
Low coverage pilot experiment with Hawley Lab
http://bergmanlab.smith.man.ac.uk/?p=1971
High coverage experiment with PacBio & BDGP
http://blog.pacificbiosciences.com/2014/01/data-release-preliminary-de-novo.ht...
Metric Value
Library Size (Kb) 15
Chemistry P5-C3
# SMRT cells 42
Run time (days) 6
# bases (nt) 15,208,567,933
# reads 1,...
Reference-based long-read mapping with BLASR
http://bergmanlab.smith.man.ac.uk/?p=2176
>90x coverage based on reference mapping
http://bergmanlab.smith.man.ac.uk/?p=2176
PacBio-only assemblies of the
D. melanogaster genome
Assembly Read Set Pre-assembly Assembler Quivered
CA25x 25x longest P...
Assembly Contigs Contig N50 (nt) Max Contig (nt)
CA25x 128 15,297,019 24,622,056
CA25x-Q 128 15,305,620 24,648,237
CA50x 1...
Long-range contiguity of CA25x assembly
Koren & Phillippy (unpublished)
http://cbcb.umd.edu/software/PBcR/dmel.html
X 3R 3...
Chin (unpublished)
https://github.com/PacificBiosciences/falcon
Long-range contiguity of FALCON-Q assembly
X3R 2R3L 2L
Base level accuracy of PacBio
D. melanogaster assemblies vs Release 5
0
6
12
18
24
30
CA25x CA25x-Q CA50x FALCON-Q FALCON-...
Towards a $1000 genome assembly
using FALCON, StarCluster & AWS
Assembly
Pre-assembly
(CPU hours)
Assembly
(CPU hours)
CA2...
Euchromatic gap closure with PacBio contigs
Celniker (unpublished)
Gap at 64C
Gap at 57B
Identification of Y-chromosome contigs in
PacBio assemblies by female/male depth ratio
0 1 2 3
02468
Ratio Profile
Ratio (...
0 10 20 30 40 50 60
01234
Ratio 0052_00|quiver|quiver
Location in chr (x10000)
Ratio
●●●●●●
●
●
●●●●●●●●●●●●●
●
●●
●
●●
●
...
Improvement of the Y-chromosome
assembly & gene models
Celniker (unpublished)
Take Home (I)
• View of D. melanogaster genome has been changing
for >100 years & is still not complete
• Frontier of D. m...
• Early, open release of genomic data by small labs
can stimulate big returns & new collaborations
• PacBio has right corp...
Validating and improving the D. melanogaster reference genome sequence using PacBio de novo assemblies
Upcoming SlideShare
Loading in …5
×

Validating and improving the D. melanogaster reference genome sequence using PacBio de novo assemblies

367
-1

Published on

Published in: Science, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
367
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Validating and improving the D. melanogaster reference genome sequence using PacBio de novo assemblies

  1. 1. Validating and improving the D. melanogaster reference genome sequence using PacBio de novo assemblies Casey M. Bergman @bergmanlab @caseybergman University of Liverpool Centre for Genomic Research PacBio Symposium 4 April 2014 !
  2. 2. Credits • Danny Miller (Stowers Institute) • Jane Landolin, Kristi Kim, Jason Chin & Edwin Hauw (Pacific Biosciences) • Sue Celniker & Roger Hoskins (Berkeley Drosophila Genome Project) • Sergey Koren & Adam Phillippy (National Biodefense Analysis and Countermeasures Center) • Raquel Linheiro (University of Manchester)
  3. 3. Bridges (1916) PMID: 17245850 “The” Drosophila genome circa 1910
  4. 4. “The” Drosophila genome circa 1925 Morgan et al. (1925) The Genetics of Drosophila
  5. 5. Painter (1933) PMID: 17801695 “The” Drosophila genome circa 1940
  6. 6. The strategy we have used is called chromosomal walking and jumping; it is shown diagrammatically in Figure 1. The chromosomal origin of any non-repeated segment of D. melanogaster DNA (Dm segment) can be determined by in situ hybridization of that DNA to polytene chromosomes. When the sites of hybridization are visualized by tritium autoradiography, the position is usually confined to one or a few bands, which is similar to the precision of the cytological localizations of rearrangement breakpoints or the localizations of well-mapped genes. If a DNA sequence is found within a few bands of a gene of interest, that sequence can be used as the starting point for a chromosomal walk to the gene. A "step" in the walking procedure involves screening a recombinant DNA library of random large Dm segments to collect those that overlap the starting point. The CIIBIDE~W:KI' F87B: I B I C I D I E I F888A I B IC~~FA 0 A Af / - - START HERE • T ill T • LEFT FUSION FRAGMENT RIGHT FUSION FRABNENT 89 IBB 88 INVERSION INVERSION BREAK BREAK Fro. 1. The strategy for walking and jumping. The upper chromosome represents a portion of the right arm of the third chromosome with normal cytology (drawn from the map of Bridges, 1941), and the lower chromosome has an inversion of the region from 87E to 89E. A few steps of a chromosomal walk are shown diagrammatically below the 87E region (not to scale with the chromosome). When the walk reached the site of the inversion breakpoint, the DNA from that position could be used to identify the two fusion fragments isolated from the inversion chromosome. The foreign DNA in the fusion fragments (tandem circles) was homologous to normal chromosomal DNA at the right or distal inversion breakpoint, and thus it served as the origin of a chromosomal walk in 89E. e.g. Bender et al. (1983) PMID: 6410077 “The” Drosophila genome circa 1990
  7. 7. “The” Drosophila genome circa 2000 Adams et al. (2000) PMID: 10731132
  8. 8. Accuracy of whole genome shotgun (WGS) assembly vs. BAC-based physical map Myers et al. (2000) PMID: 10731133
  9. 9. peaks - discrepancies green - gaps purple - TEs Myers et al. (2000) PMID: 10731133 Accuracy of WGS vs. BAC-based sequencing
  10. 10. “The” Drosophila genome since 2000 ~ 120 Mb of euchromatin ~ 60-100 Mb heterochromatin
  11. 11. Release Date Total size of scaffolds Total size of contigs Contigs Contig N50 1 Mar 2000 116,117,226 114,201,085 1427 220,490 2 Oct 2000 116,109,070 114,448,849 1103 318,193 3 Dec 2002 116,781,562 116,739,493 50 14,289,516 4 Apr 2004 118,357,599 118,348,386 28 18,203,742 5 Mar 2006 120,381,546 120,290,946 14 21,485,538 Euchromatic genome assemblies Several gaps persist in euchromatic arms
  12. 12. ~ 120 Mb of euchromatin ~ 60-100 Mb heterochromatin “The” Drosophila genome since 2000
  13. 13. Hoskins et al. (2007) PMID: 17569867 Heterochromatic genome assemblies ~350 Kb in Rel5
  14. 14. Release Scaffolds Total Size of Scaffolds Contigs Total Size of Contigs 1 0 0 0 0 2 1 (U) 7,513,406 1000 5,530,718 3 2604 20,941,991 3810 17,150,417 4 0 0 0 0 5 8 (U + armHet + mt) 19,350,335 3044 16,535,110 Majority of heterochromatin unassembled Heterochromatic genome assemblies
  15. 15. Low coverage pilot experiment with Hawley Lab http://bergmanlab.smith.man.ac.uk/?p=1971
  16. 16. High coverage experiment with PacBio & BDGP http://blog.pacificbiosciences.com/2014/01/data-release-preliminary-de-novo.html http://www.ncbi.nlm.nih.gov/sra/?term=SRP040522
  17. 17. Metric Value Library Size (Kb) 15 Chemistry P5-C3 # SMRT cells 42 Run time (days) 6 # bases (nt) 15,208,567,933 # reads 1,514,730 avg length (nt) 10,040 N50 (nt) 14,214 Max (nt) 44,766 High coverage PacBio dataset for D. melanogaster BDGP reference strain http://blog.pacificbiosciences.com/2014/01/data-release-preliminary-de-novo.html http://www.ncbi.nlm.nih.gov/sra/?term=SRP040522
  18. 18. Reference-based long-read mapping with BLASR http://bergmanlab.smith.man.ac.uk/?p=2176
  19. 19. >90x coverage based on reference mapping http://bergmanlab.smith.man.ac.uk/?p=2176
  20. 20. PacBio-only assemblies of the D. melanogaster genome Assembly Read Set Pre-assembly Assembler Quivered CA25x 25x longest PBcR CA 8.1 n CA25x-Q 25x longest PBcR CA 8.1 y CA50x 50x longest PBcR CA 8.1 n FALCON-Q 25x longest FALCON FALCON y FALCON-PBcR 70x PBcR FALCON n FALCON-AWS all FALCON FALCON n Koren & Phillippy (unpublished) Chin & Bergman (unpublished)
  21. 21. Assembly Contigs Contig N50 (nt) Max Contig (nt) CA25x 128 15,297,019 24,622,056 CA25x-Q 128 15,305,620 24,648,237 CA50x 131 4,105,199 24,577,947 FALCON-Q 434 5,001,041 21,336,512 FALCON-PBcR 1774 7,499,810 25,727,813 FALCON-AWS 955 7,882,002 21,631,108 PacBio-only assemblies of the D. melanogaster genome
  22. 22. Long-range contiguity of CA25x assembly Koren & Phillippy (unpublished) http://cbcb.umd.edu/software/PBcR/dmel.html X 3R 3L 2L 42R
  23. 23. Chin (unpublished) https://github.com/PacificBiosciences/falcon Long-range contiguity of FALCON-Q assembly X3R 2R3L 2L
  24. 24. Base level accuracy of PacBio D. melanogaster assemblies vs Release 5 0 6 12 18 24 30 CA25x CA25x-Q CA50x FALCON-Q FALCON-PBcR FALCON-AWS 0 60 120 180 240 300 mismatches/100kb indels/100kb Rel3 Rel3 Rel1 Rel1
  25. 25. Towards a $1000 genome assembly using FALCON, StarCluster & AWS Assembly Pre-assembly (CPU hours) Assembly (CPU hours) CA25x 621,000 8,000 FALCON-AWS 1,500 48 Expert Novice https://github.com/PacificBiosciences/FALCON/blob/v0.1.1/examples/Dmel_asm.md https://github.com/cbergman/FALCON/blob/v0.1.1/examples/Dmel_asm.md
  26. 26. Euchromatic gap closure with PacBio contigs Celniker (unpublished) Gap at 64C Gap at 57B
  27. 27. Identification of Y-chromosome contigs in PacBio assemblies by female/male depth ratio 0 1 2 3 02468 Ratio Profile Ratio (in 10000 bre Counts(log) 02468 Ratio Profile chr2L chr2R chr3L chr3R chr4 chrX chrYHet log10frequency female/male depth ratio bwa short read DNA-seq female/male depth ratio Linheiro & Bergman (unpublished)
  28. 28. 0 10 20 30 40 50 60 01234 Ratio 0052_00|quiver|quiver Location in chr (x10000) Ratio ●●●●●● ● ● ●●●●●●●●●●●●● ● ●● ● ●● ● ● ● ●●●●● ● ●●●● ● ●●● ● ●●● ●●●●● ●●● ● ● ●●● ● ● ● ● female/maledepthratio window (10Kb step) 0 10 20 30 40 50 60 01234 Ratio 0052_00|quiver|quiver Location in chr (x10000) Ratio ●●●●●● ● ● ●●●●●●●●●●●●● ● ●● ● ●● ● ● ● ●●●●● ● ●●●● ● ●●● ● ●●● ●●●●● ●●● ● ● ●●● ● ● ● ● ● ● ● _ _ A ratio X ratio Y ratio X log 100 count Y log 100 count Identification of Y-chromosome contigs in PacBio Assemblies by female/male depth ratio Linheiro & Bergman (unpublished)
  29. 29. Improvement of the Y-chromosome assembly & gene models Celniker (unpublished)
  30. 30. Take Home (I) • View of D. melanogaster genome has been changing for >100 years & is still not complete • Frontier of D. melanogaster genome assembly is in heterochromatic regions (model for repeat-rich plant genomes) • PacBio long reads can be used to generate long- range de novo assemblies that can close euchromatic gaps & generate large heterochromatic contigs • Bioinformatic challenges: better pre-assembly algorithms, better polishing algorithms, *.h5 data archiving
  31. 31. • Early, open release of genomic data by small labs can stimulate big returns & new collaborations • PacBio has right corporate philosophy of engaging/ collaborating with the genomics community (open data, open source) Take Home (II)

×