How to sequence a large eukaryotic genome

13,099 views
12,836 views

Published on

How to sequence a large eukaryotic genome - and how we sequenced the cod genome. A seminar I gave for the Computational Life Science (Univ. of Oslo) seminar series, September 28, 2011

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
13,099
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
82
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Greedy assemblers - The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other.  An example is shown in Figure 8, where the assembler joins, in order,  reads 1 and 2 (overlap = 200 bp), then reads 3 and 4 (overlap  = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input.  One disadvantage of the simple greedy approach is that because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies.
  • BAC-by-BAC approach.  The long lines represent individual BACs.  The minimal tiling path is represented by thick lines.  Each BAC in the tiling path is then sequenced through the shotgun method.
  • How to sequence a large eukaryotic genome

    1. 1. How to sequence a large eukaryotic genomeand how we sequenced the cod genome<br />Lex Nederbragt<br />Norwegian High-Throughput Sequencing Centre (NSC)<br />and<br />Centre for Ecological and Evolutionary Synthesis (CEES)<br />
    2. 2.
    3. 3. What is a genome assembly?<br />A hierarchical data structure<br />that maps the sequence data<br />to a putative reconstruction of the target <br />Miller et al 2010, Genomics 95 (6): 315-327 <br />
    4. 4. Hierarchical structure<br />
    5. 5. Sequence data<br />Reads<br />http://www.cbcb.umd.edu/research/assembly_primer.shtml<br />
    6. 6. Reads!<br />http://www.sciencephoto.com/media/210915/enlarge<br />
    7. 7. Contigs<br />Building contigs<br />
    8. 8. Contigs<br />Building contigs<br />Repeat copy 1<br />Repeat copy 2<br />Contig orienation?<br />Contig order?<br />Collapsed repeat consensus <br />http://www.cbcb.umd.edu/research/assembly_primer.shtml<br />
    9. 9. Mate pairs<br />Other read type<br />Repeat copy 1<br />Repeat copy 2<br />(much) longer fragments<br />mate pair reads<br />
    10. 10. Scaffolds<br />Ordered, oriented contigs<br />mate pairs<br />contigs<br />gap size estimate<br />
    11. 11. Hierarchical structure<br />
    12. 12. Algorithms<br />All are graph-based<br />Read 2<br />Read 1<br />Overlap<br />Graph-theory!<br />
    13. 13. Algorithms<br />Hamiltonian path<br />a path that contains all the nodes<br />http://www.cbcb.umd.edu/research/assembly_primer.shtml<br />
    14. 14. Algorithms<br />Overlap calculation (alignment)<br />computationally intensive<br />Read 2<br />Read 1<br />Overlap<br />
    15. 15. Algorithms<br />Path through the graph<br />contig<br />Read 2<br />Read 3<br />Read 4<br />Read 1<br />Overlap<br />Overlap<br />Overlap<br />
    16. 16. Greedy extension<br />Oldest<br />http://www.cbcb.umd.edu/research/assembly_primer.shtml<br />
    17. 17. Overlap-Layout-Consensus<br />Typical for Sanger-type reads<br />also used by newbler from 454 Life Sciences<br />Steps<br />Overlap computation<br />Layout: graph simplification<br />Consensus: sequence<br />
    18. 18. Overlap-Layout-Consensus<br />Overlap phase:<br />K-mer seeds initiate overlap<br />ACGCGATTCAGGTTACCACG<br />
    19. 19. de Bruijn graphs<br />Developed outside of DNA-related work<br />Best solution for very short reads ≤100 nt<br />GACCTACA<br />GAC<br /> ACC<br /> CCT<br /> CTA<br /> TAC<br /> ACA<br />Read<br />de Bruijn graph<br />K-mers (K=3)<br />K-1 bases overlap<br />
    20. 20. Graphs<br />Schatz M C et al. Genome Res. 2010;20:1165-1173<br />
    21. 21. Graphs<br />Simplify the graph<br />Add scaffolding information<br />
    22. 22. Sequence data<br />Sequencing errors<br />add complexity to graph<br />create new k-mers<br />Correction of errors<br />k-mer frequency<br />Kelley et al.Genome Biology 2010 11:R116<br />
    23. 23. How to sequence a genome<br />human 1990's<br />cod 1 2009 - 2011<br />cod 2  2011 - 2012<br />
    24. 24. Human genome<br />Public effort<br />BAC-by-BAC sequencing<br />hierarchical shotgun sequencing<br />Genome<br />BACs<br />Select BACs<br />100-150 kb <br />shotgun sequencing<br />http://www.cbcb.umd.edu/research/assembly_primer.shtml<br />
    25. 25. Human genome<br />Celera: shotgun sequencing<br />entire genome shotgun<br />use of mate pairs<br />
    26. 26. How to sequence a genome<br /> Preparations<br />BAC-by-BAC<br />Add shotgun<br />and mate pairs<br />
    27. 27. The cod genome project<br />Preparations<br />* From a different individual<br />
    28. 28. Cod: strategy<br />‘454 only’<br />NO subcloning<br />Pure ‘shotgun’ approach<br />454 specific paired end libraries<br />Supplementary<br />BAC ends using Sanger sequencing<br />
    29. 29. Cod: sequencing<br />
    30. 30. Cod: assembly<br />Input for assembly<br />84 million reads<br />28 billion bases (Gb)<br />34x coverage<br />Assembly program<br />Newbler from 454<br />Celera from Venter Inst.<br />Computing nodes<br />24 cpus<br />128 GB of memory<br />
    31. 31. Cod: assembly<br />611 Mb in 6 467 scaffolds<br />but 35% gap bases<br />short contigs<br />incomplete genes<br />
    32. 32. Cod: gaps<br />Polymorphiccontig 2<br />Heterozygosity<br />Contig 4<br />Contig 1<br />Polymorphiccontig 3<br />Short Tandem Repeats<br />ACACACACACACACACACACACACACACACACACACACACACACACACACACACACACA<br />ACACACACACACACACACACACACACACACACACACACACACACACACACACACACACA<br />ACACACACACACACACACACACACACACACACACACACACACACACACACACACACACA<br />ACACACACACACACACACACACACACACACACACACACACACACACACACACACACACA<br />
    33. 33. Cod: annotation<br />Ensembl<br />'repair' genes based on stickleback sequence<br />~22 000 genes<br />http://pre.ensembl.org/Gadus_morhua/<br />
    34. 34.
    35. 35. Cod 2: 2011-2012<br />Close the gaps<br />increase contig size<br />Pseudochromosomes<br />genetic linkage map<br />scaffolds to 'chromosomes'<br />anchoring<br />ordering and orienting<br />
    36. 36. Cod 2: strategy<br />New data<br />Illumina reads<br />longer 454 reads ~700 bases<br />PacBio reads?<br />Improved programs<br />newbler<br />New programs<br />assembly<br />gap closing<br />
    37. 37. Many programs to choose from<br />
    38. 38. Assembly competitions<br />Assemblathon 1<br />simulated datasets<br />ALLPATHS_LG – Broad Institute MIT (US)<br />Soapdenovo – BGI (China)<br />SGA – Sanger Institute (UK)<br />
    39. 39. Assembly competitions<br />Assemblathon 2<br />real datasets<br />snake – Illumina only<br />cichlid fish – Illumina only<br />parrot<br />Illumina<br />454 FLX+<br />PacBio<br />http://assemblathon.org/<br />
    40. 40. How to sequence a genome<br />In 2011<br />Cheap alternative: RAD-tag sequencing<br />
    41. 41. How to sequence a genome<br />Foundation of Illumina data<br />100x coverage Paired End reads (2x100bp)<br />several Mate Pair libraries<br />2kb, 3kb, 8k, 10kb, bigger?<br />this is now very cheap!<br />Fill gaps with long reads<br />454 or PacBio<br />
    42. 42. How to sequence a genome<br />Add lots of bioinformatics...<br />http://cores.montana.edu/index.php?page=bioinformatics-core-facility<br />
    43. 43. Thank you!<br />lex.nederbragt@bio.uio.no<br />www.sequencing.uio.no<br />www.sequencing.uio.no<br />

    ×