How to sequence a large eukaryotic genome - and how we sequenced the cod genome. A seminar I gave for the Computational Life Science (Univ. of Oslo) seminar series, September 28, 2011
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
How to sequence a large eukaryotic genome
1. How to sequence a large eukaryotic genomeand how we sequenced the cod genome Lex Nederbragt Norwegian High-Throughput Sequencing Centre (NSC) and Centre for Ecological and Evolutionary Synthesis (CEES)
2.
3. What is a genome assembly? A hierarchical data structure that maps the sequence data to a putative reconstruction of the target Miller et al 2010, Genomics 95 (6): 315-327
17. Overlap-Layout-Consensus Typical for Sanger-type reads also used by newbler from 454 Life Sciences Steps Overlap computation Layout: graph simplification Consensus: sequence
19. de Bruijn graphs Developed outside of DNA-related work Best solution for very short reads ≤100 nt GACCTACA GAC ACC CCT CTA TAC ACA Read de Bruijn graph K-mers (K=3) K-1 bases overlap
22. Sequence data Sequencing errors add complexity to graph create new k-mers Correction of errors k-mer frequency Kelley et al.Genome Biology 2010 11:R116
23. How to sequence a genome human 1990's cod 1 2009 - 2011 cod 2 2011 - 2012
24. Human genome Public effort BAC-by-BAC sequencing hierarchical shotgun sequencing Genome BACs Select BACs 100-150 kb shotgun sequencing http://www.cbcb.umd.edu/research/assembly_primer.shtml
38. Assembly competitions Assemblathon 1 simulated datasets ALLPATHS_LG – Broad Institute MIT (US) Soapdenovo – BGI (China) SGA – Sanger Institute (UK)
39. Assembly competitions Assemblathon 2 real datasets snake – Illumina only cichlid fish – Illumina only parrot Illumina 454 FLX+ PacBio http://assemblathon.org/
40. How to sequence a genome In 2011 Cheap alternative: RAD-tag sequencing
41. How to sequence a genome Foundation of Illumina data 100x coverage Paired End reads (2x100bp) several Mate Pair libraries 2kb, 3kb, 8k, 10kb, bigger? this is now very cheap! Fill gaps with long reads 454 or PacBio
42. How to sequence a genome Add lots of bioinformatics... http://cores.montana.edu/index.php?page=bioinformatics-core-facility
Greedy assemblers - The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other. An example is shown in Figure 8, where the assembler joins, in order, reads 1 and 2 (overlap = 200 bp), then reads 3 and 4 (overlap = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input. One disadvantage of the simple greedy approach is that because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies.
BAC-by-BAC approach. The long lines represent individual BACs. The minimal tiling path is represented by thick lines. Each BAC in the tiling path is then sequenced through the shotgun method.