How to sequence a large eukaryotic genomeand how we sequenced the cod genomeLex NederbragtNorwegian High-Throughput Sequencing Centre (NSC)andCentre for Ecological and Evolutionary Synthesis (CEES)
What is a genome assembly?A hierarchical data structurethat maps the sequence datato a putative reconstruction of the target Miller et al 2010, Genomics 95 (6): 315-327
Hierarchical structure
Sequence dataReadshttp://www.cbcb.umd.edu/research/assembly_primer.shtml
Reads!http://www.sciencephoto.com/media/210915/enlarge
ContigsBuilding contigs
ContigsBuilding contigsRepeat copy 1Repeat copy 2Contig orienation?Contig order?Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
Mate pairsOther read typeRepeat copy 1Repeat copy 2(much) longer fragmentsmate pair reads
ScaffoldsOrdered, oriented contigsmate pairscontigsgap size estimate
Hierarchical structure
AlgorithmsAll are graph-basedRead 2Read 1OverlapGraph-theory!
AlgorithmsHamiltonian patha path that contains all the nodeshttp://www.cbcb.umd.edu/research/assembly_primer.shtml
AlgorithmsOverlap calculation (alignment)computationally intensiveRead 2Read 1Overlap
AlgorithmsPath through the graphcontigRead 2Read 3Read 4Read 1OverlapOverlapOverlap
Greedy extensionOldesthttp://www.cbcb.umd.edu/research/assembly_primer.shtml
Overlap-Layout-ConsensusTypical for Sanger-type readsalso used by newbler from 454 Life SciencesStepsOverlap computationLayout: graph simplificationConsensus: sequence
Overlap-Layout-ConsensusOverlap phase:K-mer seeds initiate overlapACGCGATTCAGGTTACCACG
de Bruijn graphsDeveloped outside of DNA-related workBest solution for very short reads   ≤100 ntGACCTACAGAC ACC  CCT   CTA    TAC     ACAReadde Bruijn graphK-mers (K=3)K-1 bases overlap
GraphsSchatz M C et al. Genome Res. 2010;20:1165-1173
GraphsSimplify the graphAdd scaffolding information
Sequence dataSequencing errorsadd complexity to graphcreate new k-mersCorrection of errorsk-mer frequencyKelley et al.Genome Biology 2010 11:R116
How to sequence a genomehuman	1990'scod 1		2009 - 2011cod 2		 2011 - 2012
Human genomePublic effortBAC-by-BAC sequencinghierarchical shotgun sequencingGenomeBACsSelect BACs100-150 kb shotgun sequencinghttp://www.cbcb.umd.edu/research/assembly_primer.shtml
Human genomeCelera: shotgun sequencingentire genome shotgunuse of mate pairs
How to sequence a genome   PreparationsBAC-by-BACAdd shotgunand mate pairs
The cod genome projectPreparations* From a different individual
Cod: strategy‘454 only’NO subcloningPure ‘shotgun’ approach454 specific paired end librariesSupplementaryBAC ends using Sanger sequencing
Cod: sequencing
Cod: assemblyInput for assembly84 million reads28 billion bases (Gb)34x coverageAssembly programNewbler from 454Celera from Venter Inst.Computing nodes24 cpus128 GB of memory
Cod: assembly611 Mb in 6 467 scaffoldsbut 35% gap basesshort contigsincomplete genes
Cod: gapsPolymorphiccontig 2HeterozygosityContig 4Contig 1Polymorphiccontig 3Short Tandem RepeatsACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACA
Cod: annotationEnsembl'repair' genes based on stickleback sequence~22 000 geneshttp://pre.ensembl.org/Gadus_morhua/
Cod 2: 2011-2012Close the gapsincrease contig sizePseudochromosomesgenetic linkage mapscaffolds to 'chromosomes'anchoringordering and orienting
Cod 2: strategyNew dataIllumina readslonger 454 reads ~700 basesPacBio reads?Improved programsnewblerNew programsassemblygap closing
Many programs to choose from
Assembly competitionsAssemblathon 1simulated datasetsALLPATHS_LG – Broad Institute MIT (US)Soapdenovo – BGI (China)SGA – Sanger Institute (UK)
Assembly competitionsAssemblathon 2real datasetssnake – Illumina onlycichlid fish – Illumina onlyparrotIllumina454 FLX+PacBiohttp://assemblathon.org/
How to sequence a genomeIn 2011Cheap alternative: RAD-tag sequencing
How to sequence a genomeFoundation of Illumina data100x coverage Paired End reads (2x100bp)several Mate Pair libraries2kb, 3kb, 8k, 10kb, bigger?this is now very cheap!Fill gaps with long reads454 or PacBio
How to sequence a genomeAdd lots of bioinformatics...http://cores.montana.edu/index.php?page=bioinformatics-core-facility
Thank you!lex.nederbragt@bio.uio.nowww.sequencing.uio.nowww.sequencing.uio.no

How to sequence a large eukaryotic genome

Editor's Notes

  • #17 Greedy assemblers - The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other.  An example is shown in Figure 8, where the assembler joins, in order,  reads 1 and 2 (overlap = 200 bp), then reads 3 and 4 (overlap  = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input.  One disadvantage of the simple greedy approach is that because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies.
  • #25 BAC-by-BAC approach.  The long lines represent individual BACs.  The minimal tiling path is represented by thick lines.  Each BAC in the tiling path is then sequenced through the shotgun method.