Assembling NGS data        Dr Torsten SeemannIMB Winter School - Brisbane – Tue 3 July @ 09:45am
Ideal worldI would not need to give this talk!                                      AGTCTAGGATTCGCTA                      ...
Real world• Can’t sequence full-length native DNA  – no instrument exists (yet)• But we can sequence short fragments  – 10...
De novo assembly• De novo assembly is the process of  reconstructing the original DNA sequences  using only the fragment r...
Shakespearomics• Reads  ds, Romans, count  ns, countrymen, le  Friends, Rom  send me your ears;  crymen, lend me• Overlaps...
The awful truth“Genome assembly is impossible.”                                       He wears                            ...
Approaches•   greedy assembly•   overlap :: layout :: consensus•   de Bruijn graphs•   string graphs•   seed and extend… a...
Assembly recipe• Find all overlaps between reads  – hmm, sounds like a lot of work…• Build a graph  – a picture of read co...
Find read overlaps• If we have N reads of length L  – we have to do ½N(N-1) ~ O(N²) comparisons  – each comparison is an ~...
N=6 means 15 overlap “scores”Read#   1    2    3    4    5    6 1      -    -    -    -    -    - 2      80   -    -    - ...
Graph construction    Thicker lines mean   strongerevidence for    overlapNode/Vertex               Edge/Arc
A more realistic graph
What ruins the graph?• Read errors  – introduce false edges and nodes• Non-haploid organisms  – heterozygosity causes lots...
Graph simplification• Squash small bubbles  – collapse small errors (or minor heterozygosity)• Remove spurs  – short “dead...
Graph traversal• For each unconnected graph   – at least one per replicon in original sample• Find a path which visits eac...
Graph traversal
What happens with repeats?The repeated element is collapsed into a single contig
Mis-assembled repeats       collapsed tandem                                                                              ...
The law of repeats• It is impossible to resolve repeats of  length S unless you have reads  longer than S.• It is impossib...
Types of reads• Example fragment  – atcgtatgatcttgagattctctcttcccttatagctgctata• “Single-end” read  – atcgtatgatcttgagattc...
Scaffolding• Paired-end reads  – known sequences at either end  – roughly known distance between ends  – unknown sequence ...
Contigs to Scaffolds                       Paired-end read             ContigsScaffold   Gap          Gap
What can we assemble?• Genomes  – A single organism eg. its chromosomal DNA• Meta-genomes  – gDNA from mixtures of organis...
Genomes• Expect uniformity  – Each part of genome represented by roughly    equal number of reads• Average depth of covera...
Meta-genomes• Expect proportionality & uniformity  – Each genome represented by proportion of    reads similar to their pr...
Meta-genome issues• Closely related species  – will have very similar reads  – lots of shared nodes in the graph• Conserve...
Transcriptomes• RNA-Seq  – first convert it into DNA (cDNA)  – represents a snapshot of RNA activity• Expect proportionali...
Transcriptome issues• Huge dynamic range  – some gets lots of reads, some get none• Splice variation  – very similar, subt...
Meta-transcriptomes• RNA-Seq  – on multiple transcriptomes at once• Expect proportional proportionality  – proportion of t...
Assessing assemblies• Genome assembly  – Total length similar to genome size  – Fewer, larger contigs  – Correctness of co...
The “N50”• “The length of that contig from which 50%  of the bases are in it and shorter contigs”• Imagine we got 7 contig...
N50 concerns• Optimizing for N50  – encourages mis-assemblies!• An aggressive assembler may over-join:  – 1,1,3,5,8,12,20 ...
Assembly tools• Genome  – Velvet, Abyss, Mira, Newbler, SGA, AllPaths,    Ray, Euler, SOAPdenovo, Edena, Arachne• Meta-gen...
Example• Culture your bacterium• Extract your genomic DNA• Send it to AGRF for Illumina sequencing  – 100bp paired end• Ge...
Velvet: hash readsvelvethDir31-fmtAuto                      New options-separateMRSA_R1.fastq.gzMRSA_R2.fastq.gz          ...
Velvet: assemblyvelvetg                   “Signal” levelDir-exp_cov auto-cov_cutoff auto                     “Noise” level
Velvet: examine resultsless Dir/contigs.fa>NODE_1_length_43211_cov_27.36569AGTCGATGCTTAGAGAGTATGACCTTCTATACAAAAATCTTATATTA...
VelvetVelvet: GUI         Assembler                    Graphical                    User                    Environment   ...
Contact• Email  – torsten.seemann@monash.edu• Web  – http://vicbioinformatics.com/  – http://vlsci.org.au/• Blog  – http:/...
Assembling ngs data   torsten seemann - imb - 3 jul 2012
Assembling ngs data   torsten seemann - imb - 3 jul 2012
Assembling ngs data   torsten seemann - imb - 3 jul 2012
Upcoming SlideShare
Loading in …5
×

Assembling ngs data torsten seemann - imb - 3 jul 2012

857 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
857
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
33
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Assembling ngs data torsten seemann - imb - 3 jul 2012

  1. 1. Assembling NGS data Dr Torsten SeemannIMB Winter School - Brisbane – Tue 3 July @ 09:45am
  2. 2. Ideal worldI would not need to give this talk! AGTCTAGGATTCGCTA TAGATTCAGGCTCTGA TATATTTCGCGGGATT AGCTAGATCGCTATGC TATGATCTAGATCTCG AGATTCGTATAAGTCT AGGATTCGCTATAGAT TCAGGCTCTGATATAT TTCGCGGGATTAGCTAHuman DNA Non-existent 46 complete USB3 device haplotype chromosome sequences
  3. 3. Real world• Can’t sequence full-length native DNA – no instrument exists (yet)• But we can sequence short fragments – 100 at a time (Sanger) – 100,000 at a time (Roche 454) – 1,000,000 at a time (Ion Torrent) – 100,000,000 at a time (HiSeq 2000)
  4. 4. De novo assembly• De novo assembly is the process of reconstructing the original DNA sequences using only the fragment read sequences• Instinctively – like a jigsaw puzzle – involves finding overlaps between reads – sequencing errors will confuse matters
  5. 5. Shakespearomics• Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me• Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;• Majority rule Friends, Romans, countrymen, lend me your ears;
  6. 6. The awful truth“Genome assembly is impossible.” He wears glasses so he must be smart A/Prof. Mihai Pop World leader in de novo assembly research.
  7. 7. Approaches• greedy assembly• overlap :: layout :: consensus• de Bruijn graphs• string graphs• seed and extend… all essentially doing the same thing, but taking different short cuts.
  8. 8. Assembly recipe• Find all overlaps between reads – hmm, sounds like a lot of work…• Build a graph – a picture of read connections• Simplify the graph – sequencing errors will mess it up a lot• Traverse the graph – trace a sensible path to produce a consensus
  9. 9. Find read overlaps• If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons – each comparison is an ~ O(L²) alignment – use special tricks/heuristics to reduce these!• What counts as “overlapping” ? – minimum overlap length eg. 20bp – minimum %identity across overlap eg. 95% – choice depends on L and expected error rate
  10. 10. N=6 means 15 overlap “scores”Read# 1 2 3 4 5 6 1 - - - - - - 2 80 - - - - - 3 95 85 - - - - 4 0 30 20 - - - 5 0 0 25 70 - - 6 0 35 25 60 50 -
  11. 11. Graph construction Thicker lines mean strongerevidence for overlapNode/Vertex Edge/Arc
  12. 12. A more realistic graph
  13. 13. What ruins the graph?• Read errors – introduce false edges and nodes• Non-haploid organisms – heterozygosity causes lots of detours• Repeats – if longer than read length – causes nodes to be shared, locality confusion
  14. 14. Graph simplification• Squash small bubbles – collapse small errors (or minor heterozygosity)• Remove spurs – short “dead end” hairs on the graph• Join unambiguously connected nodes – reliable stretches of unique DNA• Remove transitive edges – Collapse paths saying the same thing differently
  15. 15. Graph traversal• For each unconnected graph – at least one per replicon in original sample• Find a path which visits each node once – the Hamiltonian path (or cycle) – provably NP-hard (this is bad) – unlikely to be single path due to repeat nodes – solution will be a set of paths which terminate at decision points• Form a consensus sequence from path – use all the overlap alignments – each of these is a CONTIG
  16. 16. Graph traversal
  17. 17. What happens with repeats?The repeated element is collapsed into a single contig
  18. 18. Mis-assembled repeats collapsed tandem excision I II III a b c a b c d I c b III a d a c b II b c rearrangement I II III IV a b c d e f I a III II IV a d e b c f
  19. 19. The law of repeats• It is impossible to resolve repeats of length S unless you have reads longer than S.• It is impossible to resolve repeats of length S unless you have reads longer than S.
  20. 20. Types of reads• Example fragment – atcgtatgatcttgagattctctcttcccttatagctgctata• “Single-end” read – atcgtatgatcttgagattctctcttcccttatagctgctata – Sequence one end of the fragment• “Paired-end” read – atcgtatgatcttgagattctctcttcccttatagctgctata – Sequence both ends of same fragment – we can exploit this information!
  21. 21. Scaffolding• Paired-end reads – known sequences at either end – roughly known distance between ends – unknown sequence between ends• Most ends will occur in same contig – if our contigs are longer than pair distance• Some ends will be in different contigs – evidence that these contigs are linked!
  22. 22. Contigs to Scaffolds Paired-end read ContigsScaffold Gap Gap
  23. 23. What can we assemble?• Genomes – A single organism eg. its chromosomal DNA• Meta-genomes – gDNA from mixtures of organisms• Transcriptomes – A single organism’s RNA inc. mRNA, ncRNA• Meta-transcriptomes – RNA from a mixture of organisms
  24. 24. Genomes• Expect uniformity – Each part of genome represented by roughly equal number of reads• Average depth of coverage – Genome: 4 Mbp – Yield: 4 million x 50 bp reads = 200 Mbp – Coverage: 200 ÷ 4 = 50x (reads per bp)
  25. 25. Meta-genomes• Expect proportionality & uniformity – Each genome represented by proportion of reads similar to their proportion in mixture• Example – Mix of 3 species: ¼ Staph, ¼ Clost, ½ Ecoli – Say we get 4M reads – Then we expect about: 1M from Staph, 1M from Clost, 2M from Ecoli
  26. 26. Meta-genome issues• Closely related species – will have very similar reads – lots of shared nodes in the graph• Conserved sequence – bits of DNA common to lots of organisms – “hub” nodes in the graph• Untangling is difficult – need longer reads
  27. 27. Transcriptomes• RNA-Seq – first convert it into DNA (cDNA) – represents a snapshot of RNA activity• Expect proportionality – the expression level of a gene is proportional to the number of reads from that gene’s cDNA
  28. 28. Transcriptome issues• Huge dynamic range – some gets lots of reads, some get none• Splice variation – very similar, subtly different transcripts – lots of shared nodes in graph
  29. 29. Meta-transcriptomes• RNA-Seq – on multiple transcriptomes at once• Expect proportional proportionality – proportion of that organism in mixture – proportions due to expression levels• Meta x transcriptome issues combined!
  30. 30. Assessing assemblies• Genome assembly – Total length similar to genome size – Fewer, larger contigs – Correctness of contigs• Metrics – Maximum contig length – N50 (next slide)
  31. 31. The “N50”• “The length of that contig from which 50% of the bases are in it and shorter contigs”• Imagine we got 7 contigs with lengths: – 1,1,3,5,8,12,20• Total – 1+1+3+5+8+12+20 = 50• N50 is the “halfway sum” = 25 – 1+1+3+5+8+12 = 30 (≥ 25) so N50 is 12
  32. 32. N50 concerns• Optimizing for N50 – encourages mis-assemblies!• An aggressive assembler may over-join: – 1,1,3,5,8,12,20 (previous) – 1,1,3,5,20,20 (now) – 1+1+3+5+20+20 = 50 (unchanged)• N50 is the “halfway sum” (still 25) – 1+1+3+5+20= 30 (≥ 25) so N50 is 20
  33. 33. Assembly tools• Genome – Velvet, Abyss, Mira, Newbler, SGA, AllPaths, Ray, Euler, SOAPdenovo, Edena, Arachne• Meta-genome – MetaVelvet, SGA, custom scripts + above• Transcriptome – Trans-Abyss, Oases, Trinity• Meta-Transcriptome – custom scripts + above
  34. 34. Example• Culture your bacterium• Extract your genomic DNA• Send it to AGRF for Illumina sequencing – 100bp paired end• Get back two files: – MRSA_R1.fastq.gz – MRSA_R2.fastq.gz• Now what?
  35. 35. Velvet: hash readsvelvethDir31-fmtAuto New options-separateMRSA_R1.fastq.gzMRSA_R2.fastq.gz No interleaving required
  36. 36. Velvet: assemblyvelvetg “Signal” levelDir-exp_cov auto-cov_cutoff auto “Noise” level
  37. 37. Velvet: examine resultsless Dir/contigs.fa>NODE_1_length_43211_cov_27.36569AGTCGATGCTTAGAGAGTATGACCTTCTATACAAAAATCTTATATTAGCGCTAGTCTGATAGCTCCCTAGATCTGATCTGATATGATCTTAGAGTATCGGCTATTGCTAGTCTCGCGTATAATAAATAATATATTTTTCTAATGATCTTATATTAGCGCTAGTCTGATAGCTCCCTAGATCTGATCTGATATGATCTTAGAGTATCGGCTATTGCTAGTCTCGCGTATAATAAATAATATATTTAGTAGTCT …
  38. 38. VelvetVelvet: GUI Assembler Graphical User Environment Where to save Add your reads Click run
  39. 39. Contact• Email – torsten.seemann@monash.edu• Web – http://vicbioinformatics.com/ – http://vlsci.org.au/• Blog – http://TheGenomeFactory.blogspot.com

×