Learning to love de Bruijn graphs
Ben Woodcroft,
Australian Centre for Ecogenomics (ACE)
Winter School in Bioinformatics, 2015
A slide from Torsten Seemann
K-mers and assembly
• For next-generation sequencing, comparison
of each read with each other read is
impossible.
– E.g. 10 million reads -> 107 x 107 read-read
comparisons. Slowww..
• K-mers and de Bruijn graphs help make things
tractable
K-mers and assembly
Forks
K-mer too small
K-mer too large
My favourite k-mer size
My favourite k-mer size
With a 100bp read, this can never happen with a k-mer size of 51
Less tips, more bubbles
As read lengths get longer, assemblers must move
from handling dead ends in the graph to handling
bubbles.
Tips and bubbles
Metagenome assembly
Me: “I know, why don’t I just assemble all my
data together?”
Run assembly
Wait 4 days
Out of memory allocating 18.4 million terabytes
of RAM.
Solutions to RAM issues
• Quality trimming
• Hard trimming
• Throwing away a proportion of reads
randomly
• Sequencing something else
Lossy de Bruijn graphs
The number of k-mers observed is vanishingly small
relative to the total number of possible k-mers
The human genome: ~3Gbp = ~3×109 k-mers
Total possible 51-mers: 451 = ~1030
0.00000000000000000002%
When making a list of k-mers, counting extra ones
probably has little effect on assembly.
Bloom filters
A low memory k-mer “store”
Is my k-mer in these reads?
From a bloom filter, the answer is either “no” or
“probably”
A finishing approach to assembly
A central assumption of this method is
that the genome is “mostly” complete
Scaffolding without mate pair data
Gap filling vs. assembly
• Regular assembly ain’t easy
• Re-assembly is more straightforward because
you are trying to get to somewhere
Gap filling can correct assembly errors
• Contigs often contain errors right at the ends
of contigs
• By starting to search a bit back (e.g. 200bp)
away from the end of the contig, these errors
can be overcome
Gap-filling can account for strain
variation
github.com/wwood/finishm
Thanks!
• Slideshare.com/benjwoodcroft
• Github.com/wwood
• Ecogenomic.org

Learning to Love De Bruijn Graphs

  • 1.
    Learning to lovede Bruijn graphs Ben Woodcroft, Australian Centre for Ecogenomics (ACE) Winter School in Bioinformatics, 2015
  • 2.
    A slide fromTorsten Seemann
  • 3.
    K-mers and assembly •For next-generation sequencing, comparison of each read with each other read is impossible. – E.g. 10 million reads -> 107 x 107 read-read comparisons. Slowww.. • K-mers and de Bruijn graphs help make things tractable
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    My favourite k-mersize With a 100bp read, this can never happen with a k-mer size of 51
  • 10.
    Less tips, morebubbles As read lengths get longer, assemblers must move from handling dead ends in the graph to handling bubbles.
  • 11.
  • 12.
    Metagenome assembly Me: “Iknow, why don’t I just assemble all my data together?” Run assembly Wait 4 days Out of memory allocating 18.4 million terabytes of RAM.
  • 13.
    Solutions to RAMissues • Quality trimming • Hard trimming • Throwing away a proportion of reads randomly • Sequencing something else
  • 14.
    Lossy de Bruijngraphs The number of k-mers observed is vanishingly small relative to the total number of possible k-mers The human genome: ~3Gbp = ~3×109 k-mers Total possible 51-mers: 451 = ~1030 0.00000000000000000002% When making a list of k-mers, counting extra ones probably has little effect on assembly.
  • 15.
    Bloom filters A lowmemory k-mer “store”
  • 16.
    Is my k-merin these reads? From a bloom filter, the answer is either “no” or “probably”
  • 17.
    A finishing approachto assembly A central assumption of this method is that the genome is “mostly” complete
  • 18.
  • 19.
    Gap filling vs.assembly • Regular assembly ain’t easy • Re-assembly is more straightforward because you are trying to get to somewhere
  • 20.
    Gap filling cancorrect assembly errors • Contigs often contain errors right at the ends of contigs • By starting to search a bit back (e.g. 200bp) away from the end of the contig, these errors can be overcome
  • 21.
    Gap-filling can accountfor strain variation github.com/wwood/finishm
  • 22.