3. K-mers and assembly
• For next-generation sequencing, comparison
of each read with each other read is
impossible.
– E.g. 10 million reads -> 107 x 107 read-read
comparisons. Slowww..
• K-mers and de Bruijn graphs help make things
tractable
12. Metagenome assembly
Me: “I know, why don’t I just assemble all my
data together?”
Run assembly
Wait 4 days
Out of memory allocating 18.4 million terabytes
of RAM.
13. Solutions to RAM issues
• Quality trimming
• Hard trimming
• Throwing away a proportion of reads
randomly
• Sequencing something else
14. Lossy de Bruijn graphs
The number of k-mers observed is vanishingly small
relative to the total number of possible k-mers
The human genome: ~3Gbp = ~3×109 k-mers
Total possible 51-mers: 451 = ~1030
0.00000000000000000002%
When making a list of k-mers, counting extra ones
probably has little effect on assembly.
19. Gap filling vs. assembly
• Regular assembly ain’t easy
• Re-assembly is more straightforward because
you are trying to get to somewhere
20. Gap filling can correct assembly errors
• Contigs often contain errors right at the ends
of contigs
• By starting to search a bit back (e.g. 200bp)
away from the end of the contig, these errors
can be overcome