2014 toronto-torbug

Building khmer, a platform
for research in scalable
sequence analysis
C. Titus Brown
ctb@msu.edu

Hello!
Assistant Professor; Microbiology; Computer Science;
etc.
More information at:
• ged.msu.edu/
• github.com/ged-lab/
• ivory.idyll.org/blog/
• @ctitusbrown

Introducing k-mers
CCGATTGCACTGGACCGA (<- read)
CCGATTGCAC
CGATTGCACT
GATTGCACTG
ATTGCACTGG
TTGCACTGGA
TGCACTGGAC
GCACTGGACC
ACTGGACCGA

K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG

K-mers give you an
implicit alignment
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
CATGGACCGATTGCACTGGACCGATGCACGGACCG
(with no accounting for mismatches or indels)

De Bruijn graphs –
assemble on overlaps
J.R. Miller et al. / Genomics (2010)

The problem with k-mers
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error results in k novel k-mers!

Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Assembly graphs scale with data size, not
information.

Practical memory
measurements (soil)
Velvet measurements (Adina Howe)

Counting k-mers
efficiently (RAM)

This leads to good things.
Efﬁcient online
counting of k-mers
Trimming reads
on abundance
Efﬁcient De
Bruijn graph
representations
Read
abundance
normalization

Data structures &
algorithms papers
• “These are not the k-mers you are looking for…”,
Zhang et al., arXiv 1309.2975, in review.
• “Scaling metagenome sequence assembly with
probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.
• “A Reference-Free Algorithm for Computational
Normalization of Shotgun Sequencing Data”, Brown
et al., arXiv 1203.4802, under revision.

Data analysis papers
• “Tackling soil diversity with the assembly of large,
complex metagenomes”, Howe et al., PNAS, 2014.
• Assembling novel ascidian genomes &
transcriptomes, Lowe et al., in prep.
• A de novo lamprey transcriptome from large scale
multi-tissue mRNAseq, Scott et al., in prep.

Lab approach – not
intentional, but working out.
Novel data
structures and
algorithms
Implement at
scale
Apply to real
biological
problems

This leads to good things.
Efﬁcient online
counting of k-mers
Trimming reads
on abundance
Efﬁcient De
Bruijn graph
representations
Read
abundance
normalization
(khmer software)

Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch
(khmer software)

How is this feasible?!
Representative half-arsed lab software development
Version that
worked once, for
some publication.
Grad student 1
research
Grad student 2
research
Incompatible and broken code

Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
A not-insane way to do software development

A not-insane way to do software development
Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests

Testing & version control
– the not so secret sauce
• High test coverage - grown over time.
• Stupidity driven testing – we write tests for bugs after
we find them and before we fix them.
• Pull requests & continuous integration – does your
proposed merge break tests?
• Pull requests & code review – does new code meet
our minimal coding etc requirements?
o Note: spellchecking!!!

Integration testing
• khmer is designed to work with other packages.
• For releases >= 1.0, we now have added
acceptance tests to make sure that khmer works
OK with other packages.
• These acceptance tests are based on integration
tests, than in turn come from an education &
documentation effort…

khmer-protocols:
• Provide standard “cheap”
assembly protocols for the cloud.
• Entirely copy/paste; ~2-6 days
from raw reads to assembly,
annotations, and differential
expression analysis. ~$150 per
data set (on Amazon rental
computers)
• Open, versioned, forkable,
citable….
Read cleaning
Diginorm
Assembly
Annotation
RSEM differential
expression

Literate testing
• Our shell-command tutorials for bioinformatics can
now be executed in an automated fashion –
commands are extracted automatically into shell
scripts.
• See: github.com/ged-lab/literate-resting/.
• Tremendously improves peace of mind and
confidence moving forward!
Leigh Sheneman

Doing things right
=> #awesomesauce
Protocols in English
for running analyses in
the cloud
Literate reSTing =>
shell scripts
Tool
competitions
Benchmarking
Education
Acceptance
tests

Benchmarking protocols
Data subset; AWS m1.xlarge
~1 hour
(See PyCon 2014 talk; video and blog post.)

Benchmarking protocols
Complete data; AWS m1.xlarge
~40 hours
(See PyCon 2014 talk; video and blog post.)

Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch

Genomic intervals shared
between data sets
Qingpeng Zhang
* Assembly free!

Error correction via graph
alignment
Jason Pell and Jordan Fish

Error correction on
simulated E. coli data
1% error rate, 100x coverage.
Jordan Fish and Jason Pell
TP FP TN FN
ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9%
1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2%
1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8%
(corrected) (mistakes) (OK) (missed)

Single pass, reference free, tunable, streaming online
variant calling.
Streaming, online variant calling.
See NIH BIG DATA grant, http://ged.msu.edu/.

Novelty… to what power?
• “Novelty” requirements for “high impact
publishing”:
o Must do novel algorithm development
o …and apply to novel and interesting data sets.
o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662)
• We’ve taken on the additional challenge of trying
to develop and maintain a core set of functionality
in research software: novelty cubed? :)

Reproducibility
Scientific progress relies on reproducibility of analysis.
(Aristotle, Nature, 322 BCE.)
All our papers now have:
• Source hosted on github;
• Data hosted there or on AWS;
• Long running data analysis =>
‘make’
• Graphing and data digestion
=> IPython Notebook (also in
github)
Qingpeng Zhang

Concluding thoughts
• API is destiny – without online counting, diginorm &
streaming approaches would not have been
possible.
• Tackle the hard problems – engineering
optimization would not have gotten us very far.
• Testing lets us scale development & process – which
means when something works, we can run with it.

Caveats
• Expense and effort – you can spend an infinite
amount of time on infrastructure & process!
o Advice: choose techniques that address actual pain points.
o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014)
• Funders and reviewers just don’t care – adopt good
software practices for yourself, not others.
o Advice: briefly mention keywords in grants, papers.
• Advisors just don’t care – see above.
o These are 90% true statements :>

Can we crowdsource
bioinformatics?
We already are! Bioinformatics is already a tremendously
open and collaborative endeavor. (Let’s take advantage
of it!)
“It’s as if somewhere, out there, is a collection of totally free
software that can do a far better job than ours can, with
open, published methods, great support networks and
fantastic tutorials. But that’s madness – who on Earth
would create such an amazing resource?”
-
http://thescienceweb.wordpress.com/2014/02/21/bioinfor
matics-software-companies-have-no-clue-why-no-one-
buys-their-products/

Prospective: sequencing
tumor cells
• Goal: phylogenetically reconstruct causal “driver
mutations” in face of passenger mutations.
• 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of
sequence.
• Most of this data will be redundant and not useful.
• Developing diginorm-based algorithms to eliminate
data while retaining variant information.

Where are we taking this?
• Streaming online algorithms only look at data
~once.
• Diginorm is streaming, online…
• Conceptually, can move many aspects of
sequence analysis into streaming mode.
=> Extraordinary potential for computational efficiency.

2014 toronto-torbug

More Related Content

Viewers also liked

Similar to 2014 toronto-torbug

More from c.titus.brown

Recently uploaded

2014 toronto-torbug

Editor's Notes