Building khmer, a platform
for research in scalable
sequence analysis
C. Titus Brown
ctb@msu.edu
Hello!
Assistant Professor; Microbiology; Computer Science;
etc.
More information at:
• ged.msu.edu/
• github.com/ged-lab/
• ivory.idyll.org/blog/
• @ctitusbrown
Introducing k-mers
CCGATTGCACTGGACCGA (<- read)
CCGATTGCAC
CGATTGCACT
GATTGCACTG
ATTGCACTGG
TTGCACTGGA
TGCACTGGAC
GCACTGGACC
ACTGGACCGA
K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
CATGGACCGATTGCACTGGACCGATGCACGGACCG
(with no accounting for mismatches or indels)
De Bruijn graphs –
assemble on overlaps
J.R. Miller et al. / Genomics (2010)
The problem with k-mers
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error results in k novel k-mers!
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Assembly graphs scale with data size, not
information.
Practical memory
measurements (soil)
Velvet measurements (Adina Howe)
Counting k-mers
efficiently (RAM)
This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Data structures &
algorithms papers
• “These are not the k-mers you are looking for…”,
Zhang et al., arXiv 1309.2975, in review.
• “Scaling metagenome sequence assembly with
probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.
• “A Reference-Free Algorithm for Computational
Normalization of Shotgun Sequencing Data”, Brown
et al., arXiv 1203.4802, under revision.
Data analysis papers
• “Tackling soil diversity with the assembly of large,
complex metagenomes”, Howe et al., PNAS, 2014.
• Assembling novel ascidian genomes &
transcriptomes, Lowe et al., in prep.
• A de novo lamprey transcriptome from large scale
multi-tissue mRNAseq, Scott et al., in prep.
Lab approach – not
intentional, but working out.
Novel data
structures and
algorithms
Implement at
scale
Apply to real
biological
problems
This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
(khmer software)
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch
(khmer software)
How is this feasible?!
Representative half-arsed lab software development
Version that
worked once, for
some publication.
Grad student 1
research
Grad student 2
research
Incompatible and broken code
Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
A not-insane way to do software development
A not-insane way to do software development
Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
Testing & version control
– the not so secret sauce
• High test coverage - grown over time.
• Stupidity driven testing – we write tests for bugs after
we find them and before we fix them.
• Pull requests & continuous integration – does your
proposed merge break tests?
• Pull requests & code review – does new code meet
our minimal coding etc requirements?
o Note: spellchecking!!!
Integration testing
• khmer is designed to work with other packages.
• For releases >= 1.0, we now have added
acceptance tests to make sure that khmer works
OK with other packages.
• These acceptance tests are based on integration
tests, than in turn come from an education &
documentation effort…
khmer-protocols
khmer-protocols:
• Provide standard “cheap”
assembly protocols for the cloud.
• Entirely copy/paste; ~2-6 days
from raw reads to assembly,
annotations, and differential
expression analysis. ~$150 per
data set (on Amazon rental
computers)
• Open, versioned, forkable,
citable….
Read cleaning
Diginorm
Assembly
Annotation
RSEM differential
expression
Literate testing
• Our shell-command tutorials for bioinformatics can
now be executed in an automated fashion –
commands are extracted automatically into shell
scripts.
• See: github.com/ged-lab/literate-resting/.
• Tremendously improves peace of mind and
confidence moving forward!
Leigh Sheneman
Doing things right
=> #awesomesauce
Protocols in English
for running analyses in
the cloud
Literate reSTing =>
shell scripts
Tool
competitions
Benchmarking
Education
Acceptance
tests
Benchmarking protocols
Data subset; AWS m1.xlarge
~1 hour
(See PyCon 2014 talk; video and blog post.)
Benchmarking protocols
Complete data; AWS m1.xlarge
~40 hours
(See PyCon 2014 talk; video and blog post.)
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch
Genomic intervals shared
between data sets
Qingpeng Zhang
* Assembly free!
Error correction via graph
alignment
Jason Pell and Jordan Fish
Error correction on
simulated E. coli data
1% error rate, 100x coverage.
Jordan Fish and Jason Pell
TP FP TN FN
ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9%
1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2%
1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8%
(corrected) (mistakes) (OK) (missed)
Single pass, reference free, tunable, streaming online
variant calling.
Streaming, online variant calling.
See NIH BIG DATA grant, http://ged.msu.edu/.
Novelty… to what power?
• “Novelty” requirements for “high impact
publishing”:
o Must do novel algorithm development
o …and apply to novel and interesting data sets.
o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662)
• We’ve taken on the additional challenge of trying
to develop and maintain a core set of functionality
in research software: novelty cubed? :)
Reproducibility
Scientific progress relies on reproducibility of analysis.
(Aristotle, Nature, 322 BCE.)
All our papers now have:
• Source hosted on github;
• Data hosted there or on AWS;
• Long running data analysis =>
‘make’
• Graphing and data digestion
=> IPython Notebook (also in
github)
Qingpeng Zhang
Concluding thoughts
• API is destiny – without online counting, diginorm &
streaming approaches would not have been
possible.
• Tackle the hard problems – engineering
optimization would not have gotten us very far.
• Testing lets us scale development & process – which
means when something works, we can run with it.
Caveats
• Expense and effort – you can spend an infinite
amount of time on infrastructure & process!
o Advice: choose techniques that address actual pain points.
o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014)
• Funders and reviewers just don’t care – adopt good
software practices for yourself, not others.
o Advice: briefly mention keywords in grants, papers.
• Advisors just don’t care – see above.
o These are 90% true statements :>
Can we crowdsource
bioinformatics?
We already are! Bioinformatics is already a tremendously
open and collaborative endeavor. (Let’s take advantage
of it!)
“It’s as if somewhere, out there, is a collection of totally free
software that can do a far better job than ours can, with
open, published methods, great support networks and
fantastic tutorials. But that’s madness – who on Earth
would create such an amazing resource?”
-
http://thescienceweb.wordpress.com/2014/02/21/bioinfor
matics-software-companies-have-no-clue-why-no-one-
buys-their-products/
Thanks!
Prospective: sequencing
tumor cells
• Goal: phylogenetically reconstruct causal “driver
mutations” in face of passenger mutations.
• 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of
sequence.
• Most of this data will be redundant and not useful.
• Developing diginorm-based algorithms to eliminate
data while retaining variant information.
Where are we taking this?
• Streaming online algorithms only look at data
~once.
• Diginorm is streaming, online…
• Conceptually, can move many aspects of
sequence analysis into streaming mode.
=> Extraordinary potential for computational efficiency.

2014 toronto-torbug

  • 1.
    Building khmer, aplatform for research in scalable sequence analysis C. Titus Brown ctb@msu.edu
  • 2.
    Hello! Assistant Professor; Microbiology;Computer Science; etc. More information at: • ged.msu.edu/ • github.com/ged-lab/ • ivory.idyll.org/blog/ • @ctitusbrown
  • 3.
    Introducing k-mers CCGATTGCACTGGACCGA (<-read) CCGATTGCAC CGATTGCACT GATTGCACTG ATTGCACTGG TTGCACTGGA TGCACTGGAC GCACTGGACC ACTGGACCGA
  • 4.
    K-mers give youan implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG
  • 5.
    K-mers give youan implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG CATGGACCGATTGCACTGGACCGATGCACGGACCG (with no accounting for mismatches or indels)
  • 6.
    De Bruijn graphs– assemble on overlaps J.R. Miller et al. / Genomics (2010)
  • 7.
    The problem withk-mers CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTCGACCGATGCACGGTACCG Each sequencing error results in k novel k-mers!
  • 8.
    Conway T C, Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com Assembly graphs scale with data size, not information.
  • 9.
  • 10.
  • 11.
    This leads togood things. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization
  • 12.
    Data structures & algorithmspapers • “These are not the k-mers you are looking for…”, Zhang et al., arXiv 1309.2975, in review. • “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012. • “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802, under revision.
  • 13.
    Data analysis papers •“Tackling soil diversity with the assembly of large, complex metagenomes”, Howe et al., PNAS, 2014. • Assembling novel ascidian genomes & transcriptomes, Lowe et al., in prep. • A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.
  • 14.
    Lab approach –not intentional, but working out. Novel data structures and algorithms Implement at scale Apply to real biological problems
  • 15.
    This leads togood things. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization (khmer software)
  • 16.
    Efficient online counting ofk-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efficient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efficient search for target genes Currentresearch (khmer software)
  • 17.
    How is thisfeasible?! Representative half-arsed lab software development Version that worked once, for some publication. Grad student 1 research Grad student 2 research Incompatible and broken code
  • 18.
    Stable version Grad student1 research Grad student 2 research Stable, tested code Run tests Run tests Run tests Run tests Run tests A not-insane way to do software development
  • 19.
    A not-insane wayto do software development Stable version Grad student 1 research Grad student 2 research Stable, tested code Run tests Run tests Run tests Run tests Run tests Run tests Run tests
  • 20.
    Testing & versioncontrol – the not so secret sauce • High test coverage - grown over time. • Stupidity driven testing – we write tests for bugs after we find them and before we fix them. • Pull requests & continuous integration – does your proposed merge break tests? • Pull requests & code review – does new code meet our minimal coding etc requirements? o Note: spellchecking!!!
  • 21.
    Integration testing • khmeris designed to work with other packages. • For releases >= 1.0, we now have added acceptance tests to make sure that khmer works OK with other packages. • These acceptance tests are based on integration tests, than in turn come from an education & documentation effort…
  • 22.
  • 23.
    khmer-protocols: • Provide standard“cheap” assembly protocols for the cloud. • Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers) • Open, versioned, forkable, citable…. Read cleaning Diginorm Assembly Annotation RSEM differential expression
  • 24.
    Literate testing • Ourshell-command tutorials for bioinformatics can now be executed in an automated fashion – commands are extracted automatically into shell scripts. • See: github.com/ged-lab/literate-resting/. • Tremendously improves peace of mind and confidence moving forward! Leigh Sheneman
  • 25.
    Doing things right =>#awesomesauce Protocols in English for running analyses in the cloud Literate reSTing => shell scripts Tool competitions Benchmarking Education Acceptance tests
  • 26.
    Benchmarking protocols Data subset;AWS m1.xlarge ~1 hour (See PyCon 2014 talk; video and blog post.)
  • 27.
    Benchmarking protocols Complete data;AWS m1.xlarge ~40 hours (See PyCon 2014 talk; video and blog post.)
  • 28.
    Efficient online counting ofk-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efficient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efficient search for target genes Currentresearch
  • 29.
    Genomic intervals shared betweendata sets Qingpeng Zhang * Assembly free!
  • 30.
    Error correction viagraph alignment Jason Pell and Jordan Fish
  • 31.
    Error correction on simulatedE. coli data 1% error rate, 100x coverage. Jordan Fish and Jason Pell TP FP TN FN ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9% 1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2% 1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8% (corrected) (mistakes) (OK) (missed)
  • 32.
    Single pass, referencefree, tunable, streaming online variant calling. Streaming, online variant calling. See NIH BIG DATA grant, http://ged.msu.edu/.
  • 33.
    Novelty… to whatpower? • “Novelty” requirements for “high impact publishing”: o Must do novel algorithm development o …and apply to novel and interesting data sets. o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662) • We’ve taken on the additional challenge of trying to develop and maintain a core set of functionality in research software: novelty cubed? :)
  • 34.
    Reproducibility Scientific progress relieson reproducibility of analysis. (Aristotle, Nature, 322 BCE.) All our papers now have: • Source hosted on github; • Data hosted there or on AWS; • Long running data analysis => ‘make’ • Graphing and data digestion => IPython Notebook (also in github) Qingpeng Zhang
  • 35.
    Concluding thoughts • APIis destiny – without online counting, diginorm & streaming approaches would not have been possible. • Tackle the hard problems – engineering optimization would not have gotten us very far. • Testing lets us scale development & process – which means when something works, we can run with it.
  • 36.
    Caveats • Expense andeffort – you can spend an infinite amount of time on infrastructure & process! o Advice: choose techniques that address actual pain points. o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014) • Funders and reviewers just don’t care – adopt good software practices for yourself, not others. o Advice: briefly mention keywords in grants, papers. • Advisors just don’t care – see above. o These are 90% true statements :>
  • 37.
    Can we crowdsource bioinformatics? Wealready are! Bioinformatics is already a tremendously open and collaborative endeavor. (Let’s take advantage of it!) “It’s as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But that’s madness – who on Earth would create such an amazing resource?” - http://thescienceweb.wordpress.com/2014/02/21/bioinfor matics-software-companies-have-no-clue-why-no-one- buys-their-products/
  • 38.
  • 39.
    Prospective: sequencing tumor cells •Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations. • 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence. • Most of this data will be redundant and not useful. • Developing diginorm-based algorithms to eliminate data while retaining variant information.
  • 40.
    Where are wetaking this? • Streaming online algorithms only look at data ~once. • Diginorm is streaming, online… • Conceptually, can move many aspects of sequence analysis into streaming mode. => Extraordinary potential for computational efficiency.

Editor's Notes

  • #15 Slow, but powerful.
  • #26 Acceptance testing other people’s software
  • #32 Update from Jordan
  • #38 More generally….