SlideShare a Scribd company logo
Building khmer, a platform
for research in scalable
sequence analysis
C. Titus Brown
ctb@msu.edu
Hello!
Assistant Professor; Microbiology; Computer Science;
etc.
More information at:
• ged.msu.edu/
• github.com/ged-lab/
• ivory.idyll.org/blog/
• @ctitusbrown
Introducing k-mers
CCGATTGCACTGGACCGA (<- read)
CCGATTGCAC
CGATTGCACT
GATTGCACTG
ATTGCACTGG
TTGCACTGGA
TGCACTGGAC
GCACTGGACC
ACTGGACCGA
K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
CATGGACCGATTGCACTGGACCGATGCACGGACCG
(with no accounting for mismatches or indels)
De Bruijn graphs –
assemble on overlaps
J.R. Miller et al. / Genomics (2010)
The problem with k-mers
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error results in k novel k-mers!
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Assembly graphs scale with data size, not
information.
Practical memory
measurements (soil)
Velvet measurements (Adina Howe)
Counting k-mers
efficiently (RAM)
This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Data structures &
algorithms papers
• “These are not the k-mers you are looking for…”,
Zhang et al., arXiv 1309.2975, in review.
• “Scaling metagenome sequence assembly with
probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.
• “A Reference-Free Algorithm for Computational
Normalization of Shotgun Sequencing Data”, Brown
et al., arXiv 1203.4802, under revision.
Data analysis papers
• “Tackling soil diversity with the assembly of large,
complex metagenomes”, Howe et al., PNAS, 2014.
• Assembling novel ascidian genomes &
transcriptomes, Lowe et al., in prep.
• A de novo lamprey transcriptome from large scale
multi-tissue mRNAseq, Scott et al., in prep.
Lab approach – not
intentional, but working out.
Novel data
structures and
algorithms
Implement at
scale
Apply to real
biological
problems
This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
(khmer software)
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch
(khmer software)
How is this feasible?!
Representative half-arsed lab software development
Version that
worked once, for
some publication.
Grad student 1
research
Grad student 2
research
Incompatible and broken code
Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
A not-insane way to do software development
A not-insane way to do software development
Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
Testing & version control
– the not so secret sauce
• High test coverage - grown over time.
• Stupidity driven testing – we write tests for bugs after
we find them and before we fix them.
• Pull requests & continuous integration – does your
proposed merge break tests?
• Pull requests & code review – does new code meet
our minimal coding etc requirements?
o Note: spellchecking!!!
Integration testing
• khmer is designed to work with other packages.
• For releases >= 1.0, we now have added
acceptance tests to make sure that khmer works
OK with other packages.
• These acceptance tests are based on integration
tests, than in turn come from an education &
documentation effort…
khmer-protocols
khmer-protocols:
• Provide standard “cheap”
assembly protocols for the cloud.
• Entirely copy/paste; ~2-6 days
from raw reads to assembly,
annotations, and differential
expression analysis. ~$150 per
data set (on Amazon rental
computers)
• Open, versioned, forkable,
citable….
Read cleaning
Diginorm
Assembly
Annotation
RSEM differential
expression
Literate testing
• Our shell-command tutorials for bioinformatics can
now be executed in an automated fashion –
commands are extracted automatically into shell
scripts.
• See: github.com/ged-lab/literate-resting/.
• Tremendously improves peace of mind and
confidence moving forward!
Leigh Sheneman
Doing things right
=> #awesomesauce
Protocols in English
for running analyses in
the cloud
Literate reSTing =>
shell scripts
Tool
competitions
Benchmarking
Education
Acceptance
tests
Benchmarking protocols
Data subset; AWS m1.xlarge
~1 hour
(See PyCon 2014 talk; video and blog post.)
Benchmarking protocols
Complete data; AWS m1.xlarge
~40 hours
(See PyCon 2014 talk; video and blog post.)
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch
Genomic intervals shared
between data sets
Qingpeng Zhang
* Assembly free!
Error correction via graph
alignment
Jason Pell and Jordan Fish
Error correction on
simulated E. coli data
1% error rate, 100x coverage.
Jordan Fish and Jason Pell
TP FP TN FN
ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9%
1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2%
1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8%
(corrected) (mistakes) (OK) (missed)
Single pass, reference free, tunable, streaming online
variant calling.
Streaming, online variant calling.
See NIH BIG DATA grant, http://ged.msu.edu/.
Novelty… to what power?
• “Novelty” requirements for “high impact
publishing”:
o Must do novel algorithm development
o …and apply to novel and interesting data sets.
o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662)
• We’ve taken on the additional challenge of trying
to develop and maintain a core set of functionality
in research software: novelty cubed? :)
Reproducibility
Scientific progress relies on reproducibility of analysis.
(Aristotle, Nature, 322 BCE.)
All our papers now have:
• Source hosted on github;
• Data hosted there or on AWS;
• Long running data analysis =>
‘make’
• Graphing and data digestion
=> IPython Notebook (also in
github)
Qingpeng Zhang
Concluding thoughts
• API is destiny – without online counting, diginorm &
streaming approaches would not have been
possible.
• Tackle the hard problems – engineering
optimization would not have gotten us very far.
• Testing lets us scale development & process – which
means when something works, we can run with it.
Caveats
• Expense and effort – you can spend an infinite
amount of time on infrastructure & process!
o Advice: choose techniques that address actual pain points.
o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014)
• Funders and reviewers just don’t care – adopt good
software practices for yourself, not others.
o Advice: briefly mention keywords in grants, papers.
• Advisors just don’t care – see above.
o These are 90% true statements :>
Can we crowdsource
bioinformatics?
We already are! Bioinformatics is already a tremendously
open and collaborative endeavor. (Let’s take advantage
of it!)
“It’s as if somewhere, out there, is a collection of totally free
software that can do a far better job than ours can, with
open, published methods, great support networks and
fantastic tutorials. But that’s madness – who on Earth
would create such an amazing resource?”
-
http://thescienceweb.wordpress.com/2014/02/21/bioinfor
matics-software-companies-have-no-clue-why-no-one-
buys-their-products/
Thanks!
Prospective: sequencing
tumor cells
• Goal: phylogenetically reconstruct causal “driver
mutations” in face of passenger mutations.
• 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of
sequence.
• Most of this data will be redundant and not useful.
• Developing diginorm-based algorithms to eliminate
data while retaining variant information.
Where are we taking this?
• Streaming online algorithms only look at data
~once.
• Diginorm is streaming, online…
• Conceptually, can move many aspects of
sequence analysis into streaming mode.
=> Extraordinary potential for computational efficiency.

More Related Content

Viewers also liked

Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition
 
Peuples inconnus
Peuples inconnusPeuples inconnus
Peuples inconnus
André Thépin
 
Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10John Doyle
 
Grandparents day
Grandparents day Grandparents day
Grandparents day
Takahe One
 
Trainings Evaluation Reports WPS Phase-II Layyah
Trainings Evaluation Reports WPS Phase-II LayyahTrainings Evaluation Reports WPS Phase-II Layyah
Trainings Evaluation Reports WPS Phase-II LayyahZafar Ahmad
 
Kezia Tragedy
Kezia TragedyKezia Tragedy
Kezia Tragedy
cobbo
 
Light Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional WorkLight Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional Work
Kegler Brown Hill + Ritter
 
3835 N Greenview #1
3835 N Greenview #13835 N Greenview #1
3835 N Greenview #1
bamadogg
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
c.titus.brown
 
What is electricity
What is electricityWhat is electricity
What is electricity
Julio Cesar Retamal Rojas
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
c.titus.brown
 
Introducing BlackBerry 10 [Indonesian Version]
Introducing BlackBerry 10 [Indonesian Version]Introducing BlackBerry 10 [Indonesian Version]
Introducing BlackBerry 10 [Indonesian Version]Khomeini Mujahid
 
AMD-V™ Nested Paging
AMD-V™ Nested PagingAMD-V™ Nested Paging
AMD-V™ Nested Paging
James Price
 
NABE Communications Section - Event Program
NABE Communications Section - Event ProgramNABE Communications Section - Event Program
NABE Communications Section - Event Program
Stephanie Abbott
 
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua ArquiteturaFernando Galdino
 
Presentacio Ciutats
Presentacio CiutatsPresentacio Ciutats
Presentacio Ciutats
Reckonerr
 
The Birth of Kehfab
The Birth of KehfabThe Birth of Kehfab
The Birth of KehfabWebsites.ca
 

Viewers also liked (20)

Duygular
DuygularDuygular
Duygular
 
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
 
Peuples inconnus
Peuples inconnusPeuples inconnus
Peuples inconnus
 
Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10
 
Grandparents day
Grandparents day Grandparents day
Grandparents day
 
Trainings Evaluation Reports WPS Phase-II Layyah
Trainings Evaluation Reports WPS Phase-II LayyahTrainings Evaluation Reports WPS Phase-II Layyah
Trainings Evaluation Reports WPS Phase-II Layyah
 
Kezia Tragedy
Kezia TragedyKezia Tragedy
Kezia Tragedy
 
Light Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional WorkLight Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional Work
 
3835 N Greenview #1
3835 N Greenview #13835 N Greenview #1
3835 N Greenview #1
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
What is electricity
What is electricityWhat is electricity
What is electricity
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Shirk
ShirkShirk
Shirk
 
Introducing BlackBerry 10 [Indonesian Version]
Introducing BlackBerry 10 [Indonesian Version]Introducing BlackBerry 10 [Indonesian Version]
Introducing BlackBerry 10 [Indonesian Version]
 
AMD-V™ Nested Paging
AMD-V™ Nested PagingAMD-V™ Nested Paging
AMD-V™ Nested Paging
 
NABE Communications Section - Event Program
NABE Communications Section - Event ProgramNABE Communications Section - Event Program
NABE Communications Section - Event Program
 
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
 
Presentacio Ciutats
Presentacio CiutatsPresentacio Ciutats
Presentacio Ciutats
 
The Birth of Kehfab
The Birth of KehfabThe Birth of Kehfab
The Birth of Kehfab
 
Body Language
Body  LanguageBody  Language
Body Language
 

Similar to 2014 toronto-torbug

2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?
Michaela Greiler
 
Software testing
Software testingSoftware testing
Software testing
Enamul Haque
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
Merce Crosas
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
Tao Xie
 
DevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroDevOps - Boldly Go for Distro
DevOps - Boldly Go for Distro
Paul Boos
 
Automatic for the People
Automatic for the PeopleAutomatic for the People
Automatic for the People
Andy Zaidman
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.au
evanbottcher
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Robert Grossman
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
c.titus.brown
 
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your Code
Nate Abele
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
c.titus.brown
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
Yannick Wurm
 
Leaping over the Boundaries of Boundary Value Analysis
Leaping over the Boundaries of Boundary Value AnalysisLeaping over the Boundaries of Boundary Value Analysis
Leaping over the Boundaries of Boundary Value Analysis
TechWell
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
c.titus.brown
 
Trends in Agile Testing by Lisa Crispin
Trends in Agile Testing by Lisa CrispinTrends in Agile Testing by Lisa Crispin
Trends in Agile Testing by Lisa CrispinDirecti Group
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
Issues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applicationsIssues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applications
Taesu Kim
 
How to Actually DO High-volume Automated Testing
How to Actually DO High-volume Automated TestingHow to Actually DO High-volume Automated Testing
How to Actually DO High-volume Automated Testing
TechWell
 

Similar to 2014 toronto-torbug (20)

2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?
 
Software testing
Software testingSoftware testing
Software testing
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
DevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroDevOps - Boldly Go for Distro
DevOps - Boldly Go for Distro
 
Automatic for the People
Automatic for the PeopleAutomatic for the People
Automatic for the People
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.au
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your Code
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
 
Leaping over the Boundaries of Boundary Value Analysis
Leaping over the Boundaries of Boundary Value AnalysisLeaping over the Boundaries of Boundary Value Analysis
Leaping over the Boundaries of Boundary Value Analysis
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Trends in Agile Testing by Lisa Crispin
Trends in Agile Testing by Lisa CrispinTrends in Agile Testing by Lisa Crispin
Trends in Agile Testing by Lisa Crispin
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
Issues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applicationsIssues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applications
 
How to Actually DO High-volume Automated Testing
How to Actually DO High-volume Automated TestingHow to Actually DO High-volume Automated Testing
How to Actually DO High-volume Automated Testing
 

More from c.titus.brown

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
c.titus.brown
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
c.titus.brown
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
c.titus.brown
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
c.titus.brown
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
c.titus.brown
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
c.titus.brown
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
c.titus.brown
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
c.titus.brown
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
c.titus.brown
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
c.titus.brown
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
c.titus.brown
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
c.titus.brown
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
c.titus.brown
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
c.titus.brown
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
c.titus.brown
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
c.titus.brown
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
c.titus.brown
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
c.titus.brown
 

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 

Recently uploaded

Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
ssuserbfdca9
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 

Recently uploaded (20)

Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 

2014 toronto-torbug

  • 1. Building khmer, a platform for research in scalable sequence analysis C. Titus Brown ctb@msu.edu
  • 2. Hello! Assistant Professor; Microbiology; Computer Science; etc. More information at: • ged.msu.edu/ • github.com/ged-lab/ • ivory.idyll.org/blog/ • @ctitusbrown
  • 3. Introducing k-mers CCGATTGCACTGGACCGA (<- read) CCGATTGCAC CGATTGCACT GATTGCACTG ATTGCACTGG TTGCACTGGA TGCACTGGAC GCACTGGACC ACTGGACCGA
  • 4. K-mers give you an implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG
  • 5. K-mers give you an implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG CATGGACCGATTGCACTGGACCGATGCACGGACCG (with no accounting for mismatches or indels)
  • 6. De Bruijn graphs – assemble on overlaps J.R. Miller et al. / Genomics (2010)
  • 7. The problem with k-mers CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTCGACCGATGCACGGTACCG Each sequencing error results in k novel k-mers!
  • 8. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com Assembly graphs scale with data size, not information.
  • 11. This leads to good things. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization
  • 12. Data structures & algorithms papers • “These are not the k-mers you are looking for…”, Zhang et al., arXiv 1309.2975, in review. • “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012. • “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802, under revision.
  • 13. Data analysis papers • “Tackling soil diversity with the assembly of large, complex metagenomes”, Howe et al., PNAS, 2014. • Assembling novel ascidian genomes & transcriptomes, Lowe et al., in prep. • A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.
  • 14. Lab approach – not intentional, but working out. Novel data structures and algorithms Implement at scale Apply to real biological problems
  • 15. This leads to good things. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization (khmer software)
  • 16. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efficient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efficient search for target genes Currentresearch (khmer software)
  • 17. How is this feasible?! Representative half-arsed lab software development Version that worked once, for some publication. Grad student 1 research Grad student 2 research Incompatible and broken code
  • 18. Stable version Grad student 1 research Grad student 2 research Stable, tested code Run tests Run tests Run tests Run tests Run tests A not-insane way to do software development
  • 19. A not-insane way to do software development Stable version Grad student 1 research Grad student 2 research Stable, tested code Run tests Run tests Run tests Run tests Run tests Run tests Run tests
  • 20. Testing & version control – the not so secret sauce • High test coverage - grown over time. • Stupidity driven testing – we write tests for bugs after we find them and before we fix them. • Pull requests & continuous integration – does your proposed merge break tests? • Pull requests & code review – does new code meet our minimal coding etc requirements? o Note: spellchecking!!!
  • 21. Integration testing • khmer is designed to work with other packages. • For releases >= 1.0, we now have added acceptance tests to make sure that khmer works OK with other packages. • These acceptance tests are based on integration tests, than in turn come from an education & documentation effort…
  • 23. khmer-protocols: • Provide standard “cheap” assembly protocols for the cloud. • Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers) • Open, versioned, forkable, citable…. Read cleaning Diginorm Assembly Annotation RSEM differential expression
  • 24. Literate testing • Our shell-command tutorials for bioinformatics can now be executed in an automated fashion – commands are extracted automatically into shell scripts. • See: github.com/ged-lab/literate-resting/. • Tremendously improves peace of mind and confidence moving forward! Leigh Sheneman
  • 25. Doing things right => #awesomesauce Protocols in English for running analyses in the cloud Literate reSTing => shell scripts Tool competitions Benchmarking Education Acceptance tests
  • 26. Benchmarking protocols Data subset; AWS m1.xlarge ~1 hour (See PyCon 2014 talk; video and blog post.)
  • 27. Benchmarking protocols Complete data; AWS m1.xlarge ~40 hours (See PyCon 2014 talk; video and blog post.)
  • 28. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efficient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efficient search for target genes Currentresearch
  • 29. Genomic intervals shared between data sets Qingpeng Zhang * Assembly free!
  • 30. Error correction via graph alignment Jason Pell and Jordan Fish
  • 31. Error correction on simulated E. coli data 1% error rate, 100x coverage. Jordan Fish and Jason Pell TP FP TN FN ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9% 1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2% 1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8% (corrected) (mistakes) (OK) (missed)
  • 32. Single pass, reference free, tunable, streaming online variant calling. Streaming, online variant calling. See NIH BIG DATA grant, http://ged.msu.edu/.
  • 33. Novelty… to what power? • “Novelty” requirements for “high impact publishing”: o Must do novel algorithm development o …and apply to novel and interesting data sets. o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662) • We’ve taken on the additional challenge of trying to develop and maintain a core set of functionality in research software: novelty cubed? :)
  • 34. Reproducibility Scientific progress relies on reproducibility of analysis. (Aristotle, Nature, 322 BCE.) All our papers now have: • Source hosted on github; • Data hosted there or on AWS; • Long running data analysis => ‘make’ • Graphing and data digestion => IPython Notebook (also in github) Qingpeng Zhang
  • 35. Concluding thoughts • API is destiny – without online counting, diginorm & streaming approaches would not have been possible. • Tackle the hard problems – engineering optimization would not have gotten us very far. • Testing lets us scale development & process – which means when something works, we can run with it.
  • 36. Caveats • Expense and effort – you can spend an infinite amount of time on infrastructure & process! o Advice: choose techniques that address actual pain points. o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014) • Funders and reviewers just don’t care – adopt good software practices for yourself, not others. o Advice: briefly mention keywords in grants, papers. • Advisors just don’t care – see above. o These are 90% true statements :>
  • 37. Can we crowdsource bioinformatics? We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Let’s take advantage of it!) “It’s as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But that’s madness – who on Earth would create such an amazing resource?” - http://thescienceweb.wordpress.com/2014/02/21/bioinfor matics-software-companies-have-no-clue-why-no-one- buys-their-products/
  • 39. Prospective: sequencing tumor cells • Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations. • 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence. • Most of this data will be redundant and not useful. • Developing diginorm-based algorithms to eliminate data while retaining variant information.
  • 40. Where are we taking this? • Streaming online algorithms only look at data ~once. • Diginorm is streaming, online… • Conceptually, can move many aspects of sequence analysis into streaming mode. => Extraordinary potential for computational efficiency.

Editor's Notes

  1. Slow, but powerful.
  2. Acceptance testing other people’s software
  3. Update from Jordan
  4. More generally….