2014 abic-talk

BUILDING BETTER
BIOINFORMATICS
SOFTWARE
(WHY THE HECK NOT?)
C. Titus Brown
ctb@msu.edu
Assistant Professor, MMG / CSE
Michigan State University

BUILDING BETTER
BIOINFORMATICS
SOFTWARE
(WHY THE HECK NOT?)
C. Titus Brown
ctb@msu.edu
A???????? Professor, VetMed, UC Davis

Lansing, Michigan -> Davis, California

Dot plots FTW!
Brown et al., 2005.

So I said these things…
“this tipping point was exacerbated by the loss of about
80% of the worlds data scientists in the 2021 Great
California Disruption.”
“[ Benchmarks ] have proven to be stifling of innovation,
because of the tendency to do incremental improvement.”
ivory.idyll.org/blog/2014-bosc-keynote.html

So I said these things…
“this tipping point was exacerbated by the loss of about
80% of the worlds data scientists in the 2021 Great
California Disruption.”
“[ Benchmarks ] have proven to be stifling of
innovation, because of the tendency to do incremental
improvement.”
ivory.idyll.org/blog/2014-bosc-keynote.html

There is a massive profusion of software!
Mick Watson, @BioMickWatson:
biomickwatson.wordpress.com/20
12/12/28/an-embargo-on-short-read-
alignment-software/
jeffvictor.deviantart.com

The players, in caricature:
1. Computer scientists
2. Software engineers
3. Data scientists
4. Statisticians
5. Biologists

The Computer Scientist
Fast, sensitive, specific – pick one.

The (Good) Software Engineer
Does it have any unit tests?

The Data Scientist
How quickly can I run it, starting from
scratch?

The Statistician
What gives me the best p-value?

The Biologist
What gives me the most publishable
result?

Problems all along the way…
1. Computer scientists: build delicate, hard to use, very high
performance software that solves the wrong problem.
2. Software engineers: all work for Google.
3. Data scientists: uses the wrong programs -- because they’re
actually usable.
4. Statisticians: only get invited into the project six months after
all the data is generated.
5. Biologists: are desperate to find any one of the above that
know any biology at all.

Example: de novo mRNAseq
Quality control
Assembly
Annotation
Differential
expression
Every one of these
steps is still an open
research problem,
with computational
challenges and direct
biological implications!

So:
1. This is all still research.
2. We’re unlikely to ever find out the right answer, but will
merely settle for one that’s not obviously terrible.
3. Everything is changing all the time: the data generation
tech, the hardware, the software, the theory...
4. Who are any of us to judge the value of any particular
approach?

So:
1. This is all still research
2. We’re unlikely to ever find out the right answer, but will
merely settle for one that’s not obviously terrible.
3. Everything is changing all the time: the data generation
tech, the hardware, the software, the theory...
4. Who are any of us to judge the value of any particular
approach?
(Well, sometimes me, when I’m peer
reviewer #2.)

All hands on deck!
Quality control
Assembly
Annotation
Differential
expression
We need it all!
• Fast/sensitive/specific
algorithms;
• Solid software;
• Statistical robustness;
• Biological insight;
• Well-trained data
scientists.
(The best bioinformaticians have multiple personality disorder, or so I tell myself.)

That sort of explains why.
But this still leaves us with too many
choices.

Example: de novo mRNAseq
Quality control
Assembly
Annotation
Differential
expression
10-20 packages
x
2-5 packages
x
5-10 packages
x
20-40 packages
= 2000-40,000 combinations

What’s the solution!?
Ultimately? All of…
Whole-workflow evaluations of tools.
Small tools (see “small tools manifesto”).
Automation!
Simulations, synthetic data, mock data, real data.
Antagonistic data set development (**).
Tool development driven with use cases.
Build based on solid command-line workflows.
Those things called “controls”.
…and more

Trying out a few approaches…

1. Automate the hell out of everything
(Ubuntu 14.04, git, make, IPython Notebook, latex)

Time from publication of KAnalyze to our 100%
reproducible re-evaluation? ~8 hours.

2. Protocols, not pipelines.
STOP HIDING THE ANALYSIS STEPS.
BIG BLACK BOXES ARE NOT SMALL
TOOLS!

Write down what you’re doing…
https://khmer-protocols.readthedocs.org/

…and add automated end-to-end tests.
c.f. “literate ReSTing”

3. Drive sustainable software
development with use cases.

4. Put everything in the cloud and
measure it.
~40 hours;
m1.xlarge
Eel Pond mRNAseq protocol.

5. Compare programs and workflows fairly.
Genome Reference
Quality Filtered Diginorm Partition Reinflation
Velvet - 80.90 83.64 84.57
IDBA 90.96 91.38 90.52 88.80
SPAde
90.42 90.35 89.57 90.02
s
Mis-assembled Contig Length
Velvet - 52071358 44730449 45381867
IDBA 21777032 20807513 17159671 18684159
SPAde
28238787 21506019 14247392 18851571
s
Kalamazoo metagenome protocol run on mock data from Shakya et al., 2013
Also! Tip o’ the hat to Michael Barton, nucleotid.es

A super fun way to do reviews!
• “What a nice new transcriptome assembler! Interesting
how it doesn’t perform that well on my 10 test data sets.”
• “Hey, so you make these claims, but I ran your code,
and…”
• “Fun fact! Your source code has a syntax error in it – even
Perl has standards! You’re still sure that’s the script you
used?”
• “Here – use our evaluation pipeline, since you clearly
need something better.”
The Brown Lab: taking passive aggression to a whole new level!

We breed our own problems.
Reward the behavior you want to see.
Let’s level up the field, already.

What are we working on, scientifically
speaking?

Streaming error correction of genomic, transcriptomic,
metagenomic data via graph alignment
Jason Pell, Jordan Fish, Michael Crusoe

Error correction on simulated E. coli data
TP FP TN FN
1.2-pass 3,494,631 99.8% 3,865 460,601,171 5,533 2.8%
(corrected) (mistakes) (OK) (missed)
1% error rate, 100x coverage.
Michael Crusoe, Jordan Fish, Jason Pell

Error correction  variant calling
Single pass, reference free, tunable, streaming
online variant calling.
(Hey, look, ma – a new mapper!)

Infrastructure: distributed graph database server
Web interface + API
Compute server
(Galaxy?
Arvados?)
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-talk.html

AGTA talk on Monday
• 3:15-4pm – come see me try to convince biomedical
researchers to share their data!
• 4-4:30pm – come listen to Ana Conesa talk about multi-omics
data integration!
Thanks!

2014 abic-talk

More Related Content

What's hot

Viewers also liked

Similar to 2014 abic-talk

More from c.titus.brown

Recently uploaded

2014 abic-talk

Editor's Notes