CT Brown - Doing next-gen sequencing analysis in the cloud

Doing next-gen sequencing
analysis in the cloud.
C. Titus Brown
ctb@msu.edu

Acknowledgements
Lab members involved Collaborators
 Adina Howe (w/Tiedje)  Jim Tiedje, MSU
 Jason Pell
 ArendHintze  Billie Swalla, UW
 RosangelaCanino-  Janet Jansson, LBNL
Koning
 Qingpeng Zhang  Susannah Tringe, JGI
 Elijah Lowe
 LikitPreeyanon Funding
 JiarongGuo
 Tim Brom USDA NIFA; NSF IOS;
 KanchanPavangadkar BEACON.
 Eric McDonald

“Be the change you want to see”
We are aggressivelyopen…

Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/interests.html
(What‟s a good license??)
 Preprints: on arXiv, q-bio:
„kmer-percolation arxiv‟
„diginormarxiv‟

The data catastrophe!
 Data set sizes growing faster than compute capacity
(esp RAM).
 Many biological algorithms don‟t scale all that well,
anyway.
 Algorithmically, we want:
 Single-pass.
 Compression approaches (lossy or otherwise).
 Low-memory data structures

 I, personally, think the last thing in the world we need
is another standalone package: pre-filtering
approaches.
“Run our nifty approaches first, then feed into the

Digital normalization

Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!

This 100x will consume
disk space and, because
of errors, memory.

Downsample based on de Bruijn
graph structure (which can be
derived online)

Digital normalization algorithm

for read in dataset:
if median_kmer_count(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read

Note, single pass; fixed memory.

Digital normalization is efficient &
effective
• Single pass algorithm
• Fixed memory;
Algorithmic nerdvana! • Cheaper than assembly;
• Reduces assembly time;
• Scales assembly memory.

Brown et al., in review, PLoS On

Digital normalization removes errors

Shotgun data is often (1) high
coverage and (2) biased in coverage.

…here we discard > 95% of data!

Other key points
 Virtually identical contigassembly; scaffolding works
but is not yet cookie-cutter.

 Digital normalization changes the way de Bruijn graph
assembly scales from the size of your data set to
the size of the source sample.

 Alwayslower memory than assembly: we never
collect most erroneous k-mers.

 Digital normalization can be done once– and then
assembly parameter exploration can be done.

Quotable quotes.
Comment: “This looks like a great solution for
people who can’t afford real computers”.

OK, but:

“Buying ever bigger computers is a great
solution for people who don’t want to think
hard.”

To be less snide: both kinds of scaling are needed,
of course.

Why use diginorm?
 Use the cloud to assemble any microbial
genomes incl. single-cell, many eukaryotic
genomes, most mRNAseq, and many
metagenomes.

 Seems to provide leverage on addressing many
biological or sample prep problems (single-cell &
genome amplification MDA; metagenome;
heterozygosity).

 And, well, the general idea of locus specific
graph analysis solves lots of things…

Some interim concluding
thoughts
 Digital normalization-like approaches provide a
path to solving the majority of assembly scaling
problems, and will enable assembly on current
cloud computing hardware.
 This is not true for highly diverse metagenome
environments…
 For soil, we estimate that we need 50 Tbp / gram
soil. Sigh.

 Biologists and bioinformaticianshate:
 Throwing away data
 Caveats in bioinformatics papers (which reviewers
like, note)

Streaming error correction.

We can do error trimming of genomic, MDA, transcriptomic,
metagenomic data in < 2 passes, fixed memory.
We have just submitted a proposal to adapt Euler or
Quake-like error correction (e.g. spectral alignment
problem) to this framework.

Side note: error correction is the
biggest “data” problem left in
sequencing.

Both for mapping & assembly.

Replication fu
 In December 2011, I met Wes McKinney on a
train and he convinced me that I should look at
IPython Notebook.

 This is an interactive Web notebook for data
analysis…

 Hey, neat! We can use this for replication!
 All of our figures can be regenerated from scratch,
on an EC2 instance, using a Makefile (data
pipeline) and IPython Notebook (figure generation).
 Everything is version controlled.
 Honestly not much work, and will be less the next
time.

So… how‟d that go?
 People who already cared thought it was nifty.
http://ivory.idyll.org/blog/replication-i.html
 Almost nobody else cares ;(
 Presub enquiry to editor: “Be sure that your paper can
be reproduced.” Uh, please read my letter to the end?
 “Could you improve your Makefile? I want to
reimplementdiginorm in another language and reuse
your pipeline, but your Makefile is a mess.”
 Incredibly useful, nonetheless. Already part of
undergraduate and graduate training in my lab;
helping us and others with next parpes; etc. etc. etc.

Life is way too short to waste on unnecessarily
replicating your own workflows, much less other
people’s.

Advertisement!
 Qingpeng Zhang (QP) will talk about our very
useful „khmer‟ software for efficiently counting k-
mers.

 Want a simple Python lib for reading & indexing
FASTA/FASTQ? Check out screed.

“Better science through superior software.”

Advertisement

Panel on “Should we have voluntary review
standards for bioinformatics?”

Tomorrow, 4:30pm.

We are aggressivelyopen
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/interests.html
(What‟s a good license??)
 Preprints: on arXiv, q-bio:
„kmer-percolation arxiv‟
„diginormarxiv‟

CT Brown - Doing next-gen sequencing analysis in the cloud

More Related Content

Viewers also liked

Similar to CT Brown - Doing next-gen sequencing analysis in the cloud

More from Jan Aerts

Recently uploaded

CT Brown - Doing next-gen sequencing analysis in the cloud