Doing next-gen sequencing
   analysis in the cloud.
       C. Titus Brown
       ctb@msu.edu
Acknowledgements
Lab members involved        Collaborators
   Adina Howe (w/Tiedje)    Jim Tiedje, MSU
   Jason Pell
   ArendHintze              Billie Swalla, UW
   RosangelaCanino-         Janet Jansson, LBNL
    Koning
   Qingpeng Zhang           Susannah Tringe, JGI
   Elijah Lowe
   LikitPreeyanon          Funding
   JiarongGuo
   Tim Brom                USDA NIFA; NSF IOS;
   KanchanPavangadkar           BEACON.
   Eric McDonald
“Be the change you want to see”
         We are aggressivelyopen…

Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
  http://ged.msu.edu/interests.html
  (What‟s a good license??)
 Preprints: on arXiv, q-bio:
  „kmer-percolation arxiv‟
  „diginormarxiv‟
The data catastrophe!
 Data set sizes growing faster than compute capacity
  (esp RAM).
 Many biological algorithms don‟t scale all that well,
  anyway.
 Algorithmically, we want:
   Single-pass.
   Compression approaches (lossy or otherwise).
   Low-memory data structures


 I, personally, think the last thing in the world we need
  is another standalone package: pre-filtering
  approaches.
    “Run our nifty approaches first, then feed into the
Digital normalization


                   Suppose you have a
                dilution factor of A (10) to
                B(1). To get 10x of B you
                  need to get 100x of A!
                          Overkill!!

                 This 100x will consume
                disk space and, because
                   of errors, memory.
Downsample based on de Bruijn
graph structure (which can be
derived online)
Digital normalization algorithm

for read in dataset:
  if median_kmer_count(read) < CUTOFF:
update_kmer_counts(read)
save(read)
  else:
       # discard read

            Note, single pass; fixed memory.
Digital normalization is efficient &
effective
                       • Single pass algorithm
                       • Fixed memory;
 Algorithmic nerdvana! • Cheaper than assembly;
                       • Reduces assembly time;
                       • Scales assembly memory.




                                 Brown et al., in review, PLoS On
Digital normalization removes errors
Shotgun data is often (1) high
coverage and (2) biased in coverage.
…here we discard > 95% of data!
Other key points
 Virtually identical contigassembly; scaffolding works
  but is not yet cookie-cutter.

 Digital normalization changes the way de Bruijn graph
  assembly scales from the size of your data set to
  the size of the source sample.

 Alwayslower memory than assembly: we never
  collect most erroneous k-mers.

 Digital normalization can be done once– and then
  assembly parameter exploration can be done.
Quotable quotes.
Comment: “This looks like a great solution for
  people who can’t afford real computers”.

                     OK, but:

   “Buying ever bigger computers is a great
   solution for people who don’t want to think
                      hard.”

To be less snide: both kinds of scaling are needed,
                      of course.
Why use diginorm?
 Use the cloud to assemble any microbial
 genomes incl. single-cell, many eukaryotic
 genomes, most mRNAseq, and many
 metagenomes.

 Seems to provide leverage on addressing many
 biological or sample prep problems (single-cell &
 genome amplification MDA; metagenome;
 heterozygosity).

 And, well, the general idea of locus specific
 graph analysis solves lots of things…
Some interim concluding
thoughts
 Digital normalization-like approaches provide a
 path to solving the majority of assembly scaling
 problems, and will enable assembly on current
 cloud computing hardware.
   This is not true for highly diverse metagenome
    environments…
   For soil, we estimate that we need 50 Tbp / gram
    soil. Sigh.

 Biologists and bioinformaticianshate:
   Throwing away data
   Caveats in bioinformatics papers (which reviewers
   like, note)
Streaming error correction.




We can do error trimming of genomic, MDA, transcriptomic,
     metagenomic data in < 2 passes, fixed memory.
   We have just submitted a proposal to adapt Euler or
   Quake-like error correction (e.g. spectral alignment
               problem) to this framework.
Side note: error correction is the
biggest “data” problem left in
sequencing.




        Both for mapping & assembly.
Replication fu
 In December 2011, I met Wes McKinney on a
 train and he convinced me that I should look at
 IPython Notebook.

 This is an interactive Web notebook for data
 analysis…

 Hey, neat! We can use this for replication!
   All of our figures can be regenerated from scratch,
    on an EC2 instance, using a Makefile (data
    pipeline) and IPython Notebook (figure generation).
   Everything is version controlled.
   Honestly not much work, and will be less the next
    time.
So… how‟d that go?
 People who already cared thought it was nifty.
       http://ivory.idyll.org/blog/replication-i.html
 Almost nobody else cares ;(
   Presub enquiry to editor: “Be sure that your paper can
    be reproduced.” Uh, please read my letter to the end?
   “Could you improve your Makefile? I want to
    reimplementdiginorm in another language and reuse
    your pipeline, but your Makefile is a mess.”
 Incredibly useful, nonetheless. Already part of
  undergraduate and graduate training in my lab;
  helping us and others with next parpes; etc. etc. etc.

   Life is way too short to waste on unnecessarily
   replicating your own workflows, much less other
                       people’s.
Acknowledgements
Lab members involved        Collaborators
   Adina Howe (w/Tiedje)    Jim Tiedje, MSU
   Jason Pell
   ArendHintze              Billie Swalla, UW
   RosangelaCanino-         Janet Jansson, LBNL
    Koning
   Qingpeng Zhang           Susannah Tringe, JGI
   Elijah Lowe
   LikitPreeyanon          Funding
   JiarongGuo
   Tim Brom                USDA NIFA; NSF IOS;
   KanchanPavangadkar           BEACON.
   Eric McDonald
Advertisement!
 Qingpeng Zhang (QP) will talk about our very
 useful „khmer‟ software for efficiently counting k-
 mers.

 Want a simple Python lib for reading & indexing
 FASTA/FASTQ? Check out screed.

  “Better science through superior software.”
Advertisement


Panel on “Should we have voluntary review
       standards for bioinformatics?”

           Tomorrow, 4:30pm.
We are aggressivelyopen
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
  http://ged.msu.edu/interests.html
  (What‟s a good license??)
 Preprints: on arXiv, q-bio:
  „kmer-percolation arxiv‟
  „diginormarxiv‟

CT Brown - Doing next-gen sequencing analysis in the cloud

  • 1.
    Doing next-gen sequencing analysis in the cloud. C. Titus Brown ctb@msu.edu
  • 2.
    Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jim Tiedje, MSU  Jason Pell  ArendHintze  Billie Swalla, UW  RosangelaCanino-  Janet Jansson, LBNL Koning  Qingpeng Zhang  Susannah Tringe, JGI  Elijah Lowe  LikitPreeyanon Funding  JiarongGuo  Tim Brom USDA NIFA; NSF IOS;  KanchanPavangadkar BEACON.  Eric McDonald
  • 4.
    “Be the changeyou want to see” We are aggressivelyopen… Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html (What‟s a good license??)  Preprints: on arXiv, q-bio: „kmer-percolation arxiv‟ „diginormarxiv‟
  • 5.
    The data catastrophe! Data set sizes growing faster than compute capacity (esp RAM).  Many biological algorithms don‟t scale all that well, anyway.  Algorithmically, we want:  Single-pass.  Compression approaches (lossy or otherwise).  Low-memory data structures  I, personally, think the last thing in the world we need is another standalone package: pre-filtering approaches. “Run our nifty approaches first, then feed into the
  • 6.
    Digital normalization Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
  • 7.
    Downsample based onde Bruijn graph structure (which can be derived online)
  • 8.
    Digital normalization algorithm forread in dataset: if median_kmer_count(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
  • 9.
    Digital normalization isefficient & effective • Single pass algorithm • Fixed memory; Algorithmic nerdvana! • Cheaper than assembly; • Reduces assembly time; • Scales assembly memory. Brown et al., in review, PLoS On
  • 10.
  • 11.
    Shotgun data isoften (1) high coverage and (2) biased in coverage.
  • 12.
    …here we discard> 95% of data!
  • 13.
    Other key points Virtually identical contigassembly; scaffolding works but is not yet cookie-cutter.  Digital normalization changes the way de Bruijn graph assembly scales from the size of your data set to the size of the source sample.  Alwayslower memory than assembly: we never collect most erroneous k-mers.  Digital normalization can be done once– and then assembly parameter exploration can be done.
  • 14.
    Quotable quotes. Comment: “Thislooks like a great solution for people who can’t afford real computers”. OK, but: “Buying ever bigger computers is a great solution for people who don’t want to think hard.” To be less snide: both kinds of scaling are needed, of course.
  • 15.
    Why use diginorm? Use the cloud to assemble any microbial genomes incl. single-cell, many eukaryotic genomes, most mRNAseq, and many metagenomes.  Seems to provide leverage on addressing many biological or sample prep problems (single-cell & genome amplification MDA; metagenome; heterozygosity).  And, well, the general idea of locus specific graph analysis solves lots of things…
  • 16.
    Some interim concluding thoughts Digital normalization-like approaches provide a path to solving the majority of assembly scaling problems, and will enable assembly on current cloud computing hardware.  This is not true for highly diverse metagenome environments…  For soil, we estimate that we need 50 Tbp / gram soil. Sigh.  Biologists and bioinformaticianshate:  Throwing away data  Caveats in bioinformatics papers (which reviewers like, note)
  • 17.
    Streaming error correction. Wecan do error trimming of genomic, MDA, transcriptomic, metagenomic data in < 2 passes, fixed memory. We have just submitted a proposal to adapt Euler or Quake-like error correction (e.g. spectral alignment problem) to this framework.
  • 18.
    Side note: errorcorrection is the biggest “data” problem left in sequencing. Both for mapping & assembly.
  • 19.
    Replication fu  InDecember 2011, I met Wes McKinney on a train and he convinced me that I should look at IPython Notebook.  This is an interactive Web notebook for data analysis…  Hey, neat! We can use this for replication!  All of our figures can be regenerated from scratch, on an EC2 instance, using a Makefile (data pipeline) and IPython Notebook (figure generation).  Everything is version controlled.  Honestly not much work, and will be less the next time.
  • 21.
    So… how‟d thatgo?  People who already cared thought it was nifty. http://ivory.idyll.org/blog/replication-i.html  Almost nobody else cares ;(  Presub enquiry to editor: “Be sure that your paper can be reproduced.” Uh, please read my letter to the end?  “Could you improve your Makefile? I want to reimplementdiginorm in another language and reuse your pipeline, but your Makefile is a mess.”  Incredibly useful, nonetheless. Already part of undergraduate and graduate training in my lab; helping us and others with next parpes; etc. etc. etc. Life is way too short to waste on unnecessarily replicating your own workflows, much less other people’s.
  • 22.
    Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jim Tiedje, MSU  Jason Pell  ArendHintze  Billie Swalla, UW  RosangelaCanino-  Janet Jansson, LBNL Koning  Qingpeng Zhang  Susannah Tringe, JGI  Elijah Lowe  LikitPreeyanon Funding  JiarongGuo  Tim Brom USDA NIFA; NSF IOS;  KanchanPavangadkar BEACON.  Eric McDonald
  • 23.
    Advertisement!  Qingpeng Zhang(QP) will talk about our very useful „khmer‟ software for efficiently counting k- mers.  Want a simple Python lib for reading & indexing FASTA/FASTQ? Check out screed. “Better science through superior software.”
  • 24.
    Advertisement Panel on “Shouldwe have voluntary review standards for bioinformatics?” Tomorrow, 4:30pm.
  • 25.
    We are aggressivelyopen Everythingdiscussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html (What‟s a good license??)  Preprints: on arXiv, q-bio: „kmer-percolation arxiv‟ „diginormarxiv‟