SlideShare a Scribd company logo
1 of 39
Streaming and Compression Approaches for
Terascale Biological Sequence Data Analysis

               C. Titus Brown
             Assistant Professor
           CSE, MMG, BEACON
          Michigan State University
              September 2012
               ctb@msu.edu
Outline
 Acknowledgements


 Big Data and next-gen sequence analysis


 Sweeping generalizations about physics and
 biology

 Physics ain‟t biology, and vice versa
Acknowledgements
Lab members involved        Collaborators
   Adina Howe (w/Tiedje)    Jim Tiedje, MSU
   Jason Pell
   Arend Hintze             Janet Jansson, LBNL
   Rosangela Canino-        Susannah Tringe, JGI
    Koning
   Qingpeng Zhang
   Elijah Lowe
   Likit Preeyanon         Funding
   Jiarong Guo
   Tim Brom                USDA NIFA; NSF IOS;
   Kanchan Pavangadkar          BEACON.
   Eric McDonald
We practice open science!
See blog post accompanying talk: „titus brown blog‟

Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog
 Twitter: @ctitusbrown
 Grants on Lab Web site:
  http://ged.msu.edu/interests.html
 Preprints: on arXiv, q-bio:
  „diginorm arxiv‟
Soil is full of uncultured microbes




                            Randy Jackson
Soil contains thousands to millions of species
                                    (“Collector’s curves” of ~species)

                                 99% of microbes cannot easily be cultured in the lab.
Number of OTUs




                   2000

                   1800

                   1600
                                                                                                          Iowa Corn
                   1400
                                                                                                          Iowa_Native_Prairie
                   1200                                                                                   Kansas Corn
                   1000                                                                                   Kansas_Native_Prairie
                                                                                                          Wisconsin Corn
                    800
                                                                                                          Wisconsin Native Prairie
                    600                                                                                   Wisconsin Restored Prairie
                    400                                                                                   Wisconsin Switchgrass

                    200

                      0
                          100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100



                                                        Number of Sequences
Shotgun metagenomics
 Collect samples;


 Extract DNA;


 Feed into sequencer;


 Computationally analyze.




                      Wikipedia: Environmental shotgun sequencing.p
Task: assemble original text from random, error
prone observations
         It was the Gest of times, it was the wor
            , it was the worst of timZs, it was the
            isdom, it was the age of foolisXness
           , it was the worVt of times, it was the
         mes, it was Ahe age of wisdom, it was th
          It was the best of times, it Gas the wor
         mes, it was the age of witdom, it was th
             isdom, it was tIe age of foolishness



It was the best of times, it was the worst of times, it was the
         age of wisdom, it was the age of foolishness
Actual coverage varies widely from
the average.
Assembly via de Bruijn graphs – k-mer
overlaps




                          J.R. Miller et al. / Genomics (2010)
K-mer graph (k=14)




    Single nucleotide variations cause long branches;
                They don‟t rejoin quickly.
Reads vs edges (memory) in de Bruijn graphs




           Conway T C , Bromage A J Bioinformatics 2011;27:479-486


© The Author 2011. Published by Oxford University Press. All rights reserved. For
 Permissions, please email: journals.permissions@oup.com
The scale of the problem is stunning.
 I estimate a worldwide capacity for DNA sequencing
  of 15 petabases/yr (it‟s probably larger).
 Individual labs can generate ~100 Gbp in ~1 week for
  $10k.
 This sequencing is at a boutique level:
   Sequencing formats are semi-standard.
   Basic analysis approaches are ~80% cookbook.
   Every biological prep, problem, and analysis is different.
 Traditionally, biologists receive no training in
  computation. (And computational people receive no
  training in biology :)
 …and our computational infrastructure is optimizing
  for high performance computing, not high throughput.
My problems are also very
annoying…
 Est ~50 Tbp to comprehensively sample the
  microbial composition of a gram of soil.
 Currently we have approximately 2 Tbp spread
  across 9 soil samples, for one project; 1 Tbp
  across 10 samples for another.

 Need 3 TB RAM on single chassis to do
  assembly of 300 Gbp.
 …estimate 500 TB RAM for 50 Tbp of sequence.


                That just won‟t do.
1. Compressible de Bruijn graphs




          Each node represents a 14-mer;
     Links between each node are 13-mer overlaps
Can store implicit de Bruijn graphs in
   a Bloom filter
                                                               AGTCGG
       AGTCGGCATGAC
       AGTCGG                                                     …C
        GTCGGC
         TCGGCA                                                   …A
          CGGCAT
           GGCATG
                                                                  …T
            GCATGA
             CATGAC
                                                                  …G

                                                                  …A
                                      Bloom filter
                                                                  …C
This allows compression of graphs at the expense of false positive nodes/edges.
False positives introduce false
nodes/edges.
              When does this start to distort the graph?
Global graph structure is retained past
18% FPR


              1%
                          5%




              10%        15%




                               Jason Pell & Arend Hintze
Equivalent to bond percolation problem; percolation
threshold independent of k (?)




                                       Jason Pell & Arend Hintze
This data structure is strikingly
efficient for storing sparse k-mer
graphs.
      “Exact” is for best possible information-theoretical storage.




                                                Jason Pell & Arend Hintze
We implemented graph partitioning
     on top of this probabilistic de Bruijn
     graph.


Split reads into “bins”
 belonging to
 different source
 species.
Can do this based
 almost entirely on
 connectivity of
 sequences.
2. Online, streaming, lossy
compression.
      Much of next-gen sequencing is redundant.
Uneven coverage => even more
redundancy


                         Suppose you have a
                      dilution factor of A (10) to
                      B(1). To get 10x of B you
                        need to get 100x of A!
                                Overkill!!

                       This 100x will consume
                      disk space and, because
                         of errors, memory.
Downsample based on de Bruijn graph
structure; this can be derived via an online
algorithm.
Digital normalization algorithm

for read in dataset:
  if estimated_coverage(read) < CUTOFF:
        update_kmer_counts(read)
        save(read)
  else:
        # discard read

              Note, single pass; fixed memory.
Digital normalization retains information, while
discarding data and errors
For soil… what do we assemble?

                                             Predicted
     Total                      % Reads                     rplb
               Total Contigs                  protein
   Assembly                    Assembled                   genes
                                              coding


    2.5 bill     4.5 mill         19%        5.3 mill       391


    3.5 bill     5.9 mill         22%        6.8 mill       466

                                   This estimates number of species ^
    Putting it in perspective:
    Total equivalent of ~1200 bacterial genomes          Adina Howe
    Human genome ~3 billion bp
Concluding thoughts
 Our approaches provide significant and
  substantial practical and theoretical leverage to
  one of the most challenging current problems in
  computational biology: assembly.
 They provide a path to the future:
   Many-core compatible; distributable?
   Decreased memory footprint => cloud computing
    can be used for many analyses.
   At an algorithmic level, provide a noise-filtering
    solution for most of the current sequencing Big Data
    problems.
 They are in use, ~dozens of labs using digital
  normalization.
 …although we‟re still in the process of publishing
Physics ain‟t biology


 The following observations are for discussion…
they are not-so-casual observations from a lifetime
           of interacting with physicists.

     (Apologies in advance for the sweeping
                generalizations.)
Important note: I don‟t hate physicists!
Significant life events involving physicists:

Birth – Gerry Brown
First UNIX account – Mark Galassi
First publication – w/Chris Adami
Grad school plans – Hans Bethe et al.
Earthshine research (~8 pubs) – w/Steve Koonin and
                                   Phil Goode
2nd Favorite publication – w/Curtis Callan, Jr.

              I am very physicist-positive!
1. Models play a very different
role.
 Physics models are often predictive and
 constraining.
   Model specifies dynamics or interaction.
   Make specific measurements to obtain initial
    conditions.
   Model can then be used to predict fine-grained
    outcomes.

 Biology models can rarely be built in the first
 place…
   Models are dominated by unknowns.
   In a few cases, can be used to determine
   sufficiency of knowledge (“the observations can be
   explained by our model”); this does not mean the
   model is correct, merely that it could be.
Endomesoderm network
 Approximately 15 years and probably 200 man-
   years of research to assemble a map of gene
  interactions for the first 30 hours of sea urchin
                    development.




                                http://sugp.caltech.edu/endomes/
http://sugp.caltech.edu/endomes/
2. Little or no tradition of computation
in biology
 Until ~last decade, not too much in the way of big data.
 Models are rarely built for the purpose of understanding
  computational data, although that is changing.
 Ecological and evolutionary models are regarded with
  suspicion: guilty until proven innocent.
 Essential zero computational training at UG/G (although
  some math).

 “Sick” culture of computation in biology:
   Development of computational methods not respected as
    independent scientific endeavor in biology.
   Biologists want push-button software that “just works”.
   Sophisticated evaluation/validation of software by users is
    rare.

  (It is hard for me to explain to biologists how big a problem this
  is.)
3. Biology is built on facts, not theory.
 Experience with Callan:
   Constrained optimization of DNA binding model to 48
    known CRP binding sites => inability to eliminate 300-
    3000 extra sites in E. coli genome.
   Ohmigod their binding signature is preserved by
    evolution => they‟re probably real! How can this be!?
   …well, it turns out we don‟t know that much about E.
    coli.

 A nice damning quote from Mark Galassi:
 “Biology and bioinformatics seem interesting. Is there
any way I can take part in the research without learning
                     all the details?”
 NO. Biology is all about the details! The more the
                          better!
My career path
 Undergrad in Math (Reed)
   Research on evolution model (Avida) ~1992
   Earthshine observations of global albedo ~1994
 PhD in Molecular Developmental Biology
 (Caltech)
   Molecular biology, genomics, gene regulation
    ~1997-2008
   Bioinformatics ~2000-
 Faculty position in CSE and Microbiology
 (MSU), 2008
   Molecular developmental biology
   Bioinformatics
   Metagenomics & next-gen sequence analysis more
   generally
My career path
 Undergrad in Math (Reed)
   Research on evolution model (Avida) ~1992
   Earthshine observations of global albedo ~1994
 PhD in Molecular Developmental Biology (Caltech)
   Molecular biology, genomics, gene regulation ~1997-
    2008
   Bioinformatics ~2000-
 Faculty position in CSE and Microbiology
 (MSU), 2008
   Molecular developmental biology
   Bioinformatics
   Metagenomics & next-gen sequence analysis more
    generally
   Moving towards integration of data + modeling…
Concluding thoughts on this stuff
 Biologists simply don‟t trust models and data
  analysis approaches that come from “outside”
  biology.
 They‟re not necessarily wrong!
 Physicists can bring an important skill set and
  attitude to biological research, but their
  knowledge is useless. They have to meet the
  biology more than halfway.
 Biologists need more cross-training so we don‟t
  retrace the same softwareany of this, I‟m happy data
             But if you disagree with
                                      development, to chat.
  analysis, and modeling mistakes that physics et
  al. has spent 30 years figuring out.

More Related Content

Viewers also liked

Approximate Thin Plate Spline Mappings
Approximate Thin Plate Spline MappingsApproximate Thin Plate Spline Mappings
Approximate Thin Plate Spline MappingsArchzilon Eshun-Davies
 
The Seven Habits of Highly Ineffective Global Business People
The Seven Habits of Highly Ineffective Global Business PeopleThe Seven Habits of Highly Ineffective Global Business People
The Seven Habits of Highly Ineffective Global Business PeopleKegler Brown Hill + Ritter
 
2011 rwc webquest
2011 rwc webquest2011 rwc webquest
2011 rwc webquestTakahe One
 
Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012Piet van Vugt
 
Nilai un dan_us_2010_ips
Nilai un dan_us_2010_ipsNilai un dan_us_2010_ips
Nilai un dan_us_2010_ips@rtNya
 
Navigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaNavigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaKegler Brown Hill + Ritter
 
Printemps
PrintempsPrintemps
PrintempsJURY
 
Intellisoft introductionrecruitment l120219
Intellisoft introductionrecruitment l120219Intellisoft introductionrecruitment l120219
Intellisoft introductionrecruitment l120219Sham Yemul
 
There is always_a_better_way
There is always_a_better_wayThere is always_a_better_way
There is always_a_better_wayDaniel Chua
 
Nice weekend with music
Nice weekend with musicNice weekend with music
Nice weekend with musicDaniel Chua
 
Romairone, Gregorio
Romairone, GregorioRomairone, Gregorio
Romairone, GregorioGregorio
 
وظائف القيادة
وظائف القيادةوظائف القيادة
وظائف القيادةAhmad Darwish
 
Great Photography
Great  PhotographyGreat  Photography
Great Photographylewisj2111
 

Viewers also liked (20)

Croisements11
Croisements11Croisements11
Croisements11
 
Approximate Thin Plate Spline Mappings
Approximate Thin Plate Spline MappingsApproximate Thin Plate Spline Mappings
Approximate Thin Plate Spline Mappings
 
The Seven Habits of Highly Ineffective Global Business People
The Seven Habits of Highly Ineffective Global Business PeopleThe Seven Habits of Highly Ineffective Global Business People
The Seven Habits of Highly Ineffective Global Business People
 
2011 rwc webquest
2011 rwc webquest2011 rwc webquest
2011 rwc webquest
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Deadlocks
DeadlocksDeadlocks
Deadlocks
 
Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012
 
Nilai un dan_us_2010_ips
Nilai un dan_us_2010_ipsNilai un dan_us_2010_ips
Nilai un dan_us_2010_ips
 
Navigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaNavigating Your Way to Business Success in India
Navigating Your Way to Business Success in India
 
Printemps
PrintempsPrintemps
Printemps
 
Portfolio
PortfolioPortfolio
Portfolio
 
Sdarticle3
Sdarticle3Sdarticle3
Sdarticle3
 
Intellisoft introductionrecruitment l120219
Intellisoft introductionrecruitment l120219Intellisoft introductionrecruitment l120219
Intellisoft introductionrecruitment l120219
 
There is always_a_better_way
There is always_a_better_wayThere is always_a_better_way
There is always_a_better_way
 
Nice weekend with music
Nice weekend with musicNice weekend with music
Nice weekend with music
 
Romairone, Gregorio
Romairone, GregorioRomairone, Gregorio
Romairone, Gregorio
 
Futura+ Idealcombi
Futura+ IdealcombiFutura+ Idealcombi
Futura+ Idealcombi
 
وظائف القيادة
وظائف القيادةوظائف القيادة
وظائف القيادة
 
Great Photography
Great  PhotographyGreat  Photography
Great Photography
 
Ramifications of New USEPA
Ramifications of New USEPARamifications of New USEPA
Ramifications of New USEPA
 

Similar to 2012 XLDB talk

2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattlec.titus.brown
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)Dag Endresen
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...c.titus.brown
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Keith Bradnam
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished? Keith Bradnam
 
Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)
Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)
Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)Dag Endresen
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Keith Bradnam
 
Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)
Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)
Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)Dag Endresen
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
Inference and informatics in a 'sequenced' world
Inference and informatics in a 'sequenced' worldInference and informatics in a 'sequenced' world
Inference and informatics in a 'sequenced' worldJoe Parker
 

Similar to 2012 XLDB talk (20)

2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished?
 
Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)
Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)
Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)
Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)
Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
Inference and informatics in a 'sequenced' world
Inference and informatics in a 'sequenced' worldInference and informatics in a 'sequenced' world
Inference and informatics in a 'sequenced' world
 

More from c.titus.brown

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynotec.titus.brown
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-reviewc.titus.brown
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcastc.titus.brown
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbugc.titus.brown
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformaticsc.titus.brown
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streamingc.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

2012 XLDB talk

  • 1. Streaming and Compression Approaches for Terascale Biological Sequence Data Analysis C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University September 2012 ctb@msu.edu
  • 2. Outline  Acknowledgements  Big Data and next-gen sequence analysis  Sweeping generalizations about physics and biology  Physics ain‟t biology, and vice versa
  • 3. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jim Tiedje, MSU  Jason Pell  Arend Hintze  Janet Jansson, LBNL  Rosangela Canino-  Susannah Tringe, JGI Koning  Qingpeng Zhang  Elijah Lowe  Likit Preeyanon Funding  Jiarong Guo  Tim Brom USDA NIFA; NSF IOS;  Kanchan Pavangadkar BEACON.  Eric McDonald
  • 4.
  • 5. We practice open science! See blog post accompanying talk: „titus brown blog‟ Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html  Preprints: on arXiv, q-bio: „diginorm arxiv‟
  • 6. Soil is full of uncultured microbes Randy Jackson
  • 7. Soil contains thousands to millions of species (“Collector’s curves” of ~species) 99% of microbes cannot easily be cultured in the lab. Number of OTUs 2000 1800 1600 Iowa Corn 1400 Iowa_Native_Prairie 1200 Kansas Corn 1000 Kansas_Native_Prairie Wisconsin Corn 800 Wisconsin Native Prairie 600 Wisconsin Restored Prairie 400 Wisconsin Switchgrass 200 0 100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100 Number of Sequences
  • 8. Shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. Wikipedia: Environmental shotgun sequencing.p
  • 9. Task: assemble original text from random, error prone observations It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 10. Actual coverage varies widely from the average.
  • 11. Assembly via de Bruijn graphs – k-mer overlaps J.R. Miller et al. / Genomics (2010)
  • 12. K-mer graph (k=14) Single nucleotide variations cause long branches; They don‟t rejoin quickly.
  • 13. Reads vs edges (memory) in de Bruijn graphs Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 14. The scale of the problem is stunning.  I estimate a worldwide capacity for DNA sequencing of 15 petabases/yr (it‟s probably larger).  Individual labs can generate ~100 Gbp in ~1 week for $10k.  This sequencing is at a boutique level:  Sequencing formats are semi-standard.  Basic analysis approaches are ~80% cookbook.  Every biological prep, problem, and analysis is different.  Traditionally, biologists receive no training in computation. (And computational people receive no training in biology :)  …and our computational infrastructure is optimizing for high performance computing, not high throughput.
  • 15. My problems are also very annoying…  Est ~50 Tbp to comprehensively sample the microbial composition of a gram of soil.  Currently we have approximately 2 Tbp spread across 9 soil samples, for one project; 1 Tbp across 10 samples for another.  Need 3 TB RAM on single chassis to do assembly of 300 Gbp.  …estimate 500 TB RAM for 50 Tbp of sequence. That just won‟t do.
  • 16. 1. Compressible de Bruijn graphs Each node represents a 14-mer; Links between each node are 13-mer overlaps
  • 17. Can store implicit de Bruijn graphs in a Bloom filter AGTCGG AGTCGGCATGAC AGTCGG …C GTCGGC TCGGCA …A CGGCAT GGCATG …T GCATGA CATGAC …G …A Bloom filter …C This allows compression of graphs at the expense of false positive nodes/edges.
  • 18. False positives introduce false nodes/edges. When does this start to distort the graph?
  • 19. Global graph structure is retained past 18% FPR 1% 5% 10% 15% Jason Pell & Arend Hintze
  • 20. Equivalent to bond percolation problem; percolation threshold independent of k (?) Jason Pell & Arend Hintze
  • 21. This data structure is strikingly efficient for storing sparse k-mer graphs. “Exact” is for best possible information-theoretical storage. Jason Pell & Arend Hintze
  • 22. We implemented graph partitioning on top of this probabilistic de Bruijn graph. Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences.
  • 23. 2. Online, streaming, lossy compression. Much of next-gen sequencing is redundant.
  • 24. Uneven coverage => even more redundancy Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
  • 25. Downsample based on de Bruijn graph structure; this can be derived via an online algorithm.
  • 26. Digital normalization algorithm for read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
  • 27. Digital normalization retains information, while discarding data and errors
  • 28. For soil… what do we assemble? Predicted Total % Reads rplb Total Contigs protein Assembly Assembled genes coding 2.5 bill 4.5 mill 19% 5.3 mill 391 3.5 bill 5.9 mill 22% 6.8 mill 466 This estimates number of species ^ Putting it in perspective: Total equivalent of ~1200 bacterial genomes Adina Howe Human genome ~3 billion bp
  • 29. Concluding thoughts  Our approaches provide significant and substantial practical and theoretical leverage to one of the most challenging current problems in computational biology: assembly.  They provide a path to the future:  Many-core compatible; distributable?  Decreased memory footprint => cloud computing can be used for many analyses.  At an algorithmic level, provide a noise-filtering solution for most of the current sequencing Big Data problems.  They are in use, ~dozens of labs using digital normalization.  …although we‟re still in the process of publishing
  • 30. Physics ain‟t biology The following observations are for discussion… they are not-so-casual observations from a lifetime of interacting with physicists. (Apologies in advance for the sweeping generalizations.)
  • 31. Important note: I don‟t hate physicists! Significant life events involving physicists: Birth – Gerry Brown First UNIX account – Mark Galassi First publication – w/Chris Adami Grad school plans – Hans Bethe et al. Earthshine research (~8 pubs) – w/Steve Koonin and Phil Goode 2nd Favorite publication – w/Curtis Callan, Jr. I am very physicist-positive!
  • 32. 1. Models play a very different role.  Physics models are often predictive and constraining.  Model specifies dynamics or interaction.  Make specific measurements to obtain initial conditions.  Model can then be used to predict fine-grained outcomes.  Biology models can rarely be built in the first place…  Models are dominated by unknowns.  In a few cases, can be used to determine sufficiency of knowledge (“the observations can be explained by our model”); this does not mean the model is correct, merely that it could be.
  • 33. Endomesoderm network Approximately 15 years and probably 200 man- years of research to assemble a map of gene interactions for the first 30 hours of sea urchin development. http://sugp.caltech.edu/endomes/
  • 35. 2. Little or no tradition of computation in biology  Until ~last decade, not too much in the way of big data.  Models are rarely built for the purpose of understanding computational data, although that is changing.  Ecological and evolutionary models are regarded with suspicion: guilty until proven innocent.  Essential zero computational training at UG/G (although some math).  “Sick” culture of computation in biology:  Development of computational methods not respected as independent scientific endeavor in biology.  Biologists want push-button software that “just works”.  Sophisticated evaluation/validation of software by users is rare. (It is hard for me to explain to biologists how big a problem this is.)
  • 36. 3. Biology is built on facts, not theory.  Experience with Callan:  Constrained optimization of DNA binding model to 48 known CRP binding sites => inability to eliminate 300- 3000 extra sites in E. coli genome.  Ohmigod their binding signature is preserved by evolution => they‟re probably real! How can this be!?  …well, it turns out we don‟t know that much about E. coli.  A nice damning quote from Mark Galassi: “Biology and bioinformatics seem interesting. Is there any way I can take part in the research without learning all the details?” NO. Biology is all about the details! The more the better!
  • 37. My career path  Undergrad in Math (Reed)  Research on evolution model (Avida) ~1992  Earthshine observations of global albedo ~1994  PhD in Molecular Developmental Biology (Caltech)  Molecular biology, genomics, gene regulation ~1997-2008  Bioinformatics ~2000-  Faculty position in CSE and Microbiology (MSU), 2008  Molecular developmental biology  Bioinformatics  Metagenomics & next-gen sequence analysis more generally
  • 38. My career path  Undergrad in Math (Reed)  Research on evolution model (Avida) ~1992  Earthshine observations of global albedo ~1994  PhD in Molecular Developmental Biology (Caltech)  Molecular biology, genomics, gene regulation ~1997- 2008  Bioinformatics ~2000-  Faculty position in CSE and Microbiology (MSU), 2008  Molecular developmental biology  Bioinformatics  Metagenomics & next-gen sequence analysis more generally  Moving towards integration of data + modeling…
  • 39. Concluding thoughts on this stuff  Biologists simply don‟t trust models and data analysis approaches that come from “outside” biology.  They‟re not necessarily wrong!  Physicists can bring an important skill set and attitude to biological research, but their knowledge is useless. They have to meet the biology more than halfway.  Biologists need more cross-training so we don‟t retrace the same softwareany of this, I‟m happy data But if you disagree with development, to chat. analysis, and modeling mistakes that physics et al. has spent 30 years figuring out.

Editor's Notes

  1. High coverage is essential.