SlideShare a Scribd company logo
Doing next-gen sequencing
   analysis in the cloud.
       C. Titus Brown
       ctb@msu.edu
Acknowledgements
Lab members involved        Collaborators
   Adina Howe (w/Tiedje)    Jim Tiedje, MSU
   Jason Pell
   ArendHintze              Billie Swalla, UW
   RosangelaCanino-         Janet Jansson, LBNL
    Koning
   Qingpeng Zhang           Susannah Tringe, JGI
   Elijah Lowe
   LikitPreeyanon          Funding
   JiarongGuo
   Tim Brom                USDA NIFA; NSF IOS;
   KanchanPavangadkar           BEACON.
   Eric McDonald
“Be the change you want to see”
         We are aggressivelyopen…

Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
  http://ged.msu.edu/interests.html
  (What‟s a good license??)
 Preprints: on arXiv, q-bio:
  „kmer-percolation arxiv‟
  „diginormarxiv‟
The data catastrophe!
 Data set sizes growing faster than compute capacity
  (esp RAM).
 Many biological algorithms don‟t scale all that well,
  anyway.
 Algorithmically, we want:
   Single-pass.
   Compression approaches (lossy or otherwise).
   Low-memory data structures


 I, personally, think the last thing in the world we need
  is another standalone package: pre-filtering
  approaches.
    “Run our nifty approaches first, then feed into the
Digital normalization


                   Suppose you have a
                dilution factor of A (10) to
                B(1). To get 10x of B you
                  need to get 100x of A!
                          Overkill!!

                 This 100x will consume
                disk space and, because
                   of errors, memory.
Downsample based on de Bruijn
graph structure (which can be
derived online)
Digital normalization algorithm

for read in dataset:
  if median_kmer_count(read) < CUTOFF:
update_kmer_counts(read)
save(read)
  else:
       # discard read

            Note, single pass; fixed memory.
Digital normalization is efficient &
effective
                       • Single pass algorithm
                       • Fixed memory;
 Algorithmic nerdvana! • Cheaper than assembly;
                       • Reduces assembly time;
                       • Scales assembly memory.




                                 Brown et al., in review, PLoS On
Digital normalization removes errors
Shotgun data is often (1) high
coverage and (2) biased in coverage.
…here we discard > 95% of data!
Other key points
 Virtually identical contigassembly; scaffolding works
  but is not yet cookie-cutter.

 Digital normalization changes the way de Bruijn graph
  assembly scales from the size of your data set to
  the size of the source sample.

 Alwayslower memory than assembly: we never
  collect most erroneous k-mers.

 Digital normalization can be done once– and then
  assembly parameter exploration can be done.
Quotable quotes.
Comment: “This looks like a great solution for
  people who can’t afford real computers”.

                     OK, but:

   “Buying ever bigger computers is a great
   solution for people who don’t want to think
                      hard.”

To be less snide: both kinds of scaling are needed,
                      of course.
Why use diginorm?
 Use the cloud to assemble any microbial
 genomes incl. single-cell, many eukaryotic
 genomes, most mRNAseq, and many
 metagenomes.

 Seems to provide leverage on addressing many
 biological or sample prep problems (single-cell &
 genome amplification MDA; metagenome;
 heterozygosity).

 And, well, the general idea of locus specific
 graph analysis solves lots of things…
Some interim concluding
thoughts
 Digital normalization-like approaches provide a
 path to solving the majority of assembly scaling
 problems, and will enable assembly on current
 cloud computing hardware.
   This is not true for highly diverse metagenome
    environments…
   For soil, we estimate that we need 50 Tbp / gram
    soil. Sigh.

 Biologists and bioinformaticianshate:
   Throwing away data
   Caveats in bioinformatics papers (which reviewers
   like, note)
Streaming error correction.




We can do error trimming of genomic, MDA, transcriptomic,
     metagenomic data in < 2 passes, fixed memory.
   We have just submitted a proposal to adapt Euler or
   Quake-like error correction (e.g. spectral alignment
               problem) to this framework.
Side note: error correction is the
biggest “data” problem left in
sequencing.




        Both for mapping & assembly.
Replication fu
 In December 2011, I met Wes McKinney on a
 train and he convinced me that I should look at
 IPython Notebook.

 This is an interactive Web notebook for data
 analysis…

 Hey, neat! We can use this for replication!
   All of our figures can be regenerated from scratch,
    on an EC2 instance, using a Makefile (data
    pipeline) and IPython Notebook (figure generation).
   Everything is version controlled.
   Honestly not much work, and will be less the next
    time.
So… how‟d that go?
 People who already cared thought it was nifty.
       http://ivory.idyll.org/blog/replication-i.html
 Almost nobody else cares ;(
   Presub enquiry to editor: “Be sure that your paper can
    be reproduced.” Uh, please read my letter to the end?
   “Could you improve your Makefile? I want to
    reimplementdiginorm in another language and reuse
    your pipeline, but your Makefile is a mess.”
 Incredibly useful, nonetheless. Already part of
  undergraduate and graduate training in my lab;
  helping us and others with next parpes; etc. etc. etc.

   Life is way too short to waste on unnecessarily
   replicating your own workflows, much less other
                       people’s.
Acknowledgements
Lab members involved        Collaborators
   Adina Howe (w/Tiedje)    Jim Tiedje, MSU
   Jason Pell
   ArendHintze              Billie Swalla, UW
   RosangelaCanino-         Janet Jansson, LBNL
    Koning
   Qingpeng Zhang           Susannah Tringe, JGI
   Elijah Lowe
   LikitPreeyanon          Funding
   JiarongGuo
   Tim Brom                USDA NIFA; NSF IOS;
   KanchanPavangadkar           BEACON.
   Eric McDonald
Advertisement!
 Qingpeng Zhang (QP) will talk about our very
 useful „khmer‟ software for efficiently counting k-
 mers.

 Want a simple Python lib for reading & indexing
 FASTA/FASTQ? Check out screed.

  “Better science through superior software.”
Advertisement


Panel on “Should we have voluntary review
       standards for bioinformatics?”

           Tomorrow, 4:30pm.
We are aggressivelyopen
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
  http://ged.msu.edu/interests.html
  (What‟s a good license??)
 Preprints: on arXiv, q-bio:
  „kmer-percolation arxiv‟
  „diginormarxiv‟

More Related Content

Viewers also liked

2014 Workers' Compensation Seminar
2014 Workers' Compensation Seminar2014 Workers' Compensation Seminar
2014 Workers' Compensation Seminar
Kegler Brown Hill + Ritter
 
Hazed and Confused
Hazed and ConfusedHazed and Confused
Hazed and Confused
Kegler Brown Hill + Ritter
 
Promotional Gaming
Promotional GamingPromotional Gaming
Promotional Gaming
Kegler Brown Hill + Ritter
 
TST Social Host Webinar- Michael Sparks June 13, 2014
TST Social Host Webinar- Michael Sparks June 13, 2014TST Social Host Webinar- Michael Sparks June 13, 2014
TST Social Host Webinar- Michael Sparks June 13, 2014
Circles of San Antonio Community Coalition
 
PPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyRPPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyR
Akwu OKOLO
 
One Step Online School Simplified
One Step Online School SimplifiedOne Step Online School Simplified
One Step Online School Simplified
ChineseTeachers.com
 
Netiquette
NetiquetteNetiquette
Netiquette
Mohd Aizat Idris
 
Enroll And Survey
Enroll And SurveyEnroll And Survey
Enroll And Surveybsrmailbox
 
Privacy Presentation- Realizing What You’ve Got and & How You Plan to Keep it...
Privacy Presentation- Realizing What You’ve Got and & How You Plan to Keep it...Privacy Presentation- Realizing What You’ve Got and & How You Plan to Keep it...
Privacy Presentation- Realizing What You’ve Got and & How You Plan to Keep it...
Kegler Brown Hill + Ritter
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
c.titus.brown
 
Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012
Piet van Vugt
 
NZ Myths & Legends webquest
NZ Myths & Legends webquestNZ Myths & Legends webquest
NZ Myths & Legends webquest
Takahe One
 
Advanced International Business Strategies for Entrepreneurs
Advanced International Business Strategies for EntrepreneursAdvanced International Business Strategies for Entrepreneurs
Advanced International Business Strategies for Entrepreneurs
Kegler Brown Hill + Ritter
 
Rose :: Properties
Rose :: PropertiesRose :: Properties
Rose :: Propertiesrejita
 
Rachel Wolfe Photography Features
Rachel Wolfe Photography FeaturesRachel Wolfe Photography Features
Rachel Wolfe Photography Features
Rachel Wolfe
 
2014 whitney-public-talk
2014 whitney-public-talk2014 whitney-public-talk
2014 whitney-public-talkc.titus.brown
 
Homework, Term 3 & 4
Homework, Term 3 & 4Homework, Term 3 & 4
Homework, Term 3 & 4Takahe One
 
Presentation #1 Chapter 3B
Presentation #1 Chapter 3BPresentation #1 Chapter 3B
Presentation #1 Chapter 3Bavlainich
 

Viewers also liked (20)

2014 Workers' Compensation Seminar
2014 Workers' Compensation Seminar2014 Workers' Compensation Seminar
2014 Workers' Compensation Seminar
 
Hazed and Confused
Hazed and ConfusedHazed and Confused
Hazed and Confused
 
Hohmann liber2006
Hohmann liber2006Hohmann liber2006
Hohmann liber2006
 
Promotional Gaming
Promotional GamingPromotional Gaming
Promotional Gaming
 
TST Social Host Webinar- Michael Sparks June 13, 2014
TST Social Host Webinar- Michael Sparks June 13, 2014TST Social Host Webinar- Michael Sparks June 13, 2014
TST Social Host Webinar- Michael Sparks June 13, 2014
 
PPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyRPPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyR
 
One Step Online School Simplified
One Step Online School SimplifiedOne Step Online School Simplified
One Step Online School Simplified
 
Netiquette
NetiquetteNetiquette
Netiquette
 
Enroll And Survey
Enroll And SurveyEnroll And Survey
Enroll And Survey
 
Privacy Presentation- Realizing What You’ve Got and & How You Plan to Keep it...
Privacy Presentation- Realizing What You’ve Got and & How You Plan to Keep it...Privacy Presentation- Realizing What You’ve Got and & How You Plan to Keep it...
Privacy Presentation- Realizing What You’ve Got and & How You Plan to Keep it...
 
OW2 Nanoko
OW2 NanokoOW2 Nanoko
OW2 Nanoko
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012
 
NZ Myths & Legends webquest
NZ Myths & Legends webquestNZ Myths & Legends webquest
NZ Myths & Legends webquest
 
Advanced International Business Strategies for Entrepreneurs
Advanced International Business Strategies for EntrepreneursAdvanced International Business Strategies for Entrepreneurs
Advanced International Business Strategies for Entrepreneurs
 
Rose :: Properties
Rose :: PropertiesRose :: Properties
Rose :: Properties
 
Rachel Wolfe Photography Features
Rachel Wolfe Photography FeaturesRachel Wolfe Photography Features
Rachel Wolfe Photography Features
 
2014 whitney-public-talk
2014 whitney-public-talk2014 whitney-public-talk
2014 whitney-public-talk
 
Homework, Term 3 & 4
Homework, Term 3 & 4Homework, Term 3 & 4
Homework, Term 3 & 4
 
Presentation #1 Chapter 3B
Presentation #1 Chapter 3BPresentation #1 Chapter 3B
Presentation #1 Chapter 3B
 

Similar to Talk at Bioinformatics Open Source Conference, 2012

2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
c.titus.brown
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
c.titus.brown
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
c.titus.brown
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
c.titus.brown
 
PyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc AltedPyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc Alted
Pôle Systematic Paris-Region
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?
Guy Coates
 
Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010parallellabs
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
Guy Coates
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsc.titus.brown
 
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduceData-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduceGeorge Ang
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
Guy Coates
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
c.titus.brown
 
Art Of Distributed P0
Art Of Distributed P0Art Of Distributed P0
Art Of Distributed P0George Ang
 

Similar to Talk at Bioinformatics Open Source Conference, 2012 (20)

2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
PyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc AltedPyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc Alted
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?
 
Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
 
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduceData-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Art Of Distributed P0
Art Of Distributed P0Art Of Distributed P0
Art Of Distributed P0
 

More from c.titus.brown

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
c.titus.brown
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
c.titus.brown
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
c.titus.brown
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
c.titus.brown
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
c.titus.brown
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
c.titus.brown
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
c.titus.brown
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
c.titus.brown
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
c.titus.brown
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
c.titus.brown
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
c.titus.brown
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
c.titus.brown
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
c.titus.brown
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
c.titus.brown
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
c.titus.brown
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
c.titus.brown
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
c.titus.brown
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
c.titus.brown
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
c.titus.brown
 

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 

Recently uploaded

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 

Recently uploaded (20)

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 

Talk at Bioinformatics Open Source Conference, 2012

  • 1. Doing next-gen sequencing analysis in the cloud. C. Titus Brown ctb@msu.edu
  • 2. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jim Tiedje, MSU  Jason Pell  ArendHintze  Billie Swalla, UW  RosangelaCanino-  Janet Jansson, LBNL Koning  Qingpeng Zhang  Susannah Tringe, JGI  Elijah Lowe  LikitPreeyanon Funding  JiarongGuo  Tim Brom USDA NIFA; NSF IOS;  KanchanPavangadkar BEACON.  Eric McDonald
  • 3.
  • 4. “Be the change you want to see” We are aggressivelyopen… Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html (What‟s a good license??)  Preprints: on arXiv, q-bio: „kmer-percolation arxiv‟ „diginormarxiv‟
  • 5. The data catastrophe!  Data set sizes growing faster than compute capacity (esp RAM).  Many biological algorithms don‟t scale all that well, anyway.  Algorithmically, we want:  Single-pass.  Compression approaches (lossy or otherwise).  Low-memory data structures  I, personally, think the last thing in the world we need is another standalone package: pre-filtering approaches. “Run our nifty approaches first, then feed into the
  • 6. Digital normalization Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
  • 7. Downsample based on de Bruijn graph structure (which can be derived online)
  • 8. Digital normalization algorithm for read in dataset: if median_kmer_count(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
  • 9. Digital normalization is efficient & effective • Single pass algorithm • Fixed memory; Algorithmic nerdvana! • Cheaper than assembly; • Reduces assembly time; • Scales assembly memory. Brown et al., in review, PLoS On
  • 11. Shotgun data is often (1) high coverage and (2) biased in coverage.
  • 12. …here we discard > 95% of data!
  • 13. Other key points  Virtually identical contigassembly; scaffolding works but is not yet cookie-cutter.  Digital normalization changes the way de Bruijn graph assembly scales from the size of your data set to the size of the source sample.  Alwayslower memory than assembly: we never collect most erroneous k-mers.  Digital normalization can be done once– and then assembly parameter exploration can be done.
  • 14. Quotable quotes. Comment: “This looks like a great solution for people who can’t afford real computers”. OK, but: “Buying ever bigger computers is a great solution for people who don’t want to think hard.” To be less snide: both kinds of scaling are needed, of course.
  • 15. Why use diginorm?  Use the cloud to assemble any microbial genomes incl. single-cell, many eukaryotic genomes, most mRNAseq, and many metagenomes.  Seems to provide leverage on addressing many biological or sample prep problems (single-cell & genome amplification MDA; metagenome; heterozygosity).  And, well, the general idea of locus specific graph analysis solves lots of things…
  • 16. Some interim concluding thoughts  Digital normalization-like approaches provide a path to solving the majority of assembly scaling problems, and will enable assembly on current cloud computing hardware.  This is not true for highly diverse metagenome environments…  For soil, we estimate that we need 50 Tbp / gram soil. Sigh.  Biologists and bioinformaticianshate:  Throwing away data  Caveats in bioinformatics papers (which reviewers like, note)
  • 17. Streaming error correction. We can do error trimming of genomic, MDA, transcriptomic, metagenomic data in < 2 passes, fixed memory. We have just submitted a proposal to adapt Euler or Quake-like error correction (e.g. spectral alignment problem) to this framework.
  • 18. Side note: error correction is the biggest “data” problem left in sequencing. Both for mapping & assembly.
  • 19. Replication fu  In December 2011, I met Wes McKinney on a train and he convinced me that I should look at IPython Notebook.  This is an interactive Web notebook for data analysis…  Hey, neat! We can use this for replication!  All of our figures can be regenerated from scratch, on an EC2 instance, using a Makefile (data pipeline) and IPython Notebook (figure generation).  Everything is version controlled.  Honestly not much work, and will be less the next time.
  • 20.
  • 21. So… how‟d that go?  People who already cared thought it was nifty. http://ivory.idyll.org/blog/replication-i.html  Almost nobody else cares ;(  Presub enquiry to editor: “Be sure that your paper can be reproduced.” Uh, please read my letter to the end?  “Could you improve your Makefile? I want to reimplementdiginorm in another language and reuse your pipeline, but your Makefile is a mess.”  Incredibly useful, nonetheless. Already part of undergraduate and graduate training in my lab; helping us and others with next parpes; etc. etc. etc. Life is way too short to waste on unnecessarily replicating your own workflows, much less other people’s.
  • 22. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jim Tiedje, MSU  Jason Pell  ArendHintze  Billie Swalla, UW  RosangelaCanino-  Janet Jansson, LBNL Koning  Qingpeng Zhang  Susannah Tringe, JGI  Elijah Lowe  LikitPreeyanon Funding  JiarongGuo  Tim Brom USDA NIFA; NSF IOS;  KanchanPavangadkar BEACON.  Eric McDonald
  • 23. Advertisement!  Qingpeng Zhang (QP) will talk about our very useful „khmer‟ software for efficiently counting k- mers.  Want a simple Python lib for reading & indexing FASTA/FASTQ? Check out screed. “Better science through superior software.”
  • 24. Advertisement Panel on “Should we have voluntary review standards for bioinformatics?” Tomorrow, 4:30pm.
  • 25. We are aggressivelyopen Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html (What‟s a good license??)  Preprints: on arXiv, q-bio: „kmer-percolation arxiv‟ „diginormarxiv‟