SlideShare a Scribd company logo
1 of 26
UC BERKELEY
$K per genome
$100,000.0
 $10,000.0
  $1,000.0
   $100.0
     $10.0
      $1.0
      $0.1
               2001 - 2014
Emperor of All Maladies,
      page 464
“Computer Scientists May Have What It Takes to Help Cure Cancer,”
David Patterson, New York Times, 12/5/2011
reconstructed genomevariant

reference genome

                                    consensus
Reads




Genetic           Read Alignment

Data
Processing
Pipeline            SNP Calling




             Structural Variant Detection




              Reconstructed Genome
Malachi Griffith,
Washington University,
August 19, 2012
“Cancer genome and
transcriptome
sequencing – analysis
challenges
and bottlenecks”




119th
step
“Computational science: ...Error…why scientific programming does not compute,”
by Zeeya Merali, 13 October 2010, Nature 467, 775-777
UC Students/Post-Docs                   External                      Faculty
 –   Ma’ayan Bresler         –   Bill Bolosky (MS/MSR)         –   Armando Fox
 –   Kristal Curtis          –   Mishali Naik (Intel)          –   Michael Jordan
 –   Jesse Liptrap           –   Paolo Narvaez (Intel)         –   Anthony Joseph
 –   Sara Sheehan            –   Ravi Pandya (MS)              –   David Patterson
 –   Ameet Talwalkar         –   Abirami Prabhakaran (Intel)   –   Satish Rao
 –   Jonathan Terhorst       –   Taylor Sittler (UCSF)         –   Scott Shenker
 –   Richard Xia             –   Gans Srinivasa (Intel)        –   Yun Song
 –   Matei Zaharia           – Arun Wiita (UCSF)               –   Ion Stoica
 –   Yuchen Zhang        Expertise
                           – Computational Biology/Medicine
                           – Machine Learning
                           – Systems
• 2011-2016
                  Adaptive/Active
                 Machine Learning                  • Berkeley Data Analysis Stack
                   and Analytics                     release as Open Source


                    Massive
                   and Diverse
                      Data


CrowdSourcing/
   Human                         Cloud Computing
 Computation
genome


                                                 read          Seed   Positions
                                                               AAAA   0, 8
               %     %      Reads/       Time
                                                               ACCT   4, 16, 24
 Aligner    Aligned Error    sec        (hours)
                                                               GTGA   12, 20
Bowtie2      84%     0      14,400          22                 …      …
BWA          87%    0.31     9,000          35
Novoalign    89%    0.21     4,260          73
SOAP2        79%     0      19,500          16
SNAP         87%     0      189,000          2
                                http://snap.cs.berkeley.edu/
1. Create easy-to-use, fast, accurate genetic analysis
  pipelines
GENOME
                                           PROTEOME
         CENTER
                                            CENTER

                                    PROTEOME                                                                                                                                                                            GENOME
                                                                   TCGA CENTERS                                                                                                          PROTEOME                       CENTER
                                     CENTER                        Boise State University                                                                                                 CENTER
        ANALYSIS                                                                                                                                                                                                  SEQUENCING
                                                                                                                                TCGA CENTERS
         CENTER                                                                                                                                                                                GENOME
                                                           PROTEOME                                                             Brigham & Women’s Hospital and Harvard Medical School                               CENTER
                                                            CENTER                                                              Broad Institute                                                CENTER
                                                                                                                                John Hopkins University                                                     ANALYSIS
                                                                                                                                Memorial Sloan-Kettering Cancer Center                                       CENTER
 TCGA CENTERS
 BC Cancer Research Center                                                                                                                                                                          ANALYSIS
 Fred Hutchinson Cancer Research Center                                                                                                                                                              CENTER
 Complete Genomics Inc.
 Pacific NW National Laboratory                                                                TCGA CENTERS
 University of Southern California                                                             Nationwide Children’s Hospital                     BIOSPECIMEN                                             DATA COORDINATING
 Oregon Health & Science University                                                                                                                  CORE                               PROTEOME               CENTER
 Institute for Systems Biology                                                                                                                                                           CENTER
                                                                                                                                                                                                                         GENOME
 University of California, Santa Cruz                                                                                                                                                                                    CENTER
                                                                                                                                           SEQUENCING
                                                                                                                   PROTEOME                  CENTER
                                                                                                                    CENTER
                ANALYSIS                                                                                                                                                                      ANALYSIS
                                                                                            TCGA CENTERS                                                                                                                GENOME
                 CENTER                                                                                                                                                                        CENTER
                                                                                            Vanderbilt University                                                                                                       CENTER
                               ANALYSIS                                                                                                                      PROTEOME
                                                                                            Washington University Genome Institute                                                                      PROTEOME
                                CENTER           GENOME                                                                                                       CENTER
                                                                                                                                                                                                         CENTER
                                                 CENTER
                                                                                                                                                                                         TCGA CENTERS
                                                               BIOSPECIMEN                                                         GENOME                                                University of North Carolina
                                                                  CORE                                                             CENTER                 ANALYSIS
             DATA CENTER                                                                                                                                   CENTER
                                                                                                                                   SEQUENCING
                                                TCGA CENTERS
                                                                                                                                     CENTER
                                                International Genomics Consortium
                                                                                                                                    TCGA CENTERS
                                                                                                                                    Baylor College of Medicine
TCGA Centers:                                                                                                                       University of Texas, M.D. Anderson Cancer Ctr
Biospecimen Core Resource
Genome Characterization Centers (GCCs)
Genome Sequencing Centers (GSCs)
Proteome Characterization Centers (PCCs)
Data Coordination Center (DCC)
Genome Data Analysis Centers (GDACs)
   Built at SDSC to store DNA information in for
    The Cancer Genome Atlas
   Designed for 50,000 genomes with average
    of 100 gigabytes per genome: 5 petabytes
   Currently 24,000 files from ~5,500 cases,
    ~60 gigabytes/case, in total 2 PB of
    downloads
   Total Cost ~ $100/year/genome at 50K
    genomes, i.e. $5M/year. The technology cost
    is about ½ the total
   Co-location opportunities in same data
    center for groups who want to compute on
    the data
Lessons learned by CGHub on storage of
sequence data
We are sincerely eager to
 hear your feedback on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.

More Related Content

Similar to BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012

Gfh 17 may-tatrc, afsim-magee
Gfh 17 may-tatrc, afsim-mageeGfh 17 may-tatrc, afsim-magee
Gfh 17 may-tatrc, afsim-mageeHarveyMagee
 
Gfh 17 may-tatrc, afsim-magee
Gfh 17 may-tatrc, afsim-mageeGfh 17 may-tatrc, afsim-magee
Gfh 17 may-tatrc, afsim-mageeHarveyMagee
 
Stephen Friend Institute of Development, Aging and Cancer 2011-11-29
Stephen Friend Institute of Development, Aging and Cancer 2011-11-29Stephen Friend Institute of Development, Aging and Cancer 2011-11-29
Stephen Friend Institute of Development, Aging and Cancer 2011-11-29Sage Base
 
Early Adoption of VPH Technology – Towards Realising more Personalised, Predi...
Early Adoption of VPH Technology – Towards Realising more Personalised, Predi...Early Adoption of VPH Technology – Towards Realising more Personalised, Predi...
Early Adoption of VPH Technology – Towards Realising more Personalised, Predi...Plan de Calidad para el SNS
 
Library resources at your fingertips 2012 slide share
Library resources at your fingertips 2012 slide share Library resources at your fingertips 2012 slide share
Library resources at your fingertips 2012 slide share Naz Torabi
 
Semantics is not a luxury
Semantics is not a luxurySemantics is not a luxury
Semantics is not a luxuryPaolo Ciccarese
 
Presentation Personalized Medicine Consortium
Presentation Personalized Medicine ConsortiumPresentation Personalized Medicine Consortium
Presentation Personalized Medicine ConsortiumEuroBioForum
 
Life Sciences Overview Pdf
Life Sciences Overview PdfLife Sciences Overview Pdf
Life Sciences Overview PdfDKhan01
 

Similar to BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012 (10)

Gfh 17 may-tatrc, afsim-magee
Gfh 17 may-tatrc, afsim-mageeGfh 17 may-tatrc, afsim-magee
Gfh 17 may-tatrc, afsim-magee
 
Gfh 17 may-tatrc, afsim-magee
Gfh 17 may-tatrc, afsim-mageeGfh 17 may-tatrc, afsim-magee
Gfh 17 may-tatrc, afsim-magee
 
Stephen Friend Institute of Development, Aging and Cancer 2011-11-29
Stephen Friend Institute of Development, Aging and Cancer 2011-11-29Stephen Friend Institute of Development, Aging and Cancer 2011-11-29
Stephen Friend Institute of Development, Aging and Cancer 2011-11-29
 
Knight Seminar Series 10/18/11
Knight Seminar Series 10/18/11Knight Seminar Series 10/18/11
Knight Seminar Series 10/18/11
 
Early Adoption of VPH Technology – Towards Realising more Personalised, Predi...
Early Adoption of VPH Technology – Towards Realising more Personalised, Predi...Early Adoption of VPH Technology – Towards Realising more Personalised, Predi...
Early Adoption of VPH Technology – Towards Realising more Personalised, Predi...
 
Library resources at your fingertips 2012 slide share
Library resources at your fingertips 2012 slide share Library resources at your fingertips 2012 slide share
Library resources at your fingertips 2012 slide share
 
Semantics is not a luxury
Semantics is not a luxurySemantics is not a luxury
Semantics is not a luxury
 
Presentation Personalized Medicine Consortium
Presentation Personalized Medicine ConsortiumPresentation Personalized Medicine Consortium
Presentation Personalized Medicine Consortium
 
Life Sciences Overview Pdf
Life Sciences Overview PdfLife Sciences Overview Pdf
Life Sciences Overview Pdf
 
Kol recommendations sr
Kol recommendations srKol recommendations sr
Kol recommendations sr
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012

  • 2. $K per genome $100,000.0 $10,000.0 $1,000.0 $100.0 $10.0 $1.0 $0.1 2001 - 2014
  • 3. Emperor of All Maladies, page 464
  • 4. “Computer Scientists May Have What It Takes to Help Cure Cancer,” David Patterson, New York Times, 12/5/2011
  • 5.
  • 6.
  • 7.
  • 9. Reads Genetic Read Alignment Data Processing Pipeline SNP Calling Structural Variant Detection Reconstructed Genome
  • 10. Malachi Griffith, Washington University, August 19, 2012 “Cancer genome and transcriptome sequencing – analysis challenges and bottlenecks” 119th step
  • 11. “Computational science: ...Error…why scientific programming does not compute,” by Zeeya Merali, 13 October 2010, Nature 467, 775-777
  • 12. UC Students/Post-Docs External Faculty – Ma’ayan Bresler – Bill Bolosky (MS/MSR) – Armando Fox – Kristal Curtis – Mishali Naik (Intel) – Michael Jordan – Jesse Liptrap – Paolo Narvaez (Intel) – Anthony Joseph – Sara Sheehan – Ravi Pandya (MS) – David Patterson – Ameet Talwalkar – Abirami Prabhakaran (Intel) – Satish Rao – Jonathan Terhorst – Taylor Sittler (UCSF) – Scott Shenker – Richard Xia – Gans Srinivasa (Intel) – Yun Song – Matei Zaharia – Arun Wiita (UCSF) – Ion Stoica – Yuchen Zhang Expertise – Computational Biology/Medicine – Machine Learning – Systems
  • 13. • 2011-2016 Adaptive/Active Machine Learning • Berkeley Data Analysis Stack and Analytics release as Open Source Massive and Diverse Data CrowdSourcing/ Human Cloud Computing Computation
  • 14.
  • 15. genome read Seed Positions AAAA 0, 8 % % Reads/ Time ACCT 4, 16, 24 Aligner Aligned Error sec (hours) GTGA 12, 20 Bowtie2 84% 0 14,400 22 … … BWA 87% 0.31 9,000 35 Novoalign 89% 0.21 4,260 73 SOAP2 79% 0 19,500 16 SNAP 87% 0 189,000 2 http://snap.cs.berkeley.edu/
  • 16.
  • 17. 1. Create easy-to-use, fast, accurate genetic analysis pipelines
  • 18.
  • 19. GENOME PROTEOME CENTER CENTER PROTEOME GENOME TCGA CENTERS PROTEOME CENTER CENTER Boise State University CENTER ANALYSIS SEQUENCING TCGA CENTERS CENTER GENOME PROTEOME Brigham & Women’s Hospital and Harvard Medical School CENTER CENTER Broad Institute CENTER John Hopkins University ANALYSIS Memorial Sloan-Kettering Cancer Center CENTER TCGA CENTERS BC Cancer Research Center ANALYSIS Fred Hutchinson Cancer Research Center CENTER Complete Genomics Inc. Pacific NW National Laboratory TCGA CENTERS University of Southern California Nationwide Children’s Hospital BIOSPECIMEN DATA COORDINATING Oregon Health & Science University CORE PROTEOME CENTER Institute for Systems Biology CENTER GENOME University of California, Santa Cruz CENTER SEQUENCING PROTEOME CENTER CENTER ANALYSIS ANALYSIS TCGA CENTERS GENOME CENTER CENTER Vanderbilt University CENTER ANALYSIS PROTEOME Washington University Genome Institute PROTEOME CENTER GENOME CENTER CENTER CENTER TCGA CENTERS BIOSPECIMEN GENOME University of North Carolina CORE CENTER ANALYSIS DATA CENTER CENTER SEQUENCING TCGA CENTERS CENTER International Genomics Consortium TCGA CENTERS Baylor College of Medicine TCGA Centers: University of Texas, M.D. Anderson Cancer Ctr Biospecimen Core Resource Genome Characterization Centers (GCCs) Genome Sequencing Centers (GSCs) Proteome Characterization Centers (PCCs) Data Coordination Center (DCC) Genome Data Analysis Centers (GDACs)
  • 20. Built at SDSC to store DNA information in for The Cancer Genome Atlas  Designed for 50,000 genomes with average of 100 gigabytes per genome: 5 petabytes  Currently 24,000 files from ~5,500 cases, ~60 gigabytes/case, in total 2 PB of downloads  Total Cost ~ $100/year/genome at 50K genomes, i.e. $5M/year. The technology cost is about ½ the total  Co-location opportunities in same data center for groups who want to compute on the data
  • 21. Lessons learned by CGHub on storage of sequence data
  • 22.
  • 23.
  • 24.
  • 25.
  • 26. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.