BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012


Published on

The problem of big data is not only that it is capacious, but that it is also heterogeneous, dirty, and growing even faster than the improvement in disk capacity. One challenge is then to derive value by answering ad hoc questions in a timely fashion that justifies the preservation of big data. A group of us from databases, machine learning, networking, and systems just started a new lab at University of California, Berkeley, to tackle this challenge. The AMPLab is working at the intersection of three trends: statistical machine learning (Algorithms), cloud computing (Machines), and crowdsourcing (People). One of the driving applications for the AMP Lab is cancer genomics. Over the next several years, gene-sequencing technologies will begin to make their way into medicine, offering the most complex tests available. This advance brings a new type of data with tremendous promise to help elucidate physiological and pathological functions within the body, as well as to make more informed decisions about patient care. The cost of genome sequencing is projected to fall within range where it may be used for diagnostic and treatment purposes within the next two years. Due to the overwhelming amount of information returned by these tests, direct human interpretation is not feasible, and therefore will have to be guided by computational methods and visualization. The use of sequencing information has debuted in the cnacer. A provocative hypothesis is that the massive growth of online digital description of tumor cell genomes will enable computer scientists to help make breakthroughs in cancer treatment, perhaps even within the next few years. Learn about the frightening fractions of cancer, dramatic speedups in genomic data processing by using cloud computing, and the blurring between opportunity and obligation when dealing with a problem that affects the lives of millions of people.

  • Be the first to comment

BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012

  2. 2. $K per genome$100,000.0 $10,000.0 $1,000.0 $100.0 $10.0 $1.0 $0.1 2001 - 2014
  3. 3. Emperor of All Maladies, page 464
  4. 4. “Computer Scientists May Have What It Takes to Help Cure Cancer,”David Patterson, New York Times, 12/5/2011
  5. 5. reconstructed genomevariantreference genome consensus
  6. 6. ReadsGenetic Read AlignmentDataProcessingPipeline SNP Calling Structural Variant Detection Reconstructed Genome
  7. 7. Malachi Griffith,Washington University,August 19, 2012“Cancer genome andtranscriptomesequencing – analysischallengesand bottlenecks”119thstep
  8. 8. “Computational science: ...Error…why scientific programming does not compute,”by Zeeya Merali, 13 October 2010, Nature 467, 775-777
  9. 9. UC Students/Post-Docs External Faculty – Ma’ayan Bresler – Bill Bolosky (MS/MSR) – Armando Fox – Kristal Curtis – Mishali Naik (Intel) – Michael Jordan – Jesse Liptrap – Paolo Narvaez (Intel) – Anthony Joseph – Sara Sheehan – Ravi Pandya (MS) – David Patterson – Ameet Talwalkar – Abirami Prabhakaran (Intel) – Satish Rao – Jonathan Terhorst – Taylor Sittler (UCSF) – Scott Shenker – Richard Xia – Gans Srinivasa (Intel) – Yun Song – Matei Zaharia – Arun Wiita (UCSF) – Ion Stoica – Yuchen Zhang Expertise – Computational Biology/Medicine – Machine Learning – Systems
  10. 10. • 2011-2016 Adaptive/Active Machine Learning • Berkeley Data Analysis Stack and Analytics release as Open Source Massive and Diverse DataCrowdSourcing/ Human Cloud Computing Computation
  11. 11. genome read Seed Positions AAAA 0, 8 % % Reads/ Time ACCT 4, 16, 24 Aligner Aligned Error sec (hours) GTGA 12, 20Bowtie2 84% 0 14,400 22 … …BWA 87% 0.31 9,000 35Novoalign 89% 0.21 4,260 73SOAP2 79% 0 19,500 16SNAP 87% 0 189,000 2
  12. 12. 1. Create easy-to-use, fast, accurate genetic analysis pipelines
  13. 13. GENOME PROTEOME CENTER CENTER PROTEOME GENOME TCGA CENTERS PROTEOME CENTER CENTER Boise State University CENTER ANALYSIS SEQUENCING TCGA CENTERS CENTER GENOME PROTEOME Brigham & Women’s Hospital and Harvard Medical School CENTER CENTER Broad Institute CENTER John Hopkins University ANALYSIS Memorial Sloan-Kettering Cancer Center CENTER TCGA CENTERS BC Cancer Research Center ANALYSIS Fred Hutchinson Cancer Research Center CENTER Complete Genomics Inc. Pacific NW National Laboratory TCGA CENTERS University of Southern California Nationwide Children’s Hospital BIOSPECIMEN DATA COORDINATING Oregon Health & Science University CORE PROTEOME CENTER Institute for Systems Biology CENTER GENOME University of California, Santa Cruz CENTER SEQUENCING PROTEOME CENTER CENTER ANALYSIS ANALYSIS TCGA CENTERS GENOME CENTER CENTER Vanderbilt University CENTER ANALYSIS PROTEOME Washington University Genome Institute PROTEOME CENTER GENOME CENTER CENTER CENTER TCGA CENTERS BIOSPECIMEN GENOME University of North Carolina CORE CENTER ANALYSIS DATA CENTER CENTER SEQUENCING TCGA CENTERS CENTER International Genomics Consortium TCGA CENTERS Baylor College of MedicineTCGA Centers: University of Texas, M.D. Anderson Cancer CtrBiospecimen Core ResourceGenome Characterization Centers (GCCs)Genome Sequencing Centers (GSCs)Proteome Characterization Centers (PCCs)Data Coordination Center (DCC)Genome Data Analysis Centers (GDACs)
  14. 14.  Built at SDSC to store DNA information in for The Cancer Genome Atlas Designed for 50,000 genomes with average of 100 gigabytes per genome: 5 petabytes Currently 24,000 files from ~5,500 cases, ~60 gigabytes/case, in total 2 PB of downloads Total Cost ~ $100/year/genome at 50K genomes, i.e. $5M/year. The technology cost is about ½ the total Co-location opportunities in same data center for groups who want to compute on the data
  15. 15. Lessons learned by CGHub on storage ofsequence data
  16. 16. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.