BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012

$K per genome
$100,000.0
$10,000.0
$1,000.0
$100.0
$10.0
$1.0
$0.1
2001 - 2014

Emperor of All Maladies,
page 464

“Computer Scientists May Have What It Takes to Help Cure Cancer,”
David Patterson, New York Times, 12/5/2011

reconstructed genomevariant

reference genome

consensus

Reads

Genetic Read Alignment

Data
Processing
Pipeline SNP Calling

Structural Variant Detection

Reconstructed Genome

Malachi Griffith,
Washington University,
August 19, 2012
“Cancer genome and
transcriptome
sequencing – analysis
challenges
and bottlenecks”

119th
step

“Computational science: ...Error…why scientific programming does not compute,”
by Zeeya Merali, 13 October 2010, Nature 467, 775-777

UC Students/Post-Docs External Faculty
– Ma’ayan Bresler – Bill Bolosky (MS/MSR) – Armando Fox
– Kristal Curtis – Mishali Naik (Intel) – Michael Jordan
– Jesse Liptrap – Paolo Narvaez (Intel) – Anthony Joseph
– Sara Sheehan – Ravi Pandya (MS) – David Patterson
– Ameet Talwalkar – Abirami Prabhakaran (Intel) – Satish Rao
– Jonathan Terhorst – Taylor Sittler (UCSF) – Scott Shenker
– Richard Xia – Gans Srinivasa (Intel) – Yun Song
– Matei Zaharia – Arun Wiita (UCSF) – Ion Stoica
– Yuchen Zhang Expertise
– Computational Biology/Medicine
– Machine Learning
– Systems

• 2011-2016
Adaptive/Active
Machine Learning • Berkeley Data Analysis Stack
and Analytics release as Open Source

Massive
and Diverse
Data

CrowdSourcing/
Human Cloud Computing
Computation

genome

read Seed Positions
AAAA 0, 8
% % Reads/ Time
ACCT 4, 16, 24
Aligner Aligned Error sec (hours)
GTGA 12, 20
Bowtie2 84% 0 14,400 22 … …
BWA 87% 0.31 9,000 35
Novoalign 89% 0.21 4,260 73
SOAP2 79% 0 19,500 16
SNAP 87% 0 189,000 2
http://snap.cs.berkeley.edu/

1. Create easy-to-use, fast, accurate genetic analysis
pipelines

GENOME
PROTEOME
CENTER
CENTER

PROTEOME GENOME
TCGA CENTERS PROTEOME CENTER
CENTER Boise State University CENTER
ANALYSIS SEQUENCING
TCGA CENTERS
CENTER GENOME
PROTEOME Brigham & Women’s Hospital and Harvard Medical School CENTER
CENTER Broad Institute CENTER
John Hopkins University ANALYSIS
Memorial Sloan-Kettering Cancer Center CENTER
TCGA CENTERS
BC Cancer Research Center ANALYSIS
Fred Hutchinson Cancer Research Center CENTER
Complete Genomics Inc.
Pacific NW National Laboratory TCGA CENTERS
University of Southern California Nationwide Children’s Hospital BIOSPECIMEN DATA COORDINATING
Oregon Health & Science University CORE PROTEOME CENTER
Institute for Systems Biology CENTER
GENOME
University of California, Santa Cruz CENTER
SEQUENCING
PROTEOME CENTER
CENTER
ANALYSIS ANALYSIS
TCGA CENTERS GENOME
CENTER CENTER
Vanderbilt University CENTER
ANALYSIS PROTEOME
Washington University Genome Institute PROTEOME
CENTER GENOME CENTER
CENTER
CENTER
TCGA CENTERS
BIOSPECIMEN GENOME University of North Carolina
CORE CENTER ANALYSIS
DATA CENTER CENTER
SEQUENCING
TCGA CENTERS
CENTER
International Genomics Consortium
TCGA CENTERS
Baylor College of Medicine
TCGA Centers: University of Texas, M.D. Anderson Cancer Ctr
Biospecimen Core Resource
Genome Characterization Centers (GCCs)
Genome Sequencing Centers (GSCs)
Proteome Characterization Centers (PCCs)
Data Coordination Center (DCC)
Genome Data Analysis Centers (GDACs)

 Built at SDSC to store DNA information in for
The Cancer Genome Atlas
 Designed for 50,000 genomes with average
of 100 gigabytes per genome: 5 petabytes
 Currently 24,000 files from ~5,500 cases,
~60 gigabytes/case, in total 2 PB of
downloads
 Total Cost ~ $100/year/genome at 50K
genomes, i.e. $5M/year. The technology cost
is about ½ the total
 Co-location opportunities in same data
center for groups who want to compute on
the data

Lessons learned by CGHub on storage of
sequence data

We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.

Please fill out an evaluation
form when you have a
chance.

BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012

Recommended

Recommended

More Related Content

Similar to BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012

Similar to BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012 (10)

More from Amazon Web Services

More from Amazon Web Services (20)

BDT205 Solving Big Problems with Big Data - AWS re: Invent 2012