SlideShare a Scribd company logo
1 of 55
© 2014 MapR Technologies 1
Primary Sequence Analysis (ETL), MapReduce style
.fastq .bam .vcf
short read
alignment
genotype
calling
MAP
MAP
REDUCE, rotate matrix 90º
(O(mn)) / 1 (O(mn) + O(n log n)) / s
Hello!
© 2014 MapR Technologies 2© 2014 MapR Technologies
Renaissance in Medicine (Draft 1)
© 2014 MapR Technologies 3
High-Level Biomedical Goal: Improve Fitness
Therapeutics => Diagnostics => Prognostics
• Therapeutics => traditional medicine
• Diagnostics => personalized medicine
– NextGen public health
– Requires hi-res mechanical knowledge
– Reverse engineer how genetic variation leads to (un)desired traits
• Prognostics => GATTACA (dys/eu)topia
– Managed populations / NextGen eugenics
© 2014 MapR Technologies 4Star Wars III: Revenge of the Sith
© 2014 MapR Technologies 5Star Wars V: The Empire Strikes Back
© 2014 MapR Technologies 6
© 2014 MapR Technologies 7
Many DNA-Based Apps Coming*…
• 2014: US$ 2B, mostly
research, mostly
chemical costs
• 2020: US$ 20B,
mostly clinical, mostly
analytics costs
* Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning
0
5
10
15
20
25
2014 2020
Cinical
Non-Clinical
© 2014 MapR Technologies 8
(Even) Moore’s Law
Stein. 2010. The case for cloud computing in genome informatics
“(Even) Moore’s” begins in 2004
with Solexa (acquired by ILMN 2007)
Storage:MB/$
DNA:bp/$
ILMN HiSeq XTen
(Jan 2014)
$1000 Genome
© 2014 MapR Technologies 9
Trends and Events: ILMN HiSeq XTen Specs
• Sold in sets of 10 units ONLY (XTen =10 sequencers)
~ $10 million/XTen, shipments began in Jan 2014
• XTen produces 600 GBases/day @ 30x oversampling
= 1.8 TBases per 3-day cycle
= 54 TBytes per 3-day cycle
= $1000 per genome
= 18,000 genomes/year/XTen
~ 4,000,000 births/year (US, 2012)
 Neonatal sequencing is a reality (with 200 of today’s systems)
© 2014 MapR Technologies 10
Summary: Major Impact on Social Fabric
• Muscular dystrophy
• Cystic fibrosis
• Albinism
• Phenylketonuria
• Hemophilia
Diseases soon to be gone
http://pandawhale.com/post/13851/my-report-card-came-in-my-paternity-test-came-in
http://www.nature.com/scitable/topicpage/rare-genetic-disorders-learning-about-genetic-disease-979
http://en.wikipedia.org/wiki/Paternity_fraud
http://www.cancer.org/cancer/cancercauses/geneticsandcancer/heredity-and-cancer
Paternity Tests
fact: US paternity fraud
rate is 1 in 25
More Troubling:
Huntington’s Disease: allow?
Inherited Cancers (10% !!!): allow?
© 2014 MapR Technologies 11
Singapore: Government Sponsored Matchmaking
• Some people have more desirable genes than others.
• “Our government wants smart ladies to meet smart guys to get
smart children.” ~ Annie Chan, Club2040 (Singapore
matchmaking agency)
http://www.nytimes.com/2008/04/29/world/asia/29iht-sing.1.12428974.html
© 2014 MapR Technologies 12
© 2014 MapR Technologies 13
© 2014 MapR Technologies 14© 2014 MapR Technologies
Why hasn’t this happened yet?
© 2014 MapR Technologies 15
The Evolving Genomics Workload
DNA
Sequencing
DNA Specimen
Primary
Analytics Apps
© 2014 MapR Technologies 16
DNA Sequencing Value Chain
%Effort
0
100
Pre-NGS
~2000
Future
~2020
Now
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
© 2014 MapR Technologies 17
Bottleneck @ Primary Analytics
DNA
Sequencing
DNA Specimen
Primary
Analytics Apps
Fix this
© 2014 MapR Technologies 18
DNA sequencing effectively becomes free
Commoditization pattern
Huge influx of inexpensive data
Creates new medical and biotech use-cases
Sequence is Becoming Free
%Effort
0
100
Pre-NGS
~2000
Future
~
Now
© 2014 MapR Technologies 19
Specialization will grow to 100% effort
This is the desirable scenario
Biologists ought to be doing biology
Experiment Design and “Downstream” Analytics
%Effort
0
100
Pre-NGS
~2000
Future
~
Now
ANALYTICS
© 2014 MapR Technologies 20
Time currently being spent on BigData problems
Not ideal
Physicians & Biologists need help from CS & SW Engineers
Data Management (1º Analytics) Bottleneck
%Effort
0
100
Pre-NGS
~2000
Future
~2020
Now
© 2014 MapR Technologies 21
Just Remember the Diamond
%Effort
0
100
Pre-NGS
~2000
Future
~2020
Now
© 2014 MapR Technologies 22© 2014 MapR Technologies
DNA Sequencing Meets MapReduce
© 2014 MapR Technologies 23
Parallelize Primary Analytics
.fastq .vcf
short read
alignment
genotype
callingreads &
mappings
© 2014 MapR Technologies 24
Sequence Analysis, Quick Overview
[…] G A C T A G A fragment1
A C A G T T T A C A fragment2
A G A T A - - A G A fragment3
A A C A G C T T A C A […] fragment4
C T A T A G A T A A fragment5
[…] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA
[…] G A C T A C A G A T A A C A G A T T A C A […] sampleDNA
© 2014 MapR Technologies 25
What is the (Probable) Color of Each Column?
© 2014 MapR Technologies 26
Which Columns are (probably) Not White?
Strategy 1: examine foreach column, foreach row O(rows*cols)
+ O(1 col) memory
© 2014 MapR Technologies 27
Which Columns are (probably) Not White?
Strategy 2: examine foreach row. keep running tallies O(rows)
+ O(rows*cols) memory
© 2014 MapR Technologies 28
Which Columns are (probably) Not White?
Strategy 3: rotate matrix. examine foreach column O(rows log rows)
+ O(cols)
+ O(1 col) memory
© 2014 MapR Technologies 29
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
• Low mem req
• Sequential access
pattern
• Requires Sort
Strategy 2
• High mem req
• Sequential access
pattern
O(rows*cols)
+ O(1 col) memory
O(rows)
+ O(rows*cols) memory
O(rows log rows)
+ O(cols)
+ O(1 col) memory
© 2014 MapR Technologies 30
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
• Low mem req
• Sequential access
pattern
• Requires Sort
Strategy 2
• High mem req
• Sequential access
pattern
O(rows*cols)
+ O(1 col) memory
O(rows)
+ O(rows*cols) memory
O(rows log rows) ÷ shards
+ O(cols) ÷ shards
+ O(1 col) memory
As # of rows & columns increases
Strategy 3 becomes more attractive
© 2014 MapR Technologies 31
Primary Sequence Analysis (ETL), MapReduce style
.fastq .bam .vcf
short read
alignment
genotype
calling
MAP
MAP
REDUCE, rotate matrix 90º
(O(mn)) / 1 (O(mn) + O(n log n)) / s
Hello!
© 2014 MapR Technologies 32
See also: Twitter Algebird – Parallel Linear Algebra Library
for Scala / MapReduce
© 2014 MapR Technologies 33© 2014 MapR Technologies
First App You’ll Likely See:
Clinical Genomics
© 2014 MapR Technologies 34
Clinical Sequencing Business Process Workflow
PhysicianPatient
Clinic
blood/saliva
Clinical Lab
Analytics
extract
© 2014 MapR Technologies 35
One Bad MTHFR
MTHFR C677T
Methylfolate helps make neurotransmitters in
your brain. When methylfolate levels are low,
so are your neurotransmitters. Low production
of neurotransmitters may cause conditions of
addictive behavior, depression, anxiety,
ADHD, mania, irritability, insomnia, learning
disorders and others.
Everyone should get tested. Why? Because 1
in 2 people are affected and if one knows they
have a MTHFR polymorphism, they know they
have to be very proactive in taking care of
themselves.
http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-
Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-
Health.htm
© 2014 MapR Technologies 37
What’s the Impact on Human Evolution?
More Reading:
The Red Queen: Sex and the Evolution of
Human Nature
© 2014 MapR Technologies 38
Clinical Sequencing Business Process Workflow
PhysicianPatient
Clinic
blood/saliva
Clinical Lab
Analytics
extract
© 2014 MapR Technologies 39
Clinical Genomics, Information Systems Perspective
Compressed Structured
Base4 Data
Uncompressed Unstructured
Base2 Data
extract
Base4=>Base2
Converter
[[ DE-STRUCTURES ]]
“BI” Reporting and
Visualization tools
PhysicianPatient
AnalystStakeholder
© 2014 MapR Technologies 40
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Reporting and Viz
Data Store
Analytics
© 2014 MapR Technologies 41
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Reporting and Viz
Data Store
Analytics
1º analytics
2º analytics
Not much in this presentation,
see also:
http://slidesha.re/1sC2BOX
© 2014 MapR Technologies 42
Clinical Applications: Performance Matters
MapR
FilesystemN
F
S
DNA
Sequencer
DNA
Sequencer
DNA
Sequencer
Raw
DNARaw
DNARaw
DNA
1º Analytics
Raw
DNARaw
DNASNP
calls
Static
Clinical
Reporting
PhysicianPatient
Reference
DBs
SNP DB
ETL
2º
Analytics
ResearcherSubject
© 2014 MapR Technologies 43
Variant Collection Enables Downstream Apps
• GWAS Association Studies
• Versioned, Personalized
Medicine
• Companion Diagnostics
SNP DB 2º
Analytics
New
Markets
Hello!
More linear algebra 
[Spark,
Summingbird,
Lambda Architecture
Slides]
© 2014 MapR Technologies 44
First Bottleneck Removed. Now What?
%Effort
0
100
Pre-NGS
~2000
Future
~
Now
ANALYTICS
© 2014 MapR Technologies 45© 2014 MapR Technologies
Next Bottleneck, Of Course!
© 2014 MapR Technologies 46
Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype…
– And an associated SNP from my
experiment
• That elucidate genetic basis of
phenotype…
• And rank order them by
impact/likelihood/etc
© 2014 MapR Technologies 47
Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype…
– And an associated SNP from my
experiment
• That elucidate genetic basis of
phenotype…
• And rank order them by
impact/likelihood/etc
• In context of, e.g.
– ε1: Racial, etc. background
– ε2: Experimental design-
specific concerns (e.g. familial
IBD/IBS)
– ε3: Environmental factors and
penetrance
– ε4: Assay-specific biases and
noise
phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4
At risk of over-simplifying as
business-level concept…
© 2014 MapR Technologies 48
HUGE PROBLEM
COMBINATORIAL EXPLOSION
© 2014 MapR Technologies 49
What’s a Percolator?
• Google Percolator
– “Caffeine” update 2010
• Iterative, incremental prioritized
updates
• No batch processing
• Decouple computational results
from data size
Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
© 2014 MapR Technologies 50
Solution: Percolate
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
Denormalize
and Percolate
(re)prioritize &
(re)process
service queries
drive
dashboards
create reports
denormalize for
display
buffer
New
models
@allenday on percolators:
http://slidesha.re/1qSXCKw
© 2014 MapR Technologies 51
If they were unlabeled, would you know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model Building
NPR. 2011. The Search For Analysts To Make Sense Of
'Big Data’
http://www.npr.org/2011/11/30/142893065
© 2014 MapR Technologies 52
If they were unlabeled, would you know which is which?
• Identify network structures
• Label them
• Observe
stimulus=>response
space mapping
• Purposefully target
• $$$$ Twitter’s Business
ModelFriend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model Building
© 2014 MapR Technologies 53
Robot Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
© 2014 MapR Technologies 54
Robot (Data?) Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
© 2014 MapR Technologies 55
© 2014 MapR Technologies 56
Q&A
@allenday allenday@mapr.com
allendaylinkedin.com/in/allenday

More Related Content

Viewers also liked

Multimedia. 204
Multimedia. 204Multimedia. 204
Multimedia. 204maumargu
 
Policy Development Process Infographic Arabic
Policy Development Process Infographic ArabicPolicy Development Process Infographic Arabic
Policy Development Process Infographic ArabicICANN
 
Digital Marketing at Asset Management Firms [INFOGRAPHIC]
Digital Marketing at Asset Management Firms [INFOGRAPHIC]Digital Marketing at Asset Management Firms [INFOGRAPHIC]
Digital Marketing at Asset Management Firms [INFOGRAPHIC]Kurtosys Systems
 
Enterprise works overview 2013 v2
Enterprise works overview 2013 v2Enterprise works overview 2013 v2
Enterprise works overview 2013 v2UIResearchPark
 
USECON RoX 2015: UX Camp - gezielte Entwicklung von Design-Leitbildern
USECON RoX 2015: UX Camp - gezielte Entwicklung von Design-LeitbildernUSECON RoX 2015: UX Camp - gezielte Entwicklung von Design-Leitbildern
USECON RoX 2015: UX Camp - gezielte Entwicklung von Design-LeitbildernUSECON
 
Big Data Case Studies
Big Data Case Studies Big Data Case Studies
Big Data Case Studies UIResearchPark
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD
 
Business Value of Data
Business Value of Data Business Value of Data
Business Value of Data UIResearchPark
 
Tutorial WordPress.com
Tutorial WordPress.comTutorial WordPress.com
Tutorial WordPress.commauricio souza
 
Texte, die konvertieren
Texte, die konvertierenTexte, die konvertieren
Texte, die konvertierenEric Kubitz
 
Grundlagen: Content einbauen (SEOkomm, 2015)
Grundlagen: Content einbauen (SEOkomm, 2015)Grundlagen: Content einbauen (SEOkomm, 2015)
Grundlagen: Content einbauen (SEOkomm, 2015)Eric Kubitz
 
Aquis Search General Presentation
Aquis Search General PresentationAquis Search General Presentation
Aquis Search General PresentationGordon Kwong
 

Viewers also liked (12)

Multimedia. 204
Multimedia. 204Multimedia. 204
Multimedia. 204
 
Policy Development Process Infographic Arabic
Policy Development Process Infographic ArabicPolicy Development Process Infographic Arabic
Policy Development Process Infographic Arabic
 
Digital Marketing at Asset Management Firms [INFOGRAPHIC]
Digital Marketing at Asset Management Firms [INFOGRAPHIC]Digital Marketing at Asset Management Firms [INFOGRAPHIC]
Digital Marketing at Asset Management Firms [INFOGRAPHIC]
 
Enterprise works overview 2013 v2
Enterprise works overview 2013 v2Enterprise works overview 2013 v2
Enterprise works overview 2013 v2
 
USECON RoX 2015: UX Camp - gezielte Entwicklung von Design-Leitbildern
USECON RoX 2015: UX Camp - gezielte Entwicklung von Design-LeitbildernUSECON RoX 2015: UX Camp - gezielte Entwicklung von Design-Leitbildern
USECON RoX 2015: UX Camp - gezielte Entwicklung von Design-Leitbildern
 
Big Data Case Studies
Big Data Case Studies Big Data Case Studies
Big Data Case Studies
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
 
Business Value of Data
Business Value of Data Business Value of Data
Business Value of Data
 
Tutorial WordPress.com
Tutorial WordPress.comTutorial WordPress.com
Tutorial WordPress.com
 
Texte, die konvertieren
Texte, die konvertierenTexte, die konvertieren
Texte, die konvertieren
 
Grundlagen: Content einbauen (SEOkomm, 2015)
Grundlagen: Content einbauen (SEOkomm, 2015)Grundlagen: Content einbauen (SEOkomm, 2015)
Grundlagen: Content einbauen (SEOkomm, 2015)
 
Aquis Search General Presentation
Aquis Search General PresentationAquis Search General Presentation
Aquis Search General Presentation
 

Similar to 2014.06.30 - Renaissance in Medicine - Singapore Management University - Data Science SG

Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersAllen Day, PhD
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Allen Day, PhD
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIAllen Day, PhD
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Allen Day, PhD
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsMapR Technologies
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleJulius Remigio, CBIP
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Deep Learning for AI (3)
Deep Learning for AI (3)Deep Learning for AI (3)
Deep Learning for AI (3)Dongheon Lee
 
Using Semantic Technology to Drive Agile Analytics - SLIDES
Using Semantic Technology to Drive Agile Analytics - SLIDESUsing Semantic Technology to Drive Agile Analytics - SLIDES
Using Semantic Technology to Drive Agile Analytics - SLIDESDATAVERSITY
 
Maze's Compass Platform - A data fabric for drug discovery and development
Maze's Compass Platform - A data fabric for drug discovery and developmentMaze's Compass Platform - A data fabric for drug discovery and development
Maze's Compass Platform - A data fabric for drug discovery and developmentNolan Nichols
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...Environmental Intelligence Lab
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECAProject
 
Crowds Cure Canver: Annotating Data from The Cancer Imaging Archive
Crowds Cure Canver: Annotating Data from The Cancer Imaging ArchiveCrowds Cure Canver: Annotating Data from The Cancer Imaging Archive
Crowds Cure Canver: Annotating Data from The Cancer Imaging ArchiveCancerImagingInforma
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 

Similar to 2014.06.30 - Renaissance in Medicine - Singapore Management University - Data Science SG (20)

Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data Engineers
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San Jose
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for Genomics
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data Style
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Deep Learning for AI (3)
Deep Learning for AI (3)Deep Learning for AI (3)
Deep Learning for AI (3)
 
Using Semantic Technology to Drive Agile Analytics - SLIDES
Using Semantic Technology to Drive Agile Analytics - SLIDESUsing Semantic Technology to Drive Agile Analytics - SLIDES
Using Semantic Technology to Drive Agile Analytics - SLIDES
 
Maze's Compass Platform - A data fabric for drug discovery and development
Maze's Compass Platform - A data fabric for drug discovery and developmentMaze's Compass Platform - A data fabric for drug discovery and development
Maze's Compass Platform - A data fabric for drug discovery and development
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...
 
Crowds Cure Canver: Annotating Data from The Cancer Imaging Archive
Crowds Cure Canver: Annotating Data from The Cancer Imaging ArchiveCrowds Cure Canver: Annotating Data from The Cancer Imaging Archive
Crowds Cure Canver: Annotating Data from The Cancer Imaging Archive
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 

More from Allen Day, PhD

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser UniversityAllen Day, PhD
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedAllen Day, PhD
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big DataAllen Day, PhD
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data AnalyticsAllen Day, PhD
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 

More from Allen Day, PhD (14)

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, Abbreviated
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 

Recently uploaded

Models Call Girls Electronic City | 7001305949 At Low Cost Cash Payment Booking
Models Call Girls Electronic City | 7001305949 At Low Cost Cash Payment BookingModels Call Girls Electronic City | 7001305949 At Low Cost Cash Payment Booking
Models Call Girls Electronic City | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
College Call Girls Hyderabad Sakshi 9907093804 Independent Escort Service Hyd...
College Call Girls Hyderabad Sakshi 9907093804 Independent Escort Service Hyd...College Call Girls Hyderabad Sakshi 9907093804 Independent Escort Service Hyd...
College Call Girls Hyderabad Sakshi 9907093804 Independent Escort Service Hyd...delhimodelshub1
 
Hi,Fi Call Girl In Marathahalli - 7001305949 with real photos and phone numbers
Hi,Fi Call Girl In Marathahalli - 7001305949 with real photos and phone numbersHi,Fi Call Girl In Marathahalli - 7001305949 with real photos and phone numbers
Hi,Fi Call Girl In Marathahalli - 7001305949 with real photos and phone numbersnarwatsonia7
 
Russian Escorts Delhi | 9711199171 | all area service available
Russian Escorts Delhi | 9711199171 | all area service availableRussian Escorts Delhi | 9711199171 | all area service available
Russian Escorts Delhi | 9711199171 | all area service availablesandeepkumar69420
 
Book Call Girls in Hosur - 7001305949 | 24x7 Service Available Near Me
Book Call Girls in Hosur - 7001305949 | 24x7 Service Available Near MeBook Call Girls in Hosur - 7001305949 | 24x7 Service Available Near Me
Book Call Girls in Hosur - 7001305949 | 24x7 Service Available Near Menarwatsonia7
 
Call Girls Madhapur 7001305949 all area service COD available Any Time
Call Girls Madhapur 7001305949 all area service COD available Any TimeCall Girls Madhapur 7001305949 all area service COD available Any Time
Call Girls Madhapur 7001305949 all area service COD available Any Timedelhimodelshub1
 
Russian Call Girls in Hyderabad Ishita 9907093804 Independent Escort Service ...
Russian Call Girls in Hyderabad Ishita 9907093804 Independent Escort Service ...Russian Call Girls in Hyderabad Ishita 9907093804 Independent Escort Service ...
Russian Call Girls in Hyderabad Ishita 9907093804 Independent Escort Service ...delhimodelshub1
 
Call Girls LB Nagar 7001305949 all area service COD available Any Time
Call Girls LB Nagar 7001305949 all area service COD available Any TimeCall Girls LB Nagar 7001305949 all area service COD available Any Time
Call Girls LB Nagar 7001305949 all area service COD available Any Timedelhimodelshub1
 
Russian Call Girls Hyderabad Saloni 9907093804 Independent Escort Service Hyd...
Russian Call Girls Hyderabad Saloni 9907093804 Independent Escort Service Hyd...Russian Call Girls Hyderabad Saloni 9907093804 Independent Escort Service Hyd...
Russian Call Girls Hyderabad Saloni 9907093804 Independent Escort Service Hyd...delhimodelshub1
 
Russian Call Girls in Goa Samaira 7001305949 Independent Escort Service Goa
Russian Call Girls in Goa Samaira 7001305949 Independent Escort Service GoaRussian Call Girls in Goa Samaira 7001305949 Independent Escort Service Goa
Russian Call Girls in Goa Samaira 7001305949 Independent Escort Service Goanarwatsonia7
 
Call Girls Dilsukhnagar 7001305949 all area service COD available Any Time
Call Girls Dilsukhnagar 7001305949 all area service COD available Any TimeCall Girls Dilsukhnagar 7001305949 all area service COD available Any Time
Call Girls Dilsukhnagar 7001305949 all area service COD available Any Timedelhimodelshub1
 
Russian Call Girls in Raipur 9873940964 Book Hot And Sexy Girls
Russian Call Girls in Raipur 9873940964 Book Hot And Sexy GirlsRussian Call Girls in Raipur 9873940964 Book Hot And Sexy Girls
Russian Call Girls in Raipur 9873940964 Book Hot And Sexy Girlsddev2574
 
Russian Call Girls in Chandigarh Ojaswi ❤️🍑 9907093804 👄🫦 Independent Escort ...
Russian Call Girls in Chandigarh Ojaswi ❤️🍑 9907093804 👄🫦 Independent Escort ...Russian Call Girls in Chandigarh Ojaswi ❤️🍑 9907093804 👄🫦 Independent Escort ...
Russian Call Girls in Chandigarh Ojaswi ❤️🍑 9907093804 👄🫦 Independent Escort ...High Profile Call Girls Chandigarh Aarushi
 
Call Girls Uppal 7001305949 all area service COD available Any Time
Call Girls Uppal 7001305949 all area service COD available Any TimeCall Girls Uppal 7001305949 all area service COD available Any Time
Call Girls Uppal 7001305949 all area service COD available Any Timedelhimodelshub1
 

Recently uploaded (20)

Models Call Girls Electronic City | 7001305949 At Low Cost Cash Payment Booking
Models Call Girls Electronic City | 7001305949 At Low Cost Cash Payment BookingModels Call Girls Electronic City | 7001305949 At Low Cost Cash Payment Booking
Models Call Girls Electronic City | 7001305949 At Low Cost Cash Payment Booking
 
College Call Girls Dehradun Kavya 🔝 7001305949 🔝 📍 Independent Escort Service...
College Call Girls Dehradun Kavya 🔝 7001305949 🔝 📍 Independent Escort Service...College Call Girls Dehradun Kavya 🔝 7001305949 🔝 📍 Independent Escort Service...
College Call Girls Dehradun Kavya 🔝 7001305949 🔝 📍 Independent Escort Service...
 
College Call Girls Hyderabad Sakshi 9907093804 Independent Escort Service Hyd...
College Call Girls Hyderabad Sakshi 9907093804 Independent Escort Service Hyd...College Call Girls Hyderabad Sakshi 9907093804 Independent Escort Service Hyd...
College Call Girls Hyderabad Sakshi 9907093804 Independent Escort Service Hyd...
 
Call Girl Dehradun Aashi 🔝 7001305949 🔝 💃 Independent Escort Service Dehradun
Call Girl Dehradun Aashi 🔝 7001305949 🔝 💃 Independent Escort Service DehradunCall Girl Dehradun Aashi 🔝 7001305949 🔝 💃 Independent Escort Service Dehradun
Call Girl Dehradun Aashi 🔝 7001305949 🔝 💃 Independent Escort Service Dehradun
 
Hi,Fi Call Girl In Marathahalli - 7001305949 with real photos and phone numbers
Hi,Fi Call Girl In Marathahalli - 7001305949 with real photos and phone numbersHi,Fi Call Girl In Marathahalli - 7001305949 with real photos and phone numbers
Hi,Fi Call Girl In Marathahalli - 7001305949 with real photos and phone numbers
 
Russian Escorts Delhi | 9711199171 | all area service available
Russian Escorts Delhi | 9711199171 | all area service availableRussian Escorts Delhi | 9711199171 | all area service available
Russian Escorts Delhi | 9711199171 | all area service available
 
Book Call Girls in Hosur - 7001305949 | 24x7 Service Available Near Me
Book Call Girls in Hosur - 7001305949 | 24x7 Service Available Near MeBook Call Girls in Hosur - 7001305949 | 24x7 Service Available Near Me
Book Call Girls in Hosur - 7001305949 | 24x7 Service Available Near Me
 
Call Girls Madhapur 7001305949 all area service COD available Any Time
Call Girls Madhapur 7001305949 all area service COD available Any TimeCall Girls Madhapur 7001305949 all area service COD available Any Time
Call Girls Madhapur 7001305949 all area service COD available Any Time
 
Russian Call Girls in Hyderabad Ishita 9907093804 Independent Escort Service ...
Russian Call Girls in Hyderabad Ishita 9907093804 Independent Escort Service ...Russian Call Girls in Hyderabad Ishita 9907093804 Independent Escort Service ...
Russian Call Girls in Hyderabad Ishita 9907093804 Independent Escort Service ...
 
Call Girls LB Nagar 7001305949 all area service COD available Any Time
Call Girls LB Nagar 7001305949 all area service COD available Any TimeCall Girls LB Nagar 7001305949 all area service COD available Any Time
Call Girls LB Nagar 7001305949 all area service COD available Any Time
 
Russian Call Girls Hyderabad Saloni 9907093804 Independent Escort Service Hyd...
Russian Call Girls Hyderabad Saloni 9907093804 Independent Escort Service Hyd...Russian Call Girls Hyderabad Saloni 9907093804 Independent Escort Service Hyd...
Russian Call Girls Hyderabad Saloni 9907093804 Independent Escort Service Hyd...
 
Russian Call Girls in Goa Samaira 7001305949 Independent Escort Service Goa
Russian Call Girls in Goa Samaira 7001305949 Independent Escort Service GoaRussian Call Girls in Goa Samaira 7001305949 Independent Escort Service Goa
Russian Call Girls in Goa Samaira 7001305949 Independent Escort Service Goa
 
Call Girls Dilsukhnagar 7001305949 all area service COD available Any Time
Call Girls Dilsukhnagar 7001305949 all area service COD available Any TimeCall Girls Dilsukhnagar 7001305949 all area service COD available Any Time
Call Girls Dilsukhnagar 7001305949 all area service COD available Any Time
 
Russian Call Girls in Raipur 9873940964 Book Hot And Sexy Girls
Russian Call Girls in Raipur 9873940964 Book Hot And Sexy GirlsRussian Call Girls in Raipur 9873940964 Book Hot And Sexy Girls
Russian Call Girls in Raipur 9873940964 Book Hot And Sexy Girls
 
Call Girl Lucknow Gauri 🔝 8923113531 🔝 🎶 Independent Escort Service Lucknow
Call Girl Lucknow Gauri 🔝 8923113531  🔝 🎶 Independent Escort Service LucknowCall Girl Lucknow Gauri 🔝 8923113531  🔝 🎶 Independent Escort Service Lucknow
Call Girl Lucknow Gauri 🔝 8923113531 🔝 🎶 Independent Escort Service Lucknow
 
Russian Call Girls in Chandigarh Ojaswi ❤️🍑 9907093804 👄🫦 Independent Escort ...
Russian Call Girls in Chandigarh Ojaswi ❤️🍑 9907093804 👄🫦 Independent Escort ...Russian Call Girls in Chandigarh Ojaswi ❤️🍑 9907093804 👄🫦 Independent Escort ...
Russian Call Girls in Chandigarh Ojaswi ❤️🍑 9907093804 👄🫦 Independent Escort ...
 
Call Girls Uppal 7001305949 all area service COD available Any Time
Call Girls Uppal 7001305949 all area service COD available Any TimeCall Girls Uppal 7001305949 all area service COD available Any Time
Call Girls Uppal 7001305949 all area service COD available Any Time
 
Russian Call Girls South Delhi 9711199171 discount on your booking
Russian Call Girls South Delhi 9711199171 discount on your bookingRussian Call Girls South Delhi 9711199171 discount on your booking
Russian Call Girls South Delhi 9711199171 discount on your booking
 
Call Girls in Lucknow Esha 🔝 8923113531 🔝 🎶 Independent Escort Service Lucknow
Call Girls in Lucknow Esha 🔝 8923113531  🔝 🎶 Independent Escort Service LucknowCall Girls in Lucknow Esha 🔝 8923113531  🔝 🎶 Independent Escort Service Lucknow
Call Girls in Lucknow Esha 🔝 8923113531 🔝 🎶 Independent Escort Service Lucknow
 
Call Girls Guwahati Aaradhya 👉 7001305949👈 🎶 Independent Escort Service Guwahati
Call Girls Guwahati Aaradhya 👉 7001305949👈 🎶 Independent Escort Service GuwahatiCall Girls Guwahati Aaradhya 👉 7001305949👈 🎶 Independent Escort Service Guwahati
Call Girls Guwahati Aaradhya 👉 7001305949👈 🎶 Independent Escort Service Guwahati
 

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data Science SG

  • 1. © 2014 MapR Technologies 1 Primary Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s Hello!
  • 2. © 2014 MapR Technologies 2© 2014 MapR Technologies Renaissance in Medicine (Draft 1)
  • 3. © 2014 MapR Technologies 3 High-Level Biomedical Goal: Improve Fitness Therapeutics => Diagnostics => Prognostics • Therapeutics => traditional medicine • Diagnostics => personalized medicine – NextGen public health – Requires hi-res mechanical knowledge – Reverse engineer how genetic variation leads to (un)desired traits • Prognostics => GATTACA (dys/eu)topia – Managed populations / NextGen eugenics
  • 4. © 2014 MapR Technologies 4Star Wars III: Revenge of the Sith
  • 5. © 2014 MapR Technologies 5Star Wars V: The Empire Strikes Back
  • 6. © 2014 MapR Technologies 6
  • 7. © 2014 MapR Technologies 7 Many DNA-Based Apps Coming*… • 2014: US$ 2B, mostly research, mostly chemical costs • 2020: US$ 20B, mostly clinical, mostly analytics costs * Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 0 5 10 15 20 25 2014 2020 Cinical Non-Clinical
  • 8. © 2014 MapR Technologies 8 (Even) Moore’s Law Stein. 2010. The case for cloud computing in genome informatics “(Even) Moore’s” begins in 2004 with Solexa (acquired by ILMN 2007) Storage:MB/$ DNA:bp/$ ILMN HiSeq XTen (Jan 2014) $1000 Genome
  • 9. © 2014 MapR Technologies 9 Trends and Events: ILMN HiSeq XTen Specs • Sold in sets of 10 units ONLY (XTen =10 sequencers) ~ $10 million/XTen, shipments began in Jan 2014 • XTen produces 600 GBases/day @ 30x oversampling = 1.8 TBases per 3-day cycle = 54 TBytes per 3-day cycle = $1000 per genome = 18,000 genomes/year/XTen ~ 4,000,000 births/year (US, 2012)  Neonatal sequencing is a reality (with 200 of today’s systems)
  • 10. © 2014 MapR Technologies 10 Summary: Major Impact on Social Fabric • Muscular dystrophy • Cystic fibrosis • Albinism • Phenylketonuria • Hemophilia Diseases soon to be gone http://pandawhale.com/post/13851/my-report-card-came-in-my-paternity-test-came-in http://www.nature.com/scitable/topicpage/rare-genetic-disorders-learning-about-genetic-disease-979 http://en.wikipedia.org/wiki/Paternity_fraud http://www.cancer.org/cancer/cancercauses/geneticsandcancer/heredity-and-cancer Paternity Tests fact: US paternity fraud rate is 1 in 25 More Troubling: Huntington’s Disease: allow? Inherited Cancers (10% !!!): allow?
  • 11. © 2014 MapR Technologies 11 Singapore: Government Sponsored Matchmaking • Some people have more desirable genes than others. • “Our government wants smart ladies to meet smart guys to get smart children.” ~ Annie Chan, Club2040 (Singapore matchmaking agency) http://www.nytimes.com/2008/04/29/world/asia/29iht-sing.1.12428974.html
  • 12. © 2014 MapR Technologies 12
  • 13. © 2014 MapR Technologies 13
  • 14. © 2014 MapR Technologies 14© 2014 MapR Technologies Why hasn’t this happened yet?
  • 15. © 2014 MapR Technologies 15 The Evolving Genomics Workload DNA Sequencing DNA Specimen Primary Analytics Apps
  • 16. © 2014 MapR Technologies 16 DNA Sequencing Value Chain %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now Sboner, et al, 2011. The real cost of sequencing: higher than you think!
  • 17. © 2014 MapR Technologies 17 Bottleneck @ Primary Analytics DNA Sequencing DNA Specimen Primary Analytics Apps Fix this
  • 18. © 2014 MapR Technologies 18 DNA sequencing effectively becomes free Commoditization pattern Huge influx of inexpensive data Creates new medical and biotech use-cases Sequence is Becoming Free %Effort 0 100 Pre-NGS ~2000 Future ~ Now
  • 19. © 2014 MapR Technologies 19 Specialization will grow to 100% effort This is the desirable scenario Biologists ought to be doing biology Experiment Design and “Downstream” Analytics %Effort 0 100 Pre-NGS ~2000 Future ~ Now ANALYTICS
  • 20. © 2014 MapR Technologies 20 Time currently being spent on BigData problems Not ideal Physicians & Biologists need help from CS & SW Engineers Data Management (1º Analytics) Bottleneck %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now
  • 21. © 2014 MapR Technologies 21 Just Remember the Diamond %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now
  • 22. © 2014 MapR Technologies 22© 2014 MapR Technologies DNA Sequencing Meets MapReduce
  • 23. © 2014 MapR Technologies 23 Parallelize Primary Analytics .fastq .vcf short read alignment genotype callingreads & mappings
  • 24. © 2014 MapR Technologies 24 Sequence Analysis, Quick Overview […] G A C T A G A fragment1 A C A G T T T A C A fragment2 A G A T A - - A G A fragment3 A A C A G C T T A C A […] fragment4 C T A T A G A T A A fragment5 […] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA […] G A C T A C A G A T A A C A G A T T A C A […] sampleDNA
  • 25. © 2014 MapR Technologies 25 What is the (Probable) Color of Each Column?
  • 26. © 2014 MapR Technologies 26 Which Columns are (probably) Not White? Strategy 1: examine foreach column, foreach row O(rows*cols) + O(1 col) memory
  • 27. © 2014 MapR Technologies 27 Which Columns are (probably) Not White? Strategy 2: examine foreach row. keep running tallies O(rows) + O(rows*cols) memory
  • 28. © 2014 MapR Technologies 28 Which Columns are (probably) Not White? Strategy 3: rotate matrix. examine foreach column O(rows log rows) + O(cols) + O(1 col) memory
  • 29. © 2014 MapR Technologies 29 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) + O(cols) + O(1 col) memory
  • 30. © 2014 MapR Technologies 30 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) ÷ shards + O(cols) ÷ shards + O(1 col) memory As # of rows & columns increases Strategy 3 becomes more attractive
  • 31. © 2014 MapR Technologies 31 Primary Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s Hello!
  • 32. © 2014 MapR Technologies 32 See also: Twitter Algebird – Parallel Linear Algebra Library for Scala / MapReduce
  • 33. © 2014 MapR Technologies 33© 2014 MapR Technologies First App You’ll Likely See: Clinical Genomics
  • 34. © 2014 MapR Technologies 34 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  • 35. © 2014 MapR Technologies 35 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  • 36. © 2014 MapR Technologies 37 What’s the Impact on Human Evolution? More Reading: The Red Queen: Sex and the Evolution of Human Nature
  • 37. © 2014 MapR Technologies 38 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  • 38. © 2014 MapR Technologies 39 Clinical Genomics, Information Systems Perspective Compressed Structured Base4 Data Uncompressed Unstructured Base2 Data extract Base4=>Base2 Converter [[ DE-STRUCTURES ]] “BI” Reporting and Visualization tools PhysicianPatient AnalystStakeholder
  • 39. © 2014 MapR Technologies 40 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics
  • 40. © 2014 MapR Technologies 41 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics 1º analytics 2º analytics Not much in this presentation, see also: http://slidesha.re/1sC2BOX
  • 41. © 2014 MapR Technologies 42 Clinical Applications: Performance Matters MapR FilesystemN F S DNA Sequencer DNA Sequencer DNA Sequencer Raw DNARaw DNARaw DNA 1º Analytics Raw DNARaw DNASNP calls Static Clinical Reporting PhysicianPatient Reference DBs SNP DB ETL 2º Analytics ResearcherSubject
  • 42. © 2014 MapR Technologies 43 Variant Collection Enables Downstream Apps • GWAS Association Studies • Versioned, Personalized Medicine • Companion Diagnostics SNP DB 2º Analytics New Markets Hello! More linear algebra  [Spark, Summingbird, Lambda Architecture Slides]
  • 43. © 2014 MapR Technologies 44 First Bottleneck Removed. Now What? %Effort 0 100 Pre-NGS ~2000 Future ~ Now ANALYTICS
  • 44. © 2014 MapR Technologies 45© 2014 MapR Technologies Next Bottleneck, Of Course!
  • 45. © 2014 MapR Technologies 46 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc
  • 46. © 2014 MapR Technologies 47 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc • In context of, e.g. – ε1: Racial, etc. background – ε2: Experimental design- specific concerns (e.g. familial IBD/IBS) – ε3: Environmental factors and penetrance – ε4: Assay-specific biases and noise phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4 At risk of over-simplifying as business-level concept…
  • 47. © 2014 MapR Technologies 48 HUGE PROBLEM COMBINATORIAL EXPLOSION
  • 48. © 2014 MapR Technologies 49 What’s a Percolator? • Google Percolator – “Caffeine” update 2010 • Iterative, incremental prioritized updates • No batch processing • Decouple computational results from data size Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
  • 49. © 2014 MapR Technologies 50 Solution: Percolate SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies Denormalize and Percolate (re)prioritize & (re)process service queries drive dashboards create reports denormalize for display buffer New models @allenday on percolators: http://slidesha.re/1qSXCKw
  • 50. © 2014 MapR Technologies 51 If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building NPR. 2011. The Search For Analysts To Make Sense Of 'Big Data’ http://www.npr.org/2011/11/30/142893065
  • 51. © 2014 MapR Technologies 52 If they were unlabeled, would you know which is which? • Identify network structures • Label them • Observe stimulus=>response space mapping • Purposefully target • $$$$ Twitter’s Business ModelFriend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building
  • 52. © 2014 MapR Technologies 53 Robot Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • 53. © 2014 MapR Technologies 54 Robot (Data?) Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • 54. © 2014 MapR Technologies 55
  • 55. © 2014 MapR Technologies 56 Q&A @allenday allenday@mapr.com allendaylinkedin.com/in/allenday