© 2014 MapR Technologies 1© 2014 MapR Technologies
Hadoop for Genomics: What you need to know
© 2014 MapR Technologies 2
DNA Sequencing, pre-2004
years
CPU
transistors/mm2
HDD
GB/mm2
DNA
bp/$, pre-2004
© 2014 MapR Technologies 3
DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
© 2014 MapR Technologies 4
DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
Similar disruption occurred for
Internet traffic in mid-1990s
© 2014 MapR Technologies 5
Effect: Many DNA-Based Apps Coming…
• 2014: US$ 2B, mostly
research, mostly
chemical costs
• 2020: US$ 20B,
mostly clinical, mostly
analytics costs
Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning
0
5
10
15
20
25
2014 2020
Clinical
Non-Clinical
© 2014 MapR Technologies 6
Genomics Value Chain
Order Test
from Clinic
Extract
Biosample
BioBank
Biosample
DNA
Extraction
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Reporting
to Clinic
Academic R&D
Pharma R&D
Clinic Therapy
Increased scale requirement
Increased feature set requirement
© 2014 MapR Technologies 7
Genomics Value Chain
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK (manual)
Pharma R&D OK, e.g. ILMN XTen Not OK (GATK) Missing, manual
Clinic Therapy OK, e.g. ILMN XTen Missing Missing
Increased scale requirement
Increased feature set requirement
Requirements
• Data Intense
• Batch
• High utilization
• Low COGS
Requirements
• Data Intense
• Interactive
• Easy to integrate
• Expressive
© 2014 MapR Technologies 8
Target Application: Alleviate / Prevent (Deterministic) Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
© 2014 MapR Technologies 9
http://steamcommunity.com/app/203160/discussions/0/846956188647169800/
http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law
What Does Moore’s Law Feel Like? #Dataviz:
Lara Croft 230=>40,000 Polygons (1996-2014)
© 2014 MapR Technologies 10
Application: Forensics
http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/
http://snapshot.parabon-nanolabs.com/
http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
© 2014 MapR Technologies 11
Growth in Resource Capacity
© 2014 MapR Technologies 12
Disruption Circa 2000
NASDAQ
Composite
© 2014 MapR Technologies 13
What Happened?
What did winners
do right to survive
the .com recession?
NASDAQ
Composite
© 2014 MapR Technologies 14
Early 1990s: Early eCommerce Vendor Setup
Storage
read/write
read/write
Website
Back Office
© 2014 MapR Technologies 15
Early 1990s: Early eCommerce Vendor Setup
Storage
read/write
read/write
Website
Back Office
<= SAN & NAS, Oracle
<= HPC
© 2014 MapR Technologies 16
Late 1990s: Workload became too big
Storage
read/write
read/write
Website WebsiteWebsite Website
Back Office Back Office
© 2014 MapR Technologies 17
Survivor Strategy Revealed: Google Publishes
• 2003: Google Filesystem (aka GFS)
– http://research.google.com/archive/gfs.html
• 2004: MapReduce
– http://research.google.com/archive/mapreduce.html
• 2006: BigTable
– http://research.google.com/archive/bigtable.html
© 2014 MapR Technologies 18
Scale-out with Google FS + MapReduce
read/write
read/write
Website WebsiteWebsite Website
Storage + Compute Cluster
Back Office Back Office
© 2014 MapR Technologies 19© 2014 MapR Technologies
Genomics: Internet Boom Déjà Vu
© 2014 MapR Technologies 20
DNA Sequencing, post-2004 DNA Sequence
NASDAQ
Composite
© 2014 MapR Technologies 21
DNA Sequencing, pre-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
Sequencer
SAN & NAS =>
HPC =>
© 2014 MapR Technologies 22
DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
© 2014 MapR Technologies 23
DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
HPC bottleneck
Sequencer
back-pressure
© 2014 MapR Technologies 24
DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
HPC bottleneck
Sequencer
back-pressure
NAS doesn’t look like a
great solution anymore…
© 2014 MapR Technologies 25
Solution: Implemented 2014 @ Complete Genomics
with MapR
write-only
DNA Sequencer Cluster (e.g. Illumina X-Ten
Storage + Compute Cluster
Decentralize I/O
Decentralize I/O
© 2014 MapR Technologies 26
Application Server
mapr-nfsserver
Linux NFS Client
Mapr client API
Loopback Mount:
localhost:/mapr /mapr
mapr-fileserver
S1
mapr-fileserver
S2
mapr-fileserver
S3
mapr-fileserver
S4
mapr-fileserver
S5
Chunk 1
256MB
MapR Inline Compression
1 2 3 4 5
1 2Chunk 2
256MB 3Chunk 3
256MB
4Chunk 4
256MB 5Chunk 5
256MB
Translate NFS into API Calls
1 1 1
4 4
2
3
2 2
3 3
4
55 5
MapR Data Platform
Network Security :
MapR RPC Full Wire Encryption
Client -> Server Communication
Server -> Server Communication
Supported Compression algorithms
( per Directory )
LZ4, LZF, ZLIB
Network Traffic will be
compressed automatically
MapR NFS Gateway on Application Servers
© 2014 MapR Technologies 27
[WHITEBOARD BREAK]
© 2014 MapR Technologies 28© 2014 MapR Technologies
[REDACTED]
© 2014 MapR Technologies 29
Allows Secondary Analytics to Scale Out
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
© 2014 MapR Technologies 30
Secondary Analytics: Acute Pain Point
FastQ
Reads
Aligned
Reads
Variants
ADAM + Avocado
Matrix rotation
is very I/O
intense
Velvet: Algorithms for de novo short read assembly
using de Bruijn graphs, Zerbino & Birney. 2008
Local de novo
is best…
…only feasible
with efficient
rotations
© 2014 MapR Technologies 31
Apache Parquet
© 2014 MapR Technologies 32
Row-Oriented Format
read1 chr1 10000 read2 TTGGAG ABCDEF
read2 chr1 20000 - TCGTAA ABCDEF
read3 chr2 5000 - GGGAAC ABCDEF
read4 chr3 1000000 read6 CCCTAC ABCDEF
read5 chr4 900000 - TTTAAG ABCDEF
0
5
20
40
57
ID Reference Position Next ID Sequence Quality
© 2014 MapR Technologies 33
Row-Oriented Splitting
© 2014 MapR Technologies 34
Column-Oriented Format
read1
read2
read3
read4
read5
chr1
chr1
chr2
chr3
chr4
10000
20000
5000
1000000
900000
read2
-
-
read6
-
TTGGAG
TCGTAA
GGGAAC
CCCTAC
TTTAAG
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ID Reference Position Next ID Sequence Quality
© 2014 MapR Technologies 35
Column-Oriented Format Partitioning
read1
read2
read3
read4
read5
chr1
chr1
chr2
chr3
chr4
10000
20000
5000
1000000
900000
read2
-
-
read6
-
TTGGAG
TCGTAA
TTGGAG
GGGAAC
TTTAAG
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ID Reference Position Next ID Sequence Quality
© 2014 MapR Technologies 36
Column-Oriented Format Splitting
© 2014 MapR Technologies 37
Apache Parquet
© 2014 MapR Technologies 38
Apache Parquet
http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
© 2014 MapR Technologies 39
Allows Secondary Analytics to Scale Out
GATK / HPC
method: flat after
chromosome split
Hadoop / Spark
method
© 2014 MapR Technologies 40© 2014 MapR Technologies
Tertiary Analytics
© 2014 MapR Technologies 41
Downstream Analytics: GWAS/PheWAS
FastQ
Reads
Aligned
Reads
Variants
Function
Phenotypes
Scalable
GWAS/PheWA
S: “Green
Field” Territory
ADAM + Avocado
© 2014 MapR Technologies 42
Target Application: Alleviate / Prevent Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
© 2014 MapR Technologies 43
GWAS Overview (Genome-wide Association Study)
• Which genome features are associated with phenotype X?
https://en.wikipedia.org/wiki/Genome-wide_association_study
© 2014 MapR Technologies 44
PheWAS Overview (Phenome-wide …)
• Which phenotypes are associated with genome variant X?
http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/
© 2014 MapR Technologies 45
Genome × Phenome Analysis
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes
© 2014 MapR Technologies 46
Disease Cause via Genome × Phenome Matrix Factorization
• Row Eigenvectors of X represent
– Sets of related phenotypes (by SNP)
• Column Eigenvectors of Y represent
– Sets of related SNPS (by phenotype)
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Principal
Column
Vector
Archetype
Genotypes
Archetype
Phenotypes
Principal
Row
Vector
Sparse Matrix
Package is Actively
Developed in Spark
Community
© 2014 MapR Technologies 47
Generalized Approach: Genome × Phenome Tensor
• Maintain individual identity
• Aggregating individuals gives up statistical power
• Leverage pedigrees – Individuals are not independent observations
Variants
Phenotypes
Variants
Phenotypes
© 2014 MapR Technologies 48
Scalable Variant Store => Root out Disease Causes
Model P ~ F(G)
Fortunately, this has already been done…
Genotypes Med Record Phenotypes, e.g.
disease risk, drug response
© 2014 MapR Technologies 49
Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE
© 2014 MapR Technologies 50
Why Create Aadhaar?
• India: 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% literacy, <3% pay income tax, <20% have bank accounts
– ~800 million mobile, ~200-300 million migrant workers
• Govt. spends about $25-40 billion on direct subsidies
– Residents have no standard identity document
– Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
Standardize identity => Stop leakage
© 2014 MapR Technologies 51
Aadhaar Biometric Capture & Index
Raw
Digital
Fingerprint
© 2014 MapR Technologies 52
Aadhaar Biometric ID Creation
F(x): unique features
G(x): uncommon features
H(x): other features
• 900MM people loaded in 4
years
• In production
– 1MM registrations/day
– 200+ trillion lookups/day
• All built on MapR-DB (HBase)
Low Entropy +
Unique
Low Entropy +
Infrequent
© 2014 MapR Technologies 53
Consistent, Low Latency
--- M7 Read Latency --- Others Read Latency
© 2014 MapR Technologies 54
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Same data shape and size
• Aadhaar: 1B humans, 5MB minutia
• Genome: 7B humans, ~3M variants
© 2014 MapR Technologies 55
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Phenotype:
healthy or sick?
Phenotype Partition
=>
Low Entropy
© 2014 MapR Technologies 56
≈
individuals
fingerprint minutiae
Find rare minutiae to
uniquely identify
medicalrecords
genetic variants
Find shared variants
to get disease root
cause
Takeaway 1: Don’t reinvent the wheel
© 2014 MapR Technologies 57
Takeaway 2: Evolution, not Revolution
DNA Sequence
NASDAQ
Composite
© 2014 MapR Technologies 58
Thank You
@allenday // @mapr
Now a few slides about MapR’s product…
…and proposed next actions
© 2014 MapR Technologies 59
“Quick Start” Package
Engagement includes:
1. Identification of data sources, transformations and reporting engines
2. Access and use of the solution template including source code
3. Training on customizing the solution template to the organization’s requirement
4. Deployment architecture document that enables a production deployment plan for the specific solution
SOLUTION
TEMPLATE
KNOWLEDGE
TRANSFER
DEPLOYMENT
ARCHITECTURE
© 2014 MapR Technologies 60
“Quick Start” 1 – Resequencing with Hadoop
Reduces Storage
Hardware
Requirements
Accelerates Data
Processing Time
Minimal impact to
existing data
pipelines
“Quick Start” 2 – Variant Analysis with NoSQL
Present data for
exploration
Operationalize
complex workflows
Web-scale
performance
© 2014 MapR Technologies 62
Genomics Value Chain
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK
Pharma R&D OK, e.g. ILMN XTen Not OK Missing
Clinic Therapy OK, e.g. ILMN XTen Missing Missing
© 2014 MapR Technologies 63
Genomics Value Chain
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK
Pharma R&D OK, e.g. ILMN XTen Not OK Missing
Clinic Therapy OK, e.g. ILMN XTen Missing Missing
Addressed by
Quick Start 1
Addressed by
Quick Start 2
© 2014 MapR Technologies 64© 2014 MapR Technologies
BONUS ROUND
© 2014 MapR Technologies 65© 2014 MapR Technologies
Genealogy Company
Slides credit: Bill Yetman, Hadoop Summit 2014
http://slidesha.re/1vRh3kY
© 2014 MapR Technologies 66
GERMLINE is…
• …an algorithm that finds hidden relationships within a pool of
DNA
• …the reference implementation of that algorithm written in C++.
• You can find it here:
http://www1.cs.columbia.edu/~gusev/germline/
6
6
© 2014 MapR Technologies 67
Projected GERMLINE run times (in hours)
6
7
Hours
Samples
0
100
200
300
400
500
600
700
2,500
12,500
22,500
32,500
42,500
52,500
62,500
72,500
82,500
92,500
102,500
112,500
122,500
GERMLINE run times
Projected GERMLINE run
times
700 hours = 29+ days
EXPONENTIAL COMPLEXITY
© 2014 MapR Technologies 68
GERMLINE: What’s the Problem?
• GERMLINE (the implementation) was not meant to be used in
an industrial setting
– Stateless, single threaded, prone to swapping (heavy memory usage)
– GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would slow to
a crawl
• Put simply: GERMLINE couldn't scale
6
8
© 2014 MapR Technologies 69
Run times for matching (in hours)
6
9
Hours
Samples
0
20
40
60
80
100
120
140
160
180
GERMLINE run times
Jermline run times
Projected GERMLINE
run times
EXPONENTIAL LINEAR
HBase
Refactor
© 2014 MapR Technologies 70
• Paper submitted describing the implementation
• Releasing as an Open Source project soon
• [HBase Schema/Algorithm Slides]
7
0
© 2014 MapR Technologies 71© 2014 MapR Technologies
Further Growth & Optimization
© 2014 MapR Technologies 72
Underdog (Strand Phasing) performance
– Went from 12 hours to process 1,000 samples
to under 25 minutes with a MapReduce
implementation
7
2
With improved accuracy!
Underdog
replaces
Beagle
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Total Run Size Total Beagle-Underdog Duration
© 2014 MapR Technologies 73
Pipeline steps and incremental change…
– Incremental change over time
– Supporting the business in a “just in time” Agile way
7
3
0
50000
100000
150000
200000
250000
500
3622
7243
9615
12353
16333
19522
22861
26642
31172
35986
40852
45252
49817
54738
61675
69496
77257
84337
90074
97448
104684
111937
119669
127194
134970
142232
149988
157710
165685
173719
181617
189817
197853
205855
213471
221290
228912
236516
243550
251315
259164
267266
275335
283114
291017
298823
306556
314662
322655
330745
338813
346847
354938
362954
371064
379208
387334
395432
Beagle-Underdog Phasing
Pipeline Finalize
Relationship Processing
Germline-Jermline Results Processing
Germline-Jermline Processing
Beagle Post Phasing
Admixture
Plink Prep
Pipeline Initialization
Jermline replaces
Germline
Ethnicity V2 Release
Underdog Replaces
Beagle
AdMixture on
Hadoop
© 2014 MapR Technologies 74
…while the business continues to grow rapidly
7
4
-
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14
#ofprocessedsamples)
DNA Database Size

Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI

  • 1.
    © 2014 MapRTechnologies 1© 2014 MapR Technologies Hadoop for Genomics: What you need to know
  • 2.
    © 2014 MapRTechnologies 2 DNA Sequencing, pre-2004 years CPU transistors/mm2 HDD GB/mm2 DNA bp/$, pre-2004
  • 3.
    © 2014 MapRTechnologies 3 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004
  • 4.
    © 2014 MapRTechnologies 4 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004 Similar disruption occurred for Internet traffic in mid-1990s
  • 5.
    © 2014 MapRTechnologies 5 Effect: Many DNA-Based Apps Coming… • 2014: US$ 2B, mostly research, mostly chemical costs • 2020: US$ 20B, mostly clinical, mostly analytics costs Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 0 5 10 15 20 25 2014 2020 Clinical Non-Clinical
  • 6.
    © 2014 MapRTechnologies 6 Genomics Value Chain Order Test from Clinic Extract Biosample BioBank Biosample DNA Extraction Sequence Biosample Secondary Analytics Tertiary Analytics Reporting to Clinic Academic R&D Pharma R&D Clinic Therapy Increased scale requirement Increased feature set requirement
  • 7.
    © 2014 MapRTechnologies 7 Genomics Value Chain Sequence Biosample Secondary Analytics Tertiary Analytics Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK (manual) Pharma R&D OK, e.g. ILMN XTen Not OK (GATK) Missing, manual Clinic Therapy OK, e.g. ILMN XTen Missing Missing Increased scale requirement Increased feature set requirement Requirements • Data Intense • Batch • High utilization • Low COGS Requirements • Data Intense • Interactive • Easy to integrate • Expressive
  • 8.
    © 2014 MapRTechnologies 8 Target Application: Alleviate / Prevent (Deterministic) Suffering Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 9.
    © 2014 MapRTechnologies 9 http://steamcommunity.com/app/203160/discussions/0/846956188647169800/ http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law What Does Moore’s Law Feel Like? #Dataviz: Lara Croft 230=>40,000 Polygons (1996-2014)
  • 10.
    © 2014 MapRTechnologies 10 Application: Forensics http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/ http://snapshot.parabon-nanolabs.com/ http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
  • 11.
    © 2014 MapRTechnologies 11 Growth in Resource Capacity
  • 12.
    © 2014 MapRTechnologies 12 Disruption Circa 2000 NASDAQ Composite
  • 13.
    © 2014 MapRTechnologies 13 What Happened? What did winners do right to survive the .com recession? NASDAQ Composite
  • 14.
    © 2014 MapRTechnologies 14 Early 1990s: Early eCommerce Vendor Setup Storage read/write read/write Website Back Office
  • 15.
    © 2014 MapRTechnologies 15 Early 1990s: Early eCommerce Vendor Setup Storage read/write read/write Website Back Office <= SAN & NAS, Oracle <= HPC
  • 16.
    © 2014 MapRTechnologies 16 Late 1990s: Workload became too big Storage read/write read/write Website WebsiteWebsite Website Back Office Back Office
  • 17.
    © 2014 MapRTechnologies 17 Survivor Strategy Revealed: Google Publishes • 2003: Google Filesystem (aka GFS) – http://research.google.com/archive/gfs.html • 2004: MapReduce – http://research.google.com/archive/mapreduce.html • 2006: BigTable – http://research.google.com/archive/bigtable.html
  • 18.
    © 2014 MapRTechnologies 18 Scale-out with Google FS + MapReduce read/write read/write Website WebsiteWebsite Website Storage + Compute Cluster Back Office Back Office
  • 19.
    © 2014 MapRTechnologies 19© 2014 MapR Technologies Genomics: Internet Boom Déjà Vu
  • 20.
    © 2014 MapRTechnologies 20 DNA Sequencing, post-2004 DNA Sequence NASDAQ Composite
  • 21.
    © 2014 MapRTechnologies 21 DNA Sequencing, pre-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node Sequencer SAN & NAS => HPC =>
  • 22.
    © 2014 MapRTechnologies 22 DNA Sequencing, post-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node DNA Sequencer Cluster (e.g. Illumina X-Ten)
  • 23.
    © 2014 MapRTechnologies 23 DNA Sequencing, post-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node DNA Sequencer Cluster (e.g. Illumina X-Ten) HPC bottleneck Sequencer back-pressure
  • 24.
    © 2014 MapRTechnologies 24 DNA Sequencing, post-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node DNA Sequencer Cluster (e.g. Illumina X-Ten) HPC bottleneck Sequencer back-pressure NAS doesn’t look like a great solution anymore…
  • 25.
    © 2014 MapRTechnologies 25 Solution: Implemented 2014 @ Complete Genomics with MapR write-only DNA Sequencer Cluster (e.g. Illumina X-Ten Storage + Compute Cluster Decentralize I/O Decentralize I/O
  • 26.
    © 2014 MapRTechnologies 26 Application Server mapr-nfsserver Linux NFS Client Mapr client API Loopback Mount: localhost:/mapr /mapr mapr-fileserver S1 mapr-fileserver S2 mapr-fileserver S3 mapr-fileserver S4 mapr-fileserver S5 Chunk 1 256MB MapR Inline Compression 1 2 3 4 5 1 2Chunk 2 256MB 3Chunk 3 256MB 4Chunk 4 256MB 5Chunk 5 256MB Translate NFS into API Calls 1 1 1 4 4 2 3 2 2 3 3 4 55 5 MapR Data Platform Network Security : MapR RPC Full Wire Encryption Client -> Server Communication Server -> Server Communication Supported Compression algorithms ( per Directory ) LZ4, LZF, ZLIB Network Traffic will be compressed automatically MapR NFS Gateway on Application Servers
  • 27.
    © 2014 MapRTechnologies 27 [WHITEBOARD BREAK]
  • 28.
    © 2014 MapRTechnologies 28© 2014 MapR Technologies [REDACTED]
  • 29.
    © 2014 MapRTechnologies 29 Allows Secondary Analytics to Scale Out Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 30.
    © 2014 MapRTechnologies 30 Secondary Analytics: Acute Pain Point FastQ Reads Aligned Reads Variants ADAM + Avocado Matrix rotation is very I/O intense Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Zerbino & Birney. 2008 Local de novo is best… …only feasible with efficient rotations
  • 31.
    © 2014 MapRTechnologies 31 Apache Parquet
  • 32.
    © 2014 MapRTechnologies 32 Row-Oriented Format read1 chr1 10000 read2 TTGGAG ABCDEF read2 chr1 20000 - TCGTAA ABCDEF read3 chr2 5000 - GGGAAC ABCDEF read4 chr3 1000000 read6 CCCTAC ABCDEF read5 chr4 900000 - TTTAAG ABCDEF 0 5 20 40 57 ID Reference Position Next ID Sequence Quality
  • 33.
    © 2014 MapRTechnologies 33 Row-Oriented Splitting
  • 34.
    © 2014 MapRTechnologies 34 Column-Oriented Format read1 read2 read3 read4 read5 chr1 chr1 chr2 chr3 chr4 10000 20000 5000 1000000 900000 read2 - - read6 - TTGGAG TCGTAA GGGAAC CCCTAC TTTAAG ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ID Reference Position Next ID Sequence Quality
  • 35.
    © 2014 MapRTechnologies 35 Column-Oriented Format Partitioning read1 read2 read3 read4 read5 chr1 chr1 chr2 chr3 chr4 10000 20000 5000 1000000 900000 read2 - - read6 - TTGGAG TCGTAA TTGGAG GGGAAC TTTAAG ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ID Reference Position Next ID Sequence Quality
  • 36.
    © 2014 MapRTechnologies 36 Column-Oriented Format Splitting
  • 37.
    © 2014 MapRTechnologies 37 Apache Parquet
  • 38.
    © 2014 MapRTechnologies 38 Apache Parquet http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
  • 39.
    © 2014 MapRTechnologies 39 Allows Secondary Analytics to Scale Out GATK / HPC method: flat after chromosome split Hadoop / Spark method
  • 40.
    © 2014 MapRTechnologies 40© 2014 MapR Technologies Tertiary Analytics
  • 41.
    © 2014 MapRTechnologies 41 Downstream Analytics: GWAS/PheWAS FastQ Reads Aligned Reads Variants Function Phenotypes Scalable GWAS/PheWA S: “Green Field” Territory ADAM + Avocado
  • 42.
    © 2014 MapRTechnologies 42 Target Application: Alleviate / Prevent Suffering Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 43.
    © 2014 MapRTechnologies 43 GWAS Overview (Genome-wide Association Study) • Which genome features are associated with phenotype X? https://en.wikipedia.org/wiki/Genome-wide_association_study
  • 44.
    © 2014 MapRTechnologies 44 PheWAS Overview (Phenome-wide …) • Which phenotypes are associated with genome variant X? http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/
  • 45.
    © 2014 MapRTechnologies 45 Genome × Phenome Analysis For given population, given SNP 𝛿, and given phenotype ϕ: Count the number of occurrences as the value of the matrix 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 SPARSE Billion + Phenotypes SPARSEBillion+Genotypes
  • 46.
    © 2014 MapRTechnologies 46 Disease Cause via Genome × Phenome Matrix Factorization • Row Eigenvectors of X represent – Sets of related phenotypes (by SNP) • Column Eigenvectors of Y represent – Sets of related SNPS (by phenotype) 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Principal Column Vector Archetype Genotypes Archetype Phenotypes Principal Row Vector Sparse Matrix Package is Actively Developed in Spark Community
  • 47.
    © 2014 MapRTechnologies 47 Generalized Approach: Genome × Phenome Tensor • Maintain individual identity • Aggregating individuals gives up statistical power • Leverage pedigrees – Individuals are not independent observations Variants Phenotypes Variants Phenotypes
  • 48.
    © 2014 MapRTechnologies 48 Scalable Variant Store => Root out Disease Causes Model P ~ F(G) Fortunately, this has already been done… Genotypes Med Record Phenotypes, e.g. disease risk, drug response
  • 49.
    © 2014 MapRTechnologies 49 Largest Biometric Database in the World PEOPLE 1.2B PEOPLE
  • 50.
    © 2014 MapRTechnologies 50 Why Create Aadhaar? • India: 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pay income tax, <20% have bank accounts – ~800 million mobile, ~200-300 million migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40% Standardize identity => Stop leakage
  • 51.
    © 2014 MapRTechnologies 51 Aadhaar Biometric Capture & Index Raw Digital Fingerprint
  • 52.
    © 2014 MapRTechnologies 52 Aadhaar Biometric ID Creation F(x): unique features G(x): uncommon features H(x): other features • 900MM people loaded in 4 years • In production – 1MM registrations/day – 200+ trillion lookups/day • All built on MapR-DB (HBase) Low Entropy + Unique Low Entropy + Infrequent
  • 53.
    © 2014 MapRTechnologies 53 Consistent, Low Latency --- M7 Read Latency --- Others Read Latency
  • 54.
    © 2014 MapRTechnologies 54 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Same data shape and size • Aadhaar: 1B humans, 5MB minutia • Genome: 7B humans, ~3M variants
  • 55.
    © 2014 MapRTechnologies 55 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Phenotype: healthy or sick? Phenotype Partition => Low Entropy
  • 56.
    © 2014 MapRTechnologies 56 ≈ individuals fingerprint minutiae Find rare minutiae to uniquely identify medicalrecords genetic variants Find shared variants to get disease root cause Takeaway 1: Don’t reinvent the wheel
  • 57.
    © 2014 MapRTechnologies 57 Takeaway 2: Evolution, not Revolution DNA Sequence NASDAQ Composite
  • 58.
    © 2014 MapRTechnologies 58 Thank You @allenday // @mapr Now a few slides about MapR’s product… …and proposed next actions
  • 59.
    © 2014 MapRTechnologies 59 “Quick Start” Package Engagement includes: 1. Identification of data sources, transformations and reporting engines 2. Access and use of the solution template including source code 3. Training on customizing the solution template to the organization’s requirement 4. Deployment architecture document that enables a production deployment plan for the specific solution SOLUTION TEMPLATE KNOWLEDGE TRANSFER DEPLOYMENT ARCHITECTURE
  • 60.
    © 2014 MapRTechnologies 60 “Quick Start” 1 – Resequencing with Hadoop Reduces Storage Hardware Requirements Accelerates Data Processing Time Minimal impact to existing data pipelines “Quick Start” 2 – Variant Analysis with NoSQL Present data for exploration Operationalize complex workflows Web-scale performance
  • 61.
    © 2014 MapRTechnologies 62 Genomics Value Chain Sequence Biosample Secondary Analytics Tertiary Analytics Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK Pharma R&D OK, e.g. ILMN XTen Not OK Missing Clinic Therapy OK, e.g. ILMN XTen Missing Missing
  • 62.
    © 2014 MapRTechnologies 63 Genomics Value Chain Sequence Biosample Secondary Analytics Tertiary Analytics Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK Pharma R&D OK, e.g. ILMN XTen Not OK Missing Clinic Therapy OK, e.g. ILMN XTen Missing Missing Addressed by Quick Start 1 Addressed by Quick Start 2
  • 63.
    © 2014 MapRTechnologies 64© 2014 MapR Technologies BONUS ROUND
  • 64.
    © 2014 MapRTechnologies 65© 2014 MapR Technologies Genealogy Company Slides credit: Bill Yetman, Hadoop Summit 2014 http://slidesha.re/1vRh3kY
  • 65.
    © 2014 MapRTechnologies 66 GERMLINE is… • …an algorithm that finds hidden relationships within a pool of DNA • …the reference implementation of that algorithm written in C++. • You can find it here: http://www1.cs.columbia.edu/~gusev/germline/ 6 6
  • 66.
    © 2014 MapRTechnologies 67 Projected GERMLINE run times (in hours) 6 7 Hours Samples 0 100 200 300 400 500 600 700 2,500 12,500 22,500 32,500 42,500 52,500 62,500 72,500 82,500 92,500 102,500 112,500 122,500 GERMLINE run times Projected GERMLINE run times 700 hours = 29+ days EXPONENTIAL COMPLEXITY
  • 67.
    © 2014 MapRTechnologies 68 GERMLINE: What’s the Problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting – Stateless, single threaded, prone to swapping (heavy memory usage) – GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply: GERMLINE couldn't scale 6 8
  • 68.
    © 2014 MapRTechnologies 69 Run times for matching (in hours) 6 9 Hours Samples 0 20 40 60 80 100 120 140 160 180 GERMLINE run times Jermline run times Projected GERMLINE run times EXPONENTIAL LINEAR HBase Refactor
  • 69.
    © 2014 MapRTechnologies 70 • Paper submitted describing the implementation • Releasing as an Open Source project soon • [HBase Schema/Algorithm Slides] 7 0
  • 70.
    © 2014 MapRTechnologies 71© 2014 MapR Technologies Further Growth & Optimization
  • 71.
    © 2014 MapRTechnologies 72 Underdog (Strand Phasing) performance – Went from 12 hours to process 1,000 samples to under 25 minutes with a MapReduce implementation 7 2 With improved accuracy! Underdog replaces Beagle 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 Total Run Size Total Beagle-Underdog Duration
  • 72.
    © 2014 MapRTechnologies 73 Pipeline steps and incremental change… – Incremental change over time – Supporting the business in a “just in time” Agile way 7 3 0 50000 100000 150000 200000 250000 500 3622 7243 9615 12353 16333 19522 22861 26642 31172 35986 40852 45252 49817 54738 61675 69496 77257 84337 90074 97448 104684 111937 119669 127194 134970 142232 149988 157710 165685 173719 181617 189817 197853 205855 213471 221290 228912 236516 243550 251315 259164 267266 275335 283114 291017 298823 306556 314662 322655 330745 338813 346847 354938 362954 371064 379208 387334 395432 Beagle-Underdog Phasing Pipeline Finalize Relationship Processing Germline-Jermline Results Processing Germline-Jermline Processing Beagle Post Phasing Admixture Plink Prep Pipeline Initialization Jermline replaces Germline Ethnicity V2 Release Underdog Replaces Beagle AdMixture on Hadoop
  • 73.
    © 2014 MapRTechnologies 74 …while the business continues to grow rapidly 7 4 - 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 #ofprocessedsamples) DNA Database Size

Editor's Notes

  • #6 cinical
  • #50 49
  • #51 Increase GDP by 2%
  • #53 BOOM LSH
  • #54 This chart shows that MapR-DB (the database in the MapR Enterprise Database Edition, formerly known as M7) (in blue) consistency reads data quickly with no spikes. Other distributions suffer from periodic “housekeeping” tasks like compactions (defragmentation) and garbage collection, leading to sharp spikes in read delays.
  • #74 Graph of each step in the pipeline for every run. This graph shows how important it is to measure everything. Some steps have been greatly reduced or eliminated. Light blue is the matching step. You can see it going quadratic and then the change when ‘J’ Jermline was released.