More Related Content Similar to Genome Analysis Pipelines, Big Data Style (20) Genome Analysis Pipelines, Big Data Style1. ®
© 2015 MapR Technologies 1
®
© 2015 MapR Technologies
Allen Day, PhD // Chief Scientist @ MapR.com
2016.04.12, Big Data Everywhere
2. ®
© 2015 MapR Technologies 2
Agenda
• Presentation Motivations
– Data inertia, data local computing
• Highlights of BigData solutions ecosystem
– MapR, NoSQL, Spark
• Biotech Analytics Use Cases
– Transition from sensors to insights - population DBs
• NoSQL performance
– Cost savings
• NoSQL cost structure
– Legacy tools – integration
• Spark wrappers
3. ®
© 2015 MapR Technologies 3
Data Inertia
• Newton’s 1st Law of Motion (Law of Inertia)
• “An object at rest stays at rest … unless acted
upon by an unbalanced force”
• Force required to transport data increases with
data size and device latency
– CPU < CPU caches < RAM < Disk/SSD < Network
bigger
faster
4. ®
© 2015 MapR Technologies 4
Data Inertia + Exponential Data Growth =>
Data Local “BigData” Computing
• Traditional algorithm design moves data to the
executing program
– High Perf Cluster + Storage Network (HPC+SAN)
• Key insight – program proportionally much
smaller than data, thus easier to move.
• Modern algorithm design moves executing
program to the data
5. ®
© 2015 MapR Technologies 5
Some BigData Tools
What is Spark?
• Spark is a parallel computing framework that
allows a job to run on 1000s of computers as
easily as 1. No code changes required.
• Makes good use of RAM and SSD storage
What is HBase?
• HBase is a non-relational (NoSQL), distributed
database modeled on Google’s BigTable.
• Provides highly scalable sustained and random
access to very large data sets
6. ®
© 2015 MapR Technologies 6
MapR Converged Platform for BigData
7. ®
© 2015 MapR Technologies 7© 2015 MapR Technologies
®
Cost-Effective ETL (Novartis)
8. ®
© 2015 MapR Technologies 8
The Problem
• Key step in data ingest for R&D handled
by enterprise data warehouse (EDW)
– Video, Proteomics, NGS, Metagenomics
• EDW at maximum capacity
– Multiple rounds of software optimization
already done
– Data still growing
• Insight limiting (= career limiting)
bottleneck
9. ®
© 2015 MapR Technologies 9
Three Options
1. No more insights / candidates
2. Increase EDW size
– Expensive
– Known to not scale well
3. Find a more scalable solution
10. ®
© 2015 MapR Technologies 10
Extract,
Load
Raw data:
• Public/private
• Compounds
• Expression data
• Genotype data
• EHR data
• …
Transform,
Load
Downstream
Analysis (R&D)
Original Flow – ELTL
Knowledge
graph
Data Warehouse
11. ®
© 2015 MapR Technologies 11
Simplified Analysis – EDW Strategy
• Majority of EDW storage consumed by ELTL
processing
– Caused by minority of code
(raw data transformations)
• Increasing EDW capacity yields
sub-linear performance
– poor division of labor
12. ®
© 2015 MapR Technologies 12
With ETL Offload
Raw data:
• Public/private
• Compounds
• Expression data
• Genotype data
• EHR data
• …
Extract,
Load
Transform,
Load
Knowledge
graph
Data Warehouse
Downstream
Analysis (R&D)
MapR
13. ®
© 2015 MapR Technologies 13
Simplified Analysis – MapR Strategy
• Lower Cost per TB of increased ETL
capacity by replacing EDW with MapR
• Scale-out architecture – linear spend
gives linear performance increase
• Strategic advantage – next-gen
architecture for implementing new use
cases
– Insights/time (and career) acceleration
14. ®
© 2015 MapR Technologies 14
Additionally…
Raw data:
• Public/private
• Compounds
• Expression data
• Genotype data
• EHR data
• …
Extract,
Load
Knowledge
graph
Data Warehouse
Downstream
Analysis (R&D)
MapRTransform,
Load
15. ®
© 2015 MapR Technologies 15
New Use Cases are Enabled
Raw data:
• Public and private
• Compounds
• Expression data
• Genotype data
• EHR data
• …
Extract,
Load
Knowledge
graph
Data Warehouse
Downstream
Analysis (R&D)
New Use
Cases
MapR
Transform,
Load
16. ®
© 2015 MapR Technologies 16© 2015 MapR Technologies
®
NoSQL: Scalable Population DBs
17. ®
© 2015 MapR Technologies 17
Catalog genetic variants => find QTLs
• Current public human cohort proposals
100K-1M individuals, >400% CAGR
• Seed and livestock companies, same trend
• Px/Dx biomarkers for PGx, reproductive
medicine, biometrics, etc.
• Idea is to catalog genetic variants, find QTLs
• Well studied problem, let’s take a look
18. ®
© 2015 MapR Technologies 18
Genome × Phenome Analysis
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix
19. ®
© 2015 MapR Technologies 19
Associate QTLs to variants via
Genome × Phenome Matrix Factorization
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Factorize w/
Spark &
MapR
• Row Eigenvectors of X represent
– Sets of related phenotypes (by SNP)
• Column Eigenvectors of Y represent
– Sets of related SNPS (by phenotype)
20. ®
© 2015 MapR Technologies 20
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Moreover… This is a generalized GWAS
21. ®
© 2015 MapR Technologies 21
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Moreover… This is a generalized GWAS
it’s PheWAS
22. ®
© 2015 MapR Technologies 22
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Moreover… This is a generalized GWAS
it’s PheWAS
NB: These calculations are mixed I/O
workload – require high-throughput
sustained read and low-latency random-
access
Proven MapR-DB use case: Aadhar
Biometric system, 1B humans biometrics
24. ®
© 2015 MapR Technologies 24
doc5
user5 user3 user1
doc3
doc1
If we change the labels…
25. ®
© 2015 MapR Technologies 25
doc5
user5 user3 user1
doc3
doc1
INTERESTS
BEHAVIORS
We have the core of Google / Facebook /
Twitter Ad Revenue Engine
26. ®
© 2015 MapR Technologies 26
doc5
user5 user3 user1
doc3
doc1
INTERESTS
BEHAVIORS
We have the core of Google / Facebook /
Twitter Ad Revenue Engine
27. ®
© 2015 MapR Technologies 27© 2015 MapR Technologies
®
Spark: Porting Legacy Pipelines
28. ®
© 2015 MapR Technologies 28
Alignment
Reference
Sequences
Aligned
Reads Downstream
Applications…
DNA Reads
29. ®
© 2015 MapR Technologies 29
Alignment
Reference
Sequences
DNA Reads
Aligned
Reads Downstream
Applications…
Align()
30. ®
© 2015 MapR Technologies 30
Possible Align() Outcomes
Unaligned
DNA Reads
Reference
Sequences
Single
Location
Reads
Multiple
Location
Reads
Unlocatable
Reads
Align()
31. ®
© 2015 MapR Technologies 31
Many-to-Many Relationship Between Reads and
Locations
• Read1
• Read2
• Read3
• Read4
• NULL
• LocationA
• LocationB
• LocationC
• LocationD
• LocationA
• NULL
• LocationE
32. ®
© 2015 MapR Technologies 32
Parallelizing Alignment
Unaligne
d DNA
Reads
Locations
Locations
Locations
Part1Part2Part3
Aligned
DNA
Reads
Align() Concat() Sort() Etc…Split()
33. ®
© 2015 MapR Technologies 33
Using HPC+SAN has Bottlenecks (GridEngine, Etc)
Part1Part2Part3
Volume Read
Bottleneck
Volume Write
Bottleneck
Read & Write
Bottleneck
34. ®
© 2015 MapR Technologies 34
Using Spark Eliminates Bottlenecks
Align() Concat() Sort()Split()
35. ®
© 2015 MapR Technologies 35
Bottom Level: Integration with Legacy Tools
Local I/O
Container
Legacy
Sub-process
36. ®
© 2015 MapR Technologies 36
Bottom Level: Integration with Legacy Tools
37. ®
© 2015 MapR Technologies 37
Bottom Level: Integration with Legacy Tools
• No time today to look at code, but a deeper
slideshow of doing this with Bowtie aligner:
• http://www.slideshare.net/allenday
• https://github.com/allenday/spark-genome-
alignment-demo
Local I/O
Container
Legacy
Sub-process
38. ®
© 2015 MapR Technologies 38
Thanks! Questions?
@allenday, @mapr
aday@mapr.com
linkedin.com/in/allenday
slideshare.net/allenday