SlideShare a Scribd company logo
1 of 38
Download to read offline
®
© 2015 MapR Technologies 1
®
© 2015 MapR Technologies
Allen Day, PhD // Chief Scientist @ MapR.com
2016.04.12, Big Data Everywhere
®
© 2015 MapR Technologies 2
Agenda
•  Presentation Motivations
–  Data inertia, data local computing
•  Highlights of BigData solutions ecosystem
–  MapR, NoSQL, Spark
•  Biotech Analytics Use Cases
–  Transition from sensors to insights - population DBs
•  NoSQL performance
–  Cost savings
•  NoSQL cost structure
–  Legacy tools – integration
•  Spark wrappers
®
© 2015 MapR Technologies 3
Data Inertia
•  Newton’s 1st Law of Motion (Law of Inertia)
•  “An object at rest stays at rest … unless acted
upon by an unbalanced force”
•  Force required to transport data increases with
data size and device latency
–  CPU < CPU caches < RAM < Disk/SSD < Network
bigger
faster
®
© 2015 MapR Technologies 4
Data Inertia + Exponential Data Growth =>
Data Local “BigData” Computing
•  Traditional algorithm design moves data to the
executing program
–  High Perf Cluster + Storage Network (HPC+SAN)
•  Key insight – program proportionally much
smaller than data, thus easier to move.
•  Modern algorithm design moves executing
program to the data
®
© 2015 MapR Technologies 5
Some BigData Tools
What is Spark?
•  Spark is a parallel computing framework that
allows a job to run on 1000s of computers as
easily as 1. No code changes required.
•  Makes good use of RAM and SSD storage
What is HBase?
•  HBase is a non-relational (NoSQL), distributed
database modeled on Google’s BigTable.
•  Provides highly scalable sustained and random
access to very large data sets
®
© 2015 MapR Technologies 6
MapR Converged Platform for BigData
®
© 2015 MapR Technologies 7© 2015 MapR Technologies
®
Cost-Effective ETL (Novartis)
®
© 2015 MapR Technologies 8
The Problem
•  Key step in data ingest for R&D handled
by enterprise data warehouse (EDW)
–  Video, Proteomics, NGS, Metagenomics
•  EDW at maximum capacity
–  Multiple rounds of software optimization
already done
–  Data still growing
•  Insight limiting (= career limiting)
bottleneck
®
© 2015 MapR Technologies 9
Three Options
1.  No more insights / candidates
2.  Increase EDW size
–  Expensive
–  Known to not scale well
3.  Find a more scalable solution
®
© 2015 MapR Technologies 10
Extract,
Load
Raw data:
•  Public/private
•  Compounds
•  Expression data
•  Genotype data
•  EHR data
•  …
Transform,
Load
Downstream
Analysis (R&D)
Original Flow – ELTL
Knowledge
graph
Data Warehouse
®
© 2015 MapR Technologies 11
Simplified Analysis – EDW Strategy
•  Majority of EDW storage consumed by ELTL
processing
–  Caused by minority of code
(raw data transformations)
•  Increasing EDW capacity yields
sub-linear performance
–  poor division of labor
®
© 2015 MapR Technologies 12
With ETL Offload
Raw data:
•  Public/private
•  Compounds
•  Expression data
•  Genotype data
•  EHR data
•  …
Extract,
Load
Transform,
Load
Knowledge
graph
Data Warehouse
Downstream
Analysis (R&D)
MapR
®
© 2015 MapR Technologies 13
Simplified Analysis – MapR Strategy
•  Lower Cost per TB of increased ETL
capacity by replacing EDW with MapR
•  Scale-out architecture – linear spend
gives linear performance increase
•  Strategic advantage – next-gen
architecture for implementing new use
cases
–  Insights/time (and career) acceleration
®
© 2015 MapR Technologies 14
Additionally…
Raw data:
•  Public/private
•  Compounds
•  Expression data
•  Genotype data
•  EHR data
•  …
Extract,
Load
Knowledge
graph
Data Warehouse
Downstream
Analysis (R&D)
MapRTransform,
Load
®
© 2015 MapR Technologies 15
New Use Cases are Enabled
Raw data:
•  Public and private
•  Compounds
•  Expression data
•  Genotype data
•  EHR data
•  …
Extract,
Load
Knowledge
graph
Data Warehouse
Downstream
Analysis (R&D)
New Use
Cases
MapR
Transform,
Load
®
© 2015 MapR Technologies 16© 2015 MapR Technologies
®
NoSQL: Scalable Population DBs
®
© 2015 MapR Technologies 17
Catalog genetic variants => find QTLs
•  Current public human cohort proposals
100K-1M individuals, >400% CAGR
•  Seed and livestock companies, same trend
•  Px/Dx biomarkers for PGx, reproductive
medicine, biometrics, etc.
•  Idea is to catalog genetic variants, find QTLs
•  Well studied problem, let’s take a look
®
© 2015 MapR Technologies 18
Genome × Phenome Analysis
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix
®
© 2015 MapR Technologies 19
Associate QTLs to variants via
Genome × Phenome Matrix Factorization
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Factorize w/
Spark &
MapR
•  Row Eigenvectors of X represent
–  Sets of related phenotypes (by SNP)
•  Column Eigenvectors of Y represent
–  Sets of related SNPS (by phenotype)
®
© 2015 MapR Technologies 20
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Moreover… This is a generalized GWAS
®
© 2015 MapR Technologies 21
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Moreover… This is a generalized GWAS
it’s PheWAS
®
© 2015 MapR Technologies 22
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Moreover… This is a generalized GWAS
it’s PheWAS
NB: These calculations are mixed I/O
workload – require high-throughput
sustained read and low-latency random-
access
Proven MapR-DB use case: Aadhar
Biometric system, 1B humans biometrics
®
© 2015 MapR Technologies 23
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Furthermore…
®
© 2015 MapR Technologies 24
doc5
user5 user3 user1
doc3
doc1
If we change the labels…
®
© 2015 MapR Technologies 25
doc5
user5 user3 user1
doc3
doc1
INTERESTS
BEHAVIORS
We have the core of Google / Facebook /
Twitter Ad Revenue Engine
®
© 2015 MapR Technologies 26
doc5
user5 user3 user1
doc3
doc1
INTERESTS
BEHAVIORS
We have the core of Google / Facebook /
Twitter Ad Revenue Engine
®
© 2015 MapR Technologies 27© 2015 MapR Technologies
®
Spark: Porting Legacy Pipelines
®
© 2015 MapR Technologies 28
Alignment
Reference
Sequences
Aligned
Reads Downstream
Applications…
DNA Reads
®
© 2015 MapR Technologies 29
Alignment
Reference
Sequences
DNA Reads
Aligned
Reads Downstream
Applications…
Align()
®
© 2015 MapR Technologies 30
Possible Align() Outcomes
Unaligned
DNA Reads
Reference
Sequences
Single
Location
Reads
Multiple
Location
Reads
Unlocatable
Reads
Align()
®
© 2015 MapR Technologies 31
Many-to-Many Relationship Between Reads and
Locations
•  Read1
•  Read2
•  Read3
•  Read4
•  NULL
•  LocationA
•  LocationB
•  LocationC
•  LocationD
•  LocationA
•  NULL
•  LocationE
®
© 2015 MapR Technologies 32
Parallelizing Alignment
Unaligne
d DNA
Reads
Locations
Locations
Locations
Part1Part2Part3
Aligned
DNA
Reads
Align() Concat() Sort() Etc…Split()
®
© 2015 MapR Technologies 33
Using HPC+SAN has Bottlenecks (GridEngine, Etc)
Part1Part2Part3
Volume Read
Bottleneck
Volume Write
Bottleneck
Read & Write
Bottleneck
®
© 2015 MapR Technologies 34
Using Spark Eliminates Bottlenecks
Align() Concat() Sort()Split()
®
© 2015 MapR Technologies 35
Bottom Level: Integration with Legacy Tools
Local I/O
Container
Legacy
Sub-process
®
© 2015 MapR Technologies 36
Bottom Level: Integration with Legacy Tools
®
© 2015 MapR Technologies 37
Bottom Level: Integration with Legacy Tools
•  No time today to look at code, but a deeper
slideshow of doing this with Bowtie aligner:
•  http://www.slideshare.net/allenday
•  https://github.com/allenday/spark-genome-
alignment-demo
Local I/O
Container
Legacy
Sub-process
®
© 2015 MapR Technologies 38
Thanks! Questions?
@allenday, @mapr
aday@mapr.com
linkedin.com/in/allenday
slideshare.net/allenday

More Related Content

What's hot

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
The Keys to Digital Transformation
The Keys to Digital TransformationThe Keys to Digital Transformation
The Keys to Digital Transformation
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Depositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske BankDepositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske Bank
 
Big Data Analytics Using Hadoop
Big Data Analytics Using HadoopBig Data Analytics Using Hadoop
Big Data Analytics Using Hadoop
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
Insight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital TransformationInsight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital Transformation
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Managing a Multi-Tenant Data Lake
Managing a Multi-Tenant Data LakeManaging a Multi-Tenant Data Lake
Managing a Multi-Tenant Data Lake
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
TESTING IN BIG DATA WORLD
TESTING IN BIG DATA  WORLDTESTING IN BIG DATA  WORLD
TESTING IN BIG DATA WORLD
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
 
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 

Viewers also liked

Viewers also liked (20)

Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
 
Enabling the Connected Car Revolution

Enabling the Connected Car Revolution
Enabling the Connected Car Revolution

Enabling the Connected Car Revolution

 
IonGAP - Uni of Westminster 23-10-2015
IonGAP - Uni of Westminster 23-10-2015IonGAP - Uni of Westminster 23-10-2015
IonGAP - Uni of Westminster 23-10-2015
 
Construction Industry Review 8 2014
Construction Industry Review  8 2014Construction Industry Review  8 2014
Construction Industry Review 8 2014
 
The DNA of Data Quality and the Data Genome
The DNA of Data Quality and the Data GenomeThe DNA of Data Quality and the Data Genome
The DNA of Data Quality and the Data Genome
 
Case Study in Linked Data and Semantic Web: Human Genome
Case Study in Linked Data and Semantic Web: Human GenomeCase Study in Linked Data and Semantic Web: Human Genome
Case Study in Linked Data and Semantic Web: Human Genome
 
Succes Story | Abomics
Succes Story | AbomicsSucces Story | Abomics
Succes Story | Abomics
 
Menestystarina | Abomics
Menestystarina | AbomicsMenestystarina | Abomics
Menestystarina | Abomics
 
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataIonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
 
GEC 2017: JF Gauthier
GEC 2017: JF GauthierGEC 2017: JF Gauthier
GEC 2017: JF Gauthier
 
Bioinformatics & Genomics December Newsletter
Bioinformatics & Genomics December NewsletterBioinformatics & Genomics December Newsletter
Bioinformatics & Genomics December Newsletter
 
Big Data & the networked future of Science (at Ignite Seattle 7)
Big Data & the networked future of Science (at Ignite Seattle 7)Big Data & the networked future of Science (at Ignite Seattle 7)
Big Data & the networked future of Science (at Ignite Seattle 7)
 
Visualizing the genome: Techniques for presenting genome data and annotations
Visualizing the genome: Techniques for presenting genome data and annotationsVisualizing the genome: Techniques for presenting genome data and annotations
Visualizing the genome: Techniques for presenting genome data and annotations
 
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
 
Human Genome and Big Data Challenges
Human Genome and Big Data ChallengesHuman Genome and Big Data Challenges
Human Genome and Big Data Challenges
 
Interviewing - why some questions are off limits
Interviewing - why some questions are off limitsInterviewing - why some questions are off limits
Interviewing - why some questions are off limits
 
Daily Snapshot - 2nd March 2017
Daily Snapshot - 2nd March 2017Daily Snapshot - 2nd March 2017
Daily Snapshot - 2nd March 2017
 
GEC 2017: Igor Oliveira
GEC 2017: Igor OliveiraGEC 2017: Igor Oliveira
GEC 2017: Igor Oliveira
 
GEC 2017: Hedda Pahlson-Moller
GEC 2017: Hedda Pahlson-MollerGEC 2017: Hedda Pahlson-Moller
GEC 2017: Hedda Pahlson-Moller
 

Similar to Genome Analysis Pipelines, Big Data Style

20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
Allen Day, PhD
 
6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop
Dr. Wilfred Lin (Ph.D.)
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
MapR Technologies
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
 

Similar to Genome Analysis Pipelines, Big Data Style (20)

Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Big data Question bank.pdf
Big data Question bank.pdfBig data Question bank.pdf
Big data Question bank.pdf
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShare
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
Starfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analyticsStarfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analytics
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Peter Elleby - Big Data, Big Noise, Big Hope - No Miracles
Peter Elleby - Big Data, Big Noise, Big Hope - No MiraclesPeter Elleby - Big Data, Big Noise, Big Hope - No Miracles
Peter Elleby - Big Data, Big Noise, Big Hope - No Miracles
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big Data
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Genome Analysis Pipelines, Big Data Style

  • 1. ® © 2015 MapR Technologies 1 ® © 2015 MapR Technologies Allen Day, PhD // Chief Scientist @ MapR.com 2016.04.12, Big Data Everywhere
  • 2. ® © 2015 MapR Technologies 2 Agenda •  Presentation Motivations –  Data inertia, data local computing •  Highlights of BigData solutions ecosystem –  MapR, NoSQL, Spark •  Biotech Analytics Use Cases –  Transition from sensors to insights - population DBs •  NoSQL performance –  Cost savings •  NoSQL cost structure –  Legacy tools – integration •  Spark wrappers
  • 3. ® © 2015 MapR Technologies 3 Data Inertia •  Newton’s 1st Law of Motion (Law of Inertia) •  “An object at rest stays at rest … unless acted upon by an unbalanced force” •  Force required to transport data increases with data size and device latency –  CPU < CPU caches < RAM < Disk/SSD < Network bigger faster
  • 4. ® © 2015 MapR Technologies 4 Data Inertia + Exponential Data Growth => Data Local “BigData” Computing •  Traditional algorithm design moves data to the executing program –  High Perf Cluster + Storage Network (HPC+SAN) •  Key insight – program proportionally much smaller than data, thus easier to move. •  Modern algorithm design moves executing program to the data
  • 5. ® © 2015 MapR Technologies 5 Some BigData Tools What is Spark? •  Spark is a parallel computing framework that allows a job to run on 1000s of computers as easily as 1. No code changes required. •  Makes good use of RAM and SSD storage What is HBase? •  HBase is a non-relational (NoSQL), distributed database modeled on Google’s BigTable. •  Provides highly scalable sustained and random access to very large data sets
  • 6. ® © 2015 MapR Technologies 6 MapR Converged Platform for BigData
  • 7. ® © 2015 MapR Technologies 7© 2015 MapR Technologies ® Cost-Effective ETL (Novartis)
  • 8. ® © 2015 MapR Technologies 8 The Problem •  Key step in data ingest for R&D handled by enterprise data warehouse (EDW) –  Video, Proteomics, NGS, Metagenomics •  EDW at maximum capacity –  Multiple rounds of software optimization already done –  Data still growing •  Insight limiting (= career limiting) bottleneck
  • 9. ® © 2015 MapR Technologies 9 Three Options 1.  No more insights / candidates 2.  Increase EDW size –  Expensive –  Known to not scale well 3.  Find a more scalable solution
  • 10. ® © 2015 MapR Technologies 10 Extract, Load Raw data: •  Public/private •  Compounds •  Expression data •  Genotype data •  EHR data •  … Transform, Load Downstream Analysis (R&D) Original Flow – ELTL Knowledge graph Data Warehouse
  • 11. ® © 2015 MapR Technologies 11 Simplified Analysis – EDW Strategy •  Majority of EDW storage consumed by ELTL processing –  Caused by minority of code (raw data transformations) •  Increasing EDW capacity yields sub-linear performance –  poor division of labor
  • 12. ® © 2015 MapR Technologies 12 With ETL Offload Raw data: •  Public/private •  Compounds •  Expression data •  Genotype data •  EHR data •  … Extract, Load Transform, Load Knowledge graph Data Warehouse Downstream Analysis (R&D) MapR
  • 13. ® © 2015 MapR Technologies 13 Simplified Analysis – MapR Strategy •  Lower Cost per TB of increased ETL capacity by replacing EDW with MapR •  Scale-out architecture – linear spend gives linear performance increase •  Strategic advantage – next-gen architecture for implementing new use cases –  Insights/time (and career) acceleration
  • 14. ® © 2015 MapR Technologies 14 Additionally… Raw data: •  Public/private •  Compounds •  Expression data •  Genotype data •  EHR data •  … Extract, Load Knowledge graph Data Warehouse Downstream Analysis (R&D) MapRTransform, Load
  • 15. ® © 2015 MapR Technologies 15 New Use Cases are Enabled Raw data: •  Public and private •  Compounds •  Expression data •  Genotype data •  EHR data •  … Extract, Load Knowledge graph Data Warehouse Downstream Analysis (R&D) New Use Cases MapR Transform, Load
  • 16. ® © 2015 MapR Technologies 16© 2015 MapR Technologies ® NoSQL: Scalable Population DBs
  • 17. ® © 2015 MapR Technologies 17 Catalog genetic variants => find QTLs •  Current public human cohort proposals 100K-1M individuals, >400% CAGR •  Seed and livestock companies, same trend •  Px/Dx biomarkers for PGx, reproductive medicine, biometrics, etc. •  Idea is to catalog genetic variants, find QTLs •  Well studied problem, let’s take a look
  • 18. ® © 2015 MapR Technologies 18 Genome × Phenome Analysis 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 SPARSE Billion + Phenotypes SPARSEBillion+Genotypes For given population, given SNP 𝛿, and given phenotype ϕ: Count the number of occurrences as the value of the matrix
  • 19. ® © 2015 MapR Technologies 19 Associate QTLs to variants via Genome × Phenome Matrix Factorization 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Archetypal Genotypes (column Eigenvector) Archetypal Phenotypes (row Eigenvector) Factorize w/ Spark & MapR •  Row Eigenvectors of X represent –  Sets of related phenotypes (by SNP) •  Column Eigenvectors of Y represent –  Sets of related SNPS (by phenotype)
  • 20. ® © 2015 MapR Technologies 20 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Archetypal Genotypes (column Eigenvector) Archetypal Phenotypes (row Eigenvector) Moreover… This is a generalized GWAS
  • 21. ® © 2015 MapR Technologies 21 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Archetypal Genotypes (column Eigenvector) Archetypal Phenotypes (row Eigenvector) Moreover… This is a generalized GWAS it’s PheWAS
  • 22. ® © 2015 MapR Technologies 22 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Archetypal Genotypes (column Eigenvector) Archetypal Phenotypes (row Eigenvector) Moreover… This is a generalized GWAS it’s PheWAS NB: These calculations are mixed I/O workload – require high-throughput sustained read and low-latency random- access Proven MapR-DB use case: Aadhar Biometric system, 1B humans biometrics
  • 23. ® © 2015 MapR Technologies 23 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Furthermore…
  • 24. ® © 2015 MapR Technologies 24 doc5 user5 user3 user1 doc3 doc1 If we change the labels…
  • 25. ® © 2015 MapR Technologies 25 doc5 user5 user3 user1 doc3 doc1 INTERESTS BEHAVIORS We have the core of Google / Facebook / Twitter Ad Revenue Engine
  • 26. ® © 2015 MapR Technologies 26 doc5 user5 user3 user1 doc3 doc1 INTERESTS BEHAVIORS We have the core of Google / Facebook / Twitter Ad Revenue Engine
  • 27. ® © 2015 MapR Technologies 27© 2015 MapR Technologies ® Spark: Porting Legacy Pipelines
  • 28. ® © 2015 MapR Technologies 28 Alignment Reference Sequences Aligned Reads Downstream Applications… DNA Reads
  • 29. ® © 2015 MapR Technologies 29 Alignment Reference Sequences DNA Reads Aligned Reads Downstream Applications… Align()
  • 30. ® © 2015 MapR Technologies 30 Possible Align() Outcomes Unaligned DNA Reads Reference Sequences Single Location Reads Multiple Location Reads Unlocatable Reads Align()
  • 31. ® © 2015 MapR Technologies 31 Many-to-Many Relationship Between Reads and Locations •  Read1 •  Read2 •  Read3 •  Read4 •  NULL •  LocationA •  LocationB •  LocationC •  LocationD •  LocationA •  NULL •  LocationE
  • 32. ® © 2015 MapR Technologies 32 Parallelizing Alignment Unaligne d DNA Reads Locations Locations Locations Part1Part2Part3 Aligned DNA Reads Align() Concat() Sort() Etc…Split()
  • 33. ® © 2015 MapR Technologies 33 Using HPC+SAN has Bottlenecks (GridEngine, Etc) Part1Part2Part3 Volume Read Bottleneck Volume Write Bottleneck Read & Write Bottleneck
  • 34. ® © 2015 MapR Technologies 34 Using Spark Eliminates Bottlenecks Align() Concat() Sort()Split()
  • 35. ® © 2015 MapR Technologies 35 Bottom Level: Integration with Legacy Tools Local I/O Container Legacy Sub-process
  • 36. ® © 2015 MapR Technologies 36 Bottom Level: Integration with Legacy Tools
  • 37. ® © 2015 MapR Technologies 37 Bottom Level: Integration with Legacy Tools •  No time today to look at code, but a deeper slideshow of doing this with Bowtie aligner: •  http://www.slideshare.net/allenday •  https://github.com/allenday/spark-genome- alignment-demo Local I/O Container Legacy Sub-process
  • 38. ® © 2015 MapR Technologies 38 Thanks! Questions? @allenday, @mapr aday@mapr.com linkedin.com/in/allenday slideshare.net/allenday