SlideShare a Scribd company logo
TileDB webinars - Nov 4, 2021
Population Genomics is a
Data Management Problem
Founder & CEO of TileDB, Inc.
Dr. Stavros Papadopoulos
Deep roots at the intersection of HPC, databases and data science
Traction with telecoms, pharmas, hospitals and other scientific organizations
40 members with expertise across all applications and domains
Who we are
TileDB was spun out from MIT and Intel Labs in 2017
WHERE IT ALL STARTED
Raised over $20M, we are very well capitalized
INVESTORS
Originally developed GenomicsDB (collaboration between Intel and Broad)
Disclaimer
I am the exclusive recipient of complaints
Email me at: stavros@tiledb.com
All the credit for our amazing work goes to our powerful team
Check it out at https://tiledb.com/about
Their mission is to empower every person to improve their life through DNA
Multi-year collaboration to scale population genomics workflows
Provided tons of ideas and optimizations to TileDB-VCF
Find more info about Helix at helix.com
Special Thanks
Agenda
The problem in population genomics
A general solution
A concrete solution with TileDB
Dr. Stephen Kingsmore on genome-informed inpatient pediatric care
TileDB-VCF walkthrough (Dr. Aaron Wolen)
Work in progress
The Problem | On the Surface
A large collection of
(single-sample)
VCF files
...
Analysis is done by “slicing” a portion across files
Mainly two approaches:
● Separate single-sample VCFs
● Combined multi-sample VCFs
Specialized downstream tools, expecting VCF inputs
All solutions around files and environments
The Problem | On the Surface
Problem with single-sample VCF
Latency from slicing each file separately adds up
Problems with multi-sample VCF
Storage space scales super-linearly with the number of samples
Multi-sample VCF file cannot be updated (N+1 problem)
Scaling population genomics is blocked
This is just a tiny fraction of the problem in Genomics
The Holistic Problem
Data management
is nowhere in the
picture
The whole
data economics
in genomics is flawed
Data Economics
Consumption
How tools can compute
on the data, where
does the computation
happen
Distribution
Who has access to the
data, what is the means
of access, and
monetization
Production
What format does the
data get produced in
and where does it get
stored
The Production Problem
slow & expensive
often custom & in-house
costly & time consuming
Some analytics
infra
Specialized applications,
wrangling and fusion
Storage in some cloud
bucket or file manager
Numerous VCF files,
also some tables
The Distribution Problem #1
wasteful re-invention
Storage in some cloud
bucket or marketplace Org #N:
Download + Wrangle +
Built analytics infra
Org #1:
Download + Wrangle +
Built analytics infra
Numerous VCF files,
also some tables
Numerous VCF files,
also some tables
The Distribution Problem #2
Data owner bears the distribution cost,
Re-invention across data owners
wrangling,
etc.
some analytics
infra
Queries by
consumer #1
Queries by
consumer #N
The Consumption Problem
inefficient & costly,
poor governance
Storage in some cloud
bucket or server
Group #N:
Wrangle + Copy - Use tool & infra #N
Group #1:
Wrangle + Copy - Use tool & infra #1
Numerous VCF files,
also some tables
The Solution
Universal
data management platform
Data in a universal,
analysis-ready format
User / group #1:
any tool, any scale
User / group #N:
any tool, any scale
All Data Science tools
No infrastructure hassles
No downloads or copies
Efficient and cloud-native
Solves N+1 problem
Unifies all data
Accessible by any tool
Global-scale governance
One infra, you own the data
Collaboration and reproducibility
Marketplace built-in
Cost shifted to consumer
Enter TileDB
Secure governance & collaboration
Scalable, serverless compute
Data & code sharing & monetization
Pay-as-you-go, consumer pays
Extreme interoperability
Zero infrastructure
multi-dimensional arrays
Universal data
management platform
Data in a universal,
analysis-ready format
User / group #1:
any tool, any scale
User / group #N:
any tool, any scale
The Secret Sauce | The Data Model
Dense array
Store everything as dense or sparse multi-dimensional arrays
Sparse array
query range
expansion
anchor_gap
S
a
m
p
l
e
(
s
t
r
i
n
g
)
Position (uint32)
1 2 ...
v7
v4
v1
v3
Indel/CNVs
...
Contig
(string)
chr1
chr2
chr3
v2
v5 v6
SNPs
a1
anchor_gap
anchor
Population Genomics with TileDB
Store variant call data as 3D sparse arrays
Storage
query range
results
v3
v2
v1 a1
v4
v5 v6 v7
S
a
m
p
l
e
(
s
t
r
i
n
g
)
Position (uint32)
1 2 ...
...
Contig
(string)
chr1
chr2
chr3
Retrieval
https://github.com/TileDB-Inc/TileDB-VCF
Arrays Subsume Dataframes
Sparse array
Dataframe
Dense vector
The Secret Sauce | The Data Model
What can be modeled as an array
LiDAR (3D sparse)
SAR (2D or 3D dense)
Population genomics (3D sparse)
Single-cell genomics (2D dense or sparse)
Biomedical imaging (2D or 3D dense) Even flat files!!! (1D dense)
Time series (ND dense or sparse)
Weather (2D or 3D dense)
Graphs (2D sparse)
Video (3D dense)
Key-values (1D or ND sparse)
Tables (1D dense or ND sparse)
TileDB Cloud
❏ Access control and logging
❏ Serverless SQL, UDFs, task graphs
❏ Jupyter notebooks and dashboards
Unified data management
and easy serverless compute
at global scale
How we built a Universal Database
Efficient APIs & tool integrations, zero-copy techniques
TileDB Embedded
Open-source interoperable
storage with a universal
open-spec array format
❏ Parallel IO, rapid reads & writes
❏ Columnar, cloud-optimized
❏ Data versioning & time traveling
Superior
performance
Built in C++
Fully-parallelized
Columnar format
Multiple compressors
R-trees for sparse arrays
TileDB Embedded
https://github.com/TileDB-Inc/TileDB
Open source:
Rapid updates
& data versioning
Immutable writes
Lock-free
Parallel reader / writer model
Time traveling
TileDB Embedded
https://github.com/TileDB-Inc/TileDB
Open source:
Extreme
interoperability
Numerous APIs
Numerous integrations
All backends
Optimized
for the cloud
Immutable writes
Parallel IO
Minimization of requests
TileDB Cloud
Universal storage Universal tooling
Universal data
.vcf .csv .bam .fastq
Universal scale
Management. Collaboration. Scalability.
TileDB Cloud
Works as SaaS: https://cloud.tiledb.com
Works on premises
Currently on AWS, soon on any cloud
Built to work anywhere
Slicing, SQL, UDFs, task graphs
It is completely serverless
On-demand JupyterHub instances
Can launch Jupyter notebooks
Compute sent to the data
It is geo-aware
Authentication, compliance, etc.
It is secure
TileDB Cloud
Full marketplace (via Stripe)
Everything is monetizable
Access control inside and outside your
organization
Make any data and code public
Discover any public data and code
(central catalog)
Everything is shareable at global scale
Jupyter notebooks
UDFs and task graphs
ML models
Everything is an array!
Dashboards (e.g., R shiny apps)
All types of data (even flat files)
Full auditability (data, code, any action)
Everything is logged
Work in Progress
or why you should depart from file formats and
specialized solutions
RLE compression for strings
Compute directly on compressed data
Compute push-down
Fine-grained access policies
Constant perf optimizations
The Universal Database
Thank you

More Related Content

What's hot

Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
Lucian Neghina
 
Big data processing system
Big data processing systemBig data processing system
Big data processing system
shima jafari
 
Cassandra
CassandraCassandra
Cassandra
Lucian Neghina
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
Adam Doyle
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
ITJobZone.biz
 
Bar camp bigdata
Bar camp bigdataBar camp bigdata
Bar camp bigdata
Uppisatish Ag
 
Overview of Bigdata Analytics
Overview of Bigdata Analytics Overview of Bigdata Analytics
Overview of Bigdata Analytics
Sankarapu Anjaneyulu
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
Frans van Noort
 
Introduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsIntroduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & Applications
Nguyen Cao
 
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
StampedeCon
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
hktripathy
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
Adam Muise
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
Sanura Hettiarachchi
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data Warehouse
Bui Ha
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
yalla4u
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
Melissa Hornbostel
 
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Software
 
2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui
2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui
2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui
StatsCommunications
 
Big data analytics.
Big data analytics.Big data analytics.
Big data analytics.
GauravBiswas9
 

What's hot (20)

Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Big data processing system
Big data processing systemBig data processing system
Big data processing system
 
Cassandra
CassandraCassandra
Cassandra
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
Bar camp bigdata
Bar camp bigdataBar camp bigdata
Bar camp bigdata
 
Overview of Bigdata Analytics
Overview of Bigdata Analytics Overview of Bigdata Analytics
Overview of Bigdata Analytics
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Introduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsIntroduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & Applications
 
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data Warehouse
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database Migrations
 
2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui
2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui
2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui
 
Big data analytics.
Big data analytics.Big data analytics.
Big data analytics.
 

Similar to Population genomics is a data management problem

D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital Preservation
PRELIDA Project
 
BIG DATA
BIG DATABIG DATA
BIG DATA
Shashank Shetty
 
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Stavros Papadopoulos
 
Solving the Really Big Tech Problems with IoT
 Solving the Really Big Tech Problems with IoT Solving the Really Big Tech Problems with IoT
Solving the Really Big Tech Problems with IoT
Eric Kavanagh
 
Managing The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageManaging The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing Storage
Dell World
 
Data Virtualization to Survive a Multi and Hybrid Cloud World
Data Virtualization to Survive a Multi and Hybrid Cloud WorldData Virtualization to Survive a Multi and Hybrid Cloud World
Data Virtualization to Survive a Multi and Hybrid Cloud World
Denodo
 
Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja
Swapnaja Tandale
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will WinWebinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
Storage Switzerland
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
WU (Vienna University of Economics and Business)
 
Eliminating the Problems of Exponential Data Growth, Forever
Eliminating the Problems of Exponential Data Growth, ForeverEliminating the Problems of Exponential Data Growth, Forever
Eliminating the Problems of Exponential Data Growth, Forever
spectralogic
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Lviv Startup Club
 
Webinar: End NAS Sprawl - Gain Control Over Unstructured Data
Webinar: End NAS Sprawl - Gain Control Over Unstructured DataWebinar: End NAS Sprawl - Gain Control Over Unstructured Data
Webinar: End NAS Sprawl - Gain Control Over Unstructured Data
Storage Switzerland
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
sarith divakar
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Webinar: Cloud Storage: The 5 Reasons IT Can Do it Better
Webinar: Cloud Storage: The 5 Reasons IT Can Do it BetterWebinar: Cloud Storage: The 5 Reasons IT Can Do it Better
Webinar: Cloud Storage: The 5 Reasons IT Can Do it Better
Storage Switzerland
 
How to Radically Simplify Your Business Data Management
How to Radically Simplify Your Business Data ManagementHow to Radically Simplify Your Business Data Management
How to Radically Simplify Your Business Data Management
Clusterpoint
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
ahmed alshikh
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingEyad Manna
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 

Similar to Population genomics is a data management problem (20)

D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital Preservation
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
 
Solving the Really Big Tech Problems with IoT
 Solving the Really Big Tech Problems with IoT Solving the Really Big Tech Problems with IoT
Solving the Really Big Tech Problems with IoT
 
Managing The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageManaging The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing Storage
 
Data Virtualization to Survive a Multi and Hybrid Cloud World
Data Virtualization to Survive a Multi and Hybrid Cloud WorldData Virtualization to Survive a Multi and Hybrid Cloud World
Data Virtualization to Survive a Multi and Hybrid Cloud World
 
Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will WinWebinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Eliminating the Problems of Exponential Data Growth, Forever
Eliminating the Problems of Exponential Data Growth, ForeverEliminating the Problems of Exponential Data Growth, Forever
Eliminating the Problems of Exponential Data Growth, Forever
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Webinar: End NAS Sprawl - Gain Control Over Unstructured Data
Webinar: End NAS Sprawl - Gain Control Over Unstructured DataWebinar: End NAS Sprawl - Gain Control Over Unstructured Data
Webinar: End NAS Sprawl - Gain Control Over Unstructured Data
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Webinar: Cloud Storage: The 5 Reasons IT Can Do it Better
Webinar: Cloud Storage: The 5 Reasons IT Can Do it BetterWebinar: Cloud Storage: The 5 Reasons IT Can Do it Better
Webinar: Cloud Storage: The 5 Reasons IT Can Do it Better
 
How to Radically Simplify Your Business Data Management
How to Radically Simplify Your Business Data ManagementHow to Radically Simplify Your Business Data Management
How to Radically Simplify Your Business Data Management
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 

Recently uploaded

Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 

Population genomics is a data management problem

  • 1. TileDB webinars - Nov 4, 2021 Population Genomics is a Data Management Problem Founder & CEO of TileDB, Inc. Dr. Stavros Papadopoulos
  • 2. Deep roots at the intersection of HPC, databases and data science Traction with telecoms, pharmas, hospitals and other scientific organizations 40 members with expertise across all applications and domains Who we are TileDB was spun out from MIT and Intel Labs in 2017 WHERE IT ALL STARTED Raised over $20M, we are very well capitalized INVESTORS Originally developed GenomicsDB (collaboration between Intel and Broad)
  • 3. Disclaimer I am the exclusive recipient of complaints Email me at: stavros@tiledb.com All the credit for our amazing work goes to our powerful team Check it out at https://tiledb.com/about
  • 4. Their mission is to empower every person to improve their life through DNA Multi-year collaboration to scale population genomics workflows Provided tons of ideas and optimizations to TileDB-VCF Find more info about Helix at helix.com Special Thanks
  • 5. Agenda The problem in population genomics A general solution A concrete solution with TileDB Dr. Stephen Kingsmore on genome-informed inpatient pediatric care TileDB-VCF walkthrough (Dr. Aaron Wolen) Work in progress
  • 6. The Problem | On the Surface A large collection of (single-sample) VCF files ... Analysis is done by “slicing” a portion across files Mainly two approaches: ● Separate single-sample VCFs ● Combined multi-sample VCFs Specialized downstream tools, expecting VCF inputs All solutions around files and environments
  • 7. The Problem | On the Surface Problem with single-sample VCF Latency from slicing each file separately adds up Problems with multi-sample VCF Storage space scales super-linearly with the number of samples Multi-sample VCF file cannot be updated (N+1 problem) Scaling population genomics is blocked This is just a tiny fraction of the problem in Genomics
  • 8. The Holistic Problem Data management is nowhere in the picture The whole data economics in genomics is flawed
  • 9. Data Economics Consumption How tools can compute on the data, where does the computation happen Distribution Who has access to the data, what is the means of access, and monetization Production What format does the data get produced in and where does it get stored
  • 10. The Production Problem slow & expensive often custom & in-house costly & time consuming Some analytics infra Specialized applications, wrangling and fusion Storage in some cloud bucket or file manager Numerous VCF files, also some tables
  • 11. The Distribution Problem #1 wasteful re-invention Storage in some cloud bucket or marketplace Org #N: Download + Wrangle + Built analytics infra Org #1: Download + Wrangle + Built analytics infra Numerous VCF files, also some tables
  • 12. Numerous VCF files, also some tables The Distribution Problem #2 Data owner bears the distribution cost, Re-invention across data owners wrangling, etc. some analytics infra Queries by consumer #1 Queries by consumer #N
  • 13. The Consumption Problem inefficient & costly, poor governance Storage in some cloud bucket or server Group #N: Wrangle + Copy - Use tool & infra #N Group #1: Wrangle + Copy - Use tool & infra #1 Numerous VCF files, also some tables
  • 14. The Solution Universal data management platform Data in a universal, analysis-ready format User / group #1: any tool, any scale User / group #N: any tool, any scale All Data Science tools No infrastructure hassles No downloads or copies Efficient and cloud-native Solves N+1 problem Unifies all data Accessible by any tool Global-scale governance One infra, you own the data Collaboration and reproducibility Marketplace built-in Cost shifted to consumer
  • 15. Enter TileDB Secure governance & collaboration Scalable, serverless compute Data & code sharing & monetization Pay-as-you-go, consumer pays Extreme interoperability Zero infrastructure multi-dimensional arrays Universal data management platform Data in a universal, analysis-ready format User / group #1: any tool, any scale User / group #N: any tool, any scale
  • 16. The Secret Sauce | The Data Model Dense array Store everything as dense or sparse multi-dimensional arrays Sparse array
  • 17. query range expansion anchor_gap S a m p l e ( s t r i n g ) Position (uint32) 1 2 ... v7 v4 v1 v3 Indel/CNVs ... Contig (string) chr1 chr2 chr3 v2 v5 v6 SNPs a1 anchor_gap anchor Population Genomics with TileDB Store variant call data as 3D sparse arrays Storage query range results v3 v2 v1 a1 v4 v5 v6 v7 S a m p l e ( s t r i n g ) Position (uint32) 1 2 ... ... Contig (string) chr1 chr2 chr3 Retrieval https://github.com/TileDB-Inc/TileDB-VCF
  • 18. Arrays Subsume Dataframes Sparse array Dataframe Dense vector
  • 19. The Secret Sauce | The Data Model What can be modeled as an array LiDAR (3D sparse) SAR (2D or 3D dense) Population genomics (3D sparse) Single-cell genomics (2D dense or sparse) Biomedical imaging (2D or 3D dense) Even flat files!!! (1D dense) Time series (ND dense or sparse) Weather (2D or 3D dense) Graphs (2D sparse) Video (3D dense) Key-values (1D or ND sparse) Tables (1D dense or ND sparse)
  • 20. TileDB Cloud ❏ Access control and logging ❏ Serverless SQL, UDFs, task graphs ❏ Jupyter notebooks and dashboards Unified data management and easy serverless compute at global scale How we built a Universal Database Efficient APIs & tool integrations, zero-copy techniques TileDB Embedded Open-source interoperable storage with a universal open-spec array format ❏ Parallel IO, rapid reads & writes ❏ Columnar, cloud-optimized ❏ Data versioning & time traveling
  • 21. Superior performance Built in C++ Fully-parallelized Columnar format Multiple compressors R-trees for sparse arrays TileDB Embedded https://github.com/TileDB-Inc/TileDB Open source: Rapid updates & data versioning Immutable writes Lock-free Parallel reader / writer model Time traveling
  • 22. TileDB Embedded https://github.com/TileDB-Inc/TileDB Open source: Extreme interoperability Numerous APIs Numerous integrations All backends Optimized for the cloud Immutable writes Parallel IO Minimization of requests
  • 23. TileDB Cloud Universal storage Universal tooling Universal data .vcf .csv .bam .fastq Universal scale Management. Collaboration. Scalability.
  • 24. TileDB Cloud Works as SaaS: https://cloud.tiledb.com Works on premises Currently on AWS, soon on any cloud Built to work anywhere Slicing, SQL, UDFs, task graphs It is completely serverless On-demand JupyterHub instances Can launch Jupyter notebooks Compute sent to the data It is geo-aware Authentication, compliance, etc. It is secure
  • 25. TileDB Cloud Full marketplace (via Stripe) Everything is monetizable Access control inside and outside your organization Make any data and code public Discover any public data and code (central catalog) Everything is shareable at global scale Jupyter notebooks UDFs and task graphs ML models Everything is an array! Dashboards (e.g., R shiny apps) All types of data (even flat files) Full auditability (data, code, any action) Everything is logged
  • 26. Work in Progress or why you should depart from file formats and specialized solutions RLE compression for strings Compute directly on compressed data Compute push-down Fine-grained access policies Constant perf optimizations