Population genomics is a data management problem

TileDB webinars - Nov 4, 2021
Population Genomics is a
Data Management Problem
Founder & CEO of TileDB, Inc.
Dr. Stavros Papadopoulos

Deep roots at the intersection of HPC, databases and data science
Traction with telecoms, pharmas, hospitals and other scientific organizations
40 members with expertise across all applications and domains
Who we are
TileDB was spun out from MIT and Intel Labs in 2017
WHERE IT ALL STARTED
Raised over $20M, we are very well capitalized
INVESTORS
Originally developed GenomicsDB (collaboration between Intel and Broad)

Disclaimer
I am the exclusive recipient of complaints
Email me at: stavros@tiledb.com
All the credit for our amazing work goes to our powerful team
Check it out at https://tiledb.com/about

Their mission is to empower every person to improve their life through DNA
Multi-year collaboration to scale population genomics workflows
Provided tons of ideas and optimizations to TileDB-VCF
Find more info about Helix at helix.com
Special Thanks

Agenda
The problem in population genomics
A general solution
A concrete solution with TileDB
Dr. Stephen Kingsmore on genome-informed inpatient pediatric care
TileDB-VCF walkthrough (Dr. Aaron Wolen)
Work in progress

The Problem | On the Surface
A large collection of
(single-sample)
VCF files
...
Analysis is done by “slicing” a portion across files
Mainly two approaches:
● Separate single-sample VCFs
● Combined multi-sample VCFs
Specialized downstream tools, expecting VCF inputs
All solutions around files and environments

The Problem | On the Surface
Problem with single-sample VCF
Latency from slicing each file separately adds up
Problems with multi-sample VCF
Storage space scales super-linearly with the number of samples
Multi-sample VCF file cannot be updated (N+1 problem)
Scaling population genomics is blocked
This is just a tiny fraction of the problem in Genomics

The Holistic Problem
Data management
is nowhere in the
picture
The whole
data economics
in genomics is flawed

Data Economics
Consumption
How tools can compute
on the data, where
does the computation
happen
Distribution
Who has access to the
data, what is the means
of access, and
monetization
Production
What format does the
data get produced in
and where does it get
stored

The Production Problem
slow & expensive
often custom & in-house
costly & time consuming
Some analytics
infra
Specialized applications,
wrangling and fusion
Storage in some cloud
bucket or file manager
Numerous VCF files,
also some tables

The Distribution Problem #1
wasteful re-invention
bucket or marketplace Org #N:
Download + Wrangle +
Built analytics infra
Org #1:
Download + Wrangle +
Built analytics infra
Numerous VCF files,
also some tables

Numerous VCF files,
also some tables
The Distribution Problem #2
Data owner bears the distribution cost,
Re-invention across data owners
wrangling,
etc.
some analytics
infra
Queries by
consumer #1
Queries by
consumer #N

The Consumption Problem
inefficient & costly,
poor governance
bucket or server
Group #N:
Wrangle + Copy - Use tool & infra #N
Group #1:
Wrangle + Copy - Use tool & infra #1
Numerous VCF files,
also some tables

The Solution
Universal
data management platform
Data in a universal,
analysis-ready format
User / group #1:
any tool, any scale
User / group #N:
any tool, any scale
All Data Science tools
No infrastructure hassles
No downloads or copies
Efficient and cloud-native
Solves N+1 problem
Unifies all data
Accessible by any tool
Global-scale governance
One infra, you own the data
Collaboration and reproducibility
Marketplace built-in
Cost shifted to consumer

Enter TileDB
Secure governance & collaboration
Scalable, serverless compute
Data & code sharing & monetization
Pay-as-you-go, consumer pays
Extreme interoperability
Zero infrastructure
multi-dimensional arrays
Universal data
management platform
Data in a universal,
analysis-ready format
User / group #1:
any tool, any scale
User / group #N:
any tool, any scale

The Secret Sauce | The Data Model
Dense array
Store everything as dense or sparse multi-dimensional arrays
Sparse array

query range
expansion
anchor_gap
S
a
m
p
l
e
(
s
t
r
i
n
g
)
Position (uint32)
1 2 ...
v7
v4
v1
v3
Indel/CNVs
...
Contig
(string)
chr1
chr2
chr3
v2
v5 v6
SNPs
a1
anchor_gap
anchor
Population Genomics with TileDB
Store variant call data as 3D sparse arrays
Storage
query range
results
v3
v2
v1 a1
v4
v5 v6 v7
S
a
m
p
l
e
(
s
t
r
i
n
g
)
Position (uint32)
1 2 ...
...
Contig
(string)
chr1
chr2
chr3
Retrieval
https://github.com/TileDB-Inc/TileDB-VCF

Arrays Subsume Dataframes
Sparse array
Dataframe
Dense vector

The Secret Sauce | The Data Model
What can be modeled as an array
LiDAR (3D sparse)
SAR (2D or 3D dense)
Population genomics (3D sparse)
Single-cell genomics (2D dense or sparse)
Biomedical imaging (2D or 3D dense) Even flat files!!! (1D dense)
Time series (ND dense or sparse)
Weather (2D or 3D dense)
Graphs (2D sparse)
Video (3D dense)
Key-values (1D or ND sparse)
Tables (1D dense or ND sparse)

TileDB Cloud
❏ Access control and logging
❏ Serverless SQL, UDFs, task graphs
❏ Jupyter notebooks and dashboards
Unified data management
and easy serverless compute
at global scale
How we built a Universal Database
Efficient APIs & tool integrations, zero-copy techniques
TileDB Embedded
Open-source interoperable
storage with a universal
open-spec array format
❏ Parallel IO, rapid reads & writes
❏ Columnar, cloud-optimized
❏ Data versioning & time traveling

Superior
performance
Built in C++
Fully-parallelized
Columnar format
Multiple compressors
R-trees for sparse arrays
TileDB Embedded
https://github.com/TileDB-Inc/TileDB
Open source:
Rapid updates
& data versioning
Immutable writes
Lock-free
Parallel reader / writer model
Time traveling

TileDB Embedded
https://github.com/TileDB-Inc/TileDB
Open source:
Extreme
interoperability
Numerous APIs
Numerous integrations
All backends
Optimized
for the cloud
Immutable writes
Parallel IO
Minimization of requests

TileDB Cloud
Universal storage Universal tooling
Universal data
.vcf .csv .bam .fastq
Universal scale
Management. Collaboration. Scalability.

TileDB Cloud
Works as SaaS: https://cloud.tiledb.com
Works on premises
Currently on AWS, soon on any cloud
Built to work anywhere
Slicing, SQL, UDFs, task graphs
It is completely serverless
On-demand JupyterHub instances
Can launch Jupyter notebooks
Compute sent to the data
It is geo-aware
Authentication, compliance, etc.
It is secure

TileDB Cloud
Full marketplace (via Stripe)
Everything is monetizable
Access control inside and outside your
organization
Make any data and code public
Discover any public data and code
(central catalog)
Everything is shareable at global scale
Jupyter notebooks
UDFs and task graphs
ML models
Everything is an array!
Dashboards (e.g., R shiny apps)
All types of data (even flat files)
Full auditability (data, code, any action)
Everything is logged

Work in Progress
or why you should depart from file formats and
specialized solutions
RLE compression for strings
Compute directly on compressed data
Compute push-down
Fine-grained access policies
Constant perf optimizations

The Universal Database
Thank you

Population genomics is a data management problem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Population genomics is a data management problem

Similar to Population genomics is a data management problem (20)

Recently uploaded

Recently uploaded (20)

Population genomics is a data management problem