SlideShare a Scribd company logo
1 of 49
Download to read offline
TileDB webinars
The TileDB Embedded
Storage Engine
Founder & CEO of TileDB, Inc.
Dr. Stavros Papadopoulos
Who is this webinar for?
Those wanting to learn about data storage fundamentals
Layout, compression, IO, etc.
Those looking to efficiently store/access any kind of data to/from anywhere
Dataframes, genomics, LiDAR, SAR, weather, and more, with a single engine
Those tired of managing custom, inefficient data formats
Formats not supporting fast updates, indexing, versioning, cloud performance
Disclaimer
I am the exclusive recipient of complaints
Email me at: stavros@tiledb.com
All the credit for our amazing work goes to our powerful team
Check it out at https://tiledb.com/about
Deep roots at the intersection of HPC, databases and data science
Traction with telecoms, pharmas, hospitals and other scientific organizations
40 members with expertise across all applications and domains
Who we are
TileDB got spun out from MIT and Intel Labs in 2017
WHERE IT ALL STARTED
Raised over $20M, we are very well capitalized
INVESTORS
What is TileDB Embedded?
An embeddable C library that stores and accesses multi-dimensional arrays
Dense array Sparse array
It implements very fast array slicing across dimensions
Superior
performance
Built in C
Fully-parallelized
Columnar format
Multiple compressors
R-trees for sparse arrays
TileDB Embedded at a Glance
https://github.com/TileDBInc/TileDB
Open source:
Rapid updates
& data versioning
Immutable writes
Lock-free
Parallel reader / writer model
Time traveling
Schema evolution
TileDB Embedded at a Glance
https://github.com/TileDBInc/TileDB
Open source:
Extreme
interoperability
Numerous APIs
Numerous integrations
All backends
Optimized
for the cloud
Immutable writes
Parallel IO
Minimization of requests
TileDB Embedded at a Glance
APIs & tool Integrations with zero-copy where possible
TileDB Embedded
Open-source interoperable
storage with a universal
open-spec array format
● Parallel IO, rapid reads & writes
● Columnar, cloud-optimized
● Data versioning & time traveling
Why arrays?
The basics
Advanced internal mechanics
Examples
Work in progress
Agenda
Comparison to other formats and engines
Docs at docs.tiledb.com
Byte 0 1 ...
Regardless of what kind of data you have, it is laid out in a 1D storage medium
Why Arrays?
where each task may slice
Algorithm as a task graph
Regardless of what kind of algorithm you run, the algorithm involves a set of slices
Why Arrays?
Byte 0 1 ...
Byte 0 1 ...
Performance is absolutely dictated by the slice result locality on the 1D medium
Why Arrays?
Arrays provide a flexible way to map/slice any-dimensional (ND data to/from a 1D layout
Giving different “importance” to different dimensions (order and tiling)
Choosing whether dimension coordinates should be materialized or not (dense vs. sparse)
Considering compression, encryption and other filters (tiling)
Abstracting all the engineering magic that it takes to make everything very fast (engine)
Unifying the data model for all application domains! (universality)
Building indices for fast search (e.g., R-trees)
Arrays Subsume Dataframes
Sparse array
Dataframe
Dense vector
Arrays Are Universal
What else can be modeled as an array
LiDAR 3D sparse)
SAR 2D or 3D dense)
Population genomics (3D sparse)
Single-cell genomics (2D dense or sparse)
Biomedical imaging (2D or 3D dense)
Even flat files!!! 1D dense)
Time series (ND dense or sparse)
Weather (2D or 3D dense)
Graphs (2D sparse)
Video (3D dense)
Key-values (1D or ND sparse)
The Basics
dense_array1
├── __t2_t2_uuid1_v
│ ├── __fragment_metadata.tdb
│ └── a0.tdb
├── __t2_t2_uuid1.ok
├── __lock.tdb
├── __meta
└── __schema
└── __t1_t1_uuid2
A Simple 2D Dense Array
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
fragment
schema
attribute data
A Simple 2D Sparse Array
sparse_array1
├── __t2_t2_uuid2_v
│ ├── __fragment_metadata.tdb
│ ├── a0.tdb
│ ├── d0.tdb
│ └── d1.tdb
├── __t2_t2_uuid2_v.ok
├── __lock.tdb
├── __meta
└── __schema
└── __t1_t2_uuid1
1 2
3
4 5
6
fragment
schema
attribute data
coordinates
Groups
dense_group
├── __tiledb_group.tdb
└── nested_group
├── __tiledb_group.tdb
└── dense_array1
├── __lock.tdb
├── __meta
└── __schema
Groups provide an easy way to hierarchically organize arrays
Array Metadata
dense_array1
├── __t2_t2_uuid2_v
│ ├── __fragment_metadata.tdb
│ └── a0.tdb
├── __t2_t2_uuid2_v.ok
├── __lock.tdb
├── __meta
│ └── __t3_t3_uuid3
└── __schema
└── __t1_t2_uuid1
You can attach any number of (key, value) pairs to an array
The key must be string, and the value can be anything
metadata go here
Multiple Attributes
dense_array1
├── __t2_t2_uuid1_v
│ ├── __fragment_metadata.tdb
│ └── a0.tdb
│ └── a1.tdb
├── __t2_t2_uuid1.ok
├── __lock.tdb
├── __meta
└── __schema
└── __t1_t1_uuid2
1,a 2,b 3,c 4,d
5,e 6,f 7,g 8,h
9,i 10,j 11,k 12,l
13,m 14,n 15,o 16,p
You can store more than one values in each cell, even of different type
TileDB has a “columnar” format that allows you to efficiently subselect on attributes
attribute data
Var-length Attributes
dense_array3
├── __t2_t2_uuid1_v
│ ├── __fragment_metadata.tdb
│ └── a0.tdb
│ └── a0_var.tdb
├── __t2_t2_uuid1.ok
├── __lock.tdb
├── __meta
└── __schema
└── __t1_t1_uuid2
TileDB supports storing variable-length values in a cell (of any data type)
a bb ccc dddd
e ff ggg hhhh
i jj kkk lll
m nn ooo pppp
offsets
var-length data
Var-length Dimensions
sparse_array4
├── __t2_t2_uuid1_v
│ ├── __fragment_metadata.tdb
│ └── a0.tdb
│ └── d0.tdb
│ └── d0_var.tdb
├── __t2_t2_uuid1.ok
├── __lock.tdb
├── __meta
└── __schema
└── __t1_t1_uuid2
You can also have var-length dimensions and slice naturally using string ranges
Applicable only to sparse arrays
offsets
var-length data
a bb ccc dddd e ff
1 2 3 4 5 6
unbounded domain
infinite gaps
Heterogeneous Dimensions
4
1.0
0.0
“dddd”
0.4
infinite string
dimension
infinite float32
dimension
Sparse array allow you to have dimensions of different types
The following 2D array allows efficient slicing on a string and a float32 dimension
Arrays as Dataframes
An array is essentially a dataframe
where dimensions are special (they are “indexed”)
What About Cloud Object Stores?
array_name → {s3,azure,gcs,tiledb}://path/array_name
Everything
demonstrated works
as is on the cloud
Tiling & Layout
Tiling | Dense Arrays
fetches the whole array from storage
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
space tile
extents
fetches only a portion of the array (a tile)
A space tile is the atomic unit of IO
space tile
extents
Cell Layout | Dense Arrays
Three parameters define the values layout on storage, called the global order
Space tile extents
Tile order/layout (row-major or column-major)
Cell order/layout (row-major or column major)
row-major tile order
row-major cell order
22
space
tiles
col-major tile order
row-major cell order
22
space
tiles
row-major tile order
col-major cell order
42
space
tiles
Tiling & Cell Layout | Sparse Arrays
Sparse arrays store only non-empty cells
Grouping non-empty cells with space tiles would be inefficient (due to potential skew)
The atomic unit of IO in sparse arrays is the data tile, of fixed (user-defined) capacity
First impose a global order similar to dense arrays, then group based on capacity
col-major tile order
row-major cell order
22
space
tiles
capacity 2
space tile
extents
space tile
extents
col-major tile order
row-major cell order
22
space
tiles
capacity 4
data tile
Hilbert Order | Sparse Arrays
Space tiles greatly affect the cell layout in sparse arrays
Sometimes it is very difficult to define a good space tiling (especially with floats and strings)
For such cases, the Hilbert order is ideal (no tile extents and order)
For floats we discretize the domain into buckets
based on the number of dimensions
For strings we assign a number of bits per dimension
and then use the string prefixes as numbers
Tile Filters
TileDB allows a wide range of filters to be applied to each tile prior to its storage
Compressors (gzip, zstd, bzip2, …)
Checksums
Encryption
The atomic unit of filtering is the chunk (typically equal to the L1 cache size)
TileDB applies the filters across chunks in parallel in a pipeline
chunk
tile
zstd
AES256
Advanced
Internal Mechanics
Versioning and Time Traveling
In TileDB, every write is immutable
Each (batch) write creates a timestamped fragment
With fragments, TileDB implements
versioning and time traveling
Versioning and Time Traveling | Dense Arrays
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Read at 0,t1
100 3 4
7 8
9 10 11 12
13 14 15 16
Read at 0,t2
200
500 600
100 - -
- -
- - - -
- - - -
Read at (t1,t2
200
500 600
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Write at t1
100 200
500 600
Write at t2
100
Write at t2
40
Versioning and Time Traveling | Sparse Arrays
1 2
3
4 5
6
Write at t1
100 2
3
40 5
6
Read at 0,t2
1 100
Read at (t1,t2
40
When no dups
allowed
1
4
6
Read at 0,t1
2
3
5
When dups
are allowed
4
dups
100
Write at t2
40
Versioning and Time Traveling | Sparse Arrays
1 2
3
4 5
6
Write at t1
100
Read at (t1,t2
40
1
4
6
Read at 0,t1
2
3
5
100 2
3
40 5
6
Read at 0,t2
1
Indexing
TileDB has a three-level indexing approach
Fragment timestamps (in the fragment names) for time traveling
Non-empty domain in each fragment’s metadata
Either simple offset arithmetic (dense) or R-trees (sparse)
1. Get list of fragment names (with .ok)
t1_t1_uuid1_v
t2_t2_uuid2_v
...
2. Ignore fragments with timestamp not in time traveling interval
3. Ignore fragments with non-empty domain not overlapping slice
__fragment_metadata.tdb
__fragment_metadata.tdb
4a. Ignore dense tiles via implicit positional indexing, or
4b. Ignore sparse tiles from the R-tree that do not overlap the slice
Algorithm
A slicing query would just traverse the tree
top-down, visiting only nodes/MBRs that
intersect the slice
Indexing
Given the non-empty domain, the space tile extents and the
tile order, we can find easily that this slice overlaps the
second and fourth tile
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
row-major tile order
22
space
tiles
MBR1
MBR2
MBR3
MBR4
col-major tile order
row-major cell order
22
space
tiles
capacity
2
R-tree
(stored in fragment metadata)
MBR1 MBR2 MBR3 MBR4
Consolidation & Vacuuming
Numerous fragments can lead to performance degradation (loss of locality, expensive listing)
TileDB supports two levels of consolidation
Fragment metadata (group the non-empty domains in a single place)
Fragments (better preserve data locality)
Old fragments are preserved after consolidation (for time traveling)
TileDB can vacuum old fragments to save space and boost listing
Time traveling will not work on vacuumed fragments
Attribute Filter Push-Down
TileDB supports pushing attribute filter conditions down to the engine
That typically boosts performance
Much fewer data gets copied around
More L1-cache conscious
More opportunities for parallelism and vectorization
Schema Evolution
TileDB supports schema evolution (since v2.4
Adding an attribute
Dropping an attribute
More schema evolution features are coming up
Full versioning and time traveling is supported
Notes on Writing
Lots of flexibility in writing in different orders, different domain subarray, etc.
Support for lock-free, parallel writing
Tips for performance:
Each tile should be 100KB  1MB
Each fragment should be 1  2GB
Fragments should not “interleave”
Run fragment metadata consolidation (especially on cloud object stores)
No support for deletions and updates yet (coming up soon)
Notes on Reading
TileDB is eventually consistent
Support for parallel writers, parallel readers (all lock-free)
Support for reads in different layouts
Support for “streaming reads” (incomplete queries)
Tips for performance:
Allocate sufficient space for the result buffers (minimize incomplete queries)
Tune written layout based on the read layout (application dependent)
Push down coordinate and attribute filter conditions
Work In Progress
Coming Up
More schema evolution features
Support for deletes and updates
Git-like versioning
ACID via modularizing locking
More tile filters (e.g., sum, min, max)
RLE and dictionary compression on strings
Computations on compressed data
Linear Algebra operations
More SQL push down (e.g., group by)
Graph algorithms
TileDB vs. Others
High-level Comparisons
vs. HDF5
TileDB is cloud-native
TileDB has support for sparse arrays
vs. Zarr
TileDB is built in C and is more interoperable
TileDB has support for sparse arrays
TileDB has support for versioning and time traveling
TileDB has support for versioning and time traveling
High-level Comparisons
vs. Parquet
TileDB is multi-dimensional and supports more flexible layouts
TileDB has support for dense arrays
vs. Delta Lake
TileDB does not rely on Spark, Presto or other subsystem
TileDB has support for dense arrays
TileDB has support for versioning and time traveling
TileDB does not support deletes, updates and full ACID (yet)
TileDB is natively multi-dimensional and supports more flexible layouts
The Universal Database
Thank you
WE ARE HIRING
Apply at tiledb.workable.com

More Related Content

What's hot

Digital Identity Wallets: What They Mean For Banks
Digital Identity Wallets: What They Mean For BanksDigital Identity Wallets: What They Mean For Banks
Digital Identity Wallets: What They Mean For BanksEvernym
 
Distributed ledger technology: beyond block chain
Distributed ledger technology: beyond block chainDistributed ledger technology: beyond block chain
Distributed ledger technology: beyond block chainbis_foresight
 
Discovering ElasticSearch
Discovering ElasticSearchDiscovering ElasticSearch
Discovering ElasticSearchBen Corlett
 
Basics of Bitcoin & Mining
Basics of Bitcoin & MiningBasics of Bitcoin & Mining
Basics of Bitcoin & MiningAkhilesh Arora
 
Blockchain Technology Fundamentals
Blockchain Technology FundamentalsBlockchain Technology Fundamentals
Blockchain Technology FundamentalsExperfy
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...HostedbyConfluent
 
Hyperledger Fabric Technical Deep Dive 20190618
Hyperledger Fabric Technical Deep Dive 20190618Hyperledger Fabric Technical Deep Dive 20190618
Hyperledger Fabric Technical Deep Dive 20190618Arnaud Le Hors
 
Fourth Generation Computers
Fourth Generation ComputersFourth Generation Computers
Fourth Generation ComputersJessa Ü Borja
 
Ding Talk - Redefining Communication & Collaboration
Ding Talk - Redefining Communication & CollaborationDing Talk - Redefining Communication & Collaboration
Ding Talk - Redefining Communication & CollaborationMuhammadShafiqChooi
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidJan Graßegger
 
Global Future of Blockchain
Global Future of Blockchain Global Future of Blockchain
Global Future of Blockchain Melanie Swan
 
Understanding Active Directory Enumeration
Understanding Active Directory EnumerationUnderstanding Active Directory Enumeration
Understanding Active Directory EnumerationDaniel López Jiménez
 
Overview of Decentralized Identity
Overview of Decentralized IdentityOverview of Decentralized Identity
Overview of Decentralized IdentityJim Flynn
 
Blockchain in industry 4.0
Blockchain in industry 4.0Blockchain in industry 4.0
Blockchain in industry 4.0Mujahid Hussain
 

What's hot (20)

Digital Identity Wallets: What They Mean For Banks
Digital Identity Wallets: What They Mean For BanksDigital Identity Wallets: What They Mean For Banks
Digital Identity Wallets: What They Mean For Banks
 
Distributed ledger technology: beyond block chain
Distributed ledger technology: beyond block chainDistributed ledger technology: beyond block chain
Distributed ledger technology: beyond block chain
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Discovering ElasticSearch
Discovering ElasticSearchDiscovering ElasticSearch
Discovering ElasticSearch
 
Basics of Bitcoin & Mining
Basics of Bitcoin & MiningBasics of Bitcoin & Mining
Basics of Bitcoin & Mining
 
Blockchain Technology Fundamentals
Blockchain Technology FundamentalsBlockchain Technology Fundamentals
Blockchain Technology Fundamentals
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
 
Ebook and ereaders
Ebook and ereadersEbook and ereaders
Ebook and ereaders
 
BitCoin, P2P, Distributed Computing
BitCoin, P2P, Distributed ComputingBitCoin, P2P, Distributed Computing
BitCoin, P2P, Distributed Computing
 
Hyperledger Fabric Technical Deep Dive 20190618
Hyperledger Fabric Technical Deep Dive 20190618Hyperledger Fabric Technical Deep Dive 20190618
Hyperledger Fabric Technical Deep Dive 20190618
 
Fourth Generation Computers
Fourth Generation ComputersFourth Generation Computers
Fourth Generation Computers
 
Ding Talk - Redefining Communication & Collaboration
Ding Talk - Redefining Communication & CollaborationDing Talk - Redefining Communication & Collaboration
Ding Talk - Redefining Communication & Collaboration
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
 
Global Future of Blockchain
Global Future of Blockchain Global Future of Blockchain
Global Future of Blockchain
 
Understanding Active Directory Enumeration
Understanding Active Directory EnumerationUnderstanding Active Directory Enumeration
Understanding Active Directory Enumeration
 
Overview of Decentralized Identity
Overview of Decentralized IdentityOverview of Decentralized Identity
Overview of Decentralized Identity
 
Blockchain concepts
Blockchain conceptsBlockchain concepts
Blockchain concepts
 
Blockchain in industry 4.0
Blockchain in industry 4.0Blockchain in industry 4.0
Blockchain in industry 4.0
 

Similar to The TileDB Embedded Storage Engine

Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
 
Sql Server Interview Question
Sql Server Interview QuestionSql Server Interview Question
Sql Server Interview Questionpukal rani
 
SQL Server In-Memory OLTP introduction (Hekaton)
SQL Server In-Memory OLTP introduction (Hekaton)SQL Server In-Memory OLTP introduction (Hekaton)
SQL Server In-Memory OLTP introduction (Hekaton)Shy Engelberg
 
Maryna Popova "Deep dive AWS Redshift"
Maryna Popova "Deep dive AWS Redshift"Maryna Popova "Deep dive AWS Redshift"
Maryna Popova "Deep dive AWS Redshift"Lviv Startup Club
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Altinity Ltd
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databasesguestdfd1ec
 
Database Performance Tuning
Database Performance Tuning Database Performance Tuning
Database Performance Tuning Arno Huetter
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Mark Smith
 
Structured Query Language (SQL) _ Edu4Sure Training.pptx
Structured Query Language (SQL) _ Edu4Sure Training.pptxStructured Query Language (SQL) _ Edu4Sure Training.pptx
Structured Query Language (SQL) _ Edu4Sure Training.pptxEdu4Sure
 
Design Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databasesDesign Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databaseslovingprince58
 
Myth busters - performance tuning 103 2008
Myth busters - performance tuning 103 2008Myth busters - performance tuning 103 2008
Myth busters - performance tuning 103 2008paulguerin
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 
Sql and mysql database concepts
Sql and mysql database conceptsSql and mysql database concepts
Sql and mysql database conceptsSelamawit Feleke
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environmentsJ'tong Atong
 

Similar to The TileDB Embedded Storage Engine (20)

TileDB
TileDBTileDB
TileDB
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Sql Server Interview Question
Sql Server Interview QuestionSql Server Interview Question
Sql Server Interview Question
 
SQL Server In-Memory OLTP introduction (Hekaton)
SQL Server In-Memory OLTP introduction (Hekaton)SQL Server In-Memory OLTP introduction (Hekaton)
SQL Server In-Memory OLTP introduction (Hekaton)
 
Maryna Popova "Deep dive AWS Redshift"
Maryna Popova "Deep dive AWS Redshift"Maryna Popova "Deep dive AWS Redshift"
Maryna Popova "Deep dive AWS Redshift"
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databases
 
User biglm
User biglmUser biglm
User biglm
 
Database Performance Tuning
Database Performance Tuning Database Performance Tuning
Database Performance Tuning
 
Database Sizing
Database SizingDatabase Sizing
Database Sizing
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
Sql Basics And Advanced
Sql Basics And AdvancedSql Basics And Advanced
Sql Basics And Advanced
 
Vertica
VerticaVertica
Vertica
 
Structured Query Language (SQL) _ Edu4Sure Training.pptx
Structured Query Language (SQL) _ Edu4Sure Training.pptxStructured Query Language (SQL) _ Edu4Sure Training.pptx
Structured Query Language (SQL) _ Edu4Sure Training.pptx
 
Design Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databasesDesign Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databases
 
Myth busters - performance tuning 103 2008
Myth busters - performance tuning 103 2008Myth busters - performance tuning 103 2008
Myth busters - performance tuning 103 2008
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Sql and mysql database concepts
Sql and mysql database conceptsSql and mysql database concepts
Sql and mysql database concepts
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environments
 

More from Stavros Papadopoulos

Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...Stavros Papadopoulos
 
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...Stavros Papadopoulos
 
Population genomics is a data management problem
Population genomics is a data management problemPopulation genomics is a data management problem
Population genomics is a data management problemStavros Papadopoulos
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
 

More from Stavros Papadopoulos (6)

Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
 
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
 
The New Data Economics
The New Data EconomicsThe New Data Economics
The New Data Economics
 
Population genomics is a data management problem
Population genomics is a data management problemPopulation genomics is a data management problem
Population genomics is a data management problem
 
TileDB Cloud Webinar (09/30/2021)
TileDB Cloud Webinar (09/30/2021)TileDB Cloud Webinar (09/30/2021)
TileDB Cloud Webinar (09/30/2021)
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
 

Recently uploaded

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 

Recently uploaded (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 

The TileDB Embedded Storage Engine

  • 1. TileDB webinars The TileDB Embedded Storage Engine Founder & CEO of TileDB, Inc. Dr. Stavros Papadopoulos
  • 2. Who is this webinar for? Those wanting to learn about data storage fundamentals Layout, compression, IO, etc. Those looking to efficiently store/access any kind of data to/from anywhere Dataframes, genomics, LiDAR, SAR, weather, and more, with a single engine Those tired of managing custom, inefficient data formats Formats not supporting fast updates, indexing, versioning, cloud performance
  • 3. Disclaimer I am the exclusive recipient of complaints Email me at: stavros@tiledb.com All the credit for our amazing work goes to our powerful team Check it out at https://tiledb.com/about
  • 4. Deep roots at the intersection of HPC, databases and data science Traction with telecoms, pharmas, hospitals and other scientific organizations 40 members with expertise across all applications and domains Who we are TileDB got spun out from MIT and Intel Labs in 2017 WHERE IT ALL STARTED Raised over $20M, we are very well capitalized INVESTORS
  • 5. What is TileDB Embedded? An embeddable C library that stores and accesses multi-dimensional arrays Dense array Sparse array It implements very fast array slicing across dimensions
  • 6. Superior performance Built in C Fully-parallelized Columnar format Multiple compressors R-trees for sparse arrays TileDB Embedded at a Glance https://github.com/TileDBInc/TileDB Open source: Rapid updates & data versioning Immutable writes Lock-free Parallel reader / writer model Time traveling Schema evolution
  • 7. TileDB Embedded at a Glance https://github.com/TileDBInc/TileDB Open source: Extreme interoperability Numerous APIs Numerous integrations All backends Optimized for the cloud Immutable writes Parallel IO Minimization of requests
  • 8. TileDB Embedded at a Glance APIs & tool Integrations with zero-copy where possible TileDB Embedded Open-source interoperable storage with a universal open-spec array format ● Parallel IO, rapid reads & writes ● Columnar, cloud-optimized ● Data versioning & time traveling
  • 9. Why arrays? The basics Advanced internal mechanics Examples Work in progress Agenda Comparison to other formats and engines Docs at docs.tiledb.com
  • 10. Byte 0 1 ... Regardless of what kind of data you have, it is laid out in a 1D storage medium Why Arrays? where each task may slice Algorithm as a task graph Regardless of what kind of algorithm you run, the algorithm involves a set of slices
  • 11. Why Arrays? Byte 0 1 ... Byte 0 1 ... Performance is absolutely dictated by the slice result locality on the 1D medium
  • 12. Why Arrays? Arrays provide a flexible way to map/slice any-dimensional (ND data to/from a 1D layout Giving different “importance” to different dimensions (order and tiling) Choosing whether dimension coordinates should be materialized or not (dense vs. sparse) Considering compression, encryption and other filters (tiling) Abstracting all the engineering magic that it takes to make everything very fast (engine) Unifying the data model for all application domains! (universality) Building indices for fast search (e.g., R-trees)
  • 13. Arrays Subsume Dataframes Sparse array Dataframe Dense vector
  • 14. Arrays Are Universal What else can be modeled as an array LiDAR 3D sparse) SAR 2D or 3D dense) Population genomics (3D sparse) Single-cell genomics (2D dense or sparse) Biomedical imaging (2D or 3D dense) Even flat files!!! 1D dense) Time series (ND dense or sparse) Weather (2D or 3D dense) Graphs (2D sparse) Video (3D dense) Key-values (1D or ND sparse)
  • 16. dense_array1 ├── __t2_t2_uuid1_v │ ├── __fragment_metadata.tdb │ └── a0.tdb ├── __t2_t2_uuid1.ok ├── __lock.tdb ├── __meta └── __schema └── __t1_t1_uuid2 A Simple 2D Dense Array 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 fragment schema attribute data
  • 17. A Simple 2D Sparse Array sparse_array1 ├── __t2_t2_uuid2_v │ ├── __fragment_metadata.tdb │ ├── a0.tdb │ ├── d0.tdb │ └── d1.tdb ├── __t2_t2_uuid2_v.ok ├── __lock.tdb ├── __meta └── __schema └── __t1_t2_uuid1 1 2 3 4 5 6 fragment schema attribute data coordinates
  • 18. Groups dense_group ├── __tiledb_group.tdb └── nested_group ├── __tiledb_group.tdb └── dense_array1 ├── __lock.tdb ├── __meta └── __schema Groups provide an easy way to hierarchically organize arrays
  • 19. Array Metadata dense_array1 ├── __t2_t2_uuid2_v │ ├── __fragment_metadata.tdb │ └── a0.tdb ├── __t2_t2_uuid2_v.ok ├── __lock.tdb ├── __meta │ └── __t3_t3_uuid3 └── __schema └── __t1_t2_uuid1 You can attach any number of (key, value) pairs to an array The key must be string, and the value can be anything metadata go here
  • 20. Multiple Attributes dense_array1 ├── __t2_t2_uuid1_v │ ├── __fragment_metadata.tdb │ └── a0.tdb │ └── a1.tdb ├── __t2_t2_uuid1.ok ├── __lock.tdb ├── __meta └── __schema └── __t1_t1_uuid2 1,a 2,b 3,c 4,d 5,e 6,f 7,g 8,h 9,i 10,j 11,k 12,l 13,m 14,n 15,o 16,p You can store more than one values in each cell, even of different type TileDB has a “columnar” format that allows you to efficiently subselect on attributes attribute data
  • 21. Var-length Attributes dense_array3 ├── __t2_t2_uuid1_v │ ├── __fragment_metadata.tdb │ └── a0.tdb │ └── a0_var.tdb ├── __t2_t2_uuid1.ok ├── __lock.tdb ├── __meta └── __schema └── __t1_t1_uuid2 TileDB supports storing variable-length values in a cell (of any data type) a bb ccc dddd e ff ggg hhhh i jj kkk lll m nn ooo pppp offsets var-length data
  • 22. Var-length Dimensions sparse_array4 ├── __t2_t2_uuid1_v │ ├── __fragment_metadata.tdb │ └── a0.tdb │ └── d0.tdb │ └── d0_var.tdb ├── __t2_t2_uuid1.ok ├── __lock.tdb ├── __meta └── __schema └── __t1_t1_uuid2 You can also have var-length dimensions and slice naturally using string ranges Applicable only to sparse arrays offsets var-length data a bb ccc dddd e ff 1 2 3 4 5 6 unbounded domain infinite gaps
  • 23. Heterogeneous Dimensions 4 1.0 0.0 “dddd” 0.4 infinite string dimension infinite float32 dimension Sparse array allow you to have dimensions of different types The following 2D array allows efficient slicing on a string and a float32 dimension
  • 24. Arrays as Dataframes An array is essentially a dataframe where dimensions are special (they are “indexed”)
  • 25. What About Cloud Object Stores? array_name → {s3,azure,gcs,tiledb}://path/array_name Everything demonstrated works as is on the cloud
  • 27. Tiling | Dense Arrays fetches the whole array from storage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 space tile extents fetches only a portion of the array (a tile) A space tile is the atomic unit of IO space tile extents
  • 28. Cell Layout | Dense Arrays Three parameters define the values layout on storage, called the global order Space tile extents Tile order/layout (row-major or column-major) Cell order/layout (row-major or column major) row-major tile order row-major cell order 22 space tiles col-major tile order row-major cell order 22 space tiles row-major tile order col-major cell order 42 space tiles
  • 29. Tiling & Cell Layout | Sparse Arrays Sparse arrays store only non-empty cells Grouping non-empty cells with space tiles would be inefficient (due to potential skew) The atomic unit of IO in sparse arrays is the data tile, of fixed (user-defined) capacity First impose a global order similar to dense arrays, then group based on capacity col-major tile order row-major cell order 22 space tiles capacity 2 space tile extents space tile extents col-major tile order row-major cell order 22 space tiles capacity 4 data tile
  • 30. Hilbert Order | Sparse Arrays Space tiles greatly affect the cell layout in sparse arrays Sometimes it is very difficult to define a good space tiling (especially with floats and strings) For such cases, the Hilbert order is ideal (no tile extents and order) For floats we discretize the domain into buckets based on the number of dimensions For strings we assign a number of bits per dimension and then use the string prefixes as numbers
  • 31. Tile Filters TileDB allows a wide range of filters to be applied to each tile prior to its storage Compressors (gzip, zstd, bzip2, …) Checksums Encryption The atomic unit of filtering is the chunk (typically equal to the L1 cache size) TileDB applies the filters across chunks in parallel in a pipeline chunk tile zstd AES256
  • 33. Versioning and Time Traveling In TileDB, every write is immutable Each (batch) write creates a timestamped fragment With fragments, TileDB implements versioning and time traveling
  • 34. Versioning and Time Traveling | Dense Arrays 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Read at 0,t1 100 3 4 7 8 9 10 11 12 13 14 15 16 Read at 0,t2 200 500 600 100 - - - - - - - - - - - - Read at (t1,t2 200 500 600 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Write at t1 100 200 500 600 Write at t2
  • 35. 100 Write at t2 40 Versioning and Time Traveling | Sparse Arrays 1 2 3 4 5 6 Write at t1 100 2 3 40 5 6 Read at 0,t2 1 100 Read at (t1,t2 40 When no dups allowed 1 4 6 Read at 0,t1 2 3 5
  • 36. When dups are allowed 4 dups 100 Write at t2 40 Versioning and Time Traveling | Sparse Arrays 1 2 3 4 5 6 Write at t1 100 Read at (t1,t2 40 1 4 6 Read at 0,t1 2 3 5 100 2 3 40 5 6 Read at 0,t2 1
  • 37. Indexing TileDB has a three-level indexing approach Fragment timestamps (in the fragment names) for time traveling Non-empty domain in each fragment’s metadata Either simple offset arithmetic (dense) or R-trees (sparse) 1. Get list of fragment names (with .ok) t1_t1_uuid1_v t2_t2_uuid2_v ... 2. Ignore fragments with timestamp not in time traveling interval 3. Ignore fragments with non-empty domain not overlapping slice __fragment_metadata.tdb __fragment_metadata.tdb 4a. Ignore dense tiles via implicit positional indexing, or 4b. Ignore sparse tiles from the R-tree that do not overlap the slice Algorithm
  • 38. A slicing query would just traverse the tree top-down, visiting only nodes/MBRs that intersect the slice Indexing Given the non-empty domain, the space tile extents and the tile order, we can find easily that this slice overlaps the second and fourth tile 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 row-major tile order 22 space tiles MBR1 MBR2 MBR3 MBR4 col-major tile order row-major cell order 22 space tiles capacity 2 R-tree (stored in fragment metadata) MBR1 MBR2 MBR3 MBR4
  • 39. Consolidation & Vacuuming Numerous fragments can lead to performance degradation (loss of locality, expensive listing) TileDB supports two levels of consolidation Fragment metadata (group the non-empty domains in a single place) Fragments (better preserve data locality) Old fragments are preserved after consolidation (for time traveling) TileDB can vacuum old fragments to save space and boost listing Time traveling will not work on vacuumed fragments
  • 40. Attribute Filter Push-Down TileDB supports pushing attribute filter conditions down to the engine That typically boosts performance Much fewer data gets copied around More L1-cache conscious More opportunities for parallelism and vectorization
  • 41. Schema Evolution TileDB supports schema evolution (since v2.4 Adding an attribute Dropping an attribute More schema evolution features are coming up Full versioning and time traveling is supported
  • 42. Notes on Writing Lots of flexibility in writing in different orders, different domain subarray, etc. Support for lock-free, parallel writing Tips for performance: Each tile should be 100KB  1MB Each fragment should be 1  2GB Fragments should not “interleave” Run fragment metadata consolidation (especially on cloud object stores) No support for deletions and updates yet (coming up soon)
  • 43. Notes on Reading TileDB is eventually consistent Support for parallel writers, parallel readers (all lock-free) Support for reads in different layouts Support for “streaming reads” (incomplete queries) Tips for performance: Allocate sufficient space for the result buffers (minimize incomplete queries) Tune written layout based on the read layout (application dependent) Push down coordinate and attribute filter conditions
  • 45. Coming Up More schema evolution features Support for deletes and updates Git-like versioning ACID via modularizing locking More tile filters (e.g., sum, min, max) RLE and dictionary compression on strings Computations on compressed data Linear Algebra operations More SQL push down (e.g., group by) Graph algorithms
  • 47. High-level Comparisons vs. HDF5 TileDB is cloud-native TileDB has support for sparse arrays vs. Zarr TileDB is built in C and is more interoperable TileDB has support for sparse arrays TileDB has support for versioning and time traveling TileDB has support for versioning and time traveling
  • 48. High-level Comparisons vs. Parquet TileDB is multi-dimensional and supports more flexible layouts TileDB has support for dense arrays vs. Delta Lake TileDB does not rely on Spark, Presto or other subsystem TileDB has support for dense arrays TileDB has support for versioning and time traveling TileDB does not support deletes, updates and full ACID (yet) TileDB is natively multi-dimensional and supports more flexible layouts
  • 49. The Universal Database Thank you WE ARE HIRING Apply at tiledb.workable.com