SlideShare a Scribd company logo
1 of 22
The TileDB Array Data Storage
Manager
Authors:
Stavros Papadopoulos, Kushal Datta, Samuel Madden and
Timothy Mattson
Published in:
Journal Proceedings of the VLDB Endowment Vol. 10 Issue 4,
November 2016
Reviewer: Olamide Timothy Tawose
Preamble
Many scientific and engineering fields generate enormous amounts of data
through measurement, simulation, and experimentation. Examples of data
produced in such fields include astronomical images, DNA sequences, geo-
locations, social relationships, and so on.
All of these data are naturally represented as multi-dimensional arrays, which
can be either dense (when every array element has a value) or sparse (when
the majority of the array elements are empty, i.e., zero or null).
Scientific array data can be very large, containing billions of non-null values
that do not readily fit into the memory of a single or even multiple machines.
Simply storing arrays as files forces application programmers to handle many
issues, including array representation on disk (i.e., sparse vs. dense layouts),
compression, parallel access, and performance.
Alternatively, these issues can be handled by optimized, special-purpose
array data storage management systems, which perform complex analytics
on scientific data. Central to such systems are efficient data access primitives
to read and write arrays.
Objectives
The authors of this work developed a novel storage manager for
multi-dimensional arrays for the scientific data management
system called TileDB.
In contrast to existing solutions, TileDB is optimized for both
dense and sparse arrays. Its key idea is to organize array
elements into ordered collections called fragments.
TileDB array manager is the first truly optimized storage
manager for both forms of multi-dimensional arrays (i.e. dense
array and sparse array).
Existing Array Management Systems
 HDF5: HDF5 is a well-known array data storage manager. It is
a dense array format. HDF5 groups array elements into
regular hyperrectangles, called chunks, which are stored on
the disk in binary format in a single large file.
Shortcomings:
i) It does not efficiently capture sparse arrays.
ii) HDF5 is optimized for in-place writes of large blocks. In-place
updates result in poor performance when writing small blocks
of elements that are randomly distributed, due to the
expensive random disk accesses they incur.
Existing Array Management Systems
 Parallel HDF5: It is a parallel version of HDF5 with some
additional limitations.
Shortcomings:
i) it does not allow concurrent writes to compressed data.
ii) it does not support variable-length element values.
iii) operation atomicity requires some coding effort from the user,
and imposes extra ordering semantics
 Vertica: Alternatively, relational databases (e.g Vertica) have
been used as the storage backend for array management
(e.g., in RAM for dense and SRAM for sparse arrays), storing
non-empty elements as records and explicitly encoding the
element indices as extra table columns. However, this also
leads to poor performance in the case of dense arrays when
explicitly encoded.
Existing Array Management Systems
 SciDB: It is a popular array database, which however also has
significant storage management limitations.
Shortcomings:
i) It is not truly designed for sparse arrays as it relies on regular
dimensional chunking (similar to HDF5).
ii) It also requires reading an entire chunk even if a small portion
of it is requested, and updating an entire chunk even when a
single element changes, leading to poor performance in the
case of random writes.
Overview of TileDB
 DataModel: TileDB provides an array-oriented data model that is similar to that of SciDB. An
array consists of dimensions and attributes. Each dimension has a domain, and all dimension
domains collectively orient the logical space of the array. A cell may either be empty (null) or
contain a tuple of attribute values. Each attribute is either a primitive type (int, float, or char), or
a fixed or variable-sized vector of a primitive type. Arrays in TileDB can be dense or sparse. All
TileDB operations can be performed on either type of array, but the choice of dense vs. sparse
can significantly impact application performance
 Figure below shows two example arrays, one dense one sparse, which have two dimensions,
rows and columns. Both dimensions have domain [1,4]. The arrays have two attributes, a1 of
type int32 and a2 of type variable char. Each cell is identified by its coordinates, i.e., a pair of
rows, columns. In the examples, empty cells are shown in white, and nonempty cells in gray. In
the dense array, every cell has an attribute tuple, e.g., cell (4,4) contains <15; pppp>, whereas
several cells in the sparse array are empty, e.g., cell (4,1).
Figure:
The Logical array view
Overview of TileDB
 Compression: TileDB employs tile-based compression. Additionally, it
allows the user to select different compression schemes on a per-attribute
basis, as attributes are stored separately.
The API supports defining different compression schemes for each
attribute, but currently only gzip is implemented. The current version of
TileDB does not support more compression schemes, such as RLE, LZ, as
well as user-defined schemes
Overview of TileDB
 TileDB storage module or manager organizes array elements into ordered
collections called fragments in order to support both dense and sparse multi-
dimensional arrays. Fragments can either be dense or sparse.
A fragment is a timestamped snapshot of a batch of array updates, i.e., a collection of
array modifications carried out via write operations and made visible at a particular
time instant. For instance, the initial loading of the data into the array constitutes
the first array fragment. If at a later time a set of cells is modified, then these cells
constitute the second array fragment, and so on. In that sense, an array is
comprised of a collection of array fragments, each of which can be regarded as a
separate array, whose collective logical overlap composes the current logical view
of the array.
Dense fragments are used only with dense arrays, but sparse fragments may be
applied to both dense and sparse arrays
 The concept of fragments dramatically improves write performance as writes are
performed in batches, and each batch (which may consist of many small, random
writes) is written to a separate fragment sequentially.
 The organization into fragments turns random writes into sequential writes, and,
Fragment Examples
Figure below shows an example of an array with three fragments; the first two are dense and the
third is sparse. Observe that the second fragment is dense within a hyper-rectangular subarray,
whereas any cell outside this subarray is empty. The figure also illustrates the collective logical view
of the array; the cells of the most recent fragments overwrite those of the older ones.
Fragments are the key concept that enables TileDB to perform rapid writes.
Overview of TileDB
 TileDB array storage manager implements an efficient read algorithm to easily find
the most recent fragment that updated each returned element in situations where
multiple fragments may cover the same logical region of an array and some
fragments may be sparse.
This read algorithm allows it to efficiently access all fragments, skipping data that does
not qualify for the result. The algorithm is slightly different for dense and sparse
arrays
 Also, TileDB array storage manager employs a consolidation algorithm that merges
multiple fragments into a single one to tackle read performance degradation when
the number of fragments grow.
 TileDB enables or supports parallelization via multi-threading and multiprocessing,
offering thread-/process-safety and atomicity via lightweight locking. TileDB
delivers better performance than existing state-of-the-art systems (all thanks to its
efficient storage manager).
 Moreover, it guarantees operation atomicity, without imposing any additional
ordering constraints (this is in contrast to parallel implementations of HDF5, like
PHDF5)
Experiments
Competitors: TileDB was compared against 3 competitors
namely; HDF5 (for dense arrays), SciDB (for dense and sparse
arrays), and Vertica (for dense and sparse arrays). For all parallel
experiments, they used the parallel version of HDF5, namely
PHDF5.
System configuration: All experiments was performed on an Intel x86_64
platform with a 2.3 GHz 36-core CPU and 128 GB of RAM, running CentOS6. They
utilized a 4 TB, 7200rpm Western Digital HDD and a 480 GB Intel SSD both not RAID-
ed and equipped with the ext4 file system. Because the HDD is I/O bound even with a
single thread, they ran serial experiments on the HDD. Because the SSD is so much
faster, they ran our parallel experiments on SSDs.
For TileDB, SciDB and HDF5, they experimented both with and without gzip
compression, using compression level 6. Note: PHDF5 does not support parallel load or
updates on compressed datasets. In addition, Vertica cannot be used without
compression; the default compression method is RLE, which is currently not supported
by TileDB. Therefore, they used Vertica with gzip (compression level 6)
Experiments
For TileDB, SciDB and HDF5, they experimented both with and without gzip
compression, using compression level 6. Note: PHDF5 does not support parallel load
or updates on compressed datasets. In addition, Vertica cannot be used without
compression; the default compression method is RLE, which is currently not supported
by TileDB. Therefore, they have used Vertica with gzip (compression level 6).
Also, Vertica by default uses all the available cores in the system. In order to vary the
number of threads used in the parallel experiments, they simply shut down a subset of
the cores at the OS level.
Datasets: For dense arrays, they constructed synthetic 2D arrays of variable sizes,
with a single int32 attribute.
For the case of sparse arrays, they used datasets retrieved from the AIS database,
which was collected by the National Oceanic and Atmospheric Administration by
tracking ship vessels in the U.S. and international waters. They extracted attributes X
(longitude), Y (latitude), SOG, COG, Heading, ROT, Status, VoyageID, and
MMSI.They used X,Y as the dimensions (i.e., they created a 2D array), and the rest
characteristics as the attributes. SciDB does not support real dimension values, thus
they converted X and Y into int64,
Dense Arrays: TileDB vs HDF5, SciDB and Vertica
a. Load: Figure below shows the time to load a dense array into TileDB, HDF5 and
SciDB, as well as their gzip-compressed versions, denoted TileDB+Z, HDF5+Z and
SciDB+Z, respectively.
In this experiment, only one CPU core is activated and each system runs on a single
instance. They generated 2D synthetic arrays with sizes 4 GB, 8 GB, and 16 GB,
and dimension domains 50,000 × 20,000, 50,000 × 40,000, and 100,00 × 40,000
respectively.
TileDB matches the performance of HDF5 (it takes only 60s to load 4GB), and it is
consistently more than an order of magnitude faster than SciDB. A dollar (’$’)
symbol indicates that the system did not manage to terminate within the shown
time.
However, the performance of TileDB+Z, HDF5+Z and SciDB+Z deteriorates as the tile
Figure: Load performance of dense
array
Dense Arrays: TileDB vs HDF5, SciDB and Vertica
a. Load: Figure below assesses the cost of loading data in parallel, for the 4GB array.
In this experiment, many CPU core were activated as the number of tested instances.
They excluded HDF5+Z, since PHDF5 does not allow parallel writes with
compression.
TileDB and HDF5 are unaffected by the level of parallelism, as they are I/O bound.
Moreover, each process writes large data chunks at a time monopolizing the
internal parallelism of the SSD and, thus, no extra savings are noticed with parallel
processes.
SciDB seems to scale because it is CPU bound. Moreover, the compressed versions
of all systems scale nicely, since they are CPU bound as well due to the CPU cost
of gzip compression.
The performance of TileDB and HDF5 match, whereas TileDB and TileDB+Z are
Figure: Load performance of dense
array
Dense Arrays: TileDB vs HDF5, SciDB and Vertica
b. Update: Figure below shows the time to perform random element updates to the 4
GB array, as a function of the number of updated elements.
They excluded HDF5+Z as it does not support updates on compressed data.
TileDB is up to more than 2 orders of magnitude faster than HDF5, with 100K updates
running in less than 1s in TileDB vs. more than 100s in HDF5, and more than 4 orders
of magnitude faster than SciDB. This is due to the sequential, fragment-based writes of
TileDB, as opposed to the in-place updates in HDF5 and the chunk-based updates in
SciDB. The performance gap increases with the number of updated elements.
Figure: Random update
performance of dense arrays
Dense Arrays: TileDB vs HDF5, SciDB and Vertica
b. Update: Figure below illustrates the update cost on SSDs when varying the number
of instances and fixing the number of element updates to 100K.
They split the updates equally into separate binary files, one for each instance.
TileDB and HDF5 are unaffected by the number of instances and, thus, all the previous
observations hold. SciDB performance improves with the number of instances.
However, it is still up to more than 3 orders of magnitude slower than TileDB.
Figure: Random update
performance of dense arrays
Sparse Arrays: TileDB vs SciDB and Vertica
Next focus is on sparse arrays, comparing TileDB with Vertica+Z (gzip-compressed
and following SRAM) and SciDB on the AIS dataset. HDF5 is not optimized for sparse
arrays, thus they omitted it from these experiments.
a. Load: Figure below shows the load performance of the three systems. They created
three different datasets with sizes 6 GB, 12 GB, 24 GB by loading from the dataset
(originally in CSV format), after de-duplicating the coordinates.
TileDB is once again more than an order of magnitude faster than SciDB. Moreover,
TileDB+Z is always slightly faster than Vertica+Z.
Figure: Load performance of sparse
arrays
Sparse Arrays: TileDB vs SciDB and Vertica
a. Load: Figure below plots the time to load the 6 GB dataset in parallel, versus the
number of instances. Note that Vertica by default uses all the cores in the system.
Thus, when they varied the number of instances, they activated only as many CPU
cores as the number of tested instances.
TileDB, TileDB+Z and Vertica+Z scale nicely with the number of instances. The
performance differences among the systems are similar to what was explained above
for the serial case.
Figure: Load performance of sparse
arrays
Conclusion: Key Takeaways
The evaluation reveals the following main results:
(i)TileDB is several orders of magnitude faster than HDF5 in the case of random
element updates for dense arrays, and at least as efficient in other settings, offering up
to 2x better performance on compressed data;
(ii)TileDB outperforms SciDB in all settings for both dense and sparse arrays, generally
by several orders of magnitude;
(iii)TileDB is 2x-40x faster than Vertica in the dense case. It is also at least as fast as
Vertica in the sparse case, featuring up to more than 2x faster parallel reads;
(iv)the performance of the read algorithm in TileDB is robust up to a large number of
fragments, whereas the consolidation mechanism requires marginally higher time than
that needed for the initial loading;
(v)TileDB exhibits excellent scalability as the dataset size (i.e. when operating on large
datasets) and level of parallelism increase.
Future Works
The datasets described only use synthetic 2D data with
integers only. A more valid experiment would be higher
dimensionality "real-world" data. Also, TileDB should be
run in modes that can be more correlated to its
competitors.
In addition, all compression schemes are not currently
supported by TileDB API. Its current implementation
only provides support for gzip while other schemes like
Run Length Encoding (RLE) and Lempel-Ziv (LZ) are
not supported.
THANK YOU FOR LISTENING

More Related Content

What's hot

Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Riccardo Zamana
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOAltinity Ltd
 
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesData Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesDatabricks
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseAltinity Ltd
 
Basics of storage Technology
Basics of storage TechnologyBasics of storage Technology
Basics of storage TechnologyLopamudra Das
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containerskbajda
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep diveDataWorks Summit
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...DataStax
 
Storage Technology Overview
Storage Technology OverviewStorage Technology Overview
Storage Technology Overviewnomathjobs
 
Introduction to storage
Introduction to storageIntroduction to storage
Introduction to storagesagaroceanic11
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Wide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data ModelingWide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data ModelingScyllaDB
 

What's hot (20)

Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesData Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive Approaches
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
Basics of storage Technology
Basics of storage TechnologyBasics of storage Technology
Basics of storage Technology
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
 
Storage Technology Overview
Storage Technology OverviewStorage Technology Overview
Storage Technology Overview
 
RAID CONCEPT
RAID CONCEPTRAID CONCEPT
RAID CONCEPT
 
Introduction to storage
Introduction to storageIntroduction to storage
Introduction to storage
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Système RAID
Système RAIDSystème RAID
Système RAID
 
Wide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data ModelingWide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data Modeling
 
GOOGLE BIGTABLE
GOOGLE BIGTABLEGOOGLE BIGTABLE
GOOGLE BIGTABLE
 

Similar to TileDB

Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquetManpreet Khurana
 
New in Hadoop: You should know the Various File Format in Hadoop.
New in Hadoop: You should know the Various File Format in Hadoop.New in Hadoop: You should know the Various File Format in Hadoop.
New in Hadoop: You should know the Various File Format in Hadoop.veeracynixit
 
What is Object storage ?
What is Object storage ?What is Object storage ?
What is Object storage ?Nabil Kassi
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Serkan Özal
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes Minio
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandraPL dream
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environmentsJ'tong Atong
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil AmbagadeSigmoid
 
JATSPack and JATSPAN, a packaging format specification and a web site
JATSPack and JATSPAN, a packaging format specification and a web siteJATSPack and JATSPAN, a packaging format specification and a web site
JATSPack and JATSPAN, a packaging format specification and a web siteKlortho
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageNilesh Salpe
 
17-NoSQL.pptx
17-NoSQL.pptx17-NoSQL.pptx
17-NoSQL.pptxlevichan1
 

Similar to TileDB (20)

Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquet
 
New in Hadoop: You should know the Various File Format in Hadoop.
New in Hadoop: You should know the Various File Format in Hadoop.New in Hadoop: You should know the Various File Format in Hadoop.
New in Hadoop: You should know the Various File Format in Hadoop.
 
Assignment 2 Theoretical
Assignment 2 TheoreticalAssignment 2 Theoretical
Assignment 2 Theoretical
 
Vertica
VerticaVertica
Vertica
 
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
 
What is Object storage ?
What is Object storage ?What is Object storage ?
What is Object storage ?
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environments
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
Designing data intensive applications
Designing data intensive applicationsDesigning data intensive applications
Designing data intensive applications
 
MySQL HA
MySQL HAMySQL HA
MySQL HA
 
DB2 TABLESPACES
DB2 TABLESPACESDB2 TABLESPACES
DB2 TABLESPACES
 
Raid levels
Raid levelsRaid levels
Raid levels
 
Storage
StorageStorage
Storage
 
JATSPack and JATSPAN, a packaging format specification and a web site
JATSPack and JATSPAN, a packaging format specification and a web siteJATSPack and JATSPAN, a packaging format specification and a web site
JATSPack and JATSPAN, a packaging format specification and a web site
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed Storage
 
17-NoSQL.pptx
17-NoSQL.pptx17-NoSQL.pptx
17-NoSQL.pptx
 

Recently uploaded

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 

Recently uploaded (20)

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 

TileDB

  • 1. The TileDB Array Data Storage Manager Authors: Stavros Papadopoulos, Kushal Datta, Samuel Madden and Timothy Mattson Published in: Journal Proceedings of the VLDB Endowment Vol. 10 Issue 4, November 2016 Reviewer: Olamide Timothy Tawose
  • 2. Preamble Many scientific and engineering fields generate enormous amounts of data through measurement, simulation, and experimentation. Examples of data produced in such fields include astronomical images, DNA sequences, geo- locations, social relationships, and so on. All of these data are naturally represented as multi-dimensional arrays, which can be either dense (when every array element has a value) or sparse (when the majority of the array elements are empty, i.e., zero or null). Scientific array data can be very large, containing billions of non-null values that do not readily fit into the memory of a single or even multiple machines. Simply storing arrays as files forces application programmers to handle many issues, including array representation on disk (i.e., sparse vs. dense layouts), compression, parallel access, and performance. Alternatively, these issues can be handled by optimized, special-purpose array data storage management systems, which perform complex analytics on scientific data. Central to such systems are efficient data access primitives to read and write arrays.
  • 3. Objectives The authors of this work developed a novel storage manager for multi-dimensional arrays for the scientific data management system called TileDB. In contrast to existing solutions, TileDB is optimized for both dense and sparse arrays. Its key idea is to organize array elements into ordered collections called fragments. TileDB array manager is the first truly optimized storage manager for both forms of multi-dimensional arrays (i.e. dense array and sparse array).
  • 4. Existing Array Management Systems  HDF5: HDF5 is a well-known array data storage manager. It is a dense array format. HDF5 groups array elements into regular hyperrectangles, called chunks, which are stored on the disk in binary format in a single large file. Shortcomings: i) It does not efficiently capture sparse arrays. ii) HDF5 is optimized for in-place writes of large blocks. In-place updates result in poor performance when writing small blocks of elements that are randomly distributed, due to the expensive random disk accesses they incur.
  • 5. Existing Array Management Systems  Parallel HDF5: It is a parallel version of HDF5 with some additional limitations. Shortcomings: i) it does not allow concurrent writes to compressed data. ii) it does not support variable-length element values. iii) operation atomicity requires some coding effort from the user, and imposes extra ordering semantics  Vertica: Alternatively, relational databases (e.g Vertica) have been used as the storage backend for array management (e.g., in RAM for dense and SRAM for sparse arrays), storing non-empty elements as records and explicitly encoding the element indices as extra table columns. However, this also leads to poor performance in the case of dense arrays when explicitly encoded.
  • 6. Existing Array Management Systems  SciDB: It is a popular array database, which however also has significant storage management limitations. Shortcomings: i) It is not truly designed for sparse arrays as it relies on regular dimensional chunking (similar to HDF5). ii) It also requires reading an entire chunk even if a small portion of it is requested, and updating an entire chunk even when a single element changes, leading to poor performance in the case of random writes.
  • 7. Overview of TileDB  DataModel: TileDB provides an array-oriented data model that is similar to that of SciDB. An array consists of dimensions and attributes. Each dimension has a domain, and all dimension domains collectively orient the logical space of the array. A cell may either be empty (null) or contain a tuple of attribute values. Each attribute is either a primitive type (int, float, or char), or a fixed or variable-sized vector of a primitive type. Arrays in TileDB can be dense or sparse. All TileDB operations can be performed on either type of array, but the choice of dense vs. sparse can significantly impact application performance  Figure below shows two example arrays, one dense one sparse, which have two dimensions, rows and columns. Both dimensions have domain [1,4]. The arrays have two attributes, a1 of type int32 and a2 of type variable char. Each cell is identified by its coordinates, i.e., a pair of rows, columns. In the examples, empty cells are shown in white, and nonempty cells in gray. In the dense array, every cell has an attribute tuple, e.g., cell (4,4) contains <15; pppp>, whereas several cells in the sparse array are empty, e.g., cell (4,1). Figure: The Logical array view
  • 8. Overview of TileDB  Compression: TileDB employs tile-based compression. Additionally, it allows the user to select different compression schemes on a per-attribute basis, as attributes are stored separately. The API supports defining different compression schemes for each attribute, but currently only gzip is implemented. The current version of TileDB does not support more compression schemes, such as RLE, LZ, as well as user-defined schemes
  • 9. Overview of TileDB  TileDB storage module or manager organizes array elements into ordered collections called fragments in order to support both dense and sparse multi- dimensional arrays. Fragments can either be dense or sparse. A fragment is a timestamped snapshot of a batch of array updates, i.e., a collection of array modifications carried out via write operations and made visible at a particular time instant. For instance, the initial loading of the data into the array constitutes the first array fragment. If at a later time a set of cells is modified, then these cells constitute the second array fragment, and so on. In that sense, an array is comprised of a collection of array fragments, each of which can be regarded as a separate array, whose collective logical overlap composes the current logical view of the array. Dense fragments are used only with dense arrays, but sparse fragments may be applied to both dense and sparse arrays  The concept of fragments dramatically improves write performance as writes are performed in batches, and each batch (which may consist of many small, random writes) is written to a separate fragment sequentially.  The organization into fragments turns random writes into sequential writes, and,
  • 10. Fragment Examples Figure below shows an example of an array with three fragments; the first two are dense and the third is sparse. Observe that the second fragment is dense within a hyper-rectangular subarray, whereas any cell outside this subarray is empty. The figure also illustrates the collective logical view of the array; the cells of the most recent fragments overwrite those of the older ones. Fragments are the key concept that enables TileDB to perform rapid writes.
  • 11. Overview of TileDB  TileDB array storage manager implements an efficient read algorithm to easily find the most recent fragment that updated each returned element in situations where multiple fragments may cover the same logical region of an array and some fragments may be sparse. This read algorithm allows it to efficiently access all fragments, skipping data that does not qualify for the result. The algorithm is slightly different for dense and sparse arrays  Also, TileDB array storage manager employs a consolidation algorithm that merges multiple fragments into a single one to tackle read performance degradation when the number of fragments grow.  TileDB enables or supports parallelization via multi-threading and multiprocessing, offering thread-/process-safety and atomicity via lightweight locking. TileDB delivers better performance than existing state-of-the-art systems (all thanks to its efficient storage manager).  Moreover, it guarantees operation atomicity, without imposing any additional ordering constraints (this is in contrast to parallel implementations of HDF5, like PHDF5)
  • 12. Experiments Competitors: TileDB was compared against 3 competitors namely; HDF5 (for dense arrays), SciDB (for dense and sparse arrays), and Vertica (for dense and sparse arrays). For all parallel experiments, they used the parallel version of HDF5, namely PHDF5. System configuration: All experiments was performed on an Intel x86_64 platform with a 2.3 GHz 36-core CPU and 128 GB of RAM, running CentOS6. They utilized a 4 TB, 7200rpm Western Digital HDD and a 480 GB Intel SSD both not RAID- ed and equipped with the ext4 file system. Because the HDD is I/O bound even with a single thread, they ran serial experiments on the HDD. Because the SSD is so much faster, they ran our parallel experiments on SSDs. For TileDB, SciDB and HDF5, they experimented both with and without gzip compression, using compression level 6. Note: PHDF5 does not support parallel load or updates on compressed datasets. In addition, Vertica cannot be used without compression; the default compression method is RLE, which is currently not supported by TileDB. Therefore, they used Vertica with gzip (compression level 6)
  • 13. Experiments For TileDB, SciDB and HDF5, they experimented both with and without gzip compression, using compression level 6. Note: PHDF5 does not support parallel load or updates on compressed datasets. In addition, Vertica cannot be used without compression; the default compression method is RLE, which is currently not supported by TileDB. Therefore, they have used Vertica with gzip (compression level 6). Also, Vertica by default uses all the available cores in the system. In order to vary the number of threads used in the parallel experiments, they simply shut down a subset of the cores at the OS level. Datasets: For dense arrays, they constructed synthetic 2D arrays of variable sizes, with a single int32 attribute. For the case of sparse arrays, they used datasets retrieved from the AIS database, which was collected by the National Oceanic and Atmospheric Administration by tracking ship vessels in the U.S. and international waters. They extracted attributes X (longitude), Y (latitude), SOG, COG, Heading, ROT, Status, VoyageID, and MMSI.They used X,Y as the dimensions (i.e., they created a 2D array), and the rest characteristics as the attributes. SciDB does not support real dimension values, thus they converted X and Y into int64,
  • 14. Dense Arrays: TileDB vs HDF5, SciDB and Vertica a. Load: Figure below shows the time to load a dense array into TileDB, HDF5 and SciDB, as well as their gzip-compressed versions, denoted TileDB+Z, HDF5+Z and SciDB+Z, respectively. In this experiment, only one CPU core is activated and each system runs on a single instance. They generated 2D synthetic arrays with sizes 4 GB, 8 GB, and 16 GB, and dimension domains 50,000 × 20,000, 50,000 × 40,000, and 100,00 × 40,000 respectively. TileDB matches the performance of HDF5 (it takes only 60s to load 4GB), and it is consistently more than an order of magnitude faster than SciDB. A dollar (’$’) symbol indicates that the system did not manage to terminate within the shown time. However, the performance of TileDB+Z, HDF5+Z and SciDB+Z deteriorates as the tile Figure: Load performance of dense array
  • 15. Dense Arrays: TileDB vs HDF5, SciDB and Vertica a. Load: Figure below assesses the cost of loading data in parallel, for the 4GB array. In this experiment, many CPU core were activated as the number of tested instances. They excluded HDF5+Z, since PHDF5 does not allow parallel writes with compression. TileDB and HDF5 are unaffected by the level of parallelism, as they are I/O bound. Moreover, each process writes large data chunks at a time monopolizing the internal parallelism of the SSD and, thus, no extra savings are noticed with parallel processes. SciDB seems to scale because it is CPU bound. Moreover, the compressed versions of all systems scale nicely, since they are CPU bound as well due to the CPU cost of gzip compression. The performance of TileDB and HDF5 match, whereas TileDB and TileDB+Z are Figure: Load performance of dense array
  • 16. Dense Arrays: TileDB vs HDF5, SciDB and Vertica b. Update: Figure below shows the time to perform random element updates to the 4 GB array, as a function of the number of updated elements. They excluded HDF5+Z as it does not support updates on compressed data. TileDB is up to more than 2 orders of magnitude faster than HDF5, with 100K updates running in less than 1s in TileDB vs. more than 100s in HDF5, and more than 4 orders of magnitude faster than SciDB. This is due to the sequential, fragment-based writes of TileDB, as opposed to the in-place updates in HDF5 and the chunk-based updates in SciDB. The performance gap increases with the number of updated elements. Figure: Random update performance of dense arrays
  • 17. Dense Arrays: TileDB vs HDF5, SciDB and Vertica b. Update: Figure below illustrates the update cost on SSDs when varying the number of instances and fixing the number of element updates to 100K. They split the updates equally into separate binary files, one for each instance. TileDB and HDF5 are unaffected by the number of instances and, thus, all the previous observations hold. SciDB performance improves with the number of instances. However, it is still up to more than 3 orders of magnitude slower than TileDB. Figure: Random update performance of dense arrays
  • 18. Sparse Arrays: TileDB vs SciDB and Vertica Next focus is on sparse arrays, comparing TileDB with Vertica+Z (gzip-compressed and following SRAM) and SciDB on the AIS dataset. HDF5 is not optimized for sparse arrays, thus they omitted it from these experiments. a. Load: Figure below shows the load performance of the three systems. They created three different datasets with sizes 6 GB, 12 GB, 24 GB by loading from the dataset (originally in CSV format), after de-duplicating the coordinates. TileDB is once again more than an order of magnitude faster than SciDB. Moreover, TileDB+Z is always slightly faster than Vertica+Z. Figure: Load performance of sparse arrays
  • 19. Sparse Arrays: TileDB vs SciDB and Vertica a. Load: Figure below plots the time to load the 6 GB dataset in parallel, versus the number of instances. Note that Vertica by default uses all the cores in the system. Thus, when they varied the number of instances, they activated only as many CPU cores as the number of tested instances. TileDB, TileDB+Z and Vertica+Z scale nicely with the number of instances. The performance differences among the systems are similar to what was explained above for the serial case. Figure: Load performance of sparse arrays
  • 20. Conclusion: Key Takeaways The evaluation reveals the following main results: (i)TileDB is several orders of magnitude faster than HDF5 in the case of random element updates for dense arrays, and at least as efficient in other settings, offering up to 2x better performance on compressed data; (ii)TileDB outperforms SciDB in all settings for both dense and sparse arrays, generally by several orders of magnitude; (iii)TileDB is 2x-40x faster than Vertica in the dense case. It is also at least as fast as Vertica in the sparse case, featuring up to more than 2x faster parallel reads; (iv)the performance of the read algorithm in TileDB is robust up to a large number of fragments, whereas the consolidation mechanism requires marginally higher time than that needed for the initial loading; (v)TileDB exhibits excellent scalability as the dataset size (i.e. when operating on large datasets) and level of parallelism increase.
  • 21. Future Works The datasets described only use synthetic 2D data with integers only. A more valid experiment would be higher dimensionality "real-world" data. Also, TileDB should be run in modes that can be more correlated to its competitors. In addition, all compression schemes are not currently supported by TileDB API. Its current implementation only provides support for gzip while other schemes like Run Length Encoding (RLE) and Lempel-Ziv (LZ) are not supported.
  • 22. THANK YOU FOR LISTENING