This document provides an overview and summary of TileDB webinars on TileDB Embedded, an embeddable C++ library that stores and accesses multi-dimensional arrays. It discusses who the webinar is for, provides a disclaimer, describes TileDB's origins and investors. The document summarizes what TileDB Embedded is, its performance, open source nature, interoperability, and optimization for cloud. It outlines the webinar agenda covering arrays, internal mechanics, examples, and comparison to other formats.
In Data Structure, AVL tree is the special case of BINARY SEARCH TREE.A binary tree is said to be an AVL tree if T is a root of tree and T(L) is its left sub tree and T(R) is its right sub-tree of tree T and H(T(L)) and H(T(R)) are the heights of the left and right sub-trees of T respectively, and |H(T(L)) - H(T(R))|<= 1 Then we called T is AVL tree.
Balance Factor
Height of left sub-tree minus height of Right left sub-tree
[H(T(L)) - H(T(R))]
Note:
An empty binary tree is an AVL Tree.
Operating system 23 process synchronizationVaibhav Khanna
Processes can execute concurrently
May be interrupted at any time, partially completing execution
Concurrent access to shared data may result in data inconsistency
Maintaining data consistency requires mechanisms to ensure the orderly execution of cooperating processes
Illustration of the problem:Suppose that we wanted to provide a solution to the consumer-producer problem that fills all the buffers. We can do so by having an integer counter that keeps track of the number of full buffers. Initially, counter is set to 0. It is incremented by the producer after it produces a new buffer and is decremented by the consumer after it consumes a buffer
How to Measure RTOS Performance – Colin Walls
In the world of smart phones and tablet PCs memory might be cheap, but in the more constrained universe of deeply embedded devices, it is still a precious resource. This is one of the many reasons why most 16- and 32-bit embedded designs rely on the services of a scalable real-time operating system (RTOS). An RTOS allows product designers to focus on the added value of their solution while delegating efficient resource (memory, peripheral, etc.) management. In addition to footprint advantages, an RTOS operates with a degree of determinism that is an essential requirement for a variety of embedded applications. This paper takes a look at “typical” reported performance metrics for an RTOS in the embedded industry.
INTRODUCTIONTO OPERATING SYSTEM
What is an Operating System?
Mainframe Systems
Desktop Systems
Multiprocessor Systems
Distributed Systems
Clustered System
Real -Time Systems
Handheld Systems
Computing Environments
The Fuzzy Logic is discussed with three simple example problems all solved in MATLAB
1. Restaurant Problem
2. Temperature Controller
3. Washing Machine Problem
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
In Data Structure, AVL tree is the special case of BINARY SEARCH TREE.A binary tree is said to be an AVL tree if T is a root of tree and T(L) is its left sub tree and T(R) is its right sub-tree of tree T and H(T(L)) and H(T(R)) are the heights of the left and right sub-trees of T respectively, and |H(T(L)) - H(T(R))|<= 1 Then we called T is AVL tree.
Balance Factor
Height of left sub-tree minus height of Right left sub-tree
[H(T(L)) - H(T(R))]
Note:
An empty binary tree is an AVL Tree.
Operating system 23 process synchronizationVaibhav Khanna
Processes can execute concurrently
May be interrupted at any time, partially completing execution
Concurrent access to shared data may result in data inconsistency
Maintaining data consistency requires mechanisms to ensure the orderly execution of cooperating processes
Illustration of the problem:Suppose that we wanted to provide a solution to the consumer-producer problem that fills all the buffers. We can do so by having an integer counter that keeps track of the number of full buffers. Initially, counter is set to 0. It is incremented by the producer after it produces a new buffer and is decremented by the consumer after it consumes a buffer
How to Measure RTOS Performance – Colin Walls
In the world of smart phones and tablet PCs memory might be cheap, but in the more constrained universe of deeply embedded devices, it is still a precious resource. This is one of the many reasons why most 16- and 32-bit embedded designs rely on the services of a scalable real-time operating system (RTOS). An RTOS allows product designers to focus on the added value of their solution while delegating efficient resource (memory, peripheral, etc.) management. In addition to footprint advantages, an RTOS operates with a degree of determinism that is an essential requirement for a variety of embedded applications. This paper takes a look at “typical” reported performance metrics for an RTOS in the embedded industry.
INTRODUCTIONTO OPERATING SYSTEM
What is an Operating System?
Mainframe Systems
Desktop Systems
Multiprocessor Systems
Distributed Systems
Clustered System
Real -Time Systems
Handheld Systems
Computing Environments
The Fuzzy Logic is discussed with three simple example problems all solved in MATLAB
1. Restaurant Problem
2. Temperature Controller
3. Washing Machine Problem
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
Aim of this presentation to provide enough information for enterprise architect to choose whether Cassandra will be project data store. Presentation describes each nuance of Cassandra architecture and ways to design data and work with them.
Structured Query Language (SQL) _ Edu4Sure Training.pptxEdu4Sure
The PPT content is for reference only. The training will be hands-on & practical.
Training: SQL (Structured Query Language)
For any Training & Certificate, please email us at partner@edu4sure.com
or Call/ whatsapp at +91-9555115533
Or visit www.testformula.com (Our LMS to access Self-paced vidoes) or visit www.edu4sure.com
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...Stavros Papadopoulos
Slides by Stavros Papadopoulos (TileDB) and Jason Brown (Capella Space) from the joint TileDB-Capella Space webinar held in April 2022 on SAR and LiDAR data analytics.
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...Stavros Papadopoulos
Slides used in the webinar TileDB hosted with participation from Spire Maritime, describing the use and accessibility of massive time series maritime data on TileDB Cloud.
Today's data economics is flawed. There is a need for a fundamental change in the way we produce, distribute and consume data. This presentation describes a solution with TileDB that can shape the future of data management.
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
Purpose-built databases and platforms have actually created more complexity, effort, and unnecessary reinvention. The status quo is a big mess. TileDB took the opposite approach.
In this presentation, Stavros, the original creator of TileDB, shared the underlying principles of the TileDB universal database built on multi-dimensional arrays, making the case for it as a true first in the data management industry.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
1. TileDB webinars
The TileDB Embedded
Storage Engine
Founder & CEO of TileDB, Inc.
Dr. Stavros Papadopoulos
2. Who is this webinar for?
Those wanting to learn about data storage fundamentals
Layout, compression, IO, etc.
Those looking to efficiently store/access any kind of data to/from anywhere
Dataframes, genomics, LiDAR, SAR, weather, and more, with a single engine
Those tired of managing custom, inefficient data formats
Formats not supporting fast updates, indexing, versioning, cloud performance
3. Disclaimer
I am the exclusive recipient of complaints
Email me at: stavros@tiledb.com
All the credit for our amazing work goes to our powerful team
Check it out at https://tiledb.com/about
4. Deep roots at the intersection of HPC, databases and data science
Traction with telecoms, pharmas, hospitals and other scientific organizations
40 members with expertise across all applications and domains
Who we are
TileDB got spun out from MIT and Intel Labs in 2017
WHERE IT ALL STARTED
Raised over $20M, we are very well capitalized
INVESTORS
5. What is TileDB Embedded?
An embeddable C library that stores and accesses multi-dimensional arrays
Dense array Sparse array
It implements very fast array slicing across dimensions
6. Superior
performance
Built in C
Fully-parallelized
Columnar format
Multiple compressors
R-trees for sparse arrays
TileDB Embedded at a Glance
https://github.com/TileDBInc/TileDB
Open source:
Rapid updates
& data versioning
Immutable writes
Lock-free
Parallel reader / writer model
Time traveling
Schema evolution
7. TileDB Embedded at a Glance
https://github.com/TileDBInc/TileDB
Open source:
Extreme
interoperability
Numerous APIs
Numerous integrations
All backends
Optimized
for the cloud
Immutable writes
Parallel IO
Minimization of requests
8. TileDB Embedded at a Glance
APIs & tool Integrations with zero-copy where possible
TileDB Embedded
Open-source interoperable
storage with a universal
open-spec array format
● Parallel IO, rapid reads & writes
● Columnar, cloud-optimized
● Data versioning & time traveling
9. Why arrays?
The basics
Advanced internal mechanics
Examples
Work in progress
Agenda
Comparison to other formats and engines
Docs at docs.tiledb.com
10. Byte 0 1 ...
Regardless of what kind of data you have, it is laid out in a 1D storage medium
Why Arrays?
where each task may slice
Algorithm as a task graph
Regardless of what kind of algorithm you run, the algorithm involves a set of slices
11. Why Arrays?
Byte 0 1 ...
Byte 0 1 ...
Performance is absolutely dictated by the slice result locality on the 1D medium
12. Why Arrays?
Arrays provide a flexible way to map/slice any-dimensional (ND data to/from a 1D layout
Giving different “importance” to different dimensions (order and tiling)
Choosing whether dimension coordinates should be materialized or not (dense vs. sparse)
Considering compression, encryption and other filters (tiling)
Abstracting all the engineering magic that it takes to make everything very fast (engine)
Unifying the data model for all application domains! (universality)
Building indices for fast search (e.g., R-trees)
14. Arrays Are Universal
What else can be modeled as an array
LiDAR 3D sparse)
SAR 2D or 3D dense)
Population genomics (3D sparse)
Single-cell genomics (2D dense or sparse)
Biomedical imaging (2D or 3D dense)
Even flat files!!! 1D dense)
Time series (ND dense or sparse)
Weather (2D or 3D dense)
Graphs (2D sparse)
Video (3D dense)
Key-values (1D or ND sparse)
19. Array Metadata
dense_array1
├── __t2_t2_uuid2_v
│ ├── __fragment_metadata.tdb
│ └── a0.tdb
├── __t2_t2_uuid2_v.ok
├── __lock.tdb
├── __meta
│ └── __t3_t3_uuid3
└── __schema
└── __t1_t2_uuid1
You can attach any number of (key, value) pairs to an array
The key must be string, and the value can be anything
metadata go here
20. Multiple Attributes
dense_array1
├── __t2_t2_uuid1_v
│ ├── __fragment_metadata.tdb
│ └── a0.tdb
│ └── a1.tdb
├── __t2_t2_uuid1.ok
├── __lock.tdb
├── __meta
└── __schema
└── __t1_t1_uuid2
1,a 2,b 3,c 4,d
5,e 6,f 7,g 8,h
9,i 10,j 11,k 12,l
13,m 14,n 15,o 16,p
You can store more than one values in each cell, even of different type
TileDB has a “columnar” format that allows you to efficiently subselect on attributes
attribute data
21. Var-length Attributes
dense_array3
├── __t2_t2_uuid1_v
│ ├── __fragment_metadata.tdb
│ └── a0.tdb
│ └── a0_var.tdb
├── __t2_t2_uuid1.ok
├── __lock.tdb
├── __meta
└── __schema
└── __t1_t1_uuid2
TileDB supports storing variable-length values in a cell (of any data type)
a bb ccc dddd
e ff ggg hhhh
i jj kkk lll
m nn ooo pppp
offsets
var-length data
22. Var-length Dimensions
sparse_array4
├── __t2_t2_uuid1_v
│ ├── __fragment_metadata.tdb
│ └── a0.tdb
│ └── d0.tdb
│ └── d0_var.tdb
├── __t2_t2_uuid1.ok
├── __lock.tdb
├── __meta
└── __schema
└── __t1_t1_uuid2
You can also have var-length dimensions and slice naturally using string ranges
Applicable only to sparse arrays
offsets
var-length data
a bb ccc dddd e ff
1 2 3 4 5 6
unbounded domain
infinite gaps
27. Tiling | Dense Arrays
fetches the whole array from storage
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
space tile
extents
fetches only a portion of the array (a tile)
A space tile is the atomic unit of IO
space tile
extents
28. Cell Layout | Dense Arrays
Three parameters define the values layout on storage, called the global order
Space tile extents
Tile order/layout (row-major or column-major)
Cell order/layout (row-major or column major)
row-major tile order
row-major cell order
22
space
tiles
col-major tile order
row-major cell order
22
space
tiles
row-major tile order
col-major cell order
42
space
tiles
29. Tiling & Cell Layout | Sparse Arrays
Sparse arrays store only non-empty cells
Grouping non-empty cells with space tiles would be inefficient (due to potential skew)
The atomic unit of IO in sparse arrays is the data tile, of fixed (user-defined) capacity
First impose a global order similar to dense arrays, then group based on capacity
col-major tile order
row-major cell order
22
space
tiles
capacity 2
space tile
extents
space tile
extents
col-major tile order
row-major cell order
22
space
tiles
capacity 4
data tile
30. Hilbert Order | Sparse Arrays
Space tiles greatly affect the cell layout in sparse arrays
Sometimes it is very difficult to define a good space tiling (especially with floats and strings)
For such cases, the Hilbert order is ideal (no tile extents and order)
For floats we discretize the domain into buckets
based on the number of dimensions
For strings we assign a number of bits per dimension
and then use the string prefixes as numbers
31. Tile Filters
TileDB allows a wide range of filters to be applied to each tile prior to its storage
Compressors (gzip, zstd, bzip2, …)
Checksums
Encryption
The atomic unit of filtering is the chunk (typically equal to the L1 cache size)
TileDB applies the filters across chunks in parallel in a pipeline
chunk
tile
zstd
AES256
33. Versioning and Time Traveling
In TileDB, every write is immutable
Each (batch) write creates a timestamped fragment
With fragments, TileDB implements
versioning and time traveling
35. 100
Write at t2
40
Versioning and Time Traveling | Sparse Arrays
1 2
3
4 5
6
Write at t1
100 2
3
40 5
6
Read at 0,t2
1 100
Read at (t1,t2
40
When no dups
allowed
1
4
6
Read at 0,t1
2
3
5
36. When dups
are allowed
4
dups
100
Write at t2
40
Versioning and Time Traveling | Sparse Arrays
1 2
3
4 5
6
Write at t1
100
Read at (t1,t2
40
1
4
6
Read at 0,t1
2
3
5
100 2
3
40 5
6
Read at 0,t2
1
37. Indexing
TileDB has a three-level indexing approach
Fragment timestamps (in the fragment names) for time traveling
Non-empty domain in each fragment’s metadata
Either simple offset arithmetic (dense) or R-trees (sparse)
1. Get list of fragment names (with .ok)
t1_t1_uuid1_v
t2_t2_uuid2_v
...
2. Ignore fragments with timestamp not in time traveling interval
3. Ignore fragments with non-empty domain not overlapping slice
__fragment_metadata.tdb
__fragment_metadata.tdb
4a. Ignore dense tiles via implicit positional indexing, or
4b. Ignore sparse tiles from the R-tree that do not overlap the slice
Algorithm
38. A slicing query would just traverse the tree
top-down, visiting only nodes/MBRs that
intersect the slice
Indexing
Given the non-empty domain, the space tile extents and the
tile order, we can find easily that this slice overlaps the
second and fourth tile
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
row-major tile order
22
space
tiles
MBR1
MBR2
MBR3
MBR4
col-major tile order
row-major cell order
22
space
tiles
capacity
2
R-tree
(stored in fragment metadata)
MBR1 MBR2 MBR3 MBR4
39. Consolidation & Vacuuming
Numerous fragments can lead to performance degradation (loss of locality, expensive listing)
TileDB supports two levels of consolidation
Fragment metadata (group the non-empty domains in a single place)
Fragments (better preserve data locality)
Old fragments are preserved after consolidation (for time traveling)
TileDB can vacuum old fragments to save space and boost listing
Time traveling will not work on vacuumed fragments
40. Attribute Filter Push-Down
TileDB supports pushing attribute filter conditions down to the engine
That typically boosts performance
Much fewer data gets copied around
More L1-cache conscious
More opportunities for parallelism and vectorization
41. Schema Evolution
TileDB supports schema evolution (since v2.4
Adding an attribute
Dropping an attribute
More schema evolution features are coming up
Full versioning and time traveling is supported
42. Notes on Writing
Lots of flexibility in writing in different orders, different domain subarray, etc.
Support for lock-free, parallel writing
Tips for performance:
Each tile should be 100KB 1MB
Each fragment should be 1 2GB
Fragments should not “interleave”
Run fragment metadata consolidation (especially on cloud object stores)
No support for deletions and updates yet (coming up soon)
43. Notes on Reading
TileDB is eventually consistent
Support for parallel writers, parallel readers (all lock-free)
Support for reads in different layouts
Support for “streaming reads” (incomplete queries)
Tips for performance:
Allocate sufficient space for the result buffers (minimize incomplete queries)
Tune written layout based on the read layout (application dependent)
Push down coordinate and attribute filter conditions
45. Coming Up
More schema evolution features
Support for deletes and updates
Git-like versioning
ACID via modularizing locking
More tile filters (e.g., sum, min, max)
RLE and dictionary compression on strings
Computations on compressed data
Linear Algebra operations
More SQL push down (e.g., group by)
Graph algorithms
47. High-level Comparisons
vs. HDF5
TileDB is cloud-native
TileDB has support for sparse arrays
vs. Zarr
TileDB is built in C and is more interoperable
TileDB has support for sparse arrays
TileDB has support for versioning and time traveling
TileDB has support for versioning and time traveling
48. High-level Comparisons
vs. Parquet
TileDB is multi-dimensional and supports more flexible layouts
TileDB has support for dense arrays
vs. Delta Lake
TileDB does not rely on Spark, Presto or other subsystem
TileDB has support for dense arrays
TileDB has support for versioning and time traveling
TileDB does not support deletes, updates and full ACID (yet)
TileDB is natively multi-dimensional and supports more flexible layouts