Categorisation of databases based on storage of data. Pictorially depicts partitioning and replication of data. Grouping of prominent databases used in the industry.
3. So many names and technologies (aka confusing)
Azure data warehouse
Blob
Redis
Cassandra
Druid
Redis
Graphite
MySQL
MemSQL
…. plus 10’s of more options in the market
4. Break it down
1. How is data stored
Row oriented,
Column oriented,
Sorted string,
Document,
Object store,
Key-value in memory,
Time series
2. Partitioning - Scale up and down
3. Replication - Consistency
4. Atomicity - All or none
5. Isolation - consistent view of data
through the transaction
6. 2. Partitioning
Writing the data
Key Data
Specified as partition key
Or
Generated by system
Range partition (Manual intervention)
Hash partition (Scaling issues)
Consistent hashing (Avoid shuffle during scaling)
Round robin (Even distribution)
State Less
State Full
7. 2. Partitioning
Writing the data
Key Data
Specified as partition key
Or
Generated by system
Partition 1
Partition 2
Partition n
Redirect
Logic
8. 2. Partitioning
Reading it back
Key (?) Data
Partition 1
Partition 2
Partition n
Partitioning key
columns are specified
Partitioning key columns are
NOT specified
Local indexes
Local indexes
Local indexes
9. 2. Partitioning
Reading it back
Key (?) Data
Partition 1
Partition 2
Partition n
Partitioning key
columns are specified
Partitioning key columns are
NOT specified
Local indexes
Local indexes
Local indexes
Process 1
Process 2
Process n
Collect
Render
Output
MPP
Massively parallel processing
10. 3. Replication
Centralised model - Master slave
Round robin
based
partitioning
requires
centralised
metastore to
keep track of
states
12. How is data stored & CRUD operations
Data format is for the partition of Data
Your replication and partitioning strategy
is Independent of
Storage format
14. Data storage - Row oriented
Read path
Ordering
of columns
matter
(a,b,c) is
different
from (c,b,a)
Penalty for
updating all index
trees for the table
Statistics refresh
can be deferred -
Hybrids
15. Data storage - Row oriented - Examples
Row level operations
Multiple types of query searches - by virtue of different indexes
Inefficiencies -
16. Analytical querying !
Lot of seeks from Disk (Range based queries)
Efficiency(Scan) >>> Efficiency(Seek)
Entire Row is fetched to operate on Few columns
Big drawback for Analytical queries.
19. Data storage - Column oriented
Hybrid Hybrid
Efficient for columnar aggregates and joins - Analytical queries
Efficient for filtering data based on condition
Inefficient for frequent updates (causes lot of soft deletes/tombstones)
Inefficient for retrieval of selected few rows
Compaction overheads
20. Data storage - Sorted String
Immutable concept of Columnar but Storage is Row level
Row
based
data
Threshold
Write path
22. Data storage - Sorted String
Conceptually SSTable => Segments
Efficient for range based queries - Scan on disk
Low latency Inserts
Peer to peer protocol. Multi datacenter replication.
Inefficient for interleaved reads - filter queries. Potentially traverse complete table.
Inefficient for aggregates and joins
Compaction overheads
Query first approach
23. Data storage - Key, Value = Doc : Document
Key Data
Data is a Document whose schema can vary.
Usually a json format is standard.
Query ability may be required on certain columns
in the document.
Ability to specify a column within document as key
for partitioning
24. Data storage - Key, Value = File : Object store
Key Data
Data is a large file.
Query ability is not required.
Eventual consistency is fine.
Metadata layer to provide a file system look and feel
25. Data storage - Key, Value = Minimal data : Cache
Key Data
Data is in few MB’s
Lightweight data structures used for persisting value
In-memory and fast
Ideal for caching use cases
Hybrid
26. Data storage - Key, Value = Periodic : Time Series
Key Data
Data captured from
data-stream/device-measurements periodically at
high frequency.
Size of value is not large. Older values should have
the capability to be aggregated and stored
Using concept of
sorted string database