What a Modern Database Enables_Srini Srinivasan.pdf

What a Modern Database
Enables
Srini Srinivasan
CTO and Founder
Aerospike

All rights reserved. © 2023 Aerospike, Inc.
Our Driving Design Centers
2
Optimizations for Modern System Architectures
• CPU and NUMA pinning
• Storage tiers (DRAM, NVMe)
• Hybrid Memory Architecture
• Network to application alignment
Massive Parallelism with Indexing
• Multi-threaded NUMA architecture
• Data distribution across disks, nodes
• Client accesses server in single hop
• AI/ML Processing
Strongly Consistent Transactions
• Zero Data Loss
• Linearizable reads (tunable)
• Read one, write all scheme
• Roster concept maximizes availability
Geo-Distributed Active-Active System
• Uniform data partitioning
• Mixed workload handling
• Self-managed rack aware clusters
• Synch and Asynch Replication

3

Aerospike Strong Consistency @ 33% less H/W
4
Failure Support – Big Hardware Savings
• 1 failure => 2 copies
• 2 failures => 3 copies
When is data consistent?
• Once all nodes respond
Aerospike consensus is
non-quorum, roster-based
How is consistency maintained?
• With a roster
• Determines cluster health
Heartbeats
• Exchanged by nodes
• CPU unaffected as data/node increases
A B
Application
(Leader) (Follower)
Aerospike passes Jepsen tests: https://jepsen.io/analyses/aerospike-3-99-0-3

Strong Consistency (SC) – Write Logic
5
Write to all replicas before return to client w/commits with minimum friction
1. Request 6. Success
2. Write Local
3. Replicate
4. Response
5. Advise Replicated*
3. Replicate
4. Response
5. Advise Replicated*
*Advise Replicated is one way
and only when more than 1 copy
Master
Client
Replica 2
Replica 1

SC – Linearizable Read Logic
6
Master or replica read alone is
sufficient for Sequential Consistency
Master
Client
1. Request 5. Response
2. Read Local
Replica 2
Replica 1
3. Status Request
4. Status Response
3. Status Request
4. Status Response
No stale reads possible
Extra network round-trip

High Availability in an RF2 Strong Consistency System
7
Synchronous
1M
Rack 1
Zone 1
A
B
C
Z
1R
2M
2R
3M
3R
C
B
Read (3R)
Write (3M)
Write (3M)
Read (3M)
RF2:
Complete
copy of
data
Writes < 10ms
Reads < 1ms
Automatic Sync
RF – Replication Factor
Rack 2
Zone 2
Rack Awareness pegs data copies to racks distributed across zones or datacenters within a cluster

Data Availability During Split Brain Events
A B C D E
A B C D E
A B C D E
RM
R
RR
M
RM
R
RR
M
RM RR
M R
In a healthy cluster the Roster Master is the same as the Master, and
the Roster Replica is the same as the replica.
Rule 1: A sub-cluster is Active if it has the Roster Master and all
Roster Replicas and at least 1 is full.
Rule 2: A sub-cluster is Active if it has the majority of nodes and at
least one full Roster Master or Roster Replica OR exactly ½ the
roster nodes and the Roster Master and the partition is full.
A B C D E
RM RR
M R
Rule 3: A sub-cluster is Active if it is a Super Majority Cluster and the
partitions are full or subsets

9

Node Add/Remove/Update without Disruption
10
Self-healing, auto-sharding, algorithmic cluster management
A B C Z
25%
CLUSTER DATA
High uptime
“Shared Nothing” architecture
No single points of failure
Self-healing capability
Auto rebalance upon node add/remove
Data migrates automatically, evenly
Set-and-forget DevOps
Automatic sharding of data
No re-tuning of cluster for use-case
changes
25%
CLUSTER DATA
25%
CLUSTER DATA
25%
CLUSTER DATA
A B C Z

Global Transactions – Sync Active-Active
11
USA West
Rack 1
Node 1 R1
Node 2
Node 3
Geographically distributed strongly
consistent transactions at scale
Node 7
Node 8
Node 9 M
United Kingdom
Rack 3
Node 4 R1
Node 5
Node 6
USA East
Rack 2
Local apps
Roster Membership Based
Local apps
Local apps
Synchronous active-active replication
Strong Consistency (linearizable)
No data loss
Conflict avoidance
Auto recovery on single site failure
Low latency reads from local rack
Single cluster with
Racks 1, 2, 3
Automatic Sync
Writes ~ 200 ms
Reads < 1ms

Distributed Data Hub – Async Active-Active
12
Multiple clusters
connected via XDR
› Asynchronous active-active replication
› Dynamic fine-grained data routing
› Relaxed consistency (lag ~ 100ms)
› Asynchronous active-active replication
› Dynamic fine-grained data routing
› Relaxed consistency (lag ~ 100ms)
Predictive
Analytics
Single Source of Truth
Legacy
Data
Store
TB’s 100’s PB’s
PB’s
Edge (ms) Core (ms) Warehouse (sec-to-mins)
Location A
(SOE)
Location B
(SOE)
Location C
(SOE)
XDR
XDR
Real Time
System of Record
Streaming
AI/ML
Engines
XDR
Query & Reporting
Store
Web
Social
Data Sources
Streaming Video
Gaming
Enterprise
Applications
IoT
3rd Party
Mobile
Features

Optimizations for Modern System
Architectures
13

Real-time Read Access to Data in SSD
14
Patented Hybrid Memory ArchitectureTM (HMA) places data on SSD and indexes-only in DRAM
Software written in C to natively talk to hardware, not an API layer
BLOCK INTERFACE
SSD SSD
NVME
SSD
HYBRID-MEMORY ARCHITECTURE™
Direct SSD device access
Highly Parallelized
Large Block Writes to SSD
SSD vendor-optimized
Continuous, non-disruptive defrag
OS FILE SYSTEM
PAGE CACHE
BLOCK INTERFACE
SSD SSD
OTHER DATABASE

Storage Tier Configurations
15
All DRAM All Flash
› Index and Data in Flash
› Sub 5-millisecond reads & writes
› Lower DRAM usage than HMA
› Suitable for lots of small objects
› Server footprint reduction similar to HMA
OPERATIONS
EXPIRY
DIGEST & TREE INFO
RECORD METADATA
STORAGE POINTER
WRITE QUEUE
BIN
1
BIN
2
BIN
3
STORAGE
FLASH INDEX
OPERATIONS
EXPIRY
DIGEST & TREE INFO
RECORD METADATA
STORAGE POINTER
WRITE QUEUE
DEFRAG
DATA IN
FLASH
READS
STORAGE
Hybrid DRAM/Flash
› Index in DRAM, Data in Flash
› Sub millisecond reads & writes
› 5-10X lower server footprint
DRAM INDEX
OPERATIONS
EXPIRY
DIGEST & TREE INFO
RECORD METADATA
STORAGE POINTER
WRITE QUEUE
DEFRAG
DATA IN
FLASH
READS
STORAGE
BIN
1
BIN
2
BIN
3
BIN
1
BIN
2
BIN
3
› Index and Data DRAM
› Sub millisecond reads & writes

SLAs versus Scale on Storage Tiers
16
Memory Optimized
512 GiB memory
2 x 1900 GB SSD
r6in.16xlarge
Storage Optimized
128 GiB memory
2 x 7500 GB SSD
im4gn.8xlarge
20 TB Data
37 nodes
20 TB Data
Addressable
memory space:
512 GiB/node
Addressable
memory space:
15 TB/node
In-Memory
All-Flash
Hybrid Memory
Performance + Cost Affordable Scale
99% < 1ms
99% < 1ms
99% < 10ms
Terabytes
Petabytes
6 nodes
In-Memory
All-Flash
Hybrid Memory
Petabytes

C based DB kernel
Optimizations for CPU, Memory, Network
17
➤ Multi-threaded data structures (NUMA pinned)
➤ Nested locking model for synchronization
➤ Lockless data structures
➤ Partitioned single threaded data structures
➤ Index entries are aligned to cache line (64 bytes)
➤ Custom memory management (arenas)
Memory Arena Assignment
Multi-core Architecture
NIC
Queue
NIC
Queue
NIC
Queue
NIC
Queue
NIC
NIC IRQ
Binding
Core Core Core Core
CPU Socket
NIC IRQ
Binding
NIC IRQ
Binding
Core Core Core Core
CPU Socket
NIC IRQ
Binding

18

Data distribution
Intelligent Data Partitioning Eliminates Hotspots
19
Data distribution is deterministic, uniform and algorithmic
Even amount data on every node
and on every flash device
Load balanced continually and
automatically on all servers, even
while scaling up/down or with
cluster reconfigurations
No retuning for new use cases
(same scheme/algos)
Partition Id Leader Replica 1 Replica 2 Replica 3 Replica 4
0 B D E A C
1 E C A D B
2 C B E A D
… … … … … …
4095 A E B D C
A B C D E

Remove bottlenecks: Same low latency from 1st GB to the 1st PB…
Smart Client TM
Direct Path to Data (single-hop)
20
Each nodes knows where all data resides via Smart ClientTM
Client is 1st
-class participant in architecture
and data fabric
Continuously updates
Calculates Partition ID to determine
Node ID
Cluster-spanning operations
(scan, query, batch) sent to all processing
nodes for parallel processing
Executes operations APIs (e.g. CRUD+)

Secondary Indexes – Parallel Query Execution
b1:r1 b2:r1 … b1:r2 b2:r4 … b5:r3 b2:r9 …
. . .
P1 P2 Px
SECONDARY INDEX
PRIMARY INDEX
RECORD RECORD
RECORD RECORD
SSD
SSD
DRAM
…
Query
• Value-based lookup
• Via secondary index
• Similar to SQL “select”
Parallel execution
• Per partition
• Scatter-gather scheme
• Multiple threads across nodes
Parallel access efficient for “low
selectivity indices
Support equality matches, range
queries: Integer, double, string, blob

A B C
CLIENT
22
% OF CLUSTER DATA
11%
SSD 1
11%
SSD 2
11%
SSD 3
Massively Parallel Architecture
Data distribution is deterministic, uniform and algorithmic
Data distribution
Even amount data on every node and on
every flash device
Load balanced continually and
automatically on all servers, even while
scaling up/down or with cluster
reconfigurations
No retuning for new use cases (same
scheme/algos)
No hot spots with intelligent auto-sharding
33%
CLUSTER DATA
0 33%
CLUSTER DATA
33%
CLUSTER DATA
A B C

Summary
23
Optimizations for Modern System Architectures
• CPU and NUMA pinning
• Storage tiers (DRAM, NVMe)
• Hybrid Memory Architecture
• Network to application alignment
• Multi-threaded NUMA architecture
• Data distribution across disks, nodes
• Client accesses server in single hop
• AI/ML Processing
• Zero Data Loss
• Linearizable reads (tunable)
• Read one, write all scheme
• Roster concept maximizes availability
• Uniform data partitioning
• Mixed workload handling
• Self-managed rack aware clusters
• Synch and Asynch Replication

Thank You
24

What a Modern Database Enables_Srini Srinivasan.pdf

Recommended

Recommended

More Related Content

Similar to What a Modern Database Enables_Srini Srinivasan.pdf

Similar to What a Modern Database Enables_Srini Srinivasan.pdf (20)

More from Aerospike, Inc.

More from Aerospike, Inc. (8)

Recently uploaded

Recently uploaded (20)

What a Modern Database Enables_Srini Srinivasan.pdf