Isolating Streaming Ingest and Queries Using RocksDB

2
Apps
?
Streaming
records +
deltas
Search queries
Vector search
Real-time analytics
SQL

3
Apps
Streaming
records +
deltas
Search queries
Vector search
Real-time analytics
1. Sharding → scalability

Sharding choices and tradeoffs
4
Value-
dependent
mapping?
Data +
indexes
together?
no yes
yes no
✅ Opportunities for larger read I/Os
❌ Coordination overheads balloon as
ingest latency drops
Can
documents
change? yes
no
❌ Unable to
support most
search and
analytics apps
Doc sharding
❗Small read
I/Os
✅ Efﬁcient
streaming ingest
✅ Consistent
indexes

5
Apps
Streaming
records +
deltas
Search queries
Vector search
Real-time analytics
1. Doc sharding → scalability + streaming ingest

6
2. Isolation between ingest and query work

7
2. Post-ingest replication → isolation + elasticity
App
B
App
A
Leader
Follower

8
3. Disaggregated storage → efﬁciency + elasticity
App
B
App
A

What technology for disaggregated storage?
9
Cold
(AWS S3)
Hot
(EBS or NVMe)
👍 Cheapest $/GB
👍 Highly durable
👍 Built-in RPC API
👎 High/unpredictable latency
👎 Expensive $/IOPS
👍 Cheapest $/IOPS
👍 Low latency
👍+👎 Build/run your own RPC service
👎 More expensive $/GB

Cloud-native search + analytics
1. Doc sharding with indexes – Converged indexing
→ Scalability
→ Streaming ingest
2. Post-ingest replication – Compute:compute separation
→ Isolation
→ Compute elasticity
3. Disaggregated hot storage – Compute:storage separation
→ High disk utilization
→ Compute elasticity
→ Storage elasticity
10

RocksDB replication
at Rockset

Shared
hot storage
1 Rockset shard ≈ 1 RocksDB instance
Data stream
Rockset stores data for each
shard in RocksDB
Writes are flushed to storage
after the memtable is full
SSD
Query
Ingest
App
Memtable
12

Document:RocksDB-key mapping is M:N
13

Shared
hot storage
Ingest turns logical update into physical deltas
Data stream
Ingesting one document applies
a delta to many RocksDB keys
Deltas are merged lazily by
RocksDB
SSD
Query
Ingest
App
Memtable
14

Shared
hot storage
Fine-grained RocksDB replication
Data stream App
Leader/follower replication
makes fresh data available in all
RocksDB instances
● Replication stream sends
data and metadata changes
● Applying memtable updates
takes 6× to 10× less CPU
than ingest
● Followers don’t run
compaction
SSD
Optional
Query
Ingest Query
App
Memtable Memtable
15

Shared
hot storage
Compute:compute separation for vector search
Data stream App
IVF ANN (inverted ﬁle
approximate nearest neighbor)
● Leader periodically builds
Voronoi cell decomposition of
vector space
● Cell membership can be
updated in real-time
● Follower queries are
executed the same as an
inverted index lookup
SSD
IVF
partitioning
Ingest Query
Memtable Memtable
16
IVF
cells

RocksDB is a log-structured merge tree (LSM)
1. Writes are buffered in RAM in a
“memtable”
2. Many megabytes of values are written to
a new file at once
3. Files are immutable, and store keys in
sorted order
4. “Compaction” creates new files by
merging several old files, making things
more sorted and removing duplicates
Big async writes
Disk
Memory
18

Rockset uses indexes to accelerate queries
Query
fragment
Cost-based
optimizer
Column scan
Index lookup
Filter
Fetch other cols
in matching rows
● Large reads
● Bandwidth limited
● Small reads
● Latency limited
● IOPS limited
19

Big writes
Small reads
Write to S3
Read from SSD
20

Shared
hot storage
Write to S3 + read from SSD
Data stream App
● (Big) writes go to AWS S3
○ High BW
○ Ensures durability
● Reads go to 1-copy SSD
○ Low latency
○ High IOPS
○ Efﬁcient small reads
○ High space utilization
SSD
Query
Ingest Query
App
S3
Memtable Memtable
21

Challenge: Cache miss to S3 can be 1,000× slower
● Cold misses
○ Synchronous prefetch on ﬁle creation
○ Prefetch from periodic S3 list
● Capacity misses
○ Cluster auto-scaling
● Software restart for upgrades
○ Dual-head serving during rolls
● Cluster resizing
○ Double-reading with rendezvous hashing
● Failure recovery
○ Whole-cluster recovery with rendezvous hashing
22

IT HAS BEEN
DAYS SINCE
THE LAST
CACHE MISS
6
Rockset hot storage is a near-perfect S3 cache
● Cold misses
○ Synchronous prefetch on ﬁle creation
○ Prefetch from periodic S3 list
● Capacity misses
○ Cluster auto-scaling
● Software restart for upgrades
○ Dual-head serving during rolls
● Cluster resizing
○ Second-chance reads with recent old conﬁgs
● Failure recovery
○ Whole-cluster recovery with rendezvous hashing
99.9999% cache hit rate
https://rockset.com/blog/separate-compute-storage-rocksdb/ 23

Rockset
● Real-time indexing
● Cloud-native efﬁciency
● Full-featured SQL
24

Isolating Streaming Ingest and Queries Using RocksDB

Recommended

Recommended

More Related Content

Similar to Isolating Streaming Ingest and Queries Using RocksDB

Similar to Isolating Streaming Ingest and Queries Using RocksDB (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Isolating Streaming Ingest and Queries Using RocksDB