Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale

Optimizing Tiered Storage for Low-Latency
Real-Time Analytics at AI Scale
Songqiao Su
Software Engineer,
StarTree

All data is not equal
2
Internal dashboards, reporting, ad-hoc
Latency: sub-seconds
Concurrency: 100s of users
Historical data
Missed
orders
Inaccurate
orders
Downtime
Top selling
items
Menu item
Feedback
Real-time Data
Real-time, user-facing analytics
Latency: Milliseconds
Concurrency: Millions of users
Latency
sensitive
Cost
sensitive

Tightly coupled storage & compute
Disk / SSD
Access speed micro to milliseconds
Access method POSIX APIs
Access availability Single instance
Cost $
Server 1 Server 2
Disk/SSD Disk/SSD
3

Paying for compute
which could remain unutilized
Disk/SSD storage is
expensive (compared to
Cloud Object Storage)
As data volume increases
Server 1
Disk/SSD Disk/SSD Disk/SSD Disk/SSD Disk/SSD
Server 2 Server 3 Server 4 Server 5
4
Num Compute Units 1 5 10 100
Storage (TB) 2 10 20 200
Monthly Compute Cost $200 $1000 $2000 $20000
Monthly Storage Cost $200 $1000 $2000 $20000
Total Monthly Cost $400 $2000 $4000 $40000

Decoupled storage & compute
Server 1 Server 2
Disk / SSD Cloud Object Storage
Access speed
micro to
milliseconds
100s of milliseconds
Access method POSIX APIs Network call
Access availability Single instance
Shared across
instances
Cost $ ⅕ $
Cloud Object Storage
5

Tiered Storage for Apache Pinot in StarTree Cloud
Brokers
Brokers
Server 1 Server 2
Disk/SSD Disk/SSD
Fully
tightly-coupled
Server 3 Server 4
Fully decoupled
Hybrid
6

Tiered Storage for Apache Pinot in StarTree Cloud
Server 3 Server 4
Server 1 Server 2
Disk/SSD
Brokers
Brokers
Disk/SSD Disk/SSD
Disk/SSD
Recent data (eg. < =30 days)
Historical data (eg. >30 days)
tierConfigs: [{
tierS3: {
age: 30d
tierBackend: s3,
tierBackendProperties: {
region: us-west-2,
bucket: foo.bucket
}
}]
7

8
Two key questions
What data to read?
When to read?

Why we don’t do lazy loading?
Server 1 Server 2
1st query slow
● Non-predictable OLAP workload
● Instance storage limited
● Wasteful data fetched
● Strict no-go for OLAP 9
Pinot segment
2nd query fast
Q: What to read?
A: Entire segment
Q: When to read?
A: During query execution

Pinot segment format
browser.fwd_idx
browser.inv_idx
browser.dict
region.inv_idx
region.fwd_idx
region.dict
country…
…
…
impressions.fwd_idx
impressions.dict
cost..
…
timestamp..
..
10
Columns:
browser, region, country,
impressions, cost,
timestamp
columns.psf
Forward index
Inverted index
Dictionary
browser.fwd_idx.offset=...
browser.inv_idx.offset=...
region.fwd_indx.offset=...
index_map

What to read - Selective columnar fetch
Server 1
select sum(impressions)
where region=”rivendell”
Only fetch
region.dict, region.inv_idx,
impressions.fwd, impressions.dict
Range GET
11
browser.fwd_idx
browser.inv_idx
browser.dict
region.inv_idx
region.fwd_idx
region.dict
country…
…
…
impressions.fwd_idx
impressions.dict
cost..
…
timestamp..
..
columns.psf
index_map
browser.fwd_idx.offset=...
browser.inv_idx.offset=...
region.fwd_indx.offset=...

When to read - Fetch during segment execution?
12
Planning phase Make segment execution plan
Pinot server processes segments in parallel
Segment
Execution
Segment
Execution
Segment
Execution
Pinot Server

Fetch during segment execution
40 segments, 8 parallelism
Total time = executionTimeMs + 1000ms
Fetch for batch 1: 200ms
Each S3 access: ~200ms
13
CPU idle when fetching.
Decouple fetch and execute?

Prefetch
Planning phase
Make segment execution plan &
prefetch all segment columns
Execution
Acquire
Release
Wait for segment
columns to be available
14
Execution
Acquire
Release
Execution
Acquire
Release
Pinot Server

Pipelining fetch and execution
Prefetch for ALL
batches during
planning: 200ms
Before prefetch After prefetch
# of segments: 40 segments; parallelism: 8; Each S3 access: ~200ms
15

Benchmark vs. Presto
Query
Presto
Decoupled
Pinot
Tiered
Storage
SELECT COUNT (*)
FROM GithubEventsTier
WHERE DAY >= 20200701 AND DAY <= 20200714
5340ms 63ms
SELECT MAX(pull_request_additions)
WHERE DAY >= 20201101 AND DAY <= 20201114 AND type =
'PullRequestEvent'
1580ms 350ms
SELECT MAX(pull_request_commits), COUNT(*), repo_id
WHERE type = 'PullRequestEvent' AND DAY = 20201114
GROUP BY repo_id ORDER BY COUNT(*) DESC
LIMIT 1000
1400ms 278ms
SELECT SUM(pull_request_commits)
WHERE JSON_MATCH(actor_json,
'"actor_id"=''39814207''')AND DAY = 20200701
8560ms 397ms
16
Tested with about 300 segments (200GB in total) with one r5.2xlarge Pinot server
Q: What to
read?
A: Selective
Columnar
fetch
Q: When to
read?
A: Prefetch
during
planning

What makes Pinot fast?
Pinot
Broker
Query
1 2 5 6 9 10
Server 1 Server 2 Server 3
3 4 7 8 11 12
1 2 3 4 5 6 7 8 9 10 11 12
Broker level pruning
5 6 7 8 9 10 11 12
Server level pruning
Filter optimizations
Aggregation optimizations
1 2 3 4 5 6 7 8 9 10 11 12
Total segments to process
Server 1 Server 2 Server 3

Pin any column index locally
Server 3 Server 4
Server 1 Server 2
1 2 3 4 5 6 7
1 Pinot segment
preload.index.keys :
account_id.bloom_filter,
transaction_id.bloom_filter
Disks/SSDs
18
Everything else
account_id.bloom_filter,
transaction_id.bloom_filter
1
3
2 4
5 7
6
Corresponding segment’s
bloom_filters
1

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
impressions.fwd_idx
Read even lesser - Read blocks & on-demand
Server 1
select sum(impressions) where region=”rivendell”
Columnar segment format
Range GET
region.inv_idx
gondor -> 3, 4, 5, 14, 25
rivendell -> 0, 1, 2, 20, 21
shire -> 6, 7, 8, 9, 10, 11, 12
….
Prefetch region.inv_idx
Read
on-demand
Only 2 blocks from
impressions.fwd
19
browser.fwd_idx
browser.inv_idx
browser.dict
region.inv_idx
region.fwd_idx
region.dict
impressions.fwd_idx
impressions.dict

Parallel block prefetch in Post Filter Phase
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
region.inv_idx
gondor -> 3, 4, 5, 14, 25
rivendell -> 0, 1, 2, 20, 21
shire -> 6, 7, 8, 9, 10, 11, 12
….
impressions.fwd_idx
20
Columnar segment format
browser.fwd_idx
browser.inv_idx
browser.dict
region.inv_idx
region.fwd_idx
region.dict
impressions.fwd_idx
impressions.dict
Identify and prefetch these
chunks right after in-ﬁlter
evaluation

21
Block Cache Prefetch v.s. Whole Column Fetch
Query
Whole
Column
Read
Tiered
Storage
Block
cache
Tiered
Storage
Pinot
Tightly
coupled
SELECT payload_pull_request // <- large raw forward index
FROM github_events_bc_big
WHERE repo_name = ‘...’ // <- high selectivity filter
1593ms 516ms 150ms
SELECT id // <- dictionary encoded column
FROM github_events_bc_big
158ms 39ms 7ms

22
Block Cache Prefetch Using different Clients
Query
Block
cache
Tiered
Storage
SELECT payload_pull_request // <- large raw forward index
FROM github_events_bc_small // single segment small table
Above ^^ query using NettyNioAsyncHttpClient
333ms
Above ^^ query using S3CrtAsyncClient 272ms

Lessons learnt
● High selectivity enable queries to fetch less data, but workload shifted from network
throughput bound to I/O bound.
● Default s3 client is not tuned for I/O bound workload, conﬁg knobs like max concurrency,
target throughput need to be properly tuned to support higher IOPS.
● For high selectivity queries block cache size range from 64kb to 128kb is a sweet spot in
terms of throughput and latencies.
24

25
Journey To Iceberg Integration

26
Pinot
Servers
Pinot
Broker
Journey To Open Table Format
Pinot
Controller
Zookeeper
Data Files (Parquet)
Metadata Files, Manifest
ﬁles/list
Catalog
Iceberg
Queries
1. Watch table
2. Transform into pinot metadata + index
3. Locate and query data ﬁles

Apache Pinot’s Tiered Storage Journey
27
● Why did we need tiered storage?
● How we ﬁt tiered storage into pinot architecture
● How we implemented this
● Journey to open table format
No lazy loading
What to read
● Selective columnar
fetch
● Block reads
When to read
● Pipeline fetch &
execution
● Prefetch
● Pinning

Thank You!
docs.startree.ai
28

Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale

More Related Content

Similar to Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale

More from Alluxio, Inc.

Recently uploaded

Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale