Optimizing Tiered Storage for Low-Latency
Real-Time Analytics at AI Scale
Songqiao Su
Software Engineer,
StarTree
All data is not equal
2
Internal dashboards, reporting, ad-hoc
Latency: sub-seconds
Concurrency: 100s of users
Historical data
Missed
orders
Inaccurate
orders
Downtime
Top selling
items
Menu item
Feedback
Real-time Data
Real-time, user-facing analytics
Latency: Milliseconds
Concurrency: Millions of users
Latency
sensitive
Cost
sensitive
Tightly coupled storage & compute
Disk / SSD
Access speed micro to milliseconds
Access method POSIX APIs
Access availability Single instance
Cost $
Server 1 Server 2
Disk/SSD Disk/SSD
3
Paying for compute
which could remain unutilized
Disk/SSD storage is
expensive (compared to
Cloud Object Storage)
As data volume increases
Server 1
Disk/SSD Disk/SSD Disk/SSD Disk/SSD Disk/SSD
Server 2 Server 3 Server 4 Server 5
4
Num Compute Units 1 5 10 100
Storage (TB) 2 10 20 200
Monthly Compute Cost $200 $1000 $2000 $20000
Monthly Storage Cost $200 $1000 $2000 $20000
Total Monthly Cost $400 $2000 $4000 $40000
Decoupled storage & compute
Server 1 Server 2
Disk / SSD Cloud Object Storage
Access speed
micro to
milliseconds
100s of milliseconds
Access method POSIX APIs Network call
Access availability Single instance
Shared across
instances
Cost $ ⅕ $
Cloud Object Storage
5
Tiered Storage for Apache Pinot in StarTree Cloud
Brokers
Brokers
Server 1 Server 2
Disk/SSD Disk/SSD
Fully
tightly-coupled
Server 3 Server 4
Cloud Object Storage
Fully decoupled
Hybrid
6
Tiered Storage for Apache Pinot in StarTree Cloud
Server 3 Server 4
Server 1 Server 2
Disk/SSD
Brokers
Brokers
Disk/SSD Disk/SSD
Disk/SSD
Cloud Object Storage
Recent data (eg. < =30 days)
Historical data (eg. >30 days)
tierConfigs: [{
tierS3: {
age: 30d
tierBackend: s3,
tierBackendProperties: {
region: us-west-2,
bucket: foo.bucket
}
}]
7
8
Two key questions
What data to read?
When to read?
Why we don’t do lazy loading?
Cloud Object Storage
Server 1 Server 2
1st query slow
● Non-predictable OLAP workload
● Instance storage limited
● Wasteful data fetched
● Strict no-go for OLAP 9
Pinot segment
2nd query fast
Q: What to read?
A: Entire segment
Q: When to read?
A: During query execution
Pinot segment format
browser.fwd_idx
browser.inv_idx
browser.dict
region.inv_idx
region.fwd_idx
region.dict
country…
…
…
impressions.fwd_idx
impressions.dict
cost..
…
timestamp..
..
10
Columns:
browser, region, country,
impressions, cost,
timestamp
columns.psf
Forward index
Inverted index
Dictionary
browser.fwd_idx.offset=...
browser.inv_idx.offset=...
region.fwd_indx.offset=...
index_map
What to read - Selective columnar fetch
Server 1
select sum(impressions)
where region=”rivendell”
Cloud Object Storage
Only fetch
region.dict, region.inv_idx,
impressions.fwd, impressions.dict
Range GET
11
browser.fwd_idx
browser.inv_idx
browser.dict
region.inv_idx
region.fwd_idx
region.dict
country…
…
…
impressions.fwd_idx
impressions.dict
cost..
…
timestamp..
..
columns.psf
index_map
browser.fwd_idx.offset=...
browser.inv_idx.offset=...
region.fwd_indx.offset=...
When to read - Fetch during segment execution?
12
Planning phase Make segment execution plan
Pinot server processes segments in parallel
Segment
Execution
Segment
Execution
Segment
Execution
Pinot Server
Fetch during segment execution
40 segments, 8 parallelism
Total time = executionTimeMs + 1000ms
Fetch for batch 1: 200ms
Fetch for batch 2: 200ms
Fetch for batch 3: 200ms
Fetch for batch 4: 200ms
Fetch for batch 5: 200ms
Each S3 access: ~200ms
13
CPU idle when fetching.
Decouple fetch and execute?
Prefetch
Planning phase
Make segment execution plan &
prefetch all segment columns
Execution
Acquire
Release
Wait for segment
columns to be available
14
Execution
Acquire
Release
Execution
Acquire
Release
Pinot Server
Pipelining fetch and execution
Total time = executionTimeMs + 1000ms
Prefetch for ALL
batches during
planning: 200ms
Total time = executionTimeMs + 200ms
Before prefetch After prefetch
# of segments: 40 segments; parallelism: 8; Each S3 access: ~200ms
Fetch for batch 1: 200ms
Fetch for batch 2: 200ms
Fetch for batch 3: 200ms
Fetch for batch 4: 200ms
Fetch for batch 5: 200ms
15
Benchmark vs. Presto
Query
Presto
Decoupled
Pinot
Tiered
Storage
SELECT COUNT (*)
FROM GithubEventsTier
WHERE DAY >= 20200701 AND DAY <= 20200714
5340ms 63ms
SELECT MAX(pull_request_additions)
FROM GithubEventsTier
WHERE DAY >= 20201101 AND DAY <= 20201114 AND type =
'PullRequestEvent'
1580ms 350ms
SELECT MAX(pull_request_commits), COUNT(*), repo_id
FROM GithubEventsTier
WHERE type = 'PullRequestEvent' AND DAY = 20201114
GROUP BY repo_id ORDER BY COUNT(*) DESC
LIMIT 1000
1400ms 278ms
SELECT SUM(pull_request_commits)
FROM GithubEventsTier
WHERE JSON_MATCH(actor_json,
'"actor_id"=''39814207''')AND DAY = 20200701
8560ms 397ms
16
Tested with about 300 segments (200GB in total) with one r5.2xlarge Pinot server
Q: What to
read?
A: Selective
Columnar
fetch
Q: When to
read?
A: Prefetch
during
planning
What makes Pinot fast?
Pinot
Broker
Query
1 2 5 6 9 10
Server 1 Server 2 Server 3
3 4 7 8 11 12
1 2 3 4 5 6 7 8 9 10 11 12
Broker level pruning
5 6 7 8 9 10 11 12
Server level pruning
Filter optimizations
Aggregation optimizations
1 2 3 4 5 6 7 8 9 10 11 12
Total segments to process
Server 1 Server 2 Server 3
Pin any column index locally
Server 3 Server 4
Cloud Object Storage
Server 1 Server 2
1 2 3 4 5 6 7
1 Pinot segment
preload.index.keys :
account_id.bloom_filter,
transaction_id.bloom_filter
Disks/SSDs
18
Everything else
account_id.bloom_filter,
transaction_id.bloom_filter
1
3
2 4
5 7
6
Corresponding segment’s
bloom_filters
1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
impressions.fwd_idx
Read even lesser - Read blocks & on-demand
Server 1
select sum(impressions) where region=”rivendell”
Cloud Object Storage
Columnar segment format
Range GET
region.inv_idx
gondor -> 3, 4, 5, 14, 25
rivendell -> 0, 1, 2, 20, 21
shire -> 6, 7, 8, 9, 10, 11, 12
….
Prefetch region.inv_idx
Read
on-demand
Only 2 blocks from
impressions.fwd
19
browser.fwd_idx
browser.inv_idx
browser.dict
region.inv_idx
region.fwd_idx
region.dict
impressions.fwd_idx
impressions.dict
Parallel block prefetch in Post Filter Phase
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
region.inv_idx
gondor -> 3, 4, 5, 14, 25
rivendell -> 0, 1, 2, 20, 21
shire -> 6, 7, 8, 9, 10, 11, 12
….
impressions.fwd_idx
20
Columnar segment format
browser.fwd_idx
browser.inv_idx
browser.dict
region.inv_idx
region.fwd_idx
region.dict
impressions.fwd_idx
impressions.dict
Identify and prefetch these
chunks right after in-filter
evaluation
21
Block Cache Prefetch v.s. Whole Column Fetch
Query
Whole
Column
Read
Tiered
Storage
Block
cache
Tiered
Storage
Pinot
Tightly
coupled
SELECT payload_pull_request // <- large raw forward index
FROM github_events_bc_big
WHERE repo_name = ‘...’ // <- high selectivity filter
1593ms 516ms 150ms
SELECT id // <- dictionary encoded column
FROM github_events_bc_big
WHERE repo_name = ‘...’ // <- high selectivity filter
158ms 39ms 7ms
22
Block Cache Prefetch Using different Clients
Query
Block
cache
Tiered
Storage
SELECT payload_pull_request // <- large raw forward index
FROM github_events_bc_small // single segment small table
WHERE repo_name = ‘...’ // <- high selectivity filter
Above ^^ query using NettyNioAsyncHttpClient
333ms
Above ^^ query using S3CrtAsyncClient 272ms
Optimize the block size
23
Lessons learnt
● High selectivity enable queries to fetch less data, but workload shifted from network
throughput bound to I/O bound.
● Default s3 client is not tuned for I/O bound workload, config knobs like max concurrency,
target throughput need to be properly tuned to support higher IOPS.
● For high selectivity queries block cache size range from 64kb to 128kb is a sweet spot in
terms of throughput and latencies.
24
25
Journey To Iceberg Integration
26
Pinot
Servers
Pinot
Broker
Journey To Open Table Format
Pinot
Controller
Zookeeper
Data Files (Parquet)
Metadata Files, Manifest
files/list
Catalog
Iceberg
Queries
1. Watch table
2. Transform into pinot metadata + index
3. Locate and query data files
Apache Pinot’s Tiered Storage Journey
27
● Why did we need tiered storage?
● How we fit tiered storage into pinot architecture
● How we implemented this
● Journey to open table format
No lazy loading
What to read
● Selective columnar
fetch
● Block reads
When to read
● Pipeline fetch &
execution
● Prefetch
● Pinning
Thank You!
docs.startree.ai
28

Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale

  • 1.
    Optimizing Tiered Storagefor Low-Latency Real-Time Analytics at AI Scale Songqiao Su Software Engineer, StarTree
  • 2.
    All data isnot equal 2 Internal dashboards, reporting, ad-hoc Latency: sub-seconds Concurrency: 100s of users Historical data Missed orders Inaccurate orders Downtime Top selling items Menu item Feedback Real-time Data Real-time, user-facing analytics Latency: Milliseconds Concurrency: Millions of users Latency sensitive Cost sensitive
  • 3.
    Tightly coupled storage& compute Disk / SSD Access speed micro to milliseconds Access method POSIX APIs Access availability Single instance Cost $ Server 1 Server 2 Disk/SSD Disk/SSD 3
  • 4.
    Paying for compute whichcould remain unutilized Disk/SSD storage is expensive (compared to Cloud Object Storage) As data volume increases Server 1 Disk/SSD Disk/SSD Disk/SSD Disk/SSD Disk/SSD Server 2 Server 3 Server 4 Server 5 4 Num Compute Units 1 5 10 100 Storage (TB) 2 10 20 200 Monthly Compute Cost $200 $1000 $2000 $20000 Monthly Storage Cost $200 $1000 $2000 $20000 Total Monthly Cost $400 $2000 $4000 $40000
  • 5.
    Decoupled storage &compute Server 1 Server 2 Disk / SSD Cloud Object Storage Access speed micro to milliseconds 100s of milliseconds Access method POSIX APIs Network call Access availability Single instance Shared across instances Cost $ ⅕ $ Cloud Object Storage 5
  • 6.
    Tiered Storage forApache Pinot in StarTree Cloud Brokers Brokers Server 1 Server 2 Disk/SSD Disk/SSD Fully tightly-coupled Server 3 Server 4 Cloud Object Storage Fully decoupled Hybrid 6
  • 7.
    Tiered Storage forApache Pinot in StarTree Cloud Server 3 Server 4 Server 1 Server 2 Disk/SSD Brokers Brokers Disk/SSD Disk/SSD Disk/SSD Cloud Object Storage Recent data (eg. < =30 days) Historical data (eg. >30 days) tierConfigs: [{ tierS3: { age: 30d tierBackend: s3, tierBackendProperties: { region: us-west-2, bucket: foo.bucket } }] 7
  • 8.
    8 Two key questions Whatdata to read? When to read?
  • 9.
    Why we don’tdo lazy loading? Cloud Object Storage Server 1 Server 2 1st query slow ● Non-predictable OLAP workload ● Instance storage limited ● Wasteful data fetched ● Strict no-go for OLAP 9 Pinot segment 2nd query fast Q: What to read? A: Entire segment Q: When to read? A: During query execution
  • 10.
    Pinot segment format browser.fwd_idx browser.inv_idx browser.dict region.inv_idx region.fwd_idx region.dict country… … … impressions.fwd_idx impressions.dict cost.. … timestamp.. .. 10 Columns: browser,region, country, impressions, cost, timestamp columns.psf Forward index Inverted index Dictionary browser.fwd_idx.offset=... browser.inv_idx.offset=... region.fwd_indx.offset=... index_map
  • 11.
    What to read- Selective columnar fetch Server 1 select sum(impressions) where region=”rivendell” Cloud Object Storage Only fetch region.dict, region.inv_idx, impressions.fwd, impressions.dict Range GET 11 browser.fwd_idx browser.inv_idx browser.dict region.inv_idx region.fwd_idx region.dict country… … … impressions.fwd_idx impressions.dict cost.. … timestamp.. .. columns.psf index_map browser.fwd_idx.offset=... browser.inv_idx.offset=... region.fwd_indx.offset=...
  • 12.
    When to read- Fetch during segment execution? 12 Planning phase Make segment execution plan Pinot server processes segments in parallel Segment Execution Segment Execution Segment Execution Pinot Server
  • 13.
    Fetch during segmentexecution 40 segments, 8 parallelism Total time = executionTimeMs + 1000ms Fetch for batch 1: 200ms Fetch for batch 2: 200ms Fetch for batch 3: 200ms Fetch for batch 4: 200ms Fetch for batch 5: 200ms Each S3 access: ~200ms 13 CPU idle when fetching. Decouple fetch and execute?
  • 14.
    Prefetch Planning phase Make segmentexecution plan & prefetch all segment columns Execution Acquire Release Wait for segment columns to be available 14 Execution Acquire Release Execution Acquire Release Pinot Server
  • 15.
    Pipelining fetch andexecution Total time = executionTimeMs + 1000ms Prefetch for ALL batches during planning: 200ms Total time = executionTimeMs + 200ms Before prefetch After prefetch # of segments: 40 segments; parallelism: 8; Each S3 access: ~200ms Fetch for batch 1: 200ms Fetch for batch 2: 200ms Fetch for batch 3: 200ms Fetch for batch 4: 200ms Fetch for batch 5: 200ms 15
  • 16.
    Benchmark vs. Presto Query Presto Decoupled Pinot Tiered Storage SELECTCOUNT (*) FROM GithubEventsTier WHERE DAY >= 20200701 AND DAY <= 20200714 5340ms 63ms SELECT MAX(pull_request_additions) FROM GithubEventsTier WHERE DAY >= 20201101 AND DAY <= 20201114 AND type = 'PullRequestEvent' 1580ms 350ms SELECT MAX(pull_request_commits), COUNT(*), repo_id FROM GithubEventsTier WHERE type = 'PullRequestEvent' AND DAY = 20201114 GROUP BY repo_id ORDER BY COUNT(*) DESC LIMIT 1000 1400ms 278ms SELECT SUM(pull_request_commits) FROM GithubEventsTier WHERE JSON_MATCH(actor_json, '"actor_id"=''39814207''')AND DAY = 20200701 8560ms 397ms 16 Tested with about 300 segments (200GB in total) with one r5.2xlarge Pinot server Q: What to read? A: Selective Columnar fetch Q: When to read? A: Prefetch during planning
  • 17.
    What makes Pinotfast? Pinot Broker Query 1 2 5 6 9 10 Server 1 Server 2 Server 3 3 4 7 8 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Broker level pruning 5 6 7 8 9 10 11 12 Server level pruning Filter optimizations Aggregation optimizations 1 2 3 4 5 6 7 8 9 10 11 12 Total segments to process Server 1 Server 2 Server 3
  • 18.
    Pin any columnindex locally Server 3 Server 4 Cloud Object Storage Server 1 Server 2 1 2 3 4 5 6 7 1 Pinot segment preload.index.keys : account_id.bloom_filter, transaction_id.bloom_filter Disks/SSDs 18 Everything else account_id.bloom_filter, transaction_id.bloom_filter 1 3 2 4 5 7 6 Corresponding segment’s bloom_filters 1
  • 19.
    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 impressions.fwd_idx Read even lesser- Read blocks & on-demand Server 1 select sum(impressions) where region=”rivendell” Cloud Object Storage Columnar segment format Range GET region.inv_idx gondor -> 3, 4, 5, 14, 25 rivendell -> 0, 1, 2, 20, 21 shire -> 6, 7, 8, 9, 10, 11, 12 …. Prefetch region.inv_idx Read on-demand Only 2 blocks from impressions.fwd 19 browser.fwd_idx browser.inv_idx browser.dict region.inv_idx region.fwd_idx region.dict impressions.fwd_idx impressions.dict
  • 20.
    Parallel block prefetchin Post Filter Phase 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 region.inv_idx gondor -> 3, 4, 5, 14, 25 rivendell -> 0, 1, 2, 20, 21 shire -> 6, 7, 8, 9, 10, 11, 12 …. impressions.fwd_idx 20 Columnar segment format browser.fwd_idx browser.inv_idx browser.dict region.inv_idx region.fwd_idx region.dict impressions.fwd_idx impressions.dict Identify and prefetch these chunks right after in-filter evaluation
  • 21.
    21 Block Cache Prefetchv.s. Whole Column Fetch Query Whole Column Read Tiered Storage Block cache Tiered Storage Pinot Tightly coupled SELECT payload_pull_request // <- large raw forward index FROM github_events_bc_big WHERE repo_name = ‘...’ // <- high selectivity filter 1593ms 516ms 150ms SELECT id // <- dictionary encoded column FROM github_events_bc_big WHERE repo_name = ‘...’ // <- high selectivity filter 158ms 39ms 7ms
  • 22.
    22 Block Cache PrefetchUsing different Clients Query Block cache Tiered Storage SELECT payload_pull_request // <- large raw forward index FROM github_events_bc_small // single segment small table WHERE repo_name = ‘...’ // <- high selectivity filter Above ^^ query using NettyNioAsyncHttpClient 333ms Above ^^ query using S3CrtAsyncClient 272ms
  • 23.
  • 24.
    Lessons learnt ● Highselectivity enable queries to fetch less data, but workload shifted from network throughput bound to I/O bound. ● Default s3 client is not tuned for I/O bound workload, config knobs like max concurrency, target throughput need to be properly tuned to support higher IOPS. ● For high selectivity queries block cache size range from 64kb to 128kb is a sweet spot in terms of throughput and latencies. 24
  • 25.
  • 26.
    26 Pinot Servers Pinot Broker Journey To OpenTable Format Pinot Controller Zookeeper Data Files (Parquet) Metadata Files, Manifest files/list Catalog Iceberg Queries 1. Watch table 2. Transform into pinot metadata + index 3. Locate and query data files
  • 27.
    Apache Pinot’s TieredStorage Journey 27 ● Why did we need tiered storage? ● How we fit tiered storage into pinot architecture ● How we implemented this ● Journey to open table format No lazy loading What to read ● Selective columnar fetch ● Block reads When to read ● Pipeline fetch & execution ● Prefetch ● Pinning
  • 28.