Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018

Pinot: Realtime OLAP for 530 Million Users
Seunghyun Lee
Software Engineer

Today’s agenda
1. Motivation
2. Architecture Overview
3. Scaling Pinot
4. Q&A

Analytics Use Case: Interactive Dashboard
select sum(pageView), time from T
where country = us,
browser = chrome,…
group by time
Slice and dice over arbitrary dimensions
Human driven queries
Use Case Response Latency Query Rate Possible Solutions
Interactive dashboard
sub-second to
few seconds
~1 qps Columnar Store

Analytics Use Case: Site Facing
select sum(pageView) from T
where memberId = 456,
pageKey = “profilePage”,
privacySettings in (…)
group by time,[title|geo|industry]
Pre-defined query format with different
primary key values
Site facing 100ms (99 percentile) 1000s qps KV Store

Analytics Use Case: Anomaly Detection
for d1 in [us, ca, … ]
for d2 in [chrome, ie, … ]
…
where country = d1, browser = d2
group by time
Identifying all issues requires us to monitor
all possible combinations
Periodic machine generated queries (bursty)
Anomaly Detection
sub-second to
few seconds
10-100s qps Streaming Engine

Use Case
Response
Latency
Query Rate
Possible
Solutions
Interactive
dashboard
sub-second to
few seconds
~1 qps
Columnar
Store
Site facing
100ms
(99 percentile)
1000s qps
KV Store
(pre-cube)
Anomaly
detection
sub-second to
few seconds
10-100s qps
Streaming
Engine
Same input data (Pageview)
Same OLAP style query
What makes these use cases use different solutions?
Different solutions based on
different workload
characteristics
Can we support all these use cases in one single system?

What is Pinot?
SQL-like interface with predictable latency (no joins)
Batch Data Ingestion (Hadoop)
Realtime Data Ingestion (Kafka)
Distributed, horizontally scalable
Open source! (https://github.com/linkedin/pinot)

Pinot @ LinkedIn
+50
Site Facing Use cases
+60k
Queries per second Records ingested
per second
+2000
Tables
+1.4m
• 300B documents
per data center
• 2 trillion documents
for internal use case

Architecture Overview
• Controller - handles cluster-wide
coordination using Apace Helix and
Zookeeper
• Broker - handles query fan out and
query routing to servers
• Server - responds to query requests
originating from the brokers

Query Execution: Distributed
Broker
S1 S3 S2 S1 S3 S2
1. Query
2.Fetch routing table from Helix
4. Process request
& send response
5. Gather response
6. Return response
Server
3. Scatter request
Controller
(Helix)

Query Execution: Hybrid Querying
time
offline server
time
t = 1
realtime server
2 3 4 5

time
1-2
offline server
time
t = 1
realtime server
2 3 4 5
offline Hadoop job

time
1-2
offline server
time
t = 1
realtime server
2 3 4 5

time
offline server
Broker
time
realtime server
Time boundary: 2
3 4 5 1-2t = 1 2

time
offline server
Broker
time
realtime server
Time boundary: 2
3 4 5
select sum(m) from T
t = 1 2 1-2

time
offline server
Broker
time
realtime server
Time boundary: 2
3 4 5
where t <= 2
where t > 2
1-2t = 1 2

Query Execution: Single Node
Query Optimization
select max(col) from T Use metadata instead of scanning
select sum(metric) from T
where country = us and accountId = x
Reorders filter for better performance
(apply accountId before country predicate)
Dynamic query planning based on column metadata, index, and dictionary

Anatomy of Pinot Segment
Dictionary Forward Index
Metadata
start/end time
available indexes
partitioning info
min/max value
…
Inverted
Sorted
Startree
Indexes
docId country code
0 us 002
1 ca 001
2 jp 003
… … …
country
ca
jp
us
…
dictId docId
code
001
002
003
…
country
2
0
1
…
code
1
0
2
…
Raw Data

Recap: Analytics Use Cases
Use Case
Response
Latency
Query Rate
Possible
Solutions
Interactive
dashboard
sub-second to
few seconds
~1 qps
Columnar
Store
Site facing
100ms
(99 percentile)
1000s qps
KV Store
(pre-cube)
Anomaly
detection
sub-second to
few seconds
10-100s qps
Streaming
Engine
Same input data (Pageview)
Same OLAP style query
Different solutions based on
different workload
characteristics

Interactive Dashboard
sub-second to
few seconds
~1 qps Columnar Store
where country = us, browser = chrome,…
group by time
0 100 200 300 400 500
Latency (milliseconds)
Frequency
pinot
druid

Site Facing
Site facing 100ms (99 percentile) 1000s qps KV Store
select sum(pageView) from T
where memberId = xx, privacySettings in…
group by time,[title|geo|industry]
● ● ● ●
● ●
●●●●
● ● ● ●
● ●
●●●●
● ● ● ●
● ●
●● ●●100
1000
10000
10 1000
Queries per second
Latency(milliseconds)
druid
pinot
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
100
1000
10000
10 100 1000
Queries per second
pinot
druid

Pinot Optimizations For Site Facing Use Cases
• Optimizing Query Processing
1. Sorted Index + Dynamic execution planning
• Optimizing Scatter and Gather
1. Smart segment assignment and routing
2. Data partitioning and pruning

Optimizing Query Processing: Sorted Index
• Access to both forward/inverted index
• Fetch contiguous block, benefit from locality
• For item filtering, pick scanning or inverted index based on cardinality of
sorted column
memberId
start
docId
end
docId
123 0 100
456 101 300
… … …
docId memberId
0 123
... …
100 123
101 456
… …
300 456
… …
select …
where memberId = 456, item in(…)
group by …
● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ●
●
●
100
1000
10 100 1000
Queries per second
sorted index
inverted index

Optimizing Scatter and Gather: Querying All Servers
Replica group: a set of servers that contains a complete set of all segments.
2 3
1 4
2 3
1 4
query 1
query 2
4 2
1 3
1 2
3 4
query 1
query 2
RG1
RG2
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
100
1000
10000
10 100 1000
Queries per second
without routing
optimization
with routing
optimization
Problem Impact Solution
Querying all servers
99% is impacted by
the slowest server (e.g. gc)
Control the number of servers to fan-out

Optimizing Scatter and Gather: Querying All Segments
S1
S3
query 1
query 2
S2
S4
S1
(p=1)
S3
(p=2)
query 1
(mid = p1)
query 2
(mid = p2)
S2
(p=1)
S4
(p=2)
Problem Impact Solution
Querying all segments More CPU work on server
Minimize the number of segment
(partitioning and pruning)
select …
where memberId = 456, item in(…)
group by …

Anomaly Detection: Challenge
for d1 in [us, ca, …]
for d2 in [key1, key2,…]
…
select sum(pageViews) from T
where country=d1, page_key=d2,
source_app=d3, device_name=d4…
group by country, time
…
Filter Aggregation Latency
select …
where country = us,…
Slow, scan 60-70% data high
select …
where country = kenya,…
Scan less than 1% low
• Latency not predictable depends on the query predicate
• Monitoring all possible combinations makes the problem worse!

Time vs Space Trade-off
latency
storage requirement
Columnar Store
KV Store (Pre-computed)
Startree Index
variable latency
low storage overhead
low latency
high storage overhead

Startree Index Generation
1. Multidimensional sort
2. Split on the column and create a node
for each value
3. Create star node (aggregate metric after
removing the split column)
4. Apply 1,2,3 for each node recursively
and stop when number of records in
node < SplitThreshold
root
*
docId country browser
…
other
dimensions
impre
ssion
0 al ie 10
1 ca safari 10
2 … … …
… us chrome 10
… us chrome 10
… us ie 10
N us safari 10
Raw records
Aggregated records
N+1 * chrome 40
N+2 * ie 20
N+3 * safari 20
caal … us *country
browser chrome … safari

Time vs Space Trade-off with Startree
latency
storage requirement
Columnar Store
KV Store (Pre-computed)
Startree Index
SplitThreshold= infinity,
No prematerialization
SplitThreshold= 1,
Full materialiation
SplitThreshold= 100,000,
Partial data aware materialiation

Startree Query Execution
select sum(pageViews)from T
where country = AL
where browser = Chrome
select sum(X)
from T
where d1=v1 and d2=v2 and …
Any query pattern will scan
less than SplitThreshold records
root
*
caal … us *country
browser chrome … safari*chrome … safari
select sum(pageViews)from T
where country = CA
Raw docs
Aggregated docs

● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
100
1000
10000
1 10 100
Queries per second
Anomaly Detection
druid
pinot with
inverted index
pinot with
startree index
Use Case Response Latency Query Throughput Possible Solutions
Anomaly detection
sub-second to
few seconds
10-100s queries
per second
Streaming Engine

Pinot vs Druid
Druid Pinot
Inverted Index Always on all columns, fixed Configurable on per column basis
Query Execution Layer Fixed Plan Split into planning and execution
Data Organization N/A Sorted column
Partitioning
Only available for
time column
Available for any column
Controlling query fan-out N/A
Replica group based segment
assignment and routing
Smart pre-matrialization N/A Star-tree

Can we support all these use cases in one single system?
Use Case Response Latency Query Rate Solution
sub-second to
few seconds
~1 qps Pinot
Site facing
100ms
(99 percentile)
1000s qps Pinot
Anomaly detection
sub-second to
few seconds
10-100s qps Pinot

Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018

Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018

Similar to Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018 (20)

Recently uploaded

Recently uploaded (20)

Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018