SlideShare a Scribd company logo
1 of 37
Download to read offline
Pinot: Realtime OLAP for 530 Million Users
Seunghyun Lee
Software Engineer
Today’s agenda
1. Motivation
2. Architecture Overview
3. Scaling Pinot
4. Q&A
Analytics Use Case: Interactive Dashboard
select sum(pageView), time from T
where country = us,
browser = chrome,…
group by time
Slice and dice over arbitrary dimensions
Human driven queries
Use Case Response Latency Query Rate Possible Solutions
Interactive dashboard
sub-second to
few seconds
~1 qps Columnar Store
Analytics Use Case: Site Facing
select sum(pageView) from T
where memberId = 456,
pageKey = “profilePage”,
privacySettings in (…)
group by time,[title|geo|industry]
Pre-defined query format with different
primary key values
Use Case Response Latency Query Rate Possible Solutions
Site facing 100ms (99 percentile) 1000s qps KV Store
Analytics Use Case: Anomaly Detection
for d1 in [us, ca, … ]
for d2 in [chrome, ie, … ]
…
select sum(pageView), time from T
where country = d1, browser = d2
group by time
Identifying all issues requires us to monitor
all possible combinations
Periodic machine generated queries (bursty)
Use Case Response Latency Query Rate Possible Solutions
Anomaly Detection
sub-second to
few seconds
10-100s qps Streaming Engine
Use Case
Response
Latency
Query Rate
Possible
Solutions
Interactive
dashboard
sub-second to
few seconds
~1 qps
Columnar
Store
Site facing
100ms
(99 percentile)
1000s qps
KV Store
(pre-cube)
Anomaly
detection
sub-second to
few seconds
10-100s qps
Streaming
Engine
Same input data (Pageview)
Same OLAP style query
What makes these use cases use different solutions?
Different solutions based on
different workload
characteristics
Can we support all these use cases in one single system?
What is Pinot?
SQL-like interface with predictable latency (no joins)
Batch Data Ingestion (Hadoop)
Realtime Data Ingestion (Kafka)
Distributed, horizontally scalable
Open source! (https://github.com/linkedin/pinot)
Pinot @ LinkedIn
+50
Site Facing Use cases
+60k
Queries per second Records ingested
per second
+2000
Tables
+1.4m
• 300B documents
per data center
• 2 trillion documents
for internal use case
Today’s agenda
1. Motivation
2. Architecture Overview
3. Scaling Pinot
4. Q&A
Architecture Overview
• Controller - handles cluster-wide
coordination using Apace Helix and
Zookeeper
• Broker - handles query fan out and
query routing to servers
• Server - responds to query requests
originating from the brokers
Query Execution: Distributed
Broker
S1 S3 S2 S1 S3 S2
1. Query
2.Fetch routing table from Helix
4. Process request
& send response
5. Gather response
6. Return response
Server
3. Scatter request
Controller
(Helix)
Query Execution: Hybrid Querying
time
offline server
time
t = 1
realtime server
2 3 4 5
Query Execution: Hybrid Querying
time
1-2
offline server
time
t = 1
realtime server
2 3 4 5
offline Hadoop job
Query Execution: Hybrid Querying
time
1-2
offline server
time
t = 1
realtime server
2 3 4 5
Query Execution: Hybrid Querying
time
offline server
Broker
time
realtime server
Time boundary: 2
3 4 5 1-2t = 1 2
Query Execution: Hybrid Querying
time
offline server
Broker
time
realtime server
Time boundary: 2
3 4 5
select sum(m) from T
t = 1 2 1-2
Query Execution: Hybrid Querying
time
offline server
Broker
time
realtime server
Time boundary: 2
3 4 5
select sum(m) from T
where t <= 2
select sum(m) from T
where t > 2
select sum(m) from T
1-2t = 1 2
Query Execution: Single Node
Query Optimization
select max(col) from T Use metadata instead of scanning
select sum(metric) from T
where country = us and accountId = x
Reorders filter for better performance
(apply accountId before country predicate)
Dynamic query planning based on column metadata, index, and dictionary
Anatomy of Pinot Segment
Dictionary Forward Index
Metadata
start/end time
available indexes
partitioning info
min/max value
…
Inverted
Sorted
Startree
Indexes
docId country code
0 us 002
1 ca 001
2 jp 003
… … …
country
ca
jp
us
…
dictId docId
code
001
002
003
…
country
2
0
1
…
code
1
0
2
…
Raw Data
Today’s agenda
1. Motivation
2. Architecture Overview
3. Scaling Pinot
4. Q&A
Recap: Analytics Use Cases
Use Case
Response
Latency
Query Rate
Possible
Solutions
Interactive
dashboard
sub-second to
few seconds
~1 qps
Columnar
Store
Site facing
100ms
(99 percentile)
1000s qps
KV Store
(pre-cube)
Anomaly
detection
sub-second to
few seconds
10-100s qps
Streaming
Engine
Same input data (Pageview)
Same OLAP style query
Different solutions based on
different workload
characteristics
Interactive Dashboard
Use Case Response Latency Query Rate Possible Solutions
Interactive dashboard
sub-second to
few seconds
~1 qps Columnar Store
select sum(pageView), time from T
where country = us, browser = chrome,…
group by time
0 100 200 300 400 500
Latency (milliseconds)
Frequency
pinot
druid
Site Facing
Use Case Response Latency Query Rate Possible Solutions
Site facing 100ms (99 percentile) 1000s qps KV Store
select sum(pageView) from T
where memberId = xx, privacySettings in…
group by time,[title|geo|industry]
● ● ● ●
● ●
●●●●
● ● ● ●
● ●
●●●●
● ● ● ●
● ●
●● ●●100
1000
10000
10 1000
Queries per second
Latency(milliseconds)
druid
pinot
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
100
1000
10000
10 100 1000
Queries per second
Latency(milliseconds)
pinot
druid
Pinot Optimizations For Site Facing Use Cases
• Optimizing Query Processing
1. Sorted Index + Dynamic execution planning
• Optimizing Scatter and Gather
1. Smart segment assignment and routing
2. Data partitioning and pruning
Optimizing Query Processing: Sorted Index
• Access to both forward/inverted index
• Fetch contiguous block, benefit from locality
• For item filtering, pick scanning or inverted index based on cardinality of
sorted column
memberId
start
docId
end
docId
123 0 100
456 101 300
… … …
docId memberId
0 123
... …
100 123
101 456
… …
300 456
… …
select …
where memberId = 456, item in(…)
group by …
● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ●
●
●
100
1000
10 100 1000
Queries per second
Latency(milliseconds)
sorted index
inverted index
Optimizing Scatter and Gather: Querying All Servers
Replica group: a set of servers that contains a complete set of all segments.
2 3
1 4
2 3
1 4
query 1
query 2
4 2
1 3
1 2
3 4
query 1
query 2
RG1
RG2
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
100
1000
10000
10 100 1000
Queries per second
Latency(milliseconds)
without routing
optimization
with routing
optimization
Problem Impact Solution
Querying all servers
99% is impacted by
the slowest server (e.g. gc)
Control the number of servers to fan-out
Optimizing Scatter and Gather: Querying All Segments
S1
S3
query 1
query 2
S2
S4
S1
(p=1)
S3
(p=2)
query 1
(mid = p1)
query 2
(mid = p2)
S2
(p=1)
S4
(p=2)
Problem Impact Solution
Querying all segments More CPU work on server
Minimize the number of segment
(partitioning and pruning)
select …
where memberId = 456, item in(…)
group by …
Anomaly Detection: Challenge
for d1 in [us, ca, …]
for d2 in [key1, key2,…]
…
select sum(pageViews) from T
where country=d1, page_key=d2,
source_app=d3, device_name=d4…
group by country, time
…
Filter Aggregation Latency
select …
where country = us,…
Slow, scan 60-70% data high
select …
where country = kenya,…
Scan less than 1% low
• Latency not predictable depends on the query predicate
• Monitoring all possible combinations makes the problem worse!
Time vs Space Trade-off
latency
storage requirement
Columnar Store
KV Store (Pre-computed)
Startree Index
variable latency
low storage overhead
low latency
high storage overhead
Startree Index Generation
1. Multidimensional sort
2. Split on the column and create a node
for each value
3. Create star node (aggregate metric after
removing the split column)
4. Apply 1,2,3 for each node recursively
and stop when number of records in
node < SplitThreshold
root
*
docId country browser
…
other
dimensions
impre
ssion
0 al ie 10
1 ca safari 10
2 … … …
… us chrome 10
… us chrome 10
… us ie 10
N us safari 10
Raw records
Aggregated records
N+1 * chrome 40
N+2 * ie 20
N+3 * safari 20
caal … us *country
browser chrome … safari
Time vs Space Trade-off with Startree
latency
storage requirement
Columnar Store
KV Store (Pre-computed)
Startree Index
SplitThreshold= infinity,
No prematerialization
SplitThreshold= 1,
Full materialiation
SplitThreshold= 100,000,
Partial data aware materialiation
Startree Query Execution
select sum(pageViews)from T
where country = AL
select sum(pageViews) from T
where browser = Chrome
select sum(pageViews) from T
select sum(X)
from T
where d1=v1 and d2=v2 and …
Any query pattern will scan
less than SplitThreshold records
root
*
caal … us *country
browser chrome … safari*chrome … safari
select sum(pageViews)from T
where country = CA
Raw docs
Aggregated docs
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
100
1000
10000
1 10 100
Queries per second
Latency(milliseconds)
Anomaly Detection
druid
pinot with
inverted index
pinot with
startree index
Use Case Response Latency Query Throughput Possible Solutions
Anomaly detection
sub-second to
few seconds
10-100s queries
per second
Streaming Engine
Pinot vs Druid
Druid Pinot
Inverted Index Always on all columns, fixed Configurable on per column basis
Query Execution Layer Fixed Plan Split into planning and execution
Data Organization N/A Sorted column
Partitioning
Only available for
time column
Available for any column
Controlling query fan-out N/A
Replica group based segment
assignment and routing
Smart pre-matrialization N/A Star-tree
Can we support all these use cases in one single system?
Use Case Response Latency Query Rate Solution
Interactive dashboard
sub-second to
few seconds
~1 qps Pinot
Site facing
100ms
(99 percentile)
1000s qps Pinot
Anomaly detection
sub-second to
few seconds
10-100s qps Pinot
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018

More Related Content

What's hot

Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 

What's hot (20)

Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovationsre:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovations
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Splunk Distributed Management Console
Splunk Distributed Management Console                                         Splunk Distributed Management Console
Splunk Distributed Management Console
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Druid
DruidDruid
Druid
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
kafka
kafkakafka
kafka
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
 

Similar to Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018

Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 

Similar to Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018 (20)

How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsImpatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
 
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data StreamingTutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Naked Performance With Clojure
Naked Performance With ClojureNaked Performance With Clojure
Naked Performance With Clojure
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Become a GC Hero
Become a GC HeroBecome a GC Hero
Become a GC Hero
 
Aerospike Go Language Client
Aerospike Go Language ClientAerospike Go Language Client
Aerospike Go Language Client
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About Sharding
 
On the way to low latency (2nd edition)
On the way to low latency (2nd edition)On the way to low latency (2nd edition)
On the way to low latency (2nd edition)
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Netflix - Realtime Impression Store
Netflix - Realtime Impression Store Netflix - Realtime Impression Store
Netflix - Realtime Impression Store
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Faceting optimizations for Solr
Faceting optimizations for SolrFaceting optimizations for Solr
Faceting optimizations for Solr
 

Recently uploaded

Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
siskavia95
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
saurabvyas476
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
ppy8zfkfm
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
siskavia95
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
yulianti213969
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
pwgnohujw
 

Recently uploaded (20)

Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 

Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018

  • 1. Pinot: Realtime OLAP for 530 Million Users Seunghyun Lee Software Engineer
  • 2. Today’s agenda 1. Motivation 2. Architecture Overview 3. Scaling Pinot 4. Q&A
  • 3. Analytics Use Case: Interactive Dashboard select sum(pageView), time from T where country = us, browser = chrome,… group by time Slice and dice over arbitrary dimensions Human driven queries Use Case Response Latency Query Rate Possible Solutions Interactive dashboard sub-second to few seconds ~1 qps Columnar Store
  • 4. Analytics Use Case: Site Facing select sum(pageView) from T where memberId = 456, pageKey = “profilePage”, privacySettings in (…) group by time,[title|geo|industry] Pre-defined query format with different primary key values Use Case Response Latency Query Rate Possible Solutions Site facing 100ms (99 percentile) 1000s qps KV Store
  • 5. Analytics Use Case: Anomaly Detection for d1 in [us, ca, … ] for d2 in [chrome, ie, … ] … select sum(pageView), time from T where country = d1, browser = d2 group by time Identifying all issues requires us to monitor all possible combinations Periodic machine generated queries (bursty) Use Case Response Latency Query Rate Possible Solutions Anomaly Detection sub-second to few seconds 10-100s qps Streaming Engine
  • 6. Use Case Response Latency Query Rate Possible Solutions Interactive dashboard sub-second to few seconds ~1 qps Columnar Store Site facing 100ms (99 percentile) 1000s qps KV Store (pre-cube) Anomaly detection sub-second to few seconds 10-100s qps Streaming Engine Same input data (Pageview) Same OLAP style query What makes these use cases use different solutions? Different solutions based on different workload characteristics Can we support all these use cases in one single system?
  • 7. What is Pinot? SQL-like interface with predictable latency (no joins) Batch Data Ingestion (Hadoop) Realtime Data Ingestion (Kafka) Distributed, horizontally scalable Open source! (https://github.com/linkedin/pinot)
  • 8. Pinot @ LinkedIn +50 Site Facing Use cases +60k Queries per second Records ingested per second +2000 Tables +1.4m • 300B documents per data center • 2 trillion documents for internal use case
  • 9. Today’s agenda 1. Motivation 2. Architecture Overview 3. Scaling Pinot 4. Q&A
  • 10. Architecture Overview • Controller - handles cluster-wide coordination using Apace Helix and Zookeeper • Broker - handles query fan out and query routing to servers • Server - responds to query requests originating from the brokers
  • 11. Query Execution: Distributed Broker S1 S3 S2 S1 S3 S2 1. Query 2.Fetch routing table from Helix 4. Process request & send response 5. Gather response 6. Return response Server 3. Scatter request Controller (Helix)
  • 12. Query Execution: Hybrid Querying time offline server time t = 1 realtime server 2 3 4 5
  • 13. Query Execution: Hybrid Querying time 1-2 offline server time t = 1 realtime server 2 3 4 5 offline Hadoop job
  • 14. Query Execution: Hybrid Querying time 1-2 offline server time t = 1 realtime server 2 3 4 5
  • 15. Query Execution: Hybrid Querying time offline server Broker time realtime server Time boundary: 2 3 4 5 1-2t = 1 2
  • 16. Query Execution: Hybrid Querying time offline server Broker time realtime server Time boundary: 2 3 4 5 select sum(m) from T t = 1 2 1-2
  • 17. Query Execution: Hybrid Querying time offline server Broker time realtime server Time boundary: 2 3 4 5 select sum(m) from T where t <= 2 select sum(m) from T where t > 2 select sum(m) from T 1-2t = 1 2
  • 18. Query Execution: Single Node Query Optimization select max(col) from T Use metadata instead of scanning select sum(metric) from T where country = us and accountId = x Reorders filter for better performance (apply accountId before country predicate) Dynamic query planning based on column metadata, index, and dictionary
  • 19. Anatomy of Pinot Segment Dictionary Forward Index Metadata start/end time available indexes partitioning info min/max value … Inverted Sorted Startree Indexes docId country code 0 us 002 1 ca 001 2 jp 003 … … … country ca jp us … dictId docId code 001 002 003 … country 2 0 1 … code 1 0 2 … Raw Data
  • 20. Today’s agenda 1. Motivation 2. Architecture Overview 3. Scaling Pinot 4. Q&A
  • 21. Recap: Analytics Use Cases Use Case Response Latency Query Rate Possible Solutions Interactive dashboard sub-second to few seconds ~1 qps Columnar Store Site facing 100ms (99 percentile) 1000s qps KV Store (pre-cube) Anomaly detection sub-second to few seconds 10-100s qps Streaming Engine Same input data (Pageview) Same OLAP style query Different solutions based on different workload characteristics
  • 22. Interactive Dashboard Use Case Response Latency Query Rate Possible Solutions Interactive dashboard sub-second to few seconds ~1 qps Columnar Store select sum(pageView), time from T where country = us, browser = chrome,… group by time 0 100 200 300 400 500 Latency (milliseconds) Frequency pinot druid
  • 23. Site Facing Use Case Response Latency Query Rate Possible Solutions Site facing 100ms (99 percentile) 1000s qps KV Store select sum(pageView) from T where memberId = xx, privacySettings in… group by time,[title|geo|industry] ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●● ●●100 1000 10000 10 1000 Queries per second Latency(milliseconds) druid pinot ● ● ● ● ● ●●● ●●●●● ● ● ● ● ● ●●● ●●●●● ● ● ● ● ● ●●● ●●●●● 100 1000 10000 10 100 1000 Queries per second Latency(milliseconds) pinot druid
  • 24. Pinot Optimizations For Site Facing Use Cases • Optimizing Query Processing 1. Sorted Index + Dynamic execution planning • Optimizing Scatter and Gather 1. Smart segment assignment and routing 2. Data partitioning and pruning
  • 25. Optimizing Query Processing: Sorted Index • Access to both forward/inverted index • Fetch contiguous block, benefit from locality • For item filtering, pick scanning or inverted index based on cardinality of sorted column memberId start docId end docId 123 0 100 456 101 300 … … … docId memberId 0 123 ... … 100 123 101 456 … … 300 456 … … select … where memberId = 456, item in(…) group by … ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 1000 10 100 1000 Queries per second Latency(milliseconds) sorted index inverted index
  • 26. Optimizing Scatter and Gather: Querying All Servers Replica group: a set of servers that contains a complete set of all segments. 2 3 1 4 2 3 1 4 query 1 query 2 4 2 1 3 1 2 3 4 query 1 query 2 RG1 RG2 ● ● ● ● ● ●●● ●●●●● ● ● ● ● ● ●●● ●●●●● ● ● ● ● ● ●●● ●●●●● 100 1000 10000 10 100 1000 Queries per second Latency(milliseconds) without routing optimization with routing optimization Problem Impact Solution Querying all servers 99% is impacted by the slowest server (e.g. gc) Control the number of servers to fan-out
  • 27. Optimizing Scatter and Gather: Querying All Segments S1 S3 query 1 query 2 S2 S4 S1 (p=1) S3 (p=2) query 1 (mid = p1) query 2 (mid = p2) S2 (p=1) S4 (p=2) Problem Impact Solution Querying all segments More CPU work on server Minimize the number of segment (partitioning and pruning) select … where memberId = 456, item in(…) group by …
  • 28. Anomaly Detection: Challenge for d1 in [us, ca, …] for d2 in [key1, key2,…] … select sum(pageViews) from T where country=d1, page_key=d2, source_app=d3, device_name=d4… group by country, time … Filter Aggregation Latency select … where country = us,… Slow, scan 60-70% data high select … where country = kenya,… Scan less than 1% low • Latency not predictable depends on the query predicate • Monitoring all possible combinations makes the problem worse!
  • 29. Time vs Space Trade-off latency storage requirement Columnar Store KV Store (Pre-computed) Startree Index variable latency low storage overhead low latency high storage overhead
  • 30. Startree Index Generation 1. Multidimensional sort 2. Split on the column and create a node for each value 3. Create star node (aggregate metric after removing the split column) 4. Apply 1,2,3 for each node recursively and stop when number of records in node < SplitThreshold root * docId country browser … other dimensions impre ssion 0 al ie 10 1 ca safari 10 2 … … … … us chrome 10 … us chrome 10 … us ie 10 N us safari 10 Raw records Aggregated records N+1 * chrome 40 N+2 * ie 20 N+3 * safari 20 caal … us *country browser chrome … safari
  • 31. Time vs Space Trade-off with Startree latency storage requirement Columnar Store KV Store (Pre-computed) Startree Index SplitThreshold= infinity, No prematerialization SplitThreshold= 1, Full materialiation SplitThreshold= 100,000, Partial data aware materialiation
  • 32. Startree Query Execution select sum(pageViews)from T where country = AL select sum(pageViews) from T where browser = Chrome select sum(pageViews) from T select sum(X) from T where d1=v1 and d2=v2 and … Any query pattern will scan less than SplitThreshold records root * caal … us *country browser chrome … safari*chrome … safari select sum(pageViews)from T where country = CA Raw docs Aggregated docs
  • 33. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 1000 10000 1 10 100 Queries per second Latency(milliseconds) Anomaly Detection druid pinot with inverted index pinot with startree index Use Case Response Latency Query Throughput Possible Solutions Anomaly detection sub-second to few seconds 10-100s queries per second Streaming Engine
  • 34. Pinot vs Druid Druid Pinot Inverted Index Always on all columns, fixed Configurable on per column basis Query Execution Layer Fixed Plan Split into planning and execution Data Organization N/A Sorted column Partitioning Only available for time column Available for any column Controlling query fan-out N/A Replica group based segment assignment and routing Smart pre-matrialization N/A Star-tree
  • 35. Can we support all these use cases in one single system? Use Case Response Latency Query Rate Solution Interactive dashboard sub-second to few seconds ~1 qps Pinot Site facing 100ms (99 percentile) 1000s qps Pinot Anomaly detection sub-second to few seconds 10-100s qps Pinot