Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018

1,452 views

Published on

Pinot: Realtime OLAP for 530 Million Users (https://github.com/linkedin/pinot)

Published in: Data & Analytics
  • Be the first to comment

Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018

  1. 1. Pinot: Realtime OLAP for 530 Million Users Seunghyun Lee Software Engineer
  2. 2. Today’s agenda 1. Motivation 2. Architecture Overview 3. Scaling Pinot 4. Q&A
  3. 3. Analytics Use Case: Interactive Dashboard select sum(pageView), time from T where country = us, browser = chrome,… group by time Slice and dice over arbitrary dimensions Human driven queries Use Case Response Latency Query Rate Possible Solutions Interactive dashboard sub-second to few seconds ~1 qps Columnar Store
  4. 4. Analytics Use Case: Site Facing select sum(pageView) from T where memberId = 456, pageKey = “profilePage”, privacySettings in (…) group by time,[title|geo|industry] Pre-defined query format with different primary key values Use Case Response Latency Query Rate Possible Solutions Site facing 100ms (99 percentile) 1000s qps KV Store
  5. 5. Analytics Use Case: Anomaly Detection for d1 in [us, ca, … ] for d2 in [chrome, ie, … ] … select sum(pageView), time from T where country = d1, browser = d2 group by time Identifying all issues requires us to monitor all possible combinations Periodic machine generated queries (bursty) Use Case Response Latency Query Rate Possible Solutions Anomaly Detection sub-second to few seconds 10-100s qps Streaming Engine
  6. 6. Use Case Response Latency Query Rate Possible Solutions Interactive dashboard sub-second to few seconds ~1 qps Columnar Store Site facing 100ms (99 percentile) 1000s qps KV Store (pre-cube) Anomaly detection sub-second to few seconds 10-100s qps Streaming Engine Same input data (Pageview) Same OLAP style query What makes these use cases use different solutions? Different solutions based on different workload characteristics Can we support all these use cases in one single system?
  7. 7. What is Pinot? SQL-like interface with predictable latency (no joins) Batch Data Ingestion (Hadoop) Realtime Data Ingestion (Kafka) Distributed, horizontally scalable Open source! (https://github.com/linkedin/pinot)
  8. 8. Pinot @ LinkedIn +50 Site Facing Use cases +60k Queries per second Records ingested per second +2000 Tables +1.4m • 300B documents per data center • 2 trillion documents for internal use case
  9. 9. Today’s agenda 1. Motivation 2. Architecture Overview 3. Scaling Pinot 4. Q&A
  10. 10. Architecture Overview • Controller - handles cluster-wide coordination using Apace Helix and Zookeeper • Broker - handles query fan out and query routing to servers • Server - responds to query requests originating from the brokers
  11. 11. Query Execution: Distributed Broker S1 S3 S2 S1 S3 S2 1. Query 2.Fetch routing table from Helix 4. Process request & send response 5. Gather response 6. Return response Server 3. Scatter request Controller (Helix)
  12. 12. Query Execution: Hybrid Querying time offline server time t = 1 realtime server 2 3 4 5
  13. 13. Query Execution: Hybrid Querying time 1-2 offline server time t = 1 realtime server 2 3 4 5 offline Hadoop job
  14. 14. Query Execution: Hybrid Querying time 1-2 offline server time t = 1 realtime server 2 3 4 5
  15. 15. Query Execution: Hybrid Querying time offline server Broker time realtime server Time boundary: 2 3 4 5 1-2t = 1 2
  16. 16. Query Execution: Hybrid Querying time offline server Broker time realtime server Time boundary: 2 3 4 5 select sum(m) from T t = 1 2 1-2
  17. 17. Query Execution: Hybrid Querying time offline server Broker time realtime server Time boundary: 2 3 4 5 select sum(m) from T where t <= 2 select sum(m) from T where t > 2 select sum(m) from T 1-2t = 1 2
  18. 18. Query Execution: Single Node Query Optimization select max(col) from T Use metadata instead of scanning select sum(metric) from T where country = us and accountId = x Reorders filter for better performance (apply accountId before country predicate) Dynamic query planning based on column metadata, index, and dictionary
  19. 19. Anatomy of Pinot Segment Dictionary Forward Index Metadata start/end time available indexes partitioning info min/max value … Inverted Sorted Startree Indexes docId country code 0 us 002 1 ca 001 2 jp 003 … … … country ca jp us … dictId docId code 001 002 003 … country 2 0 1 … code 1 0 2 … Raw Data
  20. 20. Today’s agenda 1. Motivation 2. Architecture Overview 3. Scaling Pinot 4. Q&A
  21. 21. Recap: Analytics Use Cases Use Case Response Latency Query Rate Possible Solutions Interactive dashboard sub-second to few seconds ~1 qps Columnar Store Site facing 100ms (99 percentile) 1000s qps KV Store (pre-cube) Anomaly detection sub-second to few seconds 10-100s qps Streaming Engine Same input data (Pageview) Same OLAP style query Different solutions based on different workload characteristics
  22. 22. Interactive Dashboard Use Case Response Latency Query Rate Possible Solutions Interactive dashboard sub-second to few seconds ~1 qps Columnar Store select sum(pageView), time from T where country = us, browser = chrome,… group by time 0 100 200 300 400 500 Latency (milliseconds) Frequency pinot druid
  23. 23. Site Facing Use Case Response Latency Query Rate Possible Solutions Site facing 100ms (99 percentile) 1000s qps KV Store select sum(pageView) from T where memberId = xx, privacySettings in… group by time,[title|geo|industry] ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●● ●●100 1000 10000 10 1000 Queries per second Latency(milliseconds) druid pinot ● ● ● ● ● ●●● ●●●●● ● ● ● ● ● ●●● ●●●●● ● ● ● ● ● ●●● ●●●●● 100 1000 10000 10 100 1000 Queries per second Latency(milliseconds) pinot druid
  24. 24. Pinot Optimizations For Site Facing Use Cases • Optimizing Query Processing 1. Sorted Index + Dynamic execution planning • Optimizing Scatter and Gather 1. Smart segment assignment and routing 2. Data partitioning and pruning
  25. 25. Optimizing Query Processing: Sorted Index • Access to both forward/inverted index • Fetch contiguous block, benefit from locality • For item filtering, pick scanning or inverted index based on cardinality of sorted column memberId start docId end docId 123 0 100 456 101 300 … … … docId memberId 0 123 ... … 100 123 101 456 … … 300 456 … … select … where memberId = 456, item in(…) group by … ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 1000 10 100 1000 Queries per second Latency(milliseconds) sorted index inverted index
  26. 26. Optimizing Scatter and Gather: Querying All Servers Replica group: a set of servers that contains a complete set of all segments. 2 3 1 4 2 3 1 4 query 1 query 2 4 2 1 3 1 2 3 4 query 1 query 2 RG1 RG2 ● ● ● ● ● ●●● ●●●●● ● ● ● ● ● ●●● ●●●●● ● ● ● ● ● ●●● ●●●●● 100 1000 10000 10 100 1000 Queries per second Latency(milliseconds) without routing optimization with routing optimization Problem Impact Solution Querying all servers 99% is impacted by the slowest server (e.g. gc) Control the number of servers to fan-out
  27. 27. Optimizing Scatter and Gather: Querying All Segments S1 S3 query 1 query 2 S2 S4 S1 (p=1) S3 (p=2) query 1 (mid = p1) query 2 (mid = p2) S2 (p=1) S4 (p=2) Problem Impact Solution Querying all segments More CPU work on server Minimize the number of segment (partitioning and pruning) select … where memberId = 456, item in(…) group by …
  28. 28. Anomaly Detection: Challenge for d1 in [us, ca, …] for d2 in [key1, key2,…] … select sum(pageViews) from T where country=d1, page_key=d2, source_app=d3, device_name=d4… group by country, time … Filter Aggregation Latency select … where country = us,… Slow, scan 60-70% data high select … where country = kenya,… Scan less than 1% low • Latency not predictable depends on the query predicate • Monitoring all possible combinations makes the problem worse!
  29. 29. Time vs Space Trade-off latency storage requirement Columnar Store KV Store (Pre-computed) Startree Index variable latency low storage overhead low latency high storage overhead
  30. 30. Startree Index Generation 1. Multidimensional sort 2. Split on the column and create a node for each value 3. Create star node (aggregate metric after removing the split column) 4. Apply 1,2,3 for each node recursively and stop when number of records in node < SplitThreshold root * docId country browser … other dimensions impre ssion 0 al ie 10 1 ca safari 10 2 … … … … us chrome 10 … us chrome 10 … us ie 10 N us safari 10 Raw records Aggregated records N+1 * chrome 40 N+2 * ie 20 N+3 * safari 20 caal … us *country browser chrome … safari
  31. 31. Time vs Space Trade-off with Startree latency storage requirement Columnar Store KV Store (Pre-computed) Startree Index SplitThreshold= infinity, No prematerialization SplitThreshold= 1, Full materialiation SplitThreshold= 100,000, Partial data aware materialiation
  32. 32. Startree Query Execution select sum(pageViews)from T where country = AL select sum(pageViews) from T where browser = Chrome select sum(pageViews) from T select sum(X) from T where d1=v1 and d2=v2 and … Any query pattern will scan less than SplitThreshold records root * caal … us *country browser chrome … safari*chrome … safari select sum(pageViews)from T where country = CA Raw docs Aggregated docs
  33. 33. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 1000 10000 1 10 100 Queries per second Latency(milliseconds) Anomaly Detection druid pinot with inverted index pinot with startree index Use Case Response Latency Query Throughput Possible Solutions Anomaly detection sub-second to few seconds 10-100s queries per second Streaming Engine
  34. 34. Pinot vs Druid Druid Pinot Inverted Index Always on all columns, fixed Configurable on per column basis Query Execution Layer Fixed Plan Split into planning and execution Data Organization N/A Sorted column Partitioning Only available for time column Available for any column Controlling query fan-out N/A Replica group based segment assignment and routing Smart pre-matrialization N/A Star-tree
  35. 35. Can we support all these use cases in one single system? Use Case Response Latency Query Rate Solution Interactive dashboard sub-second to few seconds ~1 qps Pinot Site facing 100ms (99 percentile) 1000s qps Pinot Anomaly detection sub-second to few seconds 10-100s qps Pinot

×