Real-time Analytics with
Trino and Apache Pinot
Xiang Fu, Elon Azoulay
Oct 22, 2021
Speaker Intro
Xiang Fu
● Co-founder - StarTree
● Previously: Streaming Platform @ Uber
● PMC & Committer: Apache Pinot
Elon Azoulay
● Software Engineer - Stealth Mode Startup
● Previously: Data @ Facebook
Agenda
● Today’s Compromises: Latency vs. Flexibility
● Not trade-off using Apache Pinot
● Trino Pinot Connector
● Benchmark
Today’s Compromises: Latency vs Flexibility
Flexibility takes time: Join on the Fly
customers
orders
customers.state =
‘California’ AND
customers.gender
= ‘Female’
JOIN customers ON
(customers.customer_id
= orders.customer_id)
Group By
customers.city,
Month(orders.date)
sum(orders.amount)
FILTER JOIN GROUP BY AGGREGATION
- Flexible to do any computation
- High query cost: disk & network
I/O, Data Partitioning, Data Serde
ETL Trade-offs: Pre-joined Table
customers
orders
state =
‘California’
AND gender
= ‘Female’
JOIN customers ON
(customers.customer_id
= orders.customer_id)
Group By city,
Month(orders.
date)
sum(amount)
FILTER
JOIN GROUP BY AGGREGATION
user_orders
_joined
Pre-Joined
Table
- Flexible to explore user dimensions
- Query time is still proportional to the
data scan, not predictable
ETL Trade-offs: Pre-aggregated Table
state =
‘California’
AND gender
= ‘Female’
Group By city,
Month(orders.
date)
sum(sum_amount)
FILTER GROUP BY AGGREGATION
user_orders
_joined
Pre-Joined
Table
user_orders_
aggregated
Pre-Aggregated
Table
SELECT
sum(amount) as
sum_amount, date,
city GROUP BY
date, city
Aggregation
+ GroupBy
- Reduced query runtime workload
- Query time is still proportional to the
multiplication of non-groupBy columns
ETL Trade-offs: Pre-cubed Table
state =
‘California’
AND gender
= ‘Female’
sum_amount,
month, city
FILTER
user_orders
_joined
Pre-Joined
Table
user_orders
_cubed
Pre-Aggregated
Table
SELECT
sum(amount) as
sum_amount, date,
Month(date) as
month, city
GROUP BY CUBE
(date, city,
Month(date))
Cubing
PROJECTION
- Predictable query runtime
- Storage overhead: one raw record
translates to multiple records
- Dimension explosion
Fact Table
Dimension Table Pre-Join Pre-Aggregation Pre-Cube
Latency
Flexibility
low
high
low
high
Not to Trade-off Using Apache Pinot
Throughput high
low
User Facing Applications Business Facing Metrics
Apache Pinot Overview
Anomaly Detection
- Ingestion: Millions of events/sec
- Workload: Thousands of queries/sec
- Performance: Millisecond
- Operation: Thousands Nodes Cluster
ADLS
GCS
Real-Time Offline
BI Visualization Data Products Anomaly Detection
ADLS
GCS
Real-Time Offline
OLTP
Server1 Server2 Server3
Zookeeper
Broker 1 Broker 2
Controller
Secrets Behind Apache Pinot
Scan
Aggregation
Filter
Storage
Bloom
Filter
Inverted Index
Columnar Store
Byte
Encoding
Sorted
Index ❏Common Techniques
❏Pinot
Compression
Star-Tree Pre-aggregation
Star-
Tree
Index
Bit/RLE
Encoding
Per-segment flexible query planning
Range
Index
Text
Index
Apache Pinot - StarTree Index
• Configurable trade-off between latency and space by partial pre-aggregation
technique
• Be able to achieve a hard upper bound for query latencies
No pre-computation
Latency
Storage
Full Pre-Cube
(KV Store)
Partial pre-computation
(Startree Index)
T= 10000
T= 100
Trino Overview
Source: Trino Architecture
BI Visualization Data Products Anomaly Detection Ad hoc Analysis
ADLS
GCS
Trino Pinot Connector
Trino Pinot Connector
Trino Pinot Connector: Aggregation Pushdown
Chasing the light: Aggregation pushdown
- Issue single Pinot broker request
- Best-effort push down for aggregations like
count/sum/min/max/distinct/approximate_distinct, etc
- 10~100x latency improvement
Passthrough Broker Queries
SELECT CASE WHEN team = ‘Giants’ then ‘BIG’ else ‘SMALL’
END AS size, team, count(*) FROM
pinot.default.”SELECT team
FROM baseball_stats WHERE conference = ‘America East’”
GROUP BY CASE WHEN team = ‘Giants’ then ‘BIG’ else
‘SMALL’ END, 2
Group by Expression Support
Passthrough Broker Queries
SELECT team, count(*) FROM
pinot.default.”SELECT team, player
FROM baseball_stats WHERE conference = ‘America East’”
ORDER BY CASE WHEN team = ‘Giants’ then ‘BIG’ else
‘SMALL’ END, 2
Order by Expression Support
Trino Pinot Connector: Server Query + Pinot Streaming API
Pinot Streaming(gRpc) Connector
- Distributed workload in parallel among Trino workers
- Configurable memory footprint for data pulling from Pinot
- Open the gate of queries requires full table scan or join
Ongoing and Future Work on the Connector
● Data Insertion
○ Push segments to the controller
○ Adds or replaces segments.
Ongoing and Future Work on the Connector
● Pinot Segments Deletion
● Table & Column Creation/Alter/Drop
CREATE TABLE DIMTABLE
(LONG_COL bigint, STRING_COL varchar)
WITH (
PRIMARY_KEY_COLUMNS = ARRAY['long_col'],
OFFLINE_CONFIG = '{
"tableName": "dimtable",
"tableType": "OFFLINE",
"isDimTable": true,
"segmentsConfig": {
"segmentPushType": "REFRESH",
"replication": "1"
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant"
},
"tableIndexConfig": {
"loadMode": "MMAP"
}
}’
);
Perf Benchmark
Benchmark Config:
- 1 Pinot Controllers (4 cores/8GB)
- 1 Pinot Brokers (4 cores/8GB)
- 3 Pinot Servers (4 cores/8GB)
- 1 Trino Coordinator (4 cores/8GB)
- 1 Trino Workers (4 cores/8GB)
Data Set:
- 40 Million rows data set
Query:
- Aggregation GroupBy + Predicate
Trino:
- Aggregation Pushdown enable/disable
Perf Benchmark
Query Type: Aggregation Group By + Predicate pushdown
Thank you
- Getting Started: https://tinyurl.com/trinoPinotTutorial
- Run Trino in Kubernetes: https://github.com/trinodb/charts
- StarTree: https://www.startree.ai/
- Apache Pinot: https://pinot.apache.org/
- Pinot on github: https://github.com/apache/pinot
- Pinot slack: https://tinyurl.com/pinotSlackChannel
- Apache Pinot Twitter: https://twitter.com/ApachePinot
- Apache Pinot Meetup: https://www.meetup.com/apache-pinot
- Starburst: https://www.starburst.io/
- Trino: https://trino.io/
- Trino on github: https://github.com/trinodb/trino
- Trino slack: https://trino.io/slack.html

Real-time Analytics with Trino and Apache Pinot

  • 1.
    Real-time Analytics with Trinoand Apache Pinot Xiang Fu, Elon Azoulay Oct 22, 2021
  • 2.
    Speaker Intro Xiang Fu ●Co-founder - StarTree ● Previously: Streaming Platform @ Uber ● PMC & Committer: Apache Pinot Elon Azoulay ● Software Engineer - Stealth Mode Startup ● Previously: Data @ Facebook
  • 3.
    Agenda ● Today’s Compromises:Latency vs. Flexibility ● Not trade-off using Apache Pinot ● Trino Pinot Connector ● Benchmark
  • 4.
  • 5.
    Flexibility takes time:Join on the Fly customers orders customers.state = ‘California’ AND customers.gender = ‘Female’ JOIN customers ON (customers.customer_id = orders.customer_id) Group By customers.city, Month(orders.date) sum(orders.amount) FILTER JOIN GROUP BY AGGREGATION - Flexible to do any computation - High query cost: disk & network I/O, Data Partitioning, Data Serde
  • 6.
    ETL Trade-offs: Pre-joinedTable customers orders state = ‘California’ AND gender = ‘Female’ JOIN customers ON (customers.customer_id = orders.customer_id) Group By city, Month(orders. date) sum(amount) FILTER JOIN GROUP BY AGGREGATION user_orders _joined Pre-Joined Table - Flexible to explore user dimensions - Query time is still proportional to the data scan, not predictable
  • 7.
    ETL Trade-offs: Pre-aggregatedTable state = ‘California’ AND gender = ‘Female’ Group By city, Month(orders. date) sum(sum_amount) FILTER GROUP BY AGGREGATION user_orders _joined Pre-Joined Table user_orders_ aggregated Pre-Aggregated Table SELECT sum(amount) as sum_amount, date, city GROUP BY date, city Aggregation + GroupBy - Reduced query runtime workload - Query time is still proportional to the multiplication of non-groupBy columns
  • 8.
    ETL Trade-offs: Pre-cubedTable state = ‘California’ AND gender = ‘Female’ sum_amount, month, city FILTER user_orders _joined Pre-Joined Table user_orders _cubed Pre-Aggregated Table SELECT sum(amount) as sum_amount, date, Month(date) as month, city GROUP BY CUBE (date, city, Month(date)) Cubing PROJECTION - Predictable query runtime - Storage overhead: one raw record translates to multiple records - Dimension explosion
  • 9.
    Fact Table Dimension TablePre-Join Pre-Aggregation Pre-Cube Latency Flexibility low high low high Not to Trade-off Using Apache Pinot Throughput high low
  • 10.
    User Facing ApplicationsBusiness Facing Metrics Apache Pinot Overview Anomaly Detection - Ingestion: Millions of events/sec - Workload: Thousands of queries/sec - Performance: Millisecond - Operation: Thousands Nodes Cluster ADLS GCS Real-Time Offline
  • 11.
    BI Visualization DataProducts Anomaly Detection ADLS GCS Real-Time Offline OLTP Server1 Server2 Server3 Zookeeper Broker 1 Broker 2 Controller
  • 12.
    Secrets Behind ApachePinot Scan Aggregation Filter Storage Bloom Filter Inverted Index Columnar Store Byte Encoding Sorted Index ❏Common Techniques ❏Pinot Compression Star-Tree Pre-aggregation Star- Tree Index Bit/RLE Encoding Per-segment flexible query planning Range Index Text Index
  • 13.
    Apache Pinot -StarTree Index • Configurable trade-off between latency and space by partial pre-aggregation technique • Be able to achieve a hard upper bound for query latencies No pre-computation Latency Storage Full Pre-Cube (KV Store) Partial pre-computation (Startree Index) T= 10000 T= 100
  • 14.
  • 15.
    BI Visualization DataProducts Anomaly Detection Ad hoc Analysis ADLS GCS Trino Pinot Connector
  • 16.
  • 17.
    Trino Pinot Connector:Aggregation Pushdown Chasing the light: Aggregation pushdown - Issue single Pinot broker request - Best-effort push down for aggregations like count/sum/min/max/distinct/approximate_distinct, etc - 10~100x latency improvement
  • 18.
    Passthrough Broker Queries SELECTCASE WHEN team = ‘Giants’ then ‘BIG’ else ‘SMALL’ END AS size, team, count(*) FROM pinot.default.”SELECT team FROM baseball_stats WHERE conference = ‘America East’” GROUP BY CASE WHEN team = ‘Giants’ then ‘BIG’ else ‘SMALL’ END, 2 Group by Expression Support
  • 19.
    Passthrough Broker Queries SELECTteam, count(*) FROM pinot.default.”SELECT team, player FROM baseball_stats WHERE conference = ‘America East’” ORDER BY CASE WHEN team = ‘Giants’ then ‘BIG’ else ‘SMALL’ END, 2 Order by Expression Support
  • 20.
    Trino Pinot Connector:Server Query + Pinot Streaming API Pinot Streaming(gRpc) Connector - Distributed workload in parallel among Trino workers - Configurable memory footprint for data pulling from Pinot - Open the gate of queries requires full table scan or join
  • 21.
    Ongoing and FutureWork on the Connector ● Data Insertion ○ Push segments to the controller ○ Adds or replaces segments.
  • 22.
    Ongoing and FutureWork on the Connector ● Pinot Segments Deletion ● Table & Column Creation/Alter/Drop CREATE TABLE DIMTABLE (LONG_COL bigint, STRING_COL varchar) WITH ( PRIMARY_KEY_COLUMNS = ARRAY['long_col'], OFFLINE_CONFIG = '{ "tableName": "dimtable", "tableType": "OFFLINE", "isDimTable": true, "segmentsConfig": { "segmentPushType": "REFRESH", "replication": "1" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "loadMode": "MMAP" } }’ );
  • 23.
    Perf Benchmark Benchmark Config: -1 Pinot Controllers (4 cores/8GB) - 1 Pinot Brokers (4 cores/8GB) - 3 Pinot Servers (4 cores/8GB) - 1 Trino Coordinator (4 cores/8GB) - 1 Trino Workers (4 cores/8GB) Data Set: - 40 Million rows data set Query: - Aggregation GroupBy + Predicate Trino: - Aggregation Pushdown enable/disable
  • 24.
    Perf Benchmark Query Type:Aggregation Group By + Predicate pushdown
  • 25.
    Thank you - GettingStarted: https://tinyurl.com/trinoPinotTutorial - Run Trino in Kubernetes: https://github.com/trinodb/charts - StarTree: https://www.startree.ai/ - Apache Pinot: https://pinot.apache.org/ - Pinot on github: https://github.com/apache/pinot - Pinot slack: https://tinyurl.com/pinotSlackChannel - Apache Pinot Twitter: https://twitter.com/ApachePinot - Apache Pinot Meetup: https://www.meetup.com/apache-pinot - Starburst: https://www.starburst.io/ - Trino: https://trino.io/ - Trino on github: https://github.com/trinodb/trino - Trino slack: https://trino.io/slack.html

Editor's Notes

  • #11 Realtime OLAP Database Columnar, Indexed Storage Low latency analytics Distributed – highly available, reliable, scalable Lambda architecture Offline data pushes Real-time stream ingestion Open Source
  • #12 Explain controller - single coordinator controlling all actions cluster state - to maintain partition assignment - partition to server mapping Cluster state maintains start and end offsets
  • #14 For certain query pattern (slice and dice on a given list of dimensions), we allow users to configure a upper bound of documents to scan. Pinot will intelligently partial pre-aggregate the records to achieve the requirement, but without exploding the storage
  • #15 Components: Coordinator (query endpoint, metadata), workers (process query results) Divides the work into splits which are processed in parallel Results returned from coordinator in final phase of processing
  • #17 Join 2 pinot tables
  • #19 Build Broker Query Pushdown filter, aggregation, limit Produce a single broker split Submit broker request Produce Results to Trino Process joins, other filters, aggregations and final limit Return results to client
  • #20 Build Broker Query Pushdown filter, aggregation, limit Produce a single broker split Submit broker request Produce Results to Trino Process joins, other filters, aggregations and final limit Return results to client
  • #21 First step is to get the metadata from the pinot controller, Talk about how this is configurable with cache ttl config
  • #24 Pinot - Fast single table OLAP Trino - Powerful connector ecosystem Complete system - covers entire landscape Get the best of Trino and Pinot Proven stack at Uber and many more
  • #25 Pinot - Fast single table OLAP Trino - Powerful connector ecosystem Complete system - covers entire landscape Get the best of Trino and Pinot Proven stack at Uber and many more