Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng Fu, Uber

Real-time analytics with upsert
using Apache Kafka and
Apache Pinot

Yupeng Fu (yupeng9@github)
● Staff Engineer @ Uber.Inc
● Real-time Data Infrastructure lead
● Committer:Apache Pinot, Alluxio
About Me

Real Time Use cases @ Uber
Exploration
Dashboards
Application
Machine
Learning

Apache Kafka @Uber
● De facto standard for data streaming
● Use cases at Uber
○ Pub/sub
○ Real-time analytics
○ Stream processing
○ Change Data Capture (CDC)
○ Ingestion into data lake
○ Logging
PBs
Msg/Day
Trillions
Data/day
Tens of Thousands Topics
Thousands Services

Apache Pinot for real-time OLAP
Peak QPS
170k+
Events/sec
1M+
Query Latency
ms

Apache Pinot for real-time OLAP
● Distributed, columnar database
● Chosen for its
○ High QPS, low latency query support
○ Cost effective as compared to others
● Use cases at Uber
○ User-Facing Analytics (Restaurant Manager,
Orders near me)
○ Dashboards
○ Operational Intelligence
○ Financial Intelligence
Hundreds TBs Data
Tens of Thousands QPS
Milliseconds latency
99.99% Uptime

Pinot’s High Level Architecture
Realtime
pipeline
Data
Plane
Control
Plane
Batch
pipeline

Why upsert in Pinot?
● Ingested data from Kafka can be updated or corrected
● Deliver an accurate and update-to-date real-time view
● No easy workaround in SQL
SELECT currentStatus,
count(*)
FROM uberEatsOrders
WHERE regionId = 1366
AND minutesSinceEpoch
BETWEEN 25432140 AND
25433580
GROUP BY currentStatus
TOP 10000

Upsert use cases @ Uber
● Uber Eats
○ e.g. Orders real-time analysis grouped by current status
● Uber Rides
○ e.g. financial report on corrected rides fares
● Uber Ads
○ e.g. Attribution analysis for ad events
● Uber Freight
○ e.g. Metrics reporting on carrier’s real-time engagement
● Customer Obsession Platform
○ e.g. Real-time metrics updates per contact change
● Segmentation and Targeting Platform
○ e.g. Support online attributes changes on user audiences
● ...

S1 S3
Pinot
Controller
S2
3
1 2
2 3
4
Pinot Servers
Zookeeper
Pinot
Broker
Pinot Data Flow (Realtime)
S4
4
1
Seg1 -> S1
Seg2 -> S2
Seg3 -> S3
Seg4 -> S4
Seg1 -> S1, S4
Seg2 -> S2, S3
Seg3 -> S3, S1
Seg4 -> S4, S2
select count(*) from X
where country = us
PK=1
PK=1
PK=1
PK=1
segments are immutable
segments are distributed
segments are replicated

Global coordinator - first attempt
● A central coordinator to map PK to record locations
● Use Kafka to aggregate metadata and dispatch updates
● Use virtual columns to annotate segment for query rewriting

Global coordinator - pro/cons
● Explored for 1+ year of dev and testing
● Advantages
○ Fewer changes to Pinot core
○ No preprocessing needed on the input stream
● Disadvantages
○ Global coordinator as single-point failure
○ Scalability on the input/output Kafka topics
○ Query rewrite complexity over the virtual columns
○ Hard to support partial update due to row-level annotation

Problem revisit
● Key challenge is on establishing the global coordination efficiently
● Alternatively, reduce it to a local coordination problem
○ Leverage the partition-by-key feature in Kafka
○ Distribute segments of the primary key to the same server

Local coordinator - revisited design

Local coordinator - pro/cons
● Advantages
○ Significantly simplified overall architecture
○ Scalability from the shared-nothing architecture
● Challenges
○ Major surgery to Pinot core required
○ Streaming processing job required to repartition the input stream

Upsert example flow
Current Status

Upsert example flow
An update on order bb arrived

Upsert example flow
An update on order ee arrived

Journey thus far and road ahead

Upsert progress
● First attempted at 09/2018
● Redesign started in 6/2020
● Released in Pinot 0.6, 11/2020
○ Documentation: https://docs.pinot.apache.org/basics/data-
import/upsert
○ Design: https://github.com/apache/incubator-pinot/issues/4261

Upsert in action - disable with query option
● Upsert can be disabled on the fly with query option
○ Analysis for updates, e.g. how many updates per UUID
○ Useful for debugging/troubleshooting
SELECT productTypeUUID as order_uuid,
jobState as current_status,
secondsSinceEpoch
FROM eats_job_state option(disbleUpsert=true)
WHERE productTypeUUID = 'eb09ce96-cfd6-4a14-93ed-bc93d82ea600'
ORDER BY secondsSinceEpoch desc
LIMIT 10

Limitations and Next Steps
● Input stream must be partitioned
● Table bootstrap and longer data retention
○ Directly push segments to real-time table
○ https://github.com/apache/incubator-pinot/pull/6567
● Partial update
○ Different merge strategies
○ https://github.com/apache/incubator-pinot/issues/6575
● Certain Pinot index (e.g. Startree) cannot be used

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng Fu, Uber

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng Fu, Uber

Similar to Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng Fu, Uber (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng Fu, Uber