"We will discuss how we at Pinterest transformed real time user engagement event consumption.
Every day, we log hundreds of billions of user engagement events across different domains to a few common Kafka topics which are consumed by hundreds of real time applications. These real time applications were built upon diverged frameworks (e.g. Spark Streaming, Storm, Flink, and internally developed frameworks using Kafka Consumer API) without standardization on processing logics. It led to repeated processing of similar logic, multiple codebases to maintain, low data quality, and inconsistency with offline datasets. These negatively impact scalability, reliability, efficiency and data accuracy of these applications and eventually affect the real-time content recommendation quality and user experience.
To address these challenges, we unified the way of consuming events in our real time applications by consolidating the compute engines to Flink, splitting events in those common topics by engagement types, generating cleansed events with standardized processing to align on business concepts. Throughout these efforts, we achieved multi-million dollar infrastructure savings and double-digit engagement gain after applications adopted those cleansed events.
Moving forward, we are implementing frameworks for better tracking and governing the Kafka events and real time use cases."
4. Pinterest is the visual inspiration platform people
around the world use to shop products personalized to
their taste, find ideas to do offline and discover the
most inspiring creators.
Pinterest’s mission is to bring everyone the inspiration
to create a life that they love.
What is Pinterest?
5. We are engineers from Pinterest Data Eng.
Data Eng’s mission is to create and run reliable, efficient and planet-scale
data platforms and services to accelerate innovation and sustain Pinterest
business.
Who are we?
13. Scaling challenges and solutions 2019 ~ 2020 Stability
Challenge
● outbound traffic = number of jobs * inbound traffic
● Kafka clusters hosting event topics had very high resource saturation
Observation
● each job only needs to process a few common event types (eg: click, view)
● events of those common types are a small portion of all the events
event
(type 1, type 2, …, type M)
streaming job 1
streaming job 2
streaming job N
…
14. Solution
event
(type 1, type 2, …, type M)
Stream Splitter v1
● Flink DataStream API
● Job graph consists of source, filter
and sink
● Filter operator only keep events of a
small set of types required by
downstream
event_core
(type i, type j, type k)
Scaling challenges and solutions 2019 ~ 2020 Stability
15. Win
● The derived topics were about 10% of the original event topics, and high Kafka
cluster resource saturation issue was mitigated.
● Due to the smaller input QPS, jobs processing derived topics required less CPU /
memory and AWS cross-az traffic cost was reduced. Infra savings!!!
event_core
(type i, type j, type k)
streaming job 1
streaming job 2
streaming job N
…
Scaling challenges and solutions 2019 ~ 2020 Stability
16. Challenge
● With new jobs requiring more event types, the derived topics grew larger and
larger (10% -> 30% of original event topics)
● Infra cost grew significantly with new jobs onboarded
Observation
● QPS for each job became larger due to the growth of derived topics and job
required more resources
● Each job had to filter input events by types to get what it needs
Scaling challenges and solutions 2021 ~ 2022 Efficiency
17. Solution
event
(type 1, type 2, …, type M)
Stream Splitter v2
● Flink SQL
● Job consists of a statement set of DML
statements – insert into event_type_i (select *
from event where type = type_i)
● one DML statement for one per-type event topic
event_type_i
event_type_j
event_type_k
…
Scaling challenges and solutions 2021 ~ 2022 Efficiency
18. Scaling challenges and solutions 2021 ~ 2022 Efficiency
Win
● Downstream jobs only process several
per-type event topics that they needs
● Downstream jobs no longer needs filter logic
● Downstream jobs require much less infra
resources (infra savings!!!)
● Setting up a new pipeline requires a new
topic and a SQL statement
streaming job 1
streaming job 2
streaming job N
…
event_type_i
event_type_j
event_type_k
…
19. Scaling challenges and solutions 2021 ~ 2022 Efficiency
Issues with Stream Splitter v2
● All the records coming out of source operator are forwarded to every pipeline
● Stream Splitter v2 jobs cost twice as much as v1 jobs
Note:
Job graph is generated by the internal
SQL planner from the DML statements
other operators that do not affect the
data transportation pattern are not
shown for better visualization
Kafka
Source
filter i
(type = type_i)
Kafka Sink i
Kafka Sink j
Kafka Sink k
…
filter j
(type = type_j)
filter k
(type = type_k)
…
M
M
M
M
Mi Mi
Mj Mj
Mk Mk
20. Scaling challenges and solutions 2021 ~ 2022 Data quality
Challenge
● Streaming and batch workflows
generated inconsistent results
Observation
● Streaming job re-implemented many
batch ETL logics without standardization
Streaming
jobs
70 impressions
100 impressions
event
DWH
SOT
tables
Batch
workflows
21. Scaling challenges and solutions 2021 ~ 2022 Data quality
Solution
event
(type 1, type 2, …, type M)
Real time DWH streams
● Build with NRTG - mini framework on top
of a subset of Flink Datastream API (Flink
state API is not supported)
● Job graph consists of source, filter,
enrich, dedup and sink
● filter, enrich and dedup logics are reusing
those in batch ETL
● dedup key is stored in off-heap memory
(with pre-configured memory size) via a 3d
party library ohc
dwh_event (enriched and deduped)
(type i, type j, type k)
Dedup accuracy is compromised
during task restart or job deployment
as in-memory dedup keys are lost
It takes up to 1 day’s raw events to
rebuild the state.
22. Scaling challenges and solutions 2021 ~ 2022 Data quality
Improved Solution
event
(type 1, type 2, …, type M)
Real time DWH streams with native
Flink State
● Native Flink state API is added to NRTG
● Dedup operator is re-written using the Flink
MapState to store dudup key with 1d TTL
● Rocksdb state backend and S3 to store
active (read / write) keys and backup
● Savepoint size is tens of TB. Full state is
preserved during task restart and job
redeployment
dwh_event (enriched and deduped)
(type i, type j, type k)
Dedup accuracy is guaranteed
during task restart or job
redeployment with specified
checkpoint (from s3)
23. Scaling challenges and solutions 2021 ~ 2022 Data quality
Win
● Downstream jobs reading dwh_events can generate consistent results with the batch workflows;
the computed real-time signals used in recommendation helped to boost Pinterest engagement
metrics by double digits.
● Downstream jobs no longer need to implement enrich and dedup logics and job graphs are
simplified to only focus on the business logic.
Streaming
jobs
70 impressions
70 impressions
dwh_event
DWH
SOT
tables
Batch
workflows
24. Scaling challenges and solutions 2021 ~ 2022 Data quality
Issues with Real-time DWH streams job
● The generated dwh_event topic consists of multiple types and downstream jobs are
reading unnecessary data and thus implementing filter logics
● The mini framework introduces extra overhead
● Supporting a new type is slow - The logics for processing different types are coupled
together due to the mini framework’s API requirement
25. Two solutions for pre-processing engagement events
Stream Splitter
Efficient downstream consumption
Fast onboarding
No data quality
Repetitive processing logics in downstreams
Inefficient job runtime (data duplication )
Data Quality
simplified downstream job logic
Slow onboarding
Inefficient downstream consumption
Inefficient job runtime (framework overhead)
Realtime DWH
Downstream job developers are confused about what to use
Infra cost doubles and KTLO cost doubles
26. Unified Solution - Requirements
● Efficiency
○ Pre-processing jobs have efficient runtime
○ Downstream jobs only read events what they need to process
● Data quality
○ Downstream jobs read enriched and deduped events that can generate
consistent results with the batch workflows
● Dev velocity
○ Supporting a new type in the pre-processing jobs should be simple and can be
enabled easily without affecting the existing pipelines
○ Downstream jobs no longer port the filter-enrich logics from batch ETL and no
longer implement deduplication logic on data source
● KTLO
○ maintain one unified solution rather than 2 solutions
27. Unified solution - API choice
● Flink Datastream API
● Flink SQL
● Mini framework like NRTG
● Flink Table API - our final choice
○ It is more expressive than Flink SQL - complex logics can’t be easily implemented as SQL
○ It is very flexible
■ source and sinks are registered and accessed as Flink tables
■ easily convert Table to Datastream when we want to leverage low-level features
○ It does not have any extra framework overhead like NRTG
28. Unified solution - job framework
Framework design
● Each output stream is generated through a pipeline made up of filter, enrich dedup and sink
operators
● Pipeline is pluggable and independent from each other
● Classes from batch ETL are re-used to maintain consistent semantics
● Java reflection are leveraged to easily configure each pipeline
Job graph optimization - side outputs
● An job operator assign every source event based on type to each pipelines through side output
● Essentially we are implementing “filter pushdown” to reduce unnecessary data transportation
31. Platinum Event Streams - What it offers?
raw event platinum event streams
Standardized
Event
Selection
Event
Deduplication
Downstream
Efficiency
streaming
applications
32. Platinum Event Streams - User Flow
Logging / Metric
Owners
Streaming
App
Developers
I want to use event A as one of my signals, what’s
the correct logic to process it from raw events?
Before After
Logging
Owners
Metric
Definition
Owners
Streaming
App
Developers
Data
Warehouse
Team
platinum event
streams
Faster onboarding w/ guaranteed
quality and efficiency!
34. Platinum Event Stream - Flink Processing
Kafka
Source
Table
Dedup 1 Kafka Sink 1
Enrich 1
Dedup 2 Kafka Sink 2
Enrich 2
Dedup N Kafka Sink N
Enrich N
… … …
Splitter
w/ Filters
Side output 1
Side output 2
…
Side output N
M
M1
M2
MN
35. Platinum Event Stream - Splitter w/ Filters
Splitter Functionalities:
1. Filter out the events we don’t need.
2. Split the stream into different sub-pipelines according to event types.
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
36. Metric Repository
(shared by batch and streaming
processing)
Event / Metric X
def filter(event: Event): Boolean =
……
……
def createDedupKey(event: Event) =
……
……
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Standardized Event Selection
Consistent w/ Batch Applications
Platinum Event Stream - Splitter w/ Filters
37. Splitter Functionalities: (1) Filtering (2) Split Streams
Solution 1 - FlatMapFunc
Severe back pressure and scalability
issue when input traffic is high.
Solution 2 - Side Output
Kafka
Source
Table
Splitter:
Initialize:
Map<event type, pipeline tag>
Process:
- Emit events with
corresponding pipeline tag.
- Throw away if not needed.
…
FlatMapFunc - 1
Kafka
Source
Table
FlatMapFunc - 2
FlatMapFunc - N
…
M
M - QPS of input
raw event stream
M
M
M
M1
M2
Mn
Mi - QPS of side output i which
is needed by pipeline i
ΣMi << M
Scalability issue solved!
Platinum Event Stream - Splitter w/ Filters
38. Latency Information
Decoded Info Derived Info
● Derived spam flags from a
couple different fields logged
in raw event data.
ms
● Additional latency information.
● Help latency sensitive
downstream takes different
actions according to latency
per event.
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
● Decoded some
commonly used fields for
downstream to use.
BASE64
Platinum Event Stream - Enrich
39. … …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Why we need deduplication?
● Duplicate events exist in Pinterest’s raw event data.
● In some cases, duplicate rates vary from ~10-40% depending on the types of events.
Causes of duplicates:
1. Repeated users’ actions when interacting with Pinterest app.
2. Incorrect client logging implementation.
3. Client resend logging messages.
Solution:
● Deduplication in both batch and streaming pipelines before exporting to dashboards or
flowing into ML systems.
Platinum Event Stream - Dedup
40. Key by
UserID
Not
exists
Update
state
& Output
DedupKey
(e)
24hr TTL
2-10 TB
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Platinum Event Stream - Dedup
Flink Stateful Functions
41. 24hr TTL
2-10 TB
Incremental checkpoint for large state
● Full state size: 2-10TB
● Every time checkpoint size: tens of GB
Re-deployment
● From savepoint: ~10 - 20 mins
● From checkpoint: < 2 mins
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Platinum Event Stream - Dedup
42. Easy-to-Extend Framework with Java Reflection
Metric Repository
(Referenced by both online and offline
processing)
event_definitions
EventA.scala
EventB.scala
EventC.scala
One Line
Configuration
pipeline1.eventClass=A
pipeline2.eventClass=B
pipeline3.eventClass=C
*.properties:
1. Only several line configuration changes needed for adding new streaming pipeline.
2. Guaranteed batch and streaming logic consistent by referencing the same code repo.
Java Reflection
Look up Event class by its
name with Java Reflection
when building job graph.
Invoke functions defined for
each metric at runtime for
each pipeline:
MetricA.filter()
MetricA.createDedupeKey()
43. Platinum Event Stream - Data Quality Monitoring
Before
30-40%
discrepancies on
streaming vs. batch
applications
After
>99% match rate
between streaming vs.
batch applications
Daily comparison with batch SOT dataset
platinum event streams offline tables
offline
SOT
tables
Kafka topic → S3 dump
(internal framework)
Internal offline data
checker system
Alerts for match rate violation
Dashboards for continuous monitoring
44. Platinum Event Streams - Cost Efficiency
Efficiency Solution
600 vcores
600 vcores
Data Quality Solution
Unified Solution
(Efficiency + Data Quality)
600 vcores
Achieve both functionalities with single
copy of cost similar to previous offerings!
45. 5. Wins and Learns
1. User engagement boost brought by cleaner source data!
2. Highly simplified downstream streaming applications’ onboarding flow!
3. Hundreds of thousands infra saving as well as maintenance cost saving!
47. Ongoing efforts - streaming governance
We build streaming lineage & catalog which are integrated with batch
lineage and catalog for unified data governance
● catalog of Flink tables that are registered all the external systems that are
interacting with Flink jobs
● lineage between Flink jobs and external systems
48. Ongoing efforts - streaming and incremental ETL
We build solutions on top of CDC, Kafka, Flink, Iceberg and Spark to
● ingest data in near real-time from online systems to offline data lake
● incrementally process offline data in data lake