SlideShare a Scribd company logo
1 of 49
Download to read offline
Evolution of Real-time User Engagement Event
Consumption at Pinterest
Heng Zhang
Lu Liu
09/26/2023
Agenda
1. Introduction
2. Background
3. Real-time event processing architecture evolution
4. Unified solution deep dive
5. Wins and Learns
6. Ongoing efforts
7. Q&A
1. Introduction
Pinterest is the visual inspiration platform people
around the world use to shop products personalized to
their taste, find ideas to do offline and discover the
most inspiring creators.
Pinterest’s mission is to bring everyone the inspiration
to create a life that they love.
What is Pinterest?
We are engineers from Pinterest Data Eng.
Data Eng’s mission is to create and run reliable, efficient and planet-scale
data platforms and services to accelerate innovation and sustain Pinterest
business.
Who are we?
2. Background
Confidential
|
©
Pinterest User Engagement events Processing - overview
Standardized metric processing
frontend events
backend events
impression
events
Streaming
jobs
Batch Data
Warehouse ETL
Batch
workflows
Kafka Message
Transportation Service
Merced
raw
events
tables
Data
Warehouse
SOT tables
Confidential
|
©
Pinterest User Engagement events Processing - data source
● Scale and data volume:
○ hundred of billions of events daily ( hundreds of TB )
○ double every several years
● event schema:
○ All the events share the same Thrift struct definition
○ Event types:
■ number of events per type shows a long tail distribution (thousands of types)
■ new types are constantly being added for certain new use cases
■ most use cases only need to process events of a very small set of types
Confidential
|
©
Pinterest
● Signals used in recommendation
● Derived results directly showing on different product surfaces
● Analytical dashboard for pinners and advertisers
● Batch and real time Experiments
● Internal analytics & debugging
User Engagement events Processing - use cases
3. Real-time Event Processing
Architecture Evolution
Confidential
|
©
Pinterest
Multiple frameworks maintained by different teams before 2019
● Mini Frameworks built on top of Kafka client - Pinpin content & discovery
● Apache Storm - Roadrunner
● Apache Spark streaming - Voracity
● Kafka streams - Ads budgeting
Compute Frameworks
Confidential
|
©
Pinterest
Xenon - Pinterest stream processing platform on top of Flink since 2019
+
Cluster
Management
(YARN)
Native Flink
DataStream API
NRTG(Lite) on
top of
DataStream API
Flink Table
API & SQL
The Resource Management & Job Execution Layer
The Developer APIs
Job State
Management
(Checkpoints,
Backups,
Restores, Edits)
Security /
Auth
(PII/FGAC)
Job Health &
Diagnosis
(Dr. Squirrel)
CI/CD Hermez
The Deployment Stack
Job
Management
Service
Common Libraries and Connectors
Compute Frameworks
Scaling challenges and solutions 2019 ~ 2020 Stability
Challenge
● outbound traffic = number of jobs * inbound traffic
● Kafka clusters hosting event topics had very high resource saturation
Observation
● each job only needs to process a few common event types (eg: click, view)
● events of those common types are a small portion of all the events
event
(type 1, type 2, …, type M)
streaming job 1
streaming job 2
streaming job N
…
Solution
event
(type 1, type 2, …, type M)
Stream Splitter v1
● Flink DataStream API
● Job graph consists of source, filter
and sink
● Filter operator only keep events of a
small set of types required by
downstream
event_core
(type i, type j, type k)
Scaling challenges and solutions 2019 ~ 2020 Stability
Win
● The derived topics were about 10% of the original event topics, and high Kafka
cluster resource saturation issue was mitigated.
● Due to the smaller input QPS, jobs processing derived topics required less CPU /
memory and AWS cross-az traffic cost was reduced. Infra savings!!!
event_core
(type i, type j, type k)
streaming job 1
streaming job 2
streaming job N
…
Scaling challenges and solutions 2019 ~ 2020 Stability
Challenge
● With new jobs requiring more event types, the derived topics grew larger and
larger (10% -> 30% of original event topics)
● Infra cost grew significantly with new jobs onboarded
Observation
● QPS for each job became larger due to the growth of derived topics and job
required more resources
● Each job had to filter input events by types to get what it needs
Scaling challenges and solutions 2021 ~ 2022 Efficiency
Solution
event
(type 1, type 2, …, type M)
Stream Splitter v2
● Flink SQL
● Job consists of a statement set of DML
statements – insert into event_type_i (select *
from event where type = type_i)
● one DML statement for one per-type event topic
event_type_i
event_type_j
event_type_k
…
Scaling challenges and solutions 2021 ~ 2022 Efficiency
Scaling challenges and solutions 2021 ~ 2022 Efficiency
Win
● Downstream jobs only process several
per-type event topics that they needs
● Downstream jobs no longer needs filter logic
● Downstream jobs require much less infra
resources (infra savings!!!)
● Setting up a new pipeline requires a new
topic and a SQL statement
streaming job 1
streaming job 2
streaming job N
…
event_type_i
event_type_j
event_type_k
…
Scaling challenges and solutions 2021 ~ 2022 Efficiency
Issues with Stream Splitter v2
● All the records coming out of source operator are forwarded to every pipeline
● Stream Splitter v2 jobs cost twice as much as v1 jobs
Note:
Job graph is generated by the internal
SQL planner from the DML statements
other operators that do not affect the
data transportation pattern are not
shown for better visualization
Kafka
Source
filter i
(type = type_i)
Kafka Sink i
Kafka Sink j
Kafka Sink k
…
filter j
(type = type_j)
filter k
(type = type_k)
…
M
M
M
M
Mi Mi
Mj Mj
Mk Mk
Scaling challenges and solutions 2021 ~ 2022 Data quality
Challenge
● Streaming and batch workflows
generated inconsistent results
Observation
● Streaming job re-implemented many
batch ETL logics without standardization
Streaming
jobs
70 impressions
100 impressions
event
DWH
SOT
tables
Batch
workflows
Scaling challenges and solutions 2021 ~ 2022 Data quality
Solution
event
(type 1, type 2, …, type M)
Real time DWH streams
● Build with NRTG - mini framework on top
of a subset of Flink Datastream API (Flink
state API is not supported)
● Job graph consists of source, filter,
enrich, dedup and sink
● filter, enrich and dedup logics are reusing
those in batch ETL
● dedup key is stored in off-heap memory
(with pre-configured memory size) via a 3d
party library ohc
dwh_event (enriched and deduped)
(type i, type j, type k)
Dedup accuracy is compromised
during task restart or job deployment
as in-memory dedup keys are lost
It takes up to 1 day’s raw events to
rebuild the state.
Scaling challenges and solutions 2021 ~ 2022 Data quality
Improved Solution
event
(type 1, type 2, …, type M)
Real time DWH streams with native
Flink State
● Native Flink state API is added to NRTG
● Dedup operator is re-written using the Flink
MapState to store dudup key with 1d TTL
● Rocksdb state backend and S3 to store
active (read / write) keys and backup
● Savepoint size is tens of TB. Full state is
preserved during task restart and job
redeployment
dwh_event (enriched and deduped)
(type i, type j, type k)
Dedup accuracy is guaranteed
during task restart or job
redeployment with specified
checkpoint (from s3)
Scaling challenges and solutions 2021 ~ 2022 Data quality
Win
● Downstream jobs reading dwh_events can generate consistent results with the batch workflows;
the computed real-time signals used in recommendation helped to boost Pinterest engagement
metrics by double digits.
● Downstream jobs no longer need to implement enrich and dedup logics and job graphs are
simplified to only focus on the business logic.
Streaming
jobs
70 impressions
70 impressions
dwh_event
DWH
SOT
tables
Batch
workflows
Scaling challenges and solutions 2021 ~ 2022 Data quality
Issues with Real-time DWH streams job
● The generated dwh_event topic consists of multiple types and downstream jobs are
reading unnecessary data and thus implementing filter logics
● The mini framework introduces extra overhead
● Supporting a new type is slow - The logics for processing different types are coupled
together due to the mini framework’s API requirement
Two solutions for pre-processing engagement events
Stream Splitter
Efficient downstream consumption
Fast onboarding
No data quality
Repetitive processing logics in downstreams
Inefficient job runtime (data duplication )
Data Quality
simplified downstream job logic
Slow onboarding
Inefficient downstream consumption
Inefficient job runtime (framework overhead)
Realtime DWH
Downstream job developers are confused about what to use
Infra cost doubles and KTLO cost doubles
Unified Solution - Requirements
● Efficiency
○ Pre-processing jobs have efficient runtime
○ Downstream jobs only read events what they need to process
● Data quality
○ Downstream jobs read enriched and deduped events that can generate
consistent results with the batch workflows
● Dev velocity
○ Supporting a new type in the pre-processing jobs should be simple and can be
enabled easily without affecting the existing pipelines
○ Downstream jobs no longer port the filter-enrich logics from batch ETL and no
longer implement deduplication logic on data source
● KTLO
○ maintain one unified solution rather than 2 solutions
Unified solution - API choice
● Flink Datastream API
● Flink SQL
● Mini framework like NRTG
● Flink Table API - our final choice
○ It is more expressive than Flink SQL - complex logics can’t be easily implemented as SQL
○ It is very flexible
■ source and sinks are registered and accessed as Flink tables
■ easily convert Table to Datastream when we want to leverage low-level features
○ It does not have any extra framework overhead like NRTG
Unified solution - job framework
Framework design
● Each output stream is generated through a pipeline made up of filter, enrich dedup and sink
operators
● Pipeline is pluggable and independent from each other
● Classes from batch ETL are re-used to maintain consistent semantics
● Java reflection are leveraged to easily configure each pipeline
Job graph optimization - side outputs
● An job operator assign every source event based on type to each pipelines through side output
● Essentially we are implementing “filter pushdown” to reduce unnecessary data transportation
Unified solution - new offering
Platinum Event Stream
4. Platinum Event Stream Deep Dive
Platinum Event Streams - What it offers?
raw event platinum event streams
Standardized
Event
Selection
Event
Deduplication
Downstream
Efficiency
streaming
applications
Platinum Event Streams - User Flow
Logging / Metric
Owners
Streaming
App
Developers
I want to use event A as one of my signals, what’s
the correct logic to process it from raw events?
Before After
Logging
Owners
Metric
Definition
Owners
Streaming
App
Developers
Data
Warehouse
Team
platinum event
streams
Faster onboarding w/ guaranteed
quality and efficiency!
Platinum Event Stream - Technical Architecture
Input:
raw event
Flink processing:
Event splitting,
filtering, enrichment,
deduplication
Output:
Kafka topics w/ cleansed event data
Platinum Event Stream - Flink Processing
Kafka
Source
Table
Dedup 1 Kafka Sink 1
Enrich 1
Dedup 2 Kafka Sink 2
Enrich 2
Dedup N Kafka Sink N
Enrich N
… … …
Splitter
w/ Filters
Side output 1
Side output 2
…
Side output N
M
M1
M2
MN
Platinum Event Stream - Splitter w/ Filters
Splitter Functionalities:
1. Filter out the events we don’t need.
2. Split the stream into different sub-pipelines according to event types.
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Metric Repository
(shared by batch and streaming
processing)
Event / Metric X
def filter(event: Event): Boolean =
……
……
def createDedupKey(event: Event) =
……
……
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Standardized Event Selection
Consistent w/ Batch Applications
Platinum Event Stream - Splitter w/ Filters
Splitter Functionalities: (1) Filtering (2) Split Streams
Solution 1 - FlatMapFunc
Severe back pressure and scalability
issue when input traffic is high.
Solution 2 - Side Output
Kafka
Source
Table
Splitter:
Initialize:
Map<event type, pipeline tag>
Process:
- Emit events with
corresponding pipeline tag.
- Throw away if not needed.
…
FlatMapFunc - 1
Kafka
Source
Table
FlatMapFunc - 2
FlatMapFunc - N
…
M
M - QPS of input
raw event stream
M
M
M
M1
M2
Mn
Mi - QPS of side output i which
is needed by pipeline i
ΣMi << M
Scalability issue solved!
Platinum Event Stream - Splitter w/ Filters
Latency Information
Decoded Info Derived Info
● Derived spam flags from a
couple different fields logged
in raw event data.
ms
● Additional latency information.
● Help latency sensitive
downstream takes different
actions according to latency
per event.
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
● Decoded some
commonly used fields for
downstream to use.
BASE64
Platinum Event Stream - Enrich
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Why we need deduplication?
● Duplicate events exist in Pinterest’s raw event data.
● In some cases, duplicate rates vary from ~10-40% depending on the types of events.
Causes of duplicates:
1. Repeated users’ actions when interacting with Pinterest app.
2. Incorrect client logging implementation.
3. Client resend logging messages.
Solution:
● Deduplication in both batch and streaming pipelines before exporting to dashboards or
flowing into ML systems.
Platinum Event Stream - Dedup
Key by
UserID
Not
exists
Update
state
& Output
DedupKey
(e)
24hr TTL
2-10 TB
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Platinum Event Stream - Dedup
Flink Stateful Functions
24hr TTL
2-10 TB
Incremental checkpoint for large state
● Full state size: 2-10TB
● Every time checkpoint size: tens of GB
Re-deployment
● From savepoint: ~10 - 20 mins
● From checkpoint: < 2 mins
… …
…
Splitter
(w/ filters)
Enrich i Dedup i Kafka Sink i
Kafka
Source
Table
Enrich 1 Dedup 1 Kafka Sink 1
Enrich N Dedup N Kafka Sink N
Platinum Event Stream - Dedup
Easy-to-Extend Framework with Java Reflection
Metric Repository
(Referenced by both online and offline
processing)
event_definitions
EventA.scala
EventB.scala
EventC.scala
One Line
Configuration
pipeline1.eventClass=A
pipeline2.eventClass=B
pipeline3.eventClass=C
*.properties:
1. Only several line configuration changes needed for adding new streaming pipeline.
2. Guaranteed batch and streaming logic consistent by referencing the same code repo.
Java Reflection
Look up Event class by its
name with Java Reflection
when building job graph.
Invoke functions defined for
each metric at runtime for
each pipeline:
MetricA.filter()
MetricA.createDedupeKey()
Platinum Event Stream - Data Quality Monitoring
Before
30-40%
discrepancies on
streaming vs. batch
applications
After
>99% match rate
between streaming vs.
batch applications
Daily comparison with batch SOT dataset
platinum event streams offline tables
offline
SOT
tables
Kafka topic → S3 dump
(internal framework)
Internal offline data
checker system
Alerts for match rate violation
Dashboards for continuous monitoring
Platinum Event Streams - Cost Efficiency
Efficiency Solution
600 vcores
600 vcores
Data Quality Solution
Unified Solution
(Efficiency + Data Quality)
600 vcores
Achieve both functionalities with single
copy of cost similar to previous offerings!
5. Wins and Learns
1. User engagement boost brought by cleaner source data!
2. Highly simplified downstream streaming applications’ onboarding flow!
3. Hundreds of thousands infra saving as well as maintenance cost saving!
6. Ongoing efforts
Ongoing efforts - streaming governance
We build streaming lineage & catalog which are integrated with batch
lineage and catalog for unified data governance
● catalog of Flink tables that are registered all the external systems that are
interacting with Flink jobs
● lineage between Flink jobs and external systems
Ongoing efforts - streaming and incremental ETL
We build solutions on top of CDC, Kafka, Flink, Iceberg and Spark to
● ingest data in near real-time from online systems to offline data lake
● incrementally process offline data in data lake
Thank you!
Q & A

More Related Content

Similar to Pinterest's Evolution of Real-time User Engagement Event Processing

AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadKarthik Murugesan
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
 
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...Nesma
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyVenkata Pingali
 
Accelerating Digital Transformation: It's About Digital Enablement
Accelerating Digital Transformation:  It's About Digital EnablementAccelerating Digital Transformation:  It's About Digital Enablement
Accelerating Digital Transformation: It's About Digital EnablementJoshua Gossett
 
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Running Flink in Production:  The good, The bad and The in Between - Lakshmi ...Running Flink in Production:  The good, The bad and The in Between - Lakshmi ...
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...Flink Forward
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiTowards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiBowen Li
 
Story of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streamingStory of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streaminglohitvijayarenu
 
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiMy past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiKim Kao
 
Resume_sukanta_updated
Resume_sukanta_updatedResume_sukanta_updated
Resume_sukanta_updatedSukanta Saha
 
Resume_APRIL_updated
Resume_APRIL_updatedResume_APRIL_updated
Resume_APRIL_updatedSukanta Saha
 
Resume april updated
Resume april updatedResume april updated
Resume april updatedSukanta Saha
 

Similar to Pinterest's Evolution of Real-time User Engagement Event Processing (20)

AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHead
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Chaitanya_updated resume
Chaitanya_updated resumeChaitanya_updated resume
Chaitanya_updated resume
 
Chaitanya_updated resume
Chaitanya_updated resumeChaitanya_updated resume
Chaitanya_updated resume
 
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case Study
 
Accelerating Digital Transformation: It's About Digital Enablement
Accelerating Digital Transformation:  It's About Digital EnablementAccelerating Digital Transformation:  It's About Digital Enablement
Accelerating Digital Transformation: It's About Digital Enablement
 
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Running Flink in Production:  The good, The bad and The in Between - Lakshmi ...Running Flink in Production:  The good, The bad and The in Between - Lakshmi ...
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiTowards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
 
Resume 11 2015
Resume 11 2015Resume 11 2015
Resume 11 2015
 
Story of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streamingStory of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streaming
 
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiMy past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
 
Resume_sukanta_updated
Resume_sukanta_updatedResume_sukanta_updated
Resume_sukanta_updated
 
Resume_APRIL_updated
Resume_APRIL_updatedResume_APRIL_updated
Resume_APRIL_updated
 
Resume april updated
Resume april updatedResume april updated
Resume april updated
 
Resume
ResumeResume
Resume
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

Pinterest's Evolution of Real-time User Engagement Event Processing

  • 1. Evolution of Real-time User Engagement Event Consumption at Pinterest Heng Zhang Lu Liu 09/26/2023
  • 2. Agenda 1. Introduction 2. Background 3. Real-time event processing architecture evolution 4. Unified solution deep dive 5. Wins and Learns 6. Ongoing efforts 7. Q&A
  • 4. Pinterest is the visual inspiration platform people around the world use to shop products personalized to their taste, find ideas to do offline and discover the most inspiring creators. Pinterest’s mission is to bring everyone the inspiration to create a life that they love. What is Pinterest?
  • 5. We are engineers from Pinterest Data Eng. Data Eng’s mission is to create and run reliable, efficient and planet-scale data platforms and services to accelerate innovation and sustain Pinterest business. Who are we?
  • 7. Confidential | © Pinterest User Engagement events Processing - overview Standardized metric processing frontend events backend events impression events Streaming jobs Batch Data Warehouse ETL Batch workflows Kafka Message Transportation Service Merced raw events tables Data Warehouse SOT tables
  • 8. Confidential | © Pinterest User Engagement events Processing - data source ● Scale and data volume: ○ hundred of billions of events daily ( hundreds of TB ) ○ double every several years ● event schema: ○ All the events share the same Thrift struct definition ○ Event types: ■ number of events per type shows a long tail distribution (thousands of types) ■ new types are constantly being added for certain new use cases ■ most use cases only need to process events of a very small set of types
  • 9. Confidential | © Pinterest ● Signals used in recommendation ● Derived results directly showing on different product surfaces ● Analytical dashboard for pinners and advertisers ● Batch and real time Experiments ● Internal analytics & debugging User Engagement events Processing - use cases
  • 10. 3. Real-time Event Processing Architecture Evolution
  • 11. Confidential | © Pinterest Multiple frameworks maintained by different teams before 2019 ● Mini Frameworks built on top of Kafka client - Pinpin content & discovery ● Apache Storm - Roadrunner ● Apache Spark streaming - Voracity ● Kafka streams - Ads budgeting Compute Frameworks
  • 12. Confidential | © Pinterest Xenon - Pinterest stream processing platform on top of Flink since 2019 + Cluster Management (YARN) Native Flink DataStream API NRTG(Lite) on top of DataStream API Flink Table API & SQL The Resource Management & Job Execution Layer The Developer APIs Job State Management (Checkpoints, Backups, Restores, Edits) Security / Auth (PII/FGAC) Job Health & Diagnosis (Dr. Squirrel) CI/CD Hermez The Deployment Stack Job Management Service Common Libraries and Connectors Compute Frameworks
  • 13. Scaling challenges and solutions 2019 ~ 2020 Stability Challenge ● outbound traffic = number of jobs * inbound traffic ● Kafka clusters hosting event topics had very high resource saturation Observation ● each job only needs to process a few common event types (eg: click, view) ● events of those common types are a small portion of all the events event (type 1, type 2, …, type M) streaming job 1 streaming job 2 streaming job N …
  • 14. Solution event (type 1, type 2, …, type M) Stream Splitter v1 ● Flink DataStream API ● Job graph consists of source, filter and sink ● Filter operator only keep events of a small set of types required by downstream event_core (type i, type j, type k) Scaling challenges and solutions 2019 ~ 2020 Stability
  • 15. Win ● The derived topics were about 10% of the original event topics, and high Kafka cluster resource saturation issue was mitigated. ● Due to the smaller input QPS, jobs processing derived topics required less CPU / memory and AWS cross-az traffic cost was reduced. Infra savings!!! event_core (type i, type j, type k) streaming job 1 streaming job 2 streaming job N … Scaling challenges and solutions 2019 ~ 2020 Stability
  • 16. Challenge ● With new jobs requiring more event types, the derived topics grew larger and larger (10% -> 30% of original event topics) ● Infra cost grew significantly with new jobs onboarded Observation ● QPS for each job became larger due to the growth of derived topics and job required more resources ● Each job had to filter input events by types to get what it needs Scaling challenges and solutions 2021 ~ 2022 Efficiency
  • 17. Solution event (type 1, type 2, …, type M) Stream Splitter v2 ● Flink SQL ● Job consists of a statement set of DML statements – insert into event_type_i (select * from event where type = type_i) ● one DML statement for one per-type event topic event_type_i event_type_j event_type_k … Scaling challenges and solutions 2021 ~ 2022 Efficiency
  • 18. Scaling challenges and solutions 2021 ~ 2022 Efficiency Win ● Downstream jobs only process several per-type event topics that they needs ● Downstream jobs no longer needs filter logic ● Downstream jobs require much less infra resources (infra savings!!!) ● Setting up a new pipeline requires a new topic and a SQL statement streaming job 1 streaming job 2 streaming job N … event_type_i event_type_j event_type_k …
  • 19. Scaling challenges and solutions 2021 ~ 2022 Efficiency Issues with Stream Splitter v2 ● All the records coming out of source operator are forwarded to every pipeline ● Stream Splitter v2 jobs cost twice as much as v1 jobs Note: Job graph is generated by the internal SQL planner from the DML statements other operators that do not affect the data transportation pattern are not shown for better visualization Kafka Source filter i (type = type_i) Kafka Sink i Kafka Sink j Kafka Sink k … filter j (type = type_j) filter k (type = type_k) … M M M M Mi Mi Mj Mj Mk Mk
  • 20. Scaling challenges and solutions 2021 ~ 2022 Data quality Challenge ● Streaming and batch workflows generated inconsistent results Observation ● Streaming job re-implemented many batch ETL logics without standardization Streaming jobs 70 impressions 100 impressions event DWH SOT tables Batch workflows
  • 21. Scaling challenges and solutions 2021 ~ 2022 Data quality Solution event (type 1, type 2, …, type M) Real time DWH streams ● Build with NRTG - mini framework on top of a subset of Flink Datastream API (Flink state API is not supported) ● Job graph consists of source, filter, enrich, dedup and sink ● filter, enrich and dedup logics are reusing those in batch ETL ● dedup key is stored in off-heap memory (with pre-configured memory size) via a 3d party library ohc dwh_event (enriched and deduped) (type i, type j, type k) Dedup accuracy is compromised during task restart or job deployment as in-memory dedup keys are lost It takes up to 1 day’s raw events to rebuild the state.
  • 22. Scaling challenges and solutions 2021 ~ 2022 Data quality Improved Solution event (type 1, type 2, …, type M) Real time DWH streams with native Flink State ● Native Flink state API is added to NRTG ● Dedup operator is re-written using the Flink MapState to store dudup key with 1d TTL ● Rocksdb state backend and S3 to store active (read / write) keys and backup ● Savepoint size is tens of TB. Full state is preserved during task restart and job redeployment dwh_event (enriched and deduped) (type i, type j, type k) Dedup accuracy is guaranteed during task restart or job redeployment with specified checkpoint (from s3)
  • 23. Scaling challenges and solutions 2021 ~ 2022 Data quality Win ● Downstream jobs reading dwh_events can generate consistent results with the batch workflows; the computed real-time signals used in recommendation helped to boost Pinterest engagement metrics by double digits. ● Downstream jobs no longer need to implement enrich and dedup logics and job graphs are simplified to only focus on the business logic. Streaming jobs 70 impressions 70 impressions dwh_event DWH SOT tables Batch workflows
  • 24. Scaling challenges and solutions 2021 ~ 2022 Data quality Issues with Real-time DWH streams job ● The generated dwh_event topic consists of multiple types and downstream jobs are reading unnecessary data and thus implementing filter logics ● The mini framework introduces extra overhead ● Supporting a new type is slow - The logics for processing different types are coupled together due to the mini framework’s API requirement
  • 25. Two solutions for pre-processing engagement events Stream Splitter Efficient downstream consumption Fast onboarding No data quality Repetitive processing logics in downstreams Inefficient job runtime (data duplication ) Data Quality simplified downstream job logic Slow onboarding Inefficient downstream consumption Inefficient job runtime (framework overhead) Realtime DWH Downstream job developers are confused about what to use Infra cost doubles and KTLO cost doubles
  • 26. Unified Solution - Requirements ● Efficiency ○ Pre-processing jobs have efficient runtime ○ Downstream jobs only read events what they need to process ● Data quality ○ Downstream jobs read enriched and deduped events that can generate consistent results with the batch workflows ● Dev velocity ○ Supporting a new type in the pre-processing jobs should be simple and can be enabled easily without affecting the existing pipelines ○ Downstream jobs no longer port the filter-enrich logics from batch ETL and no longer implement deduplication logic on data source ● KTLO ○ maintain one unified solution rather than 2 solutions
  • 27. Unified solution - API choice ● Flink Datastream API ● Flink SQL ● Mini framework like NRTG ● Flink Table API - our final choice ○ It is more expressive than Flink SQL - complex logics can’t be easily implemented as SQL ○ It is very flexible ■ source and sinks are registered and accessed as Flink tables ■ easily convert Table to Datastream when we want to leverage low-level features ○ It does not have any extra framework overhead like NRTG
  • 28. Unified solution - job framework Framework design ● Each output stream is generated through a pipeline made up of filter, enrich dedup and sink operators ● Pipeline is pluggable and independent from each other ● Classes from batch ETL are re-used to maintain consistent semantics ● Java reflection are leveraged to easily configure each pipeline Job graph optimization - side outputs ● An job operator assign every source event based on type to each pipelines through side output ● Essentially we are implementing “filter pushdown” to reduce unnecessary data transportation
  • 29. Unified solution - new offering Platinum Event Stream
  • 30. 4. Platinum Event Stream Deep Dive
  • 31. Platinum Event Streams - What it offers? raw event platinum event streams Standardized Event Selection Event Deduplication Downstream Efficiency streaming applications
  • 32. Platinum Event Streams - User Flow Logging / Metric Owners Streaming App Developers I want to use event A as one of my signals, what’s the correct logic to process it from raw events? Before After Logging Owners Metric Definition Owners Streaming App Developers Data Warehouse Team platinum event streams Faster onboarding w/ guaranteed quality and efficiency!
  • 33. Platinum Event Stream - Technical Architecture Input: raw event Flink processing: Event splitting, filtering, enrichment, deduplication Output: Kafka topics w/ cleansed event data
  • 34. Platinum Event Stream - Flink Processing Kafka Source Table Dedup 1 Kafka Sink 1 Enrich 1 Dedup 2 Kafka Sink 2 Enrich 2 Dedup N Kafka Sink N Enrich N … … … Splitter w/ Filters Side output 1 Side output 2 … Side output N M M1 M2 MN
  • 35. Platinum Event Stream - Splitter w/ Filters Splitter Functionalities: 1. Filter out the events we don’t need. 2. Split the stream into different sub-pipelines according to event types. … … … Splitter (w/ filters) Enrich i Dedup i Kafka Sink i Kafka Source Table Enrich 1 Dedup 1 Kafka Sink 1 Enrich N Dedup N Kafka Sink N
  • 36. Metric Repository (shared by batch and streaming processing) Event / Metric X def filter(event: Event): Boolean = …… …… def createDedupKey(event: Event) = …… …… … … … Splitter (w/ filters) Enrich i Dedup i Kafka Sink i Kafka Source Table Enrich 1 Dedup 1 Kafka Sink 1 Enrich N Dedup N Kafka Sink N Standardized Event Selection Consistent w/ Batch Applications Platinum Event Stream - Splitter w/ Filters
  • 37. Splitter Functionalities: (1) Filtering (2) Split Streams Solution 1 - FlatMapFunc Severe back pressure and scalability issue when input traffic is high. Solution 2 - Side Output Kafka Source Table Splitter: Initialize: Map<event type, pipeline tag> Process: - Emit events with corresponding pipeline tag. - Throw away if not needed. … FlatMapFunc - 1 Kafka Source Table FlatMapFunc - 2 FlatMapFunc - N … M M - QPS of input raw event stream M M M M1 M2 Mn Mi - QPS of side output i which is needed by pipeline i ΣMi << M Scalability issue solved! Platinum Event Stream - Splitter w/ Filters
  • 38. Latency Information Decoded Info Derived Info ● Derived spam flags from a couple different fields logged in raw event data. ms ● Additional latency information. ● Help latency sensitive downstream takes different actions according to latency per event. … … … Splitter (w/ filters) Enrich i Dedup i Kafka Sink i Kafka Source Table Enrich 1 Dedup 1 Kafka Sink 1 Enrich N Dedup N Kafka Sink N ● Decoded some commonly used fields for downstream to use. BASE64 Platinum Event Stream - Enrich
  • 39. … … … Splitter (w/ filters) Enrich i Dedup i Kafka Sink i Kafka Source Table Enrich 1 Dedup 1 Kafka Sink 1 Enrich N Dedup N Kafka Sink N Why we need deduplication? ● Duplicate events exist in Pinterest’s raw event data. ● In some cases, duplicate rates vary from ~10-40% depending on the types of events. Causes of duplicates: 1. Repeated users’ actions when interacting with Pinterest app. 2. Incorrect client logging implementation. 3. Client resend logging messages. Solution: ● Deduplication in both batch and streaming pipelines before exporting to dashboards or flowing into ML systems. Platinum Event Stream - Dedup
  • 40. Key by UserID Not exists Update state & Output DedupKey (e) 24hr TTL 2-10 TB … … … Splitter (w/ filters) Enrich i Dedup i Kafka Sink i Kafka Source Table Enrich 1 Dedup 1 Kafka Sink 1 Enrich N Dedup N Kafka Sink N Platinum Event Stream - Dedup Flink Stateful Functions
  • 41. 24hr TTL 2-10 TB Incremental checkpoint for large state ● Full state size: 2-10TB ● Every time checkpoint size: tens of GB Re-deployment ● From savepoint: ~10 - 20 mins ● From checkpoint: < 2 mins … … … Splitter (w/ filters) Enrich i Dedup i Kafka Sink i Kafka Source Table Enrich 1 Dedup 1 Kafka Sink 1 Enrich N Dedup N Kafka Sink N Platinum Event Stream - Dedup
  • 42. Easy-to-Extend Framework with Java Reflection Metric Repository (Referenced by both online and offline processing) event_definitions EventA.scala EventB.scala EventC.scala One Line Configuration pipeline1.eventClass=A pipeline2.eventClass=B pipeline3.eventClass=C *.properties: 1. Only several line configuration changes needed for adding new streaming pipeline. 2. Guaranteed batch and streaming logic consistent by referencing the same code repo. Java Reflection Look up Event class by its name with Java Reflection when building job graph. Invoke functions defined for each metric at runtime for each pipeline: MetricA.filter() MetricA.createDedupeKey()
  • 43. Platinum Event Stream - Data Quality Monitoring Before 30-40% discrepancies on streaming vs. batch applications After >99% match rate between streaming vs. batch applications Daily comparison with batch SOT dataset platinum event streams offline tables offline SOT tables Kafka topic → S3 dump (internal framework) Internal offline data checker system Alerts for match rate violation Dashboards for continuous monitoring
  • 44. Platinum Event Streams - Cost Efficiency Efficiency Solution 600 vcores 600 vcores Data Quality Solution Unified Solution (Efficiency + Data Quality) 600 vcores Achieve both functionalities with single copy of cost similar to previous offerings!
  • 45. 5. Wins and Learns 1. User engagement boost brought by cleaner source data! 2. Highly simplified downstream streaming applications’ onboarding flow! 3. Hundreds of thousands infra saving as well as maintenance cost saving!
  • 47. Ongoing efforts - streaming governance We build streaming lineage & catalog which are integrated with batch lineage and catalog for unified data governance ● catalog of Flink tables that are registered all the external systems that are interacting with Flink jobs ● lineage between Flink jobs and external systems
  • 48. Ongoing efforts - streaming and incremental ETL We build solutions on top of CDC, Kafka, Flink, Iceberg and Spark to ● ingest data in near real-time from online systems to offline data lake ● incrementally process offline data in data lake