Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022

Why Wait ? Real-Time Ingestion
Chen Qin
Heng Zhang
10/04/2022

Agenda
1. Introduction
2. Pain points and challenges
3. Real-time data ingestion & processing
4. Ongoing work
5. Q&A

Conﬁdential
|
©
Pinterest
Who are we?
● We are a team of engineers, SREs, PM and EM that builds the stateful stream data
processing platform called Xenon at Pinterest.
● We support around 100 engineers build and operate nearly 100 Flink Applications.
● We run (near) real time applications with at 10M messages per second and process
Petabytes every month.
● We have enabled 10+ top level company KRs in the past 3 years.

Conﬁdential
|
©
Pinterest
Xenon - Pinterest stream processing platform
Cluster
Management
(YARN)
NRTG
Common
Libraries and
Connectors
Flink SQL
The Resource Management & Job Execution Layer
The Developer APIs
Job State
Management
(Checkpoints,
Backups,
Restores, Edits)
Security /
Auth
(PII/FGAC)
Job Health &
Diagnosis
(Dr. Squirrel)
CI/CD Hermez
The Deployment Stack
Job
Management
Service
+

PinStats Analytic
“Overall, users … cited that currently
they have diﬃculties monitoring content
performance due to a lack of real-time
data being available, which they ﬁnd
frustrating.”

Content
Understanding
Safety: content safety, quality and rich
content signals in near real time.
Distribution: fast distribution via
near-real-time signals and learned
retrieval
Content Creation
Audience
Targeting
Content
Understanding
Quality
Interests &
Annotations
Embeddings
Performance

Infra Engineering
Realtime A/B testing
User uptime
metrics statsd aggregation

Conﬁdential
|
©
Pinterest
Architecture Challenges
Discovery
● Schema fragmentation
● Connectivity segregation
● Lineage not tracked
Governance
● Compliance - GDPR, DSA
● Ownership and quality
● Access and security
Service Mindset
● “not a data problem… until data has
problem”
https://future.com/emerging-architectures-modern-data-infrastructure/

Discovery
● Five catalogs depending on storage choice
● Can’t easily access data to backfill
● Lineage embedded in code and configuration
files
Governance
● Tribal knowledge, owner left company
● Ping multiple teams to find owner(s) of a schema
field
Service Mindset
● Offload state management to OLTP
● Limited data , logic definition reuse
Pinterest Practice till 2021

Confidential
|
©
Pinterest
● Steep learning curve - Flink DataStream API, Time / Watermark / State, Async I/O,
Source / Sink connectors, data formats and schemas
● Huge efforts to build a streaming job from scratch to have similar logic as its batch
counterpart (Spark / Cascading / MapReduce)
● Hard to validate the streaming job results match the batch job results due to
completely different implementations using different frameworks (Flink vs Spark)
Flink Dev velocity is the pain point

Toward Federated Big Data(Base) System - 2022
Federation approach towards rapid changing
data landscape, adapt to heterogeneous
workloads and systems
● Extensibility virtual table and view
interface implemented by multiple
compute engines (e.g spark/ﬂink/presto)
● Connectivity RDBMS, NoSQL, Message
Queue, OLAP as well as cloud data
warehouse
● Portability user workload as UDF and
SQL is easier to migrate cross systems;
API approach like Apache Beam is also
compatible
Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous
Databases’ - AMIT P. SHETH , JAMES A. LARSON

3. Real-time data ingestion & processing

Overview of Pinterest’s Data Ingestion and Processing Systems

Conﬁdential
|
©
Pinterest
● Everything is table - Kafka Topic, Table / Segments and Services are all registered
as Flink Table (generic table and hive-compatible table)
● Hive Metastore - centralized metadata service for all the Flink Tables
● Hive UDF - complex processing logic wrapped inside Hive UDFs which can be
shared and used in both Flink and Spark
● Iceberg - preferred storage format to support row level deletion, schema evolution,
versioning and eﬃcient queries
Table and SQL centric approach

Ingestion and processing is continuous queries on Flink Tables

Pattern 1 - streams to streams filtering and transform
Components
● Read Kafka tables
● Filtering and transform
● Append into Kafka
tables
Considerations
● Schema evolution

Pattern 2 - raw log ingestion
Components
● Read Kafka table
● Deduplication
● Append to Iceberg / S3
Considerations
● Data format
● Late arrival events
● event-time / processing-time based
ingestion

Pattern 3 - real time data warehouse
Components
● Kafka table join lookup table
● Deduplication and aggregation
● Upsert to Iceberg / S3
Considerations
● Caching to reduce lookup latency
● State TTL
● Iceberg tuning

Pattern 4 - online ingestion / indexing
Components
● Streams synchronization
● Event enrichment
● Upsert to OLTP
Considerations
● OLTP with version history (row snapshot of past
timestamp) helps with reproducible backﬁll
● Upsert bumps version timestamp

Platform support for real time ingestion and processing
● Support Thrift format in various source / sink connectors (FLIP-237)
● Expand production-level use case for the diﬀerent data ingestion and processing
patterns
● Develop tools to allow platform users to easily build, test and productionize
FlinkSQL-based streaming applications.
● Align with internal eﬀorts on Schema visualization / lineage tracking, table
governance, and data formats

Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022

Recommended

Recommended

More Related Content

Similar to Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022

Similar to Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022