Historically, Pinterest data warehouse ingestion and indexing services were implemented on batch ETL and Kafka streaming respectively. As the product side leans more toward real-time and near-realtime data to innovate and compete, teams work together to revamp the ingestion and processing stack in Pinterest.
In this talk, we plan to share our near-real-time ingestion system built on top of Apache Kafka, Apache Flink, and Apache Iceberg. We pick ANSI SQL as the common currency to minimize the ""lambda architecture"" learning curve of teams adopting fresh data near-realtime data.
6. PinStats Analytic
“Overall, users … cited that currently
they have difficulties monitoring content
performance due to a lack of real-time
data being available, which they find
frustrating.”
7. Content
Understanding
Safety: content safety, quality and rich
content signals in near real time.
Distribution: fast distribution via
near-real-time signals and learned
retrieval
Content Creation
Audience
Targeting
Content
Understanding
Quality
Interests &
Annotations
Embeddings
Performance
11. Discovery
● Five catalogs depending on storage choice
● Can’t easily access data to backfill
● Lineage embedded in code and configuration
files
Governance
● Tribal knowledge, owner left company
● Ping multiple teams to find owner(s) of a schema
field
Service Mindset
● Offload state management to OLTP
● Limited data , logic definition reuse
Pinterest Practice till 2021
13. Toward Federated Big Data(Base) System - 2022
Federation approach towards rapid changing
data landscape, adapt to heterogeneous
workloads and systems
● Extensibility virtual table and view
interface implemented by multiple
compute engines (e.g spark/flink/presto)
● Connectivity RDBMS, NoSQL, Message
Queue, OLAP as well as cloud data
warehouse
● Portability user workload as UDF and
SQL is easier to migrate cross systems;
API approach like Apache Beam is also
compatible
Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous
Databases’ - AMIT P. SHETH , JAMES A. LARSON
18. Pattern 1 - streams to streams filtering and transform
Components
● Read Kafka tables
● Filtering and transform
● Append into Kafka
tables
Considerations
● Schema evolution
19. Pattern 2 - raw log ingestion
Components
● Read Kafka table
● Deduplication
● Append to Iceberg / S3
Considerations
● Data format
● Late arrival events
● event-time / processing-time based
ingestion
20. Pattern 3 - real time data warehouse
Components
● Kafka table join lookup table
● Deduplication and aggregation
● Upsert to Iceberg / S3
Considerations
● Caching to reduce lookup latency
● State TTL
● Iceberg tuning
21. Pattern 4 - online ingestion / indexing
Components
● Streams synchronization
● Event enrichment
● Upsert to OLTP
Considerations
● OLTP with version history (row snapshot of past
timestamp) helps with reproducible backfill
● Upsert bumps version timestamp
23. Platform support for real time ingestion and processing
● Support Thrift format in various source / sink connectors (FLIP-237)
● Expand production-level use case for the different data ingestion and processing
patterns
● Develop tools to allow platform users to easily build, test and productionize
FlinkSQL-based streaming applications.
● Align with internal efforts on Schema visualization / lineage tracking, table
governance, and data formats