SlideShare a Scribd company logo
1 of 24
Download to read offline
Why Wait ? Real-Time Ingestion
Chen Qin
Heng Zhang
10/04/2022
Agenda
1. Introduction
2. Pain points and challenges
3. Real-time data ingestion & processing
4. Ongoing work
5. Q&A
1. Introduction
Confidential
|
©
Pinterest
Who are we?
● We are a team of engineers, SREs, PM and EM that builds the stateful stream data
processing platform called Xenon at Pinterest.
● We support around 100 engineers build and operate nearly 100 Flink Applications.
● We run (near) real time applications with at 10M messages per second and process
Petabytes every month.
● We have enabled 10+ top level company KRs in the past 3 years.
Confidential
|
©
Pinterest
Xenon - Pinterest stream processing platform
Cluster
Management
(YARN)
NRTG
Common
Libraries and
Connectors
Flink SQL
The Resource Management & Job Execution Layer
The Developer APIs
Job State
Management
(Checkpoints,
Backups,
Restores, Edits)
Security /
Auth
(PII/FGAC)
Job Health &
Diagnosis
(Dr. Squirrel)
CI/CD Hermez
The Deployment Stack
Job
Management
Service
+
PinStats Analytic
“Overall, users … cited that currently
they have difficulties monitoring content
performance due to a lack of real-time
data being available, which they find
frustrating.”
Content
Understanding
Safety: content safety, quality and rich
content signals in near real time.
Distribution: fast distribution via
near-real-time signals and learned
retrieval
Content Creation
Audience
Targeting
Content
Understanding
Quality
Interests &
Annotations
Embeddings
Performance
Infra Engineering
Realtime A/B testing
User uptime
metrics statsd aggregation
2. Pain points and challenges
Confidential
|
©
Pinterest
Architecture Challenges
Discovery
● Schema fragmentation
● Connectivity segregation
● Lineage not tracked
Governance
● Compliance - GDPR, DSA
● Ownership and quality
● Access and security
Service Mindset
● “not a data problem… until data has
problem”
https://future.com/emerging-architectures-modern-data-infrastructure/
Discovery
● Five catalogs depending on storage choice
● Can’t easily access data to backfill
● Lineage embedded in code and configuration
files
Governance
● Tribal knowledge, owner left company
● Ping multiple teams to find owner(s) of a schema
field
Service Mindset
● Offload state management to OLTP
● Limited data , logic definition reuse
Pinterest Practice till 2021
Confidential
|
©
Pinterest
● Steep learning curve - Flink DataStream API, Time / Watermark / State, Async I/O,
Source / Sink connectors, data formats and schemas
● Huge efforts to build a streaming job from scratch to have similar logic as its batch
counterpart (Spark / Cascading / MapReduce)
● Hard to validate the streaming job results match the batch job results due to
completely different implementations using different frameworks (Flink vs Spark)
Flink Dev velocity is the pain point
Toward Federated Big Data(Base) System - 2022
Federation approach towards rapid changing
data landscape, adapt to heterogeneous
workloads and systems
● Extensibility virtual table and view
interface implemented by multiple
compute engines (e.g spark/flink/presto)
● Connectivity RDBMS, NoSQL, Message
Queue, OLAP as well as cloud data
warehouse
● Portability user workload as UDF and
SQL is easier to migrate cross systems;
API approach like Apache Beam is also
compatible
Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous
Databases’ - AMIT P. SHETH , JAMES A. LARSON
3. Real-time data ingestion & processing
Overview of Pinterest’s Data Ingestion and Processing Systems
Confidential
|
©
Pinterest
● Everything is table - Kafka Topic, Table / Segments and Services are all registered
as Flink Table (generic table and hive-compatible table)
● Hive Metastore - centralized metadata service for all the Flink Tables
● Hive UDF - complex processing logic wrapped inside Hive UDFs which can be
shared and used in both Flink and Spark
● Iceberg - preferred storage format to support row level deletion, schema evolution,
versioning and efficient queries
Table and SQL centric approach
Ingestion and processing is continuous queries on Flink Tables
Pattern 1 - streams to streams filtering and transform
Components
● Read Kafka tables
● Filtering and transform
● Append into Kafka
tables
Considerations
● Schema evolution
Pattern 2 - raw log ingestion
Components
● Read Kafka table
● Deduplication
● Append to Iceberg / S3
Considerations
● Data format
● Late arrival events
● event-time / processing-time based
ingestion
Pattern 3 - real time data warehouse
Components
● Kafka table join lookup table
● Deduplication and aggregation
● Upsert to Iceberg / S3
Considerations
● Caching to reduce lookup latency
● State TTL
● Iceberg tuning
Pattern 4 - online ingestion / indexing
Components
● Streams synchronization
● Event enrichment
● Upsert to OLTP
Considerations
● OLTP with version history (row snapshot of past
timestamp) helps with reproducible backfill
● Upsert bumps version timestamp
4. Ongoing efforts
Platform support for real time ingestion and processing
● Support Thrift format in various source / sink connectors (FLIP-237)
● Expand production-level use case for the different data ingestion and processing
patterns
● Develop tools to allow platform users to easily build, test and productionize
FlinkSQL-based streaming applications.
● Align with internal efforts on Schema visualization / lineage tracking, table
governance, and data formats
5. Q & A

More Related Content

Similar to Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022

Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker seriesMonal Daxini
 
Scylla Summit 2022: Reinventing Data Management on the Cloud for Modern Telec...
Scylla Summit 2022: Reinventing Data Management on the Cloud for Modern Telec...Scylla Summit 2022: Reinventing Data Management on the Cloud for Modern Telec...
Scylla Summit 2022: Reinventing Data Management on the Cloud for Modern Telec...ScyllaDB
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixStitch Fix Algorithms
 
Mid-term Review Meeting - WP5
Mid-term Review Meeting - WP5Mid-term Review Meeting - WP5
Mid-term Review Meeting - WP5SLOPE Project
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)Marco Gralike
 
Continus sql with sql stream builder
Continus sql with sql stream builderContinus sql with sql stream builder
Continus sql with sql stream builderTimothy Spann
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Global Business Events
 
Mukhtar resume etl_developer
Mukhtar resume etl_developerMukhtar resume etl_developer
Mukhtar resume etl_developerMukhtar Mohammed
 
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Miguel Pérez Colino
 
Resume quaish abuzer
Resume quaish abuzerResume quaish abuzer
Resume quaish abuzerquaish abuzer
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQLWSO2
 

Similar to Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022 (20)

Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
 
Scylla Summit 2022: Reinventing Data Management on the Cloud for Modern Telec...
Scylla Summit 2022: Reinventing Data Management on the Cloud for Modern Telec...Scylla Summit 2022: Reinventing Data Management on the Cloud for Modern Telec...
Scylla Summit 2022: Reinventing Data Management on the Cloud for Modern Telec...
 
Boobalan_Muthukumarasamy_Resume_DW_8_Yrs
Boobalan_Muthukumarasamy_Resume_DW_8_YrsBoobalan_Muthukumarasamy_Resume_DW_8_Yrs
Boobalan_Muthukumarasamy_Resume_DW_8_Yrs
 
Sandeep Grandhi (1)
Sandeep Grandhi (1)Sandeep Grandhi (1)
Sandeep Grandhi (1)
 
Aws migration case study_blr_meetup
Aws migration case study_blr_meetupAws migration case study_blr_meetup
Aws migration case study_blr_meetup
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
Mid-term Review Meeting - WP5
Mid-term Review Meeting - WP5Mid-term Review Meeting - WP5
Mid-term Review Meeting - WP5
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
An AMIS overview of database 12c
An AMIS overview of database 12cAn AMIS overview of database 12c
An AMIS overview of database 12c
 
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)
 
Continus sql with sql stream builder
Continus sql with sql stream builderContinus sql with sql stream builder
Continus sql with sql stream builder
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
Flour
FlourFlour
Flour
 
Mukhtar resume etl_developer
Mukhtar resume etl_developerMukhtar resume etl_developer
Mukhtar resume etl_developer
 
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
 
Resume quaish abuzer
Resume quaish abuzerResume quaish abuzer
Resume quaish abuzer
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022

  • 1. Why Wait ? Real-Time Ingestion Chen Qin Heng Zhang 10/04/2022
  • 2. Agenda 1. Introduction 2. Pain points and challenges 3. Real-time data ingestion & processing 4. Ongoing work 5. Q&A
  • 4. Confidential | © Pinterest Who are we? ● We are a team of engineers, SREs, PM and EM that builds the stateful stream data processing platform called Xenon at Pinterest. ● We support around 100 engineers build and operate nearly 100 Flink Applications. ● We run (near) real time applications with at 10M messages per second and process Petabytes every month. ● We have enabled 10+ top level company KRs in the past 3 years.
  • 5. Confidential | © Pinterest Xenon - Pinterest stream processing platform Cluster Management (YARN) NRTG Common Libraries and Connectors Flink SQL The Resource Management & Job Execution Layer The Developer APIs Job State Management (Checkpoints, Backups, Restores, Edits) Security / Auth (PII/FGAC) Job Health & Diagnosis (Dr. Squirrel) CI/CD Hermez The Deployment Stack Job Management Service +
  • 6. PinStats Analytic “Overall, users … cited that currently they have difficulties monitoring content performance due to a lack of real-time data being available, which they find frustrating.”
  • 7. Content Understanding Safety: content safety, quality and rich content signals in near real time. Distribution: fast distribution via near-real-time signals and learned retrieval Content Creation Audience Targeting Content Understanding Quality Interests & Annotations Embeddings Performance
  • 8. Infra Engineering Realtime A/B testing User uptime metrics statsd aggregation
  • 9. 2. Pain points and challenges
  • 10. Confidential | © Pinterest Architecture Challenges Discovery ● Schema fragmentation ● Connectivity segregation ● Lineage not tracked Governance ● Compliance - GDPR, DSA ● Ownership and quality ● Access and security Service Mindset ● “not a data problem… until data has problem” https://future.com/emerging-architectures-modern-data-infrastructure/
  • 11. Discovery ● Five catalogs depending on storage choice ● Can’t easily access data to backfill ● Lineage embedded in code and configuration files Governance ● Tribal knowledge, owner left company ● Ping multiple teams to find owner(s) of a schema field Service Mindset ● Offload state management to OLTP ● Limited data , logic definition reuse Pinterest Practice till 2021
  • 12. Confidential | © Pinterest ● Steep learning curve - Flink DataStream API, Time / Watermark / State, Async I/O, Source / Sink connectors, data formats and schemas ● Huge efforts to build a streaming job from scratch to have similar logic as its batch counterpart (Spark / Cascading / MapReduce) ● Hard to validate the streaming job results match the batch job results due to completely different implementations using different frameworks (Flink vs Spark) Flink Dev velocity is the pain point
  • 13. Toward Federated Big Data(Base) System - 2022 Federation approach towards rapid changing data landscape, adapt to heterogeneous workloads and systems ● Extensibility virtual table and view interface implemented by multiple compute engines (e.g spark/flink/presto) ● Connectivity RDBMS, NoSQL, Message Queue, OLAP as well as cloud data warehouse ● Portability user workload as UDF and SQL is easier to migrate cross systems; API approach like Apache Beam is also compatible Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases’ - AMIT P. SHETH , JAMES A. LARSON
  • 14. 3. Real-time data ingestion & processing
  • 15. Overview of Pinterest’s Data Ingestion and Processing Systems
  • 16. Confidential | © Pinterest ● Everything is table - Kafka Topic, Table / Segments and Services are all registered as Flink Table (generic table and hive-compatible table) ● Hive Metastore - centralized metadata service for all the Flink Tables ● Hive UDF - complex processing logic wrapped inside Hive UDFs which can be shared and used in both Flink and Spark ● Iceberg - preferred storage format to support row level deletion, schema evolution, versioning and efficient queries Table and SQL centric approach
  • 17. Ingestion and processing is continuous queries on Flink Tables
  • 18. Pattern 1 - streams to streams filtering and transform Components ● Read Kafka tables ● Filtering and transform ● Append into Kafka tables Considerations ● Schema evolution
  • 19. Pattern 2 - raw log ingestion Components ● Read Kafka table ● Deduplication ● Append to Iceberg / S3 Considerations ● Data format ● Late arrival events ● event-time / processing-time based ingestion
  • 20. Pattern 3 - real time data warehouse Components ● Kafka table join lookup table ● Deduplication and aggregation ● Upsert to Iceberg / S3 Considerations ● Caching to reduce lookup latency ● State TTL ● Iceberg tuning
  • 21. Pattern 4 - online ingestion / indexing Components ● Streams synchronization ● Event enrichment ● Upsert to OLTP Considerations ● OLTP with version history (row snapshot of past timestamp) helps with reproducible backfill ● Upsert bumps version timestamp
  • 23. Platform support for real time ingestion and processing ● Support Thrift format in various source / sink connectors (FLIP-237) ● Expand production-level use case for the different data ingestion and processing patterns ● Develop tools to allow platform users to easily build, test and productionize FlinkSQL-based streaming applications. ● Align with internal efforts on Schema visualization / lineage tracking, table governance, and data formats
  • 24. 5. Q & A