SlideShare a Scribd company logo
1 of 39
Download to read offline
1
1
Flink-powered stream processing platform at Pinterest
Rainie Li
Software engineer@Pinterest
Kanchi Masalia
Software engineer@Pinterest
Agenda
1. Introduction
2. Challenges & Use cases
3. Platform missions & Frameworks
4. Ongoing Work
5. Q&A
Confidential
|
©
Pinterest
Confidential
|
©
Pinterest
Streaming use cases on Xenon platform
OKR
promised
OKR
delivered
~2x
over
~3x
scale
Confidential
|
©
Pinterest
Why Real Time Stream Processing
● Ads real-time spend and reporting - Calculate spend against budget limits in near real time
to quickly adjust budget pacing and update advertisers with more timely reporting results
● Fast User Signals - Make user content signals available quickly after content creation and use
these signals in ML pipelines for a personalized and fresh user experience
● Realtime Trust & Safety - Reduce levels of unsafe content as close to content creation time
● Fast Insights (Content activation) - Distribute fresh Creator content and surface engagement
metrics to Creators so they can refine their content with minimal feedback delay
● Product Authority (Shopping) - Deliver a trustworthy shopping product experience for users
by updating product metadata in near real time
● Fast Experimentation - Accurately deliver metrics to engineers for faster experiment setup,
verification, and evaluation
Confidential
|
©
Pinterest
Existing Issues
● Fragmented technologies
○ Self-managed Kafka Streams jobs (Ads Infra)
○ Overwatch platform for small batch Spark jobs (Ads Data,
Measurement)
● Lack of developer support
● Availability & scalability issues
Confidential
|
©
Pinterest
Who are we?
● We are a team of engineers, SREs, PM and EM that builds the
stateful stream data processing platform called Xenon at Pinterest.
● We support around 100 engineers build and operate 100+ Flink
Applications.
● We run (near) real time applications with at 300M messages per
second and process 150TB data per second.
● We have enabled 10+ top level company KRs in the past 3 years.
Confidential
|
©
Pinterest
Xenon platform Mission
● Stability: reliably host all deployed Flink-based stream processing
applications
● Dev Velocity: quickly productionize new use cases / features to
meet business and product needs
● Cloud Efficiency: efficiently operate infras and strive for best
practices
Confidential
|
©
Pinterest
Xenon - Pinterest stream processing platform
Cluster
Management
(YARN)
NRTG
Common
Libraries and
Connectors
Flink SQL
The Resource Management & Job Execution Layer
The Developer APIs
Job State
Management
(Checkpoints,
Backups,
Restores, Edits)
Security /
Auth
(PII/FGAC)
Job Health &
Diagnosis
(Dr. Squirrel)
CI/CD Hermez
The Deployment Stack
Job
Management
Service
+
PinStats Analytic
Use case
“Overall, users … cited that currently
they have difficulties monitoring content
performance due to a lack of real-time
data being available, which they find
frustrating.”
Creator Content
Use cases
Fast user signals: Make user content
signals available quickly after content
creation
Safety: Reduce levels of unsafe content
as close to content creation time
Content Creation
Audience
Targeting
Content
Understanding
Quality
Interests &
Annotations
Embeddings
Performance
Ads real-time
spend and
reporting
Calculate spend against budget limits in
real time to quickly adjust budget and
update advertisers with more timely
results
Confidential
|
©
Pinterest
Xenon platform Mission No.1 - stability
● Xenon Stability Strategy
● Job Deployment Framework - Hermez and Job Submission service
● Job Management Service - Pinterest stateful streaming application
runtime monitoring and auto failure to different AZ service.
Repo Jenkins
Artifactory
S3
Hermez
Job Submission
Service
Yarn
Clusters
1
2
4
5
6
7
8
Xenon Job Deployment Framework
3
Xenon Jobs / Hermez workloads
154
Production Xenon use cases
>90
179
Deployments everyday
Highlights
Stability and Tier 1 support
● Enhanced JSS State Machine
● Supported job level dedicated S3 buckets
User experience
● Hermez supported most recent checkpoint deployment
● Hermez supported kill job and distributed shell
● Enriched savepoint information on Hermez
● Track daily & monthly deployment success rate
Metrics
● Job submission latency
Xenon Job Management Service
Monitoring
● Job Status
● Critical metrics (QPS)
● Checkpointing health
● Job/task health
● Notify users
Auto Recovery
Auto recover failed jobs
from:
● Last completed
checkpoint
● Most recent savepoint
● Fresh State
AZ Failure
Resilience
Auto failover jobs to
backup clusters in different
AZs when primary
cluster/AZ goes down
Xenon JMS
Statsboard
ZK Clusters
Hermez
JSS
Auto Recovery
Monitoring
Deployment
Yarn Clusters
AZ-a
Yarn Clusters
AZ-b
Yarn Clusters
AZ-c
Failover
JMS Architecture
Flink API
user
Jobs under management Faster recovery time
>90
Jobs get recovered
every week
10X
>7
Confidential
|
©
Pinterest
Xenon platform Mission No. 2 - Developer Velocity
● Near Real Time Galaxy - Pinterest stateful streaming application Job
development framework
● CICD - Pinterest stateful streaming application change rollout flow
● Dr.Squirrel - Pinterest self-served streaming application
troubleshooting portal
● Working model - New Use Case Onboarding Process
Confidential
|
©
Pinterest
NRTG
Definition:
● Pinterest stateful streaming application Job development framework
History:
● Galaxy: a high-level managed execution platform for producing and
consuming signals (e.g. Entity features) about Pinterest entities (such
as pins, board, users).
● NRTG (Near Real Time Galaxy): It follows the same Galaxy dataflow
API used in Batch, extends it to streaming applications.
Confidential
|
©
Pinterest
NRTG components (khaki boxes below)
VIP Navboost Signal (Map Transforms, Async RPC calls, Backfill)
● User code focuses only on Business logic. ✅
● Tune flink operators using configs. ✅
● ROI: Kappa architecture - roadmap to shutting down an $800K double compute GPU cluster for visual-search batch. 🚧
Xenon
Flink
Application
Code Config
Confidential
|
©
Pinterest
Xenon CICD framework - big picture
● Bring the CICD practice from stateless online services to stateful streaming world
● Leverage the same CICD infrastructure
● Customize the CICD pipeline for validating and deploying flink-based stream
application
● Achieve the goal of safely rolling out xenon user / platform changes with minimal
human efforts involved in validation
Confidential
|
©
Pinterest
Confidential
|
©
Pinterest
Xenon CICD pipelines - details
● auto-triggered based on cron rule and availability of new artifacts
● stability checks
○ job submission success
○ no restart-loop
○ savepoint generation success
○ ACA metrics validation
○ auto-recovery from TM/JM failure
● Prod deploy: decider-controlled, safe operations on prod job during
business hours
Confidential
|
©
Pinterest
Xenon CICD Pipeline UI
● Pipeline execution history
● Pipeline operation: disable / enable /
trigger
● Links to Pipeline YAML and Spinnaker
Spinnaker UI
● Pipeline parameters
● Pipeline execution status
● Details about each Stage
Xenon CICD framework - User Interface
Confidential
|
©
Pinterest
Job Debugging tool - Dr. Squirrel
Definition:
● One-stop shop for Flink job troubleshooting
Features:
● Surface suspicious stats to Xenon users instead of users searching for them
○ GC, CPU, memory, backpressure, exceptions, bad config...
● Provide instructions on top of suspicious stats
Goal:
● Cut down troubleshooting time, lower the required Flink internal knowledge for
troubleshooting, increase the dev velocity
Dr. Squirrel UI
Architecture - Part 1
Architecture - Part 2
Confidential
|
©
Pinterest
Working model - New Use Case Onboarding Process
● Xenon team provides managed bootstrap of new use case:
○ best practices in terms of choosing framework and deciding job graph
○ Dev environment setup
○ a buildable and deployable skeleton project (bazel, java, test, configs)
○ Hermez workloads creation
○ CICD pipeline
○ YARN queue
○ dashboard / alerts with default settings
● Xenon developers write and test business logic code
● Support auto-generation NRTG and Flink SQL based project
Outcome: reduce the onboarding time by 3+ weeks
Confidential
|
©
Pinterest
Xenon platform Mission No. 2 - Cloud efficiency (ongoing)
● Auto Scaling - Auto tuning & Auto scaling up/down flink applications
● Cluster upgrade - Automatic job migration during platform upgrade
● Resource Optimization - Load balance Xenon clusters
● Evaluate k8s
Confidential
|
©
Pinterest Auto Scaling
● Service to dynamically job parallelism based on the metrics - Kafka Lag, CPU utilization and
Backpressure.
Questions?
Anumol Sebastian
Chenqi Liu
Hannah Chen
Divye Kapoor
Kanchi Masalia
Lu Niu Rainie Li
Teja Thotapalli
Nishant More
Samuel Bahr
Heng Zhang
Kevin Browne
Sergii Marchenko
Ashish Jhaveri Dinesh Kumar Sekar
Chen Qin
Shaowen Wang YOU?!
Q & A
Thank you

More Related Content

What's hot

Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistentconfluent
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkTzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkVerverica
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 

What's hot (20)

Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkTzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 

Similar to Flink powered stream processing platform at Pinterest

Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableWhy Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableHostedbyConfluent
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Bowen Li
 
Accelerating Digital Transformation: It's About Digital Enablement
Accelerating Digital Transformation:  It's About Digital EnablementAccelerating Digital Transformation:  It's About Digital Enablement
Accelerating Digital Transformation: It's About Digital EnablementJoshua Gossett
 
Hewlett Packard Entreprise | Stormrunner load | Game Changer
Hewlett Packard Entreprise | Stormrunner load | Game ChangerHewlett Packard Entreprise | Stormrunner load | Game Changer
Hewlett Packard Entreprise | Stormrunner load | Game ChangerJeffrey Nunn
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open SourceAll Things Open
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Eren Avşaroğulları
 
Software engineering with Softjourn
Software engineering with SoftjournSoftware engineering with Softjourn
Software engineering with SoftjournEmmy Gengler
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...confluent
 
The Kubernetes Effect
The Kubernetes EffectThe Kubernetes Effect
The Kubernetes EffectBilgin Ibryam
 
The differing ways to monitor and instrument
The differing ways to monitor and instrumentThe differing ways to monitor and instrument
The differing ways to monitor and instrumentJonah Kowall
 
Nayeem shaik resume
Nayeem shaik resumeNayeem shaik resume
Nayeem shaik resumeNayeem Shaik
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesJosef Adersberger
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesQAware GmbH
 

Similar to Flink powered stream processing platform at Pinterest (20)

Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made ScalableWhy Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
 
Un-clouding the cloud
Un-clouding the cloudUn-clouding the cloud
Un-clouding the cloud
 
Accelerating Digital Transformation: It's About Digital Enablement
Accelerating Digital Transformation:  It's About Digital EnablementAccelerating Digital Transformation:  It's About Digital Enablement
Accelerating Digital Transformation: It's About Digital Enablement
 
Hewlett Packard Entreprise | Stormrunner load | Game Changer
Hewlett Packard Entreprise | Stormrunner load | Game ChangerHewlett Packard Entreprise | Stormrunner load | Game Changer
Hewlett Packard Entreprise | Stormrunner load | Game Changer
 
DeepakSingh
DeepakSinghDeepakSingh
DeepakSingh
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
Software engineering with Softjourn
Software engineering with SoftjournSoftware engineering with Softjourn
Software engineering with Softjourn
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
 
Ahmed El Mawaziny CV
Ahmed El Mawaziny CVAhmed El Mawaziny CV
Ahmed El Mawaziny CV
 
The Kubernetes Effect
The Kubernetes EffectThe Kubernetes Effect
The Kubernetes Effect
 
The differing ways to monitor and instrument
The differing ways to monitor and instrumentThe differing ways to monitor and instrument
The differing ways to monitor and instrument
 
Cisco project ideas
Cisco   project ideasCisco   project ideas
Cisco project ideas
 
Nayeem shaik resume
Nayeem shaik resumeNayeem shaik resume
Nayeem shaik resume
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 

More from Flink Forward

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkFlink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 

More from Flink Forward (10)

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Flink powered stream processing platform at Pinterest

  • 1. 1 1
  • 2. Flink-powered stream processing platform at Pinterest Rainie Li Software engineer@Pinterest Kanchi Masalia Software engineer@Pinterest
  • 3. Agenda 1. Introduction 2. Challenges & Use cases 3. Platform missions & Frameworks 4. Ongoing Work 5. Q&A
  • 4.
  • 6. Confidential | © Pinterest Streaming use cases on Xenon platform OKR promised OKR delivered ~2x over ~3x scale
  • 7. Confidential | © Pinterest Why Real Time Stream Processing ● Ads real-time spend and reporting - Calculate spend against budget limits in near real time to quickly adjust budget pacing and update advertisers with more timely reporting results ● Fast User Signals - Make user content signals available quickly after content creation and use these signals in ML pipelines for a personalized and fresh user experience ● Realtime Trust & Safety - Reduce levels of unsafe content as close to content creation time ● Fast Insights (Content activation) - Distribute fresh Creator content and surface engagement metrics to Creators so they can refine their content with minimal feedback delay ● Product Authority (Shopping) - Deliver a trustworthy shopping product experience for users by updating product metadata in near real time ● Fast Experimentation - Accurately deliver metrics to engineers for faster experiment setup, verification, and evaluation
  • 8. Confidential | © Pinterest Existing Issues ● Fragmented technologies ○ Self-managed Kafka Streams jobs (Ads Infra) ○ Overwatch platform for small batch Spark jobs (Ads Data, Measurement) ● Lack of developer support ● Availability & scalability issues
  • 9. Confidential | © Pinterest Who are we? ● We are a team of engineers, SREs, PM and EM that builds the stateful stream data processing platform called Xenon at Pinterest. ● We support around 100 engineers build and operate 100+ Flink Applications. ● We run (near) real time applications with at 300M messages per second and process 150TB data per second. ● We have enabled 10+ top level company KRs in the past 3 years.
  • 10. Confidential | © Pinterest Xenon platform Mission ● Stability: reliably host all deployed Flink-based stream processing applications ● Dev Velocity: quickly productionize new use cases / features to meet business and product needs ● Cloud Efficiency: efficiently operate infras and strive for best practices
  • 11. Confidential | © Pinterest Xenon - Pinterest stream processing platform Cluster Management (YARN) NRTG Common Libraries and Connectors Flink SQL The Resource Management & Job Execution Layer The Developer APIs Job State Management (Checkpoints, Backups, Restores, Edits) Security / Auth (PII/FGAC) Job Health & Diagnosis (Dr. Squirrel) CI/CD Hermez The Deployment Stack Job Management Service +
  • 12. PinStats Analytic Use case “Overall, users … cited that currently they have difficulties monitoring content performance due to a lack of real-time data being available, which they find frustrating.”
  • 13. Creator Content Use cases Fast user signals: Make user content signals available quickly after content creation Safety: Reduce levels of unsafe content as close to content creation time Content Creation Audience Targeting Content Understanding Quality Interests & Annotations Embeddings Performance
  • 14. Ads real-time spend and reporting Calculate spend against budget limits in real time to quickly adjust budget and update advertisers with more timely results
  • 15. Confidential | © Pinterest Xenon platform Mission No.1 - stability ● Xenon Stability Strategy ● Job Deployment Framework - Hermez and Job Submission service ● Job Management Service - Pinterest stateful streaming application runtime monitoring and auto failure to different AZ service.
  • 17. Xenon Jobs / Hermez workloads 154 Production Xenon use cases >90 179 Deployments everyday
  • 18. Highlights Stability and Tier 1 support ● Enhanced JSS State Machine ● Supported job level dedicated S3 buckets User experience ● Hermez supported most recent checkpoint deployment ● Hermez supported kill job and distributed shell ● Enriched savepoint information on Hermez ● Track daily & monthly deployment success rate Metrics ● Job submission latency
  • 19. Xenon Job Management Service Monitoring ● Job Status ● Critical metrics (QPS) ● Checkpointing health ● Job/task health ● Notify users Auto Recovery Auto recover failed jobs from: ● Last completed checkpoint ● Most recent savepoint ● Fresh State AZ Failure Resilience Auto failover jobs to backup clusters in different AZs when primary cluster/AZ goes down
  • 20. Xenon JMS Statsboard ZK Clusters Hermez JSS Auto Recovery Monitoring Deployment Yarn Clusters AZ-a Yarn Clusters AZ-b Yarn Clusters AZ-c Failover JMS Architecture Flink API user
  • 21. Jobs under management Faster recovery time >90 Jobs get recovered every week 10X >7
  • 22. Confidential | © Pinterest Xenon platform Mission No. 2 - Developer Velocity ● Near Real Time Galaxy - Pinterest stateful streaming application Job development framework ● CICD - Pinterest stateful streaming application change rollout flow ● Dr.Squirrel - Pinterest self-served streaming application troubleshooting portal ● Working model - New Use Case Onboarding Process
  • 23. Confidential | © Pinterest NRTG Definition: ● Pinterest stateful streaming application Job development framework History: ● Galaxy: a high-level managed execution platform for producing and consuming signals (e.g. Entity features) about Pinterest entities (such as pins, board, users). ● NRTG (Near Real Time Galaxy): It follows the same Galaxy dataflow API used in Batch, extends it to streaming applications.
  • 25. VIP Navboost Signal (Map Transforms, Async RPC calls, Backfill) ● User code focuses only on Business logic. ✅ ● Tune flink operators using configs. ✅ ● ROI: Kappa architecture - roadmap to shutting down an $800K double compute GPU cluster for visual-search batch. 🚧 Xenon Flink Application Code Config
  • 26. Confidential | © Pinterest Xenon CICD framework - big picture ● Bring the CICD practice from stateless online services to stateful streaming world ● Leverage the same CICD infrastructure ● Customize the CICD pipeline for validating and deploying flink-based stream application ● Achieve the goal of safely rolling out xenon user / platform changes with minimal human efforts involved in validation
  • 28. Confidential | © Pinterest Xenon CICD pipelines - details ● auto-triggered based on cron rule and availability of new artifacts ● stability checks ○ job submission success ○ no restart-loop ○ savepoint generation success ○ ACA metrics validation ○ auto-recovery from TM/JM failure ● Prod deploy: decider-controlled, safe operations on prod job during business hours
  • 29. Confidential | © Pinterest Xenon CICD Pipeline UI ● Pipeline execution history ● Pipeline operation: disable / enable / trigger ● Links to Pipeline YAML and Spinnaker Spinnaker UI ● Pipeline parameters ● Pipeline execution status ● Details about each Stage Xenon CICD framework - User Interface
  • 30. Confidential | © Pinterest Job Debugging tool - Dr. Squirrel Definition: ● One-stop shop for Flink job troubleshooting Features: ● Surface suspicious stats to Xenon users instead of users searching for them ○ GC, CPU, memory, backpressure, exceptions, bad config... ● Provide instructions on top of suspicious stats Goal: ● Cut down troubleshooting time, lower the required Flink internal knowledge for troubleshooting, increase the dev velocity
  • 34. Confidential | © Pinterest Working model - New Use Case Onboarding Process ● Xenon team provides managed bootstrap of new use case: ○ best practices in terms of choosing framework and deciding job graph ○ Dev environment setup ○ a buildable and deployable skeleton project (bazel, java, test, configs) ○ Hermez workloads creation ○ CICD pipeline ○ YARN queue ○ dashboard / alerts with default settings ● Xenon developers write and test business logic code ● Support auto-generation NRTG and Flink SQL based project Outcome: reduce the onboarding time by 3+ weeks
  • 35. Confidential | © Pinterest Xenon platform Mission No. 2 - Cloud efficiency (ongoing) ● Auto Scaling - Auto tuning & Auto scaling up/down flink applications ● Cluster upgrade - Automatic job migration during platform upgrade ● Resource Optimization - Load balance Xenon clusters ● Evaluate k8s
  • 36. Confidential | © Pinterest Auto Scaling ● Service to dynamically job parallelism based on the metrics - Kafka Lag, CPU utilization and Backpressure.
  • 37. Questions? Anumol Sebastian Chenqi Liu Hannah Chen Divye Kapoor Kanchi Masalia Lu Niu Rainie Li Teja Thotapalli Nishant More Samuel Bahr Heng Zhang Kevin Browne Sergii Marchenko Ashish Jhaveri Dinesh Kumar Sekar Chen Qin Shaowen Wang YOU?!
  • 38. Q & A