SlideShare a Scribd company logo
1 of 73
Download to read offline
Large scale stream
processing with
Apache Flink
Nikolay Stoitsev
Sr. Software Engineer at Uber Tech Sofia
Stream Processing?
Stream Processing?
User Interaction Logs
Stream Processing?
User Interaction Logs
Application Logs
Stream Processing?
User Interaction Logs
Application Logs
Sensor Data
Stream Processing?
User Interaction Logs
Application Logs
Sensor Data
Database Commit Logs
Infinite Dataset
Producer Stream
Producer Stream HDFS
Producer Stream HDFS
Hive
Producer Stream HDFS
Hive
Big Latency
Producer Stream
HDFS
Real-time
service
Apache Storm
storm.apache.org
High-latency & accurate
vs.
Low-latency & approximation
Lambda architecture
https://www.oreilly.com/ideas/questioning-the-lambda-architecture
Kappa Architecture
Use Apache Kafka
Durable, scalable, fault-tolerant
Producer Kafka
Stream
Processor
Metrics we want to track
Net payout
Daily items sold
Weekly items sold
Order acceptance rate
Order preparation speed
Item rating
Real time
Scalable
Granular
Highly available
Order Stream
Payment Stream
User Rating Stream
Order Stream
Payment Stream
User Rating Stream
Stream Processor
OLAP
samza.apache.org
Apache Flink
flink.apache.org
Everything is a batch
vs.
Everything is a stream
Single JVM Cluster Cloud
Runtime
DataSet API DataStream API
Dataflow graph
Source
Source
Operator
Operator
Operator Sinc
OLAP
https://ci.apache.org/projects/flink/flink-docs-release-1.6/concepts/programming-model.html
https://ci.apache.org/projects/flink/flink-docs-release-1.6/concepts/programming-model.html
https://ci.apache.org/projects/flink/flink-docs-release-1.6/concepts/programming-model.html
Flink Program
Optimizer
Graph Builder
Client
Flink Program
Optimizer
Graph Builder
Client Job Manager
Task Manager Task Manager
Flink Program
Optimizer
Graph Builder
Client Job Manager
Task Manager Task Manager
Snapshot Store
Fault tolerant
Flink Program
Optimizer
Graph Builder
Client Job Manager
Task Manager Task Manager
Snapshot Store
Lightweight Asynchronous Snapshots for
Distributed Dataflows
Paris Carbone,
Gyula Fóra,
Stephan Ewen
Seif Haridi
Kostas Tzoumas
Barrier Msg Msg Barrier Msg Msg Barrier
Operator
Barrier Msg Msg BarrierMsg Msg
Operator
Msg
Snapshot Store
Exactly Once Processing
Can handle very large state
Flink Program
Optimizer
Graph Builder
Client Job Manager
Task Manager Task Manager
Snapshot Store
Flink Program
Optimizer
Graph Builder
Client Job Manager
Task Manager Task Manager
Snapshot Store
Job
Manager
Job
Manager
Zookeeper
Flink Program
Optimizer
Graph Builder
Client Job Manager
Task Manager Task Manager
Snapshot Store
Job
Manager
Job
Manager
Zookeeper
Flink Program
Optimizer
Graph Builder
Client
Task Manager Task Manager
Snapshot Store
Job
Manager
Job
Manager
Zookeeper
Joining Streams
Order Stream
User Rating Stream
Order Stream
User Rating Stream
Order Stream
User Rating Stream
Local Join
Local Join
Order Stream
User Rating Stream
Local Join
Local Join
Apache Flink
● Can join streams
● Fault tolerant
● Exactly Once Processing
● Combines stream and batch processing
… but it requires Java/Scala code
Scalable, efficient and robust
github.com/uber/AthenaX
SQL → what data to analyze
Flink → how to analyze it
Resource estimation and
auto scaling
Monitoring and automatic
failure recovery
eng.uber.com/athenax
Thanks!
Nikolay Stoitsev @ Uber
Large scale stream processing with Apache Flink

More Related Content

What's hot

ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
ApacheCon 2021 Apache Deep Learning 302
ApacheCon 2021   Apache Deep Learning 302ApacheCon 2021   Apache Deep Learning 302
ApacheCon 2021 Apache Deep Learning 302
Timothy Spann
 
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
HostedbyConfluent
 

What's hot (20)

10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
 
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyA Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
 
Ingesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah WhitacreIngesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah Whitacre
 
Robust Stream Processing with Apache Flink
Robust Stream Processing with Apache FlinkRobust Stream Processing with Apache Flink
Robust Stream Processing with Apache Flink
 
Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
 
Devoxx fr 2016 - Apache Kafka - Stream Data Platform
Devoxx fr 2016 - Apache Kafka - Stream Data PlatformDevoxx fr 2016 - Apache Kafka - Stream Data Platform
Devoxx fr 2016 - Apache Kafka - Stream Data Platform
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafka
 
ISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to Go
 
Connect at Twitter-scale | Jordan Bull and Ryanne Dolan, Twitter
Connect at Twitter-scale | Jordan Bull and Ryanne Dolan, TwitterConnect at Twitter-scale | Jordan Bull and Ryanne Dolan, Twitter
Connect at Twitter-scale | Jordan Bull and Ryanne Dolan, Twitter
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 
Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...
Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...
Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
 
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
Kafka Summit SF 2017 - Running Kafka as a Service at ScaleKafka Summit SF 2017 - Running Kafka as a Service at Scale
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
 
ApacheCon 2021 Apache Deep Learning 302
ApacheCon 2021   Apache Deep Learning 302ApacheCon 2021   Apache Deep Learning 302
ApacheCon 2021 Apache Deep Learning 302
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Building a modern SaaS in 2020
Building a modern SaaS in 2020Building a modern SaaS in 2020
Building a modern SaaS in 2020
 
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQLKafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
 
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
 

Similar to Large scale stream processing with Apache Flink

Down the event-driven road: Experiences of integrating streaming into analyti...
Down the event-driven road: Experiences of integrating streaming into analyti...Down the event-driven road: Experiences of integrating streaming into analyti...
Down the event-driven road: Experiences of integrating streaming into analyti...
inovex GmbH
 

Similar to Large scale stream processing with Apache Flink (20)

Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...
 
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Apache Kafka vs. Traditional Middleware (Kai Waehner, Confluent) Frankfurt 20...
Apache Kafka vs. Traditional Middleware (Kai Waehner, Confluent) Frankfurt 20...Apache Kafka vs. Traditional Middleware (Kai Waehner, Confluent) Frankfurt 20...
Apache Kafka vs. Traditional Middleware (Kai Waehner, Confluent) Frankfurt 20...
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
 
Apache Kafka as Event Streaming Platform for Microservice Architectures
Apache Kafka as Event Streaming Platform for Microservice ArchitecturesApache Kafka as Event Streaming Platform for Microservice Architectures
Apache Kafka as Event Streaming Platform for Microservice Architectures
 
Down the event-driven road: Experiences of integrating streaming into analyti...
Down the event-driven road: Experiences of integrating streaming into analyti...Down the event-driven road: Experiences of integrating streaming into analyti...
Down the event-driven road: Experiences of integrating streaming into analyti...
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Flink in action
Flink in actionFlink in action
Flink in action
 
Apache NiFi: latest developments for flow management at scale
Apache NiFi: latest developments for flow management at scaleApache NiFi: latest developments for flow management at scale
Apache NiFi: latest developments for flow management at scale
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
 
Server Monitoring (Scaling while bootstrapped)
Server Monitoring  (Scaling while bootstrapped)Server Monitoring  (Scaling while bootstrapped)
Server Monitoring (Scaling while bootstrapped)
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAData Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEA
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDB
 

More from Nikolay Stoitsev

More from Nikolay Stoitsev (20)

Building vs Buying Software
Building vs Buying SoftwareBuilding vs Buying Software
Building vs Buying Software
 
How and why to manage your manager
How and why to manage your managerHow and why to manage your manager
How and why to manage your manager
 
From programming to management
From programming to managementFrom programming to management
From programming to management
 
A practical introduction to observability
A practical introduction to observabilityA practical introduction to observability
A practical introduction to observability
 
Everything You Need to Know About NewSQL in 2020
Everything You Need to Know About NewSQL in 2020Everything You Need to Know About NewSQL in 2020
Everything You Need to Know About NewSQL in 2020
 
3 lessons on effective communication for engineers
3 lessons on effective communication for engineers3 lessons on effective communication for engineers
3 lessons on effective communication for engineers
 
Evolving big microservice architectures
Evolving big microservice architecturesEvolving big microservice architectures
Evolving big microservice architectures
 
The career path of software engineers and how to navigate it
The career path of software engineers and how to navigate itThe career path of software engineers and how to navigate it
The career path of software engineers and how to navigate it
 
Migrating a data intensive microservice from Python to Go
Migrating a data intensive microservice from Python to GoMigrating a data intensive microservice from Python to Go
Migrating a data intensive microservice from Python to Go
 
NewSQL: what, when and how
NewSQL: what, when and howNewSQL: what, when and how
NewSQL: what, when and how
 
How to read the v8 source code?
How to read the v8 source code?How to read the v8 source code?
How to read the v8 source code?
 
Running in multiple data centers
Running in multiple data centersRunning in multiple data centers
Running in multiple data centers
 
Distributed tracing for big systems
Distributed tracing for big systemsDistributed tracing for big systems
Distributed tracing for big systems
 
Reusable patterns for scalable APIs running on Docker @ Java2Days
Reusable patterns for scalable APIs running on Docker @ Java2DaysReusable patterns for scalable APIs running on Docker @ Java2Days
Reusable patterns for scalable APIs running on Docker @ Java2Days
 
Everyday tools and tricks for scaling Node.js
Everyday tools and tricks for scaling Node.jsEveryday tools and tricks for scaling Node.js
Everyday tools and tricks for scaling Node.js
 
Node.js at Uber
Node.js at UberNode.js at Uber
Node.js at Uber
 
Tracing python applications
Tracing python applicationsTracing python applications
Tracing python applications
 
Distributed tracing for Node.js
Distributed tracing for Node.jsDistributed tracing for Node.js
Distributed tracing for Node.js
 
Design Patterns for Docker Applications
Design Patterns for Docker ApplicationsDesign Patterns for Docker Applications
Design Patterns for Docker Applications
 
From Python to Java
From Python to JavaFrom Python to Java
From Python to Java
 

Recently uploaded

JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
Max Lee
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Recently uploaded (20)

Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
How to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfHow to pick right visual testing tool.pdf
How to pick right visual testing tool.pdf
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java Developers
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 

Large scale stream processing with Apache Flink