SlideShare a Scribd company logo
1 of 36
Download to read offline
Building a modern data pipeline:
Lessons learned
Saulius Valatka
Full stack online advertising platform
Serving banners, real-time auctions, data management, etc.
Adform
Data driven products
◉ Analytics
◉ Algorithm optimization
◉ Fraud detection
◉ Forecasting
>1,000,000 QPS
bids, impressions, clicks, requests, …
>500 TB
data stored
>500 MB/s
data incoming
Kafka
the backbone of any modern data pipeline
Producing
Transactional “exactly-once”
Since Kafka 0.11
Kafka Delivery Guarantees
Producing
Transactional “exactly-once”
Since Kafka 0.11
Kafka Delivery Guarantees
Consuming
Consumer must implement
“atomic” offset commiting
Long term storage loading
HDFS, Vertica and the like
1
Loading projects topics
_topic _partition _offset cookie_id url
impressions 1 401 123412341 http://www.reddit.com/
impressions 1 402 23412354 http://lwn.net/Kernel
impressions 2 390 234123566 http://join.adform.com/
impressions 2 391 64342453 http://www.ufc.com/rankings
impressions 1 403 341235346 http://nasza-klasa.pl/
impressions 3 943 465344354 http://www.facebook.com/
All the benefits of CQRS
But consistency is still hard
Multiple data stores
Kafka is awesome
We can achieve real-time data flow with exactly-once
guarantees and have statically-typed data at every step
of the way
Kafka is awesome
We can achieve real-time data flow with exactly-once
guarantees and have statically-typed data at every step
of the way
Kafa is our source of truth
Stream Processing
Flink, Spark, Kafka-Streams, oh my!
2
The world beyond batch: Streaming 101
A high level tour of modern data-processing concepts
By Tyler Akidau. August 5, 2015
The world beyond batch: Streaming 102
The what, where, when, and how of unbounded data processing.
By Tyler Akidau. January 20, 2016
Stream processing
Modern frameworks like spark (structured streaming),
flink and kafka streams allow for exactly-once stateful
data processing
Stream processing
Modern frameworks like spark (structured streaming),
flink and kafka streams allow for exactly-once stateful
data processing
Streaming should be preferred over batching
Batch Processing
… or what we don’t want to do with streams
3
What do we do in batch ?
◉ Algorithm training
◉ Attributions with wide windows
◉ Data cubing
How do you even batch ?
azkaban, oozie, luigi, airflow …
why do people keep reinventing this ?
Scheduling reinvented ¯_(ツ)_/¯
Scheduling reinvented ¯_(ツ)_/¯
◉ Declarative configuration
◉ Runs containers (https://github.com/adform/sprint)
◉ Not (yet) open-sourced
“
Time based scheduling cannot in principle
guarantee data consistency because of late data
😭
When is it safe to run ?
offset
time
real
kafka-time to the rescue!
offset
time
kafka-time watermark
real
What about “dimensions” ?
a.k.a. metadata: clients, campaigns, etc.
4
Data Distribution Platform
A modern data pipeline is …
Consistent
Otherwise, what are you even
doing ?
A modern data pipeline is …
Consistent
Otherwise, what are you even
doing ?
Immutable
Reproducible, testable,
auditable, other good words
A modern data pipeline is …
Consistent
Otherwise, what are you even
doing ?
Real Time
Ain’t nobody got time to wait for
batches …
Immutable
Reproducible, testable,
auditable, other good words
A modern data pipeline is …
Consistent
Otherwise, what are you even
doing ?
Real Time
Ain’t nobody got time to wait for
batches …
Immutable
Reproducible, testable,
auditable, other good words
Performant
Seriously, ain’t nobody got time
to wait for queries …
A modern data pipeline is …
Consistent
Otherwise, what are you even
doing ?
Real Time
Ain’t nobody got time to wait for
batches …
Statically Typed
Yes, I said it, so sue me …
Immutable
Reproducible, testable,
auditable, other good words
Performant
Seriously, ain’t nobody got time
to wait for queries …
A modern data pipeline is …
Consistent
Otherwise, what are you even
doing ?
Real Time
Ain’t nobody got time to wait for
batches …
Statically Typed
Yes, I said it, so sue me …
Immutable
Reproducible, testable,
auditable, other good words
Performant
Seriously, ain’t nobody got time
to wait for queries …
Accessible
Clearly understood, easily used
and extended
Any questions ?
You can find me at
◉ @saulius_vl
◉ github.com/sauliusvl
Thanks!

More Related Content

Similar to Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform

Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...
Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...
Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...confluent
 
Bluemix paas 기반 saas 개발 사례
Bluemix paas 기반 saas 개발 사례Bluemix paas 기반 saas 개발 사례
Bluemix paas 기반 saas 개발 사례uEngine Solutions
 
Real Time Streaming - Apache Kafka
Real Time Streaming - Apache KafkaReal Time Streaming - Apache Kafka
Real Time Streaming - Apache KafkaKnoldus Inc.
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basFlorent Ramiere
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIuser Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIconfluent
 
Using Apache Kafka to Analyze Session Windows
Using Apache Kafka to Analyze Session WindowsUsing Apache Kafka to Analyze Session Windows
Using Apache Kafka to Analyze Session Windowsconfluent
 
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...confluent
 
Reinventing Kafka in the Data Streaming Era - Jun Rao
Reinventing Kafka in the Data Streaming Era - Jun RaoReinventing Kafka in the Data Streaming Era - Jun Rao
Reinventing Kafka in the Data Streaming Era - Jun Raoconfluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureGuido Schmutz
 
A Tour of Apache Kafka
A Tour of Apache KafkaA Tour of Apache Kafka
A Tour of Apache Kafkaconfluent
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAData Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAAndrew Morgan
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBMongoDB
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightMicrosoft Tech Community
 

Similar to Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform (20)

Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...
Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...
Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...
 
Bluemix paas 기반 saas 개발 사례
Bluemix paas 기반 saas 개발 사례Bluemix paas 기반 saas 개발 사례
Bluemix paas 기반 saas 개발 사례
 
Real Time Streaming - Apache Kafka
Real Time Streaming - Apache KafkaReal Time Streaming - Apache Kafka
Real Time Streaming - Apache Kafka
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en bas
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Surge openstack
Surge openstackSurge openstack
Surge openstack
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIuser Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
 
Using Apache Kafka to Analyze Session Windows
Using Apache Kafka to Analyze Session WindowsUsing Apache Kafka to Analyze Session Windows
Using Apache Kafka to Analyze Session Windows
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
 
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
 
Reinventing Kafka in the Data Streaming Era - Jun Rao
Reinventing Kafka in the Data Streaming Era - Jun RaoReinventing Kafka in the Data Streaming Era - Jun Rao
Reinventing Kafka in the Data Streaming Era - Jun Rao
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
 
A Tour of Apache Kafka
A Tour of Apache KafkaA Tour of Apache Kafka
A Tour of Apache Kafka
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAData Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEA
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDB
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
 

More from Evention

The Factorization Machines algorithm for building recommendation system - Paw...
The Factorization Machines algorithm for building recommendation system - Paw...The Factorization Machines algorithm for building recommendation system - Paw...
The Factorization Machines algorithm for building recommendation system - Paw...Evention
 
A/B testing powered by Big data - Saurabh Goyal, Booking.com
A/B testing powered by Big data - Saurabh Goyal, Booking.comA/B testing powered by Big data - Saurabh Goyal, Booking.com
A/B testing powered by Big data - Saurabh Goyal, Booking.comEvention
 
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...Evention
 
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...Evention
 
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...Evention
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
 
Privacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, MapflatPrivacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, MapflatEvention
 
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Evention
 
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...Evention
 
Enhancing Spark - increase streaming capabilities of your applications - Kami...
Enhancing Spark - increase streaming capabilities of your applications - Kami...Enhancing Spark - increase streaming capabilities of your applications - Kami...
Enhancing Spark - increase streaming capabilities of your applications - Kami...Evention
 
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...Evention
 
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...Evention
 
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansStream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansEvention
 
Scaling Cassandra in all directions - Jimmy Mardell Spotify
Scaling Cassandra in all directions - Jimmy Mardell SpotifyScaling Cassandra in all directions - Jimmy Mardell Spotify
Scaling Cassandra in all directions - Jimmy Mardell SpotifyEvention
 
Big Data for unstructured data Dariusz Śliwa
Big Data for unstructured data Dariusz ŚliwaBig Data for unstructured data Dariusz Śliwa
Big Data for unstructured data Dariusz ŚliwaEvention
 
Elastic development. Implementing Big Data search Grzegorz Kołpuć
Elastic development. Implementing Big Data search Grzegorz KołpućElastic development. Implementing Big Data search Grzegorz Kołpuć
Elastic development. Implementing Big Data search Grzegorz KołpućEvention
 
H2 o deep water making deep learning accessible to everyone -jo-fai chow
H2 o deep water   making deep learning accessible to everyone -jo-fai chowH2 o deep water   making deep learning accessible to everyone -jo-fai chow
H2 o deep water making deep learning accessible to everyone -jo-fai chowEvention
 
That won’t fit into RAM - Michał Brzezicki
That won’t fit into RAM -  Michał  BrzezickiThat won’t fit into RAM -  Michał  Brzezicki
That won’t fit into RAM - Michał BrzezickiEvention
 
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian HueskeStream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian HueskeEvention
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Evention
 

More from Evention (20)

The Factorization Machines algorithm for building recommendation system - Paw...
The Factorization Machines algorithm for building recommendation system - Paw...The Factorization Machines algorithm for building recommendation system - Paw...
The Factorization Machines algorithm for building recommendation system - Paw...
 
A/B testing powered by Big data - Saurabh Goyal, Booking.com
A/B testing powered by Big data - Saurabh Goyal, Booking.comA/B testing powered by Big data - Saurabh Goyal, Booking.com
A/B testing powered by Big data - Saurabh Goyal, Booking.com
 
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
 
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
 
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Privacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, MapflatPrivacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, Mapflat
 
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
 
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
 
Enhancing Spark - increase streaming capabilities of your applications - Kami...
Enhancing Spark - increase streaming capabilities of your applications - Kami...Enhancing Spark - increase streaming capabilities of your applications - Kami...
Enhancing Spark - increase streaming capabilities of your applications - Kami...
 
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
 
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
 
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansStream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data Artisans
 
Scaling Cassandra in all directions - Jimmy Mardell Spotify
Scaling Cassandra in all directions - Jimmy Mardell SpotifyScaling Cassandra in all directions - Jimmy Mardell Spotify
Scaling Cassandra in all directions - Jimmy Mardell Spotify
 
Big Data for unstructured data Dariusz Śliwa
Big Data for unstructured data Dariusz ŚliwaBig Data for unstructured data Dariusz Śliwa
Big Data for unstructured data Dariusz Śliwa
 
Elastic development. Implementing Big Data search Grzegorz Kołpuć
Elastic development. Implementing Big Data search Grzegorz KołpućElastic development. Implementing Big Data search Grzegorz Kołpuć
Elastic development. Implementing Big Data search Grzegorz Kołpuć
 
H2 o deep water making deep learning accessible to everyone -jo-fai chow
H2 o deep water   making deep learning accessible to everyone -jo-fai chowH2 o deep water   making deep learning accessible to everyone -jo-fai chow
H2 o deep water making deep learning accessible to everyone -jo-fai chow
 
That won’t fit into RAM - Michał Brzezicki
That won’t fit into RAM -  Michał  BrzezickiThat won’t fit into RAM -  Michał  Brzezicki
That won’t fit into RAM - Michał Brzezicki
 
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian HueskeStream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian Hueske
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
 

Recently uploaded

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 

Recently uploaded (20)

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 

Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform

  • 1. Building a modern data pipeline: Lessons learned Saulius Valatka
  • 2. Full stack online advertising platform Serving banners, real-time auctions, data management, etc. Adform
  • 3. Data driven products ◉ Analytics ◉ Algorithm optimization ◉ Fraud detection ◉ Forecasting
  • 4. >1,000,000 QPS bids, impressions, clicks, requests, … >500 TB data stored >500 MB/s data incoming
  • 5.
  • 6. Kafka the backbone of any modern data pipeline
  • 7.
  • 9. Producing Transactional “exactly-once” Since Kafka 0.11 Kafka Delivery Guarantees Consuming Consumer must implement “atomic” offset commiting
  • 10. Long term storage loading HDFS, Vertica and the like 1
  • 11. Loading projects topics _topic _partition _offset cookie_id url impressions 1 401 123412341 http://www.reddit.com/ impressions 1 402 23412354 http://lwn.net/Kernel impressions 2 390 234123566 http://join.adform.com/ impressions 2 391 64342453 http://www.ufc.com/rankings impressions 1 403 341235346 http://nasza-klasa.pl/ impressions 3 943 465344354 http://www.facebook.com/
  • 12.
  • 13. All the benefits of CQRS But consistency is still hard Multiple data stores
  • 14. Kafka is awesome We can achieve real-time data flow with exactly-once guarantees and have statically-typed data at every step of the way
  • 15. Kafka is awesome We can achieve real-time data flow with exactly-once guarantees and have statically-typed data at every step of the way Kafa is our source of truth
  • 16. Stream Processing Flink, Spark, Kafka-Streams, oh my! 2
  • 17. The world beyond batch: Streaming 101 A high level tour of modern data-processing concepts By Tyler Akidau. August 5, 2015 The world beyond batch: Streaming 102 The what, where, when, and how of unbounded data processing. By Tyler Akidau. January 20, 2016
  • 18. Stream processing Modern frameworks like spark (structured streaming), flink and kafka streams allow for exactly-once stateful data processing
  • 19. Stream processing Modern frameworks like spark (structured streaming), flink and kafka streams allow for exactly-once stateful data processing Streaming should be preferred over batching
  • 20. Batch Processing … or what we don’t want to do with streams 3
  • 21. What do we do in batch ? ◉ Algorithm training ◉ Attributions with wide windows ◉ Data cubing
  • 22. How do you even batch ? azkaban, oozie, luigi, airflow … why do people keep reinventing this ?
  • 24. Scheduling reinvented ¯_(ツ)_/¯ ◉ Declarative configuration ◉ Runs containers (https://github.com/adform/sprint) ◉ Not (yet) open-sourced
  • 25. “ Time based scheduling cannot in principle guarantee data consistency because of late data 😭
  • 26. When is it safe to run ? offset time real
  • 27. kafka-time to the rescue! offset time kafka-time watermark real
  • 28. What about “dimensions” ? a.k.a. metadata: clients, campaigns, etc. 4
  • 30. A modern data pipeline is … Consistent Otherwise, what are you even doing ?
  • 31. A modern data pipeline is … Consistent Otherwise, what are you even doing ? Immutable Reproducible, testable, auditable, other good words
  • 32. A modern data pipeline is … Consistent Otherwise, what are you even doing ? Real Time Ain’t nobody got time to wait for batches … Immutable Reproducible, testable, auditable, other good words
  • 33. A modern data pipeline is … Consistent Otherwise, what are you even doing ? Real Time Ain’t nobody got time to wait for batches … Immutable Reproducible, testable, auditable, other good words Performant Seriously, ain’t nobody got time to wait for queries …
  • 34. A modern data pipeline is … Consistent Otherwise, what are you even doing ? Real Time Ain’t nobody got time to wait for batches … Statically Typed Yes, I said it, so sue me … Immutable Reproducible, testable, auditable, other good words Performant Seriously, ain’t nobody got time to wait for queries …
  • 35. A modern data pipeline is … Consistent Otherwise, what are you even doing ? Real Time Ain’t nobody got time to wait for batches … Statically Typed Yes, I said it, so sue me … Immutable Reproducible, testable, auditable, other good words Performant Seriously, ain’t nobody got time to wait for queries … Accessible Clearly understood, easily used and extended
  • 36. Any questions ? You can find me at ◉ @saulius_vl ◉ github.com/sauliusvl Thanks!