SlideShare a Scribd company logo
1 of 30
Building Our Software to Scale
Tal Sliwowicz, Director R&D - Infrastructure Engineering
Lior Chaga - Senior Software Engineer, Data Platform
Distributed Kafka
Architecture,
Taboola Scale
Our Scale
+2.4B Pageviews / day
400K http requests / second
+1.5B monthly unique users
500B recommendations / per month
~450K recommendations / second – at peak
40TB / day
2016 vs. 2017
+36% PVs
+34% Recommendations
+220% Incoming Data
Data Infrastructure Requirements
• Fresh data
• Fast queries
• Exact (billing)
• Flexible – simple to add data and extend
• Scale faster than the business
• Endure traffic spikes
Our Strategy to Scale
• Best of breed technologies
• A lot of custom development
• Highly optimized distributions and hardware
• Software designed with scale and self healing in mind
• Everything is monitored and profiled
• Infrastructure to support extremely agile development cycles
Any server and even any data center can go down at any time without
service interruption:
• Share nothing - each server is independent
• Application is stateless - server can accept any traffic
• Data is fully replicated in real time
• Dynamic load balancing
Data must be exact
• All processing is idempotent
• “Exactly Once” semantics - never count twice
Data Infrastructure fully pluggable
• Connect into it at any point
Architecture Principles
Backend Processing
Our code deals with:
• Real time joining of multiple streams
• Real time access to data
• Distributed processing
FE
Servers
Jade
Cloud
Storage
Tensor Flow
Serving
SQL
Backend Processing
Jade
Cloud
Storage
Tensor Flow
Serving
SQL
FE
Servers
High Throughput Using Custom Buffering
Volume
• 50B protobufs / day - billing related protobufs
• 25B protobufs / day - monitoring related protobufs
Requirements
• Can’t interrupt the recommendations service
• Can’t lose data
How do we handle this volume?
• Custom message buffering
• Async sending
• Offheap - No GC by using pre-allocated DirectByteBuffers
WriteCoordinator
Protobuff
Event
Preallocated
Reusable
GzipDirectByteBuffers
Kafka
Endpoint
Filesystem Fallback Handler
failure
Message Handling Flow
drain
Monitoring
• Messages waiting on the filesystem (should be zero)
• Message producing rate
• Number of buffers being used
• Blocking Queue Size + Dropped Messages
• Message Size
• Payload Size
• Send to Kafka times
• Errors
Schema Management
Protostuff with schema evolution
• Separate git repo with strict ALM
• Testing for backward/forward compatibility
• Feature branches cannot be deployed to production
• Became critical with many developers adding data
FE
Servers
Architecture Evolution
FE
Servers
Backend
Backend
Backend
Backend
Frontend
Backend
Multi DC Deployment
• 6 FE Data centers
• 4 BE Data Centers
• Brokers: 36 FE + 50 BE
• Broker = 7TB NVME Disk, 128GB RAM, 32 CPU Core, 10GB Ethernet
• Full Data Replication to BE DCs
Mirroring With Kafka Mirror Maker
Topic 1
.
.
.
.
.
.
.
.
Topic N
Partition 1
.
.
.
Partition K
FE Kafka
Topic 1
.
.
.
.
.
.
.
Topic N
BE Kafka
Kafka
Mirror
Maker
C1
C2
Cm
P
P
P
Message size varies:
O(10KB) - O(MB)
Mirroring With Kafka Mirror Maker
Topic 1
.
.
.
.
.
.
.
.
Topic N
Partition 1
.
.
.
Partition K
FE Kafka
Topic 1
.
.
.
.
.
.
.
Topic N
BE Kafka
Kafka
Mirror
Maker
C1
C2
Ck
Partition 1
.
.
.
Partition K
P
P
P
Topic 1
.
.
.
.
.
.
.
.
Topic N
Mirroring Done Right - KFC
Topic 1
.
.
.
.
.
.
.
Topic N
BE Kafka
KFC
Mirror
C1
C2
Ck
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
Producer Pools - as
much as we need
Partition 1
.
.
.
Partition K
FE Kafka
Partition 1
.
.
.
Partition K
Introducing KFC
Multi-purpose framework for consuming from kafka and processing
messages in parallel, with built-in monitoring.
TaboolaKafkaConsumer
Consumer
Runnable
MessageProcessor KafkaConsumerParallelismStrategy KafkaCommitStrategy
Example: MirroringMessageProcessor
Example: MirroringMessageProcessor (2)
Monitoring
● Registering all org.apache.kafka.common.Metric as codahale Gauge
● Alerting on poll cycle time:
○ Messaging processing is stuck
○ Can’t find group coordinator
● Alerting on partition lag:
○ Lag by number of messages
○ Lag by delta from produce time (bursty topics)
Monitoring poll cycle
Monitoring lag
Other KFC Usages
• Backup message to GCS - pay for PUT
• Idempotent write to C* - partial protobuf parse + pushback
• Embedded KFC for event driven processing
• BQ Upload - tweaking fetch size
• RDBMS updates
• Many more...
Thank You
Apache Spark as Our Distributed Processing Engine
Serves multiple functions:
• Runs our code that joins data stream into pageviews and sessions
• SQL engine for analysts and algo group
• Raw data feed into the deep learning engine
• 100s of data aggregators constantly running feeding data to Backstage (in beta)
Fun facts
• ~15K cores + >70TB of RAM (+>8PB on disk historic data)
• Clusters will grow by 50% in size by year’s end
• Using Spark since the end of 2013 - in production
• Contributed multiple critical fixes back to community

More Related Content

What's hot

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent
 
Using Kafka to scale database replication
Using Kafka to scale database replicationUsing Kafka to scale database replication
Using Kafka to scale database replicationVenu Ryali
 
Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architecture
Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New ArchitectureGwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architecture
Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architectureconfluent
 
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...HostedbyConfluent
 
Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...
Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...
Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...HostedbyConfluent
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...StreamNative
 
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka StreamsKafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streamsconfluent
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikHostedbyConfluent
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberHostedbyConfluent
 
Administrative techniques to reduce Kafka costs | Anna Kepler, Viasat
Administrative techniques to reduce Kafka costs | Anna Kepler, ViasatAdministrative techniques to reduce Kafka costs | Anna Kepler, Viasat
Administrative techniques to reduce Kafka costs | Anna Kepler, ViasatHostedbyConfluent
 
Taming a massive fleet of Python-based Kafka apps at Robinhood | Chandra Kuch...
Taming a massive fleet of Python-based Kafka apps at Robinhood | Chandra Kuch...Taming a massive fleet of Python-based Kafka apps at Robinhood | Chandra Kuch...
Taming a massive fleet of Python-based Kafka apps at Robinhood | Chandra Kuch...HostedbyConfluent
 
Real time dashboards with Kafka and Druid
Real time dashboards with Kafka and DruidReal time dashboards with Kafka and Druid
Real time dashboards with Kafka and DruidVenu Ryali
 
Real-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka EcosystemReal-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka Ecosystemconfluent
 
Building a derived data store using Kafka
Building a derived data store using KafkaBuilding a derived data store using Kafka
Building a derived data store using KafkaVenu Ryali
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetesconfluent
 
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, Microsoft
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, MicrosoftAzure Cosmos DB Kafka Connectors | Abinav Rameesh, Microsoft
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, MicrosoftHostedbyConfluent
 
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...HostedbyConfluent
 
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...HostedbyConfluent
 

What's hot (20)

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
 
Using Kafka to scale database replication
Using Kafka to scale database replicationUsing Kafka to scale database replication
Using Kafka to scale database replication
 
Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architecture
Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New ArchitectureGwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architecture
Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architecture
 
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
 
Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...
Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...
Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
 
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka StreamsKafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 
Administrative techniques to reduce Kafka costs | Anna Kepler, Viasat
Administrative techniques to reduce Kafka costs | Anna Kepler, ViasatAdministrative techniques to reduce Kafka costs | Anna Kepler, Viasat
Administrative techniques to reduce Kafka costs | Anna Kepler, Viasat
 
Taming a massive fleet of Python-based Kafka apps at Robinhood | Chandra Kuch...
Taming a massive fleet of Python-based Kafka apps at Robinhood | Chandra Kuch...Taming a massive fleet of Python-based Kafka apps at Robinhood | Chandra Kuch...
Taming a massive fleet of Python-based Kafka apps at Robinhood | Chandra Kuch...
 
Real time dashboards with Kafka and Druid
Real time dashboards with Kafka and DruidReal time dashboards with Kafka and Druid
Real time dashboards with Kafka and Druid
 
Real-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka EcosystemReal-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka Ecosystem
 
Building a derived data store using Kafka
Building a derived data store using KafkaBuilding a derived data store using Kafka
Building a derived data store using Kafka
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetes
 
Apache Kafka Streams
Apache Kafka StreamsApache Kafka Streams
Apache Kafka Streams
 
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, Microsoft
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, MicrosoftAzure Cosmos DB Kafka Connectors | Abinav Rameesh, Microsoft
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, Microsoft
 
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
 
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
 

Similar to Distributed Kafka Architecture Taboola Scale

Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...LINE Corporation
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestKrishna Gade
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Kai Sasaki
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterHostedbyConfluent
 
Stream Processing @ Lyft
Stream Processing @ LyftStream Processing @ Lyft
Stream Processing @ LyftJamie Grier
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Peter Bakas
 
Real Time Insights for Advertising Tech
Real Time Insights for Advertising TechReal Time Insights for Advertising Tech
Real Time Insights for Advertising TechApache Apex
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.Data Con LA
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to OneSerg Masyutin
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Data Con LA
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...Amazon Web Services
 
Ai big dataconference_ml_fastdata_vitalii bondarenko
Ai big dataconference_ml_fastdata_vitalii bondarenkoAi big dataconference_ml_fastdata_vitalii bondarenko
Ai big dataconference_ml_fastdata_vitalii bondarenkoOlga Zinkevych
 
Vitalii Bondarenko "Machine Learning on Fast Data"
Vitalii Bondarenko "Machine Learning on Fast Data"Vitalii Bondarenko "Machine Learning on Fast Data"
Vitalii Bondarenko "Machine Learning on Fast Data"DataConf
 
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...Jonghyun Lee
 

Similar to Distributed Kafka Architecture Taboola Scale (20)

Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 
Stream Processing @ Lyft
Stream Processing @ LyftStream Processing @ Lyft
Stream Processing @ Lyft
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
 
Real Time Insights for Advertising Tech
Real Time Insights for Advertising TechReal Time Insights for Advertising Tech
Real Time Insights for Advertising Tech
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
 
Ai big dataconference_ml_fastdata_vitalii bondarenko
Ai big dataconference_ml_fastdata_vitalii bondarenkoAi big dataconference_ml_fastdata_vitalii bondarenko
Ai big dataconference_ml_fastdata_vitalii bondarenko
 
Vitalii Bondarenko "Machine Learning on Fast Data"
Vitalii Bondarenko "Machine Learning on Fast Data"Vitalii Bondarenko "Machine Learning on Fast Data"
Vitalii Bondarenko "Machine Learning on Fast Data"
 
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
 

Recently uploaded

Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxDilipVasan
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeralNABLAS株式会社
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 

Recently uploaded (20)

Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 

Distributed Kafka Architecture Taboola Scale

  • 1. Building Our Software to Scale Tal Sliwowicz, Director R&D - Infrastructure Engineering Lior Chaga - Senior Software Engineer, Data Platform Distributed Kafka Architecture, Taboola Scale
  • 2.
  • 3.
  • 4. Our Scale +2.4B Pageviews / day 400K http requests / second +1.5B monthly unique users 500B recommendations / per month ~450K recommendations / second – at peak 40TB / day
  • 5. 2016 vs. 2017 +36% PVs +34% Recommendations +220% Incoming Data
  • 6. Data Infrastructure Requirements • Fresh data • Fast queries • Exact (billing) • Flexible – simple to add data and extend • Scale faster than the business • Endure traffic spikes
  • 7. Our Strategy to Scale • Best of breed technologies • A lot of custom development • Highly optimized distributions and hardware • Software designed with scale and self healing in mind • Everything is monitored and profiled • Infrastructure to support extremely agile development cycles
  • 8. Any server and even any data center can go down at any time without service interruption: • Share nothing - each server is independent • Application is stateless - server can accept any traffic • Data is fully replicated in real time • Dynamic load balancing Data must be exact • All processing is idempotent • “Exactly Once” semantics - never count twice Data Infrastructure fully pluggable • Connect into it at any point Architecture Principles
  • 9. Backend Processing Our code deals with: • Real time joining of multiple streams • Real time access to data • Distributed processing FE Servers Jade Cloud Storage Tensor Flow Serving SQL
  • 11. High Throughput Using Custom Buffering Volume • 50B protobufs / day - billing related protobufs • 25B protobufs / day - monitoring related protobufs Requirements • Can’t interrupt the recommendations service • Can’t lose data How do we handle this volume? • Custom message buffering • Async sending • Offheap - No GC by using pre-allocated DirectByteBuffers
  • 13. Monitoring • Messages waiting on the filesystem (should be zero) • Message producing rate • Number of buffers being used • Blocking Queue Size + Dropped Messages • Message Size • Payload Size • Send to Kafka times • Errors
  • 14.
  • 15.
  • 16. Schema Management Protostuff with schema evolution • Separate git repo with strict ALM • Testing for backward/forward compatibility • Feature branches cannot be deployed to production • Became critical with many developers adding data
  • 18. Multi DC Deployment • 6 FE Data centers • 4 BE Data Centers • Brokers: 36 FE + 50 BE • Broker = 7TB NVME Disk, 128GB RAM, 32 CPU Core, 10GB Ethernet • Full Data Replication to BE DCs
  • 19. Mirroring With Kafka Mirror Maker Topic 1 . . . . . . . . Topic N Partition 1 . . . Partition K FE Kafka Topic 1 . . . . . . . Topic N BE Kafka Kafka Mirror Maker C1 C2 Cm P P P Message size varies: O(10KB) - O(MB)
  • 20. Mirroring With Kafka Mirror Maker Topic 1 . . . . . . . . Topic N Partition 1 . . . Partition K FE Kafka Topic 1 . . . . . . . Topic N BE Kafka Kafka Mirror Maker C1 C2 Ck Partition 1 . . . Partition K P P P
  • 21. Topic 1 . . . . . . . . Topic N Mirroring Done Right - KFC Topic 1 . . . . . . . Topic N BE Kafka KFC Mirror C1 C2 Ck P P P P P P P P P P P P P P P Producer Pools - as much as we need Partition 1 . . . Partition K FE Kafka Partition 1 . . . Partition K
  • 22. Introducing KFC Multi-purpose framework for consuming from kafka and processing messages in parallel, with built-in monitoring. TaboolaKafkaConsumer Consumer Runnable MessageProcessor KafkaConsumerParallelismStrategy KafkaCommitStrategy
  • 25. Monitoring ● Registering all org.apache.kafka.common.Metric as codahale Gauge ● Alerting on poll cycle time: ○ Messaging processing is stuck ○ Can’t find group coordinator ● Alerting on partition lag: ○ Lag by number of messages ○ Lag by delta from produce time (bursty topics)
  • 28. Other KFC Usages • Backup message to GCS - pay for PUT • Idempotent write to C* - partial protobuf parse + pushback • Embedded KFC for event driven processing • BQ Upload - tweaking fetch size • RDBMS updates • Many more...
  • 30. Apache Spark as Our Distributed Processing Engine Serves multiple functions: • Runs our code that joins data stream into pageviews and sessions • SQL engine for analysts and algo group • Raw data feed into the deep learning engine • 100s of data aggregators constantly running feeding data to Backstage (in beta) Fun facts • ~15K cores + >70TB of RAM (+>8PB on disk historic data) • Clusters will grow by 50% in size by year’s end • Using Spark since the end of 2013 - in production • Contributed multiple critical fixes back to community