SlideShare a Scribd company logo

Forecasting Kafka Lag Issues with Machine Learning

"A key operational challenge for running Kafka in production is managing Kafka Partition Lag. Kafka partitions exhibit a variety of normal trends, influenced by how consumers consume data in partitions. Kafka Lag also exhibits abnormal patterns caused by issues in the Kafka clusters or in its consumers. Administrators need to monitor Kafka Lag, distinguish between normal and abnormal trends and act when application outcomes are impacted. Lag impacts latency and accuracy of data and insights produced from a Big Data pipeline. How can we continuously monitor Kafka Lag automatically, identify normal and abnormal trends and forecast Lag issues ahead of time? In this session, we will discuss our work in this regard using machine learning. We will discuss popular lag patterns and how our ensemble forecasting system learns from the past and predicts future trends. We will also showcase some case studies and benefits of having such a system as part of a Kafka observability platform."

Forecasting Kafka Lag Issues with Machine Learning

1 of 25
Download to read offline
Forecasting Kafka Lag
with Machine Learning
Kumaran Ponnambalam
Principal Engineer - AI
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public
Speaker Intro
2
• AI/ML Leader & Author
• 20+ years in Data, Analytics, AI/ML
• Current
• Principal Engineer – AI
• Outshift by Cisco
• Author
• LinkedIn Learning – AI/ML & Big Data
• Apache Kafka Essential Training – Getting Started
• Apache Kafka Essential Training – Building scalable apps
Kumaran Ponnambalam
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Agenda
• The Kafka consumer lag problem
• Need for lag forecasting
• Current approaches to lag forecasting
• Using Retrieval Augmented Generation
• Our approach
• Examples of real time forecasting
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Confidential.
Kafka consumer lag
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
• One topic – multiple consumer groups
• Consumers consume data independently of producers
• Lag indicates ”pending” records not yet consumed by a consumer
group
• Lag can happen by design (batch consumers) or because the
consumers are not able to keep up with the producers
• Lag directly impacts latency of data processing pipelines & hence
when outcomes are achieved
The Kafka consumer lag problem
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
• Pending records keeps increasing over time
• Delays in achieving desired application service levels
• Business impact
• A real-time threat alert system will delay alerts, leading to threat becoming
real
• A real-time recommendation system may not provide recommendations in
time, leading to loss of business
• Restoring service levels is involved and takes time
• Increasing capacity of brokers / consumers
• Resetting consumer offsets
Impact of unplanned lag

Recommended

“Introduction to Optimizing ML Models for the Edge,” a Presentation from Cisc...
“Introduction to Optimizing ML Models for the Edge,” a Presentation from Cisc...“Introduction to Optimizing ML Models for the Edge,” a Presentation from Cisc...
“Introduction to Optimizing ML Models for the Edge,” a Presentation from Cisc...Edge AI and Vision Alliance
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBDatabricks
 
Curiosity and fourTheorem present: From Coverage Guesswork to Targeted Test G...
Curiosity and fourTheorem present: From Coverage Guesswork to Targeted Test G...Curiosity and fourTheorem present: From Coverage Guesswork to Targeted Test G...
Curiosity and fourTheorem present: From Coverage Guesswork to Targeted Test G...Curiosity Software Ireland
 
Mass Scale Networking
Mass Scale NetworkingMass Scale Networking
Mass Scale NetworkingSteve Iatrou
 
Cisco datacenter ucs-best-practices_ddebussc_2015d
Cisco datacenter ucs-best-practices_ddebussc_2015dCisco datacenter ucs-best-practices_ddebussc_2015d
Cisco datacenter ucs-best-practices_ddebussc_2015dAmy Blanchard
 
WAN Automation Engine API Deep Dive
WAN Automation Engine API Deep DiveWAN Automation Engine API Deep Dive
WAN Automation Engine API Deep DiveCisco DevNet
 

More Related Content

Similar to Forecasting Kafka Lag Issues with Machine Learning

Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure applicationCodecamp Romania
 
Interop ITX: Moving applications: From Legacy to Cloud-to-Cloud
Interop ITX: Moving applications: From Legacy to Cloud-to-CloudInterop ITX: Moving applications: From Legacy to Cloud-to-Cloud
Interop ITX: Moving applications: From Legacy to Cloud-to-CloudSusan Wu
 
Cloud-native Data: Every Microservice Needs a Cache
Cloud-native Data: Every Microservice Needs a CacheCloud-native Data: Every Microservice Needs a Cache
Cloud-native Data: Every Microservice Needs a Cachecornelia davis
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Mich Talebzadeh (Ph.D.)
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Mich Talebzadeh (Ph.D.)
 
Kognitio feb 2013
Kognitio feb 2013Kognitio feb 2013
Kognitio feb 2013Kognitio
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Julien SIMON
 
Microsoft Cloud Adoption Framework for Azure: Governance Conversation
Microsoft Cloud Adoption Framework for Azure: Governance ConversationMicrosoft Cloud Adoption Framework for Azure: Governance Conversation
Microsoft Cloud Adoption Framework for Azure: Governance ConversationNicholas Vossburg
 
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...danielschulz2005
 
Train, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelTrain, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelCloudera Japan
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018Adam Gibson
 
Azure Governance for Enterprise
Azure Governance for EnterpriseAzure Governance for Enterprise
Azure Governance for EnterpriseMohit Chhabra
 
Beyond REST and RPC: Asynchronous Eventing and Messaging Patterns
Beyond REST and RPC: Asynchronous Eventing and Messaging PatternsBeyond REST and RPC: Asynchronous Eventing and Messaging Patterns
Beyond REST and RPC: Asynchronous Eventing and Messaging PatternsClemens Vasters
 
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Igor De Souza
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 
Data meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow IndiaData meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow IndiaSandesh Rao
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated MLMark Tabladillo
 
Paul Angus – Backup & Recovery in CloudStack
Paul Angus – Backup & Recovery in CloudStackPaul Angus – Backup & Recovery in CloudStack
Paul Angus – Backup & Recovery in CloudStackShapeBlue
 
ThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.jsThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.jsBrad Williams
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
 

Similar to Forecasting Kafka Lag Issues with Machine Learning (20)

Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 
Interop ITX: Moving applications: From Legacy to Cloud-to-Cloud
Interop ITX: Moving applications: From Legacy to Cloud-to-CloudInterop ITX: Moving applications: From Legacy to Cloud-to-Cloud
Interop ITX: Moving applications: From Legacy to Cloud-to-Cloud
 
Cloud-native Data: Every Microservice Needs a Cache
Cloud-native Data: Every Microservice Needs a CacheCloud-native Data: Every Microservice Needs a Cache
Cloud-native Data: Every Microservice Needs a Cache
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...
 
Kognitio feb 2013
Kognitio feb 2013Kognitio feb 2013
Kognitio feb 2013
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)
 
Microsoft Cloud Adoption Framework for Azure: Governance Conversation
Microsoft Cloud Adoption Framework for Azure: Governance ConversationMicrosoft Cloud Adoption Framework for Azure: Governance Conversation
Microsoft Cloud Adoption Framework for Azure: Governance Conversation
 
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
 
Train, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelTrain, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning model
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
Azure Governance for Enterprise
Azure Governance for EnterpriseAzure Governance for Enterprise
Azure Governance for Enterprise
 
Beyond REST and RPC: Asynchronous Eventing and Messaging Patterns
Beyond REST and RPC: Asynchronous Eventing and Messaging PatternsBeyond REST and RPC: Asynchronous Eventing and Messaging Patterns
Beyond REST and RPC: Asynchronous Eventing and Messaging Patterns
 
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Data meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow IndiaData meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow India
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
 
Paul Angus – Backup & Recovery in CloudStack
Paul Angus – Backup & Recovery in CloudStackPaul Angus – Backup & Recovery in CloudStack
Paul Angus – Backup & Recovery in CloudStack
 
ThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.jsThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.js
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 

More from HostedbyConfluent

Build Real-time Machine Learning Apps on Generative AI with Kafka Streams
Build Real-time Machine Learning Apps on Generative AI with Kafka StreamsBuild Real-time Machine Learning Apps on Generative AI with Kafka Streams
Build Real-time Machine Learning Apps on Generative AI with Kafka StreamsHostedbyConfluent
 
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...HostedbyConfluent
 
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...HostedbyConfluent
 
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...HostedbyConfluent
 
Rule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at NetflixRule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at NetflixHostedbyConfluent
 
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...HostedbyConfluent
 
Indeed Flex: The Story of a Revolutionary Recruitment Platform
Indeed Flex: The Story of a Revolutionary Recruitment PlatformIndeed Flex: The Story of a Revolutionary Recruitment Platform
Indeed Flex: The Story of a Revolutionary Recruitment PlatformHostedbyConfluent
 
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...HostedbyConfluent
 
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...HostedbyConfluent
 
Accelerating Path to Production for Generative AI-powered Applications
Accelerating Path to Production for Generative AI-powered ApplicationsAccelerating Path to Production for Generative AI-powered Applications
Accelerating Path to Production for Generative AI-powered ApplicationsHostedbyConfluent
 
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...HostedbyConfluent
 
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...HostedbyConfluent
 
Go Big or Go Home: Approaching Kafka Replication at Scale
Go Big or Go Home: Approaching Kafka Replication at ScaleGo Big or Go Home: Approaching Kafka Replication at Scale
Go Big or Go Home: Approaching Kafka Replication at ScaleHostedbyConfluent
 
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2HostedbyConfluent
 
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and DruidA Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and DruidHostedbyConfluent
 
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark PythonFrom Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark PythonHostedbyConfluent
 
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...HostedbyConfluent
 
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...HostedbyConfluent
 
From the Battlefield: Squeezing the Most From Your Kafka Infrastructure
From the Battlefield: Squeezing the Most From Your Kafka InfrastructureFrom the Battlefield: Squeezing the Most From Your Kafka Infrastructure
From the Battlefield: Squeezing the Most From Your Kafka InfrastructureHostedbyConfluent
 

More from HostedbyConfluent (20)

Build Real-time Machine Learning Apps on Generative AI with Kafka Streams
Build Real-time Machine Learning Apps on Generative AI with Kafka StreamsBuild Real-time Machine Learning Apps on Generative AI with Kafka Streams
Build Real-time Machine Learning Apps on Generative AI with Kafka Streams
 
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
 
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
 
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
 
Rule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at NetflixRule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at Netflix
 
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
 
Indeed Flex: The Story of a Revolutionary Recruitment Platform
Indeed Flex: The Story of a Revolutionary Recruitment PlatformIndeed Flex: The Story of a Revolutionary Recruitment Platform
Indeed Flex: The Story of a Revolutionary Recruitment Platform
 
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
 
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
 
Accelerating Path to Production for Generative AI-powered Applications
Accelerating Path to Production for Generative AI-powered ApplicationsAccelerating Path to Production for Generative AI-powered Applications
Accelerating Path to Production for Generative AI-powered Applications
 
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...
 
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...
 
Streaming is a Detail
Streaming is a DetailStreaming is a Detail
Streaming is a Detail
 
Go Big or Go Home: Approaching Kafka Replication at Scale
Go Big or Go Home: Approaching Kafka Replication at ScaleGo Big or Go Home: Approaching Kafka Replication at Scale
Go Big or Go Home: Approaching Kafka Replication at Scale
 
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2
 
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and DruidA Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
 
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark PythonFrom Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
 
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...
 
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
 
From the Battlefield: Squeezing the Most From Your Kafka Infrastructure
From the Battlefield: Squeezing the Most From Your Kafka InfrastructureFrom the Battlefield: Squeezing the Most From Your Kafka Infrastructure
From the Battlefield: Squeezing the Most From Your Kafka Infrastructure
 

Recently uploaded

Bit N Build Poland
Bit N Build PolandBit N Build Poland
Bit N Build PolandGDSC PJATK
 
Journey of Television in World & in India
Journey of Television in World & in IndiaJourney of Television in World & in India
Journey of Television in World & in IndiaAdarshAgarwal66
 
Power of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdfPower of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdfkatalinjordans1
 
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre..."Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...shaiyuvasv
 
Automate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellenceAutomate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellencePrecisely
 
Z-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdf
Z-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdfZ-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdf
Z-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdfDomotica daVinci
 
Q1 Memory Fabric Forum: XConn CXL Switches for AI
Q1 Memory Fabric Forum: XConn CXL Switches for AIQ1 Memory Fabric Forum: XConn CXL Switches for AI
Q1 Memory Fabric Forum: XConn CXL Switches for AIMemory Fabric Forum
 
Azure Migration Guide for IT Professionals
Azure Migration Guide for IT ProfessionalsAzure Migration Guide for IT Professionals
Azure Migration Guide for IT ProfessionalsChristine Shepherd
 
DNA LIGASE BIOTECHNOLOGY BIOLOGY STUDY OF LIFE
DNA LIGASE BIOTECHNOLOGY BIOLOGY STUDY OF LIFEDNA LIGASE BIOTECHNOLOGY BIOLOGY STUDY OF LIFE
DNA LIGASE BIOTECHNOLOGY BIOLOGY STUDY OF LIFEandreiandasan
 
Artificial-Intelligence-in-Marketing-Data.pdf
Artificial-Intelligence-in-Marketing-Data.pdfArtificial-Intelligence-in-Marketing-Data.pdf
Artificial-Intelligence-in-Marketing-Data.pdfIsidro Navarro
 
Tete thermostatique Zigbee MOES BRT-100 V2.pdf
Tete thermostatique Zigbee MOES BRT-100 V2.pdfTete thermostatique Zigbee MOES BRT-100 V2.pdf
Tete thermostatique Zigbee MOES BRT-100 V2.pdfDomotica daVinci
 
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdfQuinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdfDomotica daVinci
 
Manual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-WaveManual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-WaveDomotica daVinci
 
Breaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologyBreaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologySafe Software
 
Bluetooth Low Energy(BLE) and beacons working
Bluetooth Low Energy(BLE) and beacons workingBluetooth Low Energy(BLE) and beacons working
Bluetooth Low Energy(BLE) and beacons workingshrey Ansh
 
My self introduction to know others abut me
My self  introduction to know others abut meMy self  introduction to know others abut me
My self introduction to know others abut meManoj Prabakar B
 
Enhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for PartnersEnhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for PartnersThousandEyes
 
Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024Daniel Toomey
 

Recently uploaded (20)

Bit N Build Poland
Bit N Build PolandBit N Build Poland
Bit N Build Poland
 
Journey of Television in World & in India
Journey of Television in World & in IndiaJourney of Television in World & in India
Journey of Television in World & in India
 
Power of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdfPower of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdf
 
5 Tech Trend to Notice in ESG Landscape- 47Billion
5 Tech Trend to Notice in ESG Landscape- 47Billion5 Tech Trend to Notice in ESG Landscape- 47Billion
5 Tech Trend to Notice in ESG Landscape- 47Billion
 
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre..."Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
 
Automate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellenceAutomate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center Excellence
 
Z-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdf
Z-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdfZ-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdf
Z-Wave Fan coil Thermostat Heltun_HE-HT01_User_Manual.pdf
 
Q1 Memory Fabric Forum: XConn CXL Switches for AI
Q1 Memory Fabric Forum: XConn CXL Switches for AIQ1 Memory Fabric Forum: XConn CXL Switches for AI
Q1 Memory Fabric Forum: XConn CXL Switches for AI
 
Azure Migration Guide for IT Professionals
Azure Migration Guide for IT ProfessionalsAzure Migration Guide for IT Professionals
Azure Migration Guide for IT Professionals
 
DNA LIGASE BIOTECHNOLOGY BIOLOGY STUDY OF LIFE
DNA LIGASE BIOTECHNOLOGY BIOLOGY STUDY OF LIFEDNA LIGASE BIOTECHNOLOGY BIOLOGY STUDY OF LIFE
DNA LIGASE BIOTECHNOLOGY BIOLOGY STUDY OF LIFE
 
Artificial-Intelligence-in-Marketing-Data.pdf
Artificial-Intelligence-in-Marketing-Data.pdfArtificial-Intelligence-in-Marketing-Data.pdf
Artificial-Intelligence-in-Marketing-Data.pdf
 
Tete thermostatique Zigbee MOES BRT-100 V2.pdf
Tete thermostatique Zigbee MOES BRT-100 V2.pdfTete thermostatique Zigbee MOES BRT-100 V2.pdf
Tete thermostatique Zigbee MOES BRT-100 V2.pdf
 
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdfQuinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
 
Manual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-WaveManual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-Wave
 
Breaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologyBreaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI Technology
 
Bluetooth Low Energy(BLE) and beacons working
Bluetooth Low Energy(BLE) and beacons workingBluetooth Low Energy(BLE) and beacons working
Bluetooth Low Energy(BLE) and beacons working
 
My self introduction to know others abut me
My self  introduction to know others abut meMy self  introduction to know others abut me
My self introduction to know others abut me
 
Enhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for PartnersEnhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for Partners
 
Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024
 
COE AI Lab Universities
COE AI Lab UniversitiesCOE AI Lab Universities
COE AI Lab Universities
 

Forecasting Kafka Lag Issues with Machine Learning

  • 1. Forecasting Kafka Lag with Machine Learning Kumaran Ponnambalam Principal Engineer - AI
  • 2. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public Speaker Intro 2 • AI/ML Leader & Author • 20+ years in Data, Analytics, AI/ML • Current • Principal Engineer – AI • Outshift by Cisco • Author • LinkedIn Learning – AI/ML & Big Data • Apache Kafka Essential Training – Getting Started • Apache Kafka Essential Training – Building scalable apps Kumaran Ponnambalam
  • 3. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Agenda • The Kafka consumer lag problem • Need for lag forecasting • Current approaches to lag forecasting • Using Retrieval Augmented Generation • Our approach • Examples of real time forecasting
  • 4. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Confidential. Kafka consumer lag
  • 5. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. • One topic – multiple consumer groups • Consumers consume data independently of producers • Lag indicates ”pending” records not yet consumed by a consumer group • Lag can happen by design (batch consumers) or because the consumers are not able to keep up with the producers • Lag directly impacts latency of data processing pipelines & hence when outcomes are achieved The Kafka consumer lag problem
  • 6. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. • Pending records keeps increasing over time • Delays in achieving desired application service levels • Business impact • A real-time threat alert system will delay alerts, leading to threat becoming real • A real-time recommendation system may not provide recommendations in time, leading to loss of business • Restoring service levels is involved and takes time • Increasing capacity of brokers / consumers • Resetting consumer offsets Impact of unplanned lag
  • 7. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Popular lag patterns
  • 8. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. • Each consumer group has a steady-state consumption pattern • Repeating steady-state patterns over time • Keeps lag closer to zero or a steady value • Hard failures are easy to track, but lag creep is not • Increasing load vs limited capacity • Soft failures across systems/ infrastructure • When lag happens, will the consumer will eventually catch up? • Steady state and exception patterns do exist • Patterns vary based on individual consumer group Trends observed in consumer lag
  • 9. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Confidential. Forecasting Kafka Lag patterns
  • 10. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. • So, we know if the consumer group won’t catch up eventually • We will know ahead of time ! • We can mitigate the problem • We could possibly prevent it ! • Have time to implement corrective actions (add nodes etc.) • Minimize business impact • Mitigate before customers notice • Avoid loss of revenue or trust Why forecast Kafka Lag?
  • 11. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. • A pattern represents values for lag over time • For forecasting, we split the pattern into a key and a value • Given a Query (key) can we forecast the value ? • In real time, at specified intervals (say every 1 min), we can get the last X data points (key) and forecast the next Y data points ( value) Forecasting Lag Key Value Forecasting
  • 12. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Current approaches to Kafka lag forecasting ARIMA Moving average computation to forecast time series Easy to implement and provide good forecasts on repeating patterns Cannot handle non-linear relationships and sudden shocks ML Sequence Models Use a deep learning model to forecast next pattern in time series Can model complex patterns and relationships, especially exceptions Expensive, especially when prediction need to be done in real time for 5 sec. or so Knowledge Base of previous patterns Use a knowledge base of previously observed patterns and the eventual result Easy to manage and low cost for lookups Need prior patterns for consumer group; Cannot forecast new trends
  • 13. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Generative AI for forecasting • Generative AI has taken the world by storm this year • Applied to several use cases for machine learning • Forecasting sequences of numbers is also a “generation” use case • While Gen AI is popular with text & images, applications for numeric data also exist • Generative models can be built to understand sequence patterns and forecast into the future
  • 14. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Retrieval Augmented Generation (RAG) • Popular technique in Generative AI for creating knowledge-based applications ( e.g.: Question-Answering ) • Uses a combination of • A Deep Learning model that can generate outputs (patterns, text, images) based on its learning • A knowledge base that stores information about a specific domain /context. • Using a knowledge base constrains the output to the specific domain. It also saves significant costs than using a deep learning model as a knowledge store • The deep learning model compliments the knowledge base to generate output
  • 15. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Forecast Engine Pattern Matching Service Span Estimation Service Pattern DB Estimation Models Clients Forecasting Kafka lag with RAG Built based on previously observed patterns for a given Kafka topic consumer Ensemble of models, trained on previously observed patterns across Kafka topic consumers Retrieval based on similarity between current lag observed and patterns in DB Choose best forecast (generation) from predictions of the model ensemble Create forecast from retrieved patterns + generated patterns based on native heuristics Send current observed lag pattern and fetch forecast for immediate future
  • 16. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Forecasting with RAG • Pattern DB • Built based on past observed data • Provides steady state lag patterns (and exception patterns observed previously) • Usually sufficient to forecast lag during steady state operations • Cheap • Ensemble of estimation models • Provides general patterns, especially unknown ones for the specific consumer • Only used when pattern DB has no good matches ( high initially, exceptions only later) • Expensive, but helps model unknown scenarios
  • 17. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Forecasting with RAG (contd.) • Forecast Engine • Smart engine that picks pattern DB vs ensemble models based on quality of predictions • Pattern DB first • Uses ensemble models only when pattern DB does not return high quality results • Uses heuristics & ML to make decisions • Auto ML for training • Kafka lag observed for a given consumer-group and used to build pattern DB/ models • Pattern DB updated when new patterns are observed • Models retrained when new patterns / drift are observed
  • 18. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Confidential. Real time forecasting examples
  • 19. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Example 1
  • 20. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Example 2
  • 21. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Example 3
  • 22. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Using lag forecasts • Setup thresholds based on expected consumer behavior • Automate by querying forecast at regular intervals • Use the predicted sequences / trend line to forecast expected lag values in the future • Monitor predictions over time to ensure that the trend stays consistent • Trigger alerts / actions when trend points to threshold violations in the future
  • 23. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Future work • Add ARIMA based models too, to have a 3-dimensional forecasting engine • Build a Kafka plugin to observe lag by consumer-group and publish to forecast engine • Add User interfaces for provisioning and analytics • Triggers for exceptions