SlideShare a Scribd company logo
1 of 25
Download to read offline
Forecasting Kafka Lag
with Machine Learning
Kumaran Ponnambalam
Principal Engineer - AI
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public
Speaker Intro
2
• AI/ML Leader & Author
• 20+ years in Data, Analytics, AI/ML
• Current
• Principal Engineer – AI
• Outshift by Cisco
• Author
• LinkedIn Learning – AI/ML & Big Data
• Apache Kafka Essential Training – Getting Started
• Apache Kafka Essential Training – Building scalable apps
Kumaran Ponnambalam
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Agenda
• The Kafka consumer lag problem
• Need for lag forecasting
• Current approaches to lag forecasting
• Using Retrieval Augmented Generation
• Our approach
• Examples of real time forecasting
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Confidential.
Kafka consumer lag
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
• One topic – multiple consumer groups
• Consumers consume data independently of producers
• Lag indicates ”pending” records not yet consumed by a consumer
group
• Lag can happen by design (batch consumers) or because the
consumers are not able to keep up with the producers
• Lag directly impacts latency of data processing pipelines & hence
when outcomes are achieved
The Kafka consumer lag problem
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
• Pending records keeps increasing over time
• Delays in achieving desired application service levels
• Business impact
• A real-time threat alert system will delay alerts, leading to threat becoming
real
• A real-time recommendation system may not provide recommendations in
time, leading to loss of business
• Restoring service levels is involved and takes time
• Increasing capacity of brokers / consumers
• Resetting consumer offsets
Impact of unplanned lag
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Popular lag patterns
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
• Each consumer group has a steady-state consumption pattern
• Repeating steady-state patterns over time
• Keeps lag closer to zero or a steady value
• Hard failures are easy to track, but lag creep is not
• Increasing load vs limited capacity
• Soft failures across systems/ infrastructure
• When lag happens, will the consumer will eventually catch up?
• Steady state and exception patterns do exist
• Patterns vary based on individual consumer group
Trends observed in consumer lag
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Confidential.
Forecasting Kafka Lag patterns
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
• So, we know if the consumer group won’t catch up eventually
• We will know ahead of time !
• We can mitigate the problem
• We could possibly prevent it !
• Have time to implement corrective actions (add nodes etc.)
• Minimize business impact
• Mitigate before customers notice
• Avoid loss of revenue or trust
Why forecast Kafka Lag?
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
• A pattern represents values for lag over
time
• For forecasting, we split the pattern into a
key and a value
• Given a Query (key) can we forecast the
value ?
• In real time, at specified intervals (say every
1 min), we can get the last X data points
(key) and forecast the next Y data points (
value)
Forecasting Lag
Key Value
Forecasting
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Current approaches to Kafka lag forecasting
ARIMA
Moving average computation
to forecast time series
Easy to implement and
provide good forecasts on
repeating patterns
Cannot handle non-linear
relationships and sudden
shocks
ML Sequence Models
Use a deep learning model
to forecast next pattern in
time series
Can model complex patterns
and relationships, especially
exceptions
Expensive, especially when
prediction need to be done
in real time for 5 sec. or so
Knowledge Base of previous
patterns
Use a knowledge base of
previously observed patterns
and the eventual result
Easy to manage and low
cost for lookups
Need prior patterns for
consumer group; Cannot
forecast new trends
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Generative AI for forecasting
• Generative AI has taken the world by storm this year
• Applied to several use cases for machine learning
• Forecasting sequences of numbers is also a “generation” use case
• While Gen AI is popular with text & images, applications for numeric data also exist
• Generative models can be built to understand sequence patterns and forecast
into the future
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Retrieval Augmented Generation (RAG)
• Popular technique in Generative AI for creating knowledge-based applications (
e.g.: Question-Answering )
• Uses a combination of
• A Deep Learning model that can generate outputs (patterns, text, images) based on its learning
• A knowledge base that stores information about a specific domain /context.
• Using a knowledge base constrains the output to the specific domain. It also
saves significant costs than using a deep learning model as a knowledge store
• The deep learning model compliments the knowledge base to generate output
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Forecast
Engine
Pattern
Matching
Service
Span
Estimation
Service
Pattern
DB
Estimation
Models
Clients
Forecasting Kafka lag with RAG
Built based on
previously observed
patterns for a given
Kafka topic
consumer
Ensemble of models,
trained on previously
observed patterns
across Kafka topic
consumers
Retrieval based on similarity
between current lag
observed and patterns in DB
Choose best forecast
(generation) from
predictions of the model
ensemble
Create forecast from
retrieved patterns +
generated patterns based
on native heuristics
Send current observed
lag pattern and fetch
forecast for immediate
future
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Forecasting with RAG
• Pattern DB
• Built based on past observed data
• Provides steady state lag patterns (and exception patterns observed previously)
• Usually sufficient to forecast lag during steady state operations
• Cheap
• Ensemble of estimation models
• Provides general patterns, especially unknown ones for the specific consumer
• Only used when pattern DB has no good matches ( high initially, exceptions only later)
• Expensive, but helps model unknown scenarios
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Forecasting with RAG (contd.)
• Forecast Engine
• Smart engine that picks pattern DB vs ensemble models based on quality of predictions
• Pattern DB first
• Uses ensemble models only when pattern DB does not return high quality results
• Uses heuristics & ML to make decisions
• Auto ML for training
• Kafka lag observed for a given consumer-group and used to build pattern DB/ models
• Pattern DB updated when new patterns are observed
• Models retrained when new patterns / drift are observed
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Confidential.
Real time forecasting examples
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Example 1
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Example 2
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Example 3
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Using lag forecasts
• Setup thresholds based on expected consumer behavior
• Automate by querying forecast at regular intervals
• Use the predicted sequences / trend line to forecast expected lag values in the
future
• Monitor predictions over time to ensure that the trend stays consistent
• Trigger alerts / actions when trend points to threshold violations in the future
© 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public.
Future work
• Add ARIMA based models too, to have a 3-dimensional forecasting engine
• Build a Kafka plugin to observe lag by consumer-group and publish to forecast
engine
• Add User interfaces for provisioning and analytics
• Triggers for exceptions
Thank You !
Forecasting Kafka Lag with Machine Learning

More Related Content

Similar to Forecasting Kafka Lag with Machine Learning

Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure applicationCodecamp Romania
 
Interop ITX: Moving applications: From Legacy to Cloud-to-Cloud
Interop ITX: Moving applications: From Legacy to Cloud-to-CloudInterop ITX: Moving applications: From Legacy to Cloud-to-Cloud
Interop ITX: Moving applications: From Legacy to Cloud-to-CloudSusan Wu
 
Cloud-native Data: Every Microservice Needs a Cache
Cloud-native Data: Every Microservice Needs a CacheCloud-native Data: Every Microservice Needs a Cache
Cloud-native Data: Every Microservice Needs a Cachecornelia davis
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Mich Talebzadeh (Ph.D.)
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Mich Talebzadeh (Ph.D.)
 
Kognitio feb 2013
Kognitio feb 2013Kognitio feb 2013
Kognitio feb 2013Kognitio
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Julien SIMON
 
Microsoft Cloud Adoption Framework for Azure: Governance Conversation
Microsoft Cloud Adoption Framework for Azure: Governance ConversationMicrosoft Cloud Adoption Framework for Azure: Governance Conversation
Microsoft Cloud Adoption Framework for Azure: Governance ConversationNicholas Vossburg
 
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...danielschulz2005
 
Train, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelTrain, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelCloudera Japan
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018Adam Gibson
 
Azure Governance for Enterprise
Azure Governance for EnterpriseAzure Governance for Enterprise
Azure Governance for EnterpriseMohit Chhabra
 
Beyond REST and RPC: Asynchronous Eventing and Messaging Patterns
Beyond REST and RPC: Asynchronous Eventing and Messaging PatternsBeyond REST and RPC: Asynchronous Eventing and Messaging Patterns
Beyond REST and RPC: Asynchronous Eventing and Messaging PatternsClemens Vasters
 
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Igor De Souza
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 
Data meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow IndiaData meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow IndiaSandesh Rao
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated MLMark Tabladillo
 
Paul Angus – Backup & Recovery in CloudStack
Paul Angus – Backup & Recovery in CloudStackPaul Angus – Backup & Recovery in CloudStack
Paul Angus – Backup & Recovery in CloudStackShapeBlue
 
ThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.jsThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.jsBrad Williams
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
 

Similar to Forecasting Kafka Lag with Machine Learning (20)

Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 
Interop ITX: Moving applications: From Legacy to Cloud-to-Cloud
Interop ITX: Moving applications: From Legacy to Cloud-to-CloudInterop ITX: Moving applications: From Legacy to Cloud-to-Cloud
Interop ITX: Moving applications: From Legacy to Cloud-to-Cloud
 
Cloud-native Data: Every Microservice Needs a Cache
Cloud-native Data: Every Microservice Needs a CacheCloud-native Data: Every Microservice Needs a Cache
Cloud-native Data: Every Microservice Needs a Cache
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...
 
Kognitio feb 2013
Kognitio feb 2013Kognitio feb 2013
Kognitio feb 2013
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)
 
Microsoft Cloud Adoption Framework for Azure: Governance Conversation
Microsoft Cloud Adoption Framework for Azure: Governance ConversationMicrosoft Cloud Adoption Framework for Azure: Governance Conversation
Microsoft Cloud Adoption Framework for Azure: Governance Conversation
 
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
 
Train, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelTrain, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning model
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
Azure Governance for Enterprise
Azure Governance for EnterpriseAzure Governance for Enterprise
Azure Governance for Enterprise
 
Beyond REST and RPC: Asynchronous Eventing and Messaging Patterns
Beyond REST and RPC: Asynchronous Eventing and Messaging PatternsBeyond REST and RPC: Asynchronous Eventing and Messaging Patterns
Beyond REST and RPC: Asynchronous Eventing and Messaging Patterns
 
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Data meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow IndiaData meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow India
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
 
Paul Angus – Backup & Recovery in CloudStack
Paul Angus – Backup & Recovery in CloudStackPaul Angus – Backup & Recovery in CloudStack
Paul Angus – Backup & Recovery in CloudStack
 
ThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.jsThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.js
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Forecasting Kafka Lag with Machine Learning

  • 1. Forecasting Kafka Lag with Machine Learning Kumaran Ponnambalam Principal Engineer - AI
  • 2. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public Speaker Intro 2 • AI/ML Leader & Author • 20+ years in Data, Analytics, AI/ML • Current • Principal Engineer – AI • Outshift by Cisco • Author • LinkedIn Learning – AI/ML & Big Data • Apache Kafka Essential Training – Getting Started • Apache Kafka Essential Training – Building scalable apps Kumaran Ponnambalam
  • 3. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Agenda • The Kafka consumer lag problem • Need for lag forecasting • Current approaches to lag forecasting • Using Retrieval Augmented Generation • Our approach • Examples of real time forecasting
  • 4. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Confidential. Kafka consumer lag
  • 5. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. • One topic – multiple consumer groups • Consumers consume data independently of producers • Lag indicates ”pending” records not yet consumed by a consumer group • Lag can happen by design (batch consumers) or because the consumers are not able to keep up with the producers • Lag directly impacts latency of data processing pipelines & hence when outcomes are achieved The Kafka consumer lag problem
  • 6. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. • Pending records keeps increasing over time • Delays in achieving desired application service levels • Business impact • A real-time threat alert system will delay alerts, leading to threat becoming real • A real-time recommendation system may not provide recommendations in time, leading to loss of business • Restoring service levels is involved and takes time • Increasing capacity of brokers / consumers • Resetting consumer offsets Impact of unplanned lag
  • 7. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Popular lag patterns
  • 8. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. • Each consumer group has a steady-state consumption pattern • Repeating steady-state patterns over time • Keeps lag closer to zero or a steady value • Hard failures are easy to track, but lag creep is not • Increasing load vs limited capacity • Soft failures across systems/ infrastructure • When lag happens, will the consumer will eventually catch up? • Steady state and exception patterns do exist • Patterns vary based on individual consumer group Trends observed in consumer lag
  • 9. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Confidential. Forecasting Kafka Lag patterns
  • 10. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. • So, we know if the consumer group won’t catch up eventually • We will know ahead of time ! • We can mitigate the problem • We could possibly prevent it ! • Have time to implement corrective actions (add nodes etc.) • Minimize business impact • Mitigate before customers notice • Avoid loss of revenue or trust Why forecast Kafka Lag?
  • 11. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. • A pattern represents values for lag over time • For forecasting, we split the pattern into a key and a value • Given a Query (key) can we forecast the value ? • In real time, at specified intervals (say every 1 min), we can get the last X data points (key) and forecast the next Y data points ( value) Forecasting Lag Key Value Forecasting
  • 12. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Current approaches to Kafka lag forecasting ARIMA Moving average computation to forecast time series Easy to implement and provide good forecasts on repeating patterns Cannot handle non-linear relationships and sudden shocks ML Sequence Models Use a deep learning model to forecast next pattern in time series Can model complex patterns and relationships, especially exceptions Expensive, especially when prediction need to be done in real time for 5 sec. or so Knowledge Base of previous patterns Use a knowledge base of previously observed patterns and the eventual result Easy to manage and low cost for lookups Need prior patterns for consumer group; Cannot forecast new trends
  • 13. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Generative AI for forecasting • Generative AI has taken the world by storm this year • Applied to several use cases for machine learning • Forecasting sequences of numbers is also a “generation” use case • While Gen AI is popular with text & images, applications for numeric data also exist • Generative models can be built to understand sequence patterns and forecast into the future
  • 14. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Retrieval Augmented Generation (RAG) • Popular technique in Generative AI for creating knowledge-based applications ( e.g.: Question-Answering ) • Uses a combination of • A Deep Learning model that can generate outputs (patterns, text, images) based on its learning • A knowledge base that stores information about a specific domain /context. • Using a knowledge base constrains the output to the specific domain. It also saves significant costs than using a deep learning model as a knowledge store • The deep learning model compliments the knowledge base to generate output
  • 15. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Forecast Engine Pattern Matching Service Span Estimation Service Pattern DB Estimation Models Clients Forecasting Kafka lag with RAG Built based on previously observed patterns for a given Kafka topic consumer Ensemble of models, trained on previously observed patterns across Kafka topic consumers Retrieval based on similarity between current lag observed and patterns in DB Choose best forecast (generation) from predictions of the model ensemble Create forecast from retrieved patterns + generated patterns based on native heuristics Send current observed lag pattern and fetch forecast for immediate future
  • 16. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Forecasting with RAG • Pattern DB • Built based on past observed data • Provides steady state lag patterns (and exception patterns observed previously) • Usually sufficient to forecast lag during steady state operations • Cheap • Ensemble of estimation models • Provides general patterns, especially unknown ones for the specific consumer • Only used when pattern DB has no good matches ( high initially, exceptions only later) • Expensive, but helps model unknown scenarios
  • 17. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Forecasting with RAG (contd.) • Forecast Engine • Smart engine that picks pattern DB vs ensemble models based on quality of predictions • Pattern DB first • Uses ensemble models only when pattern DB does not return high quality results • Uses heuristics & ML to make decisions • Auto ML for training • Kafka lag observed for a given consumer-group and used to build pattern DB/ models • Pattern DB updated when new patterns are observed • Models retrained when new patterns / drift are observed
  • 18. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Confidential. Real time forecasting examples
  • 19. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Example 1
  • 20. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Example 2
  • 21. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Example 3
  • 22. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Using lag forecasts • Setup thresholds based on expected consumer behavior • Automate by querying forecast at regular intervals • Use the predicted sequences / trend line to forecast expected lag values in the future • Monitor predictions over time to ensure that the trend stays consistent • Trigger alerts / actions when trend points to threshold violations in the future
  • 23. © 2023 Outshift by Cisco and/or its affiliates. All rights reserved. Outshift by Cisco Public. Future work • Add ARIMA based models too, to have a 3-dimensional forecasting engine • Build a Kafka plugin to observe lag by consumer-group and publish to forecast engine • Add User interfaces for provisioning and analytics • Triggers for exceptions