SlideShare a Scribd company logo
www.clairvoyantsoft.com
Building Zero Data loss pipelines
with Apache Kafka
By: Avinash Ramineni
Quick Poll
| 2
| 3
About
Background Awards & Recognition
Boutique consulting firm centered on building data solutions and
products
All things Web and Data Engineering, Analytics, ML and User
Experience to bring it all together
Support core Hadoop platform, data engineering pipelines and provide
administrative and devops expertise focused on Hadoop
| 4
● Introduction and terminologies
● How can we lose data ?
● Zero Data Loss Pipelines
○ Producer
○ Kafka Cluster
○ Consumer
● Monitoring for Data Loss
● Summary
Agenda
| 5
● An open source distributed stream processing platform
○ High Throughput
○ Scalable
○ Low Latency
○ Real-time
○ Fault Tolerant
● Messages are organized as topics
● Producers push messages
● Consumers pull messages
● Kafka doesn't have message acknowledgments, it assumes the consumer keeps tracks of what's been consumed
so far.
Kafka - Basics
| 6
● Each topic has multiple partitions
● Unit of parallelism
● IDs are unique for a partition for a topic
● Makes sure partitions within the same topic are roughly the same size
● Too many partitions ?.
Kafka - Partitions
| 7
Consuming from a Kafka Topic
Kafka Reads - https://fizalihsan.github.io/technology/kafka-partition-consumer.png
| 8
● Replication factor
○ Unit of replication is partition
○ Producer ACk
● A partition is owned by a single broker in the cluster, and that broker is called the leader for the partition.
● Producers and Consumers are only served by Leaders
● A Zookeeper cluster is called an “ensemble”.
● Spreads leader partitions evenly on brokers throughout the cluster
Kafka - Cluster
| 9
Kafka - Cluster
| 10
● Any Kafka client (a producer or consumer) communicates only with the leader partition for data
○ All other partitions exist for redundancy and failover.
○ Follower partitions are responsible for copying new records from their leader partitions.
○ Follower partitions have an exact copy of the contents of the leader. Such partitions are called in-sync replicas (ISR).
Kafka - ISR (In Sync Replica)
| 11
● Failures Happen
● Systems need to be designed to tolerate failure
● Expect failures and design systems to handle them
Distributed Systems
| 12
● Producer
○ Acks =0, 1, all
○ Batch size, linger time
○ Broker is down
○ Block.on.buffer.full = false
● Kafka Cluster
○ Disk writes/flush are asynchronous
■ Hardware crashes
○ Metadata corruption
○ Kafka Bugs
● Consumer
○ Offset management
■ Offsets are committed before processing the messages completely
How can we lose data ?
| 13
● At-least once delivery semantics
● Out-of-order delivery semantics
● Kafka Cluster
● Each component in the pipeline makes sure that they are not loosing the messages
○ Producer
■ Takes the responsibility of making sure that messages are successfully delivered to Kafka Brokers
○ Kafka Cluster
■ Reliable and Resilient
○ Consumers
■ Takes the responsibility for consuming all messages from Kafka Cluster with the understanding
● Messages can be delivered more than once and have idempotency logic
● Operates with understanding of retention policy of the brokers, number of partitions for the
topics, etc.
● Manages check pointing and committing offsets
Zero Data loss pipelines
| 14
● Handle cases where broker is unavailable / down / network
issue
○ Local Spooling
○ Low Latency store
○ Backup cluster
● Ack = all ?
○ Consistency Vs Availability
■ Min ISR < = R
● (ex: replication factor = 3 min isr =2)
● Latency Vs Throughput
○ Batch size and linger time
● Standardardized on Message schema
○ Sequence Id, Timestamp, originating service
Producers
| 15
● Configure broker to make sure that no messages are lost at the broker end and can survive hardware failures using below
aspects
○ Partitions
■ Spread leader partitions evenly on brokers throughout the cluster
● Balances the load
■ Partition replication
● Replication Factor (R < = size of the cluster) (network congestion?)
● Sufficient replication sets to preserve data
● Rack awareness
○ Data Retention
■ Too much retention can be a problem
■ Leader election scenarios (unclean leader election)
○ Zookeeper availability
● DR / Backup
○ Mirror Maker to replicate the messages to another cluster
Kafka Cluster
| 16
● Failure Detection
○ Reassign replications on the dead brokers
○ Monitor for under replicated partitions
● Kafka Cluster sizing
○ Disk
○ Network
Kafka Cluster
| 17
Kafka Sizing
Recommended approach: Simulate load using load generation tools that ship with Kafka
Simplest: size based on disk-space (data ingest rate * data retention)
A little better: disk and network throughput requirements
Example: 1 TB/day —> 277MB/s. Replication 3 , Consumers 10
● Disk throughput = ingest * replication *2 (for lagging readers) = 277 * 3 * 2 = 1662 MB/s
● Network read throughput = ingest * (replication -1+ consumers)=277 *(3-1+10) = 3323 MB/s
● Network write throughput = ingest rate * replication = 277 * 3 = 831 MB /s
Network (e.g. 10 GBE): (3323+831)/1250 = 3.3 nodes
Disk (e.g. 6 drives & 70MB/s): 1662 /(6*70) = 3.95 nodes
Plan for double, so 8 kafka nodes
| 18
● Understand the data retention period and design accordingly (need to keep up with the producers)
○ Parallelism = number of partitions
■ Max consumers = partitions
○ Proper checkpointing and offset management
■ Set Autocommit disabled and commit offset after processing the message
■ Commit less often and Asynchronously
■ Checkpoint frequency
○ Expect at least once delivery of a message and have idempotency logic
Consumers
| 19
● QC Check Topic
○ Message sequence Id
○ Message counts per time bucket (10 mins? )
○ Reconcile on the number of messages produced and number of messages consumed numbers
● Kafka Audit System
○ Chaperone
■ https://github.com/uber/chaperone
○ Kafka Monitor
● Continuously monitor for issues
○ Monitor for producer errors -
■ number of retries and counts in retry database
○ Monitor Consumer Lag, fetch rate, fetch latency, records per request
○ Monitor for workload skews
● Capture Metrics
Monitor for Data Loss
| 20
● Guaranteeing zero data loss is not just Kafka’s problem
● Zero data loss pipelines require operations (Kafka cluster) and development teams (Producer /
consumer) working together
● Anticipate failures and design code to handle
● Gracefully Shutdown your application
Summary
Thank You!
| 21
Questions?
avinash@clairvoyantsoft.com

More Related Content

What's hot

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
Martin Podval
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
kafka
kafkakafka
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clusters
Prateek Maheshwari
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
Todd Palino
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
AIMDek Technologies
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
Ketan Gote
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Pattern
confluent
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
Chhavi Parasher
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Apache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsApache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patterns
Florent Ramiere
 
Apache Camel v3, Camel K and Camel Quarkus
Apache Camel v3, Camel K and Camel QuarkusApache Camel v3, Camel K and Camel Quarkus
Apache Camel v3, Camel K and Camel Quarkus
Claus Ibsen
 

What's hot (20)

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
kafka
kafkakafka
kafka
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clusters
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Pattern
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Apache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsApache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patterns
 
Apache Camel v3, Camel K and Camel Quarkus
Apache Camel v3, Camel K and Camel QuarkusApache Camel v3, Camel K and Camel Quarkus
Apache Camel v3, Camel K and Camel Quarkus
 

Similar to Building zero data loss pipelines with apache kafka

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Saroj Panyasrivanit
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Apache Kafka - Free Friday
Apache Kafka - Free FridayApache Kafka - Free Friday
Apache Kafka - Free Friday
Otávio Carvalho
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
LINE Corporation
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Otávio Carvalho
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis
Mark Harrison
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka Journey
Allen (Xiaozhong) Wang
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networking
Stephen Hemminger
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
Samuel Kerrien
 
Uber: Kafka Consumer Proxy
Uber: Kafka Consumer ProxyUber: Kafka Consumer Proxy
Uber: Kafka Consumer Proxy
confluent
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
Gluster.org
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Peter Bakas
 
Kafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - PaytmKafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - Paytm
Sumit Jain
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
Hotstar
 
High performance messaging with Apache Pulsar
High performance messaging with Apache PulsarHigh performance messaging with Apache Pulsar
High performance messaging with Apache Pulsar
Matteo Merli
 
Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®
confluent
 

Similar to Building zero data loss pipelines with apache kafka (20)

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
kafka
kafkakafka
kafka
 
Apache Kafka - Free Friday
Apache Kafka - Free FridayApache Kafka - Free Friday
Apache Kafka - Free Friday
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka Journey
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networking
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Uber: Kafka Consumer Proxy
Uber: Kafka Consumer ProxyUber: Kafka Consumer Proxy
Uber: Kafka Consumer Proxy
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Kafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - PaytmKafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - Paytm
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
High performance messaging with Apache Pulsar
High performance messaging with Apache PulsarHigh performance messaging with Apache Pulsar
High performance messaging with Apache Pulsar
 
Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®
 

More from Avinash Ramineni

Simplifying the data privacy governance quagmire building automated privacy ...
Simplifying the data privacy governance quagmire  building automated privacy ...Simplifying the data privacy governance quagmire  building automated privacy ...
Simplifying the data privacy governance quagmire building automated privacy ...
Avinash Ramineni
 
Winning the war on data breaches in a changing data landscape
Winning the war on data breaches in a changing data landscapeWinning the war on data breaches in a changing data landscape
Winning the war on data breaches in a changing data landscape
Avinash Ramineni
 
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Avinash Ramineni
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
Avinash Ramineni
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Avinash Ramineni
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
Avinash Ramineni
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015
Avinash Ramineni
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineni
Avinash Ramineni
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
Avinash Ramineni
 
Event Driven Architectures
Event Driven ArchitecturesEvent Driven Architectures
Event Driven ArchitecturesAvinash Ramineni
 

More from Avinash Ramineni (10)

Simplifying the data privacy governance quagmire building automated privacy ...
Simplifying the data privacy governance quagmire  building automated privacy ...Simplifying the data privacy governance quagmire  building automated privacy ...
Simplifying the data privacy governance quagmire building automated privacy ...
 
Winning the war on data breaches in a changing data landscape
Winning the war on data breaches in a changing data landscapeWinning the war on data breaches in a changing data landscape
Winning the war on data breaches in a changing data landscape
 
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineni
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Event Driven Architectures
Event Driven ArchitecturesEvent Driven Architectures
Event Driven Architectures
 

Recently uploaded

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 

Recently uploaded (20)

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 

Building zero data loss pipelines with apache kafka

  • 1. www.clairvoyantsoft.com Building Zero Data loss pipelines with Apache Kafka By: Avinash Ramineni
  • 3. | 3 About Background Awards & Recognition Boutique consulting firm centered on building data solutions and products All things Web and Data Engineering, Analytics, ML and User Experience to bring it all together Support core Hadoop platform, data engineering pipelines and provide administrative and devops expertise focused on Hadoop
  • 4. | 4 ● Introduction and terminologies ● How can we lose data ? ● Zero Data Loss Pipelines ○ Producer ○ Kafka Cluster ○ Consumer ● Monitoring for Data Loss ● Summary Agenda
  • 5. | 5 ● An open source distributed stream processing platform ○ High Throughput ○ Scalable ○ Low Latency ○ Real-time ○ Fault Tolerant ● Messages are organized as topics ● Producers push messages ● Consumers pull messages ● Kafka doesn't have message acknowledgments, it assumes the consumer keeps tracks of what's been consumed so far. Kafka - Basics
  • 6. | 6 ● Each topic has multiple partitions ● Unit of parallelism ● IDs are unique for a partition for a topic ● Makes sure partitions within the same topic are roughly the same size ● Too many partitions ?. Kafka - Partitions
  • 7. | 7 Consuming from a Kafka Topic Kafka Reads - https://fizalihsan.github.io/technology/kafka-partition-consumer.png
  • 8. | 8 ● Replication factor ○ Unit of replication is partition ○ Producer ACk ● A partition is owned by a single broker in the cluster, and that broker is called the leader for the partition. ● Producers and Consumers are only served by Leaders ● A Zookeeper cluster is called an “ensemble”. ● Spreads leader partitions evenly on brokers throughout the cluster Kafka - Cluster
  • 9. | 9 Kafka - Cluster
  • 10. | 10 ● Any Kafka client (a producer or consumer) communicates only with the leader partition for data ○ All other partitions exist for redundancy and failover. ○ Follower partitions are responsible for copying new records from their leader partitions. ○ Follower partitions have an exact copy of the contents of the leader. Such partitions are called in-sync replicas (ISR). Kafka - ISR (In Sync Replica)
  • 11. | 11 ● Failures Happen ● Systems need to be designed to tolerate failure ● Expect failures and design systems to handle them Distributed Systems
  • 12. | 12 ● Producer ○ Acks =0, 1, all ○ Batch size, linger time ○ Broker is down ○ Block.on.buffer.full = false ● Kafka Cluster ○ Disk writes/flush are asynchronous ■ Hardware crashes ○ Metadata corruption ○ Kafka Bugs ● Consumer ○ Offset management ■ Offsets are committed before processing the messages completely How can we lose data ?
  • 13. | 13 ● At-least once delivery semantics ● Out-of-order delivery semantics ● Kafka Cluster ● Each component in the pipeline makes sure that they are not loosing the messages ○ Producer ■ Takes the responsibility of making sure that messages are successfully delivered to Kafka Brokers ○ Kafka Cluster ■ Reliable and Resilient ○ Consumers ■ Takes the responsibility for consuming all messages from Kafka Cluster with the understanding ● Messages can be delivered more than once and have idempotency logic ● Operates with understanding of retention policy of the brokers, number of partitions for the topics, etc. ● Manages check pointing and committing offsets Zero Data loss pipelines
  • 14. | 14 ● Handle cases where broker is unavailable / down / network issue ○ Local Spooling ○ Low Latency store ○ Backup cluster ● Ack = all ? ○ Consistency Vs Availability ■ Min ISR < = R ● (ex: replication factor = 3 min isr =2) ● Latency Vs Throughput ○ Batch size and linger time ● Standardardized on Message schema ○ Sequence Id, Timestamp, originating service Producers
  • 15. | 15 ● Configure broker to make sure that no messages are lost at the broker end and can survive hardware failures using below aspects ○ Partitions ■ Spread leader partitions evenly on brokers throughout the cluster ● Balances the load ■ Partition replication ● Replication Factor (R < = size of the cluster) (network congestion?) ● Sufficient replication sets to preserve data ● Rack awareness ○ Data Retention ■ Too much retention can be a problem ■ Leader election scenarios (unclean leader election) ○ Zookeeper availability ● DR / Backup ○ Mirror Maker to replicate the messages to another cluster Kafka Cluster
  • 16. | 16 ● Failure Detection ○ Reassign replications on the dead brokers ○ Monitor for under replicated partitions ● Kafka Cluster sizing ○ Disk ○ Network Kafka Cluster
  • 17. | 17 Kafka Sizing Recommended approach: Simulate load using load generation tools that ship with Kafka Simplest: size based on disk-space (data ingest rate * data retention) A little better: disk and network throughput requirements Example: 1 TB/day —> 277MB/s. Replication 3 , Consumers 10 ● Disk throughput = ingest * replication *2 (for lagging readers) = 277 * 3 * 2 = 1662 MB/s ● Network read throughput = ingest * (replication -1+ consumers)=277 *(3-1+10) = 3323 MB/s ● Network write throughput = ingest rate * replication = 277 * 3 = 831 MB /s Network (e.g. 10 GBE): (3323+831)/1250 = 3.3 nodes Disk (e.g. 6 drives & 70MB/s): 1662 /(6*70) = 3.95 nodes Plan for double, so 8 kafka nodes
  • 18. | 18 ● Understand the data retention period and design accordingly (need to keep up with the producers) ○ Parallelism = number of partitions ■ Max consumers = partitions ○ Proper checkpointing and offset management ■ Set Autocommit disabled and commit offset after processing the message ■ Commit less often and Asynchronously ■ Checkpoint frequency ○ Expect at least once delivery of a message and have idempotency logic Consumers
  • 19. | 19 ● QC Check Topic ○ Message sequence Id ○ Message counts per time bucket (10 mins? ) ○ Reconcile on the number of messages produced and number of messages consumed numbers ● Kafka Audit System ○ Chaperone ■ https://github.com/uber/chaperone ○ Kafka Monitor ● Continuously monitor for issues ○ Monitor for producer errors - ■ number of retries and counts in retry database ○ Monitor Consumer Lag, fetch rate, fetch latency, records per request ○ Monitor for workload skews ● Capture Metrics Monitor for Data Loss
  • 20. | 20 ● Guaranteeing zero data loss is not just Kafka’s problem ● Zero data loss pipelines require operations (Kafka cluster) and development teams (Producer / consumer) working together ● Anticipate failures and design code to handle ● Gracefully Shutdown your application Summary