SlideShare a Scribd company logo
Page 1Classification: Restricted
Hadoop Training
Kafka And Zookeeper
Page 2Classification: Restricted
Agenda
• Kafka Overview
• Need for Kafka
• Kafka Architecture
• Kafka Components
• ZooKeeper Overview
• Leader Node
Page 3Classification: Restricted
Apache Kafka was originated at LinkedIn and later became an open sourced
Apache project in 2011, then First-class Apache project in 2012. Kafka is
written in Scala and Java.
Apache Kafka is publish-subscribe based fault tolerant messaging
system. It is fast, scalable and distributed by design
Two types of messaging patterns are available - one is point to point and the
other is publish-subscribe (pub-sub) messaging system.
Most of the messaging patterns follow pub-sub.
What is Kafka?
Apache Kafka is a distributed publish-subscribe messaging system and a robust
queue that can handle a high volume of data
and enables you to pass messages from one end-point to another. Kafka is
suitable for both offline and online message
consumption. Kafka messages are persisted on the disk and replicated within the
cluster to prevent data loss.
Kafka is built on top of the ZooKeeper synchronization service
Kafka Overview
Page 4Classification: Restricted
Kafka is a unified platform for handling all the real-time data feeds.
Kafka supports low latency message delivery and gives guarantee for fault
tolerance in the presence of machine failures.
It has the ability to handle a large number of diverse consumers.
Kafka is very fast, performs 2 million writes/sec.
Topics
A stream of messages belonging to a particular category is called a topic.
Data is stored in topics.
Topics are split into partitions. For each topic, Kafka keeps a mini-mum of
one partition
Need for Kafka
Page 5Classification: Restricted
Kafka Architecture
For kafka zookeeper is needed
Page 6Classification: Restricted
In the above diagram, a topic is configured into three partitions. Partition 1
has two offset factors 0 and 1. Partition 2 has four offset factors 0, 1, 2,
and 3. Partition 3 has one offset factor 0. The id of the replica is same as
the id of the server that hosts it.Assume, if the replication factor of the
topic is set to 3, then Kafka will create 3 identical replicas of each partition
and place them in the cluster to make available for all its operations. To
balance a load in cluster, each broker stores one or more of those
partitions. Multiple producers and consumers can publish and retrieve
messages at the same time.
Kafka Architecture
Page 7Classification: Restricted
Partition
Topics may have many partitions, so it can handle an arbitrary amount of
data.
Replicas are nothing but backups of a partition.
Replicas are never read or write data.
They are used to prevent data loss.
Brokers are simple system responsible for maintaining the pub-lished data.
Each broker may have zero or more partitions per topic. Kafka cluster typically
consists of multiple brokers to maintain load balance
Kafka Cluster
Kafka’s having more than one broker are called as Kafka cluster
In very simple language :broker does the work of holding the data taken from
producers so that the consumer present anywhere in that cluster can have
the access/or pull to that message from the broker when needed
Kafka Components
Page 8Classification: Restricted
Producers
Producers are the publisher of messages to one or more Kafka topics.
Producers send data to Kafka brokers.
Every time a producer pub-lishes a message to a broker, the broker simply
appends the message to the last segment file.
Actually, the message will be appended to a partition.
Producer can also send messages to a partition of their choice
Assume, if there are N partitions in a topic and N number of brokers, each
broker will have one partition else accordingly accommodated
Consumers
Consumers read data from brokers. Consumers subscribes to one or more
topics and consume published messages by pulling data from the brokers.
Kafka Components
Page 9Classification: Restricted
Producers
Producers push data to brokers. When the new broker is started, all the
producers search it and automatically sends a message to that new broker.
Kafka producer doesn’t wait for acknowledgements from the broker and
sends messages as fast as the broker can handle.
Consumers
Since Kafka brokers are stateless, which means that the consumer has to
maintain how many messages have been consumed by using partition offset.
If the consumer acknowledges a particular message offset, it implies that the
consumer has consumed all prior messages. The consumer issues an
asynchronous pull request to the broker to have a buffer of bytes ready to
consume. The consumers can rewind or skip to any point in a partition simply
by supplying an offset value. Consumer offset value is notified by ZooKeeper
Kafka Components
Page 10Classification: Restricted
• Kafka is simply a collection of topics split into one or more partitions. A Kafka
partition is a linearly ordered sequence of messages, where each message is
identified by their index (called as offset). All the data in a Kafka cluster is the
disjointed union of partitions. Incoming messages are written at the end of a
partition and messages are sequentially read by consumers
• Following is the step wise workflow of the Pub-Sub Messaging −
• Producers send message to a topic at regular intervals.
• Kafka broker stores all messages in the partitions configured for that particular
topic. It ensures the messages are equally shared between partitions. If the
producer sends two messages and there are two partitions, Kafka will store one
message in the first partition and the second message in the second partition.
• Consumer subscribes to a specific topic.
• Once the consumer subscribes to a topic, Kafka will provide the current offset of
the topic to the consumer and also saves the offset in the Zookeeper ensemble.
• Consumer will request the Kafka in a regular interval (like 100 Ms) for new
messages.
Kafka Components
Page 11Classification: Restricted
• Once Kafka receives the messages from producers, it forwards these
messages to the consumers.
• Consumer will receive the message and process it.
• Once the messages are processed, consumer will send an
acknowledgement to the Kafka broker.
• Once Kafka receives an acknowledgement, it changes the offset to the new
value and updates it in the Zookeeper. Since offsets are maintained in the
Zookeeper, the consumer can read next message correctly even during
server outrages.
• This above flow will repeat until the consumer stops the request.
• Consumer has the option to rewind/skip to the desired offset of a topic at
any time and read all the subsequent messages.
Kafka Components
Page 12Classification: Restricted
• Role of ZooKeeper
• A critical dependency of Apache Kafka is Apache Zookeeper, which is a
distributed configuration and synchronization service. Zookeeper serves as
the coordination interface between the Kafka brokers and consumers. The
Kafka servers share information via a Zookeeper cluster. Kafka stores basic
metadata in Zookeeper such as information about topics, brokers,
consumer offsets (queue readers) and so on.
• Since all the critical information is stored in the Zookeeper and it normally
replicates this data across its ensemble, failure of Kafka broker / Zookeeper
does not affect the state of the Kafka cluster. Kafka will restore the state,
once the Zookeeper restarts.
Kafka Components
Page 13Classification: Restricted
Zookeeper provides a flexible coordination infrastructure for distributed
environment. ZooKeeper framework supports many of the today's best
industrial applications.
Assume that a Hadoop cluster bridges 100 or more commodity servers.
Therefore, there’s a need for coordination and naming services. As
computation of large number of nodes are involved, each node needs to
synchronize with each other, know where to access services, and know how
they should be configured. At this point of time, Hadoop clusters require
cross-node services. ZooKeeper provides the facilities for cross-node
synchronization and ensures the tasks across Hadoop projects are serialized
and synchronized.It is used to keep the distributed system functioning
together as a single unitZooKeeper is a distributed co-ordination service to
manage large set of hosts. Co-ordinating and managing a service in a
distributed environment is a complicated process.
ZooKeeper Overview
Page 14Classification: Restricted
ZooKeeper solves this issue with its simple architecture and API. ZooKeeper
allows developers to focus on core application logic without worrying about
the distributed nature of the application.The ZooKeeper framework was
originally built at “Yahoo!” for accessing their applications in an easy and
robust manner. Later, Apache ZooKeeper became a standard for organized
service used by Hadoop, HBase, and other distributed frameworks. For
example, Apache HBase uses ZooKeeper to track the status of distributed
data.
ZooKeeper Overview
Page 15Classification: Restricted
Let us analyze how a leader node can be elected in a ZooKeeper ensemble.
Consider there are N number of nodes in a cluster. The process of leader election
is as follows −
•All the nodes create a sequential, ephemeral znode with the same
path, /app/leader_election/guid_.
•ZooKeeper ensemble will append the 10-digit sequence number to the path and
the znode created will be/app/leader_election/guid_0000000001,
/app/leader_election/guid_0000000002, etc.
•For a given instance, the node which creates the smallest number in the znode
becomes the leader and all the other nodes are followers.
•Each follower node watches the znode having the next smallest number. For
example, the node which creates
znode/app/leader_election/guid_0000000008 will watch the
znode/app/leader_election/guid_0000000007 and the node which creates the
znode /app/leader_election/guid_0000000007 will watch the
znode /app/leader_election/guid_0000000006.
Leader Node
Page 16Classification: Restricted
•If the leader goes down, then its corresponding
znode/app/leader_electionN gets deleted.
•The next in line follower node will get the notification through watcher
about the leader removal.
•The next in line follower node will check if there are other znodes with the
smallest number. If none, then it will assume the role of the leader.
Otherwise, it finds the node which created the znode with the smallest
number as leader.
•Similarly, all other follower nodes elect the node which created the znode
with the smallest number as leader.
Leader Node
Page 17Classification: Restricted
Distributed applications are difficult to coordinate and work with as they are
much more error prone due to huge number of machines attached to
network. As many machines are involved, race condition and deadlocks are
common problems when implementing distributed applications. Race
condition occurs when a machine tries to perform two or more operations at a
time and this can be taken care by serialization property of ZooKeeper.
Deadlocks are when two or more machines try to access same shared
resource at the same time. More precisely they try to access each other’s
resources which leads to lock of system as none of the system is releasing the
resource but waiting for other system to release it. Synchronization in
Zookeeper helps to solve the deadlock. Another major issue with distributed
application can be partial failure of process, which can lead to inconsistency of
data. Zookeeper handles this through atomicity, which means either whole of
the process will finish or nothing will persist after failure. Thus Zookeeper is an
important part of Hadoop that take care of these small but important issues so
that developer can focus more on functionality of the application.
ZooKeeper
Page 18Classification: Restricted
THANK YOU!!

More Related Content

What's hot

Flume
FlumeFlume
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Steve Hoffman
 
Apache Flume (NG)
Apache Flume (NG)Apache Flume (NG)
Apache Flume (NG)
Alexander Alten
 
Dns
DnsDns
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINE
kawamuray
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...
kawamuray
 
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
Cloudera, Inc.
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
kawamuray
 
Chapter 06
Chapter 06Chapter 06
Chapter 06
cclay3
 
Apache HTTP Server
Apache HTTP ServerApache HTTP Server
Apache HTTP Server
Tan Huynh Cong
 
Flume intro-100715
Flume intro-100715Flume intro-100715
Flume intro-100715
Cloudera, Inc.
 
7 understanding DNS
7 understanding DNS7 understanding DNS
7 understanding DNS
Hameda Hurmat
 
Http/2
Http/2Http/2
03 network services
03 network services03 network services
03 network services
Jadavsejal
 
M enrichment
M enrichmentM enrichment
M enrichment
Vasanthii Chowdary
 
Traffic Characterization
Traffic CharacterizationTraffic Characterization
Traffic Characterization
Ismail Mukiibi
 
Domain name system (dns) , TELNET ,FTP, TFTP
Domain name system (dns) , TELNET ,FTP, TFTPDomain name system (dns) , TELNET ,FTP, TFTP
Domain name system (dns) , TELNET ,FTP, TFTP
saurav kumar
 

What's hot (20)

Flume
FlumeFlume
Flume
 
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
 
Apache Flume (NG)
Apache Flume (NG)Apache Flume (NG)
Apache Flume (NG)
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
Dns
DnsDns
Dns
 
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINE
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...
 
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 
Chapter 06
Chapter 06Chapter 06
Chapter 06
 
Apache HTTP Server
Apache HTTP ServerApache HTTP Server
Apache HTTP Server
 
Flume intro-100715
Flume intro-100715Flume intro-100715
Flume intro-100715
 
main
mainmain
main
 
7 understanding DNS
7 understanding DNS7 understanding DNS
7 understanding DNS
 
Http/2
Http/2Http/2
Http/2
 
About Http Connection
About Http ConnectionAbout Http Connection
About Http Connection
 
03 network services
03 network services03 network services
03 network services
 
M enrichment
M enrichmentM enrichment
M enrichment
 
Traffic Characterization
Traffic CharacterizationTraffic Characterization
Traffic Characterization
 
Domain name system (dns) , TELNET ,FTP, TFTP
Domain name system (dns) , TELNET ,FTP, TFTPDomain name system (dns) , TELNET ,FTP, TFTP
Domain name system (dns) , TELNET ,FTP, TFTP
 

Similar to Session 23 - Kafka and Zookeeper

Apache kafka
Apache kafkaApache kafka
Apache kafka
Viswanath J
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Apache kafkaApache kafka
Apache kafka
Jemin Patel
 
kafka_session_updated.pptx
kafka_session_updated.pptxkafka_session_updated.pptx
kafka_session_updated.pptx
Koiuyt1
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Ramakrishna kapa
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-CamusDeep Shah
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Srikrishna k
 
Kafka Fundamentals
Kafka FundamentalsKafka Fundamentals
Kafka Fundamentals
Ketan Keshri
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
Srikrishna k
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Srikrishna k
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
TarekHamdi8
 
Columbus mule soft_meetup_aug2021_Kafka_Integration
Columbus mule soft_meetup_aug2021_Kafka_IntegrationColumbus mule soft_meetup_aug2021_Kafka_Integration
Columbus mule soft_meetup_aug2021_Kafka_Integration
MuleSoft Meetup
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQShameera Rathnayaka
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams Presentation
Knoldus Inc.
 
Kafka pub sub demo
Kafka pub sub demoKafka pub sub demo
Kafka pub sub demo
Srish Kumar
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
A Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka SkillsA Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka Skills
Ravindra kumar
 
Python Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuidePython Kafka Integration: Developers Guide
Python Kafka Integration: Developers Guide
Inexture Solutions
 
Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptx
Knoldus Inc.
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Kumar Shivam
 

Similar to Session 23 - Kafka and Zookeeper (20)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
kafka_session_updated.pptx
kafka_session_updated.pptxkafka_session_updated.pptx
kafka_session_updated.pptx
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka Fundamentals
Kafka FundamentalsKafka Fundamentals
Kafka Fundamentals
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
 
Columbus mule soft_meetup_aug2021_Kafka_Integration
Columbus mule soft_meetup_aug2021_Kafka_IntegrationColumbus mule soft_meetup_aug2021_Kafka_Integration
Columbus mule soft_meetup_aug2021_Kafka_Integration
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams Presentation
 
Kafka pub sub demo
Kafka pub sub demoKafka pub sub demo
Kafka pub sub demo
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
A Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka SkillsA Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka Skills
 
Python Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuidePython Kafka Integration: Developers Guide
Python Kafka Integration: Developers Guide
 
Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptx
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 

More from AnandMHadoop

Overview of Java
Overview of Java Overview of Java
Overview of Java
AnandMHadoop
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
AnandMHadoop
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
AnandMHadoop
 
Session 04 -Pig Continued
Session 04 -Pig ContinuedSession 04 -Pig Continued
Session 04 -Pig Continued
AnandMHadoop
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
AnandMHadoop
 
Session 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsSession 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic Commands
AnandMHadoop
 
Session 02 - Yarn Concepts
Session 02 - Yarn ConceptsSession 02 - Yarn Concepts
Session 02 - Yarn Concepts
AnandMHadoop
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
AnandMHadoop
 

More from AnandMHadoop (8)

Overview of Java
Overview of Java Overview of Java
Overview of Java
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
 
Session 04 -Pig Continued
Session 04 -Pig ContinuedSession 04 -Pig Continued
Session 04 -Pig Continued
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
 
Session 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsSession 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic Commands
 
Session 02 - Yarn Concepts
Session 02 - Yarn ConceptsSession 02 - Yarn Concepts
Session 02 - Yarn Concepts
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 

Session 23 - Kafka and Zookeeper

  • 1. Page 1Classification: Restricted Hadoop Training Kafka And Zookeeper
  • 2. Page 2Classification: Restricted Agenda • Kafka Overview • Need for Kafka • Kafka Architecture • Kafka Components • ZooKeeper Overview • Leader Node
  • 3. Page 3Classification: Restricted Apache Kafka was originated at LinkedIn and later became an open sourced Apache project in 2011, then First-class Apache project in 2012. Kafka is written in Scala and Java. Apache Kafka is publish-subscribe based fault tolerant messaging system. It is fast, scalable and distributed by design Two types of messaging patterns are available - one is point to point and the other is publish-subscribe (pub-sub) messaging system. Most of the messaging patterns follow pub-sub. What is Kafka? Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another. Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service Kafka Overview
  • 4. Page 4Classification: Restricted Kafka is a unified platform for handling all the real-time data feeds. Kafka supports low latency message delivery and gives guarantee for fault tolerance in the presence of machine failures. It has the ability to handle a large number of diverse consumers. Kafka is very fast, performs 2 million writes/sec. Topics A stream of messages belonging to a particular category is called a topic. Data is stored in topics. Topics are split into partitions. For each topic, Kafka keeps a mini-mum of one partition Need for Kafka
  • 5. Page 5Classification: Restricted Kafka Architecture For kafka zookeeper is needed
  • 6. Page 6Classification: Restricted In the above diagram, a topic is configured into three partitions. Partition 1 has two offset factors 0 and 1. Partition 2 has four offset factors 0, 1, 2, and 3. Partition 3 has one offset factor 0. The id of the replica is same as the id of the server that hosts it.Assume, if the replication factor of the topic is set to 3, then Kafka will create 3 identical replicas of each partition and place them in the cluster to make available for all its operations. To balance a load in cluster, each broker stores one or more of those partitions. Multiple producers and consumers can publish and retrieve messages at the same time. Kafka Architecture
  • 7. Page 7Classification: Restricted Partition Topics may have many partitions, so it can handle an arbitrary amount of data. Replicas are nothing but backups of a partition. Replicas are never read or write data. They are used to prevent data loss. Brokers are simple system responsible for maintaining the pub-lished data. Each broker may have zero or more partitions per topic. Kafka cluster typically consists of multiple brokers to maintain load balance Kafka Cluster Kafka’s having more than one broker are called as Kafka cluster In very simple language :broker does the work of holding the data taken from producers so that the consumer present anywhere in that cluster can have the access/or pull to that message from the broker when needed Kafka Components
  • 8. Page 8Classification: Restricted Producers Producers are the publisher of messages to one or more Kafka topics. Producers send data to Kafka brokers. Every time a producer pub-lishes a message to a broker, the broker simply appends the message to the last segment file. Actually, the message will be appended to a partition. Producer can also send messages to a partition of their choice Assume, if there are N partitions in a topic and N number of brokers, each broker will have one partition else accordingly accommodated Consumers Consumers read data from brokers. Consumers subscribes to one or more topics and consume published messages by pulling data from the brokers. Kafka Components
  • 9. Page 9Classification: Restricted Producers Producers push data to brokers. When the new broker is started, all the producers search it and automatically sends a message to that new broker. Kafka producer doesn’t wait for acknowledgements from the broker and sends messages as fast as the broker can handle. Consumers Since Kafka brokers are stateless, which means that the consumer has to maintain how many messages have been consumed by using partition offset. If the consumer acknowledges a particular message offset, it implies that the consumer has consumed all prior messages. The consumer issues an asynchronous pull request to the broker to have a buffer of bytes ready to consume. The consumers can rewind or skip to any point in a partition simply by supplying an offset value. Consumer offset value is notified by ZooKeeper Kafka Components
  • 10. Page 10Classification: Restricted • Kafka is simply a collection of topics split into one or more partitions. A Kafka partition is a linearly ordered sequence of messages, where each message is identified by their index (called as offset). All the data in a Kafka cluster is the disjointed union of partitions. Incoming messages are written at the end of a partition and messages are sequentially read by consumers • Following is the step wise workflow of the Pub-Sub Messaging − • Producers send message to a topic at regular intervals. • Kafka broker stores all messages in the partitions configured for that particular topic. It ensures the messages are equally shared between partitions. If the producer sends two messages and there are two partitions, Kafka will store one message in the first partition and the second message in the second partition. • Consumer subscribes to a specific topic. • Once the consumer subscribes to a topic, Kafka will provide the current offset of the topic to the consumer and also saves the offset in the Zookeeper ensemble. • Consumer will request the Kafka in a regular interval (like 100 Ms) for new messages. Kafka Components
  • 11. Page 11Classification: Restricted • Once Kafka receives the messages from producers, it forwards these messages to the consumers. • Consumer will receive the message and process it. • Once the messages are processed, consumer will send an acknowledgement to the Kafka broker. • Once Kafka receives an acknowledgement, it changes the offset to the new value and updates it in the Zookeeper. Since offsets are maintained in the Zookeeper, the consumer can read next message correctly even during server outrages. • This above flow will repeat until the consumer stops the request. • Consumer has the option to rewind/skip to the desired offset of a topic at any time and read all the subsequent messages. Kafka Components
  • 12. Page 12Classification: Restricted • Role of ZooKeeper • A critical dependency of Apache Kafka is Apache Zookeeper, which is a distributed configuration and synchronization service. Zookeeper serves as the coordination interface between the Kafka brokers and consumers. The Kafka servers share information via a Zookeeper cluster. Kafka stores basic metadata in Zookeeper such as information about topics, brokers, consumer offsets (queue readers) and so on. • Since all the critical information is stored in the Zookeeper and it normally replicates this data across its ensemble, failure of Kafka broker / Zookeeper does not affect the state of the Kafka cluster. Kafka will restore the state, once the Zookeeper restarts. Kafka Components
  • 13. Page 13Classification: Restricted Zookeeper provides a flexible coordination infrastructure for distributed environment. ZooKeeper framework supports many of the today's best industrial applications. Assume that a Hadoop cluster bridges 100 or more commodity servers. Therefore, there’s a need for coordination and naming services. As computation of large number of nodes are involved, each node needs to synchronize with each other, know where to access services, and know how they should be configured. At this point of time, Hadoop clusters require cross-node services. ZooKeeper provides the facilities for cross-node synchronization and ensures the tasks across Hadoop projects are serialized and synchronized.It is used to keep the distributed system functioning together as a single unitZooKeeper is a distributed co-ordination service to manage large set of hosts. Co-ordinating and managing a service in a distributed environment is a complicated process. ZooKeeper Overview
  • 14. Page 14Classification: Restricted ZooKeeper solves this issue with its simple architecture and API. ZooKeeper allows developers to focus on core application logic without worrying about the distributed nature of the application.The ZooKeeper framework was originally built at “Yahoo!” for accessing their applications in an easy and robust manner. Later, Apache ZooKeeper became a standard for organized service used by Hadoop, HBase, and other distributed frameworks. For example, Apache HBase uses ZooKeeper to track the status of distributed data. ZooKeeper Overview
  • 15. Page 15Classification: Restricted Let us analyze how a leader node can be elected in a ZooKeeper ensemble. Consider there are N number of nodes in a cluster. The process of leader election is as follows − •All the nodes create a sequential, ephemeral znode with the same path, /app/leader_election/guid_. •ZooKeeper ensemble will append the 10-digit sequence number to the path and the znode created will be/app/leader_election/guid_0000000001, /app/leader_election/guid_0000000002, etc. •For a given instance, the node which creates the smallest number in the znode becomes the leader and all the other nodes are followers. •Each follower node watches the znode having the next smallest number. For example, the node which creates znode/app/leader_election/guid_0000000008 will watch the znode/app/leader_election/guid_0000000007 and the node which creates the znode /app/leader_election/guid_0000000007 will watch the znode /app/leader_election/guid_0000000006. Leader Node
  • 16. Page 16Classification: Restricted •If the leader goes down, then its corresponding znode/app/leader_electionN gets deleted. •The next in line follower node will get the notification through watcher about the leader removal. •The next in line follower node will check if there are other znodes with the smallest number. If none, then it will assume the role of the leader. Otherwise, it finds the node which created the znode with the smallest number as leader. •Similarly, all other follower nodes elect the node which created the znode with the smallest number as leader. Leader Node
  • 17. Page 17Classification: Restricted Distributed applications are difficult to coordinate and work with as they are much more error prone due to huge number of machines attached to network. As many machines are involved, race condition and deadlocks are common problems when implementing distributed applications. Race condition occurs when a machine tries to perform two or more operations at a time and this can be taken care by serialization property of ZooKeeper. Deadlocks are when two or more machines try to access same shared resource at the same time. More precisely they try to access each other’s resources which leads to lock of system as none of the system is releasing the resource but waiting for other system to release it. Synchronization in Zookeeper helps to solve the deadlock. Another major issue with distributed application can be partial failure of process, which can lead to inconsistency of data. Zookeeper handles this through atomicity, which means either whole of the process will finish or nothing will persist after failure. Thus Zookeeper is an important part of Hadoop that take care of these small but important issues so that developer can focus more on functionality of the application. ZooKeeper