Apache Kafka

Agenda
1. What is kafka ?
2. Why Kafka ?
3. Kafka Use Cases
4. Who use Kafka ?
5. Why is Kafka so fast ?
6. Kafka Core Concept (Theory)
7. Kafka CLI 101

What is Kafka ?
At the beginning …
“ ... a publish/subscribe
messaging system...”

What is Kafka ?
... today …
“ ... a stream data platform ...”

What is Kafka ?
... but at the core …
“ ... a distributed, horizontally-
scalable, fault-tolerant ...”

What is Kafka ?
● Developed at Linkedin back in 2010, open source in 2011
● Designed to be fast, scalable, durable and available
● Used to decouple of data stream and system
● Distributed by nature (cluster)
● Resilient architecture
● Fault tolerant
● High throughput / low latency
● Ability to handle huge number of consumers

Why Kafka ?
● Great performance (low latency < 10 ms)
● Horizontal scalable (can add more node on cluster)
● Fault tolerant storage
○ Replicates topic log partitions to multiple server
● Stable, Reliable, Durability
● Robust Replication (no data lost)

Application data flow
Without using kafka

Application data flow
Kafka as a central hub

Kafka use cases
● Messaging
○ As the “traditional” messaging system
● Website activity tracking
○ Event like page views, search
● Metrics collection and Monitoring
○ Alerting and reporting on operational metrics
● Log aggregation
○ Collect logs from multiple service
● Stream processing
○ Read, process and write stream for real-time analysis

Who use Kafka?
● LinkedIn use kafka to monitoring activity data and operational metrics
● Uber uses Kafka to gather user, taxi and trip data in real-time to compute and
forecast surge pricing in real-time
● Netflix uses kafka to apply recommendation in real-time while you ‘re
watching TV-Show

Why kafka so fast ?
● Zero Copy - calls the OS kernel direct rather to move data fast
● Batch Data in Chunks - Batches data into chunks
○ End to end from Producer to file system to Consumer with minimises cross machine latency
○ Provides More efficient data compression. Reduces I/O latency
● Sequential Disk Write - Avoids Random Disk Access
○ Writes to immutable commit log. No slow disk seeking. No random I/O Operations.
○ Disk accessed in sequential manner
● Horizontal Scale - uses 100s to thousands of partitions for single topic
○ Spread out to thousands of servers
○ Handle massive load

Kafka Core Concept - Kafka ecosystem

Kafka Core Concept - Topic and Partitions
Topic :-
● Similar to table in database (but no reference id)
● Each topic identified by name (Unique Key)
Partitions :-
● Topic split into partitions, and each partition is ordered
● Each message in partition is assigned a sequential id called an offset
○ Start from zero and increase to 1,2,3,... and so on
● Ordering in only guaranteed within partition for a topic
● Once the data is written to partition, it cannot be changed (Immutability)
● Data is retained for a configurable period of time (default is 7 days)

Kafka Core Concept - Topic and Partitions
For example, 1 topic 3 partitions
Partition 0
Partition 1
Partition 2
old new
0 1 2 3 4 5
0 1 2 3
0 1 2 3 4 5 6 7
write
Topic A

Kafka Core Concept - Kafka Brokers
Kafka Brokers :-
● Broker is a Kafka server which is contain partition of topic
● Each Broker has an ID (number)
● Kafka Cluster is composed of multiple Brokers (servers)
● Topic consist of partition that can spread to multiple nodes on cluster
● Connecting to one broker bootstraps client to entire cluster (bootstrap server)
● Start with at least 3 brokers, cluster can have 10, 100, 1000 brokers of
needed

Kafka Core Concept - Kafka Brokers
Broker 1
Topic 1
Partition 0
Topic 2
Partition 0
Broker 2
Topic 1
Partition 1
Topic 2
Partition 1
Broker 3
Topic 1
Partition 2
For example, 3 brokers, 2 topic

Kafka Cluster
Kafka Core Concept - Kafka Cluster
Broker 1
Topic 1
Partition 0
Topic 2
Partition 0
Broker 2
Topic 1
Partition 1
Topic 2
Partition 1
Broker 3
Topic 1
Partition 2

Kafka Core Concept - Kafka Replication
Kafka replication factor, Failover, ISR
● Kafka replicated Topic Partitions
○ Across multiple nodes in cluster for failover
● For Topic with replication factor N, Kafka can tolerate upto N-1 server failures
without losing data
○ For example, you have 3 broker
Then replication factor is 3 - 1 = 2
■ This mean if there are 2 broker server alive
then so your data not be lost
■ this determines on how many brokers a partition will be replicated

Kafka Cluster
Broker 1
Topic A
Partition 0
Broker 2
Topic A
Partition 1
Topic A
Partition 0
Broker 3
Topic A
Partition 1
For example, Topic A with 2 partition , replication factor of 2
replicated
replicated

Kafka Cluster
Broker 1
Topic A
Partition 0
Broker 2
Topic A
Partition 1
Topic A
Partition 0
Broker 3
Topic A
Partition 1
For example, Topic A with 2 partition , replication factor of 2

Kafka Core Concept - Leader for Partition
Leader for Partition
● Each partition in topic has 1 leader and 0 or more replicas
● At any time only one broker can be leader for given partition
● Only leader can receive and serve data for partition
● The other broker will synchronize the data (follower)
○ The group of in-sync replicas for partition is called ISR (in-sync replicas)
● Therefore each partition is going have one leader and multiple ISR
● Kafka replication is for Failover
○ If one broker goes down then another broker (with ISR) can serve data

Kafka Core Concept - Leader in Partition
Topic A *
Partition 0
(Leader)
Broker 1
Topic A *
Partition 1
(Leader)
Broker 2
Topic A *
Partition 1
(ISR)
Broker 3
Topic A
Partition 0
(ISR)
replication
replication

Kafka Core Concept - Failover and ISR
Topic 1
Topic 1
Topic 1

Kafka Core Concept - Failover and ISRs
Topic 1
Topic 1

Kafka Core Concept - Producers
Producers
● Producer write data to a topics (which is made partition)
● The load is balanced to many brokers
0 1 2 3 4 5
0 1 2 3
0 1 2 3 4 5 6 7
producer
Broker 1
Topic A,Partition 0
Broker 2
Topic A,Partition 1
Broker 3
Topic A,Partition 1
writes
writes
writes
Send data

Kafka Core Concept - Producers
Durable Write
● Durability can be configured with the producer configuration
○ acks=0 : The producer never waits for an ack (possible data lost)
○ acks=1 : The producer gets an ack after the leader has receive the data (limited data lost)
○ acks=all : The producer gets an ack after all ISRs receive the data (no data lost)
● Producer can trade off between throughput or durability of writes

Kafka Core Concept - Consumer
Consumer
● Consumer read data from topic
● Data is read in order within each partitions
● Message stay on kafka … they are not remove after they are consumed
Read in order
a, e, i ,k
c, g
b, d, f, h, j, l, m, n

Kafka Core Concept - Consumer Groups
Consumer Groups
● Consumers can be organised into
Consumer Groups
● If you have more consumers
than partition,
some consumer will be inactive

Kafka Core Concept - Consumer Offsets
Consumer Offsets
● Kafka topic store the offset at which a consumer group has been reading
(Check pointing / bookmarking)
● The offset committed stored in kafka topic named “__consumer_offsets”
● When a consumer in a group has processed data received from kafka, it
should be committing the offsets (though “__consumer_offsets”)
● If a consumer dies, it will be able to read back from where it left

Kafka Core Concept - Message Delivery Semantics
Delivery Semantics for consumer
● Consumer choose when to commit offsets
● There are 3 delivery semantics
○ At most once:
■ Read message, commit offset, process message
■ Messages may be lost but are never redelivered
○ At least once: (usually preferred)
■ Read message, process message, commit offset
■ Messages are never lost but may be redelivered
○ Exactly once:
■ each message is delivered once and only once

Kafka Core Concept - Zookeeper
Zookeeper
● Manage Broker (keep a list of them)
● Zookeeper help with leadership election of Kafka Broker and
Topic Partition paris
● Zookeeper manages service discovery for Kafka Brokers that
form the cluster
● Zookeeper sends notification to Kafka in case of changes
○ New Broker join,
○ Broker died,
○ Topic removed,
○ Topic added, etc
● Kafka cannot work without Zookeeper

Kafka CLI - 101
● Kafka Topics CLI
● Kafka Console Producer CLI
● Kafka Console Consumer CLI
● Kafka Consumer Groups CLI
● Resetting Offsets
● CLI Options that are good to know
● Let’s play kafka in action

Reference
● https://www.slideshare.net/paolopat/meet-apache-kafka-data-streaming-in-
your-hands?from_action=save
● https://www.slideshare.net/JeanPaulAzar1/kafka-tutorial-introduction-to-
apache-kafka-part-1?from_action=save
● http://searene.me/2017/07/09/Why-is-Kafka-so-fast/
● https://bravenewgeek.com/building-a-distributed-log-from-scratch-part-2-data-
replication/
● https://medium.com/linedevth/apache-kafka-
%E0%B8%89%E0%B8%9A%E0%B8%B1%E0%B8%9A%E0%B8%9C%E0
%B8%B9%E0%B9%89%E0%B9%80%E0%B8%A3%E0%B8%B4%E0%B9%
88%E0%B8%A1%E0%B8%95%E0%B9%89%E0%B8%99-1-hello-apache-
kafka-242788d4f3c6

Apache Kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Kafka

Similar to Apache Kafka (20)

Recently uploaded

Recently uploaded (20)

Apache Kafka

Editor's Notes