A Quick Guide to Refresh Kafka Skills

Quick Guide to Refresh Kafka skills
1. Limitations of Kafka
A. Kafka is an at least once or at most once delivery system
but not Exactly once.
At most once scenario – lose of messages
At least once scenario – duplicates of messages
An at-most-once scenario happens when the commit
interval has occurred, and that in turn triggers Kafka to
automatically commit the last used offset. Meanwhile,
let us say the consumer did not get a chance to complete
the processing of the messages and consumer has
crashed. Now when consumer restarts, it starts to
receive messages from the last committed offset, in
essence consumer could lose a few messages in
between.
At-least-once scenario happens when consumer
processes a message and commits the message into its
persistent store and consumer crashes at that point.
Meanwhile, let us say Kafka could not get a chance to
commit the offset to the broker sincecommitinterval has
not passed. Now when the consumer restarts, it gets
delivered with a few older messages from the last
committed offset.
https://dzone.com/articles/kafka-clients-at-most-once-
at-least-once-exactly-o
More Partitions May Increase Unavailability
In the common casewhen a broker is shutdown cleanly,
the controller will proactively move the leaders off the
shutting down broker one at a time. The moving of a
singleleader takes only a few milliseconds.So, from the
clients perspective, there is only a small window of
unavailability during a clean broker shutdown.
However, when a broker is shutdown uncleanly (e.g., kill
-9), the observed unavailability could be proportional to
the number of partitions. Suppose that a broker has a
total of 2000 partitions, each with 2 replicas. Roughly,
this broker will be the leader for about 1000 partitions.
When this broker failsuncleanly,all those1000 partitions
become unavailable at exactly the same time. Suppose
that it takes 5 ms to elect a new leader for a single
partition. It will take up to 5 seconds to elect the new
leader for all 1000 partitions. So, for some partitions,
their observed unavailability can be 5 seconds plus the
time taken to detect the failure.
More Partitions May Increase End-to-end Latency
The end-to-end latency in Kafka is defined by the time
from when a message is published by the producer to
when the message is read by the consumer. Kafka only
exposes a message to a consumer after it has been
committed, i.e., when the messageis replicated to all the
in-sync replicas.So,the time to commit a message can be
a significant portion of the end-to-end latency. By
default, a Kafka broker only uses a single thread to
replicatedata from another broker, for all partitions that
share replicas between the two brokers. Our
experiments show that replicating 1000 partitions from
one broker to another can add about 20 ms latency,
which implies that the end-to-end latency is at least 20
ms. This can be too high for some real-time applications.
https://de.confluent.io/blog/how-choose-number-
topics-partitions-kafka-cluster
High-Level Stream DSL – Using the Stream DSL, a user
can express transformation, aggregations, grouping, etc.
With each transformation, data has to be serialized and
written into the topic. Then, for the next operation in the
chain, it has to be read from the topic, meaning that all
sideoperations happen for every entity (likepartition key
calculation, persisting to disk, etc.)
Kafka Streams assumes that the Serde class used for
serialization or deserialization is the one provided in the
config. With changingtheformat of data in the operation
chain, the user has to provide the appropriate Serde. If
existingSerdes can'thandletheused format, the user has
to create a custom Serde. No big deal — justextend the
Serde class and implement a custom serializer and
deserializer. From one class, we ended up with 4 — not
so optimal. So, for each custom format of data in the
operation chain, we create three additional classes. An
alternative approach is to use generic JSON or AVRO
Serdes. One more thing: the user has to specify a Serde
for both key and value parts of the message
Restarting of the application. After the breakdown, the
application will go through each internal topic to the last
valid offset, and this can take some time — especially if
logcompaction is notused and/or retention period is not
set up.
https://dzone.com/articles/problem-with-kafka-
streams-1
2. Exactly once delivery in Kafka
A. exactly-once semantics is introduced in Apache Kafka
0.11 release and Confluent Platform 3.3.
https://medium.com/@jaykreps/exactly-once-support-
in-apache-kafka-55e1fdd0a35f
For Previous versions of kafka we can partially achieve
this - Exactly-onceKafka Static Consumer via Assign (One
and Only One Message Delivery)
Steps

Step1 : Set enable.auto.commit = false
Step2 : Don’t make call to consumer.commitSync(); after
processingmessage.
Step3 : Register consumer to specific partition using
‘assign’call.
Step 4 : On startup of the consumer seek to specific
message offset by calling
consumer.seek(topicPartition,offset);
Step 5 : While processing the messages, get hold of the
offset of each message. Store the processed message’s
offset in an atomic way along with the processed
message using atomic-transaction. When data is stored
in relational database atomicity is easier to implement.
For non-relational data-store such as HDFS store or No-
SQL store one way to achieve atomicity is as follows:
Store the offset along with the message.
Step 6 : Implement idempotent as a safety net.
3. Why Avro produce and consumer is preferred
A. Avro is an open source binary message exchange
protocol. Avro helps to send optimized messages across
the wirehence reducingthe network overhead. Avro can
enforce schema for messages that can be defined using
JSON. Avro can generate binding objects in various
programming languages from these schemas. Message
payloads are automatically bound, and these generate
objects on the consumer side.Avro is natively supported
and highly recommended to use along with Kafka
4. Schema registry
A. Kafka takes bytes as an input and sends bytes as an
output – No data verification. Obviously, your data has
meaning beyond bytes, so your consumers need to parse
itand later on interpret it.They mainly occur in thesetwo
situations: The field you’re looking for doesn’t exist
anymore. The type of the field has changed (e.g. what
used to be a String is now an Integer)
What are our options to prevent and overcome these
issues?
Catch exception on parsing errors. Your code becomes
ugly and very hard to maintain. 👎
Never ever change the data producer and triple check
your producer code will never forget to send a field.
That’s what most companies do. But after a few key
people quit, all your “safeguards” are gone. 👎👎
Adopt a data format and enforce rules that allowyou to
perform schema evolution while guaranteeing not to
break your downstream applications. 👏 (Sounds too
good to be true ?) - That data format is Apache Avro.
Avro is one of the fastest serializable and deserializable
data formats. It support schema evolution.
The Kafka Avro Serializer
The engineering beauty of this architecture is that now,
your Producers usea new Serializer,provided courtesy of
Confluent, named the KafkaAvroSerializer. Upon
producing Avro data to Kafka, the following will happen
(simplified version):
Your producer will check if the schema is available is in
the Schema Registry. If not available, it will register and
cache it
The Schema Registry will verify if theschema is either the
same as before or a valid evolution. If not, it will return
an exception and the KafkaAvroSerializer will crash your
producer. Better safe than sorry
If the schema is valid and all checks pass, the producer
will only includea reference to the Schema (the Schema
ID) in the message sent to Kafka,not the whole schema.
The advantage of this is that now, your messages sent to
Kafka are much smaller!
https://medium.com/@stephane.maarek/introduction-
to-schemas-in-apache-kafka-with-the-confluent-
schema-registry-3bf55e401321
5. In-Sync replica (AKA ISR)
A. Every topic partition in Kafka isreplicated n times, where
n is the replication factor of the topic. This allows Kafka
to automatically failover to these replicas when a server
in the cluster fails so that messages remain available in
the presence of failures.Replication in Kafka happens at
the partition granularity where the partition’s write-
ahead log is replicated in order to n servers. Out of the n
replicas, one replica is designated as the leader while
others are followers. As the name suggests, the leader
takes the writes from the producer and the followers
merely copy the leader’s log in order.
When a producer sends a message to the broker, it is
written by the leader and replicated to all the partition’s

replicas. A message is committed only after it has been
successfully copied to all the in-sync replicas.
WHAT DOES IT MEAN FOR A REPLICA TO be caughtup to
the leader? - replica that has not “caught up” to the
leader’s log as possibly being marked as an out-of-sync
replica. take an example of a single partition topic foo
with a replication factor of 3.Assumethatthe replicas for
this partition live on brokers 1, 2 and 3 and that 3
messages have been committed on topic foo. Replica on
broker 1 is the current leader and replicas 2 and 3 are
followers and all replicasarepartof the ISR. Also assume
that replica.lag.max.messages is set to 4 which means
that as longas afollower is behind theleader by notmore
than 3 messages,itwill notberemoved from the ISR.And
replica.lag.time.max.ms is set to 500 ms which means
that as long as the followers send a fetch request to the
leader every 500 ms or sooner, they will not be marked
dead and will not be removed from the ISR.
A replica can be out-of-sync with the leader for several
reasons-Slow replica: A follower replica that is
consistently not able to catch up with the writes on the
leader for a certain period of time. One of the most
common reasons for this is an I/O bottleneck on the
follower replica causing it to append the copied
messages at a rate slower than it can consumer from the
leader.
Stuck replica: A follower replica that has stopped
fetching from the leader for a certain period of time. A
replica could be stuck either due to a GC pause or
because it has failed or died.
Bootstrapping replica: When the user increases the
replication factor of the topic, the new follower replicas
are out-of-sync until they are fully caught up to the
leader’s log
https://www.confluent.io/blog/hands-free-kafka-replication-a-
lesson-in-operational-simplicity/
6. Kafka log compaction
A. In the Kafka cluster,the retention policy can be set on a
per-topic basis such as time based, size-based, or log
compaction-based. Log compaction retains at least the
last known value for each record key for a single topic
partition. Compacted logs are useful for restoring state
after a crash or system failure.
Other important usecases is CDC (change data capture)
Kafka logcompaction also allows for deletes. A message
with a key and a null payload acts like a tombstone, a
delete marker for that key. Tombstones get cleared after
a period. Log compaction periodically runs in the
background by recopying log segments. Compaction
does not block reads and can be throttled to avoid
impacting I/O of producers and consumers.
The Kafka Log Cleaner does log compaction. The Log
cleaner has a pool of background compaction threads.
These threads recopy log segment files, removing older
records whose key reappears recently in the log.
Topic config min.compaction.lag.ms gets used to
guarantee a minimum period that must pass before a
message can be compacted. The consumer sees all

tombstones as long as the consumer reaches head of a
log in a period less than the topic config
delete.retention.ms (the default is 24 hours). Log
compaction will never re-order messages, just remove
some. Partition offset for a message never changes.
http://cloudurable.com/blog/kafka-architecture-log-
compaction/index.html
7. kafka record keys for partition Strategy
A. Key is an optional metadata,thatcan besent with a Kafka
message, and by default, itis used to route message to a
specific partition.E.g. if you're sendinga message m with
key as k, to a topic mytopic that has p partitions,then m
goes to the partition Hash(k) % p in mytopic. It has no
connection to the offset of a partition whatsoever.
Offsets are used by consumers to keep track of the
position of last read message in a partition.
If a PartitionKeyStrategy is used with a topic,the valueis
used as the message key, and is then implicitly used to
select the partition according to the default behavior of
the Kafka client:
If a valid partition number is specified that partition will
be used when sending the record. If no partition is
specified but a key is present a partition will be chosen
using a hash of the key. If neither key nor partition is
present a partition will be assigned in a round-robin
fashion.
It might be desirable in some cases to control these
independently. For example, you might wish to have a
message key that is more fine-grained than the partition
key, for use with Kafka log compaction on sub-graphs of
the entity state.
https://blog.newrelic.com/engineering/effective-
strategies-kafka-topic-partitioning/
https://stackoverflow.com/questions/51245962/how-
to-choose-a-key-and-offset-for-a-kafka-producer
8. kafka Mirror Maker
A. Kafka's mirroringfeaturemakes itpossibleto maintain a
replica of an existing Kafka cluster.
Mirror Maker is just a regular Java Producer/Consumer
pair. Data is read from topics in the origin cluster and
written to a topic with the same name in the destination
cluster. You can run many such mirroring processes to
increase throughput and for fault-tolerance (if one
process dies, the others will take overs the additional
load).
Alternative is Confluent Replicator. Confluent Replicator
is a more complete solution that handles topic
configuration as well as data and integrates with Kafka
Connect and Control Center to improve availability,
scalability and ease of use.
https://docs.confluent.io/current/multi-dc-
replicator/mirrormaker.html
9. Challenges faced while working with Kafka
A. Duplicates – at least once behavior – Remove the
duplicates by checkingwith previous persisted messages
or implement exactly once behavior.
Consumer liveness – if a consumer take more time to
process the message. The group coordinator expects
group members to send it regular heartbeats to indicate
that they remain active. A background heartbeat thread
runs in the consumer sending regular heartbeats to the
coordinator. If the coordinator does not receive a
heartbeat from a group member within the session
timeout, the coordinator removes the member from the
group and starts a rebalance of the group. The session
timeout can be much shorter than the maximum polling
interval so that the time taken to detect a failed
consumer can be shorteven if message processingtakes
a long time.
You can configurethe maximumpollinginterval usingthe
max.poll.interval.ms property and the session timeout
usingthe session.timeout.ms property. You will typically
not need to use these settings unless ittakes more than
5 minutes to process a batch of messages.
If you have problems with message handling caused by
message flooding, you can set a consumer option to
control the speed of message consumption. Use
fetch.max.bytes and max.poll.records to control how
much data a call to poll() can return.
https://console.bluemix.net/docs/services/EventStream
s/eventstreams114.html#consuming_messages

A Quick Guide to Refresh Kafka Skills

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Quick Guide to Refresh Kafka Skills

Similar to A Quick Guide to Refresh Kafka Skills (20)

Recently uploaded

Recently uploaded (20)

A Quick Guide to Refresh Kafka Skills