2. (Simplified) Glossary
Kafka ~ Distributed messaging system (distributed Pub Sub)
Brokers ~ The machines where the data is stored
Topic ~ Queue(s) of messages on cluster
Producer & Consumer ~ Pub Sub clients for the topic
Avro ~ A serialization format
3. OVERVIEW
Kafka Why and How ?
Producer - Consumer
Topics
A common format : Avro
Where is the data ?
Isn’t that just one big single point of failure ?
8. Publish & Subscribe using a messaging queue
● Topic represented by a dedicated queue
● Writer and Reader don’t known each other
● Processing data is the reader’s responsibility
10. Kafka storage
By default on kafka :
● Write on disk (0 copy)
● Retention of message is of 6 months by topic
● Topics are distributed for parallelism
● Topics are replicated for resilience
28. Brokers are where most of the stuff happens
The data sits on the brokers’
disk(s).
Data flows to/from Kafka. It’s
immutable, you can’t change
it directly.
Dump the data
By default, keep for approx. 6
months but it can stay there
indefinitely.
In all cases, its expiration is
totally independent from it’s
consumption.
Retention
To increase space we can
“simply” add a new broker.
Scalable
36. Talent bank’s use case
Stream “Latest”
1 topic by domain.entity
3 partitions by topic
Retention > weeks
37. Data team’s use case with JT MySQL
Stream full content of DB
1 topic by table
1 partition by topic
Retention > months
38. Data team’s use case with Salesforce
Stream “Latest”
1 topic by “Object”
1 partition
Retention < 1 week
39. (Complete) Glossary
Kakfa -> Your new best friend
topic -> Log file of the message (exist on cluster level)
Offset -> Primary key of the message (on partition level)
Brokers -> The machines that fully handle the topics
Producer & Consumer -> Your job
Avro -> So much better than json ;)
41. Valuables resources
Kafka for beginners : https://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/
Kafka overview : https://www.alibabacloud.com/blog/an-overview-of-kafka-distributed-message-system_594218
Kafka a database : https://speakerdeck.com/ept/is-kafka-a-database
Putting the Power of Kafka into the Hands of Data Scientists :
https://multithreaded.stitchfix.com/blog/2018/09/05/datahighway/
Why we choose Kafka : https://tech.trello.com/why-we-chose-kafka/
Salesforce notifications to Kafka topics : https://glenmazza.net/blog/entry/salesforce-notifications-to-kafka-topics
Streaming data out of the monolith : https://medium.com/blablacar-tech/streaming-data-out-of-the-monolith-building-a-
highly-reliable-cdc-stack-d71599131acb
Kafka client At Most One, At Least Once, Exactly Once : https://dzone.com/articles/kafka-clients-at-most-once-at-least-
once-exactly-o
Message serialization in Kafka using Avro part 1 : http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in-
apache-kafka-using-apache-avro-part-1/
Message serialization in Kafka using Avro part 2 :
http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in-apache-kafka-using-apache-avro-part-2/
Offset management in Kafka : https://fr.slideshare.net/jjkoshy/offset-management-in-kafka
Kafka listeners explained : https://rmoff.net/2018/08/02/kafka-listeners-explained/
The power of rebalancing in Kafka : https://www.youtube.com/watch?v=MmLezWRI3Ys