But as we start to deploy our applications we realizet hat clients need data from a number of sources. So we add them as needed.
But over time, particularly if we are segmenting services by function, we have stuff all over the place, and the dependencies are a nightmare. This makes for a fragile system.
Kafka is a pub/sub messaging system that can decouple your data pipelines. Most of you are probably familiar with it’s history at LinkedIn and they use it as a high throughput relatively low latency commit log. It allows sources to push data without worrying about what clients are reading it. Note that producer push, and consumers pull. Kafka itself is a cluster of brokers, which handles both persisting data to disk and serving that data to consumer requests.
So in kafka, feeds of messages are stored in categories called topics. So you have a message, it goes into a given topic. You create a topic explicitly or you can just start publishing to a topic and have it created auto-magically.
Anything that publishes message to a kafka topic is called a producer.
Anything that reads messages from kafka is called a consumer.
The broker is a cluster of one ore more servers.
All of the communication under the covers is done via a binary TCP protocol (which you don’t really have to worry too much about, unless you start writing custom client implementations).
So we can see in the diagram here that producers can write to really any broker in the cluster and consumers can do the same. Producers rely on Zookeeper for three basic things:
detecting the addition and the removal of brokers and consumers, Triggering a rebalance process in each consumer when the above events happen, and (3) maintaining the consumption relationship and keeping track of the consumed offset of each partition. Specifically, when each broker or consumer starts up, it stores its information in a broker or consumer registry in Zookeeper. The broker registry contains the broker’s host name and port, and the set of topics and partitions stored on it. There’s also a consumer registry which in versions < 0.8.2 was the only option. This can now be stored in a special Kafka topic, as zookeeper could become a bottleneck in high throughput processing.
Replication -> all the the min.insync.replicas. ..there is a timeout.
The single digit
Highlight boxes with different color
How are offsets shared
This is doable with an idempotent producer where the producer tracks committed messages within some configurable window
A lot of people for cross-data center replication. Goldengate -> Kafka.
Rabbit MQ. I can just grab messages, basically transient consumers that can rejoin if necessary. TIBCO as well. Cigna
Active Mq->trackign consumers. You can have some kind of clustering, but recovering from a a failed nodes is a problem, can lose messages or use them for a long time.
Add note about the different ways to handle message formats. Add Gwen’s code samples. Java example, python example
So basically the old way was you could run the producer async which was blazing fast but you might miss stuff. Now you can send everything async and check the callbacks to ensure that you aren’t dropping messages.