Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
4. Features
10/25/2016 Confidential 4
• Brokered
• Distributed
• high throughput for publish and subscribe
• easy scalable
• fast
• replicated commit log service
• partitioned
• stores messages on disk
• In order delivery, per partition.
8. Topic
10/25/2016 Confidential 8
• Unique identification for messages – offset
• Consumer can change the offset to re consume or skip the message
• Replicated among the configurable number of servers.
• Retention policy
• Parallelism at partitions level
9. Producers and Consumers
10/25/2016 Confidential 9
Producers:
• Decides which message to which partition - LB.
• Batch the messages
• Async Send
Consumers:
• Pull
• Queuing and pub-sub
• Consumer groups - cluster of consumers
• Ordered per partition
Non JMS
Initial development was for activity tracker for web pages
Has its unique design
Communication is by TCP
Compression
Physically is a file
Uses distributed commit log
Storage is distributed.
Kafka is all about log
Leader partitions in one server handles all r/w. followers will passively copies the leader.
Compression
Pull
Flow control at consumer side
aggressive batching
Suppose:
Queue –> all the instances have same group name
Pub-sub –> each instance has different group name
Candidates:
Apache Kafka
Apache ActiveMQ version 5.4
RabbitMQ version 2.4
System:
Linux m/c
8 2Ghz cores
16GB mem
6diskd with RAID 10
1GB network link
one m/c as broker and another for Prod and Cons
justification:
1. Kafka doesn’t wait for ack.
2. efficient storage format. Header is 9bytes than 144 bytes in AMQ(As per JMS).
Busiest thread in AMQ is to access the B-Tree to maintain msg meta data n state.
3. Zero copy
Disk -> page caches of kernal space
Kernal space -> user space
User spcace -> socket buffers
socket buffers -> NIC buffer
4. Compressing multiple messages – message set