KafkaA little introduction
Pub-Sub Messaging System
Distributed
Performance
Disk/Memory Performance                     1000M                       100M                        10M                   ...
Disk/Memory Performance                     1000M                       100M                        10M                   ...
Disk/Memory Performance                     1000M                       100M                        10M                   ...
Disk/Memory Performance                     1000M                       100M                        10M                   ...
Persistent
Length    Magic Value Checksum   Payload4 bytes     1 byte     4 bytes   n bytes
TokenOffset: 0             InputBroker: kafka.localTopic: Testing                                       MR Job            ...
TokenOffset: 0             InputBroker: kafka.localTopic: Testing                                       MR Job            ...
Useful Things• http://incubator.apache.org/kafka/• https://github.com/pingles/clj-kafka
Kafka - A little introduction
Kafka - A little introduction
Kafka - A little introduction
Kafka - A little introduction
Kafka - A little introduction
Kafka - A little introduction
Kafka - A little introduction
Kafka - A little introduction
Kafka - A little introduction
Kafka - A little introduction
Kafka - A little introduction
Kafka - A little introduction
Kafka - A little introduction
Upcoming SlideShare
Loading in …5
×

Kafka - A little introduction

3,177 views

Published on

A brief run through of Kafka and some of it's interesting characteristics that make it a great messaging system for collecting and aggregating data.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,177
On SlideShare
0
From Embeds
0
Number of Embeds
296
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • \n
  • built by linkedin to process + store high-volume activity stream data, but its really a general use messaging system...\n\n
  • at it’s heart, its a pub-sub messaging system...\n
  • It starts with a broker\n
  • Publishers connect to the broker\n
  • and send their messages, \n
  • So we connect some consumers and they can pull messages.\n\nnote when they connect, we’ll receive all messages for a topic, not just since they’ve connected more on that later...\n
  • but its also distributed, which is to say...\n
  • we can have multiple brokers in multiple places and aggregate together...\n\ninternally we can also partition within topics to allow parallel consumption, but thats for another talk...\n
  • before we get into what makes it particularly different (persistence), its useful to understand some of the engineering decisions behind how it works.\n\nperformance is interesting because the behaviour of disks / memory has informed the way kafka has been built to embrace disk persistence\n
  • research from an ACM paper\n\nvalues/sec is the number of 4-byte integer values read per second from a 1-billion-long array on disk and in memory\n\nnumber of four-byte integer values read per second from a 1-billion-long (4 GB) array on disk or in memory\n\nuses the OS’s default page caching, rather than using custom in-memory stores\ngiven all disk writes/reads will be cached\nmeans we can avoid paying the caching overhead of objects within the JVM\n\nrather than maintaining everything in memory and flush when necessary\neverything is written immediately\n\nconfigurable flushing determines how much data is at risk\n\nsimilar to varnish\n
  • research from an ACM paper\n\nvalues/sec is the number of 4-byte integer values read per second from a 1-billion-long array on disk and in memory\n\nnumber of four-byte integer values read per second from a 1-billion-long (4 GB) array on disk or in memory\n\nuses the OS’s default page caching, rather than using custom in-memory stores\ngiven all disk writes/reads will be cached\nmeans we can avoid paying the caching overhead of objects within the JVM\n\nrather than maintaining everything in memory and flush when necessary\neverything is written immediately\n\nconfigurable flushing determines how much data is at risk\n\nsimilar to varnish\n
  • research from an ACM paper\n\nvalues/sec is the number of 4-byte integer values read per second from a 1-billion-long array on disk and in memory\n\nnumber of four-byte integer values read per second from a 1-billion-long (4 GB) array on disk or in memory\n\nuses the OS’s default page caching, rather than using custom in-memory stores\ngiven all disk writes/reads will be cached\nmeans we can avoid paying the caching overhead of objects within the JVM\n\nrather than maintaining everything in memory and flush when necessary\neverything is written immediately\n\nconfigurable flushing determines how much data is at risk\n\nsimilar to varnish\n
  • research from an ACM paper\n\nvalues/sec is the number of 4-byte integer values read per second from a 1-billion-long array on disk and in memory\n\nnumber of four-byte integer values read per second from a 1-billion-long (4 GB) array on disk or in memory\n\nuses the OS’s default page caching, rather than using custom in-memory stores\ngiven all disk writes/reads will be cached\nmeans we can avoid paying the caching overhead of objects within the JVM\n\nrather than maintaining everything in memory and flush when necessary\neverything is written immediately\n\nconfigurable flushing determines how much data is at risk\n\nsimilar to varnish\n
  • \n
  • it starts with a topic, a text description for the messages contained within. we use it to describe how to deserialize the message bytes\n
  • so we send a message to the topic, what happens?\n
  • kafka creates a file\nand it persists the message, which is to say it hands it off to the O/S to write\n\nfiles are just sets of bytes, nothing clever\n\ninternally it abstracts the collection of message bytes into a messageset, which is then backed by a file\n\nso what does each message look like...\n
  • so, our message length is n - 9 bytes\n\nwith a 91 byte payload we have a 100 byte message.\n\nwhich means our next message would start at offset 100\n
  • and we can see our offsets at the bottom...\n
  • so we have the offsets which lets us send all messages to consumers, not just those that were sent after they connected... \n
  • up to the consumer to remember what they’ve consumed, but means you can re-consume an entire set of messages easily, which is very useful when integrating with long-term storage like HDFS...\n\nquick look at the way it works\n
  • \nour input to the hadoop job is a token file that specifies the offset to read from, the topic etc.\n\nhaving read the token, the mapper connects, and consumes messages from a given offset\n\nthe mapper outputs 2 sets of data- the mapped output, such as the message payloads, and an updated token file with the last read offset.\n\nthis is the key, successful completion of the job results in new metadata for the next run and the output data\n\nmeans that if the job fails we can re-run and it’ll consume from the last consumed offset\n
  • the newly created output becomes the next input\n
  • and this is why kafka is an interesting messaging system\n\nsuitable for batch and realtime\n
  • \n
  • Kafka - A little introduction

    1. 1. KafkaA little introduction
    2. 2. Pub-Sub Messaging System
    3. 3. Distributed
    4. 4. Performance
    5. 5. Disk/Memory Performance 1000M 100M 10M 1MRead values/second 100,000 10,000 1,000 100 10 1 Disk SSD Memory Random access Sequential Access Source: http://queue.acm.org/detail.cfm?id=1563874
    6. 6. Disk/Memory Performance 1000M 100M 10M 1MRead values/second 100,000 10,000 1,000 100 10 1 Disk SSD Memory Random access Sequential Access Source: http://queue.acm.org/detail.cfm?id=1563874
    7. 7. Disk/Memory Performance 1000M 100M 10M 1MRead values/second 100,000 10,000 1,000 100 10 1 Disk SSD Memory Random access Sequential Access Source: http://queue.acm.org/detail.cfm?id=1563874
    8. 8. Disk/Memory Performance 1000M 100M 10M 1MRead values/second 100,000 Sequential disk read 10,000 faster than random 1,000 100 memory read 10 1 Disk SSD Memory Random access Sequential Access Source: http://queue.acm.org/detail.cfm?id=1563874
    9. 9. Persistent
    10. 10. Length Magic Value Checksum Payload4 bytes 1 byte 4 bytes n bytes
    11. 11. TokenOffset: 0 InputBroker: kafka.localTopic: Testing MR Job Output Output Offset: 130098 Broker: kafka.local Topic: Testing Sequence File
    12. 12. TokenOffset: 0 InputBroker: kafka.localTopic: Testing MR Job Output Output Offset: 130098 Broker: kafka.local Topic: Testing Sequence File
    13. 13. Useful Things• http://incubator.apache.org/kafka/• https://github.com/pingles/clj-kafka

    ×