Apache Kafka
http://incubator.apache.org/kafka


         /*

              Joe Stein, Chief Architect
              http://www.medialets.com
              Twitter: @allthingshadoop

              Committer: Apache Kafka
         */




                     1
Overview

• whoami
• Kafka: What is it?
• Kafka: Why do we need it?
• Kafka: What do we get?
• Kafka: How do we get it?
• Kafka: Lets jump in!!!


                   2
Medialets




    3
Medialets
•   Largest deployment of rich media ads for mobile devices
•   Installed on hundreds of millions of devices
•   3-4 TB of new data every day
•   Thousands of services in production
•   Hundreds of thousands of events received every second
•   Response times are measured in microseconds
•   Languages
     – 55% JVM (70% Scala & 30% Java)
     – 20% C/C++
     – 13% Python
     – 10% Ruby
     – 2% Bash


                                 4
Apache Kafka

  What is it?




       6
A distributed publish-subscribe messaging system

• Originally created by LinkedIn, contributed to Apache in
  July 2011 currently in incubation
• Kafka is written in Scala
• Multi-language support for Publish/Consumer API
  (Scala, Java, Ruby, Python, C++, Go, PHP, etc)




                            7
Apache Kafka

Why do we need it?




        8
Offline log aggregation and real-time messaging
Other “log-aggregation only” systems (e.g. Scribe and Flume) are
architected for “push” to drive the data.
   – high performance and scale however:
      • Expected end points are large (e.g. Hadoop)
      • End points can’t have lots of business logic in real-time
        because they have consume as fast as data is pushed to
        them… excepting the data is their main job
• Messaging Systems (e.g. RabitMQ, ActiveMQ)
   – Does not scale
      • No API for batching, transactional (broker retains consumers
        stream position)
      • No message persistence means multiple consumers over
        time are impossible limiting architecture




                                  9
All-in-one system with one architecture and one API
• Kafka is a specialized system and overlaps uses cases for
  both offline and real-time log processing.




                              10
Apache Kafka

What do we get?




       11
Performance & Scale
  • Producer Test:
    – LinkedIn configured the broker in all systems to asynchronously
      flush messages to its persistence store.
    – For each system, they ran a single producer to publish a total of 10
      million messages, each of 200 bytes.
    – They configured the Kafka producer to send messages in batches of
      size 1 and 50. ActiveMQ and RabbitMQ don’t seem to have an easy
      way to batch messages and they assumed that it used a batch size
      of 1.
    – In the next slide, the x-axis represents the amount of data sent to
      the broker over time in MB, and the y-axis corresponds to the
      producer throughput in messages per second. On average, Kafka
      can publish messages at the rate of 50,000 and 400,000 messages
      per second for batch size of 1 and 50, respectively.




http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf

                                                                        12
Performance & Scale




http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf

                                                                        13
Performance & Scale
  • Consumer Test:
    – In the second experiment, LinkedIn tested the performance
      of the consumer. Again, for all systems, they used a single
      consumer to retrieve a total of 10 millions messages.
    – They configured all systems so that each pull request
      should prefetch approximately the same amount data---up
      to 1000 messages or about 200KB.
    – For both ActiveMQ and RabbitMQ, they set the consumer
      acknowledge mode to be automatic. Since all messages fit
      in memory, all systems were serving data from the page
      cache of the underlying file system or some in-memory
      buffers.




http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf

                                                                        14
Performance & Scale




http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf

                                                                        15
Apache Kafka

How do we get it?




       16
Performance & Scale
• Producing:
   – Kafka producer currently doesn’t wait for acknowledgements from the broker
     and sends messages as faster as the broker can handle.
       • This is a valid optimization for the log aggregation case, as data must be
         sent asynchronously to avoid introducing any latency into the live serving of
         traffic. We note that without acknowledging the producer, there is no
         guarantee that every published message is actually received by the broker.
       • For many types of log data, it is desirable to trade durability for throughput,
         as long as the number of dropped messages is relatively small
       • Durability through replication is being addressed in 0.8
   – Kafka has a very efficient storage format.
     http://incubator.apache.org/kafka/design.html
   – Batching
• Consuming:
   – Kafka does not maintain the delivery state which means 0 writes for each
     consumed message.
   – Kafka uses the sendfile API making the transfer of bytes from socket to disk
     through kernal space saving copies and calls between kernel user back to
     kernel



                                           17
Apache Kafka

  Lets jump in




       18
Producer
core/src/main/scala/kafka/tools/ProducerShell.scala

/**
 * Interactive shell for producing messages from the command line
 */

// config setup

val propsFile = options.valueOf(producerPropsOpt)
  val producerConfig = new ProducerConfig(Utils.loadProps(propsFile))
  val topic = options.valueOf(topicOpt)
  val producer = new Producer[String, String](producerConfig)

  val input = new BufferedReader(new InputStreamReader(System.in))
  var done = false
  while(!done) {
    val line = input.readLine()
    if(line == null) {
      done = true
    } else {
      val message = line.trim
      producer.send(new ProducerData[String, String](topic, message))
      println("Sent: %s (%d bytes)".format(line, message.getBytes.length))
    }
  }
  producer.close()
                                                  19
Consumer
core/src/main/scala/kafka/consumer/ConsoleConsumer.scala
/**
 * Consumer that dumps messages out to standard out.
 */
val connector = Consumer.create(config) //kafka.consumer.ConsumerConnector
val stream = connector.createMessageStreamsByFilter(filterSpec).get(0)
val iter = if(maxMessages >= 0)
      stream.slice(0, maxMessages)
    else
      stream
 val formatter: MessageFormatter = messageFormatterClass.newInstance().asInstanceOf[MessageFormatter]
    formatter.init(formatterArgs)

  try {
    for(messageAndTopic <- iter) {
      try {
        formatter.writeTo(messageAndTopic.message, System.out)
      } catch {
        case e =>
          if (skipMessageOnError)
            error("Error processing message, skipping this message: ", e)
          else
            throw e
      }
      if(System.out.checkError()) {
        // This means no one is listening to our output stream any more, time to shutdown
        System.err.println("Unable to write to standard out, closing consumer.")
        formatter.close()
        connector.shutdown()
        System.exit(1)
      }
    }
  } catch {
                                                                  20
    case e => error("Error processing message, stopping consumer: ", e)
Running everything
Download Kafka Source
• http://incubator.apache.org/kafka/downloads.html

Open a Terminal
• cp ~/Downloads/kafka-0.7.1-incubating-src.tgz .
• tar -xvf kafka-0.7.1-incubating-src.tgz
• cd kafka-0.7.1-incubating
• ./sbt update
• ./sbt package
Open 3 more terminals http://incubator.apache.org/kafka/quickstart.html
• Terminal 1
    – bin/zookeeper-server-start.sh config/zookeeper.properties
• Terminal 2
    – bin/kafka-server-start.sh config/server.properties
• Terminal 3
    – bin/kafka-producer-shell.sh --props config/producer.properties --topic scalathon
    – start typing
• Terminal 4
    – bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic scalathon
• Terminal 3
    – Type some more
• Terminal 4
    – See what you just typed
    – Ctrl+c
    – bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic scalathon --from-beginning
    – See EVERYTHING you have typed
                                                21
We are hiring!
 /*

      Joe Stein, Chief Architect
      http://www.medialets.com
      Twitter: @allthingshadoop

 */


 Medialets
 The rich media ad
 platform for mobile.
                      connect@medialets.com
                      www.medialets.com/showcas
                      e




              22

Apache Kafka

  • 1.
    Apache Kafka http://incubator.apache.org/kafka /* Joe Stein, Chief Architect http://www.medialets.com Twitter: @allthingshadoop Committer: Apache Kafka */ 1
  • 2.
    Overview • whoami • Kafka:What is it? • Kafka: Why do we need it? • Kafka: What do we get? • Kafka: How do we get it? • Kafka: Lets jump in!!! 2
  • 3.
  • 4.
    Medialets • Largest deployment of rich media ads for mobile devices • Installed on hundreds of millions of devices • 3-4 TB of new data every day • Thousands of services in production • Hundreds of thousands of events received every second • Response times are measured in microseconds • Languages – 55% JVM (70% Scala & 30% Java) – 20% C/C++ – 13% Python – 10% Ruby – 2% Bash 4
  • 5.
    Apache Kafka What is it? 6
  • 6.
    A distributed publish-subscribemessaging system • Originally created by LinkedIn, contributed to Apache in July 2011 currently in incubation • Kafka is written in Scala • Multi-language support for Publish/Consumer API (Scala, Java, Ruby, Python, C++, Go, PHP, etc) 7
  • 7.
    Apache Kafka Why dowe need it? 8
  • 8.
    Offline log aggregationand real-time messaging Other “log-aggregation only” systems (e.g. Scribe and Flume) are architected for “push” to drive the data. – high performance and scale however: • Expected end points are large (e.g. Hadoop) • End points can’t have lots of business logic in real-time because they have consume as fast as data is pushed to them… excepting the data is their main job • Messaging Systems (e.g. RabitMQ, ActiveMQ) – Does not scale • No API for batching, transactional (broker retains consumers stream position) • No message persistence means multiple consumers over time are impossible limiting architecture 9
  • 9.
    All-in-one system withone architecture and one API • Kafka is a specialized system and overlaps uses cases for both offline and real-time log processing. 10
  • 10.
  • 11.
    Performance & Scale • Producer Test: – LinkedIn configured the broker in all systems to asynchronously flush messages to its persistence store. – For each system, they ran a single producer to publish a total of 10 million messages, each of 200 bytes. – They configured the Kafka producer to send messages in batches of size 1 and 50. ActiveMQ and RabbitMQ don’t seem to have an easy way to batch messages and they assumed that it used a batch size of 1. – In the next slide, the x-axis represents the amount of data sent to the broker over time in MB, and the y-axis corresponds to the producer throughput in messages per second. On average, Kafka can publish messages at the rate of 50,000 and 400,000 messages per second for batch size of 1 and 50, respectively. http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf 12
  • 12.
  • 13.
    Performance & Scale • Consumer Test: – In the second experiment, LinkedIn tested the performance of the consumer. Again, for all systems, they used a single consumer to retrieve a total of 10 millions messages. – They configured all systems so that each pull request should prefetch approximately the same amount data---up to 1000 messages or about 200KB. – For both ActiveMQ and RabbitMQ, they set the consumer acknowledge mode to be automatic. Since all messages fit in memory, all systems were serving data from the page cache of the underlying file system or some in-memory buffers. http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf 14
  • 14.
  • 15.
    Apache Kafka How dowe get it? 16
  • 16.
    Performance & Scale •Producing: – Kafka producer currently doesn’t wait for acknowledgements from the broker and sends messages as faster as the broker can handle. • This is a valid optimization for the log aggregation case, as data must be sent asynchronously to avoid introducing any latency into the live serving of traffic. We note that without acknowledging the producer, there is no guarantee that every published message is actually received by the broker. • For many types of log data, it is desirable to trade durability for throughput, as long as the number of dropped messages is relatively small • Durability through replication is being addressed in 0.8 – Kafka has a very efficient storage format. http://incubator.apache.org/kafka/design.html – Batching • Consuming: – Kafka does not maintain the delivery state which means 0 writes for each consumed message. – Kafka uses the sendfile API making the transfer of bytes from socket to disk through kernal space saving copies and calls between kernel user back to kernel 17
  • 17.
    Apache Kafka Lets jump in 18
  • 18.
    Producer core/src/main/scala/kafka/tools/ProducerShell.scala /** * Interactiveshell for producing messages from the command line */ // config setup val propsFile = options.valueOf(producerPropsOpt) val producerConfig = new ProducerConfig(Utils.loadProps(propsFile)) val topic = options.valueOf(topicOpt) val producer = new Producer[String, String](producerConfig) val input = new BufferedReader(new InputStreamReader(System.in)) var done = false while(!done) { val line = input.readLine() if(line == null) { done = true } else { val message = line.trim producer.send(new ProducerData[String, String](topic, message)) println("Sent: %s (%d bytes)".format(line, message.getBytes.length)) } } producer.close() 19
  • 19.
    Consumer core/src/main/scala/kafka/consumer/ConsoleConsumer.scala /** * Consumerthat dumps messages out to standard out. */ val connector = Consumer.create(config) //kafka.consumer.ConsumerConnector val stream = connector.createMessageStreamsByFilter(filterSpec).get(0) val iter = if(maxMessages >= 0) stream.slice(0, maxMessages) else stream val formatter: MessageFormatter = messageFormatterClass.newInstance().asInstanceOf[MessageFormatter] formatter.init(formatterArgs) try { for(messageAndTopic <- iter) { try { formatter.writeTo(messageAndTopic.message, System.out) } catch { case e => if (skipMessageOnError) error("Error processing message, skipping this message: ", e) else throw e } if(System.out.checkError()) { // This means no one is listening to our output stream any more, time to shutdown System.err.println("Unable to write to standard out, closing consumer.") formatter.close() connector.shutdown() System.exit(1) } } } catch { 20 case e => error("Error processing message, stopping consumer: ", e)
  • 20.
    Running everything Download KafkaSource • http://incubator.apache.org/kafka/downloads.html Open a Terminal • cp ~/Downloads/kafka-0.7.1-incubating-src.tgz . • tar -xvf kafka-0.7.1-incubating-src.tgz • cd kafka-0.7.1-incubating • ./sbt update • ./sbt package Open 3 more terminals http://incubator.apache.org/kafka/quickstart.html • Terminal 1 – bin/zookeeper-server-start.sh config/zookeeper.properties • Terminal 2 – bin/kafka-server-start.sh config/server.properties • Terminal 3 – bin/kafka-producer-shell.sh --props config/producer.properties --topic scalathon – start typing • Terminal 4 – bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic scalathon • Terminal 3 – Type some more • Terminal 4 – See what you just typed – Ctrl+c – bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic scalathon --from-beginning – See EVERYTHING you have typed 21
  • 21.
    We are hiring! /* Joe Stein, Chief Architect http://www.medialets.com Twitter: @allthingshadoop */ Medialets The rich media ad platform for mobile. connect@medialets.com www.medialets.com/showcas e 22