Apache Kafka


Published on

Published in: Technology
  • @blackgun2000 not sure I understand your question. You can checkout the Apache Kafka Wiki for papers and presentations https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations if your looking at what has been going on in the Apache Kafka project since then or the documentation http://kafka.apache.org/documentation.html or more on the non Scala/Java client libraries https://cwiki.apache.org/confluence/display/KAFKA/Clients as Scala and Java libraries are in the project itself
    Are you sure you want to  Yes  No
    Your message goes here
  • I am just curious what happened behind these facts:
    1) Linkedin contribute Kafka to apache in 2011, the basic example of Kafka is using Ruby.
    2) Linkedin announced migration from ROR to Node.js.
    Are you sure you want to  Yes  No
    Your message goes here
  • Apache Kafka 좋아보이네. 음... RabbitMQ 보다 나은데?
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache Kafka

  1. 1. Apache Kafkahttp://incubator.apache.org/kafka /* Joe Stein, Chief Architect http://www.medialets.com Twitter: @allthingshadoop Committer: Apache Kafka */ 1
  2. 2. Overview• whoami• Kafka: What is it?• Kafka: Why do we need it?• Kafka: What do we get?• Kafka: How do we get it?• Kafka: Lets jump in!!! 2
  3. 3. Medialets 3
  4. 4. Medialets• Largest deployment of rich media ads for mobile devices• Installed on hundreds of millions of devices• 3-4 TB of new data every day• Thousands of services in production• Hundreds of thousands of events received every second• Response times are measured in microseconds• Languages – 55% JVM (70% Scala & 30% Java) – 20% C/C++ – 13% Python – 10% Ruby – 2% Bash 4
  5. 5. Apache Kafka What is it? 6
  6. 6. A distributed publish-subscribe messaging system• Originally created by LinkedIn, contributed to Apache in July 2011 currently in incubation• Kafka is written in Scala• Multi-language support for Publish/Consumer API (Scala, Java, Ruby, Python, C++, Go, PHP, etc) 7
  7. 7. Apache KafkaWhy do we need it? 8
  8. 8. Offline log aggregation and real-time messagingOther “log-aggregation only” systems (e.g. Scribe and Flume) arearchitected for “push” to drive the data. – high performance and scale however: • Expected end points are large (e.g. Hadoop) • End points can’t have lots of business logic in real-time because they have consume as fast as data is pushed to them… excepting the data is their main job• Messaging Systems (e.g. RabitMQ, ActiveMQ) – Does not scale • No API for batching, transactional (broker retains consumers stream position) • No message persistence means multiple consumers over time are impossible limiting architecture 9
  9. 9. All-in-one system with one architecture and one API• Kafka is a specialized system and overlaps uses cases for both offline and real-time log processing. 10
  10. 10. Apache KafkaWhat do we get? 11
  11. 11. Performance & Scale • Producer Test: – LinkedIn configured the broker in all systems to asynchronously flush messages to its persistence store. – For each system, they ran a single producer to publish a total of 10 million messages, each of 200 bytes. – They configured the Kafka producer to send messages in batches of size 1 and 50. ActiveMQ and RabbitMQ don’t seem to have an easy way to batch messages and they assumed that it used a batch size of 1. – In the next slide, the x-axis represents the amount of data sent to the broker over time in MB, and the y-axis corresponds to the producer throughput in messages per second. On average, Kafka can publish messages at the rate of 50,000 and 400,000 messages per second for batch size of 1 and 50, respectively.http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf 12
  12. 12. Performance & Scalehttp://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf 13
  13. 13. Performance & Scale • Consumer Test: – In the second experiment, LinkedIn tested the performance of the consumer. Again, for all systems, they used a single consumer to retrieve a total of 10 millions messages. – They configured all systems so that each pull request should prefetch approximately the same amount data---up to 1000 messages or about 200KB. – For both ActiveMQ and RabbitMQ, they set the consumer acknowledge mode to be automatic. Since all messages fit in memory, all systems were serving data from the page cache of the underlying file system or some in-memory buffers.http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf 14
  14. 14. Performance & Scalehttp://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf 15
  15. 15. Apache KafkaHow do we get it? 16
  16. 16. Performance & Scale• Producing: – Kafka producer currently doesn’t wait for acknowledgements from the broker and sends messages as faster as the broker can handle. • This is a valid optimization for the log aggregation case, as data must be sent asynchronously to avoid introducing any latency into the live serving of traffic. We note that without acknowledging the producer, there is no guarantee that every published message is actually received by the broker. • For many types of log data, it is desirable to trade durability for throughput, as long as the number of dropped messages is relatively small • Durability through replication is being addressed in 0.8 – Kafka has a very efficient storage format. http://incubator.apache.org/kafka/design.html – Batching• Consuming: – Kafka does not maintain the delivery state which means 0 writes for each consumed message. – Kafka uses the sendfile API making the transfer of bytes from socket to disk through kernal space saving copies and calls between kernel user back to kernel 17
  17. 17. Apache Kafka Lets jump in 18
  18. 18. Producercore/src/main/scala/kafka/tools/ProducerShell.scala/** * Interactive shell for producing messages from the command line */// config setupval propsFile = options.valueOf(producerPropsOpt) val producerConfig = new ProducerConfig(Utils.loadProps(propsFile)) val topic = options.valueOf(topicOpt) val producer = new Producer[String, String](producerConfig) val input = new BufferedReader(new InputStreamReader(System.in)) var done = false while(!done) { val line = input.readLine() if(line == null) { done = true } else { val message = line.trim producer.send(new ProducerData[String, String](topic, message)) println("Sent: %s (%d bytes)".format(line, message.getBytes.length)) } } producer.close() 19
  19. 19. Consumercore/src/main/scala/kafka/consumer/ConsoleConsumer.scala/** * Consumer that dumps messages out to standard out. */val connector = Consumer.create(config) //kafka.consumer.ConsumerConnectorval stream = connector.createMessageStreamsByFilter(filterSpec).get(0)val iter = if(maxMessages >= 0) stream.slice(0, maxMessages) else stream val formatter: MessageFormatter = messageFormatterClass.newInstance().asInstanceOf[MessageFormatter] formatter.init(formatterArgs) try { for(messageAndTopic <- iter) { try { formatter.writeTo(messageAndTopic.message, System.out) } catch { case e => if (skipMessageOnError) error("Error processing message, skipping this message: ", e) else throw e } if(System.out.checkError()) { // This means no one is listening to our output stream any more, time to shutdown System.err.println("Unable to write to standard out, closing consumer.") formatter.close() connector.shutdown() System.exit(1) } } } catch { 20 case e => error("Error processing message, stopping consumer: ", e)
  20. 20. Running everythingDownload Kafka Source• http://incubator.apache.org/kafka/downloads.htmlOpen a Terminal• cp ~/Downloads/kafka-0.7.1-incubating-src.tgz .• tar -xvf kafka-0.7.1-incubating-src.tgz• cd kafka-0.7.1-incubating• ./sbt update• ./sbt packageOpen 3 more terminals http://incubator.apache.org/kafka/quickstart.html• Terminal 1 – bin/zookeeper-server-start.sh config/zookeeper.properties• Terminal 2 – bin/kafka-server-start.sh config/server.properties• Terminal 3 – bin/kafka-producer-shell.sh --props config/producer.properties --topic scalathon – start typing• Terminal 4 – bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic scalathon• Terminal 3 – Type some more• Terminal 4 – See what you just typed – Ctrl+c – bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic scalathon --from-beginning – See EVERYTHING you have typed 21
  21. 21. We are hiring! /* Joe Stein, Chief Architect http://www.medialets.com Twitter: @allthingshadoop */ Medialets The rich media ad platform for mobile. connect@medialets.com www.medialets.com/showcas e 22