Real-time streaming and data pipelines with Apache Kafka
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Real-time streaming and data pipelines with Apache Kafka

  • 20,184 views
Uploaded on

Get up and running quickly with Apache Kafka http://kafka.apache.org/ ...

Get up and running quickly with Apache Kafka http://kafka.apache.org/

* Fast * A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.

* Scalable * Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers

* Durable * Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.

* Distributed by Design * Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
20,184
On Slideshare
8,334
From Embeds
11,850
Number of Embeds
37

Actions

Shares
Downloads
228
Comments
0
Likes
19

Embeds 11,850

http://java.dzone.com 6,037
http://allthingshadoop.com 4,833
http://www.hakkalabs.co 609
http://feedly.com 57
http://nellaivijay.wordpress.com 53
http://server.dzone.com 49
http://odin.80port.net 35
http://feedreader.com 23
http://www.tuicool.com 22
https://twitter.com 18
https://www.hakkalabs.co 15
http://rritw.com 14
http://www.dzone.com 9
http://architects.dzone.com 9
http://dschool.co 8
http://www.feedspot.com 7
http://feeds.feedburner.com 6
http://charmalloc.wordpress.com 6
http://hakka-staging-2.herokuapp.com 5
http://digg.com 4
http://webcache.googleusercontent.com 4
https://www.google.com 3
https://www.newsblur.com 3
http://www.fanli7.net 3
http://www.google.pt 3
https://www.google.com.hk 2
https://allthingshadoop.com 2
http://fanli7.net 2
http://www.google.de 1
http://translate.googleusercontent.com 1
http://plus.url.google.com 1
http://www.google.com 1
http://www.google.ca 1
http://www.linkedin.com 1
http://g33ktalk.com 1
http://www.rritw.com 1
http://prlog.ru 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. REAL-TIME STREAMING AND DATA PIPELINES WITH APACHE KAFKA
  • 2. Real-time Streaming
  • 3. Real-time Streaming with Data Pipelines Sync Z-Y < 20ms Y-X < 20ms Async Z-Y < 1ms Y-X < 1ms
  • 4. Real-time Streaming with Data Pipelines Sync Y-X < 20ms Async Y-X < 1ms
  • 5. Real-time Streaming with Data Pipelines
  • 6. Before we get started  Samples  https://github.com/stealthly/scala-kafka  Apache Kafka 0.8.0 Source Code  https://github.com/apache/kafka  Docs  http://kafka.apache.org/documentation.html  Wiki  https://cwiki.apache.org/confluence/display/KAFKA/Index  Presentations  https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations Tools  https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools  https://github.com/apache/kafka/tree/0.8/core/src/main/scala/kafka/tools
  • 7. Apache Kafka  Fast - A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.  Scalable - Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers  Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.  Distributed by Design - Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
  • 8. Up and running with Apache Kafka 1) Install Vagrant http://www.vagrantup.com/ 2) Install Virtual Box https://www.virtualbox.org/ 3) git clone https://github.com/stealthly/scala-kafka 4) cd scala-kafka 5) vagrant up Zookeeper will be running on 192.168.86.5 BrokerOne will be running on 192.168.86.10 All the tests in ./src/test/scala/* should pass (which is also /vagrant/src/test/scala/* in the vm) 6) ./sbt test [success] Total time: 37 s, completed Dec 19, 2013 11:21:13 AM
  • 9. Maven <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.10</artifactId> <version>0.8.0</version> <exclusions> <!-- https://issues.apache.org/jira/browse/KAFKA-1160 --> <exclusion> <groupId>com.sun.jdmk</groupId> <artifactId>jmxtools</artifactId> </exclusion> <exclusion> <groupId>com.sun.jmx</groupId> <artifactId>jmxri</artifactId> </exclusion> </exclusions> </dependency>
  • 10. SBT "org.apache.kafka" % "kafka_2.10" % "0.8.0“ intransitive() Or libraryDependencies ++= Seq( "org.apache.kafka" % "kafka_2.10" % "0.8.0", ) https://issues.apache.org/jira/browse/KAFKA-1160 https://github.com/apache/kafka/blob/0.8/project/Build.scala?source=c
  • 11. Apache Kafka Producers Consumers Brokers  Topics  Brokers  Sync Producers  Async Producers (Async/Sync)=> Akka Producers  Topics  Partitions  Read from the start  Read from where last left off  Partitions  Load Balancing Producers  Load Balancing Consumer  In Sync Replicaset (ISR)  Client API
  • 12. Producers  Topics  Brokers  Sync Producers  Async Producers  (Async/Sync)=> Akka Producers
  • 13. Producer /** at least one of these for every partition **/ val producer = new KafkaProducer(“topic","192.168.86.10:9092") producer.sendMessage(testMessage)
  • 14. Kafka Producer case class KafkaProducer( topic: String, brokerList: String, synchronously: Boolean = true, compress: Boolean = true, batchSize: Integer = 200, messageSendMaxRetries: Integer = 3 ){ val props = new Properties() val codec = if(compress) DefaultCompressionCodec.codec else NoCompressionCodec.codec props.put("compression.codec", codec.toString) props.put("producer.type", if(synchronously) "sync" else "async") props.put("metadata.broker.list", brokerList) props.put("batch.num.messages", batchSize.toString) props.put("message.send.max.retries", messageSendMaxRetries.toString) wrapper for core/src/main/kafka/producer/Producer.Scala val producer = new Producer[AnyRef, AnyRef](new ProducerConfig(props)) def sendMessage(message: String) = { try { producer.send(new KeyedMessage(topic,message.getBytes)) } catch { case e: Exception => e.printStackTrace System.exit(1) } } }
  • 15. Docs http://kafka.apache.org/documentation.html#producerconfigs Producer Config Source https://github.com/apache/kafka/blob/0.8/core/src/main/scala/kafka/ producer/ProducerConfig.scala?source=c
  • 16. class KafkaAkkaProducer extends Actor with Logging { private val producerCreated = new AtomicBoolean(false) var producer: KafkaProducer = null def init(topic: String, zklist: String) = { if (!producerCreated.getAndSet(true)) { producer = new KafkaProducer(topic,zklist) } Akka Producer def receive = { case t: (String, String) ⇒ { init(t._1, t._2) } case msg: String ⇒ { producer.sendString(msg) } } }
  • 17. Consumers Topics Partitions Read from the start Read from where last left off
  • 18. class KafkaConsumer( topic: String, groupId: String, zookeeperConnect: String, readFromStartOfStream: Boolean = true ) extends Logging { val props = new Properties() props.put("group.id", groupId) props.put("zookeeper.connect", zookeeperConnect) props.put("auto.offset.reset", if(readFromStartOfStream) "smallest" else "largest") val config = new ConsumerConfig(props) val connector = Consumer.create(config) val filterSpec = new Whitelist(topic) val stream = connector.createMessageStreamsByFilter(filterSpec, 1, new DefaultDecoder(), new DefaultDecoder()).get(0) def read(write: (Array[Byte])=>Unit) = { for(messageAndTopic <- stream) { write(messageAndTopic.message) } } def close() { connector.shutdown() } }
  • 19. Source Sample https://github.com/apache/kafka/blob/0.8/core/src/main/sca la/kafka/tools/SimpleConsumerShell.scala?source=c Low Level Consumer
  • 20. val topic = “publisher” val consumer = new KafkaConsumer(topic ,”loop”,"192.168.86.5:2181") val count = 2 val pdc = system.actorOf(Props[KafkaAkkaProducer].withRouter(RoundRobinRouter(count)), "router") 1 to countforeach { i =>( pdc ! (topic,"192.168.86.10:9092")) } def exec(binaryObject: Array[Byte]) = { val message = new String(binaryObject) pdc ! message } pdc ! “go, go gadget stream 1” pdc ! “go, go gadget stream 2” consumer.read(exec)
  • 21. Brokers  Partitions  Load Balancing Producers  Load Balancing Consumer  In Sync Replicaset (ISR)  Client API
  • 22. Partitions make up a Topic
  • 23. Load Balancing Producer By Partition
  • 24. Load Balancing Consumer By Partition
  • 25. Replication – In Sync Replicas
  • 26. Client API o Developers Guide https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol o Non-JVM Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients o Python o Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. oC o High performance C library with full protocol support o C++ o Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset. o Go (aka golang) o Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. o Ruby o Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2. o Clojure o https://github.com/pingles/clj-kafka/ 0.8 branch
  • 27. Questions? /*********************************** Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop ************************************/