Real-time streaming and data pipelines with Apache Kafka

Joe Stein
Joe SteinArchitect, Developer, Technologist
REAL-TIME STREAMING AND DATA PIPELINES WITH
APACHE KAFKA
Real-time Streaming
Real-time Streaming with Data
Pipelines
Sync
Z-Y < 20ms
Y-X < 20ms
Async
Z-Y < 1ms
Y-X < 1ms
Real-time Streaming with Data
Pipelines
Sync
Y-X < 20ms
Async
Y-X < 1ms
Real-time Streaming with Data
Pipelines
Before we get started
 Samples
 https://github.com/stealthly/scala-kafka

 Apache Kafka 0.8.0 Source Code
 https://github.com/apache/kafka

 Docs
 http://kafka.apache.org/documentation.html

 Wiki
 https://cwiki.apache.org/confluence/display/KAFKA/Index

 Presentations
 https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations

Tools
 https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools
 https://github.com/apache/kafka/tree/0.8/core/src/main/scala/kafka/tools
Apache Kafka
 Fast - A single Kafka broker can handle hundreds of megabytes of reads and writes per second
from thousands of clients.
 Scalable - Kafka is designed to allow a single cluster to serve as the central data backbone for a
large organization. It can be elastically and transparently expanded without downtime. Data
streams are partitioned and spread over a cluster of machines to allow data streams larger than
the capability of any single machine and to allow clusters of co-ordinated consumers
 Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss.
Each broker can handle terabytes of messages without performance impact.

 Distributed by Design - Kafka has a modern cluster-centric design that offers strong durability
and fault-tolerance guarantees.
Up and running with Apache Kafka
1) Install Vagrant http://www.vagrantup.com/
2) Install Virtual Box https://www.virtualbox.org/
3) git clone https://github.com/stealthly/scala-kafka
4) cd scala-kafka
5) vagrant up
Zookeeper will be running on 192.168.86.5
BrokerOne will be running on 192.168.86.10
All the tests in ./src/test/scala/* should pass (which is also /vagrant/src/test/scala/* in the vm)
6) ./sbt test
[success] Total time: 37 s, completed Dec 19, 2013 11:21:13 AM
Maven
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.0</version>
<exclusions>
<!-- https://issues.apache.org/jira/browse/KAFKA-1160 -->
<exclusion>
<groupId>com.sun.jdmk</groupId>
<artifactId>jmxtools</artifactId>
</exclusion>
<exclusion>
<groupId>com.sun.jmx</groupId>
<artifactId>jmxri</artifactId>
</exclusion>
</exclusions>
</dependency>
SBT
"org.apache.kafka" % "kafka_2.10" % "0.8.0“ intransitive()
Or
libraryDependencies ++= Seq(
"org.apache.kafka" % "kafka_2.10" % "0.8.0",
)
https://issues.apache.org/jira/browse/KAFKA-1160
https://github.com/apache/kafka/blob/0.8/project/Build.scala?source=c
Apache Kafka
Producers

Consumers

Brokers

 Topics
 Brokers
 Sync Producers
 Async Producers
(Async/Sync)=> Akka Producers

 Topics
 Partitions
 Read from the start
 Read from where last left off

 Partitions
 Load Balancing Producers
 Load Balancing Consumer
 In Sync Replicaset (ISR)
 Client API
Producers
 Topics
 Brokers
 Sync Producers
 Async Producers
 (Async/Sync)=> Akka Producers
Producer

/** at least one of these for every partition **/
val producer = new KafkaProducer(“topic","192.168.86.10:9092")
producer.sendMessage(testMessage)
Kafka
Producer

case class KafkaProducer(
topic: String,
brokerList: String,
synchronously: Boolean = true,
compress: Boolean = true,
batchSize: Integer = 200,
messageSendMaxRetries: Integer = 3
){
val props = new Properties()
val codec = if(compress) DefaultCompressionCodec.codec else NoCompressionCodec.codec
props.put("compression.codec", codec.toString)
props.put("producer.type", if(synchronously) "sync" else "async")
props.put("metadata.broker.list", brokerList)
props.put("batch.num.messages", batchSize.toString)
props.put("message.send.max.retries", messageSendMaxRetries.toString)

wrapper for
core/src/main/kafka/producer/Producer.Scala

val producer = new Producer[AnyRef, AnyRef](new ProducerConfig(props))
def sendMessage(message: String) = {
try {
producer.send(new KeyedMessage(topic,message.getBytes))
} catch {
case e: Exception =>
e.printStackTrace
System.exit(1)
}
}
}
Docs
http://kafka.apache.org/documentation.html#producerconfigs

Producer
Config

Source
https://github.com/apache/kafka/blob/0.8/core/src/main/scala/kafka/
producer/ProducerConfig.scala?source=c
class KafkaAkkaProducer extends Actor with Logging {
private val producerCreated = new AtomicBoolean(false)
var producer: KafkaProducer = null
def init(topic: String, zklist: String) = {
if (!producerCreated.getAndSet(true)) {
producer = new KafkaProducer(topic,zklist)
}

Akka Producer

def receive = {
case t: (String, String) ⇒ {
init(t._1, t._2)
}
case msg: String ⇒ {
producer.sendString(msg)
}
}
}
Consumers
Topics
Partitions
Read from the start
Read from where last left off
class KafkaConsumer(
topic: String,
groupId: String,
zookeeperConnect: String,
readFromStartOfStream: Boolean = true
) extends Logging {
val props = new Properties()
props.put("group.id", groupId)
props.put("zookeeper.connect", zookeeperConnect)
props.put("auto.offset.reset", if(readFromStartOfStream) "smallest" else "largest")
val config = new ConsumerConfig(props)
val connector = Consumer.create(config)
val filterSpec = new Whitelist(topic)
val stream = connector.createMessageStreamsByFilter(filterSpec, 1, new DefaultDecoder(), new DefaultDecoder()).get(0)
def read(write: (Array[Byte])=>Unit) = {
for(messageAndTopic <- stream) {
write(messageAndTopic.message)
}
}
def close() {
connector.shutdown()
}
}
Source Sample
https://github.com/apache/kafka/blob/0.8/core/src/main/sca
la/kafka/tools/SimpleConsumerShell.scala?source=c

Low Level
Consumer
val topic = “publisher”
val consumer = new KafkaConsumer(topic ,”loop”,"192.168.86.5:2181")
val count = 2
val pdc = system.actorOf(Props[KafkaAkkaProducer].withRouter(RoundRobinRouter(count)), "router")
1 to countforeach { i =>(
pdc ! (topic,"192.168.86.10:9092"))

}
def exec(binaryObject: Array[Byte]) = {
val message = new String(binaryObject)
pdc ! message
}
pdc ! “go, go gadget stream 1”
pdc ! “go, go gadget stream 2”
consumer.read(exec)
Brokers
 Partitions
 Load Balancing Producers
 Load Balancing Consumer
 In Sync Replicaset (ISR)
 Client API
Partitions make up a Topic
Load Balancing Producer By
Partition
Load Balancing Consumer By
Partition
Replication – In Sync Replicas
Client API
o Developers Guide
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
o Non-JVM Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients
o Python
o Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported.

oC
o High performance C library with full protocol support

o C++
o Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset.

o Go (aka golang)
o Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported.

o Ruby
o Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2.

o Clojure
o https://github.com/pingles/clj-kafka/ 0.8 branch
Questions?
/***********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly

Twitter: @allthingshadoop
************************************/
1 of 27

Recommended

Being Ready for Apache Kafka - Apache: Big Data Europe 2015 by
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
8.9K views57 slides
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache... by
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Lightbend
7.5K views72 slides
How to manage large amounts of data with akka streams by
How to manage large amounts of data with akka streamsHow to manage large amounts of data with akka streams
How to manage large amounts of data with akka streamsIgor Mielientiev
727 views24 slides
Reactive Stream Processing with Akka Streams by
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsKonrad Malawski
14.2K views123 slides
Akka Streams and HTTP by
Akka Streams and HTTPAkka Streams and HTTP
Akka Streams and HTTPRoland Kuhn
41.6K views58 slides
Exactly-once Stream Processing with Kafka Streams by
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
3.7K views74 slides

More Related Content

What's hot

Akka streams scala italy2015 by
Akka streams scala italy2015Akka streams scala italy2015
Akka streams scala italy2015mircodotta
2K views58 slides
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And... by
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
23K views31 slides
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React... by
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Lightbend
12.4K views41 slides
Kafka Streams: the easiest way to start with stream processing by
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingYaroslav Tkachenko
6.6K views40 slides
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu... by
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
17.1K views39 slides
Akka streams by
Akka streamsAkka streams
Akka streamsmircodotta
5K views59 slides

What's hot(20)

Akka streams scala italy2015 by mircodotta
Akka streams scala italy2015Akka streams scala italy2015
Akka streams scala italy2015
mircodotta2K views
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And... by Lightbend
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend23K views
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React... by Lightbend
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Lightbend12.4K views
Kafka Streams: the easiest way to start with stream processing by Yaroslav Tkachenko
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
Yaroslav Tkachenko6.6K views
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu... by Shirshanka Das
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Shirshanka Das17.1K views
Realtime Statistics based on Apache Storm and RocketMQ by Xin Wang
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
Xin Wang2K views
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka... by Reactivesummit
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Reactivesummit2.1K views
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams by confluent
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka StreamsKafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
confluent3.2K views
Journey into Reactive Streams and Akka Streams by Kevin Webber
Journey into Reactive Streams and Akka StreamsJourney into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka Streams
Kevin Webber1.6K views
Specs2 whirlwind tour at Scaladays 2014 by Eric Torreborre
Specs2 whirlwind tour at Scaladays 2014Specs2 whirlwind tour at Scaladays 2014
Specs2 whirlwind tour at Scaladays 2014
Eric Torreborre2.5K views
Introducing Exactly Once Semantics To Apache Kafka by Apurva Mehta
Introducing Exactly Once Semantics To Apache KafkaIntroducing Exactly Once Semantics To Apache Kafka
Introducing Exactly Once Semantics To Apache Kafka
Apurva Mehta6.1K views
Reactor, Reactive streams and MicroServices by Stéphane Maldini
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServices
Stéphane Maldini5.6K views
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka... by confluent
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
confluent3.4K views
Asynchronous Orchestration DSL on squbs by Anil Gursel
Asynchronous Orchestration DSL on squbsAsynchronous Orchestration DSL on squbs
Asynchronous Orchestration DSL on squbs
Anil Gursel525 views
Akka Streams - From Zero to Kafka by Mark Harrison
Akka Streams - From Zero to KafkaAkka Streams - From Zero to Kafka
Akka Streams - From Zero to Kafka
Mark Harrison743 views
ksqlDB: A Stream-Relational Database System by confluent
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent1.4K views
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo by Joe Stein
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Joe Stein3.1K views
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax by Databricks
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. SaxIntroducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Databricks1.7K views

Viewers also liked

Kinesis vs-kafka-and-kafka-deep-dive by
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveYifeng Jiang
13K views25 slides
Developing Real-Time Data Pipelines with Apache Kafka by
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
61.4K views35 slides
Kafka at Scale: Multi-Tier Architectures by
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
13.4K views33 slides
Real time Analytics with Apache Kafka and Apache Spark by
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
84K views54 slides
Introduction to Apache Kafka by
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
52.2K views70 slides
Apache Kafka 0.8 basic training - Verisign by
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
178K views119 slides

Viewers also liked(6)

Kinesis vs-kafka-and-kafka-deep-dive by Yifeng Jiang
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
Yifeng Jiang13K views
Developing Real-Time Data Pipelines with Apache Kafka by Joe Stein
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein61.4K views
Kafka at Scale: Multi-Tier Architectures by Todd Palino
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
Todd Palino13.4K views
Real time Analytics with Apache Kafka and Apache Spark by Rahul Jain
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain84K views
Introduction to Apache Kafka by Jeff Holoman
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman52.2K views
Apache Kafka 0.8 basic training - Verisign by Michael Noll
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
Michael Noll178K views

Similar to Real-time streaming and data pipelines with Apache Kafka

Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30... by
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...HostedbyConfluent
439 views45 slides
Spark Streaming Info by
Spark Streaming InfoSpark Streaming Info
Spark Streaming InfoDoug Chang
1.4K views10 slides
Sparkstreaming by
SparkstreamingSparkstreaming
SparkstreamingMarilyn Waldman
517 views48 slides
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021 by
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021StreamNative
115 views18 slides
5 Takeaways from Migrating a Library to Scala 3 - Scala Love by
5 Takeaways from Migrating a Library to Scala 3 - Scala Love5 Takeaways from Migrating a Library to Scala 3 - Scala Love
5 Takeaways from Migrating a Library to Scala 3 - Scala LoveNatan Silnitsky
424 views50 slides
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra by
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
9.7K views37 slides

Similar to Real-time streaming and data pipelines with Apache Kafka(20)

Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30... by HostedbyConfluent
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
HostedbyConfluent439 views
Spark Streaming Info by Doug Chang
Spark Streaming InfoSpark Streaming Info
Spark Streaming Info
Doug Chang1.4K views
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021 by StreamNative
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021
StreamNative115 views
5 Takeaways from Migrating a Library to Scala 3 - Scala Love by Natan Silnitsky
5 Takeaways from Migrating a Library to Scala 3 - Scala Love5 Takeaways from Migrating a Library to Scala 3 - Scala Love
5 Takeaways from Migrating a Library to Scala 3 - Scala Love
Natan Silnitsky424 views
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra by Joe Stein
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein9.7K views
Kafka streams - From pub/sub to a complete stream processing platform by Paolo Castagna
Kafka streams - From pub/sub to a complete stream processing platformKafka streams - From pub/sub to a complete stream processing platform
Kafka streams - From pub/sub to a complete stream processing platform
Paolo Castagna1K views
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ... by DataStax Academy
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy1.2K views
Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre... by HostedbyConfluent
Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...
Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...
HostedbyConfluent3.7K views
Kafka Connect & Streams - the ecosystem around Kafka by Guido Schmutz
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
Guido Schmutz1.9K views
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe... by confluent
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
confluent5.7K views
Apache Camel - The integration library by Claus Ibsen
Apache Camel - The integration libraryApache Camel - The integration library
Apache Camel - The integration library
Claus Ibsen3.6K views
Spring into rails by Hiro Asari
Spring into railsSpring into rails
Spring into rails
Hiro Asari1K views
Introduction to Apache Camel by Claus Ibsen
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
Claus Ibsen5.6K views
Continuous Delivery: The Next Frontier by Carlos Sanchez
Continuous Delivery: The Next FrontierContinuous Delivery: The Next Frontier
Continuous Delivery: The Next Frontier
Carlos Sanchez1.6K views
Big data and hadoop training - Session 5 by hkbhadraa
Big data and hadoop training - Session 5Big data and hadoop training - Session 5
Big data and hadoop training - Session 5
hkbhadraa397 views
Apache Tuscany 2.x Extensibility And SPIs by Raymond Feng
Apache Tuscany 2.x Extensibility And SPIsApache Tuscany 2.x Extensibility And SPIs
Apache Tuscany 2.x Extensibility And SPIs
Raymond Feng744 views

More from Joe Stein

Streaming Processing with a Distributed Commit Log by
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogJoe Stein
3.1K views77 slides
SMACK Stack 1.1 by
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1Joe Stein
5.7K views61 slides
Get started with Developing Frameworks in Go on Apache Mesos by
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosJoe Stein
1.9K views50 slides
Introduction To Apache Mesos by
Introduction To Apache MesosIntroduction To Apache Mesos
Introduction To Apache MesosJoe Stein
1.7K views49 slides
Developing Real-Time Data Pipelines with Apache Kafka by
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
2.7K views40 slides
Developing Frameworks for Apache Mesos by
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache MesosJoe Stein
2.3K views38 slides

More from Joe Stein(20)

Streaming Processing with a Distributed Commit Log by Joe Stein
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
Joe Stein3.1K views
SMACK Stack 1.1 by Joe Stein
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
Joe Stein5.7K views
Get started with Developing Frameworks in Go on Apache Mesos by Joe Stein
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
Joe Stein1.9K views
Introduction To Apache Mesos by Joe Stein
Introduction To Apache MesosIntroduction To Apache Mesos
Introduction To Apache Mesos
Joe Stein1.7K views
Developing Real-Time Data Pipelines with Apache Kafka by Joe Stein
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein2.7K views
Developing Frameworks for Apache Mesos by Joe Stein
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
Joe Stein2.3K views
Making Distributed Data Persistent Services Elastic (Without Losing All Your ... by Joe Stein
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Joe Stein2.2K views
Containerized Data Persistence on Mesos by Joe Stein
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
Joe Stein3.6K views
Making Apache Kafka Elastic with Apache Mesos by Joe Stein
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
Joe Stein7.9K views
Building and Deploying Application to Apache Mesos by Joe Stein
Building and Deploying Application to Apache MesosBuilding and Deploying Application to Apache Mesos
Building and Deploying Application to Apache Mesos
Joe Stein27.2K views
Apache Kafka, HDFS, Accumulo and more on Mesos by Joe Stein
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on Mesos
Joe Stein7.1K views
Developing with the Go client for Apache Kafka by Joe Stein
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache Kafka
Joe Stein20.9K views
Current and Future of Apache Kafka by Joe Stein
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
Joe Stein9K views
Introduction Apache Kafka by Joe Stein
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
Joe Stein5.1K views
Introduction to Apache Mesos by Joe Stein
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
Joe Stein19.1K views
Developing Realtime Data Pipelines With Apache Kafka by Joe Stein
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein4.9K views
Apache Cassandra 2.0 by Joe Stein
Apache Cassandra 2.0Apache Cassandra 2.0
Apache Cassandra 2.0
Joe Stein6.4K views
Storing Time Series Metrics With Cassandra and Composite Columns by Joe Stein
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite Columns
Joe Stein5.1K views
Apache Kafka by Joe Stein
Apache KafkaApache Kafka
Apache Kafka
Joe Stein23.8K views
Hadoop Streaming Tutorial With Python by Joe Stein
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
Joe Stein9.9K views

Recently uploaded

Black and White Modern Science Presentation.pptx by
Black and White Modern Science Presentation.pptxBlack and White Modern Science Presentation.pptx
Black and White Modern Science Presentation.pptxmaryamkhalid2916
16 views21 slides
Report 2030 Digital Decade by
Report 2030 Digital DecadeReport 2030 Digital Decade
Report 2030 Digital DecadeMassimo Talia
15 views41 slides
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveNetwork Automation Forum
30 views35 slides
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院IttrainingIttraining
41 views8 slides
Scaling Knowledge Graph Architectures with AI by
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIEnterprise Knowledge
28 views15 slides
Case Study Copenhagen Energy and Business Central.pdf by
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdfAitana
16 views3 slides

Recently uploaded(20)

Black and White Modern Science Presentation.pptx by maryamkhalid2916
Black and White Modern Science Presentation.pptxBlack and White Modern Science Presentation.pptx
Black and White Modern Science Presentation.pptx
maryamkhalid291616 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2217 views
AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta19 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software257 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn21 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb13 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker33 views
1st parposal presentation.pptx by i238212
1st parposal presentation.pptx1st parposal presentation.pptx
1st parposal presentation.pptx
i2382129 views
6g - REPORT.pdf by Liveplex
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdf
Liveplex10 views

Real-time streaming and data pipelines with Apache Kafka

  • 1. REAL-TIME STREAMING AND DATA PIPELINES WITH APACHE KAFKA
  • 3. Real-time Streaming with Data Pipelines Sync Z-Y < 20ms Y-X < 20ms Async Z-Y < 1ms Y-X < 1ms
  • 4. Real-time Streaming with Data Pipelines Sync Y-X < 20ms Async Y-X < 1ms
  • 5. Real-time Streaming with Data Pipelines
  • 6. Before we get started  Samples  https://github.com/stealthly/scala-kafka  Apache Kafka 0.8.0 Source Code  https://github.com/apache/kafka  Docs  http://kafka.apache.org/documentation.html  Wiki  https://cwiki.apache.org/confluence/display/KAFKA/Index  Presentations  https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations Tools  https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools  https://github.com/apache/kafka/tree/0.8/core/src/main/scala/kafka/tools
  • 7. Apache Kafka  Fast - A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.  Scalable - Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers  Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.  Distributed by Design - Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
  • 8. Up and running with Apache Kafka 1) Install Vagrant http://www.vagrantup.com/ 2) Install Virtual Box https://www.virtualbox.org/ 3) git clone https://github.com/stealthly/scala-kafka 4) cd scala-kafka 5) vagrant up Zookeeper will be running on 192.168.86.5 BrokerOne will be running on 192.168.86.10 All the tests in ./src/test/scala/* should pass (which is also /vagrant/src/test/scala/* in the vm) 6) ./sbt test [success] Total time: 37 s, completed Dec 19, 2013 11:21:13 AM
  • 10. SBT "org.apache.kafka" % "kafka_2.10" % "0.8.0“ intransitive() Or libraryDependencies ++= Seq( "org.apache.kafka" % "kafka_2.10" % "0.8.0", ) https://issues.apache.org/jira/browse/KAFKA-1160 https://github.com/apache/kafka/blob/0.8/project/Build.scala?source=c
  • 11. Apache Kafka Producers Consumers Brokers  Topics  Brokers  Sync Producers  Async Producers (Async/Sync)=> Akka Producers  Topics  Partitions  Read from the start  Read from where last left off  Partitions  Load Balancing Producers  Load Balancing Consumer  In Sync Replicaset (ISR)  Client API
  • 12. Producers  Topics  Brokers  Sync Producers  Async Producers  (Async/Sync)=> Akka Producers
  • 13. Producer /** at least one of these for every partition **/ val producer = new KafkaProducer(“topic","192.168.86.10:9092") producer.sendMessage(testMessage)
  • 14. Kafka Producer case class KafkaProducer( topic: String, brokerList: String, synchronously: Boolean = true, compress: Boolean = true, batchSize: Integer = 200, messageSendMaxRetries: Integer = 3 ){ val props = new Properties() val codec = if(compress) DefaultCompressionCodec.codec else NoCompressionCodec.codec props.put("compression.codec", codec.toString) props.put("producer.type", if(synchronously) "sync" else "async") props.put("metadata.broker.list", brokerList) props.put("batch.num.messages", batchSize.toString) props.put("message.send.max.retries", messageSendMaxRetries.toString) wrapper for core/src/main/kafka/producer/Producer.Scala val producer = new Producer[AnyRef, AnyRef](new ProducerConfig(props)) def sendMessage(message: String) = { try { producer.send(new KeyedMessage(topic,message.getBytes)) } catch { case e: Exception => e.printStackTrace System.exit(1) } } }
  • 16. class KafkaAkkaProducer extends Actor with Logging { private val producerCreated = new AtomicBoolean(false) var producer: KafkaProducer = null def init(topic: String, zklist: String) = { if (!producerCreated.getAndSet(true)) { producer = new KafkaProducer(topic,zklist) } Akka Producer def receive = { case t: (String, String) ⇒ { init(t._1, t._2) } case msg: String ⇒ { producer.sendString(msg) } } }
  • 17. Consumers Topics Partitions Read from the start Read from where last left off
  • 18. class KafkaConsumer( topic: String, groupId: String, zookeeperConnect: String, readFromStartOfStream: Boolean = true ) extends Logging { val props = new Properties() props.put("group.id", groupId) props.put("zookeeper.connect", zookeeperConnect) props.put("auto.offset.reset", if(readFromStartOfStream) "smallest" else "largest") val config = new ConsumerConfig(props) val connector = Consumer.create(config) val filterSpec = new Whitelist(topic) val stream = connector.createMessageStreamsByFilter(filterSpec, 1, new DefaultDecoder(), new DefaultDecoder()).get(0) def read(write: (Array[Byte])=>Unit) = { for(messageAndTopic <- stream) { write(messageAndTopic.message) } } def close() { connector.shutdown() } }
  • 20. val topic = “publisher” val consumer = new KafkaConsumer(topic ,”loop”,"192.168.86.5:2181") val count = 2 val pdc = system.actorOf(Props[KafkaAkkaProducer].withRouter(RoundRobinRouter(count)), "router") 1 to countforeach { i =>( pdc ! (topic,"192.168.86.10:9092")) } def exec(binaryObject: Array[Byte]) = { val message = new String(binaryObject) pdc ! message } pdc ! “go, go gadget stream 1” pdc ! “go, go gadget stream 2” consumer.read(exec)
  • 21. Brokers  Partitions  Load Balancing Producers  Load Balancing Consumer  In Sync Replicaset (ISR)  Client API
  • 23. Load Balancing Producer By Partition
  • 24. Load Balancing Consumer By Partition
  • 25. Replication – In Sync Replicas
  • 26. Client API o Developers Guide https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol o Non-JVM Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients o Python o Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. oC o High performance C library with full protocol support o C++ o Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset. o Go (aka golang) o Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. o Ruby o Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2. o Clojure o https://github.com/pingles/clj-kafka/ 0.8 branch
  • 27. Questions? /*********************************** Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop ************************************/