SlideShare a Scribd company logo
Apache Kafka
Real-Time Data Pipelines
http://kafka.apache.org/
Joe Stein
● Developer, Architect & Technologist
● Founder & Principal Consultant => Big Data Open Source Security LLC - http://stealth.ly
Big Data Open Source Security LLC provides professional services and product solutions for the collection,
storage, transfer, real-time analytics, batch processing and reporting for complex data streams, data sets and
distributed systems. BDOSS is all about the "glue" and helping companies to not only figure out what Big Data
Infrastructure Components to use but also how to change their existing (or build new) systems to work with
them.
● Apache Kafka Committer & PMC member
● Blog & Podcast - http://allthingshadoop.com
● Twitter @allthingshadoop
Overview
● What is Apache Kafka?
○ Data pipelines
○ Architecture
● How does Apache Kafka work?
○ Brokers
○ Producers
○ Consumers
○ Topics
○ Partitions
● How to use Apache Kafka?
○ Existing Integrations
○ Client Libraries
○ Out of the box API
○ Tools
Apache Kafka
● Apache Kafka
○ http://kafka.apache.org
● Apache Kafka Source Code
○ https://github.com/apache/kafka
● Documentation
○ http://kafka.apache.org/documentation.html
● Wiki
○ https://cwiki.apache.org/confluence/display/KAFKA/Index
Data Pipelines
Point to Point Data Pipelines are Problematic
Decouple
Kafka decouples data-pipelines
Kafka Architecture
Topics & Partitions
A high-throughput distributed messaging system
rethought as a distributed commit log.
Replication
Brokers load balance producers by partition
Consumer groups provides isolation to topics and partitions
Consumers rebalance themselves for partitions
How does Kafka do all of this?
● Producers - ** push **
○ Batching
○ Compression
○ Sync (Ack), Async (auto batch)
○ Replication
○ Sequential writes, guaranteed ordering within each partition
● Consumers - ** pull **
○ No state held by broker
○ Consumers control reading from the stream
● Zero Copy for producers and consumers to and from the broker http://kafka.
apache.org/documentation.html#maximizingefficiency
● Message stay on disk when consumed, deletes on TTL with compaction
available in 0.8.1 https://kafka.apache.org/documentation.html#compaction
Traditional Data Copy
https://www.ibm.com/developerworks/linux/library/j-zerocopy/
Zero Copy
https://www.ibm.com/developerworks/linux/library/j-zerocopy/
Performance comparison: Traditional approach vs. zero copy
File size Normal file transfer (ms) transferTo (ms)
7MB 156 45
21MB 337 128
63MB 843 387
98MB 1320 617
200MB 2124 1150
350MB 3631 1762
700MB 13498 4422
1GB 18399 8537
https://www.ibm.com/developerworks/linux/library/j-zerocopy/
Log Compaction
Existing Integrations
https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
● log4j Appender
● Apache Storm
● Apache Camel
● Apache Samza
● Apache Hadoop
● Apache Flume
● Camus
● AWS S3
● Rieman
● Sematext
● Dropwizard
● LogStash
● Fluent
Client Libraries
Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients
● Python - Pure Python implementation with full protocol support. Consumer and Producer
implementations included, GZIP and Snappy compression supported.
● C - High performance C library with full protocol support
● C++ - Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset.
● Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer
implementations included, GZIP and Snappy compression supported.
● Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy
compression supported. Ruby 1.9.3 and up (CI runs MRI 2.
● Clojure - Clojure DSL for the Kafka API
● JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation
Wire Protocol Developers Guide
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
Really Quick Start
1) Install Vagrant http://www.vagrantup.com/
2) Install Virtual Box https://www.virtualbox.org/
3) git clone https://github.com/stealthly/scala-kafka
4) cd scala-kafka
5) vagrant up
Zookeeper will be running on 192.168.86.5
BrokerOne will be running on 192.168.86.10
All the tests in ./src/test/scala/* should pass (which is also /vagrant/src/test/scala/* in the vm)
6) ./gradlew test
Developing Producers
https://github.com/stealthly/scala-kafka/blob/master/src/test/scala/KafkaSpec.scala
val producer = new KafkaProducer(“test-topic”,"192.168.86.10:9092")
producer.send(“hello distributed commit log”)
Producers
https://github.com/stealthly/scala-kafka/blob/master/src/main/scala/KafkaProducer.scala
case class KafkaProducer(
topic: String,
brokerList: String,
/** brokerList - This is for bootstrapping and the producer will only use it for
getting metadata (topics, partitions and replicas). The socket connections for
sending the actual data will be established based on the broker information
returned in the metadata. The format is host1:port1,host2:port2, and the list can
be a subset of brokers or a VIP pointing to a subset of brokers.
*/
Producer
clientId: String = UUID.randomUUID().toString,
/** clientId - The client id is a user-specified string sent in each request to help
trace calls. It should logically identify the application making the request. */
synchronously: Boolean = true,
/** synchronously - This parameter specifies whether the messages are sent
asynchronously in a background thread. Valid values are false for
asynchronous send and true for synchronous send. By setting the producer to
async we allow batching together of requests (which is great for throughput) but
open the possibility of a failure of the client machine dropping unsent data.*/
Producer
compress: Boolean = true,
/** compress -This parameter allows you to specify the compression codec for
all data generated by this producer. When set to true gzip is used. To override
and use snappy you need to implement that as the default codec for
compression using SnappyCompressionCodec.codec instead of
DefaultCompressionCodec.codec below. */
batchSize: Integer = 200,
/** batchSize -The number of messages to send in one batch when using
async mode. The producer will wait until either this number of messages are
ready to send or queue.buffer.max.ms is reached.*/
Producer
messageSendMaxRetries: Integer = 3,
/** messageSendMaxRetries - This property will cause the producer to
automatically retry a failed send request. This property specifies the number of
retries when such failures occur. Note that setting a non-zero value here can
lead to duplicates in the case of network errors that cause a message to be
sent but the acknowledgement to be lost.*/
Producer
requestRequiredAcks: Integer = -1
/** requestRequiredAcks
0) which means that the producer never waits for an acknowledgement from the broker
(the same behavior as 0.7). This option provides the lowest latency but the weakest
durability guarantees (some data will be lost when a server fails).
1) which means that the producer gets an acknowledgement after the leader replica has
received the data. This option provides better durability as the client waits until the server
acknowledges the request as successful (only messages that were written to the now-
dead leader but not yet replicated will be lost).
-1) which means that the producer gets an acknowledgement after all in-sync replicas
have received the data. This option provides the best durability, we guarantee that no
messages will be lost as long as at least one in sync replica remains.*/
val props = new Properties()
val codec = if(compress) DefaultCompressionCodec.codec else NoCompressionCodec.codec
props.put("compression.codec", codec.toString)
http://kafka.apache.org/documentation.html#producerconfigs
props.put("require.requred.acks",requestRequiredAcks.toString)
val producer = new Producer[AnyRef, AnyRef](new ProducerConfig(props))
def kafkaMesssage(message: Array[Byte], partition: Array[Byte]): KeyedMessage[AnyRef, AnyRef] = {
if (partition == null) {
new KeyedMessage(topic,message)
} else {
new KeyedMessage(topic,message, partition)
}
}
Producer
Producer
def send(message: String, partition: String = null): Unit = {
send(message.getBytes("UTF8"), if (partition == null) null else partition.getBytes("UTF8"))
}
def send(message: Array[Byte], partition: Array[Byte]): Unit = {
try {
producer.send(kafkaMesssage(message, partition))
} catch {
case e: Exception =>
e.printStackTrace
System.exit(1)
}
}
High Level Consumer
https://github.com/stealthly/scala-kafka/blob/master/src/main/scala/KafkaConsumer.scala
class KafkaConsumer(
topic: String,
/** topic - The high-level API hides the details of brokers from the consumer
and allows consuming off the cluster of machines without concern for the
underlying topology. It also maintains the state of what has been consumed.
The high-level API also provides the ability to subscribe to topics that match a
filter expression (i.e., either a whitelist or a blacklist regular expression).*/
High Level Consumer
groupId: String,
/** groupId - A string that uniquely identifies the group of consumer processes
to which this consumer belongs. By setting the same group id multiple
processes indicate that they are all part of the same consumer group.*/
zookeeperConnect: String,
/** zookeeperConnect - Specifies the zookeeper connection string in the form
hostname:port where host and port are the host and port of a zookeeper server.
To allow connecting through other zookeeper nodes when that zookeeper
machine is down you can also specify multiple hosts in the form hostname1:
port1,hostname2:port2,hostname3:port3. The server may also have a
zookeeper chroot path as part of it's zookeeper connection string which puts its
data under some path in the global zookeeper namespace. */
High Level Consumer
val props = new Properties()
props.put("group.id", groupId)
props.put("zookeeper.connect", zookeeperConnect)
props.put("auto.offset.reset", if(readFromStartOfStream) "smallest" else "largest")
val config = new ConsumerConfig(props)
val connector = Consumer.create(config)
val filterSpec = new Whitelist(topic)
val stream = connector.createMessageStreamsByFilter(filterSpec, 1, new
DefaultDecoder(), new DefaultDecoder()).get(0)
High Level Consumer
def read(write: (Array[Byte])=>Unit) = {
for(messageAndTopic <- stream) {
try {
write(messageAndTopic.message)
} catch {
case e: Throwable => error("Error processing message, skipping this message: ", e)
}
}
}
High Level Consumer
https://github.com/stealthly/scala-kafka/blob/master/src/test/scala/KafkaSpec.scala
val consumer = new KafkaConsumer(“test-topic”,”groupTest”,"192.168.86.5:2181")
def exec(binaryObject: Array[Byte]) = {
//magic happens
}
consumer.read(exec)
Simple Consumer
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
https://github.com/apache/kafka/blob/0.8/core/src/main/scala/kafka/tools/SimpleConsumerShell.scala
val fetchRequest = fetchRequestBuilder
.addFetch(topic, partitionId, offset, fetchSize)
.build()
System Tools
https://cwiki.apache.org/confluence/display/KAFKA/System+Tools
● Consumer Offset Checker
● Dump Log Segment
● Export Zookeeper Offsets
● Get Offset Shell
● Import Zookeeper Offsets
● JMX Tool
● Kafka Migration Tool
● Mirror Maker
● Replay Log Producer
● Simple Consumer Shell
● State Change Log Merger
● Update Offsets In Zookeeper
● Verify Consumer Rebalance
Replication Tools
https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools
● Controlled Shutdown
● Preferred Replica Leader Election Tool
● List Topic Tool
● Create Topic Tool
● Add Partition Tool
● Reassign Partitions Tool
● StateChangeLogMerger Tool
Questions?
/*******************************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
********************************************/

More Related Content

What's hot

Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
Deep Dive into the Pulsar Binary Protocol - Pulsar Virtual Summit Europe 2021
Deep Dive into the Pulsar Binary Protocol - Pulsar Virtual Summit Europe 2021Deep Dive into the Pulsar Binary Protocol - Pulsar Virtual Summit Europe 2021
Deep Dive into the Pulsar Binary Protocol - Pulsar Virtual Summit Europe 2021
StreamNative
 
WebSocket MicroService vs. REST Microservice
WebSocket MicroService vs. REST MicroserviceWebSocket MicroService vs. REST Microservice
WebSocket MicroService vs. REST Microservice
Rick Hightower
 
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre Zemb
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre ZembBuilding a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre Zemb
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre Zemb
StreamNative
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
Dmitry Tolpeko
 
Micro on NATS - Microservices with Messaging
Micro on NATS - Microservices with MessagingMicro on NATS - Microservices with Messaging
Micro on NATS - Microservices with Messaging
Apcera
 
Modern Distributed Messaging and RPC
Modern Distributed Messaging and RPCModern Distributed Messaging and RPC
Modern Distributed Messaging and RPC
Max Alexejev
 
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
Natan Silnitsky
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
Protecting your data at rest with Apache Kafka by Confluent and VormetricProtecting your data at rest with Apache Kafka by Confluent and Vormetric
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
confluent
 
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
confluent
 
Quantum (OpenStack Meetup Feb 9th, 2012)
Quantum (OpenStack Meetup Feb 9th, 2012)Quantum (OpenStack Meetup Feb 9th, 2012)
Quantum (OpenStack Meetup Feb 9th, 2012)
Dan Wendlandt
 
Apache kafka introduction
Apache kafka introductionApache kafka introduction
Apache kafka introduction
Mohammad Mazharuddin
 
Developing with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache Kafka
Joe Stein
 
Kafka as Message Broker
Kafka as Message BrokerKafka as Message Broker
Kafka as Message Broker
Haluan Irsad
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Shiao-An Yuan
 
How Splunk Mission Control leverages various Pulsar subscription types_Pranav...
How Splunk Mission Control leverages various Pulsar subscription types_Pranav...How Splunk Mission Control leverages various Pulsar subscription types_Pranav...
How Splunk Mission Control leverages various Pulsar subscription types_Pranav...
StreamNative
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
AIMDek Technologies
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
StreamNative
 
OpenStack in Action 4! Emilien Macchi & Sylvain Afchain - What's new in neutr...
OpenStack in Action 4! Emilien Macchi & Sylvain Afchain - What's new in neutr...OpenStack in Action 4! Emilien Macchi & Sylvain Afchain - What's new in neutr...
OpenStack in Action 4! Emilien Macchi & Sylvain Afchain - What's new in neutr...
eNovance
 

What's hot (20)

Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Deep Dive into the Pulsar Binary Protocol - Pulsar Virtual Summit Europe 2021
Deep Dive into the Pulsar Binary Protocol - Pulsar Virtual Summit Europe 2021Deep Dive into the Pulsar Binary Protocol - Pulsar Virtual Summit Europe 2021
Deep Dive into the Pulsar Binary Protocol - Pulsar Virtual Summit Europe 2021
 
WebSocket MicroService vs. REST Microservice
WebSocket MicroService vs. REST MicroserviceWebSocket MicroService vs. REST Microservice
WebSocket MicroService vs. REST Microservice
 
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre Zemb
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre ZembBuilding a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre Zemb
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre Zemb
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
 
Micro on NATS - Microservices with Messaging
Micro on NATS - Microservices with MessagingMicro on NATS - Microservices with Messaging
Micro on NATS - Microservices with Messaging
 
Modern Distributed Messaging and RPC
Modern Distributed Messaging and RPCModern Distributed Messaging and RPC
Modern Distributed Messaging and RPC
 
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
 
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
Protecting your data at rest with Apache Kafka by Confluent and VormetricProtecting your data at rest with Apache Kafka by Confluent and Vormetric
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
 
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
 
Quantum (OpenStack Meetup Feb 9th, 2012)
Quantum (OpenStack Meetup Feb 9th, 2012)Quantum (OpenStack Meetup Feb 9th, 2012)
Quantum (OpenStack Meetup Feb 9th, 2012)
 
Apache kafka introduction
Apache kafka introductionApache kafka introduction
Apache kafka introduction
 
Developing with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache Kafka
 
Kafka as Message Broker
Kafka as Message BrokerKafka as Message Broker
Kafka as Message Broker
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
How Splunk Mission Control leverages various Pulsar subscription types_Pranav...
How Splunk Mission Control leverages various Pulsar subscription types_Pranav...How Splunk Mission Control leverages various Pulsar subscription types_Pranav...
How Splunk Mission Control leverages various Pulsar subscription types_Pranav...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 
OpenStack in Action 4! Emilien Macchi & Sylvain Afchain - What's new in neutr...
OpenStack in Action 4! Emilien Macchi & Sylvain Afchain - What's new in neutr...OpenStack in Action 4! Emilien Macchi & Sylvain Afchain - What's new in neutr...
OpenStack in Action 4! Emilien Macchi & Sylvain Afchain - What's new in neutr...
 

Viewers also liked

Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
DataStax Academy
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
Joe Stein
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
Joe Stein
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
jstein.cassandra.nyc.2011
jstein.cassandra.nyc.2011jstein.cassandra.nyc.2011
jstein.cassandra.nyc.2011
Joe Stein
 
Storing Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite Columns
Joe Stein
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
Tao Feng
 
Apache Cassandra 2.0
Apache Cassandra 2.0Apache Cassandra 2.0
Apache Cassandra 2.0
Joe Stein
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
Peerapat Asoktummarungsri
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Joe Stein
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Navina Ramesh
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
Joe Stein
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
Joe Stein
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
Joe Stein
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
Joe Stein
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonJoe Stein
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
Joe Stein
 
Introduction To Apache Mesos
Introduction To Apache MesosIntroduction To Apache Mesos
Introduction To Apache Mesos
Joe Stein
 

Viewers also liked (20)

Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
jstein.cassandra.nyc.2011
jstein.cassandra.nyc.2011jstein.cassandra.nyc.2011
jstein.cassandra.nyc.2011
 
Storing Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite Columns
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
 
Apache Cassandra 2.0
Apache Cassandra 2.0Apache Cassandra 2.0
Apache Cassandra 2.0
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
Introduction To Apache Mesos
Introduction To Apache MesosIntroduction To Apache Mesos
Introduction To Apache Mesos
 

Similar to Developing Realtime Data Pipelines With Apache Kafka

Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Guido Schmutz
 
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMESet your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
confluent
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Kafka zero to hero
Kafka zero to heroKafka zero to hero
Kafka zero to hero
Avi Levi
 
Apache Kafka - From zero to hero
Apache Kafka - From zero to heroApache Kafka - From zero to hero
Apache Kafka - From zero to hero
Apache Kafka TLV
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
confluent
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
Guido Schmutz
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINE
kawamuray
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Virtual Bash! A Lunchtime Introduction to Kafka
Virtual Bash! A Lunchtime Introduction to KafkaVirtual Bash! A Lunchtime Introduction to Kafka
Virtual Bash! A Lunchtime Introduction to Kafka
Jason Bell
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
Apache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupApache Kafka Women Who Code Meetup
Apache Kafka Women Who Code Meetup
Snehal Nagmote
 
Kafka Deep Dive
Kafka Deep DiveKafka Deep Dive
Kafka Deep Dive
Knoldus Inc.
 
Elk presentation 2#3
Elk presentation 2#3Elk presentation 2#3
Elk presentation 2#3
uzzal basak
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
Samuel Kerrien
 

Similar to Developing Realtime Data Pipelines With Apache Kafka (20)

Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
 
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMESet your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Kafka zero to hero
Kafka zero to heroKafka zero to hero
Kafka zero to hero
 
Apache Kafka - From zero to hero
Apache Kafka - From zero to heroApache Kafka - From zero to hero
Apache Kafka - From zero to hero
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
 
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINE
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Virtual Bash! A Lunchtime Introduction to Kafka
Virtual Bash! A Lunchtime Introduction to KafkaVirtual Bash! A Lunchtime Introduction to Kafka
Virtual Bash! A Lunchtime Introduction to Kafka
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
 
Apache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupApache Kafka Women Who Code Meetup
Apache Kafka Women Who Code Meetup
 
Kafka Deep Dive
Kafka Deep DiveKafka Deep Dive
Kafka Deep Dive
 
Elk presentation 2#3
Elk presentation 2#3Elk presentation 2#3
Elk presentation 2#3
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 

More from Joe Stein

SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
Joe Stein
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
Joe Stein
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Joe Stein
 
Building and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosBuilding and Deploying Application to Apache Mesos
Building and Deploying Application to Apache Mesos
Joe Stein
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on Mesos
Joe Stein
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
Joe Stein
 

More from Joe Stein (6)

SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
Building and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosBuilding and Deploying Application to Apache Mesos
Building and Deploying Application to Apache Mesos
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on Mesos
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
 

Recently uploaded

UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 

Recently uploaded (20)

UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 

Developing Realtime Data Pipelines With Apache Kafka

  • 1. Apache Kafka Real-Time Data Pipelines http://kafka.apache.org/
  • 2. Joe Stein ● Developer, Architect & Technologist ● Founder & Principal Consultant => Big Data Open Source Security LLC - http://stealth.ly Big Data Open Source Security LLC provides professional services and product solutions for the collection, storage, transfer, real-time analytics, batch processing and reporting for complex data streams, data sets and distributed systems. BDOSS is all about the "glue" and helping companies to not only figure out what Big Data Infrastructure Components to use but also how to change their existing (or build new) systems to work with them. ● Apache Kafka Committer & PMC member ● Blog & Podcast - http://allthingshadoop.com ● Twitter @allthingshadoop
  • 3. Overview ● What is Apache Kafka? ○ Data pipelines ○ Architecture ● How does Apache Kafka work? ○ Brokers ○ Producers ○ Consumers ○ Topics ○ Partitions ● How to use Apache Kafka? ○ Existing Integrations ○ Client Libraries ○ Out of the box API ○ Tools
  • 4. Apache Kafka ● Apache Kafka ○ http://kafka.apache.org ● Apache Kafka Source Code ○ https://github.com/apache/kafka ● Documentation ○ http://kafka.apache.org/documentation.html ● Wiki ○ https://cwiki.apache.org/confluence/display/KAFKA/Index
  • 6. Point to Point Data Pipelines are Problematic
  • 11. A high-throughput distributed messaging system rethought as a distributed commit log.
  • 13. Brokers load balance producers by partition
  • 14. Consumer groups provides isolation to topics and partitions
  • 16. How does Kafka do all of this? ● Producers - ** push ** ○ Batching ○ Compression ○ Sync (Ack), Async (auto batch) ○ Replication ○ Sequential writes, guaranteed ordering within each partition ● Consumers - ** pull ** ○ No state held by broker ○ Consumers control reading from the stream ● Zero Copy for producers and consumers to and from the broker http://kafka. apache.org/documentation.html#maximizingefficiency ● Message stay on disk when consumed, deletes on TTL with compaction available in 0.8.1 https://kafka.apache.org/documentation.html#compaction
  • 19. Performance comparison: Traditional approach vs. zero copy File size Normal file transfer (ms) transferTo (ms) 7MB 156 45 21MB 337 128 63MB 843 387 98MB 1320 617 200MB 2124 1150 350MB 3631 1762 700MB 13498 4422 1GB 18399 8537 https://www.ibm.com/developerworks/linux/library/j-zerocopy/
  • 21. Existing Integrations https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem ● log4j Appender ● Apache Storm ● Apache Camel ● Apache Samza ● Apache Hadoop ● Apache Flume ● Camus ● AWS S3 ● Rieman ● Sematext ● Dropwizard ● LogStash ● Fluent
  • 22. Client Libraries Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients ● Python - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. ● C - High performance C library with full protocol support ● C++ - Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset. ● Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. ● Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2. ● Clojure - Clojure DSL for the Kafka API ● JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation Wire Protocol Developers Guide https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
  • 23. Really Quick Start 1) Install Vagrant http://www.vagrantup.com/ 2) Install Virtual Box https://www.virtualbox.org/ 3) git clone https://github.com/stealthly/scala-kafka 4) cd scala-kafka 5) vagrant up Zookeeper will be running on 192.168.86.5 BrokerOne will be running on 192.168.86.10 All the tests in ./src/test/scala/* should pass (which is also /vagrant/src/test/scala/* in the vm) 6) ./gradlew test
  • 24. Developing Producers https://github.com/stealthly/scala-kafka/blob/master/src/test/scala/KafkaSpec.scala val producer = new KafkaProducer(“test-topic”,"192.168.86.10:9092") producer.send(“hello distributed commit log”)
  • 25. Producers https://github.com/stealthly/scala-kafka/blob/master/src/main/scala/KafkaProducer.scala case class KafkaProducer( topic: String, brokerList: String, /** brokerList - This is for bootstrapping and the producer will only use it for getting metadata (topics, partitions and replicas). The socket connections for sending the actual data will be established based on the broker information returned in the metadata. The format is host1:port1,host2:port2, and the list can be a subset of brokers or a VIP pointing to a subset of brokers. */
  • 26. Producer clientId: String = UUID.randomUUID().toString, /** clientId - The client id is a user-specified string sent in each request to help trace calls. It should logically identify the application making the request. */ synchronously: Boolean = true, /** synchronously - This parameter specifies whether the messages are sent asynchronously in a background thread. Valid values are false for asynchronous send and true for synchronous send. By setting the producer to async we allow batching together of requests (which is great for throughput) but open the possibility of a failure of the client machine dropping unsent data.*/
  • 27. Producer compress: Boolean = true, /** compress -This parameter allows you to specify the compression codec for all data generated by this producer. When set to true gzip is used. To override and use snappy you need to implement that as the default codec for compression using SnappyCompressionCodec.codec instead of DefaultCompressionCodec.codec below. */ batchSize: Integer = 200, /** batchSize -The number of messages to send in one batch when using async mode. The producer will wait until either this number of messages are ready to send or queue.buffer.max.ms is reached.*/
  • 28. Producer messageSendMaxRetries: Integer = 3, /** messageSendMaxRetries - This property will cause the producer to automatically retry a failed send request. This property specifies the number of retries when such failures occur. Note that setting a non-zero value here can lead to duplicates in the case of network errors that cause a message to be sent but the acknowledgement to be lost.*/
  • 29. Producer requestRequiredAcks: Integer = -1 /** requestRequiredAcks 0) which means that the producer never waits for an acknowledgement from the broker (the same behavior as 0.7). This option provides the lowest latency but the weakest durability guarantees (some data will be lost when a server fails). 1) which means that the producer gets an acknowledgement after the leader replica has received the data. This option provides better durability as the client waits until the server acknowledges the request as successful (only messages that were written to the now- dead leader but not yet replicated will be lost). -1) which means that the producer gets an acknowledgement after all in-sync replicas have received the data. This option provides the best durability, we guarantee that no messages will be lost as long as at least one in sync replica remains.*/
  • 30. val props = new Properties() val codec = if(compress) DefaultCompressionCodec.codec else NoCompressionCodec.codec props.put("compression.codec", codec.toString) http://kafka.apache.org/documentation.html#producerconfigs props.put("require.requred.acks",requestRequiredAcks.toString) val producer = new Producer[AnyRef, AnyRef](new ProducerConfig(props)) def kafkaMesssage(message: Array[Byte], partition: Array[Byte]): KeyedMessage[AnyRef, AnyRef] = { if (partition == null) { new KeyedMessage(topic,message) } else { new KeyedMessage(topic,message, partition) } } Producer
  • 31. Producer def send(message: String, partition: String = null): Unit = { send(message.getBytes("UTF8"), if (partition == null) null else partition.getBytes("UTF8")) } def send(message: Array[Byte], partition: Array[Byte]): Unit = { try { producer.send(kafkaMesssage(message, partition)) } catch { case e: Exception => e.printStackTrace System.exit(1) } }
  • 32. High Level Consumer https://github.com/stealthly/scala-kafka/blob/master/src/main/scala/KafkaConsumer.scala class KafkaConsumer( topic: String, /** topic - The high-level API hides the details of brokers from the consumer and allows consuming off the cluster of machines without concern for the underlying topology. It also maintains the state of what has been consumed. The high-level API also provides the ability to subscribe to topics that match a filter expression (i.e., either a whitelist or a blacklist regular expression).*/
  • 33. High Level Consumer groupId: String, /** groupId - A string that uniquely identifies the group of consumer processes to which this consumer belongs. By setting the same group id multiple processes indicate that they are all part of the same consumer group.*/ zookeeperConnect: String, /** zookeeperConnect - Specifies the zookeeper connection string in the form hostname:port where host and port are the host and port of a zookeeper server. To allow connecting through other zookeeper nodes when that zookeeper machine is down you can also specify multiple hosts in the form hostname1: port1,hostname2:port2,hostname3:port3. The server may also have a zookeeper chroot path as part of it's zookeeper connection string which puts its data under some path in the global zookeeper namespace. */
  • 34. High Level Consumer val props = new Properties() props.put("group.id", groupId) props.put("zookeeper.connect", zookeeperConnect) props.put("auto.offset.reset", if(readFromStartOfStream) "smallest" else "largest") val config = new ConsumerConfig(props) val connector = Consumer.create(config) val filterSpec = new Whitelist(topic) val stream = connector.createMessageStreamsByFilter(filterSpec, 1, new DefaultDecoder(), new DefaultDecoder()).get(0)
  • 35. High Level Consumer def read(write: (Array[Byte])=>Unit) = { for(messageAndTopic <- stream) { try { write(messageAndTopic.message) } catch { case e: Throwable => error("Error processing message, skipping this message: ", e) } } }
  • 36. High Level Consumer https://github.com/stealthly/scala-kafka/blob/master/src/test/scala/KafkaSpec.scala val consumer = new KafkaConsumer(“test-topic”,”groupTest”,"192.168.86.5:2181") def exec(binaryObject: Array[Byte]) = { //magic happens } consumer.read(exec)
  • 38. System Tools https://cwiki.apache.org/confluence/display/KAFKA/System+Tools ● Consumer Offset Checker ● Dump Log Segment ● Export Zookeeper Offsets ● Get Offset Shell ● Import Zookeeper Offsets ● JMX Tool ● Kafka Migration Tool ● Mirror Maker ● Replay Log Producer ● Simple Consumer Shell ● State Change Log Merger ● Update Offsets In Zookeeper ● Verify Consumer Rebalance
  • 39. Replication Tools https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools ● Controlled Shutdown ● Preferred Replica Leader Election Tool ● List Topic Tool ● Create Topic Tool ● Add Partition Tool ● Reassign Partitions Tool ● StateChangeLogMerger Tool
  • 40. Questions? /******************************************* Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop ********************************************/