SlideShare a Scribd company logo
1 of 34
1© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Intro to Apache Kafka
Jason Hubbard | Systems Engineer
2© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Kafka Overview
3© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
What is Kafka?
• Developed by LinkedIn after challenges building pipelines into Hadoop
• Message-based store used to build data pipelines and support streaming
applications
• Kafka offers
•Publish & subscribe semantics
•Horizontal scalability
•High availability
•Nodes in a Kafka cluster (called brokers) can handle
•Reads/writes per second in the 100s of MBs
•Thousands of producers and consumers
•Multiple node failures (with proper configuration)
4© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Why Kafka? (Or rather, why not Flume?)
• No ability to replay events
• Multiple sinks requires event replication (via multiple channels)
• Sinks that share a source (mostly) process events in sync
Spool
Source
Avro
Sink
Channel
Spool
Source
Avro
Sink
Channel
Avro
Source
HBase
Sink
Channel
HDFS
Sink
HBase
HDFS
Logs
More
Logs
Channel
5© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Why Kafka for Hadoop?
2009 2012
6© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Why Kafka? Decoupling
2012 2013+?
7© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
A Departure from Legacy Models
• Message stores have two well-known types
• Queues (“producer-consumers”)
• Topics (“publisher-subscribers”)
• One consumer gets one message from a queue, then it’s gone
•Consumers might work alone or in concert
• Multiple subscribers can get one message from a topic
•Messages are “published”
• Kafka inverts or blends these concepts
•Tracks consumers by group identification
•Retains messages by expiration, not consumer interaction
•Bakes in partitioning for scalability and parallel operations
•Bakes in replication for availability and fault tolerance
8© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Components & Roles
• A Kafka server is called a broker
•Brokers can work together in a cluster
• Each broker hosts message stores called topics
•You can partition a topic across brokers for scale and parallelism
•You can also replicate a topic for resilience to failure
•Producers push to a Kafka topic, consumers pull
•Kafka provides Consumer and Producer APIs
9© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Detailed Architecture
It’s all about the logs!
…No not application logs
10© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Kafka Detailed Architecture
• Brokers and consumers initialize their state in Zookeeper
• Broker state includes host name, port address, and partition list
• Consumer state includes group name and message offsets (deprecated)
Producer
Consumer
Producers
Kafka
Cluster
Consumers
Broker
Producer
Consumer
Broker
Zookeeper
Broker Broker
Offsets
11© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Kafka and Zookeeper
• Kafka uses Zookeeper
• To indicate ‘liveness’ of each broker
• To store broker and consumer state
• To coordinate leader elections for failover
• Zookeeper stores consumer offset by default
• This can be switched to the brokers, if desired
• Zookeeper also tracks and supports state changes such as
• Adding/removing brokers and consumers
• Rebalancing consumers
• Directing producers and consumers to partition leaders
12© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Topic Partitions
• Partition is a totally-ordered store of messages (log)
• Partition order is immutable
• Messages are deleted as their time runs out
• New messages are appendable only
• The message offset is both a sequence number and a unique identifier (topic,
partition)
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
0 1 2 3 4 5 6 7 8 9
1
0
1
1
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Partition 0
Partition 1
Partition 2
Writes
Old New
13© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
How are partitions distributed?
• Partitions are usually distributed across brokers
• Each broker may host partitions of several topics
• One broker acts as leader for any replicated partition
•Other brokers with a replica act as followers
•Only leaders serves read/write requests
• If the leader blinks out, a follower is elected to take over
• Election occurs only among in-sync replicas (ISRs)
14© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Scalability & Parallelism
• Partitions can be used to allow message storage that exceeds one broker’s
capacity
•More brokers = greater message capacity
•Partitions also allow consumer groups to read a topic in parallel
•Each member can read a partition
•Kafka ensures no consumer contention in one group for a partition
15© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Replication
• A topic partition is the unit of replication
• A replica remains in-sync with its leader so long as
• It maintains communication with Zookeeper
• It does not fall too far behind the leader (configurable)
•Replicating to n brokers
•Allows Kafka to offer availability under n - 1 losses
•The quality of this offer is tempered by the ISR group count
16© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Fault Tolerance
• A broker may lead for some partitions and follow for others
• The replication for each topic determines how many brokers will follow
• Followers passively replicate the leader
•You can set an ISR policy
•Boils down to preference for high, medium, or low throughput
•The right ISR policy strikes some balance between
•Availability: electing a leader quickly in the event of failure
•Latency: assuring a producer its messages are safe (i.e., durable)
17© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Producers
• Producers publish data (messages) to Kafka topics
• Producers choose the partition a message goes to
• By selecting in round-robin fashion to distribute the load
• By assigning a semantic partitioning function to key the messages
18© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Consumers
• A consumer reads messages published to Kafka topics by moving its offset
•The offset increments by default
•Every consumer specifies a group label
• Consumer acts in one group do not affect other groups
• If one group "tails" a topic’s messages, it does not change what another
group can consume
• They come and go with little impact on the cluster or other consumers
19© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Kafka Consumer Group Operation
• Every message in a partition is read by the same instance of a consumer group
• Group members can be processes residing on separate machines
• The diagram below shows a two-broker cluster
• The brokers host one topic in four partitions, P0-P3
• Group A has two instances; each instance reads two partitions
• Group B has four instances; each instance reads one partition
Kafka
Cluster P0 P3 P1 P2
Consumer Group A
C1 C2
Consumer Group B
C3 C4 C5 C6
Broker 1 Broker 2
20© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Messages
• Kafka stores messages in its own format
•Producers and consumers also use this format for transfer efficiency
• Any serializable object can be a message
• Popular formats include string, JSON, and Avro
• Each message’s id is also its unique identifier in a topic partition
21© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Traditional Message Ordering
• Traditional queues store messages in the order received
• Consumers draw messages in store order
• With multiple consumers however, messages are not received in order
•Consumers may experience different delay
•They might also consume messages at different rates
•To retain order, only one process may consume from the queue
• Comes at the expense of parallelism
22© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Guarantees for Ordering
• Kafka appends messages sent by a producer to one partition in sending order
• If a producer sends M1 followed by M2 to the same partition
• M1 will have a lower offset than M2
• M1 will appear earlier in the partition
• A consumer always sees messages in stored order
• Given a partition with N replications, up to N-1 server failures may occur
without message loss
23© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Message Retention
• The Kafka cluster retains messages for a length of time
• You can set retention time per topic or for all
•You can also set a storage limit on any topic
• Kafka deletes messages upon expiration
24© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Demo
25© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Creating Topics
• Kafka ships with command line tools useful for exploring
• The kafka-topics tool creates topics via Zookeeper
• The default Zookeeper port is 2181
• To create and list the topic device_status
• Use the --list parameter to list all topics
$ kafka-topics --create -–zookeeper localhost:2181 
-–replication-factor 1 -–partitions 1 --topic device_status
$ kafka-topics --list -–zookeeper localhost:2181
26© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Creating a Producer
• Use kafka-console-producer to publish messages
• Requires a broker list, e.g., localhost:9092
• Provide a comma-delimited list for failover protection
• Provide the name of the topic
• We will log messages to the topic named device_status
$ kafka-console-producer --broker-list 
localhost:9092 --topic device_status
27© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Creating a Consumer
• The kafka-console-consumer tool is a simple consumer
• It uses ZooKeeper to connect; below we access localhost:2181
• We also name a topic: device_status
• To read all available messages on the topic, we use the
--from-beginning option
$ bin/kafka-console-consumer --zookeeper localhost:2181 
--topic device_status --from-beginning
28© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Creating a Spark Consumer
• The kafka-console-consumer tool is a simple consumer
import org.apache.spark.streaming._
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import
org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val ssc = new StreamingContext(sc, Seconds(1))
val kafkaParams = Map[String, Object]("bootstrap.servers" ->
"localhost:9092", "key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer], "group.id" ->
"kafkaintro", "auto.offset.reset" -> "earliest", "enable.auto.commit" ->
(false: java.lang.Boolean))
val topics = Array("TopicA")
val stream = KafkaUtils.createDirectStream[String, String](ssc,
PreferConsistent, Subscribe[String, String](topics, kafkaParams))
stream.map(_.value)print()
ssc.start()
29© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Common & Best Practices
30© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Tip: Balance Throughput & Durability
• Producers specify the durability they need with the property
request.required.acks
• Adding brokers can improve throughput
• Common practice:
Durability Behaviour Per Event
Latency
Required Acks
(request.required.acks)
Highest All replicas are in-sync Highest -1
Moderate Leader ACKS message Medium 1
Lowest No ACKs required Lowest 0
Property Value
replication 3
min.insync.replicas 2
request.required.acks -1
31© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Tip: Consider Message Keys
• A Kafka message is stored as a KV pair
•The key is not used in the default case
•A producer can set content in a message key then use a Partitioner subclass to
hash the key
• This allows the producer to effect semantic partitioning
•Example: DEBUG, INFO, WARN, ERROR partitions for a syslog topic
• Kafka guarantees messages with the same partition hash are stored in the
same partition
•A consumer group could then pair each member with an intended partition
32© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Tip: Writing Files to Topics
• Kafka will accept file content as a message
• Write a file’s data to the device_alerts topic:
• Then read it:
$ cat alerts.txt | kafka-console-producer 
--broker-list localhost:9092 --topic device_alerts
$ kafka-console-consumer --zookeeper localhost:2181 
--topic device_alerts --from-beginning
Remember that the consumer offsets might be stored in Kafka instead of
Zookeeper
33© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Best Uses
• Kafka is intended for storing messages
•Log records
•Event information
•For small messages, latency in the tens of milliseconds is common
• Kafka is not well-suited for large file transfers
•Message limits < 10KB benefit low latencies
34© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera.
Thank you
Jason.Hubbard@cloudera.com

More Related Content

What's hot

Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...AWS Summits
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams APIconfluent
 
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data PlatformStream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platformconfluent
 
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...Kai Wähner
 
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...Natan Silnitsky
 
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...StreamNative
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
IBM Message Hub service in Bluemix - Apache Kafka in a public cloud
IBM Message Hub service in Bluemix - Apache Kafka in a public cloudIBM Message Hub service in Bluemix - Apache Kafka in a public cloud
IBM Message Hub service in Bluemix - Apache Kafka in a public cloudAndrew Schofield
 
The DBA 3.0 Upgrade
The DBA 3.0 UpgradeThe DBA 3.0 Upgrade
The DBA 3.0 UpgradeSean Scott
 
Kafka clients and emitters
Kafka clients and emittersKafka clients and emitters
Kafka clients and emittersEdgar Domingues
 
Secure Kafka at Salesforce.com
Secure Kafka at Salesforce.comSecure Kafka at Salesforce.com
Secure Kafka at Salesforce.comRajasekar Elango
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformconfluent
 
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, NutanixGuaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, NutanixHostedbyConfluent
 
Confluent Enterprise Datasheet
Confluent Enterprise DatasheetConfluent Enterprise Datasheet
Confluent Enterprise Datasheetconfluent
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystemconfluent
 

What's hot (20)

Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data PlatformStream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
 
Kafka RealTime Streaming
Kafka RealTime StreamingKafka RealTime Streaming
Kafka RealTime Streaming
 
Kafka vs kinesis
Kafka vs kinesisKafka vs kinesis
Kafka vs kinesis
 
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
 
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
 
AWS database services
AWS database servicesAWS database services
AWS database services
 
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
IBM Message Hub service in Bluemix - Apache Kafka in a public cloud
IBM Message Hub service in Bluemix - Apache Kafka in a public cloudIBM Message Hub service in Bluemix - Apache Kafka in a public cloud
IBM Message Hub service in Bluemix - Apache Kafka in a public cloud
 
The DBA 3.0 Upgrade
The DBA 3.0 UpgradeThe DBA 3.0 Upgrade
The DBA 3.0 Upgrade
 
Kafka clients and emitters
Kafka clients and emittersKafka clients and emitters
Kafka clients and emitters
 
Secure Kafka at Salesforce.com
Secure Kafka at Salesforce.comSecure Kafka at Salesforce.com
Secure Kafka at Salesforce.com
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, NutanixGuaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
 
Confluent Enterprise Datasheet
Confluent Enterprise DatasheetConfluent Enterprise Datasheet
Confluent Enterprise Datasheet
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
 

Similar to Intro to Apache Kafka

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaAngelo Cesaro
 
Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxKnoldus Inc.
 
Distributed messaging with Apache Kafka
Distributed messaging with Apache KafkaDistributed messaging with Apache Kafka
Distributed messaging with Apache KafkaSaumitra Srivastav
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdfTarekHamdi8
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaGrant Henke
 
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...Data Con LA
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...Lucas Jellema
 
Building an Event Bus at Scale
Building an Event Bus at ScaleBuilding an Event Bus at Scale
Building an Event Bus at Scalejimriecken
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-CamusDeep Shah
 
Kafka for begginer
Kafka for begginerKafka for begginer
Kafka for begginerYousun Jeong
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperAnandMHadoop
 

Similar to Intro to Apache Kafka (20)

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
 
Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptx
 
Distributed messaging with Apache Kafka
Distributed messaging with Apache KafkaDistributed messaging with Apache Kafka
Distributed messaging with Apache Kafka
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
Microservices deck
Microservices deckMicroservices deck
Microservices deck
 
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
 
Building an Event Bus at Scale
Building an Event Bus at ScaleBuilding an Event Bus at Scale
Building an Event Bus at Scale
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Kafka for begginer
Kafka for begginerKafka for begginer
Kafka for begginer
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 

Recently uploaded

A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxEasyPrinterHelp
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfEasyPrinterHelp
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsUXDXConf
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 

Recently uploaded (20)

A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 

Intro to Apache Kafka

  • 1. 1© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Intro to Apache Kafka Jason Hubbard | Systems Engineer
  • 2. 2© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Kafka Overview
  • 3. 3© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. What is Kafka? • Developed by LinkedIn after challenges building pipelines into Hadoop • Message-based store used to build data pipelines and support streaming applications • Kafka offers •Publish & subscribe semantics •Horizontal scalability •High availability •Nodes in a Kafka cluster (called brokers) can handle •Reads/writes per second in the 100s of MBs •Thousands of producers and consumers •Multiple node failures (with proper configuration)
  • 4. 4© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Why Kafka? (Or rather, why not Flume?) • No ability to replay events • Multiple sinks requires event replication (via multiple channels) • Sinks that share a source (mostly) process events in sync Spool Source Avro Sink Channel Spool Source Avro Sink Channel Avro Source HBase Sink Channel HDFS Sink HBase HDFS Logs More Logs Channel
  • 5. 5© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Why Kafka for Hadoop? 2009 2012
  • 6. 6© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Why Kafka? Decoupling 2012 2013+?
  • 7. 7© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. A Departure from Legacy Models • Message stores have two well-known types • Queues (“producer-consumers”) • Topics (“publisher-subscribers”) • One consumer gets one message from a queue, then it’s gone •Consumers might work alone or in concert • Multiple subscribers can get one message from a topic •Messages are “published” • Kafka inverts or blends these concepts •Tracks consumers by group identification •Retains messages by expiration, not consumer interaction •Bakes in partitioning for scalability and parallel operations •Bakes in replication for availability and fault tolerance
  • 8. 8© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Components & Roles • A Kafka server is called a broker •Brokers can work together in a cluster • Each broker hosts message stores called topics •You can partition a topic across brokers for scale and parallelism •You can also replicate a topic for resilience to failure •Producers push to a Kafka topic, consumers pull •Kafka provides Consumer and Producer APIs
  • 9. 9© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Detailed Architecture It’s all about the logs! …No not application logs
  • 10. 10© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Kafka Detailed Architecture • Brokers and consumers initialize their state in Zookeeper • Broker state includes host name, port address, and partition list • Consumer state includes group name and message offsets (deprecated) Producer Consumer Producers Kafka Cluster Consumers Broker Producer Consumer Broker Zookeeper Broker Broker Offsets
  • 11. 11© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Kafka and Zookeeper • Kafka uses Zookeeper • To indicate ‘liveness’ of each broker • To store broker and consumer state • To coordinate leader elections for failover • Zookeeper stores consumer offset by default • This can be switched to the brokers, if desired • Zookeeper also tracks and supports state changes such as • Adding/removing brokers and consumers • Rebalancing consumers • Directing producers and consumers to partition leaders
  • 12. 12© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Topic Partitions • Partition is a totally-ordered store of messages (log) • Partition order is immutable • Messages are deleted as their time runs out • New messages are appendable only • The message offset is both a sequence number and a unique identifier (topic, partition) 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 0 1 2 3 4 5 6 7 8 9 1 0 1 1 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Partition 0 Partition 1 Partition 2 Writes Old New
  • 13. 13© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. How are partitions distributed? • Partitions are usually distributed across brokers • Each broker may host partitions of several topics • One broker acts as leader for any replicated partition •Other brokers with a replica act as followers •Only leaders serves read/write requests • If the leader blinks out, a follower is elected to take over • Election occurs only among in-sync replicas (ISRs)
  • 14. 14© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Scalability & Parallelism • Partitions can be used to allow message storage that exceeds one broker’s capacity •More brokers = greater message capacity •Partitions also allow consumer groups to read a topic in parallel •Each member can read a partition •Kafka ensures no consumer contention in one group for a partition
  • 15. 15© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Replication • A topic partition is the unit of replication • A replica remains in-sync with its leader so long as • It maintains communication with Zookeeper • It does not fall too far behind the leader (configurable) •Replicating to n brokers •Allows Kafka to offer availability under n - 1 losses •The quality of this offer is tempered by the ISR group count
  • 16. 16© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Fault Tolerance • A broker may lead for some partitions and follow for others • The replication for each topic determines how many brokers will follow • Followers passively replicate the leader •You can set an ISR policy •Boils down to preference for high, medium, or low throughput •The right ISR policy strikes some balance between •Availability: electing a leader quickly in the event of failure •Latency: assuring a producer its messages are safe (i.e., durable)
  • 17. 17© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Producers • Producers publish data (messages) to Kafka topics • Producers choose the partition a message goes to • By selecting in round-robin fashion to distribute the load • By assigning a semantic partitioning function to key the messages
  • 18. 18© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Consumers • A consumer reads messages published to Kafka topics by moving its offset •The offset increments by default •Every consumer specifies a group label • Consumer acts in one group do not affect other groups • If one group "tails" a topic’s messages, it does not change what another group can consume • They come and go with little impact on the cluster or other consumers
  • 19. 19© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Kafka Consumer Group Operation • Every message in a partition is read by the same instance of a consumer group • Group members can be processes residing on separate machines • The diagram below shows a two-broker cluster • The brokers host one topic in four partitions, P0-P3 • Group A has two instances; each instance reads two partitions • Group B has four instances; each instance reads one partition Kafka Cluster P0 P3 P1 P2 Consumer Group A C1 C2 Consumer Group B C3 C4 C5 C6 Broker 1 Broker 2
  • 20. 20© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Messages • Kafka stores messages in its own format •Producers and consumers also use this format for transfer efficiency • Any serializable object can be a message • Popular formats include string, JSON, and Avro • Each message’s id is also its unique identifier in a topic partition
  • 21. 21© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Traditional Message Ordering • Traditional queues store messages in the order received • Consumers draw messages in store order • With multiple consumers however, messages are not received in order •Consumers may experience different delay •They might also consume messages at different rates •To retain order, only one process may consume from the queue • Comes at the expense of parallelism
  • 22. 22© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Guarantees for Ordering • Kafka appends messages sent by a producer to one partition in sending order • If a producer sends M1 followed by M2 to the same partition • M1 will have a lower offset than M2 • M1 will appear earlier in the partition • A consumer always sees messages in stored order • Given a partition with N replications, up to N-1 server failures may occur without message loss
  • 23. 23© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Message Retention • The Kafka cluster retains messages for a length of time • You can set retention time per topic or for all •You can also set a storage limit on any topic • Kafka deletes messages upon expiration
  • 24. 24© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Demo
  • 25. 25© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Creating Topics • Kafka ships with command line tools useful for exploring • The kafka-topics tool creates topics via Zookeeper • The default Zookeeper port is 2181 • To create and list the topic device_status • Use the --list parameter to list all topics $ kafka-topics --create -–zookeeper localhost:2181 -–replication-factor 1 -–partitions 1 --topic device_status $ kafka-topics --list -–zookeeper localhost:2181
  • 26. 26© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Creating a Producer • Use kafka-console-producer to publish messages • Requires a broker list, e.g., localhost:9092 • Provide a comma-delimited list for failover protection • Provide the name of the topic • We will log messages to the topic named device_status $ kafka-console-producer --broker-list localhost:9092 --topic device_status
  • 27. 27© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Creating a Consumer • The kafka-console-consumer tool is a simple consumer • It uses ZooKeeper to connect; below we access localhost:2181 • We also name a topic: device_status • To read all available messages on the topic, we use the --from-beginning option $ bin/kafka-console-consumer --zookeeper localhost:2181 --topic device_status --from-beginning
  • 28. 28© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Creating a Spark Consumer • The kafka-console-consumer tool is a simple consumer import org.apache.spark.streaming._ import org.apache.kafka.common.serialization.StringDeserializer import org.apache.spark.streaming.kafka010._ import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe val ssc = new StreamingContext(sc, Seconds(1)) val kafkaParams = Map[String, Object]("bootstrap.servers" -> "localhost:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "kafkaintro", "auto.offset.reset" -> "earliest", "enable.auto.commit" -> (false: java.lang.Boolean)) val topics = Array("TopicA") val stream = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams)) stream.map(_.value)print() ssc.start()
  • 29. 29© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Common & Best Practices
  • 30. 30© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Tip: Balance Throughput & Durability • Producers specify the durability they need with the property request.required.acks • Adding brokers can improve throughput • Common practice: Durability Behaviour Per Event Latency Required Acks (request.required.acks) Highest All replicas are in-sync Highest -1 Moderate Leader ACKS message Medium 1 Lowest No ACKs required Lowest 0 Property Value replication 3 min.insync.replicas 2 request.required.acks -1
  • 31. 31© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Tip: Consider Message Keys • A Kafka message is stored as a KV pair •The key is not used in the default case •A producer can set content in a message key then use a Partitioner subclass to hash the key • This allows the producer to effect semantic partitioning •Example: DEBUG, INFO, WARN, ERROR partitions for a syslog topic • Kafka guarantees messages with the same partition hash are stored in the same partition •A consumer group could then pair each member with an intended partition
  • 32. 32© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Tip: Writing Files to Topics • Kafka will accept file content as a message • Write a file’s data to the device_alerts topic: • Then read it: $ cat alerts.txt | kafka-console-producer --broker-list localhost:9092 --topic device_alerts $ kafka-console-consumer --zookeeper localhost:2181 --topic device_alerts --from-beginning Remember that the consumer offsets might be stored in Kafka instead of Zookeeper
  • 33. 33© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Best Uses • Kafka is intended for storing messages •Log records •Event information •For small messages, latency in the tens of milliseconds is common • Kafka is not well-suited for large file transfers •Message limits < 10KB benefit low latencies
  • 34. 34© Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Thank you Jason.Hubbard@cloudera.com