SlideShare a Scribd company logo
Kafka Introduction
Apache Kafka ATL Meetup
Jeff Holoman
2© 2015 Cloudera, Inc. All rights reserved.
Agenda
• Introduction
• Concepts
• Use Cases
• Development
• Administration
• Cluster Planning
• Cloudera Integration and Roadmap (New Stuff)
3© 2015 Cloudera and/or its affiliates. All rights reserved.
Obligatory
“About Me”
Slide
Sr. Systems Engineer
Cloudera
@jeffholoman
Areas of Focus
• Data Ingest
• Financial Services
• Healthcare
Former Oracle
• Developer
• DBA
Apache Kafka Contributor
Co-Namer of “Flafka”
Cloudera engineering
blogger
I mostly spend my time now
helping build large-scale
solutions in the Financial
Services Vertical
4© 2015 Cloudera, Inc. All rights reserved.
Before we get started
• Kafka is picking up steam in the community, a lot of things are changing.
(patches welcome!)
• dev@kafka.apache.org
• users@kafka.apache.org
• http://ingest.tips/ - Great Blog
• Things we won’t cover in much detail
• Wire protocol
• Security–related stuff (KAFKA-1682)
• Mirror-maker
• Lets keep this interactive.
• If you are a Kafka user and want to talk at the next one of these, please let
me know.
5© 2015 Cloudera and/or its affiliates. All rights reserved.
Concepts
Basic Kafka Concepts
6© 2015 Cloudera, Inc. All rights reserved.
Why Kafka
6
Client Source
Data Pipelines Start like this.
7© 2015 Cloudera, Inc. All rights reserved.
Why Kafka
7
Client Source
Client
Client
Client
Then we reuse them
8© 2015 Cloudera, Inc. All rights reserved.
Why Kafka
8
Client Backend
Client
Client
Client
Then we add additional
endpoints to the existing sources
Another
Backend
9© 2015 Cloudera, Inc. All rights reserved.
Why Kafka
9
Client Backend
Client
Client
Client
Then it starts to look like this
Another
Backend
Another
Backend
Another
Backend
10© 2015 Cloudera, Inc. All rights reserved.
Why Kafka
10
Client Backend
Client
Client
Client
With maybe some of this
Another
Backend
Another
Backend
Another
Backend
As distributed systems and services increasingly become
part of a modern architecture, this makes for a fragile
system
11© 2015 Cloudera, Inc. All rights reserved.
Kafka decouples Data Pipelines
Why Kafka
11
Source System Source System Source System Source System
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Producer
s
Brokers
Consume
rs
12© 2015 Cloudera, Inc. All rights reserved.
Key terminology
• Kafka maintains feeds of messages in categories called topics.
• Processes that publish messages to a Kafka topic are called
producers.
• Processes that subscribe to topics and process the feed of published
messages are called consumers.
• Kafka is run as a cluster comprised of one or more servers each of
which is called a broker.
• Communication between all components is done via a high
performance simple binary API over TCP protocol
13© 2015 Cloudera, Inc. All rights reserved.
Efficiency
• Kafka achieves it’s high throughput and low latency primarily from
two key concepts
• 1) Batching of individual messages to amortize network overhead
and append/consume chunks together
• 2) Zero copy I/O using sendfile (Java’s NIO FileChannel transferTo
method).
• Implements linux sendfile() system call which skips unnecessary copies
• Heavily relies on Linux PageCache
• The I/O scheduler will batch together consecutive small writes into bigger physical writes
which improves throughput.
• The I/O scheduler will attempt to re-sequence writes to minimize movement of the disk head
which improves throughput.
• It automatically uses all the free memory on the machine
14© 2015 Cloudera, Inc. All rights reserved.
Efficiency - Implication
• In a system where consumers are roughly caught up with producers,
you are essentially reading data from cache.
• This is what allows end-to-end latency to be so low
15© 2015 Cloudera, Inc. All rights reserved.
Architecture
15
Producer
Consumer Consumer
Producer
s
Kafka
Cluster
Consume
rs
Broker Broker Broker Broker
Producer
Zookeeper
Offsets
16© 2015 Cloudera, Inc. All rights reserved.
Topics - Partitions
• Topics are broken up into ordered commit logs called partitions.
• Each message in a partition is assigned a sequential id called an
offset.
• Data is retained for a configurable period of time*
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
0 1 2 3 4 5 6 7 8 9
1
0
1
1
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Partition
1
Partition
2
Partition
3
Writes
Old New
17© 2015 Cloudera, Inc. All rights reserved.
Message Ordering
• Ordering is only guaranteed within a partition for a topic
• To ensure ordering:
• Group messages in a partition by key (producer)
• Configure exactly one consumer instance per partition within a consumer
group
18© 2015 Cloudera, Inc. All rights reserved.
Guarantees
• Messages sent by a producer to a particular topic partition will be
appended in the order they are sent
• A consumer instance sees messages in the order they are stored in
the log
• For a topic with replication factor N, Kafka can tolerate up to N-1
server failures without “losing” any messages committed to the log
19© 2015 Cloudera, Inc. All rights reserved.
Topics - Replication
• Topics can (and should) be replicated.
• The unit of replication is the partition
• Each partition in a topic has 1 leader and 0 or more replicas.
• A replica is deemed to be “in-sync” if
• The replica can communicate with Zookeeper
• The replica is not “too far” behind the leader (configurable)
• The group of in-sync replicas for a partition is called the ISR (In-Sync
Replicas)
• The Replication factor cannot be lowered
20© 2015 Cloudera, Inc. All rights reserved.
Topics - Replication
• Durability can be configured with the producer configuration
request.required.acks
• 0 The producer never waits for an ack
• 1 The producer gets an ack after the leader replica has received the data
• -1 The producer gets an ack after all ISRs receive the data
• Minimum available ISR can also be configured such that an error is
returned if enough replicas are not available to replicate data
21© 2015 Cloudera, Inc. All rights reserved.
• Producers can choose to trade throughput for durability of writes:
• Throughput can also be raised with more brokers… (so do this instead)!
• A sane configuration:
Durable Writes
Durabilit
y
Behaviour Per Event
Latency
Required
Acknowledgements
(request.required.acks)
Highest ACK all ISRs have received Highest -1
Medium ACK once the leader has received Medium 1
Lowest No ACKs required Lowest 0
Property Value
replication 3
min.insync.replicas 2
request.required.acks -1
22© 2015 Cloudera, Inc. All rights reserved.
Producer
• Producers publish to a topic of their choosing (push)
• Load can be distributed
• Typically by “round-robin”
• Can also do “semantic partitioning” based on a key in the message
• Brokers load balance by partition
• Can support async (less durable) sending
• All nodes can answer metadata requests about:
• Which servers are alive
• Where leaders are for the partitions of a topic
23© 2015 Cloudera, Inc. All rights reserved.
Producer – Load Balancing and ISRs
0
1
2
0
1
2
0
1
2
Producer
Broker 100 Broker 101 Broker 102
Topic:
Partition
s:
Replicas
:
my_topic
3
3
Partitio
n:
Leader
:
ISR:
1
101
100,102
Partitio
n:
Leader
:
ISR:
2
102
101,100
Partitio
n:
Leader
:
ISR:
0
100
101,102
24© 2015 Cloudera, Inc. All rights reserved.
Consumer
• Multiple Consumers can read from the same topic
• Each Consumer is responsible for managing it’s own offset
• Messages stay on Kafka…they are not removed after they are
consumed
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
Consumer
1234577
Send
Write
Fetch
Fetch
Fetch
25© 2015 Cloudera, Inc. All rights reserved.
Consumer
• Consumers can go away
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
1234577
Send
Write
Fetch
Fetch
26© 2015 Cloudera, Inc. All rights reserved.
Consumer
• And then come back
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
Consumer
1234577
Send
Write
Fetch
Fetch
Fetch
27© 2015 Cloudera, Inc. All rights reserved.
Consumer - Groups
• Consumers can be organized into Consumer Groups
Common Patterns:
• 1) All consumer instances in one group
• Acts like a traditional queue with load balancing
• 2) All consumer instances in different groups
• All messages are broadcast to all consumer instances
• 3) “Logical Subscriber” – Many consumer instances in a group
• Consumers are added for scalability and fault tolerance
• Each consumer instance reads from one or more partitions for a topic
• There cannot be more consumer instances than partitions
28© 2015 Cloudera, Inc. All rights reserved.
Consumer - Groups
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Kafka
ClusterBroker 1 Broker 2
Consumer Group A Consumer Group B
Consumer Groups
provide isolation to
topics and
partitions
29© 2015 Cloudera, Inc. All rights reserved.
Consumer - Groups
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Kafka
ClusterBroker 1 Broker 2
Consumer Group A Consumer Group B
Can rebalance
themselves X
30© 2015 Cloudera, Inc. All rights reserved.
Delivery Semantics
• At least once
• Messages are never lost but may be redelivered
• At most once
• Messages are lost but never redelivered
• Exactly once
• Messages are delivered once and only once
Default
31© 2015 Cloudera, Inc. All rights reserved.
Delivery Semantics
• At least once
• Messages are never lost but may be redelivered
• At most once
• Messages are lost but never redelivered
• Exactly once
• Messages are delivered once and only once
Much Harder
(Impossible??)
32© 2015 Cloudera, Inc. All rights reserved.
Getting Exactly Once Semantics
• Must consider two components
• Durability guarantees when publishing a message
• Durability guarantees when consuming a message
• Producer
• What happens when a produce request was sent but a network error returned
before an ack?
• Use a single writer per partition and check the latest committed value after
network errors
• Consumer
• Include a unique ID (e.g. UUID) and de-duplicate.
• Consider storing offsets with data
33© 2015 Cloudera, Inc. All rights reserved.
Use Cases
• Real-Time Stream Processing (combined with Spark Streaming)
• General purpose Message Bus
• Collecting User Activity Data
• Collecting Operational Metrics from applications, servers or devices
• Log Aggregation
• Change Data Capture
• Commit Log for distributed systems
34© 2015 Cloudera, Inc. All rights reserved.
Positioning
Should I use Kafka ?
• For really large file transfers?
• Probably not, it’s designed for “messages” not really for files. If you need to ship large files,
consider good-ole-file transfer, or breaking up the files and reading per line to move to
Kafka.
• As a replacement for MQ/Rabbit/Tibco
• Probably. Performance Numbers are drastically superior. Also gives the ability for transient
consumers. Handles failures pretty well.
• If security on the broker and across the wire is important?
• Not right now. We can’t really enforce much in the way of security. (KAFKA-1682)
• To do transformations of data?
• Not really by itself
35© 2015 Cloudera and/or its affiliates. All rights reserved.
Developing with
Kafka
API and Clients
36© 2015 Cloudera, Inc. All rights reserved.
Kafka Clients
• Remember Kafka implements a binary TCP protocol.
• All clients except the JVM-based clients are maintained external to the code
base.
• Full Client List here:
37© 2015 Cloudera, Inc. All rights reserved.
Java Producer Example – Old (< 0.8.1 )
/* start the producer */
private void start() {
producer = new Producer<String, String]]>(config);
}
/* create record and send to Kafka
* because the key is null, data will be sent to a random partition.
* the producer will switch to a different random partition every 10 minutes
**/
private void produce(String s) {
KeyedMessage<String, String]]> message = new
KeyedMessage<String, String]]>(topic, null, s);
producer.send(message);
}
38© 2015 Cloudera, Inc. All rights reserved.
Producer - New
/**
* Send the given record asynchronously and return a future which will eventually contain
the response information.
*
* @param record The record to send
* @return A future which will eventually contain the response information
*/
public Future send(ProducerRecord record);
/**
* Send a record and invoke the given callback when the record has been acknowledged
by the server
*/
public Future send(ProducerRecord record, Callback callback);
}
KAFKA-1982
40© 2015 Cloudera, Inc. All rights reserved.
Consumers
• New Consumer coming later in the year.
• For now two options
• High Level Consumer -> Much easier to code against, simpler to work with, but gives
you less control, particularly around managing consumption.
• Simple Consumer -> A lot more control, but hard to implement correctly.
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
• If you can, use the High Level Consumer. Be Warned. The default behavior is to
“auto-commit’ offsets…this means that offsets will be committed periodically for you. If
you care about data loss, don’t do this.
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example
http://ingest.tips/2014/10/12/kafka-high-level-consumer-frequently-missing-pieces/
41© 2015 Cloudera, Inc. All rights reserved.
Kafka Clients
• Full list on the wiki, some examples…
42© 2015 Cloudera, Inc. All rights reserved.
Log4j Appender
https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/p
roducer/KafkaLog4jAppender.scala
log4j.appender.KAFKA=kafka.producer.KafkaLog4jAppender
log4j.appender.KAFKA.layout=org.apache.log4j.PatternLayout
log4j.appender.KAFKA.layout.ConversionPattern=[%d] %p %m (%c)%n
log4j.appender.KAFKA.BrokerList=192.168.86.10:9092
log4j.appender.KAFKA.ProducerType=async
log4j.appender.KAFKA.Topic=logs
43© 2015 Cloudera, Inc. All rights reserved.
Syslog Producer
• Syslog producer -
https://github.com/stealthly/go_kafka_client/tree/master/syslog
44© 2015 Cloudera, Inc. All rights reserved.
Data Formatting
• There are three basic approaches
• Wild West Crazy
• LinkedIn Way
• https://github.com/schema-repo/schema-repo
• http://confluent.io/docs/current/schema-registry/docs/index.html
• Parse-Magic Way – Scaling Data
45© 2015 Cloudera and/or its affiliates. All rights reserved.
He means
traffic lights
46© 2015 Cloudera and/or its affiliates. All rights reserved.
Having understood the downsides of working with plaintext
log messages, the next logical evolution step is to use a
loosely-structured, easy-to-evolve markup scheme. JSON is
a popular format we’ve seen adopted by multiple
companies, and tried ourselves. It has the advantage of
being easily parseable, arbitrarily extensible, and amenable
to compression.
JSON log messages generally start out as an adequate
solution, but then gradually spiral into a persistent
nightmare
Jimmy Lin and Dmitriy Ryaboy
Twitter, Inc
Scaling Big Data Mining Infrastructure:
The Twitter Experience
http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-
02-Lin.pdf
47© 2015 Cloudera and/or its affiliates. All rights reserved.
Administration
Basic
48© 2015 Cloudera, Inc. All rights reserved.
Installation / Configuration
Source: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf
[root@dev ~]# cd /opt/cloudera/csd/
[root@dev csd]# wget http://archive-primary.cloudera.com/cloudera-labs/csds/kafka/CLABS_KAFKA-1.0.0.jar
--2014-11-27 14:57:07-- http://archive-primary.cloudera.com/cloudera-labs/csds/kafka/CLABS_KAFKA-1.0.0.jar
Resolving archive-primary.cloudera.com... 184.73.217.71
Connecting to archive-primary.cloudera.com|184.73.217.71|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5017 (4.9K) [application/java-archive]
Saving to: “CLABS_KAFKA-1.0.0.jar”
100%[===============================================================================>] 5,017 --.-K/s in 0s
2014-11-27 14:57:07 (421 MB/s) - “CLABS_KAFKA-1.0.0.jar” saved [5017/5017]
[root@dev csd]# service cloudera-scm-server restart
Stopping cloudera-scm-server: [ OK ]
Starting cloudera-scm-server: [ OK ]
[root@dev csd]#
1. Download the Cloudera Labs Kafka CSD into /opt/cloudera/csd/
2. Restart CM
3. CSD will be included in CM as of 5.4
49© 2015 Cloudera, Inc. All rights reserved.
Installation / Configuration
Source: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf
3. Download and distribute the Kafka parcel:
3. Add a Kafka service:
4. Review Configuration
50© 2015 Cloudera, Inc. All rights reserved.
• Create a topic*:
• List topics:
• Write data:
• Read data:
Usage
bin/kafka-topics.sh --zookeeper zkhost:2181 --create --topic foo --replication-factor 1 --partitions 1
bin/kafka-topics.sh --zookeeper zkhost:2181 --list
cat data | bin/kafka-console-producer.sh --broker-list brokerhost:9092 --topic test
bin/kafka-console-consumer.sh --zookeeper zkhost:2181 --topic test --from-beginning
* deleting a topic not possible in 0.8.2
51© 2015 Cloudera, Inc. All rights reserved.
Kafka Topics - Admin
• Commands to administer topics are via shell script:
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic
test
bin/kafka-topics.sh --list --zookeeper localhost:2181
Topics can be created dynamically by producers…don’t do this
52© 2015 Cloudera and/or its affiliates. All rights reserved.
Sizing and Planning
Guidelines and estimates
53© 2015 Cloudera, Inc. All rights reserved.
Sizing / Planning
• Cloudera is in the process of benchmarking on different hardware
configurations
• Early testing shows that relatively small clusters (3-5 nodes) can
process > 1M Messages/s * ----- with some caveats
• We will release these when they are ready
54© 2015 Cloudera, Inc. All rights reserved.
Producer Performance – Single Thread
Type Records/se
c
MB/s Avg
Latency
(ms)
Max
Latency
Median
Latency
95th
%tile
No
Replicatio
n
1,100,182 104 42 1070 1 362
3x Async 1,056,546 101 42 1157 2 323
3x Sync 493,855 47 379 4483 192 1692
55© 2015 Cloudera, Inc. All rights reserved.
Benchmark Results
56© 2015 Cloudera, Inc. All rights reserved.
Benchmark Results
57© 2015 Cloudera, Inc. All rights reserved.
Hardware Selection
• A standard Hadoop datanode configuration is recommended.
• 12-14 disks is sufficient. SATA 7200 (more frequent flushes may
benefit from SAS)
• RAID 10 or JBOD --- I have an opinion about this. Not necessarily
shared by all
• Writes in JBOD will round-robin but all writes to a given partition will
be contiguous. This could lead to imbalance across disks
• Rebuilding a failed volume can effectively can effectively disable a server with
I/O requests. (debatable)
• Kafka built-in replication should make up for reliability provided by RAID
• Kafka utilizes Page Cache heavily so additional memory will be put to good
use.
• For a lower bound calculate write_throughput * time_in_buffer(s)
58© 2015 Cloudera, Inc. All rights reserved.
Other notes
• 10GB bonded NICs are preferred – With no async can definitely
saturate a network
• Consider a dedicated zookeeper cluster (3-5 nodes)
• 1U machines, 2-4 SSDs. Stock RAM –
• This less of thing with Kafka-based offsets, you can share with Hadoop if you
want
59© 2015 Cloudera and/or its affiliates. All rights reserved.
Cloudera
Integration
Current state and Roadmap
60© 2015 Cloudera, Inc. All rights reserved.
Flume (Flafka)
• Allows for zero-coding ingestion into HDFS / HBase / Solr
• Can utilize custom routing and per-event processing via selectors
and interceptors
• Batch size of source can be tuned for throughput/latency
• Lots of other stuff… full details here:
http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-
processing/
61© 2015 Cloudera, Inc. All rights reserved.
Flume (Flafka)
• Flume Integrates with Kafka as of CDH 5.2
• Source
• Sink
• Channel
62© 2015 Cloudera, Inc. All rights reserved.
Flafka
Sources Interceptors Selectors Channels Sinks
Flume Agent
Kafka
HDFS
Kafka Producer
Producer A
Kafka
KafkaData Sources
Logs, JMS,
WebServer
etc.
63© 2015 Cloudera, Inc. All rights reserved.
Flafka Channel
Channels Sinks
Flume Agent
Kafka HDFS
HBase
Solr
Kafka Producer
Producer A
Producer C
Producer
64© 2015 Cloudera, Inc. All rights reserved.
Spark Streaming
Consuming data in parallel for high throughput:
• Each input DStream creates a single receiver, which runs on a single machine. But
often in Kafka, the NIC card can be a bottleneck. To consume data in parallel from
multiple machines initialize multiple KafkaInputDStreams with the same Consumer
Group Id. Each input DStream will (for the most part) run on a separate machine.
• In addition, each DStream can use multiple threads to consume from Kafka
Writing to Kafka from Spark or Spark Streaming:
Directly use the Kafka Producer API from your Spark code.
Example:
KafkaWordCount
http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-
tutorial/
65© 2015 Cloudera, Inc. All rights reserved.
New in Spark 1.3
• http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-
from-apache-kafka/
• The new release of Apache Spark, 1.3, includes new RDD and
DStream implementations for reading data from Apache Kafka.
• More uniform usage of Spark cluster resources when consuming from Kafka
• Control of message delivery semantics
• Delivery guarantees without reliance on a write-ahead log in HDFS
• Access to message metadata
66© 2015 Cloudera, Inc. All rights reserved.
Other Streaming
• Storm
• Samza
• VoltDB (kinda)
• DataTorrent
• Tigon
67© 2015 Cloudera and/or its affiliates. All rights reserved.
Upcoming
New Stuff
68© 2015 Cloudera, Inc. All rights reserved.
New in Current Release (0.8.2)
• New Producer (we covered)
• Delete Topic
• New offset Management(we covered)
• Automated Leader Rebalancing
• Controlled Shutdown
• Turn off Unclean Leader Election
• Min ISR (we covered)
• Connection Quotas
69© 2015 Cloudera, Inc. All rights reserved.
Slated for Next Release(s)
• New Consumer (Already in trunk)
• SSL and Security Enhancements (In review)
• New Metrics (under discussion)
• File-buffer-backed producer (maybe)
• Admin CLI (In Review)
• Enhanced Mirror Maker – Replication (
• Better docs / code cleanup
• Quotas
• Purgatory Redesign
Thank you.

More Related Content

What's hot

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
AIMDek Technologies
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
Mohammed Fazuluddin
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
emreakis
 
kafka
kafkakafka
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Shiao-An Yuan
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
Paul Brebner
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Long Nguyen
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
confluent
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Diego Pacheco
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
Dimitris Kontokostas
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
confluent
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Viswanath J
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Kumar Shivam
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Saroj Panyasrivanit
 

What's hot (20)

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
kafka
kafkakafka
kafka
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 

Viewers also liked

Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
Cloudera, Inc.
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
Multi cluster, multitenant and hierarchical kafka messaging service   slideshareMulti cluster, multitenant and hierarchical kafka messaging service   slideshare
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
Allen (Xiaozhong) Wang
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
Arinto Murdopo
 
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2
IMC Institute
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Cloudera, Inc.
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Rahul Jain
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
Joe Stein
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
Yifeng Jiang
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera Quickstart
IMC Institute
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
DataWorks Summit/Hadoop Summit
 
Kafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryKafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema Registry
Jean-Paul Azar
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
Apache Apex
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
Todd Palino
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 

Viewers also liked (20)

Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
Multi cluster, multitenant and hierarchical kafka messaging service   slideshareMulti cluster, multitenant and hierarchical kafka messaging service   slideshare
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera Quickstart
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Kafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryKafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema Registry
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 

Similar to Introduction to Apache Kafka

intro-kafka
intro-kafkaintro-kafka
intro-kafka
Rahul Shukla
 
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Data Con LA
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Shravan (Sean) Pabba
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
Grant Henke
 
Intro to Apache Kafka
Intro to Apache KafkaIntro to Apache Kafka
Intro to Apache Kafka
Jason Hubbard
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
Kafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupKafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User Group
Jeff Holoman
 
Building an Event Bus at Scale
Building an Event Bus at ScaleBuilding an Event Bus at Scale
Building an Event Bus at Scale
jimriecken
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
TarekHamdi8
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Srikrishna k
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
Srikrishna k
 
Distributed messaging with Apache Kafka
Distributed messaging with Apache KafkaDistributed messaging with Apache Kafka
Distributed messaging with Apache Kafka
Saumitra Srivastav
 
Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptx
Knoldus Inc.
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
Gwen (Chen) Shapira
 
Hands-on Workshop: Apache Pulsar
Hands-on Workshop: Apache PulsarHands-on Workshop: Apache Pulsar
Hands-on Workshop: Apache Pulsar
Sijie Guo
 
Linked In Stream Processing Meetup - Apache Pulsar
Linked In Stream Processing Meetup - Apache PulsarLinked In Stream Processing Meetup - Apache Pulsar
Linked In Stream Processing Meetup - Apache Pulsar
Karthik Ramasamy
 
OnPrem Monitoring.pdf
OnPrem Monitoring.pdfOnPrem Monitoring.pdf
OnPrem Monitoring.pdf
TarekHamdi8
 
Kafka for begginer
Kafka for begginerKafka for begginer
Kafka for begginer
Yousun Jeong
 
Kafka overview v0.1
Kafka overview v0.1Kafka overview v0.1
Kafka overview v0.1
Mahendran Ponnusamy
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 

Similar to Introduction to Apache Kafka (20)

intro-kafka
intro-kafkaintro-kafka
intro-kafka
 
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
Intro to Apache Kafka
Intro to Apache KafkaIntro to Apache Kafka
Intro to Apache Kafka
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
 
Kafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupKafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User Group
 
Building an Event Bus at Scale
Building an Event Bus at ScaleBuilding an Event Bus at Scale
Building an Event Bus at Scale
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
 
Distributed messaging with Apache Kafka
Distributed messaging with Apache KafkaDistributed messaging with Apache Kafka
Distributed messaging with Apache Kafka
 
Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptx
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Hands-on Workshop: Apache Pulsar
Hands-on Workshop: Apache PulsarHands-on Workshop: Apache Pulsar
Hands-on Workshop: Apache Pulsar
 
Linked In Stream Processing Meetup - Apache Pulsar
Linked In Stream Processing Meetup - Apache PulsarLinked In Stream Processing Meetup - Apache Pulsar
Linked In Stream Processing Meetup - Apache Pulsar
 
OnPrem Monitoring.pdf
OnPrem Monitoring.pdfOnPrem Monitoring.pdf
OnPrem Monitoring.pdf
 
Kafka for begginer
Kafka for begginerKafka for begginer
Kafka for begginer
 
Kafka overview v0.1
Kafka overview v0.1Kafka overview v0.1
Kafka overview v0.1
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 

Recently uploaded

Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
digitalxplive
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
WhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring AppsWhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring Apps
HackersList
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
ssuser1915fe1
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
313mohammedarshad
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Torry Harris
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
How to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdfHow to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdf
ChristopherTHyatt
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 

Recently uploaded (20)

Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
WhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring AppsWhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring Apps
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
How to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdfHow to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdf
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 

Introduction to Apache Kafka

  • 1. Kafka Introduction Apache Kafka ATL Meetup Jeff Holoman
  • 2. 2© 2015 Cloudera, Inc. All rights reserved. Agenda • Introduction • Concepts • Use Cases • Development • Administration • Cluster Planning • Cloudera Integration and Roadmap (New Stuff)
  • 3. 3© 2015 Cloudera and/or its affiliates. All rights reserved. Obligatory “About Me” Slide Sr. Systems Engineer Cloudera @jeffholoman Areas of Focus • Data Ingest • Financial Services • Healthcare Former Oracle • Developer • DBA Apache Kafka Contributor Co-Namer of “Flafka” Cloudera engineering blogger I mostly spend my time now helping build large-scale solutions in the Financial Services Vertical
  • 4. 4© 2015 Cloudera, Inc. All rights reserved. Before we get started • Kafka is picking up steam in the community, a lot of things are changing. (patches welcome!) • dev@kafka.apache.org • users@kafka.apache.org • http://ingest.tips/ - Great Blog • Things we won’t cover in much detail • Wire protocol • Security–related stuff (KAFKA-1682) • Mirror-maker • Lets keep this interactive. • If you are a Kafka user and want to talk at the next one of these, please let me know.
  • 5. 5© 2015 Cloudera and/or its affiliates. All rights reserved. Concepts Basic Kafka Concepts
  • 6. 6© 2015 Cloudera, Inc. All rights reserved. Why Kafka 6 Client Source Data Pipelines Start like this.
  • 7. 7© 2015 Cloudera, Inc. All rights reserved. Why Kafka 7 Client Source Client Client Client Then we reuse them
  • 8. 8© 2015 Cloudera, Inc. All rights reserved. Why Kafka 8 Client Backend Client Client Client Then we add additional endpoints to the existing sources Another Backend
  • 9. 9© 2015 Cloudera, Inc. All rights reserved. Why Kafka 9 Client Backend Client Client Client Then it starts to look like this Another Backend Another Backend Another Backend
  • 10. 10© 2015 Cloudera, Inc. All rights reserved. Why Kafka 10 Client Backend Client Client Client With maybe some of this Another Backend Another Backend Another Backend As distributed systems and services increasingly become part of a modern architecture, this makes for a fragile system
  • 11. 11© 2015 Cloudera, Inc. All rights reserved. Kafka decouples Data Pipelines Why Kafka 11 Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producer s Brokers Consume rs
  • 12. 12© 2015 Cloudera, Inc. All rights reserved. Key terminology • Kafka maintains feeds of messages in categories called topics. • Processes that publish messages to a Kafka topic are called producers. • Processes that subscribe to topics and process the feed of published messages are called consumers. • Kafka is run as a cluster comprised of one or more servers each of which is called a broker. • Communication between all components is done via a high performance simple binary API over TCP protocol
  • 13. 13© 2015 Cloudera, Inc. All rights reserved. Efficiency • Kafka achieves it’s high throughput and low latency primarily from two key concepts • 1) Batching of individual messages to amortize network overhead and append/consume chunks together • 2) Zero copy I/O using sendfile (Java’s NIO FileChannel transferTo method). • Implements linux sendfile() system call which skips unnecessary copies • Heavily relies on Linux PageCache • The I/O scheduler will batch together consecutive small writes into bigger physical writes which improves throughput. • The I/O scheduler will attempt to re-sequence writes to minimize movement of the disk head which improves throughput. • It automatically uses all the free memory on the machine
  • 14. 14© 2015 Cloudera, Inc. All rights reserved. Efficiency - Implication • In a system where consumers are roughly caught up with producers, you are essentially reading data from cache. • This is what allows end-to-end latency to be so low
  • 15. 15© 2015 Cloudera, Inc. All rights reserved. Architecture 15 Producer Consumer Consumer Producer s Kafka Cluster Consume rs Broker Broker Broker Broker Producer Zookeeper Offsets
  • 16. 16© 2015 Cloudera, Inc. All rights reserved. Topics - Partitions • Topics are broken up into ordered commit logs called partitions. • Each message in a partition is assigned a sequential id called an offset. • Data is retained for a configurable period of time* 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 0 1 2 3 4 5 6 7 8 9 1 0 1 1 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Partition 1 Partition 2 Partition 3 Writes Old New
  • 17. 17© 2015 Cloudera, Inc. All rights reserved. Message Ordering • Ordering is only guaranteed within a partition for a topic • To ensure ordering: • Group messages in a partition by key (producer) • Configure exactly one consumer instance per partition within a consumer group
  • 18. 18© 2015 Cloudera, Inc. All rights reserved. Guarantees • Messages sent by a producer to a particular topic partition will be appended in the order they are sent • A consumer instance sees messages in the order they are stored in the log • For a topic with replication factor N, Kafka can tolerate up to N-1 server failures without “losing” any messages committed to the log
  • 19. 19© 2015 Cloudera, Inc. All rights reserved. Topics - Replication • Topics can (and should) be replicated. • The unit of replication is the partition • Each partition in a topic has 1 leader and 0 or more replicas. • A replica is deemed to be “in-sync” if • The replica can communicate with Zookeeper • The replica is not “too far” behind the leader (configurable) • The group of in-sync replicas for a partition is called the ISR (In-Sync Replicas) • The Replication factor cannot be lowered
  • 20. 20© 2015 Cloudera, Inc. All rights reserved. Topics - Replication • Durability can be configured with the producer configuration request.required.acks • 0 The producer never waits for an ack • 1 The producer gets an ack after the leader replica has received the data • -1 The producer gets an ack after all ISRs receive the data • Minimum available ISR can also be configured such that an error is returned if enough replicas are not available to replicate data
  • 21. 21© 2015 Cloudera, Inc. All rights reserved. • Producers can choose to trade throughput for durability of writes: • Throughput can also be raised with more brokers… (so do this instead)! • A sane configuration: Durable Writes Durabilit y Behaviour Per Event Latency Required Acknowledgements (request.required.acks) Highest ACK all ISRs have received Highest -1 Medium ACK once the leader has received Medium 1 Lowest No ACKs required Lowest 0 Property Value replication 3 min.insync.replicas 2 request.required.acks -1
  • 22. 22© 2015 Cloudera, Inc. All rights reserved. Producer • Producers publish to a topic of their choosing (push) • Load can be distributed • Typically by “round-robin” • Can also do “semantic partitioning” based on a key in the message • Brokers load balance by partition • Can support async (less durable) sending • All nodes can answer metadata requests about: • Which servers are alive • Where leaders are for the partitions of a topic
  • 23. 23© 2015 Cloudera, Inc. All rights reserved. Producer – Load Balancing and ISRs 0 1 2 0 1 2 0 1 2 Producer Broker 100 Broker 101 Broker 102 Topic: Partition s: Replicas : my_topic 3 3 Partitio n: Leader : ISR: 1 101 100,102 Partitio n: Leader : ISR: 2 102 101,100 Partitio n: Leader : ISR: 0 100 101,102
  • 24. 24© 2015 Cloudera, Inc. All rights reserved. Consumer • Multiple Consumers can read from the same topic • Each Consumer is responsible for managing it’s own offset • Messages stay on Kafka…they are not removed after they are consumed 1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 Consumer Producer Consumer Consumer 1234577 Send Write Fetch Fetch Fetch
  • 25. 25© 2015 Cloudera, Inc. All rights reserved. Consumer • Consumers can go away 1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 Consumer Producer Consumer 1234577 Send Write Fetch Fetch
  • 26. 26© 2015 Cloudera, Inc. All rights reserved. Consumer • And then come back 1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 Consumer Producer Consumer Consumer 1234577 Send Write Fetch Fetch Fetch
  • 27. 27© 2015 Cloudera, Inc. All rights reserved. Consumer - Groups • Consumers can be organized into Consumer Groups Common Patterns: • 1) All consumer instances in one group • Acts like a traditional queue with load balancing • 2) All consumer instances in different groups • All messages are broadcast to all consumer instances • 3) “Logical Subscriber” – Many consumer instances in a group • Consumers are added for scalability and fault tolerance • Each consumer instance reads from one or more partitions for a topic • There cannot be more consumer instances than partitions
  • 28. 28© 2015 Cloudera, Inc. All rights reserved. Consumer - Groups P0 P3 P1 P2 C1 C2 C3 C4 C5 C6 Kafka ClusterBroker 1 Broker 2 Consumer Group A Consumer Group B Consumer Groups provide isolation to topics and partitions
  • 29. 29© 2015 Cloudera, Inc. All rights reserved. Consumer - Groups P0 P3 P1 P2 C1 C2 C3 C4 C5 C6 Kafka ClusterBroker 1 Broker 2 Consumer Group A Consumer Group B Can rebalance themselves X
  • 30. 30© 2015 Cloudera, Inc. All rights reserved. Delivery Semantics • At least once • Messages are never lost but may be redelivered • At most once • Messages are lost but never redelivered • Exactly once • Messages are delivered once and only once Default
  • 31. 31© 2015 Cloudera, Inc. All rights reserved. Delivery Semantics • At least once • Messages are never lost but may be redelivered • At most once • Messages are lost but never redelivered • Exactly once • Messages are delivered once and only once Much Harder (Impossible??)
  • 32. 32© 2015 Cloudera, Inc. All rights reserved. Getting Exactly Once Semantics • Must consider two components • Durability guarantees when publishing a message • Durability guarantees when consuming a message • Producer • What happens when a produce request was sent but a network error returned before an ack? • Use a single writer per partition and check the latest committed value after network errors • Consumer • Include a unique ID (e.g. UUID) and de-duplicate. • Consider storing offsets with data
  • 33. 33© 2015 Cloudera, Inc. All rights reserved. Use Cases • Real-Time Stream Processing (combined with Spark Streaming) • General purpose Message Bus • Collecting User Activity Data • Collecting Operational Metrics from applications, servers or devices • Log Aggregation • Change Data Capture • Commit Log for distributed systems
  • 34. 34© 2015 Cloudera, Inc. All rights reserved. Positioning Should I use Kafka ? • For really large file transfers? • Probably not, it’s designed for “messages” not really for files. If you need to ship large files, consider good-ole-file transfer, or breaking up the files and reading per line to move to Kafka. • As a replacement for MQ/Rabbit/Tibco • Probably. Performance Numbers are drastically superior. Also gives the ability for transient consumers. Handles failures pretty well. • If security on the broker and across the wire is important? • Not right now. We can’t really enforce much in the way of security. (KAFKA-1682) • To do transformations of data? • Not really by itself
  • 35. 35© 2015 Cloudera and/or its affiliates. All rights reserved. Developing with Kafka API and Clients
  • 36. 36© 2015 Cloudera, Inc. All rights reserved. Kafka Clients • Remember Kafka implements a binary TCP protocol. • All clients except the JVM-based clients are maintained external to the code base. • Full Client List here:
  • 37. 37© 2015 Cloudera, Inc. All rights reserved. Java Producer Example – Old (< 0.8.1 ) /* start the producer */ private void start() { producer = new Producer<String, String]]>(config); } /* create record and send to Kafka * because the key is null, data will be sent to a random partition. * the producer will switch to a different random partition every 10 minutes **/ private void produce(String s) { KeyedMessage<String, String]]> message = new KeyedMessage<String, String]]>(topic, null, s); producer.send(message); }
  • 38. 38© 2015 Cloudera, Inc. All rights reserved. Producer - New /** * Send the given record asynchronously and return a future which will eventually contain the response information. * * @param record The record to send * @return A future which will eventually contain the response information */ public Future send(ProducerRecord record); /** * Send a record and invoke the given callback when the record has been acknowledged by the server */ public Future send(ProducerRecord record, Callback callback); }
  • 40. 40© 2015 Cloudera, Inc. All rights reserved. Consumers • New Consumer coming later in the year. • For now two options • High Level Consumer -> Much easier to code against, simpler to work with, but gives you less control, particularly around managing consumption. • Simple Consumer -> A lot more control, but hard to implement correctly. https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example • If you can, use the High Level Consumer. Be Warned. The default behavior is to “auto-commit’ offsets…this means that offsets will be committed periodically for you. If you care about data loss, don’t do this. https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example http://ingest.tips/2014/10/12/kafka-high-level-consumer-frequently-missing-pieces/
  • 41. 41© 2015 Cloudera, Inc. All rights reserved. Kafka Clients • Full list on the wiki, some examples…
  • 42. 42© 2015 Cloudera, Inc. All rights reserved. Log4j Appender https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/p roducer/KafkaLog4jAppender.scala log4j.appender.KAFKA=kafka.producer.KafkaLog4jAppender log4j.appender.KAFKA.layout=org.apache.log4j.PatternLayout log4j.appender.KAFKA.layout.ConversionPattern=[%d] %p %m (%c)%n log4j.appender.KAFKA.BrokerList=192.168.86.10:9092 log4j.appender.KAFKA.ProducerType=async log4j.appender.KAFKA.Topic=logs
  • 43. 43© 2015 Cloudera, Inc. All rights reserved. Syslog Producer • Syslog producer - https://github.com/stealthly/go_kafka_client/tree/master/syslog
  • 44. 44© 2015 Cloudera, Inc. All rights reserved. Data Formatting • There are three basic approaches • Wild West Crazy • LinkedIn Way • https://github.com/schema-repo/schema-repo • http://confluent.io/docs/current/schema-registry/docs/index.html • Parse-Magic Way – Scaling Data
  • 45. 45© 2015 Cloudera and/or its affiliates. All rights reserved. He means traffic lights
  • 46. 46© 2015 Cloudera and/or its affiliates. All rights reserved. Having understood the downsides of working with plaintext log messages, the next logical evolution step is to use a loosely-structured, easy-to-evolve markup scheme. JSON is a popular format we’ve seen adopted by multiple companies, and tried ourselves. It has the advantage of being easily parseable, arbitrarily extensible, and amenable to compression. JSON log messages generally start out as an adequate solution, but then gradually spiral into a persistent nightmare Jimmy Lin and Dmitriy Ryaboy Twitter, Inc Scaling Big Data Mining Infrastructure: The Twitter Experience http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02- 02-Lin.pdf
  • 47. 47© 2015 Cloudera and/or its affiliates. All rights reserved. Administration Basic
  • 48. 48© 2015 Cloudera, Inc. All rights reserved. Installation / Configuration Source: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf [root@dev ~]# cd /opt/cloudera/csd/ [root@dev csd]# wget http://archive-primary.cloudera.com/cloudera-labs/csds/kafka/CLABS_KAFKA-1.0.0.jar --2014-11-27 14:57:07-- http://archive-primary.cloudera.com/cloudera-labs/csds/kafka/CLABS_KAFKA-1.0.0.jar Resolving archive-primary.cloudera.com... 184.73.217.71 Connecting to archive-primary.cloudera.com|184.73.217.71|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 5017 (4.9K) [application/java-archive] Saving to: “CLABS_KAFKA-1.0.0.jar” 100%[===============================================================================>] 5,017 --.-K/s in 0s 2014-11-27 14:57:07 (421 MB/s) - “CLABS_KAFKA-1.0.0.jar” saved [5017/5017] [root@dev csd]# service cloudera-scm-server restart Stopping cloudera-scm-server: [ OK ] Starting cloudera-scm-server: [ OK ] [root@dev csd]# 1. Download the Cloudera Labs Kafka CSD into /opt/cloudera/csd/ 2. Restart CM 3. CSD will be included in CM as of 5.4
  • 49. 49© 2015 Cloudera, Inc. All rights reserved. Installation / Configuration Source: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf 3. Download and distribute the Kafka parcel: 3. Add a Kafka service: 4. Review Configuration
  • 50. 50© 2015 Cloudera, Inc. All rights reserved. • Create a topic*: • List topics: • Write data: • Read data: Usage bin/kafka-topics.sh --zookeeper zkhost:2181 --create --topic foo --replication-factor 1 --partitions 1 bin/kafka-topics.sh --zookeeper zkhost:2181 --list cat data | bin/kafka-console-producer.sh --broker-list brokerhost:9092 --topic test bin/kafka-console-consumer.sh --zookeeper zkhost:2181 --topic test --from-beginning * deleting a topic not possible in 0.8.2
  • 51. 51© 2015 Cloudera, Inc. All rights reserved. Kafka Topics - Admin • Commands to administer topics are via shell script: bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test bin/kafka-topics.sh --list --zookeeper localhost:2181 Topics can be created dynamically by producers…don’t do this
  • 52. 52© 2015 Cloudera and/or its affiliates. All rights reserved. Sizing and Planning Guidelines and estimates
  • 53. 53© 2015 Cloudera, Inc. All rights reserved. Sizing / Planning • Cloudera is in the process of benchmarking on different hardware configurations • Early testing shows that relatively small clusters (3-5 nodes) can process > 1M Messages/s * ----- with some caveats • We will release these when they are ready
  • 54. 54© 2015 Cloudera, Inc. All rights reserved. Producer Performance – Single Thread Type Records/se c MB/s Avg Latency (ms) Max Latency Median Latency 95th %tile No Replicatio n 1,100,182 104 42 1070 1 362 3x Async 1,056,546 101 42 1157 2 323 3x Sync 493,855 47 379 4483 192 1692
  • 55. 55© 2015 Cloudera, Inc. All rights reserved. Benchmark Results
  • 56. 56© 2015 Cloudera, Inc. All rights reserved. Benchmark Results
  • 57. 57© 2015 Cloudera, Inc. All rights reserved. Hardware Selection • A standard Hadoop datanode configuration is recommended. • 12-14 disks is sufficient. SATA 7200 (more frequent flushes may benefit from SAS) • RAID 10 or JBOD --- I have an opinion about this. Not necessarily shared by all • Writes in JBOD will round-robin but all writes to a given partition will be contiguous. This could lead to imbalance across disks • Rebuilding a failed volume can effectively can effectively disable a server with I/O requests. (debatable) • Kafka built-in replication should make up for reliability provided by RAID • Kafka utilizes Page Cache heavily so additional memory will be put to good use. • For a lower bound calculate write_throughput * time_in_buffer(s)
  • 58. 58© 2015 Cloudera, Inc. All rights reserved. Other notes • 10GB bonded NICs are preferred – With no async can definitely saturate a network • Consider a dedicated zookeeper cluster (3-5 nodes) • 1U machines, 2-4 SSDs. Stock RAM – • This less of thing with Kafka-based offsets, you can share with Hadoop if you want
  • 59. 59© 2015 Cloudera and/or its affiliates. All rights reserved. Cloudera Integration Current state and Roadmap
  • 60. 60© 2015 Cloudera, Inc. All rights reserved. Flume (Flafka) • Allows for zero-coding ingestion into HDFS / HBase / Solr • Can utilize custom routing and per-event processing via selectors and interceptors • Batch size of source can be tuned for throughput/latency • Lots of other stuff… full details here: http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event- processing/
  • 61. 61© 2015 Cloudera, Inc. All rights reserved. Flume (Flafka) • Flume Integrates with Kafka as of CDH 5.2 • Source • Sink • Channel
  • 62. 62© 2015 Cloudera, Inc. All rights reserved. Flafka Sources Interceptors Selectors Channels Sinks Flume Agent Kafka HDFS Kafka Producer Producer A Kafka KafkaData Sources Logs, JMS, WebServer etc.
  • 63. 63© 2015 Cloudera, Inc. All rights reserved. Flafka Channel Channels Sinks Flume Agent Kafka HDFS HBase Solr Kafka Producer Producer A Producer C Producer
  • 64. 64© 2015 Cloudera, Inc. All rights reserved. Spark Streaming Consuming data in parallel for high throughput: • Each input DStream creates a single receiver, which runs on a single machine. But often in Kafka, the NIC card can be a bottleneck. To consume data in parallel from multiple machines initialize multiple KafkaInputDStreams with the same Consumer Group Id. Each input DStream will (for the most part) run on a separate machine. • In addition, each DStream can use multiple threads to consume from Kafka Writing to Kafka from Spark or Spark Streaming: Directly use the Kafka Producer API from your Spark code. Example: KafkaWordCount http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example- tutorial/
  • 65. 65© 2015 Cloudera, Inc. All rights reserved. New in Spark 1.3 • http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming- from-apache-kafka/ • The new release of Apache Spark, 1.3, includes new RDD and DStream implementations for reading data from Apache Kafka. • More uniform usage of Spark cluster resources when consuming from Kafka • Control of message delivery semantics • Delivery guarantees without reliance on a write-ahead log in HDFS • Access to message metadata
  • 66. 66© 2015 Cloudera, Inc. All rights reserved. Other Streaming • Storm • Samza • VoltDB (kinda) • DataTorrent • Tigon
  • 67. 67© 2015 Cloudera and/or its affiliates. All rights reserved. Upcoming New Stuff
  • 68. 68© 2015 Cloudera, Inc. All rights reserved. New in Current Release (0.8.2) • New Producer (we covered) • Delete Topic • New offset Management(we covered) • Automated Leader Rebalancing • Controlled Shutdown • Turn off Unclean Leader Election • Min ISR (we covered) • Connection Quotas
  • 69. 69© 2015 Cloudera, Inc. All rights reserved. Slated for Next Release(s) • New Consumer (Already in trunk) • SSL and Security Enhancements (In review) • New Metrics (under discussion) • File-buffer-backed producer (maybe) • Admin CLI (In Review) • Enhanced Mirror Maker – Replication ( • Better docs / code cleanup • Quotas • Purgatory Redesign

Editor's Notes

  1. Then we end up adding clients to use that source.
  2. But as we start to deploy our applications we realizet hat clients need data from a number of sources. So we add them as needed.
  3. But over time, particularly if we are segmenting services by function, we have stuff all over the place, and the dependencies are a nightmare. This makes for a fragile system.
  4. Kafka is a pub/sub messaging system that can decouple your data pipelines. Most of you are probably familiar with it’s history at LinkedIn and they use it as a high throughput relatively low latency commit log. It allows sources to push data without worrying about what clients are reading it. Note that producer push, and consumers pull. Kafka itself is a cluster of brokers, which handles both persisting data to disk and serving that data to consumer requests.
  5. So in kafka, feeds of messages are stored in categories called topics. So you have a message, it goes into a given topic. You create a topic explicitly or you can just start publishing to a topic and have it created auto-magically. Anything that publishes message to a kafka topic is called a producer. Anything that reads messages from kafka is called a consumer. The broker is a cluster of one ore more servers. All of the communication under the covers is done via a binary TCP protocol (which you don’t really have to worry too much about, unless you start writing custom client implementations).
  6. So we can see in the diagram here that producers can write to really any broker in the cluster and consumers can do the same. Producers rely on Zookeeper for three basic things: detecting the addition and the removal of brokers and consumers, Triggering a rebalance process in each consumer when the above events happen, and (3) maintaining the consumption relationship and keeping track of the consumed offset of each partition. Specifically, when each broker or consumer starts up, it stores its information in a broker or consumer registry in Zookeeper. The broker registry contains the broker’s host name and port, and the set of topics and partitions stored on it. There’s also a consumer registry which in versions < 0.8.2 was the only option. This can now be stored in a special Kafka topic, as zookeeper could become a bottleneck in high throughput processing.
  7. Replication -> all the the min.insync.replicas. ..there is a timeout. The single digit
  8. Highlight boxes with different color
  9. How are offsets shared
  10. This is doable with an idempotent producer where the producer tracks committed messages within some configurable window
  11. A lot of people for cross-data center replication. Goldengate -> Kafka.
  12. Rabbit MQ. I can just grab messages, basically transient consumers that can rejoin if necessary. TIBCO as well. Cigna Active Mq->trackign consumers. You can have some kind of clustering, but recovering from a a failed nodes is a problem, can lose messages or use them for a long time.
  13. Add note about the different ways to handle message formats. Add Gwen’s code samples. Java example, python example
  14. So basically the old way was you could run the producer async which was blazing fast but you might miss stuff. Now you can send everything async and check the callbacks to ensure that you aren’t dropping messages.
  15. Confluent will be releasing a REST api soon.
  16. /opt/cloudera/parcels/CLABS_KAFKA/lib/kafka/bin/kafka-topics.sh --zookeeper dev:2181/kafka --list /opt/cloudera/parcels/CLABS_KAFKA/lib/kafka/bin/kafka-topics.sh --zookeeper dev:2181/kafka --create --topic test --replication-factor 1 --partitions 1
  17. Add whitepaper