Apache kafka

1
©
Cloudera,
Inc.
All
rights
reserved.

Apache
Ka:a
-‐
Inges<on
and

Processing
Pipeline

NJ
Hadoop
Meetup
–
8/11/15

Shravan
Pabba
@skpabba

2
©
Cloudera,
Inc.
All
rights
reserved.

Agenda

•  Ka:a
Concepts
and
Architecture

•  Ka:a
vs
Tradi<onal
messaging
systems

•  Ka:a
with
Cloudera

•  Demo

§ Install
and
conﬁgure
Ka:a
on
Cloudera
cluster

§ Client
tools
-‐
Add
and
consume
data
from
topics

§ Replica<on
and
Failover
capabili<es

§ Flume
Integra<on
and
demo
of
Ka:a
to
Flume
to
HDFS

•  Other
topics

3
©
Cloudera,
Inc.
All
rights
reserved.

About
Me

•  Systems
Engineer
@
Cloudera

•  Previously
Pre/Post
Sales
Architect
@
GigaSpaces,
IBM

•  Mainframes,
Client/Server,
Distributed
&
Cloud

4
©
Cloudera,
Inc.
All
rights
reserved.

Ka:a
Concepts
and
Architecture

5
©
Cloudera,
Inc.
All
rights
reserved.

Cloudera
Enterprise
Data
Hub

Inges<on

Typical
Data
Hub
Architecture

Cloudera
Manager

Ka:a

Flume

Spark
Streaming

DistCp

Sqoop

File
Dumping

Access
Layer

Interac<ve

JDBC

ODBC

ETL

Hive

Spark
DAG

MLlib

Girpah

Grid

Compute

Custom

Egress

DistCp

Producer

File

Dumping

Ka:a/
Custom

Custom
HBase
API

SolR

Engines
Storage
Layer

HDFS
HBase
SolR

Yarn

Spark
Map
Reduce
Impala

Sentry
(Security
Framework)

Encryp<on

Navigator

PIG

6
©
Cloudera,
Inc.
All
rights
reserved.

•  No
ability
to
replay
events

•  Mul<ple
sinks
requires
event
replica<on
(via
mul<ple
channels)

•  Sinks
that
share
a
source
(mostly)
process
events
in
sync

•  This
is
!ght
coupling

Why
Ka:a?
(Or
rather,
why
didn’t
LinkedIn
use
Flume?)

Spool
Source
Avro
Sink
Channel
Spool
Source
Avro
Sink
Channel
Avro
Source
HBase
Sink
Channel
HDFS
Sink
HBase
HDFS
Logs
More
Logs
Channel

7
©
Cloudera,
Inc.
All
rights
reserved.

Why
Ka:a?

Web logs Hadoop
Connections = O(1)
2009

8
©
Cloudera,
Inc.
All
rights
reserved.

Why
Ka:a?
Increasing
complexity

Web logs Hadoop
Connections = O(1)
Connections = O(Systems2)
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
2009
2014

9
©
Cloudera,
Inc.
All
rights
reserved.

Why
Ka:a?
Decoupling

Connections = O(Systems2)
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
Connections = O(Systems)
Kafka
2014
2015+?

10
©
Cloudera,
Inc.
All
rights
reserved.

• Distributed,
structured
logs
are
very
useful

• Resiliency
/
replica<on

•  Database
write-‐ahead
logs
(HBase
WAL,
Oracle
Redo-‐logs,
etc)

• System
decoupling

•  Enterprise
service
buses
(ESBs)

•  Data
integra<on
(change
data
capture)

• Stream
processing
(e.g.
real-‐<me
alerts)

• Consensus
(using
logical
clocks)

Why
Ka:a?
Because
logs.

11
©
Cloudera,
Inc.
All
rights
reserved.

What
is
Ka:a?

•  Ka:a
is
…

Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
Kafka

12
©
Cloudera,
Inc.
All
rights
reserved.

What
is
Ka:a?

•  Ka:a
is
a
distributed,
…

Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
Broker
Broker
Broker
Kafka

13
©
Cloudera,
Inc.
All
rights
reserved.

What
is
Ka:a?

•  Ka:a
is
a
distributed,
topic-‐oriented,

…

Source 1
Topic 1 Sink 1
Source 2
Source 3
Topic 2 Sink 2
Broker

14
©
Cloudera,
Inc.
All
rights
reserved.

What
is
Ka:a?

•  Ka:a
is
a
distributed,
topic-‐oriented,

par00oned,
…

Source 1
Topic 1
Partition 1
Sink 1
Source 2
Source 3
Topic 2
Partition 1
Sink 2
Broker
Topic 1
Partition 2
Topic 2
Partition 2
Broker

15
©
Cloudera,
Inc.
All
rights
reserved.

What
is
Ka:a?

•  Ka:a
is
a
distributed,
topic-‐oriented,

par<<oned,
replicated
commit
log.

Source 1
Topic 1
Partition 1
Sink 1
Source 2
Source 3
Topic 2
Partition 1
Sink 2
Broker
Topic 1
Partition 2
Topic 2
Partition 2
Broker
Topic 1
Partition 2
Topic 2
Partition 2
Topic 1
Partition 1
Topic 2
Partition 1

16
©
Cloudera,
Inc.
All
rights
reserved.

What
is
Ka:a?

•  Ka:a
is
a
distributed,
topic-‐oriented,

par<<oned,
replicated
commit
log.

•  Ka:a
is
also
pub-‐sub
messaging

system.

•  Messages
can
be
text
(e.g.
syslog),
but

binary
is
best
(preferably
Avro!).

Source 1
Topic 1
Partition 1
Sink 1
Source 2
Source 3
Topic 2
Partition 1
Sink 2
Broker
Topic 1
Partition 2
Topic 2
Partition 2
Broker
Topic 1
Partition 2
Topic 2
Partition 2
Topic 1
Partition 1
Topic 2
Partition 1

17
©
Cloudera,
Inc.
All
rights
reserved.

Architectural
Overview

•  Each
machine
is
called
a
Broker

•  Data
wrilen
belongs
to
Topics

(analogous
to
a
Table
in
a
database)

•  Each
Topic
is
par<<oned

•  Par<<ons
are
distributed
across
the

Brokers

•  Par<<ons
are
also
replicated
(one

replica
per
par<<on
is
Leader
Par<<on)

•  Producers
and
Consumers
talk
to
the

Leader
Par<<on

Broker
1
Broker
2
Broker
3

Par<<on
1

(Leader)

Par<<on
2

Par<<on
3

Par<<on
2

(Leader)

Par<<on
1

Par<<on
3

Par<<on
3

(Leader)

Par<<on
1

Par<<on
2

Producer
Producer

Consumer
Consumer

Ka:a
Cluster

18
©
Cloudera,
Inc.
All
rights
reserved.

The
Ka:a
Advantage

•  One
broker
can
handle
100MBs
of
reads/
writes
per
second,
from
1000s
clients

•  Messages
delivered
in
milliseconds

High-‐Throughput
&
Low
Latency

•  Zero
data
loss
with
messages
persisted
on

disk
and
replicated
within
the
cluster

•  Highly-‐available
with
fault-‐tolerance
built

into
the
system.

Durability
&
Reliability

•  Elas<cally
and
transparently
add
more

machines
without
down<me
for
horizontal

scalability

•  Dynamically
add
Producers
&
Consumers

•  Enable
real-‐<me
&
batch
consump<on

Scalability
&
Flexibility

•  Modest
cluster
op<mized
to
handle
millions

of
messages
per
second

•  Open
standard
for
long-‐term
value

•  With
Cloudera,
a
single
system
for
mul<ple

workloads

Cost-‐Eﬃcient

19
©
Cloudera,
Inc.
All
rights
reserved.

How
does
it
compare
to
Flume
and
Tradi<onal

Messaging

20
©
Cloudera,
Inc.
All
rights
reserved.

Ka4a

•  Ka:a
is
very
much
a
general-‐purpose

system.
Many
producers
and
many

consumers
sharing
mul<ple
topics

•  Ka:a,
has
a
significantly
smaller

producer
and
consumer
ecosystem

•  Ka:a
requires
an
external
stream

processing
system
for
that

•  Highly
Available
ingest
pipeline

Flume

•  Flume
is
a
special-‐purpose
tool

designed
to
send
data
to
HDFS,
HBase

(and
Solr)

•  Flume
has
many
built-‐in
sources
and

sinks

•  In-‐flight
data
processing
using

interceptors.
Useful
for
data
masking

or
filtering

•  Flume
does
not
replicate
events

Ka:a
Vs
Flume

21
©
Cloudera,
Inc.
All
rights
reserved.

Random
and
Sequen<al
Access
in
Disk
and
Memory

Source:
hlp://queue.acm.org/detail.cfm?id=1563874

22
©
Cloudera,
Inc.
All
rights
reserved.

Ka4a

•  Ka:a
does
only
sequen<al
ﬁle
I/O

•  Ka:a
keeps
a
single
pointer
into
each

par<<on
of
a
topic.
All
messages
prior

to
the
pointer
are
considered

consumed,
and
all
messages
auer
it

are
consider
unconsumed

•  Relies
heavily
on
OS
pagecache
for

data
storage,
zerocopy

•  No
GC,
No
Memory
overhead

•  Ka:a
supports
end-‐to-‐end
batching

and
compression
of
messages

Tradi0onal
Messaging

•  Tradi<onal
messaging
does
random

ﬁle/memory
I/O
(BTree
structures)

•  Typically
messaging
system
keep

some
kind
of
per-‐message
state

about
what
has
been
consumed
and

have
to
update
it

•  Disk/Memory
is
used
for
storage

•  JVM
==
GC
and
memory
overhead

•  Tradi<onal
messaging
is
typically
as

non-‐batch
and
un-‐compressed

Why
is
Ka:a
fast?

23
©
Cloudera,
Inc.
All
rights
reserved.

Canonical
Use
Cases

•  Real-‐Time
Stream
Processing

•  General-‐Purpose
Message
Bus

•  User
Ac<vity
Data
Collec<on

•  Opera<onal
Metrics
Collec<on

(applica<ons,
servers,
or
devices)

•  Log
Aggrega<on

•  Change
Data
Capture

•  Distributed
Systems
Commit
Log

24
©
Cloudera,
Inc.
All
rights
reserved.

Ka:a
and
Cloudera

25
©
Cloudera,
Inc.
All
rights
reserved.

Simplified
Management

•  Deploy
and
Configure

Ka:a
clusters

•  Unified
Management

•  Mul<ple
Ka:a

clusters

•  En<re
plavorm

•  Monitoring,
Alerts,

and
Dashboards

26
©
Cloudera,
Inc.
All
rights
reserved.

Conﬁgure
Ka:a
using
CM

27
©
Cloudera,
Inc.
All
rights
reserved.

CM
has
much
more!

30
©
Cloudera,
Inc.
All
rights
reserved.

Ka:a
+
Apache
Flume

•  Ka:a
can
be
conﬁgured
as
a
fast,
reliable
Flume
Channel

•  Flume
Sources
and
Sinks
can
be
used
as
out-‐of-‐the-‐box
Ka:a
Producers
and
Consumers

Flume
Sinks
Consume
from
Ka4a:

Write
data
to
HDFS,
HBase,
or
Search

Flume
Sources
Write
to
Ka4a:

Read
from
logs,
ﬁles,
jms,
hlp,
rpc,
thriu,

etc
and
write
events
to
Ka:a

31
©
Cloudera,
Inc.
All
rights
reserved.

Cloudera
+
Ka:a

Community
involvement
and
contribu0on:

•  Spearheading
adding
security
features
to
Ka:a

•  Iden<fied
and
fixed
core
architectural
issues
to
make
Ka:a
fully
reliable

•  Strong
rela<onship
with
the
Confluent.io
and
other
Ka:a
Commilers

Support
exper0se
and
experience:

•  Mul<ple
produc<on
customers

•  Support
team
trained
by
Ka:a
Commilers

Integrated
with
Cloudera’s
produc0on-‐ready
plaForm:

•  Cloudera
Manager
CSD
makes
it
easy
to
deploy,
configure,
and
monitor
Ka:a
clusters

•  End-‐to-‐end
workloads
with
other
components,
all
on
a
single
system

•  Leading
security,
governance,
administra<on,
and
partner
network

32
©
Cloudera,
Inc.
All
rights
reserved.

Roadmap

Security:

• Authen<ca<on
with
Kerberos

• Topic
level
Authoriza<on

• SSL
encryp<on
of
data
over-‐the-‐wire

• Improved
Cloudera
Manager
integra<on

• HUE
integra<on

*Roadmap
is
subject
to
change

34
©
Cloudera,
Inc.
All
rights
reserved.

Ka:a
Demo

•  Install
and
conﬁgure
Ka:a
on
Cloudera
cluster

•  Client
tools
-‐
Add
and
consume
data
from
topics

•  Replica<on
and
Failover
capabili<es

•  Flume
Integra<on
and
demo
of
Ka:a
to
Flume
to
HDFS

36
©
Cloudera,
Inc.
All
rights
reserved.

Clients/API’s

•  Java,
Python,
Go,
C/C++,
.Net,
Clojure,
Ruby,
Erlang,
stdin/stdout
and
more
here,

hlps://cwiki.apache.org/conﬂuence/display/KAFKA/Clients#Clients-‐
ProducerDaemon

•  Producer
and
Consumer
API

•  New
Java
Producer
API
was
in
0.8.2

•  New
consumer
API
is
coming
in
next
release

38
©
Cloudera,
Inc.
All
rights
reserved.

Camus/Samza/Ka:a
Manager

•  Camus/Samza
are
tools
used
and
created
in
LinkedIn

•  Camus
is
a
client
for
inges<ng
Ka:a
data
into
Hadoop
(MR
jobs
under
the
covers)

•  Camus
being
phased
out
and
replaced
with
Gobblin

•  Samza
is
stream
processing
framework
that
uses
Ka:a
for
messaging
and
YARN

for
processing
(resource
management
etc)

•  Management
tool
for
Ka:a
develop
@
Yahoo

Apache kafka

More Related Content

What's hot

Viewers also liked

Similar to Apache kafka

Recently uploaded

Apache kafka