Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bus by Ashish Singh of Cloudera

1
©
Cloudera,
Inc.
All
rights
reserved.

Introduc8on
to
Apache
Ka;a

-‐
The
Big
Data
Message
Bus

Ashish
Singh
|
SoCware
Engineer,
Cloudera

2
©
Cloudera,
Inc.
All
rights
reserved.

•  SoCware
Engineer
@
Cloudera

•  Contributed
to
Ka;a,
Hive,
Parquet
and
Sentry

•  Used
to
work
in
HPC

•  @singhasdev

About
Me

3
©
Cloudera,
Inc.
All
rights
reserved.

Why
Ka;a

Client
Source

Data
Pipelines
Start
like
this.

4
©
Cloudera,
Inc.
All
rights
reserved.

Why
Ka;a

Client
Source

Client

Client

Client

Then
we
reuse
them

5
©
Cloudera,
Inc.
All
rights
reserved.

Why
Ka;a

Client
Backend

Client

Client

Client

Then
we
add
consumers
to
the

exis8ng
sources

Another

Backend

6
©
Cloudera,
Inc.
All
rights
reserved.

Why
Ka;a

Client
Backend

Client

Client

Client

Then
it
starts
to
look
like
this

Another

Backend

Another

Backend

Another

Backend

7
©
Cloudera,
Inc.
All
rights
reserved.

Why
Ka;a

Client
Backend

Client

Client

Client

With
maybe
some
of
this

Another

Backend

Another

Backend

Another

Backend

8
©
Cloudera,
Inc.
All
rights
reserved.

How
we
got
here

8

Applica8on

RDBMS

We
Wanted
to
Do
some
stuﬀ
in

Hadoop

Hadoop

RDBMS

RDBMS

RDBMS

Applica8on
Applica8on
Applica8on

Batch

File

transfer

Applica8on

Repor8ng

9
©
Cloudera,
Inc.
All
rights
reserved.

How
we
got
here

9

Applica8on

RDBMS

We
Wanted
to
Do
some
stuﬀ
in

Hadoop

Hadoop

RDBMS

RDBMS

RDBMS

Applica8on
Applica8on
Applica8on

Batch

File

transfer

Applica8on

Repor8ng

10
©
Cloudera,
Inc.
All
rights
reserved.

Ka;a
decouples
data
pipelines

Why
Ka;a

10

Source
System
Source
System
Source
System
Source
System

Hadoop
Security
Systems

Real-‐8me

monitoring

Data
Warehouse

Ka;a

Producers

Broker

Consumers

11
©
Cloudera,
Inc.
All
rights
reserved.

About
Ka;a

•  Publish/Subscribe
Messaging
System
From
LinkedIn

•  High
throughput
(100’s
of
k
messages/sec)

•  Low
latency
(sub-‐second
to
low
seconds)

•  Fault-‐tolerant
(Replicated
and
Distributed)

•  Supports
Agnos8c
Messaging

•  Standardizes
format
and
delivery

12
©
Cloudera,
Inc.
All
rights
reserved.

Concepts

Basic
Ka;a
Concepts

13
©
Cloudera,
Inc.
All
rights
reserved.

Key
terminology

•  Ka;a
maintains
feeds
of
messages
in
categories
called
topics.

•  Processes
that
publish
messages
to
a
Ka;a
topic
are
called
producers.

•  Processes
that
subscribe
to
topics
and
process
the
feed
of
published
messages

are
called
consumers.

•  Ka;a
is
run
as
a
cluster
comprised
of
one
or
more
servers
each
of
which
is
called

a
broker.

•  Communica8on
between
all
components
is
done
via
a
high
performance
simple

binary
API
over
TCP
protocol

14
©
Cloudera,
Inc.
All
rights
reserved.

Architecture

14

Producer

Consumer
Consumer

Producers

Ka;a

Cluster

Consumers

Broker
Broker
Broker
Broker

Producer

Zookeeper

Oﬀsets

15
©
Cloudera,
Inc.
All
rights
reserved.

Topics
-‐
Par88ons

•  Topics
are
broken
up
into
ordered
commit
logs
called
par88ons.

•  Each
message
in
a
par88on
is
assigned
a
sequen8al
id
called
an
oﬀset.

•  Data
is
retained
for
a
conﬁgurable
period
of
8me

0
1
2
3
4
5
6
7
8
9

1
0

1
1

1
2

1
3

0
1
2
3
4
5
6
7
8
9

1
0

1
1

0
1
2
3
4
5
6
7
8
9

1
0

1
1

1
2

1
3

Par88on

1

Par88on

2

Par88on

3

Writes

Old
New

16
©
Cloudera,
Inc.
All
rights
reserved.

Message
Ordering

•  Ordering
is
only
guaranteed
within
a
par88on
for
a
topic

•  To
ensure
ordering:

• Group
messages
in
a
par88on
by
key
(producer)

• Conﬁgure
exactly
one
consumer
instance
per
par88on
within
a
consumer

group

17
©
Cloudera,
Inc.
All
rights
reserved.

Guarantees

•  Messages
sent
by
a
producer
to
a
par8cular
topic
par88on
will
be
appended
in

the
order
they
are
sent

•  A
consumer
instance
sees
messages
in
the
order
they
are
stored
in
the
log

•  For
a
topic
with
replica8on
factor
N,
Ka;a
can
tolerate
up
to
N-‐1
server
failures

without
“losing”
any
messages
commiled
to
the
log

18
©
Cloudera,
Inc.
All
rights
reserved.

Topics
-‐
Replica8on

•  Topics
can
(and
should)
be
replicated.

•  The
unit
of
replica8on
is
the
par88on

•  Each
par88on
in
a
topic
has
1
leader
and
0
or
more
replicas.

•  A
replica
is
deemed
to
be
“in-‐sync”
if

• The
replica
can
communicate
with
Zookeeper

• The
replica
is
not
“too
far”
behind
the
leader
(conﬁgurable)

•  The
group
of
in-‐sync
replicas
for
a
par88on
is
called
the
ISR
(In-‐Sync
Replicas)

•  The
Replica8on
factor
cannot
be
lowered

19
©
Cloudera,
Inc.
All
rights
reserved.

Topics
-‐
Replica8on

•  Durability
can
be
configured
with
the
producer
configura8on

request.required.acks

• 0

The
producer
never
waits
for
an
ack

• 1

The
producer
gets
an
ack
aCer
the
leader
replica
has
received
the
data

• -‐1

The
producer
gets
an
ack
aCer
all
ISRs
receive
the
data

•  Minimum
available
ISR
can
also
be
configured
such
that
an
error
is
returned
if

enough
replicas
are
not
available
to
replicate
data

20
©
Cloudera,
Inc.
All
rights
reserved.

•  Producers
can
choose
to
trade
throughput
for
durability
of
writes:

•  Throughput
can
also
be
raised
with
more
brokers…
(so
do
this
instead)!

•  A
sane
conﬁgura8on:

Durable
Writes

Durability
Behaviour
Per
Event
Latency
Required
Acknowledgements

(request.required.acks)

Highest
ACK
all
ISRs
have
received
Highest
-‐1

Medium
ACK
once
the
leader
has
received
Medium
1

Lowest
No
ACKs
required
Lowest
0

Property
Value

replica8on
3

min.insync.replicas
2

request.required.acks
-‐1

21
©
Cloudera,
Inc.
All
rights
reserved.

Producer

•  Producers
publish
to
a
topic
of
their
choosing
(push)

•  Load
can
be
distributed

• Typically
by
“round-‐robin”

• Can
also
do
“seman8c
par88oning”

based
on
a
key
in
the
message

•  Brokers
load
balance
by
par88on

•  Can
support
async
(less
durable)
sending

•  All
nodes
can
answer
metadata
requests
about:

• Which
servers
are
alive

• Where
leaders
are
for
the
par88ons
of
a
topic

22
©
Cloudera,
Inc.
All
rights
reserved.

Producer
–
Load
Balancing
and
ISRs

0

1

2

0

1

2

0

1

2

Producer

Broker
100
Broker
101
Broker
102

Topic:

Par88ons:

Replicas:

my_topic

3

3

Par88on:

Leader:

ISR:

1

101

100,102

Par88on:

Leader:

ISR:

2

102

101,100

Par88on:

Leader:

ISR:

0

100

101,102

23
©
Cloudera,
Inc.
All
rights
reserved.

Consumer

•  Mul8ple
Consumers
can
read
from
the
same
topic

•  Each
Consumer
is
responsible
for
managing
it’s
own
oﬀset

•  Messages
stay
on
Ka;a…they
are
not
removed
aCer
they
are
consumed

1234567

1234568

1234569

1234570

1234571

1234572

1234573

1234574

1234575

1234576

1234577

Consumer

Producer

Consumer

Consumer

1234577

Send

Write

Fetch

Fetch

Fetch

24
©
Cloudera,
Inc.
All
rights
reserved.

Consumer

•  Consumers
can
go
away

1234567

1234568

1234569

1234570

1234571

1234572

1234573

1234574

1234575

1234576

1234577

Consumer

Producer

Consumer

1234577

Send

Write

Fetch

Fetch

25
©
Cloudera,
Inc.
All
rights
reserved.

Consumer

•  And
then
come
back

1234567

1234568

1234569

1234570

1234571

1234572

1234573

1234574

1234575

1234576

1234577

Consumer

Producer

Consumer

Consumer

1234577

Send

Write

Fetch

Fetch

Fetch

26
©
Cloudera,
Inc.
All
rights
reserved.

Consumer
-‐
Groups

•  Consumers
can
be
organized
into
Consumer
Groups

•  Common
Palerns:

•  1)
All
consumer
instances
in
one
group

• Acts
like
a
tradi8onal
queue
with
load
balancing

•  2)
All
consumer
instances
in
diﬀerent
groups

• All
messages
are
broadcast
to
all
consumer
instances

•  3)
“Logical
Subscriber”
–
Many
consumer
instances
in
a
group

• Consumers
are
added
for
scalability
and
fault
tolerance

• Each
consumer
instance
reads
from
one
or
more
par88ons
for
a
topic

• There
cannot
be
more
consumer
instances
than
par88ons

27
©
Cloudera,
Inc.
All
rights
reserved.

Consumer
-‐
Groups

P0
P3
P1
P2

C1
C2
C3
C4
C5
C6

Ka;a
Cluster

Broker
1
Broker
2

Consumer
Group
A
Consumer
Group
B

Consumer
Groups

provide
isola8on
to

topics
and
par88ons

31
©
Cloudera,
Inc.
All
rights
reserved.

Data
Exchange
in
Distributed
Architectures

•  Mul8ple
systems
interac8ng
together
beneﬁt
from
a
common
data
exchange

format.

•  Choosing
the
correct
standard
can
signiﬁcantly
impact
applica8on
design

Client
Client

serialize

serialize

deserialize

deserialize

Common
Data
Format

34
©
Cloudera,
Inc.
All
rights
reserved.

I

Avro

•  Define
Schema

•  Generate
code
for
objects

•  Serialize
/
Deserialize
into
Bytes
or
JSON

•  Embed
schema
in
files
/
records…
or
not

•  Support
for
our
favorite
languages…
Except
Go.

•  Schema
Evolu8on

• Add
and
remove
fields
without
breaking
anything

36
©
Cloudera,
Inc.
All
rights
reserved.

Use
Cases

•  Real-‐Time
Stream
Processing
(combined
with
Spark
Streaming)

•  General
purpose
Message
Bus

•  Collec8ng
User
Ac8vity
Data

•  Collec8ng
Opera8onal
Metrics
from
applica8ons,
servers
or
devices

•  Log
Aggrega8on

•  Change
Data
Capture

•  Commit
Log
for
distributed
systems

38
©
Cloudera,
Inc.
All
rights
reserved.

FAQs

•  Should
I
use
SSDs
for
Ka;a
Brokers?

•  How
do
I
encrypt
the
data
persisted
on
my
Ka;a
Brokers?

•  Is
it
true
that
Zookeeper
can
become
a
pain
point
with
a
Ka;a
cluster?

•  Does
Ka;a
support
cross-‐data
center
availability?

•  What
type
of
data
transforma8ons
are
supported
on
Ka;a?

•  How
to
send
large
messages
or
payloads
through
Ka;a?

•  Does
Ka;a
support
MQTT
or
JMS
protocols?

Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bus by Ashish Singh of Cloudera

More Related Content

What's hot

Similar to Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bus by Ashish Singh of Cloudera

More from Data Con LA

Recently uploaded

Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bus by Ashish Singh of Cloudera