Flumetalk

Flume
Reliable Distributed
Streaming Log Collection

Ian Wrigley
Educational Services, Cloudera, Inc
ian@cloudera.com

Scenario

•  Situa,on:

–  You
have
hundreds
of
services
producing
logs

–  You’re
running
a
daily
cron
job
on
the
logs

•  Rota,ng
the
logs

•  Maybe
compressing
or
otherwise
processing
them

•  Transferring
them
to
HDFS
(the
Hadoop
Distributed
File
System)

•  Problem:

–  As
the
amount
of
data
increases,
it
takes
longer
and
longer
to
run
the

cron
job

7/15/2010 2

You
need
a
“Flume”

•  Flume
is
a
distributed
system
that
gets

your
logs
from
their
source
and

aggregates
them
to
where
you
want
to

process
them

•  Open
source,

Apache
v2.0
License

•  Goals:

–  Reliability

–  Scalability

–  Extensibility

–  Manageability

Columbia Gorge, Broughton Log Flume
7/15/2010 3

Use
cases

•  Collec,ng
logs
from
nodes
in
your

Hadoop
cluster

•  Collec,ng
logs
from
services
such

as
hUpd,
mail,
etc.

•  Collec,ng
impressions
from

custom
apps
for
an
ad
network

•  But
wait,
there’s
more!

It’s log, log ... Everyone wants a log!
–  Basic
online
in-‐stream
analysis

–  Online
in-‐stream
ﬁle
processing
and

manipula,on

7/15/2010 4

Key
abstrac,ons

•  Data
path
and
control
path
Agent
•  Nodes
are
in
the
data
path

–  Nodes
have
a
source
and
a
sink

Collector
–  They
can
take
different
roles

•  A
typical
topology
has
agent
nodes
and
collector
nodes

•  Op,onally
it
has
processor
nodes

•  Masters
are
in
the
control
path
Master

–  Centralized
point
of
configura,on

–  Specify
sources
and
sinks

–  Can
control
flows
of
data
between
nodes

–  Use
one
master
or
use
many
with
a
ZooKeeper-‐backed
quorum

7/15/2010 5

A
sample
topology

Agent tier Collector tier Master

Agent
Agent Collector
Agent
Agent

Agent
Agent Collector HDFS
Agent
Agent
/logs/web/2010/0715/1200
/logs/web/2010/0715/1300
/logs/web/2010/0715/1400
Agent
Agent Collector
Agent
Agent

7/15/2010 6

Masters
control
node
conﬁgura,on


Agent
Agent Collector
Agent
Agent Storage tier

Agent
Agent
Agent
/logs/web/2010/0715/1200
/logs/web/2010/0715/1300
/logs/web/2010/0715/1400
Agent
Agent Collector
Agent
Agent

7/15/2010 7

Outline

•  What
is
Flume?

–  Goals
and
architecture

•  Reliability

–  Fault-‐tolerance
and
High
availability

•  Scalability

–  Horizontal
scalability
of
all
nodes
and
masters

•  Extensibility

–  Unix
principle,
all
kinds
of
data,
all
kinds
of
sources,
all
kinds
of
sinks

•  Manageability

–  Centralized
management
suppor,ng
dynamic
reconﬁgura,on

7/15/2010 8

RELIABILITY

The logs will still get there…
7/15/2010 9

Tunable
data
reliability
levels

•  Best
eﬀort

–  Fire
and
forget
•  Store
on
failure
+
retry

–  Local
acks,
local
errors
detectable

–  Failover
when
faults
detected

•  End-‐to-‐end
reliability

–  End
to
end
acks
–  Data
survives
compound
failures,

and
may
be
retried
mul,ple

,mes

7/15/2010 10

SCALABILITY

7/15/2010
Logs jamming the Kemi River 11

A
sample
topology


Agent
Agent Collector
Agent
Agent

Agent
Agent
Agent
/logs/web/2010/0715/1200
/logs/web/2010/0715/1300
/logs/web/2010/0715/1400
Agent
Agent Collector
Agent
Agent

7/15/2010 12

Data
path
is
horizontally
scalable

Agent
Agent
Agent

•  Add
collectors
to
increase
availability
and
to
handle
more
data

–  Assumes
a
single
agent
will
not
dominate
a
collector

–  Fewer
connec,ons
to
HDFS

–  Larger,
more
eﬃcient
writes
to
HDFS

•  Agents
have
mechanisms
for
machine
resource
tradeoﬀs

•  Write
log
locally
to
avoid
collector
disk
IO
boUleneck
and
catastrophic
failures

•  Compression
and
batching

(trade
cpu
for
network)

•  Push
computa,on
into
the
event
collec,on
pipeline
(balance
IO,
Mem,
and
CPU

resource
boUlenecks)

7/15/2010 13

Load
balancing

Agent
Agent Collector
Agent
Agent Collector

Agent Collector
Agent

•  Agents
are
logically
par,,oned
and
can
send
to
diﬀerent
collectors

•  Use
randomiza,on
to
pre-‐specify
failovers
when
many
collectors

exist

•  Spread
load
if
a
collector
goes
down

•  Spread
load
if
new
collectors
are
added
to
the
system

7/15/2010 14

Load
balancing

Agent
Agent Collector
Agent
Agent Collector

Agent Collector
Agent

•  Agents
are
logically
par,,oned
and
can
send
to
diﬀerent
collectors

•  Use
randomiza,on
to
pre-‐specify
failovers
when
many
collectors

exist

•  Spread
load
if
a
collector
goes
down

•  Spread
load
if
new
collectors
are
added
to
the
system

7/15/2010 15

Control
plane
is
horizontally
scalable

Node Master
ZK1

Node Master
ZK2

Node Master
ZK3

•  A
master
controls
dynamic
conﬁgura,ons
of
nodes

–  Uses
consensus
protocol
to
keep
state
consistent

–  Scales
well
for
conﬁgura,on
reads

–  Allows
for
adap,ve
repar,,oning
in
the
future

•  Nodes
can
talk
to
any
master

•  Masters
can
talk
to
any
ZooKeeper
member

7/15/2010 16

Control
plane
is
horizontally
scalable

Node Master
ZK1

Node Master
ZK2

Node Master
ZK3

•  A
master
controls
dynamic
conﬁgura,ons
of
nodes

–  Uses
consensus
protocol
to
keep
state
consistent

–  Scales
well
for
conﬁgura,on
reads

–  Allows
for
adap,ve
repar,,oning
in
the
future

•  Nodes
can
talk
to
any
master

•  Masters
can
talk
to
any
ZooKeeper
member

7/15/2010 17

Control
plane
is
horizontally
scalable

Node Master
ZK1

Node Master
ZK2

Node Master
ZK3

•  A
master
controls
dynamic
conﬁgura,ons
of
nodes

–  Uses
consensus
protocol
to
keep
state
consistent

–  Scales
well
for
conﬁgura,on
reads

–  Allows
for
adap,ve
repar,,oning
in
the
future

•  Nodes
can
talk
to
any
master

•  Masters
can
talk
to
any
ZooKeeper
member

7/15/2010 18

EXTENSIBILITY

Turn raw logs into something useful…
7/15/2010 19

Flume
is
easy
to
extend

•  Simple
source
and
sink
APIs

–  Event
granularity
streaming
design

–  Have
many
simple
opera,ons
and
compose
for
complex
behavior

•  End-‐to-‐end
principle

–  Put
smarts
and
state
at
the
end
points.

Keep
the
middle
simple

•  Flume
deals
with
reliability

–  Just
add
a
new
source
or
add
a
new
sink
and
Flume
has
primi,ves
to
deal

with
reliability

7/15/2010 20

Variety
of
Data
sources

•  Can
deal
with
push
and
pull
sources
push

App
Agent
•  Supports
many
legacy
event
sources

–  Tailing
a
ﬁle
poll
App
Agent
–  Output
from
periodically
Exec’ed
program

–  Syslog,
Syslog-‐ng

–  Experimental:
IRC
/
TwiUer
/
Scribe
/
AMQP
embed
App

Agent

7/15/2010 21

Variety
of
Data
output

•  Send
data
to
many
sinks

–  HDFS,
Files,
Console,
RPC

–  Experimental:
HBase,
Voldemort,
S3,
etc…

•  Supports
an
extensible
variety
of
outputs
formats
and
des,na,ons

–  Output
to
language-‐neutral
and
open
data
formats
(JSON,
Avro,
text)

–  Compressed
output
files
in
development

•  Uses
decorators
to
process
event
data
in-‐flight

–  Sampling,
aUribute
extrac,on,
filtering,
projec,on,
checksumming,

batching,
wire
compression,
etc…

7/15/2010 22

MANAGEABILITY

7/15/2010
Wheeeeee! 23

Centralized
data
flow
management

•  Master
specifies
node
sources,
sinks
and
data
flows

–  Simply
specify
the
role
of
the
node:
collector,
agent

–  Or
specify
a
custom
configura,on
for
a
node

•  Control
Interfaces:

–  Flume
Shell

–  Basic
Web

–  HUE
+
Flume
Manager
App
(Enterprise
users)

7/15/2010 24

Output
bucke,ng

Collector /logs/web/2010/0715/1200/data-xxx.txt
/logs/web/2010/0715/1200/data-xxy.txt
/logs/web/2010/0715/1300/data-xxx.txt
HDFS /logs/web/2010/0715/1300/data-xxy.txt
/logs/web/2010/0715/1400/data-xxx.txt
Collector …

node : collectorSource | collectorSink ("hdfs://namenode/logs/
web/%Y/%m%d/%H00", "data")

•  Automa,c
output
ﬁle
management

–  Write
HDFS
ﬁles
in
directories
using
,me-‐based
tags

7/15/2010 25

For
advanced
users

•  A
concise
and
precise
configura,on
language
for
specifying
arbitrary

data
paths

–  Dataflows
are
essen,ally
DAGs

–  Control
specific
event
flows

•  Enable
durability
mechanism
and
failover
mechanisms

•  Tune
the
parameters
these
mechanisms

–  Dynamic
updates
of
configura,ons

•  Allows
for
live
failover
changes

•  Allows
for
handling
newly
provisioned
machines

•  Allows
for
changing
analy,cs

7/15/2010 26

CONCLUSIONS

7/15/2010 27

Summary

•  Flume
is
a
distributed,
reliable,
scalable
system
for
collec,ng
and

delivering
high-‐volume
con,nuous
event
data
such
as
logs

–  Tunable
data
reliability
levels

–  Reliable
master
backed
by
ZooKeeper

–  Write
data
to
HDFS
into
buckets
ready
for
batch
processing

–  Dynamically
conﬁgurable
nodes

–  Simpliﬁed
automated
management
for
agent+collector
topologies

•  Open
Source
Apache
v2.0
license

7/15/2010 28

Contribute!

•  GitHub
source
repo

–  hUp://github.com/cloudera/flume

•  Mailing
lists

–  User:
hUps://groups.google.com/a/cloudera.org/group/flume-‐user

–  Dev:
hUps://groups.google.com/a/cloudera.org/group/flume-‐dev

•  Development
trackers

–  JIRA
(bugs/
formal
feature
requests):

•  hUps://issues.cloudera.org/browse/FLUME

–  Review
board
(code
reviews):

•  hUp://review.hbase.org
-‐>
hUp://review.cloudera.org

•  IRC
Channels

–  #flume
@
irc.freenode.net

7/15/2010 29

Image
credits

•  hUp://www.flickr.com/photos/victorvonsalza/3327750057/



•  hUp://www.emvergeoning.com/?m=200811

•  hUp://www.flickr.com/photos/juse/188960076/

•  hUp://www.flickr.com/photos/juse/188960076/

•  hUp://www.flickr.com/photos/23720661@N08/3186507302/

•  hUp://clarksoutdoorchairs.com/log_adirondack_chairs.html

•  hUp://www.flickr.com/photos/dboo/3314299591/

7/15/2010 30

Flumetalk

Recommended

Recommended

More Related Content

Similar to Flumetalk

Similar to Flumetalk (20)

More from Skills Matter

More from Skills Matter (20)

Recently uploaded

Recently uploaded (20)

Flumetalk