Inside Flume

Inside Flume

Henry Robinson
henry@cloudera.com
@henryr

Tuesday, 17 August 2010

Who am I?

• Distributed systems guy

• Apache ZooKeeper committer

• I work at Cloudera on Flume, ZooKeeper, Hue, more...

• p.s. Cloudera is hiring!

Copyright 2010 Cloudera Inc. All rights reserved


About Cloudera

• Software, services and support for Hadoop
• Built around an open core
• All our patches get contributed upstream
• Flume and Hue are open-source
• We just started the Whirr project
• We maintain, package and support Cloudera’s Distribution
for Hadoop
• Smoothing off a lot of the rough edges around Hadoop
• Includes MapReduce, HDFS, HBase, ZooKeeper, Oozie, Hive,
Pig, Hue, Flume and more.



What’s the problem?

• Data collection is currently a priori and ad hoc

• A priori - decide what you want to collect ahead of time

• Ad hoc - Each kind of data source goes through its own
collection path
• Usually a collection of fragile, custom scripts



What is Flume? (and how can it help?)

• Flume is:
• A distributed data collection service
• Scalable
• Conﬁgurable
• Extensible
• Manageable
• Open source
• How can it help?
• One-stop solution for data collection of all formats
• Flexible reliability guarantees allow careful performance tuning
• Enables quick iteration on new collection strategies


The Flume Model

• Built around the concept of flows
• A single flow corresponds to a type of data source
• Like web server logs
• Or machine monitoring metrics
• Different flows might have different compression,
batching or reliability setups
• Flume multiplexes many flows onto one service instance
• Flows are comprised of nodes chained together
• Each Flume process can run many nodes, so resources are
shared
• Each node receives data at its source, and sends it to its sink


Flume Flows

• Three typical ﬂows, all on the same Flume service

Flow 1: Web-clicks
Reliable Delivery, Compressed, Batched
EV
A EN
D AT TS

DATA Flow 2: Process monitoring EVENTS
Best Effort Delivery

DA
TA N TS
E VE

Flow 3: Advert Impressions
Reliable Delivery



Anatomy of a Flume node

• Data come in through a source...
• ... are optionally processed by one or more decorators...
• ... and then are transmitted out via a sink
• Each of these components is (re-)configurable at run-
time
• Each has a very simple API, and a plugin interface that
makes customizing Flume very easy
• These simple abstractions are sufficient to build more
complex features like acknowledged delivery, filtering,
compression



Agents and Collectors

• Nodes that receive data from an application are called
agents
• Flume supports many sources for agents, including:
• Syslog
• Tailing a ﬁle
• Unix processes
• Scribe API
• Twitter
• Nodes that write data to permanent storage are called
collectors
• Most often they write to HDFS


Flume Nodes Source
Agent
Sink

HTTPD Tail Apache Downstream
HTTPD logs processor node

• Each role may be
played by many
Processor
different nodes Source Decorator Sink
Extract browser
Upstream agent name from log string Downstream
node and attach it to event collector node

• Usually require
substantially fewer
collectors than agents Collector
Source Sink
HDFS://
Upstream namenode/ S
HDF
processor node /weblogs/
%{browser}/



Flume Events

• All data are transformed into a series of events

• Events are a pair (body, metadata)

• Body is a string of bytes

• Metadata is a table mapping keys to values
• Flume can use this to inform processing
• Or simply write it with the event



The Flume Configuration Language

• Node configurations are written in a simple language
• my-flume-node : src | { decorator => sink }
• For example: a configuration to read HTTP log data from
a file and send it to a collector:
• web-log-agent : tail(“/var/log/httpd.log”) | agentBESink
• On the collector, receive data and bucket it according to
browser:
• web-log-collector : autoCollectorSource
| { regex(“(Firefox|Internet Explorer)”, “browser”) =>
collectorSink(“hdfs://namenode/flume-logs/%{browser}”) }
• Two lines to set-up an entire flow


Keeping Track of Nodes

• The master service monitors all Flume nodes
• A single port-of-call for checking on the health of your Flume
service
• Send commands to the master, and it will forward them
to the nodes
• The Flume Shell is a convenient, scriptable command-line
tool
• Web-based UIs are also available



Flume as a Distributed System

• Fundamental principle: Keep state out of the data path
where possible
• Replication is costly
• Consistency is problematic
• Global knowledge is impractical
• Follow the end-to-end principle - put smarts at the edges
• Advantages
• Failures become much cheaper
• Performance is better
• Disadvantages
• Have to weaken some delivery guarantees


Scalability and reliability in Flume

• The data path is ‘horizontally scalable’
• Add more machines, get more performance
• Typically the bottleneck is write performance at the collector
• If machines fail, others automatically take their place
• The master only requires a few machines
• Consistency and replication handled by ZooKeeper + gossip
• A cluster of ﬁve or seven machines can handle thousands of
nodes
• Can add more if you manage to hit the limit



Flume as Open Source

• http://github.com/cloudera/ﬂume
• Already vibrant contributor community
• Flume 0.9.1 is at release candidate 0 right now

• Cloudera provides
• Packages
• Standardisation
• Support



Inside Flume

More Related Content

What's hot

Similar to Inside Flume

More from Cloudera, Inc.

Recently uploaded

Inside Flume