Flume @ Austin HUG 2/17/11

Flume
Reliable Distributed
Streaming Log Collection

Jonathan Hsieh, Henry Robinson, Patrick Hunt
Cloudera, Inc
Hadoop World 2010, 10/12/2010

Flume
4 months after Hadoop
World 2010

Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer, Bruce Mitchener
Cloudera, Inc
Austin Hadoop Users Group 2/17/2011

Who Am I?
• Cloudera:
– Software Engineer on the Platform Team
– Flume Project Lead / Designer / Architect
• U of Washington:
– “On Leave” from PhD program
– Research in Systems and Programming
Languages
• Previously:
– Computer Security, Embedded Systems.

Austin Hadoop User Group, 2/17/2011 4

The basic scenario
• You have a bunch of servers
generating log files.
• You figured out that your logs are
valuable and you want to keep them
and analyze them.
• Because of the volume of data,
you’ve started using a Apache
Hadoop or Cloudera’s Distribution of
Apache Hadoop.
• … and you’ve got some ad-hoc, It’s log, log .. Everyone wants a log!

hacked together scripts that copy
data from servers to HDFS.

Ad-hockery gets complicated
• Reliability
– Will you data still get there … if your scripts fail? … if your hardware failed? … if HDFS goes
down? … if EC2 has flaked out?
• Scale
– As you add servers will your scripts keep up to 100GB’s per day? Will you have tons of small
files? Are you going to have tons of connections? Are you willing to suffer more latency to
mitigate?
• Manageability
– How do you know if the script failed on machine 172? What about logs from that other
system? How do you monitor and configure all the servers? Can you deal with elasticity?
• Extensibility
– Can you service custom logs? Send data to different places like Hbase, Hive or Incremental
search indexes? Can you do near-realtime?
• Blackbox
– What happens when the guy who write it leaves?


Cloudera Flume
Flume is a framework and conduit for
collecting and quickly shipping data records
from of many sources and to one centralized
place for storage and processing.

Project Principles:
• Scalability
• Reliability
• Extensibility
• Manageability
• Openness


: The Standard Use Case

server Agent
server Agent Collector
server Agent
server Agent

server Agent
server Agent
server Agent HDFS

server Agent
server Agent
server Agent
Agent tier Collector tier


Flume
server Agent
server Agent
server Agent

server Agent
server Agent
server Agent HDFS

server Agent
server Agent
server Agent


Flume Master
server Agent
server Agent
server Agent

server Agent
server Agent
server Agent HDFS

server Agent
server Agent
server Agent


Flume’s Key Abstractions
• Data path and control path node
• Nodes are in the data path Agent
source sink
– Nodes have a source and a sink
– They can take different roles node
• A typical topology has agent nodes and collector nodes.
Collector
source sink
• Optionally it has processor nodes.
• Masters are in the control path.
– Centralized point of configuration.
– Specify sources and sinks Master
– Can control flows of data between nodes
– Use one master or use many with a ZK-backed quorum


Flume’s Key Abstractions
• Data path and control path node
• Nodes are in the data path source sink
– Nodes have a source and a sink
– They can take different roles node
• A typical topology has agent nodes and collector nodes. source sink
• Optionally it has processor nodes.
• Masters are in the control path.
– Centralized point of configuration.
– Specify sources and sinks Master
– Can control flows of data between nodes
– Use one master or use many with a ZK-backed quorum


Can I has the codez?
node001: tail(“/var/log/app/log”) | autoE2ESink;
…

collector1: autoCollectorSource |
collectorSink(“hdfs://logs/app/”,”applogs”)


Outline
• What is Flume?
• Scalability
– Horizontal scalability of all nodes and masters
• Reliability
– Fault-tolerance and High availability
• Extensibility
– Unix principle, all kinds of data, all kinds of sources, all kinds of sinks
• Manageability
– Centralized management supporting dynamic reconfiguration
• Openness
– Apache v2.0 License and an active and growing community


SCALABILITY


Flume
server Agent
server Agent
server Agent

server Agent
server Agent
server Agent HDFS

server Agent
server Agent
server Agent


Data path is horizontally scalable
server Agent
server Agent
server Agent HDFS

• Add collectors to increase availability and to handle more data
– Assumes a single agent will not dominate a collector
– Fewer connections to HDFS.
– Larger more efficient writes to HDFS.
• Agents have mechanisms for machine resource tradeoffs
• Write log locally to avoid collector disk IO bottleneck and catastrophic failures
• Compression and batching (trade cpu for network)
• Push computation into the event collection pipeline (balance IO, Mem, and CPU
resource bottlenecks)


RELIABILITY


Tunable failure recovery modes
• Best effort
– Fire and forget Agent Collector HDFS
• Store on failure + retry
– Local acks, local errors Agent Collector HDFS
detectable
– Failover when faults detected.

• End to end reliability Agent Collector
– End to end acks HDFS
– Data survives compound failures,
and may be retried multiple
times


Load balancing
Agent
Agent Collector
Agent
Agent Collector
Agent
Agent Collector

• Agents are logically partitioned and send to different collectors
• Use randomization to pre-specify failovers when many collectors
exist
• Spread load if a collector goes down.
• Spread load if new collectors added to the system.


Load balancing and collector failover
Agent
Agent Collector
Agent
Agent Collector
Agent
Agent Collector

• Agents are logically partitioned and send to different collectors
• Use randomization to pre-specify failovers when many collectors
exist
• Spread load if a collector goes down.
• Spread load if new collectors added to the system.


Control plane is horizontally scalable
Node Master ZK1
Node Master ZK2
Node Master ZK3

• A master controls dynamic configurations of nodes
– Uses consensus protocol to keep state consistent
– Scales well for configuration reads
– Allows for adaptive repartitioning in the future
• Nodes can talk to any master.
• Masters can talk to an existing ZK ensemble

Node Master ZK1
Node Master ZK2
Node Master ZK3


MANAGEABILITY

Austin Hadoop User Group, 2/17/2011
Wheeeeee! 26

Centralized Dataflow Management Interfaces
• One place to specify node
sources, sinks and data
flows.

• Basic Web interface
• Flume Shell
– Command line interface
– Scriptable
• Cloudera Enterprise
– Flume Monitor App
– Graphical web interface


Configuring Flume

fan console
tail filter
out roll hdfs
Node: tail(“file”) | filter [ console, roll(1000) {
dfs(“hdfs://namenode/user/flume”) } ] ;
• A concise and precise configuration language for specifying dataflows in
a node.
• Dynamic updates of configurations
– Allows for live failover changes
– Allows for handling newly provisioned machines
– Allows for changing analytics


Output bucketing
Collector /logs/web/2010/0715/1200/data-xxx.txt
/logs/web/2010/0715/1200/data-xxy.txt
/logs/web/2010/0715/1300/data-xxx.txt
HDFS /logs/web/2010/0715/1300/data-xxy.txt
/logs/web/2010/0715/1400/data-xxx.txt
…
Collector
node : collectorSource | collectorSink
(“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”)

• Automatic output file management
– Write hdfs files in over time based tags


EXTENSIBILITY


Flume is easy to extend
• Simple source and sink APIs
– An event streaming design
– Many simple operations composes for complex behavior

• Plug-in architecture so you can add your own sources, sinks and
decorators and sinks

fan sink
source deco
out deco sink


Variety of Connectors
• Sources produce data
– Console, Exec, Syslog, Scribe, IRC, Twitter,
– In the works: JMS, AMQP, pubsubhubbub/RSS/Atom
• Sinks consume data source
– Console, Local files, HDFS, S3
– Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra
(Riptano/DataStax), Voldemort, Elastic Search
– In the works: JMS, AMQP
sink
• Decorators modify data sent to sinks
– Wire batching, compression, sampling, projection,
extraction, throughput throttling
– Custom near real-time processing (Meebo)
– JRuby event modifiers (InfoChimps) deco
– Cryptographic extensions(Rearden)


: Multi Datacenter
Collector tier
api
api
api Agent
Agent
Agent Collector
API server

api Agent
api
api
api Agent
Agent
Agent Collector
api Agent
api
api
api Agent
Agent
Agent Collector
api Agent HDFS
Processor server

api Agent
api Agent
api Agent
proc Agent Collector
api Agent
api Agent
api Agent
api Agent
api Agent
api Agent

: Multi Datacenter
Collector tier
api
api
api Agent
Agent
Agent Collector
API server

api Agent
api
api
api Agent
Agent
Agent Collector
api Agent
api
api
api Agent
Agent
Agent Collector
api Agent Relay HDFS
Processor server

api Agent
api Agent
api Agent
api Agent
api Agent
api Agent
api Agent
api Agent
api Agent

: Near Realtime Aggregator
Flume
Ad svr Agent
Ad svr Agent Tracker Collector HDFS
Ad svr Agent
Ad svr Agent

quick
reports DB Hive job
verify

reports


An enterprise story
Flume
Collector tier
api
api
api Agent
Agent
Agent Collector
API server

api Win
Kerberos HDFS
api
api
api Agent
Agent
Agent Collector
api Linux
DD DD DD
api
api
api Agent
Agent
Agent Collector
api Linux

Active Directory
/ LDAP

An emerging community story

Flume
Agent
Agent Hive query
svr Agent
Agent HDFS Pig query

hdfs
Key lookup
Collector Fanout hbase HBase Range query

index
Incremental Search query
Search Idx Faceted query


OPENNESS AND
COMMUNITY


Flume is Open Source
• Apache v2.0 Open Source License
– Independent from Apache Software Foundation
• GitHub source code repository
– http://github.com/cloudera/flume
– Regular tarball update versions every 2-3 months.
– Regular CDH packaging updates every 3-4 months.
• Review Board for code review
• New external committers wanted!
– Cloudera folks: Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric
Sammer
– Independent folks: Bruce Mitchener

Growing user and developer community
• History:
– Initial Open Source Release, June 2010
• Growth:
– Pre-Hadoop Summit (Late June 2010):
• 4 followers, 4 forks (original authors)
– Pre-Hadoop World (October 2010):
• 174 followers, 34 forks
– Pre-CDH3B4 Release (February 2011):
• 288 followers, 51 forks


Support
• Community-based mailing lists for support
– “an answer in a few days”
– User: https://groups.google.com/a/cloudera.org/group/flume-user
– Dev: https://groups.google.com/a/cloudera.org/group/flume-dev
• Community-based IRC chat room
– “quick questions, quick answers”
– #flume in irc.freenode.net
• Commercial support with Cloudera Enterprise subscription
– Chat with sales@cloudera.com


CONCLUSIONS


Summary
• Flume is a distributed, reliable, scalable, extensible system for
collecting and delivering high-volume continuous event data such
as logs.
– It is centrally managed, which allows for automated and adaptive
configurations.
– This design allows for near-real time processing.
– Apache v2.0 License with active and growing community

• Part of Cloudera’s Distribution for Hadoop, about to be refreshed
for CDH3b4.


Questions? (and shameless plugs)
• Contact info:
– jon@cloudera.com
– Twitter @jmhsieh

• Cloudera Training in Dallas
– Hadoop Training for Developers - March 14-16
– Hadoop Training for Administrators - March 17-18
– Sign up at http://cloudera.eventbrite.com
– 10% discount code for classes "hug“

• Cloudera is Hiring!


Flume @ Austin HUG 2/17/11

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flume @ Austin HUG 2/17/11

Similar to Flume @ Austin HUG 2/17/11 (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Flume @ Austin HUG 2/17/11