• Share
  • Email
  • Embed
  • Like
  • Private Content
Cloudera's Flume
 

Cloudera's Flume

on

  • 21,860 views

Talk given at Large Scale Production Engineering meetup at Yahoo on March 24th, 2011.

Talk given at Large Scale Production Engineering meetup at Yahoo on March 24th, 2011.

Statistics

Views

Total Views
21,860
Views on SlideShare
7,109
Embed Views
14,751

Actions

Likes
16
Downloads
310
Comments
1

17 Embeds 14,751

http://kimws.wordpress.com 14082
http://www.cloudera.com 476
http://www.hanrss.com 104
http://www.coremeta.co.kr 39
https://kimws.wordpress.com 15
http://www.google.co.kr 8
https://www.google.co.kr 6
http://blog.cloudera.com 6
http://cafe.naver.com 3
http://webcache.googleusercontent.com 3
http://localhost 2
http://seaboy.tistory.com 2
http://paper.li 1
http://test.cloudera.com 1
http://blog.naver.com 1
file:// 1
https://www.google.nl 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cloudera's Flume Cloudera's Flume Presentation Transcript

    • Friday, 25 March 2011
    • Flume Reliable log collection Henry Robinson, Cloudera Jon Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer, Bruce Mitchener and others Yahoo! 03/24/11Friday, 25 March 2011
    • Who am I? ▪ Software engineer at Cloudera ▪ Apache ZooKeeper committer and PMC member ▪ Worked on first few versions of Flume ▪ Now onto pastures (i.e. internal projects) newFriday, 25 March 2011
    • Some motivationFriday, 25 March 2011
    • Some motivation ▪ You have lots of servers generating log files or other dataFriday, 25 March 2011
    • Some motivation ▪ You have lots of servers generating log files or other data ▪ You realise that capturing all your data is now economical and valuableFriday, 25 March 2011
    • Some motivation ▪ You have lots of servers generating log files or other data ▪ You realise that capturing all your data is now economical and valuable ▪ You’ve got a shiny new Apache Hadoop cluster ready to store and process all this dataFriday, 25 March 2011
    • Some motivation ▪ You have lots of servers generating log files or other data ▪ You realise that capturing all your data is now economical and valuable ▪ You’ve got a shiny new Apache Hadoop cluster ready to store and process all this dataFriday, 25 March 2011
    • Some motivation ▪ You have lots of servers generating log files or other data ▪ You realise that capturing all your data is now economical and valuable ▪ You’ve got a shiny new Apache Hadoop cluster ready to store and process all this dataFriday, 25 March 2011
    • Some motivation ▪ You have lots of servers generating log files or other data ▪ You realise that capturing all your data is now economical and valuable ▪ You’ve got a shiny new Apache Hadoop cluster ready to store and process all this data ▪ But the data’s not in the cluster. It’s on your servers.Friday, 25 March 2011
    • The Traditional SolutionFriday, 25 March 2011
    • The Traditional Solution ▪ Build your own data-copying scriptsFriday, 25 March 2011
    • The Traditional Solution ▪ Build your own data-copying scripts ▪ Maybe a cron jobFriday, 25 March 2011
    • The Traditional Solution ▪ Build your own data-copying scripts ▪ Maybe a cron job ▪ SCP’ed to all machinesFriday, 25 March 2011
    • The Traditional Solution ▪ Build your own data-copying scripts ▪ Maybe a cron job ▪ SCP’ed to all machines ▪ Started and stopped manuallyFriday, 25 March 2011
    • The Traditional Solution ▪ Build your own data-copying scripts ▪ Maybe a cron job ▪ SCP’ed to all machines ▪ Started and stopped manually ▪ Configuration is hardcodedFriday, 25 March 2011
    • The Traditional Solution ▪ Build your own data-copying scripts ▪ Maybe a cron job ▪ SCP’ed to all machines ▪ Started and stopped manually ▪ Configuration is hardcoded ▪ Adding new sources usually means a new scriptFriday, 25 March 2011
    • The Traditional Solution ▪ Build your own data-copying scripts ▪ Maybe a cron job ▪ SCP’ed to all machines ▪ Started and stopped manually ▪ Configuration is hardcoded ▪ Adding new sources usually means a new script ▪ Job security for whoever wrote them :)Friday, 25 March 2011
    • Ad-hoc development gets complicated ▪ Reliability is hard ▪ What if your scripts fail? What if HDFS is down, or if EC2 no longer exists? ▪ Scale is hard ▪ What to do when the number of servers doubles? Do your scripts load balance and aggregate effectively? ▪ Manageability is hard ▪ How will you change the location of the namenode, or the directory that you write to? How do you monitor your scripts? ▪ Extensibility is hard ▪ How will you add new sources?Friday, 25 March 2011
    • Cloudera Flume ▪ Flume is a distributed service for collecting data and persisting it in a centralised store ▪ What it is: ▪ Reliable - if you really need to collect every event, you can do that (but the control is in your hands) ▪ Scalable - got more data? Add more machines. ▪ Extensible - as your data sources evolve, so can your Flume flow ▪ Manageable - provision, deploy, configure, reconfigure and terminate data flows from a single point of configuration ▪ Open source - Apache licensed, on Github. ▪ Part of CDH - nicely integrated with Hadoop, Hive and HBaseFriday, 25 March 2011
    • Flume’s Key Abstractions ▪ Separate data plane and control plane ▪ Nodes are the data plane ▪ Each node has a source (where data come in) and sink (where the data are sent to) ▪ Nodes also have optional decorators to do per-event processing ▪ Nodes have different roles in data flow ▪ Agent nodes gather data from the source machines ▪ Collector nodes receive data from many agents, aggregate and batch write to HDFSFriday, 25 March 2011
    • Hierarchical Collection ▪ The simplest realistic logical topologyFriday, 25 March 2011
    • Flume Nodes Source Agent Sink HTTPD Tail Apache Downstream HTTPD logs processor node • Each role may be played by many different nodes Source Processor Decorator Sink Extract browser Upstream agent name from log string Downstream node and attach it to event collector node • Usually require substantially fewer collectors than agents Collector Source Sink HDFS:// Upstream namenode/ S HDF processor node /weblogs/ %{browser}/Friday, 25 March 2011
    • The Control Plane ▪ Master nodes are the control plane ▪ Centralised, fault-tolerant configuration source (backed by Apache ZooKeeper) ▪ Provision and configure new data flows without having to directly access the machine ▪ Single pane-of-glass to monitor the performance and behaviour of individual nodesFriday, 25 March 2011
    • The Event Model ▪ All data are transformed into a series of events ▪ Events are a pair (body, metadata) ▪ Body is a string of bytes ▪ Metadata is a table mapping keys to values ▪ Flume can use this to inform processing ▪ Or simply write it with the eventFriday, 25 March 2011
    • The Configuration Language ▪ Node configurations are written in a simple language ▪ my-flume-node : src | { decorator => sink } ▪ For example: a configuration to read HTTP log data from a file and send it to a collector: ▪ web-log-agent : tail(“/var/log/httpd.log”) | agentBESink ▪ On the collector, receive data and bucket it according to browser: ▪ web-log-collector : autoCollectorSource | { regex(“(Firefox|Internet Explorer)”, “browser”) => collectorSink(“hdfs://namenode/flume-logs/%{browser}”) } ▪ Two lines to set-up an entire flowFriday, 25 March 2011
    • Deployment ▪ On every machine in your Flume system, run a single process ▪ Flume multiplexes many agents and collectors onto the same single process, much like a ‘virtual machine’ for nodes ▪ Onceyou have Flume itself configured, this allows you to deploy many flows as required on one instance of the serviceFriday, 25 March 2011
    • Tunable Reliability ▪ Reliability and latency are in tension ▪ Rather than be prescriptive, we let the user decide the tradeoff ▪ Best effort reliability (aka ‘fire and forget’) sends events upstream and then promptly forgets about them ▪ Great if there are no failures; messages sent exactly once. ▪ Store-and-forward reliability stores events to local disk before forwarding. If the next-hop fails before acknowledging receipt, we can recover. ▪ End-to-end reliability stores events at the source, and only when they are acknowledged by the eventual destination are they garbage collectedFriday, 25 March 2011
    • Scalability • The data path is ‘horizontally scalable’ • Add more machines, get more performance • Typically the bottleneck is write performance at the collector • If machines fail, others automatically take their place • The master only requires a few machines • Scales substantially sublinearly with number of clients • A cluster of five or seven machines can handle thousands of nodes • Can add more if you manage to hit the limit • If you do, tell us about it!Friday, 25 March 2011
    • Extensibility ▪ Sources, sinks and decorators are all pluggable ▪ To integrate with your own systems, typically you have to implement a three method API and you’re done. ▪ The result: ever expanding library of sources and sinks, allowing for integration with many systems, including: ▪ Syslog ▪ HBase ▪ Hive ▪ Unix stdin / stdout ▪ Voldemort ▪ Scribe API compatibleFriday, 25 March 2011
    • Thanks! Questions? ▪ henry@cloudera.com http://github.com/cloudera/flumeFriday, 25 March 2011