Flume intro-100715
 

Flume intro-100715

on

  • 14,206 views

Flume: Reliable Distributed Streaming Log Collection

Flume: Reliable Distributed Streaming Log Collection

Statistics

Views

Total Views
14,206
Views on SlideShare
11,785
Embed Views
2,421

Actions

Likes
27
Downloads
597
Comments
0

8 Embeds 2,421

http://www.cloudera.com 2063
http://it.gilbird.com 341
http://webcache.googleusercontent.com 5
https://hatenainfra.g.hatena.ne.jp 4
url_unknown 3
http://www.slideshare.net 2
http://translate.googleusercontent.com 2
http://test.cloudera.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Flume intro-100715 Flume intro-100715 Presentation Transcript

  • Flume Reliable Distributed Streaming Log Collection Jonathan Hsieh, Henry Robinson, Patrick Hunt Cloudera, Inc 7/15/2010
  • Scenario • Situation: – You have hundreds of services producing logs in a datacenter . – They produce a lot of logs that you want to analyzed – You have Hadoop, a system for processing large volumes of data. • Problem: – How do I reliably ship all my logs to a place that Hadoop can analyze them? 7/15/2010 3
  • Use cases • Collecting logs from nodes in Hadoop cluster • Collecting logs from services such as httpd, mail, etc. • Collecting impressions from custom apps for an ad network • But wait, there’s more! – Basic metrics of available It’s log, log .. Everyone wants a log! – Basic online in-stream analysis 7/15/2010 4
  • A sample topology Agent tier Collector tier Master Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 5
  • You need a “Flume” • Flume is a distributed system that gets your logs from their source and aggregates them to where you want to process them. • Open source, Apache v2.0 License • Goals: – Reliability – Scalability – Extensibility – Manageability Columbia Gorge, Broughton Log Flume 7/15/2010 6
  • Key abstractions • Data path and control path Agent • Nodes are in the data path – Nodes have a source and a sink Collector – They can take different roles • A typical topology has agent nodes and collector nodes. • Optionally it has processor nodes. • Masters are in the control path. Master – Centralized point of configuration. – Specify sources and sinks – Can control flows of data between nodes – Use one master or use many with a ZK-backed quorum 7/15/2010 7
  • A sample topology Agent tier Collector tier Master Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 8
  • Masters Agent tier Collector tier Master Agent Agent Collector Agent Agent Storage tier Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 9
  • Outline • What is Flume? – Goals and architecture • Reliability – Fault-tolerance and High availability • Scalability – Horizontal scalability of all nodes and masters • Extensibility – Unix principle, all kinds of data, all kinds of sources, all kinds of sinks • Manageability – Centralized management supporting dynamic reconfiguration 7/15/2010 10
  • RELIABILITY The logs will still get there. 7/15/2010 11
  • Failures • Faults can happen at many levels – Software applications can fail – Machines can fail – Networking gear can fail – Excessive networking congestion or machine load – A node goes down for maintenance. • How do we make sure that events make it to a permanent store? 7/15/2010 12
  • Tunable data reliability levels • Best effort – Fire and forget Agent Collector HDFS • Store on failure + retry – Local acks, local errors Agent Collector HDFS detectable – Failover when faults detected. • End to end reliability Agent Collector – End to end acks HDFS – Data survives compound failures, and may be retried multiple times 7/15/2010 13
  • Dealing with Agent failures • We do not want to lose data • Make events durable at the generation point. – If a log generator goes down, it is not generating logs. – If the event generation point fails and recovers, data will reach the end point • Data is durable and survive if machines crashes and reboots – Allows for synchronous writes in log generating applications. • Watchdog program to restart agent if it fails. 7/15/2010 14
  • Dealing with Collector Failures • Data is durable at the agent: – Minimize the amount of state and possible data loss – Not necessary to durably keep intermediate state at collector – Retry if collector goes down. • Use hot failover so agents can use alternate paths: – Master predetermines failovers to load balance when collectors go down. 7/15/2010 15
  • Master Service Failures • An master machine should not be the single point of failure! • Masters keep two kinds of information: • Configuration information (node/flow configuration) – Kept in ZooKeeper ensemble for persistent, highly available metadata store – Failures easily recovered from • Ephemeral information (heartbeat info, acks, metrics reports) – Kept in memory – Failures will lose data – This information can be lazily replicated 7/15/2010 16
  • SCALABILITY 7/15/2010 Logs jamming the Kemi River 17
  • A sample topology Agent tier Collector tier Master Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 18
  • Data path is horizontally scalable Agent Agent Collector HDFS Agent Agent • Add collectors to increase availability and to handle more data – Assumes a single agent will not dominate a collector – Fewer connections to HDFS. – Larger more efficient writes to HDFS. • Agents have mechanisms for machine resource tradeoffs • Write log locally to avoid collector disk IO bottleneck and catastrophic failures • Compression and batching (trade cpu for network) • Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks) 7/15/2010 19
  • Load balancing Agent Agent Collector Agent Agent Collector Agent Collector Agent • Agents are logically partitioned and send to different collectors • Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down. • Spread load if new collectors added to the system. 7/15/2010 20
  • Load balancing Agent Agent Collector Agent Agent Collector Agent Collector Agent • Agents are logically partitioned and send to different collectors • Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down. • Spread load if new collectors added to the system. 7/15/2010 21
  • Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3 • A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future • Nodes can talk to any master. • Masters can talk to an ZK member 7/15/2010 22
  • Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3 • A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future • Nodes can talk to any master. • Masters can talk to an ZK member 7/15/2010 23
  • Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3 • A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future • Nodes can talk to any master. • Masters can talk to an ZK member 7/15/2010 24
  • EXTENSIBILITY Turn raw logs into something useful… 7/15/2010 25
  • Flume is easy to extend • Simple source and sink APIs – Event granularity streaming design – Have many simple operations and compose for complex behavior. • End-to-end principle – Put smarts and state at the end points. Keep the middle simple. • Flume deals with reliability. – Just add a new source or add a new sink and Flume has primitives to deal with reliability 7/15/2010 26
  • Variety of Data sources • Can deal with push and pull sources. push Agent • Supports many legacy event sources – Tailing a file poll – Output from periodically Exec’ed program App Agent – Syslog, Syslog-ng – Experimental: IRC / Twitter / Scribe / AMQP embed App Agent 7/15/2010 27
  • Variety of Data output • Send data to many sinks – Files, Hdfs, Console, RPC – Experimental: hbase, voldemort, s3, etc.. • Supports an extensible variety of outputs formats and destinations – Output to language neutral and open data formats (json, avro, text) – Compressed output files in development • Uses decorators to process event data in flight. – Sampling, attribute extraction, filtering, projection, checksumming, batching, wire compression, etc.. 7/15/2010 28
  • MANAGEABILITY 7/15/2010 Wheeeeee! 29
  • Centralized data flow management • One place to specify node sources, sinks and data flows. – Simply specify the role of the node: collector, agent – Or specify a custom configuration for a node • Control Interfaces: – Flume Shell – Basic web – HUE + Flume Manager App (Enterprise users) 7/15/2010 30
  • Output bucketing Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt HDFS /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt Collector … node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”) • Automatic output file management – Write hdfs files in over time based tags 7/15/2010 31
  • Simplified configurations • To make configuring flume nodes higher level, we use logical nodes. – The Flume node process is a physical node – Each Flume node process can host multiple logical nodes • Allows for: – Reduces the amount of detail required in configurations. – Reduces management process-centric management overhead – Allows for finer-grained resource control and isolation with flows 7/15/2010 32
  • Flow Isolation Agent Agent Collector Agent Agent Collector Agent Agent Collector • Isolate different kinds of data when and where it is generated – Have multiple logical nodes on a machine – Each has their own data source – Each has their own data sink 7/15/2010 33
  • Flow Isolation Agent Agent Collector Agent Agent Collector Agent Agent Collector • Isolate different kinds of data when it is generated – Have multiple logical nodes on a machine – Each has their own data source – Each has their own data sink 7/15/2010 34
  • For advanced users • A concise and precise configuration language for specifying arbitrary data paths. – Dataflows are essentially DAGs – Control specific event flows • Enable durability mechanism and failover mechanisms • Tune the parameters these mechanisms – Dynamic updates of configurations • Allows for live failover changes • Allows for handling newly provisioned machines • Allows for changing analytics 7/15/2010 35
  • CONCLUSIONS 7/15/2010 36
  • Summary • Flume is a distributed, reliable, scalable, system for collecting and delivering high-volume continuous event data such as logs – Tunable data reliability levels for day – Reliable master backed by ZK – Write data to HDFS into buckets ready for batch processing – Dynamically configurable node – Simplified automated management for agent+collector topologies • Open Source Apache v2.0. 7/15/2010 37
  • Contribute! • GitHub source repo – http://github.com/cloudera/flume • Mailing lists – User: https://groups.google.com/a/cloudera.org/group/flume-user – Dev: https://groups.google.com/a/cloudera.org/group/flume-dev • Development trackers – JIRA (bugs/ formal feature requests): • https://issues.cloudera.org/browse/FLUME – Review board (code reviews): • http://review.hbase.org -> http://review.cloudera.org • IRC Channels – #flume @ irc.freenode.net 7/15/2010 38
  • Image credits • http://www.flickr.com/photos/victorvonsalza/3327750057/ • http://www.flickr.com/photos/victorvonsalza/3207639929/ • http://www.flickr.com/photos/victorvonsalza/3327750059/ • http://www.emvergeoning.com/?m=200811 • http://www.flickr.com/photos/juse/188960076/ • http://www.flickr.com/photos/juse/188960076/ • http://www.flickr.com/photos/23720661@N08/3186507302/ • http://clarksoutdoorchairs.com/log_adirondack_chairs.html • http://www.flickr.com/photos/dboo/3314299591/ 7/15/2010 40