Your SlideShare is downloading. ×
0
Flume
Reliable Distributed
Streaming Log Collection

Jonathan Hsieh, Henry Robinson, Patrick Hunt
Cloudera, Inc
7/15/2010
Scenario
• Situation:
      – You have hundreds of services producing logs in a datacenter .
      – They produce a lot of...
Use cases
• Collecting logs from nodes in
  Hadoop cluster
• Collecting logs from services such
  as httpd, mail, etc.
• C...
A sample topology
            Agent tier   Collector tier       Master
            Agent
             Agent        Collect...
You need a “Flume”
• Flume is a distributed system that gets
  your logs from their source and
  aggregates them to where ...
Key abstractions
• Data path and control path                                            Agent
• Nodes are in the data pat...
A sample topology
            Agent tier   Collector tier       Master
            Agent
             Agent         Collec...
Masters
            Agent tier   Collector tier       Master
            Agent
             Agent         Collector
      ...
Outline
• What is Flume?
      – Goals and architecture
• Reliability
      – Fault-tolerance and High availability
• Scal...
RELIABILITY


                   The logs will still get there.
7/15/2010                                           11
Failures
• Faults can happen at many levels
      – Software applications can fail
      – Machines can fail
      – Netwo...
Tunable data reliability levels
• Best effort
      – Fire and forget                    Agent   Collector   HDFS
• Store ...
Dealing with Agent failures
• We do not want to lose data
• Make events durable at the generation point.
      – If a log ...
Dealing with Collector Failures
• Data is durable at the agent:
      – Minimize the amount of state and possible data los...
Master Service Failures
• An master machine should not be the single point of failure!
• Masters keep two kinds of informa...
SCALABILITY



7/15/2010
                   Logs jamming the Kemi River   17
A sample topology
            Agent tier   Collector tier       Master
            Agent
             Agent         Collec...
Data path is horizontally scalable
            Agent
             Agent                Collector                      HDFS...
Load balancing
                       Agent
                        Agent                   Collector
                    ...
Load balancing
                       Agent
                        Agent                   Collector
                    ...
Control plane is horizontally scalable
             Node                   Master            ZK1
             Node        ...
Control plane is horizontally scalable
             Node                   Master            ZK1
             Node        ...
Control plane is horizontally scalable
             Node                   Master            ZK1
             Node        ...
EXTENSIBILITY


                     Turn raw logs into something useful…
7/15/2010                                       ...
Flume is easy to extend
• Simple source and sink APIs
      – Event granularity streaming design
      – Have many simple ...
Variety of Data sources
• Can deal with push and pull sources.                      push

                                ...
Variety of Data output
• Send data to many sinks
      – Files, Hdfs, Console, RPC
      – Experimental: hbase, voldemort,...
MANAGEABILITY



7/15/2010
                     Wheeeeee!   29
Centralized data flow management
• One place to specify node sources, sinks and data flows.
      – Simply specify the rol...
Output bucketing
      Collector                               /logs/web/2010/0715/1200/data-xxx.txt
                     ...
Simplified configurations
• To make configuring flume nodes higher level, we use logical
  nodes.
      – The Flume node p...
Flow Isolation
                     Agent
                     Agent               Collector
                     Agent
  ...
Flow Isolation
                     Agent
                     Agent                Collector
                     Agent
 ...
For advanced users
• A concise and precise configuration language for specifying
  arbitrary data paths.
      – Dataflows...
CONCLUSIONS



7/15/2010          36
Summary
• Flume is a distributed, reliable, scalable, system for collecting and
  delivering high-volume continuous event ...
Contribute!
• GitHub source repo
      – http://github.com/cloudera/flume
• Mailing lists
      – User: https://groups.goo...
Image credits
•   http://www.flickr.com/photos/victorvonsalza/3327750057/
•   http://www.flickr.com/photos/victorvonsalza/...
Flume intro-100715
Flume intro-100715
Upcoming SlideShare
Loading in...5
×

Flume intro-100715

13,795

Published on

Flume: Reliable Distributed Streaming Log Collection

0 Comments
28 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
13,795
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
603
Comments
0
Likes
28
Embeds 0
No embeds

No notes for slide

Transcript of "Flume intro-100715"

  1. 1. Flume Reliable Distributed Streaming Log Collection Jonathan Hsieh, Henry Robinson, Patrick Hunt Cloudera, Inc 7/15/2010
  2. 2. Scenario • Situation: – You have hundreds of services producing logs in a datacenter . – They produce a lot of logs that you want to analyzed – You have Hadoop, a system for processing large volumes of data. • Problem: – How do I reliably ship all my logs to a place that Hadoop can analyze them? 7/15/2010 3
  3. 3. Use cases • Collecting logs from nodes in Hadoop cluster • Collecting logs from services such as httpd, mail, etc. • Collecting impressions from custom apps for an ad network • But wait, there’s more! – Basic metrics of available It’s log, log .. Everyone wants a log! – Basic online in-stream analysis 7/15/2010 4
  4. 4. A sample topology Agent tier Collector tier Master Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 5
  5. 5. You need a “Flume” • Flume is a distributed system that gets your logs from their source and aggregates them to where you want to process them. • Open source, Apache v2.0 License • Goals: – Reliability – Scalability – Extensibility – Manageability Columbia Gorge, Broughton Log Flume 7/15/2010 6
  6. 6. Key abstractions • Data path and control path Agent • Nodes are in the data path – Nodes have a source and a sink Collector – They can take different roles • A typical topology has agent nodes and collector nodes. • Optionally it has processor nodes. • Masters are in the control path. Master – Centralized point of configuration. – Specify sources and sinks – Can control flows of data between nodes – Use one master or use many with a ZK-backed quorum 7/15/2010 7
  7. 7. A sample topology Agent tier Collector tier Master Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 8
  8. 8. Masters Agent tier Collector tier Master Agent Agent Collector Agent Agent Storage tier Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 9
  9. 9. Outline • What is Flume? – Goals and architecture • Reliability – Fault-tolerance and High availability • Scalability – Horizontal scalability of all nodes and masters • Extensibility – Unix principle, all kinds of data, all kinds of sources, all kinds of sinks • Manageability – Centralized management supporting dynamic reconfiguration 7/15/2010 10
  10. 10. RELIABILITY The logs will still get there. 7/15/2010 11
  11. 11. Failures • Faults can happen at many levels – Software applications can fail – Machines can fail – Networking gear can fail – Excessive networking congestion or machine load – A node goes down for maintenance. • How do we make sure that events make it to a permanent store? 7/15/2010 12
  12. 12. Tunable data reliability levels • Best effort – Fire and forget Agent Collector HDFS • Store on failure + retry – Local acks, local errors Agent Collector HDFS detectable – Failover when faults detected. • End to end reliability Agent Collector – End to end acks HDFS – Data survives compound failures, and may be retried multiple times 7/15/2010 13
  13. 13. Dealing with Agent failures • We do not want to lose data • Make events durable at the generation point. – If a log generator goes down, it is not generating logs. – If the event generation point fails and recovers, data will reach the end point • Data is durable and survive if machines crashes and reboots – Allows for synchronous writes in log generating applications. • Watchdog program to restart agent if it fails. 7/15/2010 14
  14. 14. Dealing with Collector Failures • Data is durable at the agent: – Minimize the amount of state and possible data loss – Not necessary to durably keep intermediate state at collector – Retry if collector goes down. • Use hot failover so agents can use alternate paths: – Master predetermines failovers to load balance when collectors go down. 7/15/2010 15
  15. 15. Master Service Failures • An master machine should not be the single point of failure! • Masters keep two kinds of information: • Configuration information (node/flow configuration) – Kept in ZooKeeper ensemble for persistent, highly available metadata store – Failures easily recovered from • Ephemeral information (heartbeat info, acks, metrics reports) – Kept in memory – Failures will lose data – This information can be lazily replicated 7/15/2010 16
  16. 16. SCALABILITY 7/15/2010 Logs jamming the Kemi River 17
  17. 17. A sample topology Agent tier Collector tier Master Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 18
  18. 18. Data path is horizontally scalable Agent Agent Collector HDFS Agent Agent • Add collectors to increase availability and to handle more data – Assumes a single agent will not dominate a collector – Fewer connections to HDFS. – Larger more efficient writes to HDFS. • Agents have mechanisms for machine resource tradeoffs • Write log locally to avoid collector disk IO bottleneck and catastrophic failures • Compression and batching (trade cpu for network) • Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks) 7/15/2010 19
  19. 19. Load balancing Agent Agent Collector Agent Agent Collector Agent Collector Agent • Agents are logically partitioned and send to different collectors • Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down. • Spread load if new collectors added to the system. 7/15/2010 20
  20. 20. Load balancing Agent Agent Collector Agent Agent Collector Agent Collector Agent • Agents are logically partitioned and send to different collectors • Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down. • Spread load if new collectors added to the system. 7/15/2010 21
  21. 21. Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3 • A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future • Nodes can talk to any master. • Masters can talk to an ZK member 7/15/2010 22
  22. 22. Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3 • A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future • Nodes can talk to any master. • Masters can talk to an ZK member 7/15/2010 23
  23. 23. Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3 • A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future • Nodes can talk to any master. • Masters can talk to an ZK member 7/15/2010 24
  24. 24. EXTENSIBILITY Turn raw logs into something useful… 7/15/2010 25
  25. 25. Flume is easy to extend • Simple source and sink APIs – Event granularity streaming design – Have many simple operations and compose for complex behavior. • End-to-end principle – Put smarts and state at the end points. Keep the middle simple. • Flume deals with reliability. – Just add a new source or add a new sink and Flume has primitives to deal with reliability 7/15/2010 26
  26. 26. Variety of Data sources • Can deal with push and pull sources. push Agent • Supports many legacy event sources – Tailing a file poll – Output from periodically Exec’ed program App Agent – Syslog, Syslog-ng – Experimental: IRC / Twitter / Scribe / AMQP embed App Agent 7/15/2010 27
  27. 27. Variety of Data output • Send data to many sinks – Files, Hdfs, Console, RPC – Experimental: hbase, voldemort, s3, etc.. • Supports an extensible variety of outputs formats and destinations – Output to language neutral and open data formats (json, avro, text) – Compressed output files in development • Uses decorators to process event data in flight. – Sampling, attribute extraction, filtering, projection, checksumming, batching, wire compression, etc.. 7/15/2010 28
  28. 28. MANAGEABILITY 7/15/2010 Wheeeeee! 29
  29. 29. Centralized data flow management • One place to specify node sources, sinks and data flows. – Simply specify the role of the node: collector, agent – Or specify a custom configuration for a node • Control Interfaces: – Flume Shell – Basic web – HUE + Flume Manager App (Enterprise users) 7/15/2010 30
  30. 30. Output bucketing Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt HDFS /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt Collector … node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”) • Automatic output file management – Write hdfs files in over time based tags 7/15/2010 31
  31. 31. Simplified configurations • To make configuring flume nodes higher level, we use logical nodes. – The Flume node process is a physical node – Each Flume node process can host multiple logical nodes • Allows for: – Reduces the amount of detail required in configurations. – Reduces management process-centric management overhead – Allows for finer-grained resource control and isolation with flows 7/15/2010 32
  32. 32. Flow Isolation Agent Agent Collector Agent Agent Collector Agent Agent Collector • Isolate different kinds of data when and where it is generated – Have multiple logical nodes on a machine – Each has their own data source – Each has their own data sink 7/15/2010 33
  33. 33. Flow Isolation Agent Agent Collector Agent Agent Collector Agent Agent Collector • Isolate different kinds of data when it is generated – Have multiple logical nodes on a machine – Each has their own data source – Each has their own data sink 7/15/2010 34
  34. 34. For advanced users • A concise and precise configuration language for specifying arbitrary data paths. – Dataflows are essentially DAGs – Control specific event flows • Enable durability mechanism and failover mechanisms • Tune the parameters these mechanisms – Dynamic updates of configurations • Allows for live failover changes • Allows for handling newly provisioned machines • Allows for changing analytics 7/15/2010 35
  35. 35. CONCLUSIONS 7/15/2010 36
  36. 36. Summary • Flume is a distributed, reliable, scalable, system for collecting and delivering high-volume continuous event data such as logs – Tunable data reliability levels for day – Reliable master backed by ZK – Write data to HDFS into buckets ready for batch processing – Dynamically configurable node – Simplified automated management for agent+collector topologies • Open Source Apache v2.0. 7/15/2010 37
  37. 37. Contribute! • GitHub source repo – http://github.com/cloudera/flume • Mailing lists – User: https://groups.google.com/a/cloudera.org/group/flume-user – Dev: https://groups.google.com/a/cloudera.org/group/flume-dev • Development trackers – JIRA (bugs/ formal feature requests): • https://issues.cloudera.org/browse/FLUME – Review board (code reviews): • http://review.hbase.org -> http://review.cloudera.org • IRC Channels – #flume @ irc.freenode.net 7/15/2010 38
  38. 38. Image credits • http://www.flickr.com/photos/victorvonsalza/3327750057/ • http://www.flickr.com/photos/victorvonsalza/3207639929/ • http://www.flickr.com/photos/victorvonsalza/3327750059/ • http://www.emvergeoning.com/?m=200811 • http://www.flickr.com/photos/juse/188960076/ • http://www.flickr.com/photos/juse/188960076/ • http://www.flickr.com/photos/23720661@N08/3186507302/ • http://clarksoutdoorchairs.com/log_adirondack_chairs.html • http://www.flickr.com/photos/dboo/3314299591/ 7/15/2010 40
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×