Apache Flume

Uploaded on

Brief description of Apache Flume 0.9.x for EEDC assignment

Brief description of Apache Flume 0.9.x for EEDC assignment

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Arinto Murdopo Josep Subirats Group 4 EEDC 2012
  • 2. Outline● Current problem● What is Apache Flume?● The Flume Model ○ Flows and Nodes ○ Agent, Processor and Collector Nodes ○ Data and Control Path● Flume goals ○ Reliability ○ Scalability ○ Extensibility ○ Manageability● Use case: Near Realtime Aggregator 
  • 3. Current Problem● Situation:You have hundreds of services running in different serversthat produce lots of large logs which should be analyzedaltogether. You have Hadoop to process them. ● Problem:How do I send all my logs to a place that has Hadoop? Ineed a reliable, scalable, extensible and manageable wayto do it!
  • 4. What is Apache Flume?● It is a distributed data collection service that gets flows of data (like logs) from their source and aggregates them to where they have to be processed.● Goals: reliability, scalability, extensibility, manageability. Exactly what I needed!
  • 5. The Flume Model: Flows and Nodes● A flow corresponds to a type of data source (server logs, machine monitoring metrics...).● Flows are comprised of nodes chained together (see slide 7).
  • 6. The Flume Model: Flows and Nodes● In a Node, data come in through a source... ...are optionally processed by one or more decorators... ...and then are transmitted out via a sink.   Examples: Console, Exec, Syslog, IRC, Twitter, other nodes...   Examples: Console, local files, HDFS, S3, other nodes...   Examples: wire batching, compression, sampling, projection, extraction...
  • 7. The Flume Model: Agent, Processor andCollector Nodes● Agent: receives data from an application. ● Processor (optional): intermediate processing. ● Collector: write data to permanent storage.
  • 8. The Flume Model: Data and ControlPath (1/2)Nodes are in the data path.
  • 9. The Flume Model: Data and ControlPath (2/2)Masters are in the control path.● Centralized point of configuration. Multiple: ZK.● Specify sources, sinks and control data flows.
  • 10. Flume Goals: ReliabilityTunable Failure Recovery Modes ● Best Effort ● Store on Failure and Retry ● End to End Reliability
  • 11. Flume Goals: ScalabilityHorizontally Scalable Data PathLoad Balancing
  • 12. Flume Goals: ScalabilityHorizontally Scalable Control Path
  • 13. Flume Goals: Extensibility● Simple Source and Sink API ○ Event streaming and composition of simple operation  ● Plug in Architecture ○ Add your own sources, sinks, decorators    
  • 14. Flume Goals: ManageabilityCentralized Data Flow Management Interface 
  • 15. Flume Goals: ManageabilityConfiguring Flume   Node: tail(“file”) | filter [ console, roll (1000) { dfs(“hdfs://namenode/user/flume”) } ] ;Output Bucketing  /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt  /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt
  • 16. Use Case: Near Realtime Aggregator
  • 17. ConclusionFlume is● Distributed data collection service ● Suitable for enterprise setting ● Large amount of log data to process
  • 18. Q&AQuestions to be unveiled?  
  • 19. References● http://www.cloudera. com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie h_hadoop_log_processing/● http://www.slideshare.net/cloudera/inside-flume● http://www.slideshare.net/cloudera/flume-intro100715● http://www.slideshare.net/cloudera/flume-austin-hug-21711