Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Flume


Published on

Brief description of Apache Flume 0.9.x for EEDC assignment

Published in: Technology

Apache Flume

  1. 1. Arinto Murdopo Josep Subirats Group 4 EEDC 2012
  2. 2. Outline● Current problem● What is Apache Flume?● The Flume Model ○ Flows and Nodes ○ Agent, Processor and Collector Nodes ○ Data and Control Path● Flume goals ○ Reliability ○ Scalability ○ Extensibility ○ Manageability● Use case: Near Realtime Aggregator 
  3. 3. Current Problem● Situation:You have hundreds of services running in different serversthat produce lots of large logs which should be analyzedaltogether. You have Hadoop to process them. ● Problem:How do I send all my logs to a place that has Hadoop? Ineed a reliable, scalable, extensible and manageable wayto do it!
  4. 4. What is Apache Flume?● It is a distributed data collection service that gets flows of data (like logs) from their source and aggregates them to where they have to be processed.● Goals: reliability, scalability, extensibility, manageability. Exactly what I needed!
  5. 5. The Flume Model: Flows and Nodes● A flow corresponds to a type of data source (server logs, machine monitoring metrics...).● Flows are comprised of nodes chained together (see slide 7).
  6. 6. The Flume Model: Flows and Nodes● In a Node, data come in through a source... ...are optionally processed by one or more decorators... ...and then are transmitted out via a sink.   Examples: Console, Exec, Syslog, IRC, Twitter, other nodes...   Examples: Console, local files, HDFS, S3, other nodes...   Examples: wire batching, compression, sampling, projection, extraction...
  7. 7. The Flume Model: Agent, Processor andCollector Nodes● Agent: receives data from an application. ● Processor (optional): intermediate processing. ● Collector: write data to permanent storage.
  8. 8. The Flume Model: Data and ControlPath (1/2)Nodes are in the data path.
  9. 9. The Flume Model: Data and ControlPath (2/2)Masters are in the control path.● Centralized point of configuration. Multiple: ZK.● Specify sources, sinks and control data flows.
  10. 10. Flume Goals: ReliabilityTunable Failure Recovery Modes ● Best Effort ● Store on Failure and Retry ● End to End Reliability
  11. 11. Flume Goals: ScalabilityHorizontally Scalable Data PathLoad Balancing
  12. 12. Flume Goals: ScalabilityHorizontally Scalable Control Path
  13. 13. Flume Goals: Extensibility● Simple Source and Sink API ○ Event streaming and composition of simple operation  ● Plug in Architecture ○ Add your own sources, sinks, decorators    
  14. 14. Flume Goals: ManageabilityCentralized Data Flow Management Interface 
  15. 15. Flume Goals: ManageabilityConfiguring Flume   Node: tail(“file”) | filter [ console, roll (1000) { dfs(“hdfs://namenode/user/flume”) } ] ;Output Bucketing  /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt  /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt
  16. 16. Use Case: Near Realtime Aggregator
  17. 17. ConclusionFlume is● Distributed data collection service ● Suitable for enterprise setting ● Large amount of log data to process
  18. 18. Q&AQuestions to be unveiled?  
  19. 19. References● http://www.cloudera. com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie h_hadoop_log_processing/●●●