Apache Flume

  • 9,553 views
Uploaded on

Brief description of Apache Flume 0.9.x for EEDC assignment

Brief description of Apache Flume 0.9.x for EEDC assignment

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
9,553
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
475
Comments
0
Likes
14

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Arinto Murdopo Josep Subirats Group 4 EEDC 2012
  • 2. Outline● Current problem● What is Apache Flume?● The Flume Model ○ Flows and Nodes ○ Agent, Processor and Collector Nodes ○ Data and Control Path● Flume goals ○ Reliability ○ Scalability ○ Extensibility ○ Manageability● Use case: Near Realtime Aggregator 
  • 3. Current Problem● Situation:You have hundreds of services running in different serversthat produce lots of large logs which should be analyzedaltogether. You have Hadoop to process them. ● Problem:How do I send all my logs to a place that has Hadoop? Ineed a reliable, scalable, extensible and manageable wayto do it!
  • 4. What is Apache Flume?● It is a distributed data collection service that gets flows of data (like logs) from their source and aggregates them to where they have to be processed.● Goals: reliability, scalability, extensibility, manageability. Exactly what I needed!
  • 5. The Flume Model: Flows and Nodes● A flow corresponds to a type of data source (server logs, machine monitoring metrics...).● Flows are comprised of nodes chained together (see slide 7).
  • 6. The Flume Model: Flows and Nodes● In a Node, data come in through a source... ...are optionally processed by one or more decorators... ...and then are transmitted out via a sink.   Examples: Console, Exec, Syslog, IRC, Twitter, other nodes...   Examples: Console, local files, HDFS, S3, other nodes...   Examples: wire batching, compression, sampling, projection, extraction...
  • 7. The Flume Model: Agent, Processor andCollector Nodes● Agent: receives data from an application. ● Processor (optional): intermediate processing. ● Collector: write data to permanent storage.
  • 8. The Flume Model: Data and ControlPath (1/2)Nodes are in the data path.
  • 9. The Flume Model: Data and ControlPath (2/2)Masters are in the control path.● Centralized point of configuration. Multiple: ZK.● Specify sources, sinks and control data flows.
  • 10. Flume Goals: ReliabilityTunable Failure Recovery Modes ● Best Effort ● Store on Failure and Retry ● End to End Reliability
  • 11. Flume Goals: ScalabilityHorizontally Scalable Data PathLoad Balancing
  • 12. Flume Goals: ScalabilityHorizontally Scalable Control Path
  • 13. Flume Goals: Extensibility● Simple Source and Sink API ○ Event streaming and composition of simple operation  ● Plug in Architecture ○ Add your own sources, sinks, decorators    
  • 14. Flume Goals: ManageabilityCentralized Data Flow Management Interface 
  • 15. Flume Goals: ManageabilityConfiguring Flume   Node: tail(“file”) | filter [ console, roll (1000) { dfs(“hdfs://namenode/user/flume”) } ] ;Output Bucketing  /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt  /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt
  • 16. Use Case: Near Realtime Aggregator
  • 17. ConclusionFlume is● Distributed data collection service ● Suitable for enterprise setting ● Large amount of log data to process
  • 18. Q&AQuestions to be unveiled?  
  • 19. References● http://www.cloudera. com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie h_hadoop_log_processing/● http://www.slideshare.net/cloudera/inside-flume● http://www.slideshare.net/cloudera/flume-intro100715● http://www.slideshare.net/cloudera/flume-austin-hug-21711