Apache flume by Swapnil Dubey

3,118 views
2,820 views

Published on

Published in: Technology
1 Comment
7 Likes
Statistics
Notes
No Downloads
Views
Total views
3,118
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
157
Comments
1
Likes
7
Embeds 0
No embeds

No notes for slide

Apache flume by Swapnil Dubey

  1. 1. Apache Flume Loading Big Data into Hadoop Cluster using Flume Swapnil Dubey Big Data Hacker GoDataDriven
  2. 2. Agenda ➢ What is Apache Flume? ➢ Problem statement ➢ Use Case : Collecting web server logs ➢ Overview/Architecture of Flume ➢ Demos
  3. 3. What is Flume? Collection & Aggregation of Streaming Data - Typically used for log data. Advantages over other solutions:- ➢ Scalable, Reliable, Customizable ➢ Declarative and Dynamic Configuration ➢ Contextual Routing ➢ Feature Rich and Fully Extensible ➢ Open source
  4. 4. Problem Statement
  5. 5. Problem Statement LOGS LOGS LOGS Application Servers Hadoop Cluster
  6. 6. Problem Statement ➢ Data collection is Ad hoc ➢ How to get data to Hadoop ➢ Streaming Data
  7. 7. Problem Statement LOGS LOGS LOGS Application Servers Hadoop Cluster Flume Agent HDFS Write
  8. 8. Collecting web server logs. ➢ Collecting web logs using- - Single flume agent - Using multiple flume agents ➢ Typical converging flow - Converging flow characteristics-Load Balancing, Multiplexing, Failover - Large converging flows - Event volume
  9. 9. Problem Statement :Single Flume Agent LOGS LOGS LOGS Application Servers Hadoop Cluster Flume Agent HDFS Write
  10. 10. Problem Statement:Multiple Flume Agent -1 LOGS LOGS LOGS Application Servers Hadoop Cluster Flume Agent HDFS Write Flume Agent Flume Agent HDFS Write HDFS Write
  11. 11. Problem Statement:Multiple Flume Agent -2 LOGS LOGS LOGS Application Servers Hadoop Cluster Flume Agent HDFS WriteFlume Agent Flume Agent Flume Agent
  12. 12. Overview/Architecture of Flume
  13. 13. Components of Flume Events Client
  14. 14. Core Concepts ➢ Events ➢ Client ➢ Agents - Source, Channel, Sink - Interceptor - Channel Selector - Sink Processor
  15. 15. Core Concept:Event An Event is the basic unit of data transported by Flume from source to destination. ➢ Payload is opaque to Flume. ➢ Events are accompanied by optional headers. Headers: - Headers are collection of unique Key-Value pairs - Headers are used for contextual routing Events Client
  16. 16. Core Concept: Client Entity that simulates event generation, passed to one or more agents. ➢ Example: Flume log4j Appender ➢ Decouples Flume from the system where event data is generated. Events Client
  17. 17. Core Concepts: Agent Container for hosting Sources, Channels, sinks and other components.
  18. 18. Core Concepts: Source Component that receives events and places it onto one or more channels. ➢ Different types of sources: - Specialized sources for integrating with well known systems. For example -Syslog, Netcat - Auto generating Sources-Exec,SEQ - IPC Sources for Agent to Agent communication: Avro, Thrift ➢ Requires at least one Channel to function.
  19. 19. Core Concept: Channel Component that buffers incoming events which are ultimately consumed by Sinks. ➢ Different channels:- Memory, File, Database ➢ Channels are fully transactional.
  20. 20. Core Concepts: Sink Component that takes events from channel and transmits them to next hop destination. Different type of Sinks: - Terminal Sinks: HDFS,Hbase - Auto consuming Sinks: Null Sink - IPC sink : Agent to Agent communication-Avro, Thrift
  21. 21. Core Concepts:Interceptor Interceptors are applied to sources in a predetermined fashion to enable adding information and filtering of events. ➢ Built in Interceptors: Allows adding headers such as timestamps, static markers etc. ➢ Custom Interceptors: Create headers by inspecting the Event.
  22. 22. Channel Selector It facilitates selection of one or more Channels, based on preset criteria. ➢ Built in Channel Selectors: - Replicating: for duplicating events - Multiplexing: for routing based based on headers. ➢ Custom selectors can be written for dynamic criteria.
  23. 23. Sink Processor Sink Processor is responsible for invoking one sink from a specified group of Sinks. ➢ Built in Sink Processors: - Load Balancing Sink Processor. - Failover Sink Processor - Default Sink Processor.
  24. 24. Data Ingest Source Channel Processor Interceptor Channel Selector (decides for channels) Channel Events C L I E N T S E V E N T S Events filtered Events unfiltered Events
  25. 25. Data Drain ➢ Event Removal from Channel is transactional. Sink Runner Sink Sink Processor Channels Sink selection n invocation Send events to next hop Next Hop
  26. 26. Agent Pipeline * Credits: http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data- Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
  27. 27. Assured Delivery Agents use transactional exchange to guarantee delivery across hops. Start Transaction take Events end transaction SinkChannel Source Channel Start Transaction take Events end transaction Send events
  28. 28. Setting up a simple agent for HDFS agent.sources= netcat-collect agent.sinks = hdfs-write agent.channels= memoryChannel agent.sources.netcat-collect.type = netcat agent.sources.netcat-collect.bind = 127.0.0.1 agent.sources.netcat-collect.port = 11111 agent.sinks.hdfs-write.type = hdfs agent.sinks.hdfs-write.hdfs.path = hdfs://namenode_address:8020/path/to/flume_test agent.sinks.hdfs-write.rollInterval = 30 agent.sinks.hdfs-write.hdfs.writeFormat=Text agent.sinks.hdfs-write.hdfs.fileType=DataStream agent.channels.memoryChannel.type = memory agent.channels.memoryChannel.capacity=10000 agent.sources.netcat-collect.channels=memoryChannel agent.sinks.hdfs-write.channel=memoryChannel
  29. 29. Advanced Features Fan-In and Fan-Out hdfs-agent.channels=mchannel1 mchannel2 hdfs-agent.sources.netcat-collect.selector.type = replicating hdfs-agent.sources.r1.channels = mchannel1 mchannel2 Interceptors hdfs-agent.sources.netcat-collect.interceptors = filt_int hdfs-agent.sources.netcat-collect.interceptors.filt_int.type=regex_filter hdfs-agent.sources.netcat-collect.interceptors.filt_int.regex=^echo.* hdfs-agent.sources.netcat-collect.interceptors.filt_int.excludeEvents=true
  30. 30. Got BigData & Analytics work ? Contact india@GoDataDriven. com We are hiring!!

×