Apache Flume
Loading Big Data into Hadoop Cluster using Flume
Swapnil Dubey
Big Data Hacker
GoDataDriven
Agenda
➢ What is Apache Flume?
➢ Problem statement
➢ Use Case : Collecting web server logs
➢ Overview/Architecture of Flume
➢ Demos
What is Flume?
Collection & Aggregation of Streaming Data
- Typically used for log data.
Advantages over other solutions:-
➢ Scalable, Reliable, Customizable
➢ Declarative and Dynamic Configuration
➢ Contextual Routing
➢ Feature Rich and Fully Extensible
➢ Open source
Problem Statement
Problem Statement
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Problem Statement
➢ Data collection is Ad hoc
➢ How to get data to Hadoop
➢ Streaming Data
Problem Statement
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Flume Agent HDFS Write
Collecting web server logs.
➢ Collecting web logs using-
- Single flume agent
- Using multiple flume agents
➢ Typical converging flow
- Converging flow characteristics-Load Balancing, Multiplexing, Failover
- Large converging flows
- Event volume
Problem Statement :Single Flume Agent
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Flume Agent HDFS Write
Problem Statement:Multiple Flume Agent -1
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Flume Agent HDFS Write
Flume Agent
Flume Agent
HDFS Write
HDFS Write
Problem Statement:Multiple Flume Agent -2
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Flume Agent
HDFS WriteFlume Agent
Flume Agent
Flume Agent
Overview/Architecture of Flume
Components of Flume
Events
Client
Core Concepts
➢ Events
➢ Client
➢ Agents
- Source, Channel, Sink
- Interceptor
- Channel Selector
- Sink Processor
Core Concept:Event
An Event is the basic unit of data transported by Flume from
source to destination.
➢ Payload is opaque to Flume.
➢ Events are accompanied by optional headers.
Headers:
- Headers are collection of unique Key-Value pairs
- Headers are used for contextual routing
Events
Client
Core Concept: Client
Entity that simulates event generation, passed to one or more
agents.
➢ Example: Flume log4j Appender
➢ Decouples Flume from the system where event data is generated.
Events
Client
Core Concepts: Agent
Container for hosting Sources, Channels, sinks and other
components.
Core Concepts: Source
Component that receives events and places it onto one or more
channels.
➢ Different types of sources:
- Specialized sources for integrating with well known systems.
For example -Syslog, Netcat
- Auto generating Sources-Exec,SEQ
- IPC Sources for Agent to Agent communication: Avro, Thrift
➢ Requires at least one Channel to function.
Core Concept: Channel
Component that buffers incoming events which are ultimately
consumed by Sinks.
➢ Different channels:- Memory, File, Database
➢ Channels are fully transactional.
Core Concepts: Sink
Component that takes events from channel and transmits them to
next hop destination.
Different type of Sinks:
- Terminal Sinks: HDFS,Hbase
- Auto consuming Sinks: Null Sink
- IPC sink : Agent to Agent communication-Avro, Thrift
Core Concepts:Interceptor
Interceptors are applied to sources in a predetermined fashion to
enable adding information and filtering of events.
➢ Built in Interceptors: Allows adding headers such as timestamps, static markers
etc.
➢ Custom Interceptors: Create headers by
inspecting the Event.
Channel Selector
It facilitates selection of one or more Channels, based on preset
criteria.
➢ Built in Channel Selectors:
- Replicating: for duplicating events
- Multiplexing: for routing based based on headers.
➢ Custom selectors can be written for dynamic criteria.
Sink Processor
Sink Processor is responsible for invoking one sink from a
specified group of Sinks.
➢ Built in Sink Processors:
- Load Balancing Sink Processor.
- Failover Sink Processor
- Default Sink Processor.
Data Ingest
Source
Channel
Processor
Interceptor
Channel
Selector
(decides for
channels)
Channel
Events
C
L
I
E
N
T
S
E
V
E
N
T
S
Events filtered Events
unfiltered
Events
Data Drain
➢ Event Removal from Channel is transactional.
Sink
Runner
Sink
Sink
Processor
Channels
Sink selection
n invocation
Send events
to next hop
Next Hop
Agent Pipeline
* Credits: http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-
Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
Assured Delivery
Agents use transactional exchange to guarantee
delivery across hops.
Start
Transaction
take Events
end
transaction
SinkChannel Source Channel
Start
Transaction
take Events
end
transaction
Send events
Setting up a simple agent for HDFS
agent.sources= netcat-collect
agent.sinks = hdfs-write
agent.channels= memoryChannel
agent.sources.netcat-collect.type = netcat
agent.sources.netcat-collect.bind = 127.0.0.1
agent.sources.netcat-collect.port = 11111
agent.sinks.hdfs-write.type = hdfs
agent.sinks.hdfs-write.hdfs.path = hdfs://namenode_address:8020/path/to/flume_test
agent.sinks.hdfs-write.rollInterval = 30
agent.sinks.hdfs-write.hdfs.writeFormat=Text
agent.sinks.hdfs-write.hdfs.fileType=DataStream
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity=10000
agent.sources.netcat-collect.channels=memoryChannel
agent.sinks.hdfs-write.channel=memoryChannel
Advanced Features
Fan-In and Fan-Out
hdfs-agent.channels=mchannel1 mchannel2
hdfs-agent.sources.netcat-collect.selector.type = replicating
hdfs-agent.sources.r1.channels = mchannel1 mchannel2
Interceptors
hdfs-agent.sources.netcat-collect.interceptors = filt_int
hdfs-agent.sources.netcat-collect.interceptors.filt_int.type=regex_filter
hdfs-agent.sources.netcat-collect.interceptors.filt_int.regex=^echo.*
hdfs-agent.sources.netcat-collect.interceptors.filt_int.excludeEvents=true
Got BigData & Analytics work ? Contact india@GoDataDriven.
com
We are hiring!!

Apache flume by Swapnil Dubey

  • 1.
    Apache Flume Loading BigData into Hadoop Cluster using Flume Swapnil Dubey Big Data Hacker GoDataDriven
  • 2.
    Agenda ➢ What isApache Flume? ➢ Problem statement ➢ Use Case : Collecting web server logs ➢ Overview/Architecture of Flume ➢ Demos
  • 3.
    What is Flume? Collection& Aggregation of Streaming Data - Typically used for log data. Advantages over other solutions:- ➢ Scalable, Reliable, Customizable ➢ Declarative and Dynamic Configuration ➢ Contextual Routing ➢ Feature Rich and Fully Extensible ➢ Open source
  • 4.
  • 5.
  • 6.
    Problem Statement ➢ Datacollection is Ad hoc ➢ How to get data to Hadoop ➢ Streaming Data
  • 7.
    Problem Statement LOGS LOGS LOGS Application ServersHadoop Cluster Flume Agent HDFS Write
  • 8.
    Collecting web serverlogs. ➢ Collecting web logs using- - Single flume agent - Using multiple flume agents ➢ Typical converging flow - Converging flow characteristics-Load Balancing, Multiplexing, Failover - Large converging flows - Event volume
  • 9.
    Problem Statement :SingleFlume Agent LOGS LOGS LOGS Application Servers Hadoop Cluster Flume Agent HDFS Write
  • 10.
    Problem Statement:Multiple FlumeAgent -1 LOGS LOGS LOGS Application Servers Hadoop Cluster Flume Agent HDFS Write Flume Agent Flume Agent HDFS Write HDFS Write
  • 11.
    Problem Statement:Multiple FlumeAgent -2 LOGS LOGS LOGS Application Servers Hadoop Cluster Flume Agent HDFS WriteFlume Agent Flume Agent Flume Agent
  • 12.
  • 13.
  • 14.
    Core Concepts ➢ Events ➢Client ➢ Agents - Source, Channel, Sink - Interceptor - Channel Selector - Sink Processor
  • 15.
    Core Concept:Event An Eventis the basic unit of data transported by Flume from source to destination. ➢ Payload is opaque to Flume. ➢ Events are accompanied by optional headers. Headers: - Headers are collection of unique Key-Value pairs - Headers are used for contextual routing Events Client
  • 16.
    Core Concept: Client Entitythat simulates event generation, passed to one or more agents. ➢ Example: Flume log4j Appender ➢ Decouples Flume from the system where event data is generated. Events Client
  • 17.
    Core Concepts: Agent Containerfor hosting Sources, Channels, sinks and other components.
  • 18.
    Core Concepts: Source Componentthat receives events and places it onto one or more channels. ➢ Different types of sources: - Specialized sources for integrating with well known systems. For example -Syslog, Netcat - Auto generating Sources-Exec,SEQ - IPC Sources for Agent to Agent communication: Avro, Thrift ➢ Requires at least one Channel to function.
  • 19.
    Core Concept: Channel Componentthat buffers incoming events which are ultimately consumed by Sinks. ➢ Different channels:- Memory, File, Database ➢ Channels are fully transactional.
  • 20.
    Core Concepts: Sink Componentthat takes events from channel and transmits them to next hop destination. Different type of Sinks: - Terminal Sinks: HDFS,Hbase - Auto consuming Sinks: Null Sink - IPC sink : Agent to Agent communication-Avro, Thrift
  • 21.
    Core Concepts:Interceptor Interceptors areapplied to sources in a predetermined fashion to enable adding information and filtering of events. ➢ Built in Interceptors: Allows adding headers such as timestamps, static markers etc. ➢ Custom Interceptors: Create headers by inspecting the Event.
  • 22.
    Channel Selector It facilitatesselection of one or more Channels, based on preset criteria. ➢ Built in Channel Selectors: - Replicating: for duplicating events - Multiplexing: for routing based based on headers. ➢ Custom selectors can be written for dynamic criteria.
  • 23.
    Sink Processor Sink Processoris responsible for invoking one sink from a specified group of Sinks. ➢ Built in Sink Processors: - Load Balancing Sink Processor. - Failover Sink Processor - Default Sink Processor.
  • 24.
  • 25.
    Data Drain ➢ EventRemoval from Channel is transactional. Sink Runner Sink Sink Processor Channels Sink selection n invocation Send events to next hop Next Hop
  • 26.
    Agent Pipeline * Credits:http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data- Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
  • 27.
    Assured Delivery Agents usetransactional exchange to guarantee delivery across hops. Start Transaction take Events end transaction SinkChannel Source Channel Start Transaction take Events end transaction Send events
  • 28.
    Setting up asimple agent for HDFS agent.sources= netcat-collect agent.sinks = hdfs-write agent.channels= memoryChannel agent.sources.netcat-collect.type = netcat agent.sources.netcat-collect.bind = 127.0.0.1 agent.sources.netcat-collect.port = 11111 agent.sinks.hdfs-write.type = hdfs agent.sinks.hdfs-write.hdfs.path = hdfs://namenode_address:8020/path/to/flume_test agent.sinks.hdfs-write.rollInterval = 30 agent.sinks.hdfs-write.hdfs.writeFormat=Text agent.sinks.hdfs-write.hdfs.fileType=DataStream agent.channels.memoryChannel.type = memory agent.channels.memoryChannel.capacity=10000 agent.sources.netcat-collect.channels=memoryChannel agent.sinks.hdfs-write.channel=memoryChannel
  • 29.
    Advanced Features Fan-In andFan-Out hdfs-agent.channels=mchannel1 mchannel2 hdfs-agent.sources.netcat-collect.selector.type = replicating hdfs-agent.sources.r1.channels = mchannel1 mchannel2 Interceptors hdfs-agent.sources.netcat-collect.interceptors = filt_int hdfs-agent.sources.netcat-collect.interceptors.filt_int.type=regex_filter hdfs-agent.sources.netcat-collect.interceptors.filt_int.regex=^echo.* hdfs-agent.sources.netcat-collect.interceptors.filt_int.excludeEvents=true
  • 30.
    Got BigData &Analytics work ? Contact india@GoDataDriven. com We are hiring!!