Apache flume by Swapnil Dubey

Apache Flume
Loading Big Data into Hadoop Cluster using Flume
Swapnil Dubey
Big Data Hacker
GoDataDriven

Agenda
➢ What is Apache Flume?
➢ Problem statement
➢ Use Case : Collecting web server logs
➢ Overview/Architecture of Flume
➢ Demos

What is Flume?
Collection & Aggregation of Streaming Data
- Typically used for log data.
Advantages over other solutions:-
➢ Scalable, Reliable, Customizable
➢ Declarative and Dynamic Configuration
➢ Contextual Routing
➢ Feature Rich and Fully Extensible
➢ Open source

Problem Statement
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster

Problem Statement
➢ Data collection is Ad hoc
➢ How to get data to Hadoop
➢ Streaming Data

Problem Statement
LOGS
LOGS
LOGS
Flume Agent HDFS Write

Collecting web server logs.
➢ Collecting web logs using-
- Single flume agent
- Using multiple flume agents
➢ Typical converging flow
- Converging flow characteristics-Load Balancing, Multiplexing, Failover
- Large converging flows
- Event volume

Problem Statement :Single Flume Agent
LOGS
LOGS
LOGS

Problem Statement:Multiple Flume Agent -1
LOGS
LOGS
LOGS
Flume Agent
Flume Agent
HDFS Write
HDFS Write

Problem Statement:Multiple Flume Agent -2
LOGS
LOGS
LOGS
Flume Agent
HDFS WriteFlume Agent
Flume Agent
Flume Agent

Overview/Architecture of Flume

Components of Flume
Events
Client

Core Concepts
➢ Events
➢ Client
➢ Agents
- Source, Channel, Sink
- Interceptor
- Channel Selector
- Sink Processor

Core Concept:Event
An Event is the basic unit of data transported by Flume from
source to destination.
➢ Payload is opaque to Flume.
➢ Events are accompanied by optional headers.
Headers:
- Headers are collection of unique Key-Value pairs
- Headers are used for contextual routing
Events
Client

Core Concept: Client
Entity that simulates event generation, passed to one or more
agents.
➢ Example: Flume log4j Appender
➢ Decouples Flume from the system where event data is generated.
Events
Client

Core Concepts: Agent
Container for hosting Sources, Channels, sinks and other
components.

Core Concepts: Source
Component that receives events and places it onto one or more
channels.
➢ Different types of sources:
- Specialized sources for integrating with well known systems.
For example -Syslog, Netcat
- Auto generating Sources-Exec,SEQ
- IPC Sources for Agent to Agent communication: Avro, Thrift
➢ Requires at least one Channel to function.

Core Concept: Channel
Component that buffers incoming events which are ultimately
consumed by Sinks.
➢ Different channels:- Memory, File, Database
➢ Channels are fully transactional.

Core Concepts: Sink
Component that takes events from channel and transmits them to
next hop destination.
Different type of Sinks:
- Terminal Sinks: HDFS,Hbase
- Auto consuming Sinks: Null Sink
- IPC sink : Agent to Agent communication-Avro, Thrift

Core Concepts:Interceptor
Interceptors are applied to sources in a predetermined fashion to
enable adding information and filtering of events.
➢ Built in Interceptors: Allows adding headers such as timestamps, static markers
etc.
➢ Custom Interceptors: Create headers by
inspecting the Event.

Channel Selector
It facilitates selection of one or more Channels, based on preset
criteria.
➢ Built in Channel Selectors:
- Replicating: for duplicating events
- Multiplexing: for routing based based on headers.
➢ Custom selectors can be written for dynamic criteria.

Sink Processor
Sink Processor is responsible for invoking one sink from a
specified group of Sinks.
➢ Built in Sink Processors:
- Load Balancing Sink Processor.
- Failover Sink Processor
- Default Sink Processor.

Data Ingest
Source
Channel
Processor
Interceptor
Channel
Selector
(decides for
channels)
Channel
Events
C
L
I
E
N
T
S
E
V
E
N
T
S
Events filtered Events
unfiltered
Events

Data Drain
➢ Event Removal from Channel is transactional.
Sink
Runner
Sink
Sink
Processor
Channels
Sink selection
n invocation
Send events
to next hop
Next Hop

Agent Pipeline
* Credits: http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-
Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf

Assured Delivery
Agents use transactional exchange to guarantee
delivery across hops.
Start
Transaction
take Events
end
transaction
SinkChannel Source Channel
Start
Transaction
take Events
end
transaction
Send events

Setting up a simple agent for HDFS
agent.sources= netcat-collect
agent.sinks = hdfs-write
agent.channels= memoryChannel
agent.sources.netcat-collect.type = netcat
agent.sources.netcat-collect.bind = 127.0.0.1
agent.sources.netcat-collect.port = 11111
agent.sinks.hdfs-write.type = hdfs
agent.sinks.hdfs-write.hdfs.path = hdfs://namenode_address:8020/path/to/flume_test
agent.sinks.hdfs-write.rollInterval = 30
agent.sinks.hdfs-write.hdfs.writeFormat=Text
agent.sinks.hdfs-write.hdfs.fileType=DataStream
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity=10000
agent.sources.netcat-collect.channels=memoryChannel
agent.sinks.hdfs-write.channel=memoryChannel

Advanced Features
Fan-In and Fan-Out
hdfs-agent.channels=mchannel1 mchannel2
hdfs-agent.sources.netcat-collect.selector.type = replicating
hdfs-agent.sources.r1.channels = mchannel1 mchannel2
Interceptors
hdfs-agent.sources.netcat-collect.interceptors = filt_int
hdfs-agent.sources.netcat-collect.interceptors.filt_int.type=regex_filter
hdfs-agent.sources.netcat-collect.interceptors.filt_int.regex=^echo.*
hdfs-agent.sources.netcat-collect.interceptors.filt_int.excludeEvents=true

Got BigData & Analytics work ? Contact india@GoDataDriven.
com
We are hiring!!

Apache flume by Swapnil Dubey

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache flume by Swapnil Dubey

Similar to Apache flume by Swapnil Dubey (20)

Recently uploaded

Recently uploaded (20)

Apache flume by Swapnil Dubey