Apache Flume

Apache Flume
-Prabhu Sundarraj
-(Data Engineer)

~What is Apache Flume?~
 Flume is a top-level project at the
Apache Software Foundation. While it
can function as a general-purpose
event queue manager, in the context
of Hadoop it is most often used as a
log aggregator, collecting log data
from many diverse sources and
moving them to a centralized data
store.

Why do we need flume...?
HDFS (or)
Hadoop Cluster
Hadoop Cluster
Web
Servers
Applicatio
n servers
US
Application
servers
Logs
Logs
Logs
Log
Files

What is Log file and How it
looks like?
 Logs=Non-structure data, not a
table,not a rdbms, plain log file
(ie, So we cannot use sqoop)
 That is why we need Flume
 Cloudera was the company who
created flume for getting these log
files to hadoop cluster.

Flume Components:
A Flume data flow is made up of five main
components: Events, Sources, Channels, Sinks,
and Agents.
Events:
 An event is the basic unit of data that is moved
using Flume. It is similar to a message in JMS and
is generally small. It is made up of headers and a
byte-array body.
Sources:
 The source receives the event from some external
entity and stores it in a channel. The source must
understand the type of event that is sent to it: an
Avro event requires an Avro source.

Channels:
 A channel is an internal passive store with certain
specific characteristics. An in-memory channel, for
example, can move events very quickly, but does not
provide persistence. A file-based channel provides
persistence. A source stores an event in the channel
where it stays until it is consumed by a sink. This
temporary storage lets source and sink run
asynchronously.
Sinks:
 The sink removes the event from the channel and
forwards it to either to a destination, like HDFS, or to
another agent/dataflow. The sink must output an event
that is appropriate to the destination.

Agents:
 An agent is the container for a Flume data flow. It is any
physical JVM running Flume. An agent must contain at least
one source, channel, and sink, but the same agent can run
multiple sources, sinks, and channels. A particular data flow
path is set up through the configuration process.

Problem Statement:
 Data collection is Adhoc
 How to get data to hadoop
 Streaming data
For Example:
 In Twitter download all the tweets
matching with narendra modi. Now if
you download probably in another 5
minutes it will generate another data

Key features:
 Streaming data  data never stop
 Flume agent  It will bring stream
data
 Flume will run any where basically
 Its not basically installed on top of
hadoop or something
For example:
 Our application server is in US & our
Hadoop cluster is in UK and Flume
server
is in India or (anywhere it can be)
 For this we have to configure flume

Muliple Flume Agent Connecting to single flume Agent

 Flume client will get data
 Flume client will be installed on a web
server
 Source will talking to flume client to
get data
 Client will inform the source then it
start pushing the data to source
 Source is the component which will
receive the data in flume server

 whatever the data getting by source from
client it will give it to sink and sink will
push the data to hdfs
 Channel: It is used to connect source
and sink
 Flume Agent can be started by using this
command:
$flume -ng

Flume environment/ binary
path:

Flume Configuration file (spooldir_conf _file)

Scenario with hands on Flume commands:

Apache Flume

More Related Content

What's hot

Similar to Apache Flume

Recently uploaded

Apache Flume