Apache Flume
-Prabhu Sundarraj
-(Data Engineer)
~What is Apache Flume?~
 Flume is a top-level project at the
Apache Software Foundation. While it
can function as a general-purpose
event queue manager, in the context
of Hadoop it is most often used as a
log aggregator, collecting log data
from many diverse sources and
moving them to a centralized data
store.
Why do we need flume...?
HDFS (or)
Hadoop Cluster
Hadoop Cluster
Web
Servers
Applicatio
n servers
US
Application
servers
Logs
Logs
Logs
Log
Files
What is Log file and How it
looks like?
 Logs=Non-structure data, not a
table,not a rdbms, plain log file
(ie, So we cannot use sqoop)
 That is why we need Flume
 Cloudera was the company who
created flume for getting these log
files to hadoop cluster.
Flume Components:
Flume Components:
A Flume data flow is made up of five main
components: Events, Sources, Channels, Sinks,
and Agents.
Events:
 An event is the basic unit of data that is moved
using Flume. It is similar to a message in JMS and
is generally small. It is made up of headers and a
byte-array body.
Sources:
 The source receives the event from some external
entity and stores it in a channel. The source must
understand the type of event that is sent to it: an
Avro event requires an Avro source.
Channels:
 A channel is an internal passive store with certain
specific characteristics. An in-memory channel, for
example, can move events very quickly, but does not
provide persistence. A file-based channel provides
persistence. A source stores an event in the channel
where it stays until it is consumed by a sink. This
temporary storage lets source and sink run
asynchronously.
Sinks:
 The sink removes the event from the channel and
forwards it to either to a destination, like HDFS, or to
another agent/dataflow. The sink must output an event
that is appropriate to the destination.
Agents:
 An agent is the container for a Flume data flow. It is any
physical JVM running Flume. An agent must contain at least
one source, channel, and sink, but the same agent can run
multiple sources, sinks, and channels. A particular data flow
path is set up through the configuration process.
Problem Statement:
 Data collection is Adhoc
 How to get data to hadoop
 Streaming data
For Example:
 In Twitter download all the tweets
matching with narendra modi. Now if
you download probably in another 5
minutes it will generate another data
Key features:
 Streaming data  data never stop
 Flume agent  It will bring stream
data
 Flume will run any where basically
 Its not basically installed on top of
hadoop or something
For example:
 Our application server is in US & our
Hadoop cluster is in UK and Flume
server
is in India or (anywhere it can be)
 For this we have to configure flume
Muliple Flume Agent Connecting to single flume Agent
 Flume client will get data
 Flume client will be installed on a web
server
 Source will talking to flume client to
get data
 Client will inform the source then it
start pushing the data to source
 Source is the component which will
receive the data in flume server
 whatever the data getting by source from
client it will give it to sink and sink will
push the data to hdfs
 Channel: It is used to connect source
and sink
 Flume Agent can be started by using this
command:
$flume -ng
Flume environment/ binary
path:
Flume Configuration file (spooldir_conf _file)
Scenario with hands on Flume commands:
Thank you!

Apache Flume

  • 1.
  • 2.
    ~What is ApacheFlume?~  Flume is a top-level project at the Apache Software Foundation. While it can function as a general-purpose event queue manager, in the context of Hadoop it is most often used as a log aggregator, collecting log data from many diverse sources and moving them to a centralized data store.
  • 3.
    Why do weneed flume...? HDFS (or) Hadoop Cluster Hadoop Cluster Web Servers Applicatio n servers US Application servers Logs Logs Logs Log Files
  • 5.
    What is Logfile and How it looks like?  Logs=Non-structure data, not a table,not a rdbms, plain log file (ie, So we cannot use sqoop)  That is why we need Flume  Cloudera was the company who created flume for getting these log files to hadoop cluster.
  • 6.
  • 7.
    Flume Components: A Flumedata flow is made up of five main components: Events, Sources, Channels, Sinks, and Agents. Events:  An event is the basic unit of data that is moved using Flume. It is similar to a message in JMS and is generally small. It is made up of headers and a byte-array body. Sources:  The source receives the event from some external entity and stores it in a channel. The source must understand the type of event that is sent to it: an Avro event requires an Avro source.
  • 8.
    Channels:  A channelis an internal passive store with certain specific characteristics. An in-memory channel, for example, can move events very quickly, but does not provide persistence. A file-based channel provides persistence. A source stores an event in the channel where it stays until it is consumed by a sink. This temporary storage lets source and sink run asynchronously. Sinks:  The sink removes the event from the channel and forwards it to either to a destination, like HDFS, or to another agent/dataflow. The sink must output an event that is appropriate to the destination.
  • 9.
    Agents:  An agentis the container for a Flume data flow. It is any physical JVM running Flume. An agent must contain at least one source, channel, and sink, but the same agent can run multiple sources, sinks, and channels. A particular data flow path is set up through the configuration process.
  • 10.
    Problem Statement:  Datacollection is Adhoc  How to get data to hadoop  Streaming data For Example:  In Twitter download all the tweets matching with narendra modi. Now if you download probably in another 5 minutes it will generate another data
  • 11.
    Key features:  Streamingdata  data never stop  Flume agent  It will bring stream data  Flume will run any where basically  Its not basically installed on top of hadoop or something For example:  Our application server is in US & our Hadoop cluster is in UK and Flume server is in India or (anywhere it can be)  For this we have to configure flume
  • 13.
    Muliple Flume AgentConnecting to single flume Agent
  • 14.
     Flume clientwill get data  Flume client will be installed on a web server  Source will talking to flume client to get data  Client will inform the source then it start pushing the data to source  Source is the component which will receive the data in flume server
  • 16.
     whatever thedata getting by source from client it will give it to sink and sink will push the data to hdfs  Channel: It is used to connect source and sink  Flume Agent can be started by using this command: $flume -ng
  • 17.
  • 18.
    Flume Configuration file(spooldir_conf _file)
  • 19.
    Scenario with handson Flume commands:
  • 26.