Apache Flume and its use case in Manufacturing

Final Project 

Apache Flume 
 

Rapheephan Thongkham-Uan (Nancy)

cscie90 Cloud Computing
Harvard University Extension School
Prof. Zoran B. Djordjević

@TakeshiDemonkey

1

What is Apache Flume?
▪ Flume is distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data from
many different sources to a centralised data store. (http://
flume.apache.org/FlumeUserGuide.html)

!
!
!
!
!
!
!
▪ Currently available versions are 0.9.x and 1.x
▪ I want to focus on Flume use cases in manufacturing.

@Rapheephan

2

Applying Flume to Manufacturing Process
▪ In the factory, there are many machines used in the production.

!
!
!
!
!
▪ If a machine produces 1 log data file when 1 lot of product finishes
processing. In one day, there will be a big amount of log data stored
in the server.
▪ For the quality control and the production control improvement,
analysing these log files in a real time is our objective.
▪ First, we need to collect these log data files from the production
lines into the HDFS, then pass them through the analysis process.

@Rapheephan

3

Multi-agent flow image in the production system

AGENT 1

consolidation

AGENT 2

AGENT 4
HDFS

AGENT 3

@Rapheephan

4

My Sample
agent 1
CHANNEL

SOURCE

SINK

HDFS

▪ My system
▪ Java Runtime Environment (Java1.6.0_31)
▪ Cloudera's Distribution Including Apache Hadoop (CDH4.3)

▪ Working steps
1. Install Apache Flume on the Host machine
(Flume installation guide for CDH4: http://www.cloudera.com/content/cloudera-content/
cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_12.html)

2. Create 2 log generation java applications for machine1 and machine2
3. Configure Flume agent
4. Start Flume agent and test the system

@Rapheephan

5

Prepare the log generation application
▪ Create 2 Virtual Machines for generating machine1’s and machine2’s
log data.
▪ Create a simple socket java program to producing log events to
agent’s source with specific port (11111)

!
!
!
!
!
!
!
▪ Export it as an executable JAR file, and move it to the virtual
machine1
▪ Copy and move the other to the virtual machine2
@Rapheephan

6

Configuration Flume-ng agent on Host
▪ We have to configure all sink, channel, and source in the flow. My
agent name is hdfs-agent
▪ First, name the components in the agent.

! hdfs-agent.sources = log-collect
= memoryChannel
! hdfs-agent.channelshdfs-write
hdfs-agent.sinks =
!
▪ Next, define the source’s properties as follow

! hdfs-agent.sources.log-collect.type = netcat
! hdfs-agent.sources.log-collect.bind = 133.196.211.209
hdfs-agent.sources.log-collect.port = 11111
!!
hdfs-agent.sources.log-collect.channels = memoryChannel
!
▪ My source type is netcat-like source, that listens on the port ‘11111’
▪ Don’t forget to define the channel used by the source.
@Rapheephan

7

Configuration Flume-ng agent on Host (2)
▪ We want to collect the log data and write to the ‘testFlume’
directory on the HDFS cluster. Therefore, the sink should be defined
as follow.

! hdfs-agent.sinks.hdfs-write.type = hdfs
! hdfs-agent.sinks.hdfs-write.hdfs.path = hdfs://<namenode>/user/
<myusername>/testflume
= Text
! hdfs-agent.sinks.hdfs-write.hdfs.writeFormatDataStream
hdfs-agent.sinks.hdfs-write.hdfs.fileType =
!!
hdfs-agent.sinks.hdfs-write.channel = memoryChannel
!
▪ Don’t forget to specify the channel used by the sink.
▪ Finally, configure the channel

! hdfs-agent.channels.memoryChannel.type = memory
hdfs-agent.channels.memoryChannel.capacity = 1000
!
▪ The channel will store the log data in-memory with the maximum
1000 events.
@Rapheephan

8

Start the Flume agent and get result
▪ My configuration file name is ‘flume.conf’, and my agent name is
‘hdfs-agent’.
▪ Start the Flume agent using the following command.
$! flume-ng agent --conf-file flume.conf --name hdfs-agent

▪ Execute the genLog.jar on both machines
▪ On Flume master, you will be able to see something like this

!
!
!
!

13/12/17 14:36:13 INFO hdfs.BucketWriter: Creating hdfs://<namenode>:
8020/user/<my userid>/testflume/FlumeData.1387258572230.tmp
13/12/17 14:36:19 INFO hdfs.BucketWriter: Renaming hdfs://cmccldULL6400.toshiba.co.jp:8020/user/g0092010/testflume/FlumeData.
1387258572230.tmp to hdfs://<namenode>:8020/user/<my userid>/testflume/
FlumeData.1387258572230

▪ Verify that the log data has stored as events on the HDFS
g0092010@cmc-cldULL6400:~$ hadoop fs -cat
2013-12-17 14:32:19: This is a sample log
@Rapheephan

testflume/*30
file from machine
file from machine
file from machine
file from machine

1.
1.
2.
1.

9

Next steps
▪ Analyse the log data and Visualise it in the (near) real time.

!
!
!
!
!
!
!
!
!

AGENT 1

MapReduce
Hive

AGENT 2

AGENT 4

Mahout

Visualisation
tools

HDFS

Impala

AGENT 3

▪ Improving throughputs of the system.
▪ Analysing and Predicting the future trend.
▪ etc.
@Rapheephan

10

Apache Flume and its use case in Manufacturing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache Flume and its use case in Manufacturing

Similar to Apache Flume and its use case in Manufacturing (20)

Recently uploaded

Recently uploaded (20)

Apache Flume and its use case in Manufacturing