1.
Final Project
Apache Flume
Rapheephan Thongkham-Uan (Nancy)
cscie90 Cloud Computing
Harvard University Extension School
Prof. Zoran B. Djordjević
@TakeshiDemonkey
1
2. What is Apache Flume?
▪ Flume is distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data from
many different sources to a centralised data store. (http://
flume.apache.org/FlumeUserGuide.html)
!
!
!
!
!
!
!
▪ Currently available versions are 0.9.x and 1.x
▪ I want to focus on Flume use cases in manufacturing.
@Rapheephan
2
3. Applying Flume to Manufacturing Process
▪ In the factory, there are many machines used in the production.
!
!
!
!
!
▪ If a machine produces 1 log data file when 1 lot of product finishes
processing. In one day, there will be a big amount of log data stored
in the server.
▪ For the quality control and the production control improvement,
analysing these log files in a real time is our objective.
▪ First, we need to collect these log data files from the production
lines into the HDFS, then pass them through the analysis process.
@Rapheephan
3
4. Multi-agent flow image in the production system
AGENT 1
consolidation
AGENT 2
AGENT 4
HDFS
AGENT 3
@Rapheephan
4
5. My Sample
agent 1
CHANNEL
SOURCE
SINK
HDFS
▪ My system
▪ Java Runtime Environment (Java1.6.0_31)
▪ Cloudera's Distribution Including Apache Hadoop (CDH4.3)
▪ Working steps
1. Install Apache Flume on the Host machine
(Flume installation guide for CDH4: http://www.cloudera.com/content/cloudera-content/
cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_12.html)
2. Create 2 log generation java applications for machine1 and machine2
3. Configure Flume agent
4. Start Flume agent and test the system
@Rapheephan
5
6. Prepare the log generation application
▪ Create 2 Virtual Machines for generating machine1’s and machine2’s
log data.
▪ Create a simple socket java program to producing log events to
agent’s source with specific port (11111)
!
!
!
!
!
!
!
▪ Export it as an executable JAR file, and move it to the virtual
machine1
▪ Copy and move the other to the virtual machine2
@Rapheephan
6
7. Configuration Flume-ng agent on Host
▪ We have to configure all sink, channel, and source in the flow. My
agent name is hdfs-agent
▪ First, name the components in the agent.
! hdfs-agent.sources = log-collect
= memoryChannel
! hdfs-agent.channelshdfs-write
hdfs-agent.sinks =
!
▪ Next, define the source’s properties as follow
! hdfs-agent.sources.log-collect.type = netcat
! hdfs-agent.sources.log-collect.bind = 133.196.211.209
hdfs-agent.sources.log-collect.port = 11111
!!
hdfs-agent.sources.log-collect.channels = memoryChannel
!
▪ My source type is netcat-like source, that listens on the port ‘11111’
▪ Don’t forget to define the channel used by the source.
@Rapheephan
7
8. Configuration Flume-ng agent on Host (2)
▪ We want to collect the log data and write to the ‘testFlume’
directory on the HDFS cluster. Therefore, the sink should be defined
as follow.
! hdfs-agent.sinks.hdfs-write.type = hdfs
! hdfs-agent.sinks.hdfs-write.hdfs.path = hdfs://<namenode>/user/
<myusername>/testflume
= Text
! hdfs-agent.sinks.hdfs-write.hdfs.writeFormatDataStream
hdfs-agent.sinks.hdfs-write.hdfs.fileType =
!!
hdfs-agent.sinks.hdfs-write.channel = memoryChannel
!
▪ Don’t forget to specify the channel used by the sink.
▪ Finally, configure the channel
! hdfs-agent.channels.memoryChannel.type = memory
hdfs-agent.channels.memoryChannel.capacity = 1000
!
▪ The channel will store the log data in-memory with the maximum
1000 events.
@Rapheephan
8
9. Start the Flume agent and get result
▪ My configuration file name is ‘flume.conf’, and my agent name is
‘hdfs-agent’.
▪ Start the Flume agent using the following command.
$! flume-ng agent --conf-file flume.conf --name hdfs-agent
▪ Execute the genLog.jar on both machines
▪ On Flume master, you will be able to see something like this
!
!
!
!
13/12/17 14:36:13 INFO hdfs.BucketWriter: Creating hdfs://<namenode>:
8020/user/<my userid>/testflume/FlumeData.1387258572230.tmp
13/12/17 14:36:19 INFO hdfs.BucketWriter: Renaming hdfs://cmccldULL6400.toshiba.co.jp:8020/user/g0092010/testflume/FlumeData.
1387258572230.tmp to hdfs://<namenode>:8020/user/<my userid>/testflume/
FlumeData.1387258572230
▪ Verify that the log data has stored as events on the HDFS
g0092010@cmc-cldULL6400:~$ hadoop fs -cat
2013-12-17 14:32:19: This is a sample log
2013-12-17 14:32:24: This is a sample log
2013-12-17 14:32:27: This is a sample log
2013-12-17 14:32:29: This is a sample log
@Rapheephan
testflume/*30
file from machine
file from machine
file from machine
file from machine
1.
1.
2.
1.
9
10. Next steps
▪ Analyse the log data and Visualise it in the (near) real time.
!
!
!
!
!
!
!
!
!
AGENT 1
MapReduce
Hive
AGENT 2
AGENT 4
Mahout
Visualisation
tools
HDFS
Impala
AGENT 3
▪ Improving throughputs of the system.
▪ Analysing and Predicting the future trend.
▪ etc.
@Rapheephan
10