Flume Event Scalability

1

Apache FLUME Scalability Analysis
Arinto Murdopo
Facultat Informatica de Barcelona
Universitat Politecnica de Catalunya
Barcelona, Spain

I. I NTRODUCTION We execute each experiment for duration of 20 to 30 minutes
Flume is distributed data collection service focusing on because we have limited size of disk storage for storing the
high reliability and availability. It often used for moving large data.
amounts of log data from web servers to database such as
HDFS for further processing. A. Setup 1: One to one relationship between nodes and Flume
In initial project, we used Flume to aggregate RSS from load generator
many websites. In future works, those RSSs will be shown in Figure 1 shows the first setup for this experiment. Flume
a dedicated website and data in HDFS can be used for further load generator is implemented using Java under Hammer class.
data analysis. Hammer basically sends configurable TCP events in term of
What are we going to do in this project In this project, we size. In this setup, we use event size of 300 bytes. Flume node
will analyse the scalability of Flume in term of number of is configured with SyslogTcpSource which will listen to TCP
event that specific Flume configuration can support. event in configurable port and generate its own Flume event
to be transmitted via Memory channel to the HDFS sink. The
HDFS in this setup consists of three replicated nodes. We will
II. BACKGROUND
check the relation between channel capacity and maximum
In this section, we will quickly explain terminologies that rate of events that Flume node can support.
often appear in this technical report.
Flume event is a unit of data flow that transferred by Flume.
It has specific size of byte payload and optional set of string
attributes. Flume agent is a JVM process that runs Flume
source, channel, and sink. Flume source collects the events
that delivered to the sources. Flume channels are repositories
where the events are processed in an agent. In Flume 0.9.x, we
can do something like filtering in the channel. But in Flume
1.0.x (the one that we are using now), the filtering features
are still working in progress. Flume channel has a property
called capacity, which needs to be configured properly so Fig. 1. Setup 1: One to one relationship between nodes and Flume load
that Flume will have desired level of scalability. Flume sink generator
forwards the events to another Flume agent or data container,
such as HDFS, Cassandra, Dynamo or even conventional SQL
Database. B. Setup 2: Cascading setup
Figure 2 shows the second setup for this experiment. We
III. E XPERIMENT S ETUP still use event size of 300 bytes in this setup. Two Flume
The scalability analysis is based on existing Flume-NG per- scale nodes are aggregated into one single collector node
formance measurement study [1] but it differs in experiment before writing the data into HDFS cluster. Two separate load
setup. We reused Flume load generator from the aforemen- generators reside in different independent nodes. We will
tioned study so that we can easily quantify the scalability as compare accumulated number of event-per-seconds of two
number of x-byte-event per seconds where x is the size of he Flume scale nodes with first setup with same settings of
event. The differences on experiment setup are: channel capacity.
• This experiment introduces one-to-one relationship be-
IV. R ESULT AND D ISCUSSION
tween the nodes and Flume load generator. That means,
each Flume load generator process exists in an indepen- A. Verifying Maximum Supported Number of Events
dent node (which is Amazon EC2 medium instance) To verify the number of events that specific configuration
• This experiment introduces cascading setup, which will can support, we inspect Flume log file in /var/log/flume-
verify whether there is improvement in scalability or not ng/flume.log. If the process can continue without Java ex-
compared to non-cascading setup ceptions that cause it to stop or without regularly generated

2

events from Collector nodes into the other two nodes which
are Scale1-1 and Scale1-2. Note that when event arrives in
node Scale1-1 or Scale1-2, the event is not directly forwarded
into Collector node. Node Scale1-1 and Scale1-2 will hold
the events until certain number of events are accumulated.
This amount of accumulated events are configurable using
flume.conf configuration file.

Fig. 2. Setup 2: Cascading setup
V. F UTURE W ORKS
Previous Flume-ng Performance Measurement study [1]
actually includes a Pig script that will further analyse the data
dumped in HDFS. Further analysis that can be done includes
exceptions, then we will treat it as able to handle the event
detecting how many time Flume tries to retries data sending
rates applied to it.
when failures occur and the number of events have been
transferred during During this project, the writer has spent
B. Setup 1 Result significant time to make it works, but unfortunately until the
We used several capacity configuration of Flume load gen- end of the project, it still does not work properly. Therefore,
erator and the results are shown in Table I below: future works of this project is to continue fixing existing Pig
script so that we have more meaningful result from the dumped
Channel Capacity Max Events Per Second
100000 200
data in HDFS.
200000 250
400000 275 VI. C ONCLUSIONS
TABLE I Flume is promising in term of scalability. By using cascad-
R ELATION BETWEEN C HANNEL C APACITY AND M AX E VENTS P ER
S ECOND ing setup, we are able to improve the scalability of Flume.
However, more analysis is needed especially using additional
existing Pig script to process the data.
From these results we can easily see that doubling the
channel capacity does not necessarily doubled the maximum R EFERENCES
number of event rates. Both of them are not linearly correlated. [1] M. Percy, “Flume NG performance measurements.” https://cwiki.apache.
org/FLUME/flume-ng-performance-measurements.html, 2012.
[2] A. Apache, “Flume 1.x user guide.” http://archive.cloudera.com/cdh4/cdh/
C. Setup 2 Result 4/flume-ng-1.1.0-cdh4.0.0b2/FlumeUserGuide.html, 2012.
[3] a. arvind, “Apache flume (incubating).” https://blogs.apache.org/flume/
In second setup, we use channel capacity of 100000 for entry/flume ng architecture, 2011.
Scale1-1 and Scale1-2 nodes, and channel capacity of 200000 [4] “Flume 1.x installation.” https://ccp.cloudera.com/display/CDHDOC/
for Collector nodes. From previous experiment in setup 1, we Flume+1.x+Installation.
find that the maximum number of events that a single flume
node is 200 events per second. Table II shows the results of
this setup.

Scale1-1 Scale1-2 Cumulative Observation
Event Rates Event Rates Event Rates
100 100 200 No exception found that
causes Flume node processes
to stop running
200 200 400 No exception found that
causes Flume node processes
to stop running
250 250 500 Exception occurs at regular in-
terval. In this case, the ex-
ception occurs every 3 to 5
seconds, complaining that an
event has just lost.
TABLE II
R ELATION BETWEEN CUMULATIVE F LUME EVENT RATES AND
RELIABILITY OF F LUME CLUSTER

From the second setup, it is interesting to see that the
cumulative number of events that can be supported by Flume
node cluster is doubled (from 200 to 400). Well, by adding
new nodes, we offload the responsibility of handling the

Flume Event Scalability

More Related Content

What's hot

Similar to Flume Event Scalability

More from Arinto Murdopo

Flume Event Scalability