1




                   Apache FLUME Scalability Analysis
                                                       Arinto Murdopo
                                              Facultat Informatica de Barcelona
                                             Universitat Politecnica de Catalunya
                                                       Barcelona, Spain




                      I. I NTRODUCTION                              We execute each experiment for duration of 20 to 30 minutes
  Flume is distributed data collection service focusing on          because we have limited size of disk storage for storing the
high reliability and availability. It often used for moving large   data.
amounts of log data from web servers to database such as
HDFS for further processing.                                        A. Setup 1: One to one relationship between nodes and Flume
  In initial project, we used Flume to aggregate RSS from           load generator
many websites. In future works, those RSSs will be shown in            Figure 1 shows the first setup for this experiment. Flume
a dedicated website and data in HDFS can be used for further        load generator is implemented using Java under Hammer class.
data analysis.                                                      Hammer basically sends configurable TCP events in term of
  What are we going to do in this project In this project, we       size. In this setup, we use event size of 300 bytes. Flume node
will analyse the scalability of Flume in term of number of          is configured with SyslogTcpSource which will listen to TCP
event that specific Flume configuration can support.                  event in configurable port and generate its own Flume event
                                                                    to be transmitted via Memory channel to the HDFS sink. The
                                                                    HDFS in this setup consists of three replicated nodes. We will
                      II. BACKGROUND
                                                                    check the relation between channel capacity and maximum
   In this section, we will quickly explain terminologies that      rate of events that Flume node can support.
often appear in this technical report.
   Flume event is a unit of data flow that transferred by Flume.
It has specific size of byte payload and optional set of string
attributes. Flume agent is a JVM process that runs Flume
source, channel, and sink. Flume source collects the events
that delivered to the sources. Flume channels are repositories
where the events are processed in an agent. In Flume 0.9.x, we
can do something like filtering in the channel. But in Flume
1.0.x (the one that we are using now), the filtering features
are still working in progress. Flume channel has a property
called capacity, which needs to be configured properly so            Fig. 1. Setup 1: One to one relationship between nodes and Flume load
that Flume will have desired level of scalability. Flume sink       generator
forwards the events to another Flume agent or data container,
such as HDFS, Cassandra, Dynamo or even conventional SQL
Database.                                                           B. Setup 2: Cascading setup
                                                                       Figure 2 shows the second setup for this experiment. We
                  III. E XPERIMENT S ETUP                           still use event size of 300 bytes in this setup. Two Flume
   The scalability analysis is based on existing Flume-NG per-      scale nodes are aggregated into one single collector node
formance measurement study [1] but it differs in experiment         before writing the data into HDFS cluster. Two separate load
setup. We reused Flume load generator from the aforemen-            generators reside in different independent nodes. We will
tioned study so that we can easily quantify the scalability as      compare accumulated number of event-per-seconds of two
number of x-byte-event per seconds where x is the size of he        Flume scale nodes with first setup with same settings of
event. The differences on experiment setup are:                     channel capacity.
   • This experiment introduces one-to-one relationship be-
                                                                                   IV. R ESULT AND D ISCUSSION
     tween the nodes and Flume load generator. That means,
     each Flume load generator process exists in an indepen-        A. Verifying Maximum Supported Number of Events
     dent node (which is Amazon EC2 medium instance)                  To verify the number of events that specific configuration
   • This experiment introduces cascading setup, which will         can support, we inspect Flume log file in /var/log/flume-
     verify whether there is improvement in scalability or not      ng/flume.log. If the process can continue without Java ex-
     compared to non-cascading setup                                ceptions that cause it to stop or without regularly generated
2



                                                                                 events from Collector nodes into the other two nodes which
                                                                                 are Scale1-1 and Scale1-2. Note that when event arrives in
                                                                                 node Scale1-1 or Scale1-2, the event is not directly forwarded
                                                                                 into Collector node. Node Scale1-1 and Scale1-2 will hold
                                                                                 the events until certain number of events are accumulated.
                                                                                 This amount of accumulated events are configurable using
                                                                                 flume.conf configuration file.

Fig. 2.   Setup 2: Cascading setup
                                                                                                          V. F UTURE W ORKS
                                                                                    Previous Flume-ng Performance Measurement study [1]
                                                                                 actually includes a Pig script that will further analyse the data
                                                                                 dumped in HDFS. Further analysis that can be done includes
exceptions, then we will treat it as able to handle the event
                                                                                 detecting how many time Flume tries to retries data sending
rates applied to it.
                                                                                 when failures occur and the number of events have been
                                                                                 transferred during During this project, the writer has spent
B. Setup 1 Result                                                                significant time to make it works, but unfortunately until the
   We used several capacity configuration of Flume load gen-                      end of the project, it still does not work properly. Therefore,
erator and the results are shown in Table I below:                               future works of this project is to continue fixing existing Pig
                                                                                 script so that we have more meaningful result from the dumped
 Channel Capacity                          Max Events Per Second
 100000                                    200
                                                                                 data in HDFS.
 200000                                    250
 400000                                    275                                                            VI. C ONCLUSIONS
                            TABLE I                                                Flume is promising in term of scalability. By using cascad-
    R ELATION BETWEEN C HANNEL C APACITY AND M AX E VENTS P ER
                            S ECOND                                              ing setup, we are able to improve the scalability of Flume.
                                                                                 However, more analysis is needed especially using additional
                                                                                 existing Pig script to process the data.
  From these results we can easily see that doubling the
channel capacity does not necessarily doubled the maximum                                                     R EFERENCES
number of event rates. Both of them are not linearly correlated.                 [1] M. Percy, “Flume NG performance measurements.” https://cwiki.apache.
                                                                                     org/FLUME/flume-ng-performance-measurements.html, 2012.
                                                                                 [2] A. Apache, “Flume 1.x user guide.” http://archive.cloudera.com/cdh4/cdh/
C. Setup 2 Result                                                                    4/flume-ng-1.1.0-cdh4.0.0b2/FlumeUserGuide.html, 2012.
                                                                                 [3] a. arvind, “Apache flume (incubating).” https://blogs.apache.org/flume/
   In second setup, we use channel capacity of 100000 for                            entry/flume ng architecture, 2011.
Scale1-1 and Scale1-2 nodes, and channel capacity of 200000                      [4] “Flume 1.x installation.” https://ccp.cloudera.com/display/CDHDOC/
for Collector nodes. From previous experiment in setup 1, we                         Flume+1.x+Installation.
find that the maximum number of events that a single flume
node is 200 events per second. Table II shows the results of
this setup.

 Scale1-1         Scale1-2           Cumulative     Observation
 Event Rates      Event Rates        Event Rates
 100              100                200            No exception found that
                                                    causes Flume node processes
                                                    to stop running
 200              200                400            No exception found that
                                                    causes Flume node processes
                                                    to stop running
 250              250                500            Exception occurs at regular in-
                                                    terval. In this case, the ex-
                                                    ception occurs every 3 to 5
                                                    seconds, complaining that an
                                                    event has just lost.
                             TABLE II
       R ELATION BETWEEN CUMULATIVE F LUME EVENT RATES AND
                   RELIABILITY OF F LUME CLUSTER




  From the second setup, it is interesting to see that the
cumulative number of events that can be supported by Flume
node cluster is doubled (from 200 to 400). Well, by adding
new nodes, we offload the responsibility of handling the

Flume Event Scalability

  • 1.
    1 Apache FLUME Scalability Analysis Arinto Murdopo Facultat Informatica de Barcelona Universitat Politecnica de Catalunya Barcelona, Spain I. I NTRODUCTION We execute each experiment for duration of 20 to 30 minutes Flume is distributed data collection service focusing on because we have limited size of disk storage for storing the high reliability and availability. It often used for moving large data. amounts of log data from web servers to database such as HDFS for further processing. A. Setup 1: One to one relationship between nodes and Flume In initial project, we used Flume to aggregate RSS from load generator many websites. In future works, those RSSs will be shown in Figure 1 shows the first setup for this experiment. Flume a dedicated website and data in HDFS can be used for further load generator is implemented using Java under Hammer class. data analysis. Hammer basically sends configurable TCP events in term of What are we going to do in this project In this project, we size. In this setup, we use event size of 300 bytes. Flume node will analyse the scalability of Flume in term of number of is configured with SyslogTcpSource which will listen to TCP event that specific Flume configuration can support. event in configurable port and generate its own Flume event to be transmitted via Memory channel to the HDFS sink. The HDFS in this setup consists of three replicated nodes. We will II. BACKGROUND check the relation between channel capacity and maximum In this section, we will quickly explain terminologies that rate of events that Flume node can support. often appear in this technical report. Flume event is a unit of data flow that transferred by Flume. It has specific size of byte payload and optional set of string attributes. Flume agent is a JVM process that runs Flume source, channel, and sink. Flume source collects the events that delivered to the sources. Flume channels are repositories where the events are processed in an agent. In Flume 0.9.x, we can do something like filtering in the channel. But in Flume 1.0.x (the one that we are using now), the filtering features are still working in progress. Flume channel has a property called capacity, which needs to be configured properly so Fig. 1. Setup 1: One to one relationship between nodes and Flume load that Flume will have desired level of scalability. Flume sink generator forwards the events to another Flume agent or data container, such as HDFS, Cassandra, Dynamo or even conventional SQL Database. B. Setup 2: Cascading setup Figure 2 shows the second setup for this experiment. We III. E XPERIMENT S ETUP still use event size of 300 bytes in this setup. Two Flume The scalability analysis is based on existing Flume-NG per- scale nodes are aggregated into one single collector node formance measurement study [1] but it differs in experiment before writing the data into HDFS cluster. Two separate load setup. We reused Flume load generator from the aforemen- generators reside in different independent nodes. We will tioned study so that we can easily quantify the scalability as compare accumulated number of event-per-seconds of two number of x-byte-event per seconds where x is the size of he Flume scale nodes with first setup with same settings of event. The differences on experiment setup are: channel capacity. • This experiment introduces one-to-one relationship be- IV. R ESULT AND D ISCUSSION tween the nodes and Flume load generator. That means, each Flume load generator process exists in an indepen- A. Verifying Maximum Supported Number of Events dent node (which is Amazon EC2 medium instance) To verify the number of events that specific configuration • This experiment introduces cascading setup, which will can support, we inspect Flume log file in /var/log/flume- verify whether there is improvement in scalability or not ng/flume.log. If the process can continue without Java ex- compared to non-cascading setup ceptions that cause it to stop or without regularly generated
  • 2.
    2 events from Collector nodes into the other two nodes which are Scale1-1 and Scale1-2. Note that when event arrives in node Scale1-1 or Scale1-2, the event is not directly forwarded into Collector node. Node Scale1-1 and Scale1-2 will hold the events until certain number of events are accumulated. This amount of accumulated events are configurable using flume.conf configuration file. Fig. 2. Setup 2: Cascading setup V. F UTURE W ORKS Previous Flume-ng Performance Measurement study [1] actually includes a Pig script that will further analyse the data dumped in HDFS. Further analysis that can be done includes exceptions, then we will treat it as able to handle the event detecting how many time Flume tries to retries data sending rates applied to it. when failures occur and the number of events have been transferred during During this project, the writer has spent B. Setup 1 Result significant time to make it works, but unfortunately until the We used several capacity configuration of Flume load gen- end of the project, it still does not work properly. Therefore, erator and the results are shown in Table I below: future works of this project is to continue fixing existing Pig script so that we have more meaningful result from the dumped Channel Capacity Max Events Per Second 100000 200 data in HDFS. 200000 250 400000 275 VI. C ONCLUSIONS TABLE I Flume is promising in term of scalability. By using cascad- R ELATION BETWEEN C HANNEL C APACITY AND M AX E VENTS P ER S ECOND ing setup, we are able to improve the scalability of Flume. However, more analysis is needed especially using additional existing Pig script to process the data. From these results we can easily see that doubling the channel capacity does not necessarily doubled the maximum R EFERENCES number of event rates. Both of them are not linearly correlated. [1] M. Percy, “Flume NG performance measurements.” https://cwiki.apache. org/FLUME/flume-ng-performance-measurements.html, 2012. [2] A. Apache, “Flume 1.x user guide.” http://archive.cloudera.com/cdh4/cdh/ C. Setup 2 Result 4/flume-ng-1.1.0-cdh4.0.0b2/FlumeUserGuide.html, 2012. [3] a. arvind, “Apache flume (incubating).” https://blogs.apache.org/flume/ In second setup, we use channel capacity of 100000 for entry/flume ng architecture, 2011. Scale1-1 and Scale1-2 nodes, and channel capacity of 200000 [4] “Flume 1.x installation.” https://ccp.cloudera.com/display/CDHDOC/ for Collector nodes. From previous experiment in setup 1, we Flume+1.x+Installation. find that the maximum number of events that a single flume node is 200 events per second. Table II shows the results of this setup. Scale1-1 Scale1-2 Cumulative Observation Event Rates Event Rates Event Rates 100 100 200 No exception found that causes Flume node processes to stop running 200 200 400 No exception found that causes Flume node processes to stop running 250 250 500 Exception occurs at regular in- terval. In this case, the ex- ception occurs every 3 to 5 seconds, complaining that an event has just lost. TABLE II R ELATION BETWEEN CUMULATIVE F LUME EVENT RATES AND RELIABILITY OF F LUME CLUSTER From the second setup, it is interesting to see that the cumulative number of events that can be supported by Flume node cluster is doubled (from 200 to 400). Well, by adding new nodes, we offload the responsibility of handling the