1                   Apache FLUME Scalability Analysis                                                       Arinto Murdopo...
2                                                                                 events from Collector nodes into the oth...
Upcoming SlideShare
Loading in …5

Flume Event Scalability


Published on

Short report for Scalable Distributed System project for Spring Semester 2012

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Flume Event Scalability

  1. 1. 1 Apache FLUME Scalability Analysis Arinto Murdopo Facultat Informatica de Barcelona Universitat Politecnica de Catalunya Barcelona, Spain I. I NTRODUCTION We execute each experiment for duration of 20 to 30 minutes Flume is distributed data collection service focusing on because we have limited size of disk storage for storing thehigh reliability and availability. It often used for moving large data.amounts of log data from web servers to database such asHDFS for further processing. A. Setup 1: One to one relationship between nodes and Flume In initial project, we used Flume to aggregate RSS from load generatormany websites. In future works, those RSSs will be shown in Figure 1 shows the first setup for this experiment. Flumea dedicated website and data in HDFS can be used for further load generator is implemented using Java under Hammer class.data analysis. Hammer basically sends configurable TCP events in term of What are we going to do in this project In this project, we size. In this setup, we use event size of 300 bytes. Flume nodewill analyse the scalability of Flume in term of number of is configured with SyslogTcpSource which will listen to TCPevent that specific Flume configuration can support. event in configurable port and generate its own Flume event to be transmitted via Memory channel to the HDFS sink. The HDFS in this setup consists of three replicated nodes. We will II. BACKGROUND check the relation between channel capacity and maximum In this section, we will quickly explain terminologies that rate of events that Flume node can support.often appear in this technical report. Flume event is a unit of data flow that transferred by Flume.It has specific size of byte payload and optional set of stringattributes. Flume agent is a JVM process that runs Flumesource, channel, and sink. Flume source collects the eventsthat delivered to the sources. Flume channels are repositorieswhere the events are processed in an agent. In Flume 0.9.x, wecan do something like filtering in the channel. But in Flume1.0.x (the one that we are using now), the filtering featuresare still working in progress. Flume channel has a propertycalled capacity, which needs to be configured properly so Fig. 1. Setup 1: One to one relationship between nodes and Flume loadthat Flume will have desired level of scalability. Flume sink generatorforwards the events to another Flume agent or data container,such as HDFS, Cassandra, Dynamo or even conventional SQLDatabase. B. Setup 2: Cascading setup Figure 2 shows the second setup for this experiment. We III. E XPERIMENT S ETUP still use event size of 300 bytes in this setup. Two Flume The scalability analysis is based on existing Flume-NG per- scale nodes are aggregated into one single collector nodeformance measurement study [1] but it differs in experiment before writing the data into HDFS cluster. Two separate loadsetup. We reused Flume load generator from the aforemen- generators reside in different independent nodes. We willtioned study so that we can easily quantify the scalability as compare accumulated number of event-per-seconds of twonumber of x-byte-event per seconds where x is the size of he Flume scale nodes with first setup with same settings ofevent. The differences on experiment setup are: channel capacity. • This experiment introduces one-to-one relationship be- IV. R ESULT AND D ISCUSSION tween the nodes and Flume load generator. That means, each Flume load generator process exists in an indepen- A. Verifying Maximum Supported Number of Events dent node (which is Amazon EC2 medium instance) To verify the number of events that specific configuration • This experiment introduces cascading setup, which will can support, we inspect Flume log file in /var/log/flume- verify whether there is improvement in scalability or not ng/flume.log. If the process can continue without Java ex- compared to non-cascading setup ceptions that cause it to stop or without regularly generated
  2. 2. 2 events from Collector nodes into the other two nodes which are Scale1-1 and Scale1-2. Note that when event arrives in node Scale1-1 or Scale1-2, the event is not directly forwarded into Collector node. Node Scale1-1 and Scale1-2 will hold the events until certain number of events are accumulated. This amount of accumulated events are configurable using flume.conf configuration file.Fig. 2. Setup 2: Cascading setup V. F UTURE W ORKS Previous Flume-ng Performance Measurement study [1] actually includes a Pig script that will further analyse the data dumped in HDFS. Further analysis that can be done includesexceptions, then we will treat it as able to handle the event detecting how many time Flume tries to retries data sendingrates applied to it. when failures occur and the number of events have been transferred during During this project, the writer has spentB. Setup 1 Result significant time to make it works, but unfortunately until the We used several capacity configuration of Flume load gen- end of the project, it still does not work properly. Therefore,erator and the results are shown in Table I below: future works of this project is to continue fixing existing Pig script so that we have more meaningful result from the dumped Channel Capacity Max Events Per Second 100000 200 data in HDFS. 200000 250 400000 275 VI. C ONCLUSIONS TABLE I Flume is promising in term of scalability. By using cascad- R ELATION BETWEEN C HANNEL C APACITY AND M AX E VENTS P ER S ECOND ing setup, we are able to improve the scalability of Flume. However, more analysis is needed especially using additional existing Pig script to process the data. From these results we can easily see that doubling thechannel capacity does not necessarily doubled the maximum R EFERENCESnumber of event rates. Both of them are not linearly correlated. [1] M. Percy, “Flume NG performance measurements.” https://cwiki.apache. org/FLUME/flume-ng-performance-measurements.html, 2012. [2] A. Apache, “Flume 1.x user guide.” http://archive.cloudera.com/cdh4/cdh/C. Setup 2 Result 4/flume-ng-1.1.0-cdh4.0.0b2/FlumeUserGuide.html, 2012. [3] a. arvind, “Apache flume (incubating).” https://blogs.apache.org/flume/ In second setup, we use channel capacity of 100000 for entry/flume ng architecture, 2011.Scale1-1 and Scale1-2 nodes, and channel capacity of 200000 [4] “Flume 1.x installation.” https://ccp.cloudera.com/display/CDHDOC/for Collector nodes. From previous experiment in setup 1, we Flume+1.x+Installation.find that the maximum number of events that a single flumenode is 200 events per second. Table II shows the results ofthis setup. Scale1-1 Scale1-2 Cumulative Observation Event Rates Event Rates Event Rates 100 100 200 No exception found that causes Flume node processes to stop running 200 200 400 No exception found that causes Flume node processes to stop running 250 250 500 Exception occurs at regular in- terval. In this case, the ex- ception occurs every 3 to 5 seconds, complaining that an event has just lost. TABLE II R ELATION BETWEEN CUMULATIVE F LUME EVENT RATES AND RELIABILITY OF F LUME CLUSTER From the second setup, it is interesting to see that thecumulative number of events that can be supported by Flumenode cluster is doubled (from 200 to 400). Well, by addingnew nodes, we offload the responsibility of handling the