Scalable Distributed                  Systems                   Impact of reliability in the scalability of Flume         ...
ProposalLets start by describing the architecture of the Flume-based System represented in figure 1.As one can observe, th...
Figure 2. Representation of the clustering of Agents.              Red arrows represent ping messages, black dotted arrows...
Collector fails while using persistent channelsThird experiment consisted on disconnecting collector agent 2 during the ex...
after around an hour the difference already goes in 400 hundred events. There seemed to be nooptions to group events.Agent...
AppendixConfiguration stepsTools   ●    puttygen   ●    putty   ●    scp   ●    pssh   ●    cloudera manager 3.7!!   ●    ...
b. Reopen webpage. Username: admin, pass: admin            c. Install only free. Continue.            d. Proceed without r...
would occur : Caused by: org.apache.hadoop.ipc.RemoteException:               org.apache.hadoop.security.AccessControlExce...
fprintf(file,"%s : %dr",argv[1], ctr++);                  fflush(file);                  //printf("%s : %dn",argv[1], ctr+...
# In this case, it specifies the capacity of the memory channelagent$agent.channels.memoryChannel.capacity = 10000" > flum...
1.amazonaws.com:8020/flumecollector$1.sinks.hdfs-sink.channel = memory-1# Each channels type is defined.collector$1.channe...
Upcoming SlideShare
Loading in …5
×

Flume impact of reliability on scalability

1,896 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,896
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
81
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Flume impact of reliability on scalability

  1. 1. Scalable Distributed Systems Impact of reliability in the scalability of Flume Report by Mário Almeida (EMDC)IntroductionThis report will be describing the mechanisms to provide fault tolerance in Flume as well as itsimpact on scalability due to the increasing of number of flows and possible bottlenecks.In this case, reliability is the ability to continue delivering events and logging them in the HDFSin the face of failures, without losing data. This failures can be due to physical hardware fails,scarce bandwidth or memory, software crashes, etc.In order to eliminate single points of failure, previous versions of flume had a fail overmechanism that would move the flow towards a new agent without user intervention. This wasdone via a Flume Master node that would have a global state of all the data flows and coulddynamically reconfigure nodes and dataflows through fail-over chains.The problem with this implementation was that Flume Master would become a bottleneck of thesystem, although it could be replicated through secondary Flume Master Nodes, and it wouldbecome complicated to configure its behavior for larger sets of dataflows. This could have aclear impact on the scalability of the system.The actual version of Flume addresses this problem by multiplexing or replicating flows. Thismeans that an agent can send data through different channels for load balancing or it canreplicate the same flow through two different channels. This solution does provide the sowanted reliability but it either duplicates the information in the system or it needs more agents inorder to balance this load.In order to show the impact of failures in a Flume architecture, a scenario in which the sourcesare updated very often and are volatile was chosen. This specific case is relevant for the failureof the source gathering agents, since some events get lost. The case in which the Collectors failcan be tolerated in the newer version through the use of persistent channels.Multiple experiments were performed that test the usage of memory channels versus persistentJDBC channels. Also a new mechanism for tolerating failures is proposed and tested againstthe already existent architectures.
  2. 2. ProposalLets start by describing the architecture of the Flume-based System represented in figure 1.As one can observe, there are 5 distinct agents, 2 of them acting as collectors. The sourcesSource1, Source2 and Source3 consisted of three different C applications that would generatesequences of numbers and outputting them to a file. This file would only contain a sequencenumber at any given moment in order to achieve the volatility property. All the agents were indifferent machines. Figure 1. Implemented Flume-based architecture.Due to the mentioned probable reasons why the Master node was deprecated in the new FlumeNG, this report will be evaluating a possible architecture for achieving reliability in Flume.In order to make it less centralized, the idea was to form smaller clusters in which its nodes areresponsible of keeping track of each others. And in case one of them fails, another will take itsresponsibilities, either by gathering information from a source or aggregating data from otherAgents.
  3. 3. Figure 2. Representation of the clustering of Agents. Red arrows represent ping messages, black dotted arrows represent possible reconfigurations.Following the architecture depicted in figure 2, a cluster could consist of Agent11, Agent12 andAgent21 and another cluster could consist of C1 and C2. For example, in case Agent11 fails,Agent12 or Agent21 will take its place gathering information from Source1, until it finally restarts.In case Collector2 fails, Collector1 will be aggregating the information provenient from Agent21.For this purpose any agent belonging to a cluster has to have knowledge about the other agentsin the same cluster. This knowledge includes its sources and sinks. In this experiment theagents would ping each other in order to keep track of alive agents.This notion of multiple small clusters as shown in figure 2, makes the system less dependent ona centralized entity and also eliminates the need of having extra nodes. It might not keep someconsistency properties that could be achieved through the use of the embedded Zookeeperthough.ExperimentsAll the experiments were conducted with a lifespan of a few hours, generating up to 10thousand events. The objective was to make this periods as big as possible taking into accountthe costs and other schedules.Normal execution of flumeThe first experiment consisted of the normal execution of flume, without any failures and usingmemory channels.Collector fails while using memory channelsSecond experiment consisted on disconnecting collector agent 2 during the execution of flumeand register the lost events.
  4. 4. Collector fails while using persistent channelsThird experiment consisted on disconnecting collector agent 2 during the execution of flumeusing JDBC channels and register the lost events.Agent fails while using persistent channelsIn this experiment it is the Agent21 that is disconnected using JDBC channels.Dynamic RoutingIn this last experiment, both Collector2 and Agent11 are disconnected. It uses memorychannels.ResultsNormal execution of flumeAs expected during the normal execution of flume all the events generated by the sources werelogged into the HDFS. This means that the rate at which the data was being generated was wellsupported by the capacity of the memory channels.Collector fails while using memory channelsAlthough initially I thought that the failure of a collector would imply that the data that was readfrom the source would be lost by the agent due to the limitations of channel capacity, in moststests, Flume was able to store this data and resend it once the Collector was restarted.Flume uses a transactional approach to guarantee the reliable delivery of the events. Theevents are only removed from a channel when they are successfully stored in the next channel.This should still be dependent on the capacity of the channel but for the used implementationwith a channel of 10 thousand events of capacity, the Collector2 could be down for more thanone hour without any channel overflow.It was decided to drop this capacity to up to 100 events and double the rate of data generation.After this change I disconnected the Collector2 until more than 100 events were read byAgent21. Once Collector2 came back online it received the 100 events that were stored in thechannel of Agent21 but failed to deliver the subsequent events. In fact, it stopped working fromthat point, requiring a restart.Collector fails while using persistent channelsAs expected since Flume stores the events in a relational database instead of in memory, whena collector dies the events are stored and delivered whenever it becomes available again. It isindeed the way Flume achieves recoverability. Although it works well it seems that while thememory channel groups events before sending, JDBC channel works more like a pipeline,sending the events one by one. This and the implicit problem of writing into persistent storagemight have significant impact on the performance for large scale systems.Also, and probably due to this fact as well, there seemed to be an imbalance between the rateat which the sources data flows were reaching the Collector. In around every 10 events thatreached the Collector, only 2 were from Source1 and 8 were from Source2, although bothproduced data at the same rate. I wonder if it runs for long periods what will happen, in my case
  5. 5. after around an hour the difference already goes in 400 hundred events. There seemed to be nooptions to group events.Agent fails while using persistent channelsAlthough I expected that when an Agent is stopped it would lose some events, it seems thatit still logs them somehow and after restarted resends all the events since the beginning,repeating up to hundreds of events in the process. The experiments seemed to indicatethat the JDBC logs events even in case the agent crashes. Other channels such as therecoverableMemoryChannel provide parameters to set how often a background worker checkson old logs but there are no mentions to it on JDBC. Overall, although it repeated hundreds ifnot thousands of events, it didn’t lose a single event. This wouldn’t be the case though if thewhole machine would crash, but i couldn’t test it since my source was in the same node as theagent itself. Further testing would need a new architecture.Dynamic RoutingIn the way it was implemented, every time a new configuration is generated and the agentrestarted, the events in memory are lost. This happens when the Collector2 is disconnectedand the Agent21 has to change its dataflow to Collector1. Overall some events were lostwhile migrating between configurations but it achieved better results than the normal memoryreliability scheme of Flume. Although for all-round purpose the JDBC channels probablyachieves better reliability results, clustering has less delays in retrieving data. It might also notflow the data in the most efficient way if due to failures it makes all flows go through the samenodes.ConclusionsAs long as the data generation rate doesn’t overflow the available capacity in the channelmemory, Flume works well for most cases, with or without failures. There is one failure thatprobably can’t be handled without the creation of replicated flows, that is the “unsubscribing”from a source due to a failure of a machine (not only the agent) where the source has volatileinformation. Further experiments should be conducted in order to evaluate the performance ofcreating clusters versus the actual replication of data flows and subsequent processing in orderto store them in a HDFS. Scalability wise it seems that using well implemented clusters wouldmean having less nodes and less flow of information since the pings/heartbeat rate in a realsystem is much lower than a data flow. Still, the way Flume has implemented its reliability isgood for its simplicity.The proposed architecture was implemented through multiple scripts instead of actuallychanging the source code of Flume. This means that there were some workarounds, such asrestarting/reloading configurations that might introduce errors in the experimentation. Also thisscripts would never provide easy to use access to this routing mechanism. This said, moreexperiments would be needed to make this report more significative. Even so, it shows aninteresting overview of the mechanisms to achieve reliability in Flume while describing theirlimitations.Implementing dynamic routing in the actual architecture of flume can also be achieved by usingan Avro client to set different headers depending on the pretend route. This could possibly be asolution for implementing the proposed architecture.
  6. 6. AppendixConfiguration stepsTools ● puttygen ● putty ● scp ● pssh ● cloudera manager 3.7!! ● flume ● hadoopInstalling the CDH3 and setting up the HDFS 1. Go to Security groups and create a new group a. In the inbound add ssh, http, icmp (for ping). Apply Rule Changes. If its only for test purposes you can just add all tcp, udp and icmp 2. Go to Key pairs, create and download pem file 3. Create the public key file a. puttygen key.ppk -L > ~/.ssh/id_rsa.pub b. puttygen /path/to/puttykey.ppk -O private-openssh -o ~/.ssh/id_rsa 4. Create 10 Suse Medium instances in AWS Management Console: a. Choose Suse, next b. 10 instances, Medium, next c. Next d. Use the created key pair, next e. Choose previously created security group 5. Choose an AWS instance, rename it to CDHManager. Right click on it, connect -> copy public DNS 6. Download Cloudera Manager Free Edition and copy its bin file into the the machine: a. scp cloudera-manager-installer.bin root@publicdns:/root/cloudera-manager- installer.bin 7. SSH the machine and perform an ls, cloudera bin file should be there. Do the following: a. ssh root@publicdns b. ls <- check that 8. Install it: a. chmod u+x cloudera-manager-installer.bin b. sudo ./cloudera-manager-installer.bin c. next, next, yes, next, yes...wait till installation finishes 9. Go to your web browser, paste the public dns and port 7180 like this: publicDns:7180. Note that it can’t access it. Its because our security group doesn’t allow connections on this port : a. Go to security groups. Add custom tcp rule with port 7180. Apply rule changes.
  7. 7. b. Reopen webpage. Username: admin, pass: admin c. Install only free. Continue. d. Proceed without registering. 10. Go to the My instances in AWS and select all except the CDHManager. Notice that below all the public dns appear listed, copy them at once and paste on the webpage. Take out the not needed parts such as “i-2jh3h53hk3h:”. (Sometimes, some nodes might be not accessible, just delete/restart them, create another and put the public dns in the list) 11. Find instances and install CDH 3 on them with default values. continue. 12. Choose root, all accept same public key, select your public key and for private select the pem key. Install... 13. Continue, continue, cluster CDH3 a. if an error occur it is due to not open ports (generally icmp or other). Common error : "The inspector failed to run on all hosts.". 14. Add service hdfs, for example 3 datanodes, one of them can be a name node as well.Installing flume ng 2. On linux install pssh: sudo apt-get install pssh (yes i have a virtualbox running ubuntu, and i prefer to have all this tools there. I tried xservers on windows but didnt like them) 3. Create an host.txt file with all the public dns. 4. Install putty on ubuntu. a. sudo apt-get install putty. 5. Install flume like a boss: a. parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt “sudo zypper -- non-interactive install flume-ng” 6. Make it boot from startup a. parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt “sudo zypper -- non-interactive install flume-ng-agent”Running experiment 1. Send my scripts to the servers, they generate the config files a. while read line; do scp -S ssh -o StrictHostKeyChecking=no -o PasswordAuthentication=no configen root@$line:/etc/flume-ng/conf; done <hosts.txt b. while read line; do scp -S ssh -o StrictHostKeyChecking=no -o PasswordAuthentication=no genCollector root@$line:/etc/flume-ng/conf; done <hosts.txt 2. Run configen on every machine: b. parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt "cd /etc/flume- ng/conf; sudo chmod u+x configen; ./configen" a. Check results i. parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt "cd /etc/ flume-ng/conf; ls" 3. SSH to the name node and do: a. sudo -u hdfs hadoop hdfs -mkdir flume 4. SSH the agent collectors and do a. sudo -u hdfs flume-ng agent -n collector1 -f /etc/flume-ng/ conf/flumeCollector1.conf (note that sudo -u hdfs is due to the authentication mechanism of the hdfs, otherwise the following error
  8. 8. would occur : Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/dfs/nn/image/flume":hdfs:hadoop:drwxr-xr-x) 5. SSH the agents that access the sources and do the following, changing # for its number(1,2,3): a. ./run S# b. See if tail -f S# produces results, stop it with ctrl+c c. sudo flume-ng agent -n agent# -f /etc/flume-ng/conf/flumeAgent#.conf 6. You will probably get the following error: a. Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/dfs/nn/image/flume":hdfs:hadoop:drwxr-xr-x ( this is because of the non-dfs space reserved that is bigger than the node space) do: i. go to the CDH Manager web ui, select services, hdfs, configurations. Set dfs.datanode.du.reserved to a value that allows the hdfs to have some free space. Example 1737418240 (a bit more than 1 gb) ii. restart the hdfs 7. Now you are able to run experiments. If needed you can download all the files of the hdfs to a local directory by using the following command: a. hadoop fs -copyToLocal hdfs://ec2-23-22-64-132.compute- 1.amazonaws.com:8020/flume /etc/flume-ng/conf/fs b. or : i. ls Flume* | awk {print "wget http://ec2-23-22-64-132.compute- 1.amazonaws.com:50075/streamFile/flume/"$0"? -O res/data-"$0} > wgetTo conclude a java file was created that reads from this files. The data was parsed in order toobtain the results. This experimenting was done for multiple configurations.Bash script that generates the configuration files#!/bin/bash#ping -c4 ip |grep transmitted | awk {if($4==0){print "online"}else{print "offline"}}echo "#include <stdio.h>#include <unistd.h>#include <time.h>int main(int argc, char *argv[]){ int ctr = 0; FILE *file; file = fopen(argv[1],"w+"); if(file == NULL) puts("fuuuuuu!"); while(1){ //puts(".");
  9. 9. fprintf(file,"%s : %dr",argv[1], ctr++); fflush(file); //printf("%s : %dn",argv[1], ctr++); usleep(2000000); if(ctr == 1000000) ctr = 0; } fclose(file); return 0;}" > ccode.cgcc -o run ccode.cfor agent in 1 2 3doif [[ $agent == 1 || $agent == 2 ]]then collector="ec2-50-17-85-221.compute-1.amazonaws.com"else collector="ec2-50-19-2-196.compute-1.amazonaws.com"fi#echo "Setting $collector"echo "agent$agent.sources = generatoragent$agent.channels = memoryChannelagent$agent.sinks = avro-forward-sink# For each one of the sources, the type is definedagent$agent.sources.generator.type = execagent$agent.sources.generator.command = tail -f /etc/flume-ng/conf/S$agentagent$agent.sources.generator.logStdErr = true# The channel can be defined as follows.agent$agent.sources.generator.channels = memoryChannel# Each sinks type must be definedagent$agent.sinks.avro-forward-sink.type = avroagent$agent.sinks.avro-forward-sink.hostname = $collectoragent$agent.sinks.avro-forward-sink.port = 60000#Specify the channel the sink should useagent$agent.sinks.avro-forward-sink.channel = memoryChannel# Each channels type is defined.agent$agent.channels.memoryChannel.type = memory# Other config values specific to each type of channel(sink or source)# can be defined as well
  10. 10. # In this case, it specifies the capacity of the memory channelagent$agent.channels.memoryChannel.capacity = 10000" > flumeAgent$agent.confdone#COLLECTORS----------------------for collector in 1 2doif [[ $collector == 1 ]]then #Strangely this agent1 is indeed the collectors address agent1="ec2-50-17-85-221.compute-1.amazonaws.com" #"ec2-107-21-171-50.compute-1.amazonaws.com" #agent2="ec2-23-22-216-49.compute-1.amazonaws.com" sources="avro-source-1" ./genCollector $collector "$sources" "$agent1"else agent1="ec2-50-19-2-196.compute-1.amazonaws.com" #"ec2-107-22-64-107.compute-1.amazonaws.com" sources="avro-source-1" ./genCollector $collector "$sources" "$agent1"fidoneBash script that generates a single Collector configuration file#!/bin/bashecho "collector$1.sources = $2collector$1.channels = memory-1collector$1.sinks = hdfs-sink#hdfs-sink# For each one of the sources, the type is definedcollector$1.sources.avro-source-1.type = avrocollector$1.sources.avro-source-1.bind = $3collector$1.sources.avro-source-1.port = 60000collector$1.sources.avro-source-1.channels = memory-1" > flumeCollector$1.confecho "# Each sinks type must be definedcollector$1.sinks.hdfs-sink.type = hdfs#loggercollector$1.sinks.hdfs-sink.hdfs.path = hdfs://ec2-23-22-64-132.compute-
  11. 11. 1.amazonaws.com:8020/flumecollector$1.sinks.hdfs-sink.channel = memory-1# Each channels type is defined.collector$1.channels.memory-1.type = memory# Other config values specific to each type of channel(sink or source)# can be defined as well# In this case, it specifies the capacity of the memory channelcollector$1.channels.memory-1.capacity = 30000" >> flumeCollector$1.conf

×