R hive tutorial supplement 1 - Installing Hadoop


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

R hive tutorial supplement 1 - Installing Hadoop

  1. 1. RHive Tutorial – Installing HadoopThis tutorial is for beginning users without much foreknowledge of Hadoop. Itgives a simple explanation on how to install Hadoop before Hadoop.RHive has dependency for Hive and Hive in turn has dependency for Hadoop.Thus Hadoop and Hive must have been installed in order to install RHive.The method of installing Hadoop to be introduced in this tutorial is aboutsetting up a small Hadoop environment for RHive.This installation of fundamentals is useful for quickly building** a small-scaledistributed environment by using VMs or just a few servers.For large, well-structured environments this may not be appropriate.Installing HadoopWork EnvironmentThe environment used in this tutorial is set up like the following: • Server cluster environment: Cloud Service • Server Number: Total of 4 virtual machines • Server specs: virtual machine, 1 core, 1Gb main memory, 25Gb Harddisk for OS, 2TB additional harddisk • OS: CentOS5 • Network: 10.x.x.x IP addressPre-installation ChecklistChecking root account, firewall, SElinuxYou must be able to connect to the servers prepared for Hadoop installationvia a root account or obtain sudoer permission to wield a system-level accesssuch as root level access.And each server must be void of have special firewall or security settings.If you are using a Linux with such settings then you must have clearance tocontrol them or already know how to use them.If SElinux or firewall is running while strong rules are in place for securitypurposes, then you must manually configure Hadoop-related port or ACL(Access Control List) or simply disable SElinux and firewall altogether.This tutorial installs Hadoop to an isolated VM with no external access. Sincethey are isolated and unable to be connected from outside, their installedSELinux and firewall are entirely deactivated.Check Server IP Address
  2. 2. You must know the IP addresses of the servers you will be using.The servers used in this tutorial each have the following IP addresses:  This tutorial will make into Hadoop name node.And,, will become Hadoop’s Job nodes.Preliminary preparations before installing HadoopSetting hosts fileThere is a need to edit each server’s /etc/hostsYou might already know—these files are those that manually map hostnamesand IP addresses.Doing this is to make setting Hadoop convenient.Use the following settings to connect to all (four) servers and add the followinglines to /etc/hosts files.  node0  node1  node2  node3  node0 ~ node3 are arbitrary hostnames: any memorable name will do.But changing them after having installed Hadoop and ran it is quite dangerousand you need to take that into consideration.Installing JavaAs Hadoop is written in Java, JVM is required, naturally.Oftentimes, Java is installed once Linux is, and even if it isn’t, it can be easilyinstalled.If the servers you are using do not have Java installed then use the followingcommand to install Java to all servers.yum  install  java  
  3. 3. Assigning JAVA_HOME environment variableJAVA_HOME must have environment variables set.The directory where Java SDK or JRE is installed must be assigned toJAVA_HOME but if your OS is a CentOS, then you can use the followingcommand to find it out.update-­‐alternatives  -­‐-­‐display  java  In the work environment used in this tutorial, JAVA_HOME is "/usr/lib/jvm/jre-1.6.0-openjdk.x86_64".JAVA_HOME’s path can change depending on the user’s environment orinstalled Java version, so you must find out your server’s exact JAVA_HOME.On that matter, refer to a document on Linux distributions or some otherdocument on how to install Java.Once you found out JAVA_HOME, register the environment variable to/etc/profile, ~/.bashrc, or etc.JAVA_HOME=/usr/lib/jvm/jre-­‐1.6.0-­‐openjdk.x86_64/  export  JAVA_HOME=/usr/lib/jvm/jre-­‐1.6.0-­‐openjdk.x86_64  Certainly indeed, the task of installing Java and assigning JAVA_HOMEshould be similarly done for all servers.Downloading HadoopNow we’ll start installing Hadoop.As Hadoop is written in Java, merely decompressing the downloaded filealone completes the installation.Hadoop-1.0.0 version is packaged with rpm and deb so you can use rpm ordpkg (etc.) to install Hadoop.However, since Hive does not yet support Hadoop-1.0.0, it is not wise to usethis with Hive.Hadoop needs a directory to install to before installing. In other words, youmust decide upon and create a proper directory to decompress.And it must be a location where there is sufficient disk space.Hadoop uses a lot of space once it starts up, making log files and managingHDFS while storing files.Thus it is good to check whether the disk space where Hadoop will beinstalled to has sufficient hard disk space, and if there is a large add-on harddisk installed somewhere, then check where it is mounted before installing.This tutorial has made at least 2TB worth of hard disk mount in each server’s
  4. 4. “/mnt”, and made a “/mnt/srv” directory below that to install Hadoop in thatdirectory.It’s good to establish the same directory structure to all other servers as well.Make an arbitrary directory called srv, like below.mkdir  /mnt/srv  cd  /mnt/srv  We will install Hadoop under the base directory chosen above.Now we are going to download our Hadoop from Hadoop’s official website.This tutorial recommends using version 0.20.203You can download every Hadoop version from the following site.http://www.apache.org/dyn/closer.cgi/hadoop/common/The same version must be installed to all the servers. One way to do this is tocopy the downloaded file to all servers.Download the latest Hadoop version from the server like below.Wget   http://apache.tt.co.kr//hadoop/common/hadoop-­‐­‐  You can also change the mirror site, which is proper to you.Decompress the downloaded file.tar  xvfz  hadoop-­‐  Once you downloaded it into one server, in order to singly make the samedirectory for other servers and copy the file into them, you can use shellcommand like the following.If you are not accustomed to using shell programming, then just manually dothe same work for every other server.
  5. 5. $  for  I  in  `seq  3`;  do  ssh  node$I  mkdir  /mnt/srv  done  $  for  I  in  `seq  3`;  do  scp  hadoop*.gz  node$I:/mnt/srv/;  done  $   for   I   in   `seq   3`;   do   ssh   node$I   cd   /mnt/srv/;   tar   xvfz  hadoop*.gz;  done  Making SSH KeyIn order to enable Hadoop namenode to control each node, you must createand set null passphrase key.Hadoop connects to each server from namenode, to run tasktracker ordatanode, but to do this it must be able to connect to each node withoutpasswords.This tutorial will create and make a key to enable connecting to all serverswith a root account.With the command below, create private key and public key which doesn’t askfor passwords.ssh-­‐keygen  -­‐t  rsa  -­‐P    -­‐f  ~/.ssh/id_rsa  Now register public key to authorized_keys.cat  ~/.ssh/id_rsa.pub  >>  ~/.ssh/authorized_keys  Now see if you can use the command below to connect to localhost via sshwithout entering a password.ssh  localhost  If you login without being asked for passwords, then it is done.Now exit the connected localhost.exit  If you fail to connect or see a password prompt despite having properlycreated the openssh and keys as mentioned above,you might need to check sshd settings and make changes.You can usually edit the sshd settings file path “/etc/ssh/sshd_config” by usinga text editor.Edit the sshd_config file using any familiar editor.
  6. 6. vi  /etc/ssh/sshd_config  There are many configuration values in the file, but the items you should focuson are listed below.If the code line below is disabled (If a # is attached ahead of the line or the fileis blank)Edit the contents or insert the following, then quit the editor.RSAAuthentication  yes  PubkeyAuthentication  yes  AuthorizedKeysFile  .ssh/authorized_keys  If you still cannot connect to localhost via ssh without being asked for apassword despite having modified the settings file, then consult the systemadministrator or refer to relevant documents on configuring sshd.Now you must take the key file’s public key and insert then into other servers’~/.ssh/authorized_keys.Normally you would have to add ~/.ssh/id_rsa.pub to authorized_keys afterhaving copied them to other servers, but for the sake of convenience, thistutorial will be copying authorized_keys to another server.Copy the entire thing like below.$  for  I  in  `seq  3`;  do  scp  ~/.ssh/id_rsa.pub  node$I:~/.ssh/;  done  Fixing Hadoop ConfigurationsOnce Hadoop is installed, Hadoop settings need configuring.Now head over to the Hadoop conf directory.This tutorial will modify 4 files: hadoop-env.sh, core-site.xml, mapred-site.xml,and hdfs-site.xml.Move to Hadoop conf DirectoryFirst, head over to Hadoop’s conf directory, which was already installed.cd  /mnt/srv/hadoop-­‐  Modify hadoop-env.shOpen a text editor and modify hadoop-env.sh.
  7. 7. vi  hadoop-­‐env.sh  Look for the lines shown below and edit the lines to your liking.export  JAVA_HOME=/usr/java/default  export  HADOOP_LOG_DIR=/mnt/srv/hadoopdata/data/logs  JAVA_HOME can be set the same as the JAVA_HOME found out earlier inthis tutorial.As HADOOP_LOG_DIR is where Hadoop’s logs** will be saved, it’s good tochoose a location with sufficient space.We will be using a directory called /mnt/srv/hadoopdata/data/logs.Editing core-site.xmlOpen core-sie.xml with a text editor.vi  core-­‐site.xml  In here, adjust hadoop.tmp.dir and fs.default.name to appropriate values.<configuration>  <property>  <name>fs.default.name</name>  <value>hdfs://node0:9000</value>  </property>  <property>  <name>hadoop.tmp.dir</name>  <value>/mnt/srv/hadoopdata/hadoop-­‐${user.name}</value>  <description>A  base  for  other  temporary  directories.</description>  </property>  </configuration>  Editing hdfs-site.xml
  8. 8. There is no need to edit hdfs-site.xml.But should you need to edit anything else, you can open and adjust its valueswith a text editor, just like core-site.xml can be.Open hdfs-site.xml with a text editor.vi  hdfs-­‐site.xml  Should you want to increase the number of files Hadoop will simultaneouslyopen, adjust the values like below:<configuration>  <property>  <name>dfs.datanode.max.xcievers</name>  <value>4096</value>  </property>  </configuration>  The above is optional and not obligatory.Editing mapred-site.xmlOpen mapred-site.xml with a text editor like vi.vi  mapred-­‐site.xml  If you open the file and look through the contents, you may find something likethe following. In here, you should edit the value of mapred.job.tracker to suityour environment.Use defaults for the rest customize them to your liking.<configuration>  <property>  <name>mapred.job.tracker</name>  <value>node0:9001</value>  </property>  <property>  <name>mapred.jobtracker.taskScheduler</name>  
  9. 9. <value>org.apache.hadoop.mapred.FairScheduler</value>  </property>  <property>  <name>mapred.tasktracker.map.tasks.maximum</name>  <value>6</value>  </property>  <property>  <name>mapred.tasktracker.reduce.tasks.maximum</name>  <value>6</value>  </property>  <property>  <name>mapred.child.java.opts</name>  <value>-­‐Xmx2048M</value>  </property>  <property>  <name>mapred.reduce.tasks</name>  <value>16</value>  </property>  <property>  <name>mapred.task.timeout</name>  <value>3600000</value>  </property>  </configuration>  Activating HadoopChecking whether Hadoop is RunningAfter installing Hadoop, you can use a web browser to connect to a webpagethat can check up on Hadoop’s status.It’s normally serviced as port 50030.http://node0:50030/If you see Hadoop’s state as “RUNNING” like below, then Hadoop is runningas normal.
  10. 10. node0  Hadoop  Map/Reduce  Administration    Quick  Links  State:  RUNNING  Started:  Thu  Jan  05  17:24:18  EST  2012  Version:,  r1099333  Compiled:  Wed  May  4  07:57:50  PDT  2011  by  oom  Identifier:  201201051724  Naturally, you cannot connect to the page above if Hadoop namenode is onthe other side of the firewall and 50030 is not open.Trying to Run MRbenchHadoop provides several useful utilities by default.Among them, hadoop-test-* allows you an easy view of the map/reduce task.As Hadoop version used in this tutorial is, the Hadoop homedirectory must contain the hadoop-test- file.And you can check whether Hadoop’s Map/Reduce is running via thefollowing command:$HADOOP_HOME/bin/hadoop   jar   $HADOOP_HOME/hadoop-­‐test-­‐  mrbench  The results of executing the above command are as follows.MRBenchmark.0.0.2  11/12/07  13:15:36  INFO  mapred.MRBench:  creating  control  file:  1  numLines,  ASCENDING  sortOrder  11/12/07   13:15:36   INFO   mapred.MRBench:   created   control   file:  /benchmarks/MRBench/mr_input/input_-­‐1026698718.txt  11/12/07   13:15:36   INFO   mapred.MRBench:   Running   job   0:  input=hdfs://node0:9000/benchmarks/MRBench/mr_input  output=hdfs://node0:9000/benchmarks/MRBench/mr_output/output_1220591687  11/12/07   13:15:36   INFO   mapred.FileInputFormat:   Total   input  paths  to  process  :  1  11/12/07   13:15:37   INFO   mapred.JobClient:   Running   job:  job_201112071314_0001  
  11. 11. 11/12/07  13:15:38  INFO  mapred.JobClient:    map  0%  reduce  0%  11/12/07  13:15:55  INFO  mapred.JobClient:    map  50%  reduce  0%  11/12/07  13:15:58  INFO  mapred.JobClient:    map  100%  reduce  0%  11/12/07  13:16:10  INFO  mapred.JobClient:    map  100%  reduce  100%  11/12/07   13:16:15   INFO   mapred.JobClient:   Job   complete:  job_201112071314_0001  11/12/07  13:16:15  INFO  mapred.JobClient:  Counters:  26  11/12/07  13:16:15  INFO  mapred.JobClient:      Job  Counters  11/12/07   13:16:15   INFO   mapred.JobClient:           Launched   reduce  tasks=1  11/12/07   13:16:15   INFO  mapred.JobClient:          SLOTS_MILLIS_MAPS=22701  11/12/07   13:16:15   INFO   mapred.JobClient:           Total   time   spent  by  all  reduces  waiting  after  reserving  slots  (ms)=0  11/12/07   13:16:15   INFO   mapred.JobClient:           Total   time   spent  by  all  maps  waiting  after  reserving  slots  (ms)=0  11/12/07   13:16:15   INFO   mapred.JobClient:           Launched   map  tasks=2  11/12/07   13:16:15   INFO   mapred.JobClient:           Data-­‐local   map  tasks=2  11/12/07   13:16:15   INFO  mapred.JobClient:          SLOTS_MILLIS_REDUCES=15000  11/12/07   13:16:15   INFO   mapred.JobClient:       File   Input   Format  Counters  11/12/07  13:16:15  INFO  mapred.JobClient:          Bytes  Read=4  11/12/07   13:16:15   INFO   mapred.JobClient:       File   Output   Format  Counters  11/12/07  13:16:15  INFO  mapred.JobClient:          Bytes  Written=3  11/12/07  13:16:15  INFO  mapred.JobClient:      FileSystemCounters  11/12/07  13:16:15  INFO  mapred.JobClient:          FILE_BYTES_READ=13  11/12/07   13:16:15   INFO  mapred.JobClient:          HDFS_BYTES_READ=244  11/12/07   13:16:15   INFO  mapred.JobClient:          FILE_BYTES_WRITTEN=63949  11/12/07   13:16:15   INFO  
  12. 12. mapred.JobClient:          HDFS_BYTES_WRITTEN=3  11/12/07  13:16:15  INFO  mapred.JobClient:      Map-­‐Reduce  Framework  11/12/07   13:16:15   INFO   mapred.JobClient:           Map   output  materialized  bytes=19  11/12/07   13:16:15   INFO   mapred.JobClient:           Map   input  records=1  11/12/07   13:16:15   INFO   mapred.JobClient:           Reduce   shuffle  bytes=19  11/12/07  13:16:15  INFO  mapred.JobClient:          Spilled  Records=2  11/12/07  13:16:15  INFO  mapred.JobClient:          Map  output  bytes=5  11/12/07  13:16:15  INFO  mapred.JobClient:          Map  input  bytes=2  11/12/07   13:16:15   INFO   mapred.JobClient:           Combine   input  records=0  11/12/07   13:16:15   INFO  mapred.JobClient:          SPLIT_RAW_BYTES=240  11/12/07   13:16:15   INFO   mapred.JobClient:           Reduce   input  records=1  11/12/07   13:16:15   INFO   mapred.JobClient:           Reduce   input  groups=1  11/12/07   13:16:15   INFO   mapred.JobClient:           Combine   output  records=0  11/12/07   13:16:15   INFO   mapred.JobClient:           Reduce   output  records=1  11/12/07   13:16:15   INFO   mapred.JobClient:           Map   output  records=1  DataLines              Maps        Reduces  AvgTime  (milliseconds)  1                              2              1              39487  If there were no errors running then Hadoop ran without problems.Now you can make Map/Reduce implementations for yourself and useHadoop to perform distributed processing.