Your SlideShare is downloading. ×
RHive tutorial - Installation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

RHive tutorial - Installation

7,850
views

Published on

This is a tutorial which explains how to install the RHive packages with Hadoop and Hive environment. …

This is a tutorial which explains how to install the RHive packages with Hadoop and Hive environment.

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,850
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
406
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. RHive tutorial - InstallationThere are 3 ways to install RHive. • Installation using CRAN • Download from RHive project homepage an already built R package then use R CMD to install • Download the source from Github, build, then install.Excluding the version deployed in CRAN, all RHive packages and sourcescan be found in the site below:RHive’s Github repository path: https://github.com/nexr/RHiveContents of this TutorialThis tutorial explains how to install and run R and RHive in an environmentwhere Hadoop and Hive are running.Environments used in this TutorialThis tutorial is written with installing RHive on a CentOS5 Linux 64bit versionin mind.Installation procedures on other Linuxes or Mac OS x are virtually identical.Only the methods of installing packages such as git or ant may differ for eachversion of deployment.Method of using RHive in Windows will be provided as a separate article.Hadoop and Hive Structural EnvironmentThe modules installed and are running with the servers used in this tutorial areas follows.10.1.1.1 - Hadoop namenode, Hive server, R, RHive10.1.1.[2-4] - Hadoop job node, DFS node, Rserve nodeThus, this tutorial supposes the following have already been composed. • Suppose Hadoop namenode is installed in server 10.1.1.1 and Hive is installed and Hive server is running. • Servers 10.1.1.2, 10.1.1.3, and 10.1.1.4 has Hadoop DFS node and Hadoop Job node running in them. • Suppose Hadoop and Hive are functioning as normal.
  • 2. Should you require guidance beginning from Hadoop and Hive installationthen please use the Hive and Hadoop references.NoteIt’s generally not a good idea to install things of functions other thannamenode to Hadoop namenode, but for the sake of fast composition andsmall-scale cluster setup (and out of convenience), this tutorial installs Hiveserver, R, and RHive.Should a greater scale with simultaneous usage by multiple users are desired,an appropriately altered application of the contents of this tutorial shouldsuffice.Method of Installing Git to Download SourcesIt is not such a bother to download the source code from Github and installingit and on top of that there is the advantage of being able to directly build anduse the newest packages.If a problem is found in the currently used RHive and there are source codeupdates, it is faster to just download the source code and build it.The Github repository where you can download RHive’s source code is asfollows: git://github.com/nexr/RHive.gitIf the OS you are using is Linux or Mac OS X and you want to open a terminaland work within the server, then you can use SSH to connect to the remoteserver you plan to work on.This tutorial is going to use a root account as a work account, if the user’senvironment grants no permission to connect via a root account, then the userhas to obtain sudoer permission and work with a sudo command.Connecting to or opening a terminalOpen a terminal window orconnect to the server you plan to work onssh  root@10.1.1.1  Note: we assume 10.1.1.1 is the server which RHive should be installedDownload Source CodeMake a temporary directory and download RHive source via git in it.And move to the automatically created subdirectory, ‘RHive’.
  • 3. mkdir  RHive_source  cd  RHive_source  git  clone  git://github.com/nexr/RHive.git  #  if  you  succeed,  the  name  "RHive  is  made  automatically  cd  RHive  If there is no git and therefore be unable to clone, you must use the commandbelow to install git and follow the directions above.yum  install  git  Using ant to build jarBefore building RHive package, one must build sub modules written in javaand ends with jar file extensionThis may not be required in the cases of downloading from CRAN ordownloading the final version of a package,this procedure is required in the case of downloading the source and manuallybuilding it.That is, the jar module used in RHive sub modules must be compiled andreadymade before RHive package becomes made into a form that can beinstalled by R.You can compile jar files which ant will include in the RHive sub modules.ant  build  If there is no ant then install ant to Linux first, then execute theaforementioned procedures.And java must be installed, of course.Ant can be installed with the following command:yum  install  ant  Once the command has been executed then the following can result:#  antBuildfile:  build.xml  compile:        [mkdir]  Created  dir:  /mnt/srv/RHive_package/RHive/build/classes        [javac]  Compiling  5  source  files  to  
  • 4. /mnt/srv/RHive_package/RHive/build/classes        [unjar]  Expanding:  /mnt/srv/RHive_package/RHive/RHive/inst/javasrc/lib/REngine.jar  into  /mnt/srv/RHive_package/RHive/build/classes        [unjar]  Expanding:  /mnt/srv/RHive_package/RHive/RHive/inst/javasrc/lib/RserveEngine.jar  into  /mnt/srv/RHive_package/RHive/build/classes  jar:            [jar]  Building  jar:  /mnt/srv/RHive_package/RHive/rhive_udf.jar  cran:          [copy]  Copying  1  file  to  /mnt/srv/RHive_package/RHive/RHive/inst/java          [copy]  Copying  13  files  to  /mnt/srv/RHive_package/RHive/build/CRAN/rhive/inst          [copy]  Copying  9  files  to  /mnt/srv/RHive_package/RHive/build/CRAN/rhive/man          [copy]  Copying  3  files  to  /mnt/srv/RHive_package/RHive/build/CRAN/rhive/R          [copy]  Copying  1  file  to  /mnt/srv/RHive_package/RHive/build/CRAN/rhive          [copy]  Copying  1  file  to  /mnt/srv/RHive_package/RHive/build/CRAN/rhive      [delete]  Deleting:  /mnt/srv/RHive_package/RHive/rhive_udf.jar  main:  BUILD  SUCCESSFUL  You can see the build has been successful and if it failed, the quickestsolution is to consult the RHive development team.Building RHive PackageAfter making the sub modules, in order to install RHive package, it must bemade as an R package type.The current path must be checked to see if it is the same as the directorywhere jar was built, then build RHive package like below.This can be done like this:#  pwd  /root/RHive_package/RHive  #  ls  -­‐l  total  76  
  • 5. -­‐rw-­‐r-­‐-­‐r-­‐-­‐  1  root  root    1413  Dec  11  16:41  ChangeLog  -­‐rw-­‐r-­‐-­‐r-­‐-­‐  1  root  root    2068  Dec  11  16:41  INSTALL  -­‐rw-­‐r-­‐-­‐r-­‐-­‐  1  root  root    2444  Dec  11  16:41  README  drwxr-­‐xr-­‐x  5  root  root    4096  Dec  11  16:41  RHive  drwxr-­‐xr-­‐x  4  root  root    4096  Dec  11  16:42  build  -­‐rw-­‐r-­‐-­‐r-­‐-­‐  1  root  root    2999  Dec  11  16:41  build.xml  -­‐rw-­‐r-­‐-­‐r-­‐-­‐  1  root  root  35244  Dec  11  16:41  rhive-­‐logo.jpg  -­‐rw-­‐r-­‐-­‐r-­‐-­‐  1  root  root  12732  Dec  11  16:41  rhive-­‐logo.png  #  R  CMD  build  ./RHive  If the build was successful then you may see the following result message.* checking for file ‘./RHive/DESCRIPTION’ ... OK* preparing ‘RHive’:* checking DESCRIPTION meta-information ... OK* checking for LF line-endings in source and make files* checking for empty or unneeded directories* building ‘RHive_0.0-4.tar.gz’You can see RHive_0.0-4.tar.gz has been created.This package is installable by R.The created file’s name will be different according to the RHive packageversion used for building.Install RHive PackageNow we shall install the just created or downloaded RHive Package.It can be installed with the following command:R  CMD  INSTALL  ./RHive_0.0-­‐4.tar.gz  No errors mean installation success.But you might encounter errors related to rJava and Rserver packages.
  • 6. *  installing  to  library  ‘/usr/lib64/R/library’ERROR:  dependencies  ‘rJava’,  ‘Rserve’  are  not  available  for  package  ‘RHive’*  removing  ‘/usr/lib64/R/library/RHive’  This error message indicates that R packages called rJava and Rserver arenot installed in the currently used R.RHive depends on rJava and Rserve package so this package must alreadybe installed.Using CRAN to install RHive will automatically install the depended packagesfor your but in the case of having used source, automatic installation is difficult.Manually install.#  OpenR)  install.packages("rJava")  install.packages("Rserve")  #  and  install  RHive  install.packages("./RHive_0.0-­‐4.tar.gz",  repos=NULL)  No errors indicate a successful installation.Directly downloading RHive package from project siteThe URL where you can download a built package is as follows:https://github.com/nexr/RHive/downloadsWe will be downloading a suitable version to download from the above site.This tutorial will install the version as listed below:RHive_0.0-­‐4-­‐2011121201.tar.gz  —  RHive_0.0-­‐4  SNAPSHOP  (build2011121201)  -­‐  R  package  You can also download this file via a web browser and install it to a laptop ordesktop, or install by sending the file to a remote server via FTP.This tutorial will exemplify how to install it to a remote Linux server.Firstly, use a terminal to connect a remote RHive to a Linux where it will beinstalled.In this tutorial it is server 10.1.1.1, located in the internal network.ssh  root@10.1.1.1  
  • 7. mkdir  RHive_installable  cd  RHive_installable  Now create a temporary directory and use wget to download the file.The download link path can be obtained from the aforementioned downloadsite.Remember to write –no-check-certificate in the wget option.wget  -­‐-­‐no-­‐check-­‐certificate    https://github.com/downloads/nexr/RHive/RHive_0.0-­‐4-­‐2011121401.tar.gz  Once download is complete your current directory will contain the following file:#  ls  -­‐al  total  3240  drwxr-­‐xr-­‐x  2  root  root        4096  Dec  11  18:00  .  drwxr-­‐x-­‐-­‐-­‐  6  root  root        4096  Dec  11  18:02  ..  -­‐rw-­‐r-­‐-­‐r-­‐-­‐  1  root  root  3302766  Dec  12    2011  RHive_0.0-­‐4-­‐2011121401.tar.gz  This file is a package created by RHive development team made for uploadingit to CRAN, therefore doesn’t require a separate build procedure.It can be straightforwardly installed by using R.R  CMD  INSTALL  ./RHive_0.0-­‐4-­‐2011121201.tar.gz  If you encounter an error message related to rJava and Rserve dependencylike the one mentioned before,install those first inside R first and then install the reinstall the downloadedfiles. Like below.It was mentioned before but it can be installed via the following method:Open  R  install.packages(rJava)  install.packages(Rserve)  No errors mean a completed installation.
  • 8. Downloading source code without using Git clientYou can download the source code from Github even without the use of Gitcommand or Git client.Github supports the use of web browsers to download the compressed sourcecode.You can download the newest source code like below.wget  -­‐-­‐no-­‐check-­‐certificate    https://github.com/nexr/RHive/zipball/master    -­‐O  RHive.zip  unzip  RHive.zip  cd  nexr-­‐RHive-­‐df7341c/  Compiling the sources and building the package is the same as if youdownloaded RHive source via use of Git client.Installing R and RServeIn order to use RHive, all job nodes of Hadoop must have Rserve installed.RHive controls the Rserve by referencing slaves which is in conf of RHive.It is not hard to install Rserve.Connect to both Hadoop name node and job node and install R and Rservefor each.Except for name node: it does not need Rserve installed into it.ssh  root@10.1.1.1  If R is not already installed, install that first.In CentOS5, you can use the following method to install the newest version ofR.Remember to install R-devel, because it is necessary to install Rserve.rpm  -­‐Uvh  http://download.fedora.redhat.com/pub/epel/5/i386/epel-­‐release-­‐5-­‐4.noarch.rpm      yum  install  R  yum  install  R-­‐devel  
  • 9. If the required packages are installed, install Rserve via the followingcommand.open  R  install.packages("Rserve")  If the installed R does not possess a file named libR.so, the following erroroccurs when attempting to install Rserve.*  installing  *source*  package  ‘Rserve’  ...  **  package  ‘Rserve’  successfully  unpacked  and  MD5  sums  checked  checking  whether  to  compile  the  server...  yes  configure:  error:  R  was  configured  without  -­‐-­‐enable-­‐R-­‐shlib  or  -­‐-­‐enable-­‐R-­‐static-­‐lib      ***  Rserve  requires  R  (shared  or  static)  library.                                              ***  ***  Please  install  R  library  or  compile  R  with  either  -­‐-­‐enable-­‐R-­‐shlib    ***  ***  or  -­‐-­‐enable-­‐R-­‐static-­‐lib  support                                                                        ***        Alternatively  use  -­‐-­‐without-­‐server  if  you  wish  to  build  only  Rserve  client.          ERROR:  configuration  failed  for  package  ‘Rserve’  *  removing  ‘/usr/lib64/R/library/Rserve’  In order to solve this problem, when compiling R it must be compiled using --enable-R-shlib or --enable-R-static-libbut most Linux has these compiled with such options so this error is probablycaused by something else.First, use the command below to search in the file path where R’s library filesare.#  R  CMD  config  -­‐-­‐ldflags  
  • 10. -­‐L/usr/lib64/R/lib  -­‐lR  You might encounter the following error while executing the above command.[root@i-­‐10-­‐24-­‐1-­‐34  Rserve]#  R  CMD  config  -­‐-­‐ldflags  /usr/lib64/R/bin/config:  line  142:  make:  command  not  found  /usr/lib64/R/bin/config:  line  143:  make:  command  not  found  This means there is no ‘make’ utility and Rserve needs it for installation so‘make’ utility has to be installed.Install the ‘make’ utility like below and then execute “R CMD config –ldflags”and see whether library path becomes successfully displayed.yum  install  make  And let’s check if libR.so is indeed in the printed path.#  ls  -­‐al  /usr/lib64/R/lib  total  4560  drwxr-­‐xr-­‐x  2  root  root        4096  Dec  13  03:00  .  drwxr-­‐xr-­‐x  7  root  root        4096  Dec  13  03:35  ..  -­‐rwxr-­‐xr-­‐x  1  root  root  2996480  Nov    8  14:19  libR.so  -­‐rwxr-­‐xr-­‐x  1  root  root    177176  Nov    8  14:19  libRblas.so  -­‐rwxr-­‐xr-­‐x  1  root  root  1470264  Nov    8  14:19  libRlapack.so  libR.so is confirmed to be there. Now that all preparations for installing Rserveare complete, retry and finish installing Rserve.open Rinstall.packages("Rserve")*** Rserve requires R (shared or static) library. ****** Please install R library or compile R with either --enable-R-shlib ****** or --enable-R-static-lib supportRunning Rserve
  • 11. Once Rserve installation is complete, use DAEMON to run Rserve.Before running Rserve, configurations must be adjusted to enable remoteconnections to Rserve.Adjust the configurations as follows:Connect  to  the  server  where  Rserve  will  be  run.  In  all  Hadoop  job  nodes,  open  the  file,    "/etc/Rserv.conf",  using  a  text  editor.  If  there  is  no  such  file  then  it  must  be  created.    Insert  remote  enable  into  the  file.  Save  and  exit.  Rserv.conf  can  configure  many  other  options.  Details  pertaining  to  configuration  can  be  found  in  the  URL  below.    http://www.rforge.net/Rserve/doc.html  And then leave R and run Rserve in the command prompt.R  CMD  Rserve  If Rserve is run via Daemon then the following command can be used tocheck if it is listening to any ports.#  netstat  -­‐nltp  Active  Internet  connections  (only  servers)  Proto  Recv-­‐Q  Send-­‐Q  Local  Address                              Foreign  Address                          State              PID/Program  name  tcp                0            0  0.0.0.0:6311                                0.0.0.0:*                                      LISTEN            25516/Rserve  tcp                0            0  :::59873                                        :::*                                                LISTEN            13023/java  tcp                0            0  :::50020                                        :::*                                                LISTEN            13023/java  tcp                0            0  ::ffff:127.0.0.1:46056            :::*                                                LISTEN            13112/java  tcp                0            0  :::50060                                        :::*                      
  • 12.                          LISTEN            13112/java  tcp                0            0  :::22                                              :::*                                                LISTEN            1109/sshd  tcp                0            0  :::50010                                        :::*                                                LISTEN            13023/java  tcp                0            0  :::50075                                        :::*                                                LISTEN            13023/java  You can see the Rserve Daemon listening to port 6311.Port 6311 is the default port which Rserve uses. This can be changed viaadjusting the configuration.But don’t change it unless there is a special reason to.And if the port isn’t open due to the firewall, then permission must be obtainedso as to enable connection between internal servers.To check this, first see if the server where RHive will be run can achieveconnection.#  connect  to  the  RHive  server  ssh  root@10.1.1.1  #  telnet  10.1.1.2  6311  Trying  10.1.1.2...  Connected  to  10.1.1.2.  Escape  character  is  ^].  Rsrv0103QAP1      -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  #  telnet  10.1.1.3  6311  Trying  10.1.1.3...  Connected  to  10.1.1.3.  Escape  character  is  ^].  Rsrv0103QAP1      -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  #  telnet  10.1.1.4  6311  Trying  10.1.1.4...  
  • 13. Connected  to  10.1.1.4.  Escape  character  is  ^].  Rsrv0103QAP1      -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  Configuring Hadoop and Hive for RHiveIn order to run RHive, the laptops or desktops with RHive installed must alsohave Hadoop and Hive installed, and their Hadoop configurations must alsomatch the configuration of the Hadoop cluster.If the server planned for RHive installation do not have Hadoop or Hiveinstalled into it, then install a version same as the one installed for the Hadoopcluster. Then copy the Hadoop’s configuration and match them up.After matching that, configure environment variables.export  HADOOP_HOME=/service/hadoop-­‐0.20.203.0  export  HIVE_HOME=/service/hive-­‐0.7.1  In the contents above, /service/hadoop-0.20.203.0is the path where Hadoop isinstalledand /service/hive-0.7.1 is where Hive is installed.These must be put into /etc/profileIf RHive is installed in the same server as Hadoop namenode then noseparate configuring is required.But if it’s a different server or a laptop then edit the contents of/service/hadoop-0.20.203.0/conf to be the same as the Hadoop cluster youplan to use.Running the RHive ExampleAs stated before, in order to activate RHive, then environment variable mustbe configured before running R.To put it more precisely, a suitable environment variable must be set beforeinitializing RHive.If you forgot to set HIVE_HOME and HADOOP_HOME for the laptop orserver’s environment variables, or wish to toggle between using differentversions then, as listed below, can be set after running R. Open  R  
  • 14. Sys.setenv(HIVE_HOME="/service/hive-­‐0.7.1")   Sys.setenv(HADOOP_HOME="/service/hadoop-­‐0.20.203.0")   library(RHive)  You can skip this if you edited /etc/profile and etc. This method suffers thedisadvantage of having to be done every time R is run.Checking for and Setting RHive Environment VariablesYou can check whether the environment variable is properly set by running Rand using the rhive.env() Function.Should either Hive Home Directory or Hadoop Home Directory not properlyshow up then you must recheck whether they have been correctly set.rhive.env()  Hive  Home  Directory  :  /mnt/srv/hive-­‐0.8.1  Hadoop  Home  Directory  :  /mnt/srv/hadoop-­‐0.20.203.0  Default  RServe  List  node1  node2  node3  Disconnected  HiveServer  and  HDFS  RHive connectAfter loading RHive and before doing any work, the rhive.connect functionmust be called and Hive server and connection must be made.If the connection isn’t made then RHive Functions will not work.rhive.connect()  SLF4J:  Class  path  contains  multiple  SLF4J  bindings.  SLF4J:  Found  binding  in  [jar:file:/service/hive-­‐0.7.1/lib/slf4j-­‐log4j12-­‐1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]  SLF4J:  Found  binding  in  [jar:file:/service/hadoop-­‐0.20.203.0/lib/slf4j-­‐log4j12-­‐1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]  SLF4J:  See  http://www.slf4j.org/codes.html#multiple_bindings  for  an  explanation.  Checking the contents of HDFS files
  • 15. You might see how many complex messages result when making theconnection. These may be ignored.Now you can use the rhive.hdfs.* Functions to handle Hadoop’s HDFS andthese correspond to the commands which “hadoop fs” .you can use the rhive.hdfs.ls() Function to check the HDFS’s list of files.rhive.hdfs.ls("/")      permission  owner            group      length            modify-­‐time                file  1    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  14:27        /airline  2    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  13:16  /benchmarks  3    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐06  03:59      /messages  4    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  22:05                /mnt  5    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  22:15            /rhive  6    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  20:19                /tmp  Checking table list of HiveAlso, you can check the list of tables registered in Hive by using therhive.list.tables() Function.If you have not made any tables then you can see the following result.rhive.list.tables()  [1]  tab_name  <0  rows>  (or  0-­‐length  row.names)  Creating Hive tableYou can use a simple command to save R’s data frame to a Hive table.tablename  <-­‐  rhive.write.table(USArrests)  
  • 16. USArrests is data provided with R. RHive converts data frame’s object nameinto Hive table name and store it as Hive table.Checking Table descriptionsAnd you can use the rhive.list.desc() Function to see the descriptions of thetable of Hive.rhive.desc.table("USArrests")      col_name  data_type  comment  1    rowname        string  2      murder        double  3    assault              int  4  urbanpop              int  5          rape        double  As a note, Hive’s table names do not distinguish between upper and lowercases.Creating Hive Tables 2It is possible to take other data in MASS package or data with CSV filesloaded and store them into Hive.library(MASS)  tablename  <-­‐  rhive.write.table(Aids2)  rhive.desc.table(tablename)  rhive.load.table(tablename)  This method is useful for uploading to Hive some data of relatively small sizesand if attempting to save several Gbs of data to Hive, the recommendedmethod is to save files to HDFS and configuring as an external tableRHive currently does not automatically handle this for users and such afeature is still in the drawing board.Executing a simple SQL syntax
  • 17. You can use the rhive.query() function to send SQL to Hive.Let’s try running a simple SQL syntax that checks the entire number ofRecords for the Hive table, usarrests.rhive.query("SELECT  COUNT(*)  FROM  usarrests")      X_c0  1      50  The SQL syntax executed above is the result of Map/Reducing using Hadoopand Hive. If you saw SQL results like above, then it indicates the RHive,Hadoop, and Hive configurations are alright, and Hadoop calculated andoutputted the total count of the input data.One thing to watch out for is that this example only used a very small data soit is not safe to assert this has made full use of the potential of Hive andHadoop, which are distributed processing platforms.Small data such as ”usarrests” that can be loaded into a single server’smemory can be processed within R, without the use of RHive.This step is just checking if the configurations are properly calibrated andbasic functions are in working order.If you wish to use RHive through Hadoop and Hive, then it is fitting to use dataat least the proportions ranging from several GiBs to the tens of GiBs.FAQ and Contact InfoConsult the following reference materials for explanations and details forRHives for each Function.If you find a bug or find difficulty in using RHive then do a bug report on theRHive site or ask the RHive development team via e-mail.The RHive development team is always open and responsive to questions,requests, and bug reports.e-mail: rhive@nexr.com