R hive tutorial supplement 3 - Rstudio-server setup for rhive
RHive tutorial – Rstudio-server setupfor RHiveThis tutorial explains how to set up RStudio for using RHive more conveniently.You can see a detailed how-to document about setting RStudio up athttp://rstudio.org.A how-to for installing and using RStudio for RHive users is introduced here.RHive is one of R packages that uses Hadoop and Hive for processingmassive data.Though there are many R codes made with RHive that come up with resultsand finish running in a short time, but if a code that processes extremely largedata is written, it may take a long time for it to finish analyzing and come upwith results.Depending on the size of the data and the complexity of the processedcalculations, it can take anything from minutes at minimum to couple weeks atmaximum.The problem here is that R’s session must be kept until the task started by theuser reaches completion.If the user used a laptop to run the code then it must stay on and keep itssession until the code finishes. Even for desktops, it would be difficult fordesktops to reboot or anything similar while keeping its session until the taskis completed.There are many other inconveniences stemming from having to keep thesession.This problem, unrelated to RHive, also occurs when only using either Hadoopor Hive, and RHive is no exception.To solve this problem, you can also use a method of having a Hadoop clientopened, connect to the terminal, and run the code in the background.But this is not that convenient for R users, and it is difficult to make use of theconvenience of the user’s IDE environment or the task environment in R.Also, if the user is not familiar with using terminal then there is theinconvenience of having to learn that.RStudio is the best solution for this.RStudio provides desktop and server versions but the desktop version is verygood for being an IDE for R.And RStudio-server connects via a web browser and enables many people to
share common resources, and also has the advantage of being able to keepthe user’s session.And if the Hadoop, Hive, RHive installed by the user are located in a restrictednetwork and so warrants approaching them through firewalls, then RStudioport can be opened for that.You can use RHive more conveniently if you use RStudio-server with RHive.Lastly, since RStudio facilitates connecting to the server’s R environment, itenables sharing of RHive, Hadoop, and Hive between multiple people.This tutorial will demonstrate how to install, connect to, and use RStudio-server.Installing RStudio-serverRStudio can be downloaded from its official site.http://rstudio.org/RStudio’s official site, rstudio.org, provides documents detailing how to easilyinstall and use RStudio.The page below gives a guide on the installation so it is equally fine to perusethat instead of this tutorial.http://rstudio.org/download/serverThis tutorial explains how to install RStudio onto CentOS5.The majority of this installation guide is cited from the aforementioned site,with partial changes.Of course, you must install R before installing RStudio-serverIf you have read previous RHive tutorials and installed RHive accordingly,then installation of R should already be complete.But an explanation will be given here once more.In order to install newest version of R, you should do the following.$ sudo rpm -‐Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-‐release-‐5-‐4.noarch.rpm Now install R.$ sudo yum install R R-‐devel
When installing RHive, remember to not only install R but R-devel as well.Before installing RStudio-server, you must first know whether your server is ofa 32bit architecture or a 64bit architecture.Recent servers would most likely be 64bit and you can confirm this via theuname command.uname -‐m x86_64 The above case confirms the server being of a 64bit architecture.Now download the appropriate RStudio version for your architecture.Installing for 32-bit:$ wget http://download2.rstudio.org/rstudio-‐server-‐0.94.110-‐i686.rpm $ sudo rpm -Uvh rstudio-server-0.94.110-i686.rpmInstalling for 64-bit:$ wget http://download2.rstudio.org/rstudio-‐server-‐0.94.110-‐x86_64.rpm $ sudo rpm -‐Uvh rstudio-‐server-‐0.94.110-‐x86_64.rpm Making a User AccountIn order to connect to RStudio-server, a user account must exist in the serverwhere RStudio-server is installed.As RStudio-server does not allow connecting via a root account, so accountsfor normal users are needed.Connect to the server to create accounts for would-be users of RStudio-serverand set their passwords.ssh email@example.com adduser user1 passwd user1
The user1 above is an arbitrarily named account, so name one to your liking.Starting RStudio-serverRStudio-server must be run as a background process (Daemon mode).Connect to the server like it is shown belowssh firstname.lastname@example.org /etc/init.d/rstudioserver start You can easily run it like above.Connecting to RStudio-serverYou can use a web browser to connect to the RStudio-server.Run your web browser and connect to the RStudio-server’s URL.http://10.1.1.1:8787The port that can connect to RStudio is set to be 8787 by default.You can change this to something else as needed.Now you can connect to RStutio-server and perform massive data analysiswith R and RHive.Tips for using RHive in RStudioWhile working in RStudio-server, you might experience failure in loadingRHive due to improper environment variables.In this case you can solve this by adding a code that assigns values forenvironment variables.Sys.setenv(HADOOP_HOME="/mnt/srv/hadoop-‐0.20.203.0") Sys.setenv(HIVE_HOME="/mnt/srv/hive-‐0.7.1") Sys.setenv(RHIVE_DATA="/mnt/srv/rhive_data") library(RHive) The HADOOP_HOME mentioned above must have assigned to it the homedirectories of Hadoop and Hive in the server where RStudio is installed.And RHIVE_DATA refers to a temporary directory which RHive will use; it iscreated in each Hadoop node.
The setting of environment variables should be done before loading RHive viause of library functions.If you have loaded RHive without setting the environment variables, then youcan set them and then use the rhive.init() function to initialize RHive.library(RHive) Sys.setenv(HADOOP_HOME="/mnt/srv/hadoop-‐0.20.203.0") Sys.setenv(HIVE_HOME="/mnt/srv/hive-‐0.7.1") Sys.setenv(RHIVE_DATA="/mnt/srv/rhive_data") rhive.init() Now you have written codes in R via RStudio, and finished the setup of anenvironment that can use RHive to handle Hive and Hadoop.