RHive tutorial - HDFS functions

4,629 views
4,390 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,629
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
214
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

RHive tutorial - HDFS functions

  1. 1. RHive tutorial - HDFS functionsHive uses Hadoop’s system to process distributed file systems.Thus, in order to expertly use Hive and RHive,you must be able to do things along the lines of using HDFS to put, get, andremove big data.RHive possesses Functions that correspond to what the “hadoop fs”command supports.Using these Functions, a user can in R environment handle HDFS withoutusing HADOOP CLI(command line interface) or Hadoop HDFS library.If you find yourself more comfortable with using “hadoop”’s CLI or Hadooplibrary then it is also fine to use them.But if you are not familiar with using Rstudio server or working from a terminal,RHive HDFS Functions should prove to be easy-to-use solutions in handlingHDFS for R users.Before Emulating this Examplerhive.hdfs.* Functions work after RHive has successfully been installed andlibrary(Rhive) and rhive.connect are successfully executed.Let’s not forget to do the following before emulating the example.#  Open  R  library(RHive)  rhive.connect()  rhive.hdfs.connectIn order to use RHive Functions to use HDFS, a connection to hdfs must beestablished.But if the Hadoop configuration for HDFS is properly set and rhive.connectFunction is executed, then this Function will automatically beprocessed/executed* so there is no need to have this separately executed.If you need to connect to a different HDFS then you can do it like this:rhive.hdfs.connect("hdfs://10.1.1.1:9000")  [1]  "Java-­‐Object{DFS[DFSClient[clientName=DFSClient_630489789,  ugi=root]]}"  
  2. 2. The connection will fail to establish itself if you do not insert the exacthostname and port number servicing HDFS.Ask the system manager if you do not have this information.rhive.hdfs.lsThis does the same thing as "hadoop fs -ls" and this is used like this.rhive.hdfs.ls("/")      permission  owner            group      length            modify-­‐time                file  1    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  14:27        /airline  2    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  13:16  /benchmarks  3    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐06  03:59      /messages  4    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  22:05                /mnt  5    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐13  20:24            /rhive  6    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  20:19                /tmp  7    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐14  01:14              /user  This is the same as the command which uses Hadoop CLI.hadoop  fs  -­‐ls  /  rhive.hdfs.getThe rhive.hdfs.get Function’s role is to bring the data in HDFS to local.This functions in the same way as "hadoop fs -get".The next example entails taking messages data in HDFS and saving them tolocal system’s /tmp/messages, then checking the number of Records.rhive.hdfs.get("/messages",  "/tmp/messages")  
  3. 3. [1]  TRUE  system("wc  -­‐l  /tmp/messages")  145889  /tmp/messages  rhive.hdfs.putThe rhive.hdfs.put Function uploads all data in local to HDFS.This functions like "hadoop fs -put" and opposite of rhive.hdfs.get.The following example uploads the “/tmp/messages” in local system to“/messages_new” in HDFS.rhive.hdfs.put("/tmp/messages",  "/messages_new")  rhive.hdfs.ls("/")      permission  owner            group      length            modify-­‐time                    file  1    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  14:27            /airline  2    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  13:16      /benchmarks  3    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐06  03:59          /messages  4    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐14  02:02  /messages_new  5    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  22:05                    /mnt  6    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐13  20:24                /rhive  7    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐14  01:14                  /user  You can see a new file, "/messages_new", now appears in HDFS.rhive.hdfs.rmThis does the same thing as "hadoop fs -rm", deleting files in HDFS.
  4. 4. rhive.hdfs.rm("/messages_new")  rhive.hdfs.ls("/")      permission  owner            group      length            modify-­‐time                file  1    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  14:27        /airline  2    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  13:16  /benchmarks  3    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐06  03:59      /messages  4    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  22:05                /mnt  5    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐13  20:24            /rhive  6    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐14  01:14              /user  You can see the "/messages_new" file has been deleted from within HDFS.rhive.hdfs.renameThis does the same thing as "hadoop fs -mv".That is, it changes the file name for files in HDFS or moves directories.rhive.hdfs.rename("/messages",  "/messages_renamed")  [1]  TRUE  rhive.hdfs.ls("/")      permission  owner            group      length            modify-­‐time                            file  1    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  14:27                    /airline  2    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  13:16              /benchmarks  3    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐06  03:59  /messages_renamed  4    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  22:05                            /mnt  
  5. 5. 5    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐13  20:24                        /rhive  6    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐14  01:14                          /user  rhive.hdfs.existsThis checks whether a file exists within HDFS. There is no correspondingcommand hadoop that serves as a counterpart.rhive.hdfs.exists("/messages_renamed")  [1]  TRUE  rhive.hdfs.exists("/foobar")  [1]  FALSE  rhive.hdfs.mkdirsThis does the same thing as "hadoop fs -mkdir".This makes directories in HDFS, even subdirectories.rhive.hdfs.mkdirs("/newdir/newsubdir")  [1]  TRUE  rhive.hdfs.ls("/")      permission  owner            group      length            modify-­‐time                            file  1    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  14:27                    /airline  2    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  13:16              /benchmarks  3    rw-­‐r-­‐-­‐r-­‐-­‐    root  supergroup  11186419  2011-­‐12-­‐06  03:59  /messages_renamed  4    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐07  22:05                            /mnt  5    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐14  02:13                      /newdir  6    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐13  
  6. 6. 20:24                        /rhive  7    rwxr-­‐xr-­‐x    root  supergroup                0  2011-­‐12-­‐14  01:14                          /user  rhive.hdfs.ls("/newdir")      permission  owner            group  length            modify-­‐time                            file  1    rwxr-­‐xr-­‐x    root  supergroup            0  2011-­‐12-­‐14  02:13  /newdir/newsubdir  rhive.hdfs.closeThis is used to close the connection when you have completed using HDFSand no longer need to use it.rhive.hdfs.close()  

×