RHive tutorials - Basic functions

RHive tutorial - basic functions
This tutorial explains how to load RHive library and use basic Functions for
RHive.

Loading RHive
Load RHive with the method used when using any R package. Load RHive
like below:

library(RHive)

But before loading RHive, you must not forget to configure HADOOP_HOME
and HIVE_HOME environment
And if they are not set then you can temporarily set them before loading the
library, like as follows.
HADOOP_HOME is the home directory where Hadoop is installed and
HIVE_HOME is the home directory where Hive is installed.
Consult RHive tutorial - RHive installation and setting for details on
environment variables.

Sys.setenv(HIVE_HOME="/service/hive-‐0.7.1")

Sys.setenv(HADOOP_HOME="/service/hadoop-‐0.20.203.0")

library(RHive)

rhive.init
rhive.init is a procedure that internally initializes and if, before loading RHive,
environment variables were calibrated accurately then they will automatically
run.
But if these environment variable were not configured while RHive was loaded
via library(RHIve) then the following error message will result.

rhive.connect()

Error
in
.jcall("java/lang/Class",
"Ljava/lang/Class;",

"forName",
cl,

:

No
running
JVM
detected.
Maybe
.jinit()
would
help.

Error
in
.jfindClass(as.character(class))
:

No
running
JVM
detected.
Maybe
.jinit()
would
help.

For this case then designate HADOOP_HOME and HADOOP_HOME as
shown below or exit R then configure environment variables and restart R.

Sys.setenv(HIVE_HOME="/service/hive-‐0.7.1")

Sys.setenv(HADOOP_HOME="/service/hadoop-‐0.20.203.0")

rhive.init()

Or,

close
R

export
HIVE_HOME="/service/hive-‐0.7.1"

export
HADOOP_HOME="/service/hadoop-‐0.20.203.0"

open
R

rhive.connect
All Functions of RHive will only work after having connected to Hive server.
If before using other Functions of RHive, you have not established a
connection by using the rhive.connect Function,
All RHive Functions will malfunction and produce the following errors when
running.

Error
in
.jcast(hiveclient[[1]],
new.class
=

"org/apache/hadoop/hive/service/HiveClient",

:

cannot
cast
anything
but
Java
objects

Establishing a connection with Hive server to use RHive is simple with the
following:

rhive.connect()

The example above can additionally assign a few more things.

rhiveConnection
<-‐
rhive.connect("10.1.1.1")

In the case the user’s Hive server is installed to a server other than the one
with RHive installed, and has to remotely connect,
a connection can be made by handing arguments over to the rhive.connect
Function.

Then if you have multiple Hadoop and Hive clusters, then after making the
right configurations to have RHive activated, and you want to switch between
the Hives then
just like using DB client such as MySQL, you should make connections and
hand it over to the Functions via arguments to explicitly select connection.

rhive.query
If the user has experience in using Hive, then he/she probably knows that
Hive supports SQL syntax to handle the data for Map/Reduce and HDFS.
rhive.query gives SQL to Hive and receives results from Hive.
Users who know SQL syntax will find this a frequently encountered example.

rhive.query("SELECT
*
FROM
usarrests")

If you run the example above then you will see the contents of a table named
‘usarrests’ printed on the screen.
Or, on top of printing the returned result on the screen, you can also assign to
a data.frame object those results.

resultDF
<-‐
rhive.query("SELECT
*
FROM
usarrests")

A thing to beware of is if the data returned from rhive.query is bigger than the
RHive server’s memory or laptop’s, exhaustion of available memory will
induce an error message.
That is why you must not receive and put into object any data of such size.
It is better to first create a temporary table and then put the results of the SQL
to the temporary table.
You can do it as the following.

rhive.query("

CREATE
TABLE
new_usarrests
(

rowname

string,

murder

double,

assault

int,

urbanpop

int,

rape

double

)")

rhive.query("INSERT
OVERWRITE
TABLE
new_usarrests
SELECT
*

FROM
usarrest")

Consult a Hive document for a detailed account of how to use Hive SQL.

rhive.close
If you have finished using Hive and do not wish to use RHive Functions any
longer, you can use the rhive.close Function to terminate the connection.

rhive.close()

Alternatively, you can assign a specific connection to close it.

conn
<-‐
rhive.connect()

rhive.close(conn)

rhive.list.tables
The rhive.list.tables Function returns the results of tables in Hive.

rhive.list.tables()

tab_name

1

aids2

2
new_usarrests

3

usarrests

This is effectively identical to this:

rhive.query("SHOW
TABLES")

rhive.desc.table
The rhive.desc.table Function shows the description of the chosen table.

rhive.desc.table("usarrests")

col_name
data_type
comment

1

rowname

string

2

murder

double

3

assault

int

4
urbanpop

int

5

rape

double


rhive.query("DESC
usarrests")

rhive.load.table
The rhive.load.table Function loads Hive tables’ contents as R’s data.frame
object.

df1
<-‐
rhive.load.table("usarrests")

df1


df1
<-‐
rhive.query("SELECT
*
FROM
usarrests")

df1

rhive.write.table
The rhive.write.table Function is the antithesis of rhive.load.table.
But it is more useful than rhive.load.table.
If you wish to add data to a table located in Hive, you must first make a table.
But using rhive.write.table does not require any additional work, and simply
creates R’s dataframe into Hive and inserts all data.

head(UScrime)

M
So

Ed
Po1
Po2

LF

M.F
Pop

NW

U1
U2
GDP

Ineq

Prob

Time

y

1
151

1

91

58

56
510

950

33
301
108
41
394

261
0.084602

26.2011

791

2
143

0
113
103

95
583
1012

13
102

96
36
557

194
0.029599

25.2999
1635

3
142

1

89

45

44
533

969

18
219

94
33
318

250
0.083401

24.3006

578

4
136

0
121
149
141
577

994
157

80
102
39
673

167
0.015801

29.9012
1969

5
141

0
121
109
101
591

985

18

30

91
20
578

174
0.041399

21.2998
1234

6
121

0
110
118
115
547

964

25

44

84
29
689

126
0.034201

20.9995

682

rhive.write.table(UScrime)

[1]
"UScrime"

rhive.list.tables()

tab_name

1

aids2

2
new_usarrests

3

usarrests

4

uscrime

rhive.query("SELECT
*
FROM
uscrime
LIMIT
10")

rowname

m
so

ed
po1
po2

lf

mf
pop

nw

u1
u2
gdp

ineq

prob

time

1

1
151

1

91

58

56
510

950

33
301
108
41
394

261

0.084602
26.2011

2

2
143

0
113
103

95
583
1012

13
102

96
36
557

194

0.029599
25.2999

3

3
142

1

89

45

44
533

969

18
219

94
33
318

250

0.083401
24.3006

4

4
136

0
121
149
141
577

994
157

80
102
39
673

167

0.015801
29.9012

5

5
141

0
121
109
101
591

985

18

30

91
20
578

174

0.041399
21.2998

6

6
121

0
110
118
115
547

964

25

44

84
29
689

126

0.034201
20.9995

7

7
127

1
111

82

79
519

982

4
139

97
38
620

168

0.042100
20.6993

8

8
131

1
109
115
109
542

969

50
179

79
35
472

206

0.040099
24.5988

9

9
157

1

90

65

62
553

955

39
286

81
28
421

239

0.071697
29.4001

10

10
140

0
118

71

68
632
1029

7

15
100
24
526

174

0.044498
19.5994

y

1

791

2

1635

3

578

4

1969

5

1234

6

682

7

963

8

1555

9

856

10

705

The rhive.write.table Function encounters an error and does not work if the
table to be saved into Hive already exists.
Hence, if attempting to save to Hive any dataframes with the same name and
symbol as any table already in Hive, it is imperative that you delete them
before using rhive.write.table.

if
(rhive.exist.table("uscrime"))
{

rhive.query("DROP
TABLE
uscrime")

}

rhive.write.table(UScrime)

RHive - alias functions
RHive’s Functions look similar to S3 generic’s naming rules but many are
actually not generic. This is for the S3 generic Functions which RHive may or
may not support in the future.
For users who detest confusion wrought by Functions that, despite containing
“.” yet still do not count as generic, there exist some Functions with different
names but serve the same roles. The following alias Functions are such as
described below.

hiveConnect
This is same as rhive.connect.

hiveQuery
This is same as rhive.query.

hiveClose
This is same as hive.close.

hiveListTables
This is same as hive.list.tables.

hiveDescTable
This is same as hive.desc.table.

hiveLoadTable
This is same as hive.load.table.

rhive.basic.cut
rhive.basic.cut converts one numerical column from a table to one factorized
column. First, the range of the numerical column is divided into intervals, and
the values in the numerical column are factorized according to which interval
they fall. Rhive.basic.cut receives the following six arguments, tablename(a
table name), col(a numerical column name), breaks, right, summary, and
forcedRef. breaks are numerical cut points for the numerical column. right
indicates if the ends of the intervals are open or closed. If TRUE, the intervals
are closed on the right and open on the left. If not, vice versa. summary =
TRUE spits out total counts of numerical values corresponding to the intervals.
If FALSE, the name of a new table updated by the factorized table is returned.
forcedRef = TRUE forces rhive.basic.cut to return a table name instead of a
data frame for forcedRef = FALSE. The defaults of right, summary,
and forcedRef are TRUE, FALSE, and TRUE respectively.

Example for summary = FALSE

>
table_name
=
rhive.basic.cut(tablename
=
"iris",
col
=

"sepallength",
breaks
=
seq(0,
5,
0.5),
right
=
FALSE,
summary

=
FALSE,
forcedRef
=
TRUE)

>
table_name

[1]
"rhive_result_1330382904"

attr(,"result:size")

[1]
4296

>
results
=
rhive.query("select
*
from

rhive_result_1330382904")

>
head(results)

rowname
sepalwidth
petallength
petalwidth
species
sepallength

1

1

3.5

1.4

0.2

setosa

NULL

2

2

3.0

1.4

0.2

setosa

[4.5,5.0)

3

3

3.2

1.3

0.2

setosa

[4.5,5.0)

4

4

3.1

1.5

0.2

setosa

[4.5,5.0)

5

5

3.6

1.4

0.2

setosa

NULL

6

6

3.9

1.7

0.4

setosa

NULL

Example for summary = TRUE

>
summary
=
rhive.basic.cut(tablename
=
"iris",
col
=

"sepallength",
breaks
=
seq(0,
5,
0.5),
right
=
FALSE,
summary

=
TRUE,
forcedRef
=
TRUE)

>
summary

NULL
[4.0,4.5)
[4.5,5.0)

128

4

18

rhive.basic.cut2
rhive.basic.cut2 converts two numerical columns from a table to two factorized
columns. That is, the range of each numerical column is divided into intervals,
and the values in each numerical column are factorized according to which
interval they fall. Rhive.basic.cut2 receives the following eight arguments,
tablename(a table name), col1, col2(two column names), breaks1, breaks2,
right, keepCol, and forcedRef. breaks1 and breaks2 are numerical cut points
for the two numerical columns. right indicates if the ends of the intervals are
open or closed. If TRUE, the intervals are closed on the right and open on the
left. If not, vice versa. keepCol = TRUE makes the two numerical columns
kept even after the conversion. Otherwise, the factorized columns replace the
original numerical columns. forcedRef = TRUE forces rhive.basic.cut to return
a table name instead of a data frame for forcedRef = FALSE. The defaults of
right, summary, and forcedRef are TRUE, FALSE, and TRUE respectively.

Example for right = TRUE and keepCol = FALSE

> table_name = rhive.basic.cut2(tablename = "iris", col1 = "sepallength", col2
= "petallength", breaks1 = seq(0, 5, 0.5), breaks2 = seq(0, 5, 0.5), right =
TRUE, keepCol = FALSE, forcedRef = TRUE)

> table_name

[1] "rhive_result_1330385833"


[1] 5272

> results = rhive.query("select * from rhive_result_1330385833")

> head(results)

rowname sepalwidth petalwidth species sepallength petallength rep

1 1 3.5 0.2 setosa NULL (1.0,1.5] 1

2 2 3.0 0.2 setosa (4.5,5.0] (1.0,1.5] 1

3

3

3.2

0.2

setosa

(4.5,5.0]

(1.0,1.5]

1

4

4

3.1

0.2

setosa

(4.5,5.0]

(1.0,1.5]

1

5

5

3.6

0.2

setosa

(4.5,5.0]

(1.0,1.5]

1

6

6

3.9

0.4

setosa

NULL

(1.5,2.0]

1

Example for right = FALSE and keepCol = TRUE

>
table_name
=
rhive.basic.cut2(tablename
=
"iris",
col1
=

"sepallength",
col2
=
"petallength",
breaks1
=
seq(0,
5,
0.5),

breaks2
=
seq(0,
5,
0.5),
right
=
FALSE,
keepCol
=
TRUE,

forcedRef
=
TRUE)

>
table_name

[1]
"rhive_result_1330315663"


[1]
6374

>
results
=
rhive.query("select
*
from

rhive_result_1330315663")

>
head(results)

rowname
sepalwidth
petalwidth
species
sepallength

sepallength_cut
petallength
petallength_cut
rep

1

1

3.5

0.2

setosa

5.1

N
ULL

1.4

[1.0,1.5)

1

2

2

3.0

0.2

setosa

4.9

[4.5,5
.0)

1.4

[1.0,1.5)

1

3

3

3.2

0.2

setosa

4.7

[4.5,5
.0)

1.3

[1.0,1.5)

1

4

4

3.1

0.2

setosa

4.6

[4.5,5
.0)

1.5

[1.5,2.0)

1

5

5

3.6

0.2

setosa

5.0

N
ULL

1.4

[1.0,1.5)

1

rhive.basic.xtabs
rhive.basic.xtabs makes a contingency table from cross-classifying factors. A
formula object and a table name are used as input arguments and a
contingency table with matrix format is returned based on the given formula.
For instance, two column names, agegp and alcgp from a table are cross-
classifying factors in this formula, "ncontrols ~ agegp + alcgp".
Also, observations for each combination of the cross-classifying
factors are summed up through another column name, ncontrols.

Example for esoph data

>
xtab_formula

=
as.formula(paste("ncontrols","~",
"agegp",

"+","alcgp",sep
=""))

>
xtab_formula

ncontrols
~
agegp
+
alcgp

>
table_result
=
rhive.basic.xtabs(formula
=
xtab_formula,

tablename
=
"esoph")

>
head(table_result)

alcgp

agegp

0-‐39g/day
120+
40-‐79
80-‐119

25-‐34

61

5

45

5

35-‐44

89

10

80

20

45-‐54

78

15

81

39

55-‐64

89

26

84

43

65-‐74

71

8

53

29

75+

27

3

12

2

rhive.basic.t.test
The rhive.basic.t.test Function runs Welch's t-test on two samples. In this case
the two sample's mean difference is tested while holding the alternative
hypothesis, "two sample's mean difference is not 0." Thus, two-side test is
performed.

The following is an example of test the mean difference between the irises'
sepal widths and petal widths. Pay attention to how the Functions that used
the "sepallength" and "petallength" variables were called.

>
rhive.basic.t.test("iris",
"sepallength",
"iris",

"petallength")

[1]
"t
=
13.1422338118038,
df
=
211.542688378717,
p-‐value
=

0,
mean
of
x
:
5.84333333333333,
mean
of
y
:
3.758"

$statistic

t

13.14223

$parameter

df

211.5427

$p.value

[1]
0

$estimate

$estimate[[1]]

mean
of
x

5.843333

$estimate[[2]]

mean
of
y

3.758

>

Interpreting the results gives you a p-value of 0, thus revealing a difference
between the means of petal width and sepal width. The resulting statistics are
converted as an R list Object, and the string made from amassed statistics is
printed onto console.

Iris data is 150 observation cases provided by R. Using this data for R's t.test
results in a slightly off t-statistic of 13.0984. This is due to the variance used
by t.test Function to find t-statistic is sample variance, while rhive.basic.t.test
Function uses population variance. Like the example scenario, in the case of
little data, t-statistic deviance may exist but the larger the data gets the
deviance dwindles. With rhive.basic.t.test being a Function made for massive
data analysis in mind, population variance is used for speedy calculations.

rhive.block.sample
The percent argument is an optional argument that sets the percentage of
data to extract from the total data. It has a default value of 0.01, which means
it extracts 0.01% of the total data. But this percent argument's value is not the
ratio of the actually sampled data count to the total data count but more akin
to the ratio of Blocks to the total Blocks. Thus, rhive.block.sample Function
takes Samples by the Block.

Thus the entire data may be returned when using the rhive.block.sample
Function on Hive Tables of small data size. This occurs when the data is
smaller than the Block size set in Hive.

The seed variable is for specifying the Random Seed used when executing
Block Sampling in Hive. Should the Random Seeds be identical, Hive's Block
Sampling returns the same results. Thus in order to guarantee Random
Samples for every sampling, it is best to assign a value for the seed variable
in rhive.block.sample, by using the Sample Function of R.

The subset variable is an optional variable that can specify the condition for
the data to be extracted from the Table targeted by Hive, when returning
Sample Block. This argument uses the character type and corresponds to the
'where' clause in Hive HQL. Thus it must use syntax appropriate for HQL's
where clause.

rhive.block.sample Function's return values are the character values of the
name of the Hive Table that contain Sample Block results. That is, the
rhive.block.sample Function uses Sample Block to automatically create a
temporary Hive Table and return that Table's name. The following example
involves sampling data worth 0.01% of the Hive Table called
listvirtualmachines. This example used R's sample Function for the Random
Seed to be used during Block Sampling of Hive.

seedNumber
<-‐
sample(1:2^16,
1)

rhive.block.sample("listvirtualmachines",
seed=seedNumber
)

[1]
"rhive_sblk_1330404552"

As per this example, a Hive Table of the name "rhive_sblk_1330404552"
bearing 0.01% worth of data from the Hive Table, "listvirtualmachines", has
been created.

rhive.basic.scale
The rhive.basic.scale function converts numerical data with 0 average and 1
deviation. Input table name for the first argument, and the output column
name for the second.

In the returned list, there is added a "scaled_column name" column saved as
a string. This is also approachable/editable in RHive, along with/just like other
Hive tables.

scaled
<-‐
rhive.basic.scale("iris",
"sepallength")

attr(scaled,
"scaled:center")

#
[1]
5.843333

attr(scaled,
"scaled:scale")

#
[1]
0.8253013

>
rhive.desc.table(scaled[[1]])

col_name
data_type
comment

#
1

rowname

string

#
2

sepalwidth

double

#
3

petallength

double

#
4

petalwidth

double

#
5

species

string

#
6

sepallength

double

#
7
sacled_sepallength

double

rhive.basic.by
The rhive.basic.by Function consists of code that runs group by for a
specified/particular column. Thus the code below excecutes/applies group by
for "species" column, and returns the result of applying the sum Function on

"sepallength". In the results you will find the sum of each species and
sepallength.

rhive.basic.by("iris",
"species",
"sum","sepallength")

#
species

sum

#
1

setosa
250.3

#
2
versicolor
296.8

#
3

virginica
329.4

rhive.basic.merge
rhive.basic.merge makes new data set from merging two tables, based on
their common rows.

#
checking
data

rhive.query('select
*
from
iris
limit
5')

rowname
sepallength
sepalwidth
petallength
petalwidth
species

1

1

5.1

3.5

1.4

0.2

setosa

2

2

4.9

3.0

1.4

0.2

setosa

3

3

4.7

3.2

1.3

0.2

setosa

4

4

4.6

3.1

1.5

0.2

setosa

5

5

5.0

3.6

1.4

0.2

setosa

rhive.query('select
*
from
usarrests
limit
5')

rowname
murder
assault
urbanpop
rape

1

Alabama

13.2

236

58
21.2

2

Alaska

10.0

263

48
44.5

3

Arizona

8.1

294

80
31.0

4

Arkansas

8.8

190

50
19.5

5
California

9.0

276

91
40.6

##rhive.basic.merge

rhive.basic.merge('iris','usarrests',by.x='sepallength',by.y='

murder')

sepallength
sepalwidth
petallength
petalwidth

species

assault
urbanpop
rape
rowname

1

4.3

3.0

1.1

0.1

setosa

102

62
16.5

14

2

4.4

2.9

1.4

0.2

setosa

149

85
16.3

9

3

4.4

3.0

1.3

0.2

setosa

149

85
16.3

39

4

4.4

3.2

1.3

0.2

setosa

149

85
16.3

43

5

4.9

3.1

1.5

0.1

setosa

159

67
29.3

10

Merge is similar with ‘join’ in SQL. Followings are same with that.

#
Use
join
to
extract
and
print
the
names
of
all
rows
not
found

to
be
common
after
merging.

#
Should
row
names
overlap,
only
print
out
the
name
of
the

former
row.

rhive.big.query('select

a.sepallength,a.sepalwidth,a.petallength,a.petalwidth,a.species
,b.assault,b.urbanpop,b.rape,a.rowname
from
iris
a
join

usarrests
b
on
a.sepallength
=
b.murder')

sepallength
sepalwidth
petallength
petalwidth

species

assault
urbanpop
rape
rowname

1

4.3

3.0

1.1

0.1

setosa

102

62
16.5

14

2

4.4

2.9

1.4

0.2

setosa

149

85
16.3

9

3

4.4

3.0

1.3

0.2

setosa

149

85
16.3

39

4

4.4

3.2

1.3

0.2

setosa

149

85
16.3

43

5

4.9

3.1

1.5

0.1

setosa

159

67
29.3

10

rhive.basic.mode
rhive.basic.mode returns the mode and its frequency within a specified row of
the Hive table.

rhive.basic.mode('iris',
'sepallength')

sepallength
freq

1

5

10

rhive.basic.range
rhive.basic.range returns the greatest and lowest values within the specified
numerical row of the Hive table.

rhive.basic.range('iris',
'sepallength')

[1]
4.3
7.9

RHive tutorials - Basic functions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to RHive tutorials - Basic functions

Similar to RHive tutorials - Basic functions (20)

More from Aiden Seonghak Hong

More from Aiden Seonghak Hong (8)

Recently uploaded

Recently uploaded (20)

RHive tutorials - Basic functions