Tokyo r sqldf

R を SQL で操る
RDB をつかってる人でも R がとっつきやすくなる。

簑田　高志

目次

1 ．自己紹介
２． Sqldf パッケージの紹介
３．ちょっと応用…

03/05/11 2

自己紹介
●

● Twitter aad34210
● http://pracmper.blogspot.com/
●

●

● Web

●

●

03/05/11 3

質問

まずは…ちょっとみなさんに質問。

03/05/11 4

質問

・ R を使ったことがある人

03/05/11 5

質問

　・ RDB を SQL を使って操作したこ
とがある人

03/05/11 6

質問

　・ RDB を SQL を使って操作したこ
　　とがある人
　・ R で集計作業がめんどくさい！
　　 ( ・ д ・ ) ﾁｯ
　　って思ったことことがある人

03/05/11 7

sqldf パッケージ
R で集計作業するときには…
head(diamonds)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

#Price Cut
price_sum <- aggregate(diamonds[,c(7)] , list(cut = diamonds$cut) , sum)
Cut
other_mean <- aggregate(diamonds[,c(5:10)] , list(cut = diamonds$cut) , mean)
Merge
merge(price_sum , other_mean , by = c("cut"))

( ・ д ・ ) ﾁｯ＞計算したい関数ごとでコード書かなきゃいけない…
　　　　　　　　　　　　　 DataFrame を簡単に集計したい。

03/05/11 8

Sqldf パッケージ
●
R を使ってて
●
集計がめんどくさい ( ・ д ・ ) ﾁｯ
●
SQL を使ったことがある方。

そんなあなたに今日は sqldf パッケージ
を紹介します。

03/05/11 9

sqldf
sqldf is an R package for runing SQL statements on R data frames, optimized for convenience. The user
simply specifies an SQL statement in R using data frame names in place of table names and a database with
appropriate table layouts/schema is automatically created, the data frames are automatically loaded into
the database, the specified SQL statement is performed, the result is read back into R and the database is
deleted all automatically behind the scenes making the database's existence transparent to the user who
only specifies the SQL statement. Surprisingly this can at times be even faster than the corresponding pure
R calculation (although the purpose of the project is convenience and not speed).
http://code.google.com/p/sqldf/

(
sqldf R SQL
DB SQL
R DB

03/05/11 10

#sqldf
install.packages("sqldf")
library(sqldf)

sqldf(” SQL ”

#
#iris
sqldf("SELECT COUNT(*) as iris_count FROM iris")
iris_count
1 150

#iris Secies
>sqldf("SELECT Species , COUNT(*) as iris_count FROM iris GROUP BY Species")
Species iris_count
1 setosa 50
2 versicolor 50
3 virginica 50

03/05/11 11

#
# .
iris2 <- iris
colnames(iris2) <- c("Sepal_Length" , "Sepal_Width" , "Petal_Length" ,"Petal_Width" , "Species")
head(iris2)

#Specis
sqldf("
SELECT
Species ,
COUNT(Species) as Species_num,
AVG(Sepal_Length) as average_Lentgh,
AVG(Sepal_Width) as average_width
FROM
iris2
GROUP BY
Species
")
Species Species_num average_Lentgh average_width
1 setosa 50 5.006 3.428
2 versicolor 50 5.936 2.770
3 virginica 50 6.588 2.974

03/05/11 12

#
#Petal.Length
# WHERE

>sqldf("SELECT * FROM iris2 WHERE Petal_Length >= (select avg(Petal_Length) from iris2)")

Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 7.0 3.2 4.7 1.4 versicolor
2 6.4 3.2 4.5 1.5 versicolor
3 6.9 3.1 4.9 1.5 versicolor
4 5.5 2.3 4.0 1.3 versicolor
5 6.5 2.8 4.6 1.5 versicolor
6 5.7 2.8
・
・
4.5 1.3 versicolor
・
・
・
avg(Petal_Length) =3.758  

03/05/11 13

#
#
#Species
var <- "setosa"
sql_head <- "SELECT * FROM iris2 WHERE Species = "
sql_str <- paste(sql_head , "'", var ,"'" , collapse = "" , sep = "")
sqldf(sql_str)

Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
・
・
・
・
・

# Function
Sepal_search <- function(var){
sql_head <- "SELECT * FROM iris2 WHERE Species = "
sql_str <- paste(sql_head , "'", var ,"'" , collapse = "" , sep = "")
print(sqldf(sql_str))
}
#
Sepal_search(var = "versicolor")

03/05/11 14

#
#
#R sqldf

# R Code
R_Code <- function(){
price_sum <- aggregate(diamonds[,c(7)] , list(cut = diamonds$cut) , sum)
other_mean <- aggregate(diamonds[,c(5,7:10)] , list(cut = diamonds$cut) , mean)
merge(price_sum , other_mean , by = c("cut"))
}
system.time(R_Code())

# sqldf
sql_df_code <- function(){
sqldf("
SELECT
cut, SUM(price), avg(depth), avg(price), avg(x), avg(y), avg(z)
FROM diamonds
GROUP BY cut
")
}

#
system.time(sql_df_code())

03/05/11 15

#
#R aggregate ,merge
user system elapsed
0.468 0.041 0.541

#sqldf
system.time(sql_df_code())
user system elapsed
0.841 0.036 0.895

・ R のコードと、 sqldf のコードとでは、 R のコードのほうが早かった！
・コードの書きやすさとスピードのトレードオフですかね…

03/05/11 16

まとめ
1.R sqldf
SQL

2.
sqldf(“[SQL]”)
R

3. Function paste SQL
sqldf

4. R …

03/05/11 17

Tokyo r sqldf

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Tokyo r sqldf

Similar to Tokyo r sqldf (20)

More from Takashi Minoda

More from Takashi Minoda (20)

Tokyo r sqldf