Your SlideShare is downloading. ×
0
useR Vignette: Accessing Databases from R Greater Boston useR Group May 4, 2011 by Jeffrey Breen [email_address] Photo fro...
Outline <ul><li>Why relational databases?
Introducing DBI
Simple SQL queries
dbApply() marries strengths of SQL and *apply()
The parallel universe of RODBC
sqldf: No database? No problem!
Further Reading
Loading mtcars sample data.frame into MySQL </li></ul>AP Photo/Ben Margot
Why relational databases? <ul><li>Databases excel at handling large amounts of, um, data
They're everywhere </li><ul><li>Virtually all enterprise applications are built on relational databases (CRM, ERP, HRIS, e...
Upcoming SlideShare
Loading in...5
×

Accessing Databases from R

23,031

Published on

Overview of accessing relational databases from R. Focuses and demonstrates DBI family (RMySQL, RPostgreSQL, ROracle, RJDBC, etc.) but also introduces RODBC. Highlights DBI's dbApply() function to combine strengths of SQL and *apply() on large data sets. Demonstrates sqldf package which provides SQL access to standard R data.frames.

Presented at the May 2011 meeting of the Greater Boston useR Group.

Published in: Technology, Education
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
23,031
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
211
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Transcript of "Accessing Databases from R"

  1. 1. useR Vignette: Accessing Databases from R Greater Boston useR Group May 4, 2011 by Jeffrey Breen [email_address] Photo from http://en.wikipedia.org/wiki/File:Oracle_Headquarters_Redwood_Shores.jpg
  2. 2. Outline <ul><li>Why relational databases?
  3. 3. Introducing DBI
  4. 4. Simple SQL queries
  5. 5. dbApply() marries strengths of SQL and *apply()
  6. 6. The parallel universe of RODBC
  7. 7. sqldf: No database? No problem!
  8. 8. Further Reading
  9. 9. Loading mtcars sample data.frame into MySQL </li></ul>AP Photo/Ben Margot
  10. 10. Why relational databases? <ul><li>Databases excel at handling large amounts of, um, data
  11. 11. They're everywhere </li><ul><li>Virtually all enterprise applications are built on relational databases (CRM, ERP, HRIS, etc.)
  12. 12. Thanks to high quality open source databases (esp. MySQL and PostgreSQL), they're central to dynamic web development since beginning. </li><ul><li>“LAMP” = Linux + Apache + MySQL + PHP </li></ul><li>Amazon's “Relational Data Service” is just a tuned deployment of MySQL </li></ul><li>SQL provides almost-standard language to filter, aggregate, group, sort </li><ul><li>SQL-like query languages showing up in new places (Hadoop Hive)
  13. 13. ODBC provides SQL interface to non-database data (Excel, CSV, text files) </li></ul></ul>
  14. 14. Introducing DBI <ul><li>DBI provides a common interface for (most of) R's database packages
  15. 15. Database-specific code implemented in sub-packages </li><ul><li>RMySQL, RPostgreSQL, ROracle, RSQLite, RJDBC </li></ul><li>Use dbConnect(), dbDisconnect() to open, close connections: </li></ul>> library(RMySQL) > con = dbConnect(&quot;MySQL&quot;, &quot;testdb&quot;, username=&quot;testuser&quot;, password=&quot;testpass&quot;) [...] > dbDisconnect(con)
  16. 16. Using DBI <ul><li>dbReadTable() and dbWriteTable() read and write entire tables </li></ul>> df = dbReadTable(con, 'motortrend') > head(df, 4)                    mpg cyl disp  hp drat    wt  qsec vs am gear carb     mfg      model Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   Mazda        RX4 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   Mazda    RX4 Wag Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  Datsun        710 Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  Hornet    4 Drive <ul><li>dbGetQuery() runs SQL query and returns entire result set </li></ul>> df = dbGetQuery(con, &quot;SELECT * FROM motortrend&quot;) > head(df,4) row_names mpg cyl disp hp drat wt qsec vs am gear carb mfg model 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive <ul><ul><li>Note how dbReadTable() uses “row_names” column </li></ul><li>Use dbSendQuery() & fetch() to stream larger result sets
  17. 17. Advanced functions available to read schema definitions, handle transactions, call stored procedures, etc. </li></ul>
  18. 18. Simple SQL queries Fetch a column with no filtering but de-dupe: > df = dbGetQuery(con, &quot;SELECT DISTINCT mfg FROM motortrend&quot;) > head(df, 3)        mfg 1    Mazda 2   Datsun 3   Hornet Aggregate and sort result: > df = dbGetQuery(con, &quot;SELECT mfg, avg(hp) AS meanHP FROM motortrend GROUP BY mfg ORDER BY meanHP DESC&quot;) > head(df, 4)        mfg meanHP 1 Maserati    335 2     Ford    264 3   Duster    245 4   Camaro    245 > df = dbGetQuery(con, &quot;SELECT cyl as cylinders, avg(hp) as meanHP FROM motortrend GROUP by cyl ORDER BY cyl&quot;) > df cylinders meanHP 1 4 82.63636 2 6 122.28571 3 8 209.21429
  19. 19. <ul><li>dbApply() marries strengths of SQL and *apply() </li></ul><ul><li>Operates on result set from dbSendQuery() </li><ul><li>Uses fetch() to bring in smaller chunks at a time to handle Big Data
  20. 20. You must order result set by your “chunking” variable </li></ul><li>Example: calculate quantiles for horsepower vs. cylinders </li></ul>> sql = &quot;SELECT cyl, hp FROM motortrend ORDER BY cyl&quot; > rs = dbSendQuery(con, sql) > dbApply(rs, INDEX='cyl', FUN=function(x, grp) quantile(x$hp)) $`4.000000`    0%   25%   50%   75%  100%   52.0  65.5  91.0  96.0 113.0 $`6.000000`   0%  25%  50%  75% 100%   105  110  110  123  175 $`8.000000`     0%    25%    50%    75%   100% 150.00 176.25 192.50 241.25 335.00 <ul><li>Implemented and available in RMySQL, RPostgreSQL </li></ul>
  21. 21. The parallel universe of RODBC <ul><li>ODBC = “open database connectivity” </li><ul><li>Released by Microsoft in 1992
  22. 22. Cross-platform, but strongest support on Windows
  23. 23. ODBC drivers are available for every database you can think of PLUS Excel spreadsheets, CSV text files, etc. </li></ul><li>For historical reasons, RODBC not part of DBI family
  24. 24. Same idea, different details: </li><ul><li>odbcConnect() instead of dbConnection()
  25. 25. sqlFetch() = dbReadTable()
  26. 26. sqlSave() = dbWriteTable()
  27. 27. sqlQuery() = dbGetQuery() </li></ul><li>Closest match in DBI family is RJDBC using Java JDBC drivers </li></ul>
  28. 28. sqldf: No database? No problem! <ul><li>Provides SQL access to data.frames as if they were tables
  29. 29. Creates & updates SQLite databases automagically </li><ul><li>But can also be used with existing SQLite, MySQL databases </li></ul></ul>> library(sqldf) > data(mtcars) > sqldf(&quot;SELECT cyl, avg(hp) FROM mtcars GROUP BY cyl ORDER BY cyl&quot;) cyl avg(hp) 1 4 82.63636 2 6 122.28571 3 8 209.21429 > library(stringr) > mtcars$mfg = str_split_fixed(rownames(mtcars), ' ', 2)[,1] > sqldf(&quot;SELECT mfg, avg(hp) AS meanHP FROM mtcars GROUP BY mfg ORDER BY meanHP DESC LIMIT 4&quot;) mfg meanHP 1 Maserati 335 2 Ford 264 3 Camaro 245 4 Duster 245
  30. 30. Further Reading <ul><li>Bell Labs: R/S-Database Interface </li><ul><li>http://stat.bell-labs.com/RS-DBI/ </li></ul><li>R Data Import/Export manual </li><ul><li>http://cran.r-project.org/doc/manuals/R-data.html#Relational-databases </li></ul><li>CRAN: DBI and “Reverse depends” friends </li><ul><li>http://cran.r-project.org/web/packages/DBI/
  31. 31. http://cran.r-project.org/web/packages/RMySQL/
  32. 32. http://cran.r-project.org/web/packages/RPostgreSQL/
  33. 33. http://cran.r-project.org/web/packages/RJDBC/ </li></ul><li>CRAN: RODBC </li><ul><li>http://cran.r-project.org/web/packages/RODBC/ </li></ul><li>CRAN: sqldf </li><ul><li>http://cran.r-project.org/web/packages/sqldf/ </li></ul><li>Phil Spector's SQL tutorial </li><ul><li>http://www.stat.berkeley.edu/~spector/sql.pdf </li></ul></ul>
  34. 34. Loading 'mtcars' sample data.frame into MySQL In MySQL, create new database & user: mysql> create database testdb; mysql> grant all privileges on testdb.* to 'testuser'@'localhost' identified by 'testpass'; mysql> flush privileges; In R, load &quot;mtcars&quot; data.frame, clean up, and write to new &quot;motortrend&quot; data base table: library(stringr) library(RMySQL) data(mtcars) mtcars$mfg = str_split_fixed(rownames(mtcars), ' ', 2)[,1] mtcars$mfg[mtcars$mfg=='Merc'] = 'Mercedes' mtcars$model = str_split_fixed(rownames(mtcars), ' ', 2)[,2] con = dbConnect(&quot;MySQL&quot;, &quot;testdb&quot;, username=&quot;testuser&quot;, password=&quot;testpass&quot;) dbWriteTable(con, 'motortrend', mtcars) dbDisconnect(con)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×