Webinar: The rmongodb R package

1,573 views
1,310 views

Published on

A free one-hour webinar providing a general introduction to the "rmongodb" R package (https://github.com/mongosoup/rmongodb) which provides a methodology to connect the MongoDB database (http://www.mongodb.com/) and the R statistical computing environment (http://www.r-project.org).

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,573
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
43
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Webinar: The rmongodb R package

  1. 1. Webinar: The rmongodb R package Dr. rer. nat. Markus Schmidberger January 30th, 2014 Email: markus@mongosoup.de Twitter: @cloudHPC
  2. 2. Outline Introduction to Big Data, MongoDB, MongoSoup, R Introduction to R Database packages as rmongodb rmongodb Live Demo Summary & Outlook & Questions
  3. 3. Big Data Wikipedia: … a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing. … storing processing
  4. 4. Storing: NoSQL - MongoDB NoSQL: databases using looser consistency models to store data MongoDB most popular NoSQL database system document oriented JSON-like documents with dynamic schemas http://docs.mongodb.org/manual/reference/sqlcomparison/
  5. 5. MongoDB - some commands db.collection.find() db.collection.find().pretty() db.collection.find( { _id: 5 } ) db.collection.find( { pop: { $gt: 25 } } ) db.collection.insert( { item: “card”, pop: 15 } ) db.collection.ensureIndex( { orderDate: 1, zipcode: -1 } ) db.collection.update( { _id: 1 }, { $set: { “name”: “Warner” } } )
  6. 6. MongoSoup German MongoDB as a Service cloudControl Add-On running on AWS EU-Region or in Munich (Germany) all features available: shared / dedicated hosting, replica set, sharding 24/7 support available
  7. 7. Processing: Analyzing with R and Hadoop backward-looking analysis is outdated today: quasi real-time analysis tomorrow: forward-looking predictive analysis more complex methods, more data available, more processing time required efficient processing technology required: R, Hadoop, … check for my Strata London 2013 Tutorial “Big Data Analyses with R”
  8. 8. Introduction to R R is a free software environment for statistical computing and graphics offers tools to manage and analyze data standard statistical methods are implemented compiles and runs under different OS support via huge community
  9. 9. One statistical Example kmeans(dat, 4) K-means clustering with 4 clusters of sizes 17, 30, 22, 31 Cluster means: [,1] [,2] 1 0.02846 -0.3379 2 0.76616 1.0020 3 1.37160 0.9707 4 -0.06849 0.1409 Clustering vector: [1] 4 2 4 4 1 1 4 1 4 4 1 4 4 4 4 4 1 4 4 2 4 4 4 4 4 4 4 1 4 4 1 1 1 1 2 [36] 1 1 4 4 4 1 1 4 4 4 1 1 1 4 4 3 2 3 2 3
  10. 10. 3 2 2 3 2 3 2 2 3 2 2 3 2 2 3 [71] 3 2 2 3 3 2 2 2 2 2 2 2 3 2 2 4 3 2 3 2 2 3 3 3 3 3 3 2 3 2 Within cluster sum of squares by cluster: [1] 1.836 4.660 1.994 3.047 (between_SS / total_SS = 84.1 %) Available components: [1] "cluster" "centers" "withinss" [5] "tot.withinss" "betweenss" "iter" [9] "ifault" "totss" "size"
  11. 11. plot(dat, col = cl$cluster, cex=2, pch=16) points(cl$centers, col = 1:4, pch = 13, cex = 4)
  12. 12. R and Databases SQL provides a standard language to filter, aggregate, group, sort data SQL in new places: Hive, Impala, … many R packages to connect to the SQL world R stores relational data in data.frames (extended lists)
  13. 13. data(iris) head(iris[,1:3], n=3) Sepal.Length Sepal.Width Petal.Length 1 5.1 3.5 1.4 2 4.9 3.0 1.4 3 4.7 3.2 1.3 class(iris) [1] "data.frame"
  14. 14. R package: sqldf running SQL statements on R data frames library(sqldf) sqldf("select Sepal_Length,Sepal_Width,Petal_Length from iris limit 2") Sepal_Length Sepal_Width Petal_Length 1 5.1 3.5 1.4 2 4.9 3.0 1.4 sqldf("select count(*) from iris") count(*) 1 150
  15. 15. Other relational R package RMySQL RPostgreSQL ROracle RJDBC RODBC RSQLite (SQLite engine is included) One big problem: all packages read the full query results in R memory
  16. 16. R and MongoDB on CRAN there are two packages to connect R with MongoDB rmongodb supported by MongoDB, Inc. powerful for big data RMongo easy to use limited functionality reads full query results in R memory
  17. 17. R package: RMongo library(RMongo) mongo <- mongoDbConnect("cc_JwQcDLJSYQJb", "dbs001.mongosoup.de", 27017) dbAuthenticate(mongo, username="JwQcDLJSYQJb", password="RSXPkUkXXXXX") dbShowCollections(mongo) [1] "zips" "system.users" [5] "test_data" "ccp" "system.indexes"
  18. 18. dbGetQuery(mongo, "zips","{'state':'AL'}", skip=0, limit=5) X_id state city 1 35004 AL ACMAR 2 35005 AL ADAMSVILLE 3 35006 AL ADGER 4 35007 AL KEYSTONE 5 35010 AL NEW SITE loc pop [ -86.51557 , 33.584132] 6055 [ -86.959727 , 33.588437] 10616 [ -87.167455 , 33.434277] 3205 [ -86.812861 , 33.236868] 14218 [ -85.951086 , 32.941445] 19942
  19. 19. dbInsertDocument(mongo, "test_data", '{"foo": "bar", "size": 5 }') [1] "ok" # e.g. no command to remove collections # e.g. no command to create indices dbDisconnect(mongo)
  20. 20. R package: rmongodb developed on top of the MongoDB supported C driver new maintainer: markus@mongosoup.de new repository: https://github.com/mongosoup/rmongodb please provide feedback or contribute via Pull Requests
  21. 21. library(rmongodb) mongo <mongo.create(host="dbs001.mongosoup.de", db="cc_JwQcDLJSYQJb", username="JwQcDLJSYQJb", password="RSXPkUkXXXXX") mongo [1] 0 attr(,"mongo") <pointer: 0x102e4aac0> attr(,"class") [1] "mongo" attr(,"host") [1] "dbs001.mongosoup.de" attr(,"name") [1] ""
  22. 22. attr(,"username") [1] "JwQcDLJSYQJb" attr(,"password") [1] "RSXPkUkxRdOX" attr(,"db") [1] "cc_JwQcDLJSYQJb" attr(,"timeout") [1] 0
  23. 23. Live Demo Live Demo with RStudio and MongoSoup
  24. 24. JSON <-> BSON <-> R new functionality in development still problems with sub-documents and JSON arrays using jsonlite package helps library(rmongodb) library(jsonlite)
  25. 25. bson <mongo.bson.from.JSON('{"state":"AL"}') bson state : 2 AL list <- mongo.bson.to.list(bson) list $state [1] "AL" toJSON(list) [1] "{ "state" : [ "AL" ] }"
  26. 26. Summary R is a powerful statistical tool to analyse many different kind of data R can access databases MongoDB and rmongodb ready for Big Data some open issues for simple usability ​
  27. 27. Outlook Fixing JSON to BSON issues Provide efficient functionality for mongoDB to data.frames Use new mongodb-c library a lot of work: re-engineering rmongodb back-end -> more speed, more functionality go on developing plyrmongodb package: https://github.com/schmidb/dplyrmongodb
  28. 28. Questions & Answers thanks a lot for your attention demo code available as vignette in the rmongodb package on github Email: markus@mongosoup.de Twitter: @cloudHPC

×