Webinar: The rmongodb R package
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Webinar: The rmongodb R package

  • 793 views
Uploaded on

A free one-hour webinar providing a general introduction to the "rmongodb" R package (https://github.com/mongosoup/rmongodb) which provides a methodology to connect the MongoDB database......

A free one-hour webinar providing a general introduction to the "rmongodb" R package (https://github.com/mongosoup/rmongodb) which provides a methodology to connect the MongoDB database (http://www.mongodb.com/) and the R statistical computing environment (http://www.r-project.org).

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
793
On Slideshare
782
From Embeds
11
Number of Embeds
1

Actions

Shares
Downloads
14
Comments
0
Likes
1

Embeds 11

https://twitter.com 11

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Webinar: The rmongodb R package Dr. rer. nat. Markus Schmidberger January 30th, 2014 Email: markus@mongosoup.de Twitter: @cloudHPC
  • 2. Outline Introduction to Big Data, MongoDB, MongoSoup, R Introduction to R Database packages as rmongodb rmongodb Live Demo Summary & Outlook & Questions
  • 3. Big Data Wikipedia: … a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing. … storing processing
  • 4. Storing: NoSQL - MongoDB NoSQL: databases using looser consistency models to store data MongoDB most popular NoSQL database system document oriented JSON-like documents with dynamic schemas http://docs.mongodb.org/manual/reference/sqlcomparison/
  • 5. MongoDB - some commands db.collection.find() db.collection.find().pretty() db.collection.find( { _id: 5 } ) db.collection.find( { pop: { $gt: 25 } } ) db.collection.insert( { item: “card”, pop: 15 } ) db.collection.ensureIndex( { orderDate: 1, zipcode: -1 } ) db.collection.update( { _id: 1 }, { $set: { “name”: “Warner” } } )
  • 6. MongoSoup German MongoDB as a Service cloudControl Add-On running on AWS EU-Region or in Munich (Germany) all features available: shared / dedicated hosting, replica set, sharding 24/7 support available
  • 7. Processing: Analyzing with R and Hadoop backward-looking analysis is outdated today: quasi real-time analysis tomorrow: forward-looking predictive analysis more complex methods, more data available, more processing time required efficient processing technology required: R, Hadoop, … check for my Strata London 2013 Tutorial “Big Data Analyses with R”
  • 8. Introduction to R R is a free software environment for statistical computing and graphics offers tools to manage and analyze data standard statistical methods are implemented compiles and runs under different OS support via huge community
  • 9. One statistical Example kmeans(dat, 4) K-means clustering with 4 clusters of sizes 17, 30, 22, 31 Cluster means: [,1] [,2] 1 0.02846 -0.3379 2 0.76616 1.0020 3 1.37160 0.9707 4 -0.06849 0.1409 Clustering vector: [1] 4 2 4 4 1 1 4 1 4 4 1 4 4 4 4 4 1 4 4 2 4 4 4 4 4 4 4 1 4 4 1 1 1 1 2 [36] 1 1 4 4 4 1 1 4 4 4 1 1 1 4 4 3 2 3 2 3
  • 10. 3 2 2 3 2 3 2 2 3 2 2 3 2 2 3 [71] 3 2 2 3 3 2 2 2 2 2 2 2 3 2 2 4 3 2 3 2 2 3 3 3 3 3 3 2 3 2 Within cluster sum of squares by cluster: [1] 1.836 4.660 1.994 3.047 (between_SS / total_SS = 84.1 %) Available components: [1] "cluster" "centers" "withinss" [5] "tot.withinss" "betweenss" "iter" [9] "ifault" "totss" "size"
  • 11. plot(dat, col = cl$cluster, cex=2, pch=16) points(cl$centers, col = 1:4, pch = 13, cex = 4)
  • 12. R and Databases SQL provides a standard language to filter, aggregate, group, sort data SQL in new places: Hive, Impala, … many R packages to connect to the SQL world R stores relational data in data.frames (extended lists)
  • 13. data(iris) head(iris[,1:3], n=3) Sepal.Length Sepal.Width Petal.Length 1 5.1 3.5 1.4 2 4.9 3.0 1.4 3 4.7 3.2 1.3 class(iris) [1] "data.frame"
  • 14. R package: sqldf running SQL statements on R data frames library(sqldf) sqldf("select Sepal_Length,Sepal_Width,Petal_Length from iris limit 2") Sepal_Length Sepal_Width Petal_Length 1 5.1 3.5 1.4 2 4.9 3.0 1.4 sqldf("select count(*) from iris") count(*) 1 150
  • 15. Other relational R package RMySQL RPostgreSQL ROracle RJDBC RODBC RSQLite (SQLite engine is included) One big problem: all packages read the full query results in R memory
  • 16. R and MongoDB on CRAN there are two packages to connect R with MongoDB rmongodb supported by MongoDB, Inc. powerful for big data RMongo easy to use limited functionality reads full query results in R memory
  • 17. R package: RMongo library(RMongo) mongo <- mongoDbConnect("cc_JwQcDLJSYQJb", "dbs001.mongosoup.de", 27017) dbAuthenticate(mongo, username="JwQcDLJSYQJb", password="RSXPkUkXXXXX") dbShowCollections(mongo) [1] "zips" "system.users" [5] "test_data" "ccp" "system.indexes"
  • 18. dbGetQuery(mongo, "zips","{'state':'AL'}", skip=0, limit=5) X_id state city 1 35004 AL ACMAR 2 35005 AL ADAMSVILLE 3 35006 AL ADGER 4 35007 AL KEYSTONE 5 35010 AL NEW SITE loc pop [ -86.51557 , 33.584132] 6055 [ -86.959727 , 33.588437] 10616 [ -87.167455 , 33.434277] 3205 [ -86.812861 , 33.236868] 14218 [ -85.951086 , 32.941445] 19942
  • 19. dbInsertDocument(mongo, "test_data", '{"foo": "bar", "size": 5 }') [1] "ok" # e.g. no command to remove collections # e.g. no command to create indices dbDisconnect(mongo)
  • 20. R package: rmongodb developed on top of the MongoDB supported C driver new maintainer: markus@mongosoup.de new repository: https://github.com/mongosoup/rmongodb please provide feedback or contribute via Pull Requests
  • 21. library(rmongodb) mongo <mongo.create(host="dbs001.mongosoup.de", db="cc_JwQcDLJSYQJb", username="JwQcDLJSYQJb", password="RSXPkUkXXXXX") mongo [1] 0 attr(,"mongo") <pointer: 0x102e4aac0> attr(,"class") [1] "mongo" attr(,"host") [1] "dbs001.mongosoup.de" attr(,"name") [1] ""
  • 22. attr(,"username") [1] "JwQcDLJSYQJb" attr(,"password") [1] "RSXPkUkxRdOX" attr(,"db") [1] "cc_JwQcDLJSYQJb" attr(,"timeout") [1] 0
  • 23. Live Demo Live Demo with RStudio and MongoSoup
  • 24. JSON <-> BSON <-> R new functionality in development still problems with sub-documents and JSON arrays using jsonlite package helps library(rmongodb) library(jsonlite)
  • 25. bson <mongo.bson.from.JSON('{"state":"AL"}') bson state : 2 AL list <- mongo.bson.to.list(bson) list $state [1] "AL" toJSON(list) [1] "{ "state" : [ "AL" ] }"
  • 26. Summary R is a powerful statistical tool to analyse many different kind of data R can access databases MongoDB and rmongodb ready for Big Data some open issues for simple usability ​
  • 27. Outlook Fixing JSON to BSON issues Provide efficient functionality for mongoDB to data.frames Use new mongodb-c library a lot of work: re-engineering rmongodb back-end -> more speed, more functionality go on developing plyrmongodb package: https://github.com/schmidb/dplyrmongodb
  • 28. Questions & Answers thanks a lot for your attention demo code available as vignette in the rmongodb package on github Email: markus@mongosoup.de Twitter: @cloudHPC