Your SlideShare is downloading. ×
0
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Big datascienceh2oandr
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big datascienceh2oandr

632

Published on

Anqi Fu's presentation from the August 20 Meetup on using H2O with R.

Anqi Fu's presentation from the August 20 Meetup on using H2O with R.

Published in: Health & Medicine, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
632
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
14
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • http://docs.0xdata.com/quickstart/quickstart_R.htmlPackages  Install package(s)  Select CRAN mirror (US CA1)  Search for RCurl, rjson and bitops
  • Pull up R and demo this in the console, making sure everyone can follow along
  • H2OParsedData: Each data set/calculation associated with unique hex key, object acts like a “pointer”Model: coefficients, deviance, aic, df.residual, etc
  • As penalty factor increases, lasso gives more sparse results (zero values), while ridge causes all coefficients to fall (but not hit zero necessarily)
  • Transcript

    • 1. H2O – The Open Source Math Engine Big Data Science with H2O in R
    • 2. 4/23/13 H2O – Open Source Math & Machine Learning for Big Data Anqi Fu, August 2013
    • 3. Universe is sparse. Life is messy. Data is sparse & messy. - Lao Tzu
    • 4. Introduction to Big Data • There are about as many bits of information in our digital universe as there are stars in our actual universe. • The process to decode the human genome took 10 years. It can now be done in a week. • Big data means more than “lots of data”
    • 5. H2O – The Open Source Math Engine Better Predictions Same Interface
    • 6. Installation 1. Install and run H2O • Command line: java –Xmx2g –jar h2o.jar • Pull up http://localhost:54321 in browser 2. Install the R package • install.packages(c(“RCurl”, “rjson”, “bitops”)) • install.packages(“Path/To/Package/ h2o_1.2.3.tar.gz", repos = NULL, type = "source") 3. In R console, type library(h2o) • demo(package=“h2o”) • demo(h2o.glm) Replace this!
    • 7. Always have H2O running first!
    • 8. Basic R Script 1. Tell R where H2O is running: localH2O = new(“H2OClient”, ip=“127.0.0.1”, port=54321) 2. Check connection: h2o.checkClient(localH2O) 3. Pass H2OClient as parameter to import: h2o.importFile(localH2O, path=“Path/To/Data”, …)
    • 9. Overview of Objects • H2OClient: ip=character, port=numeric • H2OParsedData: h2o=H2OClient, key=character • H2OGLMModel: key=character, data=H2OParsedData, model=list(coefficients, deviance, aic, etc) Example: myModel@model$coefficients H2O key=“prostate.hex” key=“airlines.hex”
    • 10. Overview of Methods Standard R H2O read.csv, read.table, etc h2o.importFile, h2o.importURL summary summary (limited to data only) glm, glmnet h2o.glm(y, x, data, family, nfolds, alpha, lambda) kmeans h2o.kmeans(data, centers, cols, iter.max) randomForest, cforest h2o.randomForest(y, x_ignore, data, ntree, depth, classwt)
    • 11. Demo 1: Basic GLM in H2O through R
    • 12. Demo 1: Prostate Cancer Data • Prostate cancer data set from Ohio State University Comprehensive Cancer Center • N = 380 patients, ages ranging from 43-79 • Goal: Predict presence of tumor from baseline exam of patient (age, race, PSA, total gleason score, etc)
    • 13. Prostate Cancer Data: y = CAPSULE 0 = no tumor 1 = tumor x = PSA (prostate-specific antigen)
    • 14. Prostate Cancer Logistic Regression Fit Family: Binomial, Link: Logit Data: y = CAPSULE 0 = no tumor 1 = tumor x = PSA (prostate-specific antigen) Goal: Estimate probability CAPSULE = 1
    • 15. GLM Parameters • y = response variable • x = predictor variables (vector) • family = binomial (default link = logit) • data = H2OParsedData object • nfolds = cross-validation • lambda = weight on penalty factor • alpha = elastic net mixing parameter • alpha = 0 is ridge penalty (L2 norm) • alpha = 1 is lasso penalty (L1 norm)
    • 16. Under the Hood: Hacking R for H2O
    • 17. Under the Hood REST API Data (JSON) Import Parse H2O Data Scientist, Analyst, etc
    • 18. GLM Code Snippet • Create an object to represent model setClass("H2OGLMModel", representation(key="character", data="H2OParsedData", model="list")) • Declare new method for algorithm setGeneric("h2o.glm", function(x, y, data, family, nfolds = 10, alpha = 0.5, lambda = 1.0e-5) { standardGeneric("h2o.glm") }) Name Slots Parameter Initial Value
    • 19. GLM Code Snippet setMethod("h2o.glm", signature(x="character", y="character", data="H2OParsedData", …), function(x, y, data, …) { • Send parameters to GLM.json page  GLM job started res = h2o.__remoteSend(data@h2o, h2o.__PAGE_GLM, key = data@key, y = y, x = paste(x, sep="", collapse=","), …) • Keep polling and wait until job completed while(h2o.__poll(data@h2o, res$response$redirect_request_args$job) != -1) { Sys.sleep(1) } • Query Inspect.json page with GLM model key to get results res = h2o.__remoteSend(data@h2o, h2o.__PAGE_INSPECT, key=res$destination_key) http://cran.r-project.org/doc/contrib/Genolini-S4tutorialV0-5en.pdf
    • 20. Demo 2: Data Munging and Remote H2O
    • 21. Demo 2: Airlines Data • Airlines data set 1987-2013 from RITA (25%) • Goal: Predict if flight’s arrival will be delayed • Examine slices of data directly head(airlines.hex, n = 10); tail(airlines.hex) summary(airlines.hex$DepTime) • Take a subset of data to play with in R airlines.small = as.data.frame(airlines.hex[1:1000,]) glm(IsArrDelayed ~ Dest + Origin, family = binomial, data = airlines.small)
    • 22. http://www.transtats.bts.gov/Fields.asp?Table_ID=236
    • 23. Connecting to H2O Remotely • Your slip of paper contains IP/port of your assigned cluster • Point R to remote H2O client remoteH2O = new(“H2OClient”, ip = “192.168.1.161”, port = 54321) • All data operations occur on cluster h2o.importFile(remoteH2O, path = “Path/On/Remote/Server/To/Data”, …) • Objects/methods operate just like before!
    • 24. Roadmap • Long-term Goal: Full H2O/R Integration • Subset col by name/index: df[,c(1,2)]; df[,”name”] • Add/Remove cols: df[,-c(1,2)]; df[,3] = df[,2] + 1 • Filter rows: df[df$cName < 5,] • Combine data frames by row/col: rbind, cbind • Apply functions: tapply, sapply, lapply • Support for R libraries (plyr, ggplot2, etc) • More Algorithms: GBM, PCA, Neural Networks
    • 25. 4/23/13 Questions and Suggestions?

    ×