Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

1,207 views

Published on

Ken and Fonda will talk through how organizations are embracing open source machine learning and AI platforms and what strategies to use to make the transformation easier.

- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Published in: Data & Analytics
  • Be the first to comment

Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

  1. 1. CONFIDENTIAL Ken Stanford and Fonda Ingram July 25, 2016 Open Source or Closed Source
  2. 2. Trending Now? 2015 SAS vs. R Survey Results – Burch Works
  3. 3. Trending Now? • Buy vs Build o Business are: “Engingeering/Technology/Innovation companies” • Build yourself and potentially sell • Companies are hesitant to make a LARGE upfront Capital Investment before they see proven value
  4. 4. Analytics POV Lifecycle Discovery •Determine Business Objective •Determine Modeling Goal Data Prep - Understanding •Data Collection •Data Exploration •Data Quality •Data Transformation Model Building •Build Models •Model Assessment Evaluation •Model Performance •Success criteria Deployment •Monitoring and maintenance •Model Management
  5. 5. Why Open Source?
  6. 6. Why use Open Source? • Reduce vendor dependency o Run the program for any purpose o Customize the program - use cutting edge analytics NOW • Reduce cost o Freedom does not imply FREE • Responsive and Competitive o Innovate in Real Time o Rebuild in-house expertise and regain control
  7. 7. Why use H20? • Capital Investment upfront is minimal o Download H20 – use it and continue to learn, once you mature we can help you • Algorithms and Accuracy o Distributed implementation of cutting edge ML algorithms • Building components that touch all facets of the Analytics POV Lifecycle • Flexible API available in R, Python, Scala, REST/JSON • Community driven
  8. 8. Why customers hesitate? • Difficult to convert all SAS software to open source and keep my sanity .. • I have been a SAS programmer for years.. • What I have is working – why change.. • I need a throat to choke if something goes wrong.. • I like long product install times.. • No one gets fired for buying SAS..
  9. 9. What do I need to start? • Migration Strategy o Analytical Tool? ( R or Python..) o Analytics Platform? (Hadoop, S3, etc.) • Start small and get your feet wet (with H20) • May need to create a hybrid environment
  10. 10. How to get started? • Existing Use Case o Review data requirements • Get your data into H2O o Select existing model to migrate • Identify algorithms – start small • Transition should be gradual
  11. 11. Language Translation SAS H20-R dataset dataframe observation row variable column BY-Group By function if else H20.ifelse . (missing value) na (missing value)
  12. 12. How to Import data? Export SAS dataset to CSV file proc export data=work.Wheaderdataset outfile='/folders/myfolders/wheader.csv' dbms=csv replace; run; Import to H20 library(h2o) h2o.init() h2odf = h2odf = h2o.importFile(path = "h2o/data/iris_wheader.csv") stopifnot(nrow(h2odf) == 150)
  13. 13. Munging – How to slice columns? Slicing Columns in a SAS dataset /* Slice 1 column SepalLength - keep or drop */ data iris2; set sashelp.iris; keep SepalLength; run; /* Slice all columns except Species – keep or drop */ data iris2; set sashelp.iris; keep PetalLength PetalWidth SepalLength SepalWidth; run; Slicing Columns in a H20 dataframe # slice 1 column by name c1_1 <- h2odf[, "sepal_len"] # slice cols by vector of names cols_1 <- h2odf[, c("sepal_len", "sepal_wid", "petal_len", "petal_wid")]
  14. 14. Munging – How to slice rows? Slicing Rows in a SAS dataset /* Slicing obs 15 from a SAS dataset */ data subset1; set sashelp.iris (firstobs=15 obs= 15); run; /* Slicing a range of obs from a SAS dataset */ data subset2; set sashelp.iris (firstobs=25 obs= 49); run; run; Slicing Rows from a H20 dataframe # slice 1 row by index c1 <- h2odf[15,] # slice a range of rows c1_1 <- h2odf[25:49,]
  15. 15. Munging – How to slice rows? Slicing Rows in a SAS dataset /* Slicing with a value */ data subset3; set sashelp.iris; if SepalLength > 50; run; /* Filter out missing values from a SAS dataset*/ data subset3; set sashelp.iris; if SeptalLenght = . then delete; run; Slicing Rows from a H20 dataframe # slice with a boolean mask mask <- h2odf[,"sepal_len"] < 4.4 cols <- h2odf[mask,] # filter out missing values mask <- is.na(h2odf[,"sepal_len"]) cols <- h2odf[!mask,]
  16. 16. Munging – How to replacing values? Replacing values in a SAS dataset /* Replace a single numerical datum */ data iris ; obsnum = 15; modify iris point= obsnum; SepalWidth = 2; replace; stop; run; /* Replace a whole column*/ data iris ; modify iris; SepalWidth = SepalWidth * 3; replace; run; Replacing values in a H20 dataframe # replace a single numerical datum h2odf[15,3] <- 2 # replace a whole column h2odf[,1] <- 3*h2odf[,1]
  17. 17. Munging – How to replacing values? Replacing values in a SAS dataset /* replacement with if */ data iris1 ; modify iris1; if SepalLenght < 4.4 then SeptalLenght = 22; replace; run; /*Replace missing values with 0*/ data iris1 ; modify iris1; if SepalLenght = . then SeptalLenght = 0; replace; run; Replacing values in a H20 dataframe # replacement with ifelse h2odf[,"sepal_len"] <- h2o.ifelse(h2odf[,"sepal_len"] < 4.4, 22, h2odf[,"sepal_len"]) # replace missing values with 0 h2odf[is.na(h2odf[,"sepal_len"]), "sepal_len"] <- 0
  18. 18. Ensembles Deep Neural Networks Algorithms on H2O • Generalized Linear Models with Regularization: Binomial, Gaussian, Gamma, Poisson and Tweedie • Naïve Bayes • Distributed Random Forest: Classification or regression models • Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations • Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations Supervised Learning Statistical Analysis
  19. 19. Dimensionality Reduction Anomaly Detection Algorithms on H2O • K-means: Partitions observations into k clusters/groups of the same spatial size • Principal Component Analysis: Linearly transforms correlated variables to independent components • Generalized Low Rank Models*: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data • Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning Unsupervised Learning Clustering
  20. 20. Modeling techniques to help you analyze data (SAS, H2O and R) Algorithm SAS R H2O GLM proc glm proc reg proc logistic proc genmod glmnet lm h2o.glm PCA proc princomp princomp h2o.prcomp Factor Analysis proc factor factanal factor.pa SVD proc hptmine proc hpsvm svd h2o.svd Clustering proc fastclus proc hpclus kmeans h2o.kmeans Random Forest proc hpforest (EM Node) randomforest h2o.randomForest
  21. 21. Modeling techniques to help you analyze data (SAS, H2O and R) Algorithm SAS R H2O Gradient Boosting proc arboretum gbm h2o.gbm Neural Networks proc hpneural (EM Node) autoneural (EM node) proc neural proc dmneural nnet h2o.deeplearning Ensemble (Stacking) proc ensemble (EM Node) h2o.ensemble, h2o.metalearn, h2o.stack (in dev) GLRM (Cluster Analysis, Recommendation Engines) NA NA h2o.glrm Gradient Boosting proc arboretum gbm h2o.gbm Kernel Density Estimation proc kde density
  22. 22. Modeling techniques to help you analyze data (SAS, H2O and R) Algorithm SAS R H2O Variable Clustering proc varclus varclus ARIMA proc arima arima Autoregressive Models proc autoreg ar Correlation proc corr corr Survival Models proc phreg coxph Not currently available -- h2o.coxph Linear Mixed Effects Models proc mixed glimmix lme
  23. 23. Modeling techniques to help you analyze data (SAS, H2O and R) Algorithm SAS R H2O Summary proc summary proc means summary mean median max/min quantile variance Grouping/ Sort/ Rank proc sort proc rank aggregate ddply order datatable Exploratory Data Analysis proc univariate proc hpbin moments (package) hist ecdf qqnorm pnorm
  24. 24. Modeling techniques to help you analyze data (SAS, H2O and R) Algorithm SAS R H2O Plots gplot sgplot ggplot2 ggivs rgl htmlwidgets Sampling proc surveyselect runif sample h2o.runif
  25. 25. Questions

×