Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Skutil - H2O meets Sklearn - Taylor Smith

700 views

Published on

Skutil brings the best of both worlds to H2O and sklearn, delivering an easy transition into the world of distributed computing that H2O offers, while providing the same, familiar interface that sklearn users have come to know and love.

- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Published in: Technology
  • Be the first to comment

Skutil - H2O meets Sklearn - Taylor Smith

  1. 1. Scikit-Util H2O MEETS SKLEARN Taylor Smith October 26, 2016
  2. 2. Agenda  About me  Problem statement  Overview  Package motivation  Notable H2O additions  Side-by-side  Questions
  3. 3. About me  Taylor Smith  Data scientist at State Farm  M.S. Analytics from The University of Texas at Austin  ~3 years in data science, ~6 years writing software tgsmith61591@gmail.com http://github.com/tgsmith61591 https://www.linkedin.com/in/taylorgsmith @TayGriffinSmith
  4. 4. Problem statement WHY AM I STANDING HERE TALKING TO YOU?
  5. 5. DS/DE—typical division of labor  Data scientist 1. Frame the problem 2. Gather raw data 3. Analyze  Data engineer 1. Gather raw data 2. Consolidate data 3. Production
  6. 6. Where’s the disconnect?  Exploration  Technologies (Hadoop/Spark/Python/R)  Implementation  Technologies (Python/R/Java)  Dependencies/versioning  Discrepancy in tooling
  7. 7. Package motivation  What is skutil?  Began as a pre-processing library to unify Caret, sklearn, etc.  Specifically relevant to actuarial departments—(why?)  Evolved to include H2O modules  Objectives:  Deliver an easy transition into the world of distributed computing that H2O offers  Help bridge “gap” between data scientist and data engineer roles  Provide the same, familiar interface that sklearn users have come to know and love
  8. 8. Package motivation [cont’d]  Regarding R…  H2O package completeness  Why Python… Quickly growing active user base Easily supported by non-DS engineers CI/CD friendly https://www.r-bloggers.com/on-the-growth-of-r-and-python-for-data-science/
  9. 9. Skutil—Notable H2O additions  H2OPipeline  Similar to sklearn.pipeline.Pipeline H2OTransformer H2OTransformer H2OEstimator
  10. 10. Skutil—Notable H2O additions [cont’d]  H2OGridSearchCV (and H2ORandomizedSearchCV)  Similar to sklearn.grid_search module Parameter grid Param set 0 Param set n … Best model
  11. 11. Ok, I have a model… now what?  Deploying in Python?  Pickle-compatible persistence  Entire pipelines can be stored  Deploying model in Java?  Leverage H2O’s built-in “download POJO” capability*  (future release will auto-gen main class and compile runnable fat-jar) * Just the H2O model; not the full pipeline
  12. 12. Skutil at a glance—present and future  Current (v0.1.3)  Transformers  Feature selection  Imputation  Class balancers  Model selection & Pipelines  Road map  PySpark integration  (Thank you to fellow contributor, Charles Drotar)  Automated runnable jar creation using jinja +
  13. 13. H2O vs. Sklearn SKUTIL IN ACTION
  14. 14. H2O vs. Sklearn Load data Split data Fit model
  15. 15. Skutil vs. Sklearn Load data Split data Fit model
  16. 16. Questions? THANK YOU!!

×