Machine Learning with H2O, Spark, and Python at Strata 2015


Machine Learning with H2O, Spark, and Python at Strata SJ 2015-by Cliff Click and Michal Malohlava

- Powered by the open source machine learning software Contributors welcome at:
- To view videos on H2O open source machine learning software, go to:

Machine Learning with H2O, Spark, and Python at Strata 2015

  1. 1. Machine Intelligence Fast, Scalable In-Memory Machine and Deep Learning For Smarter Applications Python & Sparkling Water with H2O Cliff Click Michal Malohlava
  2. 2. Machine Intelligence Who Am I? Cliff Click CTO, Co-Founder 40 yrs coding 35 yrs building compilers 30 yrs distributed computation 20 yrs OS, device drivers, HPC, HotSpot 10 yrs Low-latency GC, custom java hardware NonBlockingHashMap 20 patents, dozens of papers 100s of public talks PhD Computer Science 1995 Rice University HotSpot JVM Server Compiler “showed the world JITing is possible”
  3. 3. Machine Intelligence H2O Open Source In-Memory Machine Learning for Big Data Distributed In-Memory Math Platform GLM, GBM, RF, K-Means, PCA, Deep Learning Easy to use SDK & API Java, R (CRAN), Scala, Spark, Python, JSON, Browser GUI Use ALL your data Modeling without sampling HDFS, S3, NFS, NoSql Big Data & Better Algorithms Better Predictions!
  4. 4. Machine Intelligence TBD. Customer Support TBD Head of Sales Distributed Systems Engineers Making ML Scale!
  5. 5. Machine Intelligence Practical Machine Learning Value Requirements Fast & Interactive In-Memory Big Data (No Sampling) Distributed Ownership Open Source Extensibility API/SDK Portability Java, REST/JSON Infrastructure Cloud or On-Premise Hadoop or Private Cluster
  6. 6. Machine Intelligence H2O Architecture Prediction Engine R & Exec Engine Web Interface Spark Scala REPL Nano-Fast Scoring Engine Distributed In-Memory K/V Store Column Compress Data Map/Reduce Memory Manager Algorithms! GBM, Random Forest, GLM, PCA, K-Means, Deep Learning HDFS S3 NFS RealTime DataFlow
  7. 7. Machine Intelligence H2O Architecture Prediction Engine R & Exec Engine Web Interface Spark Scala REPL Nano-Fast Scoring Engine Distributed In-Memory K/V Store Column Compress Data Map/Reduce Memory Manager Algorithms! GBM, Random Forest, GLM, PCA, K-Means, Deep Learning HDFS S3 NFS RealTime DataFlow
  8. 8. Machine Intelligence Python & Sparkling Water ●  CitiBike of NYC ●  Predict bikes-per-hour-per-station –  From per-trip logs ●  10M rows of data ●  Group-By, date/time feature-munging Demo!
  9. 9. Machine Intelligence H2O: A Platform for Big Math ●  Most Any Java on Big 2-D Tables –  Write like its single-thread POJO code –  Runs distributed & parallel by default ●  Fast: billion row logistic regression takes 4 sec ●  Worlds first parallel & distributed GBM –  Plus Deep Learn / Neural Nets, RF, PCA, K-means... ●  R integration: use terabyte datasets from R ●  Sparkling Water: Direct Spark integration
  10. 10. Machine Intelligence H2O: A Platform for Big Math ●  Easy launch: “java -jar h2o.jar” –  No GC tuning: -Xmx as big as you like ●  Production ready: –  Private on-premise cluster OR In the Cloud –  Hadoop, Yarn, EC2, or standalone cluster –  HDFS, S3, NFS, URI & other datasources –  Open Source, Apache v2
  11. 11. Can I call H2O’s algorithms from my Spark workflow?
  12. 12. YES, You can!
  13. 13. Sparkling Water
  14. 14. Sparkling Water Provides Transparent integration into Spark ecosystem Pure H2ORDD encapsulating H2O DataFrame Transparent use of H2O data structures and algorithms with Spark API Excels in Spark workflows requiring advanced Machine Learning algorithms
  15. 15. Sparkling Water Design spark-submit Spark Master JVM Spark Worker JVM Spark Worker JVM Spark Worker JVM Sparkling Water Cluster Spark Executor JVM H2O Spark Executor JVM H2O Spark Executor JVM H2O Sparkling App implements ?
  16. 16. Data Distribution H2O H2O H2O Sparkling Water Cluster Spark Executor JVM Data Source (e.g. HDFS) H2O RDD Spark Executor JVM Spark Executor JVM Spark RDD RDDs and DataFrames share same memory space
  17. 17. Demo time!
  18. 18. SPARKLING WATER DEMO H2O.AI Created by / @h2oai
  19. 19. LAUNCH SPARKLING SHELL > export SPARK_HOME="/path/to/spark/installation" > bin/sparkling-shell
  20. 20. PREPARE AN ENVIRONMENT val DIR_PREFIX = "/Users/michal/Devel/projects/h2o/repos/h2o2/bigdata/laptop/ // Common imports import org.apache.spark.h2o._ import org.apache.spark.examples.h2o._ import org.apache.spark.examples.h2o.DemoUtils._ import org.apache.spark.sql.SQLContext import water.fvec._ import hex.tree.gbm.GBM import hex.tree.gbm.GBMModel.GBMParameters // Initialize Spark SQLContext implicit val sqlContext = new SQLContext(sc) import sqlContext._
  21. 21. LAUNCH H2O SERVICES implicit val h2oContext = new H2OContext(sc).start() import h2oContext._
  22. 22. LOAD CITIBIKE DATA USING H2O API val dataFiles = Array[String]( "2013-07.csv", "2013-08.csv", "2013-09.csv", "2013-10.csv", "2013-11.csv", "2013-12.csv").map(f => new, f)) // Load and parse data val bikesDF = new DataFrame(dataFiles:_*) // Rename columns and remove all spaces in header val colNames = bikesDF.names().map( n => n.replace(' ', '_')) bikesDF._names = colNames bikesDF.update(null)
  23. 23. USER-DEFINED COLUMN TRANSFORMATION // Select column 'startime' val startTimeF = bikesDF('starttime) // Invoke column transformation and append the created column bikesDF.add(new TimeSplit().doIt(startTimeF)) // Do not forget to update frame in K/V store bikesDF.update(null)
  24. 24. OPEN H2O FLOW UI openFlow AND EXPLORE DATA... > getFrames ...
  25. 25. FROM H2O'S DATAFRAME TO RDD val bikesRdd = asSchemaRDD(bikesDF)
  26. 26. USE SPARK SQL // Register table and SQL table sqlContext.registerRDDAsTable(bikesRdd, "bikesRdd") // Perform SQL group operation val bikesPerDayRdd = sql( """SELECT Days, start_station_id, count(*) bikes |FROM bikesRdd |GROUP BY Days, start_station_id """.stripMargin)
  27. 27. FROM RDD TO H2O'S DATAFRAME val bikesPerDayDF:DataFrame = bikesPerDayRdd AND PERFORM ADDITIONAL COLUMN TRANSFORMATION // Select "Days" column val daysVec = bikesPerDayDF('Days) // Refine column into "Month" and "DayOfWeek" val finalBikeDF = bikesPerDayDF.add(new TimeTransform().doIt(daysVec))
  29. 29. GBM MODEL BUILDER def buildModel(df: DataFrame, trees: Int = 200, depth: Int = 6):R2 = { // Split into train and test parts val frs = splitFrame(df, Seq("train.hex", "test.hex", "hold.hex"), Seq(0.6, 0.3, 0.1)) val (train, test, hold) = (frs(0), frs(1), frs(2)) // Configure GBM parameters val gbmParams = new GBMParameters() gbmParams._train = train gbmParams._valid = test gbmParams._response_column = 'bikes gbmParams._ntrees = trees gbmParams._max_depth = depth // Build a model val gbmModel = new GBM(gbmParams).trainModel.get // Score datasets Seq(train,test,hold).foreach(gbmModel.score(_).delete) // Collect R2 metrics val result = R2("Model #1", r2(gbmModel, train), r2(gbmModel, test), r2(gbmModel, hold)) // Perform clean-up Seq(train, test, hold).foreach(_.delete()) result }
  30. 30. BUILD A GBM MODEL val result1 = buildModel(finalBikeDF)
  32. 32. LOAD WEATHER DATA USING SPARK API // Load weather data in NY 2013 val weatherData = sc.textFile(DIR_PREFIX + "31081_New_York_City__Hourly_2013.csv") // Parse data and filter them val weatherRdd =",")). map(row => NYWeatherParse(row)). filter(!_.isWrongRow()). filter(_.HourLocal == Some(12)).setName("weather").cache()
  33. 33. CREATE A JOINED TABLE USING H2O'S DATAFRAME AND SPARK'S RDD // Join with bike table sqlContext.registerRDDAsTable(weatherRdd, "weatherRdd") sqlContext.registerRDDAsTable(asSchemaRDD(finalBikeDF), "bikesRdd") val bikesWeatherRdd = sql( """SELECT b.Days, b.start_station_id, b.bikes, |b.Month, b.DayOfWeek, |w.DewPoint, w.HumidityFraction, w.Prcp1Hour, |w.Temperature, w.WeatherCode1 | FROM bikesRdd b | JOIN weatherRdd w | ON b.Days = w.Days """.stripMargin)
  34. 34. BUILD A NEW MODEL USING SPARK'S RDD IN H2O'S API val result2 = buildModel(bikesWeatherRdd)
