Machine Learning with H2O, Spark, and Python at Strata 2015

Sri Ambati
Sri AmbatiCEO & Founder at H2O.ai
H2O.ai
Machine Intelligence
Fast, Scalable In-Memory Machine and Deep Learning
For Smarter Applications
Python & Sparkling Water
with H2O
Cliff Click
Michal Malohlava
H2O.ai
Machine Intelligence
Who Am I?
Cliff Click
CTO, Co-Founder H2O.ai
cliff@h2o.ai
40 yrs coding
35 yrs building compilers
30 yrs distributed computation
20 yrs OS, device drivers, HPC, HotSpot
10 yrs Low-latency GC, custom java hardware
NonBlockingHashMap
20 patents, dozens of papers
100s of public talks
PhD Computer Science
1995 Rice University
HotSpot JVM Server Compiler
“showed the world JITing is possible”
H2O.ai
Machine Intelligence
H2O Open Source In-Memory
Machine Learning for Big Data
Distributed In-Memory Math Platform
GLM, GBM, RF, K-Means, PCA, Deep Learning
Easy to use SDK & API
Java, R (CRAN), Scala, Spark, Python, JSON, Browser GUI
Use ALL your data
Modeling without sampling
HDFS, S3, NFS, NoSql
Big Data & Better Algorithms
Better Predictions!
H2O.ai
Machine Intelligence
TBD.
Customer
Support
TBD
Head of
Sales
Distributed
Systems
Engineers
Making
ML Scale!
H2O.ai
Machine Intelligence
Practical Machine Learning
Value Requirements
Fast & Interactive In-Memory
Big Data (No Sampling) Distributed
Ownership Open Source
Extensibility API/SDK
Portability Java, REST/JSON
Infrastructure
Cloud or On-Premise Hadoop
or Private Cluster
H2O.ai
Machine Intelligence
H2O Architecture
Prediction Engine
R & Exec Engine
Web Interface
Spark Scala REPL
Nano-Fast
Scoring Engine
Distributed
In-Memory K/V Store
Column Compress Data
Map/Reduce
Memory Manager
Algorithms!
GBM, Random Forest,
GLM, PCA, K-Means,
Deep Learning
HDFS S3 NFS
RealTime
DataFlow
H2O.ai
Machine Intelligence
H2O Architecture
Prediction Engine
R & Exec Engine
Web Interface
Spark Scala REPL
Nano-Fast
Scoring Engine
Distributed
In-Memory K/V Store
Column Compress Data
Map/Reduce
Memory Manager
Algorithms!
GBM, Random Forest,
GLM, PCA, K-Means,
Deep Learning
HDFS S3 NFS
RealTime
DataFlow
H2O.ai
Machine Intelligence
Python & Sparkling Water
●  CitiBike of NYC
●  Predict bikes-per-hour-per-station
–  From per-trip logs
●  10M rows of data
●  Group-By, date/time feature-munging
Demo!
H2O.ai
Machine Intelligence
H2O: A Platform for Big Math
●  Most Any Java on Big 2-D Tables
–  Write like its single-thread POJO code
–  Runs distributed & parallel by default
●  Fast: billion row logistic regression takes 4 sec
●  Worlds first parallel & distributed GBM
–  Plus Deep Learn / Neural Nets, RF, PCA, K-means...
●  R integration: use terabyte datasets from R
●  Sparkling Water: Direct Spark integration
H2O.ai
Machine Intelligence
H2O: A Platform for Big Math
●  Easy launch: “java -jar h2o.jar”
–  No GC tuning: -Xmx as big as you like
●  Production ready:
–  Private on-premise cluster OR
In the Cloud
–  Hadoop, Yarn, EC2, or standalone cluster
–  HDFS, S3, NFS, URI & other datasources
–  Open Source, Apache v2
Can I call H2O’s
algorithms from
my Spark
workflow?
YES,
You can!
Sparkling
Water
Sparkling Water
Provides
Transparent integration into Spark ecosystem
Pure H2ORDD encapsulating H2O DataFrame
Transparent use of H2O data structures and
algorithms with Spark API
Excels in Spark workflows requiring
advanced Machine Learning algorithms
Sparkling Water Design
spark-submit
Spark
Master
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Sparkling Water Cluster
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Sparkling
App
implements
?
Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVM
Data
Source
(e.g.
HDFS)
H2O
RDD
Spark Executor JVM
Spark Executor JVM
Spark
RDD
RDDs and DataFrames
share same memory
space
Demo time!
SPARKLING WATER DEMO
H2O.AI
Created by /H2O.ai @h2oai
LAUNCH SPARKLING SHELL
> export SPARK_HOME="/path/to/spark/installation"
> bin/sparkling-shell
PREPARE AN ENVIRONMENT
val DIR_PREFIX = "/Users/michal/Devel/projects/h2o/repos/h2o2/bigdata/laptop/
// Common imports
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
import org.apache.spark.examples.h2o.DemoUtils._
import org.apache.spark.sql.SQLContext
import water.fvec._
import hex.tree.gbm.GBM
import hex.tree.gbm.GBMModel.GBMParameters
// Initialize Spark SQLContext
implicit val sqlContext = new SQLContext(sc)
import sqlContext._
LAUNCH H2O SERVICES
implicit val h2oContext = new H2OContext(sc).start()
import h2oContext._
LOAD CITIBIKE DATA
USING H2O API
val dataFiles = Array[String](
"2013-07.csv", "2013-08.csv", "2013-09.csv", "2013-10.csv",
"2013-11.csv", "2013-12.csv").map(f => new java.io.File(DIR_PREFIX, f))
// Load and parse data
val bikesDF = new DataFrame(dataFiles:_*)
// Rename columns and remove all spaces in header
val colNames = bikesDF.names().map( n => n.replace(' ', '_'))
bikesDF._names = colNames
bikesDF.update(null)
USER-DEFINED COLUMN TRANSFORMATION
// Select column 'startime'
val startTimeF = bikesDF('starttime)
// Invoke column transformation and append the created column
bikesDF.add(new TimeSplit().doIt(startTimeF))
// Do not forget to update frame in K/V store
bikesDF.update(null)
OPEN H2O FLOW UI
openFlow
AND EXPLORE DATA...
> getFrames
...
FROM H2O'S DATAFRAME TO RDD
val bikesRdd = asSchemaRDD(bikesDF)
USE SPARK SQL
// Register table and SQL table
sqlContext.registerRDDAsTable(bikesRdd, "bikesRdd")
// Perform SQL group operation
val bikesPerDayRdd = sql(
"""SELECT Days, start_station_id, count(*) bikes
|FROM bikesRdd
|GROUP BY Days, start_station_id """.stripMargin)
FROM RDD TO H2O'S DATAFRAME
val bikesPerDayDF:DataFrame = bikesPerDayRdd
AND PERFORM ADDITIONAL COLUMN TRANSFORMATION
// Select "Days" column
val daysVec = bikesPerDayDF('Days)
// Refine column into "Month" and "DayOfWeek"
val finalBikeDF = bikesPerDayDF.add(new TimeTransform().doIt(daysVec))
TIME TO BUILD A MODEL!
GBM MODEL BUILDER
def buildModel(df: DataFrame, trees: Int = 200, depth: Int = 6):R2 = {
// Split into train and test parts
val frs = splitFrame(df, Seq("train.hex", "test.hex", "hold.hex"), Seq(0.6, 0.3, 0.1))
val (train, test, hold) = (frs(0), frs(1), frs(2))
// Configure GBM parameters
val gbmParams = new GBMParameters()
gbmParams._train = train
gbmParams._valid = test
gbmParams._response_column = 'bikes
gbmParams._ntrees = trees
gbmParams._max_depth = depth
// Build a model
val gbmModel = new GBM(gbmParams).trainModel.get
// Score datasets
Seq(train,test,hold).foreach(gbmModel.score(_).delete)
// Collect R2 metrics
val result = R2("Model #1", r2(gbmModel, train), r2(gbmModel, test), r2(gbmModel, hold))
// Perform clean-up
Seq(train, test, hold).foreach(_.delete())
result
}
BUILD A GBM MODEL
val result1 = buildModel(finalBikeDF)
CAN WE IMPROVE MODEL
BY USING INFORMATION
ABOUT WEATHER?
LOAD WEATHER DATA
USING SPARK API
// Load weather data in NY 2013
val weatherData = sc.textFile(DIR_PREFIX + "31081_New_York_City__Hourly_2013.csv")
// Parse data and filter them
val weatherRdd = weatherData.map(_.split(",")).
map(row => NYWeatherParse(row)).
filter(!_.isWrongRow()).
filter(_.HourLocal == Some(12)).setName("weather").cache()
CREATE A JOINED TABLE
USING H2O'S DATAFRAME AND SPARK'S RDD
// Join with bike table
sqlContext.registerRDDAsTable(weatherRdd, "weatherRdd")
sqlContext.registerRDDAsTable(asSchemaRDD(finalBikeDF), "bikesRdd")
val bikesWeatherRdd = sql(
"""SELECT b.Days, b.start_station_id, b.bikes,
|b.Month, b.DayOfWeek,
|w.DewPoint, w.HumidityFraction, w.Prcp1Hour,
|w.Temperature, w.WeatherCode1
| FROM bikesRdd b
| JOIN weatherRdd w
| ON b.Days = w.Days
""".stripMargin)
BUILD A NEW MODEL
USING SPARK'S RDD IN H2O'S API
val result2 = buildModel(bikesWeatherRdd)
Checkout H2O.ai Training Books
http://learn.h2o.ai/

Checkout H2O.ai Blog
http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata

Checkout GitHub
https://github.com/h2oai
More info
Learn more about H2O at h2o.ai
Thank you!
Follow us at
@h2oai
1 of 36

More Related Content

What's hot(20)

Viewers also liked(20)

Open Standards in the Walled GardenOpen Standards in the Walled Garden
Open Standards in the Walled Garden
digitalbindery6.4K views
Kevin KellyKevin Kelly
Kevin Kelly
Web 2.0 Expo4.5K views
Demand MediaDemand Media
Demand Media
Web 2.0 Expo5.8K views
Smaller, Flatter, SmarterSmaller, Flatter, Smarter
Smaller, Flatter, Smarter
Web 2.0 Expo3.3K views
Web 2.0 Expo Speech: Open LeadershipWeb 2.0 Expo Speech: Open Leadership
Web 2.0 Expo Speech: Open Leadership
Charlene Li13.4K views
Advanced Caching Concepts @ Velocity NY 2015Advanced Caching Concepts @ Velocity NY 2015
Advanced Caching Concepts @ Velocity NY 2015
Rakesh Chaudhary27.2K views

Similar to Machine Learning with H2O, Spark, and Python at Strata 2015(20)

20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group1.3K views
Austin Data Meetup 092014 - SparkAustin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - Spark
Steve Blackmon664 views
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui2.7K views
Sparkling WaterSparkling Water
Sparkling Water
h2oworld580 views
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi10.4K views
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie3K views
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo1.6K views

More from Sri Ambati(20)

LLM Interpretability LLM Interpretability
LLM Interpretability
Sri Ambati4 views
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
Sri Ambati7 views

Recently uploaded(20)

Why Cloud Services.pdfWhy Cloud Services.pdf
Why Cloud Services.pdf
Pi DATACENTERS6 views
Best Mics For Your Live StreamingBest Mics For Your Live Streaming
Best Mics For Your Live Streaming
ontheflystream6 views
Unleash The MonkeysUnleash The Monkeys
Unleash The Monkeys
Jacob Duijzer7 views

Machine Learning with H2O, Spark, and Python at Strata 2015

  • 1. H2O.ai Machine Intelligence Fast, Scalable In-Memory Machine and Deep Learning For Smarter Applications Python & Sparkling Water with H2O Cliff Click Michal Malohlava
  • 2. H2O.ai Machine Intelligence Who Am I? Cliff Click CTO, Co-Founder H2O.ai cliff@h2o.ai 40 yrs coding 35 yrs building compilers 30 yrs distributed computation 20 yrs OS, device drivers, HPC, HotSpot 10 yrs Low-latency GC, custom java hardware NonBlockingHashMap 20 patents, dozens of papers 100s of public talks PhD Computer Science 1995 Rice University HotSpot JVM Server Compiler “showed the world JITing is possible”
  • 3. H2O.ai Machine Intelligence H2O Open Source In-Memory Machine Learning for Big Data Distributed In-Memory Math Platform GLM, GBM, RF, K-Means, PCA, Deep Learning Easy to use SDK & API Java, R (CRAN), Scala, Spark, Python, JSON, Browser GUI Use ALL your data Modeling without sampling HDFS, S3, NFS, NoSql Big Data & Better Algorithms Better Predictions!
  • 5. H2O.ai Machine Intelligence Practical Machine Learning Value Requirements Fast & Interactive In-Memory Big Data (No Sampling) Distributed Ownership Open Source Extensibility API/SDK Portability Java, REST/JSON Infrastructure Cloud or On-Premise Hadoop or Private Cluster
  • 6. H2O.ai Machine Intelligence H2O Architecture Prediction Engine R & Exec Engine Web Interface Spark Scala REPL Nano-Fast Scoring Engine Distributed In-Memory K/V Store Column Compress Data Map/Reduce Memory Manager Algorithms! GBM, Random Forest, GLM, PCA, K-Means, Deep Learning HDFS S3 NFS RealTime DataFlow
  • 7. H2O.ai Machine Intelligence H2O Architecture Prediction Engine R & Exec Engine Web Interface Spark Scala REPL Nano-Fast Scoring Engine Distributed In-Memory K/V Store Column Compress Data Map/Reduce Memory Manager Algorithms! GBM, Random Forest, GLM, PCA, K-Means, Deep Learning HDFS S3 NFS RealTime DataFlow
  • 8. H2O.ai Machine Intelligence Python & Sparkling Water ●  CitiBike of NYC ●  Predict bikes-per-hour-per-station –  From per-trip logs ●  10M rows of data ●  Group-By, date/time feature-munging Demo!
  • 9. H2O.ai Machine Intelligence H2O: A Platform for Big Math ●  Most Any Java on Big 2-D Tables –  Write like its single-thread POJO code –  Runs distributed & parallel by default ●  Fast: billion row logistic regression takes 4 sec ●  Worlds first parallel & distributed GBM –  Plus Deep Learn / Neural Nets, RF, PCA, K-means... ●  R integration: use terabyte datasets from R ●  Sparkling Water: Direct Spark integration
  • 10. H2O.ai Machine Intelligence H2O: A Platform for Big Math ●  Easy launch: “java -jar h2o.jar” –  No GC tuning: -Xmx as big as you like ●  Production ready: –  Private on-premise cluster OR In the Cloud –  Hadoop, Yarn, EC2, or standalone cluster –  HDFS, S3, NFS, URI & other datasources –  Open Source, Apache v2
  • 11. Can I call H2O’s algorithms from my Spark workflow?
  • 14. Sparkling Water Provides Transparent integration into Spark ecosystem Pure H2ORDD encapsulating H2O DataFrame Transparent use of H2O data structures and algorithms with Spark API Excels in Spark workflows requiring advanced Machine Learning algorithms
  • 15. Sparkling Water Design spark-submit Spark Master JVM Spark Worker JVM Spark Worker JVM Spark Worker JVM Sparkling Water Cluster Spark Executor JVM H2O Spark Executor JVM H2O Spark Executor JVM H2O Sparkling App implements ?
  • 16. Data Distribution H2O H2O H2O Sparkling Water Cluster Spark Executor JVM Data Source (e.g. HDFS) H2O RDD Spark Executor JVM Spark Executor JVM Spark RDD RDDs and DataFrames share same memory space
  • 19. LAUNCH SPARKLING SHELL > export SPARK_HOME="/path/to/spark/installation" > bin/sparkling-shell
  • 20. PREPARE AN ENVIRONMENT val DIR_PREFIX = "/Users/michal/Devel/projects/h2o/repos/h2o2/bigdata/laptop/ // Common imports import org.apache.spark.h2o._ import org.apache.spark.examples.h2o._ import org.apache.spark.examples.h2o.DemoUtils._ import org.apache.spark.sql.SQLContext import water.fvec._ import hex.tree.gbm.GBM import hex.tree.gbm.GBMModel.GBMParameters // Initialize Spark SQLContext implicit val sqlContext = new SQLContext(sc) import sqlContext._
  • 21. LAUNCH H2O SERVICES implicit val h2oContext = new H2OContext(sc).start() import h2oContext._
  • 22. LOAD CITIBIKE DATA USING H2O API val dataFiles = Array[String]( "2013-07.csv", "2013-08.csv", "2013-09.csv", "2013-10.csv", "2013-11.csv", "2013-12.csv").map(f => new java.io.File(DIR_PREFIX, f)) // Load and parse data val bikesDF = new DataFrame(dataFiles:_*) // Rename columns and remove all spaces in header val colNames = bikesDF.names().map( n => n.replace(' ', '_')) bikesDF._names = colNames bikesDF.update(null)
  • 23. USER-DEFINED COLUMN TRANSFORMATION // Select column 'startime' val startTimeF = bikesDF('starttime) // Invoke column transformation and append the created column bikesDF.add(new TimeSplit().doIt(startTimeF)) // Do not forget to update frame in K/V store bikesDF.update(null)
  • 24. OPEN H2O FLOW UI openFlow AND EXPLORE DATA... > getFrames ...
  • 25. FROM H2O'S DATAFRAME TO RDD val bikesRdd = asSchemaRDD(bikesDF)
  • 26. USE SPARK SQL // Register table and SQL table sqlContext.registerRDDAsTable(bikesRdd, "bikesRdd") // Perform SQL group operation val bikesPerDayRdd = sql( """SELECT Days, start_station_id, count(*) bikes |FROM bikesRdd |GROUP BY Days, start_station_id """.stripMargin)
  • 27. FROM RDD TO H2O'S DATAFRAME val bikesPerDayDF:DataFrame = bikesPerDayRdd AND PERFORM ADDITIONAL COLUMN TRANSFORMATION // Select "Days" column val daysVec = bikesPerDayDF('Days) // Refine column into "Month" and "DayOfWeek" val finalBikeDF = bikesPerDayDF.add(new TimeTransform().doIt(daysVec))
  • 28. TIME TO BUILD A MODEL!
  • 29. GBM MODEL BUILDER def buildModel(df: DataFrame, trees: Int = 200, depth: Int = 6):R2 = { // Split into train and test parts val frs = splitFrame(df, Seq("train.hex", "test.hex", "hold.hex"), Seq(0.6, 0.3, 0.1)) val (train, test, hold) = (frs(0), frs(1), frs(2)) // Configure GBM parameters val gbmParams = new GBMParameters() gbmParams._train = train gbmParams._valid = test gbmParams._response_column = 'bikes gbmParams._ntrees = trees gbmParams._max_depth = depth // Build a model val gbmModel = new GBM(gbmParams).trainModel.get // Score datasets Seq(train,test,hold).foreach(gbmModel.score(_).delete) // Collect R2 metrics val result = R2("Model #1", r2(gbmModel, train), r2(gbmModel, test), r2(gbmModel, hold)) // Perform clean-up Seq(train, test, hold).foreach(_.delete()) result }
  • 30. BUILD A GBM MODEL val result1 = buildModel(finalBikeDF)
  • 31. CAN WE IMPROVE MODEL BY USING INFORMATION ABOUT WEATHER?
  • 32. LOAD WEATHER DATA USING SPARK API // Load weather data in NY 2013 val weatherData = sc.textFile(DIR_PREFIX + "31081_New_York_City__Hourly_2013.csv") // Parse data and filter them val weatherRdd = weatherData.map(_.split(",")). map(row => NYWeatherParse(row)). filter(!_.isWrongRow()). filter(_.HourLocal == Some(12)).setName("weather").cache()
  • 33. CREATE A JOINED TABLE USING H2O'S DATAFRAME AND SPARK'S RDD // Join with bike table sqlContext.registerRDDAsTable(weatherRdd, "weatherRdd") sqlContext.registerRDDAsTable(asSchemaRDD(finalBikeDF), "bikesRdd") val bikesWeatherRdd = sql( """SELECT b.Days, b.start_station_id, b.bikes, |b.Month, b.DayOfWeek, |w.DewPoint, w.HumidityFraction, w.Prcp1Hour, |w.Temperature, w.WeatherCode1 | FROM bikesRdd b | JOIN weatherRdd w | ON b.Days = w.Days """.stripMargin)
  • 34. BUILD A NEW MODEL USING SPARK'S RDD IN H2O'S API val result2 = buildModel(bikesWeatherRdd)
  • 35. Checkout H2O.ai Training Books http://learn.h2o.ai/
 Checkout H2O.ai Blog http://h2o.ai/blog/
 Checkout H2O.ai Youtube Channel https://www.youtube.com/user/0xdata
 Checkout GitHub https://github.com/h2oai More info
  • 36. Learn more about H2O at h2o.ai Thank you! Follow us at @h2oai