OCT 29th, 2014 WEBINAR 
H2O 
Fast. Scalable. Machine Learning 
For Smarter Applications 
“Fluids are In, Animals are Out.” 
~ Svetlana Sicular, Gartner
Speakers 
Joel Horwitz 
Joel is a caffeine, data, and laughter driven product strategist. He is an 
active community member having founded Bay Area Analytics, Tweets 
regularly @JSHorwitz, blogs regularly joelshorwitz.com and speaks 
regularly at industry events. Always eager to learn and lend a helping hand 
makes him an invaluable asset to 0xdata. 
Michal Malohlava 
Michal is a geek, developer, Java, Linux, programming languages 
enthusiast developing software for over 10 years. He obtained PhD from 
the Charles University in Prague in 2012 and post-doc at Purdue 
University. 
H2O World Register at http://www.0xdata.com/h2o-world
Time is the only non-renewable resource. 
Speed Matters!
Today 
• Why Are We Here 
• Who We Are 
• How Do We Do It 
• Who We Work With 
• What We Believe 
• Demo and Q&A
A New Interpretation of Moore’s Law 
“Like the physical universe, the digital universe is large - by 2020 containing nearly as 
many digital bits as there are stars in the universe. It is doubling in size every two 
years, and by 2020 the digital universe - the data we create and copy annually - will 
reach 44 zettabytes, or 44 trillion gigabytes.” 
- IDC 2014
An Evolving Market to Meet the Demand 
RDBMS MPP 
Business 
Intelligence 
Data 
Science 
H 
O Distributed 
2 
File System 
Machine 
Learning
Decreasing Cost of Data is Driving Demand 
H 
O 
2 
1970 1980 1990 2000 2010 2020
H2O is the First Dedicated 
Machine Learning Open Source Platform 
H2O is for application developers and analysts who need 
scalable and fast machine learning. H2O is an open source 
predictive analytics platform. Unlike traditional analytics tools, 
H2O provides a combination of extraordinary math, a high 
performance parallel architecture, and unrivaled ease of use.
Who Are We? 
H 
2 
O
H2O Awards and Accolades 
• Top R Project of UserR Conference 2014 
• Fortune Big Data All-Stars 2014, Arno Candel 
• 100+ Meetups 
• 6000+ Users
H2O is Built for Speed and Scale 
• OpenSource 
• REST API 
• Native R Support 
• NanoFastTM Scoring Engine 
• Sophisticated Algorithms
H2O Seamlessly Integrates with Your Workflow 
• 20X Faster Imports and 3X 
Compression w/ .hex format. 
• 4 Billion Row Regression in 
Seconds. 
• Deploy in POJO or with our 
REST API
Code is incomplete without Community! 
Open Source Drives Innovation.
Law of Large Numbers Triumphs!
Every Generation Needs to Invent Its Own Math. 
ML is the new SQL!
What do our customers say about us? 
"The platform can generate Jar files to deploy models into production. This 
alone is a milestone!" - Hassan Namarvar, ShareThis 
“I have to give credit to H2O. They have a very complete way of showing which 
algorithm is the best.” – Nachum Shacham, Paypal 
“I analyzed 1 million rows training set, fitting a logistic regression with elastic 
penalty, and doing a grid search on parameters with 10-fold cross validation for 
each parameter combination… doing this analysis was a breeze… orders of 
magnitude faster than R.” - Antonio Molins, Netflix 
“Never have we had such a quick, simple, scalable and cost effective 
deployment solution for predictive modeling.” – Lou Carvalheira, Cisco
Advertising 
Better Conversions 
Brand Conversion Reach ROI 
Overall, I would say that the H2O platform is the most elegant open source in-memory ~ Hassan Namarvar, Principal Data Scientist
Fraud 
Better Detections 
Purchase 
Shopping Theft Passwords 
I have to give credit to H2O. 
They have a very complete way of showing which algorithm is the best. 
~ Nachum Shacham, Principal Data Scientist
Marketing 
Better Spend 
ROI 
Network Segments Measure 
H2O has established a new equilibrium point for performance, 
accuracy and cost for statistics and machine learning. 
~ Lou Carvalheira, Principal Data Scientist
Select Customers Powered by H2O
@hexadata & @mmalohlava 
presents 
Sparkling Water 
“Killer App for Spark”
Memory efficient 
Performance of computation 
Machine learning algorithms 
Parser, GUI, R-interface 
User-friendly API 
Large and active community 
Platform components - SQL 
Multitenancy
Sparkling Water 
+ 
RDD 
immutable 
world 
DataFrame 
mutable 
world
Sparkling Water 
RDD DataFrame
Sparkling Water 
Provides 
Transparent integration into Spark ecosystem 
Pure H2ORDD encapsulating H2O DataFrame 
Transparent use of H2O data structures and 
algorithms with Spark API 
Excels in Spark workflows requiring advanced 
Machine Learning algorithms
Sparkling Water Design 
implements 
spark-submit 
Spark 
Master 
JVM 
Spark 
Worker 
JVM 
Spark 
Worker 
JVM 
Spark 
Worker 
JVM 
Sparkling Water Cluster 
Spark 
Executor 
JVM 
H2O 
Spark 
Executor 
JVM 
H2O 
Spark 
Executor 
JVM 
H2O 
Sparklin 
g App 
jar file 
Contains application 
and Sparkling Water 
classes
Data Distribution 
Sparkling Water Cluster 
H2O 
H2O 
H2O 
Spark Executor JVM Data 
Source 
(e.g. 
HDFS) 
H2O 
RDD 
Spark 
RDD 
Spark Executor JVM 
Spark Executor JVM 
RDDs and DataFrames 
share same memory 
space
Demo Time
Flight delay prediction 
“Build a model using weather and 
flight data to predict delays of flights 
arriving to Chicago O’Hare 
International Airport”
Example Outline 
Load & Parse CSV data from 2 data sources 
Use Spark API to filter data, do SQL query for join 
Create a regression model 
Use model for delay prediction 
Plot residual plot from R
Sparkling Water 
Requirements 
Linux or Mac OS X 
Oracle Java 1.7
Download 
http://0xdata.com/download/
Install and Launch 
Unpack zip file 
and 
Point SPARK_HOME to your Spark installation 
and 
Launch h2o-examples/sparkling-shell
What is Sparkling Shell? 
Standard spark-shell 
With additional Sparkling Water classes 
export MASTER=“local-cluster[3,2,1024]” 
spark-shell  
—-jars sparkling-water.jar 
JAR containing 
Sparkling 
Water 
Spark Master 
address
Lets play with Sparkling 
shell…
Create H2O Client 
import org.apache.spark.h2o._ 
import org.apache.spark.examples.h2o._ 
val h2oContext = new H2OContext(sc).start(3) 
import h2oContext._ 
Regular Spark context 
provided by Spark shell 
Size of demanded 
H2O cloud 
Contains implicit utility functions 
Demo specific 
classes
Is Spark Running? 
Go to http://localhost:4040
Is H2O running? 
http://localhost:54321/steam/index.html
Load Data #1 
Load weather data into RDD 
val weatherDataFile = 
“examples/smalldata/weather.csv" 
val wrawdata = sc.textFile(weatherDataFile,3) 
.cache() 
val weatherTable = wrawdata 
.map(_.split(“,")) 
.map(row => WeatherParse(row)) 
.filter(!_.isWrongRow()) 
Regular Spark API 
Ad-hoc Parser
Weather Data 
case class Weather( val Year : Option[Int], 
val Month : Option[Int], 
val Day : Option[Int], 
val TmaxF : Option[Int], // Max temperatur in F 
val TminF : Option[Int], // Min temperatur in F 
val TmeanF : Option[Float], // Mean temperatur in F 
val PrcpIn : Option[Float], // Precipitation (inches) 
val SnowIn : Option[Float], // Snow (inches) 
val CDD : Option[Float], // Cooling Degree Day 
val HDD : Option[Float], // Heating Degree Day 
val GDD : Option[Float]) // Growing Degree Day 
Simple POSO to hold one row of weather data
Load Data #2 
Load flights data into H2O frame 
import java.io.File 
val dataFile = 
“examples/smalldata/allyears2k_headers.csv.gz" 
val airlinesData = new DataFrame(new File(dataFile))
Where is the data? 
Go to http://localhost:54321/steam/index.html
Use Spark API for Data 
Filtering 
// Create RDD wrapper around DataFrame 
val airlinesTable : RDD[Airlines] 
= toRDD[Airlines](airlinesData) 
// And use Spark RDD API directly 
val flightsToORD = airlinesTable 
.filter( f => f.Dest == Some(“ORD") ) 
Create a cheap wrapper 
around H2O DataFrame 
Regular Spark 
RDD call
Use Spark SQL to Data Join 
import org.apache.spark.sql.SQLContext 
// We need to create SQL context 
val sqlContext = new SQLContext(sc) 
import sqlContext._ 
flightsToORD.registerTempTable("FlightsToORD") 
weatherTable.registerTempTable("WeatherORD")
Join Data based on Flight Date 
val bigTable = sql( 
"""SELECT 
| f.Year,f.Month,f.DayofMonth, 
| f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime, 
| f.UniqueCarrier,f.FlightNum,f.TailNum, 
| f.Origin,f.Distance, 
| w.TmaxF,w.TminF,w.TmeanF, 
| w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD, 
| f.ArrDelay 
| FROM FlightsToORD f 
| JOIN WeatherORD w 
| ON f.Year=w.Year AND f.Month=w.Month 
| AND f.DayofMonth=w.Day""".stripMargin)
Launch H2O Algorithms 
import hex.deeplearning._ 
import hex.deeplearning.DeepLearningModel 
.DeepLearningParameters 
// Setup deep learning parameters 
val dlParams = new DeepLearningParameters() 
dlParams._train = bigTable 
dlParams._response_column = 'ArrDelay 
dlParams._classification = false 
// Create a new model builder 
val dl = new DeepLearning(dlParams) 
val dlModel = dl.train.get 
Result of 
SQL query 
Blocking call
Make a prediction 
// Use model to score data 
val prediction = dlModel.score(result)(‘predict) 
// Collect predicted values via RDD API 
val predictionValues = toRDD[DoubleHolder](prediction) 
.collect 
.map ( _.result.getOrElse("NaN") )
Generate Residuals Plot 
# Import H2O library and initialize H2O client 
library(h2o) 
h = h2o.init() 
# Fetch prediction and actual data, use remembered keys 
pred = h2o.getFrame(h, "dframe_b5f449d0c04ee75fda1b9bc865b14a69") 
act = h2o.getFrame (h, "frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57") 
# Select right columns 
predDelay = pred$predict 
actDelay = act$ArrDelay 
# Make sure that number of rows is same 
nrow(actDelay) == nrow(predDelay) 
# Compute residuals 
residuals = predDelay - actDelay 
# Plot residuals 
compare = cbind( 
as.data.frame(actDelay$ArrDelay), 
as.data.frame(residuals$predict)) 
plot( compare[,1:2] ) 
References 
of data
More info 
Checkout 0xdata Blog for Sparkling Water tutorials 
http://0xdata.com/blog/ 
Checkout 0xdata Youtube Channel 
https://www.youtube.com/user/0xdata 
Checkout github 
https://github.com/0xdata/sparkling-water
Thank you! 
Learn more about H2O at 
0xdata.com 
or 
neo> for r in sparkling-water; do 
git clone “git@github.com:0xdata/$r.git” 
done 
Follow us at @hexadata

Sparkling Water Webinar October 29th, 2014

  • 1.
    OCT 29th, 2014WEBINAR H2O Fast. Scalable. Machine Learning For Smarter Applications “Fluids are In, Animals are Out.” ~ Svetlana Sicular, Gartner
  • 2.
    Speakers Joel Horwitz Joel is a caffeine, data, and laughter driven product strategist. He is an active community member having founded Bay Area Analytics, Tweets regularly @JSHorwitz, blogs regularly joelshorwitz.com and speaks regularly at industry events. Always eager to learn and lend a helping hand makes him an invaluable asset to 0xdata. Michal Malohlava Michal is a geek, developer, Java, Linux, programming languages enthusiast developing software for over 10 years. He obtained PhD from the Charles University in Prague in 2012 and post-doc at Purdue University. H2O World Register at http://www.0xdata.com/h2o-world
  • 3.
    Time is theonly non-renewable resource. Speed Matters!
  • 4.
    Today • WhyAre We Here • Who We Are • How Do We Do It • Who We Work With • What We Believe • Demo and Q&A
  • 5.
    A New Interpretationof Moore’s Law “Like the physical universe, the digital universe is large - by 2020 containing nearly as many digital bits as there are stars in the universe. It is doubling in size every two years, and by 2020 the digital universe - the data we create and copy annually - will reach 44 zettabytes, or 44 trillion gigabytes.” - IDC 2014
  • 6.
    An Evolving Marketto Meet the Demand RDBMS MPP Business Intelligence Data Science H O Distributed 2 File System Machine Learning
  • 7.
    Decreasing Cost ofData is Driving Demand H O 2 1970 1980 1990 2000 2010 2020
  • 8.
    H2O is theFirst Dedicated Machine Learning Open Source Platform H2O is for application developers and analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math, a high performance parallel architecture, and unrivaled ease of use.
  • 9.
  • 10.
    H2O Awards andAccolades • Top R Project of UserR Conference 2014 • Fortune Big Data All-Stars 2014, Arno Candel • 100+ Meetups • 6000+ Users
  • 11.
    H2O is Builtfor Speed and Scale • OpenSource • REST API • Native R Support • NanoFastTM Scoring Engine • Sophisticated Algorithms
  • 12.
    H2O Seamlessly Integrateswith Your Workflow • 20X Faster Imports and 3X Compression w/ .hex format. • 4 Billion Row Regression in Seconds. • Deploy in POJO or with our REST API
  • 13.
    Code is incompletewithout Community! Open Source Drives Innovation.
  • 14.
    Law of LargeNumbers Triumphs!
  • 15.
    Every Generation Needsto Invent Its Own Math. ML is the new SQL!
  • 16.
    What do ourcustomers say about us? "The platform can generate Jar files to deploy models into production. This alone is a milestone!" - Hassan Namarvar, ShareThis “I have to give credit to H2O. They have a very complete way of showing which algorithm is the best.” – Nachum Shacham, Paypal “I analyzed 1 million rows training set, fitting a logistic regression with elastic penalty, and doing a grid search on parameters with 10-fold cross validation for each parameter combination… doing this analysis was a breeze… orders of magnitude faster than R.” - Antonio Molins, Netflix “Never have we had such a quick, simple, scalable and cost effective deployment solution for predictive modeling.” – Lou Carvalheira, Cisco
  • 17.
    Advertising Better Conversions Brand Conversion Reach ROI Overall, I would say that the H2O platform is the most elegant open source in-memory ~ Hassan Namarvar, Principal Data Scientist
  • 18.
    Fraud Better Detections Purchase Shopping Theft Passwords I have to give credit to H2O. They have a very complete way of showing which algorithm is the best. ~ Nachum Shacham, Principal Data Scientist
  • 19.
    Marketing Better Spend ROI Network Segments Measure H2O has established a new equilibrium point for performance, accuracy and cost for statistics and machine learning. ~ Lou Carvalheira, Principal Data Scientist
  • 20.
  • 21.
    @hexadata & @mmalohlava presents Sparkling Water “Killer App for Spark”
  • 22.
    Memory efficient Performanceof computation Machine learning algorithms Parser, GUI, R-interface User-friendly API Large and active community Platform components - SQL Multitenancy
  • 23.
    Sparkling Water + RDD immutable world DataFrame mutable world
  • 24.
  • 25.
    Sparkling Water Provides Transparent integration into Spark ecosystem Pure H2ORDD encapsulating H2O DataFrame Transparent use of H2O data structures and algorithms with Spark API Excels in Spark workflows requiring advanced Machine Learning algorithms
  • 26.
    Sparkling Water Design implements spark-submit Spark Master JVM Spark Worker JVM Spark Worker JVM Spark Worker JVM Sparkling Water Cluster Spark Executor JVM H2O Spark Executor JVM H2O Spark Executor JVM H2O Sparklin g App jar file Contains application and Sparkling Water classes
  • 27.
    Data Distribution SparklingWater Cluster H2O H2O H2O Spark Executor JVM Data Source (e.g. HDFS) H2O RDD Spark RDD Spark Executor JVM Spark Executor JVM RDDs and DataFrames share same memory space
  • 28.
  • 29.
    Flight delay prediction “Build a model using weather and flight data to predict delays of flights arriving to Chicago O’Hare International Airport”
  • 30.
    Example Outline Load& Parse CSV data from 2 data sources Use Spark API to filter data, do SQL query for join Create a regression model Use model for delay prediction Plot residual plot from R
  • 31.
    Sparkling Water Requirements Linux or Mac OS X Oracle Java 1.7
  • 32.
  • 33.
    Install and Launch Unpack zip file and Point SPARK_HOME to your Spark installation and Launch h2o-examples/sparkling-shell
  • 34.
    What is SparklingShell? Standard spark-shell With additional Sparkling Water classes export MASTER=“local-cluster[3,2,1024]” spark-shell —-jars sparkling-water.jar JAR containing Sparkling Water Spark Master address
  • 35.
    Lets play withSparkling shell…
  • 36.
    Create H2O Client import org.apache.spark.h2o._ import org.apache.spark.examples.h2o._ val h2oContext = new H2OContext(sc).start(3) import h2oContext._ Regular Spark context provided by Spark shell Size of demanded H2O cloud Contains implicit utility functions Demo specific classes
  • 37.
    Is Spark Running? Go to http://localhost:4040
  • 38.
    Is H2O running? http://localhost:54321/steam/index.html
  • 39.
    Load Data #1 Load weather data into RDD val weatherDataFile = “examples/smalldata/weather.csv" val wrawdata = sc.textFile(weatherDataFile,3) .cache() val weatherTable = wrawdata .map(_.split(“,")) .map(row => WeatherParse(row)) .filter(!_.isWrongRow()) Regular Spark API Ad-hoc Parser
  • 40.
    Weather Data caseclass Weather( val Year : Option[Int], val Month : Option[Int], val Day : Option[Int], val TmaxF : Option[Int], // Max temperatur in F val TminF : Option[Int], // Min temperatur in F val TmeanF : Option[Float], // Mean temperatur in F val PrcpIn : Option[Float], // Precipitation (inches) val SnowIn : Option[Float], // Snow (inches) val CDD : Option[Float], // Cooling Degree Day val HDD : Option[Float], // Heating Degree Day val GDD : Option[Float]) // Growing Degree Day Simple POSO to hold one row of weather data
  • 41.
    Load Data #2 Load flights data into H2O frame import java.io.File val dataFile = “examples/smalldata/allyears2k_headers.csv.gz" val airlinesData = new DataFrame(new File(dataFile))
  • 42.
    Where is thedata? Go to http://localhost:54321/steam/index.html
  • 43.
    Use Spark APIfor Data Filtering // Create RDD wrapper around DataFrame val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) // And use Spark RDD API directly val flightsToORD = airlinesTable .filter( f => f.Dest == Some(“ORD") ) Create a cheap wrapper around H2O DataFrame Regular Spark RDD call
  • 44.
    Use Spark SQLto Data Join import org.apache.spark.sql.SQLContext // We need to create SQL context val sqlContext = new SQLContext(sc) import sqlContext._ flightsToORD.registerTempTable("FlightsToORD") weatherTable.registerTempTable("WeatherORD")
  • 45.
    Join Data basedon Flight Date val bigTable = sql( """SELECT | f.Year,f.Month,f.DayofMonth, | f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime, | f.UniqueCarrier,f.FlightNum,f.TailNum, | f.Origin,f.Distance, | w.TmaxF,w.TminF,w.TmeanF, | w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD, | f.ArrDelay | FROM FlightsToORD f | JOIN WeatherORD w | ON f.Year=w.Year AND f.Month=w.Month | AND f.DayofMonth=w.Day""".stripMargin)
  • 46.
    Launch H2O Algorithms import hex.deeplearning._ import hex.deeplearning.DeepLearningModel .DeepLearningParameters // Setup deep learning parameters val dlParams = new DeepLearningParameters() dlParams._train = bigTable dlParams._response_column = 'ArrDelay dlParams._classification = false // Create a new model builder val dl = new DeepLearning(dlParams) val dlModel = dl.train.get Result of SQL query Blocking call
  • 47.
    Make a prediction // Use model to score data val prediction = dlModel.score(result)(‘predict) // Collect predicted values via RDD API val predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )
  • 48.
    Generate Residuals Plot # Import H2O library and initialize H2O client library(h2o) h = h2o.init() # Fetch prediction and actual data, use remembered keys pred = h2o.getFrame(h, "dframe_b5f449d0c04ee75fda1b9bc865b14a69") act = h2o.getFrame (h, "frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57") # Select right columns predDelay = pred$predict actDelay = act$ArrDelay # Make sure that number of rows is same nrow(actDelay) == nrow(predDelay) # Compute residuals residuals = predDelay - actDelay # Plot residuals compare = cbind( as.data.frame(actDelay$ArrDelay), as.data.frame(residuals$predict)) plot( compare[,1:2] ) References of data
  • 49.
    More info Checkout0xdata Blog for Sparkling Water tutorials http://0xdata.com/blog/ Checkout 0xdata Youtube Channel https://www.youtube.com/user/0xdata Checkout github https://github.com/0xdata/sparkling-water
  • 50.
    Thank you! Learnmore about H2O at 0xdata.com or neo> for r in sparkling-water; do git clone “git@github.com:0xdata/$r.git” done Follow us at @hexadata

Editor's Notes

  • #23 immutable v. mutable approach, racy updates Strong points H2O: column compression parser small nobles based on customers feedback/knowledge tunned algo