Sparkling Water Webinar October 29th, 2014

OCT 29th, 2014 WEBINAR
H2O
Fast. Scalable. Machine Learning
For Smarter Applications
“Fluids are In, Animals are Out.”
~ Svetlana Sicular, Gartner

Speakers
Joel Horwitz
Joel is a caffeine, data, and laughter driven product strategist. He is an
active community member having founded Bay Area Analytics, Tweets
regularly @JSHorwitz, blogs regularly joelshorwitz.com and speaks
regularly at industry events. Always eager to learn and lend a helping hand
makes him an invaluable asset to 0xdata.
Michal Malohlava
Michal is a geek, developer, Java, Linux, programming languages
enthusiast developing software for over 10 years. He obtained PhD from
the Charles University in Prague in 2012 and post-doc at Purdue
University.
H2O World Register at http://www.0xdata.com/h2o-world

Time is the only non-renewable resource.
Speed Matters!

Today
• Why Are We Here
• Who We Are
• How Do We Do It
• Who We Work With
• What We Believe
• Demo and Q&A

A New Interpretation of Moore’s Law
“Like the physical universe, the digital universe is large - by 2020 containing nearly as
many digital bits as there are stars in the universe. It is doubling in size every two
years, and by 2020 the digital universe - the data we create and copy annually - will
reach 44 zettabytes, or 44 trillion gigabytes.”
- IDC 2014

An Evolving Market to Meet the Demand
RDBMS MPP
Business
Intelligence
Data
Science
H
O Distributed
2
File System
Machine
Learning

Decreasing Cost of Data is Driving Demand
H
O
2
1970 1980 1990 2000 2010 2020

H2O is the First Dedicated
Machine Learning Open Source Platform
H2O is for application developers and analysts who need
scalable and fast machine learning. H2O is an open source
predictive analytics platform. Unlike traditional analytics tools,
H2O provides a combination of extraordinary math, a high
performance parallel architecture, and unrivaled ease of use.

H2O Awards and Accolades
• Top R Project of UserR Conference 2014
• Fortune Big Data All-Stars 2014, Arno Candel
• 100+ Meetups
• 6000+ Users

H2O is Built for Speed and Scale
• OpenSource
• REST API
• Native R Support
• NanoFastTM Scoring Engine
• Sophisticated Algorithms

H2O Seamlessly Integrates with Your Workflow
• 20X Faster Imports and 3X
Compression w/ .hex format.
• 4 Billion Row Regression in
Seconds.
• Deploy in POJO or with our
REST API

Code is incomplete without Community!
Open Source Drives Innovation.

Law of Large Numbers Triumphs!

Every Generation Needs to Invent Its Own Math.
ML is the new SQL!

What do our customers say about us?
"The platform can generate Jar files to deploy models into production. This
alone is a milestone!" - Hassan Namarvar, ShareThis
“I have to give credit to H2O. They have a very complete way of showing which
algorithm is the best.” – Nachum Shacham, Paypal
“I analyzed 1 million rows training set, fitting a logistic regression with elastic
penalty, and doing a grid search on parameters with 10-fold cross validation for
each parameter combination… doing this analysis was a breeze… orders of
magnitude faster than R.” - Antonio Molins, Netflix
“Never have we had such a quick, simple, scalable and cost effective
deployment solution for predictive modeling.” – Lou Carvalheira, Cisco

Advertising
Better Conversions
Brand Conversion Reach ROI
Overall, I would say that the H2O platform is the most elegant open source in-memory ~ Hassan Namarvar, Principal Data Scientist

Fraud
Better Detections
Purchase
Shopping Theft Passwords
I have to give credit to H2O.
They have a very complete way of showing which algorithm is the best.
~ Nachum Shacham, Principal Data Scientist

Marketing
Better Spend
ROI
Network Segments Measure
H2O has established a new equilibrium point for performance,
accuracy and cost for statistics and machine learning.
~ Lou Carvalheira, Principal Data Scientist

Select Customers Powered by H2O

@hexadata & @mmalohlava
presents
Sparkling Water
“Killer App for Spark”

Memory efficient
Performance of computation
Machine learning algorithms
Parser, GUI, R-interface
User-friendly API
Large and active community
Platform components - SQL
Multitenancy

Sparkling Water
+
RDD
immutable
world
DataFrame
mutable
world

Sparkling Water
RDD DataFrame

Sparkling Water
Provides
Transparent integration into Spark ecosystem
Pure H2ORDD encapsulating H2O DataFrame
Transparent use of H2O data structures and
algorithms with Spark API
Excels in Spark workflows requiring advanced
Machine Learning algorithms

Sparkling Water Design
implements
spark-submit
Spark
Master
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Sparkling Water Cluster
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Sparklin
g App
jar file
Contains application
and Sparkling Water
classes

Data Distribution
Sparkling Water Cluster
H2O
H2O
H2O
Spark Executor JVM Data
Source
(e.g.
HDFS)
H2O
RDD
Spark
RDD
Spark Executor JVM
Spark Executor JVM
RDDs and DataFrames
share same memory
space

Flight delay prediction
“Build a model using weather and
flight data to predict delays of flights
arriving to Chicago O’Hare
International Airport”

Example Outline
Load & Parse CSV data from 2 data sources
Use Spark API to filter data, do SQL query for join
Create a regression model
Use model for delay prediction
Plot residual plot from R

Sparkling Water
Requirements
Linux or Mac OS X
Oracle Java 1.7

Download
http://0xdata.com/download/

Install and Launch
Unpack zip file
and
Point SPARK_HOME to your Spark installation
and
Launch h2o-examples/sparkling-shell

What is Sparkling Shell?
Standard spark-shell
With additional Sparkling Water classes
export MASTER=“local-cluster[3,2,1024]”
spark-shell
—-jars sparkling-water.jar
JAR containing
Sparkling
Water
Spark Master
address

Lets play with Sparkling
shell…

Create H2O Client
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
val h2oContext = new H2OContext(sc).start(3)
import h2oContext._
Regular Spark context
provided by Spark shell
Size of demanded
H2O cloud
Contains implicit utility functions
Demo specific
classes

Is Spark Running?
Go to http://localhost:4040

Is H2O running?
http://localhost:54321/steam/index.html

Load Data #1
Load weather data into RDD
val weatherDataFile =
“examples/smalldata/weather.csv"
val wrawdata = sc.textFile(weatherDataFile,3)
.cache()
val weatherTable = wrawdata
.map(_.split(“,"))
.map(row => WeatherParse(row))
.filter(!_.isWrongRow())
Regular Spark API
Ad-hoc Parser

Weather Data
case class Weather( val Year : Option[Int],
val Month : Option[Int],
val Day : Option[Int],
val TmaxF : Option[Int], // Max temperatur in F
val TminF : Option[Int], // Min temperatur in F
val TmeanF : Option[Float], // Mean temperatur in F
val PrcpIn : Option[Float], // Precipitation (inches)
val SnowIn : Option[Float], // Snow (inches)
val CDD : Option[Float], // Cooling Degree Day
val HDD : Option[Float], // Heating Degree Day
val GDD : Option[Float]) // Growing Degree Day
Simple POSO to hold one row of weather data

Load Data #2
Load flights data into H2O frame
import java.io.File
val dataFile =
“examples/smalldata/allyears2k_headers.csv.gz"
val airlinesData = new DataFrame(new File(dataFile))

Where is the data?
Go to http://localhost:54321/steam/index.html

Use Spark API for Data
Filtering
// Create RDD wrapper around DataFrame
val airlinesTable : RDD[Airlines]
= toRDD[Airlines](airlinesData)
// And use Spark RDD API directly
val flightsToORD = airlinesTable
.filter( f => f.Dest == Some(“ORD") )
Create a cheap wrapper
around H2O DataFrame
Regular Spark
RDD call

Use Spark SQL to Data Join
import org.apache.spark.sql.SQLContext
// We need to create SQL context
val sqlContext = new SQLContext(sc)
import sqlContext._
flightsToORD.registerTempTable("FlightsToORD")
weatherTable.registerTempTable("WeatherORD")

Launch H2O Algorithms
import hex.deeplearning._
import hex.deeplearning.DeepLearningModel
.DeepLearningParameters
// Setup deep learning parameters
val dlParams = new DeepLearningParameters()
dlParams._train = bigTable
dlParams._response_column = 'ArrDelay
dlParams._classification = false
// Create a new model builder
val dl = new DeepLearning(dlParams)
val dlModel = dl.train.get
Result of
SQL query
Blocking call

Make a prediction
// Use model to score data
val prediction = dlModel.score(result)(‘predict)
// Collect predicted values via RDD API
val predictionValues = toRDD[DoubleHolder](prediction)
.collect
.map ( _.result.getOrElse("NaN") )

Generate Residuals Plot
# Import H2O library and initialize H2O client
library(h2o)
h = h2o.init()
# Fetch prediction and actual data, use remembered keys
pred = h2o.getFrame(h, "dframe_b5f449d0c04ee75fda1b9bc865b14a69")
act = h2o.getFrame (h, "frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57")
# Select right columns
predDelay = pred$predict
actDelay = act$ArrDelay
# Make sure that number of rows is same
nrow(actDelay) == nrow(predDelay)
# Compute residuals
residuals = predDelay - actDelay
# Plot residuals
compare = cbind(
as.data.frame(actDelay$ArrDelay),
as.data.frame(residuals$predict))
plot( compare[,1:2] )
References
of data

More info
Checkout 0xdata Blog for Sparkling Water tutorials
http://0xdata.com/blog/
Checkout 0xdata Youtube Channel
https://www.youtube.com/user/0xdata
Checkout github
https://github.com/0xdata/sparkling-water

Thank you!
Learn more about H2O at
0xdata.com
or
neo> for r in sparkling-water; do
git clone “git@github.com:0xdata/$r.git”
done
Follow us at @hexadata

Sparkling Water Webinar October 29th, 2014

More Related Content

What's hot

Viewers also liked

Similar to Sparkling Water Webinar October 29th, 2014

More from Sri Ambati

Recently uploaded

Sparkling Water Webinar October 29th, 2014

Editor's Notes