Sparkling Water is the newest application on the Apache Spark in-memory platform to extend Machine Learning for better predictions and to quickly deploy models into production. H2O is proud to partner with Cloudera and Databricks to bring this capability to a wide audience.
H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API. Smart business applications are powered by H2O’s NanoFast¬TM Scoring Engine. Learn more by going to http://www.h2o.ai and contact us for more information.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Sparkling Water Webinar October 29th, 2014
1. OCT 29th, 2014 WEBINAR
H2O
Fast. Scalable. Machine Learning
For Smarter Applications
“Fluids are In, Animals are Out.”
~ Svetlana Sicular, Gartner
2. Speakers
Joel Horwitz
Joel is a caffeine, data, and laughter driven product strategist. He is an
active community member having founded Bay Area Analytics, Tweets
regularly @JSHorwitz, blogs regularly joelshorwitz.com and speaks
regularly at industry events. Always eager to learn and lend a helping hand
makes him an invaluable asset to 0xdata.
Michal Malohlava
Michal is a geek, developer, Java, Linux, programming languages
enthusiast developing software for over 10 years. He obtained PhD from
the Charles University in Prague in 2012 and post-doc at Purdue
University.
H2O World Register at http://www.0xdata.com/h2o-world
3. Time is the only non-renewable resource.
Speed Matters!
4. Today
• Why Are We Here
• Who We Are
• How Do We Do It
• Who We Work With
• What We Believe
• Demo and Q&A
5. A New Interpretation of Moore’s Law
“Like the physical universe, the digital universe is large - by 2020 containing nearly as
many digital bits as there are stars in the universe. It is doubling in size every two
years, and by 2020 the digital universe - the data we create and copy annually - will
reach 44 zettabytes, or 44 trillion gigabytes.”
- IDC 2014
6. An Evolving Market to Meet the Demand
RDBMS MPP
Business
Intelligence
Data
Science
H
O Distributed
2
File System
Machine
Learning
7. Decreasing Cost of Data is Driving Demand
H
O
2
1970 1980 1990 2000 2010 2020
8. H2O is the First Dedicated
Machine Learning Open Source Platform
H2O is for application developers and analysts who need
scalable and fast machine learning. H2O is an open source
predictive analytics platform. Unlike traditional analytics tools,
H2O provides a combination of extraordinary math, a high
performance parallel architecture, and unrivaled ease of use.
10. H2O Awards and Accolades
• Top R Project of UserR Conference 2014
• Fortune Big Data All-Stars 2014, Arno Candel
• 100+ Meetups
• 6000+ Users
11. H2O is Built for Speed and Scale
• OpenSource
• REST API
• Native R Support
• NanoFastTM Scoring Engine
• Sophisticated Algorithms
12. H2O Seamlessly Integrates with Your Workflow
• 20X Faster Imports and 3X
Compression w/ .hex format.
• 4 Billion Row Regression in
Seconds.
• Deploy in POJO or with our
REST API
16. What do our customers say about us?
"The platform can generate Jar files to deploy models into production. This
alone is a milestone!" - Hassan Namarvar, ShareThis
“I have to give credit to H2O. They have a very complete way of showing which
algorithm is the best.” – Nachum Shacham, Paypal
“I analyzed 1 million rows training set, fitting a logistic regression with elastic
penalty, and doing a grid search on parameters with 10-fold cross validation for
each parameter combination… doing this analysis was a breeze… orders of
magnitude faster than R.” - Antonio Molins, Netflix
“Never have we had such a quick, simple, scalable and cost effective
deployment solution for predictive modeling.” – Lou Carvalheira, Cisco
17. Advertising
Better Conversions
Brand Conversion Reach ROI
Overall, I would say that the H2O platform is the most elegant open source in-memory ~ Hassan Namarvar, Principal Data Scientist
18. Fraud
Better Detections
Purchase
Shopping Theft Passwords
I have to give credit to H2O.
They have a very complete way of showing which algorithm is the best.
~ Nachum Shacham, Principal Data Scientist
19. Marketing
Better Spend
ROI
Network Segments Measure
H2O has established a new equilibrium point for performance,
accuracy and cost for statistics and machine learning.
~ Lou Carvalheira, Principal Data Scientist
22. Memory efficient
Performance of computation
Machine learning algorithms
Parser, GUI, R-interface
User-friendly API
Large and active community
Platform components - SQL
Multitenancy
25. Sparkling Water
Provides
Transparent integration into Spark ecosystem
Pure H2ORDD encapsulating H2O DataFrame
Transparent use of H2O data structures and
algorithms with Spark API
Excels in Spark workflows requiring advanced
Machine Learning algorithms
26. Sparkling Water Design
implements
spark-submit
Spark
Master
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Sparkling Water Cluster
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Sparklin
g App
jar file
Contains application
and Sparkling Water
classes
27. Data Distribution
Sparkling Water Cluster
H2O
H2O
H2O
Spark Executor JVM Data
Source
(e.g.
HDFS)
H2O
RDD
Spark
RDD
Spark Executor JVM
Spark Executor JVM
RDDs and DataFrames
share same memory
space
29. Flight delay prediction
“Build a model using weather and
flight data to predict delays of flights
arriving to Chicago O’Hare
International Airport”
30. Example Outline
Load & Parse CSV data from 2 data sources
Use Spark API to filter data, do SQL query for join
Create a regression model
Use model for delay prediction
Plot residual plot from R
33. Install and Launch
Unpack zip file
and
Point SPARK_HOME to your Spark installation
and
Launch h2o-examples/sparkling-shell
34. What is Sparkling Shell?
Standard spark-shell
With additional Sparkling Water classes
export MASTER=“local-cluster[3,2,1024]”
spark-shell
—-jars sparkling-water.jar
JAR containing
Sparkling
Water
Spark Master
address
39. Load Data #1
Load weather data into RDD
val weatherDataFile =
“examples/smalldata/weather.csv"
val wrawdata = sc.textFile(weatherDataFile,3)
.cache()
val weatherTable = wrawdata
.map(_.split(“,"))
.map(row => WeatherParse(row))
.filter(!_.isWrongRow())
Regular Spark API
Ad-hoc Parser
40. Weather Data
case class Weather( val Year : Option[Int],
val Month : Option[Int],
val Day : Option[Int],
val TmaxF : Option[Int], // Max temperatur in F
val TminF : Option[Int], // Min temperatur in F
val TmeanF : Option[Float], // Mean temperatur in F
val PrcpIn : Option[Float], // Precipitation (inches)
val SnowIn : Option[Float], // Snow (inches)
val CDD : Option[Float], // Cooling Degree Day
val HDD : Option[Float], // Heating Degree Day
val GDD : Option[Float]) // Growing Degree Day
Simple POSO to hold one row of weather data
41. Load Data #2
Load flights data into H2O frame
import java.io.File
val dataFile =
“examples/smalldata/allyears2k_headers.csv.gz"
val airlinesData = new DataFrame(new File(dataFile))
42. Where is the data?
Go to http://localhost:54321/steam/index.html
43. Use Spark API for Data
Filtering
// Create RDD wrapper around DataFrame
val airlinesTable : RDD[Airlines]
= toRDD[Airlines](airlinesData)
// And use Spark RDD API directly
val flightsToORD = airlinesTable
.filter( f => f.Dest == Some(“ORD") )
Create a cheap wrapper
around H2O DataFrame
Regular Spark
RDD call
44. Use Spark SQL to Data Join
import org.apache.spark.sql.SQLContext
// We need to create SQL context
val sqlContext = new SQLContext(sc)
import sqlContext._
flightsToORD.registerTempTable("FlightsToORD")
weatherTable.registerTempTable("WeatherORD")
45. Join Data based on Flight Date
val bigTable = sql(
"""SELECT
| f.Year,f.Month,f.DayofMonth,
| f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime,
| f.UniqueCarrier,f.FlightNum,f.TailNum,
| f.Origin,f.Distance,
| w.TmaxF,w.TminF,w.TmeanF,
| w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD,
| f.ArrDelay
| FROM FlightsToORD f
| JOIN WeatherORD w
| ON f.Year=w.Year AND f.Month=w.Month
| AND f.DayofMonth=w.Day""".stripMargin)
46. Launch H2O Algorithms
import hex.deeplearning._
import hex.deeplearning.DeepLearningModel
.DeepLearningParameters
// Setup deep learning parameters
val dlParams = new DeepLearningParameters()
dlParams._train = bigTable
dlParams._response_column = 'ArrDelay
dlParams._classification = false
// Create a new model builder
val dl = new DeepLearning(dlParams)
val dlModel = dl.train.get
Result of
SQL query
Blocking call
47. Make a prediction
// Use model to score data
val prediction = dlModel.score(result)(‘predict)
// Collect predicted values via RDD API
val predictionValues = toRDD[DoubleHolder](prediction)
.collect
.map ( _.result.getOrElse("NaN") )
48. Generate Residuals Plot
# Import H2O library and initialize H2O client
library(h2o)
h = h2o.init()
# Fetch prediction and actual data, use remembered keys
pred = h2o.getFrame(h, "dframe_b5f449d0c04ee75fda1b9bc865b14a69")
act = h2o.getFrame (h, "frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57")
# Select right columns
predDelay = pred$predict
actDelay = act$ArrDelay
# Make sure that number of rows is same
nrow(actDelay) == nrow(predDelay)
# Compute residuals
residuals = predDelay - actDelay
# Plot residuals
compare = cbind(
as.data.frame(actDelay$ArrDelay),
as.data.frame(residuals$predict))
plot( compare[,1:2] )
References
of data
49. More info
Checkout 0xdata Blog for Sparkling Water tutorials
http://0xdata.com/blog/
Checkout 0xdata Youtube Channel
https://www.youtube.com/user/0xdata
Checkout github
https://github.com/0xdata/sparkling-water
50. Thank you!
Learn more about H2O at
0xdata.com
or
neo> for r in sparkling-water; do
git clone “git@github.com:0xdata/$r.git”
done
Follow us at @hexadata
Editor's Notes
immutable v. mutable approach, racy updates
Strong points
H2O:
column compression
parser
small nobles based on customers feedback/knowledge
tunned algo