H2O Rains with
Databricks Cloud
Michal Malohlava
@mmalohlava
Meetup 2016/02/04, SF
Who Am I?
Background
• PhD in CS from Charles University in Prague, Czech
Republic
• Postdoc at Purdue University experimenting with algos
for large-scale computation
• Now SW engineer 

at H2O.ai



Experience with domain-
specific languages, distributed
system, software engineering,
and big data.
H2O.ai
H2Oteam
Sri Ambati Cliff Click
Co-Founders
Stephen
Boyd
Rob
Tibshirani
Trevor
Hastie
Scientific
Advisory
Council
H2O
Open-Source In-Memory Data Science
Platform
• Highly optimized Java code (in-house)
• Distributed in-memory K-V store and
map/reduce computation framework
• Data parser (HDFS, S3, NFS, HTTP, local
drives, etc.)
• Read/write access to distributed data
frames (R/Pandas-style)
• ML algos - Deep Learning, GBM, DRF,
GLM, GLRM, K-Means, PCA, CoxPH,
Ensembles
• REST API: clients Interactive UI/R/Python
H2O+Spark
= Sparkling
Water
Open-source distributed execution platform
User-friendly API for data transformation based on RDDs,
DataFrames (from 1.4) and DataSets (from 1.6)
Platform components - SQL, MLLib, text mining, Avro, Redshift,
Kinesis.
Easily extendable by 

3rd party packages


Interactive shell
Current release 1.6
Supported releases 1.3, 1.4, 1.5
Databricks
Databricks
• founded by the creators of Apache Spark
• still contribute 75% of the code to the Spark project
• cloud platform for running Spark in your AWS account
Databricks Platform
• integrated collaborative data 

science workspace
• notebook interface inspired by 

iPython and Zeplin but purpose 

built for Spark
• self service cluster manager

and job scheduler for production 

Spark workloads
Sparkling Water
Provides
Transparent integration of H2O with Spark ecosystem
Transparent use of H2O data structures and
algorithms with Spark API
Platform for building Smarter Applications
Excels in existing Spark workflows requiring
advanced Machine Learning algorithms
Functionality missing in H2O can be
replaced by Spark and vice versa
How to use
Sparkling
Water?
Model Building
Data

Source
Data munging Modelling
Deep Learning, GBM
DRF, GLM, GLRM

K-Means, PCA
CoxPH, Ensembles
Prediction
processing
Data Munging
Data

Source
Data load/munging/
exploration Modelling
Stream processing
Data

Source
Off-line
model
training
Data munging
Model
prediction
Deploy
the model
Stream
processing
Data
Stream
Spark Streaming/Storm
Export model

in a binary format
or
as code
Modelling
What is
inside?
Databricks
Worker node
Spark executor
Scala/Py main
program
Driver node
H2OContext
SparkContext
Worker node
Spark executor
Worker node
Spark executor
H2OServicesH2OServices
Data

Source
SparkExecutorSparkExecutorSparkExecutor
Spark Cluster
DataFrame
H2OServices
H2OFrame
Data

Source
h2oContext.asDataFrame
h2oContext.asH2OFrame
DEMO Time!
What do we need?
Databricks account (14 day free trial at www.databricks.com)

AWS account

Sparkling Water coordinates: 

ai.h2o:sparkling-water-examples_2.10:1.5.10
And some cool machine learning idea!
OR
Detect spam text messages
Data sample
Goal
For a given text
message
identify if it is
spam or not
Machine Learning
Workflow
1. Extract data
2. Transform, tokenize messages
3. Build Tf-IDF model
4. Create and evaluate 

Deep Learning model
5. Use the model to detect 

spam
Checkout H2O.ai Training Books
http://learn.h2o.ai/

Checkout H2O.ai Blog
http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata

Checkout GitHub
https://github.com/h2oai/sparkling-water
Meetups
https://meetup.com/
More info
Learn more at h2o.ai
Follow us at @h2oai
Thank you!
Sparkling Water is
open-source

ML application platform
combining

power of Spark and H2O

H2O Rains with Databricks Cloud - Parisoma SF