H2O Rains with Databricks Cloud - Parisoma SF

H2O Rains with
Databricks Cloud
Michal Malohlava
@mmalohlava
Meetup 2016/02/04, SF

Who Am I?
Background
• PhD in CS from Charles University in Prague, Czech
Republic
• Postdoc at Purdue University experimenting with algos
for large-scale computation
• Now SW engineer  
at H2O.ai 
 
Experience with domain-
specific languages, distributed
system, software engineering,
and big data.

H2O.ai
H2Oteam
Sri Ambati Cliff Click
Co-Founders
Stephen
Boyd
Rob
Tibshirani
Trevor
Hastie
Scientific
Advisory
Council

H2O
Open-Source In-Memory Data Science
Platform
• Highly optimized Java code (in-house)
• Distributed in-memory K-V store and
map/reduce computation framework
• Data parser (HDFS, S3, NFS, HTTP, local
drives, etc.)
• Read/write access to distributed data
frames (R/Pandas-style)
• ML algos - Deep Learning, GBM, DRF,
GLM, GLRM, K-Means, PCA, CoxPH,
Ensembles
• REST API: clients Interactive UI/R/Python

Open-source distributed execution platform
User-friendly API for data transformation based on RDDs,
DataFrames (from 1.4) and DataSets (from 1.6)
Platform components - SQL, MLLib, text mining, Avro, Redshift,
Kinesis.
Easily extendable by  
3rd party packages
 
Interactive shell
Current release 1.6
Supported releases 1.3, 1.4, 1.5

Databricks
Databricks
• founded by the creators of Apache Spark
• still contribute 75% of the code to the Spark project
• cloud platform for running Spark in your AWS account
Databricks Platform
• integrated collaborative data  
science workspace
• notebook interface inspired by  
iPython and Zeplin but purpose  
built for Spark
• self service cluster manager 
and job scheduler for production  
Spark workloads

Sparkling Water
Provides
Transparent integration of H2O with Spark ecosystem
Transparent use of H2O data structures and
algorithms with Spark API
Platform for building Smarter Applications
Excels in existing Spark workﬂows requiring
advanced Machine Learning algorithms
Functionality missing in H2O can be
replaced by Spark and vice versa

Model Building
Data 
Source
Data munging Modelling
Deep Learning, GBM
DRF, GLM, GLRM 
K-Means, PCA
CoxPH, Ensembles
Prediction
processing

Data Munging
Data 
Source
Data load/munging/
exploration Modelling

Stream processing
Data 
Source
Off-line
model
training
Data munging
Model
prediction
Deploy
the model
Stream
processing
Data
Stream
Spark Streaming/Storm
Export model 
in a binary format
or
as code
Modelling

Databricks
Worker node
Spark executor
Scala/Py main
program
Driver node
H2OContext
SparkContext
Worker node
Spark executor
Worker node
Spark executor

H2OServicesH2OServices
Data 
Source
SparkExecutorSparkExecutorSparkExecutor
Spark Cluster
DataFrame
H2OServices
H2OFrame
Data 
Source
h2oContext.asDataFrame
h2oContext.asH2OFrame

What do we need?
Databricks account (14 day free trial at www.databricks.com) 
AWS account 
Sparkling Water coordinates:  
ai.h2o:sparkling-water-examples_2.10:1.5.10
And some cool machine learning idea!

Goal
For a given text
message
identify if it is
spam or not

Machine Learning
Workﬂow
1. Extract data
2. Transform, tokenize messages
3. Build Tf-IDF model
4. Create and evaluate  
Deep Learning model
5. Use the model to detect  
spam

Checkout H2O.ai Training Books
http://learn.h2o.ai/ 
Checkout H2O.ai Blog
http://h2o.ai/blog/ 
Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata 
Checkout GitHub
https://github.com/h2oai/sparkling-water
Meetups
https://meetup.com/
More info

Learn more at h2o.ai
Follow us at @h2oai
Thank you!
Sparkling Water is
open-source 
ML application platform
combining 
power of Spark and H2O

H2O Rains with Databricks Cloud - Parisoma SF

More Related Content

What's hot

Viewers also liked

Similar to H2O Rains with Databricks Cloud - Parisoma SF

More from Sri Ambati

Recently uploaded

H2O Rains with Databricks Cloud - Parisoma SF