Machine Learning: H2O vs SparkML

Machine Learning
by
H2O vs SparkML
Arnab Biswas
June 2018

H2O
Open Source, In-Memory, Distributed Machine Learning Tool
• Open Source (Apache 2.0)
• In-Memory (Faster)
• Distributed (Big Data/No Sampling)
• Third Version (Stable)
• Easy To Use
• Mission - "How do we get this to work efficiently at big data scale?“
http://docs.h2o.ai/

• R, Python, Scala, Java, JSON, JavaScript, Web Interface (Flow)
• Entire library is embedded inside a jar file
• Composed in Java, naturally supports Java & Scala
• R, Python, JavaScript, Excel, Tableau, Flow communicates with
H2O clusters using REST API calls
• Easy to switch between R/Python/Java/Flow environments
Multiple LanguageSupport

• Uses in-memory compression(2-4 times smaller than gzip)
• Data frames are much smaller in memory and on disk
• Handles billions of data rows in-memory, even with a small cluster
• Data gets distributed acrossmultiple JVM
• Modelingusing whole set of data (without sampling)
• Faster training/predictiontime
• The larger is the data set, the better is the performance
• Consists of a Flow web-based GUI (Easy to use for Non-Programmers)
• However,notvery impressive!
• Easy to deploy models in production
• Checkpoint
• Continuetraining an existing model with new data
• IterativeMethods (???)
H2O : Advantage
https://en.wikipedia.org/wiki/H2O_(software)

Clustering (1/2)
• Can be deployed on a single node / multi-node cluster / Hadoop cluster
/ Apache Spark cluster
• Clustering enhances speed of computation
• Hadoop/Spark for clustering is NOT mandatory
• Multi-node cluster with shared memory model
• All computation in-memory
• Each node sees only some rows of data
• No limit to cluster size
• Distributed Data Frames (collection of vectors)
• Columns are distributed (across nodes)
- https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library
- https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638

Clustering : Limitations (2/2)
• For small data, clustering introduces slowness
• Find the sweet spot between data size & number of nodes
• Each node on the cluster must be of same size (Recommended)
• New Nodes can not be added once the cluster starts up
• If any machine dies, the whole cluster must be rebuilt
• If a single node gets removed, whole cluster becomes unusable
• Nodes should be physically close, to minimize network latency
• Each node must be running the same version of h2o.jar

Productionizing H2O
1. Build a Modelusing Python/R/Java/Flow
2. Download the model (as a POJO or MOJO)as a zip file.
3. Download resultingh2o-genmodel.jar (Isa library supportingscoring)
4. Invokethe model fromJava class to generate prediction
• Can be easily embedded inside a Java Application
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html

H2O Flow
• Web-based interactive client environment
• Similar to Jupyter Notebook
• Can be used by non-programmer as well (Mouse
clicks!)
• Combine code execution, text, mathematics, plots
& rich media in a single document
• Allows
• Data upload
• View data uploaded directly / through other
clients
• Build Model
• View models built directly / through other
clients
• Predict
• View predictions generated directly or through
other clients
• Check cluster/CPUstatus

Algorithms
Supervised Unsupervised Miscellaneous Common
Cox Proportional
Hazards
Aggregagtor Word2vec Quantiles
Deep Learning Generalized Low Rank
Models (GLRM)
Early Stopping
Distributed Random
Forest
K-Means Clustering
Generalized Linear
Model
Principal Component
Analysis (PCA)
Gradient Boosting
Machine
Naïve Bayes Classifier
Stacked Ensembles
XGBoost
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow/images/H2O-Algorithms-Road-Map.pdf

H2O Ecosystem
• H2O
• Steam
• Enterprise Steam
• Sparkling Water
• Driverless AI
• H2O4GPU

H2O Steam
• End-to-end platform that streamlines the entire process of building and deploying
applications
• Cluster Manager
• Start/stop cluster, allocate memory, start/pause/stopH2O instances
• Secure multi-tenant environment
• Model Manager
• Build, store, manage, compare, promote (historical) models
• Run A/B Test for models
• Scoring Server
• Deploys a model
• Scoring through REST API or In-App

Sparkling Water (1/3)
• Combines the fast, scalable machine learning algorithms of H2O with
the capabilities of Spark
• Provides a way to launch the H2O service on each Spark executor in
the Spark cluster, forming a H2O cluster
• “Certified on Spark”

Sparkling Water – Use Case (2/3)
Use Case 1:
Data pipeline consistsof multiple
data transformations withhelp
of Spark API. Final form of data is
transformedinto H2O frame and
passed to an H2O algorithm.
Use Case 2:
Data pipeline consistsof H2O’s
parallel dataload and parse
capabilities, while Spark API is
used as another provider of data
transformations.
H2O can be also be used as in-
place datatransformer.http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/index.html

Sparkling Water – Use Case (3/3)
Use Case 3:
1. The off-line training pipeline invoked
regularly utilizes Spark & H2O API and
provides an H2O model as
output.The model is exported in a form
independent on H2O run-time.
2. The streaming datapipeline (Using Spark
Streaming)uses model trained in the first
pipeline to score the incoming data.Since
the model is exported with no run-time
dependency to H2O, the streamingpipeline
can be lightweight and independent on
H2O/ Sparkling Water infrastructure.

Spark (MLib) vs H2O
• Spark is better at the data preparationand data munging steps
• H2O is faster than the algorithmsin SparkMLib
• MLib under performsin terms of Memory,CPU and Time
• H2O provides Web Interface (Flow) for data visualization
• H2O and MLib has overlapof algorithms
• H2O is better for productionization
• POJO/MOJOapproachmorefriendly to integrate with Java applications
• Allows evaluation metrics visualization, tracking jobsand job statuses
• H2O allowsgrid search(Spark doesn’t?)
• Spark has a better community support
• H2O has enterprisesupport
Check the slide on References

• Need for “iyzico”fraud detectionproduct
• Continuous Delivery: Models need to be continuously deployed on production
• Real-Time Fraud Detection: Predictiontime of max 100ms
• HighAvailability &Scalability
• Low Learning Curve: Stack should be usable by data scientist & SW developer
• Open Source
• Fast : Fast prototyping & deploying
• On Premise
• Initial Choice
• prediction.io+ Spark ML
Case Study I : Migration From SparkMLib To H2O (1/3)
Source: https://iyzico.engineering/spark-ml-to-h2o-migration-for-machine-learning-in-iyzico-dcba86b8eab2

Case Study I (2/3)
• Benchmarking Criteria : TensorFlow, SparkML, H2O (Winner)
• Simplicity of deploying an existingmodel (local env) to production
• POJO based models. Easy to deploy in Java environment
• Release management and DevOps cycle are easy
• Hardwarerequirementsfor training
• Memory need for training with 1 million transactions & 100 features with RF (64 Trees)
Spark ML : 16 GB RAM, Tensor Flow : 10 GB, H2O : 2 GB
• Decision Trees and BayesianModels
• Python, R, SQL Support
• Experimentationon local environment
• Experiments can be done with Python, R
• Predictiontime (ms)

• Feature Engineering, Data Pipeline was in Java 8. No need of
migration
• Migration from Spark ML + prediction.io to H2O
• 60 GB RAM is saved (Spark ML & prediction.io needed for model trainings)
• 12 cores saved (Spark ML & prediction.io needed these cores to reduce model
training time)
• Response time decreased almost 10 times (300 milliseconds to 35
milliseconds)
Case Study I (3/3)

Case Study II : Booking.com (1/n)
Source: https://www.youtube.com/watch?v=_CBKECLkIt8

Case Study II : Booking.com (2/n)

Case Study II : Booking.com (3/3)

Spark/Sparkling Water – Do I need it?

Benchmarking ML Libraries
https://github.com/szilard/benchm-ml
• Training data
• Number of rowsvaried as 10K, 100K, 1M, 10M
• ~1K features
• Binary ClassificationProblem
• Hardware (Single Instance)
• Amazon EC2 c3.8xlarge (32 cores,60GB RAM)
• If OOM, r3.8xlarge instance(32 cores,250GB RAM)
• Observations
• Training time
• Maximum memory usage duringtraining
• AUC (predictiveaccuracy)

Random Forest
H2O
• Fast, uses all cores, more accurate
• Memory Efficient
• 1M : 5G, 10M : 25 G
SparkMLib
• Slower
• Larger memory footprint
• Runs OOM at n = 1M
• With 250 G, finishes for 1M, but
crashes for 10M
• AUC broke at 1M
• Spark 2.0 is even slower
XGBoost
• Fast
• High accuracy
• Memory efficient
• 1M : 2G, 10M : 9G

Gradient Boosting Machines
Learn_rate=0.01
max_depth=16
n_trees=1000
Learn_rate=0.1
max_depth=6
n_trees=300
• Memory footprint of
GBMs smaller than for RF
• Bottleneck is mainly
training time
• Spark is inefficient in
memory (especially for
deeper trees) & crashes.
Works for shallow trees
• H2O and xgboost are the
fastest

Performance of various GBM implementations
For deployment, H2O has
the best ways to deploy as
a real-time (fast scoring)
application.
https://github.com/szilard/GBM-perf

Do I need Big Data?
• Single Instance vs Cluster
• Sending data over a network vs using shared memory
• Several distributed systems have significant computation & memory overhead
• Map-reduce style communicationpattern : Not best fit for many ML
algorithms
Benchmarking For Bigger Data

Netflix VectorFlow
• Minimalist library
• Specifically optimized for
training sparse data
• Single-machine, multi-core
environment

Benchmarking For Bigger Data
• Not enough clarity about the hardwareused
• For tree-based ensembles (RF, GBM)H2O and xgboost
can train on 100Mrecordson a single server, though
the trainingtimes become several hours
Single Node
Multiple Nodes

Security In H2O
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/security.html

Disadvantages
• No High Availability (HA) for Clusters
• Doesn’t work well on sparse data
• GPU Support is in alpha stage
• There is No SVM
• Cluster support helps Big Data
• For small data needs single, fast machines with lot of cores

References
• https://www.quora.com/Does-H2O-software-allow-you-to-perform-faster-
machine-learning-if-it-is-not-used-on-a-cluster-How
• https://www.quora.com/Why-would-one-use-H2O-ai-over-scikit-learn-machine-
learning-tool
• https://www.quora.com/What-are-the-risks-of-using-H2O-ai-framework-When-
would-my-company-need-to-pay-anything-to-H2O-ai-Is-the-framework-buggy-
somehow-or-is-it-hard-to-install-configure-extend-Do-I-need-to-pay-for-
consultancy-eventually
• https://groups.google.com/forum/#!msg/h2ostream/m2HIfUxfw-k/X8G2-
OMQAwAJ

H2O Architecture
https://www.stat.berkeley.edu/~ledell/docs/h2o_hpccon_oct2015.pdf
http://gotocon.com/dl/goto-berlin-2014/slides/PetrMaj_and_TomasNykodym_FastAnalyticsOnBigData.pdf

H2O Frame Distributed Fork & Join

Do I need Spark to run H20?
- https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library
- https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638

H2O : POJO vs MOJO
- POJOs are not supported for source files larger than 1G
- MOJOs are supported for AutoML, Deep Learning, DRF, GBM, GLM,
GLRM, K-Means, Stacked Ensembles, SVM, Word2vec, and XGBoost
models.
- POJOs are also not supported for XGBoost, GLRM, or Stacked
Ensembles models.
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html

SparkML vs SparkMLib
• Spark MLib vs Spark ML :
• https://spark.apache.org/docs/latest/ml-guide.html

Machine Learning: H2O vs SparkML

Machine Learning: H2O vs SparkML

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning: H2O vs SparkML

Similar to Machine Learning: H2O vs SparkML (20)

Recently uploaded

Recently uploaded (20)

Machine Learning: H2O vs SparkML