SlideShare a Scribd company logo
1 of 39
Machine Learning
by
H2O vs SparkML
Arnab Biswas
June 2018
H2O
Open Source, In-Memory, Distributed Machine Learning Tool
• Open Source (Apache 2.0)
• In-Memory (Faster)
• Distributed (Big Data/No Sampling)
• Third Version (Stable)
• Easy To Use
• Mission - "How do we get this to work efficiently at big data scale?“
http://docs.h2o.ai/
• R, Python, Scala, Java, JSON, JavaScript, Web Interface (Flow)
• Entire library is embedded inside a jar file
• Composed in Java, naturally supports Java & Scala
• R, Python, JavaScript, Excel, Tableau, Flow communicates with
H2O clusters using REST API calls
• Easy to switch between R/Python/Java/Flow environments
Multiple LanguageSupport
• Uses in-memory compression(2-4 times smaller than gzip)
• Data frames are much smaller in memory and on disk
• Handles billions of data rows in-memory, even with a small cluster
• Data gets distributed acrossmultiple JVM
• Modelingusing whole set of data (without sampling)
• Faster training/predictiontime
• The larger is the data set, the better is the performance
• Consists of a Flow web-based GUI (Easy to use for Non-Programmers)
• However,notvery impressive!
• Easy to deploy models in production
• Checkpoint
• Continuetraining an existing model with new data
• IterativeMethods (???)
H2O : Advantage
https://en.wikipedia.org/wiki/H2O_(software)
Clustering (1/2)
• Can be deployed on a single node / multi-node cluster / Hadoop cluster
/ Apache Spark cluster
• Clustering enhances speed of computation
• Hadoop/Spark for clustering is NOT mandatory
• Multi-node cluster with shared memory model
• All computation in-memory
• Each node sees only some rows of data
• No limit to cluster size
• Distributed Data Frames (collection of vectors)
• Columns are distributed (across nodes)
- https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library
- https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
Clustering : Limitations (2/2)
• For small data, clustering introduces slowness
• Find the sweet spot between data size & number of nodes
• Each node on the cluster must be of same size (Recommended)
• New Nodes can not be added once the cluster starts up
• If any machine dies, the whole cluster must be rebuilt
• If a single node gets removed, whole cluster becomes unusable
• Nodes should be physically close, to minimize network latency
• Each node must be running the same version of h2o.jar
Productionizing H2O
1. Build a Modelusing Python/R/Java/Flow
2. Download the model (as a POJO or MOJO)as a zip file.
3. Download resultingh2o-genmodel.jar (Isa library supportingscoring)
4. Invokethe model fromJava class to generate prediction
• Can be easily embedded inside a Java Application
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
H2O Flow
• Web-based interactive client environment
• Similar to Jupyter Notebook
• Can be used by non-programmer as well (Mouse
clicks!)
• Combine code execution, text, mathematics, plots
& rich media in a single document
• Allows
• Data upload
• View data uploaded directly / through other
clients
• Build Model
• View models built directly / through other
clients
• Predict
• View predictions generated directly or through
other clients
• Check cluster/CPUstatus
Algorithms
Supervised Unsupervised Miscellaneous Common
Cox Proportional
Hazards
Aggregagtor Word2vec Quantiles
Deep Learning Generalized Low Rank
Models (GLRM)
Early Stopping
Distributed Random
Forest
K-Means Clustering
Generalized Linear
Model
Principal Component
Analysis (PCA)
Gradient Boosting
Machine
Naïve Bayes Classifier
Stacked Ensembles
XGBoost
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow/images/H2O-Algorithms-Road-Map.pdf
H2O Ecosystem
• H2O
• Steam
• Enterprise Steam
• Sparkling Water
• Driverless AI
• H2O4GPU
H2O Steam
• End-to-end platform that streamlines the entire process of building and deploying
applications
• Cluster Manager
• Start/stop cluster, allocate memory, start/pause/stopH2O instances
• Secure multi-tenant environment
• Model Manager
• Build, store, manage, compare, promote (historical) models
• Run A/B Test for models
• Scoring Server
• Deploys a model
• Scoring through REST API or In-App
Sparkling Water (1/3)
• Combines the fast, scalable machine learning algorithms of H2O with
the capabilities of Spark
• Provides a way to launch the H2O service on each Spark executor in
the Spark cluster, forming a H2O cluster
• “Certified on Spark”
Sparkling Water – Use Case (2/3)
Use Case 1:
Data pipeline consistsof multiple
data transformations withhelp
of Spark API. Final form of data is
transformedinto H2O frame and
passed to an H2O algorithm.
Use Case 2:
Data pipeline consistsof H2O’s
parallel dataload and parse
capabilities, while Spark API is
used as another provider of data
transformations.
H2O can be also be used as in-
place datatransformer.http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/index.html
Sparkling Water – Use Case (3/3)
Use Case 3:
1. The off-line training pipeline invoked
regularly utilizes Spark & H2O API and
provides an H2O model as
output.The model is exported in a form
independent on H2O run-time.
2. The streaming datapipeline (Using Spark
Streaming)uses model trained in the first
pipeline to score the incoming data.Since
the model is exported with no run-time
dependency to H2O, the streamingpipeline
can be lightweight and independent on
H2O/ Sparkling Water infrastructure.
Spark (MLib) vs H2O
• Spark is better at the data preparationand data munging steps
• H2O is faster than the algorithmsin SparkMLib
• MLib under performsin terms of Memory,CPU and Time
• H2O provides Web Interface (Flow) for data visualization
• H2O and MLib has overlapof algorithms
• H2O is better for productionization
• POJO/MOJOapproachmorefriendly to integrate with Java applications
• Allows evaluation metrics visualization, tracking jobsand job statuses
• H2O allowsgrid search(Spark doesn’t?)
• Spark has a better community support
• H2O has enterprisesupport
Check the slide on References
• Need for “iyzico”fraud detectionproduct
• Continuous Delivery: Models need to be continuously deployed on production
• Real-Time Fraud Detection: Predictiontime of max 100ms
• HighAvailability &Scalability
• Low Learning Curve: Stack should be usable by data scientist & SW developer
• Open Source
• Fast : Fast prototyping & deploying
• On Premise
• Initial Choice
• prediction.io+ Spark ML
Case Study I : Migration From SparkMLib To H2O (1/3)
Source: https://iyzico.engineering/spark-ml-to-h2o-migration-for-machine-learning-in-iyzico-dcba86b8eab2
Case Study I (2/3)
• Benchmarking Criteria : TensorFlow, SparkML, H2O (Winner)
• Simplicity of deploying an existingmodel (local env) to production
• POJO based models. Easy to deploy in Java environment
• Release management and DevOps cycle are easy
• Hardwarerequirementsfor training
• Memory need for training with 1 million transactions & 100 features with RF (64 Trees)
Spark ML : 16 GB RAM, Tensor Flow : 10 GB, H2O : 2 GB
• Decision Trees and BayesianModels
• Python, R, SQL Support
• Experimentationon local environment
• Experiments can be done with Python, R
• Predictiontime (ms)
• Feature Engineering, Data Pipeline was in Java 8. No need of
migration
• Migration from Spark ML + prediction.io to H2O
• 60 GB RAM is saved (Spark ML & prediction.io needed for model trainings)
• 12 cores saved (Spark ML & prediction.io needed these cores to reduce model
training time)
• Response time decreased almost 10 times (300 milliseconds to 35
milliseconds)
Case Study I (3/3)
Case Study II : Booking.com (1/n)
Source: https://www.youtube.com/watch?v=_CBKECLkIt8
Case Study II : Booking.com (2/n)
Case Study II : Booking.com (3/3)
Spark/Sparkling Water – Do I need it?
Benchmarking ML Libraries
https://github.com/szilard/benchm-ml
• Training data
• Number of rowsvaried as 10K, 100K, 1M, 10M
• ~1K features
• Binary ClassificationProblem
• Hardware (Single Instance)
• Amazon EC2 c3.8xlarge (32 cores,60GB RAM)
• If OOM, r3.8xlarge instance(32 cores,250GB RAM)
• Observations
• Training time
• Maximum memory usage duringtraining
• AUC (predictiveaccuracy)
Random Forest
H2O
• Fast, uses all cores, more accurate
• Memory Efficient
• 1M : 5G, 10M : 25 G
SparkMLib
• Slower
• Larger memory footprint
• Runs OOM at n = 1M
• With 250 G, finishes for 1M, but
crashes for 10M
• AUC broke at 1M
• Spark 2.0 is even slower
XGBoost
• Fast
• High accuracy
• Memory efficient
• 1M : 2G, 10M : 9G
Gradient Boosting Machines
Learn_rate=0.01
max_depth=16
n_trees=1000
Learn_rate=0.1
max_depth=6
n_trees=300
• Memory footprint of
GBMs smaller than for RF
• Bottleneck is mainly
training time
• Spark is inefficient in
memory (especially for
deeper trees) & crashes.
Works for shallow trees
• H2O and xgboost are the
fastest
Performance of various GBM implementations
For deployment, H2O has
the best ways to deploy as
a real-time (fast scoring)
application.
https://github.com/szilard/GBM-perf
Do I need Big Data?
• Single Instance vs Cluster
• Sending data over a network vs using shared memory
• Several distributed systems have significant computation & memory overhead
• Map-reduce style communicationpattern : Not best fit for many ML
algorithms
Benchmarking For Bigger Data
Netflix VectorFlow
• Minimalist library
• Specifically optimized for
training sparse data
• Single-machine, multi-core
environment
Benchmarking For Bigger Data
• Not enough clarity about the hardwareused
• For tree-based ensembles (RF, GBM)H2O and xgboost
can train on 100Mrecordson a single server, though
the trainingtimes become several hours
Single Node
Multiple Nodes
Security In H2O
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/security.html
Disadvantages
• No High Availability (HA) for Clusters
• Doesn’t work well on sparse data
• GPU Support is in alpha stage
• There is No SVM
• Cluster support helps Big Data
• For small data needs single, fast machines with lot of cores
References
• https://www.quora.com/Does-H2O-software-allow-you-to-perform-faster-
machine-learning-if-it-is-not-used-on-a-cluster-How
• https://www.quora.com/Why-would-one-use-H2O-ai-over-scikit-learn-machine-
learning-tool
• https://www.quora.com/What-are-the-risks-of-using-H2O-ai-framework-When-
would-my-company-need-to-pay-anything-to-H2O-ai-Is-the-framework-buggy-
somehow-or-is-it-hard-to-install-configure-extend-Do-I-need-to-pay-for-
consultancy-eventually
• https://groups.google.com/forum/#!msg/h2ostream/m2HIfUxfw-k/X8G2-
OMQAwAJ
Questions
H2O Architecture
https://www.stat.berkeley.edu/~ledell/docs/h2o_hpccon_oct2015.pdf
http://gotocon.com/dl/goto-berlin-2014/slides/PetrMaj_and_TomasNykodym_FastAnalyticsOnBigData.pdf
H2O Frame Distributed Fork & Join
Do I need Spark to run H20?
- https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library
- https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
H2O : POJO vs MOJO
- POJOs are not supported for source files larger than 1G
- MOJOs are supported for AutoML, Deep Learning, DRF, GBM, GLM,
GLRM, K-Means, Stacked Ensembles, SVM, Word2vec, and XGBoost
models.
- POJOs are also not supported for XGBoost, GLRM, or Stacked
Ensembles models.
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
SparkML vs SparkMLib
• Spark MLib vs Spark ML :
• https://spark.apache.org/docs/latest/ml-guide.html
Machine Learning With H2O vs SparkML

More Related Content

What's hot

Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopBig Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
DataWorks Summit
 

What's hot (20)

Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDeploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey
 
AWS EMR Cost optimization
AWS EMR Cost optimizationAWS EMR Cost optimization
AWS EMR Cost optimization
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015
AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015 AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015
AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015
 
Building Event-Driven (Micro) Services with Apache Kafka
Building Event-Driven (Micro) Services with Apache KafkaBuilding Event-Driven (Micro) Services with Apache Kafka
Building Event-Driven (Micro) Services with Apache Kafka
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
[웨비나] 우리가 데이터 메시에 주목해야 할 이유
[웨비나] 우리가 데이터 메시에 주목해야 할 이유[웨비나] 우리가 데이터 메시에 주목해야 할 이유
[웨비나] 우리가 데이터 메시에 주목해야 할 이유
 
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopBig Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)
 
Deep Dive: Maximizing EC2 and EBS Performance
Deep Dive: Maximizing EC2 and EBS PerformanceDeep Dive: Maximizing EC2 and EBS Performance
Deep Dive: Maximizing EC2 and EBS Performance
 
Autoscaling in Kubernetes
Autoscaling in KubernetesAutoscaling in Kubernetes
Autoscaling in Kubernetes
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
 

Similar to Machine Learning With H2O vs SparkML

Hadoop Summit 2014 - recap
Hadoop Summit 2014 - recapHadoop Summit 2014 - recap
Hadoop Summit 2014 - recap
UserReport
 

Similar to Machine Learning With H2O vs SparkML (20)

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Hadoop Summit 2014 - recap
Hadoop Summit 2014 - recapHadoop Summit 2014 - recap
Hadoop Summit 2014 - recap
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 

Recently uploaded

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 

Machine Learning With H2O vs SparkML

  • 1. Machine Learning by H2O vs SparkML Arnab Biswas June 2018
  • 2. H2O Open Source, In-Memory, Distributed Machine Learning Tool • Open Source (Apache 2.0) • In-Memory (Faster) • Distributed (Big Data/No Sampling) • Third Version (Stable) • Easy To Use • Mission - "How do we get this to work efficiently at big data scale?“ http://docs.h2o.ai/
  • 3. • R, Python, Scala, Java, JSON, JavaScript, Web Interface (Flow) • Entire library is embedded inside a jar file • Composed in Java, naturally supports Java & Scala • R, Python, JavaScript, Excel, Tableau, Flow communicates with H2O clusters using REST API calls • Easy to switch between R/Python/Java/Flow environments Multiple LanguageSupport
  • 4. • Uses in-memory compression(2-4 times smaller than gzip) • Data frames are much smaller in memory and on disk • Handles billions of data rows in-memory, even with a small cluster • Data gets distributed acrossmultiple JVM • Modelingusing whole set of data (without sampling) • Faster training/predictiontime • The larger is the data set, the better is the performance • Consists of a Flow web-based GUI (Easy to use for Non-Programmers) • However,notvery impressive! • Easy to deploy models in production • Checkpoint • Continuetraining an existing model with new data • IterativeMethods (???) H2O : Advantage https://en.wikipedia.org/wiki/H2O_(software)
  • 5. Clustering (1/2) • Can be deployed on a single node / multi-node cluster / Hadoop cluster / Apache Spark cluster • Clustering enhances speed of computation • Hadoop/Spark for clustering is NOT mandatory • Multi-node cluster with shared memory model • All computation in-memory • Each node sees only some rows of data • No limit to cluster size • Distributed Data Frames (collection of vectors) • Columns are distributed (across nodes) - https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library - https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
  • 6. Clustering : Limitations (2/2) • For small data, clustering introduces slowness • Find the sweet spot between data size & number of nodes • Each node on the cluster must be of same size (Recommended) • New Nodes can not be added once the cluster starts up • If any machine dies, the whole cluster must be rebuilt • If a single node gets removed, whole cluster becomes unusable • Nodes should be physically close, to minimize network latency • Each node must be running the same version of h2o.jar
  • 7. Productionizing H2O 1. Build a Modelusing Python/R/Java/Flow 2. Download the model (as a POJO or MOJO)as a zip file. 3. Download resultingh2o-genmodel.jar (Isa library supportingscoring) 4. Invokethe model fromJava class to generate prediction • Can be easily embedded inside a Java Application http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
  • 8. H2O Flow • Web-based interactive client environment • Similar to Jupyter Notebook • Can be used by non-programmer as well (Mouse clicks!) • Combine code execution, text, mathematics, plots & rich media in a single document • Allows • Data upload • View data uploaded directly / through other clients • Build Model • View models built directly / through other clients • Predict • View predictions generated directly or through other clients • Check cluster/CPUstatus
  • 9. Algorithms Supervised Unsupervised Miscellaneous Common Cox Proportional Hazards Aggregagtor Word2vec Quantiles Deep Learning Generalized Low Rank Models (GLRM) Early Stopping Distributed Random Forest K-Means Clustering Generalized Linear Model Principal Component Analysis (PCA) Gradient Boosting Machine Naïve Bayes Classifier Stacked Ensembles XGBoost https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow/images/H2O-Algorithms-Road-Map.pdf
  • 10. H2O Ecosystem • H2O • Steam • Enterprise Steam • Sparkling Water • Driverless AI • H2O4GPU
  • 11. H2O Steam • End-to-end platform that streamlines the entire process of building and deploying applications • Cluster Manager • Start/stop cluster, allocate memory, start/pause/stopH2O instances • Secure multi-tenant environment • Model Manager • Build, store, manage, compare, promote (historical) models • Run A/B Test for models • Scoring Server • Deploys a model • Scoring through REST API or In-App
  • 12. Sparkling Water (1/3) • Combines the fast, scalable machine learning algorithms of H2O with the capabilities of Spark • Provides a way to launch the H2O service on each Spark executor in the Spark cluster, forming a H2O cluster • “Certified on Spark”
  • 13. Sparkling Water – Use Case (2/3) Use Case 1: Data pipeline consistsof multiple data transformations withhelp of Spark API. Final form of data is transformedinto H2O frame and passed to an H2O algorithm. Use Case 2: Data pipeline consistsof H2O’s parallel dataload and parse capabilities, while Spark API is used as another provider of data transformations. H2O can be also be used as in- place datatransformer.http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/index.html
  • 14. Sparkling Water – Use Case (3/3) Use Case 3: 1. The off-line training pipeline invoked regularly utilizes Spark & H2O API and provides an H2O model as output.The model is exported in a form independent on H2O run-time. 2. The streaming datapipeline (Using Spark Streaming)uses model trained in the first pipeline to score the incoming data.Since the model is exported with no run-time dependency to H2O, the streamingpipeline can be lightweight and independent on H2O/ Sparkling Water infrastructure.
  • 15. Spark (MLib) vs H2O • Spark is better at the data preparationand data munging steps • H2O is faster than the algorithmsin SparkMLib • MLib under performsin terms of Memory,CPU and Time • H2O provides Web Interface (Flow) for data visualization • H2O and MLib has overlapof algorithms • H2O is better for productionization • POJO/MOJOapproachmorefriendly to integrate with Java applications • Allows evaluation metrics visualization, tracking jobsand job statuses • H2O allowsgrid search(Spark doesn’t?) • Spark has a better community support • H2O has enterprisesupport Check the slide on References
  • 16. • Need for “iyzico”fraud detectionproduct • Continuous Delivery: Models need to be continuously deployed on production • Real-Time Fraud Detection: Predictiontime of max 100ms • HighAvailability &Scalability • Low Learning Curve: Stack should be usable by data scientist & SW developer • Open Source • Fast : Fast prototyping & deploying • On Premise • Initial Choice • prediction.io+ Spark ML Case Study I : Migration From SparkMLib To H2O (1/3) Source: https://iyzico.engineering/spark-ml-to-h2o-migration-for-machine-learning-in-iyzico-dcba86b8eab2
  • 17. Case Study I (2/3) • Benchmarking Criteria : TensorFlow, SparkML, H2O (Winner) • Simplicity of deploying an existingmodel (local env) to production • POJO based models. Easy to deploy in Java environment • Release management and DevOps cycle are easy • Hardwarerequirementsfor training • Memory need for training with 1 million transactions & 100 features with RF (64 Trees) Spark ML : 16 GB RAM, Tensor Flow : 10 GB, H2O : 2 GB • Decision Trees and BayesianModels • Python, R, SQL Support • Experimentationon local environment • Experiments can be done with Python, R • Predictiontime (ms)
  • 18. • Feature Engineering, Data Pipeline was in Java 8. No need of migration • Migration from Spark ML + prediction.io to H2O • 60 GB RAM is saved (Spark ML & prediction.io needed for model trainings) • 12 cores saved (Spark ML & prediction.io needed these cores to reduce model training time) • Response time decreased almost 10 times (300 milliseconds to 35 milliseconds) Case Study I (3/3)
  • 19. Case Study II : Booking.com (1/n) Source: https://www.youtube.com/watch?v=_CBKECLkIt8
  • 20. Case Study II : Booking.com (2/n)
  • 21. Case Study II : Booking.com (3/3)
  • 22. Spark/Sparkling Water – Do I need it?
  • 23. Benchmarking ML Libraries https://github.com/szilard/benchm-ml • Training data • Number of rowsvaried as 10K, 100K, 1M, 10M • ~1K features • Binary ClassificationProblem • Hardware (Single Instance) • Amazon EC2 c3.8xlarge (32 cores,60GB RAM) • If OOM, r3.8xlarge instance(32 cores,250GB RAM) • Observations • Training time • Maximum memory usage duringtraining • AUC (predictiveaccuracy)
  • 24. Random Forest H2O • Fast, uses all cores, more accurate • Memory Efficient • 1M : 5G, 10M : 25 G SparkMLib • Slower • Larger memory footprint • Runs OOM at n = 1M • With 250 G, finishes for 1M, but crashes for 10M • AUC broke at 1M • Spark 2.0 is even slower XGBoost • Fast • High accuracy • Memory efficient • 1M : 2G, 10M : 9G
  • 25. Gradient Boosting Machines Learn_rate=0.01 max_depth=16 n_trees=1000 Learn_rate=0.1 max_depth=6 n_trees=300 • Memory footprint of GBMs smaller than for RF • Bottleneck is mainly training time • Spark is inefficient in memory (especially for deeper trees) & crashes. Works for shallow trees • H2O and xgboost are the fastest
  • 26. Performance of various GBM implementations For deployment, H2O has the best ways to deploy as a real-time (fast scoring) application. https://github.com/szilard/GBM-perf
  • 27. Do I need Big Data? • Single Instance vs Cluster • Sending data over a network vs using shared memory • Several distributed systems have significant computation & memory overhead • Map-reduce style communicationpattern : Not best fit for many ML algorithms Benchmarking For Bigger Data
  • 28. Netflix VectorFlow • Minimalist library • Specifically optimized for training sparse data • Single-machine, multi-core environment
  • 29. Benchmarking For Bigger Data • Not enough clarity about the hardwareused • For tree-based ensembles (RF, GBM)H2O and xgboost can train on 100Mrecordson a single server, though the trainingtimes become several hours Single Node Multiple Nodes
  • 31. Disadvantages • No High Availability (HA) for Clusters • Doesn’t work well on sparse data • GPU Support is in alpha stage • There is No SVM • Cluster support helps Big Data • For small data needs single, fast machines with lot of cores
  • 32. References • https://www.quora.com/Does-H2O-software-allow-you-to-perform-faster- machine-learning-if-it-is-not-used-on-a-cluster-How • https://www.quora.com/Why-would-one-use-H2O-ai-over-scikit-learn-machine- learning-tool • https://www.quora.com/What-are-the-risks-of-using-H2O-ai-framework-When- would-my-company-need-to-pay-anything-to-H2O-ai-Is-the-framework-buggy- somehow-or-is-it-hard-to-install-configure-extend-Do-I-need-to-pay-for- consultancy-eventually • https://groups.google.com/forum/#!msg/h2ostream/m2HIfUxfw-k/X8G2- OMQAwAJ
  • 35. H2O Frame Distributed Fork & Join
  • 36. Do I need Spark to run H20? - https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library - https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
  • 37. H2O : POJO vs MOJO - POJOs are not supported for source files larger than 1G - MOJOs are supported for AutoML, Deep Learning, DRF, GBM, GLM, GLRM, K-Means, Stacked Ensembles, SVM, Word2vec, and XGBoost models. - POJOs are also not supported for XGBoost, GLRM, or Stacked Ensembles models. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
  • 38. SparkML vs SparkMLib • Spark MLib vs Spark ML : • https://spark.apache.org/docs/latest/ml-guide.html