SlideShare a Scribd company logo
1 of 39
Machine Learning
by
H2O vs SparkML
Arnab Biswas
June 2018
H2O
Open Source, In-Memory, Distributed Machine Learning Tool
• Open Source (Apache 2.0)
• In-Memory (Faster)
• Distributed (Big Data/No Sampling)
• Third Version (Stable)
• Easy To Use
• Mission - "How do we get this to work efficiently at big data scale?“
http://docs.h2o.ai/
• R, Python, Scala, Java, JSON, JavaScript, Web Interface (Flow)
• Entire library is embedded inside a jar file
• Composed in Java, naturally supports Java & Scala
• R, Python, JavaScript, Excel, Tableau, Flow communicates with
H2O clusters using REST API calls
• Easy to switch between R/Python/Java/Flow environments
Multiple LanguageSupport
• Uses in-memory compression(2-4 times smaller than gzip)
• Data frames are much smaller in memory and on disk
• Handles billions of data rows in-memory, even with a small cluster
• Data gets distributed acrossmultiple JVM
• Modelingusing whole set of data (without sampling)
• Faster training/predictiontime
• The larger is the data set, the better is the performance
• Consists of a Flow web-based GUI (Easy to use for Non-Programmers)
• However,notvery impressive!
• Easy to deploy models in production
• Checkpoint
• Continuetraining an existing model with new data
• IterativeMethods (???)
H2O : Advantage
https://en.wikipedia.org/wiki/H2O_(software)
Clustering (1/2)
• Can be deployed on a single node / multi-node cluster / Hadoop cluster
/ Apache Spark cluster
• Clustering enhances speed of computation
• Hadoop/Spark for clustering is NOT mandatory
• Multi-node cluster with shared memory model
• All computation in-memory
• Each node sees only some rows of data
• No limit to cluster size
• Distributed Data Frames (collection of vectors)
• Columns are distributed (across nodes)
- https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library
- https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
Clustering : Limitations (2/2)
• For small data, clustering introduces slowness
• Find the sweet spot between data size & number of nodes
• Each node on the cluster must be of same size (Recommended)
• New Nodes can not be added once the cluster starts up
• If any machine dies, the whole cluster must be rebuilt
• If a single node gets removed, whole cluster becomes unusable
• Nodes should be physically close, to minimize network latency
• Each node must be running the same version of h2o.jar
Productionizing H2O
1. Build a Modelusing Python/R/Java/Flow
2. Download the model (as a POJO or MOJO)as a zip file.
3. Download resultingh2o-genmodel.jar (Isa library supportingscoring)
4. Invokethe model fromJava class to generate prediction
• Can be easily embedded inside a Java Application
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
H2O Flow
• Web-based interactive client environment
• Similar to Jupyter Notebook
• Can be used by non-programmer as well (Mouse
clicks!)
• Combine code execution, text, mathematics, plots
& rich media in a single document
• Allows
• Data upload
• View data uploaded directly / through other
clients
• Build Model
• View models built directly / through other
clients
• Predict
• View predictions generated directly or through
other clients
• Check cluster/CPUstatus
Algorithms
Supervised Unsupervised Miscellaneous Common
Cox Proportional
Hazards
Aggregagtor Word2vec Quantiles
Deep Learning Generalized Low Rank
Models (GLRM)
Early Stopping
Distributed Random
Forest
K-Means Clustering
Generalized Linear
Model
Principal Component
Analysis (PCA)
Gradient Boosting
Machine
Naïve Bayes Classifier
Stacked Ensembles
XGBoost
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow/images/H2O-Algorithms-Road-Map.pdf
H2O Ecosystem
• H2O
• Steam
• Enterprise Steam
• Sparkling Water
• Driverless AI
• H2O4GPU
H2O Steam
• End-to-end platform that streamlines the entire process of building and deploying
applications
• Cluster Manager
• Start/stop cluster, allocate memory, start/pause/stopH2O instances
• Secure multi-tenant environment
• Model Manager
• Build, store, manage, compare, promote (historical) models
• Run A/B Test for models
• Scoring Server
• Deploys a model
• Scoring through REST API or In-App
Sparkling Water (1/3)
• Combines the fast, scalable machine learning algorithms of H2O with
the capabilities of Spark
• Provides a way to launch the H2O service on each Spark executor in
the Spark cluster, forming a H2O cluster
• “Certified on Spark”
Sparkling Water – Use Case (2/3)
Use Case 1:
Data pipeline consistsof multiple
data transformations withhelp
of Spark API. Final form of data is
transformedinto H2O frame and
passed to an H2O algorithm.
Use Case 2:
Data pipeline consistsof H2O’s
parallel dataload and parse
capabilities, while Spark API is
used as another provider of data
transformations.
H2O can be also be used as in-
place datatransformer.http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/index.html
Sparkling Water – Use Case (3/3)
Use Case 3:
1. The off-line training pipeline invoked
regularly utilizes Spark & H2O API and
provides an H2O model as
output.The model is exported in a form
independent on H2O run-time.
2. The streaming datapipeline (Using Spark
Streaming)uses model trained in the first
pipeline to score the incoming data.Since
the model is exported with no run-time
dependency to H2O, the streamingpipeline
can be lightweight and independent on
H2O/ Sparkling Water infrastructure.
Spark (MLib) vs H2O
• Spark is better at the data preparationand data munging steps
• H2O is faster than the algorithmsin SparkMLib
• MLib under performsin terms of Memory,CPU and Time
• H2O provides Web Interface (Flow) for data visualization
• H2O and MLib has overlapof algorithms
• H2O is better for productionization
• POJO/MOJOapproachmorefriendly to integrate with Java applications
• Allows evaluation metrics visualization, tracking jobsand job statuses
• H2O allowsgrid search(Spark doesn’t?)
• Spark has a better community support
• H2O has enterprisesupport
Check the slide on References
• Need for “iyzico”fraud detectionproduct
• Continuous Delivery: Models need to be continuously deployed on production
• Real-Time Fraud Detection: Predictiontime of max 100ms
• HighAvailability &Scalability
• Low Learning Curve: Stack should be usable by data scientist & SW developer
• Open Source
• Fast : Fast prototyping & deploying
• On Premise
• Initial Choice
• prediction.io+ Spark ML
Case Study I : Migration From SparkMLib To H2O (1/3)
Source: https://iyzico.engineering/spark-ml-to-h2o-migration-for-machine-learning-in-iyzico-dcba86b8eab2
Case Study I (2/3)
• Benchmarking Criteria : TensorFlow, SparkML, H2O (Winner)
• Simplicity of deploying an existingmodel (local env) to production
• POJO based models. Easy to deploy in Java environment
• Release management and DevOps cycle are easy
• Hardwarerequirementsfor training
• Memory need for training with 1 million transactions & 100 features with RF (64 Trees)
Spark ML : 16 GB RAM, Tensor Flow : 10 GB, H2O : 2 GB
• Decision Trees and BayesianModels
• Python, R, SQL Support
• Experimentationon local environment
• Experiments can be done with Python, R
• Predictiontime (ms)
• Feature Engineering, Data Pipeline was in Java 8. No need of
migration
• Migration from Spark ML + prediction.io to H2O
• 60 GB RAM is saved (Spark ML & prediction.io needed for model trainings)
• 12 cores saved (Spark ML & prediction.io needed these cores to reduce model
training time)
• Response time decreased almost 10 times (300 milliseconds to 35
milliseconds)
Case Study I (3/3)
Case Study II : Booking.com (1/n)
Source: https://www.youtube.com/watch?v=_CBKECLkIt8
Case Study II : Booking.com (2/n)
Case Study II : Booking.com (3/3)
Spark/Sparkling Water – Do I need it?
Benchmarking ML Libraries
https://github.com/szilard/benchm-ml
• Training data
• Number of rowsvaried as 10K, 100K, 1M, 10M
• ~1K features
• Binary ClassificationProblem
• Hardware (Single Instance)
• Amazon EC2 c3.8xlarge (32 cores,60GB RAM)
• If OOM, r3.8xlarge instance(32 cores,250GB RAM)
• Observations
• Training time
• Maximum memory usage duringtraining
• AUC (predictiveaccuracy)
Random Forest
H2O
• Fast, uses all cores, more accurate
• Memory Efficient
• 1M : 5G, 10M : 25 G
SparkMLib
• Slower
• Larger memory footprint
• Runs OOM at n = 1M
• With 250 G, finishes for 1M, but
crashes for 10M
• AUC broke at 1M
• Spark 2.0 is even slower
XGBoost
• Fast
• High accuracy
• Memory efficient
• 1M : 2G, 10M : 9G
Gradient Boosting Machines
Learn_rate=0.01
max_depth=16
n_trees=1000
Learn_rate=0.1
max_depth=6
n_trees=300
• Memory footprint of
GBMs smaller than for RF
• Bottleneck is mainly
training time
• Spark is inefficient in
memory (especially for
deeper trees) & crashes.
Works for shallow trees
• H2O and xgboost are the
fastest
Performance of various GBM implementations
For deployment, H2O has
the best ways to deploy as
a real-time (fast scoring)
application.
https://github.com/szilard/GBM-perf
Do I need Big Data?
• Single Instance vs Cluster
• Sending data over a network vs using shared memory
• Several distributed systems have significant computation & memory overhead
• Map-reduce style communicationpattern : Not best fit for many ML
algorithms
Benchmarking For Bigger Data
Netflix VectorFlow
• Minimalist library
• Specifically optimized for
training sparse data
• Single-machine, multi-core
environment
Benchmarking For Bigger Data
• Not enough clarity about the hardwareused
• For tree-based ensembles (RF, GBM)H2O and xgboost
can train on 100Mrecordson a single server, though
the trainingtimes become several hours
Single Node
Multiple Nodes
Security In H2O
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/security.html
Disadvantages
• No High Availability (HA) for Clusters
• Doesn’t work well on sparse data
• GPU Support is in alpha stage
• There is No SVM
• Cluster support helps Big Data
• For small data needs single, fast machines with lot of cores
References
• https://www.quora.com/Does-H2O-software-allow-you-to-perform-faster-
machine-learning-if-it-is-not-used-on-a-cluster-How
• https://www.quora.com/Why-would-one-use-H2O-ai-over-scikit-learn-machine-
learning-tool
• https://www.quora.com/What-are-the-risks-of-using-H2O-ai-framework-When-
would-my-company-need-to-pay-anything-to-H2O-ai-Is-the-framework-buggy-
somehow-or-is-it-hard-to-install-configure-extend-Do-I-need-to-pay-for-
consultancy-eventually
• https://groups.google.com/forum/#!msg/h2ostream/m2HIfUxfw-k/X8G2-
OMQAwAJ
Questions
H2O Architecture
https://www.stat.berkeley.edu/~ledell/docs/h2o_hpccon_oct2015.pdf
http://gotocon.com/dl/goto-berlin-2014/slides/PetrMaj_and_TomasNykodym_FastAnalyticsOnBigData.pdf
H2O Frame Distributed Fork & Join
Do I need Spark to run H20?
- https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library
- https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
H2O : POJO vs MOJO
- POJOs are not supported for source files larger than 1G
- MOJOs are supported for AutoML, Deep Learning, DRF, GBM, GLM,
GLRM, K-Means, Stacked Ensembles, SVM, Word2vec, and XGBoost
models.
- POJOs are also not supported for XGBoost, GLRM, or Stacked
Ensembles models.
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
SparkML vs SparkMLib
• Spark MLib vs Spark ML :
• https://spark.apache.org/docs/latest/ml-guide.html
Machine Learning: H2O vs SparkML

More Related Content

What's hot

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Cross-domain requests with CORS
Cross-domain requests with CORSCross-domain requests with CORS
Cross-domain requests with CORSVladimir Dzhuvinov
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introductionAlexey Grigorev
 
Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessDerek Collison
 
Liquibase for java developers
Liquibase for java developersLiquibase for java developers
Liquibase for java developersIllia Seleznov
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 

What's hot (20)

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Cross-domain requests with CORS
Cross-domain requests with CORSCross-domain requests with CORS
Cross-domain requests with CORS
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for Success
 
Liquibase for java developers
Liquibase for java developersLiquibase for java developers
Liquibase for java developers
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 

Similar to Machine Learning: H2O vs SparkML

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesRose Toomey
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesRose Toomey
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin
 
Hadoop Summit 2014 - recap
Hadoop Summit 2014 - recapHadoop Summit 2014 - recap
Hadoop Summit 2014 - recapUserReport
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Spark Summit
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 

Similar to Machine Learning: H2O vs SparkML (20)

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Hadoop Summit 2014 - recap
Hadoop Summit 2014 - recapHadoop Summit 2014 - recap
Hadoop Summit 2014 - recap
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 

Recently uploaded

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 

Recently uploaded (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 

Machine Learning: H2O vs SparkML

  • 1. Machine Learning by H2O vs SparkML Arnab Biswas June 2018
  • 2. H2O Open Source, In-Memory, Distributed Machine Learning Tool • Open Source (Apache 2.0) • In-Memory (Faster) • Distributed (Big Data/No Sampling) • Third Version (Stable) • Easy To Use • Mission - "How do we get this to work efficiently at big data scale?“ http://docs.h2o.ai/
  • 3. • R, Python, Scala, Java, JSON, JavaScript, Web Interface (Flow) • Entire library is embedded inside a jar file • Composed in Java, naturally supports Java & Scala • R, Python, JavaScript, Excel, Tableau, Flow communicates with H2O clusters using REST API calls • Easy to switch between R/Python/Java/Flow environments Multiple LanguageSupport
  • 4. • Uses in-memory compression(2-4 times smaller than gzip) • Data frames are much smaller in memory and on disk • Handles billions of data rows in-memory, even with a small cluster • Data gets distributed acrossmultiple JVM • Modelingusing whole set of data (without sampling) • Faster training/predictiontime • The larger is the data set, the better is the performance • Consists of a Flow web-based GUI (Easy to use for Non-Programmers) • However,notvery impressive! • Easy to deploy models in production • Checkpoint • Continuetraining an existing model with new data • IterativeMethods (???) H2O : Advantage https://en.wikipedia.org/wiki/H2O_(software)
  • 5. Clustering (1/2) • Can be deployed on a single node / multi-node cluster / Hadoop cluster / Apache Spark cluster • Clustering enhances speed of computation • Hadoop/Spark for clustering is NOT mandatory • Multi-node cluster with shared memory model • All computation in-memory • Each node sees only some rows of data • No limit to cluster size • Distributed Data Frames (collection of vectors) • Columns are distributed (across nodes) - https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library - https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
  • 6. Clustering : Limitations (2/2) • For small data, clustering introduces slowness • Find the sweet spot between data size & number of nodes • Each node on the cluster must be of same size (Recommended) • New Nodes can not be added once the cluster starts up • If any machine dies, the whole cluster must be rebuilt • If a single node gets removed, whole cluster becomes unusable • Nodes should be physically close, to minimize network latency • Each node must be running the same version of h2o.jar
  • 7. Productionizing H2O 1. Build a Modelusing Python/R/Java/Flow 2. Download the model (as a POJO or MOJO)as a zip file. 3. Download resultingh2o-genmodel.jar (Isa library supportingscoring) 4. Invokethe model fromJava class to generate prediction • Can be easily embedded inside a Java Application http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
  • 8. H2O Flow • Web-based interactive client environment • Similar to Jupyter Notebook • Can be used by non-programmer as well (Mouse clicks!) • Combine code execution, text, mathematics, plots & rich media in a single document • Allows • Data upload • View data uploaded directly / through other clients • Build Model • View models built directly / through other clients • Predict • View predictions generated directly or through other clients • Check cluster/CPUstatus
  • 9. Algorithms Supervised Unsupervised Miscellaneous Common Cox Proportional Hazards Aggregagtor Word2vec Quantiles Deep Learning Generalized Low Rank Models (GLRM) Early Stopping Distributed Random Forest K-Means Clustering Generalized Linear Model Principal Component Analysis (PCA) Gradient Boosting Machine Naïve Bayes Classifier Stacked Ensembles XGBoost https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow/images/H2O-Algorithms-Road-Map.pdf
  • 10. H2O Ecosystem • H2O • Steam • Enterprise Steam • Sparkling Water • Driverless AI • H2O4GPU
  • 11. H2O Steam • End-to-end platform that streamlines the entire process of building and deploying applications • Cluster Manager • Start/stop cluster, allocate memory, start/pause/stopH2O instances • Secure multi-tenant environment • Model Manager • Build, store, manage, compare, promote (historical) models • Run A/B Test for models • Scoring Server • Deploys a model • Scoring through REST API or In-App
  • 12. Sparkling Water (1/3) • Combines the fast, scalable machine learning algorithms of H2O with the capabilities of Spark • Provides a way to launch the H2O service on each Spark executor in the Spark cluster, forming a H2O cluster • “Certified on Spark”
  • 13. Sparkling Water – Use Case (2/3) Use Case 1: Data pipeline consistsof multiple data transformations withhelp of Spark API. Final form of data is transformedinto H2O frame and passed to an H2O algorithm. Use Case 2: Data pipeline consistsof H2O’s parallel dataload and parse capabilities, while Spark API is used as another provider of data transformations. H2O can be also be used as in- place datatransformer.http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/index.html
  • 14. Sparkling Water – Use Case (3/3) Use Case 3: 1. The off-line training pipeline invoked regularly utilizes Spark & H2O API and provides an H2O model as output.The model is exported in a form independent on H2O run-time. 2. The streaming datapipeline (Using Spark Streaming)uses model trained in the first pipeline to score the incoming data.Since the model is exported with no run-time dependency to H2O, the streamingpipeline can be lightweight and independent on H2O/ Sparkling Water infrastructure.
  • 15. Spark (MLib) vs H2O • Spark is better at the data preparationand data munging steps • H2O is faster than the algorithmsin SparkMLib • MLib under performsin terms of Memory,CPU and Time • H2O provides Web Interface (Flow) for data visualization • H2O and MLib has overlapof algorithms • H2O is better for productionization • POJO/MOJOapproachmorefriendly to integrate with Java applications • Allows evaluation metrics visualization, tracking jobsand job statuses • H2O allowsgrid search(Spark doesn’t?) • Spark has a better community support • H2O has enterprisesupport Check the slide on References
  • 16. • Need for “iyzico”fraud detectionproduct • Continuous Delivery: Models need to be continuously deployed on production • Real-Time Fraud Detection: Predictiontime of max 100ms • HighAvailability &Scalability • Low Learning Curve: Stack should be usable by data scientist & SW developer • Open Source • Fast : Fast prototyping & deploying • On Premise • Initial Choice • prediction.io+ Spark ML Case Study I : Migration From SparkMLib To H2O (1/3) Source: https://iyzico.engineering/spark-ml-to-h2o-migration-for-machine-learning-in-iyzico-dcba86b8eab2
  • 17. Case Study I (2/3) • Benchmarking Criteria : TensorFlow, SparkML, H2O (Winner) • Simplicity of deploying an existingmodel (local env) to production • POJO based models. Easy to deploy in Java environment • Release management and DevOps cycle are easy • Hardwarerequirementsfor training • Memory need for training with 1 million transactions & 100 features with RF (64 Trees) Spark ML : 16 GB RAM, Tensor Flow : 10 GB, H2O : 2 GB • Decision Trees and BayesianModels • Python, R, SQL Support • Experimentationon local environment • Experiments can be done with Python, R • Predictiontime (ms)
  • 18. • Feature Engineering, Data Pipeline was in Java 8. No need of migration • Migration from Spark ML + prediction.io to H2O • 60 GB RAM is saved (Spark ML & prediction.io needed for model trainings) • 12 cores saved (Spark ML & prediction.io needed these cores to reduce model training time) • Response time decreased almost 10 times (300 milliseconds to 35 milliseconds) Case Study I (3/3)
  • 19. Case Study II : Booking.com (1/n) Source: https://www.youtube.com/watch?v=_CBKECLkIt8
  • 20. Case Study II : Booking.com (2/n)
  • 21. Case Study II : Booking.com (3/3)
  • 22. Spark/Sparkling Water – Do I need it?
  • 23. Benchmarking ML Libraries https://github.com/szilard/benchm-ml • Training data • Number of rowsvaried as 10K, 100K, 1M, 10M • ~1K features • Binary ClassificationProblem • Hardware (Single Instance) • Amazon EC2 c3.8xlarge (32 cores,60GB RAM) • If OOM, r3.8xlarge instance(32 cores,250GB RAM) • Observations • Training time • Maximum memory usage duringtraining • AUC (predictiveaccuracy)
  • 24. Random Forest H2O • Fast, uses all cores, more accurate • Memory Efficient • 1M : 5G, 10M : 25 G SparkMLib • Slower • Larger memory footprint • Runs OOM at n = 1M • With 250 G, finishes for 1M, but crashes for 10M • AUC broke at 1M • Spark 2.0 is even slower XGBoost • Fast • High accuracy • Memory efficient • 1M : 2G, 10M : 9G
  • 25. Gradient Boosting Machines Learn_rate=0.01 max_depth=16 n_trees=1000 Learn_rate=0.1 max_depth=6 n_trees=300 • Memory footprint of GBMs smaller than for RF • Bottleneck is mainly training time • Spark is inefficient in memory (especially for deeper trees) & crashes. Works for shallow trees • H2O and xgboost are the fastest
  • 26. Performance of various GBM implementations For deployment, H2O has the best ways to deploy as a real-time (fast scoring) application. https://github.com/szilard/GBM-perf
  • 27. Do I need Big Data? • Single Instance vs Cluster • Sending data over a network vs using shared memory • Several distributed systems have significant computation & memory overhead • Map-reduce style communicationpattern : Not best fit for many ML algorithms Benchmarking For Bigger Data
  • 28. Netflix VectorFlow • Minimalist library • Specifically optimized for training sparse data • Single-machine, multi-core environment
  • 29. Benchmarking For Bigger Data • Not enough clarity about the hardwareused • For tree-based ensembles (RF, GBM)H2O and xgboost can train on 100Mrecordson a single server, though the trainingtimes become several hours Single Node Multiple Nodes
  • 31. Disadvantages • No High Availability (HA) for Clusters • Doesn’t work well on sparse data • GPU Support is in alpha stage • There is No SVM • Cluster support helps Big Data • For small data needs single, fast machines with lot of cores
  • 32. References • https://www.quora.com/Does-H2O-software-allow-you-to-perform-faster- machine-learning-if-it-is-not-used-on-a-cluster-How • https://www.quora.com/Why-would-one-use-H2O-ai-over-scikit-learn-machine- learning-tool • https://www.quora.com/What-are-the-risks-of-using-H2O-ai-framework-When- would-my-company-need-to-pay-anything-to-H2O-ai-Is-the-framework-buggy- somehow-or-is-it-hard-to-install-configure-extend-Do-I-need-to-pay-for- consultancy-eventually • https://groups.google.com/forum/#!msg/h2ostream/m2HIfUxfw-k/X8G2- OMQAwAJ
  • 35. H2O Frame Distributed Fork & Join
  • 36. Do I need Spark to run H20? - https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library - https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
  • 37. H2O : POJO vs MOJO - POJOs are not supported for source files larger than 1G - MOJOs are supported for AutoML, Deep Learning, DRF, GBM, GLM, GLRM, K-Means, Stacked Ensembles, SVM, Word2vec, and XGBoost models. - POJOs are also not supported for XGBoost, GLRM, or Stacked Ensembles models. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
  • 38. SparkML vs SparkMLib • Spark MLib vs Spark ML : • https://spark.apache.org/docs/latest/ml-guide.html