H2O is widely used for machine learning projects. A TechCrunch article, published in January 2017 by John Mannes, reported that around 20% of Fortune 500 companies use H2O.
Talk 1: Introduction to Scalable & Automatic Machine Learning with H2O
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O and other tools have made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models.
In this presentation, Joe will introduce the AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.
Talk 2: Making Multimillion-dollar Baseball Decisions with H2O AutoML and Shiny
Joe recently teamed up with IBM and Aginity to create a proof of concept "Moneyball" app for the IBM Think conference in Vegas. The original goal was to prove that different tools (e.g. H2O, Aginity AMP, IBM Data Science Experience, R and Shiny) could work together seamlessly for common business use-cases. Little did Joe know, the app would be used by Ari Kaplan (the real "Moneyball" guy) to validate the future performance of some baseball players. Ari recommended one player to a Major League Baseball team. The player was signed the next day with a multimillion-dollar contract. This talk is about Joe's journey to a real "Moneyball" application.
Bio : Jo-fai (or Joe) Chow is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media in UK where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab in the US as a data science evangelist promoting products via blogging and giving talks at meetups. Joe has a background in water engineering. Before his data science journey, he was an EngD research engineer at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimization. Prior to that, he was an asset management consultant specialized in data mining and constrained optimization for the utilities sector in the UK and abroad. He also holds an MSc in Environmental Management and a BEng in Civil Engineering.
3. Founded 2012, Series C in Nov, 2017
Products • Driverless AI – Automated Machine Learning
• H2O Open Source Machine Learning
• Sparkling Water
Mission Democratize AI. Do Good
Team ~100 employees
• Distributed Systems Engineers doing Machine Learning
• World-class visualization designers
Offices Mountain View, London, Prague
3
Company Overview
19. CONFIDENTIAL
Gartner names H2O as Leader with the most completeness of vision
• H2O.ai recognized as a technology
leader with most completeness of
vision
• H2O.ai was recognized for the
mindshare, partner network and
status as a quasi-industry standard
for machine learning and AI.
• H2O customers gave the highest
overall score among all the vendors
for sales relationship and account
management, customer support
(onboarding, troubleshooting, etc.)
and overall service and support.
27. Supported Formats & Data Sources
CSV
XLS
XLSX
ORC*
Hive*
SVMLight
ARFF
Parquet
Avro 1.8.0*
HDFS
S3
NFS
LOCAL
SQL
9Formats 5Sources
File type or Folder of Files
* 1. only if H2O is running as a Hadoop job
* 2. Hive files that are saved in ORC format
* 3. without multi-file parsing or column type modification
33. Distributed Algorithms
• Foundation for In-Memory Distributed Algorithm
Calculation - Distributed Data Frames and columnar
compression
• All algorithms are distributed in H2O: GBM, GLM, DRF, Deep
Learning and more. Fine-grained map-reduce iterations.
• Only enterprise-grade, open-source distributed algorithms
in the market
User Benefits
Advantageous Foundation
• “Out-of-box” functionalities for all algorithms (NO MORE
SCRIPTING) and uniform interface across all languages: R,
Python, Java
• Designed for all sizes of data sets, especially large data
• Highly optimized Java code for model exports
• In-house expertise for all algorithms
Parallel Parse into Distributed Rows
Fine Grain Map Reduce Illustration: Scalable
Distributed Histogram Calculation for GBM
Foundation for Distributed Algorithms
33
34. Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest: Classification
or regression models
• Gradient Boosting Machine: Produces an
ensemble of decision trees with increasing
refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with an
input layer followed by multiple layers of
nonlinear transformations
H2O-3 Algorithms Overview
Unsupervised Learning
• K-means: Partitions observations into k
clusters/groups of the same spatial size.
Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly transforms
correlated variables to independent components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of numerical,
Boolean, categorical, and missing data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction using
deep learning
34
61. H2O Products
In-Memory, Distributed
Machine Learning Algorithms
with H2O Flow GUI
H2O AI Open Source Engine
Integration with Spark
Lightning Fast machine
learning on GPUs
Automatic feature
engineering, machine
learning and interpretability
Secure multi-tenant H2O clusters
Lightning Fast machine
learning on GPUs
62. “Confidential and property of H2O.ai. All rights reserved”
Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and
Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest:
Classification or regression models
• Gradient Boosting Machine: Produces
an ensemble of decision trees with
increasing refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with
an input layer followed by multiple
layers of nonlinear transformations
Algorithms on H2O-3 (CPU)
Unsupervised Learning
• K-means: Partitions observations into
k clusters/groups of the same spatial
size. Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly
transforms correlated variables to independent
components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of
numerical, Boolean, categorical, and missing
data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction
using deep learning
63. “Confidential and property of H2O.ai. All rights reserved”
Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and
Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest:
Classification or regression models
• Gradient Boosting Machine: Produces
an ensemble of decision trees with
increasing refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with
an input layer followed by multiple
layers of nonlinear transformations
Algorithms on H2O4GPU (more to come)
Unsupervised Learning
• K-means: Partitions observations into
k clusters/groups of the same spatial
size. Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly
transforms correlated variables to independent
components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of
numerical, Boolean, categorical, and missing
data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction
using deep learning
73. Approach One: Learning from Lahman only
Lahman: Age, Height, Weight …
Historical Performance Stats
Home Runs
Batting Average
…
Predictions
H2O AutoML:
Learn the Pattern
Sliding Windows (Stats from previous n years)
About 300 Lahman Features
73
74. Approach Two: Learning from Lahman & AriDB
Lahman: Age, Height, Weight …
Historical Performance Stats
Home Runs
Batting Average
…
Predictions
H2O AutoML:
Learn the Pattern
Sliding Windows (Stats from previous n years)
About 300 Lahman Features +
200 AriDB Features
AriDB: Fastball, curveball,
slider, velocity …
74