Automatic	and	Interpretable	Machine	
Learning	in	R	with	H2O	and	LIME	
Jo-fai	(Joe)	Chow
Data	Science	Evangelist	/
Community	Manager
joe@h2o.ai
@matlabulous
Download	→	https://bit.ly/
joe_eRum_2018
Agenda
2
Time Topics /	Tasks
7:00	– 7:30 pm Welcome	+	Data	Hack	Italia	
7:30	– 7:45 pm
Install	h2o,	lime,	mlbench from	CRAN
slides/code:	bit.ly/joe_eRum_2018
7:45	– 8:00	pm Introduction	(H2O,	AutoML,	LIME)
8:00 – 8:30	pm Regression	Example
8:30 – 9:00	pm 🍕 🍕 🍕 🍕 🍕
9:00	– 9:20	pm Classification Example
9:20	– 9:30	pm Quick	Recap
9:30	– 9:45	pm Real	Use-Case:	Moneyball
9:45	– 10:00	pm Other	H2O	News	+	Q	&	A
3
Time Topics /	Tasks
7:30	– 7:45 pm
Install	h2o,	lime,	mlbench from	CRAN
slides/code:	bit.ly/joe_eRum_2018
mlbench for	datasets
install.packages(‘mlbench’)
4
Install	H2O	using	tar.gz (if	needed)
Go	to	https://www.h2o.ai/download/
Click
Click
Unzip
Install
5
Time Topics /	Tasks
7:00	– 7:30 pm Welcome	+	Data	Hack	Italia	
7:30	– 7:45 pm
Install	h2o,	lime,	mlbench from	CRAN
slides/code:	bit.ly/joe_eRum_2018
7:45	– 8:00	pm Introduction	(H2O,	AutoML,	LIME)
8:00 – 8:30	pm Regression	Example
8:30 – 9:00	pm
9:00	– 9:20	pm Classification Example
9:20	– 9:30	pm Quick	Recap
9:30	– 9:45	pm Real	Use-Case:	Moneyball
9:45	– 10:00	pm Other	H2O	News	+	Q	&	A
Why?
• Most	users/organizations	can	
benefit	from	automatic	machine	
learning	pipelines.
• Eliminate	time	wasted	on	human	
errors,	debugging	etc.
• Model	interpretations	is	crucial	for	
those	who	must	explain	their	
models	to	regulators	or	customers.
You	will	learn	…
• How	to	build	high	quality	H2O	
models	(almost)	automatically.
• How	to	explain	predictions	from	
complex	H2O	models	with	LIME.
• Bonus:	A	real	use-case	that	led	to	
multimillion-dollar	baseball	
decisions	earlier	this	year.
6
+ +
$20Mtrade	2	weeks	prior	to	the	
season	beginning
AutoML	+	LIME	+	Shiny	=	Real	Moneyball
About	Me
8
• Before	H2O
• Water	Engineer	/	EngD	Researcher	/	
Matlab	Fan	Boy
(wonder	why							@matlabulous?)
• Discovered	R,	Python,	H2O	…	
never	look	back	again
• Data	Scientist	at	Virgin	Media	(UK),	
Domino	Data	Lab	(US)
• At	H2O	…	
• Data	Scientist	/	Evangelist	/
• Sales	Engineer	/	Solution	Architect	/
• Community	Manager
…	The	harsh	reality	of	startup	life	…
• H2O	SWAG	Photographer
#AroundTheWorldWithH2Oai
Love	H2O?	Get	some	stickers!
Barcelona
Berlin
Brussels Paris
Milan	(2016)
9
Reminder:	#360Selfie
Budapest
Cologne
ToulouseAmsterdam
BerlinLondon
About	H2O.ai …
Have	you	seen	Avengers:	Infinity	War?
Do	you	know	all	the	characters	in	the	movie?	(No	spoilers	- I	promise)
10
11
We	develop	
machine	learning	platforms
CONFIDENTIAL
Gartner names H2O as Leader with the most completeness of vision
• H2O.ai recognized as a technology
leader with most completeness of
vision
• H2O.ai was recognized for the
mindshare, partner network and
status as a quasi-industry standard
for machine learning and AI.
• H2O customers gave the highest
overall score among all the vendors
for sales relationship and account
management, customer support
(onboarding, troubleshooting, etc.)
and overall service and support.
CONFIDENTIAL
Platforms with H2O integration
H2O + KNIME Talk
at KNIME Summit
Mar 2017
Founded 2012, Series C in Nov, 2017
Products • Driverless AI – Automated Machine Learning
• H2O Open Source Machine Learning
• Sparkling Water
Mission Democratize AI. Do Good
Team ~100 employees
• Distributed Systems Engineers doing Machine Learning
• World-class visualization designers
Offices Mountain View, London, Prague
14
Company	Overview
H2O Products
In-Memory, Distributed
Machine Learning Algorithms
with H2O Flow GUI
H2O AI Open Source Engine
Integration with Spark
Lightning Fast machine
learning on GPUs
Automatic feature
engineering, machine
learning and interpretability
Secure multi-tenant H2O clusters
H2O Products
In-Memory, Distributed
Machine Learning Algorithms
with H2O Flow GUI
H2O AI Open Source Engine
Integration with Spark
Lightning Fast machine
learning on GPUs
Automatic feature
engineering, machine
learning and interpretability
Secure multi-tenant H2O clusters
In-Memory, Distributed
Machine Learning Algorithms
with H2O Flow GUI
This Workshop
CONFIDENTIAL
CONFIDENTIAL
Worldwide	Recognition	in	the	H2O.ai	Community
Financial InsuranceMarketingHW
Vendors
Retail Advisory &
Accounting
Healthcare
“H2O.ai's	reference	customers	gave	it	the	highest	overall	score	for	sales	
relationship	and	overall	service	and	support”	- Gartner	MQ	2018
Open	Source	
Community Paying	Customers
18
Why	H2O?
19
Our	Mission:
Make	Machine	Learning	Accessible	to	Everyone
Scientific	Advisory	Council
20
21H2O	Team
22H2O	Team
Origin	of	R	Package	`ggplot2`
23H2O	Team
Matt	Dowle
24H2O	Team
Erin	LeDell,	Chief	ML	Scientist
Women	in	ML/DS	& R-Ladies	Global
25H2O	Team
Erin	LeDell Navdeep	Gill
H2O	AutoML
26H2O	Team
1st
4th
25th
48th
33rd
Kaggle	Grand	Masters	
(and	their	Highest	Rank)
13th
About	80,000	Kagglers
27H2O	Team
1st
4th
25th
48th
33rd
13th
181st
Hoping	to	get	closer	to	them	at	some	point	…
28
https://www.meetup.com/pro/h2oai/
29
Encourage	your	friends/
colleagues	to	give	a	talk.
joe@h2o.ai
We	sponsor
meetups
We	encourage
diversity
Source:	https://www.maddycoupons.in/blog/yummy-yummy-pizza/
…	and	more
Joe’s
Meetups
Female	
Speaker
Female	Speaker	
Ratio
London	
Dec	2017
Kasia	Kulma 1/3
Amsterdam	
Feb 2018
Andreea	
Bejinaru
1/2	
London
Mar	2018
Cheuk	Ting	Ho 1/3
London	
Jun	2018
Torgyn	
Shaikhina
1/4
Overall (so	far)	=	4/12	=	33.3%
About	H2O	AutoML
30
Automatic	Machine	Learning	with	H2O
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
HDFS
S3
NFS
Distributed
In-Memory
Load	Data
Loss-less
Compression
H2O	Compute	Engine
Production	Scoring	Environment
Exploratory	&
Descriptive
Analysis
Feature	
Engineering	&
Selection
Supervised	&
Unsupervised
Modeling
Model
Evaluation	&
Selection
Predict
Data	&	Model
Storage
Model	Export:
Plain	Old	Java	Object
Your
Imagination
Data	Prep	Export:
Plain	Old	Java	Object
Local
SQL
High	Level	Architecture
31
HDFS
S3
NFS
Distributed
In-Memory
Load	Data
Loss-less
Compression
H2O	Compute	Engine
Production	Scoring	Environment
Exploratory	&
Descriptive
Analysis
Feature	
Engineering	&
Selection
Supervised	&
Unsupervised
Modeling
Model
Evaluation	&
Selection
Predict
Data	&	Model
Storage
Model	Export:
Plain	Old	Java	Object
Your
Imagination
Data	Prep	Export:
Plain	Old	Java	Object
Local
SQL
High	Level	Architecture
32
Import	Data	from	
Multiple	Sources
HDFS
S3
NFS
Distributed
In-Memory
Load	Data
Loss-less
Compression
H2O	Compute	Engine
Production	Scoring	Environment
Exploratory	&
Descriptive
Analysis
Feature	
Engineering	&
Selection
Supervised	&
Unsupervised
Modeling
Model
Evaluation	&
Selection
Predict
Data	&	Model
Storage
Model	Export:
Plain	Old	Java	Object
Your
Imagination
Data	Prep	Export:
Plain	Old	Java	Object
Local
SQL
High	Level	Architecture
33
Fast,	Scalable	&	Distributed	
Compute	Engine	Written	in	
Java
Supervised	Learning
• Generalized	Linear	Models:	Binomial,	
Gaussian,	Gamma,	Poisson	and	Tweedie
• Naïve	Bayes	
Statistical	
Analysis
Ensembles
• Distributed	Random	Forest:	Classification	
or	regression	models
• Gradient	Boosting	Machine:	Produces	an	
ensemble	of	decision	trees	with	increasing	
refined	approximations
Deep	Neural	
Networks
• Deep	learning:	Create	multi-layer	feed	
forward	neural	networks	starting	with	an	
input	layer	followed	by	multiple	layers	of	
nonlinear	transformations
H2O-3	Algorithms	Overview
Unsupervised	Learning
• K-means:	Partitions	observations	into	k	
clusters/groups	of	the	same	spatial	size.	
Automatically	detect	optimal	k
Clustering
Dimensionality	
Reduction
• Principal	Component	Analysis:	Linearly	transforms	
correlated	variables	to	independent	components
• Generalized	Low	Rank	Models:	extend	the	idea	of	
PCA	to	handle	arbitrary	data	consisting	of	numerical,	
Boolean,	categorical,	and	missing	data
Anomaly	
Detection
• Autoencoders:	Find	outliers	using	a	
nonlinear	dimensionality	reduction	using	
deep	learning
34
H2O	Core
CPU
Model Building
H2O
H2O	Core
H2O H2O H2O
H2O	Core
CPU CPU CPU
Model Building
H2O Distributed In-Memory
H2O	Core
YARN
CPU CPU CPU
Model Building
H2O Distributed In-Memory
SQL NFS
S3
Firewall or Cloud
HDFS
S3
NFS
Distributed
In-Memory
Load	Data
Loss-less
Compression
H2O	Compute	Engine
Production	Scoring	Environment
Exploratory	&
Descriptive
Analysis
Feature	
Engineering	&
Selection
Supervised	&
Unsupervised
Modeling
Model
Evaluation	&
Selection
Predict
Data	&	Model
Storage
Model	Export:
Plain	Old	Java	Object
Your
Imagination
Data	Prep	Export:
Plain	Old	Java	Object
Local
SQL
High	Level	Architecture
39
Multiple	Interfaces
H2O	Flow	(Web)
40
Using	H2O	with	R	and	Python
41
42H2O	Team
Erin	LeDell Navdeep	Gill
H2O	AutoML
43
44
AutoML
Supervised	Learning
• Generalized	Linear	Models:	Binomial,	
Gaussian,	Gamma,	Poisson	and	Tweedie
• Naïve	Bayes	
Statistical	
Analysis
Ensembles
• Distributed	Random	Forest:	Classification	
or	regression	models
• Gradient	Boosting	Machine:	Produces	an	
ensemble	of	decision	trees	with	increasing	
refined	approximations
Deep	Neural	
Networks
• Deep	learning:	Create	multi-layer	feed	
forward	neural	networks	starting	with	an	
input	layer	followed	by	multiple	layers	of	
nonlinear	transformations
H2O-3	Supervised	Algorithms	in	AutoML
Unsupervised	Learning
• K-means:	Partitions	observations	into	k	
clusters/groups	of	the	same	spatial	size.	
Automatically	detect	optimal	k
Clustering
Dimensionality	
Reduction
• Principal	Component	Analysis:	Linearly	transforms	
correlated	variables	to	independent	components
• Generalized	Low	Rank	Models:	extend	the	idea	of	
PCA	to	handle	arbitrary	data	consisting	of	numerical,	
Boolean,	categorical,	and	missing	data
Anomaly	
Detection
• Autoencoders:	Find	outliers	using	a	
nonlinear	dimensionality	reduction	using	
deep	learning
45
46
47
HDFS
S3
NFS
Distributed
In-Memory
Load	Data
Loss-less
Compression
H2O	Compute	Engine
Production	Scoring	Environment
Exploratory	&
Descriptive
Analysis
Feature	
Engineering	&
Selection
Supervised	&
Unsupervised
Modeling
Model
Evaluation	&
Selection
Predict
Data	&	Model
Storage
Model	Export:
Plain	Old	Java	Object
Your
Imagination
Data	Prep	Export:
Plain	Old	Java	Object
Local
SQL
High	Level	Architecture
48
Export	Standalone	Models	
for	Production
49
URL: docs.h2o.ai
About	Machine	Learning	
Interpretability
50
LIME	(Local	Interpretable	Model-Agnostic	Explanations)
…	and	more
Acknowledgement
51
Why	Should	I	Trust	Your	Model?
52
53
Age
Age
Propensity	to	PurchasePropensity	to	Purchase
Linear	Models
Machine	Learning
Exact	explanations	for	
approximate	models.
Lost	profits.
Wasted	marketing.
“For	a	one	unit	increase	in	age,	propensity	to	
purchase	increases	by	3%	on	average.”
Approximate	explanations	
for	exact	models.
Slope	begins	to	decrease	
here.	Act	to	optimize	
savings.
Slope	begins	to	increase	
here	sharply.	Act	to	
optimize	profits.
Theory
• LIME	approximates	model	locally	
as	logistic	or	linear	model
• Repeats	process	many	times
• Outputs	features	that	are	most	
important	to	local	models
Outcome
• Approximate	reasoning
• Complex	models	can	be	
interpreted
• Neural	nets,	Random	Forest,	
Ensembles	etc.
55
Local	Interpretable	Model-Agnostic	Explanations
LIME - How	does	it	work?
56
https://github.com/jphall663/awesome-machine-learning-interpretability
57
Time Topics /	Tasks
7:00	– 7:30 pm Welcome	+	Data	Hack	Italia	
7:30	– 7:45 pm
Install	h2o,	lime,	mlbench from	CRAN
slides/code:	bit.ly/joe_eRum_2018
7:45	– 8:00	pm Introduction	(H2O,	AutoML,	LIME)
8:00 – 8:30	pm
Regression	Example
examplesregression_...Rmd
8:30 – 9:00	pm
9:00	– 9:20	pm Classification Example
9:20	– 9:30	pm Quick	Recap
9:30	– 9:45	pm Real	Use-Case:	Moneyball
9:45	– 10:00	pm Other	H2O	News	+	Q	&	A
Learning from Boston Housing Data
Features:
No. of rooms, Crime Rate,
Age …
Median
House
Value
(medv)
Predictions
H2O Machine Learning
Learn the Pattern
58
🍕🍕🍕@h2oai		@matlabulous
#AutoML			#LIME
Pizzas	&	Drinks	8:30	– 9:00	pm
Sponsored	by	H2O.ai
60
Time Topics /	Tasks
7:00	– 7:30 pm Welcome	+	Data	Hack	Italia	
7:30	– 7:45 pm
Install	h2o,	lime,	mlbench from	CRAN
slides/code:	bit.ly/joe_eRum_2018
7:45	– 8:00	pm Introduction	(H2O,	AutoML,	LIME)
8:00 – 8:30	pm Regression	Example
8:30 – 9:00	pm
9:00	– 9:20	pm
Classification	Example
examplesclassification_...Rmd
9:20	– 9:30	pm Quick	Recap
9:30	– 9:45	pm Real	Use-Case:	Moneyball
9:45	– 10:00	pm Other	H2O	News	+	Q	&	A
Learning from Diabetes Data
Features:
Pregnant, Age, Glucose …
Positive
/
Negative
Predictions
H2O Machine Learning
Learn the Pattern
61
62
Time Topics /	Tasks
7:00	– 7:30 pm Welcome	+	Data	Hack	Italia	
7:30	– 7:45 pm
Install	h2o,	lime,	mlbench from	CRAN
slides/code:	bit.ly/joe_eRum_2018
7:45	– 8:00	pm Introduction	(H2O,	AutoML,	LIME)
8:00 – 8:30	pm Regression	Example
8:30 – 9:00	pm 🍕 🍕 🍕 🍕 🍕
9:00	– 9:20	pm Classification	Example
9:20	– 9:30	pm Quick	Recap
9:30	– 9:45	pm Real	Use-Case:	Moneyball
9:45	– 10:00	pm Other	H2O	News	+	Q	&	A
Why?
• Most	users/organizations	can	
benefit	from	automatic	machine	
learning	pipelines.
• Eliminate	time	wasted	on	human	
errors,	debugging	etc.
• Model	interpretations	is	crucial	for	
those	who	must	explain	their	
models	to	regulators	or	customers.
You	will	learn	…
• How	to	build	high	quality	H2O	
models	(almost)	automatically.
• How	to	explain	predictions	from	
complex	H2O	models	with	LIME.
• Bonus:	A	real	use-case	that	led	to	
multimillion-dollar	baseball	
decisions	earlier	this	year.
63
64
Time Topics /	Tasks
7:00	– 7:30 pm Welcome	+	Data	Hack	Italia	
7:30	– 7:45 pm
Install	h2o,	lime,	mlbench from	CRAN
slides/code:	bit.ly/joe_eRum_2018
7:45	– 8:00	pm Introduction	(H2O,	AutoML,	LIME)
8:00 – 8:30	pm Regression	Example
8:30 – 9:00	pm 🍕 🍕 🍕 🍕 🍕
9:00	– 9:20	pm Classification	Example
9:20	– 9:30	pm Quick	Recap
9:30	– 9:45	pm Real	Use-Case:	Moneyball
9:45	– 10:00	pm Other	H2O	News	+	Q	&	A
SlideShare	Link
Podcast	Link
66
Time Topics /	Tasks
7:00	– 7:30 pm Welcome	+	Data	Hack	Italia	
7:30	– 7:45 pm
Install	h2o,	lime,	mlbench from	CRAN
slides/code:	bit.ly/joe_eRum_2018
7:45	– 8:00	pm Introduction	(H2O,	AutoML,	LIME)
8:00 – 8:30	pm Regression	Example
8:30 – 9:00	pm 🍕 🍕 🍕 🍕 🍕
9:00	– 9:20	pm Classification	Example
9:20	– 9:30	pm Quick	Recap
9:30	– 9:45	pm Real	Use-Case:	Moneyball
9:45	– 10:00	pm Other	H2O	News	+	Q	&	A
H2O Products
In-Memory, Distributed
Machine Learning Algorithms
with H2O Flow GUI
H2O AI Open Source Engine
Integration with Spark
Lightning Fast machine
learning on GPUs
Automatic feature
engineering, machine
learning and interpretability
Secure multi-tenant H2O clusters
Lightning Fast machine
learning on GPUs
“Confidential	and	property	of	H2O.ai.	All	rights	reserved”
Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and
Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest:
Classification or regression models
• Gradient Boosting Machine: Produces
an ensemble of decision trees with
increasing refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with
an input layer followed by multiple
layers of nonlinear transformations
Algorithms on H2O-3 (CPU)
Unsupervised Learning
• K-means: Partitions observations into
k clusters/groups of the same spatial
size. Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly
transforms correlated variables to independent
components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of
numerical, Boolean, categorical, and missing
data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction
using deep learning
“Confidential	and	property	of	H2O.ai.	All	rights	reserved”
Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and
Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest:
Classification or regression models
• Gradient Boosting Machine: Produces
an ensemble of decision trees with
increasing refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with
an input layer followed by multiple
layers of nonlinear transformations
Algorithms on H2O4GPU (more to come)
Unsupervised Learning
• K-means: Partitions observations into
k clusters/groups of the same spatial
size. Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly
transforms correlated variables to independent
components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of
numerical, Boolean, categorical, and missing
data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction
using deep learning
70
https://github.com/h2oai/h2o4gpu
71
blog.h2o.ai
Coming	Up
72
Date City Talk /	Workshop
25 June Milan H2O	+	LIME	Workshop
28	June Rome H2O	Intro	+	Use	Cases
Mid-July London Next	London	Meetup	(T.B.C.)
Mid-Oct London H2O	World	London
13-14	Nov London
Big Data	LDN
Keynote	by	Sri	(CEO)	+	Meetup
• Organizers	&	Sponsors
73
Thanks!
• Code,	Slides	&	Documents
• bit.ly/h2o_meetups
• bit.ly/joe_eRum_2018
• docs.h2o.ai
• Contact
• joe@h2o.ai
• @matlabulous
• github.com/woobe
• Please	search/ask	questions	on	
Stack	Overflow
• Use	the	tag	`h2o`	(not	h2	zero)
Appendix
74
Supported Formats & Data Sources
CSV
XLS
XLSX
ORC*
Hive*
SVMLight
ARFF
Parquet
Avro 1.8.0*
HDFS
S3
NFS
LOCAL
SQL
9Formats 5Sources
File type or Folder of Files
* 1. only if H2O is running as a Hadoop job
* 2. Hive files that are saved in ORC format
* 3. without multi-file parsing or column type modification
Distributed	Algorithms
• Foundation	for	In-Memory	Distributed	Algorithm	
Calculation	- Distributed	Data	Frames and	columnar	
compression	
• All	algorithms	are	distributed	in	H2O:	GBM,	GLM,	DRF,	Deep	
Learning	and	more.		Fine-grained	map-reduce	iterations.
• Only	enterprise-grade,	open-source	distributed	algorithms	
in	the	market
User	Benefits
Advantageous	Foundation	
• “Out-of-box”	functionalities	for	all	algorithms (NO	MORE	
SCRIPTING) and	uniform	interface	across	all	languages:	R,	
Python,	Java
• Designed	for	all	sizes	of	data	sets,	especially	large data
• Highly	optimized	Java	code	for	model	exports
• In-house	expertise	for	all	algorithms
Parallel	Parse	into	Distributed	Rows
Fine	Grain	Map	Reduce	Illustration:	Scalable	
Distributed	Histogram	Calculation	for	GBM
Foundation	for	Distributed	Algorithms
76
Demo:	
H2O	on	a	320-Core	Hadoop	
Cluster
(Web	Interface)
77
78
https://www.kaggle.com/c/higgs-boson
Learning	from	Higgs	Boson	Machine	Data
79
Sensors
(Detector)	
Data
Historical	Outcome
Is	it	a	Higgs	Particle	
(Yes/No)
Predicted	Outcome
Learn	the	Pattern
11M
Rows
28	Features
Raw	Data	Size:	7.48	GB
80
11M Rows Size	(Raw):					7.48	GB
Compressed:	2.00	GB	(≈	27%	of	Raw)
81
10	nodes
10	x	32	=	
320	Cores
10	x		29.6	=	296	
GB	Memory
H2O	Water	Meter	
(CPU	Monitor)
82
10	x	32	=	320	Cores

Automatic and Interpretable Machine Learning in R with H2O and LIME (Milan Edition)