Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O

Introduction to
Machine Learning with H2O
Jo-fai (Joe) Chow
Data Scientist
joe@h2o.ai
@matlabulous
Data Science Milan
Politecnico di Milano
10th October, 2016

About Me: Civil Engineer → Data Scientist
• 2005 - 2015
• Water Engineer
o Consultant for Utilities
o Industrial PhD
• Water Engineering +
Machine Learning
• Discovered H2O in 2014!
• 2015 - Present
• Data Scientist
o Virgin Media (UK)
o Domino Data Lab (US)
o H2O.ai (US)
2
Why? Long story – see bit.ly/joe_h2o_talk2

Agenda
• First Talk (25 mins)
o About H2O.ai
o Demo
• A Simple Classification Task
• H2O’s Web Interface
o Why H2O?
• Our Community
• Our Customers
o What’s Next?
• New H2O Features
• Second Talk (25 mins)
o H2O for IoT
• Predictive Maintenance
• Anomaly Detection
• H2O’s R Interface
• Third Talk (25 mins)
o Deep Water
o Demo
• H2O + mxnet on GPU
• H2O’s Python Interface
3

About H2O.ai
• H2O.ai, the Company
o Team: 80 (70 shown)
o Founded in 2012
o HQ: Mountain View, California
• H2O, the Platform
o Open Source (Apache 2.0)
o Algorithms written in Java
• Fast, distributed and scalable
o Multiple interfaces to suit different users
• Web, R, Python, Java, Scala, REST/JSON
o Works with desktop/laptop, cloud, Spark
and Hadoop
Joe

Current Algorithm Overview
7
Joe’s Strata Hadoop
London Talk
bit.ly/joe_h2o_talk4
Today’s
Demos
Joe’s LondonR Talk
bit.ly/joe_h2o_talk3

H2O’s Mission
9
Making Machine Learning Accessible to Everyone
Photo credit: Virgin Media

A Typical Machine Learning Task
• Demo
o Dataset – MNIST
• LeCun et al. (1999)
• Hand-written Digits
o Import & Explore Data
o Build & Evaluate Models
o Make Predictions
11Photo credit: http://www.opendeep.org/v0.0.5/docs/tutorial-classifying-handwritten-mnist-images

MNIST Hand-Written Digits
• 784 Inputs
o 28 x 28 = 784 pixels
• 1 Output
o 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9
o Classification
• Files
o Train (60k Records)
o Test (10k)
• Links
o https://s3.amazonaws.com/h2o-public-test-
data/bigdata/laptop/mnist/train.csv.gz
o https://s3.amazonaws.com/h2o-public-test-
data/bigdata/laptop/mnist/test.csv.gz
12
Photo credit: https://ml4a.github.io/ml4a/neural_networks/

H2O Flow (Web Interface) Demo
• Download and unzip jar
from www.h2o.ai
• In terminal:
o java -jar h2o.jar
• Web browser:
o localhost:54321
13

Other H2O Interfaces
• R
• Python
• docs.h2o.ai
16
Key Resources

More Advanced Topics
• Advanced Features
o Hyperparameters Tuning
o Model Stacking
o Saving/Loading Models
o Export Plain Old Java
Object (POJO)
• Key Resources
o docs.h2o.ai
• Joe’s Previous H2O Talks
o bit.ly/joe_h2o_talk3
o bit.ly/h2o_budapest_1
o bit.ly/h2o_paris_1
17

Szilard Pafka – Chief Data Scientist at Epoch
• Sziland’s talks / blog
posts about H2O:
o ML Benchmark
o Intro to ML with H2O
o H2O Scoring
o Tweets
20

Szilard Pafka – Why H2O?
21
• Szilard’s Summary Slide

H2O Community Support
23
Google forum – h2osteam community.h2o.ai
Please try

#AroundTheWorldWithH2Oai
24
Strata Hadoop
London
PyData
Amsterdam
useR! 2016
Stanford
satRdays
Budapest
London Kaggle
Meetup
Chelsea FC
Paris ML
Meetup
Big Data London

#AroundTheWorldWithH2Oai
25
Data Science Milan
Thank you 

H2O Usage in Italy
26
www.h2o.ai/community

H2O in Action
29
Thank you 
Data Science Milan – May 19, 2016
Bringing Deep Learning into production - Paolo Platter, AgileLab
http://www.slideshare.net/ds_mi/bringing-deep-learning-into-production-paolo-platter-agilelab

H2O is Evolving
• H2O Open Tour NYC
YouTube Playlist
o Advanced data munging
o Visual ML
o Deep Water (3rd talk)
o Sparkling Water
• PySparkling & RSparkling
o Steam
31
Next time?

H2O’s Mission
32
Making Machine Learning Accessible to Everyone
Photo credit: Virgin Media

End of First Talk – Thanks!
33
• Data Science Milan
• Gianmario Spacagna
• Politecnico di Milano
• Resources
o bit.ly/h2o_milan_1
o www.h2o.ai
o docs.h2o.ai
• Contact
o joe@h2o.ai
o @matlabulous
o github.com/woobe

Extra Slides
(H2O Flow Demo Screenshots – just in case)

35
Upload the file without decompressing it first

36
Change the data type of “label” from “Numeric” to “Enum” (categorical)

37
Note: Size in Memory
Click on individual labels to explore data

39
Split the full dataset into training (80% = 48k records) and
validation (20% = 12k) – a common machine learning
practice

40
Click and select parameters
for model training

41
Users have full access to all available parameters
– fine-tune model training process
For example, I am using
rectifier with dropout as the activation
to train the model for 20 epochs
with classes balancing
Leaving other settings as default

42
Training the model with estimated remaining time
– users can stop the process early if they want to

43
Performance (logloss) on validation set
Performance (logloss) on training set

44
Confusion Matrix on Training Set (48k Records)
About 2% Error
Confusion Matrix on Validation Set (12k Records)
About 4% Error

45
Using the model for prediction on test set

46
Confusion Matrix on Test Set (10k Records)
About 4% Error (similar to validation)

47
Full prediction outputs including individual
probabilities and predicted label

Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O

Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O

Similar to Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O (20)

More from Data Science Milan

More from Data Science Milan (20)

Recently uploaded

Recently uploaded (20)

Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O