H2O intro at Dallas Meetup

DALLAS MEETUP
J O R G E L U I S H E R N A N D E Z V I L L A P O L

INTRO – BIO
• Jorge Luis Hernandez Villapol
• Engineering Intern at H2O.ai
• Graduate Student at UNT Master in Electrical Engineering
• Background: Electronics Engineer
• Jorge@h2o.ai

AGENDA
• Intro - Bio
• Data Scientist Checklist
• H2O Intro – Products
• H2O Workflow
• Demo
• Where to go next?
• Q&A

CHECKLIST – BE PASSIONATE
• 10,000 Hours Rule - Malcolm Gladwell
• Marios Michailidis aka KazAnova
– Kaggle Grand Master – Top 3
– Senior Data Scientist at H2O.ai

CHECKLIST – BE EAGER TO LEARN
• New models
• New frameworks
• New technology
• New and old approaches

CHECKLIST- KEEP YOUR
FUNDAMENTALS IN CHECK
• Statistic Fundamentals
– Mean, Median, Variance
– Random Variables, pdf
– Central Limit Theorem, iids
• Error Metrics
– RSME, MSE, AUC
• Accuracy vs Precision

CHECKLIST – THE DATA SCIENTIST CYCLE
Question &
Hypothesis
Data Mining
Modeling
Evaluation
Present &
Document
Deplo
y
Idea

CHECKLIST – LEARN TO CODE
• Python, R, Scala, Julia, Java, …
• Basic Level:
– Basic Understanding,
– Basic data operations
• Expert Level:
– Performance
– Code Readability

CHECKLIST – HAVE A TOOLBOX … OF
SOLUTIONS
• Whenever you get a new problem
– Have I done this before?
– Have I done something similar before?
– Can I reuse/adapt some I had done before?
• Whenever you get a new solution
– Document
– Present
– Save

CHECKLIST – KEEP YOUR TOOLBOX
UPDATED AND GROWING
• Do your own benchmark between your tools.
• Keep an eye for updates (FYI H2O makes minor releases every 2 weeks)

CHECKLIST – SEPARATE YOUR DATA
• Overfitting is public enemy #1
• Good rule of thumb is to have a Training, Validation and test set.
• Be careful with the split! No leakage to your test set!

CHECKLIST – ONE ENSEMBLE TO RULE
THEM ALL OR SIMPLER IS BETTER?
• Start with a Simpler Model as your Base Line.
• Grow on complexity until satisfied.
• Ensembles and Stacking helps against overfitting.

WHAT IS H2O?
H2O is an open source, in-memory, distributed, fast, and scalable machine
learning and predictive analytics platform that allows you to build machine
learning models on big data and provides easy productionalization of those
models in an enterprise environment.

Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and
Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest:
Classification or regression models
• Gradient Boosting Machine:
Produces an ensemble of decision
trees with increasing refined
approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with
an input layer followed by multiple
layers of nonlinear transformations
ALGORITHMS ON H2O
Unsupervised Learning
• K-means: Partitions observations into k
clusters/groups of the same spatial
size. Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly
transforms correlated variables to independent
components
• Generalized Low Rank Models: extend the idea
of PCA to handle arbitrary data consisting of
numerical, Boolean, categorical, and missing data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction
using deep learning

WHERE TO GO NEXT?
• Download and test for yourself
– https://www.h2o.ai/
• Docs
– http://docs.h2o.ai/h2o/latest-stable/index.html
• Video Tutorials
– https://www.youtube.com/user/0xdata

H2O intro at Dallas Meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to H2O intro at Dallas Meetup

Similar to H2O intro at Dallas Meetup (20)

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

H2O intro at Dallas Meetup

Editor's Notes