SlideShare a Scribd company logo
Introduction to Generalised Low-Rank
Model and Missing Values
Jo-fai (Joe) Chow
Data Scientist
joe@h2o.ai
@matlabulus
Based on work by Anqi Fu, Madeleine Udell, Corinne Horn,
Reza Zadeh & Stephen Boyd.
About H2O.ai
• H2O in an open-source, distributed
machine learning library written in
Java with APIs in R, Python, Scala and
REST/JSON.
• Produced by H2O.ai in Mountain
View, CA.
• H2O.ai advisers are Trevor Hastie,
Rob Tibshirani and Stephen Boyd
from Stanford.
2
About Me
• 2005 - 2015
• Water Engineer
o Consultant for Utilities
o EngD Research
• 2015 - Present
• Data Scientist
o Virgin Media
o Domino Data Lab
o H2O.ai
3
About This Talk
• Overview of generalised low-rank model (GLRM).
• Four application examples:
o Basics.
o How to accelerate machine learning.
o How to visualise clusters.
o How to impute missing values.
• Q & A.
4
GLRM Overview
• GLRM is an extension of well-known matrix factorisation methods such as Principal
Component Analysis (PCA).
• Unlike PCA which is limited to numerical data, GLRM can also handle categorical,
ordinal and Boolean data.
• Given: Data table A with m rows and n columns
• Find: Compressed representation as numeric tables X and Y where k is a small user-
specified number
• Y = archetypal features created from columns of A
• X = row of A in reduced feature space
• GLRM can approximately reconstruct A from product XY 5
≈ +
Memory Reduction / Saving
GLRM Key Features
• Memory
o Compressing large data set with minimal loss in accuracy
• Speed
o Reduced dimensionality = short model training time
• Feature Engineering
o Condensed features can be analysed visually
• Missing Data Imputation
o Reconstructing data set will automatically impute missing values
6
GLRM Technical References
• Paper
o arxiv.org/abs/1410.0342
• Other Resources
o H2O World Video
o Tutorials
7
Example 1: Motor Trend Car Road Tests
8
n = 11
m = 32
“mtcars” dataset in R
A
Original Data Table
Example 1: Training a GLRM
9
Check convergence
Example 1: X and Y from GLRM
10
32
3
3
11
X Y
Example 1: Summary
11
≈A X Y
≈ +
Memory Reduction / Saving
Example 2: ML Acceleration
• About the dataset
o R package “mlbench”
o Multi-spectral scanner image
data
o 6k samples
o x1 to x36: predictors
o Classes:
• 6 levels
• Different type of soil
o Use GLRM to compress
predictors
12
Example 2: Use GLRM to Speed Up ML
13
k = 6
Reduce to 6 features
Example 2: Random Forest
• Train a vanilla H2O
Random Forest model
with …
o Full data set (36
predictors)
o Compressed data set (6
predictors)
14
Example 2: Results Comparison
Data Time 10-fold Cross Validation
Log Loss Accuracy
Raw data
(36 Predictors)
4 mins 26 sec 0.24553 91.80%
Data compressed with GLRM
(6 Predictors)
1 min 24 sec 0.25792 90.59%
15
• Benefits of GLRM
o Shorter training time
o Quick insight before running models on full data set
Example 3: Clusters Visualisation
• About the dataset
o Multi-spectral scanner
image data
o Same as example 2
o x1 to x36: predictors
o Use GLRM to compress
predictors to 2D
representation
o Use 6 classes to colour
clusters
16
Example 3: Clusters Visualisation
17
Example 4: Imputation
18
”mtcars” – same dataset for example 1 Randomly introduce 50% missing values
Example 4: GLRM with NAs
19
When we reconstruct the table using GLRM,
missing values are automatically imputed.
Example 4: Results Comparison
• We are asking GLRM to
do a difficult job
o 50% missing values
o Imputation results look
reasonable
20
Absolute difference between original and
imputed values.
Conclusions
• Use GLRM to
o Save memory
o Speed up machine learning
o Visualise clusters
o Impute missing values
• A great tool for data pre-processing
o Include it in your data pipeline
21
Any Questions?
• Contact
o joe@h2o.ai
o @matlabulous
o github.com/woobe
• Slides & Code
o github.com/h2oai/h2o-
meetups
• H2O in London
o Meetups / Office (soon)
o www.h2o.ai/careers
• H2O Help Docs &
Tutorials
o www.h2o.ai/docs
o university.h2o.ai
22

More Related Content

What's hot

mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0
Yury Kashnitsky
 
Big data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big Data
Christos Hadjinikolis
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
Doug Needham
 
Josh Wills, MLconf 2013
Josh Wills, MLconf 2013Josh Wills, MLconf 2013
Josh Wills, MLconf 2013
MLconf
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
Doug Needham
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
Joaquin Vanschoren
 

What's hot (7)

mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0
 
Big data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big Data
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
Josh Wills, MLconf 2013
Josh Wills, MLconf 2013Josh Wills, MLconf 2013
Josh Wills, MLconf 2013
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
 

Viewers also liked

Designing Sustainable Drainage Systems
Designing Sustainable Drainage SystemsDesigning Sustainable Drainage Systems
Designing Sustainable Drainage Systems
Jo-fai Chow
 
H2O Machine Learning Use Cases
H2O Machine Learning Use CasesH2O Machine Learning Use Cases
H2O Machine Learning Use Cases
Jo-fai Chow
 
NEXT Mobile Trends 2011
NEXT Mobile Trends 2011NEXT Mobile Trends 2011
NEXT Mobile Trends 2011
BathenJelden
 
Project "Deep Water"
Project "Deep Water"Project "Deep Water"
Project "Deep Water"
Jo-fai Chow
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
Jo-fai Chow
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
Jo-fai Chow
 
Hoja para taller
Hoja para taller Hoja para taller
Hoja para taller
Roberto -Buenaire Manzano
 

Viewers also liked (7)

Designing Sustainable Drainage Systems
Designing Sustainable Drainage SystemsDesigning Sustainable Drainage Systems
Designing Sustainable Drainage Systems
 
H2O Machine Learning Use Cases
H2O Machine Learning Use CasesH2O Machine Learning Use Cases
H2O Machine Learning Use Cases
 
NEXT Mobile Trends 2011
NEXT Mobile Trends 2011NEXT Mobile Trends 2011
NEXT Mobile Trends 2011
 
Project "Deep Water"
Project "Deep Water"Project "Deep Water"
Project "Deep Water"
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
 
Hoja para taller
Hoja para taller Hoja para taller
Hoja para taller
 

Similar to Introduction to Generalised Low-Rank Model and Missing Values

Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
Conor B. Murphy
 
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL DatabasesA Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
Luiz Henrique Zambom Santana
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
Jinseob Kim
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
t_ivanov
 
Machine Learning in H2O
Machine Learning in H2OMachine Learning in H2O
Machine Learning in H2O
Aakash Gupta
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
University of Washington
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
Josef Hardi
 
polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdf
Rim Moussa
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
Eran Shlomo
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
MOVING Project
 
0629venmoplus
0629venmoplus0629venmoplus
0629venmoplus
Qingpeng "Q.P." Zhang
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
Sawood Alam
 
Apereo Webinar: Learning What Works When Scaling Analytics Infrastructure (Ja...
Apereo Webinar: Learning What Works When Scaling Analytics Infrastructure (Ja...Apereo Webinar: Learning What Works When Scaling Analytics Infrastructure (Ja...
Apereo Webinar: Learning What Works When Scaling Analytics Infrastructure (Ja...
Unicon, Inc.
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
BigML, Inc
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
Mohit Garg
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Regression and Classification: An Artificial Neural Network Approach
Regression and Classification: An Artificial Neural Network ApproachRegression and Classification: An Artificial Neural Network Approach
Regression and Classification: An Artificial Neural Network Approach
Khulna University
 

Similar to Introduction to Generalised Low-Rank Model and Missing Values (20)

Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL DatabasesA Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
 
Machine Learning in H2O
Machine Learning in H2OMachine Learning in H2O
Machine Learning in H2O
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 
polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdf
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
 
0629venmoplus
0629venmoplus0629venmoplus
0629venmoplus
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 
Apereo Webinar: Learning What Works When Scaling Analytics Infrastructure (Ja...
Apereo Webinar: Learning What Works When Scaling Analytics Infrastructure (Ja...Apereo Webinar: Learning What Works When Scaling Analytics Infrastructure (Ja...
Apereo Webinar: Learning What Works When Scaling Analytics Infrastructure (Ja...
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
Regression and Classification: An Artificial Neural Network Approach
Regression and Classification: An Artificial Neural Network ApproachRegression and Classification: An Artificial Neural Network Approach
Regression and Classification: An Artificial Neural Network Approach
 

More from Jo-fai Chow

Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and ShinyMaking Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny
Jo-fai Chow
 
Automatic and Interpretable Machine Learning in R with H2O and LIME
Automatic and Interpretable Machine Learning in R with H2O and LIMEAutomatic and Interpretable Machine Learning in R with H2O and LIME
Automatic and Interpretable Machine Learning in R with H2O and LIME
Jo-fai Chow
 
Automatic and Interpretable Machine Learning with H2O and LIME
Automatic and Interpretable Machine Learning with H2O and LIMEAutomatic and Interpretable Machine Learning with H2O and LIME
Automatic and Interpretable Machine Learning with H2O and LIME
Jo-fai Chow
 
H2O at Berlin R Meetup
H2O at Berlin R MeetupH2O at Berlin R Meetup
H2O at Berlin R Meetup
Jo-fai Chow
 
H2O at BelgradeR Meetup
H2O at BelgradeR MeetupH2O at BelgradeR Meetup
H2O at BelgradeR Meetup
Jo-fai Chow
 
Introduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use CasesIntroduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use Cases
Jo-fai Chow
 
H2O at Poznan R Meetup
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R Meetup
Jo-fai Chow
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
Jo-fai Chow
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
Jo-fai Chow
 
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data GeekFrom Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
Jo-fai Chow
 
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
Jo-fai Chow
 
Developing a New Decision Support System for SuDS
Developing a New Decision Support System for SuDSDeveloping a New Decision Support System for SuDS
Developing a New Decision Support System for SuDS
Jo-fai Chow
 
Udacity Statement (Introduction to Statistics, August 2012)
Udacity Statement (Introduction to Statistics, August 2012)Udacity Statement (Introduction to Statistics, August 2012)
Udacity Statement (Introduction to Statistics, August 2012)Jo-fai Chow
 
Coursera Statement (Computational Investing, Part I,
Coursera Statement (Computational Investing, Part I, Coursera Statement (Computational Investing, Part I,
Coursera Statement (Computational Investing, Part I, Jo-fai Chow
 
Coursera Statement (Computing for Data Analysis, Oct 2013)
Coursera Statement (Computing for Data Analysis, Oct 2013)Coursera Statement (Computing for Data Analysis, Oct 2013)
Coursera Statement (Computing for Data Analysis, Oct 2013)Jo-fai Chow
 
Coursera Statement (Data Analysis, Mar 2013)
Coursera Statement (Data Analysis, Mar 2013)Coursera Statement (Data Analysis, Mar 2013)
Coursera Statement (Data Analysis, Mar 2013)Jo-fai Chow
 
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
Jo-fai Chow
 

More from Jo-fai Chow (17)

Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and ShinyMaking Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny
 
Automatic and Interpretable Machine Learning in R with H2O and LIME
Automatic and Interpretable Machine Learning in R with H2O and LIMEAutomatic and Interpretable Machine Learning in R with H2O and LIME
Automatic and Interpretable Machine Learning in R with H2O and LIME
 
Automatic and Interpretable Machine Learning with H2O and LIME
Automatic and Interpretable Machine Learning with H2O and LIMEAutomatic and Interpretable Machine Learning with H2O and LIME
Automatic and Interpretable Machine Learning with H2O and LIME
 
H2O at Berlin R Meetup
H2O at Berlin R MeetupH2O at Berlin R Meetup
H2O at Berlin R Meetup
 
H2O at BelgradeR Meetup
H2O at BelgradeR MeetupH2O at BelgradeR Meetup
H2O at BelgradeR Meetup
 
Introduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use CasesIntroduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use Cases
 
H2O at Poznan R Meetup
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R Meetup
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data GeekFrom Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
 
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
 
Developing a New Decision Support System for SuDS
Developing a New Decision Support System for SuDSDeveloping a New Decision Support System for SuDS
Developing a New Decision Support System for SuDS
 
Udacity Statement (Introduction to Statistics, August 2012)
Udacity Statement (Introduction to Statistics, August 2012)Udacity Statement (Introduction to Statistics, August 2012)
Udacity Statement (Introduction to Statistics, August 2012)
 
Coursera Statement (Computational Investing, Part I,
Coursera Statement (Computational Investing, Part I, Coursera Statement (Computational Investing, Part I,
Coursera Statement (Computational Investing, Part I,
 
Coursera Statement (Computing for Data Analysis, Oct 2013)
Coursera Statement (Computing for Data Analysis, Oct 2013)Coursera Statement (Computing for Data Analysis, Oct 2013)
Coursera Statement (Computing for Data Analysis, Oct 2013)
 
Coursera Statement (Data Analysis, Mar 2013)
Coursera Statement (Data Analysis, Mar 2013)Coursera Statement (Data Analysis, Mar 2013)
Coursera Statement (Data Analysis, Mar 2013)
 
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
 

Recently uploaded

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 

Recently uploaded (20)

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 

Introduction to Generalised Low-Rank Model and Missing Values

  • 1. Introduction to Generalised Low-Rank Model and Missing Values Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus Based on work by Anqi Fu, Madeleine Udell, Corinne Horn, Reza Zadeh & Stephen Boyd.
  • 2. About H2O.ai • H2O in an open-source, distributed machine learning library written in Java with APIs in R, Python, Scala and REST/JSON. • Produced by H2O.ai in Mountain View, CA. • H2O.ai advisers are Trevor Hastie, Rob Tibshirani and Stephen Boyd from Stanford. 2
  • 3. About Me • 2005 - 2015 • Water Engineer o Consultant for Utilities o EngD Research • 2015 - Present • Data Scientist o Virgin Media o Domino Data Lab o H2O.ai 3
  • 4. About This Talk • Overview of generalised low-rank model (GLRM). • Four application examples: o Basics. o How to accelerate machine learning. o How to visualise clusters. o How to impute missing values. • Q & A. 4
  • 5. GLRM Overview • GLRM is an extension of well-known matrix factorisation methods such as Principal Component Analysis (PCA). • Unlike PCA which is limited to numerical data, GLRM can also handle categorical, ordinal and Boolean data. • Given: Data table A with m rows and n columns • Find: Compressed representation as numeric tables X and Y where k is a small user- specified number • Y = archetypal features created from columns of A • X = row of A in reduced feature space • GLRM can approximately reconstruct A from product XY 5 ≈ + Memory Reduction / Saving
  • 6. GLRM Key Features • Memory o Compressing large data set with minimal loss in accuracy • Speed o Reduced dimensionality = short model training time • Feature Engineering o Condensed features can be analysed visually • Missing Data Imputation o Reconstructing data set will automatically impute missing values 6
  • 7. GLRM Technical References • Paper o arxiv.org/abs/1410.0342 • Other Resources o H2O World Video o Tutorials 7
  • 8. Example 1: Motor Trend Car Road Tests 8 n = 11 m = 32 “mtcars” dataset in R A Original Data Table
  • 9. Example 1: Training a GLRM 9 Check convergence
  • 10. Example 1: X and Y from GLRM 10 32 3 3 11 X Y
  • 11. Example 1: Summary 11 ≈A X Y ≈ + Memory Reduction / Saving
  • 12. Example 2: ML Acceleration • About the dataset o R package “mlbench” o Multi-spectral scanner image data o 6k samples o x1 to x36: predictors o Classes: • 6 levels • Different type of soil o Use GLRM to compress predictors 12
  • 13. Example 2: Use GLRM to Speed Up ML 13 k = 6 Reduce to 6 features
  • 14. Example 2: Random Forest • Train a vanilla H2O Random Forest model with … o Full data set (36 predictors) o Compressed data set (6 predictors) 14
  • 15. Example 2: Results Comparison Data Time 10-fold Cross Validation Log Loss Accuracy Raw data (36 Predictors) 4 mins 26 sec 0.24553 91.80% Data compressed with GLRM (6 Predictors) 1 min 24 sec 0.25792 90.59% 15 • Benefits of GLRM o Shorter training time o Quick insight before running models on full data set
  • 16. Example 3: Clusters Visualisation • About the dataset o Multi-spectral scanner image data o Same as example 2 o x1 to x36: predictors o Use GLRM to compress predictors to 2D representation o Use 6 classes to colour clusters 16
  • 17. Example 3: Clusters Visualisation 17
  • 18. Example 4: Imputation 18 ”mtcars” – same dataset for example 1 Randomly introduce 50% missing values
  • 19. Example 4: GLRM with NAs 19 When we reconstruct the table using GLRM, missing values are automatically imputed.
  • 20. Example 4: Results Comparison • We are asking GLRM to do a difficult job o 50% missing values o Imputation results look reasonable 20 Absolute difference between original and imputed values.
  • 21. Conclusions • Use GLRM to o Save memory o Speed up machine learning o Visualise clusters o Impute missing values • A great tool for data pre-processing o Include it in your data pipeline 21
  • 22. Any Questions? • Contact o joe@h2o.ai o @matlabulous o github.com/woobe • Slides & Code o github.com/h2oai/h2o- meetups • H2O in London o Meetups / Office (soon) o www.h2o.ai/careers • H2O Help Docs & Tutorials o www.h2o.ai/docs o university.h2o.ai 22