SlideShare a Scribd company logo
1 of 41
Download to read offline
Machine Learning Specialization
Handling missing data
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization3 ©2015-2016 Emily Fox & Carlos Guestrin
Decision tree review
T(xi) = Traverse decision tree
Loan
Application
Input: xi
ŷi	
Credit?
Safe Term?
Risky Safe
Income?
Term?
Risky Safe
Risky
start
excellent	 poor	
fair	
5 years	3 years	
Low	high	
5 years	3 years
Machine Learning Specialization5
So far: data always completely observed
©2015-2016 Emily Fox & Carlos Guestrin
Credit Term Income y
excellent 3 yrs high safe
fair 5 yrs low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs low safe
poor 3 yrs high risky
poor 5 yrs low safe
fair 3 yrs high safe
Known x and y
values for all
data points
Machine Learning Specialization6 ©2015-2016 Emily Fox & Carlos Guestrin
Loan application
may be
3 or 5 years
Credit Term Income y
excellent 3 yrs high safe
fair ? low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs high safe
poor ? high risky
poor 5 yrs low safe
fair ? high safe
Missing data
Machine Learning Specialization7
Missing values impact training and predictions
©2015-2016 Emily Fox & Carlos Guestrin
1.  Training data: Contains “unknown”
values
2.  Predictions: Input at prediction time
contains “unknown” values
Machine Learning Specialization8 ©2015-2016 Emily Fox & Carlos Guestrin
Start
Credit?
Safe
excellent	
Income?
poor	
Term?
Risky Safe
fair	
5 years	3 years	
Risky
Low	
Term?
Risky Safe
high	
5 years	3 years	
xi = (Credit = poor, Income = ?, Term = 5 years)
Missing values during prediction
What do
we do???
Start
Credit?
Income?
Machine Learning Specialization10 ©2015-2016 Emily Fox & Carlos Guestrin
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
y
h(x)
T(x)
x ŷ
Missing values
during training
Missing values
during prediction
Machine Learning Specialization
Handling missing data
Strategy 1: Purification by skipping
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization13
Idea 1: Purification by skipping/removing
©2015-2016 Emily Fox & Carlos Guestrin
Training
Data
h(x) is a subset of the data
without any missing values
x
Feature
extraction
h(x)
Remove/skip
missing values
May contain data points
where one (or more)
features are missing
Machine Learning Specialization15
Idea 1: Skip data points
with missing values
©2015-2016 Emily Fox & Carlos Guestrin
N = 9, 3 features
Credit Term Income y
excellent 3 yrs high safe
fair ? low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs high safe
poor 3 yrs low risky
poor 3 yrs ? safe
fair ? high safe
N = 6, 3 features
Credit Term Income y
excellent 3 yrs high safe
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs high safe
poor 3 yrs low risky
h(x)
x
Skip data
points with
missing values
Machine Learning Specialization16
The challenge with Idea 1
©2015-2016 Emily Fox & Carlos Guestrin
N = 9, 3 features
Credit Term Income y
excellent 3 yrs high safe
fair ? low risky
fair 3 yrs high safe
poor ? high risky
excellent ? low risky
fair ? high safe
poor 3 yrs low risky
poor ? low safe
fair ? high safe
N = 3, 3 features
Credit Term Income y
excellent 3 yrs high safe
fair 3 yrs high safe
poor 3 yrs low risky
h(x)
x
Skip data
points with
missing values
Warning: More than
50% of the loan
terms are unknown!
Machine Learning Specialization17
Idea 2: Skip features
with missing values
©2015-2016 Emily Fox & Carlos Guestrin
Credit Term Income y
excellent 3 yrs high safe
fair ? low risky
fair 3 yrs high safe
poor ? high risky
excellent ? low risky
fair 5 yrs high safe
poor ? high risky
poor ? low safe
fair ? high safe
N = 9, 3 features
x
N = 9, 2 features
h(x)
Credit Income y
excellent high safe
fair low risky
fair high safe
poor high risky
excellent low risky
fair high safe
poor high risky
poor low safe
fair high safe
Skip features
with many
missing values
Machine Learning Specialization19
Missing value skipping: Ideas 1 & 2
©2015-2016 Emily Fox & Carlos Guestrin
Idea 1: Skip data points where any
feature contains a missing value
-  Make sure only a few data points are
skipped
Idea 2: Skip an entire feature if it’s
missing for many data points
-  Make sure only a few features are
skipped
Machine Learning Specialization20
Missing value skipping: Pros and Cons
©2015-2016 Emily Fox & Carlos Guestrin
Pros
•  Easy to understand and implement
•  Can be applied to any model
(decision trees, logistic regression, linear
regression,…)
Cons
•  Removing data points and features may
remove important information from data
•  Unclear when it’s better to remove
data points versus features
•  Doesn’t help if data is missing at
prediction time
Machine Learning Specialization
Handling missing data
Strategy 2: Prification by imputing
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization23
Main drawback of skipping strategy
©2015-2016 Emily Fox & Carlos Guestrin
N = 9, 3 features
Credit Term Income y
excellent 3 yrs high safe
fair ? low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs high safe
poor 3 yrs low risky
poor ? low safe
fair ? high safe
N = 6, 3 features
Credit Term Income y
excellent 3 yrs high safe
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs high safe
poor 3 yrs low risky
h(x)
x
Skip data points with
missing values
Data is precious,
don’t throw
it away!
Machine Learning Specialization25
Can we keep all the data?
©2015-2016 Emily Fox & Carlos Guestrin
credit term income y
excellent 3 yrs high safe
fair ? low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs high safe
poor 3 yrs high risky
poor ? low safe
fair ? high safe
Use other data points
in x to “guess” the “?”
Machine Learning Specialization26
Idea 2: Purification by imputing
©2015-2016 Emily Fox & Carlos Guestrin
Training
Data
Same number of data
points as the original data
x
Feature
extraction
h(x)
Impute/
Substitute
missing values
Machine Learning Specialization28
Idea 2: Imputation/Substitution
©2015-2016 Emily Fox & Carlos Guestrin
N = 9, 3 features
Fill in each missing
value with a
calculated guess
Credit Term Income y
excellent 3 yrs high safe
fair ? low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs high safe
poor 3 yrs high risky
poor ? low safe
fair ? high safe
Credit Term Income y
excellent 3 yrs high safe
fair 3 yrs low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs high safe
poor 3 yrs high risky
poor 3 yrs low safe
fair 3 yrs high safe
N = 9, 3 features
Machine Learning Specialization29
Credit Term Income y
excellent 3 yrs high safe
fair ? low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs high safe
poor 3 yrs high risky
poor ? low safe
fair ? high safe
Credit Term Income y
excellent 3 yrs high safe
fair 3 yrs low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs high safe
poor 3 yrs high risky
poor 3 yrs low safe
fair 3 yrs high safe
Example: Replace ? with
most common value
©2015-2016 Emily Fox & Carlos Guestrin
# 3 year loans: 4
# 5 year loans: 2
Purification by
imputing
Machine Learning Specialization30
Common (simple) rules for
purification by imputation
©2015-2016 Emily Fox & Carlos Guestrin
Credit Term Income y
excellent 3 yrs high safe
fair ? low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs high safe
poor 3 yrs high risky
poor ? low safe
fair ? high safe
Impute each feature with missing values:
1.  Categorical features use mode: Most
popular value (mode) of non-missing xi
2.  Numerical features use average or median:
Average or median value of non-missing xi
Many advanced methods exist,
e.g., expectation-maximization (EM) algorithm
Machine Learning Specialization32
Missing value imputation: Pros and Cons
©2015-2016 Emily Fox & Carlos Guestrin
Pros
•  Easy to understand and implement
•  Can be applied to any model
(decision trees, logistic regression, linear regression,…)
•  Can be used at prediction time: use same
imputation rules
Cons
•  May result in systematic errors
Example: Feature “age” missing in all banks in
Washington by state law
Machine Learning Specialization
Handling missing data
Strategy 3: Adapt learning algorithm
to be robust to missing values
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization34 ©2015-2016 Emily Fox & Carlos Guestrin
Start
Credit?
Safe
excellent	
Income?
poor	
Term?
Risky Safe
fair	
5 years	3 years	
Risky
Low	
Term?
Risky Safe
high	
5 years	3 years	
xi = (Credit = poor, Income = ?, Term = 5 years)
Missing values during prediction: revisited
What do
we do???
Start
Credit?
Income?
Machine Learning Specialization35 ©2015-2016 Emily Fox & Carlos Guestrin
Start
Credit?
Safe
excellent	
Income?
poor	
Term?
Risky Safe
fair	
5 years	3 years	
Risky
Low	
Term?
Risky Safe
high	
5 years	3 years	
xi = (Credit = poor, Income = ?, Term = 5 years)
Add missing values to the tree definition
Associate missing
values with a branch
Start
Credit?
Income?
or unknown	
Risky
Machine Learning Specialization37 ©2015-2016 Emily Fox & Carlos Guestrin
Start
Credit?
Safe
excellent	
Income?
poor	
Term?
Risky Safe
fair	
5 years	3 years	
Risky
Low	
Term?
Risky Safe
high	
5 years	3 years	
Add missing value choice to every decision node
or unknown	
Every decision node includes choice
of response to missing values
or unknown	
or unknown	
or unknown
Machine Learning Specialization38 ©2015-2016 Emily Fox & Carlos Guestrin
Start
Credit?
Term?
Safe
fair	
5 years	
Prediction with missing values becomes simple
xi = (Credit = ?, Income = high, Term = 5 years)
or unknown	
Safe
excellent	
Income?
poor	
Risky
Low	
Term?
Risky Safe
high	
5 years	3 years	
or unknown	
or unknown	
Risky
3 years	
or unknown
Machine Learning Specialization39 ©2015-2016 Emily Fox & Carlos Guestrin
Start
Credit?
Income?
poor	
Term?
Safe
high	
5 years	
Prediction with missing values becomes simple
xi = (Credit = poor, Income = high, Term = ?)
or unknown	
Safe
excellent	
Term?
Risky Safe
fair	
5 years	3 years	
Risky
Low	
Risky
3 years	
or unknown	
or unknown	
or unknown
Machine Learning Specialization41
Explicitly handling missing data by
learning algorithm: Pros and Cons
©2015-2016 Emily Fox & Carlos Guestrin
Pros
•  Addresses training and prediction time
•  More accurate predictions
Cons
•  Requires modification of learning algorithm
-  Very simple for decision trees
Machine Learning Specialization
Feature split selection
with missing data
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization44
Pick feature split
leading to lowest
classification error
Greedy decision tree learning
©2015-2016 Emily Fox & Carlos Guestrin
•  Step 1: Start with an empty tree
•  Step 2: Select a feature to split data
•  For each split of the tree:
•  Step 3: If nothing more to,
make predictions
•  Step 4: Otherwise, go to Step 2 &
continue (recurse) on this split
Must select feature &
branch for missing values!
Machine Learning Specialization46
Feature split (without missing values)
©2015-2016 Emily Fox & Carlos Guestrin
Root
excellent fair poor
Credit?
Split on the
feature Credit
Data points with
Credit = poor
go here
Machine Learning Specialization47
Feature split (with missing values)
©2015-2016 Emily Fox & Carlos Guestrin
Root
excellent fair
poor
or unknown
Credit?
Data points with
Credit = poor or unknown
go here
Split on the
feature Credit
Machine Learning Specialization48
Missing value handling in threshold splits
©2015-2016 Emily Fox & Carlos Guestrin
Root
< $50K
or unknwon
>= $50K
Income
Data points with
income < $50K or unknown
Split on the
feature Income
Machine Learning Specialization50
Should missing go left, right, or middle?
©2015-2016 Emily Fox & Carlos Guestrin
Root
excellent
or unknown
fair poor
Credit?
Choice 1: Missing values go
with Credit=excellent
Root
excellent
fair
or unknown
poor
Credit?
Choice 2: Missing values
go with Credit=fair
Root
excellent fair
poor
or unknown
Credit?
Choice 3: Missing values
go with Credit=poor
Choose branch that leads to
lowest classification error!
Machine Learning Specialization51
Computing classification error of
decision stump with missing data
©2015-2016 Emily Fox & Carlos Guestrin
N = 40, 3 features
Credit Term Income y
excellent 3 yrs high safe
? 5 yrs low risky
fair 3 yrs high safe
poor 5 yrs high risky
? 3 yrs low risky
? 5 yrs low safe
poor 3 yrs high risky
poor 5 yrs low safe
fair 3 yrs high safe
… … … …
Root
22 18
excellent
or unknown
fair poor
Credit?
Observed values 9 0 8 4 4 12
“Unknown” 1 2
Total 10 2 8 4 4 12
Error = .
=
Machine Learning Specialization52
Use classification error to decide
©2015-2016 Emily Fox & Carlos Guestrin
Root
22 18
excellent
or unknown
10 2
fair
8 4
poor
4 12
Credit?
Root
22 18
excellent
9 0
fair
or unknown
9 6
poor
4 12
Credit?
Root
22 18
excellent
9 0
fair
8 4
poor
or unknown
5 14
Credit?
Choice 1: error = 0.25 Choice 2: error = 0.25 Choice 3: error = 0.225
WINNER
Machine Learning Specialization54
Feature split selection algorithm
with missing value handling
©2015-2016 Emily Fox & Carlos Guestrin
•  Given a subset of data M (a node in a tree)
•  For each feature hi(x):
1.  Split data points of M where hi(x) is not
“unknown” according to feature hi(x)
2.  Consider assigning data points with “unknown”
value for hi(x) to each branch
A.  Compute classification error split & branch
assignment of “unknown” values
•  Chose feature h*(x) & branch assignment of
“unknown” with lowest classification error
Machine Learning Specialization
Summary of handling missing data
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization56
What you can do now…
©2015-2016 Emily Fox & Carlos Guestrin
Describe common ways to handling missing data:
1.  Skip all rows with any missing values
2.  Skip features with many missing values
3.  Impute missing values using other data points
Modify learning algorithm (decision trees) to handle
missing data:
1.  Missing values get added to one branch of split
2.  Use classification error to determine where missing
values go
Machine Learning Specialization57
Thank you to Dr. Krishna Sridhar
Dr. Krishna Sridhar
Staff Data Scientist, Dato, Inc.
©2015-2016 Emily Fox & Carlos Guestrin

More Related Content

Viewers also liked

Non-essentiality of Correlation between Image and Depth Map in Free Viewpoin...
Non-essentiality of Correlation between Image and Depth Map in Free Viewpoin...Non-essentiality of Correlation between Image and Depth Map in Free Viewpoin...
Non-essentiality of Correlation between Image and Depth Map in Free Viewpoin...Norishige Fukushima
 
face recognition system using LBP
face recognition system using LBPface recognition system using LBP
face recognition system using LBPMarwan H. Noman
 
Lbp based edge-texture features for object recoginition
Lbp based edge-texture features for object recoginitionLbp based edge-texture features for object recoginition
Lbp based edge-texture features for object recoginitionIGEEKS TECHNOLOGIES
 
Human detection iccv09
Human detection iccv09Human detection iccv09
Human detection iccv09fanghuaxue
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector MachineShao-Chuan Wang
 
Facial expression recognition based on local binary patterns final
Facial expression recognition based on local binary patterns finalFacial expression recognition based on local binary patterns final
Facial expression recognition based on local binary patterns finalahmad abdelhafeez
 
face recognition system using LBP
face recognition system using LBPface recognition system using LBP
face recognition system using LBPMarwan H. Noman
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Space Vector Modulation(SVM) Technique for PWM Inverter
Space Vector Modulation(SVM) Technique for PWM InverterSpace Vector Modulation(SVM) Technique for PWM Inverter
Space Vector Modulation(SVM) Technique for PWM InverterPurushotam Kumar
 

Viewers also liked (12)

Non-essentiality of Correlation between Image and Depth Map in Free Viewpoin...
Non-essentiality of Correlation between Image and Depth Map in Free Viewpoin...Non-essentiality of Correlation between Image and Depth Map in Free Viewpoin...
Non-essentiality of Correlation between Image and Depth Map in Free Viewpoin...
 
face recognition system using LBP
face recognition system using LBPface recognition system using LBP
face recognition system using LBP
 
Lbp based edge-texture features for object recoginition
Lbp based edge-texture features for object recoginitionLbp based edge-texture features for object recoginition
Lbp based edge-texture features for object recoginition
 
Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
Human detection iccv09
Human detection iccv09Human detection iccv09
Human detection iccv09
 
Local binary pattern
Local binary patternLocal binary pattern
Local binary pattern
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector Machine
 
Facial expression recognition based on local binary patterns final
Facial expression recognition based on local binary patterns finalFacial expression recognition based on local binary patterns final
Facial expression recognition based on local binary patterns final
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 
face recognition system using LBP
face recognition system using LBPface recognition system using LBP
face recognition system using LBP
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Space Vector Modulation(SVM) Technique for PWM Inverter
Space Vector Modulation(SVM) Technique for PWM InverterSpace Vector Modulation(SVM) Technique for PWM Inverter
Space Vector Modulation(SVM) Technique for PWM Inverter
 

Similar to C3.4.2

Data Driven Risk Management
Data Driven Risk ManagementData Driven Risk Management
Data Driven Risk ManagementResolver Inc.
 
When User Stories Are Not Enough
When User Stories Are Not EnoughWhen User Stories Are Not Enough
When User Stories Are Not EnoughTechWell
 
Using Analytics To Make Smart HR Decisions
Using Analytics To Make Smart HR DecisionsUsing Analytics To Make Smart HR Decisions
Using Analytics To Make Smart HR DecisionsBambooHR
 
Assessingcredithealth whathasdatagottodowithitd-risk-191111012147
Assessingcredithealth whathasdatagottodowithitd-risk-191111012147Assessingcredithealth whathasdatagottodowithitd-risk-191111012147
Assessingcredithealth whathasdatagottodowithitd-risk-191111012147sea1na
 
Assessing credit health - what has data got to do with it?
Assessing credit health - what has data got to do with it?Assessing credit health - what has data got to do with it?
Assessing credit health - what has data got to do with it?NUS-ISS
 
Iceberg Webinar: Adding relevant financial context to your BCM program
Iceberg Webinar: Adding relevant financial context to your BCM programIceberg Webinar: Adding relevant financial context to your BCM program
Iceberg Webinar: Adding relevant financial context to your BCM program Iceberg Networks Corporation
 
Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
Meet TransmogrifAI, Open Source AutoML That Powers Einstein PredictionsMeet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
Meet TransmogrifAI, Open Source AutoML That Powers Einstein PredictionsMatthew Tovbin
 
Machine learning for factor investing
Machine learning for factor investingMachine learning for factor investing
Machine learning for factor investingQuantUniversity
 
Managing in the Presence of Uncertanty
Managing in the Presence of UncertantyManaging in the Presence of Uncertanty
Managing in the Presence of UncertantyGlen Alleman
 
Synthetic VIX Data Generation Using ML Techniques
Synthetic VIX Data Generation Using ML TechniquesSynthetic VIX Data Generation Using ML Techniques
Synthetic VIX Data Generation Using ML TechniquesQuantUniversity
 
Ecommerce(2)
Ecommerce(2)Ecommerce(2)
Ecommerce(2)ecommerce
 
Machine Learning and AI in Risk Management
Machine Learning and AI in Risk ManagementMachine Learning and AI in Risk Management
Machine Learning and AI in Risk ManagementQuantUniversity
 
Threat Modeling in the Cloud
Threat Modeling in the CloudThreat Modeling in the Cloud
Threat Modeling in the CloudPaige Cruz
 
Rise and Importance of Dividends
Rise and Importance of DividendsRise and Importance of Dividends
Rise and Importance of DividendsTRECDallas
 
Business Goals and Constraints.” Please respond to the following.docx
Business Goals and Constraints.” Please respond to the following.docxBusiness Goals and Constraints.” Please respond to the following.docx
Business Goals and Constraints.” Please respond to the following.docxfelicidaddinwoodie
 
Six Sigma Training Tutorial for industrial engineering in factory.pdf
Six Sigma Training Tutorial for industrial engineering in factory.pdfSix Sigma Training Tutorial for industrial engineering in factory.pdf
Six Sigma Training Tutorial for industrial engineering in factory.pdfabdulrohman195
 
Introduction to FAIR Risk Methodology – Global CISO Forum 2019 – Donna Gall...
Introduction to FAIR Risk Methodology – Global CISO Forum 2019  –  Donna Gall...Introduction to FAIR Risk Methodology – Global CISO Forum 2019  –  Donna Gall...
Introduction to FAIR Risk Methodology – Global CISO Forum 2019 – Donna Gall...EC-Council
 
2018 CISSP Mentor Program Session 3
2018 CISSP Mentor Program Session 32018 CISSP Mentor Program Session 3
2018 CISSP Mentor Program Session 3FRSecure
 
CyberMaryland Job Fair Job Seeker Handbook October 9, 2018, Baltimore, MD
CyberMaryland Job Fair Job Seeker Handbook October 9, 2018, Baltimore, MDCyberMaryland Job Fair Job Seeker Handbook October 9, 2018, Baltimore, MD
CyberMaryland Job Fair Job Seeker Handbook October 9, 2018, Baltimore, MDClearedJobs.Net
 

Similar to C3.4.2 (20)

Data Driven Risk Management
Data Driven Risk ManagementData Driven Risk Management
Data Driven Risk Management
 
When User Stories Are Not Enough
When User Stories Are Not EnoughWhen User Stories Are Not Enough
When User Stories Are Not Enough
 
Using Analytics To Make Smart HR Decisions
Using Analytics To Make Smart HR DecisionsUsing Analytics To Make Smart HR Decisions
Using Analytics To Make Smart HR Decisions
 
Assessingcredithealth whathasdatagottodowithitd-risk-191111012147
Assessingcredithealth whathasdatagottodowithitd-risk-191111012147Assessingcredithealth whathasdatagottodowithitd-risk-191111012147
Assessingcredithealth whathasdatagottodowithitd-risk-191111012147
 
Assessing credit health - what has data got to do with it?
Assessing credit health - what has data got to do with it?Assessing credit health - what has data got to do with it?
Assessing credit health - what has data got to do with it?
 
Iceberg Webinar: Adding relevant financial context to your BCM program
Iceberg Webinar: Adding relevant financial context to your BCM programIceberg Webinar: Adding relevant financial context to your BCM program
Iceberg Webinar: Adding relevant financial context to your BCM program
 
Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
Meet TransmogrifAI, Open Source AutoML That Powers Einstein PredictionsMeet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
 
Machine learning for factor investing
Machine learning for factor investingMachine learning for factor investing
Machine learning for factor investing
 
Managing in the Presence of Uncertanty
Managing in the Presence of UncertantyManaging in the Presence of Uncertanty
Managing in the Presence of Uncertanty
 
Synthetic VIX Data Generation Using ML Techniques
Synthetic VIX Data Generation Using ML TechniquesSynthetic VIX Data Generation Using ML Techniques
Synthetic VIX Data Generation Using ML Techniques
 
Ecommerce(2)
Ecommerce(2)Ecommerce(2)
Ecommerce(2)
 
Machine Learning and AI in Risk Management
Machine Learning and AI in Risk ManagementMachine Learning and AI in Risk Management
Machine Learning and AI in Risk Management
 
Threat Modeling in the Cloud
Threat Modeling in the CloudThreat Modeling in the Cloud
Threat Modeling in the Cloud
 
Rise and Importance of Dividends
Rise and Importance of DividendsRise and Importance of Dividends
Rise and Importance of Dividends
 
Business Goals and Constraints.” Please respond to the following.docx
Business Goals and Constraints.” Please respond to the following.docxBusiness Goals and Constraints.” Please respond to the following.docx
Business Goals and Constraints.” Please respond to the following.docx
 
Risk Assessments
Risk AssessmentsRisk Assessments
Risk Assessments
 
Six Sigma Training Tutorial for industrial engineering in factory.pdf
Six Sigma Training Tutorial for industrial engineering in factory.pdfSix Sigma Training Tutorial for industrial engineering in factory.pdf
Six Sigma Training Tutorial for industrial engineering in factory.pdf
 
Introduction to FAIR Risk Methodology – Global CISO Forum 2019 – Donna Gall...
Introduction to FAIR Risk Methodology – Global CISO Forum 2019  –  Donna Gall...Introduction to FAIR Risk Methodology – Global CISO Forum 2019  –  Donna Gall...
Introduction to FAIR Risk Methodology – Global CISO Forum 2019 – Donna Gall...
 
2018 CISSP Mentor Program Session 3
2018 CISSP Mentor Program Session 32018 CISSP Mentor Program Session 3
2018 CISSP Mentor Program Session 3
 
CyberMaryland Job Fair Job Seeker Handbook October 9, 2018, Baltimore, MD
CyberMaryland Job Fair Job Seeker Handbook October 9, 2018, Baltimore, MDCyberMaryland Job Fair Job Seeker Handbook October 9, 2018, Baltimore, MD
CyberMaryland Job Fair Job Seeker Handbook October 9, 2018, Baltimore, MD
 

More from Daniel LIAO (18)

C4.5
C4.5C4.5
C4.5
 
C4.4
C4.4C4.4
C4.4
 
C4.3.1
C4.3.1C4.3.1
C4.3.1
 
Lime
LimeLime
Lime
 
C4.1.2
C4.1.2C4.1.2
C4.1.2
 
C3.7.1
C3.7.1C3.7.1
C3.7.1
 
C3.6.1
C3.6.1C3.6.1
C3.6.1
 
C3.5.1
C3.5.1C3.5.1
C3.5.1
 
C2.3
C2.3C2.3
C2.3
 
C2.6
C2.6C2.6
C2.6
 
C2.2
C2.2C2.2
C2.2
 
C2.5
C2.5C2.5
C2.5
 
C2.4
C2.4C2.4
C2.4
 
C3.1.2
C3.1.2C3.1.2
C3.1.2
 
C3.2.2
C3.2.2C3.2.2
C3.2.2
 
C3.2.1
C3.2.1C3.2.1
C3.2.1
 
C3.1.logistic intro
C3.1.logistic introC3.1.logistic intro
C3.1.logistic intro
 
C2.1 intro
C2.1 introC2.1 intro
C2.1 intro
 

Recently uploaded

Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 

Recently uploaded (20)

Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 

C3.4.2

  • 1. Machine Learning Specialization Handling missing data Emily Fox & Carlos Guestrin Machine Learning Specialization University of Washington ©2015-2016 Emily Fox & Carlos Guestrin
  • 2. Machine Learning Specialization3 ©2015-2016 Emily Fox & Carlos Guestrin Decision tree review T(xi) = Traverse decision tree Loan Application Input: xi ŷi Credit? Safe Term? Risky Safe Income? Term? Risky Safe Risky start excellent poor fair 5 years 3 years Low high 5 years 3 years
  • 3. Machine Learning Specialization5 So far: data always completely observed ©2015-2016 Emily Fox & Carlos Guestrin Credit Term Income y excellent 3 yrs high safe fair 5 yrs low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs low safe poor 3 yrs high risky poor 5 yrs low safe fair 3 yrs high safe Known x and y values for all data points
  • 4. Machine Learning Specialization6 ©2015-2016 Emily Fox & Carlos Guestrin Loan application may be 3 or 5 years Credit Term Income y excellent 3 yrs high safe fair ? low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs high safe poor ? high risky poor 5 yrs low safe fair ? high safe Missing data
  • 5. Machine Learning Specialization7 Missing values impact training and predictions ©2015-2016 Emily Fox & Carlos Guestrin 1.  Training data: Contains “unknown” values 2.  Predictions: Input at prediction time contains “unknown” values
  • 6. Machine Learning Specialization8 ©2015-2016 Emily Fox & Carlos Guestrin Start Credit? Safe excellent Income? poor Term? Risky Safe fair 5 years 3 years Risky Low Term? Risky Safe high 5 years 3 years xi = (Credit = poor, Income = ?, Term = 5 years) Missing values during prediction What do we do??? Start Credit? Income?
  • 7. Machine Learning Specialization10 ©2015-2016 Emily Fox & Carlos Guestrin Training Data Feature extraction ML model Quality metric ML algorithm y h(x) T(x) x ŷ Missing values during training Missing values during prediction
  • 8. Machine Learning Specialization Handling missing data Strategy 1: Purification by skipping ©2015-2016 Emily Fox & Carlos Guestrin
  • 9. Machine Learning Specialization13 Idea 1: Purification by skipping/removing ©2015-2016 Emily Fox & Carlos Guestrin Training Data h(x) is a subset of the data without any missing values x Feature extraction h(x) Remove/skip missing values May contain data points where one (or more) features are missing
  • 10. Machine Learning Specialization15 Idea 1: Skip data points with missing values ©2015-2016 Emily Fox & Carlos Guestrin N = 9, 3 features Credit Term Income y excellent 3 yrs high safe fair ? low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs high safe poor 3 yrs low risky poor 3 yrs ? safe fair ? high safe N = 6, 3 features Credit Term Income y excellent 3 yrs high safe fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs high safe poor 3 yrs low risky h(x) x Skip data points with missing values
  • 11. Machine Learning Specialization16 The challenge with Idea 1 ©2015-2016 Emily Fox & Carlos Guestrin N = 9, 3 features Credit Term Income y excellent 3 yrs high safe fair ? low risky fair 3 yrs high safe poor ? high risky excellent ? low risky fair ? high safe poor 3 yrs low risky poor ? low safe fair ? high safe N = 3, 3 features Credit Term Income y excellent 3 yrs high safe fair 3 yrs high safe poor 3 yrs low risky h(x) x Skip data points with missing values Warning: More than 50% of the loan terms are unknown!
  • 12. Machine Learning Specialization17 Idea 2: Skip features with missing values ©2015-2016 Emily Fox & Carlos Guestrin Credit Term Income y excellent 3 yrs high safe fair ? low risky fair 3 yrs high safe poor ? high risky excellent ? low risky fair 5 yrs high safe poor ? high risky poor ? low safe fair ? high safe N = 9, 3 features x N = 9, 2 features h(x) Credit Income y excellent high safe fair low risky fair high safe poor high risky excellent low risky fair high safe poor high risky poor low safe fair high safe Skip features with many missing values
  • 13. Machine Learning Specialization19 Missing value skipping: Ideas 1 & 2 ©2015-2016 Emily Fox & Carlos Guestrin Idea 1: Skip data points where any feature contains a missing value -  Make sure only a few data points are skipped Idea 2: Skip an entire feature if it’s missing for many data points -  Make sure only a few features are skipped
  • 14. Machine Learning Specialization20 Missing value skipping: Pros and Cons ©2015-2016 Emily Fox & Carlos Guestrin Pros •  Easy to understand and implement •  Can be applied to any model (decision trees, logistic regression, linear regression,…) Cons •  Removing data points and features may remove important information from data •  Unclear when it’s better to remove data points versus features •  Doesn’t help if data is missing at prediction time
  • 15. Machine Learning Specialization Handling missing data Strategy 2: Prification by imputing ©2015-2016 Emily Fox & Carlos Guestrin
  • 16. Machine Learning Specialization23 Main drawback of skipping strategy ©2015-2016 Emily Fox & Carlos Guestrin N = 9, 3 features Credit Term Income y excellent 3 yrs high safe fair ? low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs high safe poor 3 yrs low risky poor ? low safe fair ? high safe N = 6, 3 features Credit Term Income y excellent 3 yrs high safe fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs high safe poor 3 yrs low risky h(x) x Skip data points with missing values Data is precious, don’t throw it away!
  • 17. Machine Learning Specialization25 Can we keep all the data? ©2015-2016 Emily Fox & Carlos Guestrin credit term income y excellent 3 yrs high safe fair ? low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs high safe poor 3 yrs high risky poor ? low safe fair ? high safe Use other data points in x to “guess” the “?”
  • 18. Machine Learning Specialization26 Idea 2: Purification by imputing ©2015-2016 Emily Fox & Carlos Guestrin Training Data Same number of data points as the original data x Feature extraction h(x) Impute/ Substitute missing values
  • 19. Machine Learning Specialization28 Idea 2: Imputation/Substitution ©2015-2016 Emily Fox & Carlos Guestrin N = 9, 3 features Fill in each missing value with a calculated guess Credit Term Income y excellent 3 yrs high safe fair ? low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs high safe poor 3 yrs high risky poor ? low safe fair ? high safe Credit Term Income y excellent 3 yrs high safe fair 3 yrs low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs high safe poor 3 yrs high risky poor 3 yrs low safe fair 3 yrs high safe N = 9, 3 features
  • 20. Machine Learning Specialization29 Credit Term Income y excellent 3 yrs high safe fair ? low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs high safe poor 3 yrs high risky poor ? low safe fair ? high safe Credit Term Income y excellent 3 yrs high safe fair 3 yrs low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs high safe poor 3 yrs high risky poor 3 yrs low safe fair 3 yrs high safe Example: Replace ? with most common value ©2015-2016 Emily Fox & Carlos Guestrin # 3 year loans: 4 # 5 year loans: 2 Purification by imputing
  • 21. Machine Learning Specialization30 Common (simple) rules for purification by imputation ©2015-2016 Emily Fox & Carlos Guestrin Credit Term Income y excellent 3 yrs high safe fair ? low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs high safe poor 3 yrs high risky poor ? low safe fair ? high safe Impute each feature with missing values: 1.  Categorical features use mode: Most popular value (mode) of non-missing xi 2.  Numerical features use average or median: Average or median value of non-missing xi Many advanced methods exist, e.g., expectation-maximization (EM) algorithm
  • 22. Machine Learning Specialization32 Missing value imputation: Pros and Cons ©2015-2016 Emily Fox & Carlos Guestrin Pros •  Easy to understand and implement •  Can be applied to any model (decision trees, logistic regression, linear regression,…) •  Can be used at prediction time: use same imputation rules Cons •  May result in systematic errors Example: Feature “age” missing in all banks in Washington by state law
  • 23. Machine Learning Specialization Handling missing data Strategy 3: Adapt learning algorithm to be robust to missing values ©2015-2016 Emily Fox & Carlos Guestrin
  • 24. Machine Learning Specialization34 ©2015-2016 Emily Fox & Carlos Guestrin Start Credit? Safe excellent Income? poor Term? Risky Safe fair 5 years 3 years Risky Low Term? Risky Safe high 5 years 3 years xi = (Credit = poor, Income = ?, Term = 5 years) Missing values during prediction: revisited What do we do??? Start Credit? Income?
  • 25. Machine Learning Specialization35 ©2015-2016 Emily Fox & Carlos Guestrin Start Credit? Safe excellent Income? poor Term? Risky Safe fair 5 years 3 years Risky Low Term? Risky Safe high 5 years 3 years xi = (Credit = poor, Income = ?, Term = 5 years) Add missing values to the tree definition Associate missing values with a branch Start Credit? Income? or unknown Risky
  • 26. Machine Learning Specialization37 ©2015-2016 Emily Fox & Carlos Guestrin Start Credit? Safe excellent Income? poor Term? Risky Safe fair 5 years 3 years Risky Low Term? Risky Safe high 5 years 3 years Add missing value choice to every decision node or unknown Every decision node includes choice of response to missing values or unknown or unknown or unknown
  • 27. Machine Learning Specialization38 ©2015-2016 Emily Fox & Carlos Guestrin Start Credit? Term? Safe fair 5 years Prediction with missing values becomes simple xi = (Credit = ?, Income = high, Term = 5 years) or unknown Safe excellent Income? poor Risky Low Term? Risky Safe high 5 years 3 years or unknown or unknown Risky 3 years or unknown
  • 28. Machine Learning Specialization39 ©2015-2016 Emily Fox & Carlos Guestrin Start Credit? Income? poor Term? Safe high 5 years Prediction with missing values becomes simple xi = (Credit = poor, Income = high, Term = ?) or unknown Safe excellent Term? Risky Safe fair 5 years 3 years Risky Low Risky 3 years or unknown or unknown or unknown
  • 29. Machine Learning Specialization41 Explicitly handling missing data by learning algorithm: Pros and Cons ©2015-2016 Emily Fox & Carlos Guestrin Pros •  Addresses training and prediction time •  More accurate predictions Cons •  Requires modification of learning algorithm -  Very simple for decision trees
  • 30. Machine Learning Specialization Feature split selection with missing data ©2015-2016 Emily Fox & Carlos Guestrin
  • 31. Machine Learning Specialization44 Pick feature split leading to lowest classification error Greedy decision tree learning ©2015-2016 Emily Fox & Carlos Guestrin •  Step 1: Start with an empty tree •  Step 2: Select a feature to split data •  For each split of the tree: •  Step 3: If nothing more to, make predictions •  Step 4: Otherwise, go to Step 2 & continue (recurse) on this split Must select feature & branch for missing values!
  • 32. Machine Learning Specialization46 Feature split (without missing values) ©2015-2016 Emily Fox & Carlos Guestrin Root excellent fair poor Credit? Split on the feature Credit Data points with Credit = poor go here
  • 33. Machine Learning Specialization47 Feature split (with missing values) ©2015-2016 Emily Fox & Carlos Guestrin Root excellent fair poor or unknown Credit? Data points with Credit = poor or unknown go here Split on the feature Credit
  • 34. Machine Learning Specialization48 Missing value handling in threshold splits ©2015-2016 Emily Fox & Carlos Guestrin Root < $50K or unknwon >= $50K Income Data points with income < $50K or unknown Split on the feature Income
  • 35. Machine Learning Specialization50 Should missing go left, right, or middle? ©2015-2016 Emily Fox & Carlos Guestrin Root excellent or unknown fair poor Credit? Choice 1: Missing values go with Credit=excellent Root excellent fair or unknown poor Credit? Choice 2: Missing values go with Credit=fair Root excellent fair poor or unknown Credit? Choice 3: Missing values go with Credit=poor Choose branch that leads to lowest classification error!
  • 36. Machine Learning Specialization51 Computing classification error of decision stump with missing data ©2015-2016 Emily Fox & Carlos Guestrin N = 40, 3 features Credit Term Income y excellent 3 yrs high safe ? 5 yrs low risky fair 3 yrs high safe poor 5 yrs high risky ? 3 yrs low risky ? 5 yrs low safe poor 3 yrs high risky poor 5 yrs low safe fair 3 yrs high safe … … … … Root 22 18 excellent or unknown fair poor Credit? Observed values 9 0 8 4 4 12 “Unknown” 1 2 Total 10 2 8 4 4 12 Error = . =
  • 37. Machine Learning Specialization52 Use classification error to decide ©2015-2016 Emily Fox & Carlos Guestrin Root 22 18 excellent or unknown 10 2 fair 8 4 poor 4 12 Credit? Root 22 18 excellent 9 0 fair or unknown 9 6 poor 4 12 Credit? Root 22 18 excellent 9 0 fair 8 4 poor or unknown 5 14 Credit? Choice 1: error = 0.25 Choice 2: error = 0.25 Choice 3: error = 0.225 WINNER
  • 38. Machine Learning Specialization54 Feature split selection algorithm with missing value handling ©2015-2016 Emily Fox & Carlos Guestrin •  Given a subset of data M (a node in a tree) •  For each feature hi(x): 1.  Split data points of M where hi(x) is not “unknown” according to feature hi(x) 2.  Consider assigning data points with “unknown” value for hi(x) to each branch A.  Compute classification error split & branch assignment of “unknown” values •  Chose feature h*(x) & branch assignment of “unknown” with lowest classification error
  • 39. Machine Learning Specialization Summary of handling missing data ©2015-2016 Emily Fox & Carlos Guestrin
  • 40. Machine Learning Specialization56 What you can do now… ©2015-2016 Emily Fox & Carlos Guestrin Describe common ways to handling missing data: 1.  Skip all rows with any missing values 2.  Skip features with many missing values 3.  Impute missing values using other data points Modify learning algorithm (decision trees) to handle missing data: 1.  Missing values get added to one branch of split 2.  Use classification error to determine where missing values go
  • 41. Machine Learning Specialization57 Thank you to Dr. Krishna Sridhar Dr. Krishna Sridhar Staff Data Scientist, Dato, Inc. ©2015-2016 Emily Fox & Carlos Guestrin