© 2020 Minitab, LLC.
Machine Learning with Binary Logistic Regression
© 2020 Minitab, LLC.
• 25+ years of
experience
• Minitab Trainer
• Statistical Consultant
• Minitab Software
Designer
• Master's in Statistics
Meet the Presenter:
Cheryl Pammer
Senior Advisory Statistician
© 2020 Minitab, LLC.
Learning Objectives
►Use Binary Logistic Regression in a Machine Learning Environment
►Discuss Methods for Model Selection Using:
 P-values
 Area Under the ROC Curve
 Information Criteria
4
© 2020 Minitab, LLC.
Basic Supervised Machine Learning Algorithms
►Continuous Y: Regression, CART Regression Trees
►Categorical Y: Logistic Regression, CART Classification Trees
5
© 2020 Minitab, LLC.
What Has a Machine Learned?
6
Training Data Test Data
?Data is split into a training set and a test set:
►Training (or learn) = creates model
►Test = assesses model performance
© 2020 Minitab, LLC.
Why is Model Testing (Validation) Important?
►Assessing model with the same data used to fit model leads to
overfitting.
►Overfit models do not predict well.
7
© 2020 Minitab, LLC.
Bias-Variance Trade-Off
8
Model with High Bias Model with High Variance
© 2020 Minitab, LLC.
Validation
9
Validation helps find the best balance between too simple (high bias)
and too complex (high variance)
© 2020 Minitab, LLC.
Example
►Hospital system needs to estimate
the probability that a patient will need
to be readmitted within 30 days.
►Administrators use patient data for
the past year to determine the key
drivers of readmission and predict
readmission probability for new
patients.
10
© 2020 Minitab, LLC.
Binary Logistic Regression
Models relationship between
binary response (Y) and multiple
features (X).
11
© 2020 Minitab, LLC.
Binary Logistic Regression
12
Relationship can be expressed as an equation:
Loge[p/(1-p)] = β0 + β1x1 + … + βkxk
Probability (Event) = exp(β0 + β1x1 + … + βkxk)
(1 + exp(β0 + β1x1 + … + βkxk))
© 2020 Minitab, LLC.
Baseline Rate
13
© 2020 Minitab, LLC.
Potential Predictors  Number of Hospital Days
 Number of Lab Procedures
 Number of Procedures
 Number of Medications
 Number of Outpatient Visits
 Number of Emergency Visits
 Number of Diagnoses
 Race
 Gender
 Age
 Admission
 Discharged To
 Diabetes
14
© 2020 Minitab, LLC.
Potential Terms Up To Order 2
15
© 2020 Minitab, LLC.
Stepwise Regression Using P-values
Automatically select regression or logistic regression models by
adding or removing terms, one step at a time:
►Backward Elimination: Start with full model (all terms) and
remove term with the highest p-value until everything left is
significant.
►Forward Selection: Start with an empty model (intercept only) and
add term with lowest p-value until no significant terms remain.
►Stepwise: Start with an empty model (intercept only), add term
with lowest p-value. At each step, add or remove terms based on
p-values.
© 2020 Minitab, LLC.
Validation With a Test Set
Hold out random
X% of data when
fitting model.
17
© 2020 Minitab, LLC.
Receiver Operating Characteristic (ROC) Curve
►Plot of True Positive Rate vs False Positive Rate
►For a random classifier True Positive Rate = False Positive Rate
18
True Yes True No
Model = Yes #TP #FP
Model = No #FN #TN
Sensitivity = TP/(TP + FN)
Specificity = TN/(FP + TN)
© 2020 Minitab, LLC.
Stepwise Model Selection
19
Step Predictors Test
ROC
1 Discharged To 0.5655
2 Discharged To
#Emergency Visits
0.5830
3 Discharged To
#Emergency Visits
#Diagnoses
0.6007
…
8 Final (8 Terms) 0.6021
© 2020 Minitab, LLC.
Visualizing Model Results
20
© 2020 Minitab, LLC.
Visualizing Model Results
21
© 2020 Minitab, LLC.
Predicting the Probability of Readmission
22
© 2020 Minitab, LLC.
Problems With P-Value-Based Variable Selection
►With larger data sets, tests can be
too powerful and almost everything
is significant.
►With many potential terms, some
will be significant by chance.
►Because individual p-values are
dependent on other terms in
model, finding the correct subset of
terms is challenging.
23
?
© 2020 Minitab, LLC.
Model Selection Strategies
►Need criterion to compare models.
 Categorical Y: Area under ROC curve (Test ROC)
 Continuous Y: Test R2
 Information criteria: AIC, BIC
►Given a criterion, need a search strategy.
 Look at all possible models (Best Subsets)
 Stepwise procedures
Many automated model fitting techniques exist. These vary
depending on type of model.
24
© 2020 Minitab, LLC.
Example: Furniture Delivery
►Furniture manufacturer investigates
12 potential key drivers of defects
in setup and delivery process.
►Data represent individual deliveries
over several months.
►Response:
 Yes = Damaged
 No = Not Damaged
25
© 2020 Minitab, LLC.
Information Criteria
►When a model used to represent a population, information is lost.
►Akaike Information Criteria (AIC) and Bayesian Information Criteria
(BIC) estimate relative information loss. Less is better.
26
© 2020 Minitab, LLC.
Information Criteria
Goal: Find a model that is neither underfit nor overfit.
►Assess goodness of fit using likelihood function – discourages
underfitting.
►Penalize overfitting by considering the number of model terms.
27
© 2020 Minitab, LLC.
Properties of AIC and BIC
►BIC will typically choose a model
as small or smaller than AIC
►As sample size grows to infinity
it can be shown that:
 AIC will always choose a
model that contains the true
model; it won’t leave any
variables out
 BIC will choose exactly the
right model
28
© 2020 Minitab, LLC.
Forward Information Criteria Model Selection
29
Step Predictors BIC Test
ROC
1 Warehouse Time 1348 0.9977
2 + Tech Lead 1290 0.9989
3 + Team Lift 1077 0.9993
4 + Load Type 877 0.9997
5 + Stop Number 810 0.9998
6 + Driver 1122 0.9998
BIC typically results in smaller models unless n is small.
© 2020 Minitab, LLC.
Receiver Operating Characteristic (ROC) Curve
30
Sensitivity = TP/(TP + FN)
Specificity = TN/(FP + TN)
True Yes True No
Model = Yes #TP #FP
Model = No #FN #TN
© 2020 Minitab, LLC.
One Look of the Data…
31
© 2020 Minitab, LLC.
Highly Significant, But…
32
© 2020 Minitab, LLC.
The Real Key Result
33
© 2020 Minitab, LLC.
Take-aways
You have learned to use:
►Binary Logistic Regression in a Machine Learning Environment
►Old and New Methods for Model Selection:
 P-values
 Area Under the ROC Curve
 Information Criteria
Questions?
Cheryl Pammer
cpammer@minitab.com
34
© 2020 Minitab, LLC.
Upcoming Webinars and Virtual Events
• Machine Learning with Classification & Regression Trees
(CART® )
Time: Wednesday 15 July, 12PM AEST (10AM HKT / 2PM NZST)
See all the details and sign up at:
https://info.minitab.com/resources/webinars/webinar-wednesdays
© 2020 Minitab, LLC.
Upcoming Webinars and Virtual Events
• Online/Virtual Training
Minitab is now offering virtual training taught by
Minitab experts – perfect for remote/home workers.
Visit www.minitab.com/training/training for more info.
• Talk to Minitab
Complimentary resources to help you deal quickly with today's challenges and changing environment.
Visit www.minitab.com and click on the Talk to Minitab button and a Minitab representative will be in touch!
© 2020 Minitab, LLC.
Our Approach: More Than Business Analytics… Solutions Analytics
Software
Services
Training
Learn first-hand by attending public or
customized trainings in your facilities
according to your requirements.
Statistical
Consulting
Personalized help with statistical
challenges from collecting the right data
to interpreting analysis more.
Support
Assistance with installation,
implementation, version updates
and license management.
Master statistics and
Minitab anywhere
with online training
Machine learning and
predictive analytics
software
Start, track, manage
and execute
improvement projects
with real-time
dashboards
Powerful statistical
software everyone
can use.
Data Analysis Predictive Modeling Visual Business Tools Project Oversight
Visual tools to
process and product
excellence.
Online Training
Solutions analytics is our integrated approach to providing software and services that enable organizations to make better decisions that drive business excellence.

Machine Learning with Binary Logistic Regression - APAC

  • 1.
    © 2020 Minitab,LLC. Machine Learning with Binary Logistic Regression
  • 2.
    © 2020 Minitab,LLC. • 25+ years of experience • Minitab Trainer • Statistical Consultant • Minitab Software Designer • Master's in Statistics Meet the Presenter: Cheryl Pammer Senior Advisory Statistician
  • 3.
    © 2020 Minitab,LLC. Learning Objectives ►Use Binary Logistic Regression in a Machine Learning Environment ►Discuss Methods for Model Selection Using:  P-values  Area Under the ROC Curve  Information Criteria 4
  • 4.
    © 2020 Minitab,LLC. Basic Supervised Machine Learning Algorithms ►Continuous Y: Regression, CART Regression Trees ►Categorical Y: Logistic Regression, CART Classification Trees 5
  • 5.
    © 2020 Minitab,LLC. What Has a Machine Learned? 6 Training Data Test Data ?Data is split into a training set and a test set: ►Training (or learn) = creates model ►Test = assesses model performance
  • 6.
    © 2020 Minitab,LLC. Why is Model Testing (Validation) Important? ►Assessing model with the same data used to fit model leads to overfitting. ►Overfit models do not predict well. 7
  • 7.
    © 2020 Minitab,LLC. Bias-Variance Trade-Off 8 Model with High Bias Model with High Variance
  • 8.
    © 2020 Minitab,LLC. Validation 9 Validation helps find the best balance between too simple (high bias) and too complex (high variance)
  • 9.
    © 2020 Minitab,LLC. Example ►Hospital system needs to estimate the probability that a patient will need to be readmitted within 30 days. ►Administrators use patient data for the past year to determine the key drivers of readmission and predict readmission probability for new patients. 10
  • 10.
    © 2020 Minitab,LLC. Binary Logistic Regression Models relationship between binary response (Y) and multiple features (X). 11
  • 11.
    © 2020 Minitab,LLC. Binary Logistic Regression 12 Relationship can be expressed as an equation: Loge[p/(1-p)] = β0 + β1x1 + … + βkxk Probability (Event) = exp(β0 + β1x1 + … + βkxk) (1 + exp(β0 + β1x1 + … + βkxk))
  • 12.
    © 2020 Minitab,LLC. Baseline Rate 13
  • 13.
    © 2020 Minitab,LLC. Potential Predictors  Number of Hospital Days  Number of Lab Procedures  Number of Procedures  Number of Medications  Number of Outpatient Visits  Number of Emergency Visits  Number of Diagnoses  Race  Gender  Age  Admission  Discharged To  Diabetes 14
  • 14.
    © 2020 Minitab,LLC. Potential Terms Up To Order 2 15
  • 15.
    © 2020 Minitab,LLC. Stepwise Regression Using P-values Automatically select regression or logistic regression models by adding or removing terms, one step at a time: ►Backward Elimination: Start with full model (all terms) and remove term with the highest p-value until everything left is significant. ►Forward Selection: Start with an empty model (intercept only) and add term with lowest p-value until no significant terms remain. ►Stepwise: Start with an empty model (intercept only), add term with lowest p-value. At each step, add or remove terms based on p-values.
  • 16.
    © 2020 Minitab,LLC. Validation With a Test Set Hold out random X% of data when fitting model. 17
  • 17.
    © 2020 Minitab,LLC. Receiver Operating Characteristic (ROC) Curve ►Plot of True Positive Rate vs False Positive Rate ►For a random classifier True Positive Rate = False Positive Rate 18 True Yes True No Model = Yes #TP #FP Model = No #FN #TN Sensitivity = TP/(TP + FN) Specificity = TN/(FP + TN)
  • 18.
    © 2020 Minitab,LLC. Stepwise Model Selection 19 Step Predictors Test ROC 1 Discharged To 0.5655 2 Discharged To #Emergency Visits 0.5830 3 Discharged To #Emergency Visits #Diagnoses 0.6007 … 8 Final (8 Terms) 0.6021
  • 19.
    © 2020 Minitab,LLC. Visualizing Model Results 20
  • 20.
    © 2020 Minitab,LLC. Visualizing Model Results 21
  • 21.
    © 2020 Minitab,LLC. Predicting the Probability of Readmission 22
  • 22.
    © 2020 Minitab,LLC. Problems With P-Value-Based Variable Selection ►With larger data sets, tests can be too powerful and almost everything is significant. ►With many potential terms, some will be significant by chance. ►Because individual p-values are dependent on other terms in model, finding the correct subset of terms is challenging. 23 ?
  • 23.
    © 2020 Minitab,LLC. Model Selection Strategies ►Need criterion to compare models.  Categorical Y: Area under ROC curve (Test ROC)  Continuous Y: Test R2  Information criteria: AIC, BIC ►Given a criterion, need a search strategy.  Look at all possible models (Best Subsets)  Stepwise procedures Many automated model fitting techniques exist. These vary depending on type of model. 24
  • 24.
    © 2020 Minitab,LLC. Example: Furniture Delivery ►Furniture manufacturer investigates 12 potential key drivers of defects in setup and delivery process. ►Data represent individual deliveries over several months. ►Response:  Yes = Damaged  No = Not Damaged 25
  • 25.
    © 2020 Minitab,LLC. Information Criteria ►When a model used to represent a population, information is lost. ►Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) estimate relative information loss. Less is better. 26
  • 26.
    © 2020 Minitab,LLC. Information Criteria Goal: Find a model that is neither underfit nor overfit. ►Assess goodness of fit using likelihood function – discourages underfitting. ►Penalize overfitting by considering the number of model terms. 27
  • 27.
    © 2020 Minitab,LLC. Properties of AIC and BIC ►BIC will typically choose a model as small or smaller than AIC ►As sample size grows to infinity it can be shown that:  AIC will always choose a model that contains the true model; it won’t leave any variables out  BIC will choose exactly the right model 28
  • 28.
    © 2020 Minitab,LLC. Forward Information Criteria Model Selection 29 Step Predictors BIC Test ROC 1 Warehouse Time 1348 0.9977 2 + Tech Lead 1290 0.9989 3 + Team Lift 1077 0.9993 4 + Load Type 877 0.9997 5 + Stop Number 810 0.9998 6 + Driver 1122 0.9998 BIC typically results in smaller models unless n is small.
  • 29.
    © 2020 Minitab,LLC. Receiver Operating Characteristic (ROC) Curve 30 Sensitivity = TP/(TP + FN) Specificity = TN/(FP + TN) True Yes True No Model = Yes #TP #FP Model = No #FN #TN
  • 30.
    © 2020 Minitab,LLC. One Look of the Data… 31
  • 31.
    © 2020 Minitab,LLC. Highly Significant, But… 32
  • 32.
    © 2020 Minitab,LLC. The Real Key Result 33
  • 33.
    © 2020 Minitab,LLC. Take-aways You have learned to use: ►Binary Logistic Regression in a Machine Learning Environment ►Old and New Methods for Model Selection:  P-values  Area Under the ROC Curve  Information Criteria Questions? Cheryl Pammer cpammer@minitab.com 34
  • 34.
    © 2020 Minitab,LLC. Upcoming Webinars and Virtual Events • Machine Learning with Classification & Regression Trees (CART® ) Time: Wednesday 15 July, 12PM AEST (10AM HKT / 2PM NZST) See all the details and sign up at: https://info.minitab.com/resources/webinars/webinar-wednesdays
  • 35.
    © 2020 Minitab,LLC. Upcoming Webinars and Virtual Events • Online/Virtual Training Minitab is now offering virtual training taught by Minitab experts – perfect for remote/home workers. Visit www.minitab.com/training/training for more info. • Talk to Minitab Complimentary resources to help you deal quickly with today's challenges and changing environment. Visit www.minitab.com and click on the Talk to Minitab button and a Minitab representative will be in touch!
  • 36.
    © 2020 Minitab,LLC. Our Approach: More Than Business Analytics… Solutions Analytics Software Services Training Learn first-hand by attending public or customized trainings in your facilities according to your requirements. Statistical Consulting Personalized help with statistical challenges from collecting the right data to interpreting analysis more. Support Assistance with installation, implementation, version updates and license management. Master statistics and Minitab anywhere with online training Machine learning and predictive analytics software Start, track, manage and execute improvement projects with real-time dashboards Powerful statistical software everyone can use. Data Analysis Predictive Modeling Visual Business Tools Project Oversight Visual tools to process and product excellence. Online Training Solutions analytics is our integrated approach to providing software and services that enable organizations to make better decisions that drive business excellence.