Applied various data mining techniques such as - Decision Tree, Random forest, Naive Bayes and Boosting to determine which was more accurate in predicting medical appointment no shows. Used the ROC and confusion matrix to compare the results of the various techniques applied.
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
Business Analytics with R - Using Data Mining Techniques
1. Medical Appointment No Shows
Presented by:
Team 2: Kayla Reinhart, Medha Tiwary, Janelle Manuel, Anvitha Ananth
1
2. Introduction & Problem
● 30% (about 100,000) of patients at a public sector primary care medical facility in Brazil
have missed their scheduled appointments from 5/2013 - 12/2015
● On average, primary care visits cost about $200, translating to approximately $20M in
missed revenues for the practice over 2.5 years
● Outside of missed revenues, assuming that it costs $10 to accommodate missed
appointments, this could cost the facility nearly $1M in salaries and operational costs
● Aside from direct revenue costs, no-shows significantly affect delivery, cost of care and
resource planning. Delayed testing potentially puts patients in danger. Missed screenings can
result in delayed disease detection. Reducing no-show rates can help diminish costs and
improve quality of health care delivery.
2
3. Objectives
● Evaluate the cause of missed
appointments and the impact of the input
features on no-shows
● Predict whether a patient is going to miss
an appointment
● Determine which factors have the largest
impact on no-show status
● Help doctors identify traits in patients that
are more likely to miss an appointment
and give recommendations on how to
combat high no-show rates
3
4. About the Dataset
- 300,000 observations
- 15 variables
(characteristics)
- Coded Gender, Day of
the Week, and Status
with binary numeric
values
Input Description Details
Age Patient’s Age 0-95
Gender Patient’s Gender F = 1, M = 2
Appointment Registration Date/Time Appointment was Made Time and Date Stamp Provided
Appointment Data Date/Time of Appointment Time and Date Stamp Provided
Day of the Week Day of the week of Appointment 1 = Sunday …. 7 = Saturday
Status Patient Showed or Didn’t Show 0 = Show Up, 1 = No Show
Diabetes Patient is a Diabetic 0 = No, 1 = Yes
Alcoholism Patient is an Alcoholic 0 = No, 1 = Yes
Hypertension Patient is Hypertensive 0 = No, 1 = Yes
Handicap Patient is Handicapped 0 = No, 1 = Yes
Smoker Patient is a Smoker 0 = No, 1 = Yes
Welfare Patient Receives Government Assistance 0 = No, 1 = Yes
Tuberculosis Patient had Tuberculosis 0 = No, 1 = Yes
SMS Reminder Patient was Sent a Text Reminder 0 = No, 1 = Yes
Waiting Time Duration (in days) Between Date Appt
was Made and Date of Appt
Amount in negative days
4
5. Evaluating the Data
- Less than 3% of the data contained missing values, these observations were
removed.
- The Age input had improbable values, we treated these as outliers and removed
them from the data set (194 observations = <1%)
- It was evident that there was a class imbalance problem with the “STATUS” target
- Majority Class: Status = 0 (Show) - 70% of the data
- Minority Class – Status = 1 (No Show) - 30% of the data
5
6. Preparing the Data
- Removed the Appointment Registration
and Appointment Data features, since
these values were summarized in the
Waiting Time feature.
- Using regression, we found that Age,
Alcoholism, Hypertension, SMS
Reminder, Day of the Week, and
Waiting Time were significant variables.
These are the variables we decided to
use as inputs for our models.
6
7. Class Imbalance Strategy #1: SMOTE
- Performed SMOTE to bring training dataset to 60,000 observations and correct
the class imbalance,
7
8. Class Imbalance Strategy #1: SMOTE
- Implemented a Random Forest model and
scored the data with a randomly sampled
15,000-observation test data set.
- The error matrix showed that even with
using SMOTE to correct for a class
imbalance, the model was unable to
predict any of the no-shows in the test
data.
- The accuracy of this model is misleading
Predicted 0 1
Actual 0 10480 0
1 4520 0
Error Rate 30%
8
9. Class Imbalance Strategy #2: Cost-Sensitive Learning
- Randomly sampled 60,000 observations, or
20% of the dataset
- Used a 70/0/30 partition
- 0, 20, 50, 0 loss matrix.
- Error rate increased but instances of correctly
predicted no-shows increased as well
- In this case, recall is high (.73) but precision is
low (.33), driving down the F1 measure (.46)
and making the 52% accuracy rate
unacceptable
9
10. Class Imbalance Strategy #3: Ignore It
- Performed Random
Forest on 60,000 sample
data with a 70/0/30
partition
- Accuracy improved but
but precision (.40), recall
(.06) and F1 (.10)
decreased significantly
10
11. Conclusions and Recommendations
● After multiple models with multiple parameters, we were unable to find any
features within the given dataset that significantly predicted missed appointments
● We suspect that the inputs provided are not predictive of no-shows
● We recommend that the facility gather additional information about their
patients when the appointments are made such as:
○ Job Type (Full-time, Part-time, unemployed, etc.)
○ New vs. Existing Patient
○ Reason for Appointment
○ Recency of Symptoms
○ Severity of Symptoms
○ Insurance Coverage
○ Distance from Facility
○ Means of Transportation
11