Predictive Modeling: White Paper

OPIM 5604 | Team 3
OPIM 5604 - PREDICTIVE MODELING
Predicting Hospital Readmission Rates within 30 days for
Diabetic Patients
TEAM 3
Yashi Sarbhai
Piyush Bishnoi
Manu Shankar
Muhammad Sanan Akbar
Mounika Paladugu

OPIM 5604 | Team 3
Contents
1.0 Executive Summary: 3
2.0 Problem Statement: 3
3.0 Methodology 3
3.1 DATASET OVERVIEW 3
3.1.1 Attributes and Target Variable Table 4
3.2 Data Exploration Techniques 6
3.2.1 Data Cleaning 6
3.2.2 Dimension Reduction 6
3.2.3 Missing Value Detection 7
3.2.4 Outlier Detection and Treatment. 7
4.0 Modification 7
4.1 Recoding Categorical Values 7
4.2 Rare Event Sampling 8
5.0 Modeling 8
5.1 Nominal Logistic 9
5.2 Neural Networks 9
5.3 Decision Trees: 10
5.4 Boosted Tree: 10
5.5 BootStrap Forest 11
5.6 Naïve Bayes 11
6.0 Assess 12
6.1 Model Comparison: 12
6.2 Model Improvement 12
7.0 Results and Conclusion: 13
7.1 Business Value of the Model: 13
7.2 Conclusion 13
8.0 References 14

OPIM 5604 | Team 3
1.0 Executive Summary:
A patientisconsideredtobe ‘re-admitted’inHospital whowere admittedinthe hospitalandagain
needstobe admittedtoa hospital withthe same problemwithin30days.Numberof Hospital
readmissionsindicate inefficiencyof healthcare systemsandadditional costsforTreatment.Therefore,
Healthcare Marketsand GovernmentHealthcare Agenciesare using 30-daysreadmission asanindex
for qualityof treatmentprovidedandtoassesstheirperformance,qualitycontrol measure andtarget
for cost reduction.Identifyingwhoare potential patientsfor readmission willenable healthcare
providerstoimprove theirservice andperformanyadditional Investigationsif neededandpreferably
preventreadmissioninfuture.
National DiabeticStatisticsreportstatedthat9.3% of the population inthe United Stateshave diabetes
out of which28% are still undiagnosed.AccordingtoCurrentUSMedical Reportthere are approximately
0.1 milliondiabeticpatientsandreadmissiontreatmentforthemcostsaroundd 250$ million.For
Diabetesreadmissionrate within30days isfoundto be 13-25% whichisquite higherthanrate of
hospitalizedpatients(8-14%).
2.0 Problem Statement:
Hospital ReadmissionReductionProgramstartedbyAffordable Care Actwasstartedto improve the
qualityof medical statementsandtreatmentandreduce the spendingonreadmission.We are tryingto
Predictthe Readmissionof diabeticPatientswithin30daysfrom the givendataset.We cannotprevent
readmissiononwhole butModel developedandpredictionscanbe usedto reduce re- admissionif
necessary, measuresare takenandimplementedonit.Real time dataof 100061 patientsiscollected,it
has 50 parameterscoveringall medical detailsrelatedtopatients,diagnosis,hospitalandlabtestsetc.
Firstmajor taskis to identifyparameterswhichare directlycontributingtoreadmissionandderivingthe
trend.Collected datahas huge amountof missingvaluesandredundantinformation.The model
developedisexpectedtopredictreadmissionrate of diabeticpatientswithin30days withsignificant
accuracy. Studyperformeddescribescollection,datapreparation,dimensionreduction,models
deployedandtheiraccuracy,interestingobservationsandpatternsidentified.
3.0 Methodology
3.1 DATASET OVERVIEW
The datasethas beenextractedfromUCImachine learningrepositoryandrepresentdatafor10 years
(1999-2008) of clinical care at 130 US hospitalsandintegrateddeliverynetworksincludingnumerous
featuresrepresentingpatientandhospital outcomes.
The total numberof instancespresentinthe datasetare 101,766 and the total columnattributesare 50.
Target variable inthisdatais“Readmitted”columnwhichisclassifiedas“peoplereadmittedindays”
(<30, >30, NO)
Link: https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008#

OPIM 5604 | Team 3
3.1.1 Attributes and Target Variable Table
List of features and their descriptions in the initial dataset.
Attribute Type Description and values % missing
Encounter ID Numeric Unique identifier of an encounter 0%
Patient number Numeric Unique identifier of a patient 0%
Race Nominal
Values: Caucasian, Asian, African American, Hispanic, and
other 2%
Gender Nominal Values: male, female, and unknown/invalid 0%
Age Nominal Grouped in 10-year intervals: 0, 10), 10, 20), …, 90, 100) 0%
Weight Numeric Weight in pounds. 97%
Admission type Nominal
Integer identifier corresponding to 9 distinct values, for
example, emergency, urgent, elective, newborn, and not
available 0%
Discharge
disposition Nominal
example, discharged to home, expired, and not available 0%
Admission source Nominal
example, physician referral,emergency room, and transfer
from a hospital 0%
Time in hospital Numeric Integer number of days between admission and discharge 0%
Payer code Nominal
example, Blue Cross/Blue Shield, Medicare, and self-pay 52%
Medical specialty Nominal
Integer identifier of a specialty of the admitting physician,
corresponding to 84 distinct values, for example,
cardiology, internal medicine, family/general practice, and
surgeon 53%
Number of lab
procedures Numeric Number of lab tests performed during the encounter 0%
Number of
procedures Numeric
Number of procedures (other than lab tests) performed
during the encounter 0%
Number of
medications Numeric
Number of distinct generic names administered during the
encounter 0%
Number of
outpatient visits Numeric
Number of outpatient visits of the patient in the year
preceding the encounter 0%
Number of
emergency visits Numeric
Number of emergency visits of the patient in the year
Number of
inpatient visits Numeric
Number of inpatient visits of the patient in the year
Diagnosis 1 Nominal
The primary diagnosis (coded as first three digits of ICD9);
848 distinct values 0%
Diagnosis 2 Nominal
Secondary diagnosis (coded as first three digits of ICD9);
923 distinct values 0%

OPIM 5604 | Team 3
Diagnosis 3 Nominal
Additional secondary diagnosis (coded as first three digits
of ICD9); 954 distinct values 1%
Number of
diagnoses Numeric Number of diagnoses entered to the system 0%
Glucose serum test
result Nominal
Indicates the range of the result or if the test was not taken.
Values: “>200,” “>300,” “normal,” and “none” if not
measured 0%
A1c test result Nominal
Indicates the range of the result or if the test was not taken.
Values: “>8” if the result was greater than 8%, “>7” if the
result was greater than 7% but less than 8%, “normal” if the
result was less than 7%, and “none” if not measured. 0%
Change of
medications Nominal
Indicates if there was a change in diabetic medications
(either dosage or generic name). Values: “change” and “no
change” 0%
Diabetes
medications Nominal
Indicates if there was any diabetic medication prescribed.
Values: “yes” and “no” 0%
23 features for
medications Nominal
For the generic names: metformin, repaglinide, nateglinide,
chlorpropamide, glimepiride, acetohexamide, glipizide,
glyburide, tolbutamide, pioglitazone, rosiglitazone,
acarbose,miglitol, troglitazone, tolazamide, examide,
insulin, glyburide-metformin, glipizide-metformin,
glimepiride-pioglitazone, metformin-rosiglitazone, and
metformin-pioglitazone, the feature indicates whether the
drug was prescribed or there was a change in the dosage.
Values: “up” if the dosage was increased during the
encounter, “down” if the dosage was decreased,“steady” if
the dosage did not change, and “no” if the drug was not
prescribed 0%
Readmitted Nominal
Days to inpatient readmission. Values: “<30” if the patient
was readmitted in less than 30 days, “>30” if the patient
was readmitted in more than 30 days, and “No” for no
record of readmission. 0%
3.2 Data Exploration Techniques
The core objective of the DataExplorationtechnique isremovingthe redundantdatafromrowsand
columnsthatare lesssignificantinpredictingthe targetvariable.The belowstepswere followedin
exploringandprocessingdata:
3.2.1 Data Cleaning
A newvariable “NA”wascreatedandimputedforthe insignificantvaluesinthe instances
1. Admission_type_id ->Values5,6,8whichrepresentsNotavailable,Null andnotmappedare
convertedtoNA and 7 whichrepresentsTraumacenterisrecodedto5.
2. Discharge_disposition_id ->18,25,26 (Null,Notmapped,invalid) convertedtoNA.

OPIM 5604 | Team 3
3. Admission_source_id -> 9,15,17,20,21(Not Available, Null, not mapped, invalid)
converted to NA
3.2.2 Dimension Reduction
Dimensionreductiontechniquewasdeployedinthe explorationprocesstoreduce numberof variables.
The target was to identifyminimumnumberof relevantattributeswithnon- overlappinginformation.
The belowmeasureswere taken:
3.2.2.1 Column Removal.
The followingcolumnswere removedfromthe dataset.
S. No. Attribute Reason for Removal
1. Weight 98569 valueswere missingfromtotal of 101,766
rowswhichaccounts for96.85%
2. PayerCode Payercode signifiesthe mode of paymentfor
differentpatientsandisnotmuchsignificanttothe
problemstatement
3. Medical specialty Medical specialtyhasaround50% missingvalues.
Thus,needto be removed.
4. Diag1, Diag2,Diag3 Theyare nominal variable witharound1000 distinct
possible values.Theyare codesusedformedical
purposes.These columnsare removedsoasto
reduce the complexityof the model.(Tradeoff
betweencomplexityandaccuracy)
3.2.2.2 Derived Column
medical_procedures:Thiscolumnisa summationof Num_labprocedures,num_procedures,num
medication.Itrepresentsthe individual’sdependabilityorinteractionwiththe hospitals.
Previous_number_of_visits :Number_outpatient,number_emergencyandnumber_inpatient
convertedtoone a single attribute asnumber_of_visitswhichisthe sum of the mentioned three
columns.
Diabetes_medications :Metformintometformin-pioglitazone isconvertedtoa scale of 0 - no,1- yes.
Thenthese valuesare summeduptoreduce the numberof columnsfrom23 to 1.

OPIM 5604 | Team 3
3.2.2.3 Principal Component Analysis.
Principal Analysiswascomputedfor5attributes(Time inhospital,medical_procedure,previousvisits,
numberof diagnoses,diabetes_medication).4componentsoutof five were chosen.Thisdecisionwas
made keepinginmindthe complexityversusthe accuracyof the model.
As a resultof all the techniques,wereable toreduce attributesto18 from50 whichmeansthat
dimensionalitywasreducedbymore than50%
3.2.3 Missing Value Detection
Missingvalue detectionandtreatmentplayedanessentialrole indataprocessing.The below two
approacheswere followedtotreatthe missingvalues.
● All the missingvalueswerefirstidentifiedandimputedbyavariable called‘NA’
● Attributesthathada majorityof ‘NA’valueswere droppedfromthe dataset
3.2.4 Outlier Detection and Treatment.
As a part of outlierdetection,we identifiedthe attributescontainingoutliers.Asperthe business,they
were notthe outliersbutactual values.Hence,novalueswere removed.
For example,numberof visitswererangingfrom0 to 80 buttheirpossibilitycannotbe rejected.
4.0 Modification
4.1 Recoding Categorical Values
Usingthe “Recode”optioninJMP,categorical valuesof differentvalueswere reassignedtosuite the
businessneedsandresearchconducted.
S. No. Attributes RecodedValuesApplied
1 Age Accordingto variousstudies,age groupswere codedasbelow
● 0-40 Years � 1
● 40-70 years� 2
● Above 70 �3
2 Max_glu_serum Max Glu Serumsignifiesthe sugarlevels
● None� 0
● Normal� 1
● Abnormal (>200 & >300) �2
3 A1Cresult A1C ResultsBloodtestthatreflectsavgbloodglucose levelsover
past 3 months.The resultswere codedas:

OPIM 5604 | Team 3
● None� 0
● Normal� 1
● Abnormal (>7 & >8) �2
4 Medications All the diabetesMedication(withvaluesNo,steady,Up,down)
were dividedas0or 1
● Steady,Up,Down �1
● No � 0
4.2 Rare Event Sampling
● simple randomsamplingmayproduce toofew of the rare classto yielduseful informationabout
what distinguishesthemfromthe dominantclass.In such cases stratifiedsamplingisoftenused
to oversample cases from the rare class and improve the performance of classifiers.
● In our case, the proportion of target Variable as ’Yes’ were too rare to produce any accurate
results.Hence,stratifiedsamplingwasappliedtogainabalancedratioof ‘yes’ inthe datasample.
● As a result, the total numberof instanceswere reduced to 38525 with11357 rowshavingtarget
Variable “Readmitted” as “yes”
5.0 Modeling
Thisis supervisedlearningforclassification,afterdata wasprocessed,variousclassificationmodels
were testedtoidentifythe model withmaximumaccuracyandAUC, andminimummisclassification
rate.The belowmodelswere executedtostudythe performance.
5.1 Nominal Logistic

OPIM 5604 | Team 3
5.2 Neural Networks
5.3 Decision Trees:
5.4 Boosted Tree:

OPIM 5604 | Team 3
5.5 BootStrap Forest

OPIM 5604 | Team 3
5.6 Naïve Bayes
6.0 Assess
6.1 Model Comparison:
Model
False Negatives (People
who are readmitted but
predicted as No)
Accuracy of
the Model
AUC Misclassification
rate
Logistic
Regression
932 63.7% 0.6612 0.36
Neural
Networks
844 60.6% 0.6570 0.39
Decision
Trees
877 58.1% 0.6299 0.41
Boosted Tree 952 61.4% 0.6365 0.38
Bootstrap
Forest
895 60.9% 0.6515 0.39
Naïve Bayes 1098 65.5% 0.6412 0.35

OPIM 5604 | Team 3
Out of all Modelswe are prioritizingFalse negatives.We have low false negativesinNeural Networks
model,althoughtotal accuracyof this model islow comparedtoNaïve Bayesand Logisticandothers.
So,we choose thismodel foraccurate prediction.False negativesimplywrongpredictionof readmitted
as ‘NO’whichwill impactthe accuracy of predictionandprobablywe cannotprovide propertreatment
to themif our predictioniswrong.
6.2 Model Improvement
We are altering the Cutoff by trying different possibilities by reducing it to reduce false
negatives trading off between total accuracy and reduction of false negatives. Finally, we could
come down to 591 from 844 false negatives maintaining our total accuracy

OPIM 5604 | Team 3
7.0 Results and Conclusion:
Hospitalsshoulduse the Neural Networkmodel topredictwhetherthe patientneedstobe
readmittedwithin30days. The cutoff rate forusingthe Neural Networkmodel shouldbe keptat
0.45 as we are able to achieve the targetof minimizingFalse Negatives.Stratifiedsamplingwasused
because of the rarity of the Target Variable meaningthe accuracymightbe compromisedforthe
model
7.1 Business Value of the Model:
Our Projectdevelops a predictivetool to health serviceproviders to predictpatients with risk of readmission
within 30 days with an accuracy of 60.6%. From model profiler we can see that number of visits (inpatient) is
contributingto the risk of readmission.So,Hospitals should providemore careto those.
Model cost with cut off_0.45
Falsenegatives-591,Falsepositives-9471=(591*1000$+9471*200$)=$248,5200

OPIM 5604 | Team 3
8.0 References
https://care.diabetesjournals.org/
https://www.einstein.yu.edu/
www.idf.org/millions-unite-diabetes-awareness-world-diabetes-day-2010
https://www.kff.org/medicare/issue-brief/aiming-for-fewer-hospital-u-turns-the-medicare-
hospital-readmission-reduction-program/
https://yaledailynews.com/blog/2017/01/17/medicare-penalties-lead-to-decline-in-hospital-
readmission-rates/

Predictive Modeling: White Paper

Recommended

Recommended

More Related Content

Similar to Predictive Modeling: White Paper

Similar to Predictive Modeling: White Paper (20)

Recently uploaded

Recently uploaded (20)

Predictive Modeling: White Paper