Healthy Analytics - Prevention through Prediction

CLIENT Health

Heart Failure: Prevention through
Prediction

12/17/2015

Table of Contents
Abstract
Problem Statement
Assumptions and Limitations
Data Preparation
Initial Findings
Data Processing
Modeling Results
KNN – Model Analysis
The Rapid Miner Model is described as –
Naïve Bayesian – Model Analysis
Recommendations
Summary
Appendix
Unsuccessful Models
Unsuccessful Probit Models

2

Abstract

Heart failure is a growing problem within the United States, it is a leading cause of
hospitalization among adults over 65 and claims a life every 60 seconds. This problem also costs
roughly $17 Billion annual in the United States. An opportunity exists to reduce costs by
reducing the readmittance rate of patients, which currently about 50% are readmitted after six
months from their initial treatment in the US. (Akshay S. Desai & Lynne W. Stevenson, 2012)
Specifically, for CLIENT 68% of patients were readmitted within 90 days of being released
costing $20 Million. A model was created to help predict if a patient was at high risk of having
above average charges. Two more models were created to correlate which treatments lead to a
significant reduction in readmittance. Early detection of patients that fall into either category
will give CLIENT an upper hand in managing costs and providing effective treatment.
Problem Statement
We sought to reduce the cost of heart failure by creating specialized care for patients at
high risk of generating above average charges and by identifying effective treatments for
patients which lead to less readmittance.

Data Preparation
Data purification, or cleansing, is a critical activity in any project for creating successful
models. We started by noting the seven tables we received and summarizing the data
contained within:

1. ND_HF_1_INDEX_20151111 Index File for Heart Attack
2. ND_HF_3_DEMOGRAPHICS_20151111 Demographic Data
3

3. ND_HF_4_REIMBURSEMENT_CLINIC_20151111 Reimbursement clinic Data
4. ND_HF_5_FINANCIAL_HOSPITAL_20151111 Financial Hospital Data
5. ND_HF_6_DIAGNOSIS_20151111 Diagnosis Data
6. ND_HF_7_HOSPITAL_ICD_PROCEDURE_20151111 Hospital ICD procedure Data
7. ND_HF_8_HOSPITAL_CPT_PROCEDURE_20151111 Hospital CPT procedure Data

The graphics below shows the Data files and the relationship

Initial Findings

1. All seven tables contain a PATIENT_IDENTIFIER
4

2. Index file for heart attack, financial hospital data, hospital ICD procedure data and
hospital CPT procedure data all contain PATIENT_IDENTIFIER and HAR_IDENTIFIER
3. Demographic data table and the index file for heart attack have a 1:1 relationship
4. Reimbursement clinic data is a fact table and it only contains the clinical procedures and
related cost.
5. The financial hospital data table, hospital ICD procedure data table and hospital CPT
procedure data table are related by PATIENT_IDENTIFIER and HAR_IDENTIFIER. The
financial hospital data table is super set data and hospital ICD procedure data table and
hospital CPT procedure data table are subsets. To get all the procedures we produced a
left outer join between these three tables

Meta data and data element descriptions were received from CLIENT which were critical in our
interpretation of the tables. Upon initial analysis, we observed significant value in preventing
repeat heart attack. The category was later broadened to heart related procedure, which we
manually noted in the processed file. To prepare our data for modeling and further analysis we
preformed the following steps:

Process Description Tools
Data Cleansing Data cleansing started very early in the project but continued
up to the launch of the models to ensure consistent results. In
the first round of data cleansing we removed blank values,
nulls values, 0 values, white spaces and extra characters which
were note required.
MS Access, MS
Excel, SQL Scripts
Data Profiling Data profiling is an analysis of the candidate data sources to
clarify the structure, content, relationships and derivation
rules of the data. Profiling helps not only to understand
anomalies and to assess data quality, but also to discover,
register, and assess enterprise metadata. We performed range
analysis, completeness analysis, pattern analysis and value
distribution analysis as well.
R Code, MS Excel,
Tableau
Deduplication Data deduplication is a method of reducing storage needs by
eliminating redundant data. Only one unique instance of the
data is actually retained on storage media. Redundant data is
MS SQL, MS Excel
and Python
5

replaced with a pointer to the unique data copy. In this
process we identified the unique heart related procedures and
unique clinic procedures
Data
segmentation
Our segmentation included three groups from the CLIENT
data, patients before heart attack, patients at the time of
heart attack and patients after heart attack.

MS SQL, MS Excel,
Python
Data
Aggregation
Data was aggregated to create an analysis summary. In this
process we combined the various procedures to identify a
unique patient before, during and after heart attack
MS SQL, MS Excel
Fuzzy Matching The approximate string matching (often colloquially referred
to as fuzzy string searching) is the technique of
finding strings that match a pattern approximately (rather
than exactly). In this process we matched heart related
procedures like – echocardiogram, cardiac, coronary etc.
MS SQL, MS Excel,
R, Heart Specialist
Dr.
Cross Ref A cross reference table was built to store unique heart related
procedures from the fuzzy matching data set, to allow for
easier processing. The cross reference table used a unique
code for each unique procedure.
MS Excel, MS
Access, Heart
Specialist Dr.
Final Data set In each step above we built validation reports to verify how
many records were dropped in each step and how many
records were passed thru into the new data set. The final data
set contained 544 records with the addition of a visit column
and description code. The visit column identified return visits
(binary) and the description column contained procedure
codes.
MS Excel, MS
Access, Tableau

Below is a graphical representation of the cleansing process:

6

Data Dictionary provided by CLIENT Health Care :

Data Processing
A top down approach was used for loading data into various staging areas. The steps we
followed to load data into the various staging areas and the final extract are described below.
7

1. Match the ND_HF_1_INDEX_20151111 Index File for Heart Attack and
ND_HF_3_DEMOGRAPHICS_20151111 Demographic Data to build the Patients details by demography
2. Left Outer join the ND_HF_5_FINANCIAL_HOSPITAL_20151111 Financial Hospital Data and
ND_HF_7_HOSPITAL_ICD_PROCEDURE_20151111 Hospital ICD procedure Data and
ND_HF_8_HOSPITAL_CPT_PROCEDURE_20151111 Hospital CPT procedure Data by PATIENT_IDENTIFIER
and HAR_IDENTIFIER and collect all financial data, ICD procedure data and CPT procedure data
3. Create 3 sets of Data
a. Data Prior to the Heart Attack Day =0 ; ADMIT_DAYS_FROM_INDEX <0
b. Data at the time of Heart Attack day = 0, ADMIT_DAYS_FROM_INDEX =0
c. Data after the  Heart Attack Day =0 , ADMIT_DAYS_FROM_INDEX >0
4. Join ND_HF_1_INDEX_20151111 Index File for Heart Attack, ND_HF_3_DEMOGRAPHICS_20151111
Demographic Data and ND_HF_4_REIMBURSEMENT_CLINIC_20151111 Reimbursement clinic Data by
PATIENT_IDENTIFIER.
a. Data Prior to the Heart Attack Day =0 ; ENCOUNTER_DAYS_FROM_INDEX <0
b. Data at the time of Heart Attack day = 0, ENCOUNTER_DAYS_FROM_INDEX =0
c. Data after the  Heart Attack Day =0 , ENCOUNTER_DAYS_FROM_INDEX >0

5. From steps 3 and 4 we captured the procedure descriptions
6. From procedure 3 for Less than 0 set, Equal to 0 and Greater than 0 data set, use SQL scripts to transpose
and deduplicate the data
7. From procedure 4 for Less than 0 set, Equal to 0 and Greater than 0 data set, use SQL scripts to transpose
and deduplicate the data
8. Aggregate the data set and unique rows by patient ID for Less than 0 set, Equal to 0 and Greater than 0 data
set
9. Merge the Data set for Less than 0 set, Equal to 0 and Greater than 0 data set for Clinical and Hospital
Procedure
10. Match PATIENT_IDENTIFIER with the data set from procedure 8 and identify the match Patient Identifiers
across the three data sets.
11. Match PATIENT_IDENTIFIER with the data set from procedure 9 and identify the match Patient Identifiers
across the three data sets.
12. Merge the Data set for matched patients for Less than 0 , Equal to 0 and Greater than 0 set.
8

13. Apply the Fuzzy Match with the Procedures of finding the heart related procedure and validated with the
heart specialist.
14. Total number of records are the procedure 12 are 544 and 75 procedure fields.
15. Remove the Patient ID as the objective of this exercise is to identify if the patients have certain heart related
procedures before heart attack or if they have heart related procedures during heart attack if that patient
will be back for heart related procedure
16. Uniquely identified the Heart related procedures and clinical procedures and defined them as “A1A1”,
“A1A2”. This methodology will treat them as a one heart related procedure instead of multiple words
treated as an individual heart related procedures. To build this logic used a cross reference tables with
unique procedures descriptions and related unique values.
17. Two files in the data set
a. Visit – Values “Yes” and “No”. Yes captures if there is a visit after heart attack for heart related
procedure. No captures if there is no visit after heart attack for heart related procedure.
b. Description – Related unique values from Cross Reference Table.
18. Finalized the data set with 544 records and mix of patients with after visit and no after visit for heart related
procedures.
Additional Data Mining steps necessary for predictive models
Data Aggregation 1 Consolidated below files 1 & 3 to build full Demo and Admittance data file, to
build regression models against
● ND_HF_1_INDEX_20151111 Index File for Heart Attack
● ND_HF_3_DEMOGRAPHICS_20151111 Demographic Data

MS Excel
● ND_HF_7_HOSPITAL_ICD_PROCEDURE_20151111 Hospital ICD
procedure Data
● ND_HF_8_HOSPITAL_CPT_PROCEDURE_20151111 Hospital CPT
procedure Data
MS Excel,
Data Aggregation 3 Combination of Data Aggregation 1 & Data Aggregation 2, using individual
Patient ID as the key
● ND_HF_7_HOSPITAL_ICD_PROCEDURE_20151111 – Hospital ICD
procedure Data
procedure Data
MS Excel
Data Aggregation 4 Manual tag of HIGH SPEND Patients, using below file to separate Patients
into terciles, with top tercile representing HIGH COST patients:
● ND_HF_4_REIMBURSEMENT_CLINIC_20151111 Reimbursement clinic
Data

MS Excel
9

Final Data set All four data sets above were used across multiple regression models in an
attempt to predict 1) Return for Heart Related Procedure 2) Patient cost 3)
High Patient cost (Binary)
MS Excel,  R

Modeling Results
To predict if a patient would return with a heart related procedure the following methods were
used –
1. KNN – We used the KNN model to predict if any Patient will visit the hospital for any
heart procedure if he or she had visited hospitals before and during heart attack and
had heart related procedures .
2. Naïve Bayesian – We used the same data set for the Naïve Bayesian model. With the
help Naïve Bayesian model we predicted if any Patient will visit the hospital for any
heart procedure if he or she had visited hospitals before and during heart attack and
had heart related procedures

We used the tools R and Rapid Miner for the following results
KNN – Model Analysis
In this model we used Rapid Miner to analyze our processed data set containing 544 records.
The following steps were used:
1. Store the cleaned data in source folder for analysis
2. Set the role as after visit as this is what we will be predicting
3. Split data into 70% training  and 30% test
4. Apply KNN Model
5. Used K = 25 for the first run
6. Monitor the performance
Output
10

TP – True Positive – Classified the after visit for hear procedure as “Yes” when the actual is
“Yes” (Classified Correctly)
FP – False Positive – Classified the after visit for heart procedure as “Yes” when the actual is
“No” (Classified incorrectly)
FN – False Negative – Classified the after visit for heart procedure as “No” when the actual is
“Yes” (Classified incorrectly)
TN – True Negative Classified the after visit for heart procedure as “No” when the actual is
“No” (Classified Correctly)

With help of the 2X2 confusion matrix we can formalize our definition of the prediction
accuracy (success rate)

The Class prediction for predicting the True After visit is 61.54%
The Class prediction for predicting the No After visit is 82.75%
Over all Accuracy of the Model is 81.90%

We readjusted the model and run 5 times with various number so subsets of attributes, the
Model accuracy increased from 81.90% to the following:

11

Run Sequence Accuracy K Value
1 81.90% 25
2 82.54% 25
3 84.65% 25
4 86.84% 25
5 88.90% 25
6 88.90% 25

Once it hit the constant accuracy with 88.90%. Then we changed the K Values from 25 to 10

After executed the set of instructions above with K = 10

Following is the confusion matrix

12

Overall Accuracy of the Model is 89.57%

We readjusted the model and run 7 times with various number of subsets of attributes, the
Model accuracy increased from 81.90% to the following

Run Sequence Accuracy K Value
1 89.57% 10
2 89.57% 10
3 89.57% 10
4 89.57% 10
5 88.90% 10
6 87.90% 10
7. 89.57% 10

Model accuracy Increased when we changed the K value from 25 to 10 and adjusted the
number of columns. After the 4th
attempt the model accuracy dropped and we changed it back
to step 4 settings and locked the model with the K=10 and use it for model comparison with
Naïve Bayesian Model.
13

The Rapid Miner Model is described as –

Below is the list of the heart procedures before and after and Visit after heart attack and had
heart procedures and not heart procedures –

Heart Related Procedures After Visit
IMPL CRT PACEMAKER SYS Yes
Iv infusion therapy/prophylaxis dx 1st >1 hou Yes
RIGHT HEART CARDIAC CATHETERIZATION Yes
OTHER ELECTRIC COUNTERSHOCK OF HEART Yes
CORONARY ARTERIOGRAPHY USING TWO CATHETERS Yes
CHEST XRAY No
INSERTION OF ONE VASCULAR STENT No
ANGIOCARDIOGRAPHY OF LEFT HEART STRUCTURES No
ELECTROCARDIOGRAM, TRACING No
CL:URINALYSIS, AUTO W/SCOPE No
CL:ROUTINE VENIPUNCTURE No

The Total List –

14

Naïve Bayesian – Model Analysis
After we used the Heart Procedure data to predict the repeat visit for heart procedure after the
heart attack we took a different approach to apply the Naïve Bayesian model. Based on our
learning from the data mining class we understand that not all the mining models are good for
all business problems. The Naïve Bayesian model is typically used for the following business
cases –
1. Text Classification, such as junk email (spam) filtering, Author identification, or Topic
Categorization
2. Intrusion detection and anomaly detection in computer network
3. Diagnosing medical conditions, when given a set of observed symptoms
Typically Bayesian classifiers are best applied to problems in which the information from
numerous attributes should be considered simultaneously in order to estimate the probability
of outcome.

Classifying the Heart Related Procedures from the dataset
We looked into the set of heart procedures and used the fuzzy match logic to identify the
procedures as heart related procedures. Once we identify the heart related procedures we
tagged them with a unique identifier.
Use the following R process to clean up the data for all train and test by using the following
steps –
We used the R package “tm” and used the following process to clean up the procedure
descriptions data.
1. Make All words to lower case using (docs,tolower)
2. Remove Stop words using tm_map(Description_corpus,removeWords,stopwords())
3. Remove Punctuations using tm_map(Description_corpus,removePunctuation)
4. Remove White spaces using tm_map(Description_corpus,stripWhitespace)
5. Make the document as Plan Text using tm_map(Description_corpus,
PlainTextDocument)
Then we created the Document Matrix with all the clean descriptions.
> Description_dtm<DocumentTermMatrix(Description_clean)
> dim(Description_dtm)
[1]  539 1279

> Description_dtm
<<DocumentTermMatrix (documents: 539, terms: 1279)>>
Non/sparse entries: 9639/679742
Sparsity           : 99%
Maximal term length: 7
Weighting          : term frequency (tf)

15

After cleanup we used 300 records for training and 254 for testing. We ran the R script for the
word cloud and possible return procedures are listed as :

We run the R script for word cloud and possible no return procedures are listed as :

Classifying the possible patient visit for heart related procedure after heat related procedure
before and at the time of heart attack
After we identify the heart related procedure before heart attack and heart related procedure
at the time of heart attack we used package “e1071” and used “naiveBayes” classifier to classify
16

the visits and then used the “predict” function to predict to classify the after visit for heart
related procedure and after visit for non heart related procedures
Below is the list of the heart procedures before and after and Visit after heart attack and had
heart procedures and not heart procedures
Heart Related Procedures After Visit
TRANSLUMINAL CORONARY ATHERECTOMY Yes
ENDOVASCULAR REPLACEMENT OF AORTIC VALVE Yes
Echo tthrc rt 2d +mmode rest&strs cont EC Yes
CL:R&l hrt cath w/injec hrt art/grft&l vent img s& Yes
INSULIN INJECTION PR 5 UNITS No
REMOVAL OF INNER EYE FLUID No
FLUOROGUIDE FOR SPINE INJECT No
MONOXIDE DIFFUSING CAPACITY No
ESOPHAGUS MOTILITY STUDY No
CT ANGIOGRAPHY, HEAD No
EXHALED CARBON DIOXIDE TEST No

Confusion Matrix from Naïve Bayesian Model

               | Actual
     Predicted |    After_Visit | No_After_Visit |      Row Total |
||||
   After_Visit |              3 |             20 |             23 |
               |          0.130 |          0.870 |          0.097 |
               |          0.103 |          0.096 |                |
||||
No_After_Visit |             26 |            189 |            215 |
               |          0.121 |          0.879 |          0.903 |
               |          0.897 |          0.904 |                |
||||
  Column Total |             29 |            209 |            238 |
               |          0.122 |          0.878 |                |
||||
17

TP – True Positive – Classified the after visit for hear procedure as “After_Visit” when the actual
is “After_Visit” (Classified Correctly)
FP – False Positive – Classified the after visit for heart procedure as “After_Visit” when the
actual is “No_After_Visit” (Classified incorrectly)
FN – False Negative – Classified the after visit for heart procedure as “No_After_Visit” when the
actual is “After_Visit” (Classified incorrectly)
TN – True Negative Classified the after visit for heart procedure as “No_After_Visit” when the
actual is “No_After_Visit” (Classified Correctly)

With help of the 2X2 confusion matrix we can formalize our definition of the prediction
accuracy (success rate)

Overall Accuracy of the Model is 80.67%

Total List of Procedures

Regression (PROBIT) – Predictive Model

18

Note: Biopsy of Heart Lining was present in 99% of High Cost patients, and was the only POSITIVE
indicator for high cost. All other independent variables had a negative influence in predicting HIGH
COST.

Rerun Probit against the larger data set:

Files and Scripts used for Regression (Gaussian) to Predict COST and Probit to predict (binary) Return
Heart Patient, High Cost Patient.

20

Summary
High level results follow.
We developed three models that can successfully predict (1) Whether a Patient will Return for a
Repeat Heart Procedure after their initial episode, and (2) Whether a Patient Should be Flagged as
“High Cost” in terms of treatment following their initial episode (i.e., falling within the top onethird
of most expensive patients treated).
CLIENT Health will benefit from the ability to predict repeat heart procedures in order to better serve
their patients through an optimization of resources. Specifically, the client will be able to a (1)
Anticipate Repeat Heart Procedures at Time of Heart Attack, (2) Assign Protocol Around Patients at
Risk to Return, and (3) Work to Influence Drivers for Return, including supporting patients as they
make lifestyle changes to improve their heart health. This is where recurring patient checkins come
into play. While it wasn’t within scope of this project, technology will likely play a large role in this in
managing patients who have had a heart episode in the past.
Benefits can also be reaped from the ability for CLIENT to predict which patients returning for a heart
episode should be flagged as “high cost”. Specifically, the client will be able to (1) Leverage this Model
to Assign a Care Manager to a High Cost Potential Patient, and (2) Lay the Groundwork for
Researching and Building a Protocol Specific to Reducing Cost on HC Patients.
As mentioned throughout this report, there were many other predictive models that were explored
throughout this project. While we are pleased with the results from our three successful predictive
models, we also learned a great deal from models that yielded insignificant statistical results. This
iterative process is essential to the analysis of a dataset such as that which was generously provided
to us by our client, CLIENT Health.

Technical results from our successful models follow.
Readmittance KNN Model 1: Identify the returning and not returning patients after heart attack and
have heart related procedures.  KNN Model over all accuracy of predicting returning and not returning
patients after heart attack is 89.57%. KNN model successfully predicted 79% of returning and 91% of
not returning patients for heart procedure.
Readmittance Bayesian Model 2: Identify the returning and not returning patients after heart attack
and have heart related procedures.  Naive Bayesian Model over all accuracy of predicting returning
21

and not returning patients after heart attack is 80.67%. Naive Bayesian model successfully predicted
75% of returning and 87% of not returning patients for heart procedure.
HighCost Probit Model 3: PreEvent procedures are effective in predicting High Cost for returning
Patients.  Probit model successful in predicting 93% of High Cost and 80% of NonHigh Cost patients.

Recommendations
Our overarching goal for this project was to optimize patient care resources through addressing two
objectives: (1) Predicting Patient Readmittance and, (2) Identifying High Cost Patients. We were
successful with three predictive models that address those two objectives.
Our recommendations for CLIENT are threefold: (1) Enrich Patient Data, (2) Go Beyond
Demographics and other traditional practices common to predicting an initial heart episode when
predicting readmittance, and (3) Leverage the Predictable.
In terms of Enriching Patient Data, we feel that CLIENT Health has the opportunity to challenge the
status quo and assert industry leadership in managing highquality patient data. The industry
continues to adapt to digital patient records and federal regulations surrounding the collection,
management and privacy of patient data are stringent. We believe, however, CLIENT could make
strides in the industry by starting the conversation surrounding mandated patient data collection and
recordkeeping. Whether that source of data storage by CLIENT, a thirdparty provider or even the
government was not within the scope of this project. Throughout our teams data analysis, however, it
has become clear that this would benefit patients in terms of medical care and enable providers such
as CLIENT Health to optimize resources for patient care.
Early on in this process, we learned that we could not judge a patient's likelihood for a second heart
episode based on demographics along. Based on this learning, Do Not Judge is a recommendation we
make to CLIENT Health. While demographics can be helpful in predicting an initial heart episode,
providers such as CLIENT must take into consideration a patient’s medical history when it comes to
predicting heart episode readmittance. We even learned through cardiologist Dr. Roychaudhuri that
traditional causes of initial heart failure—high blood pressure, high cholesterol, obesity and
diabetes—are not the best means by which medical technicians should address the likelihood of heart
failure readmittance or even a treatment plan upon that readmittance.
Finally, we recommend that CLIENT Health Leverage the Predictable by employing our predictive
modeling techniques in predicting heart failure patient return and in predicting which readmitted
patients will be highest cost drivers. We recommend that the Care Manager be leveraged to focus on
patients with high likelihood for readmittance in order to optimize medical technician resources. In
terms of those high cost patients, we recommend focusing on patients and procedures that are linked
to higher postheart attack cost in order to provide specialized care at a patientlevel and optimize
resources at the providerlevel.
22

Appendix
Unsuccessful Models

Additional Data Mining steps necessary for predictive models

MS Excel
● ND_HF_7_HOSPITAL_ICD_PROCEDURE_20151111 Hospital ICD
procedure Data
procedure Data
MS Excel,
Data Aggregation 3 Combination of Data Aggregation 1 & Data Aggregation 2, using individual
Patient ID as the key
● ND_HF_7_HOSPITAL_ICD_PROCEDURE_20151111 – Hospital ICD
procedure Data
procedure Data
MS Excel
Data Aggregation 4 Manual tag of HIGH SPEND Patients, using below file to separate Patients
into terciles, with top tercile representing HIGH COST patients:
● ND_HF_4_REIMBURSEMENT_CLINIC_20151111 Reimbursement clinic
Data

MS Excel
Final Data set All four data sets above were used across multiple regression models in an
attempt to predict 1) Return for Heart Related Procedure 2) Patient cost 3)
High Patient cost (Binary)
MS Excel,  R

Several predictive models were used, with most proving ineffective in predicting 1) Patient
Return, and 2) Patient cost.  Below are the regression models that were unsuccessful.
1) Regression (Gaussian):  File [ND_HF_3_DEMOGRAPHICS_20151111 Demographic Data ]
a. Trying to predict patient cost
b. Ran STEP WISE OPTIMIZATION
c. Unsuccessful
2) Regression (Gaussian): File [ Data Aggregation 2 ]
b. Ran STEP WISE Optimization
c. Unsuccessful
3) Regression (Gaussian): File [ Data Aggregation 3]
23

c. Unsuccessful
4) Regression (PROBIT): File [ND_HF_3_DEMOGRAPHICS_20151111 Demographic Data ]
a. Trying to predict HIGH COST (Binary)
c. Unsuccessful
5) Regression (PROBIT): File [ Data Aggregation 2 ]
b. Ran STEP WISE Optimization
c. Unsuccessful
6) Regression (PROBIT): File [ Data Aggregation 3]
c. Unsuccessful

Within each attempt above, there were several additional attempts to filter the model data to
attempt to trigger an improved stepwise optimization, but alas, they were still unsuccessful.
24

Unsuccessful Probit Models

25

We noticed that the model was overpredicting 0 = “Not High Cost”.  We agreed to selectively
alter the data based on the SIGNIFICANCE value in the PROBIT summary. We selectively
removed the HIGHEST SIG Values (lowest correlation), along with and patients categorized as 0
(Not High cost).  This was a timeintensive, manual process that allowed us to selectively reduce
the amount of data in the model that may have been disrupting the predictive ability.
The original data consisted of 1,000 procedures and 56 Demo+ Patient detail, with 2400 patient
ID’s.  The revised data consisted of a mix of 78 procedures + 12 Demo + Patient variables, with
1,000 patients.

26

Healthy Analytics - Prevention through Prediction

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Healthy Analytics - Prevention through Prediction

Similar to Healthy Analytics - Prevention through Prediction (20)

Healthy Analytics - Prevention through Prediction