5. 2. Index file for heart attack, financial hospital data, hospital ICD procedure data and
hospital CPT procedure data all contain PATIENT_IDENTIFIER and HAR_IDENTIFIER
3. Demographic data table and the index file for heart attack have a 1:1 relationship
4. Reimbursement clinic data is a fact table and it only contains the clinical procedures and
related cost.
5. The financial hospital data table, hospital ICD procedure data table and hospital CPT
procedure data table are related by PATIENT_IDENTIFIER and HAR_IDENTIFIER. The
financial hospital data table is super set data and hospital ICD procedure data table and
hospital CPT procedure data table are subsets. To get all the procedures we produced a
left outer join between these three tables
Meta data and data element descriptions were received from CLIENT which were critical in our
interpretation of the tables. Upon initial analysis, we observed significant value in preventing
repeat heart attack. The category was later broadened to heart related procedure, which we
manually noted in the processed file. To prepare our data for modeling and further analysis we
preformed the following steps:
Process Description Tools
Data Cleansing Data cleansing started very early in the project but continued
up to the launch of the models to ensure consistent results. In
the first round of data cleansing we removed blank values,
nulls values, 0 values, white spaces and extra characters which
were note required.
MS Access, MS
Excel, SQL Scripts
Data Profiling Data profiling is an analysis of the candidate data sources to
clarify the structure, content, relationships and derivation
rules of the data. Profiling helps not only to understand
anomalies and to assess data quality, but also to discover,
register, and assess enterprise metadata. We performed range
analysis, completeness analysis, pattern analysis and value
distribution analysis as well.
R Code, MS Excel,
Tableau
Deduplication Data deduplication is a method of reducing storage needs by
eliminating redundant data. Only one unique instance of the
data is actually retained on storage media. Redundant data is
MS SQL, MS Excel
and Python
5
8. 1. Match the ND_HF_1_INDEX_20151111 Index File for Heart Attack and
ND_HF_3_DEMOGRAPHICS_20151111 Demographic Data to build the Patients details by demography
2. Left Outer join the ND_HF_5_FINANCIAL_HOSPITAL_20151111 Financial Hospital Data and
ND_HF_7_HOSPITAL_ICD_PROCEDURE_20151111 Hospital ICD procedure Data and
ND_HF_8_HOSPITAL_CPT_PROCEDURE_20151111 Hospital CPT procedure Data by PATIENT_IDENTIFIER
and HAR_IDENTIFIER and collect all financial data, ICD procedure data and CPT procedure data
3. Create 3 sets of Data
a. Data Prior to the Heart Attack Day =0 ; ADMIT_DAYS_FROM_INDEX <0
b. Data at the time of Heart Attack day = 0, ADMIT_DAYS_FROM_INDEX =0
c. Data after the Heart Attack Day =0 , ADMIT_DAYS_FROM_INDEX >0
4. Join ND_HF_1_INDEX_20151111 Index File for Heart Attack, ND_HF_3_DEMOGRAPHICS_20151111
Demographic Data and ND_HF_4_REIMBURSEMENT_CLINIC_20151111 Reimbursement clinic Data by
PATIENT_IDENTIFIER.
a. Data Prior to the Heart Attack Day =0 ; ENCOUNTER_DAYS_FROM_INDEX <0
b. Data at the time of Heart Attack day = 0, ENCOUNTER_DAYS_FROM_INDEX =0
c. Data after the Heart Attack Day =0 , ENCOUNTER_DAYS_FROM_INDEX >0
5. From steps 3 and 4 we captured the procedure descriptions
6. From procedure 3 for Less than 0 set, Equal to 0 and Greater than 0 data set, use SQL scripts to transpose
and deduplicate the data
7. From procedure 4 for Less than 0 set, Equal to 0 and Greater than 0 data set, use SQL scripts to transpose
and deduplicate the data
8. Aggregate the data set and unique rows by patient ID for Less than 0 set, Equal to 0 and Greater than 0 data
set
9. Merge the Data set for Less than 0 set, Equal to 0 and Greater than 0 data set for Clinical and Hospital
Procedure
10. Match PATIENT_IDENTIFIER with the data set from procedure 8 and identify the match Patient Identifiers
across the three data sets.
11. Match PATIENT_IDENTIFIER with the data set from procedure 9 and identify the match Patient Identifiers
across the three data sets.
12. Merge the Data set for matched patients for Less than 0 , Equal to 0 and Greater than 0 set.
8
9. 13. Apply the Fuzzy Match with the Procedures of finding the heart related procedure and validated with the
heart specialist.
14. Total number of records are the procedure 12 are 544 and 75 procedure fields.
15. Remove the Patient ID as the objective of this exercise is to identify if the patients have certain heart related
procedures before heart attack or if they have heart related procedures during heart attack if that patient
will be back for heart related procedure
16. Uniquely identified the Heart related procedures and clinical procedures and defined them as “A1A1”,
“A1A2”. This methodology will treat them as a one heart related procedure instead of multiple words
treated as an individual heart related procedures. To build this logic used a cross reference tables with
unique procedures descriptions and related unique values.
17. Two files in the data set
a. Visit – Values “Yes” and “No”. Yes captures if there is a visit after heart attack for heart related
procedure. No captures if there is no visit after heart attack for heart related procedure.
b. Description – Related unique values from Cross Reference Table.
18. Finalized the data set with 544 records and mix of patients with after visit and no after visit for heart related
procedures.
Additional Data Mining steps necessary for predictive models
Data Aggregation 1 Consolidated below files 1 & 3 to build full Demo and Admittance data file, to
build regression models against
● ND_HF_1_INDEX_20151111 Index File for Heart Attack
● ND_HF_3_DEMOGRAPHICS_20151111 Demographic Data
MS Excel
Data Aggregation 2 Consolidated below files 1 & 3 to build full Demo and Admittance data file, to
build regression models against
● ND_HF_7_HOSPITAL_ICD_PROCEDURE_20151111 Hospital ICD
procedure Data
● ND_HF_8_HOSPITAL_CPT_PROCEDURE_20151111 Hospital CPT
procedure Data
MS Excel,
Data Aggregation 3 Combination of Data Aggregation 1 & Data Aggregation 2, using individual
Patient ID as the key
● ND_HF_1_INDEX_20151111 Index File for Heart Attack
● ND_HF_3_DEMOGRAPHICS_20151111 Demographic Data
● ND_HF_7_HOSPITAL_ICD_PROCEDURE_20151111 – Hospital ICD
procedure Data
● ND_HF_8_HOSPITAL_CPT_PROCEDURE_20151111 Hospital CPT
procedure Data
MS Excel
Data Aggregation 4 Manual tag of HIGH SPEND Patients, using below file to separate Patients
into terciles, with top tercile representing HIGH COST patients:
● ND_HF_4_REIMBURSEMENT_CLINIC_20151111 Reimbursement clinic
Data
MS Excel
9
12.
Run Sequence Accuracy K Value
1 81.90% 25
2 82.54% 25
3 84.65% 25
4 86.84% 25
5 88.90% 25
6 88.90% 25
Once it hit the constant accuracy with 88.90%. Then we changed the K Values from 25 to 10
After executed the set of instructions above with K = 10
Following is the confusion matrix
The Class prediction for predicting the True After visit is 79.17%
12
18.
TP – True Positive – Classified the after visit for hear procedure as “After_Visit” when the actual
is “After_Visit” (Classified Correctly)
FP – False Positive – Classified the after visit for heart procedure as “After_Visit” when the
actual is “No_After_Visit” (Classified incorrectly)
FN – False Negative – Classified the after visit for heart procedure as “No_After_Visit” when the
actual is “After_Visit” (Classified incorrectly)
TN – True Negative Classified the after visit for heart procedure as “No_After_Visit” when the
actual is “No_After_Visit” (Classified Correctly)
With help of the 2X2 confusion matrix we can formalize our definition of the prediction
accuracy (success rate)
The Class prediction for predicting the True After visit is 9.7%
The Class prediction for predicting the No After visit is 90.3%
Overall Accuracy of the Model is 80.67%
Total List of Procedures
Regression (PROBIT) – Predictive Model
18
23.
Appendix
Unsuccessful Models
Additional Data Mining steps necessary for predictive models
Data Aggregation 1 Consolidated below files 1 & 3 to build full Demo and Admittance data file, to
build regression models against
● ND_HF_1_INDEX_20151111 Index File for Heart Attack
● ND_HF_3_DEMOGRAPHICS_20151111 Demographic Data
MS Excel
Data Aggregation 2 Consolidated below files 1 & 3 to build full Demo and Admittance data file, to
build regression models against
● ND_HF_7_HOSPITAL_ICD_PROCEDURE_20151111 Hospital ICD
procedure Data
● ND_HF_8_HOSPITAL_CPT_PROCEDURE_20151111 Hospital CPT
procedure Data
MS Excel,
Data Aggregation 3 Combination of Data Aggregation 1 & Data Aggregation 2, using individual
Patient ID as the key
● ND_HF_1_INDEX_20151111 Index File for Heart Attack
● ND_HF_3_DEMOGRAPHICS_20151111 Demographic Data
● ND_HF_7_HOSPITAL_ICD_PROCEDURE_20151111 – Hospital ICD
procedure Data
● ND_HF_8_HOSPITAL_CPT_PROCEDURE_20151111 Hospital CPT
procedure Data
MS Excel
Data Aggregation 4 Manual tag of HIGH SPEND Patients, using below file to separate Patients
into terciles, with top tercile representing HIGH COST patients:
● ND_HF_4_REIMBURSEMENT_CLINIC_20151111 Reimbursement clinic
Data
MS Excel
Final Data set All four data sets above were used across multiple regression models in an
attempt to predict 1) Return for Heart Related Procedure 2) Patient cost 3)
High Patient cost (Binary)
MS Excel, R
Several predictive models were used, with most proving ineffective in predicting 1) Patient
Return, and 2) Patient cost. Below are the regression models that were unsuccessful.
1) Regression (Gaussian): File [ND_HF_3_DEMOGRAPHICS_20151111 Demographic Data ]
a. Trying to predict patient cost
b. Ran STEP WISE OPTIMIZATION
c. Unsuccessful
2) Regression (Gaussian): File [ Data Aggregation 2 ]
a. Trying to predict patient cost
b. Ran STEP WISE Optimization
c. Unsuccessful
3) Regression (Gaussian): File [ Data Aggregation 3]
a. Trying to predict patient cost
23
24. b. Ran STEP WISE OPTIMIZATION
c. Unsuccessful
4) Regression (PROBIT): File [ND_HF_3_DEMOGRAPHICS_20151111 Demographic Data ]
a. Trying to predict HIGH COST (Binary)
b. Ran STEP WISE OPTIMIZATION
c. Unsuccessful
5) Regression (PROBIT): File [ Data Aggregation 2 ]
a. Trying to predict HIGH COST (Binary)
b. Ran STEP WISE Optimization
c. Unsuccessful
6) Regression (PROBIT): File [ Data Aggregation 3]
a. Trying to predict HIGH COST (Binary)
b. Ran STEP WISE OPTIMIZATION
c. Unsuccessful
Within each attempt above, there were several additional attempts to filter the model data to
attempt to trigger an improved stepwise optimization, but alas, they were still unsuccessful.
24