This document describes building an initial readmission prediction model using open source data as a proof of concept. It explores using the WEKA tool to cluster data and build a neural network model. The clustering produced 2 similar clusters, limiting the model's ability to predict readmissions. The neural network performed poorly with a root mean squared error of 56.99%, indicating issues with the data or model that require further exploration. The next steps are to develop a readmission model using a healthcare organization's data and SAS, implement it in their technology platform, and continue optimizing the model.
1. Building an Initial Readmission Model
using WEKA as a Proof of Concept
Jin On
Geneia LLC
Harrisburg, PA
267-393-6810
jin.on@geneia.com
Sunil Kakade
Northwestern University
Chicago, IL
312-503-2418
sunil.kakade@gmail.com
Abstract- In this paper we use WEKA to build an initial
readmission model in order to identify opportunities for
transitional care interventions that may reduce
readmissions among chronically ill patients.
Keywords- Readmissions, WEKA, Predictive models, Care
Transitions, SAS, EM cluster
I. INTRODUCTION
The goal of this study is to build an initial readmission
model using open source healthcare data as a proof of
concept.
Specifically, Geneia LLC (Geneia) took lessons learned
from this project and productionalized a process within
the Theon® platform: Care Modeler. Building the model
is the first step, but in order to implement this model in a
production environment additional research and
planning is required.
II. IMPORTANCE OF READMISSION MODELS
There has been an increased interest in readmission risk
models for two main reasons. First, transitional care
interventions may reduce readmissions among patients.
Second, readmission rates are starting to be utilized as a
quality metric.
The Centers for Medicare & Medicaid Services (CMS)
recently began using readmission rates as a way for plans
to lower reimbursement to hospitals with surplus risk-
standardized readmission rates (Kansagara, et. Al. 1688).
III. BUSINESS UNDERSTANDING: GENEIA
Geneia is producing a model that identifies statistically
significant predictors that are related to a greater risk of
unplanned all-cause 30-day readmissions. This model can
help identify opportunities for transitional care
interventions that may reduce readmissions among
chronically ill adults. Eliminating preventable
readmissions aids in a hospital’s risk-adjusted
readmission rate for the purposes of hospital
comparison.
The American Medical Association produced a
systematic review of 30 studies of 26 unique readmission
models and displayed the various types of readmission
models:
Models relying on Retrospective Administrative
Data
Models using Real-Time Administrative Data
Models Incorporating Primary Data Collection
Due to the nature of payer data, the team will focus on
building a readmission model relying heavily on
retrospective administrative data. However, previous
studies that strictly utilized retrospective administrative
data performed under expectations so exploring other
data sources such as socio-economic information and
EMRs will be valuable (Kansagara, et. Al. 1688).
Utilizing CMS’ open data will help in determining the
production process in order to successfully implement an
impactful readmission model.
2. IV. DATA UNDERSTANDING
In order to build an initial model the team used an open
data set from CMS: the 2008 Public Use File. This file was
created by a 5% random sample of beneficiaries from a
100% beneficiary summary file (CMS 1)
There are a total of 9 variables: IP_CLM_ID,
ENE_SEX_IDENT_CD, BENE_AGE_CAT_CD,
IP_CLM_BASE_DRG_CD, IP_CLM_ID9_PRCDR_CD,
IP_CLM_DAYS_CD, IP_DRG_QUINT_PMT_AVG,
IP_DRG_QUINT_PMT_CD, and READMIT. All of these
variables had consecutive numbers as their values that
mapped to something meaningful. The first variable,
IP_CLM_ID, is the primary key to this table thus is unique
for every row. The BENE_SEX-IDENT_CD variable
indicates the sex of the beneficiary.
The next variable is BENE_AGE_CAT_CD. This categorical
variable shows the beneficiary’s age range based on
2008. However, if the member died during 2008, then
the age at the date of death was used.
IP_CLM_BASE_DRG_CD is another categorical variable.
This variable shows the diagnostic related group that was
on the hospital claim that was submitted for payment.
The variable values represent more than one DRG codes.
For example DRG codes ‘001’ and ‘002’ represent ‘Heart
transplant or implant of heart assist system’. Values
range from 1 to 311, only the top 5 are shown below.
IP_CLM_ICD9_PRCDR_CD indicates the primary
procedure performed during the inpatient stay. ICD-9
procedure codes are usually four digits in length,
however these values are two digits. The two digits after
the decimal point was excluded because the two digits
after the decimal point just gives more information on
the procedure like the exact location of the procedure.
Values range from ‘00’ to ‘99’. Not all claims have a
procedure code value populated, thus the value ‘00’
represents that the claim had no procedure performed.
IP_CLM_DAYS_CD is a categorical variable that shows the
number of days the member stayed in the hospital. This
is also known as the ‘length of stay’. This variable was
computed the difference between the CLM_FROM_DT
and the CLM_THRU_DT. However, if those two variables
are the same date, then the value of IP_CLM_DAYS_CD is
set to 1. The total inpatient days are categorized into
four groups:
IP_DRG_QUINT_PMT_AVG contains the average
Medicare total claim payment amount of the quintile for
the payments. To calculate this field all claims with
similar DRGS were grouped into quintiles using the
Medicare total payment amount on the claims. Then the
average value of payment amount for the claims in each
quintile was calculated.
3. IP_DRG_QUINT_PMT_CD is a categorical variable that
indicates the quintile value to which the actual Medicare
payment amount on the claim belongs. A value of 1
indicates the lowest quintile.
READMIT is a binary variable that indicates if a member
was readmitted or not. This is a very loose definition of
readmission. The response variable was created this way
in order to see how the other variables react to it. There
is no indicator to show whether or not the readmission
has a similar diagnosis as the original claim, however this
was worked around by creating a dummy variable. A
value of ‘1’ indicates that the member was readmitted
after this claim was submitted. A value of ‘0’ indicates
the member has no readmission history.
Variable Value Frequency Frequency (%)
0 294,232 49.9958%
1 294,183 50.00042%
V. EXPLORATORY DATA ANALYSIS
The data preparation step was time and resource
intensive. The SAS® program created the formatting and
translated all the numeric values to values that had
business sense. Before running any algorithms on the
data, the primary key was removed from the data set.
When the initial neural network model ran, the output
was unexpected. Hence, to have more reasonable
results, the primary key was removed.
The ‘READMIT’ variable was created as a part of
the data preparation. This is due to CMS’ data set
containing only descriptive data about the inpatient stay,
it did not contain an indicator to determine a
readmission. This random variable was created in an
Excel spreadsheet using the ‘RANDBETWEEN(0,1)’
function. Thus the dataset shows that 50% of the
inpatient stays ended in a readmission. This is highly
unlikely in a real world setting, however to build the
overall process this was sufficient.
VI. MODELING: WEKA
This project utilized the tool WEKA. This is a tool that was
created by the University of Waikato which resides in
New Zealand (hence the Kiwi bird logo). This free
machine learning software is written in Java. The EM
clustering algorithm was executed first to explore the
dataset’s variable relationships, and the algorithm
produced 2 clusters. The training and test set were very
similar in what the attributes looked like. What the
output showed is that although the variables are
numerical, WEKA treated each one as continuous
variables. So, the IP_CLM_ICD9_PRCDR_CD has
numerical values for ICD-9 diagnosis codes. The clustered
output shows the mean and standard deviation for this
variable, however this has no business value.
Looking at Cluster 0, or Cluster_Readmission in the test
model shows the average to be 1. I see that those
member confinements who are likely to be readmitted
(readmit=1) have certain qualities to them. For example
the gender’s mean is 1.56 which is slightly leaning
towards the female population. The mean age category
is 3.5752 which shows that ages 75-79 are more likely to
be readmitted. However when comparing the 2 clusters,
the variables are evenly distributed between the 2 which
makes it difficult to make general comments on the
qualities that make a member risky for readmission.
However, this occurred because of the nature of how the
readmission variable was created. Once Geneia’s data is
used to build initial readmission models, there will be
less similarities between cohorts of confinements.
4. VII. CONCLUSIONS
This proof of concept showed how to observe the EM
output and map out the model building process within
Geneia. Utilizing internal tools and data with the
guidance of what was learned with this proof of concept,
Geneia is producing a readmission model. This
readmission model can then be accessed from the
Theon® platform so customers are empowered with
information to aid in lowering the readmission risk of a
patient.
The data set used from CMS was a good foundation to
build a successful productionalized readmission model.
This project explored the different techniques that can
be utilized to create a robust model.
VIII. NEXT STEPS
Using open source tools and open data from CMS was an
easy way to display the proof of concept of a
readmission model. The readmission model uses
Geneia’s demo data with SAS® tools. Geneia’s Care
Modeler produces the readmission probabilities, where
the User Interface showing these probabilities is
produced in our other Theon® platform application area.
As a part of Geneia’s vision to deliver actionable data
through advanced analytics, we will be utilizing state of
the art tools in order to productionalize and implement
the readmission model in a timely fashion. Depending on
the customer’s needs, this determines which Theon®
tool will be most applicable to their organization.
IX. ACKNOWLEDGEMENTS
I would like to thank Professor Kakade for his continued
support for this project in CIS435. I would also like to
thank Geneia for giving me the means to implement an
impactful readmission model through the Theon®
platform.
X. REFERENCES
Kansagara, Devan, Honora Englander, Amanda Salanitro,
David Kagen, Cecelia Theobald, Michele Freeman, and
Sunil Kripalani. "Risk Prediction Models for Hospital
Readmission." Jama 306.15 (2011): 1688. Web.
"BSA Inpatient Claims PUF." - Centers for Medicare &
Medicaid Services. N.p., n.d. Web. 15 Nov. 2013.
<http://www.cms.gov/Research-Statistics-Data-and-
Systems/Statistics-Trends-and-
Reports/BSAPUFS/Inpatient_Claims.html>.
Proof of
Concept
with WEKA
and CMS
Test Data
Develop
Readmission
Model using
SAS and
Geneia Data
Implement
Model
within
Theon
Optimize and
incorporate
new data
sources
5. APPENDIX
The WEKA output is shown below:
=== Run information ===
Scheme:weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100
Relation: 2008_BSA_Inpatient_Claims_PUF_v2
Instances: 588415
Attributes: 8
BENE_SEX_IDENT_CD
BENE_AGE_CAT_CD
IP_CLM_BASE_DRG_CD
IP_CLM_ICD9_PRCDR_CD
IP_CLM_DAYS_CD
IP_DRG_QUINT_PMT_AVG
IP_DRG_QUINT_PMT_CD
READMIT
Test mode:split 75% train, remainder test
=== Model and evaluation on training set ===
EM
==
Number of clusters selected by cross validation: 2
Cluster
Attribute 0 1
(0.5) (0.5)
=============================================
BENE_SEX_IDENT_CD
mean 1.5614 1.5609
std. dev. 0.4962 0.4963
BENE_AGE_CAT_CD
mean 3.5743 3.575
std. dev. 1.8063 1.8053
IP_CLM_BASE_DRG_CD
mean 139.7065 140.7003
std. dev. 79.7529 79.6169
IP_CLM_ICD9_PRCDR_CD
mean 59.6499 59.7179
std. dev. 21.2575 21.2698
6. IP_CLM_DAYS_CD
mean 2.5165 2.5162
std. dev. 0.9735 0.972
IP_DRG_QUINT_PMT_AVG
mean 9573.9776 9050.2159
std. dev. 12086.2923 8568.1225
IP_DRG_QUINT_PMT_CD
mean 3.001 2.9979
std. dev. 1.4156 1.4142
READMIT
mean 0.0019 1
std. dev. 0.0437 0.5
Time taken to build model (full training data) : 101530.36 seconds
==================================================================================
==
=== Model and evaluation on test split ===
EM
==
Number of clusters selected by cross validation: 2
Cluster
Attribute 0 1
(0.5) (0.5)
=============================================
BENE_SEX_IDENT_CD
mean 1.5612 1.5608
std. dev. 0.4962 0.4963
BENE_AGE_CAT_CD
mean 3.5752 3.5724
std. dev. 1.8054 1.8065
IP_CLM_BASE_DRG_CD
mean 140.6608 139.6798
std. dev. 79.5428 79.6735
7. IP_CLM_ICD9_PRCDR_CD
mean 59.7292 59.646
std. dev. 21.2796 21.2513
IP_CLM_DAYS_CD
mean 2.5151 2.5171
std. dev. 0.9717 0.9734
IP_DRG_QUINT_PMT_AVG
mean 9072.7224 9559.9805
std. dev. 8682.2329 12098.1965
IP_DRG_QUINT_PMT_CD
mean 2.9971 3.0001
std. dev. 1.4145 1.4155
READMIT
mean 1 0.0017
std. dev. 0.5 0.0411
Time taken to build model (percentage split) : 7727.8 seconds
Clustered Instances
0 73500 ( 50%)
1 73604 ( 50%)
Log likelihood: -26.50445
8. As you can see in my data set, there seemed to have equal number of males and females in both the
readmitted and non-readmitted populations.
The 5 age bands also show that there were similar numbers in both readmitted and non-readmitted
populations.
Diagnosis code seemed to be evenly distributed between the both readmitted and non-readmitted
populations as well.
9. For the procedure code there are some gaps in what was performed in my data set, however the
readmitted and non-readmitted populations look similar once again.
After exploring how the data looks after clustering, I decided to utilize Neural Networks.
There are a total of 5 nodes that were created. Although each node can be analyzed I was more
interested in the summary output. The root mean squared error came out to 56.99%, which makes
this a pretty poor neural network for predicting readmission. I’m exploring ways to lower this RMSE,
however I believe this is a data pre-processing issue as well. The WEKA output is shown below.
=== Run information ===
Scheme:weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a
Relation: 2008_BSA_Inpatient_Claims_PUF_v2
Instances: 588415
Attributes: 8
BENE_SEX_IDENT_CD
BENE_AGE_CAT_CD
IP_CLM_BASE_DRG_CD
IP_CLM_ICD9_PRCDR_CD
IP_CLM_DAYS_CD
IP_DRG_QUINT_PMT_AVG
IP_DRG_QUINT_PMT_CD
READMIT
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===