SlideShare a Scribd company logo
1 of 11
Download to read offline
Building an Initial Readmission Model
using WEKA as a Proof of Concept
Jin On
Geneia LLC
Harrisburg, PA
267-393-6810
jin.on@geneia.com
Sunil Kakade
Northwestern University
Chicago, IL
312-503-2418
sunil.kakade@gmail.com
Abstract- In this paper we use WEKA to build an initial
readmission model in order to identify opportunities for
transitional care interventions that may reduce
readmissions among chronically ill patients.
Keywords- Readmissions, WEKA, Predictive models, Care
Transitions, SAS, EM cluster
I. INTRODUCTION
The goal of this study is to build an initial readmission
model using open source healthcare data as a proof of
concept.
Specifically, Geneia LLC (Geneia) took lessons learned
from this project and productionalized a process within
the Theon® platform: Care Modeler. Building the model
is the first step, but in order to implement this model in a
production environment additional research and
planning is required.
II. IMPORTANCE OF READMISSION MODELS
There has been an increased interest in readmission risk
models for two main reasons. First, transitional care
interventions may reduce readmissions among patients.
Second, readmission rates are starting to be utilized as a
quality metric.
The Centers for Medicare & Medicaid Services (CMS)
recently began using readmission rates as a way for plans
to lower reimbursement to hospitals with surplus risk-
standardized readmission rates (Kansagara, et. Al. 1688).
III. BUSINESS UNDERSTANDING: GENEIA
Geneia is producing a model that identifies statistically
significant predictors that are related to a greater risk of
unplanned all-cause 30-day readmissions. This model can
help identify opportunities for transitional care
interventions that may reduce readmissions among
chronically ill adults. Eliminating preventable
readmissions aids in a hospital’s risk-adjusted
readmission rate for the purposes of hospital
comparison.
The American Medical Association produced a
systematic review of 30 studies of 26 unique readmission
models and displayed the various types of readmission
models:
 Models relying on Retrospective Administrative
Data
 Models using Real-Time Administrative Data
 Models Incorporating Primary Data Collection
Due to the nature of payer data, the team will focus on
building a readmission model relying heavily on
retrospective administrative data. However, previous
studies that strictly utilized retrospective administrative
data performed under expectations so exploring other
data sources such as socio-economic information and
EMRs will be valuable (Kansagara, et. Al. 1688).
Utilizing CMS’ open data will help in determining the
production process in order to successfully implement an
impactful readmission model.
IV. DATA UNDERSTANDING
In order to build an initial model the team used an open
data set from CMS: the 2008 Public Use File. This file was
created by a 5% random sample of beneficiaries from a
100% beneficiary summary file (CMS 1)
There are a total of 9 variables: IP_CLM_ID,
ENE_SEX_IDENT_CD, BENE_AGE_CAT_CD,
IP_CLM_BASE_DRG_CD, IP_CLM_ID9_PRCDR_CD,
IP_CLM_DAYS_CD, IP_DRG_QUINT_PMT_AVG,
IP_DRG_QUINT_PMT_CD, and READMIT. All of these
variables had consecutive numbers as their values that
mapped to something meaningful. The first variable,
IP_CLM_ID, is the primary key to this table thus is unique
for every row. The BENE_SEX-IDENT_CD variable
indicates the sex of the beneficiary.
The next variable is BENE_AGE_CAT_CD. This categorical
variable shows the beneficiary’s age range based on
2008. However, if the member died during 2008, then
the age at the date of death was used.
IP_CLM_BASE_DRG_CD is another categorical variable.
This variable shows the diagnostic related group that was
on the hospital claim that was submitted for payment.
The variable values represent more than one DRG codes.
For example DRG codes ‘001’ and ‘002’ represent ‘Heart
transplant or implant of heart assist system’. Values
range from 1 to 311, only the top 5 are shown below.
IP_CLM_ICD9_PRCDR_CD indicates the primary
procedure performed during the inpatient stay. ICD-9
procedure codes are usually four digits in length,
however these values are two digits. The two digits after
the decimal point was excluded because the two digits
after the decimal point just gives more information on
the procedure like the exact location of the procedure.
Values range from ‘00’ to ‘99’. Not all claims have a
procedure code value populated, thus the value ‘00’
represents that the claim had no procedure performed.
IP_CLM_DAYS_CD is a categorical variable that shows the
number of days the member stayed in the hospital. This
is also known as the ‘length of stay’. This variable was
computed the difference between the CLM_FROM_DT
and the CLM_THRU_DT. However, if those two variables
are the same date, then the value of IP_CLM_DAYS_CD is
set to 1. The total inpatient days are categorized into
four groups:
IP_DRG_QUINT_PMT_AVG contains the average
Medicare total claim payment amount of the quintile for
the payments. To calculate this field all claims with
similar DRGS were grouped into quintiles using the
Medicare total payment amount on the claims. Then the
average value of payment amount for the claims in each
quintile was calculated.
IP_DRG_QUINT_PMT_CD is a categorical variable that
indicates the quintile value to which the actual Medicare
payment amount on the claim belongs. A value of 1
indicates the lowest quintile.
READMIT is a binary variable that indicates if a member
was readmitted or not. This is a very loose definition of
readmission. The response variable was created this way
in order to see how the other variables react to it. There
is no indicator to show whether or not the readmission
has a similar diagnosis as the original claim, however this
was worked around by creating a dummy variable. A
value of ‘1’ indicates that the member was readmitted
after this claim was submitted. A value of ‘0’ indicates
the member has no readmission history.
Variable Value Frequency Frequency (%)
0 294,232 49.9958%
1 294,183 50.00042%
V. EXPLORATORY DATA ANALYSIS
The data preparation step was time and resource
intensive. The SAS® program created the formatting and
translated all the numeric values to values that had
business sense. Before running any algorithms on the
data, the primary key was removed from the data set.
When the initial neural network model ran, the output
was unexpected. Hence, to have more reasonable
results, the primary key was removed.
The ‘READMIT’ variable was created as a part of
the data preparation. This is due to CMS’ data set
containing only descriptive data about the inpatient stay,
it did not contain an indicator to determine a
readmission. This random variable was created in an
Excel spreadsheet using the ‘RANDBETWEEN(0,1)’
function. Thus the dataset shows that 50% of the
inpatient stays ended in a readmission. This is highly
unlikely in a real world setting, however to build the
overall process this was sufficient.
VI. MODELING: WEKA
This project utilized the tool WEKA. This is a tool that was
created by the University of Waikato which resides in
New Zealand (hence the Kiwi bird logo). This free
machine learning software is written in Java. The EM
clustering algorithm was executed first to explore the
dataset’s variable relationships, and the algorithm
produced 2 clusters. The training and test set were very
similar in what the attributes looked like. What the
output showed is that although the variables are
numerical, WEKA treated each one as continuous
variables. So, the IP_CLM_ICD9_PRCDR_CD has
numerical values for ICD-9 diagnosis codes. The clustered
output shows the mean and standard deviation for this
variable, however this has no business value.
Looking at Cluster 0, or Cluster_Readmission in the test
model shows the average to be 1. I see that those
member confinements who are likely to be readmitted
(readmit=1) have certain qualities to them. For example
the gender’s mean is 1.56 which is slightly leaning
towards the female population. The mean age category
is 3.5752 which shows that ages 75-79 are more likely to
be readmitted. However when comparing the 2 clusters,
the variables are evenly distributed between the 2 which
makes it difficult to make general comments on the
qualities that make a member risky for readmission.
However, this occurred because of the nature of how the
readmission variable was created. Once Geneia’s data is
used to build initial readmission models, there will be
less similarities between cohorts of confinements.
VII. CONCLUSIONS
This proof of concept showed how to observe the EM
output and map out the model building process within
Geneia. Utilizing internal tools and data with the
guidance of what was learned with this proof of concept,
Geneia is producing a readmission model. This
readmission model can then be accessed from the
Theon® platform so customers are empowered with
information to aid in lowering the readmission risk of a
patient.
The data set used from CMS was a good foundation to
build a successful productionalized readmission model.
This project explored the different techniques that can
be utilized to create a robust model.
VIII. NEXT STEPS
Using open source tools and open data from CMS was an
easy way to display the proof of concept of a
readmission model. The readmission model uses
Geneia’s demo data with SAS® tools. Geneia’s Care
Modeler produces the readmission probabilities, where
the User Interface showing these probabilities is
produced in our other Theon® platform application area.
As a part of Geneia’s vision to deliver actionable data
through advanced analytics, we will be utilizing state of
the art tools in order to productionalize and implement
the readmission model in a timely fashion. Depending on
the customer’s needs, this determines which Theon®
tool will be most applicable to their organization.
IX. ACKNOWLEDGEMENTS
I would like to thank Professor Kakade for his continued
support for this project in CIS435. I would also like to
thank Geneia for giving me the means to implement an
impactful readmission model through the Theon®
platform.
X. REFERENCES
Kansagara, Devan, Honora Englander, Amanda Salanitro,
David Kagen, Cecelia Theobald, Michele Freeman, and
Sunil Kripalani. "Risk Prediction Models for Hospital
Readmission." Jama 306.15 (2011): 1688. Web.
"BSA Inpatient Claims PUF." - Centers for Medicare &
Medicaid Services. N.p., n.d. Web. 15 Nov. 2013.
<http://www.cms.gov/Research-Statistics-Data-and-
Systems/Statistics-Trends-and-
Reports/BSAPUFS/Inpatient_Claims.html>.
Proof of
Concept
with WEKA
and CMS
Test Data
Develop
Readmission
Model using
SAS and
Geneia Data
Implement
Model
within
Theon
Optimize and
incorporate
new data
sources
APPENDIX
The WEKA output is shown below:
=== Run information ===
Scheme:weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100
Relation: 2008_BSA_Inpatient_Claims_PUF_v2
Instances: 588415
Attributes: 8
BENE_SEX_IDENT_CD
BENE_AGE_CAT_CD
IP_CLM_BASE_DRG_CD
IP_CLM_ICD9_PRCDR_CD
IP_CLM_DAYS_CD
IP_DRG_QUINT_PMT_AVG
IP_DRG_QUINT_PMT_CD
READMIT
Test mode:split 75% train, remainder test
=== Model and evaluation on training set ===
EM
==
Number of clusters selected by cross validation: 2
Cluster
Attribute 0 1
(0.5) (0.5)
=============================================
BENE_SEX_IDENT_CD
mean 1.5614 1.5609
std. dev. 0.4962 0.4963
BENE_AGE_CAT_CD
mean 3.5743 3.575
std. dev. 1.8063 1.8053
IP_CLM_BASE_DRG_CD
mean 139.7065 140.7003
std. dev. 79.7529 79.6169
IP_CLM_ICD9_PRCDR_CD
mean 59.6499 59.7179
std. dev. 21.2575 21.2698
IP_CLM_DAYS_CD
mean 2.5165 2.5162
std. dev. 0.9735 0.972
IP_DRG_QUINT_PMT_AVG
mean 9573.9776 9050.2159
std. dev. 12086.2923 8568.1225
IP_DRG_QUINT_PMT_CD
mean 3.001 2.9979
std. dev. 1.4156 1.4142
READMIT
mean 0.0019 1
std. dev. 0.0437 0.5
Time taken to build model (full training data) : 101530.36 seconds
==================================================================================
==
=== Model and evaluation on test split ===
EM
==
Number of clusters selected by cross validation: 2
Cluster
Attribute 0 1
(0.5) (0.5)
=============================================
BENE_SEX_IDENT_CD
mean 1.5612 1.5608
std. dev. 0.4962 0.4963
BENE_AGE_CAT_CD
mean 3.5752 3.5724
std. dev. 1.8054 1.8065
IP_CLM_BASE_DRG_CD
mean 140.6608 139.6798
std. dev. 79.5428 79.6735
IP_CLM_ICD9_PRCDR_CD
mean 59.7292 59.646
std. dev. 21.2796 21.2513
IP_CLM_DAYS_CD
mean 2.5151 2.5171
std. dev. 0.9717 0.9734
IP_DRG_QUINT_PMT_AVG
mean 9072.7224 9559.9805
std. dev. 8682.2329 12098.1965
IP_DRG_QUINT_PMT_CD
mean 2.9971 3.0001
std. dev. 1.4145 1.4155
READMIT
mean 1 0.0017
std. dev. 0.5 0.0411
Time taken to build model (percentage split) : 7727.8 seconds
Clustered Instances
0 73500 ( 50%)
1 73604 ( 50%)
Log likelihood: -26.50445
As you can see in my data set, there seemed to have equal number of males and females in both the
readmitted and non-readmitted populations.
The 5 age bands also show that there were similar numbers in both readmitted and non-readmitted
populations.
Diagnosis code seemed to be evenly distributed between the both readmitted and non-readmitted
populations as well.
For the procedure code there are some gaps in what was performed in my data set, however the
readmitted and non-readmitted populations look similar once again.
After exploring how the data looks after clustering, I decided to utilize Neural Networks.
There are a total of 5 nodes that were created. Although each node can be analyzed I was more
interested in the summary output. The root mean squared error came out to 56.99%, which makes
this a pretty poor neural network for predicting readmission. I’m exploring ways to lower this RMSE,
however I believe this is a data pre-processing issue as well. The WEKA output is shown below.
=== Run information ===
Scheme:weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a
Relation: 2008_BSA_Inpatient_Claims_PUF_v2
Instances: 588415
Attributes: 8
BENE_SEX_IDENT_CD
BENE_AGE_CAT_CD
IP_CLM_BASE_DRG_CD
IP_CLM_ICD9_PRCDR_CD
IP_CLM_DAYS_CD
IP_DRG_QUINT_PMT_AVG
IP_DRG_QUINT_PMT_CD
READMIT
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
Linear Node 0
Inputs Weights
Threshold 0.10507212142288758
Node 1 -1.3304456888484266
Node 2 -0.2549966715099939
Node 3 -3.390541070153192
Node 4 -3.6528578622010737
Sigmoid Node 1
Inputs Weights
Threshold -13.464450269067473
Attrib BENE_SEX_IDENT_CD -4.939190410264623
Attrib BENE_AGE_CAT_CD -8.740965586414031
Attrib IP_CLM_BASE_DRG_CD -0.10494256311588898
Attrib IP_CLM_ICD9_PRCDR_CD -0.2764279667821472
Attrib IP_CLM_DAYS_CD -4.941028814043151
Attrib IP_DRG_QUINT_PMT_AVG 11.229004034610403
Attrib IP_DRG_QUINT_PMT_CD -1.4161938479261476
Sigmoid Node 2
Inputs Weights
Threshold -29.510054335269658
Attrib BENE_SEX_IDENT_CD -2.4530603729826272
Attrib BENE_AGE_CAT_CD 12.256052634877257
Attrib IP_CLM_BASE_DRG_CD -6.556030932329017
Attrib IP_CLM_ICD9_PRCDR_CD -7.619288616295435
Attrib IP_CLM_DAYS_CD -0.9677884091499084
Attrib IP_DRG_QUINT_PMT_AVG 7.5937369382205935
Attrib IP_DRG_QUINT_PMT_CD 9.236701926843953
Sigmoid Node 3
Inputs Weights
Threshold -9.487580769500077
Attrib BENE_SEX_IDENT_CD 2.8264751617336055
Attrib BENE_AGE_CAT_CD 1.1920813268658716
Attrib IP_CLM_BASE_DRG_CD -1.0685283000189505
Attrib IP_CLM_ICD9_PRCDR_CD -0.4555504249499105
Attrib IP_CLM_DAYS_CD 0.04425927527189678
Attrib IP_DRG_QUINT_PMT_AVG 2.091775720037553
Attrib IP_DRG_QUINT_PMT_CD -0.14238942134384816
Sigmoid Node 4
Inputs Weights
Threshold -9.265272395597584
Attrib BENE_SEX_IDENT_CD 0.16355901451324487
Attrib BENE_AGE_CAT_CD -0.18375201349583137
Attrib IP_CLM_BASE_DRG_CD -2.5806491645028133
Attrib IP_CLM_ICD9_PRCDR_CD 0.3839209458877335
Attrib IP_CLM_DAYS_CD -2.010902479691323
Attrib IP_DRG_QUINT_PMT_AVG 0.34075788774097127
Attrib IP_DRG_QUINT_PMT_CD 0.7098860282183899
Class
Input
Node 0
Time taken to build model: 971.8 seconds
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.0024
Mean absolute error 0.4994
Root mean squared error 0.5699
Relative absolute error 99.877 %
Root relative squared error 113.9805 %
Total Number of Instances 588415

More Related Content

What's hot

Instance Selection and Optimization of Neural Networks
Instance Selection and Optimization of Neural NetworksInstance Selection and Optimization of Neural Networks
Instance Selection and Optimization of Neural NetworksITIIIndustries
 
IRJET- Detection of Chronic Kidney Disease using Machine Learning in the R-En...
IRJET- Detection of Chronic Kidney Disease using Machine Learning in the R-En...IRJET- Detection of Chronic Kidney Disease using Machine Learning in the R-En...
IRJET- Detection of Chronic Kidney Disease using Machine Learning in the R-En...IRJET Journal
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detectionvineeta vineeta
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET Journal
 
IRJET- Analysis of Brand Value Prediction based on Social Media Data
IRJET-  	  Analysis of Brand Value Prediction based on Social Media DataIRJET-  	  Analysis of Brand Value Prediction based on Social Media Data
IRJET- Analysis of Brand Value Prediction based on Social Media DataIRJET Journal
 
Application of Expert System in medical systems
Application of Expert System in medical systemsApplication of Expert System in medical systems
Application of Expert System in medical systemsShashwat Shankar
 
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...The Impact of Class Rebalancing Techniques on the Performance and Interpretat...
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...Chakkrit (Kla) Tantithamthavorn
 
Applications of machine learning
Applications of machine learningApplications of machine learning
Applications of machine learningbusiness Corporate
 
Credit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning AlgorithmsCredit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning AlgorithmsHariteja Bodepudi
 
A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...IJTET Journal
 
Products Reliability Prediction Model Based on Bayesian Approach
Products Reliability Prediction Model Based on Bayesian ApproachProducts Reliability Prediction Model Based on Bayesian Approach
Products Reliability Prediction Model Based on Bayesian ApproachIJAEMSJORNAL
 
Application of AI in customer relationship management
Application of AI in customer relationship managementApplication of AI in customer relationship management
Application of AI in customer relationship managementShashwat Shankar
 
Chronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine LearningChronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine LearningIJCSIS Research Publications
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - ReportAkanksha Gohil
 
Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...IJECEIAES
 
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Melissa Moody
 
Biometric Identification and Authentication Providence using Fingerprint for ...
Biometric Identification and Authentication Providence using Fingerprint for ...Biometric Identification and Authentication Providence using Fingerprint for ...
Biometric Identification and Authentication Providence using Fingerprint for ...IJECEIAES
 
Multi-class Bio-images Classification
Multi-class Bio-images ClassificationMulti-class Bio-images Classification
Multi-class Bio-images ClassificationZhuo Li
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithmsankit panigrahy
 

What's hot (20)

Instance Selection and Optimization of Neural Networks
Instance Selection and Optimization of Neural NetworksInstance Selection and Optimization of Neural Networks
Instance Selection and Optimization of Neural Networks
 
IRJET- Detection of Chronic Kidney Disease using Machine Learning in the R-En...
IRJET- Detection of Chronic Kidney Disease using Machine Learning in the R-En...IRJET- Detection of Chronic Kidney Disease using Machine Learning in the R-En...
IRJET- Detection of Chronic Kidney Disease using Machine Learning in the R-En...
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
 
IRJET- Analysis of Brand Value Prediction based on Social Media Data
IRJET-  	  Analysis of Brand Value Prediction based on Social Media DataIRJET-  	  Analysis of Brand Value Prediction based on Social Media Data
IRJET- Analysis of Brand Value Prediction based on Social Media Data
 
Application of Expert System in medical systems
Application of Expert System in medical systemsApplication of Expert System in medical systems
Application of Expert System in medical systems
 
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...The Impact of Class Rebalancing Techniques on the Performance and Interpretat...
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...
 
Applications of machine learning
Applications of machine learningApplications of machine learning
Applications of machine learning
 
Credit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning AlgorithmsCredit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
 
A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
 
RSI_ReportPDF
RSI_ReportPDFRSI_ReportPDF
RSI_ReportPDF
 
Products Reliability Prediction Model Based on Bayesian Approach
Products Reliability Prediction Model Based on Bayesian ApproachProducts Reliability Prediction Model Based on Bayesian Approach
Products Reliability Prediction Model Based on Bayesian Approach
 
Application of AI in customer relationship management
Application of AI in customer relationship managementApplication of AI in customer relationship management
Application of AI in customer relationship management
 
Chronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine LearningChronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine Learning
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 
Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...
 
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
 
Biometric Identification and Authentication Providence using Fingerprint for ...
Biometric Identification and Authentication Providence using Fingerprint for ...Biometric Identification and Authentication Providence using Fingerprint for ...
Biometric Identification and Authentication Providence using Fingerprint for ...
 
Multi-class Bio-images Classification
Multi-class Bio-images ClassificationMulti-class Bio-images Classification
Multi-class Bio-images Classification
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithms
 

Viewers also liked

Recomendação de conteúdo com apache mahout
Recomendação de conteúdo com apache mahoutRecomendação de conteúdo com apache mahout
Recomendação de conteúdo com apache mahoutDextra
 
10 Email Hacks for Productivity
10 Email Hacks for Productivity10 Email Hacks for Productivity
10 Email Hacks for ProductivityTeamGantt
 
Digital Marketing Masterclass - Alastair Banks
Digital Marketing Masterclass - Alastair BanksDigital Marketing Masterclass - Alastair Banks
Digital Marketing Masterclass - Alastair BanksHanna Mepstead
 
1205 Poster ewma sikkelcelulcus
1205 Poster ewma sikkelcelulcus1205 Poster ewma sikkelcelulcus
1205 Poster ewma sikkelcelulcusAnnemiek Mooij
 
GROUP_1_Marketing Management
GROUP_1_Marketing ManagementGROUP_1_Marketing Management
GROUP_1_Marketing ManagementAshutosh Phadke
 
Route Profitability - SAH-BAH-SAH
Route Profitability - SAH-BAH-SAHRoute Profitability - SAH-BAH-SAH
Route Profitability - SAH-BAH-SAHMohammed Hadi
 
Differential leukocyte
Differential leukocyteDifferential leukocyte
Differential leukocyteRaghuveer CR
 
Compra il dominio e configura WordPress
Compra il dominio e configura WordPressCompra il dominio e configura WordPress
Compra il dominio e configura WordPressAmin El Fadil
 

Viewers also liked (10)

Recomendação de conteúdo com apache mahout
Recomendação de conteúdo com apache mahoutRecomendação de conteúdo com apache mahout
Recomendação de conteúdo com apache mahout
 
10 Email Hacks for Productivity
10 Email Hacks for Productivity10 Email Hacks for Productivity
10 Email Hacks for Productivity
 
Digital Marketing Masterclass - Alastair Banks
Digital Marketing Masterclass - Alastair BanksDigital Marketing Masterclass - Alastair Banks
Digital Marketing Masterclass - Alastair Banks
 
1205 Poster ewma sikkelcelulcus
1205 Poster ewma sikkelcelulcus1205 Poster ewma sikkelcelulcus
1205 Poster ewma sikkelcelulcus
 
GROUP_1_Marketing Management
GROUP_1_Marketing ManagementGROUP_1_Marketing Management
GROUP_1_Marketing Management
 
Route Profitability - SAH-BAH-SAH
Route Profitability - SAH-BAH-SAHRoute Profitability - SAH-BAH-SAH
Route Profitability - SAH-BAH-SAH
 
Silabus pai kelas xi
Silabus pai kelas xiSilabus pai kelas xi
Silabus pai kelas xi
 
Don’t make me think
Don’t make me thinkDon’t make me think
Don’t make me think
 
Differential leukocyte
Differential leukocyteDifferential leukocyte
Differential leukocyte
 
Compra il dominio e configura WordPress
Compra il dominio e configura WordPressCompra il dominio e configura WordPress
Compra il dominio e configura WordPress
 

Similar to Building_a_Readmission_Model_Using_WEKA

IRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET Journal
 
Comparative Analysis of Various Algorithms for Fetal Risk Prediction
Comparative Analysis of Various Algorithms for Fetal Risk PredictionComparative Analysis of Various Algorithms for Fetal Risk Prediction
Comparative Analysis of Various Algorithms for Fetal Risk PredictionIRJET Journal
 
A SURVEY ON BLOOD DISEASE DETECTION USING MACHINE LEARNING
A SURVEY ON BLOOD DISEASE DETECTION USING MACHINE LEARNINGA SURVEY ON BLOOD DISEASE DETECTION USING MACHINE LEARNING
A SURVEY ON BLOOD DISEASE DETECTION USING MACHINE LEARNINGIRJET Journal
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Denny Lee
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkStats Statswork
 
Automobile Insurance Claim Fraud Detection using Random Forest and ADASYN
Automobile Insurance Claim Fraud Detection using Random Forest and ADASYNAutomobile Insurance Claim Fraud Detection using Random Forest and ADASYN
Automobile Insurance Claim Fraud Detection using Random Forest and ADASYNIRJET Journal
 
Smart Healthcare Prediction System Using Machine Learning
Smart Healthcare Prediction System Using Machine LearningSmart Healthcare Prediction System Using Machine Learning
Smart Healthcare Prediction System Using Machine LearningIRJET Journal
 
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRYPROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRYIJITCA Journal
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaRahul Bhatia
 
Patient Privacy Control for Health Care in Cloud Computing System
Patient Privacy Control for Health Care in Cloud Computing  SystemPatient Privacy Control for Health Care in Cloud Computing  System
Patient Privacy Control for Health Care in Cloud Computing SystemIRJET Journal
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining ApproachICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining Approachcsandit
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHcscpconf
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
 
A Novel Approach for Forecasting Disease Using Machine Learning
A Novel Approach for Forecasting Disease Using Machine LearningA Novel Approach for Forecasting Disease Using Machine Learning
A Novel Approach for Forecasting Disease Using Machine LearningIRJET Journal
 
PREDICTION OF DIABETES (SUGAR) USING MACHINE LEARNING TECHNIQUES
PREDICTION OF DIABETES (SUGAR) USING MACHINE LEARNING TECHNIQUESPREDICTION OF DIABETES (SUGAR) USING MACHINE LEARNING TECHNIQUES
PREDICTION OF DIABETES (SUGAR) USING MACHINE LEARNING TECHNIQUESIRJET Journal
 
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...IRJET Journal
 
Dynamic Rule Base Construction and Maintenance Scheme for Disease Prediction
Dynamic Rule Base Construction and Maintenance Scheme for Disease PredictionDynamic Rule Base Construction and Maintenance Scheme for Disease Prediction
Dynamic Rule Base Construction and Maintenance Scheme for Disease Predictionijsrd.com
 
IBM impact-final-reviewed1
IBM impact-final-reviewed1IBM impact-final-reviewed1
IBM impact-final-reviewed1Priya Thinagar
 

Similar to Building_a_Readmission_Model_Using_WEKA (20)

IRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET- Disease Prediction System
IRJET- Disease Prediction System
 
Comparative Analysis of Various Algorithms for Fetal Risk Prediction
Comparative Analysis of Various Algorithms for Fetal Risk PredictionComparative Analysis of Various Algorithms for Fetal Risk Prediction
Comparative Analysis of Various Algorithms for Fetal Risk Prediction
 
A SURVEY ON BLOOD DISEASE DETECTION USING MACHINE LEARNING
A SURVEY ON BLOOD DISEASE DETECTION USING MACHINE LEARNINGA SURVEY ON BLOOD DISEASE DETECTION USING MACHINE LEARNING
A SURVEY ON BLOOD DISEASE DETECTION USING MACHINE LEARNING
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
 
Manuscript dss
Manuscript dssManuscript dss
Manuscript dss
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - Statswork
 
Automobile Insurance Claim Fraud Detection using Random Forest and ADASYN
Automobile Insurance Claim Fraud Detection using Random Forest and ADASYNAutomobile Insurance Claim Fraud Detection using Random Forest and ADASYN
Automobile Insurance Claim Fraud Detection using Random Forest and ADASYN
 
Smart Healthcare Prediction System Using Machine Learning
Smart Healthcare Prediction System Using Machine LearningSmart Healthcare Prediction System Using Machine Learning
Smart Healthcare Prediction System Using Machine Learning
 
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRYPROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
 
Patient Privacy Control for Health Care in Cloud Computing System
Patient Privacy Control for Health Care in Cloud Computing  SystemPatient Privacy Control for Health Care in Cloud Computing  System
Patient Privacy Control for Health Care in Cloud Computing System
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining ApproachICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining Approach
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
A Novel Approach for Forecasting Disease Using Machine Learning
A Novel Approach for Forecasting Disease Using Machine LearningA Novel Approach for Forecasting Disease Using Machine Learning
A Novel Approach for Forecasting Disease Using Machine Learning
 
PREDICTION OF DIABETES (SUGAR) USING MACHINE LEARNING TECHNIQUES
PREDICTION OF DIABETES (SUGAR) USING MACHINE LEARNING TECHNIQUESPREDICTION OF DIABETES (SUGAR) USING MACHINE LEARNING TECHNIQUES
PREDICTION OF DIABETES (SUGAR) USING MACHINE LEARNING TECHNIQUES
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
 
Dynamic Rule Base Construction and Maintenance Scheme for Disease Prediction
Dynamic Rule Base Construction and Maintenance Scheme for Disease PredictionDynamic Rule Base Construction and Maintenance Scheme for Disease Prediction
Dynamic Rule Base Construction and Maintenance Scheme for Disease Prediction
 
IBM impact-final-reviewed1
IBM impact-final-reviewed1IBM impact-final-reviewed1
IBM impact-final-reviewed1
 

Building_a_Readmission_Model_Using_WEKA

  • 1. Building an Initial Readmission Model using WEKA as a Proof of Concept Jin On Geneia LLC Harrisburg, PA 267-393-6810 jin.on@geneia.com Sunil Kakade Northwestern University Chicago, IL 312-503-2418 sunil.kakade@gmail.com Abstract- In this paper we use WEKA to build an initial readmission model in order to identify opportunities for transitional care interventions that may reduce readmissions among chronically ill patients. Keywords- Readmissions, WEKA, Predictive models, Care Transitions, SAS, EM cluster I. INTRODUCTION The goal of this study is to build an initial readmission model using open source healthcare data as a proof of concept. Specifically, Geneia LLC (Geneia) took lessons learned from this project and productionalized a process within the Theon® platform: Care Modeler. Building the model is the first step, but in order to implement this model in a production environment additional research and planning is required. II. IMPORTANCE OF READMISSION MODELS There has been an increased interest in readmission risk models for two main reasons. First, transitional care interventions may reduce readmissions among patients. Second, readmission rates are starting to be utilized as a quality metric. The Centers for Medicare & Medicaid Services (CMS) recently began using readmission rates as a way for plans to lower reimbursement to hospitals with surplus risk- standardized readmission rates (Kansagara, et. Al. 1688). III. BUSINESS UNDERSTANDING: GENEIA Geneia is producing a model that identifies statistically significant predictors that are related to a greater risk of unplanned all-cause 30-day readmissions. This model can help identify opportunities for transitional care interventions that may reduce readmissions among chronically ill adults. Eliminating preventable readmissions aids in a hospital’s risk-adjusted readmission rate for the purposes of hospital comparison. The American Medical Association produced a systematic review of 30 studies of 26 unique readmission models and displayed the various types of readmission models:  Models relying on Retrospective Administrative Data  Models using Real-Time Administrative Data  Models Incorporating Primary Data Collection Due to the nature of payer data, the team will focus on building a readmission model relying heavily on retrospective administrative data. However, previous studies that strictly utilized retrospective administrative data performed under expectations so exploring other data sources such as socio-economic information and EMRs will be valuable (Kansagara, et. Al. 1688). Utilizing CMS’ open data will help in determining the production process in order to successfully implement an impactful readmission model.
  • 2. IV. DATA UNDERSTANDING In order to build an initial model the team used an open data set from CMS: the 2008 Public Use File. This file was created by a 5% random sample of beneficiaries from a 100% beneficiary summary file (CMS 1) There are a total of 9 variables: IP_CLM_ID, ENE_SEX_IDENT_CD, BENE_AGE_CAT_CD, IP_CLM_BASE_DRG_CD, IP_CLM_ID9_PRCDR_CD, IP_CLM_DAYS_CD, IP_DRG_QUINT_PMT_AVG, IP_DRG_QUINT_PMT_CD, and READMIT. All of these variables had consecutive numbers as their values that mapped to something meaningful. The first variable, IP_CLM_ID, is the primary key to this table thus is unique for every row. The BENE_SEX-IDENT_CD variable indicates the sex of the beneficiary. The next variable is BENE_AGE_CAT_CD. This categorical variable shows the beneficiary’s age range based on 2008. However, if the member died during 2008, then the age at the date of death was used. IP_CLM_BASE_DRG_CD is another categorical variable. This variable shows the diagnostic related group that was on the hospital claim that was submitted for payment. The variable values represent more than one DRG codes. For example DRG codes ‘001’ and ‘002’ represent ‘Heart transplant or implant of heart assist system’. Values range from 1 to 311, only the top 5 are shown below. IP_CLM_ICD9_PRCDR_CD indicates the primary procedure performed during the inpatient stay. ICD-9 procedure codes are usually four digits in length, however these values are two digits. The two digits after the decimal point was excluded because the two digits after the decimal point just gives more information on the procedure like the exact location of the procedure. Values range from ‘00’ to ‘99’. Not all claims have a procedure code value populated, thus the value ‘00’ represents that the claim had no procedure performed. IP_CLM_DAYS_CD is a categorical variable that shows the number of days the member stayed in the hospital. This is also known as the ‘length of stay’. This variable was computed the difference between the CLM_FROM_DT and the CLM_THRU_DT. However, if those two variables are the same date, then the value of IP_CLM_DAYS_CD is set to 1. The total inpatient days are categorized into four groups: IP_DRG_QUINT_PMT_AVG contains the average Medicare total claim payment amount of the quintile for the payments. To calculate this field all claims with similar DRGS were grouped into quintiles using the Medicare total payment amount on the claims. Then the average value of payment amount for the claims in each quintile was calculated.
  • 3. IP_DRG_QUINT_PMT_CD is a categorical variable that indicates the quintile value to which the actual Medicare payment amount on the claim belongs. A value of 1 indicates the lowest quintile. READMIT is a binary variable that indicates if a member was readmitted or not. This is a very loose definition of readmission. The response variable was created this way in order to see how the other variables react to it. There is no indicator to show whether or not the readmission has a similar diagnosis as the original claim, however this was worked around by creating a dummy variable. A value of ‘1’ indicates that the member was readmitted after this claim was submitted. A value of ‘0’ indicates the member has no readmission history. Variable Value Frequency Frequency (%) 0 294,232 49.9958% 1 294,183 50.00042% V. EXPLORATORY DATA ANALYSIS The data preparation step was time and resource intensive. The SAS® program created the formatting and translated all the numeric values to values that had business sense. Before running any algorithms on the data, the primary key was removed from the data set. When the initial neural network model ran, the output was unexpected. Hence, to have more reasonable results, the primary key was removed. The ‘READMIT’ variable was created as a part of the data preparation. This is due to CMS’ data set containing only descriptive data about the inpatient stay, it did not contain an indicator to determine a readmission. This random variable was created in an Excel spreadsheet using the ‘RANDBETWEEN(0,1)’ function. Thus the dataset shows that 50% of the inpatient stays ended in a readmission. This is highly unlikely in a real world setting, however to build the overall process this was sufficient. VI. MODELING: WEKA This project utilized the tool WEKA. This is a tool that was created by the University of Waikato which resides in New Zealand (hence the Kiwi bird logo). This free machine learning software is written in Java. The EM clustering algorithm was executed first to explore the dataset’s variable relationships, and the algorithm produced 2 clusters. The training and test set were very similar in what the attributes looked like. What the output showed is that although the variables are numerical, WEKA treated each one as continuous variables. So, the IP_CLM_ICD9_PRCDR_CD has numerical values for ICD-9 diagnosis codes. The clustered output shows the mean and standard deviation for this variable, however this has no business value. Looking at Cluster 0, or Cluster_Readmission in the test model shows the average to be 1. I see that those member confinements who are likely to be readmitted (readmit=1) have certain qualities to them. For example the gender’s mean is 1.56 which is slightly leaning towards the female population. The mean age category is 3.5752 which shows that ages 75-79 are more likely to be readmitted. However when comparing the 2 clusters, the variables are evenly distributed between the 2 which makes it difficult to make general comments on the qualities that make a member risky for readmission. However, this occurred because of the nature of how the readmission variable was created. Once Geneia’s data is used to build initial readmission models, there will be less similarities between cohorts of confinements.
  • 4. VII. CONCLUSIONS This proof of concept showed how to observe the EM output and map out the model building process within Geneia. Utilizing internal tools and data with the guidance of what was learned with this proof of concept, Geneia is producing a readmission model. This readmission model can then be accessed from the Theon® platform so customers are empowered with information to aid in lowering the readmission risk of a patient. The data set used from CMS was a good foundation to build a successful productionalized readmission model. This project explored the different techniques that can be utilized to create a robust model. VIII. NEXT STEPS Using open source tools and open data from CMS was an easy way to display the proof of concept of a readmission model. The readmission model uses Geneia’s demo data with SAS® tools. Geneia’s Care Modeler produces the readmission probabilities, where the User Interface showing these probabilities is produced in our other Theon® platform application area. As a part of Geneia’s vision to deliver actionable data through advanced analytics, we will be utilizing state of the art tools in order to productionalize and implement the readmission model in a timely fashion. Depending on the customer’s needs, this determines which Theon® tool will be most applicable to their organization. IX. ACKNOWLEDGEMENTS I would like to thank Professor Kakade for his continued support for this project in CIS435. I would also like to thank Geneia for giving me the means to implement an impactful readmission model through the Theon® platform. X. REFERENCES Kansagara, Devan, Honora Englander, Amanda Salanitro, David Kagen, Cecelia Theobald, Michele Freeman, and Sunil Kripalani. "Risk Prediction Models for Hospital Readmission." Jama 306.15 (2011): 1688. Web. "BSA Inpatient Claims PUF." - Centers for Medicare & Medicaid Services. N.p., n.d. Web. 15 Nov. 2013. <http://www.cms.gov/Research-Statistics-Data-and- Systems/Statistics-Trends-and- Reports/BSAPUFS/Inpatient_Claims.html>. Proof of Concept with WEKA and CMS Test Data Develop Readmission Model using SAS and Geneia Data Implement Model within Theon Optimize and incorporate new data sources
  • 5. APPENDIX The WEKA output is shown below: === Run information === Scheme:weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: 2008_BSA_Inpatient_Claims_PUF_v2 Instances: 588415 Attributes: 8 BENE_SEX_IDENT_CD BENE_AGE_CAT_CD IP_CLM_BASE_DRG_CD IP_CLM_ICD9_PRCDR_CD IP_CLM_DAYS_CD IP_DRG_QUINT_PMT_AVG IP_DRG_QUINT_PMT_CD READMIT Test mode:split 75% train, remainder test === Model and evaluation on training set === EM == Number of clusters selected by cross validation: 2 Cluster Attribute 0 1 (0.5) (0.5) ============================================= BENE_SEX_IDENT_CD mean 1.5614 1.5609 std. dev. 0.4962 0.4963 BENE_AGE_CAT_CD mean 3.5743 3.575 std. dev. 1.8063 1.8053 IP_CLM_BASE_DRG_CD mean 139.7065 140.7003 std. dev. 79.7529 79.6169 IP_CLM_ICD9_PRCDR_CD mean 59.6499 59.7179 std. dev. 21.2575 21.2698
  • 6. IP_CLM_DAYS_CD mean 2.5165 2.5162 std. dev. 0.9735 0.972 IP_DRG_QUINT_PMT_AVG mean 9573.9776 9050.2159 std. dev. 12086.2923 8568.1225 IP_DRG_QUINT_PMT_CD mean 3.001 2.9979 std. dev. 1.4156 1.4142 READMIT mean 0.0019 1 std. dev. 0.0437 0.5 Time taken to build model (full training data) : 101530.36 seconds ================================================================================== == === Model and evaluation on test split === EM == Number of clusters selected by cross validation: 2 Cluster Attribute 0 1 (0.5) (0.5) ============================================= BENE_SEX_IDENT_CD mean 1.5612 1.5608 std. dev. 0.4962 0.4963 BENE_AGE_CAT_CD mean 3.5752 3.5724 std. dev. 1.8054 1.8065 IP_CLM_BASE_DRG_CD mean 140.6608 139.6798 std. dev. 79.5428 79.6735
  • 7. IP_CLM_ICD9_PRCDR_CD mean 59.7292 59.646 std. dev. 21.2796 21.2513 IP_CLM_DAYS_CD mean 2.5151 2.5171 std. dev. 0.9717 0.9734 IP_DRG_QUINT_PMT_AVG mean 9072.7224 9559.9805 std. dev. 8682.2329 12098.1965 IP_DRG_QUINT_PMT_CD mean 2.9971 3.0001 std. dev. 1.4145 1.4155 READMIT mean 1 0.0017 std. dev. 0.5 0.0411 Time taken to build model (percentage split) : 7727.8 seconds Clustered Instances 0 73500 ( 50%) 1 73604 ( 50%) Log likelihood: -26.50445
  • 8. As you can see in my data set, there seemed to have equal number of males and females in both the readmitted and non-readmitted populations. The 5 age bands also show that there were similar numbers in both readmitted and non-readmitted populations. Diagnosis code seemed to be evenly distributed between the both readmitted and non-readmitted populations as well.
  • 9. For the procedure code there are some gaps in what was performed in my data set, however the readmitted and non-readmitted populations look similar once again. After exploring how the data looks after clustering, I decided to utilize Neural Networks. There are a total of 5 nodes that were created. Although each node can be analyzed I was more interested in the summary output. The root mean squared error came out to 56.99%, which makes this a pretty poor neural network for predicting readmission. I’m exploring ways to lower this RMSE, however I believe this is a data pre-processing issue as well. The WEKA output is shown below. === Run information === Scheme:weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a Relation: 2008_BSA_Inpatient_Claims_PUF_v2 Instances: 588415 Attributes: 8 BENE_SEX_IDENT_CD BENE_AGE_CAT_CD IP_CLM_BASE_DRG_CD IP_CLM_ICD9_PRCDR_CD IP_CLM_DAYS_CD IP_DRG_QUINT_PMT_AVG IP_DRG_QUINT_PMT_CD READMIT Test mode:10-fold cross-validation === Classifier model (full training set) ===
  • 10. Linear Node 0 Inputs Weights Threshold 0.10507212142288758 Node 1 -1.3304456888484266 Node 2 -0.2549966715099939 Node 3 -3.390541070153192 Node 4 -3.6528578622010737 Sigmoid Node 1 Inputs Weights Threshold -13.464450269067473 Attrib BENE_SEX_IDENT_CD -4.939190410264623 Attrib BENE_AGE_CAT_CD -8.740965586414031 Attrib IP_CLM_BASE_DRG_CD -0.10494256311588898 Attrib IP_CLM_ICD9_PRCDR_CD -0.2764279667821472 Attrib IP_CLM_DAYS_CD -4.941028814043151 Attrib IP_DRG_QUINT_PMT_AVG 11.229004034610403 Attrib IP_DRG_QUINT_PMT_CD -1.4161938479261476 Sigmoid Node 2 Inputs Weights Threshold -29.510054335269658 Attrib BENE_SEX_IDENT_CD -2.4530603729826272 Attrib BENE_AGE_CAT_CD 12.256052634877257 Attrib IP_CLM_BASE_DRG_CD -6.556030932329017 Attrib IP_CLM_ICD9_PRCDR_CD -7.619288616295435 Attrib IP_CLM_DAYS_CD -0.9677884091499084 Attrib IP_DRG_QUINT_PMT_AVG 7.5937369382205935 Attrib IP_DRG_QUINT_PMT_CD 9.236701926843953 Sigmoid Node 3 Inputs Weights Threshold -9.487580769500077 Attrib BENE_SEX_IDENT_CD 2.8264751617336055 Attrib BENE_AGE_CAT_CD 1.1920813268658716 Attrib IP_CLM_BASE_DRG_CD -1.0685283000189505 Attrib IP_CLM_ICD9_PRCDR_CD -0.4555504249499105 Attrib IP_CLM_DAYS_CD 0.04425927527189678 Attrib IP_DRG_QUINT_PMT_AVG 2.091775720037553 Attrib IP_DRG_QUINT_PMT_CD -0.14238942134384816 Sigmoid Node 4 Inputs Weights Threshold -9.265272395597584
  • 11. Attrib BENE_SEX_IDENT_CD 0.16355901451324487 Attrib BENE_AGE_CAT_CD -0.18375201349583137 Attrib IP_CLM_BASE_DRG_CD -2.5806491645028133 Attrib IP_CLM_ICD9_PRCDR_CD 0.3839209458877335 Attrib IP_CLM_DAYS_CD -2.010902479691323 Attrib IP_DRG_QUINT_PMT_AVG 0.34075788774097127 Attrib IP_DRG_QUINT_PMT_CD 0.7098860282183899 Class Input Node 0 Time taken to build model: 971.8 seconds === Cross-validation === === Summary === Correlation coefficient 0.0024 Mean absolute error 0.4994 Root mean squared error 0.5699 Relative absolute error 99.877 % Root relative squared error 113.9805 % Total Number of Instances 588415