The project aims at predicting healthcare cost against actual data as provided by US survey of hospital, The dataset on which analysis has been done is a sample dataset used for educational purposes only.
2. A nationwide survey of hospital costs conducted by the US
Agency for Healthcare consists of hospital records of inpatient
samples. The analysis is being done on data records of the
hospital in Wisconsin and our main aim is to predict the
healthcare cost.
BUSINESS
PROBLEM
2
3. DATAOVERVIEW:
* According to a Survey
3
Dataset dimensions:
Total no of rows: 151
Total no of columns: 6
Dependent variable :
TOTCHG – Actual total healthcare cost
Independent variables:
AGE: Value ranges from 0 to 17
FEMALE: If yes, then ‘1’ else ‘0’
LOS: Length of stay ranged from 0 to 41
RACE: 6 unique race valued 1 to 6
APRDRG: All Patient Refined Diagnosis Related Groups
4. BUSINESS SOLUTION:
4
Since the data has multiple independent variables and has
continuous values, we will use Multi Linear Regression (MLR)
Algorithm.
We will split the dataset in 70:30 ratio and train the model by
70% of data and predict the healthcare cost on 30% data.
5. MODEL INTERPRETATION:
5
Setting the significant value at 95%, we looked for
variables with p-value < 0.05 to find out the
significant variables.
We found that AGE, LOS AND APRDRG have only
p-value less than 0.05, and thus these are our
significant variables.
Therefore, we will rebuild our model using only
these significant variables.
Also, while looking at slope(Estimate) value, we
found the following relation between independent
and dependent variable:
1. AGE and LOS are directly proportional to TOTCHG
2. FEMALE, RACE & APRDRG are inversely
proportional to TOTCHG
6. MODEL RE-BUILDING:
6
Using only significant variables from our last model, we
re-built the lm model and found that all the variables
have p-value less than 0.05.
Thus, the model built with these variables are our final
model to predict our problem
While looking at slope(Estimate) value, we found the
following relation between independent and dependent
variable:
1. AGE and LOS are directly proportional to TOTCHG
2. APRDRG is inversely proportional to TOTCHG
Also, we got our R-squared value = 0.4434. This
means our current data and independent variables are
able to explain 44.34% of dependent variable only.
7. PREDICTION RESULTS:
7
Based on our final model, we predicted the total
healthcare cost (predtest) and calculated residual
value and mean squared error (MSE).
Our MSE value turned out to be 3825548.
Also, we plotted our actual healthcare cost (TOTCHG)
against predicted healthcare cost (predtest) to see the
trend between them.
8. CONCLUSION:
8
Our MSE value equals to 3825548, which is very high
and signifies the low accuracy of our predicted result.
The plotted graph also shows the inaccuracy between
actual (TOTCHG) and predicted (predtest) healthcare
cost.
We also calculated the R-squared value which turned
out to be only 44.34%
Thus, we can conclude that our current model is
insufficient to predict healthcare cost accurately.
This is because our dataset is very small and
current variables are not able to completely
explain our dependent variable.
With a rich dataset having more features and
information, we are likely to get a good result
with the same model.