PREDICTION and RATE analysis: Health Insurance

HEALTH INSURANCE RATE ANALYSIS
AND PREDICTION
USING HEALTHCARE.GOV
MARKETPLACE DATA
By Sunitha Flowerhill
Big Data, BI, Hadoop Data lake Engineer and Architect
1

The Health Insurance Marketplace Public Use Files
(PUF) which contain data on health and dental plans
offered to individuals and small businesses through
the US Health Insurance Marketplace.
2

PROJECT DIRECTIONS, PROCEDURES, GOALS:
• DOWNLOAD NATIONWIDE DATASETS FROM HEALTHCARE.GOV
• LOOK AT THE METADATA AND SEE IF IT MATCHES WITH YOUR PROJECT GOALS.
• IDENTIFY THE BEST SUITED DATASET FROM THE DOWNLOADED BUNCH OF INSURANCE
DATASETS
• CLEANUP THE DATA USING JMP TOOLS : ROWS, COLS MENU, DATA FILTER, ROW
SELECTION ETC.
• NARROW IT DOWN TO STATE OF DELAWARE DATA
• PRELIMINARY ANALYSIS OF THE DATA – MARK THE NECESSARY COLUMNS, DELETE EMPTY
COLUMNS
• CHECK FOR CONSISTENCY OF DATA USING GRAPH BUILDER
• CONVERT THE CATEGORICAL VARIABLES: AGE TO NUMERIC, RATE TO CURRENCY, REMOVE
$ SYMBOL
• FURTHER CORE ANALYSIS: DECISION TREE, PARTIAL LEAST SQUARES, NEURAL NETWORKS
3

I have selected the huge individual rates file out of the 18 downloaded
datasets. Selected DE data, Cleaned up age column, made it numeric,
cleaned up rate column by removing dollar sign, removed insignificant
columns like tobacco for DE, eliminated empty columns. Tools used are
data filter, row selection, formula editor etc.
4

THE DATA
Now the rate_puf.csv became rate_DE.jmp with all clean data
5

There is steady increase
of rate per month, year
There is steady increase
of rate with age
Finding out which Issuer
holds most Business in
State of DE
Which issuer have
marked up and down
versions of Plans
Have done various analysis, to make sure I am choosing the correct X
factors.. There is an interesting 3D plot with Rate as Y, Age and version
number as X and Z.
6

The first analysis is the Partition decision tree – I chose this because of
the significant number of categorical variables. The major report
elements are towards the right.
7

Here is a beautiful story unfolding – from the insurance rates of state of
Delaware, from Healthcae.gov – out of 15,928 individuals, 1350 people
of prime age have 0 premium. The Major contributors of the premium
are listed in the green rectangle. Age is the most decision factor – 14
splits. The second is the version number, which I believe is the marked
up or down version of the same plan, by healthcare.gov – 8 splits, then
the issuer – various companies that offer healthcare plans. The rest of
the components are insignificant. Altogether 25 splits on the above
mentioned prime components. Decision tree is the best choice when
many of the variables are categorical. And there is only one Y, which is
the rate per individual.
8

3D TREE
The Rsquare looks good, Actual by Predicted Plot is symmetrical. 3 split
trials gave similar results
9

This is a Fit Model, partial least squares
10

The number of minimum factors is 8,
there is 16 factors for VIP
11

COMPARING PREDICTION PROFILERS :
PLS, DECISION TREE, NEURAL NETWORK
Out of curiosity, I compared the decision tree with another method –
partial least squares, which mostly support continuous variables. The
above mentioned prediction profiler sounds very interesting. Look at
the ways. Major factors in the rate prediction, in the state of Delaware
are 1. Age, (rate increases with age) Version numer (the higher the
number, lower the rate. Low version numbers have marked up
premium), then categorical variables such as issuerid1 and issuerid2
take up next places. We have 2014,15 and 16 data, there is constant
insignificant increase with month and year.
15

THE
BEGINNING...
LESSONS LEARNED, CONCLUSIONS, APPENDIX:
✓ START EARLY, MAKE EVERY EFFORT TO CLEAN DATA, ANALYZE AND RE-ANALYZE USING GRAPHS
✓ ELIMINATE UNWANTED DATA, GET OPTIMUM DATA FOR EVALUATION
➢ WHEN THERE ARE SIGNIFICANT CATEGORICAL VARIABLES, PARTITION DECISION TREE IS A GOOD
CHOICE.
➢ FIT MODEL->PLS ALSO ACCEPT A MIXTURE OF CATEGORICAL AND NUMERIC VARIABLES AND GIVES
OPTIMUM RESULTS.
➢ NEURAL NETWORKS WORKS WONDERS WITH LARGER CLEANER DATASETS.
➢ FROM ALL THE ANALYSIS, AGE, ISSUER, MARKED UP-DOWN VERSION NUMBER ARE THE MOST SIGNIFICANT
FACTORS IN DECIDING THE INDIVIDUAL RATE.
➢ FOR RATE PREDICTION, MAJOR COMPONENTS ARE:
➢ 1. AGE 2. VERSION NUMBER
➢ 3. ISSUERID, ISSUERID2 4. MONTH AND YEAR
APPENDIX:
HTTPS://DATA.HEALTHCARE.GOV/
HTTP://DHSS.DELAWARE.GOV/DHCC/
16

PREDICTION and RATE analysis: Health Insurance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PREDICTION and RATE analysis: Health Insurance

Similar to PREDICTION and RATE analysis: Health Insurance (20)

Recently uploaded

Recently uploaded (20)

PREDICTION and RATE analysis: Health Insurance