Data Con LA 2020
Description
In 2018, healthcare spending in the US accounted for 17% of the nation’s GDP. With such significant spend, how can we better understand what that means for healthcare and treatment accessibility? When policy changes occur, how can we gauge the impact on rural areas, which are disproportionally affected by inadequate access to healthcare (or “healthcare deserts”)? Using publicly available data and records, it is possible to locate all major hospitals in the U.S. and, for every residential ZIP code, model the population affected by healthcare deserts at various travel mileage thresholds. This talk will focus on:
· The several public datasets that are available to address this question
· The logic and algorithm(s) used to compute this efficiently in Python
· Visualizing the problem and telling the story in Tableau
Speaker
Andrew Kaszpurenko, Edwards Lifesciences, Manager of Advanced Analytics at Edwards Lifesciences THV Division
2. Agenda
Introduction: About me and what I do…
Data for Good: Mapping US Healthcare Desert
Data Sources
Methodologies – Brute Force
Visualization
Methodologies – Smarter way
Result and Summary
2
3. Intro: Andrew Kaszpurenko
Manager of Advanced Analytics at Edwards
Lifesciences THV Division for the last three years.
Lead a team that uses a variety of Machine Learning &
AI methods to build models that inform leadership and
business partners to help patients get treatment for
Aortic Stenosis.
Before joining Edwards, Andrew has had over a
decade of experience working in a variety of industries
from Life Insurance, Health Insurance, Finance, and
Direct Primary Care.
3
4. Intro: Edwards Lifesciences
Edwards Lifesciences
Edwards Lifesciences is the global leader in patient-focused medical innovations for
structural heart disease, as well as critical care and surgical monitoring.
Driven by a passion to help patients, the company collaborates with the world’s
leading clinicians and researchers to address unmet healthcare needs, working to
improve patient outcomes and enhance lives.
Edwards Lifesciences’ headquarter is in Irvine, CA with 14,000 employees globally.
4
Transcatheter Heart Valve (THV) Division
A minimally invasive procedure to treat aortic stenosis is also called transcatheter
aortic valve replacement (TAVR).
In the past, many people suffering from severe aortic stenosis had limited options to
replace an unhealthy valve, such as open heart surgery. Since 2011, TAVR has
opened a door of possibilities and options for treating people in the United States with
severe aortic stenosis.
5. Problem statement
Medical Background: Aortic valve stenosis that is related to increasing age and the
buildup of calcium deposits on the aortic valve is most common in older people. It
usually doesn't cause symptoms until ages 70 or 80.
Background: Medicare was looking to revise TAVR coverage decisions and some
organizations were pushing for tighter requirements for which hospitals could
perform the procedure based on volume of other procedures.
5
Problem Statement: How to show that based on different assumptions made
about how far a patient was willing to travel for treatment, would a policy change
have an impact on access?
6. Desired End Goal
Map of the United States, color indicating
where the TAVR deserts would be.
As granular as possible to really show the
deserts.
Ability to adjust the distance a patient travels
Ability to have a few scenarios of hospitals
included or not
Overlay income and poverty information
6
7. Tools & Technical Challenge
Tableau: Business friendly way to visually display and interface with the data
Python: To clean & organize the data in a manner that makes it friendly for Tableau
to work with.
7
Technical Challenge:
There are 30,000 zip codes in the US, but don’t want Tableau to have to do
nearest point calculations every time the user changes the parameters.
A nearest point calc would be 450M data points having to be calculated
every time the user changes the parameter (30,000 x 30,000)/2
8. Input Tables
Two tables needed:
List of Hospitals and their Zip Code
– Plus the different scenarios we want to
consider
A crosswalk of all the Zip Codes and their Lat,
Long
Both are publicly available:
– https://public.opendatasoft.com/explore/dataset/us-zip-
code-latitude-and-longitude/table/
– https://data.medicare.gov/Hospital-Compare/Hospital-
General-Information/xubh-q36u
8
List of Hospitals
List of Zip Codes and Lat Long
9. First Thoughts
Initial Thoughts:
Create a table outside of Tableau
Color coded by how far the closest hospital is:
1. Take every zip and compare it’s distance to every
other zip and calculate the distance.
2. Mask only the hospitals (zips) we are interested in.
3. Take the minimum distance for each zip in the table.
This means it’s (30k zips x 30k zips)/2 = 450M calcs.
9
Revise this:
Can only do just the hospitals.
30k zips x 800 hospital zips = 24M calcs
10. Methodology – Brute Force
10
Steps:
Import and clean every hospital and every
zip
Take 30k zip and give it 800 possible
locations to go to and calculate the distance.
Take the minimum distance for every zip to
all hospital pairs.
11. Code I – Import the Data and Basic clean up
11
Import the Zip Code information as a
DataFrame
Import the Hospitals information as a
DataFrame
– Some hospitals share the same zip
code as another hospital, so no need to
do the calc more than once
– Merge in the Lat/Long into the Hospital
DataFrame
Process
12. Code II – Create Master Dataframe and Calc Distance
12
Create a new DataFrame with All Hospitals
Mapped to all possible Zip Codes.
– Ie. Hospital 1 will have 30,000 points
– There are 18M rows now
Process
Now have a Data Frame with every hospital
and all the zips and distance from it
Min distance along each zip
Run the function against each row and
return the miles from it.
– Uses geopy’s - geodesic function
– On a i7-8650U it takes 1hr 25mins mins
13. Taking the clean output
table and loading it into
Tableau.
Making the map a dual
axis map to get the
hospital locations overlaid
with the result
Using a simple color
palate where anyone can
tell what is good or bad.
13
Process
Setup of Visual
14. Adding in Population statistics
– Census
Income Metrics
– IRS.gov
14
Process
Adding Layers of Detail
15. Methodologies – Faster Way
15
Divide into smaller problems:
Lat, Long are just two points on a
surface.
Just search that region
Using scikit K-d Tree
What’s taking so long:
Many repetitive calculations that are
useless.
– Want to avoid calculating if Irvine is closer
to NYC than Los Angeles. Nearest Neighbor
16. K-D Tree basics
Divide into a smaller problem:
Lat, Long are just two points on a surface.
Divide into smaller problems
Using scikit K-D Tree
16
a
b
c
d
e
f
g
h
i
j
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K-D Tree
z
x ≥6
y ≥ 6 y ≥ 6
a,b,c,d,e I,f,h,j,g
I,f,hj,ga,d,eb,c
17. Ball Tree
Divide into a smaller problem:
Because these points are on a sphere (low
dimensional manifold) K-D won’t work
But can use Ball Tree library (scikit-learn)
17
https://towardsdatascience.com/using-scikit-learns-binary-trees-to-efficiently-find-latitude-and-longitude-neighbors-909979bd929b
https://towardsdatascience.com/tree-algorithms-explained-ball-tree-algorithm-vs-kd-tree-vs-brute-force-9746debcd940
a
b
c
d
e
f
g
h
i
j
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Ball Tree
z Conceptually similar to K-D Trees
18. Code II – Add in Potential Scenarios
18
Using same main data frames
as before, ZipCode and
Hospital
Create Scenarios
Convert Lat and Long into
Radians
Process
Create Dataframes based on
the scenario
19. Code II – Ball Tree
19
Using same 2 main data
frames as before
– ZipCode
– Hospital
Process
Take the Array and put it into a
DataFrame
Query the Ball tree against all
zips (30k).
k=1, how many sites to return
1 radian = 3959 miles
20. Code II – Final
20
Run for each scenario
Stack on top of itself for each
scenario.
– Tableau likes long and skinny
data
– 3 scenarios = 30k x 3 = 90k
rows
Process
22. White space is your friend
Remove grid lines &
default shading
Use a set palate and font
size
Each thing should matter,
if not remove
Recognize the difference
between a presentation vs
dashboard
22
Visual Best Practices
23. Result and Summary
This analysis contributed to internal analysis to understand different proposed Medicare
Guidelines.
Help to understand and quantify the geographic limitations to treatment accessibility.
23
Other applications:
Logistics and distribution
– Food accessibility, supply chain, etc
Edwards Lifesciences is looking for passionate data professionals, visit
www.edwards.com
24.
25. Code 1 - Setup
from sklearn.neighbors import BallTree, KDTree
import numpy as np
import pandas as pd
from geopy.distance import geodesic
def get_scenario_list(df, mask, target_col):
# returns a target_col subset as a list based on condition defined in mask
foo_list = df[mask][target_col].to_list()
return foo_list
# Import ZIP Code geo mapping file
import_cols = ['ZipCode', 'Latitude', 'Longitude', 'ShowMap', 'City', 'State', 'Population',
'PopOver65', 'Median_Income', 'Average_Income']
df_zip = pd.read_excel(path + "ZipClean.xlsx", sheet_name="UniqueZip",
dtype={"ZipCode": str}, usecols=import_cols)
# ZIP code subset of useable zips
df_zip.drop_duplicates(['ZipCode'], keep='first', inplace=True)
df_zip.reset_index(drop=True, inplace=True)
df_zip.rename(columns={'Latitude': 'LAT', 'Longitude': 'LON'}, inplace=True)
df_zip = df_zip[np.isfinite(df_zip['LAT'])].reset_index(drop=True)
# Import Hospital File
df_hos_final = pd.read_excel(path2 + "HospitalFile.xlsx", sheet_name='Hospitals',
dtype={"Facility Zip": str})
df_hos_final.rename(columns={"Facility Zip": 'ZipCode'}, inplace=True)
# Merge Map information to hospital file to get Lat long for each hospital ZIP code
df_hos_final = pd.merge(df_hos_final, df_zip[['ZipCode', 'LAT', 'LON']], how="left", on=["ZipCode"])
# drop duplicates
df_hos = df_hos_final[['ZipCode', 'ScenerioCurrent', 'ScenerioPotential', 'LAT', 'LON']].copy()
df_hos.drop_duplicates(['ZipCode'], keep='first', inplace=True)
df_hos.reset_index(drop=True, inplace=True)
df_hos = df_hos[np.isfinite(df_hos['LAT'])].reset_index(drop=True)
26. Code 2 – Brute Force
df_zip.rename(columns={'Lat':'Latz', 'Long':'Longz'}, inplace=True)
df_hos.rename(columns={'Lat':'Lath', 'Long':'Longh'}, inplace=True)
# Create a new DataFrame with every hospital to every possible zip available
df_dist = pd.merge(df_hos.assign(key=0), df_zip.assign(key=0), on='key').drop('key', axis=1)
df_dist.rename(columns={'ZipCode_x':'Zip_Hos', 'ZipCode_y':'Zip_Map'}, inplace=True)
# Run Distance Calc
df_dist['miles'] = df_dist.apply((lambda row: geodesic((row['Latz'], row['Longz']),
(row['Lath'], row['Longh'])).miles), axis=1)
df_dist.reset_index(drop=True, inplace=True)
27. Code 3 – Nearest Neighbor (Ball Tree)
scenario_cur_mask = df_hos['ScenerioCurrent'] == 1
scenario_pot_mask = df_hos['ScenerioPotential'] == 1
# Use dictionary to store all of the scenario lists. This will be iterated through below
scenarios_dict = { 'cur': get_scenario_list(df_hos, scenario_cur_mask, 'ZipCode'),
'pot': get_scenario_list(df_hos, scenario_pot_mask, 'ZipCode')
}
# Creates new columns converting coordinate degrees to radians (both dfs)
for df in [df_hos, df_zip]:
for col in ['LAT', 'LON']:
rad = np.deg2rad(df[col].values)
df[f'{col}_rad'] = rad
# loop through each scenario in scenarios_dict, output BallTree nearest neighbor distance to closest hospital for every U.S. ZIP code
# add a flag for which scenario it is
# output for each scenario will be same length as df_zip
df_dist = pd.DataFrame()
for scenario in scenarios_dict:
# subset df_hos by each scenario
zip_list = scenarios_dict[scenario]
locations_a = df_hos[df_hos['ZipCode'].isin(zip_list)].copy()
locations_b = df_zip.copy()
# BallTree nearest neighbor distance
ball = BallTree(locations_a[["LAT_rad", "LON_rad"]].values, metric='haversine')
distances, indices = ball.query(locations_b[['LAT_rad', 'LON_rad']].values, k=1)
distances = distances * 3958.8
# get distances into a df, concatenate
df_temp = pd.DataFrame(data=distances, index=locations_b['ZipCode'])
df_temp.rename(columns={0: 'Miles'}, inplace=True)
df_temp.reset_index(drop=False, inplace=True)
df_temp['scenario'] = scenario
df_dist = pd.concat([df_dist, df_temp], ignore_index=True)
df_dist.reset_index(drop=True, inplace=True)
# merge in map data
cols_zip = ['ZipCode', 'City', 'State', 'Population', 'PopOver65', 'Median_Income', 'Average_Income']
df_dist = pd.merge(df_dist, df_zip[cols_zip], how="left", on=["ZipCode"])
# cols_hos = ['ZipCode', 'Facility Name', 'ScenerioPotential', 'ScenerioCurrent']
# df_dist = pd.merge(df_dist, df_hos_final[cols_hos], how="left", on=["ZipCode"])
df_dist.to_csv(path2 + "ZipDist_Hos_Sklearn.csv", index=False, header=True, encoding='utf8')