Introduction	to	
Healthcare	Data	Analytics
with	Extreme	Tree	Models
Yubin	Park,	PhD
Chief	Technology	Officer
1
Who	am	I
• Co-founder	and	Chief	Technology	Officer	of	Accordion	Health,	Inc.
• PhD	from	the	University	of	Texas	at	Austin
• Advisor:	Professor	JoydeepGhosh
• Studied	Machine	Learning	and	Data	Mining,	with	a	special	focus	on	
healthcare	data
• Involved	in	various	industry	data	mining	projects
• USAA:	Life-time	modeling	of	customers
• SK	Telecom:	Smartphone	purchase	prediction,	usage	pattern	analysis
• LinkedIn	Corp.:	Related	search	keywords	recommendation
• Whole	Foods	Market:	Price	elasticity	modeling
• …
2
Accordion	Health
• Healthcare	Data	Analytics	Company
• Founded	in	2014	by
• Sriram Vishwanath,	PhD
• Yubin	Park,	PhD
• Joyce	Ho,	PhD
• A	team	of	data	scientists	and	medical	
professionals
• Help	healthcare	organizations	lower	costs	and	
improve	qualities
3
From	Health	Datapalooza 2014
Types	of	Problems	We	Solve
• Which	patient	is	likely	to	be	readmitted?
• Which	patient	is	likely	to	develop	type	2	diabetes?
• Which	patient	is	likely	to	adhere	to	his	medication?
• How	much	this	patient	will	cost	this	year?
• How	many	inpatient	admissions	this	patient	will	have	this	year?
• Which	physician	is	likely	to	follow	our	care	guideline?
• What	star	rating	will	our	organization	receive	this	year?
• …
4
Healthcare	Data	is	Messy
• Data	structure
• Unstructured	data	such	as	EHR
• Structured	data	such	as	claims
• Location
• Doctors’	offices,	insurance	companies,	governments,	
etc.
• Data	definition
• Different	definitions	for	different	communities
• Data	format
• Various	industry	formats
• Data	complexity
• Patients	going	in	and	out	of	systems
• Incomplete	data
• Regulations	&	requirements
• Source:	Health	Catalyst
5
My	Usual	Work	Flow
Summary	
Statistics
Visual	
Inspection
Data	Cleansing	
&	Feature	
Engineering	(1)
Baseline	
Models
Extreme	Tree	
Models
Data	Cleansing	
&	Feature	
Engineering	(2)
Custom
Extreme	Tree	
Models
Data	Cleansing	
&	Feature	
Engineering	(3)
Fully	
Customized	
Models
6
I	start	my	data	project	by	
checking	summary	
statistics,	distributions,	data	
errors,	and	applying	simple	
models.
Extreme	Tree	Models*	
serve	as	a	check	point
before	further
developing	customized	
models.
*Extreme	Tree	Models	refer	
to	a	class	of	models	that	use	
a	tree	as	a	base	classifier.
Why	Tree-based	Models
“Of	all	the	well-known	methods,	
decision	trees	come	closest	to	
meeting	the	requirements	for	
serving	as	an	off-the-shelf	
procedure	for	data	mining.”	
• J.	H.	Friedman,	R.	Tibshirani,	and	
T.	Hastie,.	The	Elements	of	
Statistical	Learning
7
How	to	Grow	a	Tree
1. Start	with	a	dataset
2. Pick	a	splitting	feature
3. Pick	a	splitting	cut-point
4. Split	the	dataset	into	two	sets	based	on	the	splitting	feature	and	
cut-point
5. Repeat	from	Step	2	with	the	partitioned	datasets
8
Various	Kinds	of	Trees	– C4.5,	CART
1. Start	with	a	dataset
2. Pick	a	splitting	feature
3. Pick	a	splitting	cut-point
4. Split	the	dataset	into	two	sets	based	on	the	splitting	feature	and	
cut-point
5. Repeat	from	Step	2	with	the	partitioned	datasets
9
Information	Gain	à C4.5
Gini	Impurity,	Variance	Reduction	à CART
- Quinlan,	J.	R.	(1993)	C4.5:	Programs	for	Machine	Learning.	Morgan	Kaufmann	
Publishers.
- Breiman,	Leo;	Friedman,	J.	H.;	Olshen,	R.	A.;	Stone,	C.	J.	(1984). Classification	and	
regression	trees.	Monterey,	CA:	Wadsworth	&	Brooks/Cole	Advanced	Books	&	Software.
Tree	à Forest
• Randomization	Methods
• Random	data	sampling
• Random	feature	sampling
• Random	cut-point	sampling
10
Various	Kinds	of	Forests	– Bagged	Trees
1. Start	with	a	dataset
2. Pick	a	splitting	feature
3. Pick	a	splitting	cut-point
4. Split	the	dataset	into	two	sets	based	on	the	splitting	feature	and	
cut-point
5. Repeat	from	Step	2	with	the	partitioned	datasets
11
Sample	with	replacement,	and	many	trees
à Bagged	Trees
- Breiman,	L.	(1996b).	Bagging	predictors.	Machine	Learning,	24:2,	123–140.
Various	Kinds	of	Forests	– Random	Subspace
1. Start	with	a	dataset
2. Pick	a	splitting	feature
3. Pick	a	splitting	cut-point
4. Split	the	dataset	into	two	sets	based	on	the	splitting	feature	and	
cut-point
5. Repeat	from	Step	2	with	the	partitioned	datasets
12
Select	a	random	subset	of	features
Then	find	the	best	feature/cut-point
- Ho,	T.	(1998).	The	Random	subspace	method	for	constructing	decision	forests.	
IEEE	Transactions	on	Pattern	Analysis	and	Machine	Intelligence,	20:8,	832–844.
Various	Kinds	of	Forests	– Random	Forests
1. Start	with	a	dataset
2. Pick	a	splitting	feature
3. Pick	a	splitting	cut-point
4. Split	the	dataset	into	two	sets	based	on	the	splitting	feature	and	
cut-point
5. Repeat	from	Step	2	with	the	partitioned	datasets
13
Sample	with	replacement
Select	a	random	subset	of	features
Then	find	the	best	feature/cut-point
- Breiman,	L.	(2001).	Random	forests.	Machine	Learning,	45,	5–32.
Various	Kinds	of	Trees	– ExtraTrees
1. Start	with	a	dataset
2. Pick	a	splitting	feature
3. Pick	a	splitting	cut-point
4. Split	the	dataset	into	two	sets	based	on	the	splitting	feature	and	
cut-point
5. Repeat	from	Step	2	with	the	partitioned	datasets
14
Select	a	random	subset	of	(feature,	cut-point)	pairs	
Then	find	the	best	(feature,	cut-point)	pair
- Geurts,	P.,	Damien	E.,	and	Louis	W..(2006)		Extremely	randomized	trees.	
Machine	learning	63.1,	3-42.
Again,	Bias	vs	Variance
• Bias:	Error	from	model
• Variance:	Error	from	data
• Recursive	partition	à fewer	samples	as	
tree	grows
• Split	features/cut-points	are	susceptible	to	
training	samples
• Randomization	decreases	variance
• Image	Source:	Scott	Fortmann-Roe
15
Evolution	of	Bias	vs.	Variance	
16
- Geurts,	P.,	Damien	E.,	and	Louis	W..(2006)		Extremely	randomized	trees.	
Machine	learning	63.1,	3-42.
Bias	Variance	Trade-off
17Image	Source:	Scott	Fortmann-Roe
• Randomization	Methods	
reduces	variance
• However,	for	some	
problems,	reducing	the	
bias of	a	model	may	be	
more	critical	for	improving	
its	accuracy
• A very	complex	dataset	with	
many	variables	and	samples
Are	Tree	Models	are	High-Variance	Models?
• It	depends…
• Number	of	data	samples
• Number	of	features
• Data	complexity
• Randomization	Methods	
• Decrease	Variance
• But	increase	Bias
18
There	is	another	way	of	decreasing	the	
expected	error,	which
- Decrease	Bias
- May	increase	variance
Boosting:	Learn	from	Errors
19
Y =	f0(X),	where	E1 =	|Y-f0(X)|2
E1 =	f1(X),	where	E2 =	|Y-f1(X)|2
E2 =	f2(X),	where	E3 =	|Y-f2(X)|2
and	so	on...
Additive	Model	Framework
• Additive	Model	Framework	
generalizes	boosting,	
stacking,	and	other	variants
• Source:	J.	H.	Friedman,	R.	
Tibshirani,	and	T.	Hastie,.	
The	Elements	of	Statistical	
Learning (ESL)
20
Gradient	Boosting	Machine
• Additive	Models	can	be	numerically	
optimized	via	Gradient	Descent
• Source:	Wikipedia and	ESL
21
- Friedman,	Jerome	H.	(2001)	Greedy	function	approximation:	a	gradient	
boosting	machine.	Annals	of	statistics:	1189-1232.
Extreme	Gradient	Boosting	(XGBoost)
22
Various	Data	Mining	
Competitions	in	Kaggle
One	thing	they	have	in	
common:
- They	all	used	XGBoost
What’s	so	Special	about	XGBoost
• XGBoost implements	the	basic	idea	of	GBM	with	some	tweaks,	such	
as:
• Regularization	of	base	trees
• Approximate	split	finding
• Weighted	quantile sketch
• Sparsity-aware	split	finding
• Cache-aware	block	structure	for	out-of-core	computation
• “XGBoost scales	beyond	billions	of	examples	using	far	fewer	resources	
than	existing	systems.”	– T.	Chen	and	C.	Guestrin
23
Going	Further	Extreme
• XGBoost of	XGBoost
• Bagging	of	XGBoost
• Bagging	of	XGBoost of	
XGBoost of	…
• Stacking,	Bagging,	Sampling,	
etc.
• Source:	Kaggle
24
Real-world	Example:	Predict	MedAdh Scores
• Centers	for	Medicare	and	Medicaid	Services	(CMS)	measures	the	
performance	of	Medicare	Advantage	(MA)	Plans	via	Star	Rating	
System
• Medication	Adherence	(MedAdh)	is	one	of	the	most	important	
quality	measures	in	the	Star	Rating	System
• MA	Plans	want	to	know	how	much	their	MedAdh scores	will	change	
in	the	next	two	years
25
Predict	MedAdh Scores
• Where	can	I	find	data
• Download	from	the	CMS	Part	C	and	D	Performance	Data	webpage
• Constructing	datasets
• MedAdh Data	from	2012,	2013	à Training	Features,	Xtrain
• MedAdh Data	from	2015	à Training	Label,	Ytrain
• MedAdh Data	from	2013,	2014	à Test	Features,	Xtest
• MedAdh Data	from	2016	à Test	Label,	Ytest
26
Lots	of	Missing	Data
• Not	all	MA	plans	are	measured	for	a	given	year	à Mean	Imputation
27
X1,X2,X3,X4,X5,X6,X7,X8,X9,Y
...
71.2,72.7,69.9,75.2,75.9,71.0,1.8
-999,-999,-999,75.8,72.5,68.8,-4.8
61.8,59.4,57.7,57.3,59.3,58.3,16.7
...
-999,-999,-999,82.8,80.0,69.8,-11.8
73.8,73.2,71.8,74.5,76.1,72.9,4.5
Try	Various	Models
• From	simple	models	like	Linear	Regression,	Decision	Tree	to	extreme-
tree	models	such	as	ExtraTrees and	Gradient	Boosting
28
from sklearn import linear_model
from sklearn import tree
from sklearn.utils import resample
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
Try	Various	Models	– code	snippet
• From	simple	models	like	Linear	Regression,	Decision	Tree	to	extreme-
tree	models	such	as	ExtraTrees and	Gradient	Boosting
29
lm =	linear_model.LinearRegression()
dt =	tree.DecisionTreeRegressor()
etr =	ExtraTreesRegressor(n_estimators=100, max_depth=10)
gbr =	GradientBoostingRegressor(n_estimators=500,	
learning_rate=0.25,	
max_depth=8)
Try	Various	Models	– results
30
$ python	test.py
…
RMSE	Results	
lm:	2.7125536923
dt:	3.10460672029
etr:	2.18597303421
gbr:	2.02698129388
Try	Various	Models	– results
31
Extreme	Tree	Models	
exhibit	significant	
improvements	in	
accuracies	compared	to	
simple	models.
One	can	build	more	
sophisticated	models	
based	on	the	error	
characteristics	of	these	
models.
Contact
• yubin	[at]	accordionhealth [dot]	com
32

Healthcare Data Analytics with Extreme Tree Models