Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- ACDC: Alpha-Carving Decision Chain ... by Yubin Park 549 views
- Overview on "The State of Predictiv... by Yubin Park 606 views
- Trey Gordon_Resume 2016 by Trey Gordon 128 views
- Renuncia de Felipe Bulnes - Agente ... by Christian Pino La... 217 views
- China transformer manufacturing ind... by Qianzhan Intellig... 356 views
- China port and harbor industry mark... by Qianzhan Intellig... 470 views

917 views

Published on

Published in:
Technology

No Downloads

Total views

917

On SlideShare

0

From Embeds

0

Number of Embeds

110

Shares

0

Downloads

21

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Introduction to Healthcare Data Analytics with Extreme Tree Models Yubin Park, PhD Chief Technology Officer 1
- 2. Who am I • Co-founder and Chief Technology Officer of Accordion Health, Inc. • PhD from the University of Texas at Austin • Advisor: Professor JoydeepGhosh • Studied Machine Learning and Data Mining, with a special focus on healthcare data • Involved in various industry data mining projects • USAA: Life-time modeling of customers • SK Telecom: Smartphone purchase prediction, usage pattern analysis • LinkedIn Corp.: Related search keywords recommendation • Whole Foods Market: Price elasticity modeling • … 2
- 3. Accordion Health • Healthcare Data Analytics Company • Founded in 2014 by • Sriram Vishwanath, PhD • Yubin Park, PhD • Joyce Ho, PhD • A team of data scientists and medical professionals • Help healthcare organizations lower costs and improve qualities 3 From Health Datapalooza 2014
- 4. Types of Problems We Solve • Which patient is likely to be readmitted? • Which patient is likely to develop type 2 diabetes? • Which patient is likely to adhere to his medication? • How much this patient will cost this year? • How many inpatient admissions this patient will have this year? • Which physician is likely to follow our care guideline? • What star rating will our organization receive this year? • … 4
- 5. Healthcare Data is Messy • Data structure • Unstructured data such as EHR • Structured data such as claims • Location • Doctors’ offices, insurance companies, governments, etc. • Data definition • Different definitions for different communities • Data format • Various industry formats • Data complexity • Patients going in and out of systems • Incomplete data • Regulations & requirements • Source: Health Catalyst 5
- 6. My Usual Work Flow Summary Statistics Visual Inspection Data Cleansing & Feature Engineering (1) Baseline Models Extreme Tree Models Data Cleansing & Feature Engineering (2) Custom Extreme Tree Models Data Cleansing & Feature Engineering (3) Fully Customized Models 6 I start my data project by checking summary statistics, distributions, data errors, and applying simple models. Extreme Tree Models* serve as a check point before further developing customized models. *Extreme Tree Models refer to a class of models that use a tree as a base classifier.
- 7. Why Tree-based Models “Of all the well-known methods, decision trees come closest to meeting the requirements for serving as an off-the-shelf procedure for data mining.” • J. H. Friedman, R. Tibshirani, and T. Hastie,. The Elements of Statistical Learning 7
- 8. How to Grow a Tree 1. Start with a dataset 2. Pick a splitting feature 3. Pick a splitting cut-point 4. Split the dataset into two sets based on the splitting feature and cut-point 5. Repeat from Step 2 with the partitioned datasets 8
- 9. Various Kinds of Trees – C4.5, CART 1. Start with a dataset 2. Pick a splitting feature 3. Pick a splitting cut-point 4. Split the dataset into two sets based on the splitting feature and cut-point 5. Repeat from Step 2 with the partitioned datasets 9 Information Gain à C4.5 Gini Impurity, Variance Reduction à CART - Quinlan, J. R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers. - Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.
- 10. Tree à Forest • Randomization Methods • Random data sampling • Random feature sampling • Random cut-point sampling 10
- 11. Various Kinds of Forests – Bagged Trees 1. Start with a dataset 2. Pick a splitting feature 3. Pick a splitting cut-point 4. Split the dataset into two sets based on the splitting feature and cut-point 5. Repeat from Step 2 with the partitioned datasets 11 Sample with replacement, and many trees à Bagged Trees - Breiman, L. (1996b). Bagging predictors. Machine Learning, 24:2, 123–140.
- 12. Various Kinds of Forests – Random Subspace 1. Start with a dataset 2. Pick a splitting feature 3. Pick a splitting cut-point 4. Split the dataset into two sets based on the splitting feature and cut-point 5. Repeat from Step 2 with the partitioned datasets 12 Select a random subset of features Then find the best feature/cut-point - Ho, T. (1998). The Random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:8, 832–844.
- 13. Various Kinds of Forests – Random Forests 1. Start with a dataset 2. Pick a splitting feature 3. Pick a splitting cut-point 4. Split the dataset into two sets based on the splitting feature and cut-point 5. Repeat from Step 2 with the partitioned datasets 13 Sample with replacement Select a random subset of features Then find the best feature/cut-point - Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
- 14. Various Kinds of Trees – ExtraTrees 1. Start with a dataset 2. Pick a splitting feature 3. Pick a splitting cut-point 4. Split the dataset into two sets based on the splitting feature and cut-point 5. Repeat from Step 2 with the partitioned datasets 14 Select a random subset of (feature, cut-point) pairs Then find the best (feature, cut-point) pair - Geurts, P., Damien E., and Louis W..(2006) Extremely randomized trees. Machine learning 63.1, 3-42.
- 15. Again, Bias vs Variance • Bias: Error from model • Variance: Error from data • Recursive partition à fewer samples as tree grows • Split features/cut-points are susceptible to training samples • Randomization decreases variance • Image Source: Scott Fortmann-Roe 15
- 16. Evolution of Bias vs. Variance 16 - Geurts, P., Damien E., and Louis W..(2006) Extremely randomized trees. Machine learning 63.1, 3-42.
- 17. Bias Variance Trade-off 17Image Source: Scott Fortmann-Roe • Randomization Methods reduces variance • However, for some problems, reducing the bias of a model may be more critical for improving its accuracy • A very complex dataset with many variables and samples
- 18. Are Tree Models are High-Variance Models? • It depends… • Number of data samples • Number of features • Data complexity • Randomization Methods • Decrease Variance • But increase Bias 18 There is another way of decreasing the expected error, which - Decrease Bias - May increase variance
- 19. Boosting: Learn from Errors 19 Y = f0(X), where E1 = |Y-f0(X)|2 E1 = f1(X), where E2 = |Y-f1(X)|2 E2 = f2(X), where E3 = |Y-f2(X)|2 and so on...
- 20. Additive Model Framework • Additive Model Framework generalizes boosting, stacking, and other variants • Source: J. H. Friedman, R. Tibshirani, and T. Hastie,. The Elements of Statistical Learning (ESL) 20
- 21. Gradient Boosting Machine • Additive Models can be numerically optimized via Gradient Descent • Source: Wikipedia and ESL 21 - Friedman, Jerome H. (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics: 1189-1232.
- 22. Extreme Gradient Boosting (XGBoost) 22 Various Data Mining Competitions in Kaggle One thing they have in common: - They all used XGBoost
- 23. What’s so Special about XGBoost • XGBoost implements the basic idea of GBM with some tweaks, such as: • Regularization of base trees • Approximate split finding • Weighted quantile sketch • Sparsity-aware split finding • Cache-aware block structure for out-of-core computation • “XGBoost scales beyond billions of examples using far fewer resources than existing systems.” – T. Chen and C. Guestrin 23
- 24. Going Further Extreme • XGBoost of XGBoost • Bagging of XGBoost • Bagging of XGBoost of XGBoost of … • Stacking, Bagging, Sampling, etc. • Source: Kaggle 24
- 25. Real-world Example: Predict MedAdh Scores • Centers for Medicare and Medicaid Services (CMS) measures the performance of Medicare Advantage (MA) Plans via Star Rating System • Medication Adherence (MedAdh) is one of the most important quality measures in the Star Rating System • MA Plans want to know how much their MedAdh scores will change in the next two years 25
- 26. Predict MedAdh Scores • Where can I find data • Download from the CMS Part C and D Performance Data webpage • Constructing datasets • MedAdh Data from 2012, 2013 à Training Features, Xtrain • MedAdh Data from 2015 à Training Label, Ytrain • MedAdh Data from 2013, 2014 à Test Features, Xtest • MedAdh Data from 2016 à Test Label, Ytest 26
- 27. Lots of Missing Data • Not all MA plans are measured for a given year à Mean Imputation 27 X1,X2,X3,X4,X5,X6,X7,X8,X9,Y ... 71.2,72.7,69.9,75.2,75.9,71.0,1.8 -999,-999,-999,75.8,72.5,68.8,-4.8 61.8,59.4,57.7,57.3,59.3,58.3,16.7 ... -999,-999,-999,82.8,80.0,69.8,-11.8 73.8,73.2,71.8,74.5,76.1,72.9,4.5
- 28. Try Various Models • From simple models like Linear Regression, Decision Tree to extreme- tree models such as ExtraTrees and Gradient Boosting 28 from sklearn import linear_model from sklearn import tree from sklearn.utils import resample from sklearn.metrics import mean_squared_error from sklearn.ensemble import ExtraTreesRegressor from sklearn.ensemble import GradientBoostingRegressor
- 29. Try Various Models – code snippet • From simple models like Linear Regression, Decision Tree to extreme- tree models such as ExtraTrees and Gradient Boosting 29 lm = linear_model.LinearRegression() dt = tree.DecisionTreeRegressor() etr = ExtraTreesRegressor(n_estimators=100, max_depth=10) gbr = GradientBoostingRegressor(n_estimators=500, learning_rate=0.25, max_depth=8)
- 30. Try Various Models – results 30 $ python test.py … RMSE Results lm: 2.7125536923 dt: 3.10460672029 etr: 2.18597303421 gbr: 2.02698129388
- 31. Try Various Models – results 31 Extreme Tree Models exhibit significant improvements in accuracies compared to simple models. One can build more sophisticated models based on the error characteristics of these models.
- 32. Contact • yubin [at] accordionhealth [dot] com 32

No public clipboards found for this slide

Be the first to comment