LOW/HIGH-RISK DIABETES GROUP SEGMENTATION USING α-TREES AMA-IEEE Medical Technology Conference 2011 Anurekha Ramakrishnan1, Yubin Park2, Joydeep Ghosh2 1Dept. of Statistics and Scientific Computation 2Dept. of Electrical and Computer Engineering The University of Texas at Austin
Barriers to M/C learning Adoption in Healthcare Class-imbalance Target ratios are often extremely skewed. Mismatch with Performance Metrics ‘Misclassification rates may not be relevant Asymmetric costs involved. ‘Sensitivity/Specificity’ or ‘Lift’ should be a part of learning goals. Interpretation of Results Simple AND/OR Rules (in Natural Language) are desirable. We suggest a possible solution for these problems using: Modified α-Trees, Disjunctive Combination of Rules.
Objectives Other Requirements: Interpretable segmentation - AND, OR Rules in Natural language Extensive coverage using Simple rules. Note: These objectives are different from traditional machine learning objectives. The objectives are based on the observations on many failed Medical Decision Support systems.
BRFSS Dataset Behavioral Risk Factor Surveillance System URL: http://www.cdc.gov/brfss/ The largest telephone survey since 1984. Tracks health conditions and risk behaviors in the United States. Contains information on a variety of diseases e.g. diabetes, hypertension, cancer, asthma, HIV, etc. More than 400,000 records per year. Many states use BRFSS data to support health-related legislative efforts.
α-Tree1 A Decision Tree Algorithm (e.g. CART, C4.5) Decision criterion: α-Divergence. Generalizes C4.5. Robust performance in class-imbalance settings. Stop its growth when a Low/High-risk group is obtained. (modified α-Tree) Different ‘α’ values result in different decision rules. Decision trees provide greedy solutions (sub-optimal solutions). By disjunctively combining different solutions from different α-Trees, we can approach to a better solution. Python Code available (http://www.ideal.ece.utexas.edu/~yubin/) 1. Y.Park and J.Ghosh, “Compact Ensemble Trees for Imbalanced Data,” in 10th International Wokshop on Multiple Classifier Systems, Italy, June 2011.
3-Phase Diagram Example)When High-risk group is defined as more than 24% Diabetes Rate group. - Twice Higher rate than Normal Population Rule1:RFHYPE5 = 1 & AGE_G >= 5.0 & RFHLTH = 2 & BMI4CAT >= 2.0 from α=0.1 ORRule 2: RFHYPE5 ≠ 1 & RFHLTH = 1 & BMI4CAT >= 2.9 & PNEUVAC3 = 1 from α=1.0 ORRule 3: RFHYPE5 = 2 & RFHLTH ≠ 1from α=1.5 OR … These combined rules extract High-risk Diabetes Segments (>24%).
Example Tree Structure When α=2.0, total five High-risk Segmentation Rules are extracted. Different α values result in different tree structures. Yes No
Results for Twice Higher Diabetes Rate Group (High-risk) Resultant Rules from α-Trees. RFHYPE5 = 2 & RFHLTH ≠1 RFHYPE5 ≠2 & RFHLTH = 2 & RFCHOL = 2 … English Translation Segment 1: They have high-blood pressure and think themselves unhealthy (including not responding to this question). Segment 2: They have high cholesterol and think themselves unhealthy. But they don’t have high-blood pressure. …
Results for Four-times lower Diabetes Rate Group (Low-risk) Resultant Rules from α-Trees. RFHYPE5 ≠2 and RFHLTH ≠2 and PNEUVAC3 ≠1 RFHYPE5 =1 and RFHLTH ≠2 and AGE_G < 5.0 … English Translation Segment 1: They don’t have high blood pressure and think themselves healthy. They had a pneumonia shot at least once in their life time. Segment 2: They have high blood pressure, but think themselves healthy and are under 50 yrs of age. …
AppendixA α-Divergence Special cases
AppendixB Modified α-Tree Algorithm Input: BRFSS (input data), α (parameter) Output: Low-risk group extraction rules Select the best feature, which gives the maximum α-divergence criterion. If (no such feature) or (number of data points < cut-off size) or (This group is a low/high-risk group) then stop its growth. Else Segment the input data based on the best feature. Recursively run Modified α-Tree Algorithm( segmented data, α)