On the Predictability of non-CGM Diabetes Data for Personalized Recommendation
1. On the Predictability of non-CGM Diabetes
Data for Personalized Recommendation
Tu Nguyen, Markus Rokicki
Presenter: Damianos Melidis
L3S Research Center / Leibniz Universität Hannover, Germany
1KDAH-CIKM 2018
2. Task Introduction
2KDAH-CIKM 2018
• Problem statement:
• Given the current time, and the past data, predict the blood glucose in
the next hour
• Input: blood glucose,insulin intake, carb intake, activity (steps) logs
• Output: predicted blood glucose
5. Regression Models for Comparison
5KDAH-CIKM 2018
• Simple Baselines we use the last value observed (Last) and the
arithmetic mean of glucose values in the training set (AVG).
• Context-AVG we use a (temporal) context weighted average of
previous glucose values.
• ARIMA is a generalization of an autoregressive moving average
model
• LSTM is a neural network regression that accounts for the sequence
dependence among glucose inputs.
• RandomForest is a meta estimator that learns an ensemble of
regression trees, averaging the output of individual regression trees to
perform the prediction.
• Extremely Randomized Trees a variation on RF that in contrast to
regular regression trees, best split values per feature are chosen
randomly.
6. Features
6KDAH-CIKM 2018
In total 44 features of:
• (Temporal) Context group
• Insulin decay concentration group
• Carb linear absorption group
• Activity (step) group
• Glucose group
7. Overall Results
7KDAH-CIKM 2018
Overall regression performance over all patients:
vBolded indicates best performed model
vWe omit too weakly performed models e.g. linear regression
8. Feature Analysis
8KDAH-CIKM 2018
Most Useful Features Score
timeWeightAvgGlucose 1.63
avgGlucose 1.22
sumInsulinDecayLast_0 0.54
timeBin 0.46
sumInsulinLast_0 0.42
• Context Average and Average are importance features
• Time period of the day
• Very last insulin intakes are important
9. Feature Analysis
9KDAH-CIKM 2018
Least Useful Features Score
sumInsulinLast_2 0.005
sumCarbAbsorptionLast_1 0.008
sumActivityLast_2 0.016
sumInsulinDecayLast_1 0.017
sumActivityLast_3 0.037
• Certain lookback lengths are not important
• Activity contributes less significant
10. Periods of the day performance
10KDAH-CIKM 2018
Overall Random Forest results distinguished by subsets:
•time of the day. Night: 0-6, morning: 6-12, afternoon: 12-18, evening:
18-24.
•1 hour and 4 hours after meal
11. When to Predict?
11KDAH-CIKM 2018
§ For online setting, a learning model suffers the cold-start effect:
• Not enough training instances
• Noise
• New unseen data is too much different from training data
§ We evaluate different strategies to handle such aspects
§ Training variability
§ Prediction confidence
12. When to predict – Training variability
• Standard deviation of 10-fold CV
• Intuition: the decrease in std. dev is correlated with model stability
• Observed:the decrease in std. dev is correlated with training size
• Level of correlation varies across patients
12KDAH-CIKM 2018
13. When to predict – Training variability
• Leave-one-out cross validation on incremental training
13KDAH-CIKM 2018
14. When to predict – Prediction Confidence
Prediction confidence: estimate standard errors in Random Forest
• Estimation relies on the sampling variance of bagged predictors
(trees)
• Sampling however suffers Monte Carlo effect (bias)
We apply 2 estimation approaches
• Bias estimate
• Bias-corrected estimate [1]
[1] Wager et al. “Confidence intervals for random forests: The jackknife and the infinitesimal jackknife.” JMLR 2014
14KDAH-CIKM 2018
15. When to predict – Prediction Confidence
15KDAH-CIKM 2018
Bias-
corrected
Bias-
corrected
16. When to predict – Incremental training size
• Observed: prediction confidence does not correlate with increasing
training size
16KDAH-CIKM 2018
1/4
training
data
2/4
training
data
3/4
training
data
4/4
training
data
17. Overall results with Filtering methods
• Sanity filter: heuristics that remove noise
• Stability filter: prediction confidence (std. dev is not needed when
the training size is large enough)
17KDAH-CIKM 2018
19. Feature Analysis
19KDAH-CIKM 2018
We use tree-based strategies:
• rank how well a feature decrease the weighted impurity of a tree
• average over a forest
• for patient 8: