Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation

Evaluation*
in ubiquitous computing
* (quantitative evaluation, qualitative studies are a ﬁeld of their own!)
Nils Hammerla, Research Associate @ Open Lab

Evaluation? Why?
• Performance evaluation
• How well does it work?
• Model selection
• Which approach works best?
• Parameter selection
• How do I tune my approach for best results?
• Demonstration of feasibility
• Is it worthwhile investing more time/money?

Training and test-sets
Basic elements, segments, images, etc.

• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining validation

testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation

testtraining
validation
Run 1,2,3 Run 4

testtraining
validation
Run 1,2,3 Run 4
• User-dependent performance 
How would the system perform for more data from a known user? 
• User-independent performance 
How would the system perform for a new user?

• Repeated hold-out
• Split data into k continuous folds
• A split into n folds leads to n performance estimates
• Gives more reliable performance estimate
• But: this depends on how the folds are constructed!
• Variants
• Leave-one-subject-out (LOSO) — User-independent
• Leave-one-run-out — User-dependent

• (Random, stratiﬁed) cross-validation
• Split data into k folds, on the lowest level
• Folds constructed to be stratiﬁed w.r.t. class distribution
• Popular in general machine learning
• Standard approach in many ML frameworks
• User-dependent performance estimate
• Requires samples to be statistically independent
• This is not the case in most cases in ubicomp!

Pitfalls
• User dependent vs independent performance
• Assumption: User dependent performance is the upper bound of
possible system performance
• Cross-validation therefore popular in ubicomp to demonstrate
feasibility of e.g. new technical approach
study
1 2 3 4 5 6
accuracy
0
10
20
30
40
50
60
70
80
90
100
Loso vs xval performance in related work
xval
loso

Pitfalls
• Cross-validation in segmented time-series
• When testing for a segment i, it is very likely that segments i-1 or i
+1 are in the training set.
• Neighbouring segments typically overlap
• They are therefore very similar!
• This biases the results and leads to bloated performance ﬁgures

Pitfalls
• Cross-validation in segmented time-series
• This is an issue, as cross-validation is widely used
• For user-dependent evaluation
• For parameter estimation (e.g. in neural networks)
• A simple extension of cross-validation alleviates the bias
• There will be a talk on this on Friday morning!

Pitfalls
• Reporting results
• Always report means and standard deviation of results
• T-tests for difference in mean performance
• Whether t-tests on cross-validated results are meaningful is still
debated in statistics
• You have to use the same folds when comparing different
approaches!
• Better: repeat whole experiment multiple times (if computationally
feasible)

Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives

Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
true positives
false positives
false negatives
true negatives
A:

Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
true positives
false positives
false negatives
true negatives
A: B:

Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
true positives
false positives
false negatives
true negatives
A: B: C:

Performance metrics
true positives
false positives
false negatives
true negatives
A: B: C:
Precision / PPV tp / (tp + fp)
Recall / Sensitivity tp / (tp + fn)
Speciﬁcity tn / (tn + fp)
Accuracy (tp+tn) / (tp + fp + fn + tn)
F1-score 2*prec*sens / (prec+sens)
/ ( + )
/ ( + )
/ ( + )
For each class (or for two class problems):
( + )/( + + + )

Overall accuracy
Mean accuracy
Weighted F1-score
Mean F1-score
Performance metrics
true positives
false positives
false negatives
true negatives
A: B: C:
For a set of classes C = {A,B,C,…}:
1
|C|
X
c2C
accc
1
N
X
c2C
tpc
2
N
X
c2C
nc
precc ⇥ sensc
precc + sensc
2
|C|
X
c2C
precc ⇥ sensc
precc + sensc

Performance metrics
true positives
false positives
false negatives
true negatives
A: B: C:
Normalised confusion matrices
- Divide each row by its sum to get confusion probabilities
3284
449
311
66
446
1275
1600
319
133
647
2865
285
24
198
715
412
a) Confusion on HOME data-set
prediction
asleep off on dys
annotation(diary)
asleep
off
on
dys
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6
13
155
180
39
101
414
117
3
3
164
b) Confusion on LAB data-set
prediction
asleep off on dys
annotation(clinician)
asleep
off
on
dys
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
[Hammerla, Nils Y., et al. "PD Disease State Assessment in Naturalistic Environments using Deep Learning." AAAI 2015]

Performance metrics
true positives
false positives
false negatives
true negatives
A: B: C:
Receiver Operator Characteristics (ROC)
- Illustrates trade-off between  
True Positive Rate (sensitivity / recall), and
False Positive Rate (1 - speciﬁcity)
- Useful if approach has a simple parameter,
like a threshold.
0 0.1 0.2 0.3 0.4
0
0.2
0.4
0.6
0.8
1
false positive rate
truepositiverate
ROC curves for different classifiers
knn
c45
logR
PCA−logR
[Ladha, Cassim, et al. "ClimbAX: skill
assessment for climbing enthusiasts."
Ubicomp 2013]

Log likelihood
AIC / BIC
KL-divergence
Performance metrics
true positives
false positives
false negatives
true negatives
A: B: C:
Probabilistic models:
Mean Square Error (MSE)
Root Mean Square Error (RMSE)
Residuals
Correlation coefﬁcients
Regression:

Example
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
A B C
PPV 0.77 0.89 0.44
Sensitivity 0.87 0.63 0.66
Speciﬁcity 0.71 0.95 0.92
Accuracy 0.79 0.83 0.90
F1-score 0.81 0.73 0.53
Overall Accuracy 0.76
Weighted F1-score 0.76
Mean F1-score 0.69
prediction
A B C
groundtruth
A
B
C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

Which metric should I not use?
• Specificity, or per-class accuracy
• Less meaningful with an increase in #classes
• #TN cells grow in O(n2),
• #FN, #FP cells grow in O(n)!
• Leads to e.g. incredible specificity figures (> 0.99)
/ ( + ) ( + )/( + + + )

• Specificity, or per-class accuracy
• Less meaningful with an increase in #classes
• #TN cells grow in O(n2),
• #FN, #FP cells grow in O(n)!
• Leads to e.g. incredible specificity figures (> 0.99)
/ ( + ) ( + )/( + + + )
/ ( + ) / ( + )PPV or Recall
Better:

• Overall accuracy
• In unbalanced data-sets, overall accuracy can be
misleading
1
N
X
c2C
tpc
10,000 10 2
100 1 0
20 1 0
Overall accuracy = 0.99
Weighted F1-score = 0.98
Mean F1-score = 0.34
9,950 40 22
40 45 15
10 1 10
Overall accuracy = 0.99
Weighted F1-score = 0.99
Mean F1-score = 0.59
<
<<
~=

Which metric should I use?
Balanced Unbalanced
Two-class
- Accuracy
- F1-score
- Sensitivity, Speciﬁcity
- PPV, Recall for rare class
- Mean F1-score
Multi-class
- Accuracy
- Weighted F1
- Mean F1
- Avg. PPV, Sensitivity
- Mean F1 vs weighted F1
- Confusion matrices
No single metric reﬂects all aspects of performance!

Which metric should I use?
• Application requirements
• Different applications have different requirements
• What type of error is more critical?
• False positives can hurt if activities of interest rare
• Low precision can be a problem in medical applications
• There is always a trade-off between different aspects of
performance

Ward et al. (2011)
• Continuous recognition
• Measures like accuracy treat each error the same
• But: The predictions are time-series themselves, and different errors
are possible
• Which type of error is more acceptable in my application?
[Jamie Ward et al., Performance Metrics for Activity Recognition]

Ward et al. (2011)
• Similar to bioinformatics:
• Merge, Deletion, Confusion, Fragmentation, Insertion
• Each type of error is scored differently (cost matrix)
• Compound score reﬂects (adjustable) performance metric
• Systems that show very good performance in regular metrics may
show poor performance in one of these aspects!
[Jamie Ward et al., Performance Metrics for Activity Recognition]

Summary
• Careful when selecting training and test-set
• Different aspects of performance (user dependent / independent)
• Avoid random cross-validation for segmented time-series!
• Performance metrics
• No single metric reﬂects all aspects of performance
• Depends on your application!
• More difﬁcult for unbalanced multi-class problems
• Different types of errors have different impact on perceived
performance of a system
• Scoring systems can be tailored for individual application

Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation

More Related Content

What's hot

Viewers also liked

Similar to Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation

Recently uploaded

Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation