Evaluation*
in ubiquitous computing
* (quantitative evaluation, qualitative studies are a field of their own!)
Nils Hammerla, Research Associate @ Open Lab
Evaluation? Why?
• Performance evaluation
• How well does it work?
• Model selection
• Which approach works best?
• Parameter selection
• How do I tune my approach for best results?
• Demonstration of feasibility
• Is it worthwhile investing more time/money?
Training and test-sets
Basic elements, segments, images, etc.
Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining validation
Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation
Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation
Run 1,2,3 Run 4
Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation
Run 1,2,3 Run 4
• User-dependent performance

How would the system perform for more data from a known user?

• User-independent performance

How would the system perform for a new user?
Training and test-sets
• Repeated hold-out
• Split data into k continuous folds
• A split into n folds leads to n performance estimates
• Gives more reliable performance estimate
• But: this depends on how the folds are constructed!
• Variants
• Leave-one-subject-out (LOSO) — User-independent
• Leave-one-run-out — User-dependent
Training and test-sets
• (Random, stratified) cross-validation
• Split data into k folds, on the lowest level
• Folds constructed to be stratified w.r.t. class distribution
• Popular in general machine learning
• Standard approach in many ML frameworks
• User-dependent performance estimate
• Requires samples to be statistically independent
• This is not the case in most cases in ubicomp!
Pitfalls
• User dependent vs independent performance
• Assumption: User dependent performance is the upper bound of
possible system performance
• Cross-validation therefore popular in ubicomp to demonstrate
feasibility of e.g. new technical approach
study
1 2 3 4 5 6
accuracy
0
10
20
30
40
50
60
70
80
90
100
Loso vs xval performance in related work
xval
loso
Pitfalls
• Cross-validation in segmented time-series
• When testing for a segment i, it is very likely that segments i-1 or i
+1 are in the training set.
• Neighbouring segments typically overlap
• They are therefore very similar!
• This biases the results and leads to bloated performance figures
Pitfalls
• Cross-validation in segmented time-series
• This is an issue, as cross-validation is widely used
• For user-dependent evaluation
• For parameter estimation (e.g. in neural networks)
• A simple extension of cross-validation alleviates the bias
• There will be a talk on this on Friday morning!
Pitfalls
• Reporting results
• Always report means and standard deviation of results
• T-tests for difference in mean performance
• Whether t-tests on cross-validated results are meaningful is still
debated in statistics
• You have to use the same folds when comparing different
approaches!
• Better: repeat whole experiment multiple times (if computationally
feasible)
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A:
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A:
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B:
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B:
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Precision / PPV tp / (tp + fp)
Recall / Sensitivity tp / (tp + fn)
Specificity tn / (tn + fp)
Accuracy (tp+tn) / (tp + fp + fn + tn)
F1-score 2*prec*sens / (prec+sens)
/ ( + )
/ ( + )
/ ( + )
For each class (or for two class problems):
( + )/( + + + )
Overall accuracy
Mean accuracy
Weighted F1-score
Mean F1-score
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
For a set of classes C = {A,B,C,…}:
1
|C|
X
c2C
accc
1
N
X
c2C
tpc
2
N
X
c2C
nc
precc ⇥ sensc
precc + sensc
2
|C|
X
c2C
precc ⇥ sensc
precc + sensc
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Normalised confusion matrices
- Divide each row by its sum to get confusion probabilities
3284
449
311
66
446
1275
1600
319
133
647
2865
285
24
198
715
412
a) Confusion on HOME data-set
prediction
asleep off on dys
annotation(diary)
asleep
off
on
dys
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6
13
155
180
39
101
414
117
3
3
164
b) Confusion on LAB data-set
prediction
asleep off on dys
annotation(clinician)
asleep
off
on
dys
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
[Hammerla, Nils Y., et al. "PD Disease State Assessment in Naturalistic Environments using Deep Learning." AAAI 2015]
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Receiver Operator Characteristics (ROC)
- Illustrates trade-off between 

True Positive Rate (sensitivity / recall), and
False Positive Rate (1 - specificity)
- Useful if approach has a simple parameter,
like a threshold.
0 0.1 0.2 0.3 0.4
0
0.2
0.4
0.6
0.8
1
false positive rate
truepositiverate
ROC curves for different classifiers
knn
c45
logR
PCA−logR
[Ladha, Cassim, et al. "ClimbAX: skill
assessment for climbing enthusiasts."
Ubicomp 2013]
Log likelihood
AIC / BIC
KL-divergence
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Probabilistic models:
Mean Square Error (MSE)
Root Mean Square Error (RMSE)
Residuals
Correlation coefficients
Regression:
Example
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
A B C
PPV 0.77 0.89 0.44
Sensitivity 0.87 0.63 0.66
Specificity 0.71 0.95 0.92
Accuracy 0.79 0.83 0.90
F1-score 0.81 0.73 0.53
Overall Accuracy 0.76
Weighted F1-score 0.76
Mean F1-score 0.69
prediction
A B C
groundtruth
A
B
C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Which metric should I not use?
• Specificity, or per-class accuracy
• Less meaningful with an increase in #classes
• #TN cells grow in O(n2),
• #FN, #FP cells grow in O(n)!
• Leads to e.g. incredible specificity figures (> 0.99)
/ ( + ) ( + )/( + + + )
Which metric should I not use?
• Specificity, or per-class accuracy
• Less meaningful with an increase in #classes
• #TN cells grow in O(n2),
• #FN, #FP cells grow in O(n)!
• Leads to e.g. incredible specificity figures (> 0.99)
/ ( + ) ( + )/( + + + )
/ ( + ) / ( + )PPV or Recall
Better:
Which metric should I not use?
• Overall accuracy
• In unbalanced data-sets, overall accuracy can be
misleading
1
N
X
c2C
tpc
10,000 10 2
100 1 0
20 1 0
Overall accuracy = 0.99
Weighted F1-score = 0.98
Mean F1-score = 0.34
9,950 40 22
40 45 15
10 1 10
Overall accuracy = 0.99
Weighted F1-score = 0.99
Mean F1-score = 0.59
<
<<
~=
Which metric should I use?
Balanced Unbalanced
Two-class
- Accuracy
- F1-score
- Sensitivity, Specificity
- PPV, Recall for rare class
- Mean F1-score
Multi-class
- Accuracy
- Weighted F1
- Mean F1
- Avg. PPV, Sensitivity
- Mean F1 vs weighted F1
- Confusion matrices
No single metric reflects all aspects of performance!
Which metric should I use?
• Application requirements
• Different applications have different requirements
• What type of error is more critical?
• False positives can hurt if activities of interest rare
• Low precision can be a problem in medical applications
• There is always a trade-off between different aspects of
performance
Which metric should I use?
• Application requirements
• Different applications have different requirements
• What type of error is more critical?
• False positives can hurt if activities of interest rare
• Low precision can be a problem in medical applications
• There is always a trade-off between different aspects of
performance
Ward et al. (2011)
• Continuous recognition
• Measures like accuracy treat each error the same
• But: The predictions are time-series themselves, and different errors
are possible
• Which type of error is more acceptable in my application?
[Jamie Ward et al., Performance Metrics for Activity Recognition]
Ward et al. (2011)
• Continuous recognition
• Similar to bioinformatics:
• Merge, Deletion, Confusion, Fragmentation, Insertion
• Each type of error is scored differently (cost matrix)
• Compound score reflects (adjustable) performance metric
• Systems that show very good performance in regular metrics may
show poor performance in one of these aspects!
[Jamie Ward et al., Performance Metrics for Activity Recognition]
Summary
• Careful when selecting training and test-set
• Different aspects of performance (user dependent / independent)
• Avoid random cross-validation for segmented time-series!
• Performance metrics
• No single metric reflects all aspects of performance
• Depends on your application!
• More difficult for unbalanced multi-class problems
• Continuous recognition
• Different types of errors have different impact on perceived
performance of a system
• Scoring systems can be tailored for individual application

Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation

  • 1.
    Evaluation* in ubiquitous computing *(quantitative evaluation, qualitative studies are a field of their own!) Nils Hammerla, Research Associate @ Open Lab
  • 2.
    Evaluation? Why? • Performanceevaluation • How well does it work? • Model selection • Which approach works best? • Parameter selection • How do I tune my approach for best results? • Demonstration of feasibility • Is it worthwhile investing more time/money?
  • 3.
    Training and test-sets Basicelements, segments, images, etc.
  • 4.
    Training and test-sets •Hold-out validation • Choose part of the data as training and test-set • By some heuristic (e.g. 20% test), • Depending on study design. • Gives a single performance estimate • Performance may critically depend on chosen test-set testtraining validation
  • 5.
    Training and test-sets •Hold-out validation • Choose part of the data as training and test-set • By some heuristic (e.g. 20% test), • Depending on study design. • Gives a single performance estimate • Performance may critically depend on chosen test-set testtraining User 1 User 2 User 3 User 4 User 5 User 6 validation
  • 6.
    Training and test-sets •Hold-out validation • Choose part of the data as training and test-set • By some heuristic (e.g. 20% test), • Depending on study design. • Gives a single performance estimate • Performance may critically depend on chosen test-set testtraining User 1 User 2 User 3 User 4 User 5 User 6 validation Run 1,2,3 Run 4
  • 7.
    Training and test-sets •Hold-out validation • Choose part of the data as training and test-set • By some heuristic (e.g. 20% test), • Depending on study design. • Gives a single performance estimate • Performance may critically depend on chosen test-set testtraining User 1 User 2 User 3 User 4 User 5 User 6 validation Run 1,2,3 Run 4 • User-dependent performance
 How would the system perform for more data from a known user?
 • User-independent performance
 How would the system perform for a new user?
  • 8.
    Training and test-sets •Repeated hold-out • Split data into k continuous folds • A split into n folds leads to n performance estimates • Gives more reliable performance estimate • But: this depends on how the folds are constructed! • Variants • Leave-one-subject-out (LOSO) — User-independent • Leave-one-run-out — User-dependent
  • 9.
    Training and test-sets •(Random, stratified) cross-validation • Split data into k folds, on the lowest level • Folds constructed to be stratified w.r.t. class distribution • Popular in general machine learning • Standard approach in many ML frameworks • User-dependent performance estimate • Requires samples to be statistically independent • This is not the case in most cases in ubicomp!
  • 10.
    Pitfalls • User dependentvs independent performance • Assumption: User dependent performance is the upper bound of possible system performance • Cross-validation therefore popular in ubicomp to demonstrate feasibility of e.g. new technical approach study 1 2 3 4 5 6 accuracy 0 10 20 30 40 50 60 70 80 90 100 Loso vs xval performance in related work xval loso
  • 11.
    Pitfalls • Cross-validation insegmented time-series • When testing for a segment i, it is very likely that segments i-1 or i +1 are in the training set. • Neighbouring segments typically overlap • They are therefore very similar! • This biases the results and leads to bloated performance figures
  • 12.
    Pitfalls • Cross-validation insegmented time-series • This is an issue, as cross-validation is widely used • For user-dependent evaluation • For parameter estimation (e.g. in neural networks) • A simple extension of cross-validation alleviates the bias • There will be a talk on this on Friday morning!
  • 13.
    Pitfalls • Reporting results •Always report means and standard deviation of results • T-tests for difference in mean performance • Whether t-tests on cross-validated results are meaningful is still debated in statistics • You have to use the same folds when comparing different approaches! • Better: repeat whole experiment multiple times (if computationally feasible)
  • 14.
    Performance metrics 1022 13144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives
  • 15.
    Performance metrics 1022 13144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives
  • 16.
    Performance metrics 1022 13144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A:
  • 17.
    Performance metrics 1022 13144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A:
  • 18.
    Performance metrics 1022 13144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A: B:
  • 19.
    Performance metrics 1022 13144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A: B:
  • 20.
    Performance metrics 1022 13144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A: B: C:
  • 21.
    Performance metrics Basic elementsfor each class: true positives false positives false negatives true negatives A: B: C: Precision / PPV tp / (tp + fp) Recall / Sensitivity tp / (tp + fn) Specificity tn / (tn + fp) Accuracy (tp+tn) / (tp + fp + fn + tn) F1-score 2*prec*sens / (prec+sens) / ( + ) / ( + ) / ( + ) For each class (or for two class problems): ( + )/( + + + )
  • 22.
    Overall accuracy Mean accuracy WeightedF1-score Mean F1-score Performance metrics Basic elements for each class: true positives false positives false negatives true negatives A: B: C: For a set of classes C = {A,B,C,…}: 1 |C| X c2C accc 1 N X c2C tpc 2 N X c2C nc precc ⇥ sensc precc + sensc 2 |C| X c2C precc ⇥ sensc precc + sensc
  • 23.
    Performance metrics Basic elementsfor each class: true positives false positives false negatives true negatives A: B: C: Normalised confusion matrices - Divide each row by its sum to get confusion probabilities 3284 449 311 66 446 1275 1600 319 133 647 2865 285 24 198 715 412 a) Confusion on HOME data-set prediction asleep off on dys annotation(diary) asleep off on dys 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 13 155 180 39 101 414 117 3 3 164 b) Confusion on LAB data-set prediction asleep off on dys annotation(clinician) asleep off on dys 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 [Hammerla, Nils Y., et al. "PD Disease State Assessment in Naturalistic Environments using Deep Learning." AAAI 2015]
  • 24.
    Performance metrics Basic elementsfor each class: true positives false positives false negatives true negatives A: B: C: Receiver Operator Characteristics (ROC) - Illustrates trade-off between 
 True Positive Rate (sensitivity / recall), and False Positive Rate (1 - specificity) - Useful if approach has a simple parameter, like a threshold. 0 0.1 0.2 0.3 0.4 0 0.2 0.4 0.6 0.8 1 false positive rate truepositiverate ROC curves for different classifiers knn c45 logR PCA−logR [Ladha, Cassim, et al. "ClimbAX: skill assessment for climbing enthusiasts." Ubicomp 2013]
  • 25.
    Log likelihood AIC /BIC KL-divergence Performance metrics Basic elements for each class: true positives false positives false negatives true negatives A: B: C: Probabilistic models: Mean Square Error (MSE) Root Mean Square Error (RMSE) Residuals Correlation coefficients Regression:
  • 26.
    Example 1022 13 144 300542 24 12 55 132 groundtruth prediction A B C A B C A B C PPV 0.77 0.89 0.44 Sensitivity 0.87 0.63 0.66 Specificity 0.71 0.95 0.92 Accuracy 0.79 0.83 0.90 F1-score 0.81 0.73 0.53 Overall Accuracy 0.76 Weighted F1-score 0.76 Mean F1-score 0.69 prediction A B C groundtruth A B C 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 27.
    Which metric shouldI not use? • Specificity, or per-class accuracy • Less meaningful with an increase in #classes • #TN cells grow in O(n2), • #FN, #FP cells grow in O(n)! • Leads to e.g. incredible specificity figures (> 0.99) / ( + ) ( + )/( + + + )
  • 28.
    Which metric shouldI not use? • Specificity, or per-class accuracy • Less meaningful with an increase in #classes • #TN cells grow in O(n2), • #FN, #FP cells grow in O(n)! • Leads to e.g. incredible specificity figures (> 0.99) / ( + ) ( + )/( + + + ) / ( + ) / ( + )PPV or Recall Better:
  • 29.
    Which metric shouldI not use? • Overall accuracy • In unbalanced data-sets, overall accuracy can be misleading 1 N X c2C tpc 10,000 10 2 100 1 0 20 1 0 Overall accuracy = 0.99 Weighted F1-score = 0.98 Mean F1-score = 0.34 9,950 40 22 40 45 15 10 1 10 Overall accuracy = 0.99 Weighted F1-score = 0.99 Mean F1-score = 0.59 < << ~=
  • 30.
    Which metric shouldI use? Balanced Unbalanced Two-class - Accuracy - F1-score - Sensitivity, Specificity - PPV, Recall for rare class - Mean F1-score Multi-class - Accuracy - Weighted F1 - Mean F1 - Avg. PPV, Sensitivity - Mean F1 vs weighted F1 - Confusion matrices No single metric reflects all aspects of performance!
  • 31.
    Which metric shouldI use? • Application requirements • Different applications have different requirements • What type of error is more critical? • False positives can hurt if activities of interest rare • Low precision can be a problem in medical applications • There is always a trade-off between different aspects of performance
  • 32.
    Which metric shouldI use? • Application requirements • Different applications have different requirements • What type of error is more critical? • False positives can hurt if activities of interest rare • Low precision can be a problem in medical applications • There is always a trade-off between different aspects of performance
  • 33.
    Ward et al.(2011) • Continuous recognition • Measures like accuracy treat each error the same • But: The predictions are time-series themselves, and different errors are possible • Which type of error is more acceptable in my application? [Jamie Ward et al., Performance Metrics for Activity Recognition]
  • 34.
    Ward et al.(2011) • Continuous recognition • Similar to bioinformatics: • Merge, Deletion, Confusion, Fragmentation, Insertion • Each type of error is scored differently (cost matrix) • Compound score reflects (adjustable) performance metric • Systems that show very good performance in regular metrics may show poor performance in one of these aspects! [Jamie Ward et al., Performance Metrics for Activity Recognition]
  • 35.
    Summary • Careful whenselecting training and test-set • Different aspects of performance (user dependent / independent) • Avoid random cross-validation for segmented time-series! • Performance metrics • No single metric reflects all aspects of performance • Depends on your application! • More difficult for unbalanced multi-class problems • Continuous recognition • Different types of errors have different impact on perceived performance of a system • Scoring systems can be tailored for individual application