How to improve the statistical power of the 10-fold crossvalidation scheme in Recommender Systems

How to improve the statistical power of the 10-fold cross
validation scheme in Recommender Systems

University of Ljubljana
[LDOS]

..: Faculty of Electrical Engineering
..: Digital Signal, Image and Video Processing Laboratory

Andrej Košir
Ante Odić
Marko Tkalčič

[LDOS]


Statistical power, replicability and reproducibility
 What is:
 Replicability: to get the same experimental result (on the same data)
 Reproducibility : to get similar experimental results leading to the same
conclusion
Mackay, R., & Oldford, R. (2000). Scientific method, statistical method, and the speed of light, Working paper 2000-02). Department of Statistics and Actuarial Science, University of Waterloo.

 In terms of statistical testing
 Higher power => better reproducibility
 More likely to get to the same conclusions

[LDOS]


On stat hypothese testing
 When we need to use stat tests?
 The results should not change if we repeat the experiment
 When we need it: at later stages of development where results are similar
RS 1

F1

0.72

RS 2

F2

0.89
0.74

Test
data

 Elements of statistical testing






Working hypotheses
Null and alternative hypotheses: 𝐻0 and 𝐻1
p-value: 𝑝
Risk level: 𝛼
Decision on 𝐻0

[LDOS]


On errors and statistical power
 Errors in test decision:
 Errors of type I. and type II.
 Effect size

 Power:






ˆ
H0

ˆ
H1

H0

OK

type I.

H1

type II.

OK

Power = 𝑃𝑟[ 𝐻1 |𝐻1 ]

For each test a new analysis is required
more is better
The best one can do
Task 1 - How to select sample size: apriory power
Task 2 - How to estimate achieved power: posterior power

 History:
 1908 by William Sealy Gosset (Student): he did not need it
 Mainly ignored until then

 Software: GPower
http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/

[LDOS]


The application we were working on: contextual variables
 Which contextual variables are relevant:
 What is context
 Candidates: time, weather, mood, ...
 Can we simply use it all?
• Irrelevant context can worse the performance of RS

 Test if a given context is relevant
 How: compare RS with and without it

ODIĆ, Ante, TKALČIČ, Marko, TASIČ, Jurij F., KOŠIR, Andrej. Predicting and detecting the
relevant contextual information in a movie-recommender system. Interact. comput.. [Print ed.], 2013,
vol. 25, no. 1, pp. 74-90, ilustr., doi:10.1093/iwc/iws003. [COBISS.SI-ID 9650260]
ODIĆ, Ante, TKALČIČ, Marko, TASIČ, Jurij F., KOŠIR, Andrej. Impact of the context relevancy on
ratings prediction in a movie-recommender system. Automatika (Zagreb), 2013, vol. 54, no. 2, pp. 252262, ilustr., doi:10.7305/automatika.54-2.258. [COBISS.SI-ID 9782356]

[LDOS]


The problem we observed: cross validation scheme

ODIĆ, Ante, TKALČIČ, Marko, TASIČ, Jurij F., KOŠIR, Andrej. Predicting and detecting the relevant contextual
information in a movie-recommender system. Interact. comput., vol. 25, no. 1, pp. 74-90, 2013.

 There were differences among folds, but not in conclusion
 What is wrong?
 Paired / unpaired?

 What is usually done:
 Confusion matrix computation is actually unpaired

[LDOS]


Proposed solution
The procedure outline:
1.
2.
3.
4.

Select the scalar comparison measure (such as precision or F-measure).
Store the evaluation results of each fold and each method separately;
According to the specfic features of the evaluation results (distributions
etc.) select the most powerful test that meets these specific features
Perform the paired version of the selected test.

[LDOS]


Materials and methods (1)
 Dataset:
 Context Movie Dataset (LDOS-CoMoDa)
 1611 ratings from 89 users to 946 items with associated contextual factors.
 Contextual variables
•
•
•
•
•
•
•
•
•
•

time (morning, afternoon, evening, night),
daytype (working day, weekend, time (morning, afternoon, evening, night),
season (spring, summer, autumn, winter),
Location (home, public place, friend's house),
weather (sunny/clear, rainy, stormy, snowy, cloudy),
social (alone, partner, friends, colleagues, parents, public, family),
endEmo (sad, happy, scared, surprised, angry, disgusted, neutral),
dominantEmo (sad, happy, scared, surprised, angry, disgusted, neutral),
mood (positive, neutral, negative),
physical (healthy, ill), decision (user's choice, given by other), interaction (1rst, n-th)

 Publically available:
LDOS-CoMoDa contextual dataset: available at www.ldos.si/comoda.html.
Used by 29 researchers at this moment.

[LDOS]


Materials and methods (2), results
 Experimental design
 10-fold cross validation
 Two procedures: ProcPaired, ProcIndep

 Results – which contextual variable improves MF?
 Tests: Wilcoxon signed rank test (ProcIndep) and
Mann Whitney U test, (ProcPaired)
 The achieved (post-hoc) statistical power for the paired test (pw pa.) and for the
independent test (pw in.) along with the computed p-values

Id

Var 1

Var 2

1

Physical

2
3

pw paired

p paired

pw indep.

p indep.

Weather 0.42

0.001

0.14

0.24

Decision

Social

0.99

0.004

0.25

0.19

interaction

Social

0.06

<0.001

0.05

0.43

[LDOS]


Discussion
 Power improvements:
 The first combination (physical vs. weather): 0.14  0.42, low but useful;
 The second combination (decision vs. social): 0.19  0.99, the difference in
power is again substantial;
 The third combination (interaction vs. social): 0.05  0.06, irrelevant;

 It does not require substantial additional work
 Worth of effort

[LDOS]


Further work
 We limited to 10-fold cross validation and simple tests only. There is more
out there.
 We will concentrate on a comparison of RS regarding the selected final tasks
(such as best five) and not limited to scalar performance measures (such as
precision at five).

 More sophisticated statistical approaches:
 are available such as a multi-level repeated binomial regression
 my opinion: will not be used frequently

THANK YOU
Invitation: International Conference on Automatic Face and Gesture
Recognition FG2015, http://www.fg2015.org/

[LDOS]


Presentation structure









The goal
What it has to do with replicability and reproducibility?
Selected items from statistics
Our case & problem statement
Proposed solution & comments
Experimental results
Future work
Take away notes

How to improve the statistical power of the 10-fold crossvalidation scheme in Recommender Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (6)

Similar to How to improve the statistical power of the 10-fold crossvalidation scheme in Recommender Systems

Similar to How to improve the statistical power of the 10-fold crossvalidation scheme in Recommender Systems (20)

Recently uploaded

Recently uploaded (20)