Upcoming SlideShare
×

# How to improve the statistical power of the 10-fold cross validation scheme in Recommender Systems

447 views

Published on

RecSys 2013 workshop paper on how to improve your cross-validation scheme in order to improve the statistical power of underlying significance testing.

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
447
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
2
0
Likes
0
Embeds 0
No embeds

No notes for slide

### How to improve the statistical power of the 10-fold cross validation scheme in Recommender Systems

1. 1. How to improve the statistical power of the 10-fold cross validation scheme in Recommender Systems University of Ljubljana [LDOS] ..: Faculty of Electrical Engineering ..: Digital Signal, Image and Video Processing Laboratory Andrej Košir Ante Odić Marko Tkalčič
2. 2. University of Ljubljana [LDOS] ..: Faculty of Electrical Engineering ..: Digital Signal, Image and Video Processing Laboratory Statistical power, replicability and reproducibility  What is:  Replicability: to get the same experimental result (on the same data)  Reproducibility : to get similar experimental results leading to the same conclusion Mackay, R., & Oldford, R. (2000). Scientific method, statistical method, and the speed of light, Working paper 2000-02). Department of Statistics and Actuarial Science, University of Waterloo.  In terms of statistical testing  Higher power => better reproducibility  More likely to get to the same conclusions
3. 3. University of Ljubljana [LDOS] ..: Faculty of Electrical Engineering ..: Digital Signal, Image and Video Processing Laboratory On stat hypothese testing  When we need to use stat tests?  The results should not change if we repeat the experiment  When we need it: at later stages of development where results are similar RS 1 F1 0.72 RS 2 F2 0.89 0.74 Test data  Elements of statistical testing      Working hypotheses Null and alternative hypotheses: 𝐻0 and 𝐻1 p-value: 𝑝 Risk level: 𝛼 Decision on 𝐻0
4. 4. University of Ljubljana [LDOS] ..: Faculty of Electrical Engineering ..: Digital Signal, Image and Video Processing Laboratory On errors and statistical power  Errors in test decision:  Errors of type I. and type II.  Effect size  Power:      ˆ H0 ˆ H1 H0 OK type I. H1 type II. OK Power = 𝑃𝑟[ 𝐻1 |𝐻1 ] For each test a new analysis is required more is better The best one can do Task 1 - How to select sample size: apriory power Task 2 - How to estimate achieved power: posterior power  History:  1908 by William Sealy Gosset (Student): he did not need it  Mainly ignored until then  Software: GPower http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/
5. 5. University of Ljubljana [LDOS] ..: Faculty of Electrical Engineering ..: Digital Signal, Image and Video Processing Laboratory The application we were working on: contextual variables  Which contextual variables are relevant:  What is context  Candidates: time, weather, mood, ...  Can we simply use it all? • Irrelevant context can worse the performance of RS  Test if a given context is relevant  How: compare RS with and without it ODIĆ, Ante, TKALČIČ, Marko, TASIČ, Jurij F., KOŠIR, Andrej. Predicting and detecting the relevant contextual information in a movie-recommender system. Interact. comput.. [Print ed.], 2013, vol. 25, no. 1, pp. 74-90, ilustr., doi:10.1093/iwc/iws003. [COBISS.SI-ID 9650260] ODIĆ, Ante, TKALČIČ, Marko, TASIČ, Jurij F., KOŠIR, Andrej. Impact of the context relevancy on ratings prediction in a movie-recommender system. Automatika (Zagreb), 2013, vol. 54, no. 2, pp. 252262, ilustr., doi:10.7305/automatika.54-2.258. [COBISS.SI-ID 9782356]
6. 6. University of Ljubljana [LDOS] ..: Faculty of Electrical Engineering ..: Digital Signal, Image and Video Processing Laboratory The problem we observed: cross validation scheme ODIĆ, Ante, TKALČIČ, Marko, TASIČ, Jurij F., KOŠIR, Andrej. Predicting and detecting the relevant contextual information in a movie-recommender system. Interact. comput., vol. 25, no. 1, pp. 74-90, 2013.  There were differences among folds, but not in conclusion  What is wrong?  Paired / unpaired?  What is usually done:  Confusion matrix computation is actually unpaired
7. 7. University of Ljubljana [LDOS] ..: Faculty of Electrical Engineering ..: Digital Signal, Image and Video Processing Laboratory Proposed solution The procedure outline: 1. 2. 3. 4. Select the scalar comparison measure (such as precision or F-measure). Store the evaluation results of each fold and each method separately; According to the specfic features of the evaluation results (distributions etc.) select the most powerful test that meets these specific features Perform the paired version of the selected test.
8. 8. University of Ljubljana [LDOS] ..: Faculty of Electrical Engineering ..: Digital Signal, Image and Video Processing Laboratory Materials and methods (1)  Dataset:  Context Movie Dataset (LDOS-CoMoDa)  1611 ratings from 89 users to 946 items with associated contextual factors.  Contextual variables • • • • • • • • • • time (morning, afternoon, evening, night), daytype (working day, weekend, time (morning, afternoon, evening, night), season (spring, summer, autumn, winter), Location (home, public place, friend's house), weather (sunny/clear, rainy, stormy, snowy, cloudy), social (alone, partner, friends, colleagues, parents, public, family), endEmo (sad, happy, scared, surprised, angry, disgusted, neutral), dominantEmo (sad, happy, scared, surprised, angry, disgusted, neutral), mood (positive, neutral, negative), physical (healthy, ill), decision (user's choice, given by other), interaction (1rst, n-th)  Publically available: LDOS-CoMoDa contextual dataset: available at www.ldos.si/comoda.html. Used by 29 researchers at this moment.
9. 9. University of Ljubljana [LDOS] ..: Faculty of Electrical Engineering ..: Digital Signal, Image and Video Processing Laboratory Materials and methods (2), results  Experimental design  10-fold cross validation  Two procedures: ProcPaired, ProcIndep  Results – which contextual variable improves MF?  Tests: Wilcoxon signed rank test (ProcIndep) and Mann Whitney U test, (ProcPaired)  The achieved (post-hoc) statistical power for the paired test (pw pa.) and for the independent test (pw in.) along with the computed p-values Id Var 1 Var 2 1 Physical 2 3 pw paired p paired pw indep. p indep. Weather 0.42 0.001 0.14 0.24 Decision Social 0.99 0.004 0.25 0.19 interaction Social 0.06 <0.001 0.05 0.43
10. 10. University of Ljubljana [LDOS] ..: Faculty of Electrical Engineering ..: Digital Signal, Image and Video Processing Laboratory Discussion  Power improvements:  The first combination (physical vs. weather): 0.14  0.42, low but useful;  The second combination (decision vs. social): 0.19  0.99, the difference in power is again substantial;  The third combination (interaction vs. social): 0.05  0.06, irrelevant;  It does not require substantial additional work  Worth of effort
11. 11. University of Ljubljana [LDOS] ..: Faculty of Electrical Engineering ..: Digital Signal, Image and Video Processing Laboratory Further work  We limited to 10-fold cross validation and simple tests only. There is more out there.  We will concentrate on a comparison of RS regarding the selected final tasks (such as best five) and not limited to scalar performance measures (such as precision at five).  More sophisticated statistical approaches:  are available such as a multi-level repeated binomial regression  my opinion: will not be used frequently THANK YOU Invitation: International Conference on Automatic Face and Gesture Recognition FG2015, http://www.fg2015.org/
12. 12. University of Ljubljana [LDOS] ..: Faculty of Electrical Engineering ..: Digital Signal, Image and Video Processing Laboratory Presentation structure         The goal What it has to do with replicability and reproducibility? Selected items from statistics Our case & problem statement Proposed solution & comments Experimental results Future work Take away notes