SlideShare a Scribd company logo
1 of 41
Download to read offline
The Effect of Third Party
Implementations on Reproducibility
Balázs Hidasi | Gravity R&D, a Taboola Company | @balazshidasi
Ádám Czapp | Gravity R&D, a Taboola Company | @adam_czapp
Challenges of comparing the performance of algorithms
• Why does reproducibility matter?
▪ Science: core of the process
▪ Application: which papers should I try?
Challenges of comparing the performance of algorithms
• Why does reproducibility matter?
▪ Science: core of the process
▪ Application: which papers should I try?
Offline evaluation
• Imperfect proxy
• Offline metrics
Challenges of comparing the performance of algorithms
• Why does reproducibility matter?
▪ Science: core of the process
▪ Application: which papers should I try?
Online A/B test
• On business KPIs
• Not reproducible
Offline evaluation
• Imperfect proxy
• Offline metrics
Challenges of comparing the performance of algorithms
• Why does reproducibility matter?
▪ Science: core of the process
▪ Application: which papers should I try?
Online A/B test
• On business KPIs
• Not reproducible
Offline evaluation
• Imperfect proxy
• Offline metrics
Goal of the
recommender
Business goals
& metrics
Chain of approximation
Challenges of comparing the performance of algorithms
• Why does reproducibility matter?
▪ Science: core of the process
▪ Application: which papers should I try?
Online A/B test
• On business KPIs
• Not reproducible
Offline evaluation
• Imperfect proxy
• Offline metrics
Goal of the
recommender
Business goals
& metrics
A/B test setup
• Splitting
• Stopping
• Independence
• …
Evaluation setup
• Best setup &
metric for task
• Dataset
• Preprocessing
• Splitting
• Measuring
Chain of approximation
Check out our
poster on
evaluation flaws
tomorrow!
Challenges of comparing the performance of algorithms
• Why does reproducibility matter?
▪ Science: core of the process
▪ Application: which papers should I try?
Online A/B test
• On business KPIs
• Not reproducible
Offline evaluation
• Imperfect proxy
• Offline metrics
Goal of the
recommender
Business goals
& metrics
A/B test setup
• Splitting
• Stopping
• Independence
• …
Evaluation setup
• Best setup &
metric for task
• Dataset
• Preprocessing
• Splitting
• Measuring
Hyperparameter
optimization
Chain of approximation
Check out our
poster on
evaluation flaws
tomorrow!
Challenges of comparing the performance of algorithms
• Why does reproducibility matter?
▪ Science: core of the process
▪ Application: which papers should I try?
Online A/B test
• On business KPIs
• Not reproducible
Offline evaluation
• Imperfect proxy
• Offline metrics
Goal of the
recommender
Business goals
& metrics
A/B test setup
• Splitting
• Stopping
• Independence
• …
Evaluation setup
• Best setup &
metric for task
• Dataset
• Preprocessing
• Splitting
• Measuring
Hyperparameter
optimization
Algorithm
(re)implementation
Chain of approximation
Check out our
poster on
evaluation flaws
tomorrow!
Why are algorithms reimplemented?
Why are algorithms reimplemented?
Personal reimplementation
• Use with custom evaluator
• Efficiency (time of experiments)
• Official code not available
Why are algorithms reimplemented?
Personal reimplementation
• Use with custom evaluator
• Efficiency (time of experiments)
• Official code not available
Production reimplementation
• Efficiency requirements
• Language/framework
requirements
Why are algorithms reimplemented?
Personal reimplementation
• Use with custom evaluator
• Efficiency (time of experiments)
• Official code not available
Production reimplementation
• Efficiency requirements
• Language/framework
requirements
Public reimplementation
• Accessibility
• Contributing
• Official code not available
Why are algorithms reimplemented?
Personal reimplementation
• Use with custom evaluator
• Efficiency (time of experiments)
• Official code not available
Production reimplementation
• Efficiency requirements
• Language/framework
requirements
Public reimplementation
• Accessibility
• Contributing
• Official code not available
Benchmarking frameworks
• Use with unified evaluator
• Standardization/benchmarking
• Accessibility
Why are algorithms reimplemented?
Are reimplementations correct representations of the original?
Personal reimplementation
• Use with custom evaluator
• Efficiency (time of experiments)
• Official code not available
Production reimplementation
• Efficiency requirements
• Language/framework
requirements
Public reimplementation
• Accessibility
• Contributing
• Official code not available
Benchmarking frameworks
• Use with unified evaluator
• Standardization/benchmarking
• Accessibility
Comparing reimplementations of an algorithm to the original
• We chose GRU4Rec, because…
Comparing reimplementations of an algorithm to the original
• We chose GRU4Rec, because…
Seminal work of its field
• Started the line of deep learning methods for
session-based/sequential recommendations
• Often used as baseline
Comparing reimplementations of an algorithm to the original
• We chose GRU4Rec, because…
Seminal work of its field
• Started the line of deep learning methods for
session-based/sequential recommendations
• Often used as baseline
Official public implementation
https://github.com/hidasib/GRU4Rec
• Since spring 2016
• (ICLR publication)
• Still supported today
• Well-known
Comparing reimplementations of an algorithm to the original
• We chose GRU4Rec, because…
Seminal work of its field
• Started the line of deep learning methods for
session-based/sequential recommendations
• Often used as baseline
Official public implementation
https://github.com/hidasib/GRU4Rec
• Since spring 2016
• (ICLR publication)
• Still supported today
• Well-known
Simple algorithm but highly adapted
• Simple architecture
• Custom adaptations to the recommender domain
• Described in detail in the corresponding papers
Comparing reimplementations of an algorithm to the original
• We chose GRU4Rec, because…
Seminal work of its field
• Started the line of deep learning methods for
session-based/sequential recommendations
• Often used as baseline
Official public implementation
https://github.com/hidasib/GRU4Rec
• Since spring 2016
• (ICLR publication)
• Still supported today
• Well-known
Implemented in Theano
• Discontinued DL framework (2018)
• Motivation for recoding in more popular
frameworks
Simple algorithm but highly adapted
• Simple architecture
• Custom adaptations to the recommender domain
• Described in detail in the corresponding papers
Reimplementations of GRU4Rec
• Checked
▪ 2 PyTorch implementations
o GRU4REC-pytorch
– Popular reimplementation
– Published in 2018
– Last commit in 2021
o Torch-GRU4Rec
– Newer implementation from 2020
▪ 2 Tensorflow/Keras implementations
o GRU4Rec_Tensorflow
– Popular reimplementation
– Published in 2017
– Last commit in 2019
o KerasGRU4Rec
– Published in 2018
– Last meaningful update in 2020
▪ 2 benchmarking framework implementations
o Microsoft Recommenders
– Large algorithm collection
o Recpack
– Recently released framework
Reimplementations of GRU4Rec
• Checked
▪ 2 PyTorch implementations
o GRU4REC-pytorch
– Popular reimplementation
– Published in 2018
– Last commit in 2021
o Torch-GRU4Rec
– Newer implementation from 2020
▪ 2 Tensorflow/Keras implementations
o GRU4Rec_Tensorflow
– Popular reimplementation
– Published in 2017
– Last commit in 2019
o KerasGRU4Rec
– Published in 2018
– Last meaningful update in 2020
▪ 2 benchmarking framework implementations
o Microsoft Recommenders
– Large algorithm collection
o Recpack
– Recently released framework
Recommended
by RecSys
2023 CFP
Reimplementations of GRU4Rec
• Checked
▪ 2 PyTorch implementations
o GRU4REC-pytorch
– Popular reimplementation
– Published in 2018
– Last commit in 2021
o Torch-GRU4Rec
– Newer implementation from 2020
▪ 2 Tensorflow/Keras implementations
o GRU4Rec_Tensorflow
– Popular reimplementation
– Published in 2017
– Last commit in 2019
o KerasGRU4Rec
– Published in 2018
– Last meaningful update in 2020
▪ 2 benchmarking framework implementations
o Microsoft Recommenders
– Large algorithm collection
o Recpack
– Recently released framework
Recommended
by RecSys
2023 CFP
• Others (we know of)
▪ Discontinued PyTorch reimplementation
▪ RecBole implementation
o Doesn’t even reference the right papers
RQ1: Do they implement the same architecture as the
original?
• Architecture of GRU4Rec
Input items of
minibatch
Target items of
minibatch
Target
embeddings
Input embeddings Session state for
step t
Embedding table,
inputs (3 options)
GRU
layer(s)
Dot product
Loss
function
Neg. samples of
minibatch
Neg. sample
embeddings
Embedding table,
outputs
Scores
RQ1: Do they implement the same architecture as the
original?
• Architecture of GRU4Rec
Input items of
minibatch
Target items of
minibatch
Target
embeddings
Input embeddings Session state for
step t
Embedding table,
inputs (3 options)
GRU
layer(s)
Dot product
Loss
function
Neg. samples of
minibatch
Neg. sample
embeddings
Embedding table,
outputs
Scores
Those who could do it (5/6)
• GRU4REC-pytorch
• Torch-GRU4Rec
• GRU4Rec_Tensorflow
• KerasGRU4Rec
• Recpack
RQ1: Do they implement the same architecture as the
original?
• Architecture of GRU4Rec
• Different architecture in MS recommenders
Input sequences
of minibatch
Target items of
minibatch
Target
embeddings
Input embeddings Session state for
sequence
Embedding table,
inputs
GRU
layer(s)
Loss
function
Neg. samples of
minibatch
Neg. sample
embeddings
Embedding table,
outputs
Scores
Input items of
minibatch
Target items of
minibatch
Target
embeddings
Input embeddings Session state for
step t
Embedding table,
inputs (3 options)
GRU
layer(s)
Dot product
Loss
function
Neg. samples of
minibatch
Neg. sample
embeddings
Embedding table,
outputs
Scores
Clone
Session state for
sequence (xN)
FFN on
concatenated
state+embedding
Those who could do it (5/6)
• GRU4REC-pytorch
• Torch-GRU4Rec
• GRU4Rec_Tensorflow
• KerasGRU4Rec
• Recpack
RQ1: Do they implement the same architecture as the
original?
• Architecture of GRU4Rec
• Different architecture in MS recommenders
Input sequences
of minibatch
Target items of
minibatch
Target
embeddings
Input embeddings Session state for
sequence
Embedding table,
inputs
GRU
layer(s)
Loss
function
Neg. samples of
minibatch
Neg. sample
embeddings
Embedding table,
outputs
Scores
Input items of
minibatch
Target items of
minibatch
Target
embeddings
Input embeddings Session state for
step t
Embedding table,
inputs (3 options)
GRU
layer(s)
Dot product
Loss
function
Neg. samples of
minibatch
Neg. sample
embeddings
Embedding table,
outputs
Scores
Clone
Session state for
sequence (xN)
FFN on
concatenated
state+embedding
Severe scalability issue
• Number of negative samples is
strictly limited during training
• Requires negative sampling during
inference
▪ Evaluation flaw
• Not able to rank all items during
inference
Those who could do it (5/6)
• GRU4REC-pytorch
• Torch-GRU4Rec
• GRU4Rec_Tensorflow
• KerasGRU4Rec
• Recpack
RQ2: Do they have the same features as the original?
• GRU4Rec = GRU adapted to the recommendation problem
▪ Missing features (see table)
▪ Missing hyperparameters
o All versions: momentum, logQ
o Some versions: bpreg, embedding/hidden dropout, sample_alpha, …
GRU4Rec feature GRU4REC-pytorch Torch-GRU4Rec GRU4Rec_Tensorflow KerasGRU4Rec Recpack
Session parallel mini-batches
Negative
sampling
Mini-batch
Shared extra
Loss
Cross-entropy
BPR-max
Embedding
No embedding
Separate
Shared
Included
Missing
Partial or flawed
RQ3: Do they suffer from implementation errors?
Nature of the error Basic errors Inference errors Minor errors
(easy to notice & fix)
Major errors
(hard to notice or fix)
Core errors
(full rewrite)
Effort to fix Almost certainly
fixed by any user
Potentially fixed by
an involved user
Likely fixed by an
experienced user
May be fixed by a very
thorough user
Most likely NOT
fixed by any user
Examples - Typos/syntax errors
- Variables on the
incorrect device
- P(dropout) is used as
P(keep)
- Code is not prepared
for unseen test items
- Hidden states are not
reset properly
- Large initial accumulator
value prevents convergence
- Differences to the original
(learning rate decay,
initialization, optimizer)
- Hard-coded
hyperparameters
- Sampling and softmax are in
reverse order
- Softmax applied twice
- Hidden states are reset at incorrect
times
- Incorrect BPR-max loss
- Dropout can be set, but not applied
- Embedding and hidden dropout
uses the same parameter by mistake
- Sampling and
scoring are in
reverse order
Number of occurrences
GRU4REC-pytorch 1 1 0 5 1
Torch-GRU4Rec 1 0 0 0 1
GRU4Rec_Tensorflow 2 0 3 0 0
KerasGRU4Rec 0 0 2 2 0
Recpack 2 0 3 1 1
Out-of-the-box
Inference
fix
Minor
fix
Major
fix
RQ4: How do missing features & errors affect offline results?
Official
implementation
Reimplementation
Matching
features
Original
Improving/fixing Limiting features
Out-of-the-box
Inference
fix
Minor
fix
Major
fix
RQ4: How do missing features & errors affect offline results?
Official
implementation
Reimplementation
Matching
features
Original
Improving/fixing Limiting features
due to
fixable
error
due to not
fixable
error
due to
missing
features
Degradation due to errors
Total degradation
Degradation
Out-of-the-box
Inference
fix
Minor
fix
Major
fix
RQ4: How do missing features & errors affect offline results?
Official
implementation
Reimplementation
Matching
features
Original
Improving/fixing Limiting features
due to
fixable
error
due to not
fixable
error
due to
missing
features
Degradation due to errors
Total degradation
Degradation
Perf. loss
via errors
Perf. loss
via features
Total perf.
loss
MEDIAN
GRU4REC-pytorch -56.34% -46.14% -75.73%
Torch-GRU4Rec -1.29% -5.90% -7.55%
GRU4Rec_Tensorflow -80.59% -47.15% -89.46%
KerasGRU4Rec -9.54% -11.94% -21.32%
Recpack -21.23% -8.48% -30.27%
MAX
GRU4REC-pytorch -99.38% -63.88% -99.62%
Torch-GRU4Rec -10.46% -18.92% -27.24%
GRU4Rec_Tensorflow -88.44% -61.81% -93.89%
KerasGRU4Rec -26.69% -15.26% -37.87%
Recpack -37.14% -22.71% -48.86%
• Measured on 5 public session-based datasets
▪ Yoochoose, Rees46, Coveo, Retailrocket, Diginetica
• Next item prediction (strict)
• Recall & MRR
RQ5: Training time comparisons
• OOB versions vs. feature complete official versions
• Reimplementations are generally slow
• KerasGRU4Rec and Recpack versions scale badly (no sampling)
• Largest slowdown factor: 335.87x
451.75
1948.18 2082.11
981.87
13740.04 13458.18
1117.46 1275.9
0
2000
4000
6000
8000
10000
12000
14000
16000
Yoochoose
Epoch time (cross-entropy, best hyperparams),
Yoochoose dataset
GRU4Rec (original) GRU4REC-pytorch Torch-GRU4Rec
GRU4Rec_Tensorflow KerasGRU4Rec Recpack
Official PyTorch version Official Tensorflow version
367.41
7618.05 7192.88
2381.62
123265.62
49262.16
632.02 679.505
0
20000
40000
60000
80000
100000
120000
140000
Rees46
Epoch time (cross-entropy, best hyperparams),
Rees46 dataset
GRU4Rec (original) GRU4REC-pytorch Torch-GRU4Rec
GRU4Rec_Tensorflow KerasGRU4Rec Recpack
Official PyTorch version Official Tensorflow version
What does this mean?
• Final tally
▪ MS Recommender’s version is GRU4Rec in name only and deeply flawed
▪ Other versions miss at least one important feature of the original
▪ All versions have performance decreasing bugs
▪ Two implementations scale poorly
• Potentially a lot of research from the last 6-7 years used flawed baseline(s)
▪ Hard to tell: no indication of the implementation used
▪ Results might be invalidated
• Probably GRU4Rec is not the only algorithm affected
▪ It has a public version to base reimplementations on, yet they are still flawed
▪ Other well-known baselines should be checked
• Discussions
▪ Responsibility
▪ Trust in the tools we use
▪ How to correct affected work?
What can you do?
If your research used a flawed version
• Rerun experiments with official code
• Extend your work with the results
What can you do?
If your research used a flawed version
• Rerun experiments with official code
• Extend your work with the results
If you want to help
• Check reimplementations of other popular
baselines
What can you do?
If your research used a flawed version
• Rerun experiments with official code
• Extend your work with the results
If you want to help
• Check reimplementations of other popular
baselines
As an author
• Always state the implementation you use for every
baseline
• Including link, optionally commit hash
• Use official code if possible
What can you do?
If your research used a flawed version
• Rerun experiments with official code
• Extend your work with the results
If you want to help
• Check reimplementations of other popular
baselines
As an author
• Always state the implementation you use for every
baseline
• Including link, optionally commit hash
• Use official code if possible
If you reimplement an algorithm
• Validate your version against the original
before using or releasing it
• Compare metrics achieved on multiple datasets
under multiple hyperparameter settings
• Compare recommendation lists
• Check if your version has every feature/setting
• Describe the validation process and its results
in the README
• Consider if any future change to the original
code (e.g. bugfix) should be added to your
version as well
• If implementations diverge due to the original
changing, state it clearly
What can you do?
If your research used a flawed version
• Rerun experiments with official code
• Extend your work with the results
If you want to help
• Check reimplementations of other popular
baselines
As an author
• Always state the implementation you use for every
baseline
• Including link, optionally commit hash
• Use official code if possible
If you reimplement an algorithm
• Validate your version against the original
before using or releasing it
• Compare metrics achieved on multiple datasets
under multiple hyperparameter settings
• Compare recommendation lists
• Check if your version has every feature/setting
• Describe the validation process and its results
in the README
• Consider if any future change to the original
code (e.g. bugfix) should be added to your
version as well
• If implementations diverge due to the original
changing, state it clearly
As maintainer of a benchmarking framework
• Same as reimplementing any algorithm
• + validate every reimplementation submitted by
contributors
The wider picture (towards standardized benchmarking)
• State of RecSys benchmarking:
▪ Little has changed in the last decade
▪ Focus is on baseline reimplementations
▪ Collection of algorithms
▪ Evaluation is somewhat neglected
o Incorrect assumptions:
– One/few size fits all
– Single correct evaluation setup
The wider picture (towards standardized benchmarking)
• State of RecSys benchmarking:
▪ Little has changed in the last decade
▪ Focus is on baseline reimplementations
▪ Collection of algorithms
▪ Evaluation is somewhat neglected
o Incorrect assumptions:
– One/few size fits all
– Single correct evaluation setup
• Towards standardized benchmarking
▪ Collect popular recommendation tasks
o E.g. CTR prediction, session-based
recommendation, user-based recommendation,
warm/cold-start versions, reoccurrence
prediction, etc.)
▪ Evaluation stems from the tasks:
o agree on offline evaluation setups
o datasets (and their preprocessing)
o for each task
▪ Focus on the evaluation code of these setups
o including dataset & preprocessing
▪ Provide simple interfaces for evaluating external
algorithms
o Authors then can use the framework during
research
▪ Only once everything is ready, add some of the
most well-known baselines
Thanks for your attention!
Read the paper! Check out the
project website!
We’d also like to help.
Official reimplementations of GRU4Rec
PyTorch Tensorflow

More Related Content

Similar to Effect of Third Party Implementations on Reproducibility of Recommender Systems

City universitylondon devprocess_g_a_reitsch
City universitylondon devprocess_g_a_reitschCity universitylondon devprocess_g_a_reitsch
City universitylondon devprocess_g_a_reitschalanreitsch
 
Pr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open sourcePr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open sourceTerry Bunio
 
Rapid Strategic SRE Assessments
Rapid Strategic SRE AssessmentsRapid Strategic SRE Assessments
Rapid Strategic SRE AssessmentsMarc Hornbeek
 
Lean Solutions – Agile Transformation at the United States Postal Service
Lean Solutions  – Agile Transformation at the United States Postal ServiceLean Solutions  – Agile Transformation at the United States Postal Service
Lean Solutions – Agile Transformation at the United States Postal ServiceITSM Academy, Inc.
 
Cloud and Network Transformation using DevOps methodology : Cisco Live 2015
Cloud and Network Transformation using DevOps methodology : Cisco Live 2015Cloud and Network Transformation using DevOps methodology : Cisco Live 2015
Cloud and Network Transformation using DevOps methodology : Cisco Live 2015Vimal Suba
 
Sledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QASledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QAShelley Lambert
 
Automation in the world of project
Automation  in the world of projectAutomation  in the world of project
Automation in the world of projectZbyszek Mockun
 
0121_RESOURCE_SoftwareDevelopmentLifecycles.pdf
0121_RESOURCE_SoftwareDevelopmentLifecycles.pdf0121_RESOURCE_SoftwareDevelopmentLifecycles.pdf
0121_RESOURCE_SoftwareDevelopmentLifecycles.pdfBinNguynVn3
 
Using graphs for recommendations
Using graphs for recommendationsUsing graphs for recommendations
Using graphs for recommendationsRik Van Bruggen
 
Case tools and modern process of system development
Case tools and modern process of system development Case tools and modern process of system development
Case tools and modern process of system development tushar217
 
Saturn - UCSD CNS Research Review
Saturn - UCSD CNS Research ReviewSaturn - UCSD CNS Research Review
Saturn - UCSD CNS Research ReviewKabirNagrecha
 
Saturn: Joint Optimization for Large-Model Deep Learning
Saturn: Joint Optimization for Large-Model Deep LearningSaturn: Joint Optimization for Large-Model Deep Learning
Saturn: Joint Optimization for Large-Model Deep LearningKabirNagrecha
 
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample TrackingBruce Kozuma
 
Ladies Be Architects: Study Group IV: Project and System Governance
Ladies Be Architects: Study Group IV: Project and System GovernanceLadies Be Architects: Study Group IV: Project and System Governance
Ladies Be Architects: Study Group IV: Project and System Governancegemziebeth
 
Pose extraction for real time workout assistant - milestone 1
Pose extraction for real time workout assistant - milestone 1Pose extraction for real time workout assistant - milestone 1
Pose extraction for real time workout assistant - milestone 1Zachary Christmas
 
Integration Testing Practice using Perl
Integration Testing Practice using PerlIntegration Testing Practice using Perl
Integration Testing Practice using PerlMasaki Nakagawa
 
The Myth Of Requirements
The Myth Of RequirementsThe Myth Of Requirements
The Myth Of RequirementsAlan McSweeney
 

Similar to Effect of Third Party Implementations on Reproducibility of Recommender Systems (20)

City universitylondon devprocess_g_a_reitsch
City universitylondon devprocess_g_a_reitschCity universitylondon devprocess_g_a_reitsch
City universitylondon devprocess_g_a_reitsch
 
Pr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open sourcePr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open source
 
Re ppt1
Re ppt1Re ppt1
Re ppt1
 
Rapid Strategic SRE Assessments
Rapid Strategic SRE AssessmentsRapid Strategic SRE Assessments
Rapid Strategic SRE Assessments
 
Lean Solutions – Agile Transformation at the United States Postal Service
Lean Solutions  – Agile Transformation at the United States Postal ServiceLean Solutions  – Agile Transformation at the United States Postal Service
Lean Solutions – Agile Transformation at the United States Postal Service
 
Cloud and Network Transformation using DevOps methodology : Cisco Live 2015
Cloud and Network Transformation using DevOps methodology : Cisco Live 2015Cloud and Network Transformation using DevOps methodology : Cisco Live 2015
Cloud and Network Transformation using DevOps methodology : Cisco Live 2015
 
Sledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QASledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QA
 
Automation in the world of project
Automation  in the world of projectAutomation  in the world of project
Automation in the world of project
 
0121_RESOURCE_SoftwareDevelopmentLifecycles.pdf
0121_RESOURCE_SoftwareDevelopmentLifecycles.pdf0121_RESOURCE_SoftwareDevelopmentLifecycles.pdf
0121_RESOURCE_SoftwareDevelopmentLifecycles.pdf
 
Using graphs for recommendations
Using graphs for recommendationsUsing graphs for recommendations
Using graphs for recommendations
 
Case tools and modern process of system development
Case tools and modern process of system development Case tools and modern process of system development
Case tools and modern process of system development
 
Saturn - UCSD CNS Research Review
Saturn - UCSD CNS Research ReviewSaturn - UCSD CNS Research Review
Saturn - UCSD CNS Research Review
 
Saturn: Joint Optimization for Large-Model Deep Learning
Saturn: Joint Optimization for Large-Model Deep LearningSaturn: Joint Optimization for Large-Model Deep Learning
Saturn: Joint Optimization for Large-Model Deep Learning
 
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
 
Ladies Be Architects: Study Group IV: Project and System Governance
Ladies Be Architects: Study Group IV: Project and System GovernanceLadies Be Architects: Study Group IV: Project and System Governance
Ladies Be Architects: Study Group IV: Project and System Governance
 
Pose extraction for real time workout assistant - milestone 1
Pose extraction for real time workout assistant - milestone 1Pose extraction for real time workout assistant - milestone 1
Pose extraction for real time workout assistant - milestone 1
 
Integration Testing Practice using Perl
Integration Testing Practice using PerlIntegration Testing Practice using Perl
Integration Testing Practice using Perl
 
Evolve 19 | Gina Petruccelli | Let’s Dig Into Requirements
Evolve 19 | Gina Petruccelli | Let’s Dig Into RequirementsEvolve 19 | Gina Petruccelli | Let’s Dig Into Requirements
Evolve 19 | Gina Petruccelli | Let’s Dig Into Requirements
 
The Myth Of Requirements
The Myth Of RequirementsThe Myth Of Requirements
The Myth Of Requirements
 
5174 oracleascp
5174 oracleascp5174 oracleascp
5174 oracleascp
 

More from Balázs Hidasi

Egyedi termék kreatívok tömeges gyártása generatív AI segítségével
Egyedi termék kreatívok tömeges gyártása generatív AI segítségévelEgyedi termék kreatívok tömeges gyártása generatív AI segítségével
Egyedi termék kreatívok tömeges gyártása generatív AI segítségévelBalázs Hidasi
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi
 
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...Balázs Hidasi
 
Context aware factorization methods for implicit feedback based recommendatio...
Context aware factorization methods for implicit feedback based recommendatio...Context aware factorization methods for implicit feedback based recommendatio...
Context aware factorization methods for implicit feedback based recommendatio...Balázs Hidasi
 
Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Balázs Hidasi
 
Deep learning: the future of recommendations
Deep learning: the future of recommendationsDeep learning: the future of recommendations
Deep learning: the future of recommendationsBalázs Hidasi
 
Context-aware preference modeling with factorization
Context-aware preference modeling with factorizationContext-aware preference modeling with factorization
Context-aware preference modeling with factorizationBalázs Hidasi
 
Approximate modeling of continuous context in factorization algorithms (CaRR1...
Approximate modeling of continuous context in factorization algorithms (CaRR1...Approximate modeling of continuous context in factorization algorithms (CaRR1...
Approximate modeling of continuous context in factorization algorithms (CaRR1...Balázs Hidasi
 
Utilizing additional information in factorization methods (research overview,...
Utilizing additional information in factorization methods (research overview,...Utilizing additional information in factorization methods (research overview,...
Utilizing additional information in factorization methods (research overview,...Balázs Hidasi
 
Az implicit ajánlási probléma és néhány megoldása (BME TMIT szeminárium előad...
Az implicit ajánlási probléma és néhány megoldása (BME TMIT szeminárium előad...Az implicit ajánlási probléma és néhány megoldása (BME TMIT szeminárium előad...
Az implicit ajánlási probléma és néhány megoldása (BME TMIT szeminárium előad...Balázs Hidasi
 
Context-aware similarities within the factorization framework (CaRR 2013 pres...
Context-aware similarities within the factorization framework (CaRR 2013 pres...Context-aware similarities within the factorization framework (CaRR 2013 pres...
Context-aware similarities within the factorization framework (CaRR 2013 pres...Balázs Hidasi
 
iTALS: implicit tensor factorization for context-aware recommendations (ECML/...
iTALS: implicit tensor factorization for context-aware recommendations (ECML/...iTALS: implicit tensor factorization for context-aware recommendations (ECML/...
iTALS: implicit tensor factorization for context-aware recommendations (ECML/...Balázs Hidasi
 
Initialization of matrix factorization (CaRR 2012 presentation)
Initialization of matrix factorization (CaRR 2012 presentation)Initialization of matrix factorization (CaRR 2012 presentation)
Initialization of matrix factorization (CaRR 2012 presentation)Balázs Hidasi
 
ShiftTree: model alapú idősor-osztályozó (VK 2009 előadás)
ShiftTree: model alapú idősor-osztályozó (VK 2009 előadás)ShiftTree: model alapú idősor-osztályozó (VK 2009 előadás)
ShiftTree: model alapú idősor-osztályozó (VK 2009 előadás)Balázs Hidasi
 
ShiftTree: model alapú idősor-osztályozó (ML@BP előadás, 2012)
ShiftTree: model alapú idősor-osztályozó (ML@BP előadás, 2012)ShiftTree: model alapú idősor-osztályozó (ML@BP előadás, 2012)
ShiftTree: model alapú idősor-osztályozó (ML@BP előadás, 2012)Balázs Hidasi
 
ShiftTree: model based time series classifier (ECML/PKDD 2011 presentation)
ShiftTree: model based time series classifier (ECML/PKDD 2011 presentation)ShiftTree: model based time series classifier (ECML/PKDD 2011 presentation)
ShiftTree: model based time series classifier (ECML/PKDD 2011 presentation)Balázs Hidasi
 

More from Balázs Hidasi (17)

Egyedi termék kreatívok tömeges gyártása generatív AI segítségével
Egyedi termék kreatívok tömeges gyártása generatív AI segítségévelEgyedi termék kreatívok tömeges gyártása generatív AI segítségével
Egyedi termék kreatívok tömeges gyártása generatív AI segítségével
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
 
Context aware factorization methods for implicit feedback based recommendatio...
Context aware factorization methods for implicit feedback based recommendatio...Context aware factorization methods for implicit feedback based recommendatio...
Context aware factorization methods for implicit feedback based recommendatio...
 
Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...
 
Deep learning: the future of recommendations
Deep learning: the future of recommendationsDeep learning: the future of recommendations
Deep learning: the future of recommendations
 
Context-aware preference modeling with factorization
Context-aware preference modeling with factorizationContext-aware preference modeling with factorization
Context-aware preference modeling with factorization
 
Approximate modeling of continuous context in factorization algorithms (CaRR1...
Approximate modeling of continuous context in factorization algorithms (CaRR1...Approximate modeling of continuous context in factorization algorithms (CaRR1...
Approximate modeling of continuous context in factorization algorithms (CaRR1...
 
Utilizing additional information in factorization methods (research overview,...
Utilizing additional information in factorization methods (research overview,...Utilizing additional information in factorization methods (research overview,...
Utilizing additional information in factorization methods (research overview,...
 
Az implicit ajánlási probléma és néhány megoldása (BME TMIT szeminárium előad...
Az implicit ajánlási probléma és néhány megoldása (BME TMIT szeminárium előad...Az implicit ajánlási probléma és néhány megoldása (BME TMIT szeminárium előad...
Az implicit ajánlási probléma és néhány megoldása (BME TMIT szeminárium előad...
 
Context-aware similarities within the factorization framework (CaRR 2013 pres...
Context-aware similarities within the factorization framework (CaRR 2013 pres...Context-aware similarities within the factorization framework (CaRR 2013 pres...
Context-aware similarities within the factorization framework (CaRR 2013 pres...
 
iTALS: implicit tensor factorization for context-aware recommendations (ECML/...
iTALS: implicit tensor factorization for context-aware recommendations (ECML/...iTALS: implicit tensor factorization for context-aware recommendations (ECML/...
iTALS: implicit tensor factorization for context-aware recommendations (ECML/...
 
Initialization of matrix factorization (CaRR 2012 presentation)
Initialization of matrix factorization (CaRR 2012 presentation)Initialization of matrix factorization (CaRR 2012 presentation)
Initialization of matrix factorization (CaRR 2012 presentation)
 
ShiftTree: model alapú idősor-osztályozó (VK 2009 előadás)
ShiftTree: model alapú idősor-osztályozó (VK 2009 előadás)ShiftTree: model alapú idősor-osztályozó (VK 2009 előadás)
ShiftTree: model alapú idősor-osztályozó (VK 2009 előadás)
 
ShiftTree: model alapú idősor-osztályozó (ML@BP előadás, 2012)
ShiftTree: model alapú idősor-osztályozó (ML@BP előadás, 2012)ShiftTree: model alapú idősor-osztályozó (ML@BP előadás, 2012)
ShiftTree: model alapú idősor-osztályozó (ML@BP előadás, 2012)
 
ShiftTree: model based time series classifier (ECML/PKDD 2011 presentation)
ShiftTree: model based time series classifier (ECML/PKDD 2011 presentation)ShiftTree: model based time series classifier (ECML/PKDD 2011 presentation)
ShiftTree: model based time series classifier (ECML/PKDD 2011 presentation)
 

Recently uploaded

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfWadeK3
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 

Recently uploaded (20)

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 

Effect of Third Party Implementations on Reproducibility of Recommender Systems

  • 1. The Effect of Third Party Implementations on Reproducibility Balázs Hidasi | Gravity R&D, a Taboola Company | @balazshidasi Ádám Czapp | Gravity R&D, a Taboola Company | @adam_czapp
  • 2. Challenges of comparing the performance of algorithms • Why does reproducibility matter? ▪ Science: core of the process ▪ Application: which papers should I try?
  • 3. Challenges of comparing the performance of algorithms • Why does reproducibility matter? ▪ Science: core of the process ▪ Application: which papers should I try? Offline evaluation • Imperfect proxy • Offline metrics
  • 4. Challenges of comparing the performance of algorithms • Why does reproducibility matter? ▪ Science: core of the process ▪ Application: which papers should I try? Online A/B test • On business KPIs • Not reproducible Offline evaluation • Imperfect proxy • Offline metrics
  • 5. Challenges of comparing the performance of algorithms • Why does reproducibility matter? ▪ Science: core of the process ▪ Application: which papers should I try? Online A/B test • On business KPIs • Not reproducible Offline evaluation • Imperfect proxy • Offline metrics Goal of the recommender Business goals & metrics Chain of approximation
  • 6. Challenges of comparing the performance of algorithms • Why does reproducibility matter? ▪ Science: core of the process ▪ Application: which papers should I try? Online A/B test • On business KPIs • Not reproducible Offline evaluation • Imperfect proxy • Offline metrics Goal of the recommender Business goals & metrics A/B test setup • Splitting • Stopping • Independence • … Evaluation setup • Best setup & metric for task • Dataset • Preprocessing • Splitting • Measuring Chain of approximation Check out our poster on evaluation flaws tomorrow!
  • 7. Challenges of comparing the performance of algorithms • Why does reproducibility matter? ▪ Science: core of the process ▪ Application: which papers should I try? Online A/B test • On business KPIs • Not reproducible Offline evaluation • Imperfect proxy • Offline metrics Goal of the recommender Business goals & metrics A/B test setup • Splitting • Stopping • Independence • … Evaluation setup • Best setup & metric for task • Dataset • Preprocessing • Splitting • Measuring Hyperparameter optimization Chain of approximation Check out our poster on evaluation flaws tomorrow!
  • 8. Challenges of comparing the performance of algorithms • Why does reproducibility matter? ▪ Science: core of the process ▪ Application: which papers should I try? Online A/B test • On business KPIs • Not reproducible Offline evaluation • Imperfect proxy • Offline metrics Goal of the recommender Business goals & metrics A/B test setup • Splitting • Stopping • Independence • … Evaluation setup • Best setup & metric for task • Dataset • Preprocessing • Splitting • Measuring Hyperparameter optimization Algorithm (re)implementation Chain of approximation Check out our poster on evaluation flaws tomorrow!
  • 9. Why are algorithms reimplemented?
  • 10. Why are algorithms reimplemented? Personal reimplementation • Use with custom evaluator • Efficiency (time of experiments) • Official code not available
  • 11. Why are algorithms reimplemented? Personal reimplementation • Use with custom evaluator • Efficiency (time of experiments) • Official code not available Production reimplementation • Efficiency requirements • Language/framework requirements
  • 12. Why are algorithms reimplemented? Personal reimplementation • Use with custom evaluator • Efficiency (time of experiments) • Official code not available Production reimplementation • Efficiency requirements • Language/framework requirements Public reimplementation • Accessibility • Contributing • Official code not available
  • 13. Why are algorithms reimplemented? Personal reimplementation • Use with custom evaluator • Efficiency (time of experiments) • Official code not available Production reimplementation • Efficiency requirements • Language/framework requirements Public reimplementation • Accessibility • Contributing • Official code not available Benchmarking frameworks • Use with unified evaluator • Standardization/benchmarking • Accessibility
  • 14. Why are algorithms reimplemented? Are reimplementations correct representations of the original? Personal reimplementation • Use with custom evaluator • Efficiency (time of experiments) • Official code not available Production reimplementation • Efficiency requirements • Language/framework requirements Public reimplementation • Accessibility • Contributing • Official code not available Benchmarking frameworks • Use with unified evaluator • Standardization/benchmarking • Accessibility
  • 15. Comparing reimplementations of an algorithm to the original • We chose GRU4Rec, because…
  • 16. Comparing reimplementations of an algorithm to the original • We chose GRU4Rec, because… Seminal work of its field • Started the line of deep learning methods for session-based/sequential recommendations • Often used as baseline
  • 17. Comparing reimplementations of an algorithm to the original • We chose GRU4Rec, because… Seminal work of its field • Started the line of deep learning methods for session-based/sequential recommendations • Often used as baseline Official public implementation https://github.com/hidasib/GRU4Rec • Since spring 2016 • (ICLR publication) • Still supported today • Well-known
  • 18. Comparing reimplementations of an algorithm to the original • We chose GRU4Rec, because… Seminal work of its field • Started the line of deep learning methods for session-based/sequential recommendations • Often used as baseline Official public implementation https://github.com/hidasib/GRU4Rec • Since spring 2016 • (ICLR publication) • Still supported today • Well-known Simple algorithm but highly adapted • Simple architecture • Custom adaptations to the recommender domain • Described in detail in the corresponding papers
  • 19. Comparing reimplementations of an algorithm to the original • We chose GRU4Rec, because… Seminal work of its field • Started the line of deep learning methods for session-based/sequential recommendations • Often used as baseline Official public implementation https://github.com/hidasib/GRU4Rec • Since spring 2016 • (ICLR publication) • Still supported today • Well-known Implemented in Theano • Discontinued DL framework (2018) • Motivation for recoding in more popular frameworks Simple algorithm but highly adapted • Simple architecture • Custom adaptations to the recommender domain • Described in detail in the corresponding papers
  • 20. Reimplementations of GRU4Rec • Checked ▪ 2 PyTorch implementations o GRU4REC-pytorch – Popular reimplementation – Published in 2018 – Last commit in 2021 o Torch-GRU4Rec – Newer implementation from 2020 ▪ 2 Tensorflow/Keras implementations o GRU4Rec_Tensorflow – Popular reimplementation – Published in 2017 – Last commit in 2019 o KerasGRU4Rec – Published in 2018 – Last meaningful update in 2020 ▪ 2 benchmarking framework implementations o Microsoft Recommenders – Large algorithm collection o Recpack – Recently released framework
  • 21. Reimplementations of GRU4Rec • Checked ▪ 2 PyTorch implementations o GRU4REC-pytorch – Popular reimplementation – Published in 2018 – Last commit in 2021 o Torch-GRU4Rec – Newer implementation from 2020 ▪ 2 Tensorflow/Keras implementations o GRU4Rec_Tensorflow – Popular reimplementation – Published in 2017 – Last commit in 2019 o KerasGRU4Rec – Published in 2018 – Last meaningful update in 2020 ▪ 2 benchmarking framework implementations o Microsoft Recommenders – Large algorithm collection o Recpack – Recently released framework Recommended by RecSys 2023 CFP
  • 22. Reimplementations of GRU4Rec • Checked ▪ 2 PyTorch implementations o GRU4REC-pytorch – Popular reimplementation – Published in 2018 – Last commit in 2021 o Torch-GRU4Rec – Newer implementation from 2020 ▪ 2 Tensorflow/Keras implementations o GRU4Rec_Tensorflow – Popular reimplementation – Published in 2017 – Last commit in 2019 o KerasGRU4Rec – Published in 2018 – Last meaningful update in 2020 ▪ 2 benchmarking framework implementations o Microsoft Recommenders – Large algorithm collection o Recpack – Recently released framework Recommended by RecSys 2023 CFP • Others (we know of) ▪ Discontinued PyTorch reimplementation ▪ RecBole implementation o Doesn’t even reference the right papers
  • 23. RQ1: Do they implement the same architecture as the original? • Architecture of GRU4Rec Input items of minibatch Target items of minibatch Target embeddings Input embeddings Session state for step t Embedding table, inputs (3 options) GRU layer(s) Dot product Loss function Neg. samples of minibatch Neg. sample embeddings Embedding table, outputs Scores
  • 24. RQ1: Do they implement the same architecture as the original? • Architecture of GRU4Rec Input items of minibatch Target items of minibatch Target embeddings Input embeddings Session state for step t Embedding table, inputs (3 options) GRU layer(s) Dot product Loss function Neg. samples of minibatch Neg. sample embeddings Embedding table, outputs Scores Those who could do it (5/6) • GRU4REC-pytorch • Torch-GRU4Rec • GRU4Rec_Tensorflow • KerasGRU4Rec • Recpack
  • 25. RQ1: Do they implement the same architecture as the original? • Architecture of GRU4Rec • Different architecture in MS recommenders Input sequences of minibatch Target items of minibatch Target embeddings Input embeddings Session state for sequence Embedding table, inputs GRU layer(s) Loss function Neg. samples of minibatch Neg. sample embeddings Embedding table, outputs Scores Input items of minibatch Target items of minibatch Target embeddings Input embeddings Session state for step t Embedding table, inputs (3 options) GRU layer(s) Dot product Loss function Neg. samples of minibatch Neg. sample embeddings Embedding table, outputs Scores Clone Session state for sequence (xN) FFN on concatenated state+embedding Those who could do it (5/6) • GRU4REC-pytorch • Torch-GRU4Rec • GRU4Rec_Tensorflow • KerasGRU4Rec • Recpack
  • 26. RQ1: Do they implement the same architecture as the original? • Architecture of GRU4Rec • Different architecture in MS recommenders Input sequences of minibatch Target items of minibatch Target embeddings Input embeddings Session state for sequence Embedding table, inputs GRU layer(s) Loss function Neg. samples of minibatch Neg. sample embeddings Embedding table, outputs Scores Input items of minibatch Target items of minibatch Target embeddings Input embeddings Session state for step t Embedding table, inputs (3 options) GRU layer(s) Dot product Loss function Neg. samples of minibatch Neg. sample embeddings Embedding table, outputs Scores Clone Session state for sequence (xN) FFN on concatenated state+embedding Severe scalability issue • Number of negative samples is strictly limited during training • Requires negative sampling during inference ▪ Evaluation flaw • Not able to rank all items during inference Those who could do it (5/6) • GRU4REC-pytorch • Torch-GRU4Rec • GRU4Rec_Tensorflow • KerasGRU4Rec • Recpack
  • 27. RQ2: Do they have the same features as the original? • GRU4Rec = GRU adapted to the recommendation problem ▪ Missing features (see table) ▪ Missing hyperparameters o All versions: momentum, logQ o Some versions: bpreg, embedding/hidden dropout, sample_alpha, … GRU4Rec feature GRU4REC-pytorch Torch-GRU4Rec GRU4Rec_Tensorflow KerasGRU4Rec Recpack Session parallel mini-batches Negative sampling Mini-batch Shared extra Loss Cross-entropy BPR-max Embedding No embedding Separate Shared Included Missing Partial or flawed
  • 28. RQ3: Do they suffer from implementation errors? Nature of the error Basic errors Inference errors Minor errors (easy to notice & fix) Major errors (hard to notice or fix) Core errors (full rewrite) Effort to fix Almost certainly fixed by any user Potentially fixed by an involved user Likely fixed by an experienced user May be fixed by a very thorough user Most likely NOT fixed by any user Examples - Typos/syntax errors - Variables on the incorrect device - P(dropout) is used as P(keep) - Code is not prepared for unseen test items - Hidden states are not reset properly - Large initial accumulator value prevents convergence - Differences to the original (learning rate decay, initialization, optimizer) - Hard-coded hyperparameters - Sampling and softmax are in reverse order - Softmax applied twice - Hidden states are reset at incorrect times - Incorrect BPR-max loss - Dropout can be set, but not applied - Embedding and hidden dropout uses the same parameter by mistake - Sampling and scoring are in reverse order Number of occurrences GRU4REC-pytorch 1 1 0 5 1 Torch-GRU4Rec 1 0 0 0 1 GRU4Rec_Tensorflow 2 0 3 0 0 KerasGRU4Rec 0 0 2 2 0 Recpack 2 0 3 1 1
  • 29. Out-of-the-box Inference fix Minor fix Major fix RQ4: How do missing features & errors affect offline results? Official implementation Reimplementation Matching features Original Improving/fixing Limiting features
  • 30. Out-of-the-box Inference fix Minor fix Major fix RQ4: How do missing features & errors affect offline results? Official implementation Reimplementation Matching features Original Improving/fixing Limiting features due to fixable error due to not fixable error due to missing features Degradation due to errors Total degradation Degradation
  • 31. Out-of-the-box Inference fix Minor fix Major fix RQ4: How do missing features & errors affect offline results? Official implementation Reimplementation Matching features Original Improving/fixing Limiting features due to fixable error due to not fixable error due to missing features Degradation due to errors Total degradation Degradation Perf. loss via errors Perf. loss via features Total perf. loss MEDIAN GRU4REC-pytorch -56.34% -46.14% -75.73% Torch-GRU4Rec -1.29% -5.90% -7.55% GRU4Rec_Tensorflow -80.59% -47.15% -89.46% KerasGRU4Rec -9.54% -11.94% -21.32% Recpack -21.23% -8.48% -30.27% MAX GRU4REC-pytorch -99.38% -63.88% -99.62% Torch-GRU4Rec -10.46% -18.92% -27.24% GRU4Rec_Tensorflow -88.44% -61.81% -93.89% KerasGRU4Rec -26.69% -15.26% -37.87% Recpack -37.14% -22.71% -48.86% • Measured on 5 public session-based datasets ▪ Yoochoose, Rees46, Coveo, Retailrocket, Diginetica • Next item prediction (strict) • Recall & MRR
  • 32. RQ5: Training time comparisons • OOB versions vs. feature complete official versions • Reimplementations are generally slow • KerasGRU4Rec and Recpack versions scale badly (no sampling) • Largest slowdown factor: 335.87x 451.75 1948.18 2082.11 981.87 13740.04 13458.18 1117.46 1275.9 0 2000 4000 6000 8000 10000 12000 14000 16000 Yoochoose Epoch time (cross-entropy, best hyperparams), Yoochoose dataset GRU4Rec (original) GRU4REC-pytorch Torch-GRU4Rec GRU4Rec_Tensorflow KerasGRU4Rec Recpack Official PyTorch version Official Tensorflow version 367.41 7618.05 7192.88 2381.62 123265.62 49262.16 632.02 679.505 0 20000 40000 60000 80000 100000 120000 140000 Rees46 Epoch time (cross-entropy, best hyperparams), Rees46 dataset GRU4Rec (original) GRU4REC-pytorch Torch-GRU4Rec GRU4Rec_Tensorflow KerasGRU4Rec Recpack Official PyTorch version Official Tensorflow version
  • 33. What does this mean? • Final tally ▪ MS Recommender’s version is GRU4Rec in name only and deeply flawed ▪ Other versions miss at least one important feature of the original ▪ All versions have performance decreasing bugs ▪ Two implementations scale poorly • Potentially a lot of research from the last 6-7 years used flawed baseline(s) ▪ Hard to tell: no indication of the implementation used ▪ Results might be invalidated • Probably GRU4Rec is not the only algorithm affected ▪ It has a public version to base reimplementations on, yet they are still flawed ▪ Other well-known baselines should be checked • Discussions ▪ Responsibility ▪ Trust in the tools we use ▪ How to correct affected work?
  • 34. What can you do? If your research used a flawed version • Rerun experiments with official code • Extend your work with the results
  • 35. What can you do? If your research used a flawed version • Rerun experiments with official code • Extend your work with the results If you want to help • Check reimplementations of other popular baselines
  • 36. What can you do? If your research used a flawed version • Rerun experiments with official code • Extend your work with the results If you want to help • Check reimplementations of other popular baselines As an author • Always state the implementation you use for every baseline • Including link, optionally commit hash • Use official code if possible
  • 37. What can you do? If your research used a flawed version • Rerun experiments with official code • Extend your work with the results If you want to help • Check reimplementations of other popular baselines As an author • Always state the implementation you use for every baseline • Including link, optionally commit hash • Use official code if possible If you reimplement an algorithm • Validate your version against the original before using or releasing it • Compare metrics achieved on multiple datasets under multiple hyperparameter settings • Compare recommendation lists • Check if your version has every feature/setting • Describe the validation process and its results in the README • Consider if any future change to the original code (e.g. bugfix) should be added to your version as well • If implementations diverge due to the original changing, state it clearly
  • 38. What can you do? If your research used a flawed version • Rerun experiments with official code • Extend your work with the results If you want to help • Check reimplementations of other popular baselines As an author • Always state the implementation you use for every baseline • Including link, optionally commit hash • Use official code if possible If you reimplement an algorithm • Validate your version against the original before using or releasing it • Compare metrics achieved on multiple datasets under multiple hyperparameter settings • Compare recommendation lists • Check if your version has every feature/setting • Describe the validation process and its results in the README • Consider if any future change to the original code (e.g. bugfix) should be added to your version as well • If implementations diverge due to the original changing, state it clearly As maintainer of a benchmarking framework • Same as reimplementing any algorithm • + validate every reimplementation submitted by contributors
  • 39. The wider picture (towards standardized benchmarking) • State of RecSys benchmarking: ▪ Little has changed in the last decade ▪ Focus is on baseline reimplementations ▪ Collection of algorithms ▪ Evaluation is somewhat neglected o Incorrect assumptions: – One/few size fits all – Single correct evaluation setup
  • 40. The wider picture (towards standardized benchmarking) • State of RecSys benchmarking: ▪ Little has changed in the last decade ▪ Focus is on baseline reimplementations ▪ Collection of algorithms ▪ Evaluation is somewhat neglected o Incorrect assumptions: – One/few size fits all – Single correct evaluation setup • Towards standardized benchmarking ▪ Collect popular recommendation tasks o E.g. CTR prediction, session-based recommendation, user-based recommendation, warm/cold-start versions, reoccurrence prediction, etc.) ▪ Evaluation stems from the tasks: o agree on offline evaluation setups o datasets (and their preprocessing) o for each task ▪ Focus on the evaluation code of these setups o including dataset & preprocessing ▪ Provide simple interfaces for evaluating external algorithms o Authors then can use the framework during research ▪ Only once everything is ready, add some of the most well-known baselines
  • 41. Thanks for your attention! Read the paper! Check out the project website! We’d also like to help. Official reimplementations of GRU4Rec PyTorch Tensorflow