SlideShare a Scribd company logo
COMPARING ON-LINE and OFF-LINE: EVALUATION RESULTS
Off-line vs. On-line Evaluation of Recommender
Systems in Small E-commerceLadislav Peška
Department of Software Engineering
Charles University in Prague, Czech Republic
ABSTRACT
Recommending in context of small e-commerce enterprises is
rather challenging due to the lower volume of interactions and low
user loyalty, rarely extending beyond a single session. On the other
hand, we usually have to deal with lower volumes of objects, which
are easier to discover by users via searching or browsing GUI.
The main goal of this paper is to determine applicability of off-line
evaluation metrics in learning usability of recommender systems
(evaluated on-line in A/B testing). In total 800 variants of
recommending algorithms were evaluated off-line w.r.t. 18 metrics
covering rating-based, ranking-based, novelty and diversity
evaluation.
Off-line results were compared with on-line evaluation of 12
selected recommender variants. Off-line results shown a great
variance in performance w.r.t. different metrics with the Pareto
front covering 68% of the approaches. On-line results were highly
diversified w.r.t. the volume of objects visited by the user. Ranking-
based metrics provided best estimation for novel users. We further
train two regressors to predict on-line results based on the off-line
metrics and estimate performance of recommenders not evaluated
in A/B testing directly.
RESULTS and FUTURE WORK
• CTR and VRR became gradually less consistent with more
objects visited by the user.
o Perhaps, some users tend to observe all objects from some category ->
Modify VRR metric to incorporate elapsed time.
o Evaluate more business-oriented metrics in the future (conversions,
revenue, actions after click).
• For users with lower volume of visited objects (majority of the
dataset), ranking-based metrics are best estimators of on-line
performance.
o Intra list diversity (ILD) seems to gain some importance for users with
longer history.
o Rating-based and novelty metrics were mostly negatively correlated or
indifferent throughout the dataset.
o Aim on finer-grained classification of users in the future.
• Evaluate metrics based on knowledge of user’s choices (MNAR).
o However, high ratio of VRR / CTR scores indicates potentially lower effect
of missing not at random.
• Evaluate other off-line metrics, such as object’s popularity.
• Evaluate regression/ranking methods aiming to predict on-line
results from off-line metrics.
• Verify results on additional small e-commerce vendors & for
additional recommending algorithms.
DOMAIN: Czech travel agency
• Approx. 300-800 visitors daily, several hundreds to thousands objects.
• However, just few visited objects per user, low user loyalty.
• Over 2 years of historic data, 560K records, complex feedback available.
ALGORITHMS: Item-to-Item models
Low user loyalty and high fluctuation of users would prevent effective usage of
user-based algorithms, such as matrix factorization.
• Word2vec (item2vec): CF based on the stream of object’s visits.
• Doc2vec: CB based on the textual descriptions of tours.
• VSM (Cosine): CB based on the descriptive features of tours (length, price,
destination, meal plan etc.)
Hyperparameters: embeddings size, context window size, diversity and novelty
enhancements, user profile.
USER PROFILE: (Which objects were used to represent user?)
• Mean: All visited objects are used, similarities per visited object are averaged.
• Max: All objects are used, max(similarity w.r.t. visited object) is used.
• Last(-k): Only last (k) objects are used with linearly decreasing weight.
• Temporal(-k): Last k (all) objects are used with decreasing weight based on
the real time (days) elapsed from the feedback observation.
EVALUATION:
Off-line phase: June 1 – July 19, 2018
- 970 users (with visited objects in both train and test set)
- 800 variants of [algorithm, hyperparameters, user profile] were evaluated.
On-line phase: July 19 – August 17, 2018
• Selected 12 algorithms (best & worst w.r.t. each off-line metric)
• 4287 users (with some visited objects to create a user profile)
o One RS’s variant assigned to each user
• In total 928 click-throughs (CTR)
• In total 10961 visits after recommendation (VRR)
Peter Vojtáš
Department of Software Engineering
Charles University in Prague, Czech Republic
OFF-LINE METRICS PEARSON’S CORRELATION
rating ranking novelty diversity
Id Algorithm MAE AUC MRR nDCG100 Nov10𝑡 Nov10 𝒖 ILD10 CTR VRR
0 Doc2vec; e:128, w:1, last, nov. 0.292 0.617 0.031 0.057 0.234 1.000 0.800 0.0070 0.050
1 Doc2vec; e:128, w:1, temp., div. 0.362 0.679 0.031 0.075 0.221 0.999 0.838 0.0084 0.075
2 Doc2vec; e:32, w:5, mean 0.455 0.555 0.028 0.050 0.211 0.997 0.786 0.0089 0.054
3 Doc2vec; e:32, w:5, mean, div. 0.455 0.555 0.025 0.046 0.214 0.998 0.859 0.0062 0.060
4 Doc2vec; e:128, w:5, max, nov. 0.214 0.526 0.012 0.031 0.229 0.995 0.741 0.0077 0.052
5 Cosine; temp., nov. 0.406 0.797 0.146 0.215 0.255 0.994 0.270 0.0057 0.020
6 Cosine; mean, nov. 0.400 0.795 0.149 0.214 0.229 0.994 0.223 0.0119 0.088
7 Cosine; last-10 0.390 0.783 0.127 0.205 0.218 0.996 0.208 0.0075 0.055
8 Word2vec; e:64, w:5, mean, div. 0.414 0.809 0.103 0.182 0.215 0.973 0.683 0.0090 0.062
9 Word2vec; e:32, w:5, temp., nov. 0.438 0.816 0.102 0.195 0.244 0.977 0.495 0.0095 0.065
10 Word2vec; e:128, w:3, last 0.290 0.734 0.097 0.168 0.212 0.997 0.534 0.0077 0.056
11 Word2vec; e:32, w:3, last-10 0.432 0.814 0.134 0.229 0.214 0.988 0.443 0.0080 0.089
Comparing on-line and off-line results for users with 1-5 visited objects. Parameters e = embeddings size,
w = context window size, nov. and div. denotes novelty and diversity enhancements. Best results w.r.t. each
metric are in bold, green, worst results are in red, best w.r.t. algorithm type are in italic.
KEY FINDINGS
• One of the surprising findings was that results are highly dependent
on the volume of visited objects by the user. While for users with
lower volume of objects, both on-line metrics were highly correlated
and ranking-based metrics provide most relevant estimations, CTR
and VRR became gradually less consistent for users with longer history.
This is further illustrated by the volume of interactions. While per-user
CTR gradually decreases with visited objects, VRR increases.
• Reasonable level of diversity seems important for users with more
visited objects. This was also indicated by the regression methods
trained to predict on-line results from off-line metrics.
• As for the recommending algorithms, trained regressors prefer Cosine
and Word2vec models over Doc2vec, which is also observable from
the actual on-line results.
Evaluation site:
www.slantour.cz
Code & full results:
github.com/lpeska/REVEAL2018
Contact:
peska@ksi.mff.cuni.cz
vojtas@ksi.mff.cuni.cz
Users with 1-2 visited objectsComparison of total and per-user CTR and VRR scores

More Related Content

Similar to Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce

A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
Istituto nazionale di statistica
 
IRJET-Survey on Identification of Top-K Competitors using Data Mining
IRJET-Survey on Identification of Top-K Competitors using Data MiningIRJET-Survey on Identification of Top-K Competitors using Data Mining
IRJET-Survey on Identification of Top-K Competitors using Data Mining
IRJET Journal
 
IRJET-Fake Product Review Monitoring
IRJET-Fake Product Review MonitoringIRJET-Fake Product Review Monitoring
IRJET-Fake Product Review Monitoring
IRJET Journal
 
Orchestrating Collective Intelligence
Orchestrating Collective IntelligenceOrchestrating Collective Intelligence
Orchestrating Collective Intelligence
Turi, Inc.
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
Rising Media, Inc.
 
2010.080 1226
2010.080 12262010.080 1226
2010.080 1226
swaipnew
 
CV-Grace-DataAnalytics-UCL
CV-Grace-DataAnalytics-UCLCV-Grace-DataAnalytics-UCL
CV-Grace-DataAnalytics-UCL
Han Yang
 
Data analysis
Data analysisData analysis
Data analysis
AnandDesshpande
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
Giuseppe Ricci
 
Imtiaz khan data_science_analytics
Imtiaz khan data_science_analyticsImtiaz khan data_science_analytics
Imtiaz khan data_science_analytics
imtiaz khan
 
IRJET - House Price Predictor using ML through Artificial Neural Network
IRJET - House Price Predictor using ML through Artificial Neural NetworkIRJET - House Price Predictor using ML through Artificial Neural Network
IRJET - House Price Predictor using ML through Artificial Neural Network
IRJET Journal
 
SHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxSHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docx
ShahbazKhan77289
 
“Electronic Shopping Website with Recommendation System”
“Electronic Shopping Website with Recommendation System”“Electronic Shopping Website with Recommendation System”
“Electronic Shopping Website with Recommendation System”
IRJET Journal
 
Providing highly accurate service recommendation for semantic clustering over...
Providing highly accurate service recommendation for semantic clustering over...Providing highly accurate service recommendation for semantic clustering over...
Providing highly accurate service recommendation for semantic clustering over...
IRJET Journal
 
Data analytics to improve home broadband cx & network insight
Data analytics to improve home broadband cx & network insightData analytics to improve home broadband cx & network insight
Data analytics to improve home broadband cx & network insight
Ravi Sharma
 
IRJET- Credit Card Fraud Detection Analysis
IRJET- Credit Card Fraud Detection AnalysisIRJET- Credit Card Fraud Detection Analysis
IRJET- Credit Card Fraud Detection Analysis
IRJET Journal
 
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and TagsCold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
Matthias Braunhofer
 
IRJET-Smart Tourism Recommender System
IRJET-Smart Tourism Recommender SystemIRJET-Smart Tourism Recommender System
IRJET-Smart Tourism Recommender System
IRJET Journal
 
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
Emanuel Lacić
 
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
Ncct   Ieee Software Abstract Collection Volume 1   50+ AbstNcct   Ieee Software Abstract Collection Volume 1   50+ Abst
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
ncct
 

Similar to Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce (20)

A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
 
IRJET-Survey on Identification of Top-K Competitors using Data Mining
IRJET-Survey on Identification of Top-K Competitors using Data MiningIRJET-Survey on Identification of Top-K Competitors using Data Mining
IRJET-Survey on Identification of Top-K Competitors using Data Mining
 
IRJET-Fake Product Review Monitoring
IRJET-Fake Product Review MonitoringIRJET-Fake Product Review Monitoring
IRJET-Fake Product Review Monitoring
 
Orchestrating Collective Intelligence
Orchestrating Collective IntelligenceOrchestrating Collective Intelligence
Orchestrating Collective Intelligence
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
2010.080 1226
2010.080 12262010.080 1226
2010.080 1226
 
CV-Grace-DataAnalytics-UCL
CV-Grace-DataAnalytics-UCLCV-Grace-DataAnalytics-UCL
CV-Grace-DataAnalytics-UCL
 
Data analysis
Data analysisData analysis
Data analysis
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
Imtiaz khan data_science_analytics
Imtiaz khan data_science_analyticsImtiaz khan data_science_analytics
Imtiaz khan data_science_analytics
 
IRJET - House Price Predictor using ML through Artificial Neural Network
IRJET - House Price Predictor using ML through Artificial Neural NetworkIRJET - House Price Predictor using ML through Artificial Neural Network
IRJET - House Price Predictor using ML through Artificial Neural Network
 
SHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxSHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docx
 
“Electronic Shopping Website with Recommendation System”
“Electronic Shopping Website with Recommendation System”“Electronic Shopping Website with Recommendation System”
“Electronic Shopping Website with Recommendation System”
 
Providing highly accurate service recommendation for semantic clustering over...
Providing highly accurate service recommendation for semantic clustering over...Providing highly accurate service recommendation for semantic clustering over...
Providing highly accurate service recommendation for semantic clustering over...
 
Data analytics to improve home broadband cx & network insight
Data analytics to improve home broadband cx & network insightData analytics to improve home broadband cx & network insight
Data analytics to improve home broadband cx & network insight
 
IRJET- Credit Card Fraud Detection Analysis
IRJET- Credit Card Fraud Detection AnalysisIRJET- Credit Card Fraud Detection Analysis
IRJET- Credit Card Fraud Detection Analysis
 
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and TagsCold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
 
IRJET-Smart Tourism Recommender System
IRJET-Smart Tourism Recommender SystemIRJET-Smart Tourism Recommender System
IRJET-Smart Tourism Recommender System
 
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
 
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
Ncct   Ieee Software Abstract Collection Volume 1   50+ AbstNcct   Ieee Software Abstract Collection Volume 1   50+ Abst
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
 

More from Ladislav Peska

Fuzzy D’Hondt’s Algorithm for On-line Recommendations Aggregation
Fuzzy D’Hondt’s Algorithm for On-line Recommendations AggregationFuzzy D’Hondt’s Algorithm for On-line Recommendations Aggregation
Fuzzy D’Hondt’s Algorithm for On-line Recommendations Aggregation
Ladislav Peska
 
LineIT: Similarity search and recommendations for photo lineup assembling
LineIT: Similarity search and recommendations for photo lineup assemblingLineIT: Similarity search and recommendations for photo lineup assembling
LineIT: Similarity search and recommendations for photo lineup assembling
Ladislav Peska
 
Towards Similarity Models in Police Photo Lineup Assembling Tasks
Towards Similarity Models in Police Photo Lineup Assembling TasksTowards Similarity Models in Police Photo Lineup Assembling Tasks
Towards Similarity Models in Police Photo Lineup Assembling Tasks
Ladislav Peska
 
Towards Recommender Systems for Police Photo Lineup
Towards Recommender Systems for Police Photo LineupTowards Recommender Systems for Police Photo Lineup
Towards Recommender Systems for Police Photo Lineup
Ladislav Peska
 
Linking Content Information with Bayesian Personalized Ranking via Multiple C...
Linking Content Information with Bayesian Personalized Ranking via Multiple C...Linking Content Information with Bayesian Personalized Ranking via Multiple C...
Linking Content Information with Bayesian Personalized Ranking via Multiple C...
Ladislav Peska
 
Towards Complex User Feedback and Presentation Context in Recommender Systems
Towards Complex User Feedback and Presentation Context in Recommender SystemsTowards Complex User Feedback and Presentation Context in Recommender Systems
Towards Complex User Feedback and Presentation Context in Recommender Systems
Ladislav Peska
 
Using the Context of User Feedback in Recommender Systems
Using the Context of User Feedback in Recommender SystemsUsing the Context of User Feedback in Recommender Systems
Using the Context of User Feedback in Recommender Systems
Ladislav Peska
 
Using Implicit Preference Relations to Improve Content-based Recommendations,...
Using Implicit Preference Relations to Improve Content-based Recommendations,...Using Implicit Preference Relations to Improve Content-based Recommendations,...
Using Implicit Preference Relations to Improve Content-based Recommendations,...
Ladislav Peska
 
RecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF groupRecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF group
Ladislav Peska
 

More from Ladislav Peska (9)

Fuzzy D’Hondt’s Algorithm for On-line Recommendations Aggregation
Fuzzy D’Hondt’s Algorithm for On-line Recommendations AggregationFuzzy D’Hondt’s Algorithm for On-line Recommendations Aggregation
Fuzzy D’Hondt’s Algorithm for On-line Recommendations Aggregation
 
LineIT: Similarity search and recommendations for photo lineup assembling
LineIT: Similarity search and recommendations for photo lineup assemblingLineIT: Similarity search and recommendations for photo lineup assembling
LineIT: Similarity search and recommendations for photo lineup assembling
 
Towards Similarity Models in Police Photo Lineup Assembling Tasks
Towards Similarity Models in Police Photo Lineup Assembling TasksTowards Similarity Models in Police Photo Lineup Assembling Tasks
Towards Similarity Models in Police Photo Lineup Assembling Tasks
 
Towards Recommender Systems for Police Photo Lineup
Towards Recommender Systems for Police Photo LineupTowards Recommender Systems for Police Photo Lineup
Towards Recommender Systems for Police Photo Lineup
 
Linking Content Information with Bayesian Personalized Ranking via Multiple C...
Linking Content Information with Bayesian Personalized Ranking via Multiple C...Linking Content Information with Bayesian Personalized Ranking via Multiple C...
Linking Content Information with Bayesian Personalized Ranking via Multiple C...
 
Towards Complex User Feedback and Presentation Context in Recommender Systems
Towards Complex User Feedback and Presentation Context in Recommender SystemsTowards Complex User Feedback and Presentation Context in Recommender Systems
Towards Complex User Feedback and Presentation Context in Recommender Systems
 
Using the Context of User Feedback in Recommender Systems
Using the Context of User Feedback in Recommender SystemsUsing the Context of User Feedback in Recommender Systems
Using the Context of User Feedback in Recommender Systems
 
Using Implicit Preference Relations to Improve Content-based Recommendations,...
Using Implicit Preference Relations to Improve Content-based Recommendations,...Using Implicit Preference Relations to Improve Content-based Recommendations,...
Using Implicit Preference Relations to Improve Content-based Recommendations,...
 
RecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF groupRecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF group
 

Recently uploaded

University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 

Recently uploaded (20)

University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 

Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce

  • 1. COMPARING ON-LINE and OFF-LINE: EVALUATION RESULTS Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerceLadislav Peška Department of Software Engineering Charles University in Prague, Czech Republic ABSTRACT Recommending in context of small e-commerce enterprises is rather challenging due to the lower volume of interactions and low user loyalty, rarely extending beyond a single session. On the other hand, we usually have to deal with lower volumes of objects, which are easier to discover by users via searching or browsing GUI. The main goal of this paper is to determine applicability of off-line evaluation metrics in learning usability of recommender systems (evaluated on-line in A/B testing). In total 800 variants of recommending algorithms were evaluated off-line w.r.t. 18 metrics covering rating-based, ranking-based, novelty and diversity evaluation. Off-line results were compared with on-line evaluation of 12 selected recommender variants. Off-line results shown a great variance in performance w.r.t. different metrics with the Pareto front covering 68% of the approaches. On-line results were highly diversified w.r.t. the volume of objects visited by the user. Ranking- based metrics provided best estimation for novel users. We further train two regressors to predict on-line results based on the off-line metrics and estimate performance of recommenders not evaluated in A/B testing directly. RESULTS and FUTURE WORK • CTR and VRR became gradually less consistent with more objects visited by the user. o Perhaps, some users tend to observe all objects from some category -> Modify VRR metric to incorporate elapsed time. o Evaluate more business-oriented metrics in the future (conversions, revenue, actions after click). • For users with lower volume of visited objects (majority of the dataset), ranking-based metrics are best estimators of on-line performance. o Intra list diversity (ILD) seems to gain some importance for users with longer history. o Rating-based and novelty metrics were mostly negatively correlated or indifferent throughout the dataset. o Aim on finer-grained classification of users in the future. • Evaluate metrics based on knowledge of user’s choices (MNAR). o However, high ratio of VRR / CTR scores indicates potentially lower effect of missing not at random. • Evaluate other off-line metrics, such as object’s popularity. • Evaluate regression/ranking methods aiming to predict on-line results from off-line metrics. • Verify results on additional small e-commerce vendors & for additional recommending algorithms. DOMAIN: Czech travel agency • Approx. 300-800 visitors daily, several hundreds to thousands objects. • However, just few visited objects per user, low user loyalty. • Over 2 years of historic data, 560K records, complex feedback available. ALGORITHMS: Item-to-Item models Low user loyalty and high fluctuation of users would prevent effective usage of user-based algorithms, such as matrix factorization. • Word2vec (item2vec): CF based on the stream of object’s visits. • Doc2vec: CB based on the textual descriptions of tours. • VSM (Cosine): CB based on the descriptive features of tours (length, price, destination, meal plan etc.) Hyperparameters: embeddings size, context window size, diversity and novelty enhancements, user profile. USER PROFILE: (Which objects were used to represent user?) • Mean: All visited objects are used, similarities per visited object are averaged. • Max: All objects are used, max(similarity w.r.t. visited object) is used. • Last(-k): Only last (k) objects are used with linearly decreasing weight. • Temporal(-k): Last k (all) objects are used with decreasing weight based on the real time (days) elapsed from the feedback observation. EVALUATION: Off-line phase: June 1 – July 19, 2018 - 970 users (with visited objects in both train and test set) - 800 variants of [algorithm, hyperparameters, user profile] were evaluated. On-line phase: July 19 – August 17, 2018 • Selected 12 algorithms (best & worst w.r.t. each off-line metric) • 4287 users (with some visited objects to create a user profile) o One RS’s variant assigned to each user • In total 928 click-throughs (CTR) • In total 10961 visits after recommendation (VRR) Peter Vojtáš Department of Software Engineering Charles University in Prague, Czech Republic OFF-LINE METRICS PEARSON’S CORRELATION rating ranking novelty diversity Id Algorithm MAE AUC MRR nDCG100 Nov10𝑡 Nov10 𝒖 ILD10 CTR VRR 0 Doc2vec; e:128, w:1, last, nov. 0.292 0.617 0.031 0.057 0.234 1.000 0.800 0.0070 0.050 1 Doc2vec; e:128, w:1, temp., div. 0.362 0.679 0.031 0.075 0.221 0.999 0.838 0.0084 0.075 2 Doc2vec; e:32, w:5, mean 0.455 0.555 0.028 0.050 0.211 0.997 0.786 0.0089 0.054 3 Doc2vec; e:32, w:5, mean, div. 0.455 0.555 0.025 0.046 0.214 0.998 0.859 0.0062 0.060 4 Doc2vec; e:128, w:5, max, nov. 0.214 0.526 0.012 0.031 0.229 0.995 0.741 0.0077 0.052 5 Cosine; temp., nov. 0.406 0.797 0.146 0.215 0.255 0.994 0.270 0.0057 0.020 6 Cosine; mean, nov. 0.400 0.795 0.149 0.214 0.229 0.994 0.223 0.0119 0.088 7 Cosine; last-10 0.390 0.783 0.127 0.205 0.218 0.996 0.208 0.0075 0.055 8 Word2vec; e:64, w:5, mean, div. 0.414 0.809 0.103 0.182 0.215 0.973 0.683 0.0090 0.062 9 Word2vec; e:32, w:5, temp., nov. 0.438 0.816 0.102 0.195 0.244 0.977 0.495 0.0095 0.065 10 Word2vec; e:128, w:3, last 0.290 0.734 0.097 0.168 0.212 0.997 0.534 0.0077 0.056 11 Word2vec; e:32, w:3, last-10 0.432 0.814 0.134 0.229 0.214 0.988 0.443 0.0080 0.089 Comparing on-line and off-line results for users with 1-5 visited objects. Parameters e = embeddings size, w = context window size, nov. and div. denotes novelty and diversity enhancements. Best results w.r.t. each metric are in bold, green, worst results are in red, best w.r.t. algorithm type are in italic. KEY FINDINGS • One of the surprising findings was that results are highly dependent on the volume of visited objects by the user. While for users with lower volume of objects, both on-line metrics were highly correlated and ranking-based metrics provide most relevant estimations, CTR and VRR became gradually less consistent for users with longer history. This is further illustrated by the volume of interactions. While per-user CTR gradually decreases with visited objects, VRR increases. • Reasonable level of diversity seems important for users with more visited objects. This was also indicated by the regression methods trained to predict on-line results from off-line metrics. • As for the recommending algorithms, trained regressors prefer Cosine and Word2vec models over Doc2vec, which is also observable from the actual on-line results. Evaluation site: www.slantour.cz Code & full results: github.com/lpeska/REVEAL2018 Contact: peska@ksi.mff.cuni.cz vojtas@ksi.mff.cuni.cz Users with 1-2 visited objectsComparison of total and per-user CTR and VRR scores