Get on with it!
Recommender system industry
challenges move towards real-world,
online evaluation
Padova – March 24th, 2016
Andreas Lommatzsch - TU Berlin, Berlin, Germany
Jonas Seiler - plista, Berlin, Germany
Daniel Kohlsdorf - XING, Hamburg, Germany
CrowdRec - www.crowdrec.eu
Idomaar - http://rf.crowdrec.eu
Where are recommender
system challenges headed?
Direction 1:
Use info beyond the
user-item matrix.
Direction 2:
Online evaluation +
multiple metrics.
Moving towards real-world evaluation
Flickr credit: rodneycampbell
Why evaluate?
• Evaluation is crucial for the success of real-life systems
• How should we evaluate?
Precision and
Recall
Technical
complexity
Influence
on sales
Required hardware
resources
Business
models
Scalability
Diversity of the
presented results
User
satisfaction
Evaluation Settings
• A static collection of documents
• A set of queries
• A list of relevant documents defined by
experts for each query
Traditional Evaluation in IR
“The Cranfield paradigm”
Advantages
• Reproducible setting
• All researches have exactly the same
information
• Optimized for measuring precision
Query0
* #nn
* #nn
* #nn
Traditional Evaluation in IR
Weaknesses of traditional IR evaluation
• High costs for creating dataset
• Datasets are not up-to-date
• Domain-specific documents
• The expert-defined ground truth does not
consider individual user preferences
• Individual user preferences
• Context-awareness is not considered
• Technical aspects are ignored
Context is
everything
Industry and recsys challenges
• Challenges benefit both industry and academic research.
• We look at how industry challenges have evolved since
the Netflix prize 2009.
Traditional Evaluation in RecSys
Evaluation Settings
• Rating prediction on user-item matrices
• Large, sparse dataset
• Predict personalized ratings
• Cross-validation, RMSE
Advantages
• Reproducible setting
• Personalization
• Dataset is based on
real user ratings “The Netflix paradigm”
Traditional Evaluation in RecSys
Weaknesses of traditional Recommender evaluation
• Static data
• Only one type of data - only user ratings
• User ratings are noisy
• Temporal aspects tend to be ignored
• Context-awareness is not considered
• Technical aspects are ignored
Challenges of Developing Applications
Challenges
• Data streams - continuous changes
• Big data
• Combine knowledge from different sources
• Context-Awareness
• Users expect personally relevant results
• Heterogeneous devices
• Technical complexity, real-time requirements
How to address these challenges in the Evaluation?
• Realistic evaluation setting
• Heterogeneous data sources
• Streams
• Dynamic user feedback
• Appropriate metrics
• Precision and User satisfaction
• Technical complexity
• Sales and Business models
• Online and Offline Evaluation
How to Setup a better Evaluation?
Approaches for a better Evaluation
• News recommendations
@ plista
• Job recommendations
@ XING
The plista Recommendation Scenario
Setting
● 250 ms response time
● 350 Mio AI/day
● In 10 Countries
Challenges
● News change
continuously
● User do not log-in
explicitly
● Seasonality,
context-depend user
preferences
Offline
• Cross-validation
• Metric Optimization Engine
(https://github.com/Yelp/MOE)
• Integration into Spark
• How well does it correlate with
Online Evaluation?
• Time Complexity
Evaluation @ plista
Online
• AB Tests
• Limited
• by Caching Memory
• Computational
Resources
• MOE*
Offline
• Mean and variance estimation of parameter space with
Gaussian Process
• Evaluate parameter with highest Expected Improvement (EI),
Upper Confidence Interval ….
• Rest API
Evaluation using MOE
Online
• A/B Tests are expensive
• Model non-stationarity
• Integrate out non-stationarity
to get mean EI
Evaluation using MOE
Provide an API enabling researchers testing own ideas
• The CLEF-NewsREEL challenge
• A Challenge in CLEF (Conferences and Labs of the Evaluation Forum)
• 2 Tasks: Online and Offline Evaluation
The CLEF-NewsREEL challenge
How does the challenge work?
• Live streams consisting of impressions, requests, and
clicks, 5 publishers, approx 6 Million messages per day
• Technical requirements: 100 ms per request
• Live evaluation
based on CTR
CLEF-NewsREEL
Online Task
Online vs. Offline Evaluation
• Technical aspects can be evaluated without user feedback
• Analyze the required resources and the response time
• Simulate the online evaluation by replaying a recorded
stream
CLEF-NewsREEL
Offline Task
Challenge
• Realistic simulation of streams
• Reproducible setup of computing environments
Solution
• A framework simplifying
the setup of the evaluation
environment
• The Idomaar framework
developed in the CrowdRec project
CLEF-NewsREEL
Offline Task
http://rf.crowdrec.eu
More Information
• SIGIR forum Dec 2015 (Vol 49, #2)
http://sigir.org/files/forum/2015D/p129.pdf
Evaluate your algorithm online and offline in NewsREEL
• Register for the challenge!
http://crowdrec.eu/2015/11/clef-newsreel-2016/
(register until 22nd of April)
• Tutorials and Templates are provided at orp.plista.com
CLEF-NewsREEL
XING - Evaluation based on interaction
● On Xing users can give feedback on recommendations.
● Number of user feedback way lower than implicit measures.
● A/B Tests focus on clickthrough rate.
XING - RecSys Challenge, Scoring,
Space on Page
● Predict 30 items for each user.
● Score: weighted combination of the
precision
○ precisionAt(2)
○ precisionAt(4)
○ precisionAt(6)
○ precisionAt(20)
Top 6
XING - RecSys Challenge, User Data
• User ID
• Job Title
• Educational Degree
• Field of Study
• Location
XING - RecSys Challenge, User Data
• Number of past jobs
• Years of Experience
• Current career level
• Current discipline
• Current industry
XING - RecSys Challenge, Item Data
• Job title
• Desired career level
• Desired discipline
• Desired industry
XING - RecSys Challenge, Interaction Data
• Timestamp
• User
• Job
• Type:
• Deletion
• Click
• Bookmark
XING - RecSys Challenge, Future
• Live Challenge
• Users submit predicted future interactions
• The solution is recommended on the platform
• Participants get points for actual user clicks
Release to Challenge Collect Clicks
Work On Predictions
Score
How to setup a better Evaluation
• Consider different quality criteria
(prediction, technical, business models)
• Aggregate heterogeneous information sources
• Consider user feedback
• Use online and offline analyses
to understand users and their
requirements
Concluding ...
Participate in challenges based on real-life scenarios
• NewsREEL challenge
Concluding ...
• RecSys 2016 challenge
=> Organize a challenge. Focus on real-life data.