About 71,000 professionals About 257,000 professionals Staff Services Finances Rev: 4,273 M€ EPS(1): 0.45 € Integrated ICT solutions for all customers Clients About 12 million subscribers About 260 million customers Basic telephone and data services 1989 Spain Operations in 25 countries Geographies Rev: 57,946 M€ EPS: 1.63 € 2000 2008 About 149,000 professionals About 68 million customers Wireline and mobile voice, data and Internet services (1) EPS: Earnings per share Rev: 28,485 M€ EPS(1): 0.67 € Operations in 16 countries Telefonica is a fast-growing Telecom
Telco sector worldwide ranking by market cap (US$ bn) Currently among the largest in the world Source: Bloomberg, 06/12/09
Argentina: 20.9 million Brazil: 61.4 million Central America: 6.1 million Colombia: 12.6 million Chile: 10.1 million Ecuador: 3.3 million Mexico: 15.7 million Peru: 15.2 million Uruguay: 1.5 million Venezuela: 12.0 million
2
1
1
2
2
1
1
1
2
2
1
1
1
2
2
Notes: - Central America includes Guatemala, Panama, El Salvador and Nicaragua - Total accesses figure includes Narrowband Internet accesses of Terra Brasil and Terra Colombia, and Broadband Internet accesses of Terra Brasil, Telefónica de Argentina, Terra Guatemala and Terra México. Data as of March ‘09 Total Accesses (as of March ‘09) 159.5 million Leader in South America Wireline market rank Mobile market rank
Spain: 47.2 million UK: 20.8 million Germany : 16.0 million Ireland : 1.7 million Czech Republic : 7.7 million Slovakia : 0.4 million
1
4
2
Wireline market rank Mobile market rank
3
Data as of March ‘09 And a significant footprint in Europe Total Accesses (as of March ’09) 93.8 million
2
1
1
1
Telefonica R&D (TID) is the Research and Development Unit of the Telefónica Group MISSION “ To contribute to the improvement of the Telefónica Group’s competitivness through technological innovation”
Founded in 1988
Largest private R&D center in Spain
More than 1100 professionals
Five centers in Spain and two in Latin America
Telefónica was in 2008 the first Spanish company by R&D Investment and the third in the EU Products / Services / Processes development Technological Innovation (1) R&D 594 M€ 4.384 M€ Applied research R&D 61 M€
TID Scientific Groups: Publications, Patents, TechTransfer TI+D Scientific Groups Pablo Rodriguez Internet Scientific Director Nuria Oliver Multimedia Scientific Director Data Mining and User Modeling Acting Scientific Director
Internet Scientific Areas Content Distribution and P2P Next generation Managed P2P-TV Future Internet: Content Networking Delay Tolerant Bulk Distribution Network Transparency Social Networks Information Propagation Social Search Engines Infrastructure for Social based cloud computing Wireless and Mobile Systems Wireless bundling Device2Device Content Distribution Large Scale mobile data analysis
Multimedia Scientific Areas Multimedia Core Multimedia Data Analysis, Search & Retrieval Video, Audio, Image, Music, Text, Sensor Data Understanding, Summarization, Visualization Mobile and Ubicomp Context Awareness Urban Computing Mobile Multimedia & Search Wearable Physiological Monitoring HCC Multimodal User Interfaces Expression, Gesture, Emotion Recognition Personalization & Recommendation Systems Super Telepresence
Data Mining & User Modeling Areas DATA MINING
Integration of statistical & knowledge-based techniques
- Stream mining
Large scale & distributed machine learning
USER MODELING
Application to new services (technology for development)
Cognitive, socio-cultural, and contextual modeling
Behavioral user modeling (service-use patterns)
SOCIAL NETWORK ANALYSYS & BUSINESS INT.
Analytical CRM
Trend-spotting, service propagation & churn
Social Graph Analysis (construction, dynamics)
I like it... I like it not
Evaluating User Ratings Noise in
Recommender Systems
Xavier Amatriain (@xamat), Josep M. Pujol, Nuria Oliver
Telefonica Research
Recommender Systems are everywhere
Netflix: 2/3 of the movies rented were recommended
Google News: recommendations generate 38% more clickthrough
Amazon: 35% sales from recommendations
“We are leaving the age of Information and entering the Age of Recommendation” - The Long Tail (Chris Anderson)
The Netflix Prize
500K users x 17K movie titles = 100M ratings = $1M (if you “only” improve existing system by 10%! From 0.95 to 0.85 RMSE)
This is what Netflix thinks a 10% improvement is worth for their business
49K contestants on 40K teams from 184 countries.
41K valid submissions from 5K teams; 64 submissions in the “last 24 hours”
But, is there a limit to RS accuracy?
Evolution of accuracy in Netflix Prize
The Magic Barrier
Magic Barrier = Limit on prediction accuracy due to noise in original data
Natural Noise = involuntary noise introduced by users when giving feedback
Due to (a) mistakes , and (b) lack of resolution in personal rating scale (e.g. In a 1 to 5 scale a 2 may mean the same than a 3 for some users and some items) .
Magic Barrier >= Natural Noise Threshold
We cannot predict with less error than the resolution in the original data
The Question in the Wind
Our related research questions
Q1. Are users inconsistent when providing explicit feedback to Recommender Systems via the common Rating procedure?
Q2. How large is the prediction error due to these inconsistencies?
Q3. What factors affect user inconsistencies?
Experimental Setup (I)
Test-retest procedure: you need at least 3 trials to separate
Reliability : how much you can trust the instrument you are using (i.e. ratings)
r = r 12 r 23 /r 13
Stability : drift in user opinion
s 12 =r 13 /r 23 ; s 23 =r 13 /r 12 ; s 13 =r 13 ²/r 12 r 23
Users rated movies in 3 trials
Trial 1 <-> 24 h <-> Trial 2 <-> 15 days <-> Trial 3
Experimental Setup (II)
100 Movies selected from Netflix dataset doing a stratified random sampling on popularity
Ratings on a 1 to 5 star scale
Special “not seen” symbol.
Trial 1 and 3 = random order; trial 2 = ordered by popularity
118 participants
Results
Comparison to Netflix Data
Distribution of number of ratings per movie very similar to Netflix but average rating is lower (users are not voluntarily choosing what to rate)
Test-retest Stability and Reliability
Overall reliability = 0.924 (good reliabilities are expected to be > 0.9)
Stabilities might also be accounting for “learning effect” (note s12<s23)
Analysis of User Inconsistencies
Effect of the Not-Seen. Given a pair of consecutive trials:
More than 10% of items rated in a trial are then not rated in the following one
More than 20% of items only rated in one
RMSE due to Inconsistencies
Higher between R1 and R3 (same order, longer time)
Lower between R2 and R3 (removed “learning” effects?)
Impacting Variables (I)
Rating Scale Effect
Extreme ratings are more consistent
2 and 3 are the least consistent
34% of inconsistencies are between 2 and 3 and 25% between 3 and 4
90% of inconsistencies are + 1
Impacting Variables (II)
Item Order Effect
R1is the trial with most inconsistencies
R3 has less, but not when excluding “not seen” (learning effect improves “not seen” discrimination)
R2 minimizes inconsistencies because of order (reducing “contrast effect”).
Impacting Variables (and III)
User Rating Speed Effect
Evaluation time decreases as survey progresses in R1 and R3 (users losing attention but also learning)
In R2 evaluation time starts decreasing until users find segment of “popular” movies
Rating speed is not correlated with inconsistencies
Long-term Errors and Stability
New trial 7 months later with a subset of the users (36 out of the 118 in the original set)
R1 <–> 15 days <-> R3 <–> 7 months <–> R4 : All same random order
New Reliability (significantly lower): r = 0.8763 (less than 0.9)
New Stabilities (still high): s12 = 1.0025, s34 = 0.9706, and s14 = 0.9730
RMSE (much higher):
R13 = 0.6143, R14 = 0.6822, and R34 = 0.6835 for the intersection, and R13 = 0.7445, R14 = 0.8156 , R34 = 0.8014 for the union
Conclusions
Recommender Systems (and related Collaborative Filtering applications) are becoming extremely popular
Large research investments in coming up with better algorithms
However, understanding user feedback is many times much more important for the end result
To lower the Magic Barrier, RS should find ways of obtaining better and less noisy feedback from users, and model user response in the algorithm.
So... What can we do?
Different proposals
In order to deal with noise in user feedback we have so far proposed 3 different approaches:
Instead of regular users, take feedback from experts, which we expect to be less noisy (SIGIR09)
Combine ensembles of datasets to identify which works better for each user (IJCAI09)
Denoise user feedback by using a re-rating approach (Recsys09)
I will detail (1) and briefly mention the other 2.
The Wisdom of the Few
A Collaborative Filtering Approach Based on Expert Opinions from the Web
Xavier Amatriain (@xamat), Josep M. Pujol, Nuria Oliver
Telefonica Research (Barcelona)
Neal Lathia
UCL (London)
First, a little quiz
Name that Book....
“ It is really only experts who can reliably account for their reactions”
Crowds are not always wise
Collaborative filtering is the preferred approach for Recommender Systems
Recommendations are drawn from your past behavior and that of similar users in the system
Standard CF approach:
Find your Neighbors from the set of other users
Recommend things that your Neighbors liked and you have not “seen”
Problem: predictions are based on a large dataset that is sparse and noisy
Overview of the Approach
expert = individual that we can trust to have produced thoughtful, consistent and reliable evaluations (ratings) of items in a given domain
Expert-based Collaborative Filtering
Find neighbors from a reduced set of experts instead of regular users.
Identify domain experts with reliable ratings
For each user, compute “ expert neighbors ”
Compute recommendations similar to standard kNN CF
Advantages of the Approach
Noise
Experts introduce less natural noise
Malicious Ratings
Dataset can be monitored to avoid shilling
Data Sparsity
Reduced set of domain experts can be motivated to rate items
Cold Start problem
Experts rate items as soon as they are available
Scalability
Dataset is several order of magnitudes smaller
Privacy
Recommendations can be computed locally
Take home message
Expert Collaborative Filtering
Is a new approach to recommendation but it builds up on standard CF
Addresses many of standard CF shortcomings
At least in some conditions, users prefer it over standard CF approaches
User study
User Study
57 participants, only 14.5 ratings/participant
50% of the users consider Expert-based CF to be good or very good
Expert-based CF: only algorithm with an average rating over 3 (on a 0-4 scale)
User Study
Results to the questions: “The recommendation list includes movies I like/dislike” (1-4 Likert)
Experts-CF clearly outperforms other methods
Expert Collaborative Filtering
Expert-based CF
Given user u U and d , find the set of experts E ' E such that: " e E ' sim ( u , e )
confidence threshold t = the minimum number of expert neighbors who must have rated the item in order to trust their prediction.
Given an item i , find E '' E ' s.t. " e E '' r ei unrated .
if n < t ⇒ no prediction, user mean is returned.
if n ⇒ rating can be predicted: similarity-weighted average of the ratings input from each expert e in E ''
Experts vs. Users Analysis
Mining the Web for Expert Ratings
Collections of expert ratings can be obtained almost directly on the web: we crawled the Rotten Tomatoes movie critics mash-up
Only those (169) with more than 250 ratings in the Neflix dataset were used
Dataset Analysis (# ratings)
Sparsity coefficient: 0.01 (users) vs. 0.07 (experts)
Average movie has 1K user ratings vs. 100 expert ratings
Average expert rated 400 movies, 10% rated > 1K
Dataset Analysis ( average)
Users: average movie rating ~0.55 (3.2⋆);
10% 0.45(2.8⋆),10% 0.7(3.8⋆)
Experts: average movie rating ~0.6 (3.4⋆)
10% 0.4(2.6⋆), 10% 0.8 (4.2⋆)
user ratings centered 0.7 (3.8⋆)
expert ratings centered 0.6 (3.4⋆): small variability
only 10% of the experts have a mean score 0.55 (3.2⋆) and another 10% 0.7 (3.8⋆)
Dataset Analysis (std)
Users:
per movie centered around 0.25 (1⋆), little variation
per user centered around 0.25, larger variability
Experts:
lower std per movie (0.15) and larger variation.
average std per expert = 0.2, small variability.
Dataset Analysis. Summary
Experts...
are much less sparse
rate movies all over the rating scale instead of being biased towards rating only “good” movies (different incentives).
but, they seem to consistently agree on the good movies.
have a lower overall standard deviation per movie: they tend to agree more than regular users.
tend to deviate less from their personal average rating.
Experimental Results
Evaluation Procedure
Use the 169 experts to predict ratings from 10.000 users sampled from the Netflix dataset
Prediction MAE using a 80-20 holdout procedure (5-fold cross-validation)
Top-N precision by classifying items as being “recommendable” given a threshold
Still, take results with a grain of salt... we have a user study backing up the approach
Results. Prediction MAE
Setting our parameters to =10 and =0.01, we obtain a MAE of 0.781 and a coverage of 97.7%
expert-CF yields a significant accuracy improvement with respect to using the experts’ average
Accuracy is worse than standard CF (with better coverage)
Role of Thresholds
MAE is inversely proportional to the similarity threshold ( ) until the 0.06 mark, when it starts to increase as we move to higher values.
below 0.0 it degrades rapidly: too many experts;
Coverage decreases as we increase .
For the optimal MAE point of 0.06, coverage is still above 70%.
MAE as a function of the confidence threshold ( ) =0.0 and =0.01(optimal around 9)
Comparison to standard CF
Standard NN CF has MAE around 10% but coverage is also 10% lower
Expert-CF only works worse for the 10% of the users with lower MAE
Results2. Top-N Precision
Precision of the Top-N Recommendations as a function of the “recommendable” threshold
For a threshold of 4, NN-CF outperforms expert-based but if we lower it to 3 they are almost equal
Conclusions
Different approach to the Recommendation problem
At least in some conditions, users prefer recommendations from similar experts than similar users.
Expert-based CF has the potential to address many of standard CF shortcomings
Future/Curent Work
We are currently exploring its performance in other domains and implementing a distributed expert-based CF application (work with Jae-Wook Ahn, Pittsburgh U.)
Adaptive Data Sources Collaborative Filtering With Adaptive Information Sources (ITWP @ IJCAI) With Neal Lathia UCL (London)
Adaptive data sources user modeling experts? friends? like-minded? similarity trust reputation
Adaptive Data sources
Given
a simple, un-tuned, kNN predictor and multiple information sources
A problem
users are subjective , accuracy varies with source
A promise
optimal classification of users to best source produces incredibly accurate predictions
Adaptive Data sources
A question
how to classify users to source set?
Rate it Again Increasing Recommendation Accuracy by User re-Rating (Recsys October 09) Recsys 09 NY
Rate it again
By asking users to rate items again we can remove noise in the dataset
Improvements of up to 14% in accuracy!
Because we don't want all users to re-rate all items we design ways to do partial denoising
Data-dependent: only denoise extreme ratings
User-dependent: detect “noisy” users
Conclusions
For many applications such as Recommender Systems (but also Search, Advertising, and even Networks) understanding data and users is vital
Algorithms can only be as good as the data they use as input
The use of User Mining techniques is going to be a growing trend in many areas in the coming years
0 comments
Post a comment