Information overload “People read around 10 MB worth of material a day, hear 400 MB a day, and see one MB of information every second” The Economist, November 2006
Tell me what you like...
Tell me what you like and I will tell you who you are
Tell me who you know and I will tell you what you like
Tell me what you have and I will tell you what you need
The value of recommendations
Netflix: 2/3 of the movies rented were recommended
Google News: recommendations generate 38% more clickthrough
Amazon: 35% sales from recommendations
Choicestream: 28% of the people would buy more music if they found what they liked.
02 Recommender Systems
The “Recommender problem”
Estimate a utility function that is able to automatically predict how much a user will like an item that is unknown for her. Based on:
Past behavior
Relations to other users
Item similarity
...
Approaches to Recommendation
Collaborative Filtering
Recommend items based only on how other users have previously rated those items
User-based
Find similar users to me and recommend what those users liked
Item-based
Find a similar item to those that I have previously liked
Content-based
Recommend based on features inherent to the items
What works?
What works clearly depends on the domain of the recommender: Domain-specific modeling
However, in the general case it has been demonstrated that the best isolated approach is (currently) the item-based collaborative filtering.
Other approaches can be hybridized to improve results in specific cases (cold-start problem...)
03 The Netflix Prize
The Netflix Prize
500,000 users * 17,000 movie titles = 100M ratings = $1M (if you “only” improve existing system by 10%! From 0.95 to 0.85 RMSE)
This is what Netflix thinks a 10% improvement is worth for their business
29K contestants on 23K teams from 165 countries.
19K valid submissions from 2700 teams; 59 submissions in the “last 24 hours”
The Netflix Prize
First conclusion: it is really extremely simple to reach a “reasonable” recommendations and extremely difficult to improve them.
The Netflix Prize
(Apart from the extremely unlikely possibility of getting the $1M) it is a great source of data and measurable improvement.
100M ratings from 1 to 5
Measure of success: RMSE
Most successfull teams are using item-based collaborative filtering and some sort of matrix factorization (such as SVD) and...
The Netflix Prize
Currently the leader is at 8.5% improvement (blending 107 individual predictors using all sorts of techniques)
Many teams are merging
04 The Sparsity Problem
The Sparsity Problem
If you represent the Netflix rating data in a User/Movie matrix you get...
500,000 x 17,000 = 8,500 M positions
Out of which only 100M are not 0's!
Methods of dimensionality reduction
Matrix Factorization
Clustering
Projection (PCA ...)
Dimensionality Reduction
Matrix Factorization
This is so far the “winning horse”
In particular the Singular Value Decomposition method (Simon Funk's modified SVD)
Clustering
Similar results can be obtained but a higher computational cost (so far many “traditional” algorithms such as K-nn have been tried with varying results).
Our approach to Dimensionality Reduction
We are experimenting with message-passing clustering algorithms
Affinity Propagation (Frey&Dueck, Science, February 2007)
But wait... Is this all about tweaking algorithms? 05 Working with the data
What about the data?
Data massaging
Denoising – can we remove outliers and/or estimate noise?
We are working on estimating noise inherent to the absolute quantized rating system.
Remove global effects
User tendencies (e.g. to rate higher than others)
Movie tendencies
Cross tendencies (movie vs. time...)
Approaching the sparsity problem
A different (although complementary) approach to reducing data sparsity deals with trying to improve the data set.
2 possibilities
Content-based approach
“Group” similar items because they share similar important features (such as genre or director in films) to reduce dimensions
Add editorial data from external sources
User-based approach
Are there users “out there” that can provide missing data
User-oriented data approach
Adding “expert” users might help in clustering the data set
We are crawling the web to find complementary information for users such as critics or others coming from services similar to Netflix
Multimedia Entertainment E-commerce Social Networking News/Blogs/Portals Comunidades PLATFORM PRODUCTS AND SERVICES COMMERCIALIZATION Content Packaging and Design Devices Access Commercialization Customers Recommendation Systems
07 Conclusions
Key technology in future years
Many areas to improve and large unexplored research field
Area related to many traditional disciplines: Computer Science, Statistics, Economics, Sociology...
0 comments
Post a comment