The document discusses building a recommender system using collaborative filtering approaches. It describes collecting usage and rating data, calculating item-item and user-user similarities, making predictions for unknown values using k-nearest neighbors, and evaluating the system using measures like precision, recall and root mean squared error. Implementation details like programming languages, databases and cloud infrastructure are also summarized.
11. Data
what do we have?
Usage (implicit) Ratings (explicit)
vs.
• •
Noisy Accurate
• •
Only positive feedback Positive and negative
feedback
• •
Easy to collect Hard to collect
12. Data
what do we use?
• Active users (Tracker activity in the past month): ~9.000
• Actively used software items (in the past month): ~10.000
• We calculate recommendations for each OS together with
Web applications separately
13. Recommender system methods
Collaborative recommendations: The user will be
recommended items that people with similar tastes and
preferences liked (used) in the past
• Item-based collaborative filtering
• User-based collaborative filtering (we only use for
calculating user similarities to find people like you)
• Combining both methods
22. K-nearest neighbor approach
Gmail similarities
• Performance vs quality
0.6
• We take only the ‘K’ most similar items (say 4)
0.8
• Space complexity: O(m + Kn)
0.4
•
0.4
Computational complexity: O(m + n²)
0.3
0.3
23. Calculate the predicted value for Gmail
Gmail similarities User usage
1
0.6
1
0.8
1
0.4
0.4
1
24. Calculate the predicted value for Gmail
Gmail similarities User usage
0.9
0.6
Usage correction,
0.8
0.8
more usage results
in a higher score [0,1]
0.6
0.4
0.4
0.2
25. Calculate the predicted value for Gmail
Gmail similarities User usage
0.9
0.6
0.8
0.8
0.6
0.4
0.4
0.2
(0.6 * 0.9) + (0.8 * 0.8) + (0.4 * 0.6)
= 0.82
0.6 + 0.8 + 0.4 + 0.4
26. Calculate the predicted value for Gmail
• User feedback
Gmail similarities User usage
• Contacts usage
0.9
0.6
• Commercial vs Free
0.8
0.8
0.6
0.4
0.4
0.2
(0.6 * 0.9) + (0.8 * 0.8) + (0.4 * 0.6)
= 0.82
0.6 + 0.8 + 0.4 + 0.4
27. Calculate all unknown values and
show the Top-N recommendations to each user
Software items
? ?
?
1 1 1 1
?1??
1 1 1
?1?1?
Users 1 1
?1111?
1
?111?11
?1?1??1
28. Explainability
Why did I get this recommendation?
• Overlap between the item’s (K) neighbors and your usage
30. Applying inverse user frequency
log(n/ni): ni is the number of users that uses item i and n is
the total number of users in the database
0.1 0.2 0 0.4 0 0.4 0
0.1 0.2 0.6 0 0.8 0 0
0.1 0.2 0 0.4 0 0.4 0
0.1 0.2 0.6 0.4 0.8 0.4 0
Cosine Similarity(Coen, Menno)
0 0.2 0.6 0.4 0 0.4 0.2
0 0.2 0 0.4 0 0 0.2
The fact that you both use Textmate tells you more than
when you both use firefox
33. Performance
measure for success
• Cross-validation: Train-Test split (80-20)
• Precision and Recall:
- precision = size(hit set) / size(total given recs)
- recall = size(hit set) / size(test set)
• Root mean squared error (RMSE)
34. Implementation
• Ruby Enterprise Edition (garbage collection)
• MySQL database
• Built our own c-libraries
• Amazon EC2:
- Low cost
- Flexibility
- Ease of use
• Open source
35. Future challenges
• What is the best algorithm for Wakoopa? (or you)
• Reducing space-time complexity (scalability):
- Parallelization (Clojure)
- Distributed computing (Hadoop)