Building Recommender Systems: Approaches and Challenges

Coen Stevens
Lead Recommendation Engineer

How to build a recommender system?
Wakoopa use case

Mission:
Discover software & games

Software tracker
Windows Mac Linux

Building a recommender system
Approach and challenges

Data
what do we have?

Usage (implicit) Ratings (explicit)
vs.

• •
Noisy Accurate

• •
Only positive feedback Positive and negative
feedback

• •
Easy to collect Hard to collect

Data
what do we use?

• Active users (Tracker activity in the past month): ~9.000

• Actively used software items (in the past month): ~10.000

• We calculate recommendations for each OS together with
Web applications separately

Recommender system methods
Collaborative recommendations: The user will be
recommended items that people with similar tastes and
preferences liked (used) in the past

• Item-based collaborative filtering

• User-based collaborative filtering (we only use for
calculating user similarities to find people like you)

• Combining both methods

Item-Based Collaborative Filtering
User software usage matrix
Software items

220 90 180 22

280 12 42 80

Users 175 210 210 45

165 35 195 13 25

100 50 185 35 190

60 65 185

User software usage matrix [0, 1]
Software items

1 1 0 1 0 1 0

1 1 1 0 1 0 0

Users 1 1 0 1 0 1 0

1 0 1 1 1 1 0

0 1 1 1 0 1 1

0 1 0 1 0 0 1

How do we predict the probability that I would like to use GMail?
Software items

1 1 0 1 0 1 0

1 1 1 0 1 0 0

?
Users 1 1 1 0 1 0

1 0 1 1 1 1 0

0 1 1 1 0 1 1

0 1 0 1 0 0 1

Calculate the similarities between Gmail and the other software items.
Software items

1 1 0 1 0 1 0

1 1 1 0 1 0 0

Users 1 1 0 1 0 1 0

1 0 1 1 1 1 0

0 1 1 1 0 1 1

0 1 0 1 0 0 1

Cosine Similarity(Firefox, Gmail)

Calculate the similarities between Gmail and the other software items.
Software items

1 1 0 1 0 1 0

1 1 1 0 1 0 0
Popularity correction,
Users 1 1 0 1 0 1 0
we put less trust
1 0 1 1 1 1 0
in popular software
0 1 1 1 0 1 1

0 1 0 1 0 0 1

Cosine Similarity(Firefox, Gmail)

Item-item correlation matrix

1 0.1 0.6 0.1 0.1 0.1 0.7

0.2 1 0.8 0.5 0.8 0.1 0.9

0.1 0.6 1 0.5 0.7 0.2 0.3

0.2 0.6 0.4 1 0.8 0.2 0.3

0.5 0.4 0.4 0.4 1 0.1 0.2

0.5 0.5 0.3 0.5 0.3 1 0.3

0.2 0.6 0.3 0.8 0.7 0.7 1

Item-item correlation matrix
Gmail similarities

0.6 1 0.1 0.6 0.1 0.1 0.1 0.7

0.8 0.2 1 0.8 0.5 0.8 0.1 0.9

0.4 0.1 0.6 1 0.5 0.7 0.2 0.3

0.4 0.2 0.6 0.4 1 0.8 0.2 0.3

0.3 0.5 0.4 0.4 0.4 1 0.1 0.2

0.3 0.5 0.5 0.3 0.5 0.3 1 0.3

0.2 0.6 0.3 0.8 0.7 0.7 1

K-nearest neighbor approach
Gmail similarities

• Performance vs quality
0.6
• We take only the ‘K’ most similar items (say 4)
0.8

• Space complexity: O(m + Kn)
0.4

•
0.4
Computational complexity: O(m + n²)
0.3

0.3

Calculate the predicted value for Gmail
Gmail similarities User usage

1
0.6

1
0.8

1
0.4

0.4

1


0.9
0.6
Usage correction,
0.8
0.8
more usage results
in a higher score [0,1]
0.6
0.4

0.4

0.2


0.9
0.6

0.8
0.8

0.6
0.4

0.4

0.2

(0.6 * 0.9) + (0.8 * 0.8) + (0.4 * 0.6)
= 0.82
0.6 + 0.8 + 0.4 + 0.4


• User feedback

• Contacts usage
0.9
0.6
• Commercial vs Free
0.8
0.8

0.6
0.4

0.4

0.2

(0.6 * 0.9) + (0.8 * 0.8) + (0.4 * 0.6)
= 0.82
0.6 + 0.8 + 0.4 + 0.4

Calculate all unknown values and
show the Top-N recommendations to each user
Software items

? ?
?
1 1 1 1

?1??
1 1 1

?1?1?
Users 1 1

?1111?
1

?111?11
?1?1??1

Explainability
Why did I get this recommendation?

• Overlap between the item’s (K) neighbors and your usage

User-Based Collaborative Filtering
Finding people like you

1 1 0 1 0 1 0

1 1 1 0 1 0 0

1 1 0 1 0 1 0

1 1 1 1 1 1 0
Cosine Similarity(Coen, Menno)

0 1 1 1 0 1 1

0 1 0 1 0 0 1

Applying inverse user frequency

log(n/ni): ni is the number of users that uses item i and n is
the total number of users in the database

0.1 0.2 0 0.4 0 0.4 0

0.1 0.2 0.6 0 0.8 0 0

0.1 0.2 0 0.4 0 0.4 0

0.1 0.2 0.6 0.4 0.8 0.4 0

0 0.2 0.6 0.4 0 0.4 0.2

0 0.2 0 0.4 0 0 0.2

The fact that you both use Textmate tells you more than
when you both use ﬁrefox

0.1 0.2 0 0.4 0 0.4 0

0.1 0.2 0.6 0 0.8 0 0

0.1 0.2 0 0.4 0 0.4 0

0.1 0.2 0.6 0.4 0.8 0.4 0

0 0.2 0.6 0.4 0 0.4 0.2

0 0.2 0 0.4 0 0 0.2

User-user correlation matrix

1 0.8 0.6 0.5 0.7 0.2

0.8 1 0.4 0.7 0.5 0.5

0.6 0.4 1 0.4 0.9 0.1

0.5 0.8 0.4 1 0.6 0.4

0.8 0.5 0.9 0.6 1 0.2

0.2 0.5 0.1 0.4 0.2 1

Performance
measure for success

• Cross-validation: Train-Test split (80-20)

• Precision and Recall:
- precision = size(hit set) / size(total given recs)
- recall = size(hit set) / size(test set)

• Root mean squared error (RMSE)

Implementation

• Ruby Enterprise Edition (garbage collection)

• MySQL database

• Built our own c-libraries

• Amazon EC2:
- Low cost
- Flexibility
- Ease of use

• Open source

Future challenges

• What is the best algorithm for Wakoopa? (or you)

• Reducing space-time complexity (scalability):
- Parallelization (Clojure)
- Distributed computing (Hadoop)

1 evening, 3 speakers, 100 developers
www.recked.org

Building Recommender Systems: Approaches and Challenges

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Building Recommender Systems: Approaches and Challenges

Similar to Building Recommender Systems: Approaches and Challenges (20)

More from blueace

More from blueace (8)

Recently uploaded

Recently uploaded (20)

Building Recommender Systems: Approaches and Challenges