Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
- S.V.Giri
Provides implementation for Scalable Machine Learning Algorithms
-- Wikipedia
Machine Learning Algorithms
 Collaborative ...
Similarity – Number of Common Movies between users
SIM(US1, US2)= 0 , SIM(US1, US3)= 3
Threshold for Similarity
The more...
Cosine Similarity
Tanimoto Coefficient
Pearson Correlation Coefficient
Euclidean Distance
LogLikelihood Similarity
S...
 A measure of similarity between 2 vectors
 Values from 0 to 1
5
n
i i
n
i i
n
i ii
yx
yx
yx
yx
yx
1
2
1
2
1
),cos( 
...
Cos(US1,US2)= 5*0 + 4*2 + 0*4 + 1*5 / (8.19*6.71) = 0.22
Cos(US1,US3)= 5*5 + 4*5 + 5*5+ 0*2 + 1*1 / (8.19*8.94) = 0.97
Cos...
October, 2006 – 1 million Dollar
Training Data Set
Users – 480,000
Movies – 18,000
Pairs – 100 Million
Ratings : 1- 5
Test...
Actual Values –
(us1,mv1,5)(u2,mv1,3)(u3,mv2,1)(u5,mv6,3)
Predicted Values –
(us1,mv1,4)(u2,mv1,3)(u3,mv2,2)(u5,mv6,2)
RMS...
(US4, SW4) =??
Average of all the other user ratings for this movie
= 4+2+5/3 = round(3.66) = 4
9
10
Sim(US4,US1) = 0.19
Sim(US4,US2) = 0.91
Sim(US4,US3)= 0.35
US4 is similar to US2
Hence Rating(US4,SW2)= Rating(US2,SW2)=2
...
Sim(US5,US2) = 0.955
Rating(US5,SW2)= Rating(US2,SW2)= 2
Avg(US2)= 3, AVG (US5)=2
Rating(US5,SW2)= Rating(US2,SW2)+ AVG (U...
Training Data Set
Users – 480,000
Movies – 18,000
Ratings – 100 Million
Sparse Matrix
Actual Possible pairings – 480,000*1...
Similarity Matrix Computation
Time Complexity
User based Similarity :
For all Users (Sim (UserVector, User vector))
Number...
Dimensionality Reductions :
SVD (Singular Valued Decomposition)
MinHasing
Locality Sensitive Hashing (LSH)
15
US1 SW1 5
US1 SW2 4
US1 LOTR1 5
US1 Notting Hill 0
US1 Mean Girls 1
US2 SW1 0
US2 SW2 2
US2 LOTR1 -
…
16
17
User Based – Similarity Between Users
Product Based – Similarity Between Products
Click Based – Based on user Clicks/Likes...
19
Cos(SW1,SW2)= 0.94
Cos(SW1, Notting Hill)= 0.233
Cos(Mean Girls, Notting Hill)= 0.94
20
US1 US2 US3 US4
SW1 5 0 5 1
SW2 4 ...
The Firm ∼ The RainMaker
The Bourne Identity ∼ The Bourne Ultimatum
 Uniform Weight
 Weighted Parameters
21
Author Categ...
Problem:
 User Reads a news article
 Find Similar news articles
 Don’t find same news article.
How to convert document ...
New Movie
- No views (or less views)
- No similar Movies
New User
- No ratings (fewer ratings)
- No similar Users
23
Thank you
24
Upcoming SlideShare
Loading in …5
×

Mahout Taste Engine

872 views

Published on

Published in: Technology, Education
  • Be the first to comment

Mahout Taste Engine

  1. 1. - S.V.Giri
  2. 2. Provides implementation for Scalable Machine Learning Algorithms -- Wikipedia Machine Learning Algorithms  Collaborative Filtering  Clustering  Classification  Dimensionality reduction  Anomaly detection 2
  3. 3. Similarity – Number of Common Movies between users SIM(US1, US2)= 0 , SIM(US1, US3)= 3 Threshold for Similarity The more the user watches movies, the more is he similar to others 3
  4. 4. Cosine Similarity Tanimoto Coefficient Pearson Correlation Coefficient Euclidean Distance LogLikelihood Similarity Spearman Rank Correlation 4
  5. 5.  A measure of similarity between 2 vectors  Values from 0 to 1 5 n i i n i i n i ii yx yx yx yx yx 1 2 1 2 1 ),cos(   
  6. 6. Cos(US1,US2)= 5*0 + 4*2 + 0*4 + 1*5 / (8.19*6.71) = 0.22 Cos(US1,US3)= 5*5 + 4*5 + 5*5+ 0*2 + 1*1 / (8.19*8.94) = 0.97 Cos(US2,US4)= 0*1+4*4+5*3/(6.71*5.09)= 0.91 6
  7. 7. October, 2006 – 1 million Dollar Training Data Set Users – 480,000 Movies – 18,000 Pairs – 100 Million Ratings : 1- 5 Test Data Set Ratings to be predicted – 1.5 Million Pairs Metrics - RMSE Cinematch – 0.9514 Best RMSE – 0.8563 (Cracked by – BelKor’s Pragmatic Chaos) 7
  8. 8. Actual Values – (us1,mv1,5)(u2,mv1,3)(u3,mv2,1)(u5,mv6,3) Predicted Values – (us1,mv1,4)(u2,mv1,3)(u3,mv2,2)(u5,mv6,2) RMSE = √((5-4)²+(3-3)²+(1-2)²+(3-2)²)/4 = 0.86 8
  9. 9. (US4, SW4) =?? Average of all the other user ratings for this movie = 4+2+5/3 = round(3.66) = 4 9
  10. 10. 10
  11. 11. Sim(US4,US1) = 0.19 Sim(US4,US2) = 0.91 Sim(US4,US3)= 0.35 US4 is similar to US2 Hence Rating(US4,SW2)= Rating(US2,SW2)=2 11
  12. 12. Sim(US5,US2) = 0.955 Rating(US5,SW2)= Rating(US2,SW2)= 2 Avg(US2)= 3, AVG (US5)=2 Rating(US5,SW2)= Rating(US2,SW2)+ AVG (US5)- AVG(US2)= 1 12
  13. 13. Training Data Set Users – 480,000 Movies – 18,000 Ratings – 100 Million Sparse Matrix Actual Possible pairings – 480,000*18,000 = 8.6 Billion Pairs Present = 1.1% Best Representation: (Key, Value) pair 13
  14. 14. Similarity Matrix Computation Time Complexity User based Similarity : For all Users (Sim (UserVector, User vector)) Number of users = 480,000 Number of user pairs = 480,000 * 480,000= 230 Billion user pairs Number of comparisons for one sim val = 18000 Total Computations = 230 Billion * 18000 = 4140 Trillion Operations 14
  15. 15. Dimensionality Reductions : SVD (Singular Valued Decomposition) MinHasing Locality Sensitive Hashing (LSH) 15
  16. 16. US1 SW1 5 US1 SW2 4 US1 LOTR1 5 US1 Notting Hill 0 US1 Mean Girls 1 US2 SW1 0 US2 SW2 2 US2 LOTR1 - … 16
  17. 17. 17
  18. 18. User Based – Similarity Between Users Product Based – Similarity Between Products Click Based – Based on user Clicks/Likes Content Based – Based on tags, reviews, ratings. 18
  19. 19. 19
  20. 20. Cos(SW1,SW2)= 0.94 Cos(SW1, Notting Hill)= 0.233 Cos(Mean Girls, Notting Hill)= 0.94 20 US1 US2 US3 US4 SW1 5 0 5 1 SW2 4 2 5 - LOTR1 5 - 5 - Notting Hill 0 4 2 4 Mean Girls 1 5 1 3
  21. 21. The Firm ∼ The RainMaker The Bourne Identity ∼ The Bourne Ultimatum  Uniform Weight  Weighted Parameters 21 Author Category Year The Firm John Grisham Thriller 1991 The Bourne Identity Robert Ludlum Thriller 1980 The Bourne Ultimatum Robert Ludlum Thriller 1990 The Rainmaker John Grisham Thriller 1995
  22. 22. Problem:  User Reads a news article  Find Similar news articles  Don’t find same news article. How to convert document into a vector?  Extract all the words  Remove stop words  Identify Named Entities 22
  23. 23. New Movie - No views (or less views) - No similar Movies New User - No ratings (fewer ratings) - No similar Users 23
  24. 24. Thank you 24

×