Mahout Taste Engine

  • 523 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
523
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
17
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. - S.V.Giri
  • 2. Provides implementation for Scalable Machine Learning Algorithms -- Wikipedia Machine Learning Algorithms  Collaborative Filtering  Clustering  Classification  Dimensionality reduction  Anomaly detection 2
  • 3. Similarity – Number of Common Movies between users SIM(US1, US2)= 0 , SIM(US1, US3)= 3 Threshold for Similarity The more the user watches movies, the more is he similar to others 3
  • 4. Cosine Similarity Tanimoto Coefficient Pearson Correlation Coefficient Euclidean Distance LogLikelihood Similarity Spearman Rank Correlation 4
  • 5.  A measure of similarity between 2 vectors  Values from 0 to 1 5 n i i n i i n i ii yx yx yx yx yx 1 2 1 2 1 ),cos(   
  • 6. Cos(US1,US2)= 5*0 + 4*2 + 0*4 + 1*5 / (8.19*6.71) = 0.22 Cos(US1,US3)= 5*5 + 4*5 + 5*5+ 0*2 + 1*1 / (8.19*8.94) = 0.97 Cos(US2,US4)= 0*1+4*4+5*3/(6.71*5.09)= 0.91 6
  • 7. October, 2006 – 1 million Dollar Training Data Set Users – 480,000 Movies – 18,000 Pairs – 100 Million Ratings : 1- 5 Test Data Set Ratings to be predicted – 1.5 Million Pairs Metrics - RMSE Cinematch – 0.9514 Best RMSE – 0.8563 (Cracked by – BelKor’s Pragmatic Chaos) 7
  • 8. Actual Values – (us1,mv1,5)(u2,mv1,3)(u3,mv2,1)(u5,mv6,3) Predicted Values – (us1,mv1,4)(u2,mv1,3)(u3,mv2,2)(u5,mv6,2) RMSE = √((5-4)²+(3-3)²+(1-2)²+(3-2)²)/4 = 0.86 8
  • 9. (US4, SW4) =?? Average of all the other user ratings for this movie = 4+2+5/3 = round(3.66) = 4 9
  • 10. 10
  • 11. Sim(US4,US1) = 0.19 Sim(US4,US2) = 0.91 Sim(US4,US3)= 0.35 US4 is similar to US2 Hence Rating(US4,SW2)= Rating(US2,SW2)=2 11
  • 12. Sim(US5,US2) = 0.955 Rating(US5,SW2)= Rating(US2,SW2)= 2 Avg(US2)= 3, AVG (US5)=2 Rating(US5,SW2)= Rating(US2,SW2)+ AVG (US5)- AVG(US2)= 1 12
  • 13. Training Data Set Users – 480,000 Movies – 18,000 Ratings – 100 Million Sparse Matrix Actual Possible pairings – 480,000*18,000 = 8.6 Billion Pairs Present = 1.1% Best Representation: (Key, Value) pair 13
  • 14. Similarity Matrix Computation Time Complexity User based Similarity : For all Users (Sim (UserVector, User vector)) Number of users = 480,000 Number of user pairs = 480,000 * 480,000= 230 Billion user pairs Number of comparisons for one sim val = 18000 Total Computations = 230 Billion * 18000 = 4140 Trillion Operations 14
  • 15. Dimensionality Reductions : SVD (Singular Valued Decomposition) MinHasing Locality Sensitive Hashing (LSH) 15
  • 16. US1 SW1 5 US1 SW2 4 US1 LOTR1 5 US1 Notting Hill 0 US1 Mean Girls 1 US2 SW1 0 US2 SW2 2 US2 LOTR1 - … 16
  • 17. 17
  • 18. User Based – Similarity Between Users Product Based – Similarity Between Products Click Based – Based on user Clicks/Likes Content Based – Based on tags, reviews, ratings. 18
  • 19. 19
  • 20. Cos(SW1,SW2)= 0.94 Cos(SW1, Notting Hill)= 0.233 Cos(Mean Girls, Notting Hill)= 0.94 20 US1 US2 US3 US4 SW1 5 0 5 1 SW2 4 2 5 - LOTR1 5 - 5 - Notting Hill 0 4 2 4 Mean Girls 1 5 1 3
  • 21. The Firm ∼ The RainMaker The Bourne Identity ∼ The Bourne Ultimatum  Uniform Weight  Weighted Parameters 21 Author Category Year The Firm John Grisham Thriller 1991 The Bourne Identity Robert Ludlum Thriller 1980 The Bourne Ultimatum Robert Ludlum Thriller 1990 The Rainmaker John Grisham Thriller 1995
  • 22. Problem:  User Reads a news article  Find Similar news articles  Don’t find same news article. How to convert document into a vector?  Extract all the words  Remove stop words  Identify Named Entities 22
  • 23. New Movie - No views (or less views) - No similar Movies New User - No ratings (fewer ratings) - No similar Users 23
  • 24. Thank you 24