SlideShare a Scribd company logo
1 of 24
- S.V.Giri
Provides implementation for Scalable Machine Learning Algorithms
-- Wikipedia
Machine Learning Algorithms
 Collaborative Filtering
 Clustering
 Classification
 Dimensionality reduction
 Anomaly detection
2
Similarity – Number of Common Movies between users
SIM(US1, US2)= 0 , SIM(US1, US3)= 3
Threshold for Similarity
The more the user watches movies, the more is he similar to others
3
Cosine Similarity
Tanimoto Coefficient
Pearson Correlation Coefficient
Euclidean Distance
LogLikelihood Similarity
Spearman Rank Correlation
4
 A measure of similarity between 2 vectors
 Values from 0 to 1
5
n
i i
n
i i
n
i ii
yx
yx
yx
yx
yx
1
2
1
2
1
),cos( 


Cos(US1,US2)= 5*0 + 4*2 + 0*4 + 1*5 / (8.19*6.71) = 0.22
Cos(US1,US3)= 5*5 + 4*5 + 5*5+ 0*2 + 1*1 / (8.19*8.94) = 0.97
Cos(US2,US4)= 0*1+4*4+5*3/(6.71*5.09)= 0.91
6
October, 2006 – 1 million Dollar
Training Data Set
Users – 480,000
Movies – 18,000
Pairs – 100 Million
Ratings : 1- 5
Test Data Set
Ratings to be predicted – 1.5 Million Pairs
Metrics - RMSE
Cinematch – 0.9514
Best RMSE – 0.8563 (Cracked by – BelKor’s Pragmatic Chaos)
7
Actual Values –
(us1,mv1,5)(u2,mv1,3)(u3,mv2,1)(u5,mv6,3)
Predicted Values –
(us1,mv1,4)(u2,mv1,3)(u3,mv2,2)(u5,mv6,2)
RMSE = √((5-4)²+(3-3)²+(1-2)²+(3-2)²)/4
= 0.86
8
(US4, SW4) =??
Average of all the other user ratings for this movie
= 4+2+5/3 = round(3.66) = 4
9
10
Sim(US4,US1) = 0.19
Sim(US4,US2) = 0.91
Sim(US4,US3)= 0.35
US4 is similar to US2
Hence Rating(US4,SW2)= Rating(US2,SW2)=2
11
Sim(US5,US2) = 0.955
Rating(US5,SW2)= Rating(US2,SW2)= 2
Avg(US2)= 3, AVG (US5)=2
Rating(US5,SW2)= Rating(US2,SW2)+ AVG (US5)- AVG(US2)= 1
12
Training Data Set
Users – 480,000
Movies – 18,000
Ratings – 100 Million
Sparse Matrix
Actual Possible pairings – 480,000*18,000 = 8.6 Billion
Pairs Present = 1.1%
Best Representation:
(Key, Value) pair
13
Similarity Matrix Computation
Time Complexity
User based Similarity :
For all Users (Sim (UserVector, User vector))
Number of users = 480,000
Number of user pairs = 480,000 * 480,000= 230 Billion user pairs
Number of comparisons for one sim val = 18000
Total Computations = 230 Billion * 18000 = 4140 Trillion
Operations
14
Dimensionality Reductions :
SVD (Singular Valued Decomposition)
MinHasing
Locality Sensitive Hashing (LSH)
15
US1 SW1 5
US1 SW2 4
US1 LOTR1 5
US1 Notting Hill 0
US1 Mean Girls 1
US2 SW1 0
US2 SW2 2
US2 LOTR1 -
…
16
17
User Based – Similarity Between Users
Product Based – Similarity Between Products
Click Based – Based on user Clicks/Likes
Content Based – Based on tags, reviews, ratings.
18
19
Cos(SW1,SW2)= 0.94
Cos(SW1, Notting Hill)= 0.233
Cos(Mean Girls, Notting Hill)= 0.94
20
US1 US2 US3 US4
SW1 5 0 5 1
SW2 4 2 5 -
LOTR1 5 - 5 -
Notting Hill 0 4 2 4
Mean Girls 1 5 1 3
The Firm ∼ The RainMaker
The Bourne Identity ∼ The Bourne Ultimatum
 Uniform Weight
 Weighted Parameters
21
Author Category Year
The Firm John Grisham Thriller 1991
The Bourne
Identity
Robert Ludlum Thriller 1980
The Bourne
Ultimatum
Robert Ludlum Thriller 1990
The Rainmaker John Grisham Thriller 1995
Problem:
 User Reads a news article
 Find Similar news articles
 Don’t find same news article.
How to convert document into a vector?
 Extract all the words
 Remove stop words
 Identify Named Entities
22
New Movie
- No views (or less views)
- No similar Movies
New User
- No ratings (fewer ratings)
- No similar Users
23
Thank you
24

More Related Content

Similar to Mahout Taste Engine

2014-mo444-final-project
2014-mo444-final-project2014-mo444-final-project
2014-mo444-final-projectPaulo Faria
 
Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Amro Elfeki
 
Sociocast NODE vs. Collaborative Filtering Benchmark
Sociocast NODE vs. Collaborative Filtering BenchmarkSociocast NODE vs. Collaborative Filtering Benchmark
Sociocast NODE vs. Collaborative Filtering BenchmarkAlbert Azout
 
IRJET- Random Valued Impulse Noise Detection Schemes
IRJET- Random Valued Impulse Noise Detection SchemesIRJET- Random Valued Impulse Noise Detection Schemes
IRJET- Random Valued Impulse Noise Detection SchemesIRJET Journal
 
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data MiningMetaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data MiningVarun Ojha
 
RecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF groupRecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF groupLadislav Peska
 
Search-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability DetectionSearch-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability DetectionLionel Briand
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...hyunsung lee
 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsEvgeniy Marinov
 
ZunqiuPresentationOct05
ZunqiuPresentationOct05ZunqiuPresentationOct05
ZunqiuPresentationOct05Chen Zunqiu
 
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation Osama Hosam
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project reportGaurav Sawant
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automataijait
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systemsfuchaoqun
 
Literature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based WatermarkingLiterature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based WatermarkingPriyatham Bollimpalli
 
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataAnalysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataIOSRjournaljce
 
Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014ijcsbi
 

Similar to Mahout Taste Engine (20)

2014-mo444-final-project
2014-mo444-final-project2014-mo444-final-project
2014-mo444-final-project
 
Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology
 
Sociocast NODE vs. Collaborative Filtering Benchmark
Sociocast NODE vs. Collaborative Filtering BenchmarkSociocast NODE vs. Collaborative Filtering Benchmark
Sociocast NODE vs. Collaborative Filtering Benchmark
 
IRJET- Random Valued Impulse Noise Detection Schemes
IRJET- Random Valued Impulse Noise Detection SchemesIRJET- Random Valued Impulse Noise Detection Schemes
IRJET- Random Valued Impulse Noise Detection Schemes
 
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data MiningMetaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
 
RecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF groupRecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF group
 
Search-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability DetectionSearch-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability Detection
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender Systems
 
Glowworm Swarm Optimisation
Glowworm Swarm OptimisationGlowworm Swarm Optimisation
Glowworm Swarm Optimisation
 
ZunqiuPresentationOct05
ZunqiuPresentationOct05ZunqiuPresentationOct05
ZunqiuPresentationOct05
 
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
 
Literature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based WatermarkingLiterature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based Watermarking
 
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataAnalysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
 
Adam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the OddballsAdam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the Oddballs
 
Kaggle kenneth
Kaggle kennethKaggle kenneth
Kaggle kenneth
 
Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014
 

Recently uploaded

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Mahout Taste Engine

  • 2. Provides implementation for Scalable Machine Learning Algorithms -- Wikipedia Machine Learning Algorithms  Collaborative Filtering  Clustering  Classification  Dimensionality reduction  Anomaly detection 2
  • 3. Similarity – Number of Common Movies between users SIM(US1, US2)= 0 , SIM(US1, US3)= 3 Threshold for Similarity The more the user watches movies, the more is he similar to others 3
  • 4. Cosine Similarity Tanimoto Coefficient Pearson Correlation Coefficient Euclidean Distance LogLikelihood Similarity Spearman Rank Correlation 4
  • 5.  A measure of similarity between 2 vectors  Values from 0 to 1 5 n i i n i i n i ii yx yx yx yx yx 1 2 1 2 1 ),cos(   
  • 6. Cos(US1,US2)= 5*0 + 4*2 + 0*4 + 1*5 / (8.19*6.71) = 0.22 Cos(US1,US3)= 5*5 + 4*5 + 5*5+ 0*2 + 1*1 / (8.19*8.94) = 0.97 Cos(US2,US4)= 0*1+4*4+5*3/(6.71*5.09)= 0.91 6
  • 7. October, 2006 – 1 million Dollar Training Data Set Users – 480,000 Movies – 18,000 Pairs – 100 Million Ratings : 1- 5 Test Data Set Ratings to be predicted – 1.5 Million Pairs Metrics - RMSE Cinematch – 0.9514 Best RMSE – 0.8563 (Cracked by – BelKor’s Pragmatic Chaos) 7
  • 8. Actual Values – (us1,mv1,5)(u2,mv1,3)(u3,mv2,1)(u5,mv6,3) Predicted Values – (us1,mv1,4)(u2,mv1,3)(u3,mv2,2)(u5,mv6,2) RMSE = √((5-4)²+(3-3)²+(1-2)²+(3-2)²)/4 = 0.86 8
  • 9. (US4, SW4) =?? Average of all the other user ratings for this movie = 4+2+5/3 = round(3.66) = 4 9
  • 10. 10
  • 11. Sim(US4,US1) = 0.19 Sim(US4,US2) = 0.91 Sim(US4,US3)= 0.35 US4 is similar to US2 Hence Rating(US4,SW2)= Rating(US2,SW2)=2 11
  • 12. Sim(US5,US2) = 0.955 Rating(US5,SW2)= Rating(US2,SW2)= 2 Avg(US2)= 3, AVG (US5)=2 Rating(US5,SW2)= Rating(US2,SW2)+ AVG (US5)- AVG(US2)= 1 12
  • 13. Training Data Set Users – 480,000 Movies – 18,000 Ratings – 100 Million Sparse Matrix Actual Possible pairings – 480,000*18,000 = 8.6 Billion Pairs Present = 1.1% Best Representation: (Key, Value) pair 13
  • 14. Similarity Matrix Computation Time Complexity User based Similarity : For all Users (Sim (UserVector, User vector)) Number of users = 480,000 Number of user pairs = 480,000 * 480,000= 230 Billion user pairs Number of comparisons for one sim val = 18000 Total Computations = 230 Billion * 18000 = 4140 Trillion Operations 14
  • 15. Dimensionality Reductions : SVD (Singular Valued Decomposition) MinHasing Locality Sensitive Hashing (LSH) 15
  • 16. US1 SW1 5 US1 SW2 4 US1 LOTR1 5 US1 Notting Hill 0 US1 Mean Girls 1 US2 SW1 0 US2 SW2 2 US2 LOTR1 - … 16
  • 17. 17
  • 18. User Based – Similarity Between Users Product Based – Similarity Between Products Click Based – Based on user Clicks/Likes Content Based – Based on tags, reviews, ratings. 18
  • 19. 19
  • 20. Cos(SW1,SW2)= 0.94 Cos(SW1, Notting Hill)= 0.233 Cos(Mean Girls, Notting Hill)= 0.94 20 US1 US2 US3 US4 SW1 5 0 5 1 SW2 4 2 5 - LOTR1 5 - 5 - Notting Hill 0 4 2 4 Mean Girls 1 5 1 3
  • 21. The Firm ∼ The RainMaker The Bourne Identity ∼ The Bourne Ultimatum  Uniform Weight  Weighted Parameters 21 Author Category Year The Firm John Grisham Thriller 1991 The Bourne Identity Robert Ludlum Thriller 1980 The Bourne Ultimatum Robert Ludlum Thriller 1990 The Rainmaker John Grisham Thriller 1995
  • 22. Problem:  User Reads a news article  Find Similar news articles  Don’t find same news article. How to convert document into a vector?  Extract all the words  Remove stop words  Identify Named Entities 22
  • 23. New Movie - No views (or less views) - No similar Movies New User - No ratings (fewer ratings) - No similar Users 23