Recommendation Engines
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Recommendation Engines

on

  • 521 views

Modern web applications embrace personalization in order to provide a unique customer experience. Recommendation engines, in general, and Collaborative Filtering, in particular, are essential ...

Modern web applications embrace personalization in order to provide a unique customer experience. Recommendation engines, in general, and Collaborative Filtering, in particular, are essential techniques for delivering state-of-the-art personalization effects on a web site.

These slides are based on a presentation that I gave to New England's Java User Group (NEJUG) in 2009; in that respect, they are quite old. Nevertheless, the content is about the fundamental concepts of these techniques and the fundamentals never go out of fashion.

The code references are from the project Yooreeka. The Yooreeka project started with the code of the book "Algorithms of the Intelligent Web " (Manning 2009). You can find the Yooreeka 2.0 API (Javadoc) at http://www.marmanis.com/static/javadoc/index.html

Statistics

Views

Total Views
521
Views on SlideShare
518
Embed Views
3

Actions

Likes
1
Downloads
9
Comments
0

1 Embed 3

https://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Recommendation Engines Presentation Transcript

  • 1. Recommendation Engines:A key personalization feature of modern web applications Haralambos (Babis) Marmanis NEJUG June 11, 2009
  • 2. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryPresentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
  • 3. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryRecommendations in ActionOnline store recommendations Amazon.com Provide recommendations for purchasing more items
  • 4. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryRecommendations in ActionOnline store recommendations Netflix.com Provide recommendations for viewing more movies
  • 5. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryRecommendations in ActionContent recommendations Any news portal or other content aggregator Recommendations for articles, books, news stories
  • 6. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary“It’s the Economy ...”The Long Tail Goodbye Pareto Principle, Hello Long Tail Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith, used a log-linear curve to describe the relationship between Amazon.com sales and sales ranking. They found that a large proportion of Amazon.com’s book sales come from obscure books that were not available in brick-and-mortar stores. They also found that consumer benefit from access to increased product variety in online book stores is ten times larger than their benefit from access to lower prices online!
  • 7. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary“It’s the Economy ...”The Long Tail Goodbye Pareto Principle, Hello Long Tail Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith, used a log-linear curve to describe the relationship between Amazon.com sales and sales ranking. They found that a large proportion of Amazon.com’s book sales come from obscure books that were not available in brick-and-mortar stores. They also found that consumer benefit from access to increased product variety in online book stores is ten times larger than their benefit from access to lower prices online!
  • 8. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary“It’s the Economy ...”The Long Tail Goodbye Pareto Principle, Hello Long Tail Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith, used a log-linear curve to describe the relationship between Amazon.com sales and sales ranking. They found that a large proportion of Amazon.com’s book sales come from obscure books that were not available in brick-and-mortar stores. They also found that consumer benefit from access to increased product variety in online book stores is ten times larger than their benefit from access to lower prices online!
  • 9. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryJava source codeYooreeka! Open Source, Machine Learning library Search, recommendations, clustering, classification, and combination of classifiers! URL: http://code.google.com/p/yooreeka/
  • 10. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryPresentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
  • 11. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryThe Online Music Store Example Frank’s music ratings
  • 12. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryThe Online Music Store Example Constantine’s music ratings
  • 13. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryThe Online Music Store Example Catherine’s music ratings
  • 14. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummarySimilarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to define similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
  • 15. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummarySimilarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to define similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
  • 16. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummarySimilarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to define similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
  • 17. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummarySimilarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to define similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
  • 18. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummarySimilarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to define similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
  • 19. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryDistance (formulas) Let Xi and Yi be two vectors in RN Minkowski or p-norm distance 1 N p d = |Xi − Yi |p (1) i=1 Manhattan distance d = max |Xi − Yi | (2) i Chebychev or L∞ distance 1 N p d = lim |Xi − Yi |p (3) p→∞ i=1
  • 20. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryDistance (formulas) Let Xi and Yi be two vectors in RN Minkowski or p-norm distance 1 N p d = |Xi − Yi |p (1) i=1 Manhattan distance d = max |Xi − Yi | (2) i Chebychev or L∞ distance 1 N p d = lim |Xi − Yi |p (3) p→∞ i=1
  • 21. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryDistance (formulas) Let Xi and Yi be two vectors in RN Minkowski or p-norm distance 1 N p d = |Xi − Yi |p (1) i=1 Manhattan distance d = max |Xi − Yi | (2) i Chebychev or L∞ distance 1 N p d = lim |Xi − Yi |p (3) p→∞ i=1
  • 22. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryDistance (formulas) Let Xi and Yi be two vectors in RN Minkowski or p-norm distance 1 N p d = |Xi − Yi |p (1) i=1 Manhattan distance d = max |Xi − Yi | (2) i Chebychev or L∞ distance 1 N p d = lim |Xi − Yi |p (3) p→∞ i=1
  • 23. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummarySimilarity (formulas) Na¨ve Similarity ı β simNaive = (4) β+d where d is the Euclidean distance. Similarity I simI = 1 − tanh(σ) (5) where σ is the biased estimator of sample variance Similarity II common simII = simI × (6) maximum There is more . . . Jaccard, Tanimoto, and so on
  • 24. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummarySimilarity (formulas) Na¨ve Similarity ı β simNaive = (4) β+d where d is the Euclidean distance. Similarity I simI = 1 − tanh(σ) (5) where σ is the biased estimator of sample variance Similarity II common simII = simI × (6) maximum There is more . . . Jaccard, Tanimoto, and so on
  • 25. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummarySimilarity (formulas) Na¨ve Similarity ı β simNaive = (4) β+d where d is the Euclidean distance. Similarity I simI = 1 − tanh(σ) (5) where σ is the biased estimator of sample variance Similarity II common simII = simI × (6) maximum There is more . . . Jaccard, Tanimoto, and so on
  • 26. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryThe ”best” Similarity formula Which is the best similarity formula? There is no such thing! It depends on the problem, the data, the definition of ... ”best” ¨ ¨ ¨ Spertus,Sahami, and Buyukkokten (2005) Evaluating similarity measures: a large-scale study in the orkut social network. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining The simple L2 based (cosine) similarity showed the best empirical results among seven similarity metrics.
  • 27. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryThe ”best” Similarity formula Which is the best similarity formula? There is no such thing! It depends on the problem, the data, the definition of ... ”best” ¨ ¨ ¨ Spertus,Sahami, and Buyukkokten (2005) Evaluating similarity measures: a large-scale study in the orkut social network. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining The simple L2 based (cosine) similarity showed the best empirical results among seven similarity metrics.
  • 28. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryThe ”best” Similarity formula Which is the best similarity formula? There is no such thing! It depends on the problem, the data, the definition of ... ”best” ¨ ¨ ¨ Spertus,Sahami, and Buyukkokten (2005) Evaluating similarity measures: a large-scale study in the orkut social network. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining The simple L2 based (cosine) similarity showed the best empirical results among seven similarity metrics.
  • 29. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryThe ”best” Similarity formula Which is the best similarity formula? There is no such thing! It depends on the problem, the data, the definition of ... ”best” ¨ ¨ ¨ Spertus,Sahami, and Buyukkokten (2005) Evaluating similarity measures: a large-scale study in the orkut social network. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining The simple L2 based (cosine) similarity showed the best empirical results among seven similarity metrics.
  • 30. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryPresentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
  • 31. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
  • 32. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
  • 33. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
  • 34. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
  • 35. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
  • 36. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryUser based User Similarity Matrix U1 U2 U3 U4 U5 .. U1 [ S11 S12 S13 S14 S15 ... ] U2 [ S21 S22 S23 S24 S25 ... ] U3 [ S31 S32 S33 S34 S35 ... ] U4 [ S41 S42 S43 S44 S45 ... ] U5 [ S51 S52 S53 S54 S55 ... ] .. [ ... ... ... ... ... ... ]
  • 37. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryUser based User Similarity Matrix (cont.) U1 U2 U3 U4 U5 .. U1 [1.0, 0.333, 0.385, 0.333, 0.364, ... ] U2 [0.0, 1.000, 0.545, 0.385, 0.615, ... ] U3 [0.0, 0.000, 1.000, 0.364, 0.636, ... ] U4 [0.0, 0.000, 0.000, 1.000, 0.231, ... ] U5 [0.0, 0.000, 0.000, 0.000, 1.000, ... ] .. [0.0, 0.000, 0.000, 0.000, 0.000, ... ]
  • 38. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryRating Counting Matrix Rating Counting Matrix R1 R2 R3 R4 R5 R1 [ X11 X12 X13 X14 X15 ] R2 [ X21 X22 X23 X24 X25 ] R3 [ X31 X32 X33 X34 X35 ] R4 [ X41 X42 X43 X44 X45 ] R5 [ X51 X52 X53 X54 X55 ]
  • 39. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryRating Counting Matrix BeanShell script (Users) BaseDataset ds = MusicData.createDataset(); Delphi delphi = new Delphi(ds,RecommendationType.USER_BASED); MusicUser mu1 = ds.pickUser("Bob"); delphi.findSimilarUsers(mu1); delphi.recommend(mu1);
  • 40. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryItem based Item Similarity Matrix I1 I2 I3 I4 I5 ... I1 [1.0, 0.333, 0.385, 0.333, 0.364, ... ] I2 [0.0, 1.000, 0.545, 0.385, 0.615, ... ] I3 [0.0, 0.000, 1.000, 0.364, 0.636, ... ] I4 [0.0, 0.000, 0.000, 1.000, 0.231, ... ] I5 [0.0, 0.000, 0.000, 0.000, 1.000, ... ] .. [0.0, 0.000, 0.000, 0.000, 0.000, ... ]
  • 41. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryItem based BeanShell script (Items) Delphi delphi = new Delphi(ds,RecommendationType.ITEM_BASED); MusicUser mu1 = ds.pickUser("Bob"); delphi.recommend(mu1); MusicItem mi = ds.pickItem("La Bamba"); delphi.findSimilarItems(mi);
  • 42. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryItem based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
  • 43. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryItem based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
  • 44. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryItem based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
  • 45. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryItem based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
  • 46. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryItem based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
  • 47. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryPresentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
  • 48. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryText Parsing & Analysis No more ratings, what do we do? Now we deal with documents So, we need to define similarity based on the content of the documents Use Lucene’s StandardAnalyzer Build your own! (see CustomAnalyzer)
  • 49. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryText Parsing & Analysis No more ratings, what do we do? Now we deal with documents So, we need to define similarity based on the content of the documents Use Lucene’s StandardAnalyzer Build your own! (see CustomAnalyzer)
  • 50. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryText Parsing & Analysis No more ratings, what do we do? Now we deal with documents So, we need to define similarity based on the content of the documents Use Lucene’s StandardAnalyzer Build your own! (see CustomAnalyzer)
  • 51. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryText Parsing & Analysis No more ratings, what do we do? Now we deal with documents So, we need to define similarity based on the content of the documents Use Lucene’s StandardAnalyzer Build your own! (see CustomAnalyzer)
  • 52. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryDocument representation No more ratings!
  • 53. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryDocument representation No more ratings!
  • 54. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryDocument representation No more ratings!
  • 55. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryPresentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
  • 56. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryNetflix Prize Description Netflix prize More than 100 million ratings 480 thousand randomly-chosen, anonymous customers 18 thousand movie titles
  • 57. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryNetflix Prize Description Netflix prize More than 100 million ratings 480 thousand randomly-chosen, anonymous customers 18 thousand movie titles
  • 58. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryNetflix Prize Description Netflix prize More than 100 million ratings 480 thousand randomly-chosen, anonymous customers 18 thousand movie titles
  • 59. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryNetflix Prize Description Netflix prize More than 100 million ratings 480 thousand randomly-chosen, anonymous customers 18 thousand movie titles
  • 60. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryLessons learned Important considerations Data normalization Neighbor selection How many neighbors? Who are the ”best” neighbors? Neighbor weights ”Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique.”
  • 61. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryLessons learned Important considerations Data normalization Neighbor selection How many neighbors? Who are the ”best” neighbors? Neighbor weights ”Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique.”
  • 62. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryLessons learned Important considerations Data normalization Neighbor selection How many neighbors? Who are the ”best” neighbors? Neighbor weights ”Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique.”
  • 63. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryLessons learned Important considerations Data normalization Neighbor selection How many neighbors? Who are the ”best” neighbors? Neighbor weights ”Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique.”
  • 64. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize SummaryPresentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
  • 65. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality
  • 66. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality
  • 67. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality
  • 68. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality
  • 69. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality