### SlideShare for iOS

by Linkedin Corporation

FREE - On the App Store

Modern web applications embrace personalization in order to provide a unique customer experience. Recommendation engines, in general, and Collaborative Filtering, in particular, are essential ...

Modern web applications embrace personalization in order to provide a unique customer experience. Recommendation engines, in general, and Collaborative Filtering, in particular, are essential techniques for delivering state-of-the-art personalization effects on a web site.

These slides are based on a presentation that I gave to New England's Java User Group (NEJUG) in 2009; in that respect, they are quite old. Nevertheless, the content is about the fundamental concepts of these techniques and the fundamentals never go out of fashion.

The code references are from the project Yooreeka. The Yooreeka project started with the code of the book "Algorithms of the Intelligent Web " (Manning 2009). You can find the Yooreeka 2.0 API (Javadoc) at http://www.marmanis.com/static/javadoc/index.html

- Total Views
- 521
- Views on SlideShare
- 518
- Embed Views

- Likes
- 1
- Downloads
- 9
- Comments
- 0

https://www.linkedin.com | 3 |

Uploaded via SlideShare as Adobe PDF

© All Rights Reserved

- 1. Recommendation Engines:A key personalization feature of modern web applications Haralambos (Babis) Marmanis NEJUG June 11, 2009
- 2. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryPresentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
- 3. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryRecommendations in ActionOnline store recommendations Amazon.com Provide recommendations for purchasing more items
- 4. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryRecommendations in ActionOnline store recommendations Netﬂix.com Provide recommendations for viewing more movies
- 5. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryRecommendations in ActionContent recommendations Any news portal or other content aggregator Recommendations for articles, books, news stories
- 6. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary“It’s the Economy ...”The Long Tail Goodbye Pareto Principle, Hello Long Tail Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith, used a log-linear curve to describe the relationship between Amazon.com sales and sales ranking. They found that a large proportion of Amazon.com’s book sales come from obscure books that were not available in brick-and-mortar stores. They also found that consumer beneﬁt from access to increased product variety in online book stores is ten times larger than their beneﬁt from access to lower prices online!
- 7. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary“It’s the Economy ...”The Long Tail Goodbye Pareto Principle, Hello Long Tail Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith, used a log-linear curve to describe the relationship between Amazon.com sales and sales ranking. They found that a large proportion of Amazon.com’s book sales come from obscure books that were not available in brick-and-mortar stores. They also found that consumer beneﬁt from access to increased product variety in online book stores is ten times larger than their beneﬁt from access to lower prices online!
- 8. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary“It’s the Economy ...”The Long Tail Goodbye Pareto Principle, Hello Long Tail Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith, used a log-linear curve to describe the relationship between Amazon.com sales and sales ranking. They found that a large proportion of Amazon.com’s book sales come from obscure books that were not available in brick-and-mortar stores. They also found that consumer beneﬁt from access to increased product variety in online book stores is ten times larger than their beneﬁt from access to lower prices online!
- 9. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryJava source codeYooreeka! Open Source, Machine Learning library Search, recommendations, clustering, classiﬁcation, and combination of classiﬁers! URL: http://code.google.com/p/yooreeka/
- 10. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryPresentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
- 11. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryThe Online Music Store Example Frank’s music ratings
- 12. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryThe Online Music Store Example Constantine’s music ratings
- 13. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryThe Online Music Store Example Catherine’s music ratings
- 14. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummarySimilarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to deﬁne similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
- 15. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummarySimilarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to deﬁne similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
- 16. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummarySimilarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to deﬁne similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
- 17. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummarySimilarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to deﬁne similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
- 18. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummarySimilarity The notion of Similarity Often based on the notion of distance The smaller the distance, the greater the similarity Similarity values, typically, constrained in [0,∞) or [0,1] It is not necessary to deﬁne similarity formulas. E.g. if d < then similar, otherwise not. Similarity could also be empirical or probabilistic
- 19. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryDistance (formulas) Let Xi and Yi be two vectors in RN Minkowski or p-norm distance 1 N p d = |Xi − Yi |p (1) i=1 Manhattan distance d = max |Xi − Yi | (2) i Chebychev or L∞ distance 1 N p d = lim |Xi − Yi |p (3) p→∞ i=1
- 20. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryDistance (formulas) Let Xi and Yi be two vectors in RN Minkowski or p-norm distance 1 N p d = |Xi − Yi |p (1) i=1 Manhattan distance d = max |Xi − Yi | (2) i Chebychev or L∞ distance 1 N p d = lim |Xi − Yi |p (3) p→∞ i=1
- 21. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryDistance (formulas) Let Xi and Yi be two vectors in RN Minkowski or p-norm distance 1 N p d = |Xi − Yi |p (1) i=1 Manhattan distance d = max |Xi − Yi | (2) i Chebychev or L∞ distance 1 N p d = lim |Xi − Yi |p (3) p→∞ i=1
- 22. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryDistance (formulas) Let Xi and Yi be two vectors in RN Minkowski or p-norm distance 1 N p d = |Xi − Yi |p (1) i=1 Manhattan distance d = max |Xi − Yi | (2) i Chebychev or L∞ distance 1 N p d = lim |Xi − Yi |p (3) p→∞ i=1
- 23. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummarySimilarity (formulas) Na¨ve Similarity ı β simNaive = (4) β+d where d is the Euclidean distance. Similarity I simI = 1 − tanh(σ) (5) where σ is the biased estimator of sample variance Similarity II common simII = simI × (6) maximum There is more . . . Jaccard, Tanimoto, and so on
- 24. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummarySimilarity (formulas) Na¨ve Similarity ı β simNaive = (4) β+d where d is the Euclidean distance. Similarity I simI = 1 − tanh(σ) (5) where σ is the biased estimator of sample variance Similarity II common simII = simI × (6) maximum There is more . . . Jaccard, Tanimoto, and so on
- 25. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummarySimilarity (formulas) Na¨ve Similarity ı β simNaive = (4) β+d where d is the Euclidean distance. Similarity I simI = 1 − tanh(σ) (5) where σ is the biased estimator of sample variance Similarity II common simII = simI × (6) maximum There is more . . . Jaccard, Tanimoto, and so on
- 26. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryThe ”best” Similarity formula Which is the best similarity formula? There is no such thing! It depends on the problem, the data, the deﬁnition of ... ”best” ¨ ¨ ¨ Spertus,Sahami, and Buyukkokten (2005) Evaluating similarity measures: a large-scale study in the orkut social network. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining The simple L2 based (cosine) similarity showed the best empirical results among seven similarity metrics.
- 27. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryThe ”best” Similarity formula Which is the best similarity formula? There is no such thing! It depends on the problem, the data, the deﬁnition of ... ”best” ¨ ¨ ¨ Spertus,Sahami, and Buyukkokten (2005) Evaluating similarity measures: a large-scale study in the orkut social network. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining The simple L2 based (cosine) similarity showed the best empirical results among seven similarity metrics.
- 28. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryThe ”best” Similarity formula Which is the best similarity formula? There is no such thing! It depends on the problem, the data, the deﬁnition of ... ”best” ¨ ¨ ¨ Spertus,Sahami, and Buyukkokten (2005) Evaluating similarity measures: a large-scale study in the orkut social network. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining The simple L2 based (cosine) similarity showed the best empirical results among seven similarity metrics.
- 29. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryThe ”best” Similarity formula Which is the best similarity formula? There is no such thing! It depends on the problem, the data, the deﬁnition of ... ”best” ¨ ¨ ¨ Spertus,Sahami, and Buyukkokten (2005) Evaluating similarity measures: a large-scale study in the orkut social network. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining The simple L2 based (cosine) similarity showed the best empirical results among seven similarity metrics.
- 30. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryPresentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
- 31. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
- 32. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
- 33. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
- 34. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
- 35. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary Tapestry Experimental mail system by Goldberg et al. (circa 1992) in Xerox PARC
- 36. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryUser based User Similarity Matrix U1 U2 U3 U4 U5 .. U1 [ S11 S12 S13 S14 S15 ... ] U2 [ S21 S22 S23 S24 S25 ... ] U3 [ S31 S32 S33 S34 S35 ... ] U4 [ S41 S42 S43 S44 S45 ... ] U5 [ S51 S52 S53 S54 S55 ... ] .. [ ... ... ... ... ... ... ]
- 37. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryUser based User Similarity Matrix (cont.) U1 U2 U3 U4 U5 .. U1 [1.0, 0.333, 0.385, 0.333, 0.364, ... ] U2 [0.0, 1.000, 0.545, 0.385, 0.615, ... ] U3 [0.0, 0.000, 1.000, 0.364, 0.636, ... ] U4 [0.0, 0.000, 0.000, 1.000, 0.231, ... ] U5 [0.0, 0.000, 0.000, 0.000, 1.000, ... ] .. [0.0, 0.000, 0.000, 0.000, 0.000, ... ]
- 38. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryRating Counting Matrix Rating Counting Matrix R1 R2 R3 R4 R5 R1 [ X11 X12 X13 X14 X15 ] R2 [ X21 X22 X23 X24 X25 ] R3 [ X31 X32 X33 X34 X35 ] R4 [ X41 X42 X43 X44 X45 ] R5 [ X51 X52 X53 X54 X55 ]
- 39. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryRating Counting Matrix BeanShell script (Users) BaseDataset ds = MusicData.createDataset(); Delphi delphi = new Delphi(ds,RecommendationType.USER_BASED); MusicUser mu1 = ds.pickUser("Bob"); delphi.findSimilarUsers(mu1); delphi.recommend(mu1);
- 40. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryItem based Item Similarity Matrix I1 I2 I3 I4 I5 ... I1 [1.0, 0.333, 0.385, 0.333, 0.364, ... ] I2 [0.0, 1.000, 0.545, 0.385, 0.615, ... ] I3 [0.0, 0.000, 1.000, 0.364, 0.636, ... ] I4 [0.0, 0.000, 0.000, 1.000, 0.231, ... ] I5 [0.0, 0.000, 0.000, 0.000, 1.000, ... ] .. [0.0, 0.000, 0.000, 0.000, 0.000, ... ]
- 41. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryItem based BeanShell script (Items) Delphi delphi = new Delphi(ds,RecommendationType.ITEM_BASED); MusicUser mu1 = ds.pickUser("Bob"); delphi.recommend(mu1); MusicItem mi = ds.pickItem("La Bamba"); delphi.findSimilarItems(mi);
- 42. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryItem based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
- 43. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryItem based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
- 44. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryItem based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
- 45. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryItem based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
- 46. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryItem based Peruse the code Delphi UserBasedSimilarity ItemBasedSimilarity BaseSimilarityMatrix RatingCountMatrix
- 47. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryPresentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
- 48. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryText Parsing & Analysis No more ratings, what do we do? Now we deal with documents So, we need to deﬁne similarity based on the content of the documents Use Lucene’s StandardAnalyzer Build your own! (see CustomAnalyzer)
- 49. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryText Parsing & Analysis No more ratings, what do we do? Now we deal with documents So, we need to deﬁne similarity based on the content of the documents Use Lucene’s StandardAnalyzer Build your own! (see CustomAnalyzer)
- 50. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryText Parsing & Analysis No more ratings, what do we do? Now we deal with documents So, we need to deﬁne similarity based on the content of the documents Use Lucene’s StandardAnalyzer Build your own! (see CustomAnalyzer)
- 51. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryText Parsing & Analysis No more ratings, what do we do? Now we deal with documents So, we need to deﬁne similarity based on the content of the documents Use Lucene’s StandardAnalyzer Build your own! (see CustomAnalyzer)
- 52. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryDocument representation No more ratings!
- 53. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryDocument representation No more ratings!
- 54. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryDocument representation No more ratings!
- 55. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryPresentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
- 56. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryNetﬂix Prize Description Netﬂix prize More than 100 million ratings 480 thousand randomly-chosen, anonymous customers 18 thousand movie titles
- 57. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryNetﬂix Prize Description Netﬂix prize More than 100 million ratings 480 thousand randomly-chosen, anonymous customers 18 thousand movie titles
- 58. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryNetﬂix Prize Description Netﬂix prize More than 100 million ratings 480 thousand randomly-chosen, anonymous customers 18 thousand movie titles
- 59. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryNetﬂix Prize Description Netﬂix prize More than 100 million ratings 480 thousand randomly-chosen, anonymous customers 18 thousand movie titles
- 60. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryLessons learned Important considerations Data normalization Neighbor selection How many neighbors? Who are the ”best” neighbors? Neighbor weights ”Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than reﬁning a single technique.”
- 61. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryLessons learned Important considerations Data normalization Neighbor selection How many neighbors? Who are the ”best” neighbors? Neighbor weights ”Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than reﬁning a single technique.”
- 62. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryLessons learned Important considerations Data normalization Neighbor selection How many neighbors? Who are the ”best” neighbors? Neighbor weights ”Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than reﬁning a single technique.”
- 63. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryLessons learned Important considerations Data normalization Neighbor selection How many neighbors? Who are the ”best” neighbors? Neighbor weights ”Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than reﬁning a single technique.”
- 64. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize SummaryPresentation Outline 1 Introduction Recommendations in Action “It’s the Economy ...” Java source code 2 Basic Concepts The Online Music Store Example Similarity Distance (formulas) Similarity (formulas) The ”best” Similarity formula 3 Collaborative Filtering User based Rating Counting Matrix Item based 4 Content based
- 65. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality
- 66. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality
- 67. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality
- 68. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality
- 69. Introduction Basic Concepts Collaborative Filtering Content based Netﬂix Prize Summary Important considerations Business value validation - ”Long Tail”, ”niches to riches”, etc. Similarity metrics - Many to choose from, do not be afraid to explore! Collaborative Filtering: ”Show me your friend ...” User based Item based Content based recommendations - NLP challenges Large scale implementations - Speed, data size, quality

Full NameComment goes here.Peng Zhang, Business Analyst at GMAC-SAIC 8 months ago