Crab: A Python Framework for Building Recommender Systems

10,711 views

Published on

Keynote at VII PythonBrasil (Python Brazilian Users Meeting) at São Paulo. SP in 01/10/2011.

Published in: Technology, Art & Photos
3 Comments
17 Likes
Statistics
Notes
  • what's the method for benchmark ?item-based or user-based? do you use multiple thread or hadoop to do the benchmark?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Epic.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Nice work!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
10,711
On SlideShare
0
From Embeds
0
Number of Embeds
2,314
Actions
Shares
0
Downloads
239
Comments
3
Likes
17
Embeds 0
No embeds

No notes for slide

Crab: A Python Framework for Building Recommender Systems

  1. 1. Crab A Python Framework for Building Recommendation Engines PythonBrasil 2011, São Paulo, SPMarcel Caraciolo Ricardo Caspirro Bruno Melo @marcelcaraciolo @ricardocaspirro @brunomelo
  2. 2. What is Crab ? A python framework for building recommendation enginesA Scikit module for collaborative, content and hybrid filtering Mahout Alternative for Python Developers :D Open-Source under the BSD license https://github.com/muricoca/crab
  3. 3. When started ?It began one year agoCommunity-driven, 4 membersSince April,2011 the open-source labs Muriçoca incorporated itSince April,2011 rewritting it as Scikit https://github.com/muricoca/
  4. 4. Knowing ScikitsScikits are Scipy Toolkits - independent and projects hosted under a common namespace. Scikits Image Scikits MlabWrap Scikits AudioLab Scikit Learn .... http://scikits.appspot.com/scikits
  5. 5. Knowing Scikits Scikit-Learn Machine Learning Algorithms + scientific Python packages (Numpy, Scipy and Matplotlib) http://scikit-learn.sourceforge.net/Our goal: Incorporate the Crab as Scikit and incorporate some parts of them at Scikit-learn
  6. 6. Why Recommendations ?The world is an over-crowded place !"#$%&()$*+$,-$&.#/0&%)#)$1(,0#
  7. 7. Why Recommendations * +,&-.$/).#&0#/"1.#$%234(".# ? $/)#5(&6 7&.2.#"$4,#)$8 We are overloaded * 93((3&/.#&0#:&3".;#5&&<.# $/)#:-.34#2%$4<.#&/(3/"Thousands of news articles and blog posts each day * =/#>$/&3;#?#@A#+B#4,$//"(.;# 2,&-.$/).#&0#7%&6%$:.# Millions of movies, books and music tracks online "$4,#)$8 Several Places, Offers and Events * =/#C"1#D&%<;#.""%$(# Even Friends sometimes we are overloaded ! 2,&-.$/).#&0#$)#:"..$6".# ."/2#2&#-.#7"%#)$8
  8. 8. Why Recommendations ?We really need and consume only a few of them! “A lot of times, people don’t know what they want until you show it to them.” Steve Jobs “We are leaving the Information age, and entering into the Recommendation age.” Chris Anderson, from book Long Tail
  9. 9. Why Recommendations ?Can Google help ? Yes, but only when we really know what we are looking for But, what’s does it mean by “interesting” ?Can Facebook help ? Yes, I tend to find my friends’ stuffs interesting What if i had only few friends and what they like do not always attract me ?Can experts help ? Yes, but it won’t scale well. But it is what they like, not me! Exactly same advice!
  10. 10. Why Recommendations ? Recommendation SystemsSystems designed to recommend to me something I may like
  11. 11. Why Recommendations ? !"#$%&"$"()*#*+,) Recommendation Systems -+*#)+. -#/) 0#)1# !2 23&4"+)1 5,6 7),*%"&863 Graph Representation
  12. 12. The current CrabCollaborative Filtering algorithms User-Based, Item-Based and Factorization Matrix (SVD)Evaluation of the Recommender Algorithms Precision, Recall, F1-Score, RMSE Precision-Recall Charts
  13. 13. The current Crab Precision-Recall Charts
  14. 14. Collaborative Filtering O Vento ToyThor Armagedon Items Levou Storelike recommends Marcel Rafael Amanda Users Similar
  15. 15. The current Crab
  16. 16. The current Crab>>>#load the dataset
  17. 17. The current Crab>>>#load the dataset>>> from crab.datasets import load_sample_movies
  18. 18. The current Crab>>>#load the dataset>>> from crab.datasets import load_sample_movies>>> data = load_sample_movies()
  19. 19. The current Crab>>>#load the dataset>>> from crab.datasets import load_sample_movies>>> data = load_sample_movies()>>> data
  20. 20. The current Crab>>>#load the dataset>>> from crab.datasets import load_sample_movies>>> data = load_sample_movies()>>> data{DESCR: sample_movies data set was collected by the book called nProgramming the Collective Intelligence by Toby Segaran nnNotesn----- nThis data set consists ofnt* n ratings with (1-5) from n users to n movies., data: {1: {1: 3.0, 2: 4.0, 3: 3.5, 4: 5.0, 5: 3.0},  2: {1: 3.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: 3.0, 6: 2.0},  3: {2: 3.5, 3: 2.5, 4: 4.0, 5: 4.5, 6: 3.0},  4: {1: 2.5, 2: 3.5, 3: 2.5, 4: 3.5, 5: 3.0, 6: 3.0},  5: {2: 4.5, 3: 1.0, 4: 4.0},  6: {1: 3.0, 2: 3.5, 3: 3.5, 4: 5.0, 5: 3.0, 6: 1.5},  7: {1: 2.5, 2: 3.0, 4: 3.5, 5: 4.0}}, item_ids: {1: Lady in the Water,  2: Snakes on a Planet,  3: You, Me and Dupree,  4: Superman Returns,  5: The Night Listener,  6: Just My Luck}, user_ids: {1: Jack Matthews,  2: Mick LaSalle,  3: Claudia Puig,  4: Lisa Rose,  5: Toby,  6: Gene Seymour,  7: Michael Phillips}}
  21. 21. The current Crab
  22. 22. The current Crab>>> from crab.models import MatrixPreferenceDataModel
  23. 23. The current Crab>>> from crab.models import MatrixPreferenceDataModel>>> m = MatrixPreferenceDataModel(data.data)
  24. 24. The current Crab>>> from crab.models import MatrixPreferenceDataModel>>> m = MatrixPreferenceDataModel(data.data)>>> print mMatrixPreferenceDataModel (7 by 6)         1 2 3 4 5 ...1 3.000000 4.000000 3.500000 5.000000 3.0000002 3.000000 4.000000 2.000000 3.000000 3.0000003 --- 3.500000 2.500000 4.000000 4.5000004 2.500000 3.500000 2.500000 3.500000 3.0000005 --- 4.500000 1.000000 4.000000 ---6 3.000000 3.500000 3.500000 5.000000 3.0000007 2.500000 3.000000 --- 3.500000 4.000000
  25. 25. The current Crab
  26. 26. The current Crab>>> #import pairwise distance
  27. 27. The current Crab>>> #import pairwise distance>>> from crab.metrics.pairwise import euclidean_distances
  28. 28. The current Crab>>> #import pairwise distance>>> from crab.metrics.pairwise import euclidean_distances>>> #import similarity
  29. 29. The current Crab>>> #import pairwise distance>>> from crab.metrics.pairwise import euclidean_distances>>> #import similarity>>> from crab.similarities import UserSimilarity
  30. 30. The current Crab>>> #import pairwise distance>>> from crab.metrics.pairwise import euclidean_distances>>> #import similarity>>> from crab.similarities import UserSimilarity>>> similarity = UserSimilarity(m, euclidean_distances)
  31. 31. The current Crab>>> #import pairwise distance>>> from crab.metrics.pairwise import euclidean_distances>>> #import similarity>>> from crab.similarities import UserSimilarity>>> similarity = UserSimilarity(m, euclidean_distances)>>> similarity[1]
  32. 32. The current Crab>>> #import pairwise distance>>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity >>> from crab.similarities import UserSimilarity >>> similarity = UserSimilarity(m, euclidean_distances) >>> similarity[1] [(1, 1.0),(6, 0.66666666666666663),(4, 0.34054242658316669),(3, 0.32037724101704074),(7, 0.32037724101704074),(2, 0.2857142857142857),(5, 0.2674788903885893)]
  33. 33. The current Crab>>> #import pairwise distance>>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity >>> from crab.similarities import UserSimilarity >>> similarity = UserSimilarity(m, euclidean_distances) >>> similarity[1] [(1, 1.0),(6, 0.66666666666666663), MatrixPreferenceDataModel (7 by 6)          1 2 3 4 5(4, 0.34054242658316669), 1 3.000000 4.000000 3.500000 5.000000 3.000000(3, 0.32037724101704074), 2 3.000000 4.000000 2.000000 3.000000 3.000000 3 --- 3.500000 2.500000 4.000000 4.500000(7, 0.32037724101704074), 4 2.500000 3.500000 2.500000 3.500000 3.000000 5 --- 4.500000 1.000000 4.000000 ---(2, 0.2857142857142857), 6 3.000000 3.500000 3.500000 5.000000 3.000000(5, 0.2674788903885893)] 7 2.500000 3.000000 --- 3.500000 4.000000
  34. 34. The current Crab
  35. 35. The current Crab>>> from crab.recommenders.knn import UserBasedRecommender
  36. 36. The current Crab>>> from crab.recommenders.knn import UserBasedRecommender>>> recsys = UserBasedRecommender(model=m,similarity=similarity, capper=True,with_preference=True)
  37. 37. The current Crab>>> from crab.recommenders.knn import UserBasedRecommender>>> recsys = UserBasedRecommender(model=m,similarity=similarity, capper=True,with_preference=True)>>> recsys.recommend(5)array([[ 5. , 3.45712869],       [ 1. , 2.78857832],       [ 6. , 2.38193068]])
  38. 38. The current Crab>>> from crab.recommenders.knn import UserBasedRecommender>>> recsys = UserBasedRecommender(model=m,similarity=similarity, capper=True,with_preference=True)>>> recsys.recommend(5)array([[ 5. , 3.45712869],       [ 1. , 2.78857832],       [ 6. , 2.38193068]])>>> recsys.recommended_because(user_id=5,item_id=1)array([[ 2. , 3. ],       [ 1. , 3. ],       [ 6. , 3. ],       [ 7. , 2.5],       [ 4. , 2.5]])
  39. 39. The current Crab>>> from crab.recommenders.knn import UserBasedRecommender>>> recsys = UserBasedRecommender(model=m,similarity=similarity, capper=True,with_preference=True)>>> recsys.recommend(5)array([[ 5. , 3.45712869],       [ 1. , 2.78857832],       [ 6. , 2.38193068]])>>> recsys.recommended_because(user_id=5,item_id=1)array([[ 2. , 3. ],       [ 1. , 3. ], MatrixPreferenceDataModel (7 by 6)          1 2 3 4 5 ...       [ 6. , 3. ], 1 3.000000 4.000000 3.500000 5.000000 3.000000 2 3.000000 4.000000 2.000000 3.000000 3.000000       [ 7. , 2.5], 3 --- 3.500000 2.500000 4.000000 4.500000       [ 4. , 2.5]]) 4 2.500000 3.500000 2.500000 3.500000 3.000000 5 --- 4.500000 1.000000 4.000000 --- 6 3.000000 3.500000 3.500000 5.000000 3.000000 7 2.500000 3.000000 --- 3.500000 4.000000
  40. 40. The current CrabUsing REST APIs to deploy the recommender django-piston, django-rest, django-tastypie
  41. 41. Crab is already in production News from Abril Publisher recommendations! Collecting over 10 magazines, 20 books and 100+ articles Running on Python + Scipy + DjangoContent-Based-FilteringEasy-to-use interface Still in development
  42. 42. Content Based Filtering SimilarDuro de O Vento Toy Armagedon Items Matar Levou Store recommend likes Marcel Users
  43. 43. Crab is already in production PythonBrasil keynotes Recommender Recommending keynotes based on a hybrid approach Running on Python + Scipy + DjangoContent-Based-Filtering +Collaborative Filtering Schedule your keynotes Still in development
  44. 44. source, the recommendation architecture that we propose will would rely more on collaborative-filtering techniques, that is,aggregate the results of such filtering techniques. Bezerra and Carvalho proposed approaches where the results the reviews from similar users. We aim at integrating the previously mentioned hybrid prod- Figure 1 shows a overview of our meta recommender achieved showed to be very promising [19]. approach. By combining the content-based filtering and theuct recommendation approach in a mobile application so the A. Crab is already in productionusers could benefit from useful and logical recommendations. collaborative-based one into a hybrid recommender system, itMoreover, we aim at providing a suited explanation for each would use the services/products III. S YSTEM catalogues repositories which D ESIGNrecommendation to the user, since the current approaches just the services to be recommended, and the review repository Application data information our mobile recommender sys- that contains the user opinions about those services. All this foronly deliver product recommendations with a overall scorewithout pointing out the appropriateness of such recommen- datatembecan be from data source containers in the web product description can extracted divided into two parts: the recdation [13]. Besides the basic information provided by the such(such location-based social network Foursquare its attributes) and the user as the as location, description and [17] as Hybrid Meta Approach gives the system’s architecture andsuppliers, the system will deliver the explanation, providingrelevant reviews of similar users, we believe that it will tags, etc.). The Figure 3increase the confidence in the buying decision process and the displayed at the Figure 2 and the location recommendation engine from Google: Google HotPot [18]. by user (such as rating, comments, reviews or ratings provided mo whproduct accepptance rate. In the mobile context this approach pocould help the users in this process and showing the user relative components. thiopinions could contribute to achieve this task. rec spe !"#$"%&$ 5&-$ !"#$%&%($) !".,"/#) acc !"*+#,$+-) !"*+#,$+-) +,-*.&$ !(#$()&*&%$ /01&234&$ !6#$6,00&41&7$ wh res !<#$<&2&&04&%A$B,431*,0A$&14C$ ves 0+44%6+%$,.")1%#"2) 0+($"($)1%#"2) 3,4$",(5) ou 3,4$",(5) )))67,8,#%)+,4%$91$%4)-1":)))) suc !"#$%&"()*+,#&-,.) /$%,0"12()*3$4%)3""5.) ))))1,;&,<4)<1&%%,)=2)4&:&8$1)) )))))))))))%$4%,5)94,14>?) <,7)41$ pro 8&=,%*1,>$ exp 8&4,99&0731*,0$:0;*0&$ !B#$B*%1$,2$D4,&7$<,7)41%$ !(#$()&*&%$ ma 8&?*&@$ we Fig. 2. User Reviews from Foursquare Social Network 8&=,%*1,>$ com 7"$%) !"8+99"(2")) !8#$830E&7$<,7)41%$ The content-based filtering approach will be used to filter ext the product/service repository, while the collaborative based 8&%).1%$ B. approach will derive the product review recommendations. In addition we will use text mining techniques to distinct the !"8+99"(2%$,+(#) polarity of the user review between positive or negative one. This information summarized would contribute in the product Architecture Fig. 3. Mobile Recommender System rat score recommendation computation. The final product recom- Fig. 1. Meta Recommender Architecture mendation score is computed by integrating the result of both me recommenders. By now, weproduct/service recommender, the user could In our mobile are considering to use different and Since one of the goals of this work is to incorporate options regarding this integration approach, one and get a list of recommen-different data sources of user opinions and descriptions, we filter some products or services at special oth is the symbolic data analysis approach (SDA) [19], whichhave addopted an meta recommendation architecture. By using eachtations. The user user ratings/reviews arehis preferences or give his product description and also can enter modeled owa meta recommender architecture, the system would providea personalized control over the generated recommendation list feedback to some offered product recommendation. as set of modal symbolic descriptions that summarizes the Re information provided by the corresponding data sources. It is
  45. 45. Crab is already in production Brazilian Social Network called Atepassar.com Educational network with more than 60.000 students and 120 video-classes Running on Python + Numpy + Scipy and DjangoBackend for RecommendationsMongoDB - mongoengine Daily Recommendations with Explanations
  46. 46. Evaluating your recommender Crab implements the most used recommender metrics. Precision, Recall, F1-Score, RMSE Using matplotlib for a plotter utility Implement new metricsSimulations support maybe (??)
  47. 47. Evaluating your recommender
  48. 48. Evaluating your recommender>>> from crab.metrics.classes import CfEvaluator
  49. 49. Evaluating your recommender>>> from crab.metrics.classes import CfEvaluator>>> evaluator = CfEvaluator()
  50. 50. Evaluating your recommender>>> from crab.metrics.classes import CfEvaluator>>> evaluator = CfEvaluator()>>> evaluator.evaluate(recommender=recsys,metric=rmse)
  51. 51. Evaluating your recommender>>> from crab.metrics.classes import CfEvaluator>>> evaluator = CfEvaluator()>>> evaluator.evaluate(recommender=recsys,metric=rmse) {rmse: 0.69467177857026907}
  52. 52. Evaluating your recommender>>> from crab.metrics.classes import CfEvaluator>>> evaluator = CfEvaluator()>>> evaluator.evaluate(recommender=recsys,metric=rmse) {rmse: 0.69467177857026907}>>> evaluator.evaluate_on_split(recommender=recsys, at =2)
  53. 53. Evaluating your recommender>>> from crab.metrics.classes import CfEvaluator>>> evaluator = CfEvaluator()>>> evaluator.evaluate(recommender=recsys,metric=rmse) {rmse: 0.69467177857026907}>>> evaluator.evaluate_on_split(recommender=recsys, at =2) ({error: [{mae: 0.345, nmae: 0.4567, rmse: 0.568}, {mae: 0.456, nmae: 0.356778, rmse: 0.6788}, {mae: 0.456, nmae: 0.356778, rmse: 0.6788}], ir: [{f1score: 0.456, precision: 0.78557, recall:0.55677}, {f1score: 0.64567, precision: 0.67865, recall: 0.785955}, {f1score: 0.45070, precision: 0.74744, recall: 0.858585}]}, {final_score: {avg: {f1score: 0.495955, mae: 0.429292, nmae: 0.373739, precision: 0.63932929, recall: 0.729939393, rmse: 0.3466868}, stdev: {f1score: 0.09938383 , mae: 0.0593933, nmae: 0.03393939, precision: 0.0192929, recall: 0.031293939, rmse: 0.234949494}}})
  54. 54. Distributing the recommendation computationsUse Hadoop and Map-Reduce intensively Investigating the Yelp mrjob framework https://github.com/pfig/mrjobDevelop the Netflix and novel standard-of-the-art used Matrix Factorization, Singular Value Decomposition (SVD), Boltzman machinesThe most commonly used is Slope One technique. Simple algebra math with slope one algebra y = a*x+b
  55. 55. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...    @memory.cache  def get_similarity(self, source_id, target_id):         source_preferences = self.model.preferences_from_user(source_id)         target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]
  56. 56. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]>>> #Without memory.cache
  57. 57. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]>>> #Without memory.cache >>># With memory.cache
  58. 58. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]>>> #Without memory.cache >>># With memory.cache>>> timeit similarity.get_similarities (‘marcel_caraciolo’)
  59. 59. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]>>> #Without memory.cache >>># With memory.cache>>> timeit similarity.get_similarities >>> timeit similarity.get_similarities (‘marcel_caraciolo’) (‘marcel_caraciolo’)
  60. 60. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]>>> #Without memory.cache >>># With memory.cache>>> timeit similarity.get_similarities >>> timeit similarity.get_similarities (‘marcel_caraciolo’) (‘marcel_caraciolo’) 100 loops, best of 3: 978 ms per loop
  61. 61. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]>>> #Without memory.cache >>># With memory.cache>>> timeit similarity.get_similarities >>> timeit similarity.get_similarities (‘marcel_caraciolo’) (‘marcel_caraciolo’) 100 loops, best of 3: 978 ms per loop 100 loops, best of 3: 434 ms per loop
  62. 62. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html Investigate how to use multiprocessing and parallel packages with similarities computation from joblib import Parallel ... def get_similarities(self, source_id):        return Parallel(n_jobs=3) ((other_id, delayed(self.get_similarity) (source_id, other_id)) for other_id, v in self.model)
  63. 63. Distributed Computing with mrJob https://github.com/Yelp/mrjob
  64. 64. Distributed Computing with mrJob https://github.com/Yelp/mrjobIt supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)
  65. 65. Distributed Computing with mrJob https://github.com/Yelp/mrjobIt supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)
  66. 66. Distributed Computing with mrJob https://github.com/Yelp/mrjob """The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w]+") class MRWordFreqCount(MRJob):     def mapper(self, _, line):         for word in WORD_RE.findall(line):             yield (word.lower(), 1)     def reducer(self, word, counts):         yield (word, sum(counts)) if __name__ == __main__:     MRWordFreqCount.run()It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)
  67. 67. Distributed Computing with mrJob https://github.com/Yelp/mrjobElsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
  68. 68. Distributed Computing with mrJob https://github.com/Yelp/mrjobElsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
  69. 69. Future studies with Sparse Matrices Real datasets come with lots of empty values http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.htmlSolutions: scipy.sparse package Sharding operations Matrix Factorization techniques (SVD) Apontador Reviews Dataset
  70. 70. Future studies with Sparse Matrices Real datasets come with lots of empty values http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html Solutions: scipy.sparse package Sharding operations Matrix Factorization techniques (SVD) Crab implements a MatrixFactorization with Expectation Maximization algorithm Apontador Reviews Dataset
  71. 71. Future studies with Sparse Matrices Real datasets come with lots of empty values http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html Solutions: scipy.sparse package Sharding operations Matrix Factorization techniques (SVD) Crab implements a MatrixFactorization with Expectation Maximization algorithm scikits.crab.svd package Apontador Reviews Dataset
  72. 72. Optimizations with Cython http://cython.org/Cython is a Python extension that lets developers annotate functions so they can be compiled to C. http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
  73. 73. Optimizations with Cython http://cython.org/Cython is a Python extension that lets developers annotate functions so they can be compiled to C.# setup.pyfrom distutils.core import setupfrom distutils.extension import Extensionfrom Cython.Distutils import build_ext# for notes on compiler flags see:# http://docs.python.org/install/index.htmlsetup(cmdclass = {build_ext: build_ext},ext_modules = [Extension("spearman_correlation_cython", ["spearman_correlation_cython.pyx"])]) http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
  74. 74. Optimizations with Cython http://cython.org/Cython is a Python extension that lets developers annotate functions so they can be compiled to C.# setup.pyfrom distutils.core import setupfrom distutils.extension import Extensionfrom Cython.Distutils import build_ext# for notes on compiler flags see:# http://docs.python.org/install/index.htmlsetup(cmdclass = {build_ext: build_ext},ext_modules = [Extension("spearman_correlation_cython", ["spearman_correlation_cython.pyx"])]) http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
  75. 75. Benchmarks Pure Python w/ Python w/ Scipy Dataset dicts and NumpyMovieLens 100k 15.32 s 9.56 s http://www.grouplens.org/node/73 Old Crab New Crab
  76. 76. Benchmarks Pure Python w/ Python w/ Scipy Dataset dicts and Numpy MovieLens 100k 15.32 s 9.56 s http://www.grouplens.org/node/73 Old Crab New CrabTime ellapsed ( Recommend 5 items) 0 4 8 12 16
  77. 77. Benchmarks Pure Python w/ Python w/ Scipy Dataset dicts and Numpy MovieLens 100k 15.32 s 9.56 s http://www.grouplens.org/node/73 Old Crab New CrabTime ellapsed ( Recommend 5 items) 0 4 8 12 16
  78. 78. Benchmarks Pure Python w/ Python w/ Scipy Dataset dicts and Numpy MovieLens 100k 15.32 s 9.56 s http://www.grouplens.org/node/73 Old Crab New CrabTime ellapsed ( Recommend 5 items) 0 4 8 12 16
  79. 79. Why migrate ?Old Crab running only using Pure Python Recommendations demand heavy maths calculations and lots of processingCompatible with Numpy and Scipy libraries High Standard and popular scientific libraries optimized for scientific calculations in PythonScikits projects are amazing! Active Communities, Scientific Conferences and updated projects (e.g. scikit-learn)Turn the Crab framework visible for the community Join the scientific researchers and machine learning developers around the Globe coding with Python to help us in this project Be Fast and Furious
  80. 80. Why migrate ?Numpy optimized with PyPy 2.x - 48.x Faster http://morepypy.blogspot.com/2011/05/numpy-in-pypy-status-and-roadmap.html
  81. 81. How are we working ? Sprints, Online Discussions and Issueshttps://github.com/muricoca/crab/wiki/UpcomingEvents
  82. 82. How are we working ? Our Project’s Home Pagehttp://muricoca.github.com/crab
  83. 83. Future Releases Planned Release 0.1 Collaborative Filtering Algorithms working, sample datasets to load and test Planned Release 0.11 Sparse Matrixes and Database Models support Planned Release 0.12 Slope One Agorithm, new factorization techniques implemented....
  84. 84. Join us!1. Read our Wiki Page https://github.com/muricoca/crab/wiki/Developer-Resources2. Check out our current sprints and open issues https://github.com/muricoca/crab/issues3. Forks, Pull Requests mandatory4. Join us at irc.freenode.net #muricoca or at our discussion list http://groups.google.com/group/scikit-crab
  85. 85. Recommended BooksToby Segaran, Programming Collective SatnamAlag, Collective Intelligence inIntelligence, OReilly, 2007 Action, Manning Publications, 2009 ACM RecSys, KDD , SBSC...
  86. 86. Crab A Python Framework for Building Recommendation Engines https://github.com/muricoca/crabMarcel Caraciolo Ricardo Caspirro Bruno Melo @marcelcaraciolo @ricardocaspirro @brunomelo {marcel, ricardo,bruno}@muricoca.com

×