Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

5,412 views

Published on

No Downloads

Total views

5,412

On SlideShare

0

From Embeds

0

Number of Embeds

61

Shares

0

Downloads

92

Comments

0

Likes

10

No embeds

No notes for slide

- 1. New Directions in Mahout’s RecommendersSebastian Schelter, Apache Software FoundationRecommender Systems Get-together Berlin
- 2. NewDirectionsinMahout’sRecommenders2/28New Directions?Mahout in Action is the prime source ofinformation for using Mahout in practice.As it is more than two years old, itis missing a lot of recent developments.This talk describes what has been added to the recommendersof Mahout since then.
- 3. Single machine recommenders
- 4. NewDirectionsinMahout’sRecommenders4/28MyMedialite, scientiﬁc library of recom-mender system algorithmsMahout now features a couple of popular latent factor models,mostly ported by Zeno Gantner.
- 5. NewDirectionsinMahout’sRecommenders5/28New recommenders and factorizersBiasedItemBasedRecommender, item-based kNN withuser-item-bias estimationKoren: Factor in the Neighbors: Scalable and Accurate Collaborative Filtering, TKDD ’09RatingSGDFactorizer, biased matrix factorizationKoren et al.: Matrix Factorization Techniques for Recommender Systems, IEEE Computer ’09SVDPlusPlusFactorizer, SVD++Koren: Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, KDD ’08ALSWRFactorizer, matrix factorization using AlternatingLeast SquaresZhou et al.: Large-Scale Parallel Collaborative Filtering for the Netﬂix Prize, AAIM ’08Hu et al.: Collaborative Filtering for Implicit Feedback Datasets, ICDM ’08
- 6. NewDirectionsinMahout’sRecommenders6/28Batch Item-Similarities on a single machineSimple but powerful way to deploy Mahout: Use item-basedcollaborative ﬁltering with periodically precomputed itemsimilarities.Mahout now supports multithreaded item similaritycomputation on a single machine for data sizes that don’trequire a Hadoop-based solution.DataModel dataModel = new FileDataModel(new File(”movielens.csv”));ItemSimilarity similarity = new LogLikelihoodSimilarity(dataModel));ItemBasedRecommender recommender =new GenericItemBasedRecommender(dataModel, similarity);BatchItemSimilarities batch =new MultithreadedBatchItemSimilarities(recommender, k);batch.computeItemSimilarities(numThreads, maxDurationInHours,new FileSimilarItemsWriter(resultFile));
- 7. Parallel processing
- 8. NewDirectionsinMahout’sRecommenders8/28Collaborative Filteringidea: infer recommendations from patterns found in thehistorical user-item interactionsdata can be explicit feedback (ratings) or implicit feedback(clicks, pageviews), represented in the interaction matrix Aitem1 · · · item3 · · ·user1 3 · · · 4 · · ·user2 − · · · 4 · · ·user3 5 · · · 1 · · ·· · · · · · · · · · · · · · ·row ai denotes the interaction history of user iwe target use cases with millions of users and hundreds ofmillions of interactions
- 9. NewDirectionsinMahout’sRecommenders9/28MapReduceparadigm for data-intensive parallel processingdata is partitioned in a distributed ﬁle systemcomputation is moved to datasystem handles distribution, execution, scheduling, failuresﬁxed processing pipeline where user speciﬁes twofunctionsmap : (k1, v1) → list(k2, v2)reduce : (k2, list(v2)) → list(v2)DFSInputInputInputmapmapmapreducereduceDFSOutputOutputshuffle
- 10. Scalable neighborhood methods
- 11. NewDirectionsinMahout’sRecommenders11/28Neighborhood MethodsItem-Based Collaborative Filtering is one of the mostdeployed CF algorithms, because:simple and intuitively understandableadditionally gives non-personalized, per-itemrecommendations (people who like X might also like Y)recommendations for new users without model retrainingcomprehensible explanations (we recommend Y becauseyou liked X)
- 12. NewDirectionsinMahout’sRecommenders12/28Cooccurrencesstart with a simpliﬁed view:imagine interaction matrix A wasbinary→ we look at cooccurrences onlyitem similarity computation becomes matrix multiplicationri = (A A) aiscale-out of the item-based approach reduces to ﬁnding aneﬃcient way to compute the item similarity matrixS = A A
- 13. NewDirectionsinMahout’sRecommenders13/28Parallelizing S = A Astandard approach of computing item cooccurrences requiresrandom access to both users and itemsforeach item f doforeach user i who interacted with f doforeach item j that i also interacted with doSfj = Sfj + 1→ not eﬃciently parallelizable on partitioned datarow outer product formulation of matrix multiplication iseﬃciently parallelizable on a row-partitioned AS = A A =i∈Aai aimappers compute the outer products of rows of A, emit theresults row-wise, reducers sum these up to form S
- 14. NewDirectionsinMahout’sRecommenders14/28Parallel similarity computationreal datasets not binary and we want to use a variety ofsimilarity measures, e.g. Pearson correlationexpress similarity measures by 3 canonical functions, whichcan be eﬃciently embedded into the computation (cf.,VectorSimilarityMeasure)preprocess adjusts an item rating vectorf = preprocess( f ) j = preprocess( j )norm computes a single number from the adjusted vectornf = norm( f ) nj = norm( j )similarity computes the similarity of two vectors from thenorms and their dot productSfj = similarity( dotfj, nf , nj )
- 15. NewDirectionsinMahout’sRecommenders15/28Example: Jaccard coeﬃcientpreprocess binarizes the rating vectorsif =3−5 j =441 f = bin(f ) =101 j = bin(j) =111norm computes the number of users that rated each itemnf = f 1 = 2 nj = j 1 = 3similarity ﬁnally computes the jaccard coeﬃcient fromthe norms and the dot product of the vectorsjaccard(f , j) =|f ∩ j||f ∪ j|=dotfjnf + nj − dotfj=22 + 3 − 2=23
- 16. NewDirectionsinMahout’sRecommenders16/28Implementation in Mahouto.a.m.math.hadoop.similarity.cooccurrence.RowSimilarityJobcomputes the top-k pairwise similarities for each row of amatrix using some similarity measureo.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJobcomputes the top-k similar items per item usingRowSimilarityJobo.a.m.cf.taste.hadoop.item.RecommenderJobcomputes recommendations and similar items usingRowSimilarityJob
- 17. NewDirectionsinMahout’sRecommenders17/28MapReduce pass 1data partitioned by items (row-partitioned A )invokes preprocess and norm for each item vectortransposes input to form Areduceshufflecombinemap1----1-----1---21---1,2,2,-1,--1----1----1-----1-0,1,2,1,---1-----1--3212,0,-1,1----11---21---1,2,-1,--11-11-----11-,2, ,--1-11----0,1,21-----321-1, ,0 1 2 3 40 - - 1 - 11 1 - 1 1 -2 1 1 1 1 -binarized A pointingfrom users to itemsAT pointing fromitems to users21321item „norms“0 1 20 - 1 21 - - 12 3 1 53 - 2 44 1 - ---1-1--11---11-0,1,2,--321-1,
- 18. NewDirectionsinMahout’sRecommenders18/28MapReduce pass 2data partitioned by users (row-partitioned A)computes dot products of columnsloads norms and invokes similarityimplementation contains several optimizations(sparsiﬁcation, exploit symmetry and thresholds)reduceshufflecombinemap0 1 2 3 40 - - 1 - 11 1 - 1 1 -2 1 1 1 1 --122---11----2-0,1,2,binarized A----12,--11----1--111---11-0,2,0,1,---1-2,----12,-122---11-0,1,---2-,----12,0 1 2 3 40 - 12231 -1 - - 1312-2 - - - 23133 - - - - -4 - - - - -“ATA“ holding itemsimilarities21321item „norms“
- 19. NewDirectionsinMahout’sRecommenders19/28Cost of the algorithmmajor cost in our algorithm is the communication in thesecond MapReduce pass: for each user, we have to process thesquare of the number of his interactionsS =i∈Aai ai→ cost is dominated by the densest rows of A(the users with the highest number of interactions)distribution of interactions per user is usually heavy tailed→ small number of power users with an unproportionallyhigh amount of interactions drastically increase the runtimeif a user has more than p interactions, only use a randomsample of size p of his interactionssaw negligible eﬀect on prediction quality for moderate p
- 20. NewDirectionsinMahout’sRecommenders20/28Scalable Neighborhood Methods: ExperimentsSetup26 machines running Java 7 and Hadoop 1.0.4two 4-core Opteron CPUs, 32 GB memory and four 1 TBdisk drives per machineResultsYahoo Songs dataset (700M datapoints, 1.8M users, 136Kitems), 26 machines, similarity computation takes less than 40minutes
- 21. Scalable matrix factorization
- 22. NewDirectionsinMahout’sRecommenders22/28Latent factor models: ideainteractions are deeply inﬂuenced by a set of factors that arevery speciﬁc to the domain (e.g. amount of action orcomplexity of characters in movies)these factors are in general not obvious, we might be able tothink of some of them but it’s hard to estimate their impacton the interactionsneed to infer those so called latent factors from theinteraction data
- 23. NewDirectionsinMahout’sRecommenders23/28low-rank matrix factorizationapproximately factor A into the product of two rank r featurematrices U and M such that A ≈ UM.U models the latent features of the users, M models the latentfeatures of the itemsdot product ui mj in the latent feature space predicts strengthof interactions between user i and item jto obtain a factorization, minimize regularized squared errorover the observed interactions, e.g.:minU,M(i,j)∈A(aij − ui mj)2+ λinui ui2+jnmj mj2
- 24. NewDirectionsinMahout’sRecommenders24/28Alternating Least SquaresALS rotates between ﬁxing U and M. When U is ﬁxed, thesystem recomputes M by solving a least-squares problem peritem, and vice versa.easy to parallelize, as all users (and vice versa, items) can berecomputed independentlyadditionally, ALS is able to solve non-sparse models fromimplicit data≈ ×Au × iUu × kMk × i
- 25. NewDirectionsinMahout’sRecommenders25/28Implementation in Mahouto.a.m.cf.taste.hadoop.als.ParallelALSFactorizationJobcomputes a factorization using Alternating Least Squares, hasdiﬀerent solvers for explicit and implicit dataZhou et al.: Large-Scale Parallel Collaborative Filtering for the Netﬂix Prize, AAIM ’08Hu et al.: Collaborative Filtering for Implicit Feedback Datasets, ICDM ’08o.a.m.cf.taste.hadoop.als.FactorizationEvaluator computesthe prediction error of a factorization on a test seto.a.m.cf.taste.hadoop.als.RecommenderJob computesrecommendations from a factorization
- 26. NewDirectionsinMahout’sRecommenders26/28Scalable Matrix Factorization: ImplementationRecompute user feature matrix U using a broadcast-join:1. Run a map-only job using multithreaded mappers2. load item-feature matrix M into memory from HDFS toshare it among the individual mappers3. mappers read the interaction histories of the users4. multithreaded: solve a least squares problem per user torecompute its feature vectoruser histories A user features Uitem features MMapHash-Join + Re-computationlocalfwdlocalfwdlocalfwdMapHash-Join + Re-computationMapHash-Join + Re-computationbroadcastmachine1machine2machine3
- 27. NewDirectionsinMahout’sRecommenders27/28Scalable Matrix Factorization: ExperimentsSetup26 machines running Java 7 and Hadoop 1.0.4two 4-core Opteron CPUs, 32 GB memory and four 1 TBdisk drives per machineconﬁgured Hadoop to reuse JVMs, ran multithreadedmappersResultsYahoo Songs dataset (700M datapoints), 26 machines, singleiteration (two map-only jobs) takes less than 2 minutes
- 28. Thanks for listening!Follow me on twitter at http://twitter.com/sscdotopenJoin Mahout’s mailinglists at http://s.apache.org/mahout-listspicture on slide 3 by Tim Abott, http://www.ﬂickr.com/photos/theabbott/picture on slide 21 by Crimson Diabolics, http://crimsondiabolics.deviantart.com/

No public clipboards found for this slide

Be the first to comment