Your SlideShare is downloading. ×
Slope one recommender on hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Slope one recommender on hadoop

5,155
views

Published on

Introduction about the MapReduce distributed version of SlopeOne in Mahout

Introduction about the MapReduce distributed version of SlopeOne in Mahout

Published in: Technology

0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,155
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
105
Comments
0
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012
  • 2. Overview• Introduction• Recommender Systems & Slope One Recommender• Distributed Slope One on Mahout and Hadoop• Experimental Setup and Analyses• Drive Mahout on Hadoop• Interesting Communities Center for Web Intelligence, DePaul University, USA
  • 3. Introduction• About Me: a recommendation guy• My Research: data mining and recommender systems• Typical Experimental Research 1) Design or improve an algorithm; 2) Run algorithms and baseline algs on datasets; 3) Compare experimental results; 4) Try different parameters, find reasons and even re-design and improve algorithm itself; 5) Run algorithms and baseline algs on datasets; 6) Compare experimental results; 7) Try different parameters, find reasons and even re-design and improve algorithm itself; 8) And so on… Until it approaches expected results.
  • 4. Introduction• Sometimes, data is large-scale. e.g. one algorithm may spend days to complete, how about experimental results are not as expected. Then improve algorithms and run it for days again, and again. How can we do previously? (for tasks not that complicated) 1). Paralleling but complicated synchronization and limited resources, such as CPU, memory, etc; 2). Take advantage of PC Labs, let’s do it with 10 PCs• Nearly all research will ultimately face the large-scale problems , especially in the domain of data mining.• But, we have Map-Reduce NOW!
  • 5. Introduction• Do not need to distribute data and tasks manually. Instead we just simply generate configurations.• Do not need to care about more details, e.g. how data is distributed, when one specific task will be ran on which machine, or how they conduct tasks one by one.• Instead, we can pre-define working flow. We can take advantage of the functional contributions from mappers and reducers.• More benefits: replication, balancing, robustness, etc
  • 6. Recommender Systems• Collaborative Filtering• Slope One and Simple Weighted Slope One• Slope One in Mahout• Distributed Slope One in Mahout• Mappers and Reducers Center for Web Intelligence, DePaul University, USA
  • 7. Recommender Systems
  • 8. Collaborative Filtering (CF)One of most popular recommendation algorithms. User-based: User-CF Item-based: Item-CF, Slope One User 5 Rating? 5 4 4 4 star 5 Example: User-based Collaborative Filtering
  • 9. Slope One RecommenderReference: Daniel Lemire, Anna Maclachlan, Slope One Predictors forOnline Rating-Based Collaborative Filtering, In SIAM Data Mining(SDM05), April 21-23, 2005. http://lemire.me/fr/abstracts/SDM2005.html User Batman Spiderman U1 3 4 U2 2 4 U3 2 ?1). How different two movies were rated?U1 rated Spiderman higher by (4-3) = 1U2 rated Spiderman higher by (4-2) = 2On average, Spiderman is rated (1+2)/2 = 1.5 higher2). Rating difference can tell predictionsIf we know U3 gave Batman a 2-star, probably he will ratedSpiderman by (2+1.5) = 3.5 star
  • 10. Simple Weighted Slope OneUsually user rated multiple items User HarryPotter Batman Spiderman U1 5 3 4 U2 ? 2 4 U3 4 2 ?1). How different the two movies were rated?Diff(Batman, Spiderman) = [(4-3)+(4-2)]/2 = 1.5Diff(HarryPotter, Spiderman) = (4-5)/1 = -1“2” and “1” here we call them as “count”.2). Weighted rating difference can tell predictionsWe use a simple weighted approachRefer to Batman only, rating = 2+1.5 = 3.5Refer to HarryPotter only, rating = 4-1 = 3Consider them all, predicted rating = (3.5*2 + 3*1])/ (2+1) = 3.33
  • 11. Simple Weighted Slope One User HarryPotter Batman Spiderman u1 5 3 4 u2 ? 2 4 u3 4 2 ? Question: Online or Offline?To calculate the prediction ratings, we need 2 matrices:1).Difference Matrix Movie1 Movie2 Movie3 Movie4 Movie1 Movie2 -1.5 Movie3 2 1 Movie4 -1 0.5 -22). Count MatrixJust number of users co-rated on two items
  • 12. Slope One in MahoutMahout, an open-source machine learning library.1). Recommendation algorithms User-based CF, Item-based CF, Slope One, etc2). Clustering KMeans, Fuzzy KMeans, etc3). Classification Decision Trees, Naive Bayes, SVM, etc4). Latent Factor Models LDA, SVD, Matrix Factorization, etc
  • 13. Slope One in Mahoutorg.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommenderPre-Processing Stage: (class MemoryDiffStorage with Map)for every item i for every other item j for every user u expressing preference for both i and j add the difference in u’s preference for i and j to an averageRecommendation Stage:for every item i the user u expresses no preference for for every item j that user u expresses a preference for find the average preference difference between j and i add this diff to u’s preference value for j add this to a running averagereturn the top items, ranked by these averagesSimple weighting: as introduced previouslyStdDev weighting: item-item rating diffs with lower sd should be weighted highly
  • 14. Distributed Slope One in MahoutSimilar to our previous practice, e.g. the matrix factorizationProcess, what we need is the Difference Matrix.Suppose there are M users rated N items, the matrixrequires N(N-1)/2 cells. Also, the density is another aspect– how user rated items. If there are several items and therating matrix is dense, the computational costs will increaseaccordingly.Question again: Online or Offline?Depends on tasks & data.Large-scale data. Let’s do it offline!
  • 15. Distributed Slope One in Mahoutpackage org.apache.mahout.cf.taste.hadoop.slopeone; class SlopeOneAverageDiffsJob class SlopeOnePrefsToDiffsReducer class SlopeOneDiffsToAveragesReducerpackage org.apache.mahout.cf.taste.hadoop; class ToItemPrefsMapper org.apache.hadoop.mapreduce.MapperTwo Mapper-Reducer Stages: 1). Create DiffMatrix for each user 2). Collect AvgDiff info, counts, StdDevLet’s see how it works…
  • 16. Mapper and Reducer - 1 User HarryPotter Batman Spiderman U1 5 3 4 U2 ? 2 4 U3 4 2 ? Mapper1 (ToItemPrefsMapper)  <UserID, Pair<ItemID, Rating>> Reducer1 (PrefsToDiffsReducer)  <Pair<Item1,Item2>, Diff> (for all three users) <U1> Potter Bat Spider <U2> Potter Bat Spider Potter Potter Bat -2 Bat NULL Spider -1 1 Spider NULL 2
  • 17. Mapper and Reducer - 2 <U1> Potter Bat Spider <U2> Potter Bat SpiderPotter Potter Bat -2 Bat NULLSpider -1 1 Spider NULL 2Mapper2 (org.apache.hadoop.mapreduce.Mapper)Reducer2 (DiffsToAveragesReducer)Average Diffs, Count, StedDev <Aggregate> Potter Bat Spider Potter Bat -2, 1 Spider -1, 1 1.5, 2Simply, <a,b> pair denotes a=averge diff, b=countNotice: we should use three matrices in practice, here I used 2.
  • 18. Predictions User HarryPotter Batman Spiderman U1 5 3 4 U2 ? 2 4 U3 4 2 ? <Aggregate> Potter Bat Spider Potter Bat -2, 1 Spider -1, 1 1.5, 2 Simply, <a,b> pair denotes a=averge diff, b=count Notice: we should use three matrices in practice, here I used 2. Prediction(U3, Spiderman) = [(4-1)*1 + (2+1.5)*2] / (1+2) = 3.33333333333333333333
  • 19. Experiments• Data• Hadoop Setup• Running Performances Center for Web Intelligence, DePaul University, USA
  • 20. Experiment SetupData: MovieLens-1M ratings # of users: 6,040 # of movies: 3,900 # of ratings: 1,000,209Density of the ratings: each user has at least 20 ratings obviously, some users have many more ratingsRating format: UserID, ItemID, Rating (scale 1-5)Data Split: 80% training, 20% testing
  • 21. Experiment SetupHadoop Cluster Setup IBM SmartCloud 1 master node, 7 slave nodes Each node is as SUSE Linux Enterprise Server v11 SP1 Server Configuration: 64 bit (vCPU: 2, RAM: 4 GiB, Disk: 60 GiB) Hadoop v.0.20.205.0 Mahout distribution-0.6The environment setup follows the typical workflow as:http://irecsys.blogspot.com/2012/11/configurate-map-reduce-environment-on.htmlThanks Scott Young, neat writeup!!
  • 22. Experimental AnalysesStage-1: SlopeOneAverageDiffsJob by Map-Reduce Goal: Build DiffStorage Output: DiffStorage txt file, 1.45GB Running Time:  real 13m 34.228s  user 0m 5.136s  sys 0m 1.028s Item1 Item2 Diff Count StdDev 221 223 -1.02 197 0.5Stage-2: Java evaluator to measure MAE on testing set Running Time:  Load Testing Set (21K records), 299ms  Load Training Set (79K records), 1,771ms  Load DiffStorage, 176,352ms = 2.9m  Prediction (21K records), 18,182ms = 0.3m  MAE = 0.71330756
  • 23. Experimental Experiences1. Why not MovieLens 10M data? Map-Reduce on 10M data may cost several hrs; Running time depends on cluster and configuration; Also, DiffStorage file will be too large.2. Java Evaluator Load full DiffStorage file is time-consuming. Also, incur Java heap space and GCOverlimit errors; Those errors can not be fixed by –Xmx or other solutions; Two solutions: 1). Just use simple weighting, discard StdDev weighting. 2). Simple Mapper and Reducer, run it on clusters. For MovieLens 1M, it is not that efficient compared with the live SlopeOne recommendation; 10M data may be better, will try MovieLens-10M data later; Slope One is simple but memory-expensive.
  • 24. More …• Drive Mahout on Hadoop• Interesting Communities Center for Web Intelligence, DePaul University, USA
  • 25. Mahout + HadoopHow to put more Mahout algorithms to Hadoop?1. Pre-set Command in Mahout Let’s see bin/mahout – help, then it provides a list of available programs such as svd, fkmeans, etc. Some are basic functions, such as splitDataset Some can be executed as Hadoop tasks e.g. Run and evaluate Matrix Factorization on rating dataset bin/mahout parallelALS --input inputSource --output outputSource --tempDir tmpFolder --numFeatures 20 --numIterations 10 bin/mahout evaluateFactorization --input inputSource --output outputSource --userFeatures als/out/U/ --itemFeatures als/out/M/ --tempDir tmpFolder
  • 26. Mahout + Hadoop2. More Algorithms on Hadoop Mahout provides a way to run more Mahout algorithms. Simply,$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-<version>.jar <Job Class> --recommenderClassName Class <OPTIONS> Which kinds of Jobs it supports? Mahout implemented some versions. Some popular ones: 1).org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob --recommenderClassName ClassName 2).org.apache.mahout.cf.taste.hadoop.item.RecommenderJob 3).org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob 4).org.apache.mahout.cf.taste.hadoop.slopeone.SlopeOneAverageDiffsJob
  • 27. Interesting CommunitiesBeyond Hadoop and Mahout official sites1. Data Mining KDnuggets, http://www.kdnuggets.com Popular community for Data Mining & Analytics. Lots of useful information, such as news, materials, datasets, jobs, etc.2. Big Data SmartData Collective, http://smartdatacollective.com/ Smarter Computing, http://www.smartercomputingblog.com/ Big Data Meetup, http://big-data.meetup.com/3. Recommender Systems ACM Official Site, http://recsys.acm.org/ RecSys Wiki, http://recsyswiki.com/
  • 28. Thank You! Center for Web Intelligence, DePaul University, USA