Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Collaborative filtering20081111

807 views

Published on

Collaborative filtering

Published in: Education, Technology
  • Be the first to comment

Collaborative filtering20081111

  1. 1. DCFLA: A D istributed C ollaborative- F iltering Neighbor- L ocating A lgorithm Authors: Bo Xie, Peng Han, Fan Yang, Rui-Min Shen, Hua- Jun Zeng, and Zheng Chen Source: Information Sciences, vol. 177, no. 6, pp. 1349-1363, 2007. Professor: Dr. Shu-Ching Wang ( 王淑卿 ) Speaker: Yu-Chien Chou( 周裕健 ) Date: Nov. 11, 2008
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Related Work </li></ul><ul><li>Novel Collaborative-Filtering </li></ul><ul><li>Recommendation Systems </li></ul><ul><li>Experiments </li></ul><ul><li>Conclusions </li></ul>Nov. 11, 2008
  3. 3. Introduction (1/2) <ul><li>Motivation </li></ul><ul><ul><li>Overload of information </li></ul></ul><ul><ul><li>Compare the recommendation approach </li></ul></ul><ul><ul><ul><li>CB – information-filtering </li></ul></ul></ul><ul><ul><ul><li>CF – memory-based </li></ul></ul></ul><ul><ul><ul><li>CF – model-based </li></ul></ul></ul><ul><ul><li>Focus on the scalability issue (CF) </li></ul></ul><ul><ul><ul><li>Time and space complexity of the similarity calculation </li></ul></ul></ul><ul><ul><li>Design and implement the decentralization CF </li></ul></ul>Nov. 11, 2008
  4. 4. Introduction (2/2) <ul><li>Purpose </li></ul><ul><ul><li>Reduce the time complexity </li></ul></ul><ul><ul><ul><li>O ( M 2 * N )  O ( N + M log M ) </li></ul></ul></ul><ul><ul><li>Increase the algorithm’s scalability </li></ul></ul><ul><ul><li>Propose the concept of most same opinion (MSO) </li></ul></ul><ul><ul><li>Propose the average rating normalization (ARN) technique to improve the MSO </li></ul></ul><ul><ul><li>Use a normalized rating instead of the raw rating </li></ul></ul>Nov. 11, 2008
  5. 5. Related work (1/5) <ul><li>Proposing a recommendation system </li></ul><ul><ul><li>Use the memory-based CF algorithm </li></ul></ul><ul><ul><li>Built on the peer-to-peer (P2P) architecture </li></ul></ul><ul><ul><ul><li>Distribution hash table (DHT) routing algorithms </li></ul></ul></ul><ul><li>Memory-based CF algorithm (a) </li></ul><ul><ul><li>Average of the votes </li></ul></ul>P a,j is the vote by active user a on item j , v' a is the mean vote by user a , N is the number of users in the user database, v i , j is the vote by user i on item j , v' i is the mean vote by user i . v i,j is the vote by user i on item j , I i is the total items . Nov. 11, 2008
  6. 6. Related work (2/5) <ul><li>Memory-based CF algorithm (b) </li></ul><ul><ul><li>2. Pearson correlation coefficient </li></ul></ul><ul><ul><li>3. Vector similarity </li></ul></ul><ul><ul><li>j is the all item, </li></ul></ul><ul><ul><li>a and i are the user. </li></ul></ul><ul><ul><li>j is the all item, </li></ul></ul><ul><ul><li>k is the target item, </li></ul></ul><ul><ul><li>I is the total user, </li></ul></ul><ul><ul><li>a and i are the user. </li></ul></ul>Nov. 11, 2008
  7. 7. Related work (3/5) <ul><li>P2P system and DHT routing algorithm </li></ul><ul><ul><li>DHT routing algorithm </li></ul></ul><ul><ul><ul><li>Decentralization </li></ul></ul></ul><ul><ul><ul><li>Scalability </li></ul></ul></ul><ul><ul><ul><li>Fault tolerance </li></ul></ul></ul><ul><li>P2P benefits </li></ul><ul><ul><li>Avoiding dependence on centralized point </li></ul></ul><ul><ul><li>Allowing direct communication </li></ul></ul><ul><ul><li>Aggregating the possibility of resource </li></ul></ul>Nov. 11, 2008
  8. 8. Related work (4/5) <ul><li>P2P application </li></ul><ul><ul><li>Parallelizable application </li></ul></ul><ul><ul><ul><li>Split a large computation-intensive </li></ul></ul></ul><ul><ul><li>Content and file management applications </li></ul></ul><ul><ul><ul><li>Storing information </li></ul></ul></ul><ul><ul><ul><li>Retrieving information </li></ul></ul></ul><ul><ul><li>Collaborative application </li></ul></ul><ul><ul><ul><li>Allow users to collaborate in real time </li></ul></ul></ul>Nov. 11, 2008
  9. 9. Related work (5/5) <ul><li>The primary goals of DHT </li></ul><ul><ul><li>Provide an efficient, scalable, and robust routing algorithm </li></ul></ul><ul><ul><ul><li>Reduce the number of P2P hops </li></ul></ul></ul><ul><ul><li>Reduce the amount of routing states that should be preserved at each peer </li></ul></ul><ul><li>The distributed collaborative filtering systems (DCF) </li></ul><ul><ul><li>Advantage in terms of scalability </li></ul></ul>Nov. 11, 2008
  10. 10. Novel collaborative-filtering (1/10) <ul><li>Distributed collaborative-filtering neighbor-locating algorithm (DCFLA) </li></ul><ul><ul><li>Applies DCF recommendation systems </li></ul></ul><ul><ul><li>Most same opinion (MSO) </li></ul></ul><ul><ul><li>Average rating normalization (ARN) </li></ul></ul><ul><li>The main goal </li></ul><ul><ul><li>Reduce the network traffic and time costs </li></ul></ul>Nov. 11, 2008
  11. 11. Novel collaborative-filtering (2/10) <ul><li>Basic DHT-based CF algorithm </li></ul><ul><ul><li>Divide the original centralized user database into fractions  buckets </li></ul></ul>Rid of some noisy users Nov. 11, 2008
  12. 12. Novel collaborative-filtering (3/10) <ul><li>Locating neighbors in DHT-based CF </li></ul><ul><ul><li>Most same opinion (MSO) (a) </li></ul></ul><ul><ul><ul><li>Inverse preference frequency (IPF) </li></ul></ul></ul>M is the total number of users in the system, n i,v is the user voted for item i with rating v . Nov. 11, 2008
  13. 13. Novel collaborative-filtering (4/10) <ul><li>Locating neighbors in DHT-based CF </li></ul><ul><ul><li>Most same opinion (MSO) (b) </li></ul></ul><ul><ul><ul><li>The consistency of user i and j </li></ul></ul></ul>C i,j is the consistency of users i and j , v i,k is the vote of user i for item k , v j,k is the vote of user j for item k , N is the total number of items in the system. Nov. 11, 2008
  14. 14. Novel collaborative-filtering (5/10) <ul><li>Time complexity of the similarity calculation for traditional memory-based CF </li></ul>Program SimilarityCalculation (Output) For User=1:M For OtherUser=1:M, Calculate similarity between User and OtherUser by “ Pearson correlation coefficient or Vector similarity ” End End. Nov. 11, 2008 <ul><ul><li>Time complexity: O ( M 2 N ) </li></ul></ul><ul><ul><li>M is the number of users, N is the number of items, </li></ul></ul><ul><ul><li>When M or N grows to millions, it’s almost impossible to make a real-time prediction using the traditional centralized method. </li></ul></ul>
  15. 15. Novel collaborative-filtering (6/10) <ul><li>Time complexity of the similarity calculation for DHT-MSO-based CF neighbor-locating algorithm </li></ul>Program DHT_MSO_NeighborLocating (Output) For each rated item, fetch vector <USERID,IPF> from bucket; End Merge vectors to get consistency by IPF Nov. 11, 2008 <ul><ul><li>Time complexity: O ( N+MlogM ) </li></ul></ul>
  16. 16. Novel collaborative-filtering (7/10) <ul><li>Average rating normalization (ARN) (1/3) </li></ul><ul><ul><li>For example -- a and b will be not in the same bucket </li></ul></ul>Nov. 11, 2008 Item ID 1 2 3 4 5 6 Vote(a) 4 4 5 5 6 6 Vote(b) 3 3 4 4 5 5
  17. 17. Novel collaborative-filtering (8/10) <ul><li>Average rating normalization (ARN) (2/3) </li></ul><ul><ul><li>For example </li></ul></ul>Nov. 11, 2008 Item ID 1 2 3 4 5 6 Vote(a) 4 4 5 5 6 6 Vote(b) 3 3 4 4 5 5
  18. 18. Novel collaborative-filtering (9/10) <ul><li>Average rating normalization (ARN) (3/3) </li></ul><ul><ul><li>a and b will never be in the same bucket (based on basic DHT-based CF algorithm) </li></ul></ul><ul><ul><li><ITEM_ID, VOTE> vs. <ITEM_ID, ARN_VOTE> </li></ul></ul><ul><ul><li>For example -- a and b will be in the same bucket </li></ul></ul>Nov. 11, 2008 New approach Former approach Item ID 1 2 3 4 5 6 Vote(a) -1 -1 0 0 1 1 Vote(b) -1 -1 0 0 1 1 Item ID 1 2 3 4 5 6 Vote(a) 4 4 5 5 6 6 Vote(b) 3 3 4 4 5 5
  19. 19. Novel collaborative-filtering (10/10) <ul><li>ARN_VOTE </li></ul><ul><li>ARN approach constructs N buckets instead of N * C buckets for the basic DHT-based CF algorithm </li></ul><ul><ul><li>N is the number of items </li></ul></ul><ul><ul><li>C is the possible rating for every item </li></ul></ul>v ’ i,j is the ARN_VOTE of user i on item j , v ij is the vote of user i on item j , v ’ i is the mean vote for user i . v req : retrieve similar neighbors, v ack : return the users degree of satisfying, δ : threshold. Nov. 11, 2008
  20. 20. Recommendation systems (1/7) <ul><li>Architecture of DCF system </li></ul>Nov. 11, 2008 User1 User2 User3 User4 User5
  21. 21. Recommendation systems (2/7) <ul><li>DCF system vs. traditional centralized CF system </li></ul><ul><ul><li>Difference </li></ul></ul><ul><ul><ul><li>Maintenance of the user database </li></ul></ul></ul><ul><ul><ul><li>Complex computation task of making predictions </li></ul></ul></ul><ul><ul><li>Similarity </li></ul></ul><ul><ul><ul><li>Unique key </li></ul></ul></ul><ul><ul><ul><ul><li>Each user has V key </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Construct a DHT overlay network </li></ul></ul></ul></ul>Nov. 11, 2008
  22. 22. Recommendation systems (3/7) <ul><li>DHT-MSO-based CF algorithm </li></ul><ul><ul><li>MSO is used as the distributed neighbor-locating algorithm </li></ul></ul>Construct DHT overlay network put (key), lookup (key) Training set, executing the put (key) function Fetches similar neighbors, executing the lookup (key) function Computes the corresponding prediction put (key) <ITEM_ID, VOTE> Input: training set, test set, target item Output: mean absolute error of prediction Nov. 11, 2008
  23. 23. Recommendation systems (4/7) <ul><li>DCF puts the vote vector for peer P into the DHT overlay network </li></ul>P generates a unique 128-bit DHT key k local Local key k local is most similar to K Receives the put message with K Repeats steps 2 and 3 Input: test set ( P ’s vote vector) Output: NULL P hashes one <ITEM_ID, VOTE> combination to key K Finding the similar neighbor Nov. 11, 2008
  24. 24. Recommendation systems (5/7) <ul><li>DCF lookup similar users for peer P </li></ul>P generates a unique 128-bit DHT key k local Local key k local is most similar to K Receives the lookup, message with K Repeats steps 2 and 3 Input: test set ( P ’s vote vector) Output: training set (vote vectors retrieved for similar users) P hashes one <ITEM_ID, VOTE> combination to key K Finding the similar neighbor Received from similar users Nov. 11, 2008
  25. 25. Recommendation systems (6/7) <ul><li>DCF put algorithm </li></ul><ul><ul><li>Construct a DHT overlay network </li></ul></ul><ul><ul><li>Fill data into DHT </li></ul></ul><ul><li>DCF lookup algorithm </li></ul><ul><ul><li>Fetch similar user with high consistency </li></ul></ul><ul><ul><li>Construct a local training set to make recommendations </li></ul></ul>Nov. 11, 2008
  26. 26. Recommendation systems (7/7) <ul><li>DHT-MSO-ARN-based CF </li></ul><ul><ul><li>Almost the same as a DHT-MSO-based algorithm </li></ul></ul>Construct DHT overlay network put (key), lookup (key) Training set, executing the put (key) function Fetches similar neighbors, executing the lookup (key) function Computes the corresponding prediction Input: training set, test set, target item Output: mean absolute error of prediction put (key) <ITEM_ID, ARN_VOTE> Nov. 11, 2008
  27. 27. Experiments (1/10) <ul><li>CF algorithms </li></ul><ul><ul><li>Traditional memory-based CF algorithm (baseline) </li></ul></ul><ul><ul><li>Basic DHT-based CF </li></ul></ul><ul><ul><li>DHT-based CF with MSO </li></ul></ul><ul><ul><li>DHT-based CF with MSO and ARN </li></ul></ul><ul><li>Data set </li></ul><ul><ul><li>EachMovie data set </li></ul></ul><ul><ul><li>72,916 users, 1,628 movies </li></ul></ul><ul><ul><li>2,811,983 ratings ranging from 0 to 5 </li></ul></ul>Nov. 11, 2008
  28. 28. Experiments (2/10) <ul><li>Metrics and methodology </li></ul><ul><ul><li>MAE (mean absolute error) </li></ul></ul>v a,j is the actual rating user a gives to item j , p a,j is the predicted value, A is the active user set, T is the test item set. Nov. 11, 2008
  29. 29. Experiments (3/10) <ul><li>Experimental results (1/8) </li></ul><ul><ul><li>The efficiency of neighbor-chosen (a) </li></ul></ul>Nov. 11, 2008
  30. 30. Experiments (4/10) <ul><li>Experimental results (2/8) </li></ul><ul><ul><li>The efficiency of neighbor-chosen (b) </li></ul></ul>Nov. 11, 2008
  31. 31. Experiments (5/10) <ul><li>Experimental results (3/8) </li></ul><ul><ul><li>Comparison of the prediction accuracy of four CF algorithms (all-but-one protocol) </li></ul></ul>Nov. 11, 2008
  32. 32. Experiments (6/10) <ul><li>Experimental results (4/8) </li></ul><ul><ul><li>Comparison of the prediction accuracy of four CF algorithms (given 5 protocol) </li></ul></ul>Nov. 11, 2008
  33. 33. Experiments (7/10) <ul><li>Experimental results (5/8) </li></ul><ul><ul><li>Comparison of the fetch by four algorithms (all-but-one protocol) </li></ul></ul>Nov. 11, 2008
  34. 34. Experiments (8/10) <ul><li>Experimental results (6/8) </li></ul><ul><ul><li>Comparison of the fetch by four algorithms (given 5 protocol) </li></ul></ul>Nov. 11, 2008
  35. 35. Experiments (9/10) <ul><li>Experimental results (7/8) </li></ul><ul><ul><li>Comparison of different threshold values for ARN (all-but-one protocol) </li></ul></ul>Nov. 11, 2008
  36. 36. Experiments (10/10) <ul><li>Experimental results (8/8) </li></ul><ul><ul><li>Comparison of different threshold values for ARN (given 5 protocol) </li></ul></ul>Nov. 11, 2008
  37. 37. Conclusions <ul><li>Proposed a new algorithm </li></ul><ul><ul><li>Based on a DHT peer-to-peer routing method </li></ul></ul><ul><ul><li>Distributed collaborative filtering neighbor locating algorithm (DCFLA) </li></ul></ul><ul><ul><ul><li>Most same opinion (MSO) </li></ul></ul></ul><ul><ul><ul><li>Average rating normalization (ARZ) </li></ul></ul></ul><ul><ul><li>Reduced the network traffic and time cost </li></ul></ul>Nov. 11, 2008
  38. 38. Q & A <ul><li>Thanks for your Listening!! </li></ul>

×