Tamara HeckDept. of Information ScienceHeinrich-Heine-UniversityDüsseldorf       A Comparison of Different        User-Sim...
Agenda New needs for scientific cooperation Forms of similarity measurements Similarity coefficients Database settings...
Scientific CooperationNeed for…                                                                                           ...
Scientific Cooperation   Web technologies facilitate scientific work:       Online user networks help            Findin...
Scientific Cooperation   Recommendation       Can help to find resources, documents and people        that are important...
Similarity Measurements   Approach: User and resource recommendation    for scientists   Social Bookmarking Services (SB...
Similarity Measurements   Folksonomy as basis:       F: = (U, T, R, Y), where U, T and R are finite sets with        Y i...
Similarity Coefficients   User a and b, R with elements called resources                                         sim (���...
Database Settings   Raw data: bookmarked articles of 45 physical journals    from 3 SBS      13,762 bookmarks      10,4...
Database Settings   Raw data: bookmarked articles of 45 physical journals    from 3 SBS      Users with one bookmark wer...
Results   Differences:                                                             Jaccard                               ...
Results I                        2g               g   Similarity coefficients:                       a+b             a+b−...
Results I   Coefficient differences                                                                                      ...
Results I   Dice vs. cosine:       Cosine: stronger distinction of allocation of resources         2g          ������   ...
Results I   Example with target user „dchen“        common bm       bm     user1       user2                             ...
Discussion I   What‘s the difference for a user who should be    recommended similar users or resources?       Dchen (21...
Results   Differences:       I. Between coefficients       II. Between resource- and tag-based similarity              ...
Results II    Resource- or tag-based similarity?     user2      Dice    Cosinus    common tags                  tags     ...
Results II   Example: user „dchen“       30 users with at least 1 common bookmark       560 users with at least one com...
Results II   Database:       1,262 unique users       6,430 user-pairs with at least one common bookmark       54,413 ...
Discussion II   Resources (scientific article):       Explicit hint of similar interest if article discusses more       ...
Results   Differences:       I. Between coefficients       II. Between resource- and tag-based similarity       III. B...
Results III   Similarity based on common resources and    common tags:        1. Coefficients are added: Dice + Dice/Cos...
Results III   159 unique users within the set       389 user-pairs with same bookmark-same tag        accordance       ...
Results III                          Distribution of common bookmarks and tags compared to combined methodnr. of common bo...
Result III   Comparison of methods:       1. Added Dice values from common bookmarks and        common tags of a user-pa...
Results III  Comparison of user ranking order with different methods: example user “dchen”                                ...
Discussion III   Assumption:       If same tags are assigned to same bookmarks, the        user must be very similar    ...
Conclusion   Discussion I: similarity coefficients:       Cosine considers bookmark/tag distribution between        two ...
Conclusion   Discussion III: mixed methods       Mixed similarity measurement based on common        bookmarks and tags ...
Thank you for attention   Do you have any questions?   Contact me:    Tamara Heck    Heinrich-Heine-University    Dept. ...
BibliographyAhlgren, Per; Jarvening, Bo; Rousseau, Ronald (2003). Requirements for a Cocitation Similarity Measure, with S...
BibliographyPeters, Isabella (2009): Folksonomies. Indexing and Retrieval in Web 2.0 (Knowledge and Information). De Gruyt...
Upcoming SlideShare
Loading in …5
×

Issome2011 comparison of different user similarity measures

1,055 views

Published on

Presentation at ISSOME2011, Abo/Turku, Finnland, August 24-26

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,055
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Issome2011 comparison of different user similarity measures

  1. 1. Tamara HeckDept. of Information ScienceHeinrich-Heine-UniversityDüsseldorf A Comparison of Different User-Similarity Measures as Basis for Research and Scientific Cooperation Information Science and Social Media – International Conference August 24-26, Åbo/Turku, Finland
  2. 2. Agenda New needs for scientific cooperation Forms of similarity measurements Similarity coefficients Database settings Results  Results and discussion I: similarity coefficients  Result and discussion II: resource- and tag-based similarity  Results and discussion III: combined methods Conclusion 2 Tamara Heck - A comparison of different user-similarity measures 2011
  3. 3. Scientific CooperationNeed for… literature storage research literature project teamscommunity network cooperation partners 3 Tamara Heck - A comparison of different user-similarity measures 2011
  4. 4. Scientific Cooperation Web technologies facilitate scientific work:  Online user networks help  Finding literature  Categorize important items  Searching for important people Difficulty:  Mostly implicit networks  “New” researchers not aware of networks  Senior researchers only know familiar networks  Information overload 4 Tamara Heck - A comparison of different user-similarity measures 2011
  5. 5. Scientific Cooperation Recommendation  Can help to find resources, documents and people that are important for scientific research  Filters information  Only show relevant items → Basis for recommendation is similarity 5 Tamara Heck - A comparison of different user-similarity measures 2011
  6. 6. Similarity Measurements Approach: User and resource recommendation for scientists Social Bookmarking Services (SBS) for academic literature management  BibSonomy  CiteULike  Connotea 6 Tamara Heck - A comparison of different user-similarity measures 2011
  7. 7. Similarity Measurements Folksonomy as basis:  F: = (U, T, R, Y), where U, T and R are finite sets with Y is a ternary relation between them: Y ⊆U x T x R the elements usernames, tags and resources,  docsonomy DF:= (T, R, Z) , Z ⊆ T x R with the elements called tag actions or assignments personomy PUT:= (U, T, X), X ⊆U x T   ⊆UxR  personal bookmark list (PBL): PBLUR:= (U, R, W), W 7 Tamara Heck - A comparison of different user-similarity measures 2011
  8. 8. Similarity Coefficients User a and b, R with elements called resources sim (������, ������) = 2 (������ ������ ∩������ ������ ) (bookmarks or tags) ������(������)+������(������) Dice: sim (������, ������) = ������ ������ ∩������(������) ������ ������ +������ ������ −(������ ������ ∩������(������)) Jaccard-Sneath: sim (������, ������) = ������ ������ ∩������(������) ������ ������ ∗������(������) Cosine: 8 Tamara Heck - A comparison of different user-similarity measures 2011
  9. 9. Database Settings Raw data: bookmarked articles of 45 physical journals from 3 SBS  13,762 bookmarks  10,498 diverse bookmarks  2473 unique users → 1,974 users used tags  (71 usernames were found in more than one service)  36,433 tags before clearing Tag clearing: → 8,233 unique tags  „%import%“, „%jabref%“, „%upload%“  Deletion of lines/underlines  Change from plural to singular form  Change from English to American spelling: e.g. „s“ → „z“ 9 Tamara Heck - A comparison of different user-similarity measures 2011
  10. 10. Database Settings Raw data: bookmarked articles of 45 physical journals from 3 SBS  Users with one bookmark were left out because they would cause biased results, i.e. user-pairs with similarity = 1  1262 unique users with more than one bookmark  A user recommendation system should have the option of setting a threshold:  Users with a minimum of bookmarks (e.g. CiteULike)  Users with a minimum of common bookmarks (regulation with slider possible) 10 Tamara Heck - A comparison of different user-similarity measures 2011
  11. 11. Results Differences: Jaccard -Sneath  I. Between coefficients Dice Cosine  II. Between resource- and tag-based similarity  III. Between combined methods 11 Tamara Heck - A comparison of different user-similarity measures 2011
  12. 12. Results I 2g g Similarity coefficients: a+b a+b−g  Dice Jaccard-Sneath ∗ +1 2 ������ ������  g = common bookmarks / tags ������+������−������ ������+������−������ 2J D = J+1 D = [Egghe10] Differences between Dice e. Cosine 12 Tamara Heck - A comparison of different user-similarity measures 2011
  13. 13. Results I Coefficient differences 13 Tamara Heck - A comparison of different user-similarity measures 2011
  14. 14. Results I Dice vs. cosine:  Cosine: stronger distinction of allocation of resources 2g ������ and tags a+b ������∗������  = = 0.1 2∗5 10  Example: a = 10, b = 90, g = 5 10+90 100 = 0.16 = 0.102 5 5  Dice: 10∗90 40∗60  Cosine:  Cosine = 0.1 if a = 50 and b = 50 14 Tamara Heck - A comparison of different user-similarity measures 2011
  15. 15. Results I Example with target user „dchen“ common bm bm user1 user2 Dice Cosinus bm dchen user2 bm bm 18 214 58 dchen weeks 0.1324 0.1616 17 214 58 dchen ghunter 0.125 0.1526 11 214 52 dchen kdesmond 0.0827 0.1043 8 214 66 dchen kkims 0.0571 0.0673 6 214 26 dchen kedmond 0.05 0.0804 5 214 25 dchen katiehumphry 0.0418 0.0684 4 214 15 dchen tathabhatt 0.0349 0.0706 5 214 105 dchen rodney 0.0313 0.0334 3 214 9 dchen waitonhill 0.0269 0.0684 2 214 2 dchen caortiz 0.0185 0.0967 bm = bookmarks, data from BibSonomy, CiteULike, Connotea 15 Tamara Heck - A comparison of different user-similarity measures 2011
  16. 16. Discussion I What‘s the difference for a user who should be recommended similar users or resources?  Dchen (214) and waitonhill (9), (3 in common)  → waitonhill is rather uninteresting for dchen  → but dchen might be very interesting for waitonhill  Dice: 0,0269 Cosine: 0,0684  If dchen would have less bookmarks, he would be more similar to waitonhill → a positive result?  Dchen (190):  Dice: 0,03 Cosine: 0,073 16 Tamara Heck - A comparison of different user-similarity measures 2011
  17. 17. Results Differences:  I. Between coefficients  II. Between resource- and tag-based similarity Book- mark Tag Book- Book- mark mark Tag Tag  III. Between combined methods 17 Tamara Heck - A comparison of different user-similarity measures 2011
  18. 18. Results II Resource- or tag-based similarity? user2 Dice Cosinus common tags tags user1 user2 Dice Cosinus bm bmweeks 0.1324 0.1616 tags dchen user2 tags tagsghunter 0.125 0.1526kdesmond 0.0827 0.1043 31 175 64 dchen weeks 0.2594 0.2929kkims 0.0571 0.0673 25 175 68 dchen ghunter 0.2058 0.2292kedmond 0.05 0.0804 20 175 29 dchen kedmond 0.1961 0.2807katiehumphry 0.0418 0.0684tathabhatt 0.0349 0.0706 41 175 259 dchen rodney 0.1889 0.1926rodney 0.0313 0.0334 25 175 102 dchen andreab 0.1805 0.1871waitonhill 0.0269 0.0684 16 175 35 dchen kkims 0.1524 0.2044caortiz 0.0185 0.0967 54 175 564 dchen michaelbussmann 0.1461 0.1719 20 175 107 dchen paulschlesinger 0.1418 0.1462Different usersrecommended with 14 175 36 dchen jeevanjyoti 0.1327 0.1764tag-based similarity 23 175 176 dchen bronckobuster 0.1311 0.1311measure Similarity based on common tags, 18 data from BibSonomy, CiteULike, Connotea Tamara Heck - A comparison of different user-similarity measures 2011
  19. 19. Results II Example: user „dchen“  30 users with at least 1 common bookmark  560 users with at least one common tag  23 users with at least one common bookmark and one common tag 19 Tamara Heck - A comparison of different user-similarity measures 2011
  20. 20. Results II Database:  1,262 unique users  6,430 user-pairs with at least one common bookmark  54,413 user-pairs with at least one common tag  0.15: average bookmarks in common  1.3: average bookmarks in common leaving out user-pairs with no bookmark in common common bookmarks and tags of user-pairs  1.37: average tags in common 80 common bookmarks common tags 70 common bookmarks  1.44: average tags in common 60 leaving out user-pairs with no 50 40 tags in common 30 20 10 0 1 1001 2001 3001 4001 5001 6001 7001 8001 9001 10001 11001 12001 user-pairs 20 Tamara Heck - A comparison of different user-similarity measures 2011
  21. 21. Discussion II Resources (scientific article):  Explicit hint of similar interest if article discusses more or less one definite topic  What about standard work or surveys? Tags:  Inform user more precisely in which context other users read the article: ‘biological’, ‘absorption’, ‘ferromagnetism, ‘theory’, ‘method’  Can identify user’s research field  But: no unitary vocabulary used;  Tags might be inappropriate for others: ‘to read’, 21 ‘mypaper’Tamara Heck - A comparison of different user-similarity measures 2011
  22. 22. Results Differences:  I. Between coefficients  II. Between resource- and tag-based similarity  III. Between mixed methods Book- Tag mark Tag Book- mark Tag 22 Tamara Heck - A comparison of different user-similarity measures 2011
  23. 23. Results III Similarity based on common resources and common tags:  1. Coefficients are added: Dice + Dice/Cosine + Cosine  2. Search for users who have assigned at least one common tag to one common bookmark User A sim User B Tag User A Book User A sim User D mark User A sim User C User B Tag User B sim User D User B sim User C User D User C 23 Tamara Heck - A comparison of different user-similarity measures 2011
  24. 24. Results III 159 unique users within the set  389 user-pairs with same bookmark-same tag accordance  273 user-pairs with more than 1 accordance  Average same bookmark-same tag appearance: 3.42  Average common bookmarks: 3.1  Average common tags: 4.91 24 Tamara Heck - A comparison of different user-similarity measures 2011
  25. 25. Results III Distribution of common bookmarks and tags compared to combined methodnr. of common bookmark- 40 common tag pair 30 20 10 0 20 common bookmarks 15 10 5 0 40 common tags 30 20 10 25 0 1 51 101 151 201 251 301 351
  26. 26. Result III Comparison of methods:  1. Added Dice values from common bookmarks and common tags of a user-pair  2. Same bookmark-same tag appearance  3. Mixed method with same bookmark-same tag factor  1. Sum: Dbm(a,b) + Dt(a,b)  2. Author sim if T1B1 → a and b  3. Sum of (T1B1 … TnBn) + 1 * Dbm(a,b) + Dt(a,b) 26 Tamara Heck - A comparison of different user-similarity measures 2011
  27. 27. Results III Comparison of user ranking order with different methods: example user “dchen” same bm Sum same tag same bm users bm Dice users tag Dice users 2 Dice users 3 +Dice users 4 same tag weeks 0,1324 weeks 0,2594 weeks 0,3918 weeks 9,795 weeks 24 ghunter 0,125 ghunter 0,2058 ghunter 0,3308 ghunter 5,2928 ghunter 15 kdesmond 0,0827 kedmond 0,1961 kedmond 0,2461 kedmond 2,2149 kedmond 8 kkims 0,0571 rodney 0,1889 rodney 0,2202 kkims 1,257 kkims 5 kedmond 0,05 andreab 0,1805 kkims 0,2095 kdesmond 0,9615 kdesmond 4 katiehumphry 0,0418 kkims 0,1524 kdesmond 0,1923 rodney 0,8808 rodney 3 tathabhatt 0,0349 michaelbussmann 0,1461 andreab 0,1805 katiehumphry 0,502 katiehumphry 3 rodney 0,0313 paulschlesinger 0,1418 jeevanjyoti 0,1502 jeevanjyoti 0,3004 softsimu 1 waitonhill 0,0269 jeevanjyoti 0,1327 michaelbussmann 0,1461 softsimu 0,1962 tathabhatt 1 caortiz 0,0185 bronckobuster 0,1311 paulschlesinger 0,1418 andreab 0,1805 jeevanjyoti 1 knordstr 0,0184 chiufanlee 0,1181 bronckobuster 0,1311 tathabhatt 0,1482 peteryunker 0,0179 barrat 0,1111 katiehumphry 0,1255 michaelbussmann 0,1461 jeevanjyoti 0,0175 pbuczek 0,1106 chiufanlee 0,1254 paulschlesinger 0,1418 sobolevnrm 0,015 kdesmond 0,1096 barrat 0,1111 bronckobuster 0,1311 softsimu 0,0132 gdurin 0,1057 pbuczek 0,1106 chiufanlee 0,1254 lgolick 0,0092 cgguido 0,0952 gdurin 0,1057 barrat 0,1111 whitead 0,0092 Tomste 0,0923 6rheology 0,0987 pbuczek 0,1106 ccthomas 0,0091 6rheology 0,0909 softsimu 0,0981 gdurin 0,1057 devries 0,0091 mattroche 0,0909 cgguido 0,0952 6rheology 0,0987 governmentmen 0,0091 edws 0,087 kaigrass 0,0924 cgguido 0,0952 … … softsimu 0,0849 tathabhatt 0,0741 katiehumphry 0,0837 27 104 tathabhatt 0,0392 Tamara Heck - A comparison of different user-similarity measures 2011
  28. 28. Discussion III Assumption:  If same tags are assigned to same bookmarks, the user must be very similar  Users read articles within the same context  Therefore: Mixed method may provide further ranking improvement Challenge:  Tags are inappropriate  Users might have copied other user’s tags without distinction of adequacy for them  Sparse data for mixed method 28 Tamara Heck - A comparison of different user-similarity measures 2011
  29. 29. Conclusion Discussion I: similarity coefficients:  Cosine considers bookmark/tag distribution between two users  Use of coefficient may depend on target user’s interests Discussion II: resource- and tag-based similarity  More users found with similarity based on common tags  Different user rankings  Tags an indication to user’s reading context if they are context sensitive 29 Tamara Heck - A comparison of different user-similarity measures 2011
  30. 30. Conclusion Discussion III: mixed methods  Mixed similarity measurement based on common bookmarks and tags maybe more appropriate because many users don’t share common bookmarks AND common tags  Users who use the same tags to describe the same bookmark maybe more similar and the relation should be considered 30 Tamara Heck - A comparison of different user-similarity measures 2011
  31. 31. Thank you for attention Do you have any questions? Contact me: Tamara Heck Heinrich-Heine-University Dept. of Information Science D-40225 Düsseldorf, Germany Phone: 004921181-14137 tamara.heck@uni-duesseldorf.de Homepage Twitter: tamaraheck / #iwhhu Heinrich-Heine-University Düsseldorf Presentation on Slideshare 31 Tamara Heck - A comparison of different user-similarity measures 2011
  32. 32. BibliographyAhlgren, Per; Jarvening, Bo; Rousseau, Ronald (2003). Requirements for a Cocitation Similarity Measure, with Special Reference to Pearsons Coreltaion Coefficient. In: Journal of the American Society for Information Science and Technology, 54, 6, 550-560.Cacheda, Fidel; Carneiro, Víctor; Fernández, Diego; Formoso, Vreixo (2011): Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems. In: ACM Transactions on the Web, 5/1, article 2.Egghe, Leo (2010): Good Properties of Similarity Measures and Their Complementarity. In: Journal of the American Society for Information Science and Technology, 61/10, 2151–2160.Hamers, Lieve; Hemeryck, Yves; Herweyers, Guido; Janssen, Marc (1989): Similarity Measures in Scientometric Research: The Jaccard Index Versus Salton’s Cosine Formula. In: Information Processing & Management, 25/3, 315–318.Heck, Tamara; Peters, Isabella (2010). Expert Recommender Systems: Establishing Communities of Practice Based on Social Bookmarking Systems. In: Proceedings of I-Know 2010. 10th International Conference on Knowledge Management and Knowledge Technologies, 458-464.Knautz, Kathrin; Soubusta, Simone; Stock, Wolfgang G. (2010): Tag clusters as information retrieval interfaces. In: Proceedings of the 43th Annual Hawaii International Conference on System Sciences (HICSS-43), 10 pages.Leydesdorff, Loet (2005). Similarity Measure, Author Cocitation Analysis, and Information Theory. In: Journal of the American Society for Information Science and Technology, 56, 7, 769-772.Leydesdorff, Loet; Vaughan, Liwen (2006). Co-occurrence matrices and their applications in information science: Extending ACA to the Web environment. In: Journal of the American Society for Information Science and Technology, 57, 12, 1616–1628.Liang, Huizhi; Xu, Yue; Li, Yuefeng; Nayak, Richi (2008): Collaborative Filtering Recommender Systems Using Tag Information. In: ACM International Conference on Web Intelligence and Intelligent Agent Technology. 2008 IEEE/WIC, 59–62.Luo, Heng; Niu, Changyong; Shen, Ruimin; Ullrich, Carsten (2008): A collaborative filtering framework based on both local user similarity and global user similarity. In: Machine Learning, 72/3, 231–245.Marinho, Leandro B.; Nanopoulos, Alexandros; Schmidt-Thieme, Lars; Jäschke, Robert; Hotho, Andreas; Stumme, Gerd; Symeonidis, Panagiotis (2011): Social Tagging Recommenders Systems. Recommender Systems Handbook. Springer, New York. 32 Tamara Heck - A comparison of different user-similarity measures 2011
  33. 33. BibliographyPeters, Isabella (2009): Folksonomies. Indexing and Retrieval in Web 2.0 (Knowledge and Information). De Gruyter, Saur: Berlin.Rorvig, Mark (1999): Images of similarity: A Visual Exploration of Optimal Similarity Metrics and Scaling Properties of TREC Topic- Document Sets. In: Journal of the American Society for Information Science, 50/8, 639–651.Schneider, Jesper W.; Borlund, Pia (2007a): Matrix comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results. In: Journal of the American Society for Information Science and Technology, 58, 11, 1586–1596.Schneider, Jesper W.; Borlund, Pia (2007b): Matrix comparison, Part 2: Measuring the resemblance between proximity measures or ordination results by use of the mantel and procrustes statistics. In: Journal of the American Society for Information Science and Technology, 58, 11, 1596–1609.Szomszor, M., Cattuto, C., Alani, H., O’Hara, K., Baldassarri, A., Loreto, V.,Servedio, V. D. P. (2007): Folksonomies, the Semantic Web, and Movie Recommendation. In: 4th European Semantic Web Conference, Bridging the Gap between Semantic Web and Web 2.0, Innsbruck, Austria, 71-84.Van Eck, Nees Jan; Waltman, Ludo (2008): Appropriate Similarity Measures for Author Co-Citation Analysis. In: Journal of the American Society for Information Science and Technology, 59/10, 1653–1661.Van Eck, Nees Jan; Waltman, Ludo (2009): How to Normalize Cooccurrence Data? An Analysis of Some Well-Known Similarity Measures. In: Journal of the American Society for Information Science and Technology, 60/8, 1635–1651.Zanardi, Valentina; Capra, Licia (2008): Social Ranking: Uncovering Relevant Content Using Tag-based Recommender Systems. In: Proceedings of the 2008 ACM Conference on Recommender Systems. ACM New York, NY, 51–58.Zhen, Yi; Li, Wu-Jun; Yeung, Dit-Yan (2009): TagiCoFi: tag informed collaborative filtering. In: Proceedings of the third ACM conference on Recommender systems, 69–76. 33 Tamara Heck - A comparison of different user-similarity measures 2011

×