Successfully reported this slideshow.

Harnessing Folksonomies for Resource Classification

2

Share

1 of 98
1 of 98

Harnessing Folksonomies for Resource Classification

2

Share

Download to read offline

Transcript

  1. 1. Harnessing Folksonomies for Resource Classification PhD Thesis Arkaitz Zubiaga UNED July 12th, 2011 Advisors: Raquel Mart´ınez Unanue V´ ıctor Fresno Fern´ndez a
  2. 2. PhD Thesis Table of Contents Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 2 / 98
  3. 3. Motivation PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 3 / 98
  4. 4. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 4 / 98
  5. 5. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 5 / 98
  6. 6. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 6 / 98
  7. 7. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier Classifying resources is a common task. STS & Datasets Web pages, books, movies, files,... Representing the Aggregation of Large collections of resources → expensive & effortful Tags to classify manually. Tag Distributions LoC reported an average cost of $94.58 for cataloging on STS each book in 2002. User Behavior on STS Conclusions & Enormous costs and efforts → automatic classification. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 7 / 98
  8. 8. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier Representation of resources → self-content. STS & Datasets Representing Use of self-content of resources presents some issues: the Aggregation of Not always representative enough. Tags Not always accessible (e.g., books). Tag Distributions on STS User Behavior Social tags provided by users → alternative to solve the on STS problem. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 8 / 98
  9. 9. Motivation PhD Thesis Tagging Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications T1, T2, T3 = sets of tags. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 9 / 98
  10. 10. Motivation PhD Thesis Social Tagging Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Aggregation of user annotations → folksonomy. Folksonomy: Folk (People) + Taxis (Classification) + Nomos (Management). Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 10 / 98
  11. 11. Motivation PhD Thesis Organization of Resources Arkaitz Zubiaga Motivation Selection of a User annotations → own organization of resources. Classifier STS & Datasets A user’s tags Representing the Tag # Resources Aggregation of Tags research 82 Tag twitter 28 Distributions on STS web2.0 35 User Behavior language 42 on STS english 64 Conclusions & Outlook ... ... Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 11 / 98
  12. 12. Motivation PhD Thesis Example of Bookmarks Arkaitz Zubiaga Motivation User Resource Tags Selection of a 1 user1 flickr.com photo, web2.0, social Classifier 2 user2 flickr.com photography, images STS & Datasets 3 user1 google.com searchengine Representing 4 user3 twitter.com microblogging, twitter the Aggregation of Tags Tag Bookmark: (1) user ui ∈ U who annotates (2) resource rj ∈ R being annotated Distributions on STS User Behavior (3) tags Tij = {t1 , ..., tn } ∈ T utilized. on STS Conclusions & Outlook Publications bij : ui × rj × Tij Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 12 / 98
  13. 13. Motivation PhD Thesis Sum of Annotations Arkaitz Zubiaga Motivation Top tags (79,681 users) Selection of a Classifier Tag Rank Tag User Count STS & 1 photos 22,712 Datasets 2 flickr 19,046 Representing the 3 photography 15,968 Aggregation of Tags 4 photo 15,225 Tag 5 sharing 10,648 Distributions on STS 6 images 9,637 User Behavior 7 web2.0 9,528 on STS Conclusions & 8 community 4,571 Outlook 9 social 3,798 Publications 10 pictures 3,115 Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 13 / 98
  14. 14. Motivation PhD Thesis Tag-based Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 14 / 98
  15. 15. Motivation PhD Thesis Problem Statement Arkaitz Zubiaga Motivation How can the annotations provided by users on social tagging Selection of a systems be exploited to improve the accuracy of a resource Classifier classification task? STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 15 / 98
  16. 16. Motivation PhD Thesis Related Work Arkaitz Zubiaga Motivation Selection of a Classifier STS & Social tags for information management: Datasets Search: Bao et al. (2007) & Heymann et al. (2008). Representing the Recommender Systems: Shepitsen et al. (2008) & Li Aggregation of Tags Tag et al. (2008). Distributions on STS User Behavior on STS Enhanced Browsing: Smith (2008). Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 16 / 98
  17. 17. Motivation PhD Thesis Related Work Arkaitz Zubiaga Motivation Selection of a Classifier Classification: Noll and Meinel (2008) → statistical STS & Datasets analysis of matches between tags & taxonomies. Representing Tags are useful for broad categorization. the Not for narrower categorization. Aggregation of Tags Tag Distributions Lack of further research with: on STS Actual classification experiments. User Behavior Other types of resources. on STS Different representations of social tags. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 17 / 98
  18. 18. Selection of a Classifier PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 18 / 98
  19. 19. Selection of a Classifier PhD Thesis Characteristics of the task Arkaitz Zubiaga Motivation We have: Selection of a Classifier Large set of resources: some labeled + many unlabeled. STS & Multiclass taxonomy. Datasets Representing the Automated classifiers learn a model from labeled Aggregation of resources. Tags Tag This model is used to classify unlabeled resources Distributions afterward. on STS User Behavior on STS 2 learning settings: Conclusions & Outlook Supervised: only labeled resources considered for learning. Semi-supervised: unlabeled resources are also taken into Publications account. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 19 / 98
  20. 20. Selection of a Classifier PhD Thesis Support Vector Machines (SVM) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Hyperplane that separates with largest margin. Outlook Publications Use of kernels → redimensions the space. Resource/Hyperplane margin → Classifier’s reliability. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 20 / 98
  21. 21. Selection of a Classifier PhD Thesis Selection of a Classifier Arkaitz Zubiaga Motivation Selection of a Classifier SVMs solve binary problems by default. STS & Datasets Representing To solve multiclass tasks: the Aggregation of Native multiclass classifier (mSVM). Tags Combining binary classifiers: Tag one-against-all (oaaSVM). Distributions on STS one-against-one (oaoSVM). User Behavior on STS Conclusions & Both supervised (s) and semi-supervised (ss). Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 21 / 98
  22. 22. Selection of a Classifier PhD Thesis Experiment Settings Arkaitz Zubiaga Motivation Selection of a 3 benchmark datasets to analyze suitability of Classifier classifiers: STS & Datasets Representing Dataset # web pages # trainset # categories BankSearch the Aggregation of 10,000 3,000 10 WebKB Tags 4,518 1,000 6 Tag Distributions Y! Science 788 100 6 on STS User Behavior on STS We present accuracy to show performance. Conclusions & Outlook Publications We perform 6 runs, and show the average accuracy. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 22 / 98
  23. 23. Selection of a Classifier PhD Thesis Results Arkaitz Zubiaga Motivation BankSearch WebKB Y! Science Selection of a mSVM (s) .925 .810 .825 Classifier mSVM (ss) .923 .778 .836 STS & Datasets oaaSVM (s) .843 .776 .536 Representing the oaaSVM (ss) .842 .773 .565 Aggregation of Tags oaoSVM (s) .826 .775 .483 Tag oaoSVM (ss) .811 .754 .514 Distributions on STS User Behavior Native multiclass classifier performs best, while on STS Conclusions & supervised semi-supervised. Outlook Publications We used the supervised approach, as it is computationally less expensive. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 23 / 98
  24. 24. STS & Datasets PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 24 / 98
  25. 25. STS & Datasets PhD Thesis Requirements Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Selected STS should have: Representing Large communities involved. the Aggregation of Public access to data. Tags Consolidated taxonomies as a ground truth. Tag Distributions We chose Delicious, LibraryThing & GoodReads. on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 25 / 98
  26. 26. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 26 / 98
  27. 27. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 27 / 98
  28. 28. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 28 / 98
  29. 29. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 29 / 98
  30. 30. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 30 / 98
  31. 31. STS & Datasets PhD Thesis Retrieval of Categorized Resources Arkaitz Zubiaga Motivation Selection of a Classifier Retrieval of popular annotated resources, which were also STS & categorized by experts. Datasets Representing the Aggregation of Top level (L1) Second level (L2) Tags Tag Resources Classes Resources Classes Distributions on STS Web ODP 12,616 17 12,286 243 DDC 27,299 10 27,040 99 User Behavior Books on STS LCC 24,861 20 23,565 204 Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 31 / 98
  32. 32. STS & Datasets PhD Thesis Retrieval of Additional User Annotations Arkaitz Zubiaga Motivation Selection of a Classifier Delicious: 300,571,231 bookmarks STS & → 273,478,137 annotated (91.00%) Datasets LibraryThing: 44,612,784 bookmarks Representing the Aggregation of Tags → 22,343,427 annotated (50.08%) Tag Distributions on STS GoodReads: 47,302,861 bookmarks User Behavior → 9,323,539 annotated (19.71%) on STS Conclusions & Importance of system’s encouragement to tagging Outlook Publications resources. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 32 / 98
  33. 33. STS & Datasets PhD Thesis Tag Popularity on Resources Arkaitz Zubiaga Motivation Selection of a 1 Classifier 0.9 STS & Delicious Datasets Representing 0.8 LibraryThing the Aggregation of 0.7 GoodReads Average usage Tags 0.6 Tag Distributions 0.5 on STS User Behavior 0.4 on STS 0.3 Conclusions & Outlook 0.2 Publications 0.1 0 Tag rank on resources Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 33 / 98
  34. 34. STS & Datasets PhD Thesis Tag Novelty in Bookmarks by Rank Arkaitz Zubiaga 100% Motivation Selection of a Classifier 90% Delicious STS & 80% LibraryThing Datasets 70% Representing the Aggregation of 60% % of novelty Tags 50% Tag Distributions 40% on STS User Behavior 30% on STS Conclusions & 20% Outlook 10% Publications 0% 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Bookmark rank Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 34 / 98
  35. 35. STS & Datasets PhD Thesis Retrieval of Additional Data Arkaitz Zubiaga Motivation Selection of a Classifier URLs: STS & Datasets Self-content, by crawling URLs. Representing User reviews (Delicious & StumbleUpon). the Aggregation of Tags Books: Tag Distributions Self-content (unavailable): on STS Synopses (Barnes&Noble). User Behavior Editorial reviews (Amazon). on STS User reviews (LibraryThing, GoodReads & Amazon). Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 35 / 98
  36. 36. STS & Datasets PhD Thesis Summary of the Analysis of Datasets Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Few users annotate resources when the system does not encourage to do it. Representing the Aggregation of Tags Tag Resource-based tag suggestions → Repeated use of Distributions on STS popular tags. User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 36 / 98
  37. 37. Representing the Aggregation of Tags PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 37 / 98
  38. 38. Representing the Aggregation of Tags PhD Thesis Representing Resources Using Tags Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Different ways to aggregate user annotations on a Representing vectorial representation. the Aggregation of Tags Tag 2 major factors to consider: Distributions What tags to use? on STS How to weigh those tags? User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 38 / 98
  39. 39. Representing the Aggregation of Tags PhD Thesis Representing Resources Using Tags Arkaitz Zubiaga Use of all tags (FTA), or just top 10 tags for each Motivation resource. Selection of a Classifier 4 different weightings. STS & Datasets Example of a resource (100 users): Representing t1 (50), t2 (30), t3 (20), ..., t9 (1), t10 (1), ..., tn (1) the Aggregation of Tags Tag FTA Top 10 Distributions on STS User Behavior t1 t2 t3 ... t9 t10 ... tn on STS Ranks 1 0.9 0.8 ... 0.2 0.1 ... 0 Conclusions & Outlook Fractions 0.5 0.3 0.2 ... 0.02 0.01 ... 0.01 Publications Binary 1 1 1 ... 1 1 ... 1 TF 50 30 20 ... 2 1 ... 1 Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 39 / 98
  40. 40. Representing the Aggregation of Tags PhD Thesis Representing Resources Using Other Data Arkaitz Sources Zubiaga Motivation Selection of a Classifier STS & To represent resources using content and reviews: Datasets Representing 1 Removal of HTML tags. the Aggregation of Tags 2 Removal of stopwords. Tag Distributions on STS 3 Stem of remaining words. User Behavior on STS Conclusions & 4 TF-IDF weighting of words. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 40 / 98
  41. 41. Representing the Aggregation of Tags PhD Thesis Experiment Setup Arkaitz Zubiaga Motivation Selection of a Classifier Multiclass SVMs. STS & Datasets Representing Show the average accuracy of 6 runs. the Aggregation of Tags For clarity of presentation, we limit results to: Tag Distributions LCC taxonomy for books. on STS Training sets of 6,000 URLs (6,616 (L1)/6,286 (L2) for User Behavior test). Training sets of 18,000 books (8,861 (L1)/5,565 (L2) for on STS Conclusions & Outlook test). Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 41 / 98
  42. 42. Representing the Aggregation of Tags PhD Thesis Experiment Setup Arkaitz Zubiaga Motivation Selection of a Classifier Compared Representations STS & Self-content (baseline). Datasets Representing the Reviews. Aggregation of Tags Tag Tags: Distributions on STS Ranks (Top 10). User Behavior Fractions (Top 10 & FTA). on STS Binary (Top 10 & FTA). Conclusions & Outlook TF (Top 10 & FTA). Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 42 / 98
  43. 43. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Delicious LThing GReads Selection of a L1 L2 L1 L2 L1 L2 Classifier (17) (243) (20) (204) (20) (204) STS & Datasets Content .610 .470 .807 .673 .807 .673 Representing Reviews .646 .524 .828 .705 .828 .705 Ranks the Aggregation of .484 .360 .795 .511 .630 .405 Fractions (10) Tags .464 .349 .738 .411 .663 .427 Tag Fractions (FTA) .461 .336 .712 .409 .654 .432 Tags Distributions Binary (10) on STS .531 .361 .770 .550 .623 .422 User Behavior on STS Binary (FTA) .572 .529 .655 .606 .639 .481 Conclusions & TF (10) .654 .545 .855 .722 .713 .491 Outlook Publications TF (FTA) .680 .568 .857 .736 .731 .517 Usually, FTA > 10. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 43 / 98
  44. 44. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Delicious LThing GReads Selection of a L1 L2 L1 L2 L1 L2 Classifier (17) (243) (20) (204) (20) (204) STS & Datasets Content .610 .470 .807 .673 .807 .673 Representing Reviews .646 .524 .828 .705 .828 .705 Ranks the Aggregation of .484 .360 .795 .511 .630 .405 Tags Fractions (10) .464 .349 .738 .411 .663 .427 Fractions (FTA) Tag .461 .336 .712 .409 .654 .432 Tags Distributions Binary (10) on STS .531 .361 .770 .550 .623 .422 User Behavior on STS Binary (FTA) .572 .529 .655 .606 .639 .481 Conclusions & TF (10) .654 .545 .855 .722 .713 .491 Outlook TF (FTA) .680 .568 .857 .736 .731 .517 Publications TF (FTA) is the best approach for tags. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 44 / 98
  45. 45. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Selection of a Classifier Delicious LThing GReads STS & Datasets L1 L2 L1 L2 L1 L2 Representing (17) (243) (20) (204) (20) (204) the Aggregation of Content .610 .470 .807 .673 .807 .673 Tags Reviews .646 .524 .828 .705 .828 .705 Tag Distributions Tags .680 .568 .857 .736 .731 .517 on STS User Behavior on STS Tags clearly outperform content and reviews on Conclusions & Outlook Delicious and LibraryThing. Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 45 / 98
  46. 46. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Selection of a Classifier Delicious LThing GReads STS & Datasets L1 L2 L1 L2 L1 L2 Representing (17) (243) (20) (204) (20) (204) the Aggregation of Content .610 .470 .807 .673 .807 .673 Tags Reviews .646 .524 .828 .705 .828 .705 Tag Distributions Tags .680 .568 .857 .736 .731 .517 on STS User Behavior on STS GoodReads’ disencouragement to tagging makes it Conclusions & Outlook insufficient to outperform content and reviews. Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 46 / 98
  47. 47. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LThing GReads Datasets L1 L2 L1 L2 L1 L2 Representing the (17) (243) (20) (204) (20) (204) Aggregation of Tags Content .610 .470 .807 .673 .807 .673 Tag Reviews .646 .524 .828 .705 .828 .705 Distributions on STS Tags .680 .568 .857 .736 .731 .517 User Behavior on STS Conclusions & Tags are also useful for deeper categorization (L2). Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 47 / 98
  48. 48. Representing the Aggregation of Tags PhD Thesis Classifier Committees Arkaitz Zubiaga Despite the superiority of social tags, all data sources Motivation perform well. Selection of a Classifier STS & Their outputs can be combined by using classifier Datasets committees. Representing the Aggregation of Tags Classifier committees add up margins (i.e., reliability Tag values) outputted by several classifiers, and provide a Distributions on STS single combined prediction. User Behavior on STS Conclusions & Cat. #1 Cat. #2 Cat. #3 Outlook Classif. A 1.2 1.1 0.6 Publications Classif. B 0.5 1.0 1.2 Classif. committees 1.7 2.1 1.8 Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 48 / 98
  49. 49. Representing the Aggregation of Tags PhD Thesis Results of Classifier Committees Arkaitz Zubiaga Delicious LThing GReads Motivation L1 L2 L1 L2 L1 L2 Selection of a Classifier (17) (243) (20) (204) (20) (204) STS & Content (C) .610 .470 .807 .673 .807 .673 Datasets Reviews (R) .646 .524 .828 .705 .828 .705 Representing the Tags (T) .680 .568 .857 .736 .731 .517 Aggregation of C+R .670 .547 .817 .704 .817 .704 Commit. Tags Tag Distributions C+T .696 .587 .821 .720 .832 .696 on STS R+T .694 .584 .859 .755 .857 .730 User Behavior on STS C+R+T .699 .588 .827 .732 .843 .727 Conclusions & Outlook Classifier committees successfully improve performance. Publications Even on GoodReads, where tags were not good enough on their own. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 49 / 98
  50. 50. Representing the Aggregation of Tags PhD Thesis Results of Classifier Committees Arkaitz Zubiaga Delicious LThing GReads Motivation L1 L2 L1 L2 L1 L2 Selection of a Classifier (17) (243) (20) (204) (20) (204) STS & Content (C) .610 .470 .807 .673 .807 .673 Reviews (R) Datasets .646 .524 .828 .705 .828 .705 Representing the Tags (T) .680 .568 .857 .736 .731 .517 Aggregation of C+R .670 .547 .817 .704 .817 .704 Commit. Tags Tag C+T .696 .587 .821 .720 .832 .696 Distributions on STS R+T .694 .584 .859 .755 .857 .730 User Behavior C+R+T .699 .588 .827 .732 .843 .727 on STS Conclusions & Outlook Data sources must be chosen with care: Publications All 3 are helpful on Delicious. Content is harmful for books. Inappropriate considering synopses and ed. reviews as a summary of content? Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 50 / 98
  51. 51. Representing the Aggregation of Tags PhD Thesis Summary of Results Arkaitz Zubiaga Motivation Selection of a Classifier Better represent using all tags with TF weighting. STS & Datasets Representing Tags perform accurately even for deeper levels. the Aggregation of The system must encourage the user to tag to make it Tags useful enough. Tag Distributions on STS Tags can be combined with other data to improve User Behavior on STS performance. Conclusions & Combined data sources must be chosen with care. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 51 / 98
  52. 52. Tag Distributions on STS PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 52 / 98
  53. 53. Tag Distributions on STS PhD Thesis Tag Distributions Arkaitz Zubiaga Motivation Selection of a Classifier So far, we have considered that tags annotated by the STS & Datasets Representing same number of users are equally representative to the Aggregation of the resource. Tags Tag Distributions on STS Distributions of tags in a collection could help User Behavior determine representativity of tags. on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 53 / 98
  54. 54. Tag Distributions on STS PhD Thesis TF-IDF Arkaitz Zubiaga Motivation Selection of a TF-IDF is an inverse weighting function (IWF) that Classifier computes: STS & Datasets the term frequency (TF). Representing the the inverse document frequency (IDF). Aggregation of Tags Tag |D| Distributions tf -idfij = tfij × log on STS |{d : ti ∈ d}| User Behavior on STS High IDF value for terms appearing in few documents. Low IDF value for terms appearing in many documents. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 54 / 98
  55. 55. Tag Distributions on STS PhD Thesis Tag Weighting Functions Arkaitz Zubiaga Motivation Selection of a Classifier Analogous to TF-IDF on folksonomies: STS & Datasets TF-IRF → distributions across resources. Representing TF-IUF → distributions across users. the TF-IBF → distributions across bookmarks. Aggregation of Tags Tag Distributions TF-IRF and TF-IUF had been barely used, and their on STS suitability was yet unexplored. User Behavior on STS Conclusions & TF-IBF had not been used. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 55 / 98
  56. 56. Tag Distributions on STS PhD Thesis Results Using IWFs Arkaitz Zubiaga Motivation Selection of a Classifier Delicious LThing GReads STS & L1 L2 L1 L2 L1 L2 Datasets TF .680 .568 .857 .736 .731 .517 Representing the TF-IRF .639 .529 .894 .809 .799 .622 IWFs Aggregation of Tags TF-IBF .641 .532 .895 .811 .800 .628 Tag TF-IUF .661 .555 .892 .803 .794 .623 Distributions on STS All 3 IWFs clearly outperform TF for LibraryThing and User Behavior on STS Conclusions & GoodReads. Similar performance of IWFs. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 56 / 98
  57. 57. Tag Distributions on STS PhD Thesis Results Using IWFs Arkaitz Zubiaga Motivation Selection of a Delicious LThing GReads Classifier STS & L1 L2 L1 L2 L1 L2 Datasets TF .680 .568 .857 .736 .731 .517 Representing the TF-IRF .639 .529 .894 .809 .799 .622 IWFs Aggregation of Tags TF-IBF .641 .532 .895 .811 .800 .628 Tag TF-IUF .661 .555 .892 .803 .794 .623 Distributions on STS User Behavior on STS IWFs underperform on Delicious, due to tag suggestions that make top tags utmost popular. Conclusions & Outlook IUF superior to IBF and IRF. Users who make their Publications own choices make the difference. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 57 / 98
  58. 58. Tag Distributions on STS PhD Thesis Using Classifier Committees with IWFs Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the How about using tags represented with IWFs on classifier Aggregation of Tags committees? Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 58 / 98
  59. 59. Tag Distributions on STS PhD Thesis Results Using IWF with Committees Arkaitz Zubiaga Motivation Selection of a Delicious LThing GReads Classifier STS & L1 L2 L1 L2 L1 L2 Datasets TF .699 .588 .859 .755 .857 .730 Representing the TF-IRF .697 .592 .885 .793 .864 .748 IWFs Aggregation of Tags TF-IBF .698 .592 .887 .797 .866 .751 Tag TF-IUF .700 .595 .885 .792 .864 .749 Distributions on STS User Behavior IWF-based committes are even better than TF-based ones. on STS Conclusions & Outlook Even on Delicious, where IWFs were not appropriate, Publications committees perform slightly better. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 59 / 98
  60. 60. Tag Distributions on STS PhD Thesis Results Using IWF with Committees Arkaitz Zubiaga Motivation Selection of a Classifier Delicious LThing GReads STS & L1 L2 L1 L2 L1 L2 TF Datasets .699 .588 .859 .755 .857 .730 Representing the TF-IRF .697 .592 .885 .793 .864 .748 IWFs Aggregation of Tags TF-IBF .698 .592 .887 .797 .866 .751 Tag TF-IUF .700 .595 .885 .792 .864 .749 Distributions on STS User Behavior on STS Despite this outperformance of IWFs using committees, Conclusions & IWFs on their own perform better on LibraryThing (.895 & .811). Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 60 / 98
  61. 61. Tag Distributions on STS PhD Thesis Summary of Results Using IWFs Arkaitz Zubiaga Motivation Selection of a Classifier STS & IWFs are an appropriate way to weight tags when used on classifier committees. Datasets Representing the The exception is LibraryThing, where tags on their own Aggregation of Tags perform better. Tag Distributions on STS Combined data sources must be appropriately chosen User Behavior (e.g., synopses & ed. reviews are harmful with books). on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 61 / 98
  62. 62. User Behavior on STS PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 62 / 98
  63. 63. User Behavior on STS PhD Thesis User Behavior: Categorizers and Describers Arkaitz Zubiaga K¨rner1 suggested 2 kinds of user behavior: Motivation o Selection of a Classifier STS & Categorizer Describer Datasets Goal of Tagging later browsing later retrieval Representing the Change of Tag Vocabulary costly cheap Aggregation of Size of Tag Vocabulary limited open Tags Tags subjective objective Tag Distributions on STS User Behavior They found that Describers help infer semantic on STS relations among tags. Conclusions & Outlook Do these tagging behaviors affect the usefulness of tags Publications for resource classification? 1 C. K¨rner. Understanding the Motivation behind Tagging. Hypertext 2009. o Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 63 / 98
  64. 64. User Behavior on STS PhD Thesis Categorizers and Describers Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 64 / 98
  65. 65. User Behavior on STS PhD Thesis Categorizers and Describers Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 65 / 98
  66. 66. User Behavior on STS PhD Thesis Weighting Measures Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing We use 3 measures to weight users, based on Koerner et al. the Aggregation of (2010). Tags Tag 2 factors are considered: verbosity & diversity. Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 66 / 98
  67. 67. User Behavior on STS PhD Thesis Weighting Measures: TPP Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Tags per Post (TPP) – Verbosity Representing the Aggregation of r Tags |Tur | Tag TPP(u) = Distributions |Ru | on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 67 / 98
  68. 68. User Behavior on STS PhD Thesis Weighting Measures: ORPHAN Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Orphan Ratio (ORPHAN) – Diversity Representing the |R(tmax )| Aggregation of n= Tags 100 Tag Distributions o |Tu | o on STS ORPHAN(u) = , T = {t||R(t)| ≤ n} User Behavior |Tu | u on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 68 / 98
  69. 69. User Behavior on STS PhD Thesis Weighting Measures: TRR Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing Tag Resource Ratio (TRR) – Verbosity + Diversity the Aggregation of Tags |Tu | TRR(u) = Tag |Ru | Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 69 / 98
  70. 70. User Behavior on STS PhD Thesis Use of Weighting Measures Arkaitz Zubiaga Motivation Selection of a Classifier These 3 measures provide: STS & Datasets A weight for each user. Ranking of users according to each measure. Representing the Aggregation of Tags From rankings → subsets of users as extreme Tag Distributions Categorizers (highest-ranked) and extreme Describers on STS (lowest-ranked). User Behavior on STS Conclusions & Subsets range from 10% to 100% (step size = 10%). Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 70 / 98
  71. 71. User Behavior on STS PhD Thesis Experiment Setup Arkaitz Zubiaga We select subsets of users according to number of tag Motivation assignments. Selection of a Classifier STS & Selecting by percents of users would be unfair → Datasets different amounts of data. Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 71 / 98
  72. 72. User Behavior on STS PhD Thesis Experiments Arkaitz Zubiaga Motivation Classification Selection of a We use a multiclass SVM, with TF weighting of tags. Classifier STS & Descriptivity Datasets Vectorial representations of resources: Tr → tag frequencies. Representing the Aggregation of Tags Rr → term frequencies on descriptive data Tag (self-content). Distributions on STS User Behavior Cosine similarity between Tr and Rr : on STS Conclusions & n Outlook Tri × Rri cos(θr ) = n 2 n 2 Publications i=1 i=1 (Tri ) × i=1 (Rri ) Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 72 / 98
  73. 73. User Behavior on STS PhD Thesis Descriptivity Results Arkaitz Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.) Motivation Selection of a Delicious Classifier STS & Datasets Representing the LibraryThing Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & GoodReads Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 73 / 98
  74. 74. User Behavior on STS PhD Thesis Classification Results Arkaitz Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.) Motivation Delicious Selection of a Classifier STS & Datasets Representing the LibraryThing Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & GoodReads Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 74 / 98
  75. 75. User Behavior on STS PhD Thesis Classification Results Arkaitz Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.) Motivation Delicious Selection of a Classifier STS & Datasets Representing the LibraryThing Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & GoodReads Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 75 / 98
  76. 76. User Behavior on STS PhD Thesis Overall Categorizers/Describers Results Arkaitz Zubiaga Motivation Selection of a Classifier STS & Discriminating by verbosity (TPP) does best for finding Datasets Representing the extreme Categorizers. Aggregation of Tags Tag Distributions The use of non-descriptive tags provide more accurate on STS classification. User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 76 / 98
  77. 77. Conclusions & Outlook PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 77 / 98
  78. 78. Conclusions & Outlook PhD Thesis Contributions Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Generation & analysis of 3 large-scale social tagging Representing datasets. the Aggregation of Release of some tagging datasets, used by Godoy and Tags Tag Distributions Amandi (2010), Strohmaier et al. (2010), Li et al. (2011), on STS and Ares et al. (2011). User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 78 / 98
  79. 79. Conclusions & Outlook PhD Thesis Contributions Arkaitz Zubiaga Motivation Selection of a Classifier First research work performing actual classification STS & Datasets experiments using social tags. Representing Analysis of different representations of social tags. the Aggregation of Analysis of effect of tag distributions. Tags Study of user behavior. Tag Distributions It paves the way to future researchers interested in the on STS User Behavior on STS task & in the exploration of STS. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 79 / 98
  80. 80. Conclusions & Outlook PhD Thesis Research Questions Arkaitz Zubiaga Motivation Selection of a Classifier Apart from the Problem Statement: STS & Datasets Representing How can the annotations provided by users on social the Aggregation of tagging systems be exploited to improve the accuracy of Tags a resource classification task? Tag Distributions on STS We set forth 10 research questions. User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 80 / 98
  81. 81. Conclusions & Outlook PhD Thesis Research Questions (1) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing RQ 1 What is a suitable SVM classifier for the task? the Aggregation of Tags Native multiclass SVM >> Combinations of Tag Distributions binary SVMs. on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 81 / 98
  82. 82. Conclusions & Outlook PhD Thesis Research Questions (2) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 2 What is a suitable learning method for the Datasets task? Representing the Aggregation of Supervised Semi-supervised. Tags Tag Distributions Unlike for binary tasks, where on STS Semi-supervised >> Supervised (Joachims, User Behavior on STS 1999). Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 82 / 98
  83. 83. Conclusions & Outlook PhD Thesis Research Questions (3) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets RQ 3 How do the settings of STS affect Representing folksonomies? the Aggregation of Tags Great impact of tag suggestions. Tag Distributions on STS Importance of encouraging users to User Behavior on STS annotate. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 83 / 98
  84. 84. Conclusions & Outlook PhD Thesis Research Questions (4) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 4 How to amalgamate annotations to get a Datasets representation of a resource? Representing the Aggregation of Tags Considering all the tags rather than only Tag those in the top. Distributions on STS User Behavior Weighting tags according to number of on STS users annotating them. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 84 / 98
  85. 85. Conclusions & Outlook PhD Thesis Research Questions (5) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 5 Is it worthwhile combining tags with other Datasets data sources? Representing the Aggregation of Tags Combining different data sources helps Tag improve performance. Distributions on STS User Behavior Data sources must be appropriately on STS chosen. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 85 / 98
  86. 86. Conclusions & Outlook PhD Thesis Research Questions (6) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets RQ 6 Are social tags specific enough to classify into Representing narrower categories? the Aggregation of Tags Tags are as useful as for top level. Tag Distributions on STS Noll and Meinel (2008) → tags were User Behavior on STS probably not useful for deeper levels. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 86 / 98
  87. 87. Conclusions & Outlook PhD Thesis Research Questions (7) Arkaitz Zubiaga Motivation Selection of a Classifier RQ 7 Can we consider tag distributions to get the STS & Datasets representativity of each tag? Representing the Aggregation of LibraryThing & GoodReads: really useful. Tags Tag Distributions Delicious: not useful, because of tag suggestions on STS User Behavior on STS → need of committees to make them Conclusions & useful. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 87 / 98
  88. 88. Conclusions & Outlook PhD Thesis Research Questions (8) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 8 What approach to use to weigh the Datasets representativity of tags? Representing the Aggregation of Tags LibraryThing & GoodReads: IBF, IRF & Tag IUF are very similar. Distributions on STS User Behavior Delicious: IUF clearly superior, because of on STS users that get rid of suggestions. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 88 / 98
  89. 89. Conclusions & Outlook PhD Thesis Research Questions (9) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 9 Can we discriminate users who further resemble Datasets an expert classification? Representing the Aggregation of Tags Categorizers > Describers for Tag classification. Distributions on STS User Behavior Need of appropriate measure for on STS discriminating. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 89 / 98
  90. 90. Conclusions & Outlook PhD Thesis Research Questions (10) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets RQ 10 What features identify a Categorizer? Representing the Aggregation of Categorizers can be found when Tags discriminating by verbosity. Tag Distributions on STS Non-descriptive tags produce more User Behavior on STS accurate classification. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 90 / 98
  91. 91. Conclusions & Outlook PhD Thesis Future Directions Arkaitz Zubiaga Motivation Selection of a Classifier Increase of interest in the field, still much work to do. STS & Datasets We have considered each tag as a diferent token. → Considering semantic meanings of social tags could Representing the Aggregation of Tags help. Tag Distributions on STS Tag suggestions leverage several issues in folksonomies. User Behavior → Looking for a weighting function that fits the on STS Conclusions & characteristics of systems with tag suggestions, e.g., Outlook Delicious. Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 91 / 98
  92. 92. Publications PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 92 / 98
  93. 93. Publications PhD Thesis Publications Arkaitz Zubiaga Motivation Peer-Reviewed Conferences (I) Selection of a Arkaitz Zubiaga, Christian K¨rner, Markus Strohmaier. o Classifier 2011. Tags vs Shelves: From Social Tagging to Social STS & Datasets Classification. In Proceedings of Hypertext 2011, the Representing 22nd ACM Conference on Hypertext and Hypermedia, the Aggregation of Eindhoven, Netherlands. (acceptance rate: 35/104, 34%) Tags Tag Distributions on STS Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2009. User Behavior Getting the Most Out of Social Annotations for Web Page on STS Classification. In Proceedings of DocEng 2009, the 9th Conclusions & Outlook ACM Symposium on Document Engineering, pp. 74-83, Publications Munich, Germany. (acceptance rate: 16/54, 29.6%) [15 citations] Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 93 / 98
  94. 94. Publications PhD Thesis Publications Arkaitz Zubiaga Motivation Peer-Reviewed Conferences (II) Selection of a Classifier Arkaitz Zubiaga. 2009. Enhancing Navigation on STS & Wikipedia with Social Tags. Wikimania 2009, Buenos Datasets Aires, Argentina. Representing the [6 citations] Aggregation of Tags Tag Arkaitz Zubiaga, Alberto P. Garc´ ıa-Plaza, V´ ıctor Fresno, Distributions on STS Raquel Mart´ ınez. 2009. Content-based Clustering for Tag User Behavior Cloud Visualization. In Proceedings of ASONAM 2009, on STS Conclusions & International Conference on Advances in Social Networks Outlook Analysis and Mining, pp. 316-319, Athens, Greece. Publications [3 citations] Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 94 / 98
  95. 95. Publications PhD Thesis Publications Arkaitz Zubiaga Journals (I) Motivation Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2011. Selection of a Classifier Augmenting Web Page Classifiers with Social Annotations. STS & Procesamiento del Lenguaje Natural. (acceptance rate: Datasets 33/60, 55%) Representing the Aggregation of Tags Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2009. Tag Clasificaci´n de P´ginas Web con Anotaciones Sociales. o a Distributions Procesamiento del Lenguaje Natural, vol. 43, pp. 225-233. on STS (acceptance rate: 36/72, 50%) User Behavior on STS Conclusions & Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´ ınez. 2009. Outlook Comparativa de Aproximaciones a SVM Semisupervisado Publications Multiclase para Clasificaci´n de P´ginas Web. Procesamiento o a del Lenguaje Natural, vol. 42, pp. 63-70. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 95 / 98
  96. 96. Publications PhD Thesis Publications Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Journals (II) Representing the Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ ınez. Harnessing Aggregation of Tags Folksonomies to Produce a Social Classification of Resources. IEEE Transactions on Knowledge and Data Engineering. Tag Distributions (pending notification) on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 96 / 98
  97. 97. Publications PhD Thesis Publications Arkaitz Zubiaga Motivation Book Chapters Selection of a Classifier Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´ ınez. 2011. STS & Exploiting Social Annotations for Resource Classification. Datasets Social Network Mining, Analysis and Research Representing the Trends: Techniques and Applications. IGI Global. Aggregation of Tags Workshops Tag Distributions Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ınez. 2009. Is on STS Unlabeled Data Suitable for Multiclass SVM-based Web User Behavior on STS Page Classification?. In Proceedings of the NAACL-HLT Conclusions & 2009 Workshop on Semi-supervised Learning for Natural Language Processing, pp. 28-36, Boulder, CO, Outlook Publications United States. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 97 / 98
  98. 98. Publications PhD Thesis Thank You Arkaitz Zubiaga Motivation Selection of a Achiu Arigato Danke Dhannvaad Dua Netjer en ek Classifier Efcharisto Gracias Gr`cies Gratia Grazie a Guishepeli Hvala Kiitos K¨sz¨n¨m o o o STS & Datasets Representing Merc´ Merci Mila esker Obrigado the Aggregation of Tags e Tag Distributions Shukran Tack Tak Takk T¨anan Tapadh leat Shukriya on STS User Behavior on STS Tesekk¨r ederim Thank you Toda u Conclusions & Outlook Publications http://thesis.zubiaga.org/ Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 98 / 98

Transcript

  1. 1. Harnessing Folksonomies for Resource Classification PhD Thesis Arkaitz Zubiaga UNED July 12th, 2011 Advisors: Raquel Mart´ınez Unanue V´ ıctor Fresno Fern´ndez a
  2. 2. PhD Thesis Table of Contents Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 2 / 98
  3. 3. Motivation PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 3 / 98
  4. 4. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 4 / 98
  5. 5. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 5 / 98
  6. 6. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 6 / 98
  7. 7. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier Classifying resources is a common task. STS & Datasets Web pages, books, movies, files,... Representing the Aggregation of Large collections of resources → expensive & effortful Tags to classify manually. Tag Distributions LoC reported an average cost of $94.58 for cataloging on STS each book in 2002. User Behavior on STS Conclusions & Enormous costs and efforts → automatic classification. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 7 / 98
  8. 8. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier Representation of resources → self-content. STS & Datasets Representing Use of self-content of resources presents some issues: the Aggregation of Not always representative enough. Tags Not always accessible (e.g., books). Tag Distributions on STS User Behavior Social tags provided by users → alternative to solve the on STS problem. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 8 / 98
  9. 9. Motivation PhD Thesis Tagging Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications T1, T2, T3 = sets of tags. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 9 / 98
  10. 10. Motivation PhD Thesis Social Tagging Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Aggregation of user annotations → folksonomy. Folksonomy: Folk (People) + Taxis (Classification) + Nomos (Management). Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 10 / 98
  11. 11. Motivation PhD Thesis Organization of Resources Arkaitz Zubiaga Motivation Selection of a User annotations → own organization of resources. Classifier STS & Datasets A user’s tags Representing the Tag # Resources Aggregation of Tags research 82 Tag twitter 28 Distributions on STS web2.0 35 User Behavior language 42 on STS english 64 Conclusions & Outlook ... ... Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 11 / 98
  12. 12. Motivation PhD Thesis Example of Bookmarks Arkaitz Zubiaga Motivation User Resource Tags Selection of a 1 user1 flickr.com photo, web2.0, social Classifier 2 user2 flickr.com photography, images STS & Datasets 3 user1 google.com searchengine Representing 4 user3 twitter.com microblogging, twitter the Aggregation of Tags Tag Bookmark: (1) user ui ∈ U who annotates (2) resource rj ∈ R being annotated Distributions on STS User Behavior (3) tags Tij = {t1 , ..., tn } ∈ T utilized. on STS Conclusions & Outlook Publications bij : ui × rj × Tij Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 12 / 98
  13. 13. Motivation PhD Thesis Sum of Annotations Arkaitz Zubiaga Motivation Top tags (79,681 users) Selection of a Classifier Tag Rank Tag User Count STS & 1 photos 22,712 Datasets 2 flickr 19,046 Representing the 3 photography 15,968 Aggregation of Tags 4 photo 15,225 Tag 5 sharing 10,648 Distributions on STS 6 images 9,637 User Behavior 7 web2.0 9,528 on STS Conclusions & 8 community 4,571 Outlook 9 social 3,798 Publications 10 pictures 3,115 Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 13 / 98
  14. 14. Motivation PhD Thesis Tag-based Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 14 / 98
  15. 15. Motivation PhD Thesis Problem Statement Arkaitz Zubiaga Motivation How can the annotations provided by users on social tagging Selection of a systems be exploited to improve the accuracy of a resource Classifier classification task? STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 15 / 98
  16. 16. Motivation PhD Thesis Related Work Arkaitz Zubiaga Motivation Selection of a Classifier STS & Social tags for information management: Datasets Search: Bao et al. (2007) & Heymann et al. (2008). Representing the Recommender Systems: Shepitsen et al. (2008) & Li Aggregation of Tags Tag et al. (2008). Distributions on STS User Behavior on STS Enhanced Browsing: Smith (2008). Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 16 / 98
  17. 17. Motivation PhD Thesis Related Work Arkaitz Zubiaga Motivation Selection of a Classifier Classification: Noll and Meinel (2008) → statistical STS & Datasets analysis of matches between tags & taxonomies. Representing Tags are useful for broad categorization. the Not for narrower categorization. Aggregation of Tags Tag Distributions Lack of further research with: on STS Actual classification experiments. User Behavior Other types of resources. on STS Different representations of social tags. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 17 / 98
  18. 18. Selection of a Classifier PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 18 / 98
  19. 19. Selection of a Classifier PhD Thesis Characteristics of the task Arkaitz Zubiaga Motivation We have: Selection of a Classifier Large set of resources: some labeled + many unlabeled. STS & Multiclass taxonomy. Datasets Representing the Automated classifiers learn a model from labeled Aggregation of resources. Tags Tag This model is used to classify unlabeled resources Distributions afterward. on STS User Behavior on STS 2 learning settings: Conclusions & Outlook Supervised: only labeled resources considered for learning. Semi-supervised: unlabeled resources are also taken into Publications account. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 19 / 98
  20. 20. Selection of a Classifier PhD Thesis Support Vector Machines (SVM) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Hyperplane that separates with largest margin. Outlook Publications Use of kernels → redimensions the space. Resource/Hyperplane margin → Classifier’s reliability. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 20 / 98
  21. 21. Selection of a Classifier PhD Thesis Selection of a Classifier Arkaitz Zubiaga Motivation Selection of a Classifier SVMs solve binary problems by default. STS & Datasets Representing To solve multiclass tasks: the Aggregation of Native multiclass classifier (mSVM). Tags Combining binary classifiers: Tag one-against-all (oaaSVM). Distributions on STS one-against-one (oaoSVM). User Behavior on STS Conclusions & Both supervised (s) and semi-supervised (ss). Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 21 / 98
  22. 22. Selection of a Classifier PhD Thesis Experiment Settings Arkaitz Zubiaga Motivation Selection of a 3 benchmark datasets to analyze suitability of Classifier classifiers: STS & Datasets Representing Dataset # web pages # trainset # categories BankSearch the Aggregation of 10,000 3,000 10 WebKB Tags 4,518 1,000 6 Tag Distributions Y! Science 788 100 6 on STS User Behavior on STS We present accuracy to show performance. Conclusions & Outlook Publications We perform 6 runs, and show the average accuracy. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 22 / 98
  23. 23. Selection of a Classifier PhD Thesis Results Arkaitz Zubiaga Motivation BankSearch WebKB Y! Science Selection of a mSVM (s) .925 .810 .825 Classifier mSVM (ss) .923 .778 .836 STS & Datasets oaaSVM (s) .843 .776 .536 Representing the oaaSVM (ss) .842 .773 .565 Aggregation of Tags oaoSVM (s) .826 .775 .483 Tag oaoSVM (ss) .811 .754 .514 Distributions on STS User Behavior Native multiclass classifier performs best, while on STS Conclusions & supervised semi-supervised. Outlook Publications We used the supervised approach, as it is computationally less expensive. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 23 / 98
  24. 24. STS & Datasets PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 24 / 98
  25. 25. STS & Datasets PhD Thesis Requirements Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Selected STS should have: Representing Large communities involved. the Aggregation of Public access to data. Tags Consolidated taxonomies as a ground truth. Tag Distributions We chose Delicious, LibraryThing & GoodReads. on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 25 / 98
  26. 26. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 26 / 98
  27. 27. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 27 / 98
  28. 28. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 28 / 98
  29. 29. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 29 / 98
  30. 30. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 30 / 98
  31. 31. STS & Datasets PhD Thesis Retrieval of Categorized Resources Arkaitz Zubiaga Motivation Selection of a Classifier Retrieval of popular annotated resources, which were also STS & categorized by experts. Datasets Representing the Aggregation of Top level (L1) Second level (L2) Tags Tag Resources Classes Resources Classes Distributions on STS Web ODP 12,616 17 12,286 243 DDC 27,299 10 27,040 99 User Behavior Books on STS LCC 24,861 20 23,565 204 Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 31 / 98
  32. 32. STS & Datasets PhD Thesis Retrieval of Additional User Annotations Arkaitz Zubiaga Motivation Selection of a Classifier Delicious: 300,571,231 bookmarks STS & → 273,478,137 annotated (91.00%) Datasets LibraryThing: 44,612,784 bookmarks Representing the Aggregation of Tags → 22,343,427 annotated (50.08%) Tag Distributions on STS GoodReads: 47,302,861 bookmarks User Behavior → 9,323,539 annotated (19.71%) on STS Conclusions & Importance of system’s encouragement to tagging Outlook Publications resources. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 32 / 98
  33. 33. STS & Datasets PhD Thesis Tag Popularity on Resources Arkaitz Zubiaga Motivation Selection of a 1 Classifier 0.9 STS & Delicious Datasets Representing 0.8 LibraryThing the Aggregation of 0.7 GoodReads Average usage Tags 0.6 Tag Distributions 0.5 on STS User Behavior 0.4 on STS 0.3 Conclusions & Outlook 0.2 Publications 0.1 0 Tag rank on resources Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 33 / 98
  34. 34. STS & Datasets PhD Thesis Tag Novelty in Bookmarks by Rank Arkaitz Zubiaga 100% Motivation Selection of a Classifier 90% Delicious STS & 80% LibraryThing Datasets 70% Representing the Aggregation of 60% % of novelty Tags 50% Tag Distributions 40% on STS User Behavior 30% on STS Conclusions & 20% Outlook 10% Publications 0% 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Bookmark rank Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 34 / 98
  35. 35. STS & Datasets PhD Thesis Retrieval of Additional Data Arkaitz Zubiaga Motivation Selection of a Classifier URLs: STS & Datasets Self-content, by crawling URLs. Representing User reviews (Delicious & StumbleUpon). the Aggregation of Tags Books: Tag Distributions Self-content (unavailable): on STS Synopses (Barnes&Noble). User Behavior Editorial reviews (Amazon). on STS User reviews (LibraryThing, GoodReads & Amazon). Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 35 / 98
  36. 36. STS & Datasets PhD Thesis Summary of the Analysis of Datasets Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Few users annotate resources when the system does not encourage to do it. Representing the Aggregation of Tags Tag Resource-based tag suggestions → Repeated use of Distributions on STS popular tags. User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 36 / 98
  37. 37. Representing the Aggregation of Tags PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 37 / 98
  38. 38. Representing the Aggregation of Tags PhD Thesis Representing Resources Using Tags Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Different ways to aggregate user annotations on a Representing vectorial representation. the Aggregation of Tags Tag 2 major factors to consider: Distributions What tags to use? on STS How to weigh those tags? User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 38 / 98
  39. 39. Representing the Aggregation of Tags PhD Thesis Representing Resources Using Tags Arkaitz Zubiaga Use of all tags (FTA), or just top 10 tags for each Motivation resource. Selection of a Classifier 4 different weightings. STS & Datasets Example of a resource (100 users): Representing t1 (50), t2 (30), t3 (20), ..., t9 (1), t10 (1), ..., tn (1) the Aggregation of Tags Tag FTA Top 10 Distributions on STS User Behavior t1 t2 t3 ... t9 t10 ... tn on STS Ranks 1 0.9 0.8 ... 0.2 0.1 ... 0 Conclusions & Outlook Fractions 0.5 0.3 0.2 ... 0.02 0.01 ... 0.01 Publications Binary 1 1 1 ... 1 1 ... 1 TF 50 30 20 ... 2 1 ... 1 Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 39 / 98
  40. 40. Representing the Aggregation of Tags PhD Thesis Representing Resources Using Other Data Arkaitz Sources Zubiaga Motivation Selection of a Classifier STS & To represent resources using content and reviews: Datasets Representing 1 Removal of HTML tags. the Aggregation of Tags 2 Removal of stopwords. Tag Distributions on STS 3 Stem of remaining words. User Behavior on STS Conclusions & 4 TF-IDF weighting of words. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 40 / 98
  41. 41. Representing the Aggregation of Tags PhD Thesis Experiment Setup Arkaitz Zubiaga Motivation Selection of a Classifier Multiclass SVMs. STS & Datasets Representing Show the average accuracy of 6 runs. the Aggregation of Tags For clarity of presentation, we limit results to: Tag Distributions LCC taxonomy for books. on STS Training sets of 6,000 URLs (6,616 (L1)/6,286 (L2) for User Behavior test). Training sets of 18,000 books (8,861 (L1)/5,565 (L2) for on STS Conclusions & Outlook test). Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 41 / 98
  42. 42. Representing the Aggregation of Tags PhD Thesis Experiment Setup Arkaitz Zubiaga Motivation Selection of a Classifier Compared Representations STS & Self-content (baseline). Datasets Representing the Reviews. Aggregation of Tags Tag Tags: Distributions on STS Ranks (Top 10). User Behavior Fractions (Top 10 & FTA). on STS Binary (Top 10 & FTA). Conclusions & Outlook TF (Top 10 & FTA). Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 42 / 98
  43. 43. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Delicious LThing GReads Selection of a L1 L2 L1 L2 L1 L2 Classifier (17) (243) (20) (204) (20) (204) STS & Datasets Content .610 .470 .807 .673 .807 .673 Representing Reviews .646 .524 .828 .705 .828 .705 Ranks the Aggregation of .484 .360 .795 .511 .630 .405 Fractions (10) Tags .464 .349 .738 .411 .663 .427 Tag Fractions (FTA) .461 .336 .712 .409 .654 .432 Tags Distributions Binary (10) on STS .531 .361 .770 .550 .623 .422 User Behavior on STS Binary (FTA) .572 .529 .655 .606 .639 .481 Conclusions & TF (10) .654 .545 .855 .722 .713 .491 Outlook Publications TF (FTA) .680 .568 .857 .736 .731 .517 Usually, FTA > 10. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 43 / 98
  44. 44. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Delicious LThing GReads Selection of a L1 L2 L1 L2 L1 L2 Classifier (17) (243) (20) (204) (20) (204) STS & Datasets Content .610 .470 .807 .673 .807 .673 Representing Reviews .646 .524 .828 .705 .828 .705 Ranks the Aggregation of .484 .360 .795 .511 .630 .405 Tags Fractions (10) .464 .349 .738 .411 .663 .427 Fractions (FTA) Tag .461 .336 .712 .409 .654 .432 Tags Distributions Binary (10) on STS .531 .361 .770 .550 .623 .422 User Behavior on STS Binary (FTA) .572 .529 .655 .606 .639 .481 Conclusions & TF (10) .654 .545 .855 .722 .713 .491 Outlook TF (FTA) .680 .568 .857 .736 .731 .517 Publications TF (FTA) is the best approach for tags. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 44 / 98
  45. 45. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Selection of a Classifier Delicious LThing GReads STS & Datasets L1 L2 L1 L2 L1 L2 Representing (17) (243) (20) (204) (20) (204) the Aggregation of Content .610 .470 .807 .673 .807 .673 Tags Reviews .646 .524 .828 .705 .828 .705 Tag Distributions Tags .680 .568 .857 .736 .731 .517 on STS User Behavior on STS Tags clearly outperform content and reviews on Conclusions & Outlook Delicious and LibraryThing. Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 45 / 98
  46. 46. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Selection of a Classifier Delicious LThing GReads STS & Datasets L1 L2 L1 L2 L1 L2 Representing (17) (243) (20) (204) (20) (204) the Aggregation of Content .610 .470 .807 .673 .807 .673 Tags Reviews .646 .524 .828 .705 .828 .705 Tag Distributions Tags .680 .568 .857 .736 .731 .517 on STS User Behavior on STS GoodReads’ disencouragement to tagging makes it Conclusions & Outlook insufficient to outperform content and reviews. Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 46 / 98
  47. 47. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LThing GReads Datasets L1 L2 L1 L2 L1 L2 Representing the (17) (243) (20) (204) (20) (204) Aggregation of Tags Content .610 .470 .807 .673 .807 .673 Tag Reviews .646 .524 .828 .705 .828 .705 Distributions on STS Tags .680 .568 .857 .736 .731 .517 User Behavior on STS Conclusions & Tags are also useful for deeper categorization (L2). Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 47 / 98
  48. 48. Representing the Aggregation of Tags PhD Thesis Classifier Committees Arkaitz Zubiaga Despite the superiority of social tags, all data sources Motivation perform well. Selection of a Classifier STS & Their outputs can be combined by using classifier Datasets committees. Representing the Aggregation of Tags Classifier committees add up margins (i.e., reliability Tag values) outputted by several classifiers, and provide a Distributions on STS single combined prediction. User Behavior on STS Conclusions & Cat. #1 Cat. #2 Cat. #3 Outlook Classif. A 1.2 1.1 0.6 Publications Classif. B 0.5 1.0 1.2 Classif. committees 1.7 2.1 1.8 Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 48 / 98
  49. 49. Representing the Aggregation of Tags PhD Thesis Results of Classifier Committees Arkaitz Zubiaga Delicious LThing GReads Motivation L1 L2 L1 L2 L1 L2 Selection of a Classifier (17) (243) (20) (204) (20) (204) STS & Content (C) .610 .470 .807 .673 .807 .673 Datasets Reviews (R) .646 .524 .828 .705 .828 .705 Representing the Tags (T) .680 .568 .857 .736 .731 .517 Aggregation of C+R .670 .547 .817 .704 .817 .704 Commit. Tags Tag Distributions C+T .696 .587 .821 .720 .832 .696 on STS R+T .694 .584 .859 .755 .857 .730 User Behavior on STS C+R+T .699 .588 .827 .732 .843 .727 Conclusions & Outlook Classifier committees successfully improve performance. Publications Even on GoodReads, where tags were not good enough on their own. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 49 / 98
  50. 50. Representing the Aggregation of Tags PhD Thesis Results of Classifier Committees Arkaitz Zubiaga Delicious LThing GReads Motivation L1 L2 L1 L2 L1 L2 Selection of a Classifier (17) (243) (20) (204) (20) (204) STS & Content (C) .610 .470 .807 .673 .807 .673 Reviews (R) Datasets .646 .524 .828 .705 .828 .705 Representing the Tags (T) .680 .568 .857 .736 .731 .517 Aggregation of C+R .670 .547 .817 .704 .817 .704 Commit. Tags Tag C+T .696 .587 .821 .720 .832 .696 Distributions on STS R+T .694 .584 .859 .755 .857 .730 User Behavior C+R+T .699 .588 .827 .732 .843 .727 on STS Conclusions & Outlook Data sources must be chosen with care: Publications All 3 are helpful on Delicious. Content is harmful for books. Inappropriate considering synopses and ed. reviews as a summary of content? Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 50 / 98
  51. 51. Representing the Aggregation of Tags PhD Thesis Summary of Results Arkaitz Zubiaga Motivation Selection of a Classifier Better represent using all tags with TF weighting. STS & Datasets Representing Tags perform accurately even for deeper levels. the Aggregation of The system must encourage the user to tag to make it Tags useful enough. Tag Distributions on STS Tags can be combined with other data to improve User Behavior on STS performance. Conclusions & Combined data sources must be chosen with care. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 51 / 98
  52. 52. Tag Distributions on STS PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 52 / 98
  53. 53. Tag Distributions on STS PhD Thesis Tag Distributions Arkaitz Zubiaga Motivation Selection of a Classifier So far, we have considered that tags annotated by the STS & Datasets Representing same number of users are equally representative to the Aggregation of the resource. Tags Tag Distributions on STS Distributions of tags in a collection could help User Behavior determine representativity of tags. on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 53 / 98
  54. 54. Tag Distributions on STS PhD Thesis TF-IDF Arkaitz Zubiaga Motivation Selection of a TF-IDF is an inverse weighting function (IWF) that Classifier computes: STS & Datasets the term frequency (TF). Representing the the inverse document frequency (IDF). Aggregation of Tags Tag |D| Distributions tf -idfij = tfij × log on STS |{d : ti ∈ d}| User Behavior on STS High IDF value for terms appearing in few documents. Low IDF value for terms appearing in many documents. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 54 / 98
  55. 55. Tag Distributions on STS PhD Thesis Tag Weighting Functions Arkaitz Zubiaga Motivation Selection of a Classifier Analogous to TF-IDF on folksonomies: STS & Datasets TF-IRF → distributions across resources. Representing TF-IUF → distributions across users. the TF-IBF → distributions across bookmarks. Aggregation of Tags Tag Distributions TF-IRF and TF-IUF had been barely used, and their on STS suitability was yet unexplored. User Behavior on STS Conclusions & TF-IBF had not been used. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 55 / 98
  56. 56. Tag Distributions on STS PhD Thesis Results Using IWFs Arkaitz Zubiaga Motivation Selection of a Classifier Delicious LThing GReads STS & L1 L2 L1 L2 L1 L2 Datasets TF .680 .568 .857 .736 .731 .517 Representing the TF-IRF .639 .529 .894 .809 .799 .622 IWFs Aggregation of Tags TF-IBF .641 .532 .895 .811 .800 .628 Tag TF-IUF .661 .555 .892 .803 .794 .623 Distributions on STS All 3 IWFs clearly outperform TF for LibraryThing and User Behavior on STS Conclusions & GoodReads. Similar performance of IWFs. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 56 / 98
  57. 57. Tag Distributions on STS PhD Thesis Results Using IWFs Arkaitz Zubiaga Motivation Selection of a Delicious LThing GReads Classifier STS & L1 L2 L1 L2 L1 L2 Datasets TF .680 .568 .857 .736 .731 .517 Representing the TF-IRF .639 .529 .894 .809 .799 .622 IWFs Aggregation of Tags TF-IBF .641 .532 .895 .811 .800 .628 Tag TF-IUF .661 .555 .892 .803 .794 .623 Distributions on STS User Behavior on STS IWFs underperform on Delicious, due to tag suggestions that make top tags utmost popular. Conclusions & Outlook IUF superior to IBF and IRF. Users who make their Publications own choices make the difference. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 57 / 98
  58. 58. Tag Distributions on STS PhD Thesis Using Classifier Committees with IWFs Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the How about using tags represented with IWFs on classifier Aggregation of Tags committees? Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 58 / 98
  59. 59. Tag Distributions on STS PhD Thesis Results Using IWF with Committees Arkaitz Zubiaga Motivation Selection of a Delicious LThing GReads Classifier STS & L1 L2 L1 L2 L1 L2 Datasets TF .699 .588 .859 .755 .857 .730 Representing the TF-IRF .697 .592 .885 .793 .864 .748 IWFs Aggregation of Tags TF-IBF .698 .592 .887 .797 .866 .751 Tag TF-IUF .700 .595 .885 .792 .864 .749 Distributions on STS User Behavior IWF-based committes are even better than TF-based ones. on STS Conclusions & Outlook Even on Delicious, where IWFs were not appropriate, Publications committees perform slightly better. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 59 / 98
  60. 60. Tag Distributions on STS PhD Thesis Results Using IWF with Committees Arkaitz Zubiaga Motivation Selection of a Classifier Delicious LThing GReads STS & L1 L2 L1 L2 L1 L2 TF Datasets .699 .588 .859 .755 .857 .730 Representing the TF-IRF .697 .592 .885 .793 .864 .748 IWFs Aggregation of Tags TF-IBF .698 .592 .887 .797 .866 .751 Tag TF-IUF .700 .595 .885 .792 .864 .749 Distributions on STS User Behavior on STS Despite this outperformance of IWFs using committees, Conclusions & IWFs on their own perform better on LibraryThing (.895 & .811). Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 60 / 98
  61. 61. Tag Distributions on STS PhD Thesis Summary of Results Using IWFs Arkaitz Zubiaga Motivation Selection of a Classifier STS & IWFs are an appropriate way to weight tags when used on classifier committees. Datasets Representing the The exception is LibraryThing, where tags on their own Aggregation of Tags perform better. Tag Distributions on STS Combined data sources must be appropriately chosen User Behavior (e.g., synopses & ed. reviews are harmful with books). on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 61 / 98
  62. 62. User Behavior on STS PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 62 / 98
  63. 63. User Behavior on STS PhD Thesis User Behavior: Categorizers and Describers Arkaitz Zubiaga K¨rner1 suggested 2 kinds of user behavior: Motivation o Selection of a Classifier STS & Categorizer Describer Datasets Goal of Tagging later browsing later retrieval Representing the Change of Tag Vocabulary costly cheap Aggregation of Size of Tag Vocabulary limited open Tags Tags subjective objective Tag Distributions on STS User Behavior They found that Describers help infer semantic on STS relations among tags. Conclusions & Outlook Do these tagging behaviors affect the usefulness of tags Publications for resource classification? 1 C. K¨rner. Understanding the Motivation behind Tagging. Hypertext 2009. o Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 63 / 98
  64. 64. User Behavior on STS PhD Thesis Categorizers and Describers Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 64 / 98
  65. 65. User Behavior on STS PhD Thesis Categorizers and Describers Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 65 / 98
  66. 66. User Behavior on STS PhD Thesis Weighting Measures Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing We use 3 measures to weight users, based on Koerner et al. the Aggregation of (2010). Tags Tag 2 factors are considered: verbosity & diversity. Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 66 / 98
  67. 67. User Behavior on STS PhD Thesis Weighting Measures: TPP Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Tags per Post (TPP) – Verbosity Representing the Aggregation of r Tags |Tur | Tag TPP(u) = Distributions |Ru | on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 67 / 98
  68. 68. User Behavior on STS PhD Thesis Weighting Measures: ORPHAN Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Orphan Ratio (ORPHAN) – Diversity Representing the |R(tmax )| Aggregation of n= Tags 100 Tag Distributions o |Tu | o on STS ORPHAN(u) = , T = {t||R(t)| ≤ n} User Behavior |Tu | u on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 68 / 98
  69. 69. User Behavior on STS PhD Thesis Weighting Measures: TRR Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing Tag Resource Ratio (TRR) – Verbosity + Diversity the Aggregation of Tags |Tu | TRR(u) = Tag |Ru | Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 69 / 98
  70. 70. User Behavior on STS PhD Thesis Use of Weighting Measures Arkaitz Zubiaga Motivation Selection of a Classifier These 3 measures provide: STS & Datasets A weight for each user. Ranking of users according to each measure. Representing the Aggregation of Tags From rankings → subsets of users as extreme Tag Distributions Categorizers (highest-ranked) and extreme Describers on STS (lowest-ranked). User Behavior on STS Conclusions & Subsets range from 10% to 100% (step size = 10%). Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 70 / 98
  71. 71. User Behavior on STS PhD Thesis Experiment Setup Arkaitz Zubiaga We select subsets of users according to number of tag Motivation assignments. Selection of a Classifier STS & Selecting by percents of users would be unfair → Datasets different amounts of data. Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 71 / 98
  72. 72. User Behavior on STS PhD Thesis Experiments Arkaitz Zubiaga Motivation Classification Selection of a We use a multiclass SVM, with TF weighting of tags. Classifier STS & Descriptivity Datasets Vectorial representations of resources: Tr → tag frequencies. Representing the Aggregation of Tags Rr → term frequencies on descriptive data Tag (self-content). Distributions on STS User Behavior Cosine similarity between Tr and Rr : on STS Conclusions & n Outlook Tri × Rri cos(θr ) = n 2 n 2 Publications i=1 i=1 (Tri ) × i=1 (Rri ) Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 72 / 98
  73. 73. User Behavior on STS PhD Thesis Descriptivity Results Arkaitz Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.) Motivation Selection of a Delicious Classifier STS & Datasets Representing the LibraryThing Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & GoodReads Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 73 / 98
  74. 74. User Behavior on STS PhD Thesis Classification Results Arkaitz Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.) Motivation Delicious Selection of a Classifier STS & Datasets Representing the LibraryThing Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & GoodReads Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 74 / 98
  75. 75. User Behavior on STS PhD Thesis Classification Results Arkaitz Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.) Motivation Delicious Selection of a Classifier STS & Datasets Representing the LibraryThing Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & GoodReads Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 75 / 98
  76. 76. User Behavior on STS PhD Thesis Overall Categorizers/Describers Results Arkaitz Zubiaga Motivation Selection of a Classifier STS & Discriminating by verbosity (TPP) does best for finding Datasets Representing the extreme Categorizers. Aggregation of Tags Tag Distributions The use of non-descriptive tags provide more accurate on STS classification. User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 76 / 98
  77. 77. Conclusions & Outlook PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 77 / 98
  78. 78. Conclusions & Outlook PhD Thesis Contributions Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Generation & analysis of 3 large-scale social tagging Representing datasets. the Aggregation of Release of some tagging datasets, used by Godoy and Tags Tag Distributions Amandi (2010), Strohmaier et al. (2010), Li et al. (2011), on STS and Ares et al. (2011). User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 78 / 98
  79. 79. Conclusions & Outlook PhD Thesis Contributions Arkaitz Zubiaga Motivation Selection of a Classifier First research work performing actual classification STS & Datasets experiments using social tags. Representing Analysis of different representations of social tags. the Aggregation of Analysis of effect of tag distributions. Tags Study of user behavior. Tag Distributions It paves the way to future researchers interested in the on STS User Behavior on STS task & in the exploration of STS. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 79 / 98
  80. 80. Conclusions & Outlook PhD Thesis Research Questions Arkaitz Zubiaga Motivation Selection of a Classifier Apart from the Problem Statement: STS & Datasets Representing How can the annotations provided by users on social the Aggregation of tagging systems be exploited to improve the accuracy of Tags a resource classification task? Tag Distributions on STS We set forth 10 research questions. User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 80 / 98
  81. 81. Conclusions & Outlook PhD Thesis Research Questions (1) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing RQ 1 What is a suitable SVM classifier for the task? the Aggregation of Tags Native multiclass SVM >> Combinations of Tag Distributions binary SVMs. on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 81 / 98
  82. 82. Conclusions & Outlook PhD Thesis Research Questions (2) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 2 What is a suitable learning method for the Datasets task? Representing the Aggregation of Supervised Semi-supervised. Tags Tag Distributions Unlike for binary tasks, where on STS Semi-supervised >> Supervised (Joachims, User Behavior on STS 1999). Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 82 / 98
  83. 83. Conclusions & Outlook PhD Thesis Research Questions (3) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets RQ 3 How do the settings of STS affect Representing folksonomies? the Aggregation of Tags Great impact of tag suggestions. Tag Distributions on STS Importance of encouraging users to User Behavior on STS annotate. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 83 / 98
  84. 84. Conclusions & Outlook PhD Thesis Research Questions (4) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 4 How to amalgamate annotations to get a Datasets representation of a resource? Representing the Aggregation of Tags Considering all the tags rather than only Tag those in the top. Distributions on STS User Behavior Weighting tags according to number of on STS users annotating them. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 84 / 98
  85. 85. Conclusions & Outlook PhD Thesis Research Questions (5) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 5 Is it worthwhile combining tags with other Datasets data sources? Representing the Aggregation of Tags Combining different data sources helps Tag improve performance. Distributions on STS User Behavior Data sources must be appropriately on STS chosen. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 85 / 98
  86. 86. Conclusions & Outlook PhD Thesis Research Questions (6) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets RQ 6 Are social tags specific enough to classify into Representing narrower categories? the Aggregation of Tags Tags are as useful as for top level. Tag Distributions on STS Noll and Meinel (2008) → tags were User Behavior on STS probably not useful for deeper levels. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 86 / 98
  87. 87. Conclusions & Outlook PhD Thesis Research Questions (7) Arkaitz Zubiaga Motivation Selection of a Classifier RQ 7 Can we consider tag distributions to get the STS & Datasets representativity of each tag? Representing the Aggregation of LibraryThing & GoodReads: really useful. Tags Tag Distributions Delicious: not useful, because of tag suggestions on STS User Behavior on STS → need of committees to make them Conclusions & useful. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 87 / 98
  88. 88. Conclusions & Outlook PhD Thesis Research Questions (8) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 8 What approach to use to weigh the Datasets representativity of tags? Representing the Aggregation of Tags LibraryThing & GoodReads: IBF, IRF & Tag IUF are very similar. Distributions on STS User Behavior Delicious: IUF clearly superior, because of on STS users that get rid of suggestions. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 88 / 98
  89. 89. Conclusions & Outlook PhD Thesis Research Questions (9) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 9 Can we discriminate users who further resemble Datasets an expert classification? Representing the Aggregation of Tags Categorizers > Describers for Tag classification. Distributions on STS User Behavior Need of appropriate measure for on STS discriminating. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 89 / 98
  90. 90. Conclusions & Outlook PhD Thesis Research Questions (10) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets RQ 10 What features identify a Categorizer? Representing the Aggregation of Categorizers can be found when Tags discriminating by verbosity. Tag Distributions on STS Non-descriptive tags produce more User Behavior on STS accurate classification. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 90 / 98
  91. 91. Conclusions & Outlook PhD Thesis Future Directions Arkaitz Zubiaga Motivation Selection of a Classifier Increase of interest in the field, still much work to do. STS & Datasets We have considered each tag as a diferent token. → Considering semantic meanings of social tags could Representing the Aggregation of Tags help. Tag Distributions on STS Tag suggestions leverage several issues in folksonomies. User Behavior → Looking for a weighting function that fits the on STS Conclusions & characteristics of systems with tag suggestions, e.g., Outlook Delicious. Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 91 / 98
  92. 92. Publications PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 92 / 98
  93. 93. Publications PhD Thesis Publications Arkaitz Zubiaga Motivation Peer-Reviewed Conferences (I) Selection of a Arkaitz Zubiaga, Christian K¨rner, Markus Strohmaier. o Classifier 2011. Tags vs Shelves: From Social Tagging to Social STS & Datasets Classification. In Proceedings of Hypertext 2011, the Representing 22nd ACM Conference on Hypertext and Hypermedia, the Aggregation of Eindhoven, Netherlands. (acceptance rate: 35/104, 34%) Tags Tag Distributions on STS Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2009. User Behavior Getting the Most Out of Social Annotations for Web Page on STS Classification. In Proceedings of DocEng 2009, the 9th Conclusions & Outlook ACM Symposium on Document Engineering, pp. 74-83, Publications Munich, Germany. (acceptance rate: 16/54, 29.6%) [15 citations] Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 93 / 98
  94. 94. Publications PhD Thesis Publications Arkaitz Zubiaga Motivation Peer-Reviewed Conferences (II) Selection of a Classifier Arkaitz Zubiaga. 2009. Enhancing Navigation on STS & Wikipedia with Social Tags. Wikimania 2009, Buenos Datasets Aires, Argentina. Representing the [6 citations] Aggregation of Tags Tag Arkaitz Zubiaga, Alberto P. Garc´ ıa-Plaza, V´ ıctor Fresno, Distributions on STS Raquel Mart´ ınez. 2009. Content-based Clustering for Tag User Behavior Cloud Visualization. In Proceedings of ASONAM 2009, on STS Conclusions & International Conference on Advances in Social Networks Outlook Analysis and Mining, pp. 316-319, Athens, Greece. Publications [3 citations] Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 94 / 98
  95. 95. Publications PhD Thesis Publications Arkaitz Zubiaga Journals (I) Motivation Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2011. Selection of a Classifier Augmenting Web Page Classifiers with Social Annotations. STS & Procesamiento del Lenguaje Natural. (acceptance rate: Datasets 33/60, 55%) Representing the Aggregation of Tags Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2009. Tag Clasificaci´n de P´ginas Web con Anotaciones Sociales. o a Distributions Procesamiento del Lenguaje Natural, vol. 43, pp. 225-233. on STS (acceptance rate: 36/72, 50%) User Behavior on STS Conclusions & Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´ ınez. 2009. Outlook Comparativa de Aproximaciones a SVM Semisupervisado Publications Multiclase para Clasificaci´n de P´ginas Web. Procesamiento o a del Lenguaje Natural, vol. 42, pp. 63-70. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 95 / 98
  96. 96. Publications PhD Thesis Publications Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Journals (II) Representing the Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ ınez. Harnessing Aggregation of Tags Folksonomies to Produce a Social Classification of Resources. IEEE Transactions on Knowledge and Data Engineering. Tag Distributions (pending notification) on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 96 / 98
  97. 97. Publications PhD Thesis Publications Arkaitz Zubiaga Motivation Book Chapters Selection of a Classifier Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´ ınez. 2011. STS & Exploiting Social Annotations for Resource Classification. Datasets Social Network Mining, Analysis and Research Representing the Trends: Techniques and Applications. IGI Global. Aggregation of Tags Workshops Tag Distributions Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ınez. 2009. Is on STS Unlabeled Data Suitable for Multiclass SVM-based Web User Behavior on STS Page Classification?. In Proceedings of the NAACL-HLT Conclusions & 2009 Workshop on Semi-supervised Learning for Natural Language Processing, pp. 28-36, Boulder, CO, Outlook Publications United States. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 97 / 98
  98. 98. Publications PhD Thesis Thank You Arkaitz Zubiaga Motivation Selection of a Achiu Arigato Danke Dhannvaad Dua Netjer en ek Classifier Efcharisto Gracias Gr`cies Gratia Grazie a Guishepeli Hvala Kiitos K¨sz¨n¨m o o o STS & Datasets Representing Merc´ Merci Mila esker Obrigado the Aggregation of Tags e Tag Distributions Shukran Tack Tak Takk T¨anan Tapadh leat Shukriya on STS User Behavior on STS Tesekk¨r ederim Thank you Toda u Conclusions & Outlook Publications http://thesis.zubiaga.org/ Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 98 / 98

More Related Content

Similar to Harnessing Folksonomies for Resource Classification

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

×