Harnessing Folksonomies for Resource            Classification              PhD Thesis            Arkaitz Zubiaga          ...
PhD Thesis          Table of Contents   Arkaitz   Zubiaga                     1   MotivationMotivationSelection of aClassi...
Motivation PhD Thesis          Index   Arkaitz   Zubiaga                     1   MotivationMotivationSelection of aClassifi...
Motivation PhD Thesis          Resource Classification   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &DatasetsRep...
Motivation PhD Thesis          Resource Classification   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &DatasetsRep...
Motivation PhD Thesis          Resource Classification   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &DatasetsRep...
Motivation PhD Thesis          Resource Classification   Arkaitz   ZubiagaMotivationSelection of aClassifier                ...
Motivation PhD Thesis          Resource Classification   Arkaitz   ZubiagaMotivationSelection of aClassifier                ...
Motivation PhD Thesis          Tagging   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresentingtheAgg...
Motivation PhD Thesis          Social Tagging   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresentin...
Motivation PhD Thesis          Organization of Resources   Arkaitz   ZubiagaMotivationSelection of a              User ann...
Motivation PhD Thesis          Example of Bookmarks   Arkaitz   ZubiagaMotivation                          User    Resourc...
Motivation PhD Thesis          Sum of Annotations   Arkaitz   ZubiagaMotivation                                Top   tags ...
Motivation PhD Thesis          Tag-based Resource Classification   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &D...
Motivation PhD Thesis          Problem Statement   Arkaitz   ZubiagaMotivation                      How can the annotation...
Motivation PhD Thesis          Related Work   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &                     ...
Motivation PhD Thesis          Related Work   Arkaitz   ZubiagaMotivationSelection of aClassifier                   Classifi...
Selection of a Classifier PhD Thesis          Index   Arkaitz   Zubiaga                     1   MotivationMotivationSelecti...
Selection of a Classifier PhD Thesis          Characteristics of the task   Arkaitz   ZubiagaMotivation                  We...
Selection of a Classifier PhD Thesis          Support Vector Machines (SVM)   Arkaitz   ZubiagaMotivationSelection of aClas...
Selection of a Classifier PhD Thesis          Selection of a Classifier   Arkaitz   ZubiagaMotivationSelection of aClassifier...
Selection of a Classifier PhD Thesis          Experiment Settings   Arkaitz   ZubiagaMotivationSelection of a              ...
Selection of a Classifier PhD Thesis          Results   Arkaitz   ZubiagaMotivation                                        ...
STS & Datasets PhD Thesis          Index   Arkaitz   Zubiaga                     1   MotivationMotivationSelection of aCla...
STS & Datasets PhD Thesis          Requirements   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &Datasets         ...
STS & Datasets PhD Thesis          Characteristics of STS   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &       ...
STS & Datasets PhD Thesis          Characteristics of STS   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &       ...
STS & Datasets PhD Thesis          Characteristics of STS   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &       ...
STS & Datasets PhD Thesis          Characteristics of STS   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &       ...
STS & Datasets PhD Thesis          Characteristics of STS   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &       ...
STS & Datasets PhD Thesis          Retrieval of Categorized Resources   Arkaitz   ZubiagaMotivationSelection of aClassifier...
STS & Datasets PhD Thesis          Retrieval of Additional User Annotations   Arkaitz   ZubiagaMotivationSelection of aCla...
STS & Datasets PhD Thesis          Tag Popularity on Resources   Arkaitz   ZubiagaMotivationSelection of a                ...
STS & Datasets PhD Thesis          Tag Novelty in Bookmarks by Rank   Arkaitz   Zubiaga                                   ...
STS & Datasets PhD Thesis          Retrieval of Additional Data   Arkaitz   ZubiagaMotivationSelection of aClassifier      ...
STS & Datasets PhD Thesis          Summary of the Analysis of Datasets   Arkaitz   ZubiagaMotivationSelection of aClassifie...
Representing the Aggregation of Tags PhD Thesis          Index   Arkaitz   Zubiaga                     1   MotivationMotiv...
Representing the Aggregation of Tags PhD Thesis          Representing Resources Using Tags   Arkaitz   ZubiagaMotivationSe...
Representing the Aggregation of Tags PhD Thesis          Representing Resources Using Tags   Arkaitz   Zubiaga            ...
Representing the Aggregation of Tags PhD Thesis          Representing Resources Using Other Data   Arkaitz           Sourc...
Representing the Aggregation of Tags PhD Thesis          Experiment Setup   Arkaitz   ZubiagaMotivationSelection of aClass...
Representing the Aggregation of Tags PhD Thesis          Experiment Setup   Arkaitz   ZubiagaMotivationSelection of aClass...
Representing the Aggregation of Tags PhD Thesis          Results of Tags vs Other Data Sources   Arkaitz   ZubiagaMotivati...
Representing the Aggregation of Tags PhD Thesis          Results of Tags vs Other Data Sources   Arkaitz   ZubiagaMotivati...
Representing the Aggregation of Tags PhD Thesis          Results of Tags vs Other Data Sources   Arkaitz   ZubiagaMotivati...
Representing the Aggregation of Tags PhD Thesis          Results of Tags vs Other Data Sources   Arkaitz   ZubiagaMotivati...
Representing the Aggregation of Tags PhD Thesis          Results of Tags vs Other Data Sources   Arkaitz   ZubiagaMotivati...
Representing the Aggregation of Tags PhD Thesis          Classifier Committees   Arkaitz   Zubiaga                         ...
Representing the Aggregation of Tags PhD Thesis          Results of Classifier Committees   Arkaitz   Zubiaga              ...
Representing the Aggregation of Tags PhD Thesis          Results of Classifier Committees   Arkaitz   Zubiaga              ...
Representing the Aggregation of Tags PhD Thesis          Summary of Results   Arkaitz   ZubiagaMotivationSelection of aCla...
Tag Distributions on STS PhD Thesis          Index   Arkaitz   Zubiaga                     1   MotivationMotivationSelecti...
Tag Distributions on STS PhD Thesis          Tag Distributions   Arkaitz   ZubiagaMotivationSelection of aClassifier       ...
Tag Distributions on STS PhD Thesis          TF-IDF   Arkaitz   ZubiagaMotivationSelection of a       TF-IDF is an inverse...
Tag Distributions on STS PhD Thesis          Tag Weighting Functions   Arkaitz   ZubiagaMotivationSelection of aClassifier ...
Tag Distributions on STS PhD Thesis          Results Using IWFs   Arkaitz   ZubiagaMotivationSelection of aClassifier      ...
Tag Distributions on STS PhD Thesis          Results Using IWFs   Arkaitz   ZubiagaMotivationSelection of a               ...
Tag Distributions on STS PhD Thesis          Using Classifier Committees with IWFs   Arkaitz   ZubiagaMotivationSelection o...
Tag Distributions on STS PhD Thesis          Results Using IWF with Committees   Arkaitz   ZubiagaMotivationSelection of a...
Tag Distributions on STS PhD Thesis          Results Using IWF with Committees   Arkaitz   ZubiagaMotivationSelection of a...
Tag Distributions on STS PhD Thesis          Summary of Results Using IWFs   Arkaitz   ZubiagaMotivationSelection of aClas...
User Behavior on STS PhD Thesis          Index   Arkaitz   Zubiaga                     1   MotivationMotivationSelection o...
User Behavior on STS PhD Thesis          User Behavior: Categorizers and Describers   Arkaitz   Zubiaga                   ...
User Behavior on STS PhD Thesis          Categorizers and Describers   Arkaitz   ZubiagaMotivationSelection of aClassifierS...
User Behavior on STS PhD Thesis          Categorizers and Describers   Arkaitz   ZubiagaMotivationSelection of aClassifierS...
User Behavior on STS PhD Thesis          Weighting Measures   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &Datas...
User Behavior on STS PhD Thesis          Weighting Measures: TPP   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &...
User Behavior on STS PhD Thesis          Weighting Measures: ORPHAN   Arkaitz   ZubiagaMotivationSelection of aClassifierST...
User Behavior on STS PhD Thesis          Weighting Measures: TRR   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &...
User Behavior on STS PhD Thesis          Use of Weighting Measures   Arkaitz   ZubiagaMotivationSelection of aClassifier   ...
User Behavior on STS PhD Thesis          Experiment Setup   Arkaitz   Zubiaga                            We select subsets...
User Behavior on STS PhD Thesis          Experiments   Arkaitz   ZubiagaMotivation                     ClassificationSelect...
User Behavior on STS PhD Thesis          Descriptivity Results   Arkaitz   Zubiaga                                 TPP (Ve...
User Behavior on STS PhD Thesis          Classification Results   Arkaitz   Zubiaga                                 TPP (Ve...
User Behavior on STS PhD Thesis          Classification Results   Arkaitz   Zubiaga                                 TPP (Ve...
User Behavior on STS PhD Thesis          Overall Categorizers/Describers Results   Arkaitz   ZubiagaMotivationSelection of...
Conclusions & Outlook PhD Thesis          Index   Arkaitz   Zubiaga                     1   MotivationMotivationSelection ...
Conclusions & Outlook PhD Thesis          Contributions   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &Datasets ...
Conclusions & Outlook PhD Thesis          Contributions   Arkaitz   ZubiagaMotivationSelection of aClassifier              ...
Conclusions & Outlook PhD Thesis          Research Questions   Arkaitz   ZubiagaMotivationSelection of aClassifier         ...
Conclusions & Outlook PhD Thesis          Research Questions (1)   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &...
Conclusions & Outlook PhD Thesis          Research Questions (2)   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &...
Conclusions & Outlook PhD Thesis          Research Questions (3)   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &...
Conclusions & Outlook PhD Thesis          Research Questions (4)   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &...
Conclusions & Outlook PhD Thesis          Research Questions (5)   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &...
Conclusions & Outlook PhD Thesis          Research Questions (6)   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &...
Conclusions & Outlook PhD Thesis          Research Questions (7)   Arkaitz   ZubiagaMotivationSelection of aClassifier     ...
Conclusions & Outlook PhD Thesis          Research Questions (8)   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &...
Conclusions & Outlook PhD Thesis          Research Questions (9)   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &...
Conclusions & Outlook PhD Thesis          Research Questions (10)   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS ...
Conclusions & Outlook PhD Thesis          Future Directions   Arkaitz   ZubiagaMotivationSelection of aClassifier          ...
Publications PhD Thesis          Index   Arkaitz   Zubiaga                     1   MotivationMotivationSelection of aClass...
Publications PhD Thesis          Publications   Arkaitz   ZubiagaMotivation                     Peer-Reviewed Conferences ...
Publications PhD Thesis          Publications   Arkaitz   ZubiagaMotivation           Peer-Reviewed Conferences (II)Select...
Publications PhD Thesis          Publications   Arkaitz   Zubiaga                     Journals (I)Motivation              ...
Publications PhD Thesis          Publications   Arkaitz   ZubiagaMotivationSelection of aClassifierSTS &Datasets           ...
Publications PhD Thesis          Publications   Arkaitz   ZubiagaMotivation           Book ChaptersSelection of aClassifier...
Publications PhD Thesis          Thank You   Arkaitz   ZubiagaMotivationSelection of a                        Achiu       ...
Upcoming SlideShare
Loading in...5
×

Harnessing Folksonomies for Resource Classification

742

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
742
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Harnessing Folksonomies for Resource Classification

  1. 1. Harnessing Folksonomies for Resource Classification PhD Thesis Arkaitz Zubiaga UNED July 12th, 2011 Advisors: Raquel Mart´ınez Unanue V´ ıctor Fresno Fern´ndez a
  2. 2. PhD Thesis Table of Contents Arkaitz Zubiaga 1 MotivationMotivationSelection of aClassifier 2 Selection of a ClassifierSTS &Datasets 3 STS & DatasetsRepresentingtheAggregation ofTags 4 Representing the Aggregation of TagsTagDistributions 5 Tag Distributions on STSon STSUser Behavioron STS 6 User Behavior on STSConclusions &Outlook 7 Conclusions & OutlookPublications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 2 / 98
  3. 3. Motivation PhD Thesis Index Arkaitz Zubiaga 1 MotivationMotivationSelection of aClassifier 2 Selection of a ClassifierSTS &Datasets 3 STS & DatasetsRepresentingtheAggregation ofTags 4 Representing the Aggregation of TagsTagDistributions 5 Tag Distributions on STSon STSUser Behavioron STS 6 User Behavior on STSConclusions &Outlook 7 Conclusions & OutlookPublications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 3 / 98
  4. 4. Motivation PhD Thesis Resource Classification Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresentingtheAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 4 / 98
  5. 5. Motivation PhD Thesis Resource Classification Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresentingtheAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 5 / 98
  6. 6. Motivation PhD Thesis Resource Classification Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresentingtheAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 6 / 98
  7. 7. Motivation PhD Thesis Resource Classification Arkaitz ZubiagaMotivationSelection of aClassifier Classifying resources is a common task.STS &Datasets Web pages, books, movies, files,...RepresentingtheAggregation of Large collections of resources → expensive & effortfulTags to classify manually.TagDistributions LoC reported an average cost of $94.58 for catalogingon STS each book in 2002.User Behavioron STSConclusions & Enormous costs and efforts → automatic classification.OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 7 / 98
  8. 8. Motivation PhD Thesis Resource Classification Arkaitz ZubiagaMotivationSelection of aClassifier Representation of resources → self-content.STS &DatasetsRepresenting Use of self-content of resources presents some issues:theAggregation of Not always representative enough.Tags Not always accessible (e.g., books).TagDistributionson STSUser Behavior Social tags provided by users → alternative to solve theon STS problem.Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 8 / 98
  9. 9. Motivation PhD Thesis Tagging Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresentingtheAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions &OutlookPublications T1, T2, T3 = sets of tags. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 9 / 98
  10. 10. Motivation PhD Thesis Social Tagging Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresentingtheAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions &OutlookPublications Aggregation of user annotations → folksonomy. Folksonomy: Folk (People) + Taxis (Classification) + Nomos (Management). Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 10 / 98
  11. 11. Motivation PhD Thesis Organization of Resources Arkaitz ZubiagaMotivationSelection of a User annotations → own organization of resources.ClassifierSTS &Datasets A user’s tagsRepresentingthe Tag # ResourcesAggregation ofTags research 82Tag twitter 28Distributionson STS web2.0 35User Behavior language 42on STS english 64Conclusions &Outlook ... ...Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 11 / 98
  12. 12. Motivation PhD Thesis Example of Bookmarks Arkaitz ZubiagaMotivation User Resource TagsSelection of a 1 user1 flickr.com photo, web2.0, socialClassifier 2 user2 flickr.com photography, imagesSTS &Datasets 3 user1 google.com searchengineRepresenting 4 user3 twitter.com microblogging, twittertheAggregation ofTagsTag Bookmark: (1) user ui ∈ U who annotates (2) resource rj ∈ R being annotatedDistributionson STSUser Behavior (3) tags Tij = {t1 , ..., tn } ∈ T utilized.on STSConclusions &OutlookPublications bij : ui × rj × Tij Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 12 / 98
  13. 13. Motivation PhD Thesis Sum of Annotations Arkaitz ZubiagaMotivation Top tags (79,681 users)Selection of aClassifier Tag Rank Tag User CountSTS & 1 photos 22,712Datasets 2 flickr 19,046Representingthe 3 photography 15,968Aggregation ofTags 4 photo 15,225Tag 5 sharing 10,648Distributionson STS 6 images 9,637User Behavior 7 web2.0 9,528on STSConclusions & 8 community 4,571Outlook 9 social 3,798Publications 10 pictures 3,115 Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 13 / 98
  14. 14. Motivation PhD Thesis Tag-based Resource Classification Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresentingtheAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 14 / 98
  15. 15. Motivation PhD Thesis Problem Statement Arkaitz ZubiagaMotivation How can the annotations provided by users on social taggingSelection of a systems be exploited to improve the accuracy of a resourceClassifier classification task?STS &DatasetsRepresentingtheAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 15 / 98
  16. 16. Motivation PhD Thesis Related Work Arkaitz ZubiagaMotivationSelection of aClassifierSTS & Social tags for information management:Datasets Search: Bao et al. (2007) & Heymann et al. (2008).Representingthe Recommender Systems: Shepitsen et al. (2008) & LiAggregation ofTagsTag et al. (2008).Distributionson STSUser Behavioron STS Enhanced Browsing: Smith (2008).Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 16 / 98
  17. 17. Motivation PhD Thesis Related Work Arkaitz ZubiagaMotivationSelection of aClassifier Classification: Noll and Meinel (2008) → statisticalSTS &Datasets analysis of matches between tags & taxonomies.Representing Tags are useful for broad categorization.the Not for narrower categorization.Aggregation ofTagsTagDistributions Lack of further research with:on STS Actual classification experiments.User Behavior Other types of resources.on STS Different representations of social tags.Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 17 / 98
  18. 18. Selection of a Classifier PhD Thesis Index Arkaitz Zubiaga 1 MotivationMotivationSelection of aClassifier 2 Selection of a ClassifierSTS &Datasets 3 STS & DatasetsRepresentingtheAggregation ofTags 4 Representing the Aggregation of TagsTagDistributions 5 Tag Distributions on STSon STSUser Behavioron STS 6 User Behavior on STSConclusions &Outlook 7 Conclusions & OutlookPublications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 18 / 98
  19. 19. Selection of a Classifier PhD Thesis Characteristics of the task Arkaitz ZubiagaMotivation We have:Selection of aClassifier Large set of resources: some labeled + many unlabeled.STS & Multiclass taxonomy.DatasetsRepresentingthe Automated classifiers learn a model from labeledAggregation of resources.TagsTag This model is used to classify unlabeled resourcesDistributions afterward.on STSUser Behavioron STS 2 learning settings:Conclusions &Outlook Supervised: only labeled resources considered for learning. Semi-supervised: unlabeled resources are also taken intoPublications account. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 19 / 98
  20. 20. Selection of a Classifier PhD Thesis Support Vector Machines (SVM) Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresentingtheAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions & Hyperplane that separates with largest margin.OutlookPublications Use of kernels → redimensions the space. Resource/Hyperplane margin → Classifier’s reliability. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 20 / 98
  21. 21. Selection of a Classifier PhD Thesis Selection of a Classifier Arkaitz ZubiagaMotivationSelection of aClassifier SVMs solve binary problems by default.STS &DatasetsRepresenting To solve multiclass tasks:theAggregation of Native multiclass classifier (mSVM).Tags Combining binary classifiers:Tag one-against-all (oaaSVM).Distributionson STS one-against-one (oaoSVM).User Behavioron STSConclusions & Both supervised (s) and semi-supervised (ss).OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 21 / 98
  22. 22. Selection of a Classifier PhD Thesis Experiment Settings Arkaitz ZubiagaMotivationSelection of a 3 benchmark datasets to analyze suitability ofClassifier classifiers:STS &DatasetsRepresenting Dataset # web pages # trainset # categories BankSearchtheAggregation of 10,000 3,000 10 WebKBTags 4,518 1,000 6TagDistributions Y! Science 788 100 6on STSUser Behavioron STS We present accuracy to show performance.Conclusions &OutlookPublications We perform 6 runs, and show the average accuracy. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 22 / 98
  23. 23. Selection of a Classifier PhD Thesis Results Arkaitz ZubiagaMotivation BankSearch WebKB Y! ScienceSelection of a mSVM (s) .925 .810 .825Classifier mSVM (ss) .923 .778 .836STS &Datasets oaaSVM (s) .843 .776 .536Representingthe oaaSVM (ss) .842 .773 .565Aggregation ofTags oaoSVM (s) .826 .775 .483Tag oaoSVM (ss) .811 .754 .514Distributionson STSUser Behavior Native multiclass classifier performs best, whileon STSConclusions & supervised semi-supervised.OutlookPublications We used the supervised approach, as it is computationally less expensive. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 23 / 98
  24. 24. STS & Datasets PhD Thesis Index Arkaitz Zubiaga 1 MotivationMotivationSelection of aClassifier 2 Selection of a ClassifierSTS &Datasets 3 STS & DatasetsRepresentingtheAggregation ofTags 4 Representing the Aggregation of TagsTagDistributions 5 Tag Distributions on STSon STSUser Behavioron STS 6 User Behavior on STSConclusions &Outlook 7 Conclusions & OutlookPublications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 24 / 98
  25. 25. STS & Datasets PhD Thesis Requirements Arkaitz ZubiagaMotivationSelection of aClassifierSTS &Datasets Selected STS should have:Representing Large communities involved.theAggregation of Public access to data.Tags Consolidated taxonomies as a ground truth.TagDistributions We chose Delicious, LibraryThing & GoodReads.on STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 25 / 98
  26. 26. STS & Datasets PhD Thesis Characteristics of STS Arkaitz ZubiagaMotivationSelection of aClassifierSTS & Delicious LibraryThing GoodReadsDatasets Resources web documents books booksRepresentingthe Tag suggestions based on earlier no based on earlierAggregation of bookmarks on the tags utilized byTags resource the userTagDistributions Tag insertion space-separated comma-separated one by one text-on STS boxUser Behavior Saving a resource prompts user to prompts user to user needs toon STS add tags add tags at sec- click again to addConclusions & ond step tagsOutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 26 / 98
  27. 27. STS & Datasets PhD Thesis Characteristics of STS Arkaitz ZubiagaMotivationSelection of aClassifierSTS & Delicious LibraryThing GoodReadsDatasets Resources web documents books booksRepresentingthe Tag suggestions based on earlier no based on earlierAggregation of bookmarks on the tags utilized byTags resource the userTagDistributions Tag insertion space-separated comma-separated one by one text-on STS boxUser Behavior Saving a resource prompts user to prompts user to user needs toon STS add tags add tags at sec- click again to addConclusions & ond step tagsOutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 27 / 98
  28. 28. STS & Datasets PhD Thesis Characteristics of STS Arkaitz ZubiagaMotivationSelection of aClassifierSTS & Delicious LibraryThing GoodReadsDatasets Resources web documents books booksRepresentingthe Tag suggestions based on earlier no based on earlierAggregation of bookmarks on the tags utilized byTags resource the userTagDistributions Tag insertion space-separated comma-separated one by one text-on STS boxUser Behavior Saving a resource prompts user to prompts user to user needs toon STS add tags add tags at sec- click again to addConclusions & ond step tagsOutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 28 / 98
  29. 29. STS & Datasets PhD Thesis Characteristics of STS Arkaitz ZubiagaMotivationSelection of aClassifierSTS & Delicious LibraryThing GoodReadsDatasets Resources web documents books booksRepresentingthe Tag suggestions based on earlier no based on earlierAggregation of bookmarks on the tags utilized byTags resource the userTagDistributions Tag insertion space-separated comma-separated one by one text-on STS boxUser Behavior Saving a resource prompts user to prompts user to user needs toon STS add tags add tags at sec- click again to addConclusions & ond step tagsOutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 29 / 98
  30. 30. STS & Datasets PhD Thesis Characteristics of STS Arkaitz ZubiagaMotivationSelection of aClassifierSTS & Delicious LibraryThing GoodReadsDatasets Resources web documents books booksRepresentingthe Tag suggestions based on earlier no based on earlierAggregation of bookmarks on the tags utilized byTags resource the userTagDistributions Tag insertion space-separated comma-separated one by one text-on STS boxUser Behavior Saving a resource prompts user to prompts user to user needs toon STS add tags add tags at sec- click again to addConclusions & ond step tagsOutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 30 / 98
  31. 31. STS & Datasets PhD Thesis Retrieval of Categorized Resources Arkaitz ZubiagaMotivationSelection of aClassifier Retrieval of popular annotated resources, which were alsoSTS & categorized by experts.DatasetsRepresentingtheAggregation of Top level (L1) Second level (L2)TagsTag Resources Classes Resources ClassesDistributionson STS Web ODP 12,616 17 12,286 243 DDC 27,299 10 27,040 99User Behavior Bookson STS LCC 24,861 20 23,565 204Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 31 / 98
  32. 32. STS & Datasets PhD Thesis Retrieval of Additional User Annotations Arkaitz ZubiagaMotivationSelection of aClassifier Delicious: 300,571,231 bookmarksSTS & → 273,478,137 annotated (91.00%)Datasets LibraryThing: 44,612,784 bookmarksRepresentingtheAggregation ofTags → 22,343,427 annotated (50.08%)TagDistributionson STS GoodReads: 47,302,861 bookmarksUser Behavior → 9,323,539 annotated (19.71%)on STSConclusions & Importance of system’s encouragement to taggingOutlookPublications resources. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 32 / 98
  33. 33. STS & Datasets PhD Thesis Tag Popularity on Resources Arkaitz ZubiagaMotivationSelection of a 1Classifier 0.9STS & DeliciousDatasetsRepresenting 0.8 LibraryThingtheAggregation of 0.7 GoodReads Average usageTags 0.6TagDistributions 0.5on STSUser Behavior 0.4on STS 0.3Conclusions &Outlook 0.2Publications 0.1 0 Tag rank on resources Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 33 / 98
  34. 34. STS & Datasets PhD Thesis Tag Novelty in Bookmarks by Rank Arkaitz Zubiaga 100%MotivationSelection of aClassifier 90% DeliciousSTS & 80% LibraryThingDatasets 70%RepresentingtheAggregation of 60% % of noveltyTags 50%TagDistributions 40%on STSUser Behavior 30%on STSConclusions & 20%Outlook 10%Publications 0% 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Bookmark rank Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 34 / 98
  35. 35. STS & Datasets PhD Thesis Retrieval of Additional Data Arkaitz ZubiagaMotivationSelection of aClassifier URLs:STS &Datasets Self-content, by crawling URLs.Representing User reviews (Delicious & StumbleUpon).theAggregation ofTags Books:TagDistributions Self-content (unavailable):on STS Synopses (Barnes&Noble).User Behavior Editorial reviews (Amazon).on STS User reviews (LibraryThing, GoodReads & Amazon).Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 35 / 98
  36. 36. STS & Datasets PhD Thesis Summary of the Analysis of Datasets Arkaitz ZubiagaMotivationSelection of aClassifierSTS &Datasets Few users annotate resources when the system does not encourage to do it.RepresentingtheAggregation ofTagsTag Resource-based tag suggestions → Repeated use ofDistributionson STS popular tags.User Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 36 / 98
  37. 37. Representing the Aggregation of Tags PhD Thesis Index Arkaitz Zubiaga 1 MotivationMotivationSelection of aClassifier 2 Selection of a ClassifierSTS &Datasets 3 STS & DatasetsRepresentingtheAggregation ofTags 4 Representing the Aggregation of TagsTagDistributions 5 Tag Distributions on STSon STSUser Behavioron STS 6 User Behavior on STSConclusions &Outlook 7 Conclusions & OutlookPublications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 37 / 98
  38. 38. Representing the Aggregation of Tags PhD Thesis Representing Resources Using Tags Arkaitz ZubiagaMotivationSelection of aClassifierSTS &Datasets Different ways to aggregate user annotations on aRepresenting vectorial representation.theAggregation ofTagsTag 2 major factors to consider:Distributions What tags to use?on STS How to weigh those tags?User Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 38 / 98
  39. 39. Representing the Aggregation of Tags PhD Thesis Representing Resources Using Tags Arkaitz Zubiaga Use of all tags (FTA), or just top 10 tags for eachMotivation resource.Selection of aClassifier 4 different weightings.STS &Datasets Example of a resource (100 users):Representing t1 (50), t2 (30), t3 (20), ..., t9 (1), t10 (1), ..., tn (1)theAggregation ofTagsTag FTA Top 10Distributionson STSUser Behavior t1 t2 t3 ... t9 t10 ... tnon STS Ranks 1 0.9 0.8 ... 0.2 0.1 ... 0Conclusions &Outlook Fractions 0.5 0.3 0.2 ... 0.02 0.01 ... 0.01Publications Binary 1 1 1 ... 1 1 ... 1 TF 50 30 20 ... 2 1 ... 1 Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 39 / 98
  40. 40. Representing the Aggregation of Tags PhD Thesis Representing Resources Using Other Data Arkaitz Sources ZubiagaMotivationSelection of aClassifierSTS & To represent resources using content and reviews:DatasetsRepresenting 1 Removal of HTML tags.theAggregation ofTags 2 Removal of stopwords.TagDistributionson STS 3 Stem of remaining words.User Behavioron STSConclusions & 4 TF-IDF weighting of words.OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 40 / 98
  41. 41. Representing the Aggregation of Tags PhD Thesis Experiment Setup Arkaitz ZubiagaMotivationSelection of aClassifier Multiclass SVMs.STS &DatasetsRepresenting Show the average accuracy of 6 runs.theAggregation ofTags For clarity of presentation, we limit results to:TagDistributions LCC taxonomy for books.on STS Training sets of 6,000 URLs (6,616 (L1)/6,286 (L2) forUser Behavior test). Training sets of 18,000 books (8,861 (L1)/5,565 (L2) foron STSConclusions &Outlook test).Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 41 / 98
  42. 42. Representing the Aggregation of Tags PhD Thesis Experiment Setup Arkaitz ZubiagaMotivationSelection of aClassifier Compared RepresentationsSTS & Self-content (baseline).DatasetsRepresentingthe Reviews.Aggregation ofTagsTag Tags:Distributionson STS Ranks (Top 10).User Behavior Fractions (Top 10 & FTA).on STS Binary (Top 10 & FTA).Conclusions &Outlook TF (Top 10 & FTA).Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 42 / 98
  43. 43. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz ZubiagaMotivation Delicious LThing GReadsSelection of a L1 L2 L1 L2 L1 L2Classifier (17) (243) (20) (204) (20) (204)STS &Datasets Content .610 .470 .807 .673 .807 .673Representing Reviews .646 .524 .828 .705 .828 .705 RankstheAggregation of .484 .360 .795 .511 .630 .405 Fractions (10)Tags .464 .349 .738 .411 .663 .427Tag Fractions (FTA) .461 .336 .712 .409 .654 .432 TagsDistributions Binary (10)on STS .531 .361 .770 .550 .623 .422User Behavioron STS Binary (FTA) .572 .529 .655 .606 .639 .481Conclusions & TF (10) .654 .545 .855 .722 .713 .491OutlookPublications TF (FTA) .680 .568 .857 .736 .731 .517 Usually, FTA > 10. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 43 / 98
  44. 44. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz ZubiagaMotivation Delicious LThing GReadsSelection of a L1 L2 L1 L2 L1 L2Classifier (17) (243) (20) (204) (20) (204)STS &Datasets Content .610 .470 .807 .673 .807 .673Representing Reviews .646 .524 .828 .705 .828 .705 RankstheAggregation of .484 .360 .795 .511 .630 .405Tags Fractions (10) .464 .349 .738 .411 .663 .427 Fractions (FTA)Tag .461 .336 .712 .409 .654 .432 TagsDistributions Binary (10)on STS .531 .361 .770 .550 .623 .422User Behavioron STS Binary (FTA) .572 .529 .655 .606 .639 .481Conclusions & TF (10) .654 .545 .855 .722 .713 .491Outlook TF (FTA) .680 .568 .857 .736 .731 .517Publications TF (FTA) is the best approach for tags. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 44 / 98
  45. 45. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz ZubiagaMotivationSelection of aClassifier Delicious LThing GReadsSTS &Datasets L1 L2 L1 L2 L1 L2Representing (17) (243) (20) (204) (20) (204)theAggregation of Content .610 .470 .807 .673 .807 .673Tags Reviews .646 .524 .828 .705 .828 .705TagDistributions Tags .680 .568 .857 .736 .731 .517on STSUser Behavioron STS Tags clearly outperform content and reviews onConclusions &Outlook Delicious and LibraryThing.Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 45 / 98
  46. 46. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz ZubiagaMotivationSelection of aClassifier Delicious LThing GReadsSTS &Datasets L1 L2 L1 L2 L1 L2Representing (17) (243) (20) (204) (20) (204)theAggregation of Content .610 .470 .807 .673 .807 .673Tags Reviews .646 .524 .828 .705 .828 .705TagDistributions Tags .680 .568 .857 .736 .731 .517on STSUser Behavioron STS GoodReads’ disencouragement to tagging makes itConclusions &Outlook insufficient to outperform content and reviews.Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 46 / 98
  47. 47. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz ZubiagaMotivationSelection of aClassifierSTS & Delicious LThing GReadsDatasets L1 L2 L1 L2 L1 L2Representingthe (17) (243) (20) (204) (20) (204)Aggregation ofTags Content .610 .470 .807 .673 .807 .673Tag Reviews .646 .524 .828 .705 .828 .705Distributionson STS Tags .680 .568 .857 .736 .731 .517User Behavioron STSConclusions & Tags are also useful for deeper categorization (L2).OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 47 / 98
  48. 48. Representing the Aggregation of Tags PhD Thesis Classifier Committees Arkaitz Zubiaga Despite the superiority of social tags, all data sourcesMotivation perform well.Selection of aClassifierSTS & Their outputs can be combined by using classifierDatasets committees.RepresentingtheAggregation ofTags Classifier committees add up margins (i.e., reliabilityTag values) outputted by several classifiers, and provide aDistributionson STS single combined prediction.User Behavioron STSConclusions & Cat. #1 Cat. #2 Cat. #3Outlook Classif. A 1.2 1.1 0.6Publications Classif. B 0.5 1.0 1.2 Classif. committees 1.7 2.1 1.8 Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 48 / 98
  49. 49. Representing the Aggregation of Tags PhD Thesis Results of Classifier Committees Arkaitz Zubiaga Delicious LThing GReadsMotivation L1 L2 L1 L2 L1 L2Selection of aClassifier (17) (243) (20) (204) (20) (204)STS & Content (C) .610 .470 .807 .673 .807 .673Datasets Reviews (R) .646 .524 .828 .705 .828 .705Representingthe Tags (T) .680 .568 .857 .736 .731 .517Aggregation of C+R .670 .547 .817 .704 .817 .704 Commit.TagsTagDistributions C+T .696 .587 .821 .720 .832 .696on STS R+T .694 .584 .859 .755 .857 .730User Behavioron STS C+R+T .699 .588 .827 .732 .843 .727Conclusions &Outlook Classifier committees successfully improve performance.Publications Even on GoodReads, where tags were not good enough on their own. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 49 / 98
  50. 50. Representing the Aggregation of Tags PhD Thesis Results of Classifier Committees Arkaitz Zubiaga Delicious LThing GReadsMotivation L1 L2 L1 L2 L1 L2Selection of aClassifier (17) (243) (20) (204) (20) (204)STS & Content (C) .610 .470 .807 .673 .807 .673 Reviews (R)Datasets .646 .524 .828 .705 .828 .705Representingthe Tags (T) .680 .568 .857 .736 .731 .517Aggregation of C+R .670 .547 .817 .704 .817 .704 Commit.TagsTag C+T .696 .587 .821 .720 .832 .696Distributionson STS R+T .694 .584 .859 .755 .857 .730User Behavior C+R+T .699 .588 .827 .732 .843 .727on STSConclusions &Outlook Data sources must be chosen with care:Publications All 3 are helpful on Delicious. Content is harmful for books. Inappropriate considering synopses and ed. reviews as a summary of content? Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 50 / 98
  51. 51. Representing the Aggregation of Tags PhD Thesis Summary of Results Arkaitz ZubiagaMotivationSelection of aClassifier Better represent using all tags with TF weighting.STS &DatasetsRepresenting Tags perform accurately even for deeper levels.theAggregation of The system must encourage the user to tag to make itTags useful enough.TagDistributionson STS Tags can be combined with other data to improveUser Behavioron STS performance.Conclusions & Combined data sources must be chosen with care.OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 51 / 98
  52. 52. Tag Distributions on STS PhD Thesis Index Arkaitz Zubiaga 1 MotivationMotivationSelection of aClassifier 2 Selection of a ClassifierSTS &Datasets 3 STS & DatasetsRepresentingtheAggregation ofTags 4 Representing the Aggregation of TagsTagDistributions 5 Tag Distributions on STSon STSUser Behavioron STS 6 User Behavior on STSConclusions &Outlook 7 Conclusions & OutlookPublications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 52 / 98
  53. 53. Tag Distributions on STS PhD Thesis Tag Distributions Arkaitz ZubiagaMotivationSelection of aClassifier So far, we have considered that tags annotated by theSTS &DatasetsRepresenting same number of users are equally representative totheAggregation of the resource.TagsTagDistributionson STS Distributions of tags in a collection could helpUser Behavior determine representativity of tags.on STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 53 / 98
  54. 54. Tag Distributions on STS PhD Thesis TF-IDF Arkaitz ZubiagaMotivationSelection of a TF-IDF is an inverse weighting function (IWF) thatClassifier computes:STS &Datasets the term frequency (TF).Representingthe the inverse document frequency (IDF).Aggregation ofTagsTag |D|Distributions tf -idfij = tfij × logon STS |{d : ti ∈ d}|User Behavioron STS High IDF value for terms appearing in few documents. Low IDF value for terms appearing in many documents.Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 54 / 98
  55. 55. Tag Distributions on STS PhD Thesis Tag Weighting Functions Arkaitz ZubiagaMotivationSelection of aClassifier Analogous to TF-IDF on folksonomies:STS &Datasets TF-IRF → distributions across resources.Representing TF-IUF → distributions across users.the TF-IBF → distributions across bookmarks.Aggregation ofTagsTagDistributions TF-IRF and TF-IUF had been barely used, and theiron STS suitability was yet unexplored.User Behavioron STSConclusions & TF-IBF had not been used.OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 55 / 98
  56. 56. Tag Distributions on STS PhD Thesis Results Using IWFs Arkaitz ZubiagaMotivationSelection of aClassifier Delicious LThing GReadsSTS & L1 L2 L1 L2 L1 L2Datasets TF .680 .568 .857 .736 .731 .517Representingthe TF-IRF .639 .529 .894 .809 .799 .622 IWFsAggregation ofTags TF-IBF .641 .532 .895 .811 .800 .628Tag TF-IUF .661 .555 .892 .803 .794 .623Distributionson STS All 3 IWFs clearly outperform TF for LibraryThing andUser Behavioron STSConclusions & GoodReads. Similar performance of IWFs.OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 56 / 98
  57. 57. Tag Distributions on STS PhD Thesis Results Using IWFs Arkaitz ZubiagaMotivationSelection of a Delicious LThing GReadsClassifierSTS & L1 L2 L1 L2 L1 L2Datasets TF .680 .568 .857 .736 .731 .517Representingthe TF-IRF .639 .529 .894 .809 .799 .622 IWFsAggregation ofTags TF-IBF .641 .532 .895 .811 .800 .628Tag TF-IUF .661 .555 .892 .803 .794 .623Distributionson STSUser Behavioron STS IWFs underperform on Delicious, due to tag suggestions that make top tags utmost popular.Conclusions &Outlook IUF superior to IBF and IRF. Users who make theirPublications own choices make the difference. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 57 / 98
  58. 58. Tag Distributions on STS PhD Thesis Using Classifier Committees with IWFs Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresentingthe How about using tags represented with IWFs on classifierAggregation ofTags committees?TagDistributionson STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 58 / 98
  59. 59. Tag Distributions on STS PhD Thesis Results Using IWF with Committees Arkaitz ZubiagaMotivationSelection of a Delicious LThing GReadsClassifierSTS & L1 L2 L1 L2 L1 L2Datasets TF .699 .588 .859 .755 .857 .730Representingthe TF-IRF .697 .592 .885 .793 .864 .748 IWFsAggregation ofTags TF-IBF .698 .592 .887 .797 .866 .751Tag TF-IUF .700 .595 .885 .792 .864 .749Distributionson STSUser Behavior IWF-based committes are even better than TF-based ones.on STSConclusions &Outlook Even on Delicious, where IWFs were not appropriate,Publications committees perform slightly better. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 59 / 98
  60. 60. Tag Distributions on STS PhD Thesis Results Using IWF with Committees Arkaitz ZubiagaMotivationSelection of aClassifier Delicious LThing GReadsSTS & L1 L2 L1 L2 L1 L2 TFDatasets .699 .588 .859 .755 .857 .730Representingthe TF-IRF .697 .592 .885 .793 .864 .748 IWFsAggregation ofTags TF-IBF .698 .592 .887 .797 .866 .751Tag TF-IUF .700 .595 .885 .792 .864 .749Distributionson STSUser Behavioron STS Despite this outperformance of IWFs using committees,Conclusions & IWFs on their own perform better on LibraryThing (.895 & .811).OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 60 / 98
  61. 61. Tag Distributions on STS PhD Thesis Summary of Results Using IWFs Arkaitz ZubiagaMotivationSelection of aClassifierSTS & IWFs are an appropriate way to weight tags when used on classifier committees.DatasetsRepresentingthe The exception is LibraryThing, where tags on their ownAggregation ofTags perform better.TagDistributionson STS Combined data sources must be appropriately chosenUser Behavior (e.g., synopses & ed. reviews are harmful with books).on STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 61 / 98
  62. 62. User Behavior on STS PhD Thesis Index Arkaitz Zubiaga 1 MotivationMotivationSelection of aClassifier 2 Selection of a ClassifierSTS &Datasets 3 STS & DatasetsRepresentingtheAggregation ofTags 4 Representing the Aggregation of TagsTagDistributions 5 Tag Distributions on STSon STSUser Behavioron STS 6 User Behavior on STSConclusions &Outlook 7 Conclusions & OutlookPublications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 62 / 98
  63. 63. User Behavior on STS PhD Thesis User Behavior: Categorizers and Describers Arkaitz Zubiaga K¨rner1 suggested 2 kinds of user behavior:Motivation oSelection of aClassifierSTS & Categorizer DescriberDatasets Goal of Tagging later browsing later retrievalRepresentingthe Change of Tag Vocabulary costly cheapAggregation of Size of Tag Vocabulary limited openTags Tags subjective objectiveTagDistributionson STSUser Behavior They found that Describers help infer semanticon STS relations among tags.Conclusions &Outlook Do these tagging behaviors affect the usefulness of tagsPublications for resource classification? 1 C. K¨rner. Understanding the Motivation behind Tagging. Hypertext 2009. o Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 63 / 98
  64. 64. User Behavior on STS PhD Thesis Categorizers and Describers Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresentingtheAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 64 / 98
  65. 65. User Behavior on STS PhD Thesis Categorizers and Describers Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresentingtheAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 65 / 98
  66. 66. User Behavior on STS PhD Thesis Weighting Measures Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresenting We use 3 measures to weight users, based on Koerner et al.theAggregation of (2010).TagsTag 2 factors are considered: verbosity & diversity.Distributionson STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 66 / 98
  67. 67. User Behavior on STS PhD Thesis Weighting Measures: TPP Arkaitz ZubiagaMotivationSelection of aClassifierSTS &Datasets Tags per Post (TPP) – VerbosityRepresentingtheAggregation of rTags |Tur |Tag TPP(u) =Distributions |Ru |on STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 67 / 98
  68. 68. User Behavior on STS PhD Thesis Weighting Measures: ORPHAN Arkaitz ZubiagaMotivationSelection of aClassifierSTS &Datasets Orphan Ratio (ORPHAN) – DiversityRepresentingthe |R(tmax )|Aggregation of n=Tags 100TagDistributions o |Tu | oon STS ORPHAN(u) = , T = {t||R(t)| ≤ n}User Behavior |Tu | uon STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 68 / 98
  69. 69. User Behavior on STS PhD Thesis Weighting Measures: TRR Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresenting Tag Resource Ratio (TRR) – Verbosity + DiversitytheAggregation ofTags |Tu | TRR(u) =Tag |Ru |Distributionson STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 69 / 98
  70. 70. User Behavior on STS PhD Thesis Use of Weighting Measures Arkaitz ZubiagaMotivationSelection of aClassifier These 3 measures provide:STS &Datasets A weight for each user. Ranking of users according to each measure.RepresentingtheAggregation ofTags From rankings → subsets of users as extremeTagDistributions Categorizers (highest-ranked) and extreme Describerson STS (lowest-ranked).User Behavioron STSConclusions & Subsets range from 10% to 100% (step size = 10%).OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 70 / 98
  71. 71. User Behavior on STS PhD Thesis Experiment Setup Arkaitz Zubiaga We select subsets of users according to number of tagMotivation assignments.Selection of aClassifierSTS & Selecting by percents of users would be unfair →Datasets different amounts of data.RepresentingtheAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 71 / 98
  72. 72. User Behavior on STS PhD Thesis Experiments Arkaitz ZubiagaMotivation ClassificationSelection of a We use a multiclass SVM, with TF weighting of tags.ClassifierSTS & DescriptivityDatasets Vectorial representations of resources: Tr → tag frequencies.RepresentingtheAggregation ofTags Rr → term frequencies on descriptive dataTag (self-content).Distributionson STSUser Behavior Cosine similarity between Tr and Rr :on STSConclusions & nOutlook Tri × Rri cos(θr ) = n 2 n 2Publications i=1 i=1 (Tri ) × i=1 (Rri ) Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 72 / 98
  73. 73. User Behavior on STS PhD Thesis Descriptivity Results Arkaitz Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.)MotivationSelection of a DeliciousClassifierSTS &DatasetsRepresentingthe LibraryThingAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions & GoodReadsOutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 73 / 98
  74. 74. User Behavior on STS PhD Thesis Classification Results Arkaitz Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.)Motivation DeliciousSelection of aClassifierSTS &DatasetsRepresentingthe LibraryThingAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions & GoodReadsOutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 74 / 98
  75. 75. User Behavior on STS PhD Thesis Classification Results Arkaitz Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.)Motivation DeliciousSelection of aClassifierSTS &DatasetsRepresentingthe LibraryThingAggregation ofTagsTagDistributionson STSUser Behavioron STSConclusions & GoodReadsOutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 75 / 98
  76. 76. User Behavior on STS PhD Thesis Overall Categorizers/Describers Results Arkaitz ZubiagaMotivationSelection of aClassifierSTS & Discriminating by verbosity (TPP) does best for findingDatasetsRepresentingthe extreme Categorizers.Aggregation ofTagsTagDistributions The use of non-descriptive tags provide more accurateon STS classification.User Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 76 / 98
  77. 77. Conclusions & Outlook PhD Thesis Index Arkaitz Zubiaga 1 MotivationMotivationSelection of aClassifier 2 Selection of a ClassifierSTS &Datasets 3 STS & DatasetsRepresentingtheAggregation ofTags 4 Representing the Aggregation of TagsTagDistributions 5 Tag Distributions on STSon STSUser Behavioron STS 6 User Behavior on STSConclusions &Outlook 7 Conclusions & OutlookPublications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 77 / 98
  78. 78. Conclusions & Outlook PhD Thesis Contributions Arkaitz ZubiagaMotivationSelection of aClassifierSTS &Datasets Generation & analysis of 3 large-scale social taggingRepresenting datasets.theAggregation of Release of some tagging datasets, used by Godoy andTagsTagDistributions Amandi (2010), Strohmaier et al. (2010), Li et al. (2011),on STS and Ares et al. (2011).User Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 78 / 98
  79. 79. Conclusions & Outlook PhD Thesis Contributions Arkaitz ZubiagaMotivationSelection of aClassifier First research work performing actual classificationSTS &Datasets experiments using social tags.Representing Analysis of different representations of social tags.theAggregation of Analysis of effect of tag distributions.Tags Study of user behavior.TagDistributions It paves the way to future researchers interested in theon STSUser Behavioron STS task & in the exploration of STS.Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 79 / 98
  80. 80. Conclusions & Outlook PhD Thesis Research Questions Arkaitz ZubiagaMotivationSelection of aClassifier Apart from the Problem Statement:STS &DatasetsRepresenting How can the annotations provided by users on socialtheAggregation of tagging systems be exploited to improve the accuracy ofTags a resource classification task?TagDistributionson STS We set forth 10 research questions.User Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 80 / 98
  81. 81. Conclusions & Outlook PhD Thesis Research Questions (1) Arkaitz ZubiagaMotivationSelection of aClassifierSTS &DatasetsRepresenting RQ 1 What is a suitable SVM classifier for the task?theAggregation ofTags Native multiclass SVM >> Combinations ofTagDistributions binary SVMs.on STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 81 / 98
  82. 82. Conclusions & Outlook PhD Thesis Research Questions (2) Arkaitz ZubiagaMotivationSelection of aClassifierSTS & RQ 2 What is a suitable learning method for theDatasets task?RepresentingtheAggregation of Supervised Semi-supervised.TagsTagDistributions Unlike for binary tasks, whereon STS Semi-supervised >> Supervised (Joachims,User Behavioron STS 1999).Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 82 / 98
  83. 83. Conclusions & Outlook PhD Thesis Research Questions (3) Arkaitz ZubiagaMotivationSelection of aClassifierSTS &Datasets RQ 3 How do the settings of STS affectRepresenting folksonomies?theAggregation ofTags Great impact of tag suggestions.TagDistributionson STS Importance of encouraging users toUser Behavioron STS annotate.Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 83 / 98
  84. 84. Conclusions & Outlook PhD Thesis Research Questions (4) Arkaitz ZubiagaMotivationSelection of aClassifierSTS & RQ 4 How to amalgamate annotations to get aDatasets representation of a resource?RepresentingtheAggregation ofTags Considering all the tags rather than onlyTag those in the top.Distributionson STSUser Behavior Weighting tags according to number ofon STS users annotating them.Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 84 / 98
  85. 85. Conclusions & Outlook PhD Thesis Research Questions (5) Arkaitz ZubiagaMotivationSelection of aClassifierSTS & RQ 5 Is it worthwhile combining tags with otherDatasets data sources?RepresentingtheAggregation ofTags Combining different data sources helpsTag improve performance.Distributionson STSUser Behavior Data sources must be appropriatelyon STS chosen.Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 85 / 98
  86. 86. Conclusions & Outlook PhD Thesis Research Questions (6) Arkaitz ZubiagaMotivationSelection of aClassifierSTS &Datasets RQ 6 Are social tags specific enough to classify intoRepresenting narrower categories?theAggregation ofTags Tags are as useful as for top level.TagDistributionson STS Noll and Meinel (2008) → tags wereUser Behavioron STS probably not useful for deeper levels.Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 86 / 98
  87. 87. Conclusions & Outlook PhD Thesis Research Questions (7) Arkaitz ZubiagaMotivationSelection of aClassifier RQ 7 Can we consider tag distributions to get theSTS &Datasets representativity of each tag?RepresentingtheAggregation of LibraryThing & GoodReads: really useful.TagsTagDistributions Delicious: not useful, because of tag suggestionson STSUser Behavioron STS → need of committees to make themConclusions & useful.OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 87 / 98
  88. 88. Conclusions & Outlook PhD Thesis Research Questions (8) Arkaitz ZubiagaMotivationSelection of aClassifierSTS & RQ 8 What approach to use to weigh theDatasets representativity of tags?RepresentingtheAggregation ofTags LibraryThing & GoodReads: IBF, IRF &Tag IUF are very similar.Distributionson STSUser Behavior Delicious: IUF clearly superior, because ofon STS users that get rid of suggestions.Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 88 / 98
  89. 89. Conclusions & Outlook PhD Thesis Research Questions (9) Arkaitz ZubiagaMotivationSelection of aClassifierSTS & RQ 9 Can we discriminate users who further resembleDatasets an expert classification?RepresentingtheAggregation ofTags Categorizers > Describers forTag classification.Distributionson STSUser Behavior Need of appropriate measure foron STS discriminating.Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 89 / 98
  90. 90. Conclusions & Outlook PhD Thesis Research Questions (10) Arkaitz ZubiagaMotivationSelection of aClassifierSTS &Datasets RQ 10 What features identify a Categorizer?RepresentingtheAggregation of Categorizers can be found whenTags discriminating by verbosity.TagDistributionson STS Non-descriptive tags produce moreUser Behavioron STS accurate classification.Conclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 90 / 98
  91. 91. Conclusions & Outlook PhD Thesis Future Directions Arkaitz ZubiagaMotivationSelection of aClassifier Increase of interest in the field, still much work to do.STS &Datasets We have considered each tag as a diferent token. → Considering semantic meanings of social tags couldRepresentingtheAggregation ofTags help.TagDistributionson STS Tag suggestions leverage several issues in folksonomies.User Behavior → Looking for a weighting function that fits theon STSConclusions & characteristics of systems with tag suggestions, e.g.,Outlook Delicious.Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 91 / 98
  92. 92. Publications PhD Thesis Index Arkaitz Zubiaga 1 MotivationMotivationSelection of aClassifier 2 Selection of a ClassifierSTS &Datasets 3 STS & DatasetsRepresentingtheAggregation ofTags 4 Representing the Aggregation of TagsTagDistributions 5 Tag Distributions on STSon STSUser Behavioron STS 6 User Behavior on STSConclusions &Outlook 7 Conclusions & OutlookPublications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 92 / 98
  93. 93. Publications PhD Thesis Publications Arkaitz ZubiagaMotivation Peer-Reviewed Conferences (I)Selection of a Arkaitz Zubiaga, Christian K¨rner, Markus Strohmaier. oClassifier 2011. Tags vs Shelves: From Social Tagging to SocialSTS &Datasets Classification. In Proceedings of Hypertext 2011, theRepresenting 22nd ACM Conference on Hypertext and Hypermedia,theAggregation of Eindhoven, Netherlands. (acceptance rate: 35/104, 34%)TagsTagDistributionson STS Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2009.User Behavior Getting the Most Out of Social Annotations for Web Pageon STS Classification. In Proceedings of DocEng 2009, the 9thConclusions &Outlook ACM Symposium on Document Engineering, pp. 74-83,Publications Munich, Germany. (acceptance rate: 16/54, 29.6%) [15 citations] Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 93 / 98
  94. 94. Publications PhD Thesis Publications Arkaitz ZubiagaMotivation Peer-Reviewed Conferences (II)Selection of aClassifier Arkaitz Zubiaga. 2009. Enhancing Navigation onSTS & Wikipedia with Social Tags. Wikimania 2009, BuenosDatasets Aires, Argentina.Representingthe [6 citations]Aggregation ofTagsTag Arkaitz Zubiaga, Alberto P. Garc´ ıa-Plaza, V´ ıctor Fresno,Distributionson STS Raquel Mart´ ınez. 2009. Content-based Clustering for TagUser Behavior Cloud Visualization. In Proceedings of ASONAM 2009,on STSConclusions & International Conference on Advances in Social NetworksOutlook Analysis and Mining, pp. 316-319, Athens, Greece.Publications [3 citations] Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 94 / 98
  95. 95. Publications PhD Thesis Publications Arkaitz Zubiaga Journals (I)Motivation Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2011.Selection of aClassifier Augmenting Web Page Classifiers with Social Annotations.STS & Procesamiento del Lenguaje Natural. (acceptance rate:Datasets 33/60, 55%)RepresentingtheAggregation ofTags Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2009.Tag Clasificaci´n de P´ginas Web con Anotaciones Sociales. o aDistributions Procesamiento del Lenguaje Natural, vol. 43, pp. 225-233.on STS (acceptance rate: 36/72, 50%)User Behavioron STSConclusions & Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´ ınez. 2009.Outlook Comparativa de Aproximaciones a SVM SemisupervisadoPublications Multiclase para Clasificaci´n de P´ginas Web. Procesamiento o a del Lenguaje Natural, vol. 42, pp. 63-70. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 95 / 98
  96. 96. Publications PhD Thesis Publications Arkaitz ZubiagaMotivationSelection of aClassifierSTS &Datasets Journals (II)Representingthe Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ ınez. HarnessingAggregation ofTags Folksonomies to Produce a Social Classification of Resources. IEEE Transactions on Knowledge and Data Engineering.TagDistributions (pending notification)on STSUser Behavioron STSConclusions &OutlookPublications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 96 / 98
  97. 97. Publications PhD Thesis Publications Arkaitz ZubiagaMotivation Book ChaptersSelection of aClassifier Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´ ınez. 2011.STS & Exploiting Social Annotations for Resource Classification.Datasets Social Network Mining, Analysis and ResearchRepresentingthe Trends: Techniques and Applications. IGI Global.Aggregation ofTags WorkshopsTagDistributions Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ınez. 2009. Ison STS Unlabeled Data Suitable for Multiclass SVM-based WebUser Behavioron STS Page Classification?. In Proceedings of the NAACL-HLTConclusions & 2009 Workshop on Semi-supervised Learning for Natural Language Processing, pp. 28-36, Boulder, CO,OutlookPublications United States. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 97 / 98
  98. 98. Publications PhD Thesis Thank You Arkaitz ZubiagaMotivationSelection of a Achiu Arigato Danke Dhannvaad Dua Netjer en ekClassifier Efcharisto Gracias Gr`cies Gratia Grazie a Guishepeli Hvala Kiitos K¨sz¨n¨m o o oSTS &DatasetsRepresenting Merc´ Merci Mila esker ObrigadotheAggregation ofTags eTagDistributions Shukran Tack Tak Takk T¨anan Tapadh leat Shukriyaon STSUser Behavioron STS Tesekk¨r ederim Thank you Toda uConclusions &OutlookPublications http://thesis.zubiaga.org/ Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 98 / 98
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×