Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An integrated approach to discover tag semantics


Published on

Talk presentation at SAC 2011. From the paper abstract: "Tag-based systems have become very common for online classification thanks to their intrinsic advantages such as self-organization and rapid evolution. However, they are still affected by some issues that limit their utility, mainly due to the inherent ambiguity in the semantics of tags. Synonyms, homonyms, and polysemous words, while not harmful for the casual user, strongly affect the quality of search results and the performances of tag-based recommendation
systems. In this paper we rely on the concept of tag relatedness in order to study small groups of similar tags and detect relationships between them. This approach is grounded on a model that builds upon an edge-colored multigraph of users, tags, and resources. To put our thoughts in practice, we present a modular and extensible framework of analysis for discovering synonyms, homonyms and hierarchical relationships amongst sets of tags. Some initial results of its application to the delicious database are presented, showing that such an approach could be useful to solve some of the well known problems of folksonomies.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

An integrated approach to discover tag semantics

  1. 1. An Integrated Approach to Discover Tag Semantics SAC 2011, Web Technologies Track, March 24th 2011 Antonina Dattolo Davide Eynard Luca Mazzola University of Udine USI - University of Lugano USI - University of LuganoDepartment of Mathematics ITC - Institute for ITC - Institute for and Computer Science Communication Technologies Communication Technologies
  2. 2. Talk outline  Properties of tags  Folksonomies as edge-colored multigraphs  Framework design and implementation  Tests and evaluations  Conclusions24/03/2011 An integrated approach to discover tag semantics 2/27
  3. 3. Tags properties Tags:  are democratic and bottom-up (vs hierarchical)  are inclusive and current  follow desire lines  are easy to use24/03/2011 An integrated approach to discover tag semantics 3/27
  4. 4. Tags cons  Lexical ambiguities:  Synonyms  game and juego, or web2.0 and web_2  Homonyms  check as in chess and in “to check” (polysemous)  sf as scifi or san_francisco  Basic level variations  dog and poodle  Ambiguities due to different purposes:  blog to tag a blog software (i.e. Wordpress), a blog service, a blog post, something to blog later, ...24/03/2011 An integrated approach to discover tag semantics 4/27
  5. 5. Advantages of disambiguation  Synonym detection:  increases recall  allows for better recommendation systems  Homonym detection:  allows to find different contexts of use  increases precision  Basic level variations detection:  identifies a hierarchy  increases recall (i.e. automatically searching for subclasses)  provides a mean to browse search results24/03/2011 An integrated approach to discover tag semantics 5/27
  6. 6. Approaches to tag disambiguation  Roughly two main families of approaches  Theoretical ones, aiming at describing the system as a whole  More practical, ad-hoc ones (often addressing one or few issues at a time)  Our approach  Main assumption: lexical ambiguities are not independent from each other  Solution based on  a theoretical framework  a modular, extensible analysis tool24/03/2011 An integrated approach to discover tag semantics 6/27
  7. 7. Folksonomies as edge-coloredmultigraphs  Def.1: An edge-colored multigraph is a triple ECMG = (MG, C, c) where:  MG = (V,E,f) is a multigraph  C is a set of colors  c : E→C is an assignment of colors to multigraph edges  Def.2: A personomy related to user u is a non-directed edge-colored graph of color Cu: Pu = (T, R, E, Cu)24/03/2011 An integrated approach to discover tag semantics 7/27
  8. 8. Folksonomies as edge-coloredmultigraphs  Def.3: Given a set of users U and the family of personomies Pu (u ∈U), a folksonomy is defined as that is, an edge-colored multigraph where:  vertices are tags + resources  edges are tag assignments made on resources by each user  every color is a different user24/03/2011 An integrated approach to discover tag semantics 8/27
  9. 9. First simplification step  As we are only interested in relationships between tags, we need to perform two simplification steps on the edge-colored multigraph  Step 1: colored edges are collapsed and substituted by weighted edges  potentially, every color (user) might be assigned a different weight wu  the weight w of the collapsed edge is the sum of all the wu linking the same two vertices  when wu= 1 for each user, w = times a tag is used on a resource24/03/2011 An integrated approach to discover tag semantics 9/27
  10. 10. Second simplification step  Step 2:  a link is created between ta and tb if they share a resource  resource nodes are dropped  Edges weights can be calculated in different ways:  number of triples (ti ,r,tj ) where (ti ,r), (r,tj ) ∈E => co-occurrence  normalized co-occurrence (i.e. Using the Jaccard index)  distributional measures  custom metrics (i.e. sum of products of connecting edges weights) =>24/03/2011 An integrated approach to discover tag semantics 10/27
  11. 11. The whole process at a glance 1 2 3 4 11/27
  12. 12. System architecture  Basic assumption:  ambiguous tags should be related (either by cooccurrence or by presence in the same context)  Three main components:  tag analysis tool  disambiguation tool  front-end24/03/2011 An integrated approach to discover tag semantics 12/27
  13. 13. Synonyms detection / 1  Natural text …  Two words are considered synonyms if they can be replaced by each other without affecting the meaning of a sentence  … vs. Tag-based systems  It is possible to swap two tags within a “sentence” (i.e. a tagging action) without affecting its meaning when we have:  variations of a word (i.e. blog, blogs, blogging)  translations into other languages (i.e. game, juego, spiel)  terms joined by non-alphabetic characters (i.e. web2, web_2)  No “one size fits all” solution24/03/2011 An integrated approach to discover tag semantics 13/27
  14. 14. Synonyms detection / 2  A modular solution for synonyms detection:  different heuristics, each one returning the likelihood of tags to be synonyms  results are weighted to obtain an overall likelihood  Suggested heuristics:  an edit distance such as Levenshteins (normalized to account for short strings);  synonym search in WordNet (good precision, low recall);  online translation bases (top-down, such as dictionaries, or bottom-up, collaboratively grown vocabs like Wikipedia)  stemming with NLP algorithms24/03/2011 An integrated approach to discover tag semantics 14/27
  15. 15. Homonyms detection  Check if the tag t has been used in different contexts  cluster tags related to t in groups  the most frequent tags in these groups are used to name and disambiguate the contexts  Clustering algorithm:  an overlapping one, also used in social network analysis*  a cluster is a subgraph G identified by the maximization of a fitness property s = strength of internal (in) or external (out) links α = tweaking parameter * A. Lancichinetti et al. : “Detecting the overlapping and hierarchical community structure of complex networks”24/03/2011 An integrated approach to discover tag semantics 15/27
  16. 16. Hierarchy detection  Hierarchy is a specific case of basic level variation  A possible approach: Hearst patterns on the Web, such as:  C1 (and|or) other C2 (i.e. “poodles and other dogs”)  C1 such as I (i.e. “cities such as San Francisco”) (note: Ci are concepts, I is a concept instance)  Search for the patterns, and use the number of results as an indicator for their strength  Pros: the Web is as up-to-date as folksonomies  Cons: O(n2) complexity, not really scalable24/03/2011 An integrated approach to discover tag semantics 16/27
  17. 17. Prototype development  Dataset  Data from more than 30K users of  Ignored the system:unfiled tag  For the calculation of Tag Context Similarity, we only took into account the top 10K tags Prototype  Tag analysis tool, calculating CO, NCO, and TCS (takes time, runs as a batch job and saves matrices in the DB)  Disambiguation with homonyms plugin, implementing the overlapping clustering algorithm, and Wikipedia synonym discovery  Front-end is currently a command-line application24/03/2011 An integrated approach to discover tag semantics 17/27
  18. 18. Experimental results / 1  System tested against three different sets of tags:  Top 20 tags in delicious  A group of tags known to be ambiguous (apple, cambridge, sf, stream, turkey, tube)  A set of subjective tags, chosen between the most popular ones in delicious (cool, fun, funny, interesting, toread)  For each tag:  we calculated the top n (with n = 50) related tags with the three metrics (CO, NCO, TCS)  we performed synonym and homonym analyses24/03/2011 An integrated approach to discover tag semantics 18/27
  19. 19. Experimental results / 2  Tag Context Similarity already tends to provide synonyms as top-related tags  i.e. toread related: read, read_later, to_read, etc.  Analyzing a less popular synonym (@readit):  9 out of the top 10 (and 17 out of the top 50) related tags are synonyms  reason: as less popular tags are less spread across contexts, they tend to have a higher similarity with other less popular synonyms  Wikipedia results:  analyzing the 31 tags in our three sets, we got 215 new words;  of those 215, only 83 are valid tags in our delicious dataset;  of those 83, only 20 belong to the 10K most-used tags;  only 2 belong to the set of the top-related tags of their English synonym.24/03/2011 An integrated approach to discover tag semantics 19/27
  20. 20. Experimental results / 3  Homonyms detection:  we tested the algorithm with different values of α  meaningful results in a relatively short time (but we are working only on the top related tags...)  limit: the graphs of top related tags differ in connectivity, so there is not a value of α that is good for all of them (αsf=1.4, αstream=1.74).24/03/2011 An integrated approach to discover tag semantics 20/27
  21. 21. Conclusions  Model  Flexible enough to support other kind of metrics  Multigraph can be simplified in other ways  User-related weights still have to be taken into account  Tool  Still in prototypal phase, but already provided useful results and allowed us to compare  metrics: different metrics provide very different results, that might be more or less useful according to the user needs  tag behaviors: different depending on their popularity and the use that people do of them24/03/2011 An integrated approach to discover tag semantics 21/27
  22. 22. Conclusions  Ongoing work  Clustering evaluation metrics to find best α  Applications (i.e. for tag grouping and visualization*)  User- and resource-specific projections**  Future work  Development of other plugins and front-end  Play with user-related weights to focus on specific communities / filter spam * Mazzola, Eynard, Mazza: ”GVIS: a framework for graphical mashups of heterogeneous sources to support data interpretation”. ** Dattolo, Ferrara, Tasso: "On social semantic relations for recommending tags and resources using folksonomies"24/03/2011 An integrated approach to discover tag semantics 22/27
  23. 23. Thank you! Thanks for your attention! Questions?24/03/2011 An integrated approach to discover tag semantics 23/27
  24. 24. toread top 20 related tags24/03/2011 An integrated approach to discover tag semantics 24/27
  25. 25. @readit top 20 related tags24/03/2011 An integrated approach to discover tag semantics 25/27
  26. 26. sf top 20 related tags24/03/2011 An integrated approach to discover tag semantics 26/27
  27. 27. stream top 20 related tags24/03/2011 An integrated approach to discover tag semantics 27/27