An Integrated Approach
       to Discover Tag Semantics
            SAC 2011, Web Technologies Track, March 24th 2011



   Antonina Dattolo              Davide Eynard                  Luca Mazzola
    University of Udine       USI - University of Lugano    USI - University of Lugano
Department of Mathematics         ITC - Institute for           ITC - Institute for
  and Computer Science       Communication Technologies    Communication Technologies
 antonina.dattolo@uniud.it     davide.eynard@usi.ch           luca.mazzola@usi.ch
Talk outline

    Properties of tags
    Folksonomies as edge-colored multigraphs
    Framework design and implementation
    Tests and evaluations
    Conclusions




24/03/2011           An integrated approach to discover tag semantics   2/27
Tags properties

     Tags:
            are democratic and bottom-up (vs hierarchical)
            are inclusive and current
            follow desire lines
            are easy to use




24/03/2011                   An integrated approach to discover tag semantics   3/27
Tags cons

    Lexical ambiguities:
            Synonyms
                 game and juego, or web2.0 and web_2
            Homonyms
                 check as in chess and in “to check” (polysemous)
                 sf as scifi or san_francisco
            Basic level variations
                 dog and poodle
    Ambiguities due to different purposes:
                 blog to tag a blog software (i.e. Wordpress), a blog service, a blog
                   post, something to blog later, ...

24/03/2011                      An integrated approach to discover tag semantics     4/27
Advantages of disambiguation

    Synonym detection:
            increases recall
            allows for better recommendation systems
    Homonym detection:
            allows to find different contexts of use
            increases precision
    Basic level variations detection:
            identifies a hierarchy
            increases recall (i.e. automatically searching for subclasses)
            provides a mean to browse search results
24/03/2011                   An integrated approach to discover tag semantics   5/27
Approaches to tag disambiguation

    Roughly two main families of approaches
            Theoretical ones, aiming at describing the system as a
             whole
            More practical, ad-hoc ones (often addressing one or few
             issues at a time)
    Our approach
            Main assumption: lexical ambiguities are not independent
             from each other
            Solution based on
                 a theoretical framework
                 a modular, extensible analysis tool

24/03/2011                      An integrated approach to discover tag semantics   6/27
Folksonomies as edge-colored
multigraphs
    Def.1: An edge-colored multigraph is a triple
             ECMG = (MG, C, c)
     where:
            MG = (V,E,f) is a multigraph
            C is a set of colors
            c : E→C is an assignment of colors to multigraph edges

    Def.2: A personomy related to user u is a non-directed
     edge-colored graph of color Cu:
             Pu = (T, R, E, Cu)

24/03/2011                    An integrated approach to discover tag semantics   7/27
Folksonomies as edge-colored
multigraphs
    Def.3: Given a set of users U and the family of
     personomies Pu (u ∈U), a folksonomy is defined as



     that is, an edge-colored multigraph where:
            vertices are tags + resources
            edges are tag assignments made on
             resources by each user
            every color is a different user


24/03/2011                   An integrated approach to discover tag semantics   8/27
First simplification step

    As we are only interested in relationships between
     tags, we need to perform two simplification steps on
     the edge-colored multigraph
    Step 1: colored edges are collapsed and substituted
     by weighted edges
        potentially, every color (user) might be
         assigned a different weight wu
        the weight w of the collapsed edge is the sum
         of all the wu linking the same two vertices
        when wu= 1 for each user, w = times a tag is
         used on a resource

24/03/2011                    An integrated approach to discover tag semantics   9/27
Second simplification step
    Step 2:
            a link is created between ta and tb if they
             share a resource
            resource nodes are dropped
    Edges' weights can be calculated
     in different ways:
            number of triples (ti ,r,tj ) where (ti ,r), (r,tj ) ∈E
             => co-occurrence
            normalized co-occurrence (i.e. Using the
             Jaccard index)
            distributional measures
            custom metrics (i.e. sum of products of
             connecting edges' weights)                                =>
24/03/2011                         An integrated approach to discover tag semantics   10/27
The whole process at a glance
 1                              2




 3                              4




                                11/27
System architecture

    Basic assumption:
            ambiguous tags should be related (either by cooccurrence or
             by presence in the same context)
    Three main components:
            tag analysis tool
            disambiguation tool
            front-end




24/03/2011                   An integrated approach to discover tag semantics   12/27
Synonyms detection / 1

    Natural text …
            Two words are considered synonyms if they can be replaced
             by each other without affecting the meaning of a sentence
    … vs. Tag-based systems
            It is possible to swap two tags within a “sentence” (i.e. a
             tagging action) without affecting its meaning when we have:
                 variations of a word (i.e. blog, blogs, blogging)
                 translations into other languages (i.e. game, juego, spiel)
                 terms joined by non-alphabetic characters (i.e. web2, web_2)
            No “one size fits all” solution


24/03/2011                      An integrated approach to discover tag semantics   13/27
Synonyms detection / 2

    A modular solution for synonyms detection:
            different heuristics, each one returning the likelihood of tags to be
             synonyms
            results are weighted to obtain an overall likelihood


    Suggested heuristics:
            an edit distance such as Levenshtein's (normalized to account for short
             strings);
            synonym search in WordNet (good precision, low recall);
            online translation bases (top-down, such as dictionaries, or bottom-up,
             collaboratively grown vocabs like Wikipedia)
            stemming with NLP algorithms

24/03/2011                      An integrated approach to discover tag semantics     14/27
Homonyms detection

    Check if the tag t has been used in different contexts
            cluster tags related to t in groups
            the most frequent tags in these groups are used to name
             and disambiguate the contexts
    Clustering algorithm:
            an overlapping one, also used in social network analysis*
            a cluster is a subgraph G identified by the maximization of a
             fitness property
                                                                                            s = strength of internal (in)
                                                                                                or external (out) links
                                                                                            α = tweaking parameter


     * A. Lancichinetti et al. : “Detecting the overlapping and hierarchical community structure of complex networks”


24/03/2011                                An integrated approach to discover tag semantics                                  15/27
Hierarchy detection

    Hierarchy is a specific case of basic level variation
    A possible approach: Hearst patterns on the Web,
     such as:
            C1 (and|or) other C2 (i.e. “poodles and other dogs”)
            C1 such as I (i.e. “cities such as San Francisco”)
             (note: Ci are concepts, I is a concept instance)

    Search for the patterns, and use the number of results
     as an indicator for their strength
    Pros: the Web is as up-to-date as folksonomies
 
     Cons: O(n2) complexity, not really scalable
24/03/2011                      An integrated approach to discover tag semantics   16/27
Prototype development

       Dataset
               Data from more than 30K users of
                http://www.delicious.com
               Ignored the system:unfiled tag
               For the calculation of Tag Context Similarity,
                we only took into account the top 10K tags



       Prototype
               Tag analysis tool, calculating CO, NCO, and TCS (takes time, runs as a
                batch job and saves matrices in the DB)
               Disambiguation with homonyms plugin, implementing the overlapping
                clustering algorithm, and Wikipedia synonym discovery
               Front-end is currently a command-line application
24/03/2011                         An integrated approach to discover tag semantics   17/27
Experimental results / 1

    System tested against three different sets of tags:
            Top 20 tags in delicious
            A group of tags known to be ambiguous (apple, cambridge, sf,
             stream, turkey, tube)
            A set of subjective tags, chosen between the most popular ones in
             delicious (cool, fun, funny, interesting, toread)
    For each tag:
            we calculated the top n (with n = 50) related tags with the three metrics
             (CO, NCO, TCS)
            we performed synonym and homonym analyses




24/03/2011                      An integrated approach to discover tag semantics    18/27
Experimental results / 2
    Tag Context Similarity already tends to provide
     synonyms as top-related tags
            i.e. toread related: read, read_later, to_read, etc.
    Analyzing a less popular synonym (@readit):
            9 out of the top 10 (and 17 out of the top 50) related tags are synonyms
            reason: as less popular tags are less spread across contexts, they tend
             to have a higher similarity with other less popular synonyms
    Wikipedia results:
            analyzing the 31 tags in our three sets, we got 215 new words;
            of those 215, only 83 are valid tags in our delicious dataset;
            of those 83, only 20 belong to the 10K most-used tags;
            only 2 belong to the set of the top-related tags of their English synonym.
24/03/2011                      An integrated approach to discover tag semantics    19/27
Experimental results / 3

    Homonyms detection:
            we tested the algorithm with
             different values of α
            meaningful results in a relatively
             short time (but we are working
             only on the top related tags...)
            limit: the graphs of top related
             tags differ in connectivity, so
             there is not a value of α that is
             good for all of them (αsf=1.4,
             αstream=1.74).




24/03/2011                       An integrated approach to discover tag semantics   20/27
Conclusions

    Model
            Flexible enough to support other kind of metrics
            Multigraph can be simplified in other ways
            User-related weights still have to be taken into account
    Tool
            Still in prototypal phase, but already provided useful results
             and allowed us to compare
                 metrics: different metrics provide very different results, that might be
                   more or less useful according to the user needs
                 tag behaviors: different depending on their popularity and the use
                    that people do of them


24/03/2011                       An integrated approach to discover tag semantics     21/27
Conclusions

    Ongoing work
            Clustering evaluation metrics to find best α
            Applications (i.e. for tag grouping and visualization*)
            User- and resource-specific projections**
    Future work
            Development of other plugins and front-end
            Play with user-related weights to focus on specific
             communities / filter spam

     * Mazzola, Eynard, Mazza: ”GVIS: a framework for graphical mashups of heterogeneous sources to support data
     interpretation”.

     ** Dattolo, Ferrara, Tasso: "On social semantic relations for recommending tags and resources using folksonomies"

24/03/2011                               An integrated approach to discover tag semantics                                22/27
Thank you!



             Thanks for your attention!

                        Questions?




24/03/2011       An integrated approach to discover tag semantics   23/27
toread top 20 related tags




24/03/2011   An integrated approach to discover tag semantics   24/27
@readit top 20 related tags




24/03/2011   An integrated approach to discover tag semantics   25/27
sf top 20 related tags




24/03/2011   An integrated approach to discover tag semantics   26/27
stream top 20 related tags




24/03/2011   An integrated approach to discover tag semantics   27/27

An integrated approach to discover tag semantics

  • 1.
    An Integrated Approach to Discover Tag Semantics SAC 2011, Web Technologies Track, March 24th 2011 Antonina Dattolo Davide Eynard Luca Mazzola University of Udine USI - University of Lugano USI - University of Lugano Department of Mathematics ITC - Institute for ITC - Institute for and Computer Science Communication Technologies Communication Technologies antonina.dattolo@uniud.it davide.eynard@usi.ch luca.mazzola@usi.ch
  • 2.
    Talk outline  Properties of tags  Folksonomies as edge-colored multigraphs  Framework design and implementation  Tests and evaluations  Conclusions 24/03/2011 An integrated approach to discover tag semantics 2/27
  • 3.
    Tags properties Tags:  are democratic and bottom-up (vs hierarchical)  are inclusive and current  follow desire lines  are easy to use 24/03/2011 An integrated approach to discover tag semantics 3/27
  • 4.
    Tags cons  Lexical ambiguities:  Synonyms  game and juego, or web2.0 and web_2  Homonyms  check as in chess and in “to check” (polysemous)  sf as scifi or san_francisco  Basic level variations  dog and poodle  Ambiguities due to different purposes:  blog to tag a blog software (i.e. Wordpress), a blog service, a blog post, something to blog later, ... 24/03/2011 An integrated approach to discover tag semantics 4/27
  • 5.
    Advantages of disambiguation  Synonym detection:  increases recall  allows for better recommendation systems  Homonym detection:  allows to find different contexts of use  increases precision  Basic level variations detection:  identifies a hierarchy  increases recall (i.e. automatically searching for subclasses)  provides a mean to browse search results 24/03/2011 An integrated approach to discover tag semantics 5/27
  • 6.
    Approaches to tagdisambiguation  Roughly two main families of approaches  Theoretical ones, aiming at describing the system as a whole  More practical, ad-hoc ones (often addressing one or few issues at a time)  Our approach  Main assumption: lexical ambiguities are not independent from each other  Solution based on  a theoretical framework  a modular, extensible analysis tool 24/03/2011 An integrated approach to discover tag semantics 6/27
  • 7.
    Folksonomies as edge-colored multigraphs  Def.1: An edge-colored multigraph is a triple ECMG = (MG, C, c) where:  MG = (V,E,f) is a multigraph  C is a set of colors  c : E→C is an assignment of colors to multigraph edges  Def.2: A personomy related to user u is a non-directed edge-colored graph of color Cu: Pu = (T, R, E, Cu) 24/03/2011 An integrated approach to discover tag semantics 7/27
  • 8.
    Folksonomies as edge-colored multigraphs  Def.3: Given a set of users U and the family of personomies Pu (u ∈U), a folksonomy is defined as that is, an edge-colored multigraph where:  vertices are tags + resources  edges are tag assignments made on resources by each user  every color is a different user 24/03/2011 An integrated approach to discover tag semantics 8/27
  • 9.
    First simplification step  As we are only interested in relationships between tags, we need to perform two simplification steps on the edge-colored multigraph  Step 1: colored edges are collapsed and substituted by weighted edges  potentially, every color (user) might be assigned a different weight wu  the weight w of the collapsed edge is the sum of all the wu linking the same two vertices  when wu= 1 for each user, w = times a tag is used on a resource 24/03/2011 An integrated approach to discover tag semantics 9/27
  • 10.
    Second simplification step  Step 2:  a link is created between ta and tb if they share a resource  resource nodes are dropped  Edges' weights can be calculated in different ways:  number of triples (ti ,r,tj ) where (ti ,r), (r,tj ) ∈E => co-occurrence  normalized co-occurrence (i.e. Using the Jaccard index)  distributional measures  custom metrics (i.e. sum of products of connecting edges' weights) => 24/03/2011 An integrated approach to discover tag semantics 10/27
  • 11.
    The whole processat a glance 1 2 3 4 11/27
  • 12.
    System architecture  Basic assumption:  ambiguous tags should be related (either by cooccurrence or by presence in the same context)  Three main components:  tag analysis tool  disambiguation tool  front-end 24/03/2011 An integrated approach to discover tag semantics 12/27
  • 13.
    Synonyms detection /1  Natural text …  Two words are considered synonyms if they can be replaced by each other without affecting the meaning of a sentence  … vs. Tag-based systems  It is possible to swap two tags within a “sentence” (i.e. a tagging action) without affecting its meaning when we have:  variations of a word (i.e. blog, blogs, blogging)  translations into other languages (i.e. game, juego, spiel)  terms joined by non-alphabetic characters (i.e. web2, web_2)  No “one size fits all” solution 24/03/2011 An integrated approach to discover tag semantics 13/27
  • 14.
    Synonyms detection /2  A modular solution for synonyms detection:  different heuristics, each one returning the likelihood of tags to be synonyms  results are weighted to obtain an overall likelihood  Suggested heuristics:  an edit distance such as Levenshtein's (normalized to account for short strings);  synonym search in WordNet (good precision, low recall);  online translation bases (top-down, such as dictionaries, or bottom-up, collaboratively grown vocabs like Wikipedia)  stemming with NLP algorithms 24/03/2011 An integrated approach to discover tag semantics 14/27
  • 15.
    Homonyms detection  Check if the tag t has been used in different contexts  cluster tags related to t in groups  the most frequent tags in these groups are used to name and disambiguate the contexts  Clustering algorithm:  an overlapping one, also used in social network analysis*  a cluster is a subgraph G identified by the maximization of a fitness property s = strength of internal (in) or external (out) links α = tweaking parameter * A. Lancichinetti et al. : “Detecting the overlapping and hierarchical community structure of complex networks” 24/03/2011 An integrated approach to discover tag semantics 15/27
  • 16.
    Hierarchy detection  Hierarchy is a specific case of basic level variation  A possible approach: Hearst patterns on the Web, such as:  C1 (and|or) other C2 (i.e. “poodles and other dogs”)  C1 such as I (i.e. “cities such as San Francisco”) (note: Ci are concepts, I is a concept instance)  Search for the patterns, and use the number of results as an indicator for their strength  Pros: the Web is as up-to-date as folksonomies  Cons: O(n2) complexity, not really scalable 24/03/2011 An integrated approach to discover tag semantics 16/27
  • 17.
    Prototype development  Dataset  Data from more than 30K users of http://www.delicious.com  Ignored the system:unfiled tag  For the calculation of Tag Context Similarity, we only took into account the top 10K tags  Prototype  Tag analysis tool, calculating CO, NCO, and TCS (takes time, runs as a batch job and saves matrices in the DB)  Disambiguation with homonyms plugin, implementing the overlapping clustering algorithm, and Wikipedia synonym discovery  Front-end is currently a command-line application 24/03/2011 An integrated approach to discover tag semantics 17/27
  • 18.
    Experimental results /1  System tested against three different sets of tags:  Top 20 tags in delicious  A group of tags known to be ambiguous (apple, cambridge, sf, stream, turkey, tube)  A set of subjective tags, chosen between the most popular ones in delicious (cool, fun, funny, interesting, toread)  For each tag:  we calculated the top n (with n = 50) related tags with the three metrics (CO, NCO, TCS)  we performed synonym and homonym analyses 24/03/2011 An integrated approach to discover tag semantics 18/27
  • 19.
    Experimental results /2  Tag Context Similarity already tends to provide synonyms as top-related tags  i.e. toread related: read, read_later, to_read, etc.  Analyzing a less popular synonym (@readit):  9 out of the top 10 (and 17 out of the top 50) related tags are synonyms  reason: as less popular tags are less spread across contexts, they tend to have a higher similarity with other less popular synonyms  Wikipedia results:  analyzing the 31 tags in our three sets, we got 215 new words;  of those 215, only 83 are valid tags in our delicious dataset;  of those 83, only 20 belong to the 10K most-used tags;  only 2 belong to the set of the top-related tags of their English synonym. 24/03/2011 An integrated approach to discover tag semantics 19/27
  • 20.
    Experimental results /3  Homonyms detection:  we tested the algorithm with different values of α  meaningful results in a relatively short time (but we are working only on the top related tags...)  limit: the graphs of top related tags differ in connectivity, so there is not a value of α that is good for all of them (αsf=1.4, αstream=1.74). 24/03/2011 An integrated approach to discover tag semantics 20/27
  • 21.
    Conclusions  Model  Flexible enough to support other kind of metrics  Multigraph can be simplified in other ways  User-related weights still have to be taken into account  Tool  Still in prototypal phase, but already provided useful results and allowed us to compare  metrics: different metrics provide very different results, that might be more or less useful according to the user needs  tag behaviors: different depending on their popularity and the use that people do of them 24/03/2011 An integrated approach to discover tag semantics 21/27
  • 22.
    Conclusions  Ongoing work  Clustering evaluation metrics to find best α  Applications (i.e. for tag grouping and visualization*)  User- and resource-specific projections**  Future work  Development of other plugins and front-end  Play with user-related weights to focus on specific communities / filter spam * Mazzola, Eynard, Mazza: ”GVIS: a framework for graphical mashups of heterogeneous sources to support data interpretation”. ** Dattolo, Ferrara, Tasso: "On social semantic relations for recommending tags and resources using folksonomies" 24/03/2011 An integrated approach to discover tag semantics 22/27
  • 23.
    Thank you! Thanks for your attention! Questions? 24/03/2011 An integrated approach to discover tag semantics 23/27
  • 24.
    toread top 20related tags 24/03/2011 An integrated approach to discover tag semantics 24/27
  • 25.
    @readit top 20related tags 24/03/2011 An integrated approach to discover tag semantics 25/27
  • 26.
    sf top 20related tags 24/03/2011 An integrated approach to discover tag semantics 26/27
  • 27.
    stream top 20related tags 24/03/2011 An integrated approach to discover tag semantics 27/27