An integrated approach to discover tag semantics

An Integrated Approach
to Discover Tag Semantics
SAC 2011, Web Technologies Track, March 24th 2011

Antonina Dattolo Davide Eynard Luca Mazzola
University of Udine USI - University of Lugano USI - University of Lugano
Department of Mathematics ITC - Institute for ITC - Institute for
and Computer Science Communication Technologies Communication Technologies
antonina.dattolo@uniud.it davide.eynard@usi.ch luca.mazzola@usi.ch

Talk outline

 Properties of tags
 Folksonomies as edge-colored multigraphs
 Framework design and implementation
 Tests and evaluations
 Conclusions

24/03/2011 An integrated approach to discover tag semantics 2/27

Tags properties

Tags:
 are democratic and bottom-up (vs hierarchical)
 are inclusive and current
 follow desire lines
 are easy to use


Tags cons

 Lexical ambiguities:
 Synonyms
 game and juego, or web2.0 and web_2
 Homonyms
 check as in chess and in “to check” (polysemous)
 sf as scifi or san_francisco
 Basic level variations
 dog and poodle
 Ambiguities due to different purposes:
 blog to tag a blog software (i.e. Wordpress), a blog service, a blog
post, something to blog later, ...


Advantages of disambiguation

 Synonym detection:
 increases recall
 allows for better recommendation systems
 Homonym detection:
 allows to find different contexts of use
 increases precision
 Basic level variations detection:
 identifies a hierarchy
 increases recall (i.e. automatically searching for subclasses)
 provides a mean to browse search results

Approaches to tag disambiguation

 Roughly two main families of approaches
 Theoretical ones, aiming at describing the system as a
whole
 More practical, ad-hoc ones (often addressing one or few
issues at a time)
 Our approach
 Main assumption: lexical ambiguities are not independent
from each other
 Solution based on
 a theoretical framework
 a modular, extensible analysis tool


Folksonomies as edge-colored
multigraphs
 Def.1: An edge-colored multigraph is a triple
ECMG = (MG, C, c)
where:
 MG = (V,E,f) is a multigraph
 C is a set of colors
 c : E→C is an assignment of colors to multigraph edges

 Def.2: A personomy related to user u is a non-directed
edge-colored graph of color Cu:
Pu = (T, R, E, Cu)


Folksonomies as edge-colored
multigraphs
 Def.3: Given a set of users U and the family of
personomies Pu (u ∈U), a folksonomy is defined as

that is, an edge-colored multigraph where:
 vertices are tags + resources
 edges are tag assignments made on
resources by each user
 every color is a different user


First simplification step

 As we are only interested in relationships between
tags, we need to perform two simplification steps on
the edge-colored multigraph
 Step 1: colored edges are collapsed and substituted
by weighted edges
 potentially, every color (user) might be
assigned a different weight wu
 the weight w of the collapsed edge is the sum
of all the wu linking the same two vertices
 when wu= 1 for each user, w = times a tag is
used on a resource


Second simplification step
 Step 2:
 a link is created between ta and tb if they
share a resource
 resource nodes are dropped
 Edges' weights can be calculated
in different ways:
 number of triples (ti ,r,tj ) where (ti ,r), (r,tj ) ∈E
=> co-occurrence
 normalized co-occurrence (i.e. Using the
Jaccard index)
 distributional measures
 custom metrics (i.e. sum of products of
connecting edges' weights) =>

The whole process at a glance
1 2

3 4

11/27

System architecture

 Basic assumption:
 ambiguous tags should be related (either by cooccurrence or
by presence in the same context)
 Three main components:
 tag analysis tool
 disambiguation tool
 front-end


Synonyms detection / 1

 Natural text …
 Two words are considered synonyms if they can be replaced
by each other without affecting the meaning of a sentence
 … vs. Tag-based systems
 It is possible to swap two tags within a “sentence” (i.e. a
tagging action) without affecting its meaning when we have:
 variations of a word (i.e. blog, blogs, blogging)
 translations into other languages (i.e. game, juego, spiel)
 terms joined by non-alphabetic characters (i.e. web2, web_2)
 No “one size fits all” solution


Synonyms detection / 2

 A modular solution for synonyms detection:
 different heuristics, each one returning the likelihood of tags to be
synonyms
 results are weighted to obtain an overall likelihood

 Suggested heuristics:
 an edit distance such as Levenshtein's (normalized to account for short
strings);
 synonym search in WordNet (good precision, low recall);
 online translation bases (top-down, such as dictionaries, or bottom-up,
collaboratively grown vocabs like Wikipedia)
 stemming with NLP algorithms


Homonyms detection

 Check if the tag t has been used in different contexts
 cluster tags related to t in groups
 the most frequent tags in these groups are used to name
and disambiguate the contexts
 Clustering algorithm:
 an overlapping one, also used in social network analysis*
 a cluster is a subgraph G identified by the maximization of a
fitness property
s = strength of internal (in)
or external (out) links
α = tweaking parameter

* A. Lancichinetti et al. : “Detecting the overlapping and hierarchical community structure of complex networks”


Hierarchy detection

 Hierarchy is a specific case of basic level variation
 A possible approach: Hearst patterns on the Web,
such as:
 C1 (and|or) other C2 (i.e. “poodles and other dogs”)
 C1 such as I (i.e. “cities such as San Francisco”)
(note: Ci are concepts, I is a concept instance)

 Search for the patterns, and use the number of results
as an indicator for their strength
 Pros: the Web is as up-to-date as folksonomies

Cons: O(n2) complexity, not really scalable

Prototype development

 Dataset
 Data from more than 30K users of
http://www.delicious.com
 Ignored the system:unfiled tag
 For the calculation of Tag Context Similarity,
we only took into account the top 10K tags

 Prototype
 Tag analysis tool, calculating CO, NCO, and TCS (takes time, runs as a
batch job and saves matrices in the DB)
 Disambiguation with homonyms plugin, implementing the overlapping
clustering algorithm, and Wikipedia synonym discovery
 Front-end is currently a command-line application

Experimental results / 1

 System tested against three different sets of tags:
 Top 20 tags in delicious
 A group of tags known to be ambiguous (apple, cambridge, sf,
stream, turkey, tube)
 A set of subjective tags, chosen between the most popular ones in
delicious (cool, fun, funny, interesting, toread)
 For each tag:
 we calculated the top n (with n = 50) related tags with the three metrics
(CO, NCO, TCS)
 we performed synonym and homonym analyses


 Tag Context Similarity already tends to provide
synonyms as top-related tags
 i.e. toread related: read, read_later, to_read, etc.
 Analyzing a less popular synonym (@readit):
 9 out of the top 10 (and 17 out of the top 50) related tags are synonyms
 reason: as less popular tags are less spread across contexts, they tend
to have a higher similarity with other less popular synonyms
 Wikipedia results:
 analyzing the 31 tags in our three sets, we got 215 new words;
 of those 215, only 83 are valid tags in our delicious dataset;
 of those 83, only 20 belong to the 10K most-used tags;
 only 2 belong to the set of the top-related tags of their English synonym.


 Homonyms detection:
 we tested the algorithm with
different values of α
 meaningful results in a relatively
short time (but we are working
only on the top related tags...)
 limit: the graphs of top related
tags differ in connectivity, so
there is not a value of α that is
good for all of them (αsf=1.4,
αstream=1.74).


Conclusions

 Model
 Flexible enough to support other kind of metrics
 Multigraph can be simplified in other ways
 User-related weights still have to be taken into account
 Tool
 Still in prototypal phase, but already provided useful results
and allowed us to compare
 metrics: different metrics provide very different results, that might be
more or less useful according to the user needs
 tag behaviors: different depending on their popularity and the use
that people do of them


Conclusions

 Ongoing work
 Clustering evaluation metrics to find best α
 Applications (i.e. for tag grouping and visualization*)
 User- and resource-specific projections**
 Future work
 Development of other plugins and front-end
 Play with user-related weights to focus on specific
communities / filter spam

* Mazzola, Eynard, Mazza: ”GVIS: a framework for graphical mashups of heterogeneous sources to support data
interpretation”.

** Dattolo, Ferrara, Tasso: "On social semantic relations for recommending tags and resources using folksonomies"


Thank you!

Thanks for your attention!

Questions?


toread top 20 related tags


@readit top 20 related tags


sf top 20 related tags


stream top 20 related tags


An integrated approach to discover tag semantics

More Related Content

Viewers also liked

Similar to An integrated approach to discover tag semantics

More from Davide Eynard

Recently uploaded

An integrated approach to discover tag semantics