4. 5
How can we measure semantic stability?
How can we compare the semantic stabilization process in
different systems?
What impacts semantic stability?
5. Measuring Semantic Stability
State of the Art
• Relative tag proportions per resource become stable with
increasing number of tag assignments [Golder and
Huberman, 2006]
• KL-divergence of rank-ordered tag frequency distribution per
resource at different time points converges towards zero
[Halpin et al., 2007]
• Power Law distributions [Cattuto et al., 2006] – Scale
invariance property ensures that regardless how large the
system grows the shape of the distribution stays the same
6
6. Some Limitations
• Don’t allow comparing the semantic
stabilization process of different systems
• Prune tag distributions to top-k tags
– Cannot handle non-conjoint lists of tags
• Random tagging process also produces
“stable” description
– Tag assignment at timepoint t+1 has less impact
on the tag distribution of a resource than a tag at
timepoint t
7
7. Example
KL-Divergence
8
• KL-divergence converges
towards zero.
• But random baseline also
converges towards zero if
we assume a constant
tagging rate.
• We do not always know
the top k tags!
0 200 400 600 800 1000
0.00.20.40.60.81.0
Number of consecutive tags assignments
KLDivergence
●
●
●
●
●
●
●
●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
8. Example
Relative Tag Proportion
9
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
0.000.050.100.150.200.25
Consecutive Tags (User List Names)
RelativeTagProportion
bloggers
blogs
business
design
digital
entertainment
internet
it
marketing
mashable
media
my favstar.fm list
news
social
social media
social−media
socialmedia
tech
tech news
techies
technews
technology
tecnologia
twibes−socialmedia
web
Relative Tag Proportion
0 2000 4000 6000 8000 10000
0.000.050.100.150.200.250.300.35
Consecutive Tags (User List Names)
RelativeTagProportion
1
2
3
4
5
9. Intuition and Approach
• Some descriptors are
more important than
others.
• Ranking of (top)
descriptors remains
stable over time
• All descriptors are
equally important.
• Ranking of (top)
descriptors changes
over time
0
0.1
0.2
0.3
0.4
P(T)
0
0.1
0.2
0.3
0.4
0
0.1
0.2
0.3
P(T)
0
0.1
0.2
0.3
stable
less stable
tn tn+m
tn tn+m
10. Intuition and Approach
• Some descriptors are
more important than
others.
• Ranking of (top)
descriptors remains
stable over time
• All descriptors are
equally important.
• Ranking of (top)
descriptors changes
over time
0
0.1
0.2
0.3
0.4
P(T)
0
0.1
0.2
0.3
0.4
stable
less stable
tn tn+m
tn tn+m
0
0.2
0.4
0.6
0
0.2
0.4
P(T)
11. Requirements
• Rank agreement of the descriptors of a
resources over time
• Weighted rank agreement
• Non-conjoint lists of descriptors
• Random Baseline
13
12. Rank Biased Overlap (RBO)
[Webber et al., 2010]
• RBO falls in the range [0, 1], where 0 means
disjoint, and 1 means identical
• p lies between 0 and 1 and determines how steep
the decline in weights is
• The smaller p, the more top-weighted the metric
14
17. Tie correction for
Rank Biased Overlap
• RBO does not penalize ties
• We want to penalize ties since they show that
users have not agreed on a ranking
• Sum only over those depths which occur in at
least one of the two rankings
19
18. Same concordant pairs: (A,D) and (B,D) and (C,D)
0
10
20
30
40
50
60
70
80
90
A B C D
0
10
20
30
40
50
60
70
80
90
C B A D
RBOorig = 0.2
RBOmod= 0.2
0
10
20
30
40
50
60
70
80
90
A B C D
0
10
20
30
40
50
60
70
80
90
A B C D
RBOorig = 0.34
RBOmod= 0.17
No Ties Ties
tn tn+m tn tn+m
R1 R2
A B C D C B A D A B C D C B A D
Frequency
Frequency
19. Semantic Stabilization on a
Resource Level
23
0 1000 2000 3000 4000
0.00.20.40.60.81.0
Number of consecutive tags assignments
RBO
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●●
●●
●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●
●
●●●●●●●●●●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●
●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●
●
●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
• Tag distributions of Twitter
users become semantically
stable between 1k and 2k
tag assignments
• The RBO values of random
tagging distributions
increase slower and are
significantly lower
20. Semantic Stabilization
on a System Level
• How can we compare the semantic stabilization
process in different systems?
• We call a resource description semantically stable
after tn+m tag assignments, if the RBO value
between its tag distribution at point tn and tn+m is
equal or greater than k.
24
21. Semantic Stabilization
on a System Level
25
After 1250 tag assignments 90% of all
resources have a stability above 0.61
27. What causes semantic stability?
• Simulations based on the epistemic tagging model
[Dellschaft and Staab, 2008].
• Use parameter I as imitation rate and produce tag
distributions for I=0, 0.1, ... 1
31
30. What causes stability?
35
If imitation and BK are
combined an imitation is
dominant higher levels of
semantic stability are
reached faster
31. What causes stability?
36
• Combination of shared background knowledge and imitation
behaviour (where imitation is more important) leads to the fastest
and highest stabilization.
• Natural language systems show similar stabilization as social tagging
systems where no imitation is supported
32. Conclusions & Implications
• Attempt to formalize semantic stability in social streams
• Novel approach to measure and compare the semantic
stabilization process in different social streams
Why is that useful?
• Identify social streams (e.g. tag stream of URL or word stream
of hashtags) which are semantically stable
– Extract shared and agreed-upon semantic knowledge from
social streams
• Select systems that provide semantically stable streams
37
33. References
• D. Bollen and H. Halpin. The role of tag suggestions in folksonomies. In Proceedings of the 20th ACM
conference on Hypertext and hypermedia, HT ’09, pages 359–360, New York, NY, USA, 2009. ACM.
• C. Cattuto, Semiotic dynamics on social tagging communities. The European Physical Journal C -
Particles and Fields August 2006, Volume 46, Issue 2 Supplement, pp 33-37
• A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. SIAM Rev.,
51(4):661–703, Nov. 2009.
• K. Dellschaft and S. Staab. An epistemic dynamic model for tagging systems. In HT ’08: Proceedings of
the nineteenth ACM conference on Hypertext and hypermedia, pages 71–80, New York, NY, USA, 2008.
ACM.
• S. Golder and B. A. Huberman. Usage patterns of collaborative tagging systems. Journal of Information
Science, 32(2):198–208, April 2006.
• H. Halpin, V. Robu, and H. Shepherd. The complex dynamics of collaborative tagging. In Proceedings of
the 16th international conference on World Wide Web, WWW ’07, pages 211–220, New York, NY, USA,
2007. ACM.
• A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme. Bibsonomy: A social bookmark and publication
sharing system. In Proceedings of the Conceptual Structures Tool Interoperability Workshop at the 14th
International Conference on Conceptual Structures, pages 87-102, 2006.
• C. T. Kello, G. D. A. Brown, R. Ferrer-i Cancho, J. G. Holden, K. Linkenkaer-Hansen, T. Rhodes, and G.
C. Van Orden. Scaling laws in cognitive sciences. Trends in Cognitive Sciences, 14(5):223{232, May
2010.
• W. Webber, A. Moat, and J. Zobel. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst.,
28(4):20:1{20:38, Nov. 2010.
40
35. Limitations and Future Work
• RBO measures ranking but ignores the differences
in the frequencies
• Decay function to weight tag counts
– old tag assignments are less important than new ones
• Number and diversity of users who tag a resource
might impact the semantic stabilization process
42
36. Alternatives to RBO
• Unweighted and conjoint measures
– Kendall tau, Spearman rho
• Weighted and conjoint measures
– Weighted Kendall tau
• Unweighted and non-conjoint measures
– Intersection metric
• Weighted and conjoint
– Cumulative overlap at increasing depths
43
38. Categories of Semantically
Unstable Resources
• Entity to which a resource refers changes
• Resource (i.e. website) changes
• Entity/Topic to which a resource refers is controversial
– website refers to controversial entity/topic on which
different viewpoints exist
• External conditions which impact viewpoints on
entity/topic change
– Website remains stable but viewpoint of taggers on the
entity or topic related with the site change
45
40. Relative Tag Proportion
[Golder and Huberman, 2006]
47
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
0.000.050.100.150.200.25
Consecutive Tags (User List Names)
RelativeTagProportion
bloggers
blogs
business
design
digital
entertainment
internet
it
marketing
mashable
media
my favstar.fm list
news
social
social media
social−media
socialmedia
tech
tech news
techies
technews
technology
tecnologia
twibes−socialmedia
web
Relative Tag Proportion
0 2000 4000 6000 8000 10000
0.000.050.100.150.200.250.300.35
Consecutive Tags (User List Names)
RelativeTagProportion
1
2
3
4
5
41. KL-Divergence
[Halpin et al., 2007]
48
• KL divergence between the rank-ordered frequency
distribution of the top 25 tags at different time points
tn+mtn
stableless stable
43. Power Law
[Cattuto, 2006]
50
• Is the rank-ordered frequency distribution a power law
distribution?
• Is the frequency y of a tag inversely proportional to it's
rank r?
tn+mtn
44. Power Law
[Cattuto, 2006]
51
• Is it really power law?
– Very likely yes according to the maximum
likelihood estimator and Kolmogorov-
Smirnov statistic [Clauset et al., 2010]
– Estimate alpha and xmin over some
reasonable range
– Compare power law fit to the fit of the
exponential function, the lognormal
function and the stretched exponential
(Weibull) function. Use the log-likelihood
ratios to indicate which fit is better.
– We do not find significant differences
between the power law fit and the
lognormal fit
There are many social media apps which allow users to tag and talk… From these distr. User activities… Folksonomies are collaboratively generated and fuzzy categorization schemas. Ontologies are formally defined classification schemas. Usually they are generated pre-hoc by a group of experts who conceptualize a domain of interest. Ideally this conceptualization represents the agreed-upon and shared semantic view on the domain of interest. If the domain of interest is huge and constantly changing the manual construction of ontologies fails. Therefore there was a lot of research in the SW… Since ontologies present shared and agreed-upon semantics, we need to ask ourself to what extent the description of resources which emerge from social media streams represent shared and stable semantic descriptions of resources.
Better example: resources which are very subjective. Ask 100 people and everyone describes them differently.
Some resources change over time (and by res I mean everything that can have an URI) and therefor also their semantic description changes. People would have described him as body buildner in the 70ies, as an actor in the 90ies and as a politician nowadays.
Other resources dont change but people‘s viewpoint on them may change or people may have contradicting viewpoints on the same resource. Therefore again the descriptions may stabelize but also destabelize over time.
Other resources dont change, also people‘s viewpoint on them don‘t change. Therefore over time when people keep tagging them, the description will converge to a stable and shared semantic description. In our work we are interesting to measure semantic stability, comparing the semantic stabilization process in differnt SM systems and in exploring the factors that might impact sem stability.
Previous researcher also recognized the need of exploring the semantic stability of resource descriptions which emerge when a large amount of users tag a resource since this stability is a pre-request for learning ontologies from folksonomies.
And there are three methods which have been proposed to measure semantic stability. However as we have shown in our paper those methods have certain limitations to overcome them we present a novel approach for .
Existing methods have certain limitations. They do not allow… they operate on a resource level
Their measure converges per definition towards 0 if the number of tag assign remains stable over time. Only if the number of tags assigned to a res varies a lot over the time bins then convergence can be interpreted as sign for semantic stability. But a single tag assignment in month j has always more impact on the shape of the distribution than a single tag assignment in month j+1.
If we look at the relative proportions of the top k tags (which scott golder and bernardo) did we see that
To summerize: Kello et al provide a good critical reflection on the informtiveness of scaling laws. For example researchers have shown that also random sequences of characters exhibit Zipf law. So there is this ongoing disucssion about scaling laws since idiosynractic ways of producing power laws exsits. Further the question of what produces a power remains opeb. Additive summation of components as well as systems dominated by multiplicative interactions are known to produce heavy tails.
So given these limitations which we observed we thought we should come up with an alternative approach for measruing semantic stability in social streams. The inuition behind our approach of measuring semantic stability is the following: for one given resource we observe after tn and tn+m tag assignments a ranked lists of tags which reflects by how many users a tag was assigned to a resource. We consider a resource description as sem. Stable if… Unstable if all descriptors are equally important or unimportant. This is a sign of disagreement.
Our intuition of semantic stability incorporates 2 aspects: implicit consensus adn stability. Stability means tolerance to pertuations over time and implicit consensus means that users agree on the relative importance of tags. Some tags are picked much more frequently than others.
Form this intuition we can infer some reuqirements for a new measure
To operationalize our intuition of semantic stability and meet this requirements we propose to use a modified version of the RBO. Measures the agreement between two ranked lists of items. It is based upon the cumulative set overlap. Can handle non-conjoint rankings. The set overlap at each rank is weighted by a geometric sequence, providing both top-weightiness and convergence.
Consider this resource for which we observe the following 2 tag distributions after tn and tn+m tag assignments. If we want to assess the stability of the resource description we first compare the overlap at depth1.
And so on. This is the cumulative set overlap. Cumulative set overlap the weight of the overlap at depth d depends on the total number of depths D. Element at rank 1 has the weight D/D, Element at rank 2 has weight D-2/D. In addition we use p. P defines how fast the decline in weight is.
In addition we have this parameter p which defines a convergent series of weights (i.e., a series of weights whose sum is bounded). RBO biases the proportional overlap at each depth using this convergent series of weights. Therefore p ensures that the infinte tail of tags does not dominate the finite head and this is important since we have heavy tail distributions.
The smaller p the more top weighted the metric.
RBO does not penalize ties.
So we have two resources R1 and R2. For both resources we observe their tag distr after tn and tn+m tag assignments. If we look at R1 we can see that we have 4 concordant pairs, i.e. A has a higher rank than D after tn and also after tn+m tag assignments. The same is true for B,D and C,D. The desc of R1 contains ties, the desc of R2 does not contain ties. But we find the same concordant pairs between the 2 tag distributions of R2.
Here it makes a difference how we rank. If we say C=1, B=1, A=1 and D=4 OR if we say C=1/3, B=1/3, A=1/3 and D=4. For the first version we would produce the same result as the original measure. For the second variant we would produce the following result: 0.24 (so does not really penalize ties). Alternative: we could only sum over those depths which occur in the second ranking. Then we would penalize the emgernce of ties over time, but not the existence of ties.
So we have two resources R1 and R2. For both resources we observe their tag distr after tn and tn+m tag assignments. If we look at R1 we can see that we have 4 concordant pairs, i.e. A has a higher rank than D after tn and also after tn+m tag assignments. The same is true for B,D and C,D. The desc of R1 contains ties, the desc of R2 does not contain ties. But we find the same concordant pairs between the 2 tag distributions of R2.
So now we have a measure that operationalizes our intuition of sem stability and we can use it to explore the stabilization process of individual resources. This figure shows the stabilization of the description of Twitter users. On Twitter people can tag…..One can see that between 1k and 2k tas high or medium levels of stability are reached depending on which resource we look at. We can also see that a random baseline process does not really stabelize. If we would e.g. only look at the shape of the distr also a random baseline process would stabelize since it would produce a flat distr which would continue to be flat over time.
So now we have seen how we can use our approach on a reource level, however it is still unclear... To adress this problem we introduced a flexible definition of semantic stability which allows us to compare the semantic stability of different resource streams steming from different social tagging systems.
This definition allows us to explore the semantic stabilization process per system by looking at the proportion of resources that have stabelized according to our parameter k and t. This figure shows the percentage of resources (in this case heavily tagged Twitter users) stabilized at time t with stability threshold k. For example, at point P indicates that after 1250 tag assignments 90% of resources have an RBO value of 0.61 or higher. The contour lines illustrate the curve for which the function has constant values. The corresponding values are depicted in the lines and represent the percentage of stabilization f.
We used this approach to compare the sem. Stab. Process in different SM systems. First lets look at tag streams in Twitter. Tag streams are user list name streams.
Let’s compare the stabilization process in Twitter with the stabilization process in Delicious. We can see that resource descriptions in Delicious stabelize faster and reach sign. Higher levels of sem. Stability.
Next we looked at the sem. Stab. Of book descriptions in librarything.
We also added a random baseline and we can see that it does not stabelize. This shows that our approach is able to differentiate between real sem. Stabilization and the stability which we observe for random tagging. If we would e.g. only look at the shape of the distr. Than also a random tagging process would stabelize since the relative flat list of tags also would stay relatively flat. Ist important that
It is actually surprising that tag streams of Twitter users stabilize that much. We wanted to know if this is because people TAG other people or if we would observe the same stabilization if people would just TALK about other people. We created a dataset of tweets where a random set of users is mentioned. That means a stream of tweets where people talk about users. We used the words in this tweets as descriptors of the person. One can see a similar stabilization process than when people tag other users. This suggest that a medium level of stability can also be explained by the properties of natural language.
This leads me to our final question…..
The epistemic tagging model is a generative model which includes both BK and the influence of previously assigned tagged. Since BK is encoded in NL we cannot differ between NL and BK at this stage.
Klaas and Steffen showed that a maixture of BK and imitation is best for reproducing the shape of the tag frequency distribution. However they focus on reproducing the shape of the rank-ordered frequency distribution while we explore the stabilization process over time. Further previous research considered the sharp drop between rank 7 and 10 as a typical characteristic of tagging streams which differs them for word-freu distributions. However Bollens and Halpins work suggests that this might only be caused by the user interface which suggests up to 10 tags. If no tags are suggested there is no sharp drop.
First we consider a tagging model where people ONLY rely on their BK. This model reflects the properties of NL since we use Wikipedia as BK.
We then add a bit of imitation. Now people imitate others 30% of the time and use 70% of the time their BK. We do not see differences.
Next we use 70% imitation rate and 30% BK. We see faster and higher stabilization.
Finally, we use 100% imitation and observe that in this case no stabilization happens since people fail to introduce new tags if they don‘t use their BK at all. Overall our emp. Results as well as our simulation results suggest that
To sum up: where do we go from here? So I have presented you a simple method for measuring semantic stability in social streams which is pretty flexible, can easily be adapted to other count/frequency functions and it can be used on a resource and system level. I have shown you that existing methods have certain limitations and that the notion of semantic stability requires both concepts: stability and implicit consensus.
So why should we care about semantic stability?
First because it helps us to learn sth about the nature of resources on the web. Second it helps to identify streams which are in a stable phase. Finally, it helps to identify applications that have a community which produces stable descriptions.
Empirical results as well as simulation results show that the stabilization process benefits from combining …
Some interesting avenues for future work
Problem of cum overlap is if we assume we have a long and potentially infinite tail than the tail will dominate the head. RBO biases the proportional overlap at each depth by a convergent series of weights (i.e., a series of weights whose sum is bounded)
We used this method on different social resource streams. Here you can see a plot from one Twitter user. I use this dataset from Twitter here since this was the starting point for this project because we thought the dataset is interesting. One can see here that the lines indeed become straight for many sample users. Since I did this work during my internship at Hp i had the chance to discuss the plots with Bernardo and he said that it looks stable but less stable than delicious. So how can we quantify that?
A power law is a functional relationship between two quantities, where one quantity varies as a power of another. The scale invariant propertiy of power laws makes them interesting since it suggests that no matter how much the system grows the shape of the distribution remains the same.
The probability of measuring a particular value of some quantity varies (inversely) as a power of that value.
The probability of observing a tag with frequency Y varies as a power of its rank. There are only few very frequent tags but many less frequent tags therefore the probability of observing a high-frequency tag is low.
cumulative distribution function (CDF) (also called rank-frequency distribution) describes the probability that a random variable X will be found at a value less than or equal to x.
complementary cumulative distribution function (ccdf) asks how often the random variable is above a particular level.
Cumulative distributions are sometimes also called rank/frequency. Cumulative distributions with a power-law form are sometimes said to follow Zipf’s law or a Pareto distribution, after two early researchers.
“Zipf’s law” and “Pareto distribution” are effectively synonymous with “power-law distribution”.
Zipf’s law and the Pareto distribution differ from one another in the way the cumulative distribution is plotted—Zipf made his plots with x on the horizontal axis and P(x) on the vertical one; Pareto did it the other way around. This causes much confusion in the literature, but the data depicted in the plots are of course identical.
Empirical power-law distributions hold only approximately or over a limited range.