SDOW (ISWC2011)

DIGITAL
Institute for Information and Communication Technologies

Pragmatic metadata matters:
How data about the usage of data affects
semantic user models
Claudia Wagner, Markus Strohmaier, Yulan He

Sunday, October 23, 2011

Example
Semantic Metadata

sioc:content
sioc:name
sioc:has_creator
rdf:type

rdf:type sioc:Post

sioc:UserAccount
2
foaf:Person
sioc:account_of

Example
Pragmatic Metadata

3


Aim
Can pragmatic metadata support the generation of semantic
metadata and if yes how?

sioc:name sioc:content
sioc:has_creator
rdf:type

rdf:type
? sioc:topic
sioc:Post
foaf:interest

sioc:UserAccount ?
4
foaf:Person
sioc:account_of

Experimental Setup
§ Methodology
§ Topic Modeling Algorithms to learn topics (probability
distributions of words) and annotate users and posts
with topics
§ Incorporated different types of pragmatic metadata
into the Topic Models
§ Compared different models via their predictive
performance

§ Dataset
§ Boards.ie
§ Forums, Posts and Users
§ User`s authoring and replying behavior
§ Training Dataset: First and last week of February 2006
§ Test Dataset: 3 future posts of each user
5


Evaluation

§ Compare different models by testing their predictive
performance on held out posts.

Log Likelihood of a word of user`s
future post given the model we learned

Sum over all words in a user`s future post

§ Assumption: a better user topic model reacts less
perplex on future posts authored by a user and needs
less trainings samples.
6


Methodology
LDA
§ How to learn topics and annotate users with topics?

Text

§
Latent Dirichlet Allocation (LDA)
T1: (Blei et al, 2003)
mac: 0.3
iMac: 0.13
PC: 0.03
computer: 0.04
....
T1 T2 T3
7


Methodology
DMR
§ How to incorporate metadata into topic models?

§ Dirichlet Multinomial Regression (DMR) Topic Models
(Mimno et al, 2008)

§ Observe feature vector x per document
§ Draw „fresh“ alpha for each document which depends
on observed features x and the feature distribution per
topic λt

8
∝ dt= exp(λt Xdt)


Methodology

Post 7
ID Alg Doc Metadata Future

M1 LDA Post - Past
Post 1
authored
M2 LDA User -
Post 2
M3 DMR Post author
M4 DMR User author Post 3
replies to
User 1
M5 DMR Post reply-user
Post 4
authored
M6 DMR User reply-user
Post 5
M7 DMR Post related-user
M8 DMR User related-user User 2 Post 6

9


Post
training
scheme

(M3,
M5
and
M7)

§ Different user activities performed on content

Baseline
LDA

(M1
and
M2)

Models
which
take
user
replies
into
account.
(M6
and
M8)
10


Results

ID Alg Doc Metadata
Post 7 Future
M1 LDA Post -
M2 LDA User - Past
Post 1
authored
M3 DMR Post author
Post 2
M4 DMR User author
Post 3
M5 DMR Post reply-user User 1 replies to

M6 DMR User reply-user Post 4
authored

Post 5
M7 DMR Post related-user

User 2 Post 6
M8 DMR User related-user
11


Results
§ The topics of users who reply to a user are also likely for
this user
§ Therefore, if 2 users get replies from the same users
than they are more likely to talk about the same topics

§ Topic models which incorporate pragmatic metadata per
user can indeed improve models
§ Topic models which incorporate pragmatic metadata per
post often over-fit data
§ Model Assumptions are too strict!

§ Idea: Incorporate behavioral user similarities
§ Intuition: users which are similar are more likely to talk
about the same topics
§ How to measure behavioral similarity?
§ forum usage
12
§ communication behavior


Methodology
Post 7 Future

ID Alg Doc Metadata
Past
Post 1

authored
M9 DMR Post top 10 forums Post 2

User 1 Post 3
M10 DMR User top 10 forums f1 f15
f2 f20
f3 f31 authored Post 4
top 10 f4 f12
M11 DMR Post communication f5 f5
Post 5
partner f6 f6
f7 f17
f8 f18 Post 6
top 10 f9 f19 User 2
M12 DMR User communication f10 f10
partner
13


Post
training
scheme

(M3,
M9
and
M11)

Baseline
LDA

(M1
and
M2)

User
training

scheme

(M4,
M10

and
M12)

Models
M12

incorporates
user

similari;es
based
on

their
communica;on
behavior
14


Results
§ Topic models seem to benefit from taking behavioral
user similarities into account

§ Users who behave similar (regarding their forum usage
and communication behavior) are likely to talk about the
same topics

§ Common communication-partner seem to be more
predictive for common topics than common forums

15


Conclusions
§ Pragmatic metadata may help to learn better semantic
user models

§ But pragmatic metadata observed on a post level often
over-fits data

§ Pragmatic Metadata on a user level seems to improve
the predictive performance of topic models
§ If posts of 2 users are “used” in a similar way then
they are more likely to talk about the same topics
§ If 2 users behave similar (tend to post to same forums
or tend to talk to same users) they are more likely to
talk about same topics.
§ Common communication-partner seem to be more
predictive for common topics than common forums

16


Limitations and Future Work
§ Perplexity and semantic interpretability of topics do not
necessarily correlate (Chang et al., 2009)
§ Separate evaluation of semantic coherence of topics

§ Analyzing different types of behavior- and usage-related
metadata and explore to what extent they may reveal
information about the semantics of data
§ behavior on social streams such as Twitter
§ tagging behavior
§ navigation behavior

17


References
§ David M. Blei, Andrew Ng, Michael Jordan. Latent Dirichlet allocation. JMLR (3)
(2003) pp. 993-1022

§ Chang, J., Boyd-graber, J., Gerrish, S., Wang, C. and Blei, D. Reading Tea
Leaves: How Humans Interpret Topic Models, Neural Information Processing
Systems, NIPS (2009)

§ Mimno, D.M. and McCallum, A. Topic Models Conditioned on Arbitrary Features
with Dirichlet-multinomial Regression. In Proceedings of UAI. (2008), pp. 411-418

18


SDOW (ISWC2011)

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to SDOW (ISWC2011)

Similar to SDOW (ISWC2011) (20)

More from Claudia Wagner

More from Claudia Wagner (18)

Recently uploaded

Recently uploaded (20)

SDOW (ISWC2011)