"Augmenting Social Media Items with Metadata using Related Web Content" - Slides from the public presentation part of my PhD defense. Presented at DERI, NUI Galway, on August 30, 2011.
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Sheila Kinsella PhD Defense
1. Digital Enterprise Research Institute www.deri.ie
Augmenting Social Media Items with
Metadata using Related Web Content
Sheila Kinsella
Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
1
2. Outline
Digital Enterprise Research Institute www.deri.ie
Motivation
Example Scenario
Tag prediction Approach
Geolocation Evaluation
Topic classification Summary
Combining the approaches
Impact
Conclusions
2
3. Motivation
Digital Enterprise Research Institute www.deri.ie
Social media is an important information source
e.g., real time citizen journalism, Q&A sites, niche topics
Search and navigation can be challenging
Short and informal posts
Items are not curated and often lack metadata
Users conversing share a common context and therefore omit
relevant information, e.g. location
Making use of related Web data can help us to infer
such context information
e.g., hyperlinks, posts with similar content
3
4. Example Scenario:
Adding metadata to a blog post
Digital Enterprise Research Institute www.deri.ie
Last night I saw Connacht
play at The Sportsground.
tags? The match started well for
Connacht with a great try but
after half time the opposition
closed the gap. Finally we
managed to hold out for the topic?
win. It was a great game
location? from both sides. Here's a clip
of the first try.
4
5. Example Scenario:
Possible clues from content
Digital Enterprise Research Institute www.deri.ie
Last night I saw Connacht
tags? play at The Sportsground.
The match started well for
Connacht with a great try but
after half time the opposition
closed the gap. Finally we
managed to hold out for the topic?
win. It was a great game
location? from both sides. Here's a clip
of the first try.
5
6. Example Scenario:
Possible clues from content
Digital Enterprise Research Institute www.deri.ie
Last night I saw Connacht
play at The Sportsground.
tags? The match started well for
Connacht with a great try but
after half time the opposition
closed the gap. Finally we
managed to hold out for the topic?
win. It was a great game
location? from both sides. Here's a clip
of the first try.
6
7. Example Scenario:
Possible clues from content
Digital Enterprise Research Institute www.deri.ie
Last night I saw Connacht
play at The Sportsground.
tags? The match started well for
Connacht with a great try but
after half time the opposition
closed the gap. Finally we
managed to hold out for the
win. It was a great game topic?
location? from both sides. Here's a clip
of the first try.
7
8. Example Scenario:
Exploiting related Web content
Digital Enterprise Research Institute www.deri.ie
...didn’t see Last night I saw Connacht
the match but play at The Sportsground.
here’s a The match started well for
summary href Connacht with a great try but
from John.. after half time the opposition
closed the gap. Finally we
managed to hold out for the
..............This win. It was a great game
review of the from both sides. Here's a clip
Connacht href of the first try.
match shows
that they are
getting back in
form!......
tags from anchortext
8
9. Example Scenario:
Exploiting related Web content
Digital Enterprise Research Institute www.deri.ie
Last night I saw Connacht
play at The Sportsground.
The match started well for
Connacht with a great try but
location
after half time the opposition from
closed the gap. Finally we
managed to hold out for the geotagged
win. It was a great game
from both sides. Here's a clip social
of the first try.
media
JohnSmith John Smith
I’m at the Galway
Sportsground
9
10. Example Scenario:
Exploiting related Web content
Digital Enterprise Research Institute www.deri.ie
Last night I saw Connacht
play at The Sportsground. YouTube
The match started well for Title:
Fionn Carr try
Connacht with a great try but
after half time the opposition
closed the gap. Finally we
managed to hold out for the Category:
href
win. It was a great game Sport
from both sides. Here's a clip Tags:
rugby, try, carr,
of the first try.
connacht
topic from hyperlinked objects
10
11. Example Scenario:
Overview of approaches
Digital Enterprise Research Institute www.deri.ie
...didn’t see Last night I saw Connacht
the match but play at The Sportsground. YouTube
here’s a The match started well for Title:
Fionn Carr try
summary href Connacht with a great try but
from John.. after half time the opposition
closed the gap. Finally we href
managed to hold out for the Category:
..............This win. It was a great game Sport
review of the from both sides. Here's a clip Tags:
Connacht href rugby, try, carr,
of the first try.
match shows connacht
that they are
getting back in JohnSmith John Smith
form!...... I’m at the Galway
Sportsground TOPIC
TAG PREDICTION GEOLOCATION CLASSIFICATIO
N
11
12. Tag Prediction: Approach
Digital Enterprise Research Institute www.deri.ie
Aim: Automatic tag generation based on anchortext
1. Data collection and preprocessing
Retrieve document and extract META information
Retrieve inlinking documents and extract anchortext
Preprocessing (e.g. stemming, stopword removal)
2. Tag indexing and ranking
Generate term vectors from the preprocessed annotations
Ranking: tf and tf-idf
12
13. Tag Prediction: Evaluation (1)
Digital Enterprise Research Institute www.deri.ie
Datasets:
Web: WEBSPAM-2007, 12M pages from .uk domain
Delicious: 2007 Crawl containing tags for 4.5M URLs
Overlap between datasets: 192k URLs
Goals:
Compare overlap of predicted tags and delicious tags
Assess relevance of predicted tags and relevance of delicious
tags
13
14. Tag Prediction: Evaluation (2)
Digital Enterprise Research Institute www.deri.ie
Automatic Evaluation
Relative precision@k (Average proportion of predicted tags
that are also among delicious tags)
k=1 k=2 k=3 k=4 k=5
0.48 0.45 0.42 0.39 0.37
Relative recall@k (Average proportion of delicious tags can
also be inferred from anchortext)
k=1 k=2 k=3 k=4 k=5
0.41 0.35 0.32 0.29 0.28
14
15. Tag Prediction: Evaluation (3)
Digital Enterprise Research Institute www.deri.ie
Human Evaluation
80 documents, each assessed by 3 judges
– 0: not relevant; 1: quite relevant; 2: very relevant
Evaluator agreement
– In 85% of cases, judges at least almost agree
– i.e., two agree and the third differs by just one point
15
16. Tag Prediction: Evaluation (4)
Digital Enterprise Research Institute www.deri.ie
Human Evaluation
Precision@k (Average proportion of tags judged relevant by
evaluators)
– Relevance threshold: 1
k=1 k=2 k=3 k=4 k=5
Delicious 0.86 0.84 0.82 0.80 0.78
predicted 0.78 0.76 0.69 0.67 0.66
Not feasible to measure recall
16
17. Tag Prediction: Summary
Digital Enterprise Research Institute www.deri.ie
Substantial overlap between tags assigned on a social
bookmarking site and terms from anchortext
Human evaluators rate relevance of terms from
anchortext as not much lower than tags
This approach can provide useful and novel
annotations for untagged social media items, if other
users link to them with anchortext
17
18. Geolocation: Approach (1)
Digital Enterprise Research Institute www.deri.ie
Aim: Location prediction based on models built from
geotagged social media
Enables detection of implicit location clues such as slang,
venues, other terms of local relevance
1. Reverse Geocoding
Filter geotagged tweets from Twitter stream
Reverse-geocode each coordinate to corresponding places
– Postal code, City, State, Country
– Yahoo! Geoplanet service
Aggregate all of the text from each place together for model
building
18
19. Geolocation: Approach (2)
Digital Enterprise Research Institute www.deri.ie
2. Language Modelling
Approach from information retrieval – given a query, find the
most relevant document in a collection
Model each document and query as bag of words
For each document, calculate probability that a random sampling
would result in the query
Based on the intuition that users create queries by guessing
words that would occur in the document
For our geolocation task: estimates the probability that a random
sampling of a location would result in the social media post
19
20. Geolocation: Evaluation (1)
Digital Enterprise Research Institute www.deri.ie
Dataset
Twitter Firehose stream
7.3 million geotagged tweets posted during Summer 2010
Retweets removed, #hashtags and @usernames preserved
Place type # Tweets # Distinct places
Country 7.3m 222
State 7.3m 2.3k
City 6.3m 72.6k
Postal code 7.2m 104.7k
Baseline: Yahoo Placemaker!
identifies and disambiguates placenames in text and returns the
spatial entity most likely to encompass them
20
21. Geolocation: Evaluation (2)
Digital Enterprise Research Institute www.deri.ie
Prediction Methods
Trivial Classifier
– Each tweet assigned to the most common place in training set
Placemaker (Tweet)
– Each tweet is submitted to Placemaker and the most probable
candidate is selected. Allows detection of explicit geographic
references in the tweet
Language Model
– Locations are ranking according to their query likelihood and the
location whose model ranks highest is selected
Placemaker (Location)
– The location field from the tweet is submitted to Placemaker and the
most probable candidate is selected. Allows detection of explicit
geographic references in the self-reported location
21
22. Geolocation: Evaluation (3)
Digital Enterprise Research Institute www.deri.ie
Tweet location prediction accuracy
Common location focused services removed
Zip Town State Country
Trivial
0.005 0.061 0.060 0.434
Classifier
Placemaker
0.018 0.060 0.076 0.120
(Tweet)
Language
0.052 0.217 0.246 0.514
Model
Placemaker
0.017 0.269 0.401 0.518
(Location)
22
23. Geolocation: Summary
Digital Enterprise Research Institute www.deri.ie
Language models of geotagged tweets enables the
location of non-geotagged items to be predicted
The approach gives large improvements compared to
parsing for explicit placenames
City level accuracy – 21.7% versus 6%
The approach can be used to detect implicit
geographical information in social media posts
23
24. Topic Classification: Approach
Digital Enterprise Research Institute www.deri.ie
Aim: Improve topic classification using structured data from
hyperlinks
1. Identify sources of structured data from hyperlinks
Based on domains, e.g., wikipedia.org
2. Retrieve structured data for these hyperlinks
From Linked Data/APIs, e.g., dbpedia.org
3. Perform text classification
Requires set of already categorised posts for training
Post content and external metadata as sources of textual features
Compare accuracy achieved by different metadata types
4. Related to IR studies that classify documents based on fielded text
from hyperlinked pages, but they consider structural rather than
semantic fields
24
25. Topic Classification: Evaluation (1)
Digital Enterprise Research Institute www.deri.ie
Datasets
Forum Twitter
Data source message board microblogging site
Ground truth topics forums #hashtags
# classes (topics) 10 6
# posts 6,626 2,415
External data sources
Linked Data Web APIs
25
26. Topic Classification: Evaluation (2)
Digital Enterprise Research Institute www.deri.ie
Experimental Setup
Multinomial Naïve Bayes classifier (WEKA)
10-fold cross-validation
Compared classification accuracy for different post
representations based
– post content
– hyperlinked HTML pages
– hyperlinked object metadata
– combinations of these
Experimented to find optimal ways of combining feature vectors
(e.g., weightings)
26
27. Topic Classification: Evaluation (3)
Digital Enterprise Research Institute www.deri.ie
Results
Data Source Forum Twitter
Content (no URLs) 0.745 0.722
Content (with URLs) 0.811 0.759
HTML 0.730 0.645
Metadata 0.835 0.683
Content + HTML 0.832 0.784
Content + Metadata 0.899 0.820
(micro-averaged F1)
27
28. Topic Classification: Evaluation (4)
Digital Enterprise Research Institute www.deri.ie
Results – comparing metadata types
Wikipedia
Metadata type Content (no URLs) Metadata only Content+M’data
Category 0.811 0.851
Description 0.761 0.798 0.850
Title 0.685 0.809
YouTube
Metadata type Content (no URLs) Metadata only Content+M’data
Tag 0.838 0.864
Title 0.773 0.824
0.709
Description 0.752 0.810
Category 0.514 0.753
28
29. Topic Classification: Summary
Digital Enterprise Research Institute www.deri.ie
Topic classification in social media can be improved by
making use of structured metadata from hyperlinked
objects
The most useful metadata types can be found
experimentally, but for different objects, the usefulness
of metadata types varies
The categories assigned by this approach would allow a
user to browse social media posts with hyperlinks by
topic, even if the text of the post itself is not
sufficient for accurate automatic categorisation of
the post.
29
30. Combining the approaches (1)
Digital Enterprise Research Institute www.deri.ie
location
topic
tags
30
31. Combining the approaches (2)
Digital Enterprise Research Institute www.deri.ie
Last night I watched
Connacht play at The
tags? Sportsground. The match
started well for Connacht
with a great try but after
half time the opposition
closed the gap. Finally topic?
we managed to hold out
location? for the win. It was a great
game from both sides.
Here's a clip of the first
try.
31
32. Combining the approaches (3)
Digital Enterprise Research Institute www.deri.ie
@prefix ex: <http://example.org/> .
@prefix content: <http://purl.org/rss/1.0/modules/content/> .
@prefix dc: <http://purl.org/dc/terms/> .
@prefix sioc: <http://rdfs.org/sioc/ns#> .
ex:post1 rdf:type sioc:Post .
ex:post1 content:encoded “Last night I watched Connacht play at The
Sportsground. The match started well for Connacht with a great try but
after half time the opposition closed the gap. Finally we managed to hold
out for the win. It was a great game from both sides. Here's a
[url=„http://www.youtube.com/watch?v=[...]‟]clip of the first try.[/url]” .
ex:post1 sioc:links_to <http://www.youtube.com/watch?v=[...]> .
ex:post1 dc:subject “connacht” .
ex:post1 dc:subject “match” .
ex:post1 dc:subject “review” .
ex:post1 dc:subject “summary” .
ex:post1 dc:spatial <http://sws.geonames.org/2964180/> .
ex:post1 sioc:topic <http://www.dmoz.org/Sports/Football/Rugby_Union/> .
32
33. Combining the approaches (4)
Digital Enterprise Research Institute www.deri.ie
Use-case 1: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
PREFIX sioc: <http://rdfs.org/sioc/ns#> .
Local search PREFIX dc: <http://purl.org/dc/terms/> .
SELECT ?post WHERE {
A blogger is ?post rdf:type sioc:Post .
?post dc:spatial <http://sws.geonames.org/2964180/> .
looking for media ?post dc:created ?date .
to enhance a FILTER (str(?date) > ``2009-05-23T00:00:00'') .
FILTER (str(?date) < ``2009-06-06T23:59:59'') .
post about the
FILTER EXISTS {
Volvo Ocean { ?post dc:subject ``volvooceanrace'' } UNION
Race { ?post dc:subject ``vor'' } UNION
{ ?post dc:subject ``oceanrace'' } UNION
{ ?post dc:subject ``yacht'' }
}
}
33
34. Combining the approaches (5)
Digital Enterprise Research Institute www.deri.ie
Use-case 2: local browsing
A sports fan wants to follow conversations about sports in their
local area
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
PREFIX sioc: <http://rdfs.org/sioc/ns#> .
PREFIX gn: <http://www.geonames.org/ontology#> .
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> .
SELECT ?post WHERE {
?post rdf:type sioc:Post .
?post dc:spatial ?location .
?location gn:parentADM1 <http://sws.geonames.org/2963597/> .
?post sioc:topic ?topic .
?topic skos:broader+ <http://www.dmoz.org/Sports/> .
}
34
35. Impact
Digital Enterprise Research Institute www.deri.ie
5 conference papers
ESWC 2011, ECIR 2011, I-Semantics 2010, IV 2008, ASNA 2008
2 workshop papers
WIDM @ CIKM 2008, SMUC @ CIKM 2011
2 book chapters
Advances in Computers 76 (Elsevier)
Reasoning Web (Springer)
Tutorial
"Combining the Social and the Semantic Web”, ESWC 2011
35
36. Summary
Digital Enterprise Research Institute www.deri.ie
Proposed approaches for automatically generating
metadata for social media posts using related Web
content
Tags, location and topic
Evaluated the accuracy of each approach
Illustrated how the approaches can be used in
combination in order to semantically enrich social
media posts and enable enhanced search and
browsing in a social media dataset
36