Digital Enterprise Research Institute                                                           www.deri.ie               ...
OutlineDigital Enterprise Research Institute                        www.deri.ie              Motivation              Exa...
MotivationDigital Enterprise Research Institute                                               www.deri.ie              So...
Example Scenario:       Adding metadata to a blog postDigital Enterprise Research Institute                               ...
Example Scenario:       Possible clues from contentDigital Enterprise Research Institute                                  ...
Example Scenario:       Possible clues from contentDigital Enterprise Research Institute                                  ...
Example Scenario:       Possible clues from contentDigital Enterprise Research Institute                                  ...
Example Scenario:       Exploiting related Web contentDigital Enterprise Research Institute                               ...
Example Scenario:       Exploiting related Web contentDigital Enterprise Research Institute                               ...
Example Scenario:       Exploiting related Web contentDigital Enterprise Research Institute                               ...
Example Scenario:       Overview of approachesDigital Enterprise Research Institute                                       ...
Tag Prediction: ApproachDigital Enterprise Research Institute                                         www.deri.ie        ...
Tag Prediction: Evaluation (1)Digital Enterprise Research Institute                                             www.deri.i...
Tag Prediction: Evaluation (2)Digital Enterprise Research Institute                                                www.der...
Tag Prediction: Evaluation (3)Digital Enterprise Research Institute                                                       ...
Tag Prediction: Evaluation (4)Digital Enterprise Research Institute                                              www.deri....
Tag Prediction: SummaryDigital Enterprise Research Institute                           www.deri.ie            Substantial...
Geolocation: Approach (1)Digital Enterprise Research Institute                                               www.deri.ie  ...
Geolocation: Approach (2)Digital Enterprise Research Institute                                               www.deri.ie  ...
Geolocation: Evaluation (1)Digital Enterprise Research Institute                                            www.deri.ie   ...
Geolocation: Evaluation (2)Digital Enterprise Research Institute                                                        ww...
Geolocation: Evaluation (3)Digital Enterprise Research Institute                                     www.deri.ie         ...
Geolocation: SummaryDigital Enterprise Research Institute                         www.deri.ie            Language models ...
Topic Classification: ApproachDigital Enterprise Research Institute                                                    www...
Topic Classification: Evaluation (1)Digital Enterprise Research Institute                                                 ...
Topic Classification: Evaluation (2)Digital Enterprise Research Institute                                           www.de...
Topic Classification: Evaluation (3)Digital Enterprise Research Institute                               www.deri.ie       ...
Topic Classification: Evaluation (4)Digital Enterprise Research Institute                                                 ...
Topic Classification: SummaryDigital Enterprise Research Institute                               www.deri.ie            T...
Combining the approaches (1)Digital Enterprise Research Institute                www.deri.ie            location          ...
Combining the approaches (2)Digital Enterprise Research Institute                                          www.deri.ie    ...
Combining the approaches (3)Digital Enterprise Research Institute                                  www.deri.ie@prefix     ...
Combining the approaches (4)Digital Enterprise Research Institute                                                       ww...
Combining the approaches (5)Digital Enterprise Research Institute                                                   www.de...
ImpactDigital Enterprise Research Institute                                            www.deri.ie            5 conferenc...
SummaryDigital Enterprise Research Institute                             www.deri.ie            Proposed approaches for a...
Upcoming SlideShare
Loading in …5
×

Sheila Kinsella PhD Defense

1,121 views
1,058 views

Published on

"Augmenting Social Media Items with Metadata using Related Web Content" - Slides from the public presentation part of my PhD defense. Presented at DERI, NUI Galway, on August 30, 2011.

Published in: Technology, Health & Medicine
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,121
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Sheila Kinsella PhD Defense

  1. 1. Digital Enterprise Research Institute www.deri.ie Augmenting Social Media Items with Metadata using Related Web Content Sheila Kinsella Copyright 2010 Digital Enterprise Research Institute. All rights reserved. 1
  2. 2. OutlineDigital Enterprise Research Institute www.deri.ie  Motivation  Example Scenario  Tag prediction  Approach  Geolocation  Evaluation  Topic classification  Summary  Combining the approaches  Impact  Conclusions 2
  3. 3. MotivationDigital Enterprise Research Institute www.deri.ie  Social media is an important information source  e.g., real time citizen journalism, Q&A sites, niche topics  Search and navigation can be challenging  Short and informal posts  Items are not curated and often lack metadata  Users conversing share a common context and therefore omit relevant information, e.g. location  Making use of related Web data can help us to infer such context information  e.g., hyperlinks, posts with similar content 3
  4. 4. Example Scenario: Adding metadata to a blog postDigital Enterprise Research Institute www.deri.ie Last night I saw Connacht play at The Sportsground. tags? The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the topic? win. It was a great game location? from both sides. Heres a clip of the first try. 4
  5. 5. Example Scenario: Possible clues from contentDigital Enterprise Research Institute www.deri.ie Last night I saw Connacht tags? play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the topic? win. It was a great game location? from both sides. Heres a clip of the first try. 5
  6. 6. Example Scenario: Possible clues from contentDigital Enterprise Research Institute www.deri.ie Last night I saw Connacht play at The Sportsground. tags? The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the topic? win. It was a great gamelocation? from both sides. Heres a clip of the first try. 6
  7. 7. Example Scenario: Possible clues from contentDigital Enterprise Research Institute www.deri.ie Last night I saw Connacht play at The Sportsground. tags? The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game topic? location? from both sides. Heres a clip of the first try. 7
  8. 8. Example Scenario: Exploiting related Web contentDigital Enterprise Research Institute www.deri.ie ...didn’t see Last night I saw Connacht the match but play at The Sportsground. here’s a The match started well for summary href Connacht with a great try but from John.. after half time the opposition closed the gap. Finally we managed to hold out for the ..............This win. It was a great game review of the from both sides. Heres a clip Connacht href of the first try. match shows that they are getting back in form!...... tags from anchortext 8
  9. 9. Example Scenario: Exploiting related Web contentDigital Enterprise Research Institute www.deri.ie Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but location after half time the opposition from closed the gap. Finally we managed to hold out for the geotagged win. It was a great game from both sides. Heres a clip social of the first try. media JohnSmith John Smith I’m at the Galway Sportsground 9
  10. 10. Example Scenario: Exploiting related Web contentDigital Enterprise Research Institute www.deri.ie Last night I saw Connacht play at The Sportsground. YouTube The match started well for Title: Fionn Carr try Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the Category: href win. It was a great game Sport from both sides. Heres a clip Tags: rugby, try, carr, of the first try. connacht topic from hyperlinked objects 10
  11. 11. Example Scenario: Overview of approachesDigital Enterprise Research Institute www.deri.ie ...didn’t see Last night I saw Connacht the match but play at The Sportsground. YouTube here’s a The match started well for Title: Fionn Carr try summary href Connacht with a great try but from John.. after half time the opposition closed the gap. Finally we href managed to hold out for the Category: ..............This win. It was a great game Sport review of the from both sides. Heres a clip Tags: Connacht href rugby, try, carr, of the first try. match shows connacht that they are getting back in JohnSmith John Smith form!...... I’m at the Galway Sportsground TOPICTAG PREDICTION GEOLOCATION CLASSIFICATIO N 11
  12. 12. Tag Prediction: ApproachDigital Enterprise Research Institute www.deri.ie  Aim: Automatic tag generation based on anchortext 1. Data collection and preprocessing  Retrieve document and extract META information  Retrieve inlinking documents and extract anchortext  Preprocessing (e.g. stemming, stopword removal) 2. Tag indexing and ranking  Generate term vectors from the preprocessed annotations  Ranking: tf and tf-idf 12
  13. 13. Tag Prediction: Evaluation (1)Digital Enterprise Research Institute www.deri.ie  Datasets:  Web: WEBSPAM-2007, 12M pages from .uk domain  Delicious: 2007 Crawl containing tags for 4.5M URLs  Overlap between datasets: 192k URLs  Goals:  Compare overlap of predicted tags and delicious tags  Assess relevance of predicted tags and relevance of delicious tags 13
  14. 14. Tag Prediction: Evaluation (2)Digital Enterprise Research Institute www.deri.ie  Automatic Evaluation  Relative precision@k (Average proportion of predicted tags that are also among delicious tags) k=1 k=2 k=3 k=4 k=5 0.48 0.45 0.42 0.39 0.37  Relative recall@k (Average proportion of delicious tags can also be inferred from anchortext) k=1 k=2 k=3 k=4 k=5 0.41 0.35 0.32 0.29 0.28 14
  15. 15. Tag Prediction: Evaluation (3)Digital Enterprise Research Institute www.deri.ie  Human Evaluation  80 documents, each assessed by 3 judges – 0: not relevant; 1: quite relevant; 2: very relevant  Evaluator agreement – In 85% of cases, judges at least almost agree – i.e., two agree and the third differs by just one point 15
  16. 16. Tag Prediction: Evaluation (4)Digital Enterprise Research Institute www.deri.ie  Human Evaluation  Precision@k (Average proportion of tags judged relevant by evaluators) – Relevance threshold: 1 k=1 k=2 k=3 k=4 k=5 Delicious 0.86 0.84 0.82 0.80 0.78 predicted 0.78 0.76 0.69 0.67 0.66  Not feasible to measure recall 16
  17. 17. Tag Prediction: SummaryDigital Enterprise Research Institute www.deri.ie  Substantial overlap between tags assigned on a social bookmarking site and terms from anchortext  Human evaluators rate relevance of terms from anchortext as not much lower than tags  This approach can provide useful and novel annotations for untagged social media items, if other users link to them with anchortext 17
  18. 18. Geolocation: Approach (1)Digital Enterprise Research Institute www.deri.ie  Aim: Location prediction based on models built from geotagged social media  Enables detection of implicit location clues such as slang, venues, other terms of local relevance 1. Reverse Geocoding  Filter geotagged tweets from Twitter stream  Reverse-geocode each coordinate to corresponding places – Postal code, City, State, Country – Yahoo! Geoplanet service  Aggregate all of the text from each place together for model building 18
  19. 19. Geolocation: Approach (2)Digital Enterprise Research Institute www.deri.ie 2. Language Modelling  Approach from information retrieval – given a query, find the most relevant document in a collection  Model each document and query as bag of words  For each document, calculate probability that a random sampling would result in the query  Based on the intuition that users create queries by guessing words that would occur in the document  For our geolocation task: estimates the probability that a random sampling of a location would result in the social media post 19
  20. 20. Geolocation: Evaluation (1)Digital Enterprise Research Institute www.deri.ie  Dataset  Twitter Firehose stream  7.3 million geotagged tweets posted during Summer 2010  Retweets removed, #hashtags and @usernames preserved Place type # Tweets # Distinct places Country 7.3m 222 State 7.3m 2.3k City 6.3m 72.6k Postal code 7.2m 104.7k  Baseline: Yahoo Placemaker!  identifies and disambiguates placenames in text and returns the spatial entity most likely to encompass them 20
  21. 21. Geolocation: Evaluation (2)Digital Enterprise Research Institute www.deri.ie  Prediction Methods  Trivial Classifier – Each tweet assigned to the most common place in training set  Placemaker (Tweet) – Each tweet is submitted to Placemaker and the most probable candidate is selected. Allows detection of explicit geographic references in the tweet  Language Model – Locations are ranking according to their query likelihood and the location whose model ranks highest is selected  Placemaker (Location) – The location field from the tweet is submitted to Placemaker and the most probable candidate is selected. Allows detection of explicit geographic references in the self-reported location 21
  22. 22. Geolocation: Evaluation (3)Digital Enterprise Research Institute www.deri.ie  Tweet location prediction accuracy  Common location focused services removed Zip Town State Country Trivial 0.005 0.061 0.060 0.434 Classifier Placemaker 0.018 0.060 0.076 0.120 (Tweet) Language 0.052 0.217 0.246 0.514 Model Placemaker 0.017 0.269 0.401 0.518 (Location) 22
  23. 23. Geolocation: SummaryDigital Enterprise Research Institute www.deri.ie  Language models of geotagged tweets enables the location of non-geotagged items to be predicted  The approach gives large improvements compared to parsing for explicit placenames  City level accuracy – 21.7% versus 6%  The approach can be used to detect implicit geographical information in social media posts 23
  24. 24. Topic Classification: ApproachDigital Enterprise Research Institute www.deri.ie  Aim: Improve topic classification using structured data from hyperlinks 1. Identify sources of structured data from hyperlinks  Based on domains, e.g., wikipedia.org 2. Retrieve structured data for these hyperlinks  From Linked Data/APIs, e.g., dbpedia.org 3. Perform text classification  Requires set of already categorised posts for training  Post content and external metadata as sources of textual features  Compare accuracy achieved by different metadata types 4. Related to IR studies that classify documents based on fielded text from hyperlinked pages, but they consider structural rather than semantic fields 24
  25. 25. Topic Classification: Evaluation (1)Digital Enterprise Research Institute www.deri.ie  Datasets Forum Twitter Data source message board microblogging site Ground truth topics forums #hashtags # classes (topics) 10 6 # posts 6,626 2,415  External data sources Linked Data Web APIs 25
  26. 26. Topic Classification: Evaluation (2)Digital Enterprise Research Institute www.deri.ie  Experimental Setup  Multinomial Naïve Bayes classifier (WEKA)  10-fold cross-validation  Compared classification accuracy for different post representations based – post content – hyperlinked HTML pages – hyperlinked object metadata – combinations of these  Experimented to find optimal ways of combining feature vectors (e.g., weightings) 26
  27. 27. Topic Classification: Evaluation (3)Digital Enterprise Research Institute www.deri.ie  Results Data Source Forum Twitter Content (no URLs) 0.745 0.722 Content (with URLs) 0.811 0.759 HTML 0.730 0.645 Metadata 0.835 0.683 Content + HTML 0.832 0.784 Content + Metadata 0.899 0.820 (micro-averaged F1) 27
  28. 28. Topic Classification: Evaluation (4)Digital Enterprise Research Institute www.deri.ie  Results – comparing metadata types Wikipedia Metadata type Content (no URLs) Metadata only Content+M’data Category 0.811 0.851 Description 0.761 0.798 0.850 Title 0.685 0.809 YouTube Metadata type Content (no URLs) Metadata only Content+M’data Tag 0.838 0.864 Title 0.773 0.824 0.709 Description 0.752 0.810 Category 0.514 0.753 28
  29. 29. Topic Classification: SummaryDigital Enterprise Research Institute www.deri.ie  Topic classification in social media can be improved by making use of structured metadata from hyperlinked objects  The most useful metadata types can be found experimentally, but for different objects, the usefulness of metadata types varies  The categories assigned by this approach would allow a user to browse social media posts with hyperlinks by topic, even if the text of the post itself is not sufficient for accurate automatic categorisation of the post. 29
  30. 30. Combining the approaches (1)Digital Enterprise Research Institute www.deri.ie location topic tags 30
  31. 31. Combining the approaches (2)Digital Enterprise Research Institute www.deri.ie Last night I watched Connacht play at The tags? Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally topic? we managed to hold out location? for the win. It was a great game from both sides. Heres a clip of the first try. 31
  32. 32. Combining the approaches (3)Digital Enterprise Research Institute www.deri.ie@prefix ex: <http://example.org/> .@prefix content: <http://purl.org/rss/1.0/modules/content/> .@prefix dc: <http://purl.org/dc/terms/> .@prefix sioc: <http://rdfs.org/sioc/ns#> .ex:post1 rdf:type sioc:Post .ex:post1 content:encoded “Last night I watched Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Heres a [url=„http://www.youtube.com/watch?v=[...]‟]clip of the first try.[/url]” .ex:post1 sioc:links_to <http://www.youtube.com/watch?v=[...]> .ex:post1 dc:subject “connacht” .ex:post1 dc:subject “match” .ex:post1 dc:subject “review” .ex:post1 dc:subject “summary” .ex:post1 dc:spatial <http://sws.geonames.org/2964180/> .ex:post1 sioc:topic <http://www.dmoz.org/Sports/Football/Rugby_Union/> . 32
  33. 33. Combining the approaches (4)Digital Enterprise Research Institute www.deri.ie Use-case 1: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . PREFIX sioc: <http://rdfs.org/sioc/ns#> . Local search PREFIX dc: <http://purl.org/dc/terms/> . SELECT ?post WHERE { A blogger is ?post rdf:type sioc:Post . ?post dc:spatial <http://sws.geonames.org/2964180/> . looking for media ?post dc:created ?date . to enhance a FILTER (str(?date) > ``2009-05-23T00:00:00) . FILTER (str(?date) < ``2009-06-06T23:59:59) . post about the FILTER EXISTS { Volvo Ocean { ?post dc:subject ``volvooceanrace } UNION Race { ?post dc:subject ``vor } UNION { ?post dc:subject ``oceanrace } UNION { ?post dc:subject ``yacht } } } 33
  34. 34. Combining the approaches (5)Digital Enterprise Research Institute www.deri.ie Use-case 2: local browsing A sports fan wants to follow conversations about sports in their local area PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . PREFIX sioc: <http://rdfs.org/sioc/ns#> . PREFIX gn: <http://www.geonames.org/ontology#> . PREFIX skos: <http://www.w3.org/2004/02/skos/core#> . SELECT ?post WHERE { ?post rdf:type sioc:Post . ?post dc:spatial ?location . ?location gn:parentADM1 <http://sws.geonames.org/2963597/> . ?post sioc:topic ?topic . ?topic skos:broader+ <http://www.dmoz.org/Sports/> . } 34
  35. 35. ImpactDigital Enterprise Research Institute www.deri.ie  5 conference papers  ESWC 2011, ECIR 2011, I-Semantics 2010, IV 2008, ASNA 2008  2 workshop papers  WIDM @ CIKM 2008, SMUC @ CIKM 2011  2 book chapters  Advances in Computers 76 (Elsevier)  Reasoning Web (Springer)  Tutorial  "Combining the Social and the Semantic Web”, ESWC 2011 35
  36. 36. SummaryDigital Enterprise Research Institute www.deri.ie  Proposed approaches for automatically generating metadata for social media posts using related Web content  Tags, location and topic  Evaluated the accuracy of each approach  Illustrated how the approaches can be used in combination in order to semantically enrich social media posts and enable enhanced search and browsing in a social media dataset 36

×