0
Mining Social Media with Linked Open Data,
Entity Recognition and Event Extraction
Leon Derczynski
Kalina Bontcheva
Third ...
Social Media = Big Data
Gartner ''3V'' definition:
1.Volume
2.Velocity
3.Variety
High volume & velocity of messages:
Twitt...
What resources do we have now?
Large, content-rich, connected, digital streams of human discourse
We transfer knowledge vi...
Entity annotation components
Named entity recognition
dbpedia.org/resource/.....
Michael_Jackson
Michael_Jackson_(writer)
...
Named Entity Recognition
Goal is to find entities we might like to link
General accuracy on newswire: 89% F1
General accur...
NER difficulties
Rule-based systems get the bulk of entities (newswire 77% F1)
ML-based systems do well at the remainder (...
Word-level linking performance
Dataset: Ritter NER + DBpedia URIs
Detect mentions of entity in tweets
Crowdsourced annotat...
Word-level linking issues
Automatic annotation:
Branching out from Lincoln park(LOC) after dark ... Hello "Russian
Navy(OR...
LODIE: LOD-based Inf. Extr.
Uses DBPedia as reference knowledge graph
Why DBPedia?
Regularly updated (from Wikipedia)
Good...
LODIE: LOD-based Inf. Extr.
We increase recall by:
Deriving abbreviations from link anchor texts in Wikipedia
''She was bo...
Social media contains events
How are events differently described in social media and news?
Conventional docs (e.g. newswi...
Event extraction
Social media streams are punctuated with descriptions of events
… Accompanied by event facets
''Obama is ...
Event extraction
How can we extract event mentions?
Conventional approaches are hybrid:
Statistical learning
Syntactic str...
LOD for event reassembly
What is needed to reassemble events from social media?
Identify mentions of the same event
Collec...
Conclusion
Event extraction from social media
using
linked open data
enables
extraction of rich event descriptions
Thank you!
Thank you for listening!
Do you have any questions?
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction
Upcoming SlideShare
Loading in...5
×

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

2,831

Published on

Presented at the 4th DEOS workshop, http://diadem.cs.ox.ac.uk/deos13/

Social media presents itself as a context-rich source of big data, readily exhibiting volume, velocity and variety. Mining information from microblogs and other social media is a challenging, emerging research area. Unlike carefully authored news text and other longer content, social media text poses a number of new challenges, due to the short, noisy, context-dependent, and dynamic nature.

This talk will discuss firstly how Linked Open Data (LOD) vocabularies (namely DBpedia and YAGO) have been used to help entity recognition and disambiguation in such content. We will introduce LODIE, the LOD-based extension of the widely used ANNIE open-source entity recognition system. LODIE includes also entity disambiguation (including products, as well as names of persons, locations, and organisations) and has been developed as part of the TrendMiner and uComp projects. Quantitative evaluation results will be shown, including a comparison against other state-of-the-art methods and an analysis of how errors in upstream linguistic pre-processing (i.e. tokenisation and POS tagging) can affect disambiguation performance. Our results demonstrate the importance of adjusting approaches for this genre.

The second half of the talk will focus on fine-grained events in tweets. Awareness of temporal context in social media enables many interesting applications. We identify events using the TimeML schema, focusing on occurrences and actions. Challenges of event annotation will be discussed, as well as the development of a supervised event extractor specifically for social media. We evaluate this against traditional event annotation approaches (e.g. Evita, TIPSem).

Published in: Technology, Business
2 Comments
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,831
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
30
Comments
2
Likes
8
Embeds 0
No embeds

No notes for slide

Transcript of "Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction"

  1. 1. Mining Social Media with Linked Open Data, Entity Recognition and Event Extraction Leon Derczynski Kalina Bontcheva Third Workshop on Data Extraction and Object Search, Oxford, 7 July 2013
  2. 2. Social Media = Big Data Gartner ''3V'' definition: 1.Volume 2.Velocity 3.Variety High volume & velocity of messages: Twitter has ~20 000 000 users per month They write ~500 000 000 messages per day Massive variety: Stock markets; Earthquakes; Social arrangements; … Bieber
  3. 3. What resources do we have now? Large, content-rich, connected, digital streams of human discourse We transfer knowledge via communication Sampling communication gives a sample of human knowledge ''You've only done that which you can communicate'' The metadata (time – place – imagery) gives a richer resource: →A sampling of human behaviour
  4. 4. Entity annotation components Named entity recognition dbpedia.org/resource/..... Michael_Jackson Michael_Jackson_(writer) Linking entities
  5. 5. Named Entity Recognition Goal is to find entities we might like to link General accuracy on newswire: 89% F1 General accuracy on microblogs: 41% F1 L. Derczynski, D. Maynard, N. Aswani, K. Bontcheva. ''Microblog-Genre Noise and Impact on Semantic Annotation Accuracy.'' 24th ACM Conference on Hypertext and Social Media. 2013 Newswire: Microblog: Gotta dress up for london fashion week and party in style!!! London Fashion Week grows up – but mustn't take itself too seriously. Once a launching pad for new designers, it is fast becoming the main event. But LFW mustn't let the luxury and money crush its sense of silliness.
  6. 6. NER difficulties Rule-based systems get the bulk of entities (newswire 77% F1) ML-based systems do well at the remainder (newswire 89% F1) Small proportion of difficult entities Many complex issues Using improved pipeline: ML struggles, even with in-genre data: 49% F1 Rules cut through microblog noise: 80% F1
  7. 7. Word-level linking performance Dataset: Ritter NER + DBpedia URIs Detect mentions of entity in tweets Crowdsourced annotations Expert gold standard Discard after disagreement or ambiguity We disambiguate mentions to DBpedia / Wikipedia (easy to map) General performance: F1 81%
  8. 8. Word-level linking issues Automatic annotation: Branching out from Lincoln park(LOC) after dark ... Hello "Russian Navy(ORG)", it's like the same thing but with glitter! Actual: Branching out from Lincoln park after dark(PROD) ... Hello "Russian Navy(PROD)", it's like the same thing but with glitter! Clue in unusual collocations + ?
  9. 9. LODIE: LOD-based Inf. Extr. Uses DBPedia as reference knowledge graph Why DBPedia? Regularly updated (from Wikipedia) Good source for named entities A hierarchy of concepts A capital is also a city, but not vice versa Relations between concepts Paris locatedIn France ParisHilton bornIn NewYorkCity Demo: http://demos.gate.ac.uk/trendminer/obie/
  10. 10. LODIE: LOD-based Inf. Extr. We increase recall by: Deriving abbreviations from link anchor texts in Wikipedia ''She was born in <a href=''New_York_(city)''>NYC</a>'' Rank boosting terms using redirect pages Matching NE candidates using include wild card queries (e.g. Burton upon Trent and Burton-on-Trent) This makes disambiguation harder (precision) Use naive string, latent semantic, and contextual similarity metrics + URI commonness to disambiguate This is what achieved our good results! Demo: http://demos.gate.ac.uk/trendminer/obie/
  11. 11. Social media contains events How are events differently described in social media and news? Conventional docs (e.g. newswire) have contextual info Central event in distinct document segment (e.g. headline) Location Actors / participants Causes Outcomes Similar prior events This kind of description not found in social media No editing guidelines Often limited message length Instead, event facets are represented sparsely Only 1-2 facets per message about the event
  12. 12. Event extraction Social media streams are punctuated with descriptions of events … Accompanied by event facets ''Obama is visiting Russia'' ''The US president has not visited Putin before'' Many viewpoints on the same temporal entity (like triples) How can we extract these? We use the TimeML definitions of events in text: Minimal lexicalisation (i.e. annotate one word) Event classes: we focus on ACTIONs and OCCURRENCEs
  13. 13. Event extraction How can we extract event mentions? Conventional approaches are hybrid: Statistical learning Syntactic structures Existing TimeML resources TimeBank corpus (newswire) Evita event extraction tool Adapting to social media text Negatively impacted by problems with NER Short sentence structure → Use shallow linguistic techniques and fuzzy matches Evita: F1 80.1 TIPSem: F1 81.4 (on well-formed text) USFD Arcomem: F1 81.1 (noise-resilient)
  14. 14. LOD for event reassembly What is needed to reassemble events from social media? Identify mentions of the same event Collect facets and integrate them LOD gives unique identifiers for facet values Many possible lexicalisations for the same event (run, control) Identify co-referring mentions though: Shared actors Consistent facets (i.e. non-conflicting) Lexical event similarity (e.g. wordnet) This helps cluster mentions of the same event Aggolmerate facets Final product: Event description grounded in linked open data
  15. 15. Conclusion Event extraction from social media using linked open data enables extraction of rich event descriptions
  16. 16. Thank you! Thank you for listening! Do you have any questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×