SlideShare a Scribd company logo
Mining Social Media with Linked Open Data,
Entity Recognition and Event Extraction
Leon Derczynski
Kalina Bontcheva
Third Workshop on Data Extraction and Object Search,
Oxford,
7 July 2013
Social Media = Big Data
Gartner ''3V'' definition:
1.Volume
2.Velocity
3.Variety
High volume & velocity of messages:
Twitter has ~20 000 000 users per month
They write ~500 000 000 messages per day
Massive variety:
Stock markets;
Earthquakes;
Social arrangements;
… Bieber
What resources do we have now?
Large, content-rich, connected, digital streams of human discourse
We transfer knowledge via communication
Sampling communication gives a sample of human knowledge
''You've only done that which you can communicate''
The metadata (time – place – imagery) gives a richer resource:
→A sampling of human behaviour
Entity annotation components
Named entity recognition
dbpedia.org/resource/.....
Michael_Jackson
Michael_Jackson_(writer)
Linking entities
Named Entity Recognition
Goal is to find entities we might like to link
General accuracy on newswire: 89% F1
General accuracy on microblogs: 41% F1
L. Derczynski, D. Maynard, N. Aswani, K. Bontcheva. ''Microblog-Genre Noise and Impact on Semantic Annotation
Accuracy.'' 24th ACM Conference on Hypertext and Social Media. 2013
Newswire:
Microblog:
Gotta dress up for london fashion week and party in
style!!!
London Fashion Week grows up – but mustn't take
itself too seriously. Once a launching pad for new
designers, it is fast becoming the main event. But
LFW mustn't let the luxury and money crush its
sense of silliness.
NER difficulties
Rule-based systems get the bulk of entities (newswire 77% F1)
ML-based systems do well at the remainder (newswire 89% F1)
Small proportion of
difficult entities
Many complex issues
Using improved pipeline:
ML struggles, even with in-genre data: 49% F1
Rules cut through microblog noise: 80% F1
Word-level linking performance
Dataset: Ritter NER + DBpedia URIs
Detect mentions of entity in tweets
Crowdsourced annotations
Expert gold standard
Discard after disagreement or ambiguity
We disambiguate mentions to DBpedia / Wikipedia (easy to map)
General performance: F1 81%
Word-level linking issues
Automatic annotation:
Branching out from Lincoln park(LOC) after dark ... Hello "Russian
Navy(ORG)", it's like the same thing but with glitter!
Actual:
Branching out from Lincoln park after dark(PROD) ... Hello
"Russian Navy(PROD)", it's like the same thing but with glitter!
Clue in unusual collocations
+ ?
LODIE: LOD-based Inf. Extr.
Uses DBPedia as reference knowledge graph
Why DBPedia?
Regularly updated (from Wikipedia)
Good source for named entities
A hierarchy of concepts
A capital is also a city, but not vice versa
Relations between concepts
Paris locatedIn France
ParisHilton bornIn NewYorkCity
Demo: http://demos.gate.ac.uk/trendminer/obie/
LODIE: LOD-based Inf. Extr.
We increase recall by:
Deriving abbreviations from link anchor texts in Wikipedia
''She was born in <a href=''New_York_(city)''>NYC</a>''
Rank boosting terms using redirect pages
Matching NE candidates using include wild card queries (e.g.
Burton upon Trent and Burton-on-Trent)
This makes disambiguation harder (precision)
Use naive string, latent semantic, and contextual similarity metrics +
URI commonness to disambiguate
This is what achieved our good results!
Demo: http://demos.gate.ac.uk/trendminer/obie/
Social media contains events
How are events differently described in social media and news?
Conventional docs (e.g. newswire) have contextual info
Central event in distinct document segment (e.g. headline)
Location
Actors / participants
Causes
Outcomes
Similar prior events
This kind of description not found in social media
No editing guidelines
Often limited message length
Instead, event facets are represented sparsely
Only 1-2 facets per message about the event
Event extraction
Social media streams are punctuated with descriptions of events
… Accompanied by event facets
''Obama is visiting Russia''
''The US president has not visited Putin before''
Many viewpoints on the same temporal entity
(like triples)
How can we extract these?
We use the TimeML definitions of events in text:
Minimal lexicalisation (i.e. annotate one word)
Event classes: we focus on ACTIONs and OCCURRENCEs
Event extraction
How can we extract event mentions?
Conventional approaches are hybrid:
Statistical learning
Syntactic structures
Existing TimeML resources
TimeBank corpus (newswire)
Evita event extraction tool
Adapting to social media text
Negatively impacted by problems with NER
Short sentence structure
→ Use shallow linguistic techniques and fuzzy matches
Evita: F1 80.1
TIPSem: F1 81.4 (on well-formed text)
USFD Arcomem: F1 81.1 (noise-resilient)
LOD for event reassembly
What is needed to reassemble events from social media?
Identify mentions of the same event
Collect facets and integrate them
LOD gives unique identifiers for facet values
Many possible lexicalisations for the same event (run, control)
Identify co-referring mentions though:
Shared actors
Consistent facets (i.e. non-conflicting)
Lexical event similarity (e.g. wordnet)
This helps
cluster mentions of the same event
Aggolmerate facets
Final product: Event description grounded in linked open data
Conclusion
Event extraction from social media
using
linked open data
enables
extraction of rich event descriptions
Thank you!
Thank you for listening!
Do you have any questions?

More Related Content

What's hot

2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
jins0618
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Julien PLU
 
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Jonathan Stray
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...Daniel Katz
 
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust networkBig Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Ruchika Sharma
 
Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?
Rich Heimann
 
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIBig Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Ruchika Sharma
 
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia ChatbotEnhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
Ram G Athreya
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterprise
Arjen de Vries
 
Social network analysis
Social network analysisSocial network analysis
Social network analysis
prasadkulkarnigit
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...Daniel Katz
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Deepak K
 
Fake News Detector
Fake News DetectorFake News Detector
Fake News Detector
IrisYoon5
 
Social Network Analysis - Visualization
Social Network Analysis - VisualizationSocial Network Analysis - Visualization
Social Network Analysis - Visualization
Lee Taemin
 
About the Social Semantic Web
About the Social Semantic WebAbout the Social Semantic Web
About the Social Semantic Web
Web Information Systems, TU Delft
 
Pydata Taipei 2020
Pydata Taipei 2020Pydata Taipei 2020
Pydata Taipei 2020
Tunghai University
 

What's hot (16)

2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
 
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
 
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust networkBig Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
 
Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?
 
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIBig Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
 
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia ChatbotEnhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterprise
 
Social network analysis
Social network analysisSocial network analysis
Social network analysis
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.
 
Fake News Detector
Fake News DetectorFake News Detector
Fake News Detector
 
Social Network Analysis - Visualization
Social Network Analysis - VisualizationSocial Network Analysis - Visualization
Social Network Analysis - Visualization
 
About the Social Semantic Web
About the Social Semantic WebAbout the Social Semantic Web
About the Social Semantic Web
 
Pydata Taipei 2020
Pydata Taipei 2020Pydata Taipei 2020
Pydata Taipei 2020
 

Similar to Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Amit Sheth
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
Leon Derczynski
 
Open University - TU100 Day school 1
Open University - TU100 Day school 1Open University - TU100 Day school 1
Open University - TU100 Day school 1
Sarah Horrigan-Fullard
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chi
Barbara Starr
 
Repositories thru the looking glass
Repositories thru the looking glassRepositories thru the looking glass
Repositories thru the looking glass
Eduserv Foundation
 
ESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social Media
Arjen de Vries
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Artificial Intelligence Institute at UofSC
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
Leon Derczynski
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
Sebastian Ryszard Kruk
 
Guest Lecture: Linked Open Data for the Humanities and Social Sciences
Guest Lecture: Linked Open Data for the Humanities and Social SciencesGuest Lecture: Linked Open Data for the Humanities and Social Sciences
Guest Lecture: Linked Open Data for the Humanities and Social Sciences
Laura Hollink
 
Generating Storylines (Literature Survey)
Generating Storylines (Literature Survey)Generating Storylines (Literature Survey)
Generating Storylines (Literature Survey)
Anunaya
 
Introduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and TerminologyIntroduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and Terminology
Steven Miller
 
Introduction to Application Profiles
Introduction to Application ProfilesIntroduction to Application Profiles
Introduction to Application Profiles
Diane Hillmann
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
Peter Mika
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
Peter Mika
 
Understanding User-Community Engagement by Multi-faceted Features: A Case ...
Understanding User-Community Engagement by Multi-faceted Features: A Case ...Understanding User-Community Engagement by Multi-faceted Features: A Case ...
Understanding User-Community Engagement by Multi-faceted Features: A Case ...
Artificial Intelligence Institute at UofSC
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applications
Vasileios Lampos
 
Journalism and the Semantic Web
Journalism and the Semantic WebJournalism and the Semantic Web
Journalism and the Semantic Web
Kurt Cagle
 

Similar to Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction (20)

Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
 
Open University - TU100 Day school 1
Open University - TU100 Day school 1Open University - TU100 Day school 1
Open University - TU100 Day school 1
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chi
 
Repositories thru the looking glass
Repositories thru the looking glassRepositories thru the looking glass
Repositories thru the looking glass
 
ESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social Media
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
Guest Lecture: Linked Open Data for the Humanities and Social Sciences
Guest Lecture: Linked Open Data for the Humanities and Social SciencesGuest Lecture: Linked Open Data for the Humanities and Social Sciences
Guest Lecture: Linked Open Data for the Humanities and Social Sciences
 
Generating Storylines (Literature Survey)
Generating Storylines (Literature Survey)Generating Storylines (Literature Survey)
Generating Storylines (Literature Survey)
 
Introduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and TerminologyIntroduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and Terminology
 
Introduction to Application Profiles
Introduction to Application ProfilesIntroduction to Application Profiles
Introduction to Application Profiles
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 
Oss swot
Oss swotOss swot
Oss swot
 
Understanding User-Community Engagement by Multi-faceted Features: A Case ...
Understanding User-Community Engagement by Multi-faceted Features: A Case ...Understanding User-Community Engagement by Multi-faceted Features: A Case ...
Understanding User-Community Engagement by Multi-faceted Features: A Case ...
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applications
 
MDST 3703 F10 Seminar 4
MDST 3703 F10 Seminar 4MDST 3703 F10 Seminar 4
MDST 3703 F10 Seminar 4
 
Journalism and the Semantic Web
Journalism and the Semantic WebJournalism and the Semantic Web
Journalism and the Semantic Web
 

More from Leon Derczynski

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
Leon Derczynski
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
Leon Derczynski
 
RumourEval
RumourEvalRumourEval
RumourEval
Leon Derczynski
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Leon Derczynski
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGC
Leon Derczynski
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
Leon Derczynski
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
Leon Derczynski
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Leon Derczynski
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Leon Derczynski
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doLeon Derczynski
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
Leon Derczynski
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
Leon Derczynski
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
Leon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
Leon Derczynski
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
Leon Derczynski
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
Leon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
Leon Derczynski
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesLeon Derczynski
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answering
Leon Derczynski
 

More from Leon Derczynski (20)

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 
RumourEval
RumourEvalRumourEval
RumourEval
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGC
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologies
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answering
 

Recently uploaded

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

  • 1. Mining Social Media with Linked Open Data, Entity Recognition and Event Extraction Leon Derczynski Kalina Bontcheva Third Workshop on Data Extraction and Object Search, Oxford, 7 July 2013
  • 2.
  • 3. Social Media = Big Data Gartner ''3V'' definition: 1.Volume 2.Velocity 3.Variety High volume & velocity of messages: Twitter has ~20 000 000 users per month They write ~500 000 000 messages per day Massive variety: Stock markets; Earthquakes; Social arrangements; … Bieber
  • 4. What resources do we have now? Large, content-rich, connected, digital streams of human discourse We transfer knowledge via communication Sampling communication gives a sample of human knowledge ''You've only done that which you can communicate'' The metadata (time – place – imagery) gives a richer resource: →A sampling of human behaviour
  • 5. Entity annotation components Named entity recognition dbpedia.org/resource/..... Michael_Jackson Michael_Jackson_(writer) Linking entities
  • 6. Named Entity Recognition Goal is to find entities we might like to link General accuracy on newswire: 89% F1 General accuracy on microblogs: 41% F1 L. Derczynski, D. Maynard, N. Aswani, K. Bontcheva. ''Microblog-Genre Noise and Impact on Semantic Annotation Accuracy.'' 24th ACM Conference on Hypertext and Social Media. 2013 Newswire: Microblog: Gotta dress up for london fashion week and party in style!!! London Fashion Week grows up – but mustn't take itself too seriously. Once a launching pad for new designers, it is fast becoming the main event. But LFW mustn't let the luxury and money crush its sense of silliness.
  • 7. NER difficulties Rule-based systems get the bulk of entities (newswire 77% F1) ML-based systems do well at the remainder (newswire 89% F1) Small proportion of difficult entities Many complex issues Using improved pipeline: ML struggles, even with in-genre data: 49% F1 Rules cut through microblog noise: 80% F1
  • 8. Word-level linking performance Dataset: Ritter NER + DBpedia URIs Detect mentions of entity in tweets Crowdsourced annotations Expert gold standard Discard after disagreement or ambiguity We disambiguate mentions to DBpedia / Wikipedia (easy to map) General performance: F1 81%
  • 9. Word-level linking issues Automatic annotation: Branching out from Lincoln park(LOC) after dark ... Hello "Russian Navy(ORG)", it's like the same thing but with glitter! Actual: Branching out from Lincoln park after dark(PROD) ... Hello "Russian Navy(PROD)", it's like the same thing but with glitter! Clue in unusual collocations + ?
  • 10. LODIE: LOD-based Inf. Extr. Uses DBPedia as reference knowledge graph Why DBPedia? Regularly updated (from Wikipedia) Good source for named entities A hierarchy of concepts A capital is also a city, but not vice versa Relations between concepts Paris locatedIn France ParisHilton bornIn NewYorkCity Demo: http://demos.gate.ac.uk/trendminer/obie/
  • 11. LODIE: LOD-based Inf. Extr. We increase recall by: Deriving abbreviations from link anchor texts in Wikipedia ''She was born in <a href=''New_York_(city)''>NYC</a>'' Rank boosting terms using redirect pages Matching NE candidates using include wild card queries (e.g. Burton upon Trent and Burton-on-Trent) This makes disambiguation harder (precision) Use naive string, latent semantic, and contextual similarity metrics + URI commonness to disambiguate This is what achieved our good results! Demo: http://demos.gate.ac.uk/trendminer/obie/
  • 12. Social media contains events How are events differently described in social media and news? Conventional docs (e.g. newswire) have contextual info Central event in distinct document segment (e.g. headline) Location Actors / participants Causes Outcomes Similar prior events This kind of description not found in social media No editing guidelines Often limited message length Instead, event facets are represented sparsely Only 1-2 facets per message about the event
  • 13. Event extraction Social media streams are punctuated with descriptions of events … Accompanied by event facets ''Obama is visiting Russia'' ''The US president has not visited Putin before'' Many viewpoints on the same temporal entity (like triples) How can we extract these? We use the TimeML definitions of events in text: Minimal lexicalisation (i.e. annotate one word) Event classes: we focus on ACTIONs and OCCURRENCEs
  • 14. Event extraction How can we extract event mentions? Conventional approaches are hybrid: Statistical learning Syntactic structures Existing TimeML resources TimeBank corpus (newswire) Evita event extraction tool Adapting to social media text Negatively impacted by problems with NER Short sentence structure → Use shallow linguistic techniques and fuzzy matches Evita: F1 80.1 TIPSem: F1 81.4 (on well-formed text) USFD Arcomem: F1 81.1 (noise-resilient)
  • 15. LOD for event reassembly What is needed to reassemble events from social media? Identify mentions of the same event Collect facets and integrate them LOD gives unique identifiers for facet values Many possible lexicalisations for the same event (run, control) Identify co-referring mentions though: Shared actors Consistent facets (i.e. non-conflicting) Lexical event similarity (e.g. wordnet) This helps cluster mentions of the same event Aggolmerate facets Final product: Event description grounded in linked open data
  • 16. Conclusion Event extraction from social media using linked open data enables extraction of rich event descriptions
  • 17. Thank you! Thank you for listening! Do you have any questions?