SlideShare a Scribd company logo
University of Sheffield, NLP
TwitIE: An Open-Source Information Extraction
Pipeline for Microblog Text
Kalina Bontcheva
Leon Derczynski
Adam Funk
Mark A. Greenwood
Diana Maynard
Niraj Aswani
© The University of Sheffield, 1995-2013
This work is licensed under
the Creative Commons Attribution-NonCommercial-NoDerivs Licence
University of Sheffield, NLP
The Problem
• Running ANNIE on 300 news articles – 87% f-score
• Running ANNIE on some tweets - < 40% f-score
University of Sheffield, NLP
Example: Persons in news articles
University of Sheffield, NLP
Example: Persons in tweets
University of Sheffield, NLP
Genre Differences in Entity Types
News Tweets
PER Politicians, business
leaders, journalists,
celebrities
Sportsmen, actors, TV
personalities, celebrities,
names of friends
LOC Countries, cities, rivers,
and other places related to
current affairs
Restaurants, bars, local
landmarks/areas, cities,
rarely countries
ORG Public and private
companies, government
organisations
Bands, internet companies,
sports clubs
University of Sheffield, NLP
Tweet-specific NER challenges
• Capitalisation is not indicative of named entities
• All uppercase, e.g. APPLE IS AWSOME
• All lowercase, e.g. all welcome, joe included
• All letters upper initial, e.g. 10 Quotes from Amy Poehler
That Will Get You Through High School
• Unusual spelling, acronyms, and abbreviations
• Social media conventions:
• Hashtags, e.g. #ukuncut, #RusselBrand, #taxavoidance
• @Mentions, e.g. @edchi (PER), @mcg_graz (LOC),
@BBC (ORG)
University of Sheffield, NLP
TwitIE: GATE’s new Twitter NER pipeline
University of Sheffield, NLP
Importing tweets into GATE
• GATE now supports JSON format import for tweets
• Located in the Format_Twitter plugin
• Automatically used for files *.json
• Alternatively, specify text/x-json-twitter as a mime type
• The tweet text becomes the document, all other JSON
fields become features
University of Sheffield, NLP
Language Detection: Less than 50% English
 The main challenges on tweets/Facebook status updates:
the short number of tokens (10 tokens/tweet on average)
the noisy nature of the words (abbreviations, misspellings).
 Due to the length of the text, we can make the assumption that
one tweet is written in only one language
 We have adapted the TextCat language identification plugin
 Provided fingerprints for 5 languages: DE, EN, FR, ES, NL
 You can extend it to new languages easily
University of Sheffield, NLP
Language Detection Examples
University of Sheffield, NLP
Tokenisation
 Splitting a text into its constituent parts
 Plenty of “unusual”, but very important tokens in social media:
– @Apple – mentions of company/brand/person names
– #fail, #SteveJobs – hashtags expressing sentiment, person
or company names
– :-(, :-), :-P – emoticons (punctuation and optionally letters)
– URLs
 Tokenisation key for entity recognition and opinion mining
 A study of 1.1 million tweets: 26% of English tweets have a
URL, 16.6% - a hashtag, and 54.8% - a user name mention
[Carter, 2013].
University of Sheffield, NLP
Example
– #WiredBizCon #nike vp said when @Apple saw what
http://nikeplus.com did, #SteveJobs was like wow I didn't
expect this at all.
– Tokenising on white space doesn't work that well:
• Nike and Apple are company names, but if we have
tokens such as #nike and @Apple, this will make the
entity recognition harder, as it will need to look at sub-
token level
– Tokenising on white space and punctuation characters
doesn't work well either: URLs get separated (http,
nikeplus), as are emoticons and email addresses
University of Sheffield, NLP
The TwitIE Tokeniser
Treat RTs and URLs as 1 token each
#nike is two tokens (# and nike) plus a separate
annotation HashTag covering both. Same for @mentions
-> UserID
Capitalisation is preserved, but an orthography feature is
added: all caps, lowercase, mixCase
Date and phone number normalisation, lowercasing, and
emoticons are optionally done later in separate modules
Consequently, tokenisation is faster and more generic
Also, more tailored to our NER module
University of Sheffield, NLP
POS Tagging
• The accuracy of the Stanford POS tagger drops from about
97% on news to 80% on tweets (Ritter, 2011)
• Need for an adapted POS tagger, specifically for tweets
• We re-trained the Stanford POS tagger using some hand-
annotated tweets, IRC and news texts
• Next we compare the differences between the ANNIE POS
Tagger and the Tweet POS Tagger on the example tweets
University of Sheffield, NLP
POS Tagging Example
• TwitIE POS tagger on the left
• ANNIE POS tagger on the right
• The TwitIE POS tagger is a separate paper at RANLP’2013
• Beats Ritter (2011); uses a grown-up tag set (cf. Gimpel, 2011)
University of Sheffield, NLP
Tweet Normalisation
 “RT @Bthompson WRITEZ: @libbyabrego honored?!
Everybody knows the libster is nice with it...lol...(thankkkks a
bunch;))”
 OMG! I’m so guilty!!! Sprained biibii’s leg! ARGHHHHHH!!!!!!
 Similar to SMS normalisation
 For some components to work well (POS tagger, parser), it is
necessary to produce a normalised version of each token
 BUT uppercasing, and letter and exclamation mark repetition
often convey strong sentiment
 Therefore some choose not to normalise, while others keep
both versions of the tokens
University of Sheffield, NLP
A normalised example
 Normaliser currently based on spelling correction and some
lists of common abbreviations
 Outstanding issues:
Insert new Token annotations, so easier to POS tag, etc?
For example: “trying to” now 1 annotation
Some abbreviations which span token boundaries (e.g. gr8,
do n’t) difficult to handle
Capitalisation and punctuation normalisation
University of Sheffield, NLP
TwitIE NER Results
University of Sheffield, NLP
Trying TwitIE
• Plugin in the latest GATE snapshot and forthcoming 7.2
release
• Download details at: https://gate.ac.uk/wiki/twitie.html
• Available soon as a web service on the forthcoming
AnnoMarket NLP cloud marketplace:
• https://annomarket.com/
University of Sheffield, NLP
Coming Soon: TwitIE-as-a-Service
Preview of some text analytics services on AnnoMarket.com
University of Sheffield, NLP
Acknowledgements
• Kalina Bontcheva is supported by a Career Acceleration
Fellowship from the Engineering and Physical Sciences
Research Council (grant EP/I004327/1)
• This research is also partially supported by the EU-funded
FP7 TrendMiner project (http://www.trendminer-project.eu)
and the CHIST-ERA uComp project (http://www.ucomp.eu)
Thank you for your time!

More Related Content

What's hot

Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
Sanzid Kawsar
 
The Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information RetrievalThe Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information Retrieval
Tony Russell-Rose
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
prashantdahake
 
Natural Language Processing: Definition and Application
Natural Language Processing: Definition and ApplicationNatural Language Processing: Definition and Application
Natural Language Processing: Definition and Application
Stephen Shellman
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
Jaganadh Gopinadhan
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Abash shah
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
Jose Zagal
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
Minh Pham
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysis
Diana Maynard
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
ananth
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
David Rostcheck
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Yogendra Tamang
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
ananth
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
Marina Santini
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
KarenVacca
 
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
PhD Assistance
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
Marina Santini
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
AkhilPolisetty
 
Natural language processing and its application in ai
Natural language processing and its application in aiNatural language processing and its application in ai
Natural language processing and its application in ai
Ram Kumar
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 

What's hot (20)

Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
The Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information RetrievalThe Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information Retrieval
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing: Definition and Application
Natural Language Processing: Definition and ApplicationNatural Language Processing: Definition and Application
Natural Language Processing: Definition and Application
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysis
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing and its application in ai
Natural language processing and its application in aiNatural language processing and its application in ai
Natural language processing and its application in ai
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 

Viewers also liked

Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
Atul Shridhar
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platform
toncho11
 
Ontological approach for improving semantic web search results
Ontological approach for improving semantic web search resultsOntological approach for improving semantic web search results
Ontological approach for improving semantic web search results
eSAT Journals
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to Delivery
Ontotext
 
A Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval TechniquesA Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval Techniques
NUST School of Electrical Engineering and Computer Science
 
In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?
Irfan Ullah
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Mauro Dragoni
 
Semantics And Search
Semantics And SearchSemantics And Search
Semantics And Search
Vestforsk.no
 
Semantic data mining: an ontology based approach
Semantic data mining: an ontology based approachSemantic data mining: an ontology based approach
Semantic data mining: an ontology based approach
Agnieszka Ławrynowicz
 
Semantic security framework and context-aware role-based access control ontol...
Semantic security framework and context-aware role-based access control ontol...Semantic security framework and context-aware role-based access control ontol...
Semantic security framework and context-aware role-based access control ontol...
Natalia Díaz Rodríguez
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
Peter Mika
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processing
ATHMAN HAJ-HAMOU
 
Semantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and RefinementSemantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and Refinement
Andre Freitas
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
Ontotext
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Martin Goodson
 
Ontology-based Classification and Faceted Search Interface for APIs
Ontology-based Classification and Faceted Search Interface for APIsOntology-based Classification and Faceted Search Interface for APIs
Ontology-based Classification and Faceted Search Interface for APIs
New York City College of Technology Computer Systems Technology Colloquium
 
Indexing in Search Engine
Indexing in Search EngineIndexing in Search Engine
Indexing in Search Engine
Shikha Gupta
 

Viewers also liked (17)

Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platform
 
Ontological approach for improving semantic web search results
Ontological approach for improving semantic web search resultsOntological approach for improving semantic web search results
Ontological approach for improving semantic web search results
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to Delivery
 
A Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval TechniquesA Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval Techniques
 
In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Semantics And Search
Semantics And SearchSemantics And Search
Semantics And Search
 
Semantic data mining: an ontology based approach
Semantic data mining: an ontology based approachSemantic data mining: an ontology based approach
Semantic data mining: an ontology based approach
 
Semantic security framework and context-aware role-based access control ontol...
Semantic security framework and context-aware role-based access control ontol...Semantic security framework and context-aware role-based access control ontol...
Semantic security framework and context-aware role-based access control ontol...
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processing
 
Semantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and RefinementSemantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and Refinement
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
 
Ontology-based Classification and Faceted Search Interface for APIs
Ontology-based Classification and Faceted Search Interface for APIsOntology-based Classification and Faceted Search Interface for APIs
Ontology-based Classification and Faceted Search Interface for APIs
 
Indexing in Search Engine
Indexing in Search EngineIndexing in Search Engine
Indexing in Search Engine
 

Similar to TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
ananth
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)
Olga Melnikova
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
Leon Derczynski
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
Email Tips and Trends 2010
Email Tips and Trends 2010Email Tips and Trends 2010
Email Tips and Trends 2010
Christy Broccardo
 
Effective communication via email
Effective communication via emailEffective communication via email
Effective communication via email
Marianna Semenova
 
Email Tips 2010
Email Tips 2010Email Tips 2010
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Max Irwin
 
Electronic writing processes
Electronic writing processesElectronic writing processes
Electronic writing processes
Rabin Bhandari
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
rohithprabhas1
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
Paige Morgan
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
nikshaikh786
 
Ir 03
Ir   03Ir   03
Resume Advice - Intuit Careers Facebook Video Chat Feb 2011
Resume Advice - Intuit Careers Facebook Video Chat Feb 2011Resume Advice - Intuit Careers Facebook Video Chat Feb 2011
Resume Advice - Intuit Careers Facebook Video Chat Feb 2011
Gail Houston
 
SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...
SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...
SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...
Cornelius Puschmann
 
Technical resumes with Dean Liesl Folks_fall2014_sept
Technical resumes with Dean Liesl Folks_fall2014_septTechnical resumes with Dean Liesl Folks_fall2014_sept
Technical resumes with Dean Liesl Folks_fall2014_sept
Holly M. Justice
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
MedBelatrach
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
CzechDreamin
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
SamuelKetema1
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
eLearning Consortium 電子學習聯盟
 

Similar to TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text (20)

Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Email Tips and Trends 2010
Email Tips and Trends 2010Email Tips and Trends 2010
Email Tips and Trends 2010
 
Effective communication via email
Effective communication via emailEffective communication via email
Effective communication via email
 
Email Tips 2010
Email Tips 2010Email Tips 2010
Email Tips 2010
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Electronic writing processes
Electronic writing processesElectronic writing processes
Electronic writing processes
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
Ir 03
Ir   03Ir   03
Ir 03
 
Resume Advice - Intuit Careers Facebook Video Chat Feb 2011
Resume Advice - Intuit Careers Facebook Video Chat Feb 2011Resume Advice - Intuit Careers Facebook Video Chat Feb 2011
Resume Advice - Intuit Careers Facebook Video Chat Feb 2011
 
SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...
SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...
SchemaCMD - An XML-based storage schema for the compilation of mixed-source C...
 
Technical resumes with Dean Liesl Folks_fall2014_sept
Technical resumes with Dean Liesl Folks_fall2014_septTechnical resumes with Dean Liesl Folks_fall2014_sept
Technical resumes with Dean Liesl Folks_fall2014_sept
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
 

More from Leon Derczynski

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
Leon Derczynski
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
Leon Derczynski
 
RumourEval
RumourEvalRumourEval
RumourEval
Leon Derczynski
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Leon Derczynski
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGC
Leon Derczynski
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
Leon Derczynski
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
Leon Derczynski
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Leon Derczynski
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Leon Derczynski
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
Leon Derczynski
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
Leon Derczynski
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
Leon Derczynski
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Leon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
Leon Derczynski
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
Leon Derczynski
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
Leon Derczynski
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
Leon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
Leon Derczynski
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
Leon Derczynski
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologies
Leon Derczynski
 

More from Leon Derczynski (20)

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 
RumourEval
RumourEvalRumourEval
RumourEval
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGC
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologies
 

Recently uploaded

20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 

Recently uploaded (20)

20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 

TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

  • 1. University of Sheffield, NLP TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text Kalina Bontcheva Leon Derczynski Adam Funk Mark A. Greenwood Diana Maynard Niraj Aswani © The University of Sheffield, 1995-2013 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs Licence
  • 2. University of Sheffield, NLP The Problem • Running ANNIE on 300 news articles – 87% f-score • Running ANNIE on some tweets - < 40% f-score
  • 3. University of Sheffield, NLP Example: Persons in news articles
  • 4. University of Sheffield, NLP Example: Persons in tweets
  • 5. University of Sheffield, NLP Genre Differences in Entity Types News Tweets PER Politicians, business leaders, journalists, celebrities Sportsmen, actors, TV personalities, celebrities, names of friends LOC Countries, cities, rivers, and other places related to current affairs Restaurants, bars, local landmarks/areas, cities, rarely countries ORG Public and private companies, government organisations Bands, internet companies, sports clubs
  • 6. University of Sheffield, NLP Tweet-specific NER challenges • Capitalisation is not indicative of named entities • All uppercase, e.g. APPLE IS AWSOME • All lowercase, e.g. all welcome, joe included • All letters upper initial, e.g. 10 Quotes from Amy Poehler That Will Get You Through High School • Unusual spelling, acronyms, and abbreviations • Social media conventions: • Hashtags, e.g. #ukuncut, #RusselBrand, #taxavoidance • @Mentions, e.g. @edchi (PER), @mcg_graz (LOC), @BBC (ORG)
  • 7. University of Sheffield, NLP TwitIE: GATE’s new Twitter NER pipeline
  • 8. University of Sheffield, NLP Importing tweets into GATE • GATE now supports JSON format import for tweets • Located in the Format_Twitter plugin • Automatically used for files *.json • Alternatively, specify text/x-json-twitter as a mime type • The tweet text becomes the document, all other JSON fields become features
  • 9. University of Sheffield, NLP Language Detection: Less than 50% English  The main challenges on tweets/Facebook status updates: the short number of tokens (10 tokens/tweet on average) the noisy nature of the words (abbreviations, misspellings).  Due to the length of the text, we can make the assumption that one tweet is written in only one language  We have adapted the TextCat language identification plugin  Provided fingerprints for 5 languages: DE, EN, FR, ES, NL  You can extend it to new languages easily
  • 10. University of Sheffield, NLP Language Detection Examples
  • 11. University of Sheffield, NLP Tokenisation  Splitting a text into its constituent parts  Plenty of “unusual”, but very important tokens in social media: – @Apple – mentions of company/brand/person names – #fail, #SteveJobs – hashtags expressing sentiment, person or company names – :-(, :-), :-P – emoticons (punctuation and optionally letters) – URLs  Tokenisation key for entity recognition and opinion mining  A study of 1.1 million tweets: 26% of English tweets have a URL, 16.6% - a hashtag, and 54.8% - a user name mention [Carter, 2013].
  • 12. University of Sheffield, NLP Example – #WiredBizCon #nike vp said when @Apple saw what http://nikeplus.com did, #SteveJobs was like wow I didn't expect this at all. – Tokenising on white space doesn't work that well: • Nike and Apple are company names, but if we have tokens such as #nike and @Apple, this will make the entity recognition harder, as it will need to look at sub- token level – Tokenising on white space and punctuation characters doesn't work well either: URLs get separated (http, nikeplus), as are emoticons and email addresses
  • 13. University of Sheffield, NLP The TwitIE Tokeniser Treat RTs and URLs as 1 token each #nike is two tokens (# and nike) plus a separate annotation HashTag covering both. Same for @mentions -> UserID Capitalisation is preserved, but an orthography feature is added: all caps, lowercase, mixCase Date and phone number normalisation, lowercasing, and emoticons are optionally done later in separate modules Consequently, tokenisation is faster and more generic Also, more tailored to our NER module
  • 14. University of Sheffield, NLP POS Tagging • The accuracy of the Stanford POS tagger drops from about 97% on news to 80% on tweets (Ritter, 2011) • Need for an adapted POS tagger, specifically for tweets • We re-trained the Stanford POS tagger using some hand- annotated tweets, IRC and news texts • Next we compare the differences between the ANNIE POS Tagger and the Tweet POS Tagger on the example tweets
  • 15. University of Sheffield, NLP POS Tagging Example • TwitIE POS tagger on the left • ANNIE POS tagger on the right • The TwitIE POS tagger is a separate paper at RANLP’2013 • Beats Ritter (2011); uses a grown-up tag set (cf. Gimpel, 2011)
  • 16. University of Sheffield, NLP Tweet Normalisation  “RT @Bthompson WRITEZ: @libbyabrego honored?! Everybody knows the libster is nice with it...lol...(thankkkks a bunch;))”  OMG! I’m so guilty!!! Sprained biibii’s leg! ARGHHHHHH!!!!!!  Similar to SMS normalisation  For some components to work well (POS tagger, parser), it is necessary to produce a normalised version of each token  BUT uppercasing, and letter and exclamation mark repetition often convey strong sentiment  Therefore some choose not to normalise, while others keep both versions of the tokens
  • 17. University of Sheffield, NLP A normalised example  Normaliser currently based on spelling correction and some lists of common abbreviations  Outstanding issues: Insert new Token annotations, so easier to POS tag, etc? For example: “trying to” now 1 annotation Some abbreviations which span token boundaries (e.g. gr8, do n’t) difficult to handle Capitalisation and punctuation normalisation
  • 18. University of Sheffield, NLP TwitIE NER Results
  • 19. University of Sheffield, NLP Trying TwitIE • Plugin in the latest GATE snapshot and forthcoming 7.2 release • Download details at: https://gate.ac.uk/wiki/twitie.html • Available soon as a web service on the forthcoming AnnoMarket NLP cloud marketplace: • https://annomarket.com/
  • 20. University of Sheffield, NLP Coming Soon: TwitIE-as-a-Service Preview of some text analytics services on AnnoMarket.com
  • 21. University of Sheffield, NLP Acknowledgements • Kalina Bontcheva is supported by a Career Acceleration Fellowship from the Engineering and Physical Sciences Research Council (grant EP/I004327/1) • This research is also partially supported by the EU-funded FP7 TrendMiner project (http://www.trendminer-project.eu) and the CHIST-ERA uComp project (http://www.ucomp.eu) Thank you for your time!

Editor's Notes

  1. Leon, in the paper you show ANNIE 60% on the dev set. The above 40% is on the entire ds that’s in svn. Feel free to replace that table, as you like. I could not load the dev set into GATE, due to its strange format. I am sure there’s a script somewhere that’ll convert it into a proper .conll format, I just had no time to find and run it. It’s ok, nobody will notice perhaps :)
  2. These are mostly politicians. Often their names are preceded by their titles. There is also bigger context, within which entity coreference helps with detection (e.g. Atef and Mohammed Atef; bin Laden and Osama bin Laden).
  3. These are names of friends, singers, artists, sportspeople, and celebrities. Often in lowercase, referred to by first or surname only and sometimes misspelled.
  4. Hashtags: some contain locations, some – person names, and others are phrases For the @Mentions – IIRR Ritter (or some similar recent paper on Twitter NER) wrote that @mentions were excluded from their evaluation, since they are trivially recognisable as persons. Well, the point is – they are not all persons (used to be true). Now we have locations/facilities, organisations, as well as some products, research projects, and the like. Hence, even though it’s trivial to identify @mentions as an NE, assigning it the appropriate NE type is far from a solved problem!