University of Sheffield, NLP
Text Analysis with GATE
Diana Maynard
University of Sheffield
Big Social Data Workshop
Reading, UK, April 2015
© The University of Sheffield, 2015
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike Licence
University of Sheffield, NLP
Outline of the Tutorial
• Introduction to NLP and Information Extraction
• Introduction to GATE
• Social media analysis
• Example applications
• Semantic Search, Nesta, Decarbonet.
University of Sheffield, NLP
We are all connected to each other...
● Information,
thoughts and
opinions are
shared prolifically
on the social web
these days
● 72% of online
adults use social
networking sites
University of Sheffield, NLP
Your grandmother is three times as likely to
use a social networking site now as in 2009
University of Sheffield, NLP
There are hundreds of tools for social media
analytics
● Most of them are commercial and not freely available
● The research tools tend to focus on specific topics and
scenarios, and aren't easily adaptable
● The analysis they do often doesn't go much beyond number
crunching, e.g.
– look at number of tweets, retweets, favourites
– filter by hashtag or keyword for topic categorisation
– use off-the-shelf sentiment tools
– use counts of word length, POS categories etc
– very little semantics, don't deal with variation, ambiguity,
slang, sarcasm etc.
University of Sheffield, NLP
Analysing Social Media is harder than it sounds
There are lots
of things to
think about!
University of Sheffield, NLP
Analysing language in social media is hard
● Grundman:politics makes #climatechange scientific issue,people
don’t like knowitall rational voice tellin em wat 2do
● @adambation Try reading this article , it looks like it would be
really helpful and not obvious at all. http://t.co/mo3vODoX
● Want to solve the problem of #ClimateChange? Just #vote for a
#politician! Poof! Problem gone! #sarcasm #TVP #99%
● Human Caused #ClimateChange is a Monumental Scam!
http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!!
Lying to us like MOFO's Tax The Air We Breath! F**k Them!
University of Sheffield, NLP
Let's search for keywords like “Arctic”
Oops!
University of Sheffield, NLP
Seems like we need something to help!
How about NLP?
University of Sheffield, NLP
It is difficult to access unstructured information
efficiently
Information extraction tools can help you:
● Save time and money on management of text and data from
multiple sources
● Find hidden links scattered across huge volumes of diverse
information
● Integrate structured data from variety of sources
● Interlink text and data
● Collect information and extract new facts
University of Sheffield, NLP
What is Entity Recognition?
●
Entity Recognition is about recogising and classifying key Named
Entities and terms in the text
●
A Named Entity is a Person, Location, Organisation, Date etc.
●
A term is a key concept or phrase that is representative of the text
●
Entities and terms may be described in different ways but refer to
the same thing. We call this co-reference.
Mitt Romney, the favorite to win the Republican nomination for president in 2012
DatePerson Term
The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior.
Organisation
co-reference
Location
University of Sheffield, NLP
What is Event Recognition?
●
An event is an action or situation relevant to the domain
expressed by some relation between entities or terms.
●
It is always grounded in time, e.g. the performance of a
band, an election, the death of a person
Mitt Romney, the favorite to win the Republican nomination for president in 2012
Event DatePerson
Relation Relation
University of Sheffield, NLP
Why are Entities and Events Useful?
● They can help answer the “Big 5” journalism questions
(who, what, when, where, why)
● They can be used to categorise the texts in different ways
– look at all texts about Obama.
● They can be used as targets for opinion mining
– find out what people think about President Obama
● When linked to an ontology and/or combined with other
information, they can be used for reasoning about things
not explicit in the text
– seeing how opinions about different American
presidents have changed over the years
University of Sheffield, NLP
Approaches to Information Extraction
Knowledge Engineering

rule based

developed by
experienced language
engineers

make use of human
intuition

easier to understand
results

development could be
very time consuming

some changes may be
hard to accommodate
Learning Systems

use statistics or other
machine learning

developers do not need
LE expertise

requires large amounts of
annotated training data

some changes may
require re-annotation of
the entire training corpus
University of Sheffield, NLP
Seems like we need a tool to do this clever
stuff for us.
How about GATE?
University of Sheffield, NLP
What is GATE?
GATE is an NLP toolkit developed at the University of Sheffield
over the last 20 years.
It's open source and freely available. http://gate.ac.uk
• components for language processing, e.g. parsers, machine
learning tools, stemmers, IR tools, IE components for various
languages...
• tools for visualising and manipulating text, annotations,
ontologies, parse trees, etc.
• various information extraction tools
• evaluation and benchmarking tools
University of Sheffield, NLP
GATE components
● Language Resources (LRs), e.g. lexicons, corpora,
ontologies
● Processing Resources (PRs), e.g. parsers, generators,
taggers
●
Visual Resources (VRs), i.e. visualisation and editing
components
● Algorithms are separated from the data, which means:
– the two can be developed independently by users with
different expertise.
– alternative resources of one type can be used without
affecting the other, e.g. a different visual resource can be
used with the same language resource
University of Sheffield, NLP
ANNIE
• ANNIE is GATE's rule-based IE system
• It uses the language engineering approach (though we also
have tools in GATE for ML)
• Distributed as part of GATE
• Uses a finite-state pattern-action rule language, JAPE
• ANNIE contains a reusable and easily extendable set of
components:
– generic preprocessing components for tokenisation,
sentence splitting etc
– components for performing NE on general open domain text
University of Sheffield, NLP
ANNIE Modules
University of Sheffield, NLP
20
Document with Tokens
University of Sheffield, NLP
21
Document with Sentences
University of Sheffield, NLP
Gazetteer editor
definition file
entries entries for selected list
University of Sheffield, NLP
Named Entity Grammars
• Hand-coded rules written in JAPE applied to annotations to
identify NEs
• Phases run sequentially and constitute a cascade of FSTs over
annotations
• Annotations from format analysis, tokeniser. splitter, POS tagger,
morphological analysis, gazetteer etc.
• Because phases are sequential, annotations can be built up over
a period of phases, as new information is gleaned
• Standard named entities: persons, locations, organisations, dates,
addresses, money
• Basic NE grammars can be adapted for new applications, domains
and languages
University of Sheffield, NLP
Document with NEs
University of Sheffield, NLP
Coreference
University of Sheffield, NLP
Hands-on: Running ANNIE on news texts
●
Start GATE by double clicking on the icon
●
Because ANNIE is a ready-made application, we can just load it
directly from the menu
●
Right click on Applications, select “Ready-made applications”
and then ANNIE
●
Create a new corpus (right click on Language Resources →
New → GATE corpus and click “OK)
●
Populate the corpus (right click on the corpus, and use the file
chooser in Directory URL to select the hands-on/corpora/news-
texts directory. Click “Open” and then “OK”)
●
Double click on ANNIE and select “Run this application”
●
This will now run the application on all the documents in your
corpus
University of Sheffield, NLP
Hands-on: Looking at the results
● Now you can inspect your annotated data: double click on a
document to open it
● Select “Annotation Sets” and “Annotation List” from the tabs
● Click on the top arrow above “Original markups” in the right
pane
● You should see a mixture of Named Entity annotations
(Person, Location etc) and some other linguistic annotations
(Token, Sentence etc) in the Default annotation set
● Click on the name of an annotation in the right hand pane that
you want to view (you can select multiple annotations)
● Hover over the annotation in the text to get more info
● Try the Annotation Stack for a different view
University of Sheffield, NLP
3. Analysing Social Media
University of Sheffield, NLP
Analysing language in social media is hard
● Grundman:politics makes #climatechange scientific issue,people
don’t like knowitall rational voice tellin em wat 2do
● @adambation Try reading this article , it looks like it would be
really helpful and not obvious at all. http://t.co/mo3vODoX
● Want to solve the problem of #ClimateChange? Just #vote for a
#politician! Poof! Problem gone! #sarcasm #TVP #99%
● Human Caused #ClimateChange is a Monumental Scam!
http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!!
Lying to us like MOFO's Tax The Air We Breath! F**k Them!
University of Sheffield, NLP
Challenges for NLP
● Noisy language: unusual punctuation, capitalisation, spelling, use
of slang, sarcasm etc.
● Terse nature of microposts such as tweets
● Use of hashtags, @mentions etc causes problems for
tokenisation #thisistricky
● Lack of context gives rise to ambiguities
● NER performs poorly on microposts, mainly because of linguistic
pre-processing failure
– Performance of standard IE tools decreases from ~90% to
~40% when run on tweets rather than news articles
University of Sheffield, NLP
Lack of context causes ambiguity
Branching out from Lincoln park after dark ... Hello Russian
Navy, it's like the same thing but with glitter!
??
University of Sheffield, NLP
Getting the NEs right is crucial
Branching out from Lincoln park after dark ... Hello Russian
Navy, it's like the same thing but with glitter!
University of Sheffield, NLP
Hands-on with TwitIE
● First we need to load the TWITIE plugin
● Click on the green jigsaw-shaped icon at the top to load the
plugins manager (this takes a few seconds to load)
● Scroll down to the Twitter plugin and select “Load now”.
● Click “Apply All” and then “Close”.
● Now right-click on “Applications”, select “Ready-made
applications” and “TwitIE”
● Create another new corpus, name it “Tweets” and populate
with texts from hands-on/corpora/tweets
● Once loaded, double click on TwitIE to open it, and then select
“Run this application” (make sure the tweets corpus is chosen)
● Use the Annotation Stack view to look at Tokens in hashtags
University of Sheffield, NLP
Semantic Search with GATE
University of Sheffield, NLP
GATE Mímir
• can be used to index and search over text, annotations,
semantic metadata (concepts and instances)
• allows queries that arbitrarily mix full-text, structural, linguistic
and semantic annotations
• is open source
University of Sheffield, NLP
What can GATE Mímir do that Google can't?
Show me:
• all documents mentioning a temperature between 30 and 90
degrees F (expressed in any unit)
• all abstracts written in French on patent documents from the last
6 months which mention any form of the word “transistor” in the
English abstract
• the names of the patent inventors of those abstracts
• all documents mentioning steel industries in the UK, along with
their location
University of Sheffield, NLP
Search News Articles for
Politicians born in Sheffield
http://demos.gate.ac.uk/mimir/gpd/search/gus
University of Sheffield, NLP
MIMIR: Searching Text Mining Results
●
Searching and managing text annotations, semantic
information, and full text documents in one search engine
●
Queries over annotation graphs
●
Regular expressions, Kleene operators
●
Designed to be integrated as a web service in custom end-user
systems with bespoke interfaces
●
Demos at http://services.gate.ac.uk/mimir/
University of Sheffield, NLP
Hands-on with Semantic Search
Try these queries on the BBC News demo:
http://services.gate.ac.uk/mimir/gpd/search/index
● Gordon Brown
● Gordon Brown said
●
Gordon Brown [0..3] root:say
●
{Person} [0..3] root:say
●
{Person.gender=female}[0..3] root:say
●
Make sure you type the queries EXACTLY or they probably won't
work!
●
Try making up some of your own queries.
University of Sheffield, NLP
Try your hand with some SPARQL
(for the more adventurous!)
●
{Person inst ="http://dbpedia.org/resource/Gordon_Brown"} [0..3]
root:say
●
{Person sparql="SELECT ?inst WHERE { ?inst a :Politician }"}
[0..3] root:say
●
{Person sparql = "SELECT ?inst WHERE { ?inst :party
<http://dbpedia.org/resource/Labour_Party_%28UK%29> }" }
[0..3] root:say
●
{Person sparql = "SELECT ?inst WHERE {
?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK
%29> .
?inst :almaMater
<http://dbpedia.org/resource/University_of_Edinburgh> }"
} [0..3] root:say
Can you work out what these do?
University of Sheffield, NLP
We don't just have to look at politicians saying and
measuring things
● If we first process the text with other NLP tools such as
sentiment analysis, we can also search for positive or negative
documents
● Or positive/negative comments about certain
people/topics/things/companies
● In the DecarboNet project, we're looking at people's opinions
about topics relating to climate change, e.g. fracking
● We could index on the sentiment annotations too
● Other people are using the combination of opinion mining and
MIMIR to look at e.g. customer feedback about products /
companies on a huge corpus
University of Sheffield, NLP
A positive tweet
University of Sheffield, NLP
A negative tweet
University of Sheffield, NLP
A Sarcastic Tweet
University of Sheffield, NLP
What diseases are in these documents?
University of Sheffield, NLP
What Pathogens?
University of Sheffield, NLP
Disease vs Disease Co-ocurrences
University of Sheffield, NLP
Diseases vs Pathogens
University of Sheffield, NLP
The Political Futures Tracker
● Example of using the technology on a real scenario - analysing
political tweets in the run-up to the UK elections
● Project funded by Nesta http://www.nesta.org.uk/
● Series of blog posts about the project, leading up to the
election
http://www.nesta.org.uk/news/political-futures-tracker
● Some of our own more technical blog posts about the
underlying tools:
http://gate.ac.uk/projects/pft/
University of Sheffield, NLP
Recognition of MPs / Candidates
University of Sheffield, NLP
Topic recognition
This term is found by first
recognising the head
word “job” from a list
under the theme
“employment” and
matching against its root
form in the text, i.e. “job”.
It is then extended to
include the adjectival
modifier “British”, which
is not present in a list
anywhere.
University of Sheffield, NLP
Positive opinion about science and innovation
University of Sheffield, NLP
What do the different parties talk about?
Conservative vs Labour
Transport, Europe and employment are
mentioned more by Conservatives
NHS is mentioned more by Labour
University of Sheffield, NLP
Where did people tweet about the economy?
University of Sheffield, NLP
Terms that co-occur with immigration topic
University of Sheffield, NLP
Which tweets express sentiment?
University of Sheffield, NLP
Summary
● Text mining is a very useful pre-requisite for doing all kinds of
more interesting things, like semantic search
● Semantic web search allows you to do much more interesting
kinds of search than a standard text-based search
● Text mining is hard, so it won't always be correct, however
● This is especially true on lower quality text such as social
media
● On the other hand, social media has some of the most
interesting material to look at
● Still plenty of work to be done, but there are lots of tools that
you can use now and get useful results
● We run annual GATE training courses where you can spend a
whole week learning all this and more!
University of Sheffield, NLP
Acknowledgements and further information
● Research partially supported by the European Union/EU under the Information
and Communication Technologies (ICT) theme of the 7th Framework Programme
for R&D (FP7) DecarboNet (610829) http://www.decarbonet.eu and Nesta
http://nesta.org.uk
● Annual GATE training course in June in
Sheffield:https://gate.ac.uk/family/training.html
● Download gate: http://gate.ac.uk/download
University of Sheffield, NLP
Relevant publications
● K Bontcheva, V Tablan, H Cunningham. Semantic Search over Documents
and Ontologies (2014) Bridging Between Information Retrieval and
Databases, 31-53
● V. Tablan, K. Bontcheva, I. Roberts, and H. Cunningham. Mímir: An open-
source semantic search framework for interactive information seeking and
discovery. Journal of Web Semantics: Science, Services and Agents on the
World Wide Web, 2014
● A. Dietzel and D. Maynard. Climate Change: A Chance for Political Re-
Engagement? In Proc. of the Political Studies Association 65th Annual
International Conference, April 2015, Sheffield, UK.
● Diana Maynard. Challenges in Analysing Social Media. In Adrian Duşa,
Dietrich Nelle, Günter Stock and Gert G. Wagner (eds.) (2014): Facing the
Future: European Research Infrastructures for the Humanities and Social
Sciences. SCIVERO Verlag, Berlin, 2014.
● These and other relevant papers on the GATE website:
http://gate.ac.uk/gate/doc/papers.html

GATE: a text analysis tool for social media

  • 1.
    University of Sheffield,NLP Text Analysis with GATE Diana Maynard University of Sheffield Big Social Data Workshop Reading, UK, April 2015 © The University of Sheffield, 2015 This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike Licence
  • 2.
    University of Sheffield,NLP Outline of the Tutorial • Introduction to NLP and Information Extraction • Introduction to GATE • Social media analysis • Example applications • Semantic Search, Nesta, Decarbonet.
  • 3.
    University of Sheffield,NLP We are all connected to each other... ● Information, thoughts and opinions are shared prolifically on the social web these days ● 72% of online adults use social networking sites
  • 4.
    University of Sheffield,NLP Your grandmother is three times as likely to use a social networking site now as in 2009
  • 5.
    University of Sheffield,NLP There are hundreds of tools for social media analytics ● Most of them are commercial and not freely available ● The research tools tend to focus on specific topics and scenarios, and aren't easily adaptable ● The analysis they do often doesn't go much beyond number crunching, e.g. – look at number of tweets, retweets, favourites – filter by hashtag or keyword for topic categorisation – use off-the-shelf sentiment tools – use counts of word length, POS categories etc – very little semantics, don't deal with variation, ambiguity, slang, sarcasm etc.
  • 6.
    University of Sheffield,NLP Analysing Social Media is harder than it sounds There are lots of things to think about!
  • 7.
    University of Sheffield,NLP Analysing language in social media is hard ● Grundman:politics makes #climatechange scientific issue,people don’t like knowitall rational voice tellin em wat 2do ● @adambation Try reading this article , it looks like it would be really helpful and not obvious at all. http://t.co/mo3vODoX ● Want to solve the problem of #ClimateChange? Just #vote for a #politician! Poof! Problem gone! #sarcasm #TVP #99% ● Human Caused #ClimateChange is a Monumental Scam! http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!! Lying to us like MOFO's Tax The Air We Breath! F**k Them!
  • 8.
    University of Sheffield,NLP Let's search for keywords like “Arctic” Oops!
  • 9.
    University of Sheffield,NLP Seems like we need something to help! How about NLP?
  • 10.
    University of Sheffield,NLP It is difficult to access unstructured information efficiently Information extraction tools can help you: ● Save time and money on management of text and data from multiple sources ● Find hidden links scattered across huge volumes of diverse information ● Integrate structured data from variety of sources ● Interlink text and data ● Collect information and extract new facts
  • 11.
    University of Sheffield,NLP What is Entity Recognition? ● Entity Recognition is about recogising and classifying key Named Entities and terms in the text ● A Named Entity is a Person, Location, Organisation, Date etc. ● A term is a key concept or phrase that is representative of the text ● Entities and terms may be described in different ways but refer to the same thing. We call this co-reference. Mitt Romney, the favorite to win the Republican nomination for president in 2012 DatePerson Term The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior. Organisation co-reference Location
  • 12.
    University of Sheffield,NLP What is Event Recognition? ● An event is an action or situation relevant to the domain expressed by some relation between entities or terms. ● It is always grounded in time, e.g. the performance of a band, an election, the death of a person Mitt Romney, the favorite to win the Republican nomination for president in 2012 Event DatePerson Relation Relation
  • 13.
    University of Sheffield,NLP Why are Entities and Events Useful? ● They can help answer the “Big 5” journalism questions (who, what, when, where, why) ● They can be used to categorise the texts in different ways – look at all texts about Obama. ● They can be used as targets for opinion mining – find out what people think about President Obama ● When linked to an ontology and/or combined with other information, they can be used for reasoning about things not explicit in the text – seeing how opinions about different American presidents have changed over the years
  • 14.
    University of Sheffield,NLP Approaches to Information Extraction Knowledge Engineering  rule based  developed by experienced language engineers  make use of human intuition  easier to understand results  development could be very time consuming  some changes may be hard to accommodate Learning Systems  use statistics or other machine learning  developers do not need LE expertise  requires large amounts of annotated training data  some changes may require re-annotation of the entire training corpus
  • 15.
    University of Sheffield,NLP Seems like we need a tool to do this clever stuff for us. How about GATE?
  • 16.
    University of Sheffield,NLP What is GATE? GATE is an NLP toolkit developed at the University of Sheffield over the last 20 years. It's open source and freely available. http://gate.ac.uk • components for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages... • tools for visualising and manipulating text, annotations, ontologies, parse trees, etc. • various information extraction tools • evaluation and benchmarking tools
  • 17.
    University of Sheffield,NLP GATE components ● Language Resources (LRs), e.g. lexicons, corpora, ontologies ● Processing Resources (PRs), e.g. parsers, generators, taggers ● Visual Resources (VRs), i.e. visualisation and editing components ● Algorithms are separated from the data, which means: – the two can be developed independently by users with different expertise. – alternative resources of one type can be used without affecting the other, e.g. a different visual resource can be used with the same language resource
  • 18.
    University of Sheffield,NLP ANNIE • ANNIE is GATE's rule-based IE system • It uses the language engineering approach (though we also have tools in GATE for ML) • Distributed as part of GATE • Uses a finite-state pattern-action rule language, JAPE • ANNIE contains a reusable and easily extendable set of components: – generic preprocessing components for tokenisation, sentence splitting etc – components for performing NE on general open domain text
  • 19.
    University of Sheffield,NLP ANNIE Modules
  • 20.
    University of Sheffield,NLP 20 Document with Tokens
  • 21.
    University of Sheffield,NLP 21 Document with Sentences
  • 22.
    University of Sheffield,NLP Gazetteer editor definition file entries entries for selected list
  • 23.
    University of Sheffield,NLP Named Entity Grammars • Hand-coded rules written in JAPE applied to annotations to identify NEs • Phases run sequentially and constitute a cascade of FSTs over annotations • Annotations from format analysis, tokeniser. splitter, POS tagger, morphological analysis, gazetteer etc. • Because phases are sequential, annotations can be built up over a period of phases, as new information is gleaned • Standard named entities: persons, locations, organisations, dates, addresses, money • Basic NE grammars can be adapted for new applications, domains and languages
  • 24.
    University of Sheffield,NLP Document with NEs
  • 25.
    University of Sheffield,NLP Coreference
  • 26.
    University of Sheffield,NLP Hands-on: Running ANNIE on news texts ● Start GATE by double clicking on the icon ● Because ANNIE is a ready-made application, we can just load it directly from the menu ● Right click on Applications, select “Ready-made applications” and then ANNIE ● Create a new corpus (right click on Language Resources → New → GATE corpus and click “OK) ● Populate the corpus (right click on the corpus, and use the file chooser in Directory URL to select the hands-on/corpora/news- texts directory. Click “Open” and then “OK”) ● Double click on ANNIE and select “Run this application” ● This will now run the application on all the documents in your corpus
  • 27.
    University of Sheffield,NLP Hands-on: Looking at the results ● Now you can inspect your annotated data: double click on a document to open it ● Select “Annotation Sets” and “Annotation List” from the tabs ● Click on the top arrow above “Original markups” in the right pane ● You should see a mixture of Named Entity annotations (Person, Location etc) and some other linguistic annotations (Token, Sentence etc) in the Default annotation set ● Click on the name of an annotation in the right hand pane that you want to view (you can select multiple annotations) ● Hover over the annotation in the text to get more info ● Try the Annotation Stack for a different view
  • 28.
    University of Sheffield,NLP 3. Analysing Social Media
  • 29.
    University of Sheffield,NLP Analysing language in social media is hard ● Grundman:politics makes #climatechange scientific issue,people don’t like knowitall rational voice tellin em wat 2do ● @adambation Try reading this article , it looks like it would be really helpful and not obvious at all. http://t.co/mo3vODoX ● Want to solve the problem of #ClimateChange? Just #vote for a #politician! Poof! Problem gone! #sarcasm #TVP #99% ● Human Caused #ClimateChange is a Monumental Scam! http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!! Lying to us like MOFO's Tax The Air We Breath! F**k Them!
  • 30.
    University of Sheffield,NLP Challenges for NLP ● Noisy language: unusual punctuation, capitalisation, spelling, use of slang, sarcasm etc. ● Terse nature of microposts such as tweets ● Use of hashtags, @mentions etc causes problems for tokenisation #thisistricky ● Lack of context gives rise to ambiguities ● NER performs poorly on microposts, mainly because of linguistic pre-processing failure – Performance of standard IE tools decreases from ~90% to ~40% when run on tweets rather than news articles
  • 31.
    University of Sheffield,NLP Lack of context causes ambiguity Branching out from Lincoln park after dark ... Hello Russian Navy, it's like the same thing but with glitter! ??
  • 32.
    University of Sheffield,NLP Getting the NEs right is crucial Branching out from Lincoln park after dark ... Hello Russian Navy, it's like the same thing but with glitter!
  • 33.
    University of Sheffield,NLP Hands-on with TwitIE ● First we need to load the TWITIE plugin ● Click on the green jigsaw-shaped icon at the top to load the plugins manager (this takes a few seconds to load) ● Scroll down to the Twitter plugin and select “Load now”. ● Click “Apply All” and then “Close”. ● Now right-click on “Applications”, select “Ready-made applications” and “TwitIE” ● Create another new corpus, name it “Tweets” and populate with texts from hands-on/corpora/tweets ● Once loaded, double click on TwitIE to open it, and then select “Run this application” (make sure the tweets corpus is chosen) ● Use the Annotation Stack view to look at Tokens in hashtags
  • 34.
    University of Sheffield,NLP Semantic Search with GATE
  • 35.
    University of Sheffield,NLP GATE Mímir • can be used to index and search over text, annotations, semantic metadata (concepts and instances) • allows queries that arbitrarily mix full-text, structural, linguistic and semantic annotations • is open source
  • 36.
    University of Sheffield,NLP What can GATE Mímir do that Google can't? Show me: • all documents mentioning a temperature between 30 and 90 degrees F (expressed in any unit) • all abstracts written in French on patent documents from the last 6 months which mention any form of the word “transistor” in the English abstract • the names of the patent inventors of those abstracts • all documents mentioning steel industries in the UK, along with their location
  • 37.
    University of Sheffield,NLP Search News Articles for Politicians born in Sheffield http://demos.gate.ac.uk/mimir/gpd/search/gus
  • 38.
    University of Sheffield,NLP MIMIR: Searching Text Mining Results ● Searching and managing text annotations, semantic information, and full text documents in one search engine ● Queries over annotation graphs ● Regular expressions, Kleene operators ● Designed to be integrated as a web service in custom end-user systems with bespoke interfaces ● Demos at http://services.gate.ac.uk/mimir/
  • 39.
    University of Sheffield,NLP Hands-on with Semantic Search Try these queries on the BBC News demo: http://services.gate.ac.uk/mimir/gpd/search/index ● Gordon Brown ● Gordon Brown said ● Gordon Brown [0..3] root:say ● {Person} [0..3] root:say ● {Person.gender=female}[0..3] root:say ● Make sure you type the queries EXACTLY or they probably won't work! ● Try making up some of your own queries.
  • 40.
    University of Sheffield,NLP Try your hand with some SPARQL (for the more adventurous!) ● {Person inst ="http://dbpedia.org/resource/Gordon_Brown"} [0..3] root:say ● {Person sparql="SELECT ?inst WHERE { ?inst a :Politician }"} [0..3] root:say ● {Person sparql = "SELECT ?inst WHERE { ?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK%29> }" } [0..3] root:say ● {Person sparql = "SELECT ?inst WHERE { ?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK %29> . ?inst :almaMater <http://dbpedia.org/resource/University_of_Edinburgh> }" } [0..3] root:say Can you work out what these do?
  • 41.
    University of Sheffield,NLP We don't just have to look at politicians saying and measuring things ● If we first process the text with other NLP tools such as sentiment analysis, we can also search for positive or negative documents ● Or positive/negative comments about certain people/topics/things/companies ● In the DecarboNet project, we're looking at people's opinions about topics relating to climate change, e.g. fracking ● We could index on the sentiment annotations too ● Other people are using the combination of opinion mining and MIMIR to look at e.g. customer feedback about products / companies on a huge corpus
  • 42.
    University of Sheffield,NLP A positive tweet
  • 43.
    University of Sheffield,NLP A negative tweet
  • 44.
    University of Sheffield,NLP A Sarcastic Tweet
  • 45.
    University of Sheffield,NLP What diseases are in these documents?
  • 46.
    University of Sheffield,NLP What Pathogens?
  • 47.
    University of Sheffield,NLP Disease vs Disease Co-ocurrences
  • 48.
    University of Sheffield,NLP Diseases vs Pathogens
  • 49.
    University of Sheffield,NLP The Political Futures Tracker ● Example of using the technology on a real scenario - analysing political tweets in the run-up to the UK elections ● Project funded by Nesta http://www.nesta.org.uk/ ● Series of blog posts about the project, leading up to the election http://www.nesta.org.uk/news/political-futures-tracker ● Some of our own more technical blog posts about the underlying tools: http://gate.ac.uk/projects/pft/
  • 50.
    University of Sheffield,NLP Recognition of MPs / Candidates
  • 51.
    University of Sheffield,NLP Topic recognition This term is found by first recognising the head word “job” from a list under the theme “employment” and matching against its root form in the text, i.e. “job”. It is then extended to include the adjectival modifier “British”, which is not present in a list anywhere.
  • 52.
    University of Sheffield,NLP Positive opinion about science and innovation
  • 53.
    University of Sheffield,NLP What do the different parties talk about? Conservative vs Labour Transport, Europe and employment are mentioned more by Conservatives NHS is mentioned more by Labour
  • 54.
    University of Sheffield,NLP Where did people tweet about the economy?
  • 55.
    University of Sheffield,NLP Terms that co-occur with immigration topic
  • 56.
    University of Sheffield,NLP Which tweets express sentiment?
  • 57.
    University of Sheffield,NLP Summary ● Text mining is a very useful pre-requisite for doing all kinds of more interesting things, like semantic search ● Semantic web search allows you to do much more interesting kinds of search than a standard text-based search ● Text mining is hard, so it won't always be correct, however ● This is especially true on lower quality text such as social media ● On the other hand, social media has some of the most interesting material to look at ● Still plenty of work to be done, but there are lots of tools that you can use now and get useful results ● We run annual GATE training courses where you can spend a whole week learning all this and more!
  • 58.
    University of Sheffield,NLP Acknowledgements and further information ● Research partially supported by the European Union/EU under the Information and Communication Technologies (ICT) theme of the 7th Framework Programme for R&D (FP7) DecarboNet (610829) http://www.decarbonet.eu and Nesta http://nesta.org.uk ● Annual GATE training course in June in Sheffield:https://gate.ac.uk/family/training.html ● Download gate: http://gate.ac.uk/download
  • 59.
    University of Sheffield,NLP Relevant publications ● K Bontcheva, V Tablan, H Cunningham. Semantic Search over Documents and Ontologies (2014) Bridging Between Information Retrieval and Databases, 31-53 ● V. Tablan, K. Bontcheva, I. Roberts, and H. Cunningham. Mímir: An open- source semantic search framework for interactive information seeking and discovery. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 2014 ● A. Dietzel and D. Maynard. Climate Change: A Chance for Political Re- Engagement? In Proc. of the Political Studies Association 65th Annual International Conference, April 2015, Sheffield, UK. ● Diana Maynard. Challenges in Analysing Social Media. In Adrian Duşa, Dietrich Nelle, Günter Stock and Gert G. Wagner (eds.) (2014): Facing the Future: European Research Infrastructures for the Humanities and Social Sciences. SCIVERO Verlag, Berlin, 2014. ● These and other relevant papers on the GATE website: http://gate.ac.uk/gate/doc/papers.html