Tools for (Almost) Real-Time Social Media Analysis

University of Sheffield, NLP
Tools for (Almost) Real-Time
Social Media Analysis
Dr. Diana Maynard
Dept of Computer Science
University of Sheffield, UK
19 March 2015, Vienna

We are all connected to each other...
● Information,
thoughts and
opinions are
shared prolifically
on the social web
these days
● 72% of online
adults use social
networking sites

Your grandmother is three times as likely to
use a social networking site now as in 2009

There are hundreds of tools for social media
analytics
● Most of them are commercial and not freely available
● The research tools tend to focus on specific topics and
scenarios, and aren't easily adaptable
● The analysis they do often doesn't go much beyond number
crunching, e.g.
– look at number of tweets, retweets, favourites
– filter by hashtag or keyword for topic categorisation
– use off-the-shelf sentiment tools
– use counts of word length, POS categories etc
– very little semantics, don't deal with variation, ambiguity,
slang, sarcasm etc.

Analysing Social Media is harder than it sounds
There are lots
of things to
think about!

Analysing language in social media is hard
● Grundman:politics makes #climatechange scientific issue,people
don’t like knowitall rational voice tellin em wat 2do
● @adambation Try reading this article , it looks like it would be
really helpful and not obvious at all. http://t.co/mo3vODoX
● Want to solve the problem of #ClimateChange? Just #vote for a
#politician! Poof! Problem gone! #sarcasm #TVP #99%
● Human Caused #ClimateChange is a Monumental Scam!
http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!!
Lying to us like MOFO's Tax The Air We Breath! F**k Them!

Let's search for keywords like “Arctic”
Oops!

Seems like we need something to help!
How about NLP?

It is difficult to access unstructured information
efficiently
Information extraction tools can help you:
● Save time and money on management of text and data from
multiple sources
● Find hidden links scattered across huge volumes of diverse
information
● Integrate structured data from variety of sources
● Interlink text and data
● Collect information and extract new facts

What is Entity Recognition?
●
Entity Recognition is about recogising and classifying key Named
Entities and terms in the text
●
A Named Entity is a Person, Location, Organisation, Date etc.
●
A term is a key concept or phrase that is representative of the text
●
Entities and terms may be described in different ways but refer to
the same thing. We call this co-reference.
Mitt Romney, the favorite to win the Republican nomination for president in 2012
DatePerson Term
The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior.
Organisation
co-reference
Location

What is Event Recognition?
●
An event is an action or situation relevant to the domain
expressed by some relation between entities or terms.
●
It is always grounded in time, e.g. the performance of a
band, an election, the death of a person
Mitt Romney, the favorite to win the Republican nomination for president in 2012
Event DatePerson
Relation Relation

Why are Entities and Events Useful?
● They can help answer the “Big 5” journalism questions
(who, what, when, where, why)
● They can be used to categorise the texts in different ways
– look at all texts about Obama.
● They can be used as targets for opinion mining
– find out what people think about President Obama
● When linked to an ontology and/or combined with other
information, they can be used for reasoning about things
not explicit in the text
– seeing how opinions about different American
presidents have changed over the years

Approaches to Information Extraction
Knowledge Engineering

rule based

developed by
experienced language
engineers

make use of human
intuition

easier to understand
results

development could be
very time consuming

some changes may be
hard to accommodate
Learning Systems

use statistics or other
machine learning

developers do not need
LE expertise

requires large amounts of
annotated training data

some changes may
require re-annotation of
the entire training corpus

Seems like we need a tool to do this clever
stuff for us.
How about GATE?

What is GATE?
GATE is an NLP toolkit developed at the University of Sheffield
over the last 20 years.
It's open source and freely available. http://gate.ac.uk
• components for language processing, e.g. parsers, machine
learning tools, stemmers, IR tools, IE components for various
languages...
• tools for visualising and manipulating text, annotations,
ontologies, parse trees, etc.
• various information extraction tools
• evaluation and benchmarking tools

GATE components
● Language Resources (LRs), e.g. lexicons, corpora,
ontologies
● Processing Resources (PRs), e.g. parsers, generators,
taggers
●
Visual Resources (VRs), i.e. visualisation and editing
components
● Algorithms are separated from the data, which means:
– the two can be developed independently by users with
different expertise.
– alternative resources of one type can be used without
affecting the other, e.g. a different visual resource can be
used with the same language resource

ANNIE
• ANNIE is GATE's rule-based IE system
• It uses the language engineering approach (though we also
have tools in GATE for ML)
• Distributed as part of GATE
• Uses a finite-state pattern-action rule language, JAPE
• ANNIE contains a reusable and easily extendable set of
components:
– generic preprocessing components for tokenisation,
sentence splitting etc
– components for performing NE on general open domain text

ANNIE Modules

19
Document with Tokens

20
Document with Sentences

Gazetteer editor
definition file
entries entries for selected list

Named Entity Grammars
• Hand-coded rules written in JAPE applied to annotations to
identify NEs
• Phases run sequentially and constitute a cascade of FSTs over
annotations
• Annotations from format analysis, tokeniser. splitter, POS tagger,
morphological analysis, gazetteer etc.
• Because phases are sequential, annotations can be built up over
a period of phases, as new information is gleaned
• Standard named entities: persons, locations, organisations, dates,
addresses, money
• Basic NE grammars can be adapted for new applications, domains
and languages

Document with NEs

Coreference

Right, so we have a technology, and we have
a tool to apply the technology.
Now how do we do it?

Framework
● Data collection (via Twitter streaming API)
● Documents stored as JSON and processed (annotated) via
GCP
● Documents indexed via MIMIR
● Search and visualisation via MIMIR/Prospector

Live streaming (coming soon)
● If we want to process the tweets in real time, we can use the
Twitter streaming client to feed the incoming tweets to a
message queue.
● A separate process then reads messages from the queue,
annotates them and pushes them into Mimir.
● If the rate of incoming tweets exceeds the capacity of the
processing side, we can simply launch more instances of the
message consumer across different machines to scale the
capacity.
● Query and visualisation can then be performed as before on
whatever data we currently have available

Let's look at some examples
(For anyone who grew up in the UK):
“Here's one I prepared earlier”

DecarboNet project: what do people think about
climate change?
And how much do
we really know
about it?
How do we know
what's really true?
“It's cold in my flat“

Political Futures Tracker Application
● Example of using the technology on a real scenario - analysing
political tweets in the run-up to the UK elections
● Project funded by Nesta http://www.nesta.org.uk/
● Series of blog posts about the project, leading up to the
election, see e.g.
http://www.nesta.org.uk/blog/silver-surfers-and-westminster-
twitterati

Twitter collection
● collected Tweets using Twitter's “statuses/filter” streaming API
● can follow up to 5000 user IDs and receive in real time
● collected all tweets and retweets posted by these users
● also retweets of, and replies to, any tweet posted by these
users

Twitter collection (2)
● Initial list of 506 UK MPs' Twitter accounts, extracted from a
CSV file made available by BBC News Labs and cleaned
● Also added list of UK election candidates collected and made
available at https://yournextmp.com, and updated periodically
– 1,504 on 13th
January 2015
– 1,811 on 2nd
February 2015
● Added 21 official party accounts
● Total number of accounts followed at 16th
Feb: 1,894
– 444 MPs standing again are included in both the MP and
candidate lists

Tweets per hour collected
Government U-turn
on fracking
Douglas Carswell's accidental
“Hello Kitty” tweet

Longer web documents
● Also crawled websites of UK political parties (Con, Lab, LD,
Green, UKIP, BNP, SNP, PC, plus the NI parties and various
smaller parties)
●
Initial crawl on 28th
-29th
October retrieved 375MB
(compressed)
● Re-crawled regularly to pick up new pages

Politician and candidate annotation
● Acquired and corrected a list of UK MPs and election candidates
and their affiliations, twitter accounts and DBpedia URIs
● Converted to gazetteers so that MPs in various forms (name or
twitter handle) can be recognised in tweets and annotated with
the relevant info (URI, full name, constituency etc.)

Recognition of MPs / Candidates

Topic Recognition
● A set of themes was taken from the categories used on http://www.gov.uk
● For each theme, a gazetteer list was developed containing typical
keywords and phrases representative of that theme
● e.g. “asylum seeker” indicates the topic “borders and immigration”
● Each list was expanded via:
– automatic term recognition (based on tf.idf) over a corpus of
manifestos and other political documents
– manual additions
● Each list also contains potential head terms and modifiers which can be
expanded into longer terms on the fly from the text during analysis stage
● e.g. “terrorist” can modify many other words (terrorist attack, terrorst
threat, ...)

Topic recognition
This term is found by first
recognising the head
word “job” from a list
under the theme
“employment” and
matching against its root
form in the text, i.e. “job”.
It is then extended to
include the adjectival
modifier “British”, which
is not present in a list
anywhere.

Sentiment annotation
● Annotations are created over the whole sentence and contain
the following features:
– sentiment_kind: optimism / pessimism
– holder: the person holding the opinion (MP's name)
– holder_URI: the URI fo the holder
– target: the target of the opinion, e.g. MP or topic
– target_URI: if appropriate, the URI of the target
– score: a positive/negative value reflecting the strength of opinion
– sarcasm: yes / no (whether sarcasm is present)
– sentiment_string: the main word(s) that contain sentiment
● These annotations and features will be used as input to MIMIR
to facilitate analysis/aggregation

Positive opinion about science and innovation

GATE Mímir:
Answering Questions Google Can't

GATE Mimir
● can be used to index and search over text,
annotations, semantic metadata (concepts and
instances)
● allows queries that arbitrarily mix full-text,
structural, linguistic and semantic annotations
● is open source

Show me:
● all documents mentioning a temperature between 30 and 90
degrees F (expressed in any unit)
● all abstracts written in French on Patent Documents from the
last 6 months which mention any form of the word “transistor”
in the English abstract
● the names of the patent inventors of those abstracts
● all documents mentioning steel industries in the UK, along with
their location

Search news articles for politicians born in Sheffield

Document Indexing with MIMIR
● MIMIR allows for indexing and querying text, annotations and
semantic knowledge
– this gives a rich source of data for analysis
● Currently we have used MIMIR to index
– the raw collected text
– annotations provided by Twitter and by the applications

Examples of Mimir queries on our corpus
● Get all documents which talk about the borders/immigration topic
{Topic theme = "borders_and_immigration"}
● Get all documents where the author of the document is a candidate
{DocumentAuthor sparql = "?c nesta:candidate ?author_uri"}
● Get all documents where the author is an MP standing for re-election for the
same seat
{DocumentAuthor sparql="?c nesta:candidate ?author_uri . ?c dbp-prop:mp ?
author_uri"}
● Get all documents where the author is a candidate for the Sheffield Hallam
constituency
{DocumentAuthor
sparql="<http://dbpedia.org/resource/Sheffield_Hallam_(UK_Parliament_constit
uency)> nesta:candidate ?author_uri"}

What do the different parties talk about?
Conservative vs Labour
Transport, Europe and employment are
mentioned more by Conservatives
NHS is mentioned more by Labour

Topics mentioned by UKIP

Topics mentioned by the Green Party
Expected topics are high on the list

How old are MPs?

Where did people tweet about the economy?

Measuring engagement about climate change
● We also used the tools to measure how engaged both MPs
and the public are about the topic of climate change
● Comparison of climate change with other political topics
● Theory is that people are quite apathetic about most political
topics in general
● But people are more enthusiastic about climate change,
because it's something they can actively do something about
● Results showed that climate change is not frequently tweeted
about by most politicians apart from the Green Party, but is in
top 3 topics for most of the engagement criteria we applied
(retweets, replies, sentiment, optimism, @mentions, URLs)
● Climate change tweets contained the highest number of URLs
- direct engagement with additional information

Average retweets per tweet

Terms that co-occur with environmental topic

Terms that co-occur with environment topic

Terms that co-occur with immigration topic

Climate change tweets often express sentiment

Summary
● Once you have the indexed data, you can carry on doing all
kinds of interesting comparisons and analysis.
● Simple analysis tools can give you pretty pictures, but you can
do much more interesting things when you delve a bit deeper
and make use of information not explicit in the text
● For this you need both NLP and Linked Open Data
● Our tools are all freely available and open source

Tools for (Almost) Real-Time Social Media Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tools for (Almost) Real-Time Social Media Analysis

Similar to Tools for (Almost) Real-Time Social Media Analysis (20)

More from Diana Maynard

More from Diana Maynard (14)

Recently uploaded

Recently uploaded (20)

Tools for (Almost) Real-Time Social Media Analysis