The document describes a presentation given by Dr. Diana Maynard from the University of Sheffield on tools for social media analysis. It discusses how social media use has increased and the challenges of analyzing language from social media. It introduces the GATE natural language processing toolkit developed at the University of Sheffield that can be used to extract entities, events, and sentiments from social media texts through machine learning and rule-based approaches. An example application of these tools to analyze political tweets for a project tracking the UK election is also mentioned.
GATE: a text analysis tool for social mediaDiana Maynard
Short tutorial about how and why to use GATE for text analysis of social media, given at the Big Social Data workshop at Reading University in April 2015.
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
Code: http://gate.ac.uk/wiki/twitie.html
Paper: https://gate.ac.uk/sale/ranlp2013/twitie/twitie-ranlp2013.pdf
Twitter is the largest source of microblog text, responsible for gigabytes of human discourse every day. Processing microblog text is difficult: the genre is noisy, documents have little context, and utterances are very short. As such, conventional NLP tools fail when faced with tweets and other microblog text. We present TwitIE, an open-source NLP pipeline customised to microblog text at every stage. Additionally, it includes Twitter-specific data import and metadata handling. This paper introduces each stage of the TwitIE pipeline, which is a modification of the GATE ANNIE open-source pipeline for news text. An evaluation against some state-of-the-art systems is also presented.
GATE: a text analysis tool for social mediaDiana Maynard
Short tutorial about how and why to use GATE for text analysis of social media, given at the Big Social Data workshop at Reading University in April 2015.
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
Code: http://gate.ac.uk/wiki/twitie.html
Paper: https://gate.ac.uk/sale/ranlp2013/twitie/twitie-ranlp2013.pdf
Twitter is the largest source of microblog text, responsible for gigabytes of human discourse every day. Processing microblog text is difficult: the genre is noisy, documents have little context, and utterances are very short. As such, conventional NLP tools fail when faced with tweets and other microblog text. We present TwitIE, an open-source NLP pipeline customised to microblog text at every stage. Additionally, it includes Twitter-specific data import and metadata handling. This paper introduces each stage of the TwitIE pipeline, which is a modification of the GATE ANNIE open-source pipeline for news text. An evaluation against some state-of-the-art systems is also presented.
This presentation is about GATE which is a Natural Language Processing Platform That supports many Languages. It also mentions Mimir which is an Indexing server for GATE that enables its users to search in a corpus of documents
Social media & sentiment analysis splunk conf2012Michael Wilde
This presentation was delivered at Splunk's User Conference (conf2012). It covers info about social media data, how to index / use it with Splunk and a lot of content around Sentiment Analysis.
What do you really mean when you tweet? Challenges for opinion mining on soci...Diana Maynard
This talk, given at BRACIS 2013, introduces the topics of opinion mining and social media analytics, in particular looking at the challenges they impose for an NLP system. It investigates the impact of non-standard text in social media, use of sarcasm, swear words, non-words, short sentences, multiple languages and so on, which impede the success of current NLP tools to perform good analysis, and examines tools being developed in some current cutting-edge research projects, including not only text-based research but also multimedia analysis.
Using Data Science for Social Good: Fighting Human TraffickingAnidata
In this talk, Vincent Emanuele tells the story of how Anidata has applied data science techniques in collaboration with law enforcement to fight human trafficking.
Artificial Assistants: How can I help you? by Christopher CurrinChristopher Currin
Chatbots are not equal; with different forms permeating our lives more and more. Virtual assistants are increasingly relevant for businesses and our day-to-day lives. Chatbots have become ubiquitous for interactions, yet ‘reasonable’ intelligence remains elusive.
In this talk, we explore and explain their underlying architectures and capabilities to understand what makes them work, their weaknesses, and future improvements.
Design principles from a technology and human perspective will be disseminated with examples of current production systems and their impact. Furthermore, the audience will have the opportunity to advance these best practices.
Resources will be made available, so the technology is relevant, practical, and accessible.
Dark Data and Improving Human Rights in Fulton CountyAnidata
In this talk, Dr Baxley discusses the application of data science techniques to fight human trafficking. He gives a brief overview of Anidata, then dives into the technical details of the implementation of the graph-based entity resolution algorithm developed and implemented using Python, Luigi, and NetworkX.
Twitter provides a selfie of envolving languageTERMCAT
Twitter provides ‘selfies’ of evolving language. Can social networks be used to quickly identify the most recent neologisms?
Maria Pia Montoro - Terminology Coordination Unit of the European Parliament
VII EAFT Terminology Summit. Barcelona, 27-28 november 2014
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
ADAPT seminar series. June 2021
research papers @NoDaLiDa2021:the 23rd Nordic Conference on Computational Linguistics
& COLING20:MWE-LEX WS
Bonus takeaway:
AlphaMWE multilingual corpus
with MWEs
Arcomem training entities-and-events_advancedarcomem
This presentation on Entities and Events detection is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
It is quite often observed that when people use retrieval systems, they do not just search documents or text passages in the first place, but for some information contained inside, which is related to some entities, for instance, person, organization, location, events, time, etc. The goal is to find out various kinds of valuable semantic information about real-world entites embedded in different web pages and databases. But It is a difficult task for us to find out specific or exact information about entities from present search engines. So we need search engines, which will identify our queries across different domains and extract structured information about entities.
This presentation is about GATE which is a Natural Language Processing Platform That supports many Languages. It also mentions Mimir which is an Indexing server for GATE that enables its users to search in a corpus of documents
Social media & sentiment analysis splunk conf2012Michael Wilde
This presentation was delivered at Splunk's User Conference (conf2012). It covers info about social media data, how to index / use it with Splunk and a lot of content around Sentiment Analysis.
What do you really mean when you tweet? Challenges for opinion mining on soci...Diana Maynard
This talk, given at BRACIS 2013, introduces the topics of opinion mining and social media analytics, in particular looking at the challenges they impose for an NLP system. It investigates the impact of non-standard text in social media, use of sarcasm, swear words, non-words, short sentences, multiple languages and so on, which impede the success of current NLP tools to perform good analysis, and examines tools being developed in some current cutting-edge research projects, including not only text-based research but also multimedia analysis.
Using Data Science for Social Good: Fighting Human TraffickingAnidata
In this talk, Vincent Emanuele tells the story of how Anidata has applied data science techniques in collaboration with law enforcement to fight human trafficking.
Artificial Assistants: How can I help you? by Christopher CurrinChristopher Currin
Chatbots are not equal; with different forms permeating our lives more and more. Virtual assistants are increasingly relevant for businesses and our day-to-day lives. Chatbots have become ubiquitous for interactions, yet ‘reasonable’ intelligence remains elusive.
In this talk, we explore and explain their underlying architectures and capabilities to understand what makes them work, their weaknesses, and future improvements.
Design principles from a technology and human perspective will be disseminated with examples of current production systems and their impact. Furthermore, the audience will have the opportunity to advance these best practices.
Resources will be made available, so the technology is relevant, practical, and accessible.
Dark Data and Improving Human Rights in Fulton CountyAnidata
In this talk, Dr Baxley discusses the application of data science techniques to fight human trafficking. He gives a brief overview of Anidata, then dives into the technical details of the implementation of the graph-based entity resolution algorithm developed and implemented using Python, Luigi, and NetworkX.
Twitter provides a selfie of envolving languageTERMCAT
Twitter provides ‘selfies’ of evolving language. Can social networks be used to quickly identify the most recent neologisms?
Maria Pia Montoro - Terminology Coordination Unit of the European Parliament
VII EAFT Terminology Summit. Barcelona, 27-28 november 2014
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
ADAPT seminar series. June 2021
research papers @NoDaLiDa2021:the 23rd Nordic Conference on Computational Linguistics
& COLING20:MWE-LEX WS
Bonus takeaway:
AlphaMWE multilingual corpus
with MWEs
Arcomem training entities-and-events_advancedarcomem
This presentation on Entities and Events detection is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
It is quite often observed that when people use retrieval systems, they do not just search documents or text passages in the first place, but for some information contained inside, which is related to some entities, for instance, person, organization, location, events, time, etc. The goal is to find out various kinds of valuable semantic information about real-world entites embedded in different web pages and databases. But It is a difficult task for us to find out specific or exact information about entities from present search engines. So we need search engines, which will identify our queries across different domains and extract structured information about entities.
Efficient named entity annotation through pre-emptingLeon Derczynski
Linguistic annotation is time-consuming and expensive. One common annotation task is to mark entities – such as names
of people, places and organisations – in text. In a document, many segments of text often contain no entities at all. We show that these segments are worth skipping, and demonstrate a technique for reducing the amount of entity-less text examined
by annotators, which we call “preempting”. This technique is evaluated in a crowdsourcing scenario, where it provides downstream performance improvements for the same size corpus.
LiMoSINe Press kit introduces this project that integrates the studies of leading researchers over diverse topics with a view to enable new kinds of language-based technology search. Now we are developing 5 demonstrators: ORMA, ThemeStreams, FlickrDemo, DEESSE and Streamwatchr. http://limosine-project.eu/
This presentation on Opinion Mining is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Make a query regarding a topic of interest and come to know the sentiment for the day in pie-chart or for the week in form of line-chart for the tweets gathered from twitter.com
1 CELI – Language and Information Gennaio 2014
2 We develop software solutions based on (NLP) Natural Language Processing
3 CELI’s offices, Countries in which we operate, Years of experience, People, Active customers, Business lines
4 Partners in Academia, Research projects, Published scientific papers
Close relationship with scientific community
5 From 1999 to 2013
6 Clients: semantic solutions, Speech Technology, Blogmeter
7 NLP solutions
8 NLP technology: Comprehensive suite of multilingual components and resource
9 Linguistic processing and annotation
10 From text to Knowledge
11 Meaningful intelligence from unstructured information
12 Speech technology: Comprehensive suite of multilingual components and resources for text processing in Voice application (Text To Speech)
13 Contribution to TTS development:Consulting and technologies
14 Semantic solutions
15 Semantic Search: Enterprise Semantic Search solution for document system and knowledge management systems
16 Linked Data for Semantic Search: Creation-ReUse of multilingual ontologies,Linking to LOD resources,Deploying LOD
17 Linked (Open) Data for Enterprise Search
18 Semantic Search Platform
19 Customer Voice Analytics: Automatic classification of customer surveys (answers to open questions) and verbatim (customer cases or call transcriptios)
20-21 Multilingual management of verbatim coding
22 Product lines (Blogmeter, Crosslibrary)
23 Social Media Monitoring, Analytics & Management Tools per Aziende & Agenzie.
24 Blogmeter: Leader in Italia nella social media intelligence,Tecnologie d’avanguardia per la social intelligence
25 Digital Humanities e Scuola Digitale
26 Leggere i classici usando il digitale
27 I Promessi sposi e Pinocchio
28 Grazie per l’attenzione!
29 Vittorio Di Tomaso ditomaso@celi.it
Impact the UX of Your Website with Contextual InquiryRachel Vacek
A contextual inquiry is a research study that involves in-depth interviews where users walk through common tasks in the physical environment in which they typically perform them. It can be used to better understand the intents and motivations behind user behavior. In this session, learn what’s needed to conduct a contextual inquiry and how to analyze the ethnographic data once collected. We'll cover how to synthesize and visualize your findings as sequence models and affinity diagrams that directly inform the development of personas and common task flows. Finally, learn how this process can help guide your design and content strategy efforts while constructing a rich picture of the user experience.
Three experiments I have done with data science. Related to text analysis, integration. Focusing on the learning's rather than details on how it was done with source code. I feel it is important to see this subject in relation to business problems rather than as pure branch of Statistics. Focusing on what has to be done enabled me to find the right solution from a complicated and very interesting subject.
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...Diana Maynard
Talk given at the Data Pioneers 1st meetup in London, 27 July 2017.
Abstract:
The GATE open source NLP toolkit has now been in continuous development for 20 years at the University of Sheffield. Originally funded by a small EPSRC research grant, it now involves a team of 12 researchers working on it, and has been downloaded by hundreds of thousands of users all over the world. Its users range from solitary research students to multinational companies and government institutions. In this talk, I will give an overview of my work with GATE, giving examples of real-life case studies, ranging from analysing polarised opinions in online political debates (Brexit, the UK, French and US elections) through to finding a new cause of cancer by analysing information in the biomedical domain.
Multimodal opinion mining from social mediaDiana Maynard
Presentation at the BCS SGAI 2013 conference in Cambridge, December 2013, describing the combination of opinion mining from text and multimedia from social media.
Practical Opinion Mining for Social MediaDiana Maynard
This tutorial will introduce the concepts of sentiment analysis and opinion mining from unstructured text in social media, looking at why they are useful and what tools and techniques are available. It will cover both rule-based and machine learning techniques, provide some background information on the key underlying NLP processes required, and look in detail at some of the major problems and solutions, such as detection of sarcasm, use of informal language, spam opinion detection, trustworthiness of opinion holders, and so on. The techniques will be demonstrated with real applications developed in GATE, an open-source language processing toolkit. Links are provided to some hands-on material to try at home.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Tools for (Almost) Real-Time Social Media Analysis
1. University of Sheffield, NLP
Tools for (Almost) Real-Time
Social Media Analysis
Dr. Diana Maynard
Dept of Computer Science
University of Sheffield, UK
19 March 2015, Vienna
2. University of Sheffield, NLP
We are all connected to each other...
● Information,
thoughts and
opinions are
shared prolifically
on the social web
these days
● 72% of online
adults use social
networking sites
3. University of Sheffield, NLP
Your grandmother is three times as likely to
use a social networking site now as in 2009
4. University of Sheffield, NLP
There are hundreds of tools for social media
analytics
● Most of them are commercial and not freely available
● The research tools tend to focus on specific topics and
scenarios, and aren't easily adaptable
● The analysis they do often doesn't go much beyond number
crunching, e.g.
– look at number of tweets, retweets, favourites
– filter by hashtag or keyword for topic categorisation
– use off-the-shelf sentiment tools
– use counts of word length, POS categories etc
– very little semantics, don't deal with variation, ambiguity,
slang, sarcasm etc.
5. University of Sheffield, NLP
Analysing Social Media is harder than it sounds
There are lots
of things to
think about!
6. University of Sheffield, NLP
Analysing language in social media is hard
● Grundman:politics makes #climatechange scientific issue,people
don’t like knowitall rational voice tellin em wat 2do
● @adambation Try reading this article , it looks like it would be
really helpful and not obvious at all. http://t.co/mo3vODoX
● Want to solve the problem of #ClimateChange? Just #vote for a
#politician! Poof! Problem gone! #sarcasm #TVP #99%
● Human Caused #ClimateChange is a Monumental Scam!
http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!!
Lying to us like MOFO's Tax The Air We Breath! F**k Them!
9. University of Sheffield, NLP
It is difficult to access unstructured information
efficiently
Information extraction tools can help you:
● Save time and money on management of text and data from
multiple sources
● Find hidden links scattered across huge volumes of diverse
information
● Integrate structured data from variety of sources
● Interlink text and data
● Collect information and extract new facts
10. University of Sheffield, NLP
What is Entity Recognition?
●
Entity Recognition is about recogising and classifying key Named
Entities and terms in the text
●
A Named Entity is a Person, Location, Organisation, Date etc.
●
A term is a key concept or phrase that is representative of the text
●
Entities and terms may be described in different ways but refer to
the same thing. We call this co-reference.
Mitt Romney, the favorite to win the Republican nomination for president in 2012
DatePerson Term
The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior.
Organisation
co-reference
Location
11. University of Sheffield, NLP
What is Event Recognition?
●
An event is an action or situation relevant to the domain
expressed by some relation between entities or terms.
●
It is always grounded in time, e.g. the performance of a
band, an election, the death of a person
Mitt Romney, the favorite to win the Republican nomination for president in 2012
Event DatePerson
Relation Relation
12. University of Sheffield, NLP
Why are Entities and Events Useful?
● They can help answer the “Big 5” journalism questions
(who, what, when, where, why)
● They can be used to categorise the texts in different ways
– look at all texts about Obama.
● They can be used as targets for opinion mining
– find out what people think about President Obama
● When linked to an ontology and/or combined with other
information, they can be used for reasoning about things
not explicit in the text
– seeing how opinions about different American
presidents have changed over the years
13. University of Sheffield, NLP
Approaches to Information Extraction
Knowledge Engineering
rule based
developed by
experienced language
engineers
make use of human
intuition
easier to understand
results
development could be
very time consuming
some changes may be
hard to accommodate
Learning Systems
use statistics or other
machine learning
developers do not need
LE expertise
requires large amounts of
annotated training data
some changes may
require re-annotation of
the entire training corpus
14. University of Sheffield, NLP
Seems like we need a tool to do this clever
stuff for us.
How about GATE?
15. University of Sheffield, NLP
What is GATE?
GATE is an NLP toolkit developed at the University of Sheffield
over the last 20 years.
It's open source and freely available. http://gate.ac.uk
• components for language processing, e.g. parsers, machine
learning tools, stemmers, IR tools, IE components for various
languages...
• tools for visualising and manipulating text, annotations,
ontologies, parse trees, etc.
• various information extraction tools
• evaluation and benchmarking tools
16. University of Sheffield, NLP
GATE components
● Language Resources (LRs), e.g. lexicons, corpora,
ontologies
● Processing Resources (PRs), e.g. parsers, generators,
taggers
●
Visual Resources (VRs), i.e. visualisation and editing
components
● Algorithms are separated from the data, which means:
– the two can be developed independently by users with
different expertise.
– alternative resources of one type can be used without
affecting the other, e.g. a different visual resource can be
used with the same language resource
17. University of Sheffield, NLP
ANNIE
• ANNIE is GATE's rule-based IE system
• It uses the language engineering approach (though we also
have tools in GATE for ML)
• Distributed as part of GATE
• Uses a finite-state pattern-action rule language, JAPE
• ANNIE contains a reusable and easily extendable set of
components:
– generic preprocessing components for tokenisation,
sentence splitting etc
– components for performing NE on general open domain text
22. University of Sheffield, NLP
Named Entity Grammars
• Hand-coded rules written in JAPE applied to annotations to
identify NEs
• Phases run sequentially and constitute a cascade of FSTs over
annotations
• Annotations from format analysis, tokeniser. splitter, POS tagger,
morphological analysis, gazetteer etc.
• Because phases are sequential, annotations can be built up over
a period of phases, as new information is gleaned
• Standard named entities: persons, locations, organisations, dates,
addresses, money
• Basic NE grammars can be adapted for new applications, domains
and languages
25. University of Sheffield, NLP
Right, so we have a technology, and we have
a tool to apply the technology.
Now how do we do it?
26. University of Sheffield, NLP
Framework
● Data collection (via Twitter streaming API)
● Documents stored as JSON and processed (annotated) via
GCP
● Documents indexed via MIMIR
● Search and visualisation via MIMIR/Prospector
27. University of Sheffield, NLP
Live streaming (coming soon)
● If we want to process the tweets in real time, we can use the
Twitter streaming client to feed the incoming tweets to a
message queue.
● A separate process then reads messages from the queue,
annotates them and pushes them into Mimir.
● If the rate of incoming tweets exceeds the capacity of the
processing side, we can simply launch more instances of the
message consumer across different machines to scale the
capacity.
● Query and visualisation can then be performed as before on
whatever data we currently have available
28. University of Sheffield, NLP
Let's look at some examples
(For anyone who grew up in the UK):
“Here's one I prepared earlier”
29. University of Sheffield, NLP
DecarboNet project: what do people think about
climate change?
And how much do
we really know
about it?
How do we know
what's really true?
“It's cold in my flat“
30. University of Sheffield, NLP
Political Futures Tracker Application
● Example of using the technology on a real scenario - analysing
political tweets in the run-up to the UK elections
● Project funded by Nesta http://www.nesta.org.uk/
● Series of blog posts about the project, leading up to the
election, see e.g.
http://www.nesta.org.uk/blog/silver-surfers-and-westminster-
twitterati
31. University of Sheffield, NLP
Twitter collection
● collected Tweets using Twitter's “statuses/filter” streaming API
● can follow up to 5000 user IDs and receive in real time
● collected all tweets and retweets posted by these users
● also retweets of, and replies to, any tweet posted by these
users
32. University of Sheffield, NLP
Twitter collection (2)
● Initial list of 506 UK MPs' Twitter accounts, extracted from a
CSV file made available by BBC News Labs and cleaned
● Also added list of UK election candidates collected and made
available at https://yournextmp.com, and updated periodically
– 1,504 on 13th
January 2015
– 1,811 on 2nd
February 2015
● Added 21 official party accounts
● Total number of accounts followed at 16th
Feb: 1,894
– 444 MPs standing again are included in both the MP and
candidate lists
33. University of Sheffield, NLP
Tweets per hour collected
Government U-turn
on fracking
Douglas Carswell's accidental
“Hello Kitty” tweet
34. University of Sheffield, NLP
Longer web documents
● Also crawled websites of UK political parties (Con, Lab, LD,
Green, UKIP, BNP, SNP, PC, plus the NI parties and various
smaller parties)
●
Initial crawl on 28th
-29th
October retrieved 375MB
(compressed)
● Re-crawled regularly to pick up new pages
35. University of Sheffield, NLP
Politician and candidate annotation
● Acquired and corrected a list of UK MPs and election candidates
and their affiliations, twitter accounts and DBpedia URIs
● Converted to gazetteers so that MPs in various forms (name or
twitter handle) can be recognised in tweets and annotated with
the relevant info (URI, full name, constituency etc.)
37. University of Sheffield, NLP
Topic Recognition
● A set of themes was taken from the categories used on http://www.gov.uk
● For each theme, a gazetteer list was developed containing typical
keywords and phrases representative of that theme
● e.g. “asylum seeker” indicates the topic “borders and immigration”
● Each list was expanded via:
– automatic term recognition (based on tf.idf) over a corpus of
manifestos and other political documents
– manual additions
● Each list also contains potential head terms and modifiers which can be
expanded into longer terms on the fly from the text during analysis stage
● e.g. “terrorist” can modify many other words (terrorist attack, terrorst
threat, ...)
38. University of Sheffield, NLP
Topic recognition
This term is found by first
recognising the head
word “job” from a list
under the theme
“employment” and
matching against its root
form in the text, i.e. “job”.
It is then extended to
include the adjectival
modifier “British”, which
is not present in a list
anywhere.
39. University of Sheffield, NLP
Sentiment annotation
● Annotations are created over the whole sentence and contain
the following features:
– sentiment_kind: optimism / pessimism
– holder: the person holding the opinion (MP's name)
– holder_URI: the URI fo the holder
– target: the target of the opinion, e.g. MP or topic
– target_URI: if appropriate, the URI of the target
– score: a positive/negative value reflecting the strength of opinion
– sarcasm: yes / no (whether sarcasm is present)
– sentiment_string: the main word(s) that contain sentiment
● These annotations and features will be used as input to MIMIR
to facilitate analysis/aggregation
42. University of Sheffield, NLP
GATE Mimir
● can be used to index and search over text,
annotations, semantic metadata (concepts and
instances)
● allows queries that arbitrarily mix full-text,
structural, linguistic and semantic annotations
● is open source
43. University of Sheffield, NLP
Show me:
● all documents mentioning a temperature between 30 and 90
degrees F (expressed in any unit)
● all abstracts written in French on Patent Documents from the
last 6 months which mention any form of the word “transistor”
in the English abstract
● the names of the patent inventors of those abstracts
● all documents mentioning steel industries in the UK, along with
their location
45. University of Sheffield, NLP
Document Indexing with MIMIR
● MIMIR allows for indexing and querying text, annotations and
semantic knowledge
– this gives a rich source of data for analysis
● Currently we have used MIMIR to index
– the raw collected text
– annotations provided by Twitter and by the applications
46. University of Sheffield, NLP
Examples of Mimir queries on our corpus
● Get all documents which talk about the borders/immigration topic
{Topic theme = "borders_and_immigration"}
● Get all documents where the author of the document is a candidate
{DocumentAuthor sparql = "?c nesta:candidate ?author_uri"}
● Get all documents where the author is an MP standing for re-election for the
same seat
{DocumentAuthor sparql="?c nesta:candidate ?author_uri . ?c dbp-prop:mp ?
author_uri"}
● Get all documents where the author is a candidate for the Sheffield Hallam
constituency
{DocumentAuthor
sparql="<http://dbpedia.org/resource/Sheffield_Hallam_(UK_Parliament_constit
uency)> nesta:candidate ?author_uri"}
47. University of Sheffield, NLP
What do the different parties talk about?
Conservative vs Labour
Transport, Europe and employment are
mentioned more by Conservatives
NHS is mentioned more by Labour
52. University of Sheffield, NLP
Measuring engagement about climate change
● We also used the tools to measure how engaged both MPs
and the public are about the topic of climate change
● Comparison of climate change with other political topics
● Theory is that people are quite apathetic about most political
topics in general
● But people are more enthusiastic about climate change,
because it's something they can actively do something about
● Results showed that climate change is not frequently tweeted
about by most politicians apart from the Green Party, but is in
top 3 topics for most of the engagement criteria we applied
(retweets, replies, sentiment, optimism, @mentions, URLs)
● Climate change tweets contained the highest number of URLs
- direct engagement with additional information
58. University of Sheffield, NLP
Summary
● Once you have the indexed data, you can carry on doing all
kinds of interesting comparisons and analysis.
● Simple analysis tools can give you pretty pictures, but you can
do much more interesting things when you delve a bit deeper
and make use of information not explicit in the text
● For this you need both NLP and Linked Open Data
● Our tools are all freely available and open source