SlideShare a Scribd company logo
1 of 154
Download to read offline
University of Sheffield, NLP 
Text Analysis with GATE 
Diana Maynard 
University of Sheffield 
Search Solutions Tutorial 2014 
London, UK 
© The University of Sheffield, 2014 
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike Licence
University of Sheffield, NLP 
Outline of the Tutorial 
• Introduction to NLP and Information Extraction 
• Introduction to GATE 
• Social media analysis 
• Text Analysis for Semantic Search 
• Semantic Search 
• Semantic Annotation 
• GATE MIMIR 
• Example applications 
• ExoPatent, TNA Web Archive, BBC News, Envilod, 
Prospector
University of Sheffield, NLP 
1. NLP and Information Extraction
University of Sheffield, NLP 
Oddly enough, people have successfully combined 
information and toast...
University of Sheffield, NLP 
The weather-forecasting toaster 
● This weather-forecasting toaster, connected to a phone point, 
was designed in 2001 by a PhD student 
● It accessed the MetOffice website via a modem inside the 
toaster and translated the information into a 1, 2 or 3 for rain, 
cloud or sun 
● The relevant symbol was then branded into the toast in the last 
few seconds of toasting
University of Sheffield, NLP 
With tools such as these, why do we need text 
mining? 
● It turns out that toast isn't actually a very good medium 
for finding information
7 
University of Sheffield, NLP 
Information overload or filter failure? 
● We all know that there's 
masses of information 
available online these days, 
and it's constantly growing 
● You often hear people talk 
about “information overload” 
● But the real problem is not 
the amount of information, 
but our inability to filter it 
correctly 
● Clay Shirky has an excellent 
talk on this topic 
http://bit.ly/oWJTNZ
University of Sheffield, NLP 
It is difficult to access unstructured information 
efficiently 
Information extraction tools can help you: 
● Save time and money on management of text and data from 
multiple sources 
● Find hidden links scattered across huge volumes of diverse 
information 
● Integrate structured data from variety of sources 
● Interlink text and data 
● Collect information and extract new facts
University of Sheffield, NLP 
What is information extraction? 
● Information Extraction (IE) is the automatic discovery of new, 
previously unknown information, by automatically extracting 
information from different textual resources. 
● A key element is the linking together of the extracted 
information together to form new facts or new hypotheses to 
be explored further 
● In IE, the goal is to discover previously unknown information, 
i.e. something that no one yet knows
10 
University of Sheffield, NLP 
IE is not Data Mining 
Data mining is about using analytical techniques to find 
interesting patterns from large structured databases 
Examples: 
• using consumer purchasing 
patterns to predict which products 
to place close together on shelves 
in supermarkets 
• analysing spending patterns on 
credit cards to detect fraudulent 
card use.
11 
University of Sheffield, NLP 
IE is not Web Search 
● IE is also different from traditional web search or IR. 
● In search, the user is typically looking for something that is 
already known and has been written by someone else. 
● The problem lies in sifting through all the material that currently 
isn't relevant to your needs, in order to find the information that 
is. 
● The solution often lies in better ways to ask the right question 
● You can't ask Google to tell you about 
– all the speeches made by Tony Blair about foot and mouth 
disease while he was Prime Minister 
– all the documents in which a politician born in Sheffield is 
quoted as saying something about hospitals.
University of Sheffield, NLP 
More about how we can do this will be revealed later... 
GATE Mímir: 
Answering Questions 
Google Can't
University of Sheffield, NLP 
Information Extraction Basics 
● Entity recognition 
is required for 
● Relation extraction 
which is required for 
● Event recognition 
which is required for 
● Summarisation tasks
University of Sheffield, NLP 
What is Entity Recognition? 
● Entity Recognition is about recogising and classifying key Named 
Entities and terms in the text 
● A Named Entity is a Person, Location, Organisation, Date etc. 
● A term is a key concept or phrase that is representative of the text 
Mitt Romney, the favorite to win the Republican nomination for president in 2012 
Person Term Date 
● Entities and terms may be described in different ways but refer to 
the same thing. We call this co-reference. 
co-reference 
The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior. 
Organisation 
Location
University of Sheffield, NLP 
What is Event Recognition? 
● An event is an action or situation relevant to the domain 
expressed by some relation between entities or terms. 
● It is always grounded in time, e.g. the performance of a 
band, an election, the death of a person 
Relation Relation 
Mitt Romney, the favorite to win the Republican nomination for president in 2012 
Person Event Date
University of Sheffield, NLP 
Why are Entities and Events Useful? 
● They can help answer the “Big 5” journalism questions 
(who, what, when, where, why) 
● They can be used to categorise the texts in different ways 
– look at all texts about Obama. 
● They can be used as targets for opinion mining 
– find out what people think about President Obama 
● When linked to an ontology and/or combined with other 
information, they can be used for reasoning about things 
not explicit in the text 
– seeing how opinions about different American 
presidents have changed over the years
17 
University of Sheffield, NLP 
NLP components for text mining 
● A text mining system is usually built up from a number of 
different NLP components 
● First, you need some Information Extraction tools to do the 
donkey work of getting all the relevant pieces of information 
and facts. 
● Then you need some tools to apply the reasoning to the facts, 
e.g. opinion mining, information aggregation, semantic 
technologies, dynamics analysis and so on. 
● GATE is an example of a tool for text mining which allows you 
to combine all the necessary NLP components
University of Sheffield, NLP 
Approaches to Information Extraction 
Knowledge Engineering 
 rule based 
 developed by 
experienced language 
engineers 
 make use of human 
intuition 
 easier to understand 
results 
 development could be 
very time consuming 
 some changes may be 
hard to accommodate 
Learning Systems 
 use statistics or other 
machine learning 
 developers do not need 
LE expertise 
 requires large amounts of 
annotated training data 
 some changes may 
require re-annotation of 
the entire training corpus
University of Sheffield, NLP 
2. GATE
University of Sheffield, NLP 
What is GATE? 
GATE is an NLP toolkit developed at the University of Sheffield 
over the last 20 years 
It includes: 
• components for language processing, e.g. parsers, machine 
learning tools, stemmers, IR tools, IE components for various 
languages... 
• tools for visualising and manipulating text, annotations, 
ontologies, parse trees, etc. 
• various information extraction tools 
• evaluation and benchmarking tools
University of Sheffield, NLP 
GATE components 
● Language Resources (LRs), e.g. lexicons, corpora, 
ontologies 
● Processing Resources (PRs), e.g. parsers, generators, 
taggers 
● Visual Resources (VRs), i.e. visualisation and editing 
components 
● Algorithms are separated from the data, which means: 
– the two can be developed independently by users with 
different expertise. 
– alternative resources of one type can be used without 
affecting the other, e.g. a different visual resource can be 
used with the same language resource
University of Sheffield, NLP 
ANNIE 
• ANNIE is GATE's rule-based IE system 
• It uses the language engineering approach (though we also 
have tools in GATE for ML) 
• Distributed as part of GATE 
• Uses a finite-state pattern-action rule language, JAPE 
• ANNIE contains a reusable and easily extendable set of 
components: 
– generic preprocessing components for tokenisation, 
sentence splitting etc 
– components for performing NE on general open domain text
University of Sheffield, NLP 
What's in ANNIE? 
• The ANNIE application contains a set of core PRs: 
• Tokeniser 
• Sentence Splitter 
• POS tagger 
• Gazetteers 
• Named entity tagger (JAPE transducer) 
• Orthomatcher (orthographic coreference) 
• There are also other PRs available in the ANNIE plugin, which 
are not used in the default application, but can be added if 
necessary 
• NP and VP chunker, morphological analysis
University of Sheffield, NLP 
ANNIE Modules
University of Sheffield, NLP 
Let's look at the PRs 
• Each PR in the ANNIE pipeline creates some new 
annotations, or modifies existing ones 
• Document Reset → removes annotations 
• Tokeniser → Token annotations 
• Gazetteer → Lookup annotations 
• Sentence Splitter → Sentence, Split annotations 
• POS tagger → adds category features to Token 
annotations 
• NE transducer → Date, Person, Location, Organisation, 
Money, Percent annotations 
• Orthomatcher → adds match features to NE annotations
University of Sheffield, NLP 
Tokeniser 
 Tokenisation based on Unicode classes 
 Declarative token specification language 
 Produces Token and SpaceToken annotations with 
features orthography and kind 
 Length and string features are also produced 
 Rule for a lowercase word with initial uppercase 
letter: 
"UPPERCASE_LETTER" LOWERCASE_LETTER"* > 
Token; orthography=upperInitial; kind=word
University of Sheffield, NLP 
27 
Document with Tokens
University of Sheffield, NLP 
ANNIE English Tokeniser 
 The English Tokeniser is a slightly enhanced version of the 
Unicode tokeniser 
 It comprises an additional JAPE transducer which adapts 
the generic tokeniser output for the POS tagger 
requirements 
 It converts constructs involving apostrophes into more 
sensible combinations 
 don’t → do + n't 
 you've → you + 've
University of Sheffield, NLP 
Sentence Splitter 
• The default splitter finds sentences based on Tokens 
• Creates Sentence annotations and Split annotations on the 
sentence delimiters 
• Uses a gazetteer of abbreviations etc. and a set of JAPE 
grammars which find sentence delimiters and then annotate 
sentences and splits 
• There are a few sentence splitter variants available which work 
in slightly different ways
University of Sheffield, NLP 
30 
Document with Sentences
University of Sheffield, NLP 
POS tagger 
 ANNIE POS tagger is a Java implementation of Brill's 
transformation based tagger 
 Previously known as Hepple Tagger 
 Trained on WSJ, uses Penn Treebank tagset 
 Default ruleset and lexicon can be modified manually (with a 
little deciphering) 
 Adds category feature to Token annotations 
 Requires Tokeniser and Sentence Splitter to be run first
Morphological analyser 
 Not an integral part of ANNIE, but can be found in the 
Tools plugin as an “added extra” 
 Flex based rules: can be modified by the user 
(instructions in the User Guide) 
 Generates “root” feature on Token annotations 
 Requires Tokeniser to be run first 
 Requires POS tagger to be run first if the 
considerPOSTag parameter is set to true
University of Sheffield, NLP 
Gazetteers 
● Gazetteers are plain text files containing lists of names (e.g 
rivers, cities, people, …) 
● The lists are compiled into Finite State Machines 
● Each gazetteer has an index file listing all the lists, plus 
features of each list (majorType, minorType and language) 
● Lists can be modified either internally using the Gazetteer 
Editor, or externally in your favourite editor 
● Gazetteers generate Lookup annotations with relevant features 
corresponding to the list matched 
● You can also add extra features and values to individual lists 
● Lookup annotations are used primarily by the NE transducer 
● The ANNIE gazetteer has about 60,000 entries arranged in 80 
lists
University of Sheffield, NLP 
Gazetteer editor 
definition file 
entries entries for selected list
University of Sheffield, NLP 
Named Entity Recognition 
• Gazetteers can be used to find terms that suggest entities 
• However, the entries can often be ambiguous 
– “May Jones” vs “May 2010” vs “May I be excused?” 
– “Mr Parkinson” vs “Parkinson's Disease” 
– “General Motors” vs. “General Smith” 
• Handcrafted grammars are used to define patterns over the 
Lookups and other annotations 
• These patterns can help disambiguate, and they can combine 
different annotations, e.g. Dates can be comprised of day + 
number + month
University of Sheffield, NLP 
Named Entity Grammars 
• Hand-coded rules written in JAPE applied to annotations to 
identify NEs 
• Phases run sequentially and constitute a cascade of FSTs over 
annotations 
• Annotations from format analysis, tokeniser. splitter, POS tagger, 
morphological analysis, gazetteer etc. 
• Because phases are sequential, annotations can be built up over 
a period of phases, as new information is gleaned 
• Standard named entities: persons, locations, organisations, dates, 
addresses, money 
• Basic NE grammars can be adapted for new applications, domains 
and languages
University of Sheffield, NLP 
Example of a JAPE pattern-action rule 
Rule: PersonName 
( 
{Lookup.majorType == firstname} 
{Token.category == NNP} 
):tag 
--> 
:tag.Person = {kind = fullName} 
Look for an entry in a 
gazetter list of first 
names 
followed by a proper noun 
create an annotation of type “Person” give the annotation the feature kind 
and value “fullName”
University of Sheffield, NLP 
JAPE rules are usually a bit more complex 
● They can make use of any annotations already created, e.g. 
– strings of text 
– POS categories 
– gazetteer lookup 
– root forms of words 
– document metadata, e.g. HTML tags 
– annotations “in context” 
● They can use regular expression operators, including negative 
constraints, “contains”, “within” etc. 
● And the RHS of a rule can contain any Java code you like
University of Sheffield, NLP 
Document with NEs
Using co-reference 
 Different expressions may refer to the same entity 
 Orthographic co-reference module (orthomatcher) 
matches proper names and their variants in a 
document 
 [Mr Smith] and [John Smith] will be matched as the 
same person 
 [International Business Machines Ltd.] will match 
[IBM]
Orthomatcher PR 
• Performs co-reference resolution based on orthographical 
information of entities 
• Produces a list of annotation IDs that form a co-reference “chain” 
• List of such lists stored as a document feature named 
“MatchesAnnots” 
• Improves results by assigning entity type to previously unclassified 
names, based on relations with classified entities 
• May not reclassify already classified entities 
• Classification of unknown entities very useful for surnames which 
match a full name, or abbreviations, 
e.g. “Bonfield” <Unknown> will match “Sir Peter Bonfield” <Person> 
• A pronominal PR is also available
University of Sheffield, NLP 
Coreference
University of Sheffield, NLP 
Other NLP Toolkits 
• UIMA 
• OpenCalais 
• Lingpipe 
• OpenNLP 
• Stanford Tools 
All integrated into GATE as plugins
University of Sheffield, NLP 
3. Analysing Social Media
University of Sheffield, NLP 
Analysing language in social media is hard 
● Grundman:politics makes #climatechange scientific issue,people 
don’t like knowitall rational voice tellin em wat 2do 
● @adambation Try reading this article , it looks like it would be 
really helpful and not obvious at all. http://t.co/mo3vODoX 
● Want to solve the problem of #ClimateChange? Just #vote for a 
#politician! Poof! Problem gone! #sarcasm #TVP #99% 
● Human Caused #ClimateChange is a Monumental Scam! 
http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!! 
Lying to us like MOFO's Tax The Air We Breath! F**k Them!
University of Sheffield, NLP 
Challenges for NLP 
● Noisy language: unusual punctuation, capitalisation, spelling, use 
of slang, sarcasm etc. 
● Terse nature of microposts such as tweets 
● Use of hashtags, @mentions etc causes problems for 
tokenisation #thisistricky 
● Lack of context gives rise to ambiguities 
● NER performs poorly on microposts, mainly because of linguistic 
pre-processing failure 
– Performance of standard IE tools decreases from ~90% to 
~40% when run on tweets rather than news articles
University of Sheffield, NLP 
Persons in news articles
University of Sheffield, NLP 
Persons in tweets
University of Sheffield, NLP 
Lack of context causes ambiguity 
Branching out from Lincoln park after dark ... Hello Russian 
Navy, it's like the same thing but with glitter! 
??
University of Sheffield, NLP 
Getting the NEs right is crucial 
Branching out from Lincoln park after dark ... Hello Russian 
Navy, it's like the same thing but with glitter!
University of Sheffield, NLP 
How do we deal with this kind of text? 
● Typical NLP pipeline means that degraded performance has a 
knock-on effect along the chain 
● Short sentences confuse language identification tools 
● Linguistic processing tools have to be adapted to the domain 
● Retraining individual components on large volumes of data 
● Adaptation of techniques from e.g. SMS analysis 
● Development of new Twitter-specific tools (e.g. GATE's TwitIE) 
● But....lack of standards, easily accessible data, common 
evaluation etc. are holding back development
University of Sheffield, NLP 
TwitIE to the rescue
University of Sheffield, NLP 
4. Text Analytics for Semantic Search
University of Sheffield, NLP 
Text Search isn't Enough 
“I like the Internet. Really, I do. Any time I 
need a piece of shareware or I want to find 
out the weather in Bogota... I'm the first guy 
to get the modem humming. But as a 
source of information, it sucks. You got a 
billion pieces of data, struggling to be heard 
and seen and downloaded, and anything I 
want to know seems to get trampled 
underfoot in the crowd." 
Michael Marshall, The Straw Men. HarperCollins 
Publishers, 2002.
University of Sheffield, NLP 
Semantic Queries in Google
University of Sheffield, NLP 
Searching for Things, Not Strings 
• 500 million entities that Google 
“knows” about 
• Used to provide more accurate 
search results 
• Summaries of information about 
the entity being searched 
http://googleblog.blogspot.it/2012/05/introducing-knowledge-graph-things-not.html
University of Sheffield, NLP 
Facebook Graph Search
University of Sheffield, NLP 
Semantic Enrichment 
● Textual mentions aren't actually that useful in isolation 
– knowing that something is a “Person" isn't very helpful 
– knowing which Person the mention refers to can be very useful 
● Disambiguating mentions against an ontology provides extra context 
● This is where semantic enrichment comes in 
● The end product is a set of textual mentions linked to an ontology, 
otherwise known as semantic annotations 
● Annotations on their own can be useful but they can also 
– be used to generate corpus level statistIcs 
– be used for further ontology population 
– form the basis of summaries 
– be indexed to provide semantic search
University of Sheffield, NLP 
Automatic Semantic Enrichment 
• Use Text Mining, e.g. 
• Information Extraction – recognise names of people, 
organisations, locations, dates, references, etc. 
• Term recognition – identify domain-specific terms 
• Automatically extend article metadata to improve search quality 
• Example: using a customised GATE text mining pipeline to enrich 
metadata in the Envia environmental science repository 
http://www.bl.uk/reshelp/experthelp/science/eventsandprojects/en 
viatbl/index.html
University of Sheffield, NLP 
Semantic Enrichment in Envia 
60
University of Sheffield, NLP 
Mining medical records 
● Medical records contain a large amount of unstructured text 
– letters between hopitals and GPs 
– discharge summaries 
● These documents might contain information not recorded 
elsewhere 
– it turns out doctors don't like forms! 
– often information-specific fields are ignored, with everything 
put in the free text area
University of Sheffield, NLP 
Medical Records at SLAM 
● NIHR Biomedical Research Centre at the South London and 
Maudsley Hospital are using text mining in a number of their 
studies 
● They have developed applications to extract: 
– the results of mental state tests, and the date the test was 
administered 
– education level (high school, university, etc.) 
– smoking status 
– medication history 
● They have even had promising results predicting suicides!
University of Sheffield, NLP 
Cancer Research 
● Genome Wide Association Studies (GWAS) aim to investigate 
genetic variants across the whole genome 
– With enough cases and controls, this allows them to state 
that a given SNP (Single Nucleotide Polymorphism) is 
related to a given disease. 
– A single study can be very expensive in both time and 
money to collect the required samples. 
● Can we reduce the costs by analysing published articles to 
gnerate prior probabilities for each SNP?
University of Sheffield, NLP 
Can Semantic Annotation Cure Cancer? 
● In conjunction with IARC (International Agency for Research on 
Cancer, part of the WHO) we developed a text analysis 
approach to mine PubMed 
● We showed retrospectively that our approach would have saved 
over a year's worth of work and more than 1.5 million Euros 
● We completed a new study which found a new cause for oral 
cancer 
– Oral cancer is rare enough that traditional methods would 
have failed to find enough cases to make the study plausible
University of Sheffield, NLP 
Government Web Archive 
● We developed a semantic annotation application to process 
every crawled page in the archive. 
● Entities annotated included; people, companies, locations, 
government departments, ministerial positions, social 
documents, dates, money.... 
● Where possible, annotations were linked to an ontology which 
– was based on DBpedia 
– was extended with UK government-specific concepts 
– included the modelling of the evolution of government 
● Annotations were indexed to allow for complex semantic 
querying of the collection 
● Demo coming later.....
University of Sheffield, NLP 
5.1 Semantic Annotation
University of Sheffield, NLP 
Why ontologies for semantic search? 
 Semantic annotation: rather than just annotating the word “Cambridge” as a 
location, link it to an ontology instance 
 Differentiate between Cambridge, UK and Cambridge, Mass. 
 Semantic search via reasoning 
 So we can infer that this document mentions a city in Europe. 
 Ontologies tell us that this particular Cambridge is part of the country called 
the UK, which is part of the continent Europe. 
 Knowledge source 
 If I want to annotate strikes in baseball reports, the ontology will tell me that 
a strike involves a batter who is a person 
 In the text “BA went on strike”, using the knowledge that BA is a company 
and not a person, the IE system can conclude that this is not the kind of 
strike it is interested in
University of Sheffield, NLP 
More semantic search examples 
Q: {ScalarValue}{MeasurementUnit} -> 
A: “12 cm”, “190 g”, “two hours” 
--- 
Q: {Reference} -> 
A: JP-A-60-180889 
A: Kalderon et al. (1984) Cell 39:499-509 
68
University of Sheffield, NLP 
Example Semantic Search Architecture 
69
University of Sheffield, NLP 
What is Semantic Annotation? 
Annotation: 
The process of adding metadata to [parts of] a document. 
Semantic Annotation: 
Annotation process where [parts of] the annotation schema 
(annotation types, annotation features) are ontological objects.
University of Sheffield, NLP 
Semantic Annotation
University of Sheffield, NLP 
What is an Ontology? 
 Set of concepts (instances and 
classes) 
 Relationships between them (is-a, 
part-of, located-in) 
 Multiple inheritance 
 Classes can have more than one 
parent 
 Instances can have more than one 
class 
 Ontologies are graphs, not trees
University of Sheffield, NLP 
URIs, Labels, Comments 
 The names of a classes or instance shown are URIs 
(Universal Resource Identifier) 
 http://dbpedia.org/page/Paris 
 The linguistic lexicalisation is typically encoded in the label 
property, as a string 
 The comment property is often used 
for documentation purposes, similarly 
a string 
 Comments and labels are 
annotation properties
University of Sheffield, NLP 
Datatype Properties 
 Datatype properties link individuals to data values 
 Datatype properties can be of type boolean, date, int, 
 e.g. a person can have an age property 
 Available datatypes taken from XMLSchema 
http://dbpedia.org/page/Paris
University of Sheffield, NLP 
Object Properties 
 Object properties link instances together 
 They describe relationships between instances, e.g. people work for 
organisations 
 Domain is the subject of the relation (the thing it applies to) 
 Range is the object of the relation (the possible”values”) 
Person work_for Organisation 
Domain Property Range 
Similar to domains, multiple alternative ranges can be specified by using a 
class description of the owl:unionOf, but this raises the complexity of the 
ontology and makes reasoning harder
University of Sheffield, NLP 
DBpedia 
● Machine readable knowledge on various entities and topics, 
including: 
– 410,000 places/locations, 
– 310,000 persons 
– 140,000 organisations 
● For each entity we have: 
– entity name variants (e.g. IBM, Int. Business Machines) 
– a textual abstract 
– reference(s) to corresponding Wikipedia page(s) 
– entity-specific properties (e.g. latitude and longitude for 
places)
University of Sheffield, NLP 
Example from DBpedia 
… 
Links to GeoNames 
And Freebase 
Latitude & Longitude
University of Sheffield, NLP 
GeoNames 
● 2.8 million populated places 
– 5.5 million alternate names 
● Knowledge about NUTS country sub-divisions 
– use for enrichment of recognised locations with the implied 
higher-level country sub-divisions 
● However, the sheer size of GeoNames creates a lot of ambiguity 
during semantic enrichment 
● We use it as an additional knowledge source, but not as a primary 
source (DBpedia)
University of Sheffield, NLP 
5.2 Semantic Annotation In GATE 
(and other tools)
University of Sheffield, NLP 
Information Extraction for the Semantic Web 
● Traditional IE is based on a flat structure, e.g. recognising 
Person, Location, Organisation, Date, Time etc. 
● For the Semantic Web, we need information in a hierarchical 
structure 
● Idea is that we attach semantic metadata to the documents, 
pointing to concepts in an ontology 
● Information can be exported as an ontology annotated with 
instances, or as text annotated with links to the ontology
University of Sheffield, NLP 
Traditional NE Recognition 
John lives in London . He works there for Polar Bear Design . 
PERSON LOCATION ORGANISATION
University of Sheffield, NLP 
Co-reference 
same_as 
John lives in London . He works there for Polar Bear Design . 
PER LOC ORG
University of Sheffield, NLP 
Relations 
live_in 
John lives in London . He works there for Polar Bear Design . 
PER LOC ORG
University of Sheffield, NLP 
Relations (2) 
employee_of 
John lives in London . He works there for Polar Bear Design . 
PER LOC ORG
University of Sheffield, NLP 
Relations (3) 
based_in 
John lives in London . He works there for Polar Bear Design . 
PER LOC ORG
University of Sheffield, NLP 
Richer NE Tagging 
● Attachment of instances in 
the text to concepts in the 
domain ontology 
● Disambiguation of 
instances, e.g. Cambridge, 
MA vs Cambridge, UK
University of Sheffield, NLP 
Ontology-based IE 
John lives in London. He works there for Polar Bear Design.
University of Sheffield, NLP 
Ontology-based IE (2) 
John lives in London. He works there for Polar Bear Design.
University of Sheffield, NLP 
Typical Semantic Annotation pipeline 
Analyse document structure 
Linguistic Pre-processing 
NE recognition 
Ontology Lookup 
Ontology-based IE 
Populate ontology (optional) 
Corpus 
Export as RDF 
RDF
University of Sheffield, NLP 
OpenCalais 
http://viewer.opencalais.com/ 
Paste text of http://www.membranes.com/ 
Not easily customised/extended 
Domain-specific coverage varies
University of Sheffield, NLP 
Zemanta 
● Paste text from 
www.membranes.com 
● The main entity of 
interest “Hydranautics” 
is missed 
● Common problem with 
general purpose, open-domain 
semantic 
annotation tools 
● Best results require 
bespoke customisation 
http://www.zemanta.com/demo/
University of Sheffield, NLP 
Semantic Annotation in GATE 
● Insert GATE screenshot
University of Sheffield, NLP 
Automatic Semantic Annotation 
● Locations (linked to DBpedia and GeoNames) 
– Annotate the place name itself (e.g. Norwich) with the 
corresponding DBpedia and GeoNames URIs 
– Also use knowledge of the implied reference to the levels 1, 2, 
and 3 sub-divisions from the Nomenclature of Territorial Units for 
Statistics (NUTS). 
– For Norwich, these are East of England (UKH – level 1), East 
Anglia (UKH1 – level 2), and Norfolk (UKH13 – level 3). 
– Similarly use knowledge to retrieve nearby places
University of Sheffield, NLP 
“South Gloucestershire” Example
University of Sheffield, NLP 
Semantic Annotation (2) 
● Organisations (linked to DBpedia) 
– Names of companies, government organisations, committees, 
agencies, universities, and other organisations 
● Dates 
– Absolute (e.g. 31/03/2012) and relative (yesterday) 
● Measurements and Percentages 
– e.g. 8,596 km2 , 1 km, one fifth, 10%
University of Sheffield, NLP 
The YODIE Service from GATE 
http://demos.gate.ac.uk/trendminer/obie/ 
96 
http://demos.gate.c.uk/trendminer/obie/
University of Sheffield, NLP 
Evaluation: YODIE on TAC-KBP’2010 
PER LOC ORG TOTAL 
DB Spotlight 0.97 / 0.40 0.82 / 0.46 0.86 / 0.31 0.85 / 0.39 
Zemanta 0.96 / 0.84 0.89 / 0.62 0.82 / 0.57 0.90 / 0.68 
LODIE 0.81 / 0.82 0.73 / 0.76 0.56 / 0.59 0.71 / 0.74 
Zemanta ∩ LODIE 1.00 / 0.74 0.95 / 0.45 0.97 / 0.42 0.97 / 0.54 
Zemanta U LODIE 0.94 / 0.93 0.77 / 0.76 0.72 / 0.71 0.82 / 0.81 
● Precision and Recall figures shown above 
● Similar results on a set of metadata records and scientific papers 
in environmental science (EnviLOD) 
● hard when candidate ambiguity is high
University of Sheffield, NLP 
Recognition of term variants in Decarbonet
University of Sheffield, NLP 
ClimaTerm Web service 
● http://services.gate.ac.uk/decarbonet/term-recognition/ 
● Input: a document 
● Output: the text as an XML file annotated with term and URI 
“Cars pollute by emitting carbon 
dioxide.” 
<Document> 
<TermInstance=" 
http://www.eionet.europa.eu/gemet/concept/1148">cars 
</Term> pollute by emitting 
<Term Instance="http://reegle.info/glossary/1064">carbon 
dioxide</Term> 
</Document>
University of Sheffield, NLP 
Term recognition demo
University of Sheffield, NLP 
Semantic Search: An Overview
University of Sheffield, NLP 
GATE Mímir 
• can be used to index and search over text, annotations, 
semantic metadata (concepts and instances) 
• allows queries that arbitrarily mix full-text, structural, linguistic 
and semantic annotations 
• is open source
University of Sheffield, NLP 
What can GATE Mímir do that Google can't? 
Show me: 
• all documents mentioning a temperature between 30 and 90 
degrees F (expressed in any unit) 
• all abstracts written in French on patent documents from the last 
6 months which mention any form of the word “transistor” in the 
English abstract 
• the names of the patent inventors of those abstracts 
• all documents mentioning steel industries in the UK, along with 
their location
University of Sheffield, NLP 
Search News Articles for 
Politicians born in Sheffield 
http://demos.gate.ac.uk/mimir/gpd/search/gus
University of Sheffield, NLP 
Easily Create Your Own 
Custom GATE Mímir Interfaces 
http://demos.gate.ac.uk/pin/
University of Sheffield, NLP 
MIMIR: Searching Text Mining Results 
● Searching and managing text annotations, semantic 
information, and full text documents in one search engine 
● Queries over annotation graphs 
● Regular expressions, Kleene operators 
● Designed to be integrated as a web service in custom end-user 
systems with bespoke interfaces 
● Demos at http://services.gate.ac.uk/mimir/
University of Sheffield, NLP 
Modelling Text Annotations and Additional 
Knowledge Sources
University of Sheffield, NLP 
MIMIR: the technical bit 
● Uses a combination of an FTS engine (MG4J), a database for 
indexing the semantic annotations (H2), and a knowledge repository 
(OWLIM/Sesame) 
● Use database to indexing annotation kinds, then the additional 
linguistic data for words/annotations are stored in MG4J (POS, 
morphological root, etc) 
● Semantic query: against the database/ knowledge base for the 
additional knowledge 
● A query: decompose into the respective parts, query the appropriate 
component, then merge the results
University of Sheffield, NLP 
Clustering and Scalability 
● All searches are local to a document, so we can use clusters of 
semantic repositories: 
● Faster indexing (less data in each repository) 
● Faster searching (search space is broken into slices that are 
searched in parallel – like Google). 
● Joining results is trivial: union of result sets. 
● Simple scalability: just add more nodes. 
● Search speed stays almost constant while the data increases (each 
individual repository has the same amount of data; there are just 
more repositories)
University of Sheffield, NLP 
Ranking of Results 
● By default documents are returned in index order 
● The following ranking algorithms are also available (others can be 
implemented via plugins if required) 
– Count 
– Hit length 
– TF.IDF 
– BM25 
• These algorithms treat annotations within the queries in the same way 
as words, using index level frequencies etc. 
• The ranking algorithm can be changed without re-indexing, allowing for 
easy experimentation
University of Sheffield, NLP 
Scaling Up 
● We annotated 1.08 million web pages using a GATE language 
analysis pipeline. 
– Documents crawled using Heritrix10 , with total content size of 
57 GiB or 6.6 billions plain text characters. 
– The indexing server has 2 Intel Xeon 2.8GHz CPUs 11 GB of 
RAM, and runs 64 bit Ubuntu Linux. Indexing process was 94 
hours. 
● We also indexed 150 million similar web pages, using two hundred 
Amazon EC2 Large Instances running for a week to produce a 
federated index 
● Mimir runs on GateCloud.net, so easy to scale up
University of Sheffield, NLP 
Search for string “Harriet Harman”
University of Sheffield, NLP 
Search for string “Harriet Harman says”
University of Sheffield, NLP 
Search with morphological variants: 
Harriet Harman root:say
University of Sheffield, NLP 
Replace strings with NEs: {Person} root:say
University of Sheffield, NLP 
{Person} AND root:say – 11803 hits
University of Sheffield, NLP 
{Person} [0..5] root: say – 5495 hits
University of Sheffield, NLP 
Patent Annotation: Data Model 
118
University of Sheffield, NLP 
An Example Text 
119
University of Sheffield, NLP 
Hands-On with patent data 
http://demos.gate.ac.uk/mimir/patents/search/index 
Text. Matches plain text. 
Example: nanomaterial 
Linguistic variations of text 
Example: (root:nanomaterial | root:nanoparticle ) 
Annotation. Matches semantic annotations. 
Syntax: {Type feature1=value1 feature2=value2...} 
Example: {Abstract lang="DE"} 
Sequence Query. Sequence of other queries. 
Syntax: Query1 [n..m] Query2... 
Example: from {Measurement} [1..5] {Measurement}
University of Sheffield, NLP 
Inclusion Queries 
IN Query. Hits of one query only if inside another. 
Syntax: Query1 IN Query2 
Example: (root:nanomaterial | root:nanoparticle ) IN {Abstract} 
Number of times these words are mentioned in patent abstracts (as 
well as links to the actual documents) 
OVER Query. Hits of a query, only if overlapping hits of another. 
Syntax: Query1 OVER Query2 
Example: {Abstract} OVER (root:nanomatrial | root:nanoparticle ) 
Finds all abstracts that contain nanomaterial(s) or nanoparticle(s)
University of Sheffield, NLP 
Date restrictions 
( 
{Abstract lang="EN"} OVER 
(root:nanomaterial | root:nanoparticle ) 
) 
IN 
{PatentDocument date > 20050000} 
YYYYMMDD
University of Sheffield, NLP 
Find references to literature or patents in 
the prior art or background sections, which 
contain nanomaterial/nanoparticle 
({Reference type="Literature"} 
| 
{Reference type="Patent"} 
) IN 
({Section type="PriorArt"} 
| 
{Section type="BackgroundArt"} 
) 
OVER 
(root:nanomaterial | root:nanoparticle)
University of Sheffield, NLP 
Queries Using External Knowledge 
{Measurement spec="1 to 100 volts“} 
● Uses GNU Units (http://www.gnu.org/software/units/) to convert 
measurements and normalise them to ISI units 
{Measurement spec="1 to 100 kg m^2 / A s^3"} 
● Example hits: 10 volts, 2V, +20 and -20 volts; ±10V; +/- 100V; +3.3 
volts 
{Measurement spec="1 to 100 m / s"} 
● Example hits: 40 km/hr, 60m/min, 100cm/sec, 60 fps; 10 to 2000 
cm/sec
University of Sheffield, NLP 
Searching LOD with SPARQL 
• SQL-like query language for RDF data 
• Simple protocol for querying remote databases over HTTP 
• Query types 
• select: projections of variables and expressions 
• construct: create triples (or graphs) based on query results 
• ask: whether a query returns results (result is true/false) 
• describe: describes resources in the graph
University of Sheffield, NLP 
SPARQL Example 
# Software companies founded in the US 
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX dbpedia: <http://dbpedia.org/resource/> 
PREFIX dbp-ont: <http://dbpedia.org/ontology/> 
PREFIX geo-ont: <http://www.geonames.org/ontology#> 
PREFIX umbel-sc: <http://umbel.org/umbel/sc/> 
SELECT DISTINCT ?Company ?Location 
WHERE { 
?Company rdf:type dbp-ont:Company ; 
dbp-ont:industry dbpedia:Computer_software ; 
dbp-ont:foundationPlace ?Location . 
?Location geo-ont:parentFeature dbpedia:United_States . 
}
University of Sheffield, NLP 
SPARQL Results 
● Try at: http://factforge.net/sparql
University of Sheffield, NLP 
Documents mentioning Persons born in Sheffield 
{Person sparql = "SELECT ?inst WHERE { ?inst :birthPlace 
<http://dbpedia.org/resource/Sheffield>}"} 
The BBC web document does not mention Sheffield at all: 
http://www.bbc.co.uk/news/uk-politics-11494915 
The relevant text snippet is:
University of Sheffield, NLP 
Try these on the BBC News demo: 
http://services.gate.ac.uk/mimir/gpd/search/index 
● Gordon Brown [0..3] root:say 
● {Person} [0..3] root:say 
● {Person inst ="http://dbpedia.org/resource/Gordon_Brown"} 
[0..3] root:say 
● {Person sparql="SELECT ?inst WHERE { ?inst a :Politician }"} 
[0..3] root:say 
● {Person sparql = "SELECT ?inst WHERE { ?inst :party 
<http://dbpedia.org/resource/Labour_Party_%28UK%29> }" } 
[0..3] root:say 
● {Person sparql = "SELECT ?inst WHERE { 
?inst :party <http://dbpedia.org/resource/Labour_Party_ 
%28UK%29> . 
?inst :almaMater 
<http://dbpedia.org/resource/University_of_Edinburgh> }" 
} [0..3] root:say
University of Sheffield, NLP 
User Interfaces for SPARQL-based Semantic 
Search 
• SPARQL-based semantic searches, tapping into LOD resources are 
extremely powerful 
• However, impossible to write by the vast majority of users 
• User interfaces for SPARQL-based semantic search: 
• Faceted searches (see ExoPatent next) 
• Form-based searches (see EnviLOD, PIN) 
• Text-based searches (natural language interfaces for querying 
ontologies), e.g. FREyA
University of Sheffield, NLP 
Faceted Search: ExoPatent Example 
● Use semantic information to expose linkages between 
documents based on the intersecting relationships between 
various sets of data from 
● FDA Orange Book (23,000 patented drugs 
● Unified Medical Language System (UMLS) – database of medical 
terms (370,000) 
● Patent bibliographic information 
● Search for diseases, drug names, body parts, references to 
literature and other patents, numeric values, ranges 
● Demo uses a small set of patents (40,000)
University of Sheffield, NLP 
ExoPatent: Faceted Search
University of Sheffield, NLP 
Annotated Patent Document
University of Sheffield, NLP 
Find all applicants who filed patents related 
to mitochondria, as well as drug names and 
active ingredients
University of Sheffield, NLP 
Semantic Search over Content and Annotations 
http://demos.gate.ac.uk/trendminer/envilod
University of Sheffield, NLP 
EnviLOD Semantic Search UI
University of Sheffield, NLP 
EnviLOD Semantic Search UI
University of Sheffield, NLP 
EnviLOD Semantic Search UI
University of Sheffield, NLP 
Example Results
University of Sheffield, NLP 
…and the underlying SPARQL Query 
{Sem_Location dbpediaSparql="select distinct ?inst 
where { 
{{ ?inst <http://dbpedia.org/property/north> ?loc} UNION 
{ ?inst <http://dbpedia.org/property/east> ?loc } UNION 
{ ?inst <http://dbpedia.org/property/west> ?loc } UNION 
{ ?inst <http://dbpedia.org/property/south> ?loc } UNION 
{ ?inst <http://dbpedia.org/property/northeast> ?loc } UNION 
{ ?inst <http://dbpedia.org/property/northwest> ?loc } UNION 
{ ?inst <http://dbpedia.org/property/southeast> ?loc } UNION 
{ ?inst <http://dbpedia.org/property/southwest> ?loc } 
FILTER(REGEX(STR(?loc), "Sheffield", "i"))} }"} AND (root:"flood") 
Ongoing work: Use GeoSparql instead and be able to specify distances and 
reason with the richer information in GeoNames
University of Sheffield, NLP 
Flooding in Oxford
University of Sheffield, NLP 
Some Caveats…. 
● EnviLOD demonstrator built in a 6 month project 
● VERY limited environmental science content indexed 
– Some Defra, Environment Agency, Scottish Government reports 
● PDFs of the articles are not connected to
University of Sheffield, NLP 
We don't just have to look at politicians saying and 
measuring things 
● If we first process the text with other NLP tools such as 
sentiment analysis, we can also search for positive or negative 
documents 
● Or positive/negative comments about certain 
people/topics/things/companies 
● In the Decarbonet project, we're looking at people's opinions 
about topics relating to climate change, e.g. fracking 
● We could index on the sentiment annotations too 
● Other people are using the combination of opinion mining and 
MIMIR to look at e.g. customer feedback about products / 
companies on a huge corpus
University of Sheffield, NLP 
A positive tweet
University of Sheffield, NLP 
A negative tweet
University of Sheffield, NLP 
A Sarcastic Tweet
University of Sheffield, NLP 
Explicitly Choosing The Search Classes
University of Sheffield, NLP 
Choosing A Specific Instance
University of Sheffield, NLP 
What diseases are in these documents?
University of Sheffield, NLP 
What Pathogens?
University of Sheffield, NLP 
Disease vs Disease Co-ocurrences
University of Sheffield, NLP 
Diseases vs Pathogens
University of Sheffield, NLP 
Summary 
● Text mining is a very useful pre-requisite for doing all kinds of 
more interesting things, like semantic search 
● Semantic web search allows you to do much more interesting 
kinds of search than a standard text-based search 
● Text mining is hard, so it won't always be correct, however 
● This is especially true on lower quality text such as social 
media 
● On the other hand, social media has some of the most 
interesting material to look at 
● Still plenty of work to be done, but there are lots of tools that 
you can use now and get useful results 
● We run annual GATE training courses where you can spend a 
whole week learning all this and more!
University of Sheffield, NLP 
Acknowledgements and further information 
● Research partially supported by the European Union/EU under the Information 
and Communication Technologies (ICT) theme of the 7th Framework Programme 
for R&D (FP7) DecarboNet (610829) http://www.decarbonet.eu and TrendMiner 
(287863) 
● Semantic Search over Documents and Ontologies (2014) K Bontcheva, V Tablan, 
H Cunningham. Bridging Between Information Retrieval and Databases, 31-53 
● V. Tablan, K. Bontcheva, I. Roberts, and H. Cunningham. Mímir: An open-source 
semantic search framework for interactive information seeking and discovery. 
Journal of Web Semantics: Science, Services and Agents on the World Wide 
Web, 2014 
● These and other relevant papers on the GATE website: 
http://gate.ac.uk/gate/doc/papers.html 
● Annual GATE training course in June in 
Sheffield:https://gate.ac.uk/family/training.html 
● Download gate: http://gate.ac.uk/download

More Related Content

What's hot

Repository open access-المستودعات الرقمية
Repository open access-المستودعات الرقميةRepository open access-المستودعات الرقمية
Repository open access-المستودعات الرقميةabedelaziz benzine
 
نظم المعلومات الصحيه
نظم المعلومات الصحيهنظم المعلومات الصحيه
نظم المعلومات الصحيهahmed khanjar
 
المستودعات الرقمية
المستودعات الرقميةالمستودعات الرقمية
المستودعات الرقميةtassadit ALLOUNE
 
نظم معلومات
نظم معلومات نظم معلومات
نظم معلومات moez1969
 
Presentation 1 آوعية معلومات
Presentation 1 آوعية معلوماتPresentation 1 آوعية معلومات
Presentation 1 آوعية معلوماتmashael alomar
 
Promoting the re use of open data through ODIP
Promoting the re use of open data through ODIPPromoting the re use of open data through ODIP
Promoting the re use of open data through ODIPOpen Data Support
 
انطولوجيا الويب الدلالي ودورها في تعزيز المحتوى الرقمي
انطولوجيا الويب الدلالي ودورها في تعزيز المحتوى الرقميانطولوجيا الويب الدلالي ودورها في تعزيز المحتوى الرقمي
انطولوجيا الويب الدلالي ودورها في تعزيز المحتوى الرقميDr. Ahmed Farag
 
E archive ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات
E archive  ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات   E archive  ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات
E archive ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات Essam Obaid
 
تطبيقات النظم مفتوحة المصدر في المكتبات ومراكز المعلومات Open source applicat...
تطبيقات النظم مفتوحة المصدر في المكتبات ومراكز المعلومات Open source applicat...تطبيقات النظم مفتوحة المصدر في المكتبات ومراكز المعلومات Open source applicat...
تطبيقات النظم مفتوحة المصدر في المكتبات ومراكز المعلومات Open source applicat...mohamed Elzalabany
 
البيانات المترابطة في المكتبات / ترجمة محمد عبدالحميد معوض
البيانات المترابطة في المكتبات / ترجمة محمد عبدالحميد معوضالبيانات المترابطة في المكتبات / ترجمة محمد عبدالحميد معوض
البيانات المترابطة في المكتبات / ترجمة محمد عبدالحميد معوضMuhammad Muawwad
 
تسويق خدمات المعلومات
 تسويق خدمات المعلومات تسويق خدمات المعلومات
تسويق خدمات المعلوماتMajida Mourad
 
مهارات البحث الفعال عن المصادر الأكاديمية عبر شبكة الإنترنت
مهارات البحث الفعال عن المصادر الأكاديمية عبر شبكة الإنترنتمهارات البحث الفعال عن المصادر الأكاديمية عبر شبكة الإنترنت
مهارات البحث الفعال عن المصادر الأكاديمية عبر شبكة الإنترنتghadeermagdy
 
Database system concepts and architecture
Database system concepts and architectureDatabase system concepts and architecture
Database system concepts and architectureMahmoud Almadhoun
 
مهارات البحث في مصادر المعلومات الالكترونية
مهارات البحث في مصادر المعلومات الالكترونيةمهارات البحث في مصادر المعلومات الالكترونية
مهارات البحث في مصادر المعلومات الالكترونيةرؤية للحقائب التدريبية
 
الادارة الحديثة والامتياز الاداري
الادارة الحديثة والامتياز الاداري الادارة الحديثة والامتياز الاداري
الادارة الحديثة والامتياز الاداري tanmya-eg
 

What's hot (20)

Repository open access-المستودعات الرقمية
Repository open access-المستودعات الرقميةRepository open access-المستودعات الرقمية
Repository open access-المستودعات الرقمية
 
نظم المعلومات الصحيه
نظم المعلومات الصحيهنظم المعلومات الصحيه
نظم المعلومات الصحيه
 
المستودعات الرقمية
المستودعات الرقميةالمستودعات الرقمية
المستودعات الرقمية
 
الفهرسة المقروءة آلياً Marc
الفهرسة المقروءة آلياً Marcالفهرسة المقروءة آلياً Marc
الفهرسة المقروءة آلياً Marc
 
القمة في العمل المؤسسي
القمة في العمل المؤسسيالقمة في العمل المؤسسي
القمة في العمل المؤسسي
 
نظم معلومات
نظم معلومات نظم معلومات
نظم معلومات
 
Presentation 1 آوعية معلومات
Presentation 1 آوعية معلوماتPresentation 1 آوعية معلومات
Presentation 1 آوعية معلومات
 
Promoting the re use of open data through ODIP
Promoting the re use of open data through ODIPPromoting the re use of open data through ODIP
Promoting the re use of open data through ODIP
 
Cloud storage
Cloud storageCloud storage
Cloud storage
 
انطولوجيا الويب الدلالي ودورها في تعزيز المحتوى الرقمي
انطولوجيا الويب الدلالي ودورها في تعزيز المحتوى الرقميانطولوجيا الويب الدلالي ودورها في تعزيز المحتوى الرقمي
انطولوجيا الويب الدلالي ودورها في تعزيز المحتوى الرقمي
 
E archive ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات
E archive  ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات   E archive  ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات
E archive ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات
 
تطبيقات النظم مفتوحة المصدر في المكتبات ومراكز المعلومات Open source applicat...
تطبيقات النظم مفتوحة المصدر في المكتبات ومراكز المعلومات Open source applicat...تطبيقات النظم مفتوحة المصدر في المكتبات ومراكز المعلومات Open source applicat...
تطبيقات النظم مفتوحة المصدر في المكتبات ومراكز المعلومات Open source applicat...
 
البيانات المترابطة في المكتبات / ترجمة محمد عبدالحميد معوض
البيانات المترابطة في المكتبات / ترجمة محمد عبدالحميد معوضالبيانات المترابطة في المكتبات / ترجمة محمد عبدالحميد معوض
البيانات المترابطة في المكتبات / ترجمة محمد عبدالحميد معوض
 
الرامات
الراماتالرامات
الرامات
 
تسويق خدمات المعلومات
 تسويق خدمات المعلومات تسويق خدمات المعلومات
تسويق خدمات المعلومات
 
مهارات البحث الفعال عن المصادر الأكاديمية عبر شبكة الإنترنت
مهارات البحث الفعال عن المصادر الأكاديمية عبر شبكة الإنترنتمهارات البحث الفعال عن المصادر الأكاديمية عبر شبكة الإنترنت
مهارات البحث الفعال عن المصادر الأكاديمية عبر شبكة الإنترنت
 
Database system concepts and architecture
Database system concepts and architectureDatabase system concepts and architecture
Database system concepts and architecture
 
معيقات المسار الوظيفي
معيقات المسار الوظيفيمعيقات المسار الوظيفي
معيقات المسار الوظيفي
 
مهارات البحث في مصادر المعلومات الالكترونية
مهارات البحث في مصادر المعلومات الالكترونيةمهارات البحث في مصادر المعلومات الالكترونية
مهارات البحث في مصادر المعلومات الالكترونية
 
الادارة الحديثة والامتياز الاداري
الادارة الحديثة والامتياز الاداري الادارة الحديثة والامتياز الاداري
الادارة الحديثة والامتياز الاداري
 

Viewers also liked

GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaDiana Maynard
 
Text Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATEText Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATEDiana Maynard
 
GATE : General Architecture for Text Engineering
GATE : General Architecture for Text EngineeringGATE : General Architecture for Text Engineering
GATE : General Architecture for Text EngineeringAhmed Magdy Ezzeldin, MSc.
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
 
Tools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media AnalysisTools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media AnalysisDiana Maynard
 
Monografia pipeline
Monografia pipelineMonografia pipeline
Monografia pipelinevaneyui
 
Sem Perio2009 Mela Motores Semanticos
Sem Perio2009 Mela Motores SemanticosSem Perio2009 Mela Motores Semanticos
Sem Perio2009 Mela Motores SemanticosMela Bosch
 
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...Diana Maynard
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]Sagar Ahire
 
Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search EnginesAtul Shridhar
 
Ontological approach for improving semantic web search results
Ontological approach for improving semantic web search resultsOntological approach for improving semantic web search results
Ontological approach for improving semantic web search resultseSAT Journals
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryOntotext
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformtoncho11
 
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]Sagar Ahire
 

Viewers also liked (20)

GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social media
 
Text Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATEText Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATE
 
GATE : General Architecture for Text Engineering
GATE : General Architecture for Text EngineeringGATE : General Architecture for Text Engineering
GATE : General Architecture for Text Engineering
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
 
Applying sw mikel_egana
Applying sw mikel_eganaApplying sw mikel_egana
Applying sw mikel_egana
 
Tools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media AnalysisTools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media Analysis
 
Semantic web
Semantic webSemantic web
Semantic web
 
Monografia pipeline
Monografia pipelineMonografia pipeline
Monografia pipeline
 
Sem Perio2009 Mela Motores Semanticos
Sem Perio2009 Mela Motores SemanticosSem Perio2009 Mela Motores Semanticos
Sem Perio2009 Mela Motores Semanticos
 
sent_analysis_report
sent_analysis_reportsent_analysis_report
sent_analysis_report
 
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
 
Nltk
NltkNltk
Nltk
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]
 
Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
 
Ontological approach for improving semantic web search results
Ontological approach for improving semantic web search resultsOntological approach for improving semantic web search results
Ontological approach for improving semantic web search results
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to Delivery
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platform
 
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
 

Similar to Text analysis and Semantic Search with GATE

Arcomem training entities-and-events_advanced
Arcomem training entities-and-events_advancedArcomem training entities-and-events_advanced
Arcomem training entities-and-events_advancedarcomem
 
Text analysis-semantic-search
Text analysis-semantic-searchText analysis-semantic-search
Text analysis-semantic-searchDiana Maynard
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introductionananth
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advancedarcomem
 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction documentrajatkr
 
NATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxNATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxsaivinay93
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for developmentAravind Reddy
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for developmentAravind Reddy
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and searchNathan McMinn
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位eLearning Consortium 電子學習聯盟
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Groupbotsplash.com
 
Natural Language Processing .pdf
Natural Language Processing .pdfNatural Language Processing .pdf
Natural Language Processing .pdfAnime196637
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
 
Rob Hanna: Leveraging Cognitive Science to Improve Topic- Based Authoring
Rob Hanna: Leveraging Cognitive Science to Improve Topic- Based AuthoringRob Hanna: Leveraging Cognitive Science to Improve Topic- Based Authoring
Rob Hanna: Leveraging Cognitive Science to Improve Topic- Based AuthoringJack Molisani
 
Introduction to NLP_1.pptx
Introduction to NLP_1.pptxIntroduction to NLP_1.pptx
Introduction to NLP_1.pptxjkamble
 

Similar to Text analysis and Semantic Search with GATE (20)

Arcomem training entities-and-events_advanced
Arcomem training entities-and-events_advancedArcomem training entities-and-events_advanced
Arcomem training entities-and-events_advanced
 
Text analysis-semantic-search
Text analysis-semantic-searchText analysis-semantic-search
Text analysis-semantic-search
 
Entity Search Engine
Entity Search Engine Entity Search Engine
Entity Search Engine
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advanced
 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction document
 
NATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxNATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptx
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Top 10 Must-Know NLP Techniques for Data Scientists
Top 10 Must-Know NLP Techniques for Data ScientistsTop 10 Must-Know NLP Techniques for Data Scientists
Top 10 Must-Know NLP Techniques for Data Scientists
 
subrat
 subrat subrat
subrat
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
 
Unit 5f.pptx
Unit 5f.pptxUnit 5f.pptx
Unit 5f.pptx
 
Natural Language Processing .pdf
Natural Language Processing .pdfNatural Language Processing .pdf
Natural Language Processing .pdf
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
Rob Hanna: Leveraging Cognitive Science to Improve Topic- Based Authoring
Rob Hanna: Leveraging Cognitive Science to Improve Topic- Based AuthoringRob Hanna: Leveraging Cognitive Science to Improve Topic- Based Authoring
Rob Hanna: Leveraging Cognitive Science to Improve Topic- Based Authoring
 
Introduction to NLP_1.pptx
Introduction to NLP_1.pptxIntroduction to NLP_1.pptx
Introduction to NLP_1.pptx
 

More from Diana Maynard

Filth and lies: analysing social media
Filth and lies: analysing social mediaFilth and lies: analysing social media
Filth and lies: analysing social mediaDiana Maynard
 
Adding value to NLP: a little semantics goes a long way
Adding value to NLP: a little semantics goes a long wayAdding value to NLP: a little semantics goes a long way
Adding value to NLP: a little semantics goes a long wayDiana Maynard
 
Methodological possibilities for strengthening the monitoring of SDG indicato...
Methodological possibilities for strengthening the monitoring of SDG indicato...Methodological possibilities for strengthening the monitoring of SDG indicato...
Methodological possibilities for strengthening the monitoring of SDG indicato...Diana Maynard
 
Getting the-most-out-of-conferences
Getting the-most-out-of-conferencesGetting the-most-out-of-conferences
Getting the-most-out-of-conferencesDiana Maynard
 
Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...Diana Maynard
 
The language of social media
The language of social mediaThe language of social media
The language of social mediaDiana Maynard
 
Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...Diana Maynard
 
Adapting NLP tools to diverse data: challenges and solutions
Adapting NLP tools to diverse data: challenges and solutionsAdapting NLP tools to diverse data: challenges and solutions
Adapting NLP tools to diverse data: challenges and solutionsDiana Maynard
 
Ontologies as bridges between data sources and user queries: the KNOWMAK proj...
Ontologies as bridges between data sources and user queries: the KNOWMAK proj...Ontologies as bridges between data sources and user queries: the KNOWMAK proj...
Ontologies as bridges between data sources and user queries: the KNOWMAK proj...Diana Maynard
 
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...Diana Maynard
 
Challenges of social media analysis in the real world
Challenges of social media analysis in the real worldChallenges of social media analysis in the real world
Challenges of social media analysis in the real worldDiana Maynard
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATEDiana Maynard
 
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Diana Maynard
 
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Diana Maynard
 
Do we really know what people mean when they tweet?
Do we really know what people mean when they tweet?Do we really know what people mean when they tweet?
Do we really know what people mean when they tweet?Diana Maynard
 
Disability and Adventure Travel: the Double-Edged Sword
Disability and Adventure Travel: the Double-Edged SwordDisability and Adventure Travel: the Double-Edged Sword
Disability and Adventure Travel: the Double-Edged SwordDiana Maynard
 
Multimodal opinion mining from social media
Multimodal opinion mining from social mediaMultimodal opinion mining from social media
Multimodal opinion mining from social mediaDiana Maynard
 
Practical Opinion Mining for Social Media
Practical Opinion Mining for Social MediaPractical Opinion Mining for Social Media
Practical Opinion Mining for Social MediaDiana Maynard
 
What do you really mean when you tweet? Challenges for opinion mining on soci...
What do you really mean when you tweet? Challenges for opinion mining on soci...What do you really mean when you tweet? Challenges for opinion mining on soci...
What do you really mean when you tweet? Challenges for opinion mining on soci...Diana Maynard
 

More from Diana Maynard (20)

Filth and lies: analysing social media
Filth and lies: analysing social mediaFilth and lies: analysing social media
Filth and lies: analysing social media
 
Adding value to NLP: a little semantics goes a long way
Adding value to NLP: a little semantics goes a long wayAdding value to NLP: a little semantics goes a long way
Adding value to NLP: a little semantics goes a long way
 
Methodological possibilities for strengthening the monitoring of SDG indicato...
Methodological possibilities for strengthening the monitoring of SDG indicato...Methodological possibilities for strengthening the monitoring of SDG indicato...
Methodological possibilities for strengthening the monitoring of SDG indicato...
 
Getting the-most-out-of-conferences
Getting the-most-out-of-conferencesGetting the-most-out-of-conferences
Getting the-most-out-of-conferences
 
Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...
 
The language of social media
The language of social mediaThe language of social media
The language of social media
 
Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...
 
Adapting NLP tools to diverse data: challenges and solutions
Adapting NLP tools to diverse data: challenges and solutionsAdapting NLP tools to diverse data: challenges and solutions
Adapting NLP tools to diverse data: challenges and solutions
 
Ontologies as bridges between data sources and user queries: the KNOWMAK proj...
Ontologies as bridges between data sources and user queries: the KNOWMAK proj...Ontologies as bridges between data sources and user queries: the KNOWMAK proj...
Ontologies as bridges between data sources and user queries: the KNOWMAK proj...
 
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...
 
Challenges of social media analysis in the real world
Challenges of social media analysis in the real worldChallenges of social media analysis in the real world
Challenges of social media analysis in the real world
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATE
 
Cls8 decarbonet
Cls8 decarbonetCls8 decarbonet
Cls8 decarbonet
 
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?
 
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?
 
Do we really know what people mean when they tweet?
Do we really know what people mean when they tweet?Do we really know what people mean when they tweet?
Do we really know what people mean when they tweet?
 
Disability and Adventure Travel: the Double-Edged Sword
Disability and Adventure Travel: the Double-Edged SwordDisability and Adventure Travel: the Double-Edged Sword
Disability and Adventure Travel: the Double-Edged Sword
 
Multimodal opinion mining from social media
Multimodal opinion mining from social mediaMultimodal opinion mining from social media
Multimodal opinion mining from social media
 
Practical Opinion Mining for Social Media
Practical Opinion Mining for Social MediaPractical Opinion Mining for Social Media
Practical Opinion Mining for Social Media
 
What do you really mean when you tweet? Challenges for opinion mining on soci...
What do you really mean when you tweet? Challenges for opinion mining on soci...What do you really mean when you tweet? Challenges for opinion mining on soci...
What do you really mean when you tweet? Challenges for opinion mining on soci...
 

Recently uploaded

Gram Darshan PPT cyber rural in villages of india
Gram Darshan PPT cyber rural  in villages of indiaGram Darshan PPT cyber rural  in villages of india
Gram Darshan PPT cyber rural in villages of indiaimessage0108
 
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls KolkataVIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
AlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsAlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsThierry TROUIN ☁
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5
 
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirtrahman018755
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebJames Anderson
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...Diya Sharma
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...aditipandeya
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGAPNIC
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$kojalkojal131
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Roomgirls4nights
 

Recently uploaded (20)

Gram Darshan PPT cyber rural in villages of india
Gram Darshan PPT cyber rural  in villages of indiaGram Darshan PPT cyber rural  in villages of india
Gram Darshan PPT cyber rural in villages of india
 
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls KolkataVIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
AlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsAlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with Flows
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
 
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girls
 
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOG
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
 

Text analysis and Semantic Search with GATE

  • 1. University of Sheffield, NLP Text Analysis with GATE Diana Maynard University of Sheffield Search Solutions Tutorial 2014 London, UK © The University of Sheffield, 2014 This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike Licence
  • 2. University of Sheffield, NLP Outline of the Tutorial • Introduction to NLP and Information Extraction • Introduction to GATE • Social media analysis • Text Analysis for Semantic Search • Semantic Search • Semantic Annotation • GATE MIMIR • Example applications • ExoPatent, TNA Web Archive, BBC News, Envilod, Prospector
  • 3. University of Sheffield, NLP 1. NLP and Information Extraction
  • 4. University of Sheffield, NLP Oddly enough, people have successfully combined information and toast...
  • 5. University of Sheffield, NLP The weather-forecasting toaster ● This weather-forecasting toaster, connected to a phone point, was designed in 2001 by a PhD student ● It accessed the MetOffice website via a modem inside the toaster and translated the information into a 1, 2 or 3 for rain, cloud or sun ● The relevant symbol was then branded into the toast in the last few seconds of toasting
  • 6. University of Sheffield, NLP With tools such as these, why do we need text mining? ● It turns out that toast isn't actually a very good medium for finding information
  • 7. 7 University of Sheffield, NLP Information overload or filter failure? ● We all know that there's masses of information available online these days, and it's constantly growing ● You often hear people talk about “information overload” ● But the real problem is not the amount of information, but our inability to filter it correctly ● Clay Shirky has an excellent talk on this topic http://bit.ly/oWJTNZ
  • 8. University of Sheffield, NLP It is difficult to access unstructured information efficiently Information extraction tools can help you: ● Save time and money on management of text and data from multiple sources ● Find hidden links scattered across huge volumes of diverse information ● Integrate structured data from variety of sources ● Interlink text and data ● Collect information and extract new facts
  • 9. University of Sheffield, NLP What is information extraction? ● Information Extraction (IE) is the automatic discovery of new, previously unknown information, by automatically extracting information from different textual resources. ● A key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further ● In IE, the goal is to discover previously unknown information, i.e. something that no one yet knows
  • 10. 10 University of Sheffield, NLP IE is not Data Mining Data mining is about using analytical techniques to find interesting patterns from large structured databases Examples: • using consumer purchasing patterns to predict which products to place close together on shelves in supermarkets • analysing spending patterns on credit cards to detect fraudulent card use.
  • 11. 11 University of Sheffield, NLP IE is not Web Search ● IE is also different from traditional web search or IR. ● In search, the user is typically looking for something that is already known and has been written by someone else. ● The problem lies in sifting through all the material that currently isn't relevant to your needs, in order to find the information that is. ● The solution often lies in better ways to ask the right question ● You can't ask Google to tell you about – all the speeches made by Tony Blair about foot and mouth disease while he was Prime Minister – all the documents in which a politician born in Sheffield is quoted as saying something about hospitals.
  • 12. University of Sheffield, NLP More about how we can do this will be revealed later... GATE Mímir: Answering Questions Google Can't
  • 13. University of Sheffield, NLP Information Extraction Basics ● Entity recognition is required for ● Relation extraction which is required for ● Event recognition which is required for ● Summarisation tasks
  • 14. University of Sheffield, NLP What is Entity Recognition? ● Entity Recognition is about recogising and classifying key Named Entities and terms in the text ● A Named Entity is a Person, Location, Organisation, Date etc. ● A term is a key concept or phrase that is representative of the text Mitt Romney, the favorite to win the Republican nomination for president in 2012 Person Term Date ● Entities and terms may be described in different ways but refer to the same thing. We call this co-reference. co-reference The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior. Organisation Location
  • 15. University of Sheffield, NLP What is Event Recognition? ● An event is an action or situation relevant to the domain expressed by some relation between entities or terms. ● It is always grounded in time, e.g. the performance of a band, an election, the death of a person Relation Relation Mitt Romney, the favorite to win the Republican nomination for president in 2012 Person Event Date
  • 16. University of Sheffield, NLP Why are Entities and Events Useful? ● They can help answer the “Big 5” journalism questions (who, what, when, where, why) ● They can be used to categorise the texts in different ways – look at all texts about Obama. ● They can be used as targets for opinion mining – find out what people think about President Obama ● When linked to an ontology and/or combined with other information, they can be used for reasoning about things not explicit in the text – seeing how opinions about different American presidents have changed over the years
  • 17. 17 University of Sheffield, NLP NLP components for text mining ● A text mining system is usually built up from a number of different NLP components ● First, you need some Information Extraction tools to do the donkey work of getting all the relevant pieces of information and facts. ● Then you need some tools to apply the reasoning to the facts, e.g. opinion mining, information aggregation, semantic technologies, dynamics analysis and so on. ● GATE is an example of a tool for text mining which allows you to combine all the necessary NLP components
  • 18. University of Sheffield, NLP Approaches to Information Extraction Knowledge Engineering  rule based  developed by experienced language engineers  make use of human intuition  easier to understand results  development could be very time consuming  some changes may be hard to accommodate Learning Systems  use statistics or other machine learning  developers do not need LE expertise  requires large amounts of annotated training data  some changes may require re-annotation of the entire training corpus
  • 20. University of Sheffield, NLP What is GATE? GATE is an NLP toolkit developed at the University of Sheffield over the last 20 years It includes: • components for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages... • tools for visualising and manipulating text, annotations, ontologies, parse trees, etc. • various information extraction tools • evaluation and benchmarking tools
  • 21. University of Sheffield, NLP GATE components ● Language Resources (LRs), e.g. lexicons, corpora, ontologies ● Processing Resources (PRs), e.g. parsers, generators, taggers ● Visual Resources (VRs), i.e. visualisation and editing components ● Algorithms are separated from the data, which means: – the two can be developed independently by users with different expertise. – alternative resources of one type can be used without affecting the other, e.g. a different visual resource can be used with the same language resource
  • 22. University of Sheffield, NLP ANNIE • ANNIE is GATE's rule-based IE system • It uses the language engineering approach (though we also have tools in GATE for ML) • Distributed as part of GATE • Uses a finite-state pattern-action rule language, JAPE • ANNIE contains a reusable and easily extendable set of components: – generic preprocessing components for tokenisation, sentence splitting etc – components for performing NE on general open domain text
  • 23. University of Sheffield, NLP What's in ANNIE? • The ANNIE application contains a set of core PRs: • Tokeniser • Sentence Splitter • POS tagger • Gazetteers • Named entity tagger (JAPE transducer) • Orthomatcher (orthographic coreference) • There are also other PRs available in the ANNIE plugin, which are not used in the default application, but can be added if necessary • NP and VP chunker, morphological analysis
  • 24. University of Sheffield, NLP ANNIE Modules
  • 25. University of Sheffield, NLP Let's look at the PRs • Each PR in the ANNIE pipeline creates some new annotations, or modifies existing ones • Document Reset → removes annotations • Tokeniser → Token annotations • Gazetteer → Lookup annotations • Sentence Splitter → Sentence, Split annotations • POS tagger → adds category features to Token annotations • NE transducer → Date, Person, Location, Organisation, Money, Percent annotations • Orthomatcher → adds match features to NE annotations
  • 26. University of Sheffield, NLP Tokeniser  Tokenisation based on Unicode classes  Declarative token specification language  Produces Token and SpaceToken annotations with features orthography and kind  Length and string features are also produced  Rule for a lowercase word with initial uppercase letter: "UPPERCASE_LETTER" LOWERCASE_LETTER"* > Token; orthography=upperInitial; kind=word
  • 27. University of Sheffield, NLP 27 Document with Tokens
  • 28. University of Sheffield, NLP ANNIE English Tokeniser  The English Tokeniser is a slightly enhanced version of the Unicode tokeniser  It comprises an additional JAPE transducer which adapts the generic tokeniser output for the POS tagger requirements  It converts constructs involving apostrophes into more sensible combinations  don’t → do + n't  you've → you + 've
  • 29. University of Sheffield, NLP Sentence Splitter • The default splitter finds sentences based on Tokens • Creates Sentence annotations and Split annotations on the sentence delimiters • Uses a gazetteer of abbreviations etc. and a set of JAPE grammars which find sentence delimiters and then annotate sentences and splits • There are a few sentence splitter variants available which work in slightly different ways
  • 30. University of Sheffield, NLP 30 Document with Sentences
  • 31. University of Sheffield, NLP POS tagger  ANNIE POS tagger is a Java implementation of Brill's transformation based tagger  Previously known as Hepple Tagger  Trained on WSJ, uses Penn Treebank tagset  Default ruleset and lexicon can be modified manually (with a little deciphering)  Adds category feature to Token annotations  Requires Tokeniser and Sentence Splitter to be run first
  • 32. Morphological analyser  Not an integral part of ANNIE, but can be found in the Tools plugin as an “added extra”  Flex based rules: can be modified by the user (instructions in the User Guide)  Generates “root” feature on Token annotations  Requires Tokeniser to be run first  Requires POS tagger to be run first if the considerPOSTag parameter is set to true
  • 33. University of Sheffield, NLP Gazetteers ● Gazetteers are plain text files containing lists of names (e.g rivers, cities, people, …) ● The lists are compiled into Finite State Machines ● Each gazetteer has an index file listing all the lists, plus features of each list (majorType, minorType and language) ● Lists can be modified either internally using the Gazetteer Editor, or externally in your favourite editor ● Gazetteers generate Lookup annotations with relevant features corresponding to the list matched ● You can also add extra features and values to individual lists ● Lookup annotations are used primarily by the NE transducer ● The ANNIE gazetteer has about 60,000 entries arranged in 80 lists
  • 34. University of Sheffield, NLP Gazetteer editor definition file entries entries for selected list
  • 35. University of Sheffield, NLP Named Entity Recognition • Gazetteers can be used to find terms that suggest entities • However, the entries can often be ambiguous – “May Jones” vs “May 2010” vs “May I be excused?” – “Mr Parkinson” vs “Parkinson's Disease” – “General Motors” vs. “General Smith” • Handcrafted grammars are used to define patterns over the Lookups and other annotations • These patterns can help disambiguate, and they can combine different annotations, e.g. Dates can be comprised of day + number + month
  • 36. University of Sheffield, NLP Named Entity Grammars • Hand-coded rules written in JAPE applied to annotations to identify NEs • Phases run sequentially and constitute a cascade of FSTs over annotations • Annotations from format analysis, tokeniser. splitter, POS tagger, morphological analysis, gazetteer etc. • Because phases are sequential, annotations can be built up over a period of phases, as new information is gleaned • Standard named entities: persons, locations, organisations, dates, addresses, money • Basic NE grammars can be adapted for new applications, domains and languages
  • 37. University of Sheffield, NLP Example of a JAPE pattern-action rule Rule: PersonName ( {Lookup.majorType == firstname} {Token.category == NNP} ):tag --> :tag.Person = {kind = fullName} Look for an entry in a gazetter list of first names followed by a proper noun create an annotation of type “Person” give the annotation the feature kind and value “fullName”
  • 38. University of Sheffield, NLP JAPE rules are usually a bit more complex ● They can make use of any annotations already created, e.g. – strings of text – POS categories – gazetteer lookup – root forms of words – document metadata, e.g. HTML tags – annotations “in context” ● They can use regular expression operators, including negative constraints, “contains”, “within” etc. ● And the RHS of a rule can contain any Java code you like
  • 39. University of Sheffield, NLP Document with NEs
  • 40. Using co-reference  Different expressions may refer to the same entity  Orthographic co-reference module (orthomatcher) matches proper names and their variants in a document  [Mr Smith] and [John Smith] will be matched as the same person  [International Business Machines Ltd.] will match [IBM]
  • 41. Orthomatcher PR • Performs co-reference resolution based on orthographical information of entities • Produces a list of annotation IDs that form a co-reference “chain” • List of such lists stored as a document feature named “MatchesAnnots” • Improves results by assigning entity type to previously unclassified names, based on relations with classified entities • May not reclassify already classified entities • Classification of unknown entities very useful for surnames which match a full name, or abbreviations, e.g. “Bonfield” <Unknown> will match “Sir Peter Bonfield” <Person> • A pronominal PR is also available
  • 42. University of Sheffield, NLP Coreference
  • 43. University of Sheffield, NLP Other NLP Toolkits • UIMA • OpenCalais • Lingpipe • OpenNLP • Stanford Tools All integrated into GATE as plugins
  • 44. University of Sheffield, NLP 3. Analysing Social Media
  • 45. University of Sheffield, NLP Analysing language in social media is hard ● Grundman:politics makes #climatechange scientific issue,people don’t like knowitall rational voice tellin em wat 2do ● @adambation Try reading this article , it looks like it would be really helpful and not obvious at all. http://t.co/mo3vODoX ● Want to solve the problem of #ClimateChange? Just #vote for a #politician! Poof! Problem gone! #sarcasm #TVP #99% ● Human Caused #ClimateChange is a Monumental Scam! http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!! Lying to us like MOFO's Tax The Air We Breath! F**k Them!
  • 46. University of Sheffield, NLP Challenges for NLP ● Noisy language: unusual punctuation, capitalisation, spelling, use of slang, sarcasm etc. ● Terse nature of microposts such as tweets ● Use of hashtags, @mentions etc causes problems for tokenisation #thisistricky ● Lack of context gives rise to ambiguities ● NER performs poorly on microposts, mainly because of linguistic pre-processing failure – Performance of standard IE tools decreases from ~90% to ~40% when run on tweets rather than news articles
  • 47. University of Sheffield, NLP Persons in news articles
  • 48. University of Sheffield, NLP Persons in tweets
  • 49. University of Sheffield, NLP Lack of context causes ambiguity Branching out from Lincoln park after dark ... Hello Russian Navy, it's like the same thing but with glitter! ??
  • 50. University of Sheffield, NLP Getting the NEs right is crucial Branching out from Lincoln park after dark ... Hello Russian Navy, it's like the same thing but with glitter!
  • 51. University of Sheffield, NLP How do we deal with this kind of text? ● Typical NLP pipeline means that degraded performance has a knock-on effect along the chain ● Short sentences confuse language identification tools ● Linguistic processing tools have to be adapted to the domain ● Retraining individual components on large volumes of data ● Adaptation of techniques from e.g. SMS analysis ● Development of new Twitter-specific tools (e.g. GATE's TwitIE) ● But....lack of standards, easily accessible data, common evaluation etc. are holding back development
  • 52. University of Sheffield, NLP TwitIE to the rescue
  • 53. University of Sheffield, NLP 4. Text Analytics for Semantic Search
  • 54. University of Sheffield, NLP Text Search isn't Enough “I like the Internet. Really, I do. Any time I need a piece of shareware or I want to find out the weather in Bogota... I'm the first guy to get the modem humming. But as a source of information, it sucks. You got a billion pieces of data, struggling to be heard and seen and downloaded, and anything I want to know seems to get trampled underfoot in the crowd." Michael Marshall, The Straw Men. HarperCollins Publishers, 2002.
  • 55. University of Sheffield, NLP Semantic Queries in Google
  • 56. University of Sheffield, NLP Searching for Things, Not Strings • 500 million entities that Google “knows” about • Used to provide more accurate search results • Summaries of information about the entity being searched http://googleblog.blogspot.it/2012/05/introducing-knowledge-graph-things-not.html
  • 57. University of Sheffield, NLP Facebook Graph Search
  • 58. University of Sheffield, NLP Semantic Enrichment ● Textual mentions aren't actually that useful in isolation – knowing that something is a “Person" isn't very helpful – knowing which Person the mention refers to can be very useful ● Disambiguating mentions against an ontology provides extra context ● This is where semantic enrichment comes in ● The end product is a set of textual mentions linked to an ontology, otherwise known as semantic annotations ● Annotations on their own can be useful but they can also – be used to generate corpus level statistIcs – be used for further ontology population – form the basis of summaries – be indexed to provide semantic search
  • 59. University of Sheffield, NLP Automatic Semantic Enrichment • Use Text Mining, e.g. • Information Extraction – recognise names of people, organisations, locations, dates, references, etc. • Term recognition – identify domain-specific terms • Automatically extend article metadata to improve search quality • Example: using a customised GATE text mining pipeline to enrich metadata in the Envia environmental science repository http://www.bl.uk/reshelp/experthelp/science/eventsandprojects/en viatbl/index.html
  • 60. University of Sheffield, NLP Semantic Enrichment in Envia 60
  • 61. University of Sheffield, NLP Mining medical records ● Medical records contain a large amount of unstructured text – letters between hopitals and GPs – discharge summaries ● These documents might contain information not recorded elsewhere – it turns out doctors don't like forms! – often information-specific fields are ignored, with everything put in the free text area
  • 62. University of Sheffield, NLP Medical Records at SLAM ● NIHR Biomedical Research Centre at the South London and Maudsley Hospital are using text mining in a number of their studies ● They have developed applications to extract: – the results of mental state tests, and the date the test was administered – education level (high school, university, etc.) – smoking status – medication history ● They have even had promising results predicting suicides!
  • 63. University of Sheffield, NLP Cancer Research ● Genome Wide Association Studies (GWAS) aim to investigate genetic variants across the whole genome – With enough cases and controls, this allows them to state that a given SNP (Single Nucleotide Polymorphism) is related to a given disease. – A single study can be very expensive in both time and money to collect the required samples. ● Can we reduce the costs by analysing published articles to gnerate prior probabilities for each SNP?
  • 64. University of Sheffield, NLP Can Semantic Annotation Cure Cancer? ● In conjunction with IARC (International Agency for Research on Cancer, part of the WHO) we developed a text analysis approach to mine PubMed ● We showed retrospectively that our approach would have saved over a year's worth of work and more than 1.5 million Euros ● We completed a new study which found a new cause for oral cancer – Oral cancer is rare enough that traditional methods would have failed to find enough cases to make the study plausible
  • 65. University of Sheffield, NLP Government Web Archive ● We developed a semantic annotation application to process every crawled page in the archive. ● Entities annotated included; people, companies, locations, government departments, ministerial positions, social documents, dates, money.... ● Where possible, annotations were linked to an ontology which – was based on DBpedia – was extended with UK government-specific concepts – included the modelling of the evolution of government ● Annotations were indexed to allow for complex semantic querying of the collection ● Demo coming later.....
  • 66. University of Sheffield, NLP 5.1 Semantic Annotation
  • 67. University of Sheffield, NLP Why ontologies for semantic search?  Semantic annotation: rather than just annotating the word “Cambridge” as a location, link it to an ontology instance  Differentiate between Cambridge, UK and Cambridge, Mass.  Semantic search via reasoning  So we can infer that this document mentions a city in Europe.  Ontologies tell us that this particular Cambridge is part of the country called the UK, which is part of the continent Europe.  Knowledge source  If I want to annotate strikes in baseball reports, the ontology will tell me that a strike involves a batter who is a person  In the text “BA went on strike”, using the knowledge that BA is a company and not a person, the IE system can conclude that this is not the kind of strike it is interested in
  • 68. University of Sheffield, NLP More semantic search examples Q: {ScalarValue}{MeasurementUnit} -> A: “12 cm”, “190 g”, “two hours” --- Q: {Reference} -> A: JP-A-60-180889 A: Kalderon et al. (1984) Cell 39:499-509 68
  • 69. University of Sheffield, NLP Example Semantic Search Architecture 69
  • 70. University of Sheffield, NLP What is Semantic Annotation? Annotation: The process of adding metadata to [parts of] a document. Semantic Annotation: Annotation process where [parts of] the annotation schema (annotation types, annotation features) are ontological objects.
  • 71. University of Sheffield, NLP Semantic Annotation
  • 72. University of Sheffield, NLP What is an Ontology?  Set of concepts (instances and classes)  Relationships between them (is-a, part-of, located-in)  Multiple inheritance  Classes can have more than one parent  Instances can have more than one class  Ontologies are graphs, not trees
  • 73. University of Sheffield, NLP URIs, Labels, Comments  The names of a classes or instance shown are URIs (Universal Resource Identifier)  http://dbpedia.org/page/Paris  The linguistic lexicalisation is typically encoded in the label property, as a string  The comment property is often used for documentation purposes, similarly a string  Comments and labels are annotation properties
  • 74. University of Sheffield, NLP Datatype Properties  Datatype properties link individuals to data values  Datatype properties can be of type boolean, date, int,  e.g. a person can have an age property  Available datatypes taken from XMLSchema http://dbpedia.org/page/Paris
  • 75. University of Sheffield, NLP Object Properties  Object properties link instances together  They describe relationships between instances, e.g. people work for organisations  Domain is the subject of the relation (the thing it applies to)  Range is the object of the relation (the possible”values”) Person work_for Organisation Domain Property Range Similar to domains, multiple alternative ranges can be specified by using a class description of the owl:unionOf, but this raises the complexity of the ontology and makes reasoning harder
  • 76. University of Sheffield, NLP DBpedia ● Machine readable knowledge on various entities and topics, including: – 410,000 places/locations, – 310,000 persons – 140,000 organisations ● For each entity we have: – entity name variants (e.g. IBM, Int. Business Machines) – a textual abstract – reference(s) to corresponding Wikipedia page(s) – entity-specific properties (e.g. latitude and longitude for places)
  • 77. University of Sheffield, NLP Example from DBpedia … Links to GeoNames And Freebase Latitude & Longitude
  • 78. University of Sheffield, NLP GeoNames ● 2.8 million populated places – 5.5 million alternate names ● Knowledge about NUTS country sub-divisions – use for enrichment of recognised locations with the implied higher-level country sub-divisions ● However, the sheer size of GeoNames creates a lot of ambiguity during semantic enrichment ● We use it as an additional knowledge source, but not as a primary source (DBpedia)
  • 79. University of Sheffield, NLP 5.2 Semantic Annotation In GATE (and other tools)
  • 80. University of Sheffield, NLP Information Extraction for the Semantic Web ● Traditional IE is based on a flat structure, e.g. recognising Person, Location, Organisation, Date, Time etc. ● For the Semantic Web, we need information in a hierarchical structure ● Idea is that we attach semantic metadata to the documents, pointing to concepts in an ontology ● Information can be exported as an ontology annotated with instances, or as text annotated with links to the ontology
  • 81. University of Sheffield, NLP Traditional NE Recognition John lives in London . He works there for Polar Bear Design . PERSON LOCATION ORGANISATION
  • 82. University of Sheffield, NLP Co-reference same_as John lives in London . He works there for Polar Bear Design . PER LOC ORG
  • 83. University of Sheffield, NLP Relations live_in John lives in London . He works there for Polar Bear Design . PER LOC ORG
  • 84. University of Sheffield, NLP Relations (2) employee_of John lives in London . He works there for Polar Bear Design . PER LOC ORG
  • 85. University of Sheffield, NLP Relations (3) based_in John lives in London . He works there for Polar Bear Design . PER LOC ORG
  • 86. University of Sheffield, NLP Richer NE Tagging ● Attachment of instances in the text to concepts in the domain ontology ● Disambiguation of instances, e.g. Cambridge, MA vs Cambridge, UK
  • 87. University of Sheffield, NLP Ontology-based IE John lives in London. He works there for Polar Bear Design.
  • 88. University of Sheffield, NLP Ontology-based IE (2) John lives in London. He works there for Polar Bear Design.
  • 89. University of Sheffield, NLP Typical Semantic Annotation pipeline Analyse document structure Linguistic Pre-processing NE recognition Ontology Lookup Ontology-based IE Populate ontology (optional) Corpus Export as RDF RDF
  • 90. University of Sheffield, NLP OpenCalais http://viewer.opencalais.com/ Paste text of http://www.membranes.com/ Not easily customised/extended Domain-specific coverage varies
  • 91. University of Sheffield, NLP Zemanta ● Paste text from www.membranes.com ● The main entity of interest “Hydranautics” is missed ● Common problem with general purpose, open-domain semantic annotation tools ● Best results require bespoke customisation http://www.zemanta.com/demo/
  • 92. University of Sheffield, NLP Semantic Annotation in GATE ● Insert GATE screenshot
  • 93. University of Sheffield, NLP Automatic Semantic Annotation ● Locations (linked to DBpedia and GeoNames) – Annotate the place name itself (e.g. Norwich) with the corresponding DBpedia and GeoNames URIs – Also use knowledge of the implied reference to the levels 1, 2, and 3 sub-divisions from the Nomenclature of Territorial Units for Statistics (NUTS). – For Norwich, these are East of England (UKH – level 1), East Anglia (UKH1 – level 2), and Norfolk (UKH13 – level 3). – Similarly use knowledge to retrieve nearby places
  • 94. University of Sheffield, NLP “South Gloucestershire” Example
  • 95. University of Sheffield, NLP Semantic Annotation (2) ● Organisations (linked to DBpedia) – Names of companies, government organisations, committees, agencies, universities, and other organisations ● Dates – Absolute (e.g. 31/03/2012) and relative (yesterday) ● Measurements and Percentages – e.g. 8,596 km2 , 1 km, one fifth, 10%
  • 96. University of Sheffield, NLP The YODIE Service from GATE http://demos.gate.ac.uk/trendminer/obie/ 96 http://demos.gate.c.uk/trendminer/obie/
  • 97. University of Sheffield, NLP Evaluation: YODIE on TAC-KBP’2010 PER LOC ORG TOTAL DB Spotlight 0.97 / 0.40 0.82 / 0.46 0.86 / 0.31 0.85 / 0.39 Zemanta 0.96 / 0.84 0.89 / 0.62 0.82 / 0.57 0.90 / 0.68 LODIE 0.81 / 0.82 0.73 / 0.76 0.56 / 0.59 0.71 / 0.74 Zemanta ∩ LODIE 1.00 / 0.74 0.95 / 0.45 0.97 / 0.42 0.97 / 0.54 Zemanta U LODIE 0.94 / 0.93 0.77 / 0.76 0.72 / 0.71 0.82 / 0.81 ● Precision and Recall figures shown above ● Similar results on a set of metadata records and scientific papers in environmental science (EnviLOD) ● hard when candidate ambiguity is high
  • 98. University of Sheffield, NLP Recognition of term variants in Decarbonet
  • 99. University of Sheffield, NLP ClimaTerm Web service ● http://services.gate.ac.uk/decarbonet/term-recognition/ ● Input: a document ● Output: the text as an XML file annotated with term and URI “Cars pollute by emitting carbon dioxide.” <Document> <TermInstance=" http://www.eionet.europa.eu/gemet/concept/1148">cars </Term> pollute by emitting <Term Instance="http://reegle.info/glossary/1064">carbon dioxide</Term> </Document>
  • 100. University of Sheffield, NLP Term recognition demo
  • 101. University of Sheffield, NLP Semantic Search: An Overview
  • 102. University of Sheffield, NLP GATE Mímir • can be used to index and search over text, annotations, semantic metadata (concepts and instances) • allows queries that arbitrarily mix full-text, structural, linguistic and semantic annotations • is open source
  • 103. University of Sheffield, NLP What can GATE Mímir do that Google can't? Show me: • all documents mentioning a temperature between 30 and 90 degrees F (expressed in any unit) • all abstracts written in French on patent documents from the last 6 months which mention any form of the word “transistor” in the English abstract • the names of the patent inventors of those abstracts • all documents mentioning steel industries in the UK, along with their location
  • 104. University of Sheffield, NLP Search News Articles for Politicians born in Sheffield http://demos.gate.ac.uk/mimir/gpd/search/gus
  • 105. University of Sheffield, NLP Easily Create Your Own Custom GATE Mímir Interfaces http://demos.gate.ac.uk/pin/
  • 106. University of Sheffield, NLP MIMIR: Searching Text Mining Results ● Searching and managing text annotations, semantic information, and full text documents in one search engine ● Queries over annotation graphs ● Regular expressions, Kleene operators ● Designed to be integrated as a web service in custom end-user systems with bespoke interfaces ● Demos at http://services.gate.ac.uk/mimir/
  • 107. University of Sheffield, NLP Modelling Text Annotations and Additional Knowledge Sources
  • 108. University of Sheffield, NLP MIMIR: the technical bit ● Uses a combination of an FTS engine (MG4J), a database for indexing the semantic annotations (H2), and a knowledge repository (OWLIM/Sesame) ● Use database to indexing annotation kinds, then the additional linguistic data for words/annotations are stored in MG4J (POS, morphological root, etc) ● Semantic query: against the database/ knowledge base for the additional knowledge ● A query: decompose into the respective parts, query the appropriate component, then merge the results
  • 109. University of Sheffield, NLP Clustering and Scalability ● All searches are local to a document, so we can use clusters of semantic repositories: ● Faster indexing (less data in each repository) ● Faster searching (search space is broken into slices that are searched in parallel – like Google). ● Joining results is trivial: union of result sets. ● Simple scalability: just add more nodes. ● Search speed stays almost constant while the data increases (each individual repository has the same amount of data; there are just more repositories)
  • 110. University of Sheffield, NLP Ranking of Results ● By default documents are returned in index order ● The following ranking algorithms are also available (others can be implemented via plugins if required) – Count – Hit length – TF.IDF – BM25 • These algorithms treat annotations within the queries in the same way as words, using index level frequencies etc. • The ranking algorithm can be changed without re-indexing, allowing for easy experimentation
  • 111. University of Sheffield, NLP Scaling Up ● We annotated 1.08 million web pages using a GATE language analysis pipeline. – Documents crawled using Heritrix10 , with total content size of 57 GiB or 6.6 billions plain text characters. – The indexing server has 2 Intel Xeon 2.8GHz CPUs 11 GB of RAM, and runs 64 bit Ubuntu Linux. Indexing process was 94 hours. ● We also indexed 150 million similar web pages, using two hundred Amazon EC2 Large Instances running for a week to produce a federated index ● Mimir runs on GateCloud.net, so easy to scale up
  • 112. University of Sheffield, NLP Search for string “Harriet Harman”
  • 113. University of Sheffield, NLP Search for string “Harriet Harman says”
  • 114. University of Sheffield, NLP Search with morphological variants: Harriet Harman root:say
  • 115. University of Sheffield, NLP Replace strings with NEs: {Person} root:say
  • 116. University of Sheffield, NLP {Person} AND root:say – 11803 hits
  • 117. University of Sheffield, NLP {Person} [0..5] root: say – 5495 hits
  • 118. University of Sheffield, NLP Patent Annotation: Data Model 118
  • 119. University of Sheffield, NLP An Example Text 119
  • 120. University of Sheffield, NLP Hands-On with patent data http://demos.gate.ac.uk/mimir/patents/search/index Text. Matches plain text. Example: nanomaterial Linguistic variations of text Example: (root:nanomaterial | root:nanoparticle ) Annotation. Matches semantic annotations. Syntax: {Type feature1=value1 feature2=value2...} Example: {Abstract lang="DE"} Sequence Query. Sequence of other queries. Syntax: Query1 [n..m] Query2... Example: from {Measurement} [1..5] {Measurement}
  • 121. University of Sheffield, NLP Inclusion Queries IN Query. Hits of one query only if inside another. Syntax: Query1 IN Query2 Example: (root:nanomaterial | root:nanoparticle ) IN {Abstract} Number of times these words are mentioned in patent abstracts (as well as links to the actual documents) OVER Query. Hits of a query, only if overlapping hits of another. Syntax: Query1 OVER Query2 Example: {Abstract} OVER (root:nanomatrial | root:nanoparticle ) Finds all abstracts that contain nanomaterial(s) or nanoparticle(s)
  • 122. University of Sheffield, NLP Date restrictions ( {Abstract lang="EN"} OVER (root:nanomaterial | root:nanoparticle ) ) IN {PatentDocument date > 20050000} YYYYMMDD
  • 123. University of Sheffield, NLP Find references to literature or patents in the prior art or background sections, which contain nanomaterial/nanoparticle ({Reference type="Literature"} | {Reference type="Patent"} ) IN ({Section type="PriorArt"} | {Section type="BackgroundArt"} ) OVER (root:nanomaterial | root:nanoparticle)
  • 124. University of Sheffield, NLP Queries Using External Knowledge {Measurement spec="1 to 100 volts“} ● Uses GNU Units (http://www.gnu.org/software/units/) to convert measurements and normalise them to ISI units {Measurement spec="1 to 100 kg m^2 / A s^3"} ● Example hits: 10 volts, 2V, +20 and -20 volts; ±10V; +/- 100V; +3.3 volts {Measurement spec="1 to 100 m / s"} ● Example hits: 40 km/hr, 60m/min, 100cm/sec, 60 fps; 10 to 2000 cm/sec
  • 125. University of Sheffield, NLP Searching LOD with SPARQL • SQL-like query language for RDF data • Simple protocol for querying remote databases over HTTP • Query types • select: projections of variables and expressions • construct: create triples (or graphs) based on query results • ask: whether a query returns results (result is true/false) • describe: describes resources in the graph
  • 126. University of Sheffield, NLP SPARQL Example # Software companies founded in the US PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX dbpedia: <http://dbpedia.org/resource/> PREFIX dbp-ont: <http://dbpedia.org/ontology/> PREFIX geo-ont: <http://www.geonames.org/ontology#> PREFIX umbel-sc: <http://umbel.org/umbel/sc/> SELECT DISTINCT ?Company ?Location WHERE { ?Company rdf:type dbp-ont:Company ; dbp-ont:industry dbpedia:Computer_software ; dbp-ont:foundationPlace ?Location . ?Location geo-ont:parentFeature dbpedia:United_States . }
  • 127. University of Sheffield, NLP SPARQL Results ● Try at: http://factforge.net/sparql
  • 128. University of Sheffield, NLP Documents mentioning Persons born in Sheffield {Person sparql = "SELECT ?inst WHERE { ?inst :birthPlace <http://dbpedia.org/resource/Sheffield>}"} The BBC web document does not mention Sheffield at all: http://www.bbc.co.uk/news/uk-politics-11494915 The relevant text snippet is:
  • 129. University of Sheffield, NLP Try these on the BBC News demo: http://services.gate.ac.uk/mimir/gpd/search/index ● Gordon Brown [0..3] root:say ● {Person} [0..3] root:say ● {Person inst ="http://dbpedia.org/resource/Gordon_Brown"} [0..3] root:say ● {Person sparql="SELECT ?inst WHERE { ?inst a :Politician }"} [0..3] root:say ● {Person sparql = "SELECT ?inst WHERE { ?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK%29> }" } [0..3] root:say ● {Person sparql = "SELECT ?inst WHERE { ?inst :party <http://dbpedia.org/resource/Labour_Party_ %28UK%29> . ?inst :almaMater <http://dbpedia.org/resource/University_of_Edinburgh> }" } [0..3] root:say
  • 130. University of Sheffield, NLP User Interfaces for SPARQL-based Semantic Search • SPARQL-based semantic searches, tapping into LOD resources are extremely powerful • However, impossible to write by the vast majority of users • User interfaces for SPARQL-based semantic search: • Faceted searches (see ExoPatent next) • Form-based searches (see EnviLOD, PIN) • Text-based searches (natural language interfaces for querying ontologies), e.g. FREyA
  • 131. University of Sheffield, NLP Faceted Search: ExoPatent Example ● Use semantic information to expose linkages between documents based on the intersecting relationships between various sets of data from ● FDA Orange Book (23,000 patented drugs ● Unified Medical Language System (UMLS) – database of medical terms (370,000) ● Patent bibliographic information ● Search for diseases, drug names, body parts, references to literature and other patents, numeric values, ranges ● Demo uses a small set of patents (40,000)
  • 132. University of Sheffield, NLP ExoPatent: Faceted Search
  • 133. University of Sheffield, NLP Annotated Patent Document
  • 134. University of Sheffield, NLP Find all applicants who filed patents related to mitochondria, as well as drug names and active ingredients
  • 135. University of Sheffield, NLP Semantic Search over Content and Annotations http://demos.gate.ac.uk/trendminer/envilod
  • 136. University of Sheffield, NLP EnviLOD Semantic Search UI
  • 137. University of Sheffield, NLP EnviLOD Semantic Search UI
  • 138. University of Sheffield, NLP EnviLOD Semantic Search UI
  • 139. University of Sheffield, NLP Example Results
  • 140. University of Sheffield, NLP …and the underlying SPARQL Query {Sem_Location dbpediaSparql="select distinct ?inst where { {{ ?inst <http://dbpedia.org/property/north> ?loc} UNION { ?inst <http://dbpedia.org/property/east> ?loc } UNION { ?inst <http://dbpedia.org/property/west> ?loc } UNION { ?inst <http://dbpedia.org/property/south> ?loc } UNION { ?inst <http://dbpedia.org/property/northeast> ?loc } UNION { ?inst <http://dbpedia.org/property/northwest> ?loc } UNION { ?inst <http://dbpedia.org/property/southeast> ?loc } UNION { ?inst <http://dbpedia.org/property/southwest> ?loc } FILTER(REGEX(STR(?loc), "Sheffield", "i"))} }"} AND (root:"flood") Ongoing work: Use GeoSparql instead and be able to specify distances and reason with the richer information in GeoNames
  • 141. University of Sheffield, NLP Flooding in Oxford
  • 142. University of Sheffield, NLP Some Caveats…. ● EnviLOD demonstrator built in a 6 month project ● VERY limited environmental science content indexed – Some Defra, Environment Agency, Scottish Government reports ● PDFs of the articles are not connected to
  • 143. University of Sheffield, NLP We don't just have to look at politicians saying and measuring things ● If we first process the text with other NLP tools such as sentiment analysis, we can also search for positive or negative documents ● Or positive/negative comments about certain people/topics/things/companies ● In the Decarbonet project, we're looking at people's opinions about topics relating to climate change, e.g. fracking ● We could index on the sentiment annotations too ● Other people are using the combination of opinion mining and MIMIR to look at e.g. customer feedback about products / companies on a huge corpus
  • 144. University of Sheffield, NLP A positive tweet
  • 145. University of Sheffield, NLP A negative tweet
  • 146. University of Sheffield, NLP A Sarcastic Tweet
  • 147. University of Sheffield, NLP Explicitly Choosing The Search Classes
  • 148. University of Sheffield, NLP Choosing A Specific Instance
  • 149. University of Sheffield, NLP What diseases are in these documents?
  • 150. University of Sheffield, NLP What Pathogens?
  • 151. University of Sheffield, NLP Disease vs Disease Co-ocurrences
  • 152. University of Sheffield, NLP Diseases vs Pathogens
  • 153. University of Sheffield, NLP Summary ● Text mining is a very useful pre-requisite for doing all kinds of more interesting things, like semantic search ● Semantic web search allows you to do much more interesting kinds of search than a standard text-based search ● Text mining is hard, so it won't always be correct, however ● This is especially true on lower quality text such as social media ● On the other hand, social media has some of the most interesting material to look at ● Still plenty of work to be done, but there are lots of tools that you can use now and get useful results ● We run annual GATE training courses where you can spend a whole week learning all this and more!
  • 154. University of Sheffield, NLP Acknowledgements and further information ● Research partially supported by the European Union/EU under the Information and Communication Technologies (ICT) theme of the 7th Framework Programme for R&D (FP7) DecarboNet (610829) http://www.decarbonet.eu and TrendMiner (287863) ● Semantic Search over Documents and Ontologies (2014) K Bontcheva, V Tablan, H Cunningham. Bridging Between Information Retrieval and Databases, 31-53 ● V. Tablan, K. Bontcheva, I. Roberts, and H. Cunningham. Mímir: An open-source semantic search framework for interactive information seeking and discovery. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 2014 ● These and other relevant papers on the GATE website: http://gate.ac.uk/gate/doc/papers.html ● Annual GATE training course in June in Sheffield:https://gate.ac.uk/family/training.html ● Download gate: http://gate.ac.uk/download