These slides describe the work done at the CLARIN talk of Europe Creative Camp, in which groups from various countries worked with EuroParliament speeches.
Our work covers term extraction, term organisation and term linking between the Europarliament and UK Parliament data sets.
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Information Extraction from EuroParliament and UK Parliament data
1. Computational Support in eHumanities
Proof of concept produced during
CLARIN’s Creative Camp
Talk of Europe
Wim Peters
Adam Funk
University of Sheffield, UK
w.peters@sheffield.ac.uk
a.funk@sheffield.ac.uk
2. CLARIN’s Creative Camp
Talk of Europe
• Our main aim in this event:
• Term identification and structuring
in ToE and UK Parliament data
• Linking ToE and UK Parliament terminology
• Automatic enrichment of ToE data set
• http://linkedpolitics.ops.few.vu.nl/home
3. Data set 1
• Talk of Europe data set
• Plenary debates of the European Parliament
as Linked Open Data
• http://linkedpolitics.ops.few.vu.nl/
10. Output
• For terms in each data set:
– Terms
– Term hierarchies
– Term clusters
– Sententence-based sentiment context
• Between data sets:
– Term relatedness between terms
11. • To identify and extract relevant information from the source
material, we use the GATE architecture for the production of
semantic metadata in the form of text annotations.
• GATE is a framework for language engineering applications, which
supports efficient and robust text processing including functionality
for both manual and automatic annotation.
• It is highly scalable and has been applied in many large text
processing projects;
• It is an open source desktop application written in Java that
provides a user interface for professional linguists and text
engineers to bring together a wide variety of natural language
processing tools and apply them to a set of documents.
General Architecture for Text
Engineering
12. • General Architecture for Text Engineering (GATE)
• open source framework which
supports plug-in NLP components
to process a corpus of text.
http://gate.ac.uk/
Free system download and training courses
LEX 2014, Ravenna, Italy
General Architecture for Text
Engineering
13. Advantages
• Reproducibility
• Reusability
• Flexibility
• Customisability to scholarly requirements
regarding research questions and analysis
methodology
• http://www.gate.ac.uk
15. Term Extraction
• TermRaider
• http://www.dcs.shef.ac.uk/~wim/termraider.html
• automatically provides domain-specific noun phrase
term candidates from a text corpus together with a
statistically derived termhood score.
• Possible terms are filtered by means of a multi-word-
unit grammar that defines the possible sequences of
part of speech tags constituting noun phrases.
• It computes various termhood scores such as Kyoto
Domain Relevance and frequency/inverted document
frequency (TF/IDF). The scores indicate the salience of
each term candidate for each document in the corpus.
16. KYOTO domain relevance score
• df* (1+nh)
– df: number of documents in the current corpora
containing the term
– nf: number of hyponymic term candidates
• W. Bosma and P. Vossen. Bootstrapping language-neutral term extraction.
In 7th Language Resources and Evaluation Conference (LREC), Valletta,
Malta (2010)
18. Term Relatedness 1: Hyponyms
(rdf: skos:narrowerTransitive)
• Hierarchical relations between terms based on head phrase matching
• fight
– fight against all form of intolerance
• fight
– fight against serious crime and terrorism
• fight
– fight against all form of intolerance and discrimination
• fight
– fight against illegal drug and the organised crime
• fight
– fight against corruption and organised crime
• control
– efficient control
• efficient control of EU fund
19. Term relatedness 2: Clusters
• Compute Pointwise Mutual Information
– Pair-wise association score for terms that co-occur
within a context window (in our case sentences)
20. Cluster creation
• Simple clique algorithm
• https://en.wikipedia.org/wiki/Cluster_analysis
• Each cluster member (a term candidate with Kyoto
Domain Relevance score of > 70/100 is connected to all
other cluster members by means of a PMI score >
70/100
– Result: “statistical thesaurus”
– strongly associated groups of words
– Use enhance data exploration by expanding
searches with related terms (query expansion)
21. Clusters including “human rights”
ToE data
(manually highlighted elements indicative of contrast with UK
perspective)
• endvotecommissionnetworkprogrammefun
dingproposalreporttextlevelservicefreedom
fundconcernpresidentaccessbasisinternete
nforcementexampleinstrumentplasticmoney
EU policy
• recommendationpositionlevelchangecommu
nityrightpartapproachdiscussiondossierrega
rdopinionpolicyforcenegotiationaccountpub
licopportunityfight
22. Clusters including “human rights”
UK data
(manually highlighted elements indicative of contrast with EU
perspective)
• foreignpressanswerelection
• realiseMPspoliticianconsequenceclaimin
terestlessonpensionemployment
• incentiveaccountabilitymovementtreatme
ntwordyoung peopleassessment
23. Term Relatedness 3: Links between ToE
and UK terms
(rdf: skos:related)
• For now the link is limited to orthographic
overlap of terms’ canonical forms
– Lemmatised
– decapitalised
24. Sentiment Context for Terminology
• Sentences have a sentiment value of positive,
negative or neutral
• This allows the exploration of the emotional
load of the context in which terminology is
used
26. Why RDF output?
• Standard knowledge representation
• Queryable in SPARQL
• Slots additional knowledge into Talk of Europe
data model
27. Coverage of results
• Proof of concept
• EuroParliament
– 2 months (6546 speeches)
– 7900 term candidates
• UK Parliament
– 1 month (January 2014, 7571 UK speeches)
– 28000 term candidates
• Around 750000 triples
• 2900 relations between EU and UK terminology
28. Usability of data and methodology
• Assists further exploration of
parliamentarians’ styles, priorities and
perspectives through term usage and context
– E.g. compare cluster members of terms in order to
detect contrastive perspectives between ToE and
UK terminological use
– (see “human rights” example)
• Flexible methodology, re-usable on other data