SlideShare a Scribd company logo
1 of 36
Text Encoding and Enrichment for Linguistic 
Analysis: Archives on the policy of Armaments 
within Western European Union 
Centre Virtuel de la Connaissance sur l’Europe (CVCE), Luxembourg 
Florentina Armaselu (DHLab) 
Verónica Martins (EIS) - 
Catherine Jones (DHLab) - 
Exploring Historical Sources with Language Technology: Results and Perspectives 
Huygens ING , The Hague, December 8, 9, 2014
1. About the CVCE 
2. Overview of the WEU-DIPLO project 
3. XML-TEI Encoding 
4. Named Entity Recognition (NER) 
5. Corpus Analysis 
6. Future work 
7. References 
Summary 2
CVCE - Centre Virtuel de la 
Connaissance sur l'Europe 
An interdisciplinary centre of e-research 
and documentation on 
the European Integration 
two key areas of activity: 
- Interdisciplinary research on the European 
integration process in the XX and XXI centuries; 
- Research, development and integration of 
digital tools and methods to support 
advancement in European Integration Studies. 
About the CVCE 3
Overview of the WEU-DIPLO project 
1. Goal: XML-TEI encoding, corpus analysis and Web publication of institutional documents 
of the W.E.U. (Western European Union): 
• Topics: armament production, standardization, control in the period from 1954 to 1982; 
• Source: Archives nationales de Luxembourg, W.E.U collection. 
2. Format: 
• digitized versions (JPEG) of typewritten materials (one file per page). 
3. Size: 
Category Number of 
Note 89 43 46 34 395 191 204 144 
Minutes 30 15 15 15 256 138 118 118 
Memorandum 3 1 2 2 16 7 9 9 
Study 2 0 2 1 12 0 12 8 
Discourse 1 0 1 0 4 0 4 0 
Draft protocol 2 1 1 0 4 2 2 0 
Total 127 60 67 52 687 338 349 279 
*proc. = processed 
Number of documents 
per language 
of pages 
Number of pages per 
EN FR FR proc.* EN FR FR proc.* 
Overview WEU-DIPLO 4
Overview of the WEU-DIPLO project 
5. Corpus Selection 
• Form and content 
 Form 
 OCR experiment conditions - need to diversify the form of the 
 Bilingual. 
 Content 
 Archives’ 30 years rule corresponds with 30 years time period for the 
corpus (1954-1982)-selection of documents from the 1950’s, 1960’s, 
1970’s and 1980’s; 
 Case study: Armaments production and control within WEU 
 Selection based on research question and more specific topics: 
French and British positions, WEU’s role/competences, nature of the 
debates within the Council/Standing Armament Committee; 
 Need for the documents to cover all the available material categories 
(minutes, notes, memorandum…). 
• Resources 
 limited time and human resources. 
Overview WEU-DIPLO 5
Overview of the WEU-DIPLO project: examples ©WEU-UEO 
Overview WEU-DIPLO 6
Overview of the WEU-DIPLO project: workflow 
Overview WEU-DIPLO 7
Why TEI encoding? 
• structured and retrievable metadata (title, author, origin place of document, 
availability date, confidentiality status, document reference, etc.); 
• clear representation of the document structure (header, footer, divisions – 
section, subsection, paragraph, line); 
• identification of semantic elements (discourse of countries representatives, 
entities: names of organisations, persons, places, functions, dates, etc.). 
XML-TEI Encoding: WEU-DIPLO metadata, structure 
XML-TEI Encoding: WEU-DIPLO semantics 
Named Entity Recognition (NER): GATE - 
Named Entity Recognition (NER/GATE): WEU-DIPLO 
Named Entity Recognition (NER/XML-TEI): WEU-DIPLO 
Corpus Analysis: TXM – 
Corpus Analysis: TXM - Textométrie 14
Corpus Analysis - TXM: WEU-DIPLO 
Corpus WEU-DIPLO: 52 documents, French, 6905 items for 101965 occurrences (content and metadata); 
6417 items for 74287 occurrences (content) – Lexicon (functional /lexical forms) (lemmatised, POS tagged, lower case) 
Corpus Analysis - TXM: WEU-DIPLO 15
Corpus Analysis - TXM: WEU-DIPLO 
• WEU-DIPLO: content – Index (by type of entity) 
Corpus Analysis - TXM: WEU-DIPLO 16
Corpus Analysis: WEU-DIPLO 
• Participants description 
Corpus Analysis: WEU-DIPLO 17
Corpus Analysis – TXM: WEU-DIPLO 
Partition: representatives’ discourse by country/organisation 
Corpus Analysis - TXM: WEU-DIPLO 18
Corpus Analysis - TXM: WEU-DIPLO 
• Specificity score (log10 ): 
o overuse (+)/deficit (-) of a form in a part/subcorpus as compared with the parent 
corpus and a threshold. 
• Statistical model (Lafon, 1980): 
Where: T = number of occurrences in the parent corpus; 
t = number of occurrences in a part/subcorpus; 
f = frequency of a form F in the parent corpus; 
X = variable of value 0, 1, 2, …, k, …, f; 
Prob (X=K) = probability that F occurs k times in the part/subcorpus of size t. 
Corpus Analysis - TXM: WEU-DIPLO 19
Corpus Analysis - TXM: WEU-DIPLO 
Specificities: by part of speech 
Corpus Analysis - TXM: WEU-DIPLO 20
Corpus Analysis - TXM: WEU-DIPLO – 
Specificities: by part of speech (Verb) 
Corpus Analysis - TXM: WEU-DIPLO 21
Corpus Analysis - TXM: WEU-DIPLO 
Specificities (Verb) by representatives and mode/tense (Grevisse, 1993). 
Representative Mode / Tense 
France CONDITIONAL: attenuation (wish, advice, necessity, certainty) 
Forms: serait (37); aurait (19); seraient (17); pourrait (16); devrait (13), voudrais (11); … 
Exemples: le gouvernement français serait partisan d'accélérer …; cette réunion se déroulerait selon la 
formule …; qu'il ne faudrait pas trop ralentir l'opération envisagée … 
UK delegation PAST PARTICIPLE: passive/past perfect, adjectives 
Forms: été (20); donné (6); destinés (5); placées (5); établi (4); révisé (4); chargé (3); … 
Exemples: le produit final devrait être mis à la disposition de …; les accords auxquels elles ont abouti 
n'ont pas encore donné de résultats suffisamment …; projectiles nucléaires destinés à ces armes … 
C.P.A. (Comité 
permanent des 
SIMPLE PAST: narration, succession of past actions 
Forms: exposa (1); fut (1); intervint (1); posa (1); prirent (1); soutinrent (1); … 
Exemples: une première proposition (belge) tendit à la réunion des hautes autorités …; luxembourg 
et france soutinrent, sans insistance, ce point de vue …; les pays-bas prirent la même attitude … 
A.C.A. (Agence 
pour le contrôle 
IMPERFECT: description, explanation 
Forms: était (11); avait (4); étaient (3); présidait (2); affectait (1); ajoutait (1); dépasseraient (1); … 
Exemples: le retrait des forces françaises de l’organisation intégrée de l’o.t.a.n. n'affectait nullement 
l'exécution des tâches …; il est bien évident que, s’il était adopté, il cesserait d’être inexact …; il résultait 
de cette étude que " le problème du stockage des armes nucléaires … 
Conseil de 
FUTURE: actions/goals to be accomplished 
Forms: sera (11); seront (7); pourra (5); devront (3); pourront (3); auront (2); donnera (2), … 
Exemples: les principes généraux ci-après devront gouverner nos travaux …; cela nous fournira la 
transition entre les sections a et b de notre mandat …; le conseil procédera à un examen attentif de la … 
Corpus Analysis - TXM: WEU-DIPLO 22
Corpus Analysis - TXM: WEU-DIPLO 
Concordances: use of conditional, French representatives/name/document 
Corpus Analysis - TXM: WEU-DIPLO 23
Corpus Analysis - TXM: WEU-DIPLO 
Context: conditional forms (French representative/Beaumarchais), vo-CR-73-10_FR 
Corpus Analysis - TXM: WEU-DIPLO 24
Corpus Analysis - TXM: WEU-DIPLO 
Specificities: by lemma, representatives partition (selection), groupe (contrôle) 
Corpus Analysis - TXM: WEU-DIPLO 25
Corpus Analysis - TXM: WEU-DIPLO 
Specificities: by lemma, representatives (selection), groupe (contrôle) - Discussion 
• Predictable results: 
o A.C.A.’s (Agence pour le contrôle des armements) discourse positive specificity (overuse): 
 contrôle/contrôler/contrôlable – inspection - vérification/vérifier; 
 limitation/limite/limiter-restriction/restreindre/restrictif. 
(A.C.A.’s role) 
o UK reprentesatives/delegation’s discourse negative specificity (scarcity): 
 arme/armement nucléaire/abc/atomique. 
(interested in the topic but not mainly concerned) 
• Less predictable results: 
o UK and France representatives’ discourse negative specificity: 
 contrôle/contrôler/contrôlable – inspection - vérification/vérifier; 
 A.C.A. - agence pour le contrôle des armements. 
(possible cause: selection of documents in the sample?) 
Corpus Analysis - TXM: WEU-DIPLO 26
Corpus Analysis - TXM: WEU-DIPLO 
Specificities: by lemma, representatives partition (selection), groupe (standardisation) 
Corpus Analysis - TXM: WEU-DIPLO 27
Corpus Analysis - TXM: WEU-DIPLO 
Cooccurrences: for ‘standard*’ sorted by co-frequency 
Corpus Analysis - TXM: WEU-DIPLO 28
Corpus Analysis - TXM: WEU-DIPLO 
Concordances: ‘standard*’ – ‘armements’ 
Corpus Analysis - TXM: WEU-DIPLO 29
Corpus Analysis: WEU-DIPLO 
Partition: representatives’ discourse (by name) 
Corpus Analysis: WEU-DIPLO 30
Corpus Analysis - TXM: WEU-DIPLO 
Lexical profile (Guyard, 1981): positive specificities (>2.0), lemmas, names partition 
Part of 
speech / 
Noun Proper 
Adjective Verb Adverb 
commun; arme; accord 
recensement; mise; 
choix; point; centre; 
opération; déclaration; 
système d’armes 
- commun; 
équitable; secret; 
procéder - 
pays; discussion; 
britannique; partenaire; 
- bilatéral; 
analogue; final; 
offrir; devoir 
ministre belge; avis; 
gouvernement français; 
idée; opération; désir 
- autonome; 
trop; pas; ne 
britannique; doctrine; 
M. Van 
- exister - 
Corpus Analysis - TXM: WEU-DIPLO 31
Corpus Analysis - TXM: WEU-DIPLO 
Lexical profile (Guyard, 1981): positive specificities, lemmas, names partition - Discussion 
• Chauvel (FR) / Lloyd (UK) : 
o Commun (rank 1) / bilatéral (rank 1) 
• production en commun; programme (régional), intérêt, défense, fonds commun(e)(s) 
• base, discussion, arrangements, comités directeurs bilatéra(l)(le)(ux) 
• Destremau (FR) / Callaghan (UK): 
o C.P.A – Comité permanent des armements (specificity score 1.44) / Eurogroupe (rank 1) 
(French attempts to revive CPA / UK’s Atlanticist preference - creation of Eurogroup in 1968 which did 
not include France). 
• Why standard(isation)(iser) not specific to any of individualized discourse by 
name, although high specificity for French representatives discourse as a whole? 
Corpus Analysis - TXM: WEU-DIPLO 32
Corpus Analysis - TXM: WEU-DIPLO 
Specificities: standard(isation)(iser) lemmas, names partition 
Corpus Analysis - TXM: WEU-DIPLO 33
Corpus Analysis - TXM: WEU-DIPLO 
Specificities: standard(isation)(iser) lemmas, documents subtypes partition 
Corpus Analysis - TXM: WEU-DIPLO 34
Future work 
1. Corpus analysis and interpretation (in progress). 
2. Choice and adaptation of Web publication platform (in progress) 
EVT (Edition Visualization Technology): 
XTF : 
TEIBoilerplate : 
Future work 
• GATE: 
• Grevisse, Le bon usage. Grammaire française, Duculot, Paris, 1993. 
• Guyard Marie-Renée. Spécificités d'auteurs dans Le Surréalisme au service 
de la Révolution. In: Mots, mars 1981, N°2. Qu'est-ce que le vocabulaire 
spécifique d'un texte politique? pp. 95-122. 
• Lafon Pierre, Sur la variabilité de la fréquence des formes dans un corpus. 
In: Mots, octobre 1980, N°1. Saussure, Zipf, Lagado, des méthodes, des 
calculs, des doutes et le vocabulaire de quelques textes politiques. pp. 
• TEI: 
• TXM: 
References 36

More Related Content

Viewers also liked

Institutional and product videos
Institutional and product videosInstitutional and product videos
Institutional and product videosAlexandre Pallota
Analitica Latin America Exhibition 2013
Analitica Latin America Exhibition 2013Analitica Latin America Exhibition 2013
Analitica Latin America Exhibition 2013Alexandre Pallota
Esa 2013 presentation (final)
Esa 2013 presentation (final)Esa 2013 presentation (final)
Esa 2013 presentation (final)Warwick Allen
TS Quick Start Guide2013
TS Quick Start Guide2013TS Quick Start Guide2013
TS Quick Start Guide2013eelizaro
TEI Conference - CVCE
TEI Conference - CVCETEI Conference - CVCE
TEI Conference - CVCEdhlab
CUbRIK Summer School RHodes histoGraph
CUbRIK Summer School RHodes histoGraphCUbRIK Summer School RHodes histoGraph
CUbRIK Summer School RHodes histoGraphdhlab
History of Europe demo at IEEE MMSP 2013
History of Europe demo at IEEE MMSP 2013History of Europe demo at IEEE MMSP 2013
History of Europe demo at IEEE MMSP 2013dhlab
Humanist machine interaction for the digital humanities
Humanist machine interaction for the digital humanitiesHumanist machine interaction for the digital humanities
Humanist machine interaction for the digital humanitiesdhlab
Algebra de baldor by. aimb
Algebra de baldor by. aimbAlgebra de baldor by. aimb
Algebra de baldor by. aimbAlex Mindiola
HistoGraph presentation Insa de Lyon
HistoGraph presentation Insa de LyonHistoGraph presentation Insa de Lyon
HistoGraph presentation Insa de Lyondhlab
Change management - leading people
Change management - leading peopleChange management - leading people
Change management - leading peopleClarkson Alliance
Termetet dhe energjia e valeve sizmike
Termetet dhe energjia e valeve sizmikeTermetet dhe energjia e valeve sizmike
Termetet dhe energjia e valeve sizmikeMirsad
Europe’s Beginnings through the Looking Glass: Publishing Historical Document...
Europe’s Beginnings through the Looking Glass: Publishing Historical Document...Europe’s Beginnings through the Looking Glass: Publishing Historical Document...
Europe’s Beginnings through the Looking Glass: Publishing Historical Document...dhlab

Viewers also liked (18)

Advertising / Off-line
Advertising / Off-lineAdvertising / Off-line
Advertising / Off-line
Institutional and product videos
Institutional and product videosInstitutional and product videos
Institutional and product videos
Analitica Latin America Exhibition 2013
Analitica Latin America Exhibition 2013Analitica Latin America Exhibition 2013
Analitica Latin America Exhibition 2013
Esa 2013 presentation (final)
Esa 2013 presentation (final)Esa 2013 presentation (final)
Esa 2013 presentation (final)
TS Quick Start Guide2013
TS Quick Start Guide2013TS Quick Start Guide2013
TS Quick Start Guide2013
TEI Conference - CVCE
TEI Conference - CVCETEI Conference - CVCE
TEI Conference - CVCE
CUbRIK Summer School RHodes histoGraph
CUbRIK Summer School RHodes histoGraphCUbRIK Summer School RHodes histoGraph
CUbRIK Summer School RHodes histoGraph
History of Europe demo at IEEE MMSP 2013
History of Europe demo at IEEE MMSP 2013History of Europe demo at IEEE MMSP 2013
History of Europe demo at IEEE MMSP 2013
Humanist machine interaction for the digital humanities
Humanist machine interaction for the digital humanitiesHumanist machine interaction for the digital humanities
Humanist machine interaction for the digital humanities
Algebra de baldor by. aimb
Algebra de baldor by. aimbAlgebra de baldor by. aimb
Algebra de baldor by. aimb
HistoGraph presentation Insa de Lyon
HistoGraph presentation Insa de LyonHistoGraph presentation Insa de Lyon
HistoGraph presentation Insa de Lyon
Change management - leading people
Change management - leading peopleChange management - leading people
Change management - leading people
Google scholar
Google scholarGoogle scholar
Google scholar
Creating Blog
Creating BlogCreating Blog
Creating Blog
Termetet dhe energjia e valeve sizmike
Termetet dhe energjia e valeve sizmikeTermetet dhe energjia e valeve sizmike
Termetet dhe energjia e valeve sizmike
Europe’s Beginnings through the Looking Glass: Publishing Historical Document...
Europe’s Beginnings through the Looking Glass: Publishing Historical Document...Europe’s Beginnings through the Looking Glass: Publishing Historical Document...
Europe’s Beginnings through the Looking Glass: Publishing Historical Document...

Similar to Text Encoding and Enrichment for Linguistic Analysis: Archives on the policy of Armaments within Western European Union

Embedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTEmbedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTAlexandre Rademaker
PRESSoo: A formal ontology for continuing resources
PRESSoo: A formal ontology for continuing resourcesPRESSoo: A formal ontology for continuing resources
PRESSoo: A formal ontology for continuing resourcesISSN International Centre
Modeling and Querying Greek Legislation using Semantic Web Technologies
Modeling and Querying Greek Legislation using Semantic Web TechnologiesModeling and Querying Greek Legislation using Semantic Web Technologies
Modeling and Querying Greek Legislation using Semantic Web TechnologiesIlias Chalkidis
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Christophe Tricot
The Role of Ontology in the Era of Big Military Data
The Role of Ontology in the Era of Big Military DataThe Role of Ontology in the Era of Big Military Data
The Role of Ontology in the Era of Big Military DataBarry Smith
Academic and professional written genres - By Giovanni Parodi
Academic and professional written genres - By Giovanni ParodiAcademic and professional written genres - By Giovanni Parodi
Academic and professional written genres - By Giovanni Parodivercingetorix2
Fantoni Urgo - Cirp Dictionary
Fantoni Urgo - Cirp DictionaryFantoni Urgo - Cirp Dictionary
Fantoni Urgo - Cirp DictionaryGualtiero Fantoni
Clustering over the cultural heritage linked open dataset xlendi shipwreck
Clustering over the cultural heritage linked open dataset xlendi shipwreckClustering over the cultural heritage linked open dataset xlendi shipwreck
Clustering over the cultural heritage linked open dataset xlendi shipwreckMohamed BEN ELLEFI
The High Frequency Receiver Function
The High Frequency Receiver FunctionThe High Frequency Receiver Function
The High Frequency Receiver FunctionTracy Huang
EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t...
 EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t... EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t...
EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t...Keith.May
Automatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech PhenomenaAutomatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech PhenomenaITIIIndustries

Similar to Text Encoding and Enrichment for Linguistic Analysis: Archives on the policy of Armaments within Western European Union (15)

Embedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTEmbedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PT
The CIDOC CRM Family and LOD
The CIDOC CRM Family and LODThe CIDOC CRM Family and LOD
The CIDOC CRM Family and LOD
PRESSoo: A formal ontology for continuing resources
PRESSoo: A formal ontology for continuing resourcesPRESSoo: A formal ontology for continuing resources
PRESSoo: A formal ontology for continuing resources
grammer genration
grammer genration grammer genration
grammer genration
Modeling and Querying Greek Legislation using Semantic Web Technologies
Modeling and Querying Greek Legislation using Semantic Web TechnologiesModeling and Querying Greek Legislation using Semantic Web Technologies
Modeling and Querying Greek Legislation using Semantic Web Technologies
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
The Role of Ontology in the Era of Big Military Data
The Role of Ontology in the Era of Big Military DataThe Role of Ontology in the Era of Big Military Data
The Role of Ontology in the Era of Big Military Data
Academic and professional written genres - By Giovanni Parodi
Academic and professional written genres - By Giovanni ParodiAcademic and professional written genres - By Giovanni Parodi
Academic and professional written genres - By Giovanni Parodi
Fantoni Urgo - Cirp Dictionary
Fantoni Urgo - Cirp DictionaryFantoni Urgo - Cirp Dictionary
Fantoni Urgo - Cirp Dictionary
Clustering over the cultural heritage linked open dataset xlendi shipwreck
Clustering over the cultural heritage linked open dataset xlendi shipwreckClustering over the cultural heritage linked open dataset xlendi shipwreck
Clustering over the cultural heritage linked open dataset xlendi shipwreck
Language tools bne-5-10-2011
Language tools bne-5-10-2011Language tools bne-5-10-2011
Language tools bne-5-10-2011
The High Frequency Receiver Function
The High Frequency Receiver FunctionThe High Frequency Receiver Function
The High Frequency Receiver Function
Barbiers iclave-fr
Barbiers iclave-frBarbiers iclave-fr
Barbiers iclave-fr
EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t...
 EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t... EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t...
EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t...
Automatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech PhenomenaAutomatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech Phenomena

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture

Text Encoding and Enrichment for Linguistic Analysis: Archives on the policy of Armaments within Western European Union

  • 1. Text Encoding and Enrichment for Linguistic Analysis: Archives on the policy of Armaments within Western European Union Centre Virtuel de la Connaissance sur l’Europe (CVCE), Luxembourg Florentina Armaselu (DHLab) Verónica Martins (EIS) - Catherine Jones (DHLab) - 1 Exploring Historical Sources with Language Technology: Results and Perspectives Huygens ING , The Hague, December 8, 9, 2014
  • 2. Summary 1. About the CVCE 2. Overview of the WEU-DIPLO project 3. XML-TEI Encoding 4. Named Entity Recognition (NER) 5. Corpus Analysis 6. Future work 7. References Summary 2
  • 3. CVCE - Centre Virtuel de la Connaissance sur l'Europe An interdisciplinary centre of e-research and documentation on the European Integration Process. two key areas of activity: - Interdisciplinary research on the European integration process in the XX and XXI centuries; - Research, development and integration of digital tools and methods to support advancement in European Integration Studies. About the CVCE 3
  • 4. Overview of the WEU-DIPLO project 1. Goal: XML-TEI encoding, corpus analysis and Web publication of institutional documents of the W.E.U. (Western European Union): • Topics: armament production, standardization, control in the period from 1954 to 1982; • Source: Archives nationales de Luxembourg, W.E.U collection. 2. Format: • digitized versions (JPEG) of typewritten materials (one file per page). 3. Size: Category Number of documents Note 89 43 46 34 395 191 204 144 Minutes 30 15 15 15 256 138 118 118 Memorandum 3 1 2 2 16 7 9 9 Study 2 0 2 1 12 0 12 8 Discourse 1 0 1 0 4 0 4 0 Draft protocol 2 1 1 0 4 2 2 0 Total 127 60 67 52 687 338 349 279 *proc. = processed Number of documents per language Number of pages Number of pages per language EN FR FR proc.* EN FR FR proc.* Overview WEU-DIPLO 4
  • 5. Overview of the WEU-DIPLO project 5. Corpus Selection • Form and content  Form  OCR experiment conditions - need to diversify the form of the documents;  Bilingual.  Content  Archives’ 30 years rule corresponds with 30 years time period for the corpus (1954-1982)-selection of documents from the 1950’s, 1960’s, 1970’s and 1980’s;  Case study: Armaments production and control within WEU  Selection based on research question and more specific topics: French and British positions, WEU’s role/competences, nature of the debates within the Council/Standing Armament Committee;  Need for the documents to cover all the available material categories (minutes, notes, memorandum…). • Resources  limited time and human resources. Overview WEU-DIPLO 5
  • 6. Overview of the WEU-DIPLO project: examples ©WEU-UEO Memorandum Minutes Study Notes Overview WEU-DIPLO 6
  • 7. Overview of the WEU-DIPLO project: workflow Overview WEU-DIPLO 7
  • 8. XML-TEI Encoding: WEU-DIPLO Why TEI encoding? • structured and retrievable metadata (title, author, origin place of document, availability date, confidentiality status, document reference, etc.); • clear representation of the document structure (header, footer, divisions – section, subsection, paragraph, line); • identification of semantic elements (discourse of countries representatives, entities: names of organisations, persons, places, functions, dates, etc.). XML-TEI: WEU-DIPLO 8
  • 9. XML-TEI Encoding: WEU-DIPLO metadata, structure XML-TEI: WEU-DIPLO 9
  • 10. XML-TEI Encoding: WEU-DIPLO semantics XML-TEI: WEU-DIPLO 10
  • 11. Named Entity Recognition (NER): GATE - XML-TEI: WEU-DIPLO 11
  • 12. Named Entity Recognition (NER/GATE): WEU-DIPLO NER/GATE: WEU-DIPLO 12
  • 13. Named Entity Recognition (NER/XML-TEI): WEU-DIPLO NER/XML-TEI: WEU-DIPLO 13
  • 14. Corpus Analysis: TXM – Corpus Analysis: TXM - Textométrie 14
  • 15. Corpus Analysis - TXM: WEU-DIPLO Corpus WEU-DIPLO: 52 documents, French, 6905 items for 101965 occurrences (content and metadata); 6417 items for 74287 occurrences (content) – Lexicon (functional /lexical forms) (lemmatised, POS tagged, lower case) Corpus Analysis - TXM: WEU-DIPLO 15
  • 16. Corpus Analysis - TXM: WEU-DIPLO • WEU-DIPLO: content – Index (by type of entity) Corpus Analysis - TXM: WEU-DIPLO 16
  • 17. Corpus Analysis: WEU-DIPLO • Participants description Corpus Analysis: WEU-DIPLO 17
  • 18. Corpus Analysis – TXM: WEU-DIPLO Partition: representatives’ discourse by country/organisation Corpus Analysis - TXM: WEU-DIPLO 18
  • 19. Corpus Analysis - TXM: WEU-DIPLO Specificities • Specificity score (log10 ): o overuse (+)/deficit (-) of a form in a part/subcorpus as compared with the parent corpus and a threshold. • Statistical model (Lafon, 1980): Where: T = number of occurrences in the parent corpus; t = number of occurrences in a part/subcorpus; f = frequency of a form F in the parent corpus; X = variable of value 0, 1, 2, …, k, …, f; Prob (X=K) = probability that F occurs k times in the part/subcorpus of size t. Corpus Analysis - TXM: WEU-DIPLO 19
  • 20. Corpus Analysis - TXM: WEU-DIPLO Specificities: by part of speech Corpus Analysis - TXM: WEU-DIPLO 20
  • 21. Corpus Analysis - TXM: WEU-DIPLO – Specificities: by part of speech (Verb) Corpus Analysis - TXM: WEU-DIPLO 21
  • 22. Corpus Analysis - TXM: WEU-DIPLO Specificities (Verb) by representatives and mode/tense (Grevisse, 1993). Representative Mode / Tense France CONDITIONAL: attenuation (wish, advice, necessity, certainty) Forms: serait (37); aurait (19); seraient (17); pourrait (16); devrait (13), voudrais (11); … Exemples: le gouvernement français serait partisan d'accélérer …; cette réunion se déroulerait selon la formule …; qu'il ne faudrait pas trop ralentir l'opération envisagée … UK delegation PAST PARTICIPLE: passive/past perfect, adjectives Forms: été (20); donné (6); destinés (5); placées (5); établi (4); révisé (4); chargé (3); … Exemples: le produit final devrait être mis à la disposition de …; les accords auxquels elles ont abouti n'ont pas encore donné de résultats suffisamment …; projectiles nucléaires destinés à ces armes … C.P.A. (Comité permanent des armements) SIMPLE PAST: narration, succession of past actions Forms: exposa (1); fut (1); intervint (1); posa (1); prirent (1); soutinrent (1); … Exemples: une première proposition (belge) tendit à la réunion des hautes autorités …; luxembourg et france soutinrent, sans insistance, ce point de vue …; les pays-bas prirent la même attitude … A.C.A. (Agence pour le contrôle des armements) IMPERFECT: description, explanation Forms: était (11); avait (4); étaient (3); présidait (2); affectait (1); ajoutait (1); dépasseraient (1); … Exemples: le retrait des forces françaises de l’organisation intégrée de l’o.t.a.n. n'affectait nullement l'exécution des tâches …; il est bien évident que, s’il était adopté, il cesserait d’être inexact …; il résultait de cette étude que " le problème du stockage des armes nucléaires … Conseil de l'U.E.O. FUTURE: actions/goals to be accomplished Forms: sera (11); seront (7); pourra (5); devront (3); pourront (3); auront (2); donnera (2), … Exemples: les principes généraux ci-après devront gouverner nos travaux …; cela nous fournira la transition entre les sections a et b de notre mandat …; le conseil procédera à un examen attentif de la … Corpus Analysis - TXM: WEU-DIPLO 22
  • 23. Corpus Analysis - TXM: WEU-DIPLO Concordances: use of conditional, French representatives/name/document Corpus Analysis - TXM: WEU-DIPLO 23
  • 24. Corpus Analysis - TXM: WEU-DIPLO Context: conditional forms (French representative/Beaumarchais), vo-CR-73-10_FR Corpus Analysis - TXM: WEU-DIPLO 24
  • 25. Corpus Analysis - TXM: WEU-DIPLO Specificities: by lemma, representatives partition (selection), groupe (contrôle) Corpus Analysis - TXM: WEU-DIPLO 25
  • 26. Corpus Analysis - TXM: WEU-DIPLO Specificities: by lemma, representatives (selection), groupe (contrôle) - Discussion • Predictable results: o A.C.A.’s (Agence pour le contrôle des armements) discourse positive specificity (overuse):  contrôle/contrôler/contrôlable – inspection - vérification/vérifier;  limitation/limite/limiter-restriction/restreindre/restrictif. (A.C.A.’s role) o UK reprentesatives/delegation’s discourse negative specificity (scarcity):  arme/armement nucléaire/abc/atomique. (interested in the topic but not mainly concerned) • Less predictable results: o UK and France representatives’ discourse negative specificity:  contrôle/contrôler/contrôlable – inspection - vérification/vérifier;  A.C.A. - agence pour le contrôle des armements. (possible cause: selection of documents in the sample?) Corpus Analysis - TXM: WEU-DIPLO 26
  • 27. Corpus Analysis - TXM: WEU-DIPLO Specificities: by lemma, representatives partition (selection), groupe (standardisation) Corpus Analysis - TXM: WEU-DIPLO 27
  • 28. Corpus Analysis - TXM: WEU-DIPLO Cooccurrences: for ‘standard*’ sorted by co-frequency Corpus Analysis - TXM: WEU-DIPLO 28
  • 29. Corpus Analysis - TXM: WEU-DIPLO Concordances: ‘standard*’ – ‘armements’ Corpus Analysis - TXM: WEU-DIPLO 29
  • 30. Corpus Analysis: WEU-DIPLO Partition: representatives’ discourse (by name) Corpus Analysis: WEU-DIPLO 30
  • 31. Corpus Analysis - TXM: WEU-DIPLO Lexical profile (Guyard, 1981): positive specificities (>2.0), lemmas, names partition Part of speech / Name Noun Proper Noun Adjective Verb Adverb Chauvel (FR) commun; arme; accord d’exécution; recensement; mise; choix; point; centre; opération; déclaration; système d’armes - commun; équitable; secret; suivant procéder - Lloyd (UK) pays; discussion; arrangement; coopération; gouvernement britannique; partenaire; estime - bilatéral; déterminé; multilatéral; analogue; final; européen engager; associer; offrir; devoir - Destremau (FR) ministre belge; avis; gouvernement français; idée; opération; désir - autonome; américain; industriel falloir; mériter; envisager trop; pas; ne Callaghan (UK) gouvernement britannique; doctrine; industrie Eurogroupe; M. Van Elslande - exister - Corpus Analysis - TXM: WEU-DIPLO 31
  • 32. Corpus Analysis - TXM: WEU-DIPLO Lexical profile (Guyard, 1981): positive specificities, lemmas, names partition - Discussion • Chauvel (FR) / Lloyd (UK) : o Commun (rank 1) / bilatéral (rank 1) • production en commun; programme (régional), intérêt, défense, fonds commun(e)(s) • base, discussion, arrangements, comités directeurs bilatéra(l)(le)(ux) • Destremau (FR) / Callaghan (UK): o C.P.A – Comité permanent des armements (specificity score 1.44) / Eurogroupe (rank 1) (French attempts to revive CPA / UK’s Atlanticist preference - creation of Eurogroup in 1968 which did not include France). • Why standard(isation)(iser) not specific to any of individualized discourse by name, although high specificity for French representatives discourse as a whole? Corpus Analysis - TXM: WEU-DIPLO 32
  • 33. Corpus Analysis - TXM: WEU-DIPLO Specificities: standard(isation)(iser) lemmas, names partition Corpus Analysis - TXM: WEU-DIPLO 33
  • 34. Corpus Analysis - TXM: WEU-DIPLO Specificities: standard(isation)(iser) lemmas, documents subtypes partition Corpus Analysis - TXM: WEU-DIPLO 34
  • 35. Future work 1. Corpus analysis and interpretation (in progress). 2. Choice and adaptation of Web publication platform (in progress) EVT (Edition Visualization Technology): KILN : PhiloLOGIC: XTF : TEIBoilerplate : Future work 35
  • 36. References • GATE: • Grevisse, Le bon usage. Grammaire française, Duculot, Paris, 1993. • Guyard Marie-Renée. Spécificités d'auteurs dans Le Surréalisme au service de la Révolution. In: Mots, mars 1981, N°2. Qu'est-ce que le vocabulaire spécifique d'un texte politique? pp. 95-122. • Lafon Pierre, Sur la variabilité de la fréquence des formes dans un corpus. In: Mots, octobre 1980, N°1. Saussure, Zipf, Lagado, des méthodes, des calculs, des doutes et le vocabulaire de quelques textes politiques. pp. 127-165. • TEI: • TXM: References 36