Presentation at "Exploring Historical Sources with Language Technology: Results and Perspectives", The Hague, December 8, 2014. Overview of 'Diplomacy within WEU' project: http://www.cvce.eu/en/project/weu/presentation.
Text Encoding and Enrichment for Linguistic Analysis: Archives on the policy of Armaments within Western European Union
1. Text Encoding and Enrichment for Linguistic
Analysis: Archives on the policy of Armaments
within Western European Union
Centre Virtuel de la Connaissance sur l’Europe (CVCE), Luxembourg
Florentina Armaselu (DHLab) -florentina.armaselu@cvce.eu
Verónica Martins (EIS) - veronica.martins@cvce.eu
Catherine Jones (DHLab) - catherine.jones@cvce.eu
1
www.cvce.eu
Exploring Historical Sources with Language Technology: Results and Perspectives
Huygens ING , The Hague, December 8, 9, 2014
2. Summary
1. About the CVCE
2. Overview of the WEU-DIPLO project
3. XML-TEI Encoding
4. Named Entity Recognition (NER)
5. Corpus Analysis
6. Future work
7. References
Summary 2
3. CVCE - Centre Virtuel de la
Connaissance sur l'Europe
An interdisciplinary centre of e-research
and documentation on
the European Integration
Process.
two key areas of activity:
- Interdisciplinary research on the European
integration process in the XX and XXI centuries;
- Research, development and integration of
digital tools and methods to support
advancement in European Integration Studies.
About the CVCE 3
4. Overview of the WEU-DIPLO project
1. Goal: XML-TEI encoding, corpus analysis and Web publication of institutional documents
of the W.E.U. (Western European Union):
• Topics: armament production, standardization, control in the period from 1954 to 1982;
• Source: Archives nationales de Luxembourg, W.E.U collection.
2. Format:
• digitized versions (JPEG) of typewritten materials (one file per page).
3. Size:
Category Number of
documents
Note 89 43 46 34 395 191 204 144
Minutes 30 15 15 15 256 138 118 118
Memorandum 3 1 2 2 16 7 9 9
Study 2 0 2 1 12 0 12 8
Discourse 1 0 1 0 4 0 4 0
Draft protocol 2 1 1 0 4 2 2 0
Total 127 60 67 52 687 338 349 279
*proc. = processed
Number of documents
per language
Number
of pages
Number of pages per
language
EN FR FR proc.* EN FR FR proc.*
Overview WEU-DIPLO 4
5. Overview of the WEU-DIPLO project
5. Corpus Selection
• Form and content
Form
OCR experiment conditions - need to diversify the form of the
documents;
Bilingual.
Content
Archives’ 30 years rule corresponds with 30 years time period for the
corpus (1954-1982)-selection of documents from the 1950’s, 1960’s,
1970’s and 1980’s;
Case study: Armaments production and control within WEU
Selection based on research question and more specific topics:
French and British positions, WEU’s role/competences, nature of the
debates within the Council/Standing Armament Committee;
Need for the documents to cover all the available material categories
(minutes, notes, memorandum…).
• Resources
limited time and human resources.
Overview WEU-DIPLO 5
18. Corpus Analysis – TXM: WEU-DIPLO
Partition: representatives’ discourse by country/organisation
Corpus Analysis - TXM: WEU-DIPLO 18
19. Corpus Analysis - TXM: WEU-DIPLO
Specificities
• Specificity score (log10 ):
o overuse (+)/deficit (-) of a form in a part/subcorpus as compared with the parent
corpus and a threshold.
• Statistical model (Lafon, 1980):
Where: T = number of occurrences in the parent corpus;
t = number of occurrences in a part/subcorpus;
f = frequency of a form F in the parent corpus;
X = variable of value 0, 1, 2, …, k, …, f;
Prob (X=K) = probability that F occurs k times in the part/subcorpus of size t.
Corpus Analysis - TXM: WEU-DIPLO 19
20. Corpus Analysis - TXM: WEU-DIPLO
Specificities: by part of speech
Corpus Analysis - TXM: WEU-DIPLO 20
21. Corpus Analysis - TXM: WEU-DIPLO –
Specificities: by part of speech (Verb)
Corpus Analysis - TXM: WEU-DIPLO 21
22. Corpus Analysis - TXM: WEU-DIPLO
Specificities (Verb) by representatives and mode/tense (Grevisse, 1993).
Representative Mode / Tense
France CONDITIONAL: attenuation (wish, advice, necessity, certainty)
Forms: serait (37); aurait (19); seraient (17); pourrait (16); devrait (13), voudrais (11); …
Exemples: le gouvernement français serait partisan d'accélérer …; cette réunion se déroulerait selon la
formule …; qu'il ne faudrait pas trop ralentir l'opération envisagée …
UK delegation PAST PARTICIPLE: passive/past perfect, adjectives
Forms: été (20); donné (6); destinés (5); placées (5); établi (4); révisé (4); chargé (3); …
Exemples: le produit final devrait être mis à la disposition de …; les accords auxquels elles ont abouti
n'ont pas encore donné de résultats suffisamment …; projectiles nucléaires destinés à ces armes …
C.P.A. (Comité
permanent des
armements)
SIMPLE PAST: narration, succession of past actions
Forms: exposa (1); fut (1); intervint (1); posa (1); prirent (1); soutinrent (1); …
Exemples: une première proposition (belge) tendit à la réunion des hautes autorités …; luxembourg
et france soutinrent, sans insistance, ce point de vue …; les pays-bas prirent la même attitude …
A.C.A. (Agence
pour le contrôle
des
armements)
IMPERFECT: description, explanation
Forms: était (11); avait (4); étaient (3); présidait (2); affectait (1); ajoutait (1); dépasseraient (1); …
Exemples: le retrait des forces françaises de l’organisation intégrée de l’o.t.a.n. n'affectait nullement
l'exécution des tâches …; il est bien évident que, s’il était adopté, il cesserait d’être inexact …; il résultait
de cette étude que " le problème du stockage des armes nucléaires …
Conseil de
l'U.E.O.
FUTURE: actions/goals to be accomplished
Forms: sera (11); seront (7); pourra (5); devront (3); pourront (3); auront (2); donnera (2), …
Exemples: les principes généraux ci-après devront gouverner nos travaux …; cela nous fournira la
transition entre les sections a et b de notre mandat …; le conseil procédera à un examen attentif de la …
Corpus Analysis - TXM: WEU-DIPLO 22
23. Corpus Analysis - TXM: WEU-DIPLO
Concordances: use of conditional, French representatives/name/document
Corpus Analysis - TXM: WEU-DIPLO 23
24. Corpus Analysis - TXM: WEU-DIPLO
Context: conditional forms (French representative/Beaumarchais), vo-CR-73-10_FR
Corpus Analysis - TXM: WEU-DIPLO 24
25. Corpus Analysis - TXM: WEU-DIPLO
Specificities: by lemma, representatives partition (selection), groupe (contrôle)
Corpus Analysis - TXM: WEU-DIPLO 25
26. Corpus Analysis - TXM: WEU-DIPLO
Specificities: by lemma, representatives (selection), groupe (contrôle) - Discussion
• Predictable results:
o A.C.A.’s (Agence pour le contrôle des armements) discourse positive specificity (overuse):
contrôle/contrôler/contrôlable – inspection - vérification/vérifier;
limitation/limite/limiter-restriction/restreindre/restrictif.
(A.C.A.’s role)
o UK reprentesatives/delegation’s discourse negative specificity (scarcity):
arme/armement nucléaire/abc/atomique.
(interested in the topic but not mainly concerned)
• Less predictable results:
o UK and France representatives’ discourse negative specificity:
contrôle/contrôler/contrôlable – inspection - vérification/vérifier;
A.C.A. - agence pour le contrôle des armements.
(possible cause: selection of documents in the sample?)
Corpus Analysis - TXM: WEU-DIPLO 26
27. Corpus Analysis - TXM: WEU-DIPLO
Specificities: by lemma, representatives partition (selection), groupe (standardisation)
Corpus Analysis - TXM: WEU-DIPLO 27
28. Corpus Analysis - TXM: WEU-DIPLO
Cooccurrences: for ‘standard*’ sorted by co-frequency
Corpus Analysis - TXM: WEU-DIPLO 28
29. Corpus Analysis - TXM: WEU-DIPLO
Concordances: ‘standard*’ – ‘armements’
Corpus Analysis - TXM: WEU-DIPLO 29
30. Corpus Analysis: WEU-DIPLO
Partition: representatives’ discourse (by name)
Corpus Analysis: WEU-DIPLO 30
31. Corpus Analysis - TXM: WEU-DIPLO
Lexical profile (Guyard, 1981): positive specificities (>2.0), lemmas, names partition
Part of
speech /
Name
Noun Proper
Noun
Adjective Verb Adverb
Chauvel
(FR)
commun; arme; accord
d’exécution;
recensement; mise;
choix; point; centre;
opération; déclaration;
système d’armes
- commun;
équitable; secret;
suivant
procéder -
Lloyd
(UK)
pays; discussion;
arrangement;
coopération;
gouvernement
britannique; partenaire;
estime
- bilatéral;
déterminé;
multilatéral;
analogue; final;
européen
engager;
associer;
offrir; devoir
-
Destremau
(FR)
ministre belge; avis;
gouvernement français;
idée; opération; désir
- autonome;
américain;
industriel
falloir;
mériter;
envisager
trop; pas; ne
Callaghan
(UK)
gouvernement
britannique; doctrine;
industrie
Eurogroupe;
M. Van
Elslande
- exister -
Corpus Analysis - TXM: WEU-DIPLO 31
32. Corpus Analysis - TXM: WEU-DIPLO
Lexical profile (Guyard, 1981): positive specificities, lemmas, names partition - Discussion
• Chauvel (FR) / Lloyd (UK) :
o Commun (rank 1) / bilatéral (rank 1)
• production en commun; programme (régional), intérêt, défense, fonds commun(e)(s)
• base, discussion, arrangements, comités directeurs bilatéra(l)(le)(ux)
• Destremau (FR) / Callaghan (UK):
o C.P.A – Comité permanent des armements (specificity score 1.44) / Eurogroupe (rank 1)
(French attempts to revive CPA / UK’s Atlanticist preference - creation of Eurogroup in 1968 which did
not include France).
• Why standard(isation)(iser) not specific to any of individualized discourse by
name, although high specificity for French representatives discourse as a whole?
Corpus Analysis - TXM: WEU-DIPLO 32
33. Corpus Analysis - TXM: WEU-DIPLO
Specificities: standard(isation)(iser) lemmas, names partition
Corpus Analysis - TXM: WEU-DIPLO 33
34. Corpus Analysis - TXM: WEU-DIPLO
Specificities: standard(isation)(iser) lemmas, documents subtypes partition
Corpus Analysis - TXM: WEU-DIPLO 34
35. Future work
1. Corpus analysis and interpretation (in progress).
2. Choice and adaptation of Web publication platform (in progress)
EVT (Edition Visualization Technology): http://sourceforge.net/projects/evt-project/
KILN : http://kiln.readthedocs.org/en/latest/#
PhiloLOGIC: https://sites.google.com/site/philologic3/home
XTF : http://xtf.cdlib.org/about/
TEIBoilerplate : http://dcl.ils.indiana.edu/teibp/
Future work
35
36. References
• GATE: https://gate.ac.uk/
• Grevisse, Le bon usage. Grammaire française, Duculot, Paris, 1993.
• Guyard Marie-Renée. Spécificités d'auteurs dans Le Surréalisme au service
de la Révolution. In: Mots, mars 1981, N°2. Qu'est-ce que le vocabulaire
spécifique d'un texte politique? pp. 95-122.
• Lafon Pierre, Sur la variabilité de la fréquence des formes dans un corpus.
In: Mots, octobre 1980, N°1. Saussure, Zipf, Lagado, des méthodes, des
calculs, des doutes et le vocabulaire de quelques textes politiques. pp.
127-165.
• TEI: http://www.tei-c.org
• TXM: http://textometrie.ens-lyon.fr/
References 36