SlideShare a Scribd company logo
1 of 31
Download to read offline
Mining Paper Catalogues
A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology
Tim Evans1, Felix Kußmaul2
23rd Annual Meeting EAA, Maastricht
31 August 2017
1Archaeology Data Service, University of York
2Archaeological Institute, University of Cologne
Data Source
Figure 1: Sample from Ettlinger, Conspectus.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 2/22
Oh dear!
Problem
Running texts contain a lot of irrelevant information (for machine processing).
This makes database lookups without keywords extremely inefficient.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 3/22
What we have: What we want:
{
"form": "23.1",
"origin": "Italy",
"decoration": "none",
"occurs": "uncommon"
},
{
"form": "23.2",
"origin": "Italy, not Padana",
"occurs": "Mediterranean region;
North-Italy"
}
UNSTRUCTURED DATA STRUCTURED DATA
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 4/22
What we have: What we want:
{
"form": "23.1",
"origin": "Italy",
"decoration": "none",
"occurs": "uncommon"
},
{
"form": "23.2",
"origin": "Italy, not Padana",
"occurs": "Mediterranean region;
North-Italy"
}
UNSTRUCTURED DATA STRUCTURED DATA
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 4/22
What we have: What we want:
{
"form": "23.1",
"origin": "Italy",
"decoration": "none",
"occurs": "uncommon"
},
{
"form": "23.2",
"origin": "Italy, not Padana",
"occurs": "Mediterranean region;
North-Italy"
}
UNSTRUCTURED DATA STRUCTURED DATA
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 4/22
Information Extraction
Definition: Information Extraction (IE)
“[IE refers to] the identification and extraction of instances of a particular class
of events or relationships in a natural language text and their transformation
into a structured representation.” – Grishman 1997, Eikvil 1999
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 5/22
IE Process Pipeline
Stanford ClearNLP OpenNLP CoreNLP UIMA Ruta
Tokenisation Lemmatisation POS-Tagging NER Information
Extraction
unstructured
document structured data
PosMapper CoreNLP MatePos OpenNLP CoreNLP
Figure 2: IE process pipeline.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 6/22
Named Entity Recognition
The quick brown fox jump over the lazy dog .
DT JJ JJ NN VBD IN DT JJ NN .
jumps
Figure 3: POS-tagging examples after lemmatisation.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 7/22
Named Entity Recognition
The quick brown fox jump over the lazy dog .
DT JJ JJ NN VBD IN DT JJ NN .
jumps
Figure 3: POS-tagging examples after lemmatisation.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 7/22
Adapting the NER
Most NERs (e. g. Stanford CoreNLP) only recognise 8 entities types:
PERSON DATE
ORGANIZATION TIME
LOCATION MONEY
PERCENT MISC
So we have to add the custom entity type FORM.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 8/22
Two approaches for NER
Rule-based approach
• High precision, but lower recall
⇒ Many many rules?!
Machine-learning approach
• Lower precision, but high recall
• Needs to be trained!
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
Two approaches for NER
Rule-based approach
• High precision, but lower recall
⇒ Many many rules?!
Figure 4: Excerpt from Gempeler,
Elephantine X.
Machine-learning approach
• Lower precision, but high recall
• Needs to be trained!
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
Two approaches for NER
Rule-based approach
• High precision, but lower recall
⇒ Many many rules?!
Figure 4: Excerpt from Gempeler,
Elephantine X.
Machine-learning approach
• Lower precision, but high recall
• Needs to be trained!
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
Two approaches for NER
Rule-based approach
• High precision, but lower recall
⇒ Many many rules?!
Figure 4: Excerpt from Gempeler,
Elephantine X.
Machine-learning approach
• Lower precision, but high recall
• Needs to be trained!
Figure 5: Manually annotated sentence
from Ettlinger, Conspectus in iepy.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
Temporal Expressions
With HeidelTime temporal expressions are mapped to TIMEX3 standard
around 140 B. C. −→ APPROX BC0140
Spätes 3.–4. Jh. n. Chr. −→ END 02; 03
second quarter first century B. C. −→ XXXX-Q2 BC00
first half third century A. D. −→ XXXX-H1 02
HeidelTime supports many other languages, e. g. German, Italian, French, …
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 10/22
Relation Extraction
Subject Relation Object
quick brown fox jump over lazy dog
K 612 dates 031
Subform 23.2 occurs North Italy
Subform 23.2 dates XXXX-Q2 002
⇒ e. g.
{ "form": "23.2",
"dating": "XXXX-Q2 00" }
1“4th century A. D.”
2“second and third quarters of the first century A. D.”
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 11/22
Relation Extraction
Subject Relation Object
quick brown fox jump over lazy dog
K 612 dates 031
Subform 23.2 occurs North Italy
Subform 23.2 dates XXXX-Q2 002
⇒ e. g.
{ "form": "23.2",
"dating": "XXXX-Q2 00" }
1“4th century A. D.”
2“second and third quarters of the first century A. D.”
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 11/22
MULTILINGUALISM
Background
Two problems:
• Linguistic
• Conceptual
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 12/22
Different languages
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 13/22
Different traditions
Figure 6: Plate, platter or dish?
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 14/22
Creating controlled vocabularies
Creating wordlists that project team would be most useful to describe the key
features of a vessel or sherd
• Sherd type (e.g. rim or handle)
• Form (e.g. plate or bowl)
• Decoration form (e.g. burnished)
• Decoration color (e.g. yellow)
• Fabric (e.g. Dressel 28 fabric)
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 15/22
Lessons from ARIADNE
Used tools and methodology developed for the ARIADNE project by the
Hypermedia Research Group at the University of South Wales
• Created a neutral spine based on the Getty Institute’s Art and Architecture
Thesaurus (AAT)
• This spine was populated by members from partner organisations,
identifying common terms and concepts within it
• Project partners then mapped terms in their language to this neutral spine
• French terms supplied courtesy of a 2001 Masters thesis by Caroline Sourzat
(thanks to Eleni Schindler Kaudelka for identifying this on the ArchAIDE blog!)
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 16/22
Mapping terms and concepts (part 1)
Often this was very straightforward, for example:
• The Italian terms graffita, graffita a punta, graffita a stecca = “sgraffito”
(http://vocab.getty.edu/aat/300266416)
• The Spanish term Cántaro = “jars” (http://vocab.getty.edu/aat/300195348)
• The German terms gebogener Henkel, Ohrförmiger Henkel, langer
Vertikalhenkel = “handles” (http://vocab.getty.edu/aat/300266416)
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 17/22
Mapping terms and concepts (part 2)
Often this was more complicated, with partners having differing perceptions on
what to call something (e.g. “plate” versus “platter”)
In truth, this confusion may also be reflected by what has come out of the ground!
An advantage of using the AAT (a “SKOS’d” thesaurus), is that ambiguity or
difference in nomenclature can be resolved by a broader term or concept, so for
example …
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 18/22
Mapping terms and concepts (part 3)
Looking at the hierarchies for plate and platter in the AAT we can see that both
are “dishes (vessels for food)”, or even broader “culinary containers”. So whole
we can retain our original classifications (and this is essential for text mining), we
can agree at a fundamental level what these fundamentally are
Figure 7: AAT Hierarchies for Plate and Platter
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 19/22
Outlook
• Recognize reigns of emperors as DATE entities
• Coreferences in general
• HeidelTime:
second and third quarter of the first century A. D. −→ XXXX-Q3; 00
• Returning to difference in ceramic recording details
• Fabric names often contain locations, e. g. Magdalensberg xyz
• Location sometimes narrow, sometimes whole regions
• In many cases the form is not named in particular but just described
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 20/22
References
Ettlinger, Elisabeth. Conspectus formarum terrae sigillatae Italico modo confectae.
Ed. by Deutsches Archäologisches Institut zu Frankfurt and
Römisch-Germanische Kommission. Materialien zur römisch-germanischen
Keramik. Bonn: Habelt, 1990.
Gempeler, Robert D. Elephantine X. Die Keramik römischer bis früharabischer Zeit.
Mainz: Von Zabern, 1992.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 21/22
Thank you very much for your attention!
Questions?
This project has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement № 693548
Mining Paper Catalogues
A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology
Tim Evans1, Felix Kußmaul2
23rd Annual Meeting EAA, Maastricht
31 August 2017
1Archaeology Data Service, University of York
2Archaeological Institute, University of Cologne

More Related Content

Similar to Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology

JMartin - 596 PE - Final Project Write-Up
JMartin - 596 PE - Final Project Write-UpJMartin - 596 PE - Final Project Write-Up
JMartin - 596 PE - Final Project Write-Up
Jason Martin
 
601 l9-dicts+quizrev s10[1]
601 l9-dicts+quizrev s10[1]601 l9-dicts+quizrev s10[1]
601 l9-dicts+quizrev s10[1]
bellhawaii
 
Planning the dissertation
Planning the dissertationPlanning the dissertation
Planning the dissertation
Joseph Martinez
 
Saturday school exemplars 11.20
Saturday school exemplars 11.20Saturday school exemplars 11.20
Saturday school exemplars 11.20
Michael Christiano
 
601 Session9-dicts+quizreview-s13
601 Session9-dicts+quizreview-s13601 Session9-dicts+quizreview-s13
601 Session9-dicts+quizreview-s13
Diane Nahl
 

Similar to Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology (20)

Encyclopedia.of.Archaeology.History.and.Discoveries.eBook-EEn.pdf
Encyclopedia.of.Archaeology.History.and.Discoveries.eBook-EEn.pdfEncyclopedia.of.Archaeology.History.and.Discoveries.eBook-EEn.pdf
Encyclopedia.of.Archaeology.History.and.Discoveries.eBook-EEn.pdf
 
Vibration of plates by Leissa
Vibration of plates  by LeissaVibration of plates  by Leissa
Vibration of plates by Leissa
 
What are data and information, why they matter (v. ITA 2020)
What are data and information, why they matter (v. ITA 2020)What are data and information, why they matter (v. ITA 2020)
What are data and information, why they matter (v. ITA 2020)
 
JMartin - 596 PE - Final Project Write-Up
JMartin - 596 PE - Final Project Write-UpJMartin - 596 PE - Final Project Write-Up
JMartin - 596 PE - Final Project Write-Up
 
Cantor
CantorCantor
Cantor
 
601 l9-dicts+quizrev s10[1]
601 l9-dicts+quizrev s10[1]601 l9-dicts+quizrev s10[1]
601 l9-dicts+quizrev s10[1]
 
0134484592 ch31
0134484592 ch310134484592 ch31
0134484592 ch31
 
Archimedes
ArchimedesArchimedes
Archimedes
 
The Life-History of the BACH Project
The Life-History of the BACH ProjectThe Life-History of the BACH Project
The Life-History of the BACH Project
 
Planning the Dissertation Project
Planning the Dissertation ProjectPlanning the Dissertation Project
Planning the Dissertation Project
 
Planning the dissertation
Planning the dissertationPlanning the dissertation
Planning the dissertation
 
Saturday school exemplars 11.20
Saturday school exemplars 11.20Saturday school exemplars 11.20
Saturday school exemplars 11.20
 
International Cloud Atlas - 1987 - Volume 2
International Cloud Atlas - 1987 - Volume 2International Cloud Atlas - 1987 - Volume 2
International Cloud Atlas - 1987 - Volume 2
 
Pitch 1 Veldnotities | Andreas Weber
Pitch 1 Veldnotities | Andreas WeberPitch 1 Veldnotities | Andreas Weber
Pitch 1 Veldnotities | Andreas Weber
 
What are data and information, why they matter (v. ITA 2021)
What are data and information, why they matter (v. ITA 2021)What are data and information, why they matter (v. ITA 2021)
What are data and information, why they matter (v. ITA 2021)
 
7 q5 marco
7 q5   marco7 q5   marco
7 q5 marco
 
Sharing and Creating Knowledge: Crowdsurfing at the ETH Library in Zurich
Sharing and Creating Knowledge: Crowdsurfing at the ETH Library in ZurichSharing and Creating Knowledge: Crowdsurfing at the ETH Library in Zurich
Sharing and Creating Knowledge: Crowdsurfing at the ETH Library in Zurich
 
Prescott Emda2015
Prescott Emda2015Prescott Emda2015
Prescott Emda2015
 
601 Session9-dicts+quizreview-s13
601 Session9-dicts+quizreview-s13601 Session9-dicts+quizreview-s13
601 Session9-dicts+quizreview-s13
 
What are data and information, why they matter
What are data and information, why they matterWhat are data and information, why they matter
What are data and information, why they matter
 

More from ArchAIDE Project

More from ArchAIDE Project (20)

Presentation and results of ArchAIDE project - EAA2018
Presentation and results of ArchAIDE project - EAA2018Presentation and results of ArchAIDE project - EAA2018
Presentation and results of ArchAIDE project - EAA2018
 
Talking about the revolution. Innovation in communication within the ArchAIDE...
Talking about the revolution. Innovation in communication within the ArchAIDE...Talking about the revolution. Innovation in communication within the ArchAIDE...
Talking about the revolution. Innovation in communication within the ArchAIDE...
 
Workshop, Athens 14 May 2018
Workshop, Athens 14 May 2018Workshop, Athens 14 May 2018
Workshop, Athens 14 May 2018
 
Fair of European Innovators in Cultural Heritage
Fair of European Innovators in Cultural HeritageFair of European Innovators in Cultural Heritage
Fair of European Innovators in Cultural Heritage
 
Italian training day. Pisa, 23 Marzo 2018 Il progetto
Italian training day. Pisa, 23 Marzo 2018 Il progettoItalian training day. Pisa, 23 Marzo 2018 Il progetto
Italian training day. Pisa, 23 Marzo 2018 Il progetto
 
Una nuova frontiera per la documentazione e l’interpretazione della ceramica
Una nuova frontiera per la documentazione e l’interpretazione della ceramicaUna nuova frontiera per la documentazione e l’interpretazione della ceramica
Una nuova frontiera per la documentazione e l’interpretazione della ceramica
 
EVA/Minerva Conference on Digitisation of Cultural Heritage
EVA/Minerva Conference on Digitisation of Cultural HeritageEVA/Minerva Conference on Digitisation of Cultural Heritage
EVA/Minerva Conference on Digitisation of Cultural Heritage
 
II Congreso Internacional de musealización y puesta en valor del Patrimonio C...
II Congreso Internacional de musealización y puesta en valor del Patrimonio C...II Congreso Internacional de musealización y puesta en valor del Patrimonio C...
II Congreso Internacional de musealización y puesta en valor del Patrimonio C...
 
Campagne fotografiche sulle classi ceramiche test (WP5)
Campagne fotografiche sulle classi ceramiche test (WP5)Campagne fotografiche sulle classi ceramiche test (WP5)
Campagne fotografiche sulle classi ceramiche test (WP5)
 
Una rete neurale per il riconoscimento automatico della ceramica: il progetto...
Una rete neurale per il riconoscimento automatico della ceramica: il progetto...Una rete neurale per il riconoscimento automatico della ceramica: il progetto...
Una rete neurale per il riconoscimento automatico della ceramica: il progetto...
 
A mobile app for the automatic recognition of archaeological potsherds: the A...
A mobile app for the automatic recognition of archaeological potsherds: the A...A mobile app for the automatic recognition of archaeological potsherds: the A...
A mobile app for the automatic recognition of archaeological potsherds: the A...
 
Navigating a new digital interface: using automated image recognition to iden...
Navigating a new digital interface: using automated image recognition to iden...Navigating a new digital interface: using automated image recognition to iden...
Navigating a new digital interface: using automated image recognition to iden...
 
Una rete neurale per l’archeologia
Una rete neurale per l’archeologiaUna rete neurale per l’archeologia
Una rete neurale per l’archeologia
 
ArchAIDE Projekttreffen und EAA in Maastricht
ArchAIDE Projekttreffen und EAA in MaastrichtArchAIDE Projekttreffen und EAA in Maastricht
ArchAIDE Projekttreffen und EAA in Maastricht
 
InnovativeTechnologies
InnovativeTechnologiesInnovativeTechnologies
InnovativeTechnologies
 
Development and analysis of 3D reference collections from archaeological arch...
Development and analysis of 3D reference collections from archaeological arch...Development and analysis of 3D reference collections from archaeological arch...
Development and analysis of 3D reference collections from archaeological arch...
 
Michael Remmy, WP5: Population of the database
Michael Remmy, WP5: Population of the databaseMichael Remmy, WP5: Population of the database
Michael Remmy, WP5: Population of the database
 
Populating the Reference Database Photographing Collections
Populating the Reference Database Photographing CollectionsPopulating the Reference Database Photographing Collections
Populating the Reference Database Photographing Collections
 
ArchAIDE Kick-Off Meeting - WP5
ArchAIDE Kick-Off Meeting - WP5ArchAIDE Kick-Off Meeting - WP5
ArchAIDE Kick-Off Meeting - WP5
 
Eva Miguel Gascon, Mireia Pinto Monte, Marisol Madrid i Fernandez, Jaume Buxe...
Eva Miguel Gascon, Mireia Pinto Monte, Marisol Madrid i Fernandez, Jaume Buxe...Eva Miguel Gascon, Mireia Pinto Monte, Marisol Madrid i Fernandez, Jaume Buxe...
Eva Miguel Gascon, Mireia Pinto Monte, Marisol Madrid i Fernandez, Jaume Buxe...
 

Recently uploaded

No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
Sheetaleventcompany
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 

Recently uploaded (20)

Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 

Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology

  • 1. Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology Tim Evans1, Felix Kußmaul2 23rd Annual Meeting EAA, Maastricht 31 August 2017 1Archaeology Data Service, University of York 2Archaeological Institute, University of Cologne
  • 2. Data Source Figure 1: Sample from Ettlinger, Conspectus. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 2/22
  • 3. Oh dear! Problem Running texts contain a lot of irrelevant information (for machine processing). This makes database lookups without keywords extremely inefficient. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 3/22
  • 4. What we have: What we want: { "form": "23.1", "origin": "Italy", "decoration": "none", "occurs": "uncommon" }, { "form": "23.2", "origin": "Italy, not Padana", "occurs": "Mediterranean region; North-Italy" } UNSTRUCTURED DATA STRUCTURED DATA T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 4/22
  • 5. What we have: What we want: { "form": "23.1", "origin": "Italy", "decoration": "none", "occurs": "uncommon" }, { "form": "23.2", "origin": "Italy, not Padana", "occurs": "Mediterranean region; North-Italy" } UNSTRUCTURED DATA STRUCTURED DATA T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 4/22
  • 6. What we have: What we want: { "form": "23.1", "origin": "Italy", "decoration": "none", "occurs": "uncommon" }, { "form": "23.2", "origin": "Italy, not Padana", "occurs": "Mediterranean region; North-Italy" } UNSTRUCTURED DATA STRUCTURED DATA T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 4/22
  • 7. Information Extraction Definition: Information Extraction (IE) “[IE refers to] the identification and extraction of instances of a particular class of events or relationships in a natural language text and their transformation into a structured representation.” – Grishman 1997, Eikvil 1999 T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 5/22
  • 8. IE Process Pipeline Stanford ClearNLP OpenNLP CoreNLP UIMA Ruta Tokenisation Lemmatisation POS-Tagging NER Information Extraction unstructured document structured data PosMapper CoreNLP MatePos OpenNLP CoreNLP Figure 2: IE process pipeline. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 6/22
  • 9. Named Entity Recognition The quick brown fox jump over the lazy dog . DT JJ JJ NN VBD IN DT JJ NN . jumps Figure 3: POS-tagging examples after lemmatisation. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 7/22
  • 10. Named Entity Recognition The quick brown fox jump over the lazy dog . DT JJ JJ NN VBD IN DT JJ NN . jumps Figure 3: POS-tagging examples after lemmatisation. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 7/22
  • 11. Adapting the NER Most NERs (e. g. Stanford CoreNLP) only recognise 8 entities types: PERSON DATE ORGANIZATION TIME LOCATION MONEY PERCENT MISC So we have to add the custom entity type FORM. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 8/22
  • 12. Two approaches for NER Rule-based approach • High precision, but lower recall ⇒ Many many rules?! Machine-learning approach • Lower precision, but high recall • Needs to be trained! T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
  • 13. Two approaches for NER Rule-based approach • High precision, but lower recall ⇒ Many many rules?! Figure 4: Excerpt from Gempeler, Elephantine X. Machine-learning approach • Lower precision, but high recall • Needs to be trained! T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
  • 14. Two approaches for NER Rule-based approach • High precision, but lower recall ⇒ Many many rules?! Figure 4: Excerpt from Gempeler, Elephantine X. Machine-learning approach • Lower precision, but high recall • Needs to be trained! T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
  • 15. Two approaches for NER Rule-based approach • High precision, but lower recall ⇒ Many many rules?! Figure 4: Excerpt from Gempeler, Elephantine X. Machine-learning approach • Lower precision, but high recall • Needs to be trained! Figure 5: Manually annotated sentence from Ettlinger, Conspectus in iepy. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
  • 16. Temporal Expressions With HeidelTime temporal expressions are mapped to TIMEX3 standard around 140 B. C. −→ APPROX BC0140 Spätes 3.–4. Jh. n. Chr. −→ END 02; 03 second quarter first century B. C. −→ XXXX-Q2 BC00 first half third century A. D. −→ XXXX-H1 02 HeidelTime supports many other languages, e. g. German, Italian, French, … T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 10/22
  • 17. Relation Extraction Subject Relation Object quick brown fox jump over lazy dog K 612 dates 031 Subform 23.2 occurs North Italy Subform 23.2 dates XXXX-Q2 002 ⇒ e. g. { "form": "23.2", "dating": "XXXX-Q2 00" } 1“4th century A. D.” 2“second and third quarters of the first century A. D.” T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 11/22
  • 18. Relation Extraction Subject Relation Object quick brown fox jump over lazy dog K 612 dates 031 Subform 23.2 occurs North Italy Subform 23.2 dates XXXX-Q2 002 ⇒ e. g. { "form": "23.2", "dating": "XXXX-Q2 00" } 1“4th century A. D.” 2“second and third quarters of the first century A. D.” T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 11/22
  • 20. Background Two problems: • Linguistic • Conceptual T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 12/22
  • 21. Different languages T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 13/22
  • 22. Different traditions Figure 6: Plate, platter or dish? T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 14/22
  • 23. Creating controlled vocabularies Creating wordlists that project team would be most useful to describe the key features of a vessel or sherd • Sherd type (e.g. rim or handle) • Form (e.g. plate or bowl) • Decoration form (e.g. burnished) • Decoration color (e.g. yellow) • Fabric (e.g. Dressel 28 fabric) T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 15/22
  • 24. Lessons from ARIADNE Used tools and methodology developed for the ARIADNE project by the Hypermedia Research Group at the University of South Wales • Created a neutral spine based on the Getty Institute’s Art and Architecture Thesaurus (AAT) • This spine was populated by members from partner organisations, identifying common terms and concepts within it • Project partners then mapped terms in their language to this neutral spine • French terms supplied courtesy of a 2001 Masters thesis by Caroline Sourzat (thanks to Eleni Schindler Kaudelka for identifying this on the ArchAIDE blog!) T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 16/22
  • 25. Mapping terms and concepts (part 1) Often this was very straightforward, for example: • The Italian terms graffita, graffita a punta, graffita a stecca = “sgraffito” (http://vocab.getty.edu/aat/300266416) • The Spanish term Cántaro = “jars” (http://vocab.getty.edu/aat/300195348) • The German terms gebogener Henkel, Ohrförmiger Henkel, langer Vertikalhenkel = “handles” (http://vocab.getty.edu/aat/300266416) T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 17/22
  • 26. Mapping terms and concepts (part 2) Often this was more complicated, with partners having differing perceptions on what to call something (e.g. “plate” versus “platter”) In truth, this confusion may also be reflected by what has come out of the ground! An advantage of using the AAT (a “SKOS’d” thesaurus), is that ambiguity or difference in nomenclature can be resolved by a broader term or concept, so for example … T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 18/22
  • 27. Mapping terms and concepts (part 3) Looking at the hierarchies for plate and platter in the AAT we can see that both are “dishes (vessels for food)”, or even broader “culinary containers”. So whole we can retain our original classifications (and this is essential for text mining), we can agree at a fundamental level what these fundamentally are Figure 7: AAT Hierarchies for Plate and Platter T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 19/22
  • 28. Outlook • Recognize reigns of emperors as DATE entities • Coreferences in general • HeidelTime: second and third quarter of the first century A. D. −→ XXXX-Q3; 00 • Returning to difference in ceramic recording details • Fabric names often contain locations, e. g. Magdalensberg xyz • Location sometimes narrow, sometimes whole regions • In many cases the form is not named in particular but just described T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 20/22
  • 29. References Ettlinger, Elisabeth. Conspectus formarum terrae sigillatae Italico modo confectae. Ed. by Deutsches Archäologisches Institut zu Frankfurt and Römisch-Germanische Kommission. Materialien zur römisch-germanischen Keramik. Bonn: Habelt, 1990. Gempeler, Robert D. Elephantine X. Die Keramik römischer bis früharabischer Zeit. Mainz: Von Zabern, 1992. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 21/22
  • 30. Thank you very much for your attention! Questions? This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement № 693548
  • 31. Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology Tim Evans1, Felix Kußmaul2 23rd Annual Meeting EAA, Maastricht 31 August 2017 1Archaeology Data Service, University of York 2Archaeological Institute, University of Cologne