SlideShare a Scribd company logo
STANDOFF ANNOTATION FOR THE ANCIENT
GREEK AND LATIN DEPENDENCY TREEBANK
DATECH2019, 9.5.2019
Giuseppe G. A. Celano
DFG PROJECT
1. Revise: correct the errors
2. Standardize: make the AGLDT standoff as PAULA XML (and convert into
UD)
1. standoff for multiple annotations and/or multiple interpretations of the
same token
2. standoff to overcome the problem of conflicting hierarchies
3. Expand: add new annotations
(https://git.informatik.uni-leipzig.de/celano/agldt1)
THE AGLT
▸ Ancient Greek texts: 557,922 tokens
▸ Latin texts: 79,697 tokens
▸ available in GitHub/GitLab:
▸ https://perseusdl.github.io/treebank_data/
▸ https://git.informatik.uni-leipzig.de/celano/agldt1
LABELED DIRECTED ACYCLIC GRAPHS
THE PERSEUS TREEBANK (LAST RELEASE, 2.1)
▸ 12 texts
composition date text token number
63 BC Cicero, In Catilinam 6,652
51 BC Caesar, De Bello Gallico 1556
post 44 BC Sallust, Bellum Catilinae 13191
ca 25 BC Prop. Elegiae 5297
29-19 BC Vergil, Aeneid 2839
ca 8 AD Ov., Metamorphoses 5209
14 AD Aug., Res Gestae 3035
15-50 AD Ph., Fabulae 6588
ca 100 AD? Petr., Satyricon 14177
ca 100-110 AD Tac. Historiae 3531
117-138 AD Suet., Vita Divi Augusti 8313
ca 400 AD Ger. Vulgata 9309
TREEBANK PIPELINE
start
Choose
a TEI/XML text
preliminary
automatic
annotation
tokenize it
rule-based
POS Tagger/Parser
manual
correction
end
THE PERSEUS TREEBANK: TEI XML TEXT
THE PERSEUS TREEBANK: INLINE ANNOTATION
INLINE ANNOTATION: ADVANTAGES
1. easy to add
2. easy to query
3. well supported by annotation tools
INLINE ANNOTATION: DISADVANTAGES
1. the tokenized text becomes the new base text
2. after text extraction from a TEI text, links to the original text is virtually lost
(e.g., amabam-que and content of some editorial markup)
3. it is unfeasible to connect such base texts to other annotation layers with
different tokenization schemes. For example:
‣ amabamque: one phonetic word
‣ amabam-que: two syntactic words
‣ am-a-ba-m-que: five morphemes
‣ verse vs. sentence
STANDOFF ANNOTATION
1. each annotation layer is attached separately to the original text
(i.e., the base text).
2. an annotation layer references the original text or another
annotaion layer which references the original text
STANDOFF ANNOTATION: PAULA XML
1. Open format based on the principles of LAF (ISO 24612:2012)
2. already employed in a number of historical language corpora
3. the base text is a bare xml text, which is virtually referenced only
via offsets
THE CASE STUDY: CAESAR’S DE BELLO CIVILI
1. the base text is a ‘complex’ TEI xml file’
‣ reference is made via XPath coinciding with CTS divisions
(https://git.informatik.uni-leipzig.de/celano/latinnlp/tree/master/case-study))
TOKENIZATION/WORD SEGMENTATION
▸ Latin: rule-based
▸ select the text to annotate from the TEI XML file
▸ identify abbreviations (word list + regular expressions)
▸ Cn. = Gnaeus
▸ list of not-to-tokenize words (e.g., Antigone, aeque)
▸ tokens ending with ne/que/ve
▸ list of to-tokenize words (e.g., nequis, nobiscum)
PAULA: TEI BASE TEXT
PAULA: TOKENIZATION
PAULA: SENTENCE SPLIT
CURRENT CHALLENGES
▸ extraction of text from TEI texts may require different scripts
▸ what is the ideal tokenization/word segmentation?
▸ annotation tools do not support standoff annotation
▸ lack of support for XPointer
THANK YOU FOR YOUR ATTENTION!

More Related Content

Similar to Session6 04.giuseppe celano

Haskell vs. F# vs. Scala
Haskell vs. F# vs. ScalaHaskell vs. F# vs. Scala
Haskell vs. F# vs. Scala
pt114
 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Vladimir Alexiev, PhD, PMP
 
Basic Introduction to LaTeX
Basic Introduction to LaTeXBasic Introduction to LaTeX
Basic Introduction to LaTeX
Veronika Heimsbakk
 
sigproc-sp.pdf
sigproc-sp.pdfsigproc-sp.pdf
sigproc-sp.pdf
sahilsahoo85
 
Unixshellscript 100406085942-phpapp02
Unixshellscript 100406085942-phpapp02Unixshellscript 100406085942-phpapp02
Unixshellscript 100406085942-phpapp02
Ben Mohammed Esskhayri
 
abc12
abc12abc12
abc12
kshraddha9
 
popopo
popopopopopo
popopo
kshraddha9
 
abc
abcabc
sphinx-i18n — The True Story
sphinx-i18n — The True Storysphinx-i18n — The True Story
sphinx-i18n — The True Story
Robert Lehmann
 
Sour Pickles
Sour PicklesSour Pickles
Sour Pickles
SensePost
 
Ekon bestof rtl_delphi
Ekon bestof rtl_delphiEkon bestof rtl_delphi
Ekon bestof rtl_delphi
Max Kleiner
 
Bash shell programming in linux
Bash shell programming in linuxBash shell programming in linux
Bash shell programming in linux
Norberto Angulo
 
BEL.bio Overview and BioDati Studio
BEL.bio Overview and BioDati Studio BEL.bio Overview and BioDati Studio
BEL.bio Overview and BioDati Studio
wshayes
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
aniruddh Tyagi
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
aniruddh Tyagi
 
(Ebook) linux shell scripting tutorial
(Ebook) linux shell scripting tutorial(Ebook) linux shell scripting tutorial
(Ebook) linux shell scripting tutorial
jayaramprabhu
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
Aniruddh Tyagi
 
Algorithm2e package for Latex
Algorithm2e package for LatexAlgorithm2e package for Latex
Algorithm2e package for Latex
Chris Lee
 
Dsohowto
DsohowtoDsohowto
Dsohowto
KarlFrank99
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
pathsproject
 

Similar to Session6 04.giuseppe celano (20)

Haskell vs. F# vs. Scala
Haskell vs. F# vs. ScalaHaskell vs. F# vs. Scala
Haskell vs. F# vs. Scala
 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
 
Basic Introduction to LaTeX
Basic Introduction to LaTeXBasic Introduction to LaTeX
Basic Introduction to LaTeX
 
sigproc-sp.pdf
sigproc-sp.pdfsigproc-sp.pdf
sigproc-sp.pdf
 
Unixshellscript 100406085942-phpapp02
Unixshellscript 100406085942-phpapp02Unixshellscript 100406085942-phpapp02
Unixshellscript 100406085942-phpapp02
 
abc12
abc12abc12
abc12
 
popopo
popopopopopo
popopo
 
abc
abcabc
abc
 
sphinx-i18n — The True Story
sphinx-i18n — The True Storysphinx-i18n — The True Story
sphinx-i18n — The True Story
 
Sour Pickles
Sour PicklesSour Pickles
Sour Pickles
 
Ekon bestof rtl_delphi
Ekon bestof rtl_delphiEkon bestof rtl_delphi
Ekon bestof rtl_delphi
 
Bash shell programming in linux
Bash shell programming in linuxBash shell programming in linux
Bash shell programming in linux
 
BEL.bio Overview and BioDati Studio
BEL.bio Overview and BioDati Studio BEL.bio Overview and BioDati Studio
BEL.bio Overview and BioDati Studio
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
 
(Ebook) linux shell scripting tutorial
(Ebook) linux shell scripting tutorial(Ebook) linux shell scripting tutorial
(Ebook) linux shell scripting tutorial
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
 
Algorithm2e package for Latex
Algorithm2e package for LatexAlgorithm2e package for Latex
Algorithm2e package for Latex
 
Dsohowto
DsohowtoDsohowto
Dsohowto
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
 

More from IMPACT Centre of Competence

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
IMPACT Centre of Competence
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
IMPACT Centre of Competence
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
IMPACT Centre of Competence
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
IMPACT Centre of Competence
 

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Recently uploaded

みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 

Recently uploaded (20)

みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 

Session6 04.giuseppe celano

  • 1. STANDOFF ANNOTATION FOR THE ANCIENT GREEK AND LATIN DEPENDENCY TREEBANK DATECH2019, 9.5.2019 Giuseppe G. A. Celano
  • 2. DFG PROJECT 1. Revise: correct the errors 2. Standardize: make the AGLDT standoff as PAULA XML (and convert into UD) 1. standoff for multiple annotations and/or multiple interpretations of the same token 2. standoff to overcome the problem of conflicting hierarchies 3. Expand: add new annotations (https://git.informatik.uni-leipzig.de/celano/agldt1)
  • 3. THE AGLT ▸ Ancient Greek texts: 557,922 tokens ▸ Latin texts: 79,697 tokens ▸ available in GitHub/GitLab: ▸ https://perseusdl.github.io/treebank_data/ ▸ https://git.informatik.uni-leipzig.de/celano/agldt1
  • 5. THE PERSEUS TREEBANK (LAST RELEASE, 2.1) ▸ 12 texts composition date text token number 63 BC Cicero, In Catilinam 6,652 51 BC Caesar, De Bello Gallico 1556 post 44 BC Sallust, Bellum Catilinae 13191 ca 25 BC Prop. Elegiae 5297 29-19 BC Vergil, Aeneid 2839 ca 8 AD Ov., Metamorphoses 5209 14 AD Aug., Res Gestae 3035 15-50 AD Ph., Fabulae 6588 ca 100 AD? Petr., Satyricon 14177 ca 100-110 AD Tac. Historiae 3531 117-138 AD Suet., Vita Divi Augusti 8313 ca 400 AD Ger. Vulgata 9309
  • 6. TREEBANK PIPELINE start Choose a TEI/XML text preliminary automatic annotation tokenize it rule-based POS Tagger/Parser manual correction end
  • 7. THE PERSEUS TREEBANK: TEI XML TEXT
  • 8. THE PERSEUS TREEBANK: INLINE ANNOTATION
  • 9. INLINE ANNOTATION: ADVANTAGES 1. easy to add 2. easy to query 3. well supported by annotation tools
  • 10. INLINE ANNOTATION: DISADVANTAGES 1. the tokenized text becomes the new base text 2. after text extraction from a TEI text, links to the original text is virtually lost (e.g., amabam-que and content of some editorial markup) 3. it is unfeasible to connect such base texts to other annotation layers with different tokenization schemes. For example: ‣ amabamque: one phonetic word ‣ amabam-que: two syntactic words ‣ am-a-ba-m-que: five morphemes ‣ verse vs. sentence
  • 11. STANDOFF ANNOTATION 1. each annotation layer is attached separately to the original text (i.e., the base text). 2. an annotation layer references the original text or another annotaion layer which references the original text
  • 12. STANDOFF ANNOTATION: PAULA XML 1. Open format based on the principles of LAF (ISO 24612:2012) 2. already employed in a number of historical language corpora 3. the base text is a bare xml text, which is virtually referenced only via offsets
  • 13. THE CASE STUDY: CAESAR’S DE BELLO CIVILI 1. the base text is a ‘complex’ TEI xml file’ ‣ reference is made via XPath coinciding with CTS divisions (https://git.informatik.uni-leipzig.de/celano/latinnlp/tree/master/case-study))
  • 14. TOKENIZATION/WORD SEGMENTATION ▸ Latin: rule-based ▸ select the text to annotate from the TEI XML file ▸ identify abbreviations (word list + regular expressions) ▸ Cn. = Gnaeus ▸ list of not-to-tokenize words (e.g., Antigone, aeque) ▸ tokens ending with ne/que/ve ▸ list of to-tokenize words (e.g., nequis, nobiscum)
  • 18. CURRENT CHALLENGES ▸ extraction of text from TEI texts may require different scripts ▸ what is the ideal tokenization/word segmentation? ▸ annotation tools do not support standoff annotation ▸ lack of support for XPointer
  • 19. THANK YOU FOR YOUR ATTENTION!