SlideShare a Scribd company logo
STANDOFF ANNOTATION FOR THE ANCIENT
GREEK AND LATIN DEPENDENCY TREEBANK
DATECH2019, 9.5.2019
Giuseppe G. A. Celano
DFG PROJECT
1. Revise: correct the errors
2. Standardize: make the AGLDT standoff as PAULA XML (and convert into
UD)
1. standoff for multiple annotations and/or multiple interpretations of the
same token
2. standoff to overcome the problem of conflicting hierarchies
3. Expand: add new annotations
(https://git.informatik.uni-leipzig.de/celano/agldt1)
THE AGLT
▸ Ancient Greek texts: 557,922 tokens
▸ Latin texts: 79,697 tokens
▸ available in GitHub/GitLab:
▸ https://perseusdl.github.io/treebank_data/
▸ https://git.informatik.uni-leipzig.de/celano/agldt1
LABELED DIRECTED ACYCLIC GRAPHS
THE PERSEUS TREEBANK (LAST RELEASE, 2.1)
▸ 12 texts
composition date text token number
63 BC Cicero, In Catilinam 6,652
51 BC Caesar, De Bello Gallico 1556
post 44 BC Sallust, Bellum Catilinae 13191
ca 25 BC Prop. Elegiae 5297
29-19 BC Vergil, Aeneid 2839
ca 8 AD Ov., Metamorphoses 5209
14 AD Aug., Res Gestae 3035
15-50 AD Ph., Fabulae 6588
ca 100 AD? Petr., Satyricon 14177
ca 100-110 AD Tac. Historiae 3531
117-138 AD Suet., Vita Divi Augusti 8313
ca 400 AD Ger. Vulgata 9309
TREEBANK PIPELINE
start
Choose
a TEI/XML text
preliminary
automatic
annotation
tokenize it
rule-based
POS Tagger/Parser
manual
correction
end
THE PERSEUS TREEBANK: TEI XML TEXT
THE PERSEUS TREEBANK: INLINE ANNOTATION
INLINE ANNOTATION: ADVANTAGES
1. easy to add
2. easy to query
3. well supported by annotation tools
INLINE ANNOTATION: DISADVANTAGES
1. the tokenized text becomes the new base text
2. after text extraction from a TEI text, links to the original text is virtually lost
(e.g., amabam-que and content of some editorial markup)
3. it is unfeasible to connect such base texts to other annotation layers with
different tokenization schemes. For example:
‣ amabamque: one phonetic word
‣ amabam-que: two syntactic words
‣ am-a-ba-m-que: five morphemes
‣ verse vs. sentence
STANDOFF ANNOTATION
1. each annotation layer is attached separately to the original text
(i.e., the base text).
2. an annotation layer references the original text or another
annotaion layer which references the original text
STANDOFF ANNOTATION: PAULA XML
1. Open format based on the principles of LAF (ISO 24612:2012)
2. already employed in a number of historical language corpora
3. the base text is a bare xml text, which is virtually referenced only
via offsets
THE CASE STUDY: CAESAR’S DE BELLO CIVILI
1. the base text is a ‘complex’ TEI xml file’
‣ reference is made via XPath coinciding with CTS divisions
(https://git.informatik.uni-leipzig.de/celano/latinnlp/tree/master/case-study))
TOKENIZATION/WORD SEGMENTATION
▸ Latin: rule-based
▸ select the text to annotate from the TEI XML file
▸ identify abbreviations (word list + regular expressions)
▸ Cn. = Gnaeus
▸ list of not-to-tokenize words (e.g., Antigone, aeque)
▸ tokens ending with ne/que/ve
▸ list of to-tokenize words (e.g., nequis, nobiscum)
PAULA: TEI BASE TEXT
PAULA: TOKENIZATION
PAULA: SENTENCE SPLIT
CURRENT CHALLENGES
▸ extraction of text from TEI texts may require different scripts
▸ what is the ideal tokenization/word segmentation?
▸ annotation tools do not support standoff annotation
▸ lack of support for XPointer
THANK YOU FOR YOUR ATTENTION!

More Related Content

Similar to Session6 04.giuseppe celano

Haskell vs. F# vs. Scala
Haskell vs. F# vs. ScalaHaskell vs. F# vs. Scala
Haskell vs. F# vs. Scala
pt114
 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Vladimir Alexiev, PhD, PMP
 
Basic Introduction to LaTeX
Basic Introduction to LaTeXBasic Introduction to LaTeX
Basic Introduction to LaTeX
Veronika Heimsbakk
 
sigproc-sp.pdf
sigproc-sp.pdfsigproc-sp.pdf
sigproc-sp.pdf
sahilsahoo85
 
Unixshellscript 100406085942-phpapp02
Unixshellscript 100406085942-phpapp02Unixshellscript 100406085942-phpapp02
Unixshellscript 100406085942-phpapp02
Ben Mohammed Esskhayri
 
abc12
abc12abc12
abc12
kshraddha9
 
popopo
popopopopopo
popopo
kshraddha9
 
abc
abcabc
sphinx-i18n — The True Story
sphinx-i18n — The True Storysphinx-i18n — The True Story
sphinx-i18n — The True Story
Robert Lehmann
 
Sour Pickles
Sour PicklesSour Pickles
Sour Pickles
SensePost
 
Ekon bestof rtl_delphi
Ekon bestof rtl_delphiEkon bestof rtl_delphi
Ekon bestof rtl_delphi
Max Kleiner
 
Bash shell programming in linux
Bash shell programming in linuxBash shell programming in linux
Bash shell programming in linux
Norberto Angulo
 
BEL.bio Overview and BioDati Studio
BEL.bio Overview and BioDati Studio BEL.bio Overview and BioDati Studio
BEL.bio Overview and BioDati Studio
wshayes
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
aniruddh Tyagi
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
aniruddh Tyagi
 
(Ebook) linux shell scripting tutorial
(Ebook) linux shell scripting tutorial(Ebook) linux shell scripting tutorial
(Ebook) linux shell scripting tutorial
jayaramprabhu
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
Aniruddh Tyagi
 
Algorithm2e package for Latex
Algorithm2e package for LatexAlgorithm2e package for Latex
Algorithm2e package for Latex
Chris Lee
 
Dsohowto
DsohowtoDsohowto
Dsohowto
KarlFrank99
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
pathsproject
 

Similar to Session6 04.giuseppe celano (20)

Haskell vs. F# vs. Scala
Haskell vs. F# vs. ScalaHaskell vs. F# vs. Scala
Haskell vs. F# vs. Scala
 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
 
Basic Introduction to LaTeX
Basic Introduction to LaTeXBasic Introduction to LaTeX
Basic Introduction to LaTeX
 
sigproc-sp.pdf
sigproc-sp.pdfsigproc-sp.pdf
sigproc-sp.pdf
 
Unixshellscript 100406085942-phpapp02
Unixshellscript 100406085942-phpapp02Unixshellscript 100406085942-phpapp02
Unixshellscript 100406085942-phpapp02
 
abc12
abc12abc12
abc12
 
popopo
popopopopopo
popopo
 
abc
abcabc
abc
 
sphinx-i18n — The True Story
sphinx-i18n — The True Storysphinx-i18n — The True Story
sphinx-i18n — The True Story
 
Sour Pickles
Sour PicklesSour Pickles
Sour Pickles
 
Ekon bestof rtl_delphi
Ekon bestof rtl_delphiEkon bestof rtl_delphi
Ekon bestof rtl_delphi
 
Bash shell programming in linux
Bash shell programming in linuxBash shell programming in linux
Bash shell programming in linux
 
BEL.bio Overview and BioDati Studio
BEL.bio Overview and BioDati Studio BEL.bio Overview and BioDati Studio
BEL.bio Overview and BioDati Studio
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
 
(Ebook) linux shell scripting tutorial
(Ebook) linux shell scripting tutorial(Ebook) linux shell scripting tutorial
(Ebook) linux shell scripting tutorial
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
 
Algorithm2e package for Latex
Algorithm2e package for LatexAlgorithm2e package for Latex
Algorithm2e package for Latex
 
Dsohowto
DsohowtoDsohowto
Dsohowto
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
 

More from IMPACT Centre of Competence

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
IMPACT Centre of Competence
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
IMPACT Centre of Competence
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
IMPACT Centre of Competence
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
IMPACT Centre of Competence
 

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Recently uploaded

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 

Recently uploaded (20)

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 

Session6 04.giuseppe celano

  • 1. STANDOFF ANNOTATION FOR THE ANCIENT GREEK AND LATIN DEPENDENCY TREEBANK DATECH2019, 9.5.2019 Giuseppe G. A. Celano
  • 2. DFG PROJECT 1. Revise: correct the errors 2. Standardize: make the AGLDT standoff as PAULA XML (and convert into UD) 1. standoff for multiple annotations and/or multiple interpretations of the same token 2. standoff to overcome the problem of conflicting hierarchies 3. Expand: add new annotations (https://git.informatik.uni-leipzig.de/celano/agldt1)
  • 3. THE AGLT ▸ Ancient Greek texts: 557,922 tokens ▸ Latin texts: 79,697 tokens ▸ available in GitHub/GitLab: ▸ https://perseusdl.github.io/treebank_data/ ▸ https://git.informatik.uni-leipzig.de/celano/agldt1
  • 5. THE PERSEUS TREEBANK (LAST RELEASE, 2.1) ▸ 12 texts composition date text token number 63 BC Cicero, In Catilinam 6,652 51 BC Caesar, De Bello Gallico 1556 post 44 BC Sallust, Bellum Catilinae 13191 ca 25 BC Prop. Elegiae 5297 29-19 BC Vergil, Aeneid 2839 ca 8 AD Ov., Metamorphoses 5209 14 AD Aug., Res Gestae 3035 15-50 AD Ph., Fabulae 6588 ca 100 AD? Petr., Satyricon 14177 ca 100-110 AD Tac. Historiae 3531 117-138 AD Suet., Vita Divi Augusti 8313 ca 400 AD Ger. Vulgata 9309
  • 6. TREEBANK PIPELINE start Choose a TEI/XML text preliminary automatic annotation tokenize it rule-based POS Tagger/Parser manual correction end
  • 7. THE PERSEUS TREEBANK: TEI XML TEXT
  • 8. THE PERSEUS TREEBANK: INLINE ANNOTATION
  • 9. INLINE ANNOTATION: ADVANTAGES 1. easy to add 2. easy to query 3. well supported by annotation tools
  • 10. INLINE ANNOTATION: DISADVANTAGES 1. the tokenized text becomes the new base text 2. after text extraction from a TEI text, links to the original text is virtually lost (e.g., amabam-que and content of some editorial markup) 3. it is unfeasible to connect such base texts to other annotation layers with different tokenization schemes. For example: ‣ amabamque: one phonetic word ‣ amabam-que: two syntactic words ‣ am-a-ba-m-que: five morphemes ‣ verse vs. sentence
  • 11. STANDOFF ANNOTATION 1. each annotation layer is attached separately to the original text (i.e., the base text). 2. an annotation layer references the original text or another annotaion layer which references the original text
  • 12. STANDOFF ANNOTATION: PAULA XML 1. Open format based on the principles of LAF (ISO 24612:2012) 2. already employed in a number of historical language corpora 3. the base text is a bare xml text, which is virtually referenced only via offsets
  • 13. THE CASE STUDY: CAESAR’S DE BELLO CIVILI 1. the base text is a ‘complex’ TEI xml file’ ‣ reference is made via XPath coinciding with CTS divisions (https://git.informatik.uni-leipzig.de/celano/latinnlp/tree/master/case-study))
  • 14. TOKENIZATION/WORD SEGMENTATION ▸ Latin: rule-based ▸ select the text to annotate from the TEI XML file ▸ identify abbreviations (word list + regular expressions) ▸ Cn. = Gnaeus ▸ list of not-to-tokenize words (e.g., Antigone, aeque) ▸ tokens ending with ne/que/ve ▸ list of to-tokenize words (e.g., nequis, nobiscum)
  • 18. CURRENT CHALLENGES ▸ extraction of text from TEI texts may require different scripts ▸ what is the ideal tokenization/word segmentation? ▸ annotation tools do not support standoff annotation ▸ lack of support for XPointer
  • 19. THANK YOU FOR YOUR ATTENTION!