SlideShare a Scribd company logo
1 of 18
Download to read offline
Collatinus :
Lemmatizer and
morphological
analyzer for
Latin texts.
Yves Ouvrard
Philippe Verkerk
Digital Classics III: Re-thinking Text Analysis May 12th 2017
Who are we ?
Yves Ouvrard
Professor of latin
(retired)
Philippe Verkerk
Physicist
Lab. PhLAM
●
Reading support of
latin texts
●
Prepare the wordlist
associated to a text
What was initially Collatinus ?
Collatinus 11
●
Free and Open program (GNU GPL)
- written in C++ (Qt 5)
- resources as plain text files
●
Stand-alone software (Mac OS X, Windows,
Linux)
http://outils.biblissima.fr/en/collatinus/
●
On-line version (Collatinus 10.2)
http://outils.biblissima.fr/en/collatinus-web/
Main functions
●
Lemmatization
Analysis
Reading support
●
Dictionaries
- Gaffiot 2016
- Lewis & Short
- Georges...
●
Scansion●
Inflection
Ārmă (Ārmā) vĭrūmquĕ cānō (cănō̆ ), Trōjāe quī prīmŭs ăb ōrīs (ōrĭs)
--̆ u-u -̆ -̆ -- - -u u --̆
Ītălĭām, fātō prŏfŭgūs, Lāvīnĭăquĕ (Lāvīnĭāquĕ) vēnĭt (vĕnĭt)
-uu- -- uu- --u-̆ u -̆ u
lītŏră, mūlt[um] īll[e] ēt tērrīs jāctātŭs (jāctātūs) ĕt āltō
-uu -` -` - -- ---̆ u --
vī sŭpĕrūm sāevāe mĕmŏrēm Jūnōnĭs ŏb īrăm;
- uu- -- uu- --u u -u
mūltă (mūltā) quōqu[e] (quŏqu[e]) ēt bēllō pāssūs, dūm cōndĕrĕt ūrbĕm,
--̆ -̆ ` - -- -- - -uu -u
īnfērrētquĕ dĕōs Lătĭō, gĕnūs (gĕnŭs) ūndĕ Lătīnŭm,
---u u- uu- u-̆ -u u-u
Ālbānīquĕ pātrēs (pā̆ trēs), ātqu[e] āltāe mōenĭă Rōmāe.
---u -̆ - -` -- -uu --
What makes the difference ?
●
Dictionaries :
a single click opens
the dictionary
●
Allows to compare
two dictionaries
●
Scansion :
knows a priori if a
vowel is short or long
→ not limited to
metrical verses
●
Integrated server
→ can answer a question from another program
●
Expandable(one can add lemmas and/or dictionaries)
Principle of the analysis
●
Not a list of inflected forms
→ closer to the human mechanism
●
Split the word in 2 parts (all possibilities)
●
Look for the 2 parts in the proper list
●
Check for the compatibility
●
24 100 lemmas in the “main lexicon”
●
58 000 lemmas in the “extended lexicon”
The figures
●
24 100 lemmas
(classical Latin)
●
58 000 lemmas in the
extended lexicon
(unusual words)
●
140 “paradigms”
- Lewis & Short ≈ 60 000 entries
- Gaffiot 2016 ≈ 60 000 entries
- K.E. Georges ≈ 50 000 entries
- Gérard Jeanneau ≈ 50 000 entries
Generate automatically the
lemmas in the format of Collatinus
7 500 entries still waiting
●
Contraction in the declension
ĭĭ → ī
amasse↔ amavisse
●
Assimilation of the prefix
(without an extra entry)
obfero↔ offero
Ordering the analysis
●
“Ordering” the analysis
→ a step to desambiguization
≠ from a usual POS tagger (no list of forms)
●
Statistics on the texts lemmatized by the
LASLA (almost 2 000 000 forms)
●
Counting the tags, the sequences of 3 tags
and each lemma (mapped in the lexicon)
●
91 tags : more than the POS
TextiColor
●
Students are
supposed to know
a list of words
●
The text is
prepared with
different colors :
- known
- not known
- medieval form
]
From quantities to rhythm
●
Collatinus knows the quantity of the vowels
→ easy to determine the tonical accent.
●
Can split the syllabes→ hyphenation
abs·cí·di (abs-cido) ≠ áb·sci·di (ab-scindo)
●
Opens the way to the study of rhythmic prose
●
Statistics on paroxytons and proparoxytons.
●
Rhythm everywhere, not only inclausulae
New feature : probabilistic tagger
●
No one knows if a probabilistic tagger works
with Latin (“free” order of the words)
●
No evaluation of the error rate (not yet)
●
2nd order hidden Markov model
●
Training corpus : LASLA.
●
For each sentence, chooses the best
sequence of tags and thus the best analysis
Internal server
●
Collatinus can answer questions asked from
another program. For instance, LibreOffice.
Internal server (2)
●
A “standard” tagger with the data from LASLA
- List of forms, with lemma and analysis
- Exact number of occurrences
- Tagset and statistics on sequences of tags
●
Problem of the forms that have not been seen
before : Ask Collatinus !
→ Answers a LASLA-like analysis
LASLA-Tagger
●
Supervised lemmatization.
●
Edition/correction of the result of the tagger.
→ Hope to reach zero error !
●
Exact number of occurrences. For “patres” :
253 voc. 110 nom. and 75 acc.
Collatinus expects 6 voc. 200 nom. & 320 acc
LASLA-Tagger : screen-shot
What is going on ?
Praelector:
●
“Grammatical assistant” to help pupils in
reading Latin / understanding the sentence.
→“Translation” in French of the sentence
●
Medieval spelling :
e forae, o forau, f forph, etc...
Reduce the medieval forms and the classical
words to the same “phonetic” form.
Future plans
●
Coupling of the probabilistic tagger with the
gramatical analysis of the sentence
●
“Universal” output format to gather all the
information about the text (lemmas,
analysis, scanned and accented forms...)
●
Statistics on rhythmic prose (cursus)
●
Looking for verses in prose
Suggestions
●
Yves.Ouvrard@collatinus.org
●
Philippe.Verkerk@univ-lille1.fr
Thank you for your attention
● http://outils.biblissima.fr/en/collatinus/
● http://outils.biblissima.fr/en/collatinus-web/
● For ancient Greek : http://outils.biblissima.fr/en/eulexis/

More Related Content

Similar to Collatinus: Lemmatizer and morphological analyzer for Latin texts

Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyRobert Viseur
 
A practical parser with combined parsing
A practical parser with combined parsingA practical parser with combined parsing
A practical parser with combined parsingijseajournal
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty FrameworkAapo Talvensaari
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLEric Evans
 
ANTLR4 and its testing
ANTLR4 and its testingANTLR4 and its testing
ANTLR4 and its testingKnoldus Inc.
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text ProcessingSuneel Marthi
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextDataWorks Summit
 
Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyOlivier Bourgeois
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011Patrick Walton
 
Devfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-KoansDevfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-KoansTomoki Aburatani
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
LVEE 2014: Text parsing with Python and PLY
LVEE 2014: Text parsing with Python and PLYLVEE 2014: Text parsing with Python and PLY
LVEE 2014: Text parsing with Python and PLYdmbaturin
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingFlorian Leitner
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeProf. Wim Van Criekinge
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayMichael Yarichuk
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfeliasabdi2024
 

Similar to Collatinus: Lemmatizer and morphological analyzer for Latin texts (20)

Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
A practical parser with combined parsing
A practical parser with combined parsingA practical parser with combined parsing
A practical parser with combined parsing
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty Framework
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
 
ANTLR4 and its testing
ANTLR4 and its testingANTLR4 and its testing
ANTLR4 and its testing
 
AINL 2016: Kuznetsova
AINL 2016: KuznetsovaAINL 2016: Kuznetsova
AINL 2016: Kuznetsova
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
parsing.pptx
parsing.pptxparsing.pptx
parsing.pptx
 
Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development Efficiency
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011
 
Devfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-KoansDevfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-Koans
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Static analysis for beginners
Static analysis for beginnersStatic analysis for beginners
Static analysis for beginners
 
Bioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekingeBioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekinge
 
LVEE 2014: Text parsing with Python and PLY
LVEE 2014: Text parsing with Python and PLYLVEE 2014: Text parsing with Python and PLY
LVEE 2014: Text parsing with Python and PLY
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekinge
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdf
 

More from Equipex Biblissima

Da Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritteDa Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritteEquipex Biblissima
 
eScriptorium: An Open Source Platform for Historical Document Analysis
eScriptorium: An Open Source Platform for Historical Document AnalysiseScriptorium: An Open Source Platform for Historical Document Analysis
eScriptorium: An Open Source Platform for Historical Document AnalysisEquipex Biblissima
 
Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...
Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...
Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...Equipex Biblissima
 
Appliquer les techniques d'apprentissage profond pour détecter les enluminure...
Appliquer les techniques d'apprentissage profond pour détecter les enluminure...Appliquer les techniques d'apprentissage profond pour détecter les enluminure...
Appliquer les techniques d'apprentissage profond pour détecter les enluminure...Equipex Biblissima
 
Représentations du chant du Moyen Âge dans les images IIIF
Représentations du chant du Moyen Âge dans les images IIIFReprésentations du chant du Moyen Âge dans les images IIIF
Représentations du chant du Moyen Âge dans les images IIIFEquipex Biblissima
 
Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...
Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...
Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...Equipex Biblissima
 
Mise en œuvre de IIIF pour la reconnaissance automatique de documents
Mise en œuvre de IIIF pour la reconnaissance automatique de documentsMise en œuvre de IIIF pour la reconnaissance automatique de documents
Mise en œuvre de IIIF pour la reconnaissance automatique de documentsEquipex Biblissima
 
Actualités et perspectives de IIIF
Actualités et perspectives de IIIFActualités et perspectives de IIIF
Actualités et perspectives de IIIFEquipex Biblissima
 
Mieux diffuser et valoriser ses images sur le Web grâce aux standards IIIF
Mieux diffuser et valoriser ses images sur le Web grâce aux standards IIIFMieux diffuser et valoriser ses images sur le Web grâce aux standards IIIF
Mieux diffuser et valoriser ses images sur le Web grâce aux standards IIIFEquipex Biblissima
 
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...Equipex Biblissima
 
IIIF360: A Service to Support and Promote IIIF in France
IIIF360: A Service to Support and Promote IIIF in FranceIIIF360: A Service to Support and Promote IIIF in France
IIIF360: A Service to Support and Promote IIIF in FranceEquipex Biblissima
 
The Biblissima Authority File of Geographical Names
The Biblissima Authority File of Geographical NamesThe Biblissima Authority File of Geographical Names
The Biblissima Authority File of Geographical NamesEquipex Biblissima
 
Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...
Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...
Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...Equipex Biblissima
 
Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)
Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)
Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)Equipex Biblissima
 
Biblissima: Connecting Manuscripts Collections
Biblissima: Connecting Manuscripts CollectionsBiblissima: Connecting Manuscripts Collections
Biblissima: Connecting Manuscripts CollectionsEquipex Biblissima
 
A la recherche du patrimoine écrit avec le portail Biblissima
A la recherche du patrimoine écrit avec le portail BiblissimaA la recherche du patrimoine écrit avec le portail Biblissima
A la recherche du patrimoine écrit avec le portail BiblissimaEquipex Biblissima
 
Browse and Visualize Manuscripts Illuminations with IIIF
Browse and Visualize Manuscripts Illuminations with IIIFBrowse and Visualize Manuscripts Illuminations with IIIF
Browse and Visualize Manuscripts Illuminations with IIIFEquipex Biblissima
 
Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...
Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...
Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...Equipex Biblissima
 

More from Equipex Biblissima (20)

Da Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritteDa Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritte
 
eScriptorium: An Open Source Platform for Historical Document Analysis
eScriptorium: An Open Source Platform for Historical Document AnalysiseScriptorium: An Open Source Platform for Historical Document Analysis
eScriptorium: An Open Source Platform for Historical Document Analysis
 
Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...
Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...
Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...
 
Appliquer les techniques d'apprentissage profond pour détecter les enluminure...
Appliquer les techniques d'apprentissage profond pour détecter les enluminure...Appliquer les techniques d'apprentissage profond pour détecter les enluminure...
Appliquer les techniques d'apprentissage profond pour détecter les enluminure...
 
Représentations du chant du Moyen Âge dans les images IIIF
Représentations du chant du Moyen Âge dans les images IIIFReprésentations du chant du Moyen Âge dans les images IIIF
Représentations du chant du Moyen Âge dans les images IIIF
 
Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...
Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...
Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...
 
Mise en œuvre de IIIF pour la reconnaissance automatique de documents
Mise en œuvre de IIIF pour la reconnaissance automatique de documentsMise en œuvre de IIIF pour la reconnaissance automatique de documents
Mise en œuvre de IIIF pour la reconnaissance automatique de documents
 
Nakala et IIIF
Nakala et IIIFNakala et IIIF
Nakala et IIIF
 
Actualités et perspectives de IIIF
Actualités et perspectives de IIIFActualités et perspectives de IIIF
Actualités et perspectives de IIIF
 
Mieux diffuser et valoriser ses images sur le Web grâce aux standards IIIF
Mieux diffuser et valoriser ses images sur le Web grâce aux standards IIIFMieux diffuser et valoriser ses images sur le Web grâce aux standards IIIF
Mieux diffuser et valoriser ses images sur le Web grâce aux standards IIIF
 
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...
 
IIIF360: A Service to Support and Promote IIIF in France
IIIF360: A Service to Support and Promote IIIF in FranceIIIF360: A Service to Support and Promote IIIF in France
IIIF360: A Service to Support and Promote IIIF in France
 
The Biblissima Authority File of Geographical Names
The Biblissima Authority File of Geographical NamesThe Biblissima Authority File of Geographical Names
The Biblissima Authority File of Geographical Names
 
Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...
Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...
Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...
 
Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)
Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)
Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)
 
Biblissima: Connecting Manuscripts Collections
Biblissima: Connecting Manuscripts CollectionsBiblissima: Connecting Manuscripts Collections
Biblissima: Connecting Manuscripts Collections
 
IIIF et Biblissima
IIIF et BiblissimaIIIF et Biblissima
IIIF et Biblissima
 
A la recherche du patrimoine écrit avec le portail Biblissima
A la recherche du patrimoine écrit avec le portail BiblissimaA la recherche du patrimoine écrit avec le portail Biblissima
A la recherche du patrimoine écrit avec le portail Biblissima
 
Browse and Visualize Manuscripts Illuminations with IIIF
Browse and Visualize Manuscripts Illuminations with IIIFBrowse and Visualize Manuscripts Illuminations with IIIF
Browse and Visualize Manuscripts Illuminations with IIIF
 
Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...
Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...
Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...
 

Recently uploaded

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Collatinus: Lemmatizer and morphological analyzer for Latin texts

  • 1. Collatinus : Lemmatizer and morphological analyzer for Latin texts. Yves Ouvrard Philippe Verkerk Digital Classics III: Re-thinking Text Analysis May 12th 2017
  • 2. Who are we ? Yves Ouvrard Professor of latin (retired) Philippe Verkerk Physicist Lab. PhLAM ● Reading support of latin texts ● Prepare the wordlist associated to a text What was initially Collatinus ?
  • 3. Collatinus 11 ● Free and Open program (GNU GPL) - written in C++ (Qt 5) - resources as plain text files ● Stand-alone software (Mac OS X, Windows, Linux) http://outils.biblissima.fr/en/collatinus/ ● On-line version (Collatinus 10.2) http://outils.biblissima.fr/en/collatinus-web/
  • 4. Main functions ● Lemmatization Analysis Reading support ● Dictionaries - Gaffiot 2016 - Lewis & Short - Georges... ● Scansion● Inflection Ārmă (Ārmā) vĭrūmquĕ cānō (cănō̆ ), Trōjāe quī prīmŭs ăb ōrīs (ōrĭs) --̆ u-u -̆ -̆ -- - -u u --̆ Ītălĭām, fātō prŏfŭgūs, Lāvīnĭăquĕ (Lāvīnĭāquĕ) vēnĭt (vĕnĭt) -uu- -- uu- --u-̆ u -̆ u lītŏră, mūlt[um] īll[e] ēt tērrīs jāctātŭs (jāctātūs) ĕt āltō -uu -` -` - -- ---̆ u -- vī sŭpĕrūm sāevāe mĕmŏrēm Jūnōnĭs ŏb īrăm; - uu- -- uu- --u u -u mūltă (mūltā) quōqu[e] (quŏqu[e]) ēt bēllō pāssūs, dūm cōndĕrĕt ūrbĕm, --̆ -̆ ` - -- -- - -uu -u īnfērrētquĕ dĕōs Lătĭō, gĕnūs (gĕnŭs) ūndĕ Lătīnŭm, ---u u- uu- u-̆ -u u-u Ālbānīquĕ pātrēs (pā̆ trēs), ātqu[e] āltāe mōenĭă Rōmāe. ---u -̆ - -` -- -uu --
  • 5. What makes the difference ? ● Dictionaries : a single click opens the dictionary ● Allows to compare two dictionaries ● Scansion : knows a priori if a vowel is short or long → not limited to metrical verses ● Integrated server → can answer a question from another program ● Expandable(one can add lemmas and/or dictionaries)
  • 6. Principle of the analysis ● Not a list of inflected forms → closer to the human mechanism ● Split the word in 2 parts (all possibilities) ● Look for the 2 parts in the proper list ● Check for the compatibility ● 24 100 lemmas in the “main lexicon” ● 58 000 lemmas in the “extended lexicon”
  • 7. The figures ● 24 100 lemmas (classical Latin) ● 58 000 lemmas in the extended lexicon (unusual words) ● 140 “paradigms” - Lewis & Short ≈ 60 000 entries - Gaffiot 2016 ≈ 60 000 entries - K.E. Georges ≈ 50 000 entries - Gérard Jeanneau ≈ 50 000 entries Generate automatically the lemmas in the format of Collatinus 7 500 entries still waiting ● Contraction in the declension ĭĭ → ī amasse↔ amavisse ● Assimilation of the prefix (without an extra entry) obfero↔ offero
  • 8. Ordering the analysis ● “Ordering” the analysis → a step to desambiguization ≠ from a usual POS tagger (no list of forms) ● Statistics on the texts lemmatized by the LASLA (almost 2 000 000 forms) ● Counting the tags, the sequences of 3 tags and each lemma (mapped in the lexicon) ● 91 tags : more than the POS
  • 9. TextiColor ● Students are supposed to know a list of words ● The text is prepared with different colors : - known - not known - medieval form ]
  • 10. From quantities to rhythm ● Collatinus knows the quantity of the vowels → easy to determine the tonical accent. ● Can split the syllabes→ hyphenation abs·cí·di (abs-cido) ≠ áb·sci·di (ab-scindo) ● Opens the way to the study of rhythmic prose ● Statistics on paroxytons and proparoxytons. ● Rhythm everywhere, not only inclausulae
  • 11. New feature : probabilistic tagger ● No one knows if a probabilistic tagger works with Latin (“free” order of the words) ● No evaluation of the error rate (not yet) ● 2nd order hidden Markov model ● Training corpus : LASLA. ● For each sentence, chooses the best sequence of tags and thus the best analysis
  • 12. Internal server ● Collatinus can answer questions asked from another program. For instance, LibreOffice.
  • 13. Internal server (2) ● A “standard” tagger with the data from LASLA - List of forms, with lemma and analysis - Exact number of occurrences - Tagset and statistics on sequences of tags ● Problem of the forms that have not been seen before : Ask Collatinus ! → Answers a LASLA-like analysis
  • 14. LASLA-Tagger ● Supervised lemmatization. ● Edition/correction of the result of the tagger. → Hope to reach zero error ! ● Exact number of occurrences. For “patres” : 253 voc. 110 nom. and 75 acc. Collatinus expects 6 voc. 200 nom. & 320 acc
  • 16. What is going on ? Praelector: ● “Grammatical assistant” to help pupils in reading Latin / understanding the sentence. →“Translation” in French of the sentence ● Medieval spelling : e forae, o forau, f forph, etc... Reduce the medieval forms and the classical words to the same “phonetic” form.
  • 17. Future plans ● Coupling of the probabilistic tagger with the gramatical analysis of the sentence ● “Universal” output format to gather all the information about the text (lemmas, analysis, scanned and accented forms...) ● Statistics on rhythmic prose (cursus) ● Looking for verses in prose
  • 18. Suggestions ● Yves.Ouvrard@collatinus.org ● Philippe.Verkerk@univ-lille1.fr Thank you for your attention ● http://outils.biblissima.fr/en/collatinus/ ● http://outils.biblissima.fr/en/collatinus-web/ ● For ancient Greek : http://outils.biblissima.fr/en/eulexis/