SlideShare a Scribd company logo
1 of 12
Resources for historical
Slovene

Tomaž Erjavec
Department of Knowledge Technologies
Jožef Stefan Institute
Ljubljana


                  IMPACT Conference 2011
                 October 24-25, 2011, London
Tomaž Erjavec: Slovene language resources   2




Background
• Pre-story: AHLib (2004–08)
  (Deutsch-slowenische/kroatische Übersetzung 1848–1918)
   • Corpus / DL of ger→slv books
   • AAS: transcription correction and markup (TEI P4)
   • JSI: automatic annotation and editing environment
• Story: EU IP IMPACT (ext. 2010–2011)
  • Better OCR for historical texts
  • NUK: GTD transcriptions (PAGE/Aletheia)
  • JSI: (semi)manual lexicon construction
• Co-story: Google award (2011)
  • Developing language models for historical Slovene
  • ZRC SAZU: transcriptions of old texts (TEI P5)
  • JSI: annotating a corpus of old Slovene
Tomaž Erjavec: Slovene language resources   3


                                                              Annotators
Methodology                                                                         Historical
                                          Texts                   Corpus             lexicon
• Develop 3 resources:
  • transcribed texts
  • hand-annotated corpus
                                                                ToTrTaLe
  • lexicon of historical words
• Develop annotation tool, ToTrTaLe                          Contemporary
                                                                models
  • How to tag and lemmatise historical Slovene?
    Little chance of developing training data comparable to that for
    contemporary Slovene
  • Basic idea:
     •   modernise words then use models for modern Slovene
     •   transcription is via fixed lexicon + transcription patterns
     •   patterns implemented via LMU Vaam
     •   mostly OK for XIX and XVIII century language
Tomaž Erjavec: Slovene language resources   4




Issues
• Tokenisation - words were split differently in historical
 language :
  • žnjo → z njo
  • po noči → ponoči
• Variability:
   • archaic forms:
    ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin
  • inflection:
    ljubezen ← ljubezni, ljubeznijo
  • both:
    ljubezen ←
         ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezi
    n, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin
• Extinct words:
  • zajhen / cajhen / znamenje
Tomaž Erjavec: Slovene language resources   5




Transcribed historical texts
• AHLib corpus/DL:
  90 books, 10,000 pages, 2M words (> 1850)
• NUK GTD:
  5,000 pages, 1M words
• Google Books:
  30 books, 10,000 pages, 2M words (in progress)
• WikiSource (Lj Uni):
  200 books, 5M words (in progress)
~ 10M words

• most texts have associated facsimiles
• can be made freely available
Tomaž Erjavec: Slovene language resources   6




Initial Lexicon
• Development of initial lexicon (2010), using the data and tools at hand
• AHLib collection (70 books > 1850)
• Transcription rules + FidaPLUS lexicon of contemporary slv
• LMU LeXtractor editing tool
• produced 3,000 entries (word-forms)
Tomaž Erjavec: Slovene language resources          7


Reference corpus                       Period          Units       Pages           Tokens

goo300k                               1584
                                      1695
                                                              1
                                                              1
                                                                           8
                                                                          27
                                                                                      6000
                                                                                     10000
• Page sampled                      1751-1800                 8          155         27000
                                    1801-1850                12          206         74000
• Each word annotated with:         1851-1875                36          380        126000
  • Contemporary equivalent         1876-1900                23          224         51000
  • Modern lemma                        ∑                    81         1000        296000
  • Part-of-speech tag
• First with ToTrTaLe
• Then manually correct
  • INL Cobalt Lexicon Tool
  • A team of annotators
  • Also correcting errors in transcription
  • Manual, cookbook, FAQ, mailing list, meetings…
• TEI P5 – bibliography, links to facsimiles & DL
Tomaž Erjavec: Slovene language resources   8



INL Cobalt lexicon building tool
Tomaž Erjavec: Slovene language resources   9




TEI
corpus
dump
Tomaž Erjavec: Slovene language resources       10




Final lexicon
                                                 goo300k               All       Historical
Composition:                                     Lex. entries            56346        22849
• Initial LeXtractor lexicon (3k entries)        Word-forms              53853        19627
• Lexicon dump from goo300k                      Normalised              46996        15402
• Additional lexicon from full                   Modernised              37334        11396
  text collection                          Lemmas           19569                     8605
Format:
• TEI P5
• lemma oriented
• grammatical properties, glosses, historical spelling, (corpus)
  examples
Tomaž Erjavec: Slovene language resources   11




Results
• Language resources for historical Slovene:
   • Text Collection hs5M:
     • facsimile + transcription, DL (+ automatic annotation)
  • Annotated Corpus goo300k:
     • page-sampled , hand-annotated
  • Structured Lexicon imp20k:
     • grammar + glosses + forms + attestations
  • TEI P5, CC BY
• ToTrTaLe + resources for HS:
   • tokenisation & transcription patterns
• Services: CUWI, (moderniser+archaiser)
• all still work in progress, available mid-2012
Tomaž Erjavec: Slovene language resources   12




Further work
• Better IR for Digital Libraries: NUK
• Dictionary of historical Slovene: ZRC
• Beyond words: changes in syntax
• MT paradigm
• tweets & Croatian

More Related Content

Viewers also liked

IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT Centre of Competence
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Centre of Competence
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Centre of Competence
 

Viewers also liked (17)

IMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna RoadmapIMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna Roadmap
 
IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a Portal
 
IMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to Taverna
 
IMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEPIMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEP
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACT
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer Laamanen
 
IMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul FogelIMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul Fogel
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos Antonacopoulos
 
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael FuchsIMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
 
IMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens NeudeckerIMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens Neudecker
 
IMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory CraneIMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory Crane
 
IMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf TzadokIMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf Tzadok
 
IMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus GravenhorstIMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus Gravenhorst
 
IMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan PletschacherIMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan Pletschacher
 
IMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich ReffleIMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich Reffle
 
IMPACT Final Conference - Jesse de Does
IMPACT Final Conference - Jesse de DoesIMPACT Final Conference - Jesse de Does
IMPACT Final Conference - Jesse de Does
 
IMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien DepuydtIMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien Depuydt
 

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 

Recently uploaded (20)

Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 

IMACT Final Conference - Language Parallel Sessions - Erjavec

  • 1. Resources for historical Slovene Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana IMPACT Conference 2011 October 24-25, 2011, London
  • 2. Tomaž Erjavec: Slovene language resources 2 Background • Pre-story: AHLib (2004–08) (Deutsch-slowenische/kroatische Übersetzung 1848–1918) • Corpus / DL of ger→slv books • AAS: transcription correction and markup (TEI P4) • JSI: automatic annotation and editing environment • Story: EU IP IMPACT (ext. 2010–2011) • Better OCR for historical texts • NUK: GTD transcriptions (PAGE/Aletheia) • JSI: (semi)manual lexicon construction • Co-story: Google award (2011) • Developing language models for historical Slovene • ZRC SAZU: transcriptions of old texts (TEI P5) • JSI: annotating a corpus of old Slovene
  • 3. Tomaž Erjavec: Slovene language resources 3 Annotators Methodology Historical Texts Corpus lexicon • Develop 3 resources: • transcribed texts • hand-annotated corpus ToTrTaLe • lexicon of historical words • Develop annotation tool, ToTrTaLe Contemporary models • How to tag and lemmatise historical Slovene? Little chance of developing training data comparable to that for contemporary Slovene • Basic idea: • modernise words then use models for modern Slovene • transcription is via fixed lexicon + transcription patterns • patterns implemented via LMU Vaam • mostly OK for XIX and XVIII century language
  • 4. Tomaž Erjavec: Slovene language resources 4 Issues • Tokenisation - words were split differently in historical language : • žnjo → z njo • po noči → ponoči • Variability: • archaic forms: ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin • inflection: ljubezen ← ljubezni, ljubeznijo • both: ljubezen ← ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezi n, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin • Extinct words: • zajhen / cajhen / znamenje
  • 5. Tomaž Erjavec: Slovene language resources 5 Transcribed historical texts • AHLib corpus/DL: 90 books, 10,000 pages, 2M words (> 1850) • NUK GTD: 5,000 pages, 1M words • Google Books: 30 books, 10,000 pages, 2M words (in progress) • WikiSource (Lj Uni): 200 books, 5M words (in progress) ~ 10M words • most texts have associated facsimiles • can be made freely available
  • 6. Tomaž Erjavec: Slovene language resources 6 Initial Lexicon • Development of initial lexicon (2010), using the data and tools at hand • AHLib collection (70 books > 1850) • Transcription rules + FidaPLUS lexicon of contemporary slv • LMU LeXtractor editing tool • produced 3,000 entries (word-forms)
  • 7. Tomaž Erjavec: Slovene language resources 7 Reference corpus Period Units Pages Tokens goo300k 1584 1695 1 1 8 27 6000 10000 • Page sampled 1751-1800 8 155 27000 1801-1850 12 206 74000 • Each word annotated with: 1851-1875 36 380 126000 • Contemporary equivalent 1876-1900 23 224 51000 • Modern lemma ∑ 81 1000 296000 • Part-of-speech tag • First with ToTrTaLe • Then manually correct • INL Cobalt Lexicon Tool • A team of annotators • Also correcting errors in transcription • Manual, cookbook, FAQ, mailing list, meetings… • TEI P5 – bibliography, links to facsimiles & DL
  • 8. Tomaž Erjavec: Slovene language resources 8 INL Cobalt lexicon building tool
  • 9. Tomaž Erjavec: Slovene language resources 9 TEI corpus dump
  • 10. Tomaž Erjavec: Slovene language resources 10 Final lexicon goo300k All Historical Composition: Lex. entries 56346 22849 • Initial LeXtractor lexicon (3k entries) Word-forms 53853 19627 • Lexicon dump from goo300k Normalised 46996 15402 • Additional lexicon from full Modernised 37334 11396 text collection Lemmas 19569 8605 Format: • TEI P5 • lemma oriented • grammatical properties, glosses, historical spelling, (corpus) examples
  • 11. Tomaž Erjavec: Slovene language resources 11 Results • Language resources for historical Slovene: • Text Collection hs5M: • facsimile + transcription, DL (+ automatic annotation) • Annotated Corpus goo300k: • page-sampled , hand-annotated • Structured Lexicon imp20k: • grammar + glosses + forms + attestations • TEI P5, CC BY • ToTrTaLe + resources for HS: • tokenisation & transcription patterns • Services: CUWI, (moderniser+archaiser) • all still work in progress, available mid-2012
  • 12. Tomaž Erjavec: Slovene language resources 12 Further work • Better IR for Digital Libraries: NUK • Dictionary of historical Slovene: ZRC • Beyond words: changes in syntax • MT paradigm • tweets & Croatian