0
Resources for historicalSloveneTomaž ErjavecDepartment of Knowledge TechnologiesJožef Stefan InstituteLjubljana           ...
Tomaž Erjavec: Slovene language resources   2Background• Pre-story: AHLib (2004–08)  (Deutsch-slowenische/kroatische Übers...
Tomaž Erjavec: Slovene language resources   3                                                              AnnotatorsMetho...
Tomaž Erjavec: Slovene language resources   4Issues• Tokenisation - words were split differently in historical language : ...
Tomaž Erjavec: Slovene language resources   5Transcribed historical texts• AHLib corpus/DL:  90 books, 10,000 pages, 2M wo...
Tomaž Erjavec: Slovene language resources   6Initial Lexicon• Development of initial lexicon (2010), using the data and to...
Tomaž Erjavec: Slovene language resources          7Reference corpus                       Period          Units       Pag...
Tomaž Erjavec: Slovene language resources   8INL Cobalt lexicon building tool
Tomaž Erjavec: Slovene language resources   9TEIcorpusdump
Tomaž Erjavec: Slovene language resources       10Final lexicon                                                 goo300k   ...
Tomaž Erjavec: Slovene language resources   11Results• Language resources for historical Slovene:   • Text Collection hs5M...
Tomaž Erjavec: Slovene language resources   12Further work• Better IR for Digital Libraries: NUK• Dictionary of historical...
Upcoming SlideShare
Loading in...5
×

IMACT Final Conference - Language Parallel Sessions - Erjavec

1,893

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,893
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "IMACT Final Conference - Language Parallel Sessions - Erjavec"

  1. 1. Resources for historicalSloveneTomaž ErjavecDepartment of Knowledge TechnologiesJožef Stefan InstituteLjubljana IMPACT Conference 2011 October 24-25, 2011, London
  2. 2. Tomaž Erjavec: Slovene language resources 2Background• Pre-story: AHLib (2004–08) (Deutsch-slowenische/kroatische Übersetzung 1848–1918) • Corpus / DL of ger→slv books • AAS: transcription correction and markup (TEI P4) • JSI: automatic annotation and editing environment• Story: EU IP IMPACT (ext. 2010–2011) • Better OCR for historical texts • NUK: GTD transcriptions (PAGE/Aletheia) • JSI: (semi)manual lexicon construction• Co-story: Google award (2011) • Developing language models for historical Slovene • ZRC SAZU: transcriptions of old texts (TEI P5) • JSI: annotating a corpus of old Slovene
  3. 3. Tomaž Erjavec: Slovene language resources 3 AnnotatorsMethodology Historical Texts Corpus lexicon• Develop 3 resources: • transcribed texts • hand-annotated corpus ToTrTaLe • lexicon of historical words• Develop annotation tool, ToTrTaLe Contemporary models • How to tag and lemmatise historical Slovene? Little chance of developing training data comparable to that for contemporary Slovene • Basic idea: • modernise words then use models for modern Slovene • transcription is via fixed lexicon + transcription patterns • patterns implemented via LMU Vaam • mostly OK for XIX and XVIII century language
  4. 4. Tomaž Erjavec: Slovene language resources 4Issues• Tokenisation - words were split differently in historical language : • žnjo → z njo • po noči → ponoči• Variability: • archaic forms: ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin • inflection: ljubezen ← ljubezni, ljubeznijo • both: ljubezen ← ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezi n, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin• Extinct words: • zajhen / cajhen / znamenje
  5. 5. Tomaž Erjavec: Slovene language resources 5Transcribed historical texts• AHLib corpus/DL: 90 books, 10,000 pages, 2M words (> 1850)• NUK GTD: 5,000 pages, 1M words• Google Books: 30 books, 10,000 pages, 2M words (in progress)• WikiSource (Lj Uni): 200 books, 5M words (in progress)~ 10M words• most texts have associated facsimiles• can be made freely available
  6. 6. Tomaž Erjavec: Slovene language resources 6Initial Lexicon• Development of initial lexicon (2010), using the data and tools at hand• AHLib collection (70 books > 1850)• Transcription rules + FidaPLUS lexicon of contemporary slv• LMU LeXtractor editing tool• produced 3,000 entries (word-forms)
  7. 7. Tomaž Erjavec: Slovene language resources 7Reference corpus Period Units Pages Tokensgoo300k 1584 1695 1 1 8 27 6000 10000• Page sampled 1751-1800 8 155 27000 1801-1850 12 206 74000• Each word annotated with: 1851-1875 36 380 126000 • Contemporary equivalent 1876-1900 23 224 51000 • Modern lemma ∑ 81 1000 296000 • Part-of-speech tag• First with ToTrTaLe• Then manually correct • INL Cobalt Lexicon Tool • A team of annotators • Also correcting errors in transcription • Manual, cookbook, FAQ, mailing list, meetings…• TEI P5 – bibliography, links to facsimiles & DL
  8. 8. Tomaž Erjavec: Slovene language resources 8INL Cobalt lexicon building tool
  9. 9. Tomaž Erjavec: Slovene language resources 9TEIcorpusdump
  10. 10. Tomaž Erjavec: Slovene language resources 10Final lexicon goo300k All HistoricalComposition: Lex. entries 56346 22849• Initial LeXtractor lexicon (3k entries) Word-forms 53853 19627• Lexicon dump from goo300k Normalised 46996 15402• Additional lexicon from full Modernised 37334 11396 text collection Lemmas 19569 8605Format:• TEI P5• lemma oriented• grammatical properties, glosses, historical spelling, (corpus) examples
  11. 11. Tomaž Erjavec: Slovene language resources 11Results• Language resources for historical Slovene: • Text Collection hs5M: • facsimile + transcription, DL (+ automatic annotation) • Annotated Corpus goo300k: • page-sampled , hand-annotated • Structured Lexicon imp20k: • grammar + glosses + forms + attestations • TEI P5, CC BY• ToTrTaLe + resources for HS: • tokenisation & transcription patterns• Services: CUWI, (moderniser+archaiser)• all still work in progress, available mid-2012
  12. 12. Tomaž Erjavec: Slovene language resources 12Further work• Better IR for Digital Libraries: NUK• Dictionary of historical Slovene: ZRC• Beyond words: changes in syntax• MT paradigm• tweets & Croatian
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×