Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IMACT Final Conference - Language Parallel Sessions - Erjavec

2,151 views

Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

IMACT Final Conference - Language Parallel Sessions - Erjavec

  1. 1. Resources for historicalSloveneTomaž ErjavecDepartment of Knowledge TechnologiesJožef Stefan InstituteLjubljana IMPACT Conference 2011 October 24-25, 2011, London
  2. 2. Tomaž Erjavec: Slovene language resources 2Background• Pre-story: AHLib (2004–08) (Deutsch-slowenische/kroatische Übersetzung 1848–1918) • Corpus / DL of ger→slv books • AAS: transcription correction and markup (TEI P4) • JSI: automatic annotation and editing environment• Story: EU IP IMPACT (ext. 2010–2011) • Better OCR for historical texts • NUK: GTD transcriptions (PAGE/Aletheia) • JSI: (semi)manual lexicon construction• Co-story: Google award (2011) • Developing language models for historical Slovene • ZRC SAZU: transcriptions of old texts (TEI P5) • JSI: annotating a corpus of old Slovene
  3. 3. Tomaž Erjavec: Slovene language resources 3 AnnotatorsMethodology Historical Texts Corpus lexicon• Develop 3 resources: • transcribed texts • hand-annotated corpus ToTrTaLe • lexicon of historical words• Develop annotation tool, ToTrTaLe Contemporary models • How to tag and lemmatise historical Slovene? Little chance of developing training data comparable to that for contemporary Slovene • Basic idea: • modernise words then use models for modern Slovene • transcription is via fixed lexicon + transcription patterns • patterns implemented via LMU Vaam • mostly OK for XIX and XVIII century language
  4. 4. Tomaž Erjavec: Slovene language resources 4Issues• Tokenisation - words were split differently in historical language : • žnjo → z njo • po noči → ponoči• Variability: • archaic forms: ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin • inflection: ljubezen ← ljubezni, ljubeznijo • both: ljubezen ← ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezi n, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin• Extinct words: • zajhen / cajhen / znamenje
  5. 5. Tomaž Erjavec: Slovene language resources 5Transcribed historical texts• AHLib corpus/DL: 90 books, 10,000 pages, 2M words (> 1850)• NUK GTD: 5,000 pages, 1M words• Google Books: 30 books, 10,000 pages, 2M words (in progress)• WikiSource (Lj Uni): 200 books, 5M words (in progress)~ 10M words• most texts have associated facsimiles• can be made freely available
  6. 6. Tomaž Erjavec: Slovene language resources 6Initial Lexicon• Development of initial lexicon (2010), using the data and tools at hand• AHLib collection (70 books > 1850)• Transcription rules + FidaPLUS lexicon of contemporary slv• LMU LeXtractor editing tool• produced 3,000 entries (word-forms)
  7. 7. Tomaž Erjavec: Slovene language resources 7Reference corpus Period Units Pages Tokensgoo300k 1584 1695 1 1 8 27 6000 10000• Page sampled 1751-1800 8 155 27000 1801-1850 12 206 74000• Each word annotated with: 1851-1875 36 380 126000 • Contemporary equivalent 1876-1900 23 224 51000 • Modern lemma ∑ 81 1000 296000 • Part-of-speech tag• First with ToTrTaLe• Then manually correct • INL Cobalt Lexicon Tool • A team of annotators • Also correcting errors in transcription • Manual, cookbook, FAQ, mailing list, meetings…• TEI P5 – bibliography, links to facsimiles & DL
  8. 8. Tomaž Erjavec: Slovene language resources 8INL Cobalt lexicon building tool
  9. 9. Tomaž Erjavec: Slovene language resources 9TEIcorpusdump
  10. 10. Tomaž Erjavec: Slovene language resources 10Final lexicon goo300k All HistoricalComposition: Lex. entries 56346 22849• Initial LeXtractor lexicon (3k entries) Word-forms 53853 19627• Lexicon dump from goo300k Normalised 46996 15402• Additional lexicon from full Modernised 37334 11396 text collection Lemmas 19569 8605Format:• TEI P5• lemma oriented• grammatical properties, glosses, historical spelling, (corpus) examples
  11. 11. Tomaž Erjavec: Slovene language resources 11Results• Language resources for historical Slovene: • Text Collection hs5M: • facsimile + transcription, DL (+ automatic annotation) • Annotated Corpus goo300k: • page-sampled , hand-annotated • Structured Lexicon imp20k: • grammar + glosses + forms + attestations • TEI P5, CC BY• ToTrTaLe + resources for HS: • tokenisation & transcription patterns• Services: CUWI, (moderniser+archaiser)• all still work in progress, available mid-2012
  12. 12. Tomaž Erjavec: Slovene language resources 12Further work• Better IR for Digital Libraries: NUK• Dictionary of historical Slovene: ZRC• Beyond words: changes in syntax• MT paradigm• tweets & Croatian

×