Ivan Derganskyi


Published on

Двуязычные и многоязычные электронные языковые ресурсы

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Ivan Derganskyi

  1. 1. Двуязычные и многоязычные электронные языковые ресурсы Иван А. Держанский ( [email_address] ) Институт математики и информатики Болгарской академии наук Секция Математической лингвистики
  2. 2. Resources for language engineering <ul><li>lexical databases (LDBs) </li></ul><ul><li>electronic dictionaries </li></ul><ul><ul><li>monolingual </li></ul></ul><ul><ul><li>bilingual and multilingual </li></ul></ul><ul><li>corpora </li></ul>
  3. 3. C orpus annotation <ul><li>Def: the process of adding linguistic information in an electronic form to a text corpus. </li></ul><ul><li>Most common types: </li></ul><ul><ul><li>morphosyntactic (grammatical, PoS) annotation </li></ul></ul><ul><ul><li>lemma annotation </li></ul></ul>
  4. 4. PoS tagging <ul><li>Def: the task of labelling each word in a sequence of words with its appropriate part-of-speech. </li></ul><ul><li>Ambiguity: </li></ul><ul><ul><li>вероятно ‘probable ( sg. n. ), probably’ </li></ul></ul><ul><ul><ul><li>вероятно -> P о S: adjective, Gender: neuter, Number: singular, Definiteness: no </li></ul></ul></ul><ul><ul><ul><li>вероятно -> P о S: adverb, Type: adjectival </li></ul></ul></ul><ul><li>Def tagset: set of PoS tags </li></ul>
  5. 5. Electronic corpora of Bulgarian <ul><li>The first two electronic corpora of the Bulgarian language were created in the framework of two EU projects on language technologies: </li></ul><ul><li>MULTEXT-East ( http://nl.ijs.si/IME ); </li></ul><ul><li>CONCEDE. </li></ul>
  6. 6. MULTEXT-East <ul><li>The project MULTEXT-East (Multilingual Text Tools and Corpora for Eastern and Central European Languages, 1995–1997) produced resources for six Central and Eastern European languages: </li></ul><ul><li>Bulgarian, </li></ul><ul><li>Slovene, </li></ul><ul><li>Czech, </li></ul><ul><li>Roumanian, </li></ul><ul><li>Hungarian, </li></ul><ul><li>Estonian, </li></ul><ul><li>as well as English (as the ‘hub language’ of the project). </li></ul>
  7. 7. MULTEXT-East (continued) <ul><li>The extended results of the project were made available in 1998, first on CD-ROM and then via TRACTOR, the TELRI (Technology-Enhanced Learning in Research-led Institutions) Research Archive of Computational Tools and Resources. </li></ul><ul><li>Version 3 (2004) includes material in five more languages (Croatian, Lithuanian, Resian, Russian, Serbian). </li></ul>
  8. 8. MULTEXT-East (continued) <ul><li>The corpus of Bulgarian, developed according to the methodology and requirements of the project, contains three parts: </li></ul><ul><li>Bulgarian Language-Specific Resources , </li></ul><ul><li>a Parallel Annotated 1984 Corpus , </li></ul><ul><li>a Comparative Corpus . </li></ul>
  9. 9. The Parallel Annotated 1984 Corpus <ul><li>The Parallel Annotated 1984 Corpus consists of </li></ul><ul><li>the Bulgarian translation of George Orwell’s novel Nineteen Eighty-Four (including approximately 87,000 words); </li></ul><ul><li>Bulgarian-English aligned texts. </li></ul>
  10. 10. The Parallel Annotated 1984 Corpus ( continued ) <ul><li>The material was formatted as a well-structured, lemmatised, Corpus Encoding Standard (CES) corpus (Ide, 1998). </li></ul><ul><li>That is, each word form is accompanied by the corresponding lemma and grammatical information that constitute its standard lexical description. </li></ul>
  11. 11. The Parallel Annotated 1984 Corpus ( continued ) <ul><li>The lexical descriptions for Bulgarian are in line with the terminology and the methodology used by MULTEXT. </li></ul><ul><li>The corpus was marked and validated for alignment and sentence boundaries. </li></ul>
  12. 12. The Comparative Corpus <ul><li>The Comparative Corpus contains two subsets of about 100,000 words, each consisting of fiction, comprising excerpts from two contemporary Bulgarian novels, and excerpts from newspaper text. </li></ul><ul><li>The data was comparable across the six languages, in terms of the number and size of texts. </li></ul>
  13. 13. The Comparative Corpus (continued) <ul><li>The entire multilingual Comparative Corpus was prepared in CES (Corpus Encoding Standard) format, manually or using ad-hoc tools, and was automatically annotated for tokenisation, sentence boundaries, and part of speech using the project tools. </li></ul>
  14. 14. Bulgarian Language-Specific Resources <ul><li>The Bulgarian Language-Specific Resources are data required by the segmentation procedure, morphological analyser and disambiguator. </li></ul><ul><li>This includes a lexical list and lists of special tokens (frequent abbreviations and names, titles, patterns for proper names, etc.) with their types. </li></ul>
  15. 15. The lexicon <ul><li>The lexical list ( lexicon ) contains about 242,000 lemmata. </li></ul><ul><li>Each lemma in the lexicon is associated with its part(s) of speech and lexical characteristics. </li></ul><ul><li>156,000 morpho­syntactic descriptions were provided for Bulgarian. </li></ul>
  16. 16. The lexicon (continued) <ul><li>Each lexicon entry includes the following information: </li></ul><ul><li>word form; </li></ul><ul><li>lemma; </li></ul><ul><li>part of speech; </li></ul><ul><li>further morphological information (feature values). </li></ul>
  17. 17. The lexicon (continued) <ul><li>part of speech </li></ul><ul><ul><li>the traditional set of 10 parts of speech </li></ul></ul><ul><ul><li>punctuation </li></ul></ul><ul><ul><li>abbreviations </li></ul></ul><ul><ul><li>numbers written in digits </li></ul></ul><ul><ul><li>unidentified objects (residuals) </li></ul></ul><ul><li>same system for all languages of the project (though different interpretations) </li></ul>
  18. 18. Lexicography ↔ Linguistic Theory <ul><li>lexicography requires linguistic theory (analysis, methodology) </li></ul><ul><ul><li>but also serves as a touchstone, because what can be represented must have been studied, understood, formalised to a sufficient extent </li></ul></ul><ul><li>lexicography supports linguistic theory (data for research) </li></ul>
  19. 19. Dictionary ↔ Grammar <ul><li>mutually complementary, mutually indispensable components of integrated linguistic description </li></ul><ul><li>lexicographic type (unification) </li></ul><ul><li>lexicographic portrait ( individualisation ) </li></ul>
  20. 20. Computational lexicography <ul><li>digital (machine-readable) dictionaries: </li></ul><ul><li>digital versions of traditional dictionaries for human use </li></ul><ul><li>computer dictionaries as components of information systems </li></ul>
  21. 21. Advantages of digital dictionaries <ul><li>size not an issue </li></ul><ul><ul><li>potential for infinite growth in depth and breadth (a dictionary needn’t be small, medium or large by design) </li></ul></ul><ul><ul><li>many purposes served (explanatory dictionary, grammatical dictionary, dictionary of synonyms, antonyms, phraseology, etymology, etc., all as one integrated system) </li></ul></ul>
  22. 22. Advantages of digital dictionaries (continued) <ul><li>easy update possible, incl. by continued distributed collective effort (wiki-style) </li></ul><ul><li>flexible search (incl. bidirectional) and presentation of results </li></ul><ul><li>audio-, video- etc. material can be added </li></ul><ul><li>requirement: definitions must be simpler, but at the same time more comprehensive </li></ul>
  23. 23. Dictionary (definition) <ul><li>an aggregate of linguistic units (forms) </li></ul><ul><ul><li>established in the language system as represented by the usage of a certain language community, </li></ul></ul><ul><ul><li>put in a predetermined order and </li></ul></ul><ul><ul><li>accompanied by formal (orthographic, phonetic, grammatical, etymological, stylistic, etc.) and semantic information </li></ul></ul><ul><ul><ul><li>on the linguistic units themselves or </li></ul></ul></ul><ul><ul><ul><li>on the denoted entities or phenomena, </li></ul></ul></ul>
  24. 24. Dictionary (definition, continued) <ul><li>an aggregate of linguistic units (forms) </li></ul><ul><ul><li>put in a predetermined order and </li></ul></ul><ul><ul><li>accompanied by formal and semantic information, </li></ul></ul><ul><ul><li>arranged and ordered in a certain way within the entry , </li></ul></ul><ul><li>… almost always supplemented by auxiliary material </li></ul><ul><ul><li>introduction, criteria, sources, list of abbreviations, structure of the dictionary entry, grammar tables </li></ul></ul>
  25. 25. Structure of the dictionary entry <ul><li>register part (on the left) </li></ul><ul><li>interpretation part (on the right) </li></ul><ul><li>all the register parts together form the dictionary’s register </li></ul><ul><li>the set of rules and methods used when composing the entries forms the metalanguage </li></ul>
  26. 26. The register <ul><li>designing the register (needn’t be a one-time event in the case of an electronic dictionary) </li></ul><ul><ul><li>from other dictionaries </li></ul></ul><ul><ul><li>from a corpus of texts </li></ul></ul><ul><li>editing the register: eliminating obsolete words, arbitrary neologisms, suspected non-words </li></ul><ul><li>automatic extension: productive derivation made into procedures </li></ul>
  27. 27. Structural aspects of lexicography <ul><li>macrostructure: nature and purpose of the dictionary, place within the typology of dictionaries, choice of register, choice of illustrations, order, metalanguage </li></ul><ul><li>mediostructure: relations between language units, e.g., derivation, families of words </li></ul><ul><li>microstructure: setup of the entry, hierarchy of meanings; requirements: standardisation , economy, simplicity, completeness </li></ul>
  28. 28. An example of a lexical entry: CONCEDE Bulgarian dictionary <ul><li><entry> </li></ul><ul><li><hw>цел</hw> </li></ul><ul><li><gen>ж.</gen> </li></ul><ul><li><struc type=&quot;Sense&quot; n=&quot;1&quot;> </li></ul><ul><li><def>Това, към което е насочена някаква дейност, към което </li></ul><ul><li>някой се стреми; умисъл, намерение.</def> </li></ul><ul><li><eg><q>С каква цел отиваш в града?</q></eg> </li></ul><ul><li><eg><q>Вървя без цел.</q></eg> </li></ul><ul><li><eg><q>Постигнах целта си.</q></eg> </li></ul><ul><li><eg><q>Целта оправдава средствата.</q></eg></struc> </li></ul><ul><li><struc type=&quot;Sense&quot; n=&quot;2&quot;> </li></ul><ul><li><def>Предмет или точка, в която някой стреля, към </li></ul><ul><li>която е насочено определено действие, движение, удар и под.; </li></ul><ul><li>прицел.</def> </li></ul><ul><li><eg><q>Улучих целта.</q></eg></struc> </li></ul><ul><li><struc type=&quot;Phrases&quot;> </li></ul><ul><li><struc type=&quot;Phrase&quot; n=&quot;1&quot;><orth>Имам (нямам) [за] цел.</orth> </li></ul><ul><li><def>стремя се (не се стремя) към нещо.</def> </li></ul><ul><li><eg><q>Нямам за цел да му навредя.</q></eg></struc> </li></ul><ul><li><struc type=&quot;Phrase&quot; n=&quot;2&quot;><orth>Попадам в целта.</orth> </li></ul><ul><li><def>улучвам, умервам.</def></struc> </li></ul><ul><li></struc> </li></ul><ul><li><etym><lang>нем.</lang>&gt;<lang>рус.</lang></etym> </li></ul><ul><li></entry> </li></ul>
  29. 29. An example of a lexical entry (zoom, part 1: head word, gender) <ul><li><entry> </li></ul><ul><li><hw> цел </hw> </li></ul><ul><li><gen> ж. </gen> </li></ul><ul><li>[ … ] </li></ul><ul><li></entry> </li></ul>
  30. 30. An example of a lexical entry (zoom, part 2) <ul><li><struc type=&quot;Sense&quot; n=&quot;1&quot;> </li></ul><ul><li><def> Това, към което е насочена някаква дейност, към което някой се стреми; умисъл, намерение. </def> </li></ul><ul><li><eg><q> С каква цел отиваш в града? </q></eg> </li></ul><ul><li><eg><q> Вървя без цел. </q></eg> </li></ul><ul><li><eg><q> Постигнах целта си. </q></eg> </li></ul><ul><li><eg><q> Целта оправдава средствата. </q></eg></struc> </li></ul>
  31. 31. An example of a lexical entry (zoom, part 3) <ul><li><struc type=&quot;Sense&quot; n=&quot;2&quot;> </li></ul><ul><li><def> Предмет или точка, в която някой стреля, към която е насочено определено действие, движение, удар и под.; прицел. </def> </li></ul><ul><li><eg><q> Улучих целта. </q></eg></struc> </li></ul>
  32. 32. An example of a lexical entry (zoom, part 4) <ul><li><struc type=&quot;Phrases&quot;> </li></ul><ul><li><struc type=&quot;Phrase&quot; n=&quot;1&quot;><orth> Имам (нямам) [за] цел. </orth> </li></ul><ul><li><def> стремя се (не се стремя) към нещо. </def> </li></ul><ul><li><eg><q> Нямам за цел да му навредя. </q></eg></struc> </li></ul><ul><li><struc type=&quot;Phrase&quot; n=&quot;2&quot;><orth> Попадам в целта. </orth> </li></ul><ul><li><def> улучвам, умервам. </def></struc> </li></ul><ul><li></struc> </li></ul>
  33. 33. An example of a lexical entry (zoom, part 5: etymology) <ul><li><entry> </li></ul><ul><li>[…] </li></ul><ul><li><etym><lang> нем. </lang> &gt; <lang> рус. </lang></etym> </li></ul><ul><li></entry> </li></ul>
  34. 34. ABBYY Lingvo (Ru–It)
  35. 35. ABBYY Lingvo ( Ru –Et) <ul><li>цель </li></ul><ul><li>[m1][trn]eesmärk, märk, otstarve, siht[/trn][/m] </li></ul>
  36. 36. Why is order important?
  37. 38. Why is order important? (continued) <ul><li>Ингредиенты : сахар, глюкоза, мука, милая , корица, какао, сода, маргарин </li></ul>
  38. 39. Why is order important? (continued) <ul><li>Ингредиенты: бикарбонат натрия, ароматы, студень , молочный порошок, эмульгатор </li></ul>
  39. 40. wash (En – Ru )
  40. 41. honey (En – Ru )
  41. 42. jelly (En – Ru )
  42. 43. Digital grammatical dictionaries <ul><li>modelling of inflexion </li></ul><ul><ul><li>(essential for inflecting languages) </li></ul></ul><ul><li>word form ↔ lemma + grammatical meaning </li></ul><ul><ul><li>built upon a formal model of inflexion: a division of the set of words into inflexional paradigmatic classes (non-intersecting subsets with algorithmically described rules) </li></ul></ul>
  43. 44. Bi- and multilingual dictionaries <ul><li>translation: </li></ul><ul><li>most general member(s) of the corresponding synset </li></ul><ul><li>grammatical semantics (incl. valency, subcategorisation) </li></ul><ul><li>pragmatic context (sublanguage of most frequent usage) </li></ul>
  44. 45. Bi- and multilingual dictionaries (continued) <ul><li>bilingual dictionary: </li></ul><ul><li>two integrated linguistic systems (explanatory dictionary, grammatical dictionary, dictionary of synonyms, of antonyms, of phraseology) </li></ul><ul><li>complemented by </li></ul><ul><ul><li>comparable monolingual corpora and </li></ul></ul><ul><ul><li>a parallel bilingual corpus and </li></ul></ul><ul><li>linked by an interface </li></ul>
  45. 46. Bi- and multilingual dictionaries (continued) <ul><li>Integrating a synonym and a translation linguistic system: EuroWordNet (an assembly of WordNets using a common ontology and indexing) </li></ul>
  46. 47. Bi- and multilingual dictionaries (continued) <ul><li>multilingual dictionary: </li></ul><ul><ul><li>a set of pairs of bilingual dictionaries </li></ul></ul><ul><ul><li>interlingua </li></ul></ul><ul><ul><ul><li>one of the target languages </li></ul></ul></ul><ul><ul><ul><li>an external natural language </li></ul></ul></ul><ul><ul><ul><li>an artificial but speakable language (e.g., Esperanto) </li></ul></ul></ul><ul><ul><ul><li>a semantic interlingua (a digital concept dictionary) </li></ul></ul></ul>
  47. 48. Plans <ul><li>of the joint research project “Semantics and Contrastive linguistics with a focus on a bilingual electronic dictionary” between IMI—BAS and ISS—PAS : </li></ul><ul><li>Bulgarian –Polish/Polish–Bulgarian dictionaries </li></ul><ul><li>Bulgarian–Polish–Ukrainian dictionary </li></ul><ul><li>Bulgarian–Polish–Ukrainian–Lithuanian … </li></ul><ul><li>… more? </li></ul>
  48. 49. Bulgarian –Polish/Polish–Bulgarian dictionaries … on the basis of (1) <ul><li>the most recent paper bilingual dictionaries (1987, 1988) </li></ul><ul><li>volume ≈60 000 words </li></ul><ul><li>already dated </li></ul><ul><li>of questionable reliability to boot </li></ul>
  49. 50. Bulgarian –Polish/Polish–Bulgarian dictionaries … on the basis of (2) <ul><li>a bilingual corpus (3 000 000 words envisaged) consisting of </li></ul><ul><li>fiction </li></ul><ul><ul><li>Polish to Bulgarian (easy to find) </li></ul></ul><ul><ul><li>Bulgarian to Polish (hard to find) </li></ul></ul><ul><ul><li>3rdLg original, translated into Bg and Pl </li></ul></ul><ul><li>EU/EC documents </li></ul><ul><li>texts in Bulgarian and Polish of similar sizes </li></ul><ul><ul><li>excerpts from newspapers </li></ul></ul><ul><ul><li>literary works available on the Internet </li></ul></ul>
  50. 51. Bulgarian –Polish dictionary (after OCR and proofreading) <ul><li>претовар|я, -иш vp. v. претоварям </li></ul><ul><li>претоп|я, -иш vp. v. претапям, претопявам </li></ul><ul><li>претопява|м, -ш vi . przetapiać; przen. asymilować </li></ul><ul><li>претор, -и т hist. pretor m </li></ul><ul><li>преториан|ец, -ци т pretorianin m </li></ul><ul><li>преториански adi. pretoriański </li></ul><ul><li>I преточ|а, -иш vp. v. npe такам </li></ul><ul><li>II преточ|а, -иш vp. v. II преточвам </li></ul><ul><li>I преточвам v. претакам </li></ul><ul><li>II преточва | м, -ш vi. ostrzyć nadmiernie </li></ul><ul><li>претрайва|м, -ш vi. v. npe трая </li></ul><ul><li>претра|я, -еш vp. lud. przetrwać </li></ul><ul><li>претрива|м, -ш vi. przecierać, przecinać, przepiłowywać; ~м праговете wycieram (obijam) cudze progi </li></ul><ul><li>претри|я, -еш vp. v. претривам </li></ul>
  51. 52. Bulgarian –Polish dictionary (after first round of markup) <ul><li>[b]претовар|я, -иш[/b] [i]vp.[/i] v. [b] претоварям[/b] </li></ul><ul><li>[b]претоп|я, -иш[/b] [i]vp.[/i] v. [b] претапям, претопявам[/b] </li></ul><ul><li>[b]претопява|м, -ш[/b] [i]vi .[/i] przetapiać; [i]przen.[/i] asymilować </li></ul><ul><li>[b]претор, -и[/b] [i]m[/i] [i] hist.[/i] [b]pretor[/b] [i]m[/i] </li></ul><ul><li>[b]преториан|ец, -ци[/b] [i]m[/i] pretorianin [i]m[/i] </li></ul><ul><li>[b]преториански[/b] [i]adi.[/i] pretoriański </li></ul><ul><li>[b] I преточ|а, -иш[/b] [i]vp.[/i] v. [b] пре такам[/b] </li></ul><ul><li>[b] II преточ|а, -иш[/b] [i]vp.[/i] v. [b] II преточвам[/b] </li></ul><ul><li>[b] I преточвам[/b] v. [b] претакам [/b] </li></ul><ul><li>[b]II преточва | м, -ш[/b] [i]vi.[/i] [b]ostrzyć nadmiernie [/b] </li></ul><ul><li>[b] претрайва|м, -ш[/b] [i]vi.[/i] v. [b] прет рая[/b] </li></ul><ul><li>[b]претра|я, -еш[/b] [i]vp.[/i] [i]lud.[/i] przetrwać </li></ul><ul><li>[b]претрива|м, -ш[/b] [i]vi.[/i] przecierać, przecinać, przepiłowywać; [b]~м праговете[/b] wycieram (obijam) cudze progi </li></ul><ul><li>[b]претри|я, -еш[/b] [i]vp.[/i] v. [b] претривам[/b] </li></ul>
  52. 53. Adding procedurality? <ul><li>по газва|м, -ш vi. deptać, brodzić (trochę) </li></ul><ul><li>по гор|я, -иш vp. popalić się (trochę, krótko) ; […] </li></ul><ul><li>по гъделичква|м, -ш vi. łaskotać, łechtać (trochę, lekko) </li></ul><ul><li>по гълта|м, -ш vp. łyknąć trochę </li></ul><ul><li>по гърмява|м, -ш vi. pogrzmiewać, grzmieć od czasu do czasu , […] </li></ul><ul><li>по дадва|м, -ш vi. lud. dawać po trochę, od czasu do czasu </li></ul>
  53. 54. Polyprefixation <ul><li>по за газ|я, -иш vp. zabrnąć, wpaść w ciężkie położenie (trochę) </li></ul><ul><li>по за гатн|а, -еш vp. napomknąć, wspomnieć mimochodem </li></ul><ul><li>по за гледа|м, -ш vp. spoglądnąć, spojrzeć, popatrzyć (trochę, od czasu do czasu) </li></ul><ul><li>по на тежава|м, -ш vi. stawać się trochę cięższym, ciążyć trochę </li></ul><ul><li>по на тисн|а, -еш vp. nacisnąć, przycisnąć trochę </li></ul><ul><li>по на товар|я, -иш vp. naładować trochę , obciążyć, obarczyć trochę </li></ul>
  54. 55. Adding procedurality? (continued) <ul><li>пре търкаля|м, -ш vp. przetoczyć, przesunąć tocząc </li></ul><ul><li>Likewise perhaps: </li></ul><ul><li>evaluatives </li></ul><ul><li>words for females </li></ul><ul><li>abstract nouns </li></ul><ul><li>… and other productive derivatives </li></ul>
  55. 56. Applications of the electronic LDB <ul><li>lexicography: </li></ul><ul><ul><li>creation of electronic bilingual dictionaries for research and teaching </li></ul></ul><ul><ul><li>specialised reference works, e.g., valency dictionaries </li></ul></ul><ul><li>education : training skills of independent investigation with the help of the computer </li></ul>