LET ME INTRODUCE MYSELF! Xavier Blanco - [email_address] Autonomous University of Barcelona Laboratory of Phonetics, Lexicology and Semantics f LexSem My research areas: Lexicography Machine Translation My theoretical background: Lexicon-Grammar Meaning-Text Theory
MY SPEECH AT A GLANCE WHY MULTI-LEXEMIC UNITS? 4 TYPES OF MULTI-LEXEMIC UNITS CONSEQUENCES FOR TRANSLATION
Why are multi-lexemic units so important for MT? All Machine Translation Software needs dictionaries (i.e. complete linguistic descriptions of its working languages, formalized with identical procedures and criteria and linked by means of translation equivalence relations). A natural question concerns the nature of the macrostructure elements (lemmas) in these dictionaries, i.e. the lexical units . Lexicon items are not always just words but very often sequences of words . Many technical terms have been coined to refer to these complex lexical items: compounds, collocations, idioms, frozen expressions...
Why are multi-lexemic units so important for MT? In fact, multi-lexemic units are important because they are linguistic signs , and linguistics signs are the natural unit of treatement for dictionaries.
Which is the main problem? The main problem is that multi-lexemic units are/seem not so easy to identify and classify than mono-lexemic units. N.B.: that is true only regarding the segmentation (tokenization) question, not the polysemy question.
WHAT IS A MULTI-LEXEMIC UNIT? A multi-lexemic unit is a sequence of word-forms P whose meaning cannot be build (by the general rules of the language L ) from the meanings of the constituent lexemes of P , their semantically loaded morphological means (if any) and their combinatorial properties.
FIRST TYPE OF MULTI-LEXEMIC UNIT: “COMPOUND” UNITS In most electronic dictionaries lexical units are typically classified as nouns, verbs, adjectives and adverbs . In adition to simple forms, the languages we work with (romance languages) have, for each of this categories, multi-lexemic items (i.e. sequences of word-forms separated by a blank or a hyphen). Let’s beginn with compound nouns...
“ COMPOUND NOUNS” Examples of compound nouns are: hard drug high school black hole
“ COMPOUND NOUNS” <ul><li>A large amount of compound nouns: </li></ul><ul><li>have simple variants or synonyms ( demócrata cristiano, demócrata-cristiano, demócratacristiano). </li></ul><ul><li>can (sometimes must) be translated into simple nouns: high school = lycée, instituto // escalera mecánica, escalier roulant = escalator ). </li></ul><ul><li>acronyms are a way of constructing variants of compound nous: acquired immune deficiency syndrome = AIDS . </li></ul>
“ COMPOUND NOUNS” Systematic descriptions of compound nouns have been proposed ranging from just a few types to over 700. Probably, for most of the applications, it’s enough to take in account only a dozen cases, such as: N-Adj, Adj-N ( christian name // nombre de pila ) N-Prep-N ( quality of live // calidad de vida ) N-N ( family doctor // médico de cabecera ) Prep-N ( under age // menor de edad ) V-N ( washing machine // lavadora )
“ COMPOUND NOUNS” It is often necessay to treat compound nouns as having a recursive structure: auditory canal = Adj-N => N external auditory canal = Adj-(Adj-N) => Adj-N => N
“ COMPOUND NOUNS” VERY IMPORTANT: The number of compound nouns has been often underestimated . Typically even in large tradictional dictionaries only a very small percentage seems to have caught the lexicographers attention. We think that, if we consider technical languages, a few millions compound occur in texts. The size of any serious dictionary project in this area must be very important.
“ COMPOUND NOUNS” When we say that a compound noun is a linguistic unit, it means that we are obliged to describe its form , its meaning and its combinatory INDEPENDENTLY of the properties the forms it contains may have. A regular construction like a black sweater is reducible to a predication like black(sweater). But compounds like a black hole or a black box are not.
“ COMPOUND NOUNS” <ul><li>Compound nouns can be inmune to a number of syntactic modifications that similar but regular constructions can undergo: </li></ul><ul><li>MODIFICATION: * a very black hole </li></ul><ul><li>PREDICATIVITY: * this hole is black </li></ul><ul><li>SYNONYMY: a black hole vs a black orifice, a black opening </li></ul><ul><li>COORDINATION: * a black and deep hole </li></ul><ul><li>DELETION: ...a black hole. This hole... </li></ul><ul><li>NOMINALIZATION: * the blackness of the hole </li></ul>
“ COMPOUND NOUNS” Clearly these tests have varying degrees of precision depending on the semantic opaqueness of the compound It is nevertheless obvious that we need to list them in a dictionary. Not only high schol , but also driving shool or private shool need to be associated which a specific meaning description. How should we come to know that driving school is not a school in which one take courses while being in a car? Or that private shool is not a school for soldiers of this rank?
“ COMPOUND VERBS” He flogs a dead horse. I gave him a taste of his own medicine I put my shoulder to the wheel
“ COMPOUND VERBS” Compound verbs are typicallly verbs with frozen arguments (1, 2 or 3). More often than not their degree of semantic opacity is much higher than in the case of compound nouns. The bad news: They are more difficult to extract from corpora (e.g. insertions, inflectional patterns...). The good news: the number of elements of this class is considerable smaller than in the nominal counterpart. It is likely that they can easily be kept far below 100,000 .
“ COMPOUND ADVERBS” on foot at your own risk in cold blood in spite of N Probably below 10,000 , excepting a few compound adverbs schemas that are productive: from (9 a.m.) till (5 p.m.) Compound adjectives: out of order, de moda (fashionable)... Compound determiners: a lot of, a flurry of criticism...
SECOND TYPE OF MULTI-LEXEMIC UNIT: “COLLOCATIONS” Compounds need to be regarded as units in connection with almost every linguistic operation. They are macrostructural elements of the dictionaries and are typically translated as a whole without any attempt to maintain neither the internal structure nor the meaning of their particular parts. On the opposite, collocations involve 2 linguistic signs: the base of the collocation and the value of the collocation. We are going to discuss only two classes of collocations.
COLLOCATIONS: FROZEN MODIFIERS to condemn strongly, to endorse heartily, to laugh heartily, to laugh one’s head off... easy as pie, as 1-2-3... (smb) thin as a rake.. it rains cats and dogs... heavy smoker, ______ liar admirer profondement ; aimer passionnément ; remercier chaleureusement; surveiller étroitement ...
COLLOCATIONS: FROZEN MODIFIERS Translation is a good indicator of the frozen status of these constructions. The translation of these expressions must proceed by first identifying the type of modification and then reconstructing that modification in the target language on the basis of the translation of the base term: miedo cerval -> INT (miedo) = INT (fear) -> mortal fear * deer fear peur bleue -> INT (peur) = INT (fear) -> mortal fear * blue fear
COLLOCATIONS: FROZEN MODIFIERS There seems to be a restricted number of meanings that are likely to function as values of collocations. Exemples of such semantic values are intensity , anti-intensity, praise and anti-praise. Collocations are not so difficult to understand (by a human being), but are difficult to produce (for a non-native speaker).
COLLOCATIONS: FROZEN MODIFIERS These modifiers need to be coded for each lexical unit separately. Not every lexical unit will have instantiations for every semantic value of a possible frozen modifier and some lexical units will have more than one modifier for a given semantic value. These frozen modifiers range from highly idiosyncratic ones to almost regular ones : but they always need to be explicitly coded.
COLLOCATIONS: SUPPORT VERBS The main predicate of a sentence can be realized not just by verbs, but also by nouns, adjectives and prepositions. In the latter cases, an additional lexical element, called support verb , is usually associated with the real semantic predicate to form the predicational basis of the simple sentence. Particularly for nouns, these support verbs cannot always be predicted just from the nature of the main predicate.
COLLOCATIONS: SUPPORT VERBS to play a role to give an advice to take a look at to do someone a favor to put a question “ The man who makes no mistakes does not usually make anything”
COLLOCATIONS: SUPPORT VERBS the war broke out I keep my calm to reduce to despair to raise hope in to draw smb attention to
COLLOCATIONS: SUPPORT VERBS to fulfil a promise to answer a question to follow an advice his dream came true
COLLOCATIONS: SUPPORT VERBS The artillery _________ a heavy bombardment over the town. The artillery __________ the town to a heavy bombardment. The town _________ a heavy bombardment (of the artillery). A heavy bombardment (of the artillery) ________ over the town.
THIRD TYPE OF MULTI-LEXEMIC UNIT: “FROZEN SENTENCES” Proverbs: A birth in the hand is worth two in the bush; A rolling stone gathers no moss. Pragmatemes: Staff only, Can I help you? N.B.: Often frozen sentences undergo variations which can involve creative mechanisms fo the defreezing of the ordinary accepted patterns.
FOURTH TYPE OF MULTI-LEXEMIC UNIT: “GRAMMATICAL UNITS” Empirical Grammatical Expressions : has been, could have been, may have been / either... or / if... then... Theoretical Grammatical Expressions: <Adj_colour> <clothes>... but blue jeans! <Noun_animate> drink <beverages>
Here I should present the calculus of Grammatical Meanings, but... zzz PISS: Powerpoint Induced Sleep Syndrom
FINAL OVERVIEW <ul><li>Let a linguistic sign be an ordered triple </li></ul><ul><li>A = <‘A’, A , ∑A > where: </li></ul><ul><ul><li>‘ A’ is the signifier of A </li></ul></ul><ul><ul><li>A is the signifiant of ‘A’ </li></ul></ul><ul><ul><li>∑ A is the set of combinatory properties of A </li></ul></ul><ul><ul><li>Basic types of linguistic signs are: morphs, modifications, conversions, supramorphs, word-forms, phrasemes and syntagms. </li></ul></ul>
FINAL OVERVIEW Free sequences : AB = <‘AB’; / A B / ∑A U ∑B > Full-idioms : AB = <‘ C ’; / A B /> | ‘A’ ‘C’ & ‘B’ ‘C’ Cuasi-idioms : AB = <‘A B C ’; / A B /> | ‘C’ ≠ ‘A’ & ‘C’ ≠ ‘B’ Semi-idioms : AB = <‘A C’; / A B /> The signifier of the semi-idiom includes, intact, the signifier of one of its two constituents. A is chosen by the speaker strictly because of its signified. But B s used to express ‘C’ contingent on A. Otherwise B will not be the signifiant of ‘C’.
CONCLUSIONS <ul><li>Tests with large collection of texts at the LADL and the CIS have shown that at least one-third of any natural language corpus must be analyzed in terms of multi-lexemic units. </li></ul><ul><li>The characteristics of multi-lexemic units are such than there is no alternative to the lexicographic solution. </li></ul><ul><li>The availability of large scale multi-lexemic dictionaries will significantly improve the quality of machine translation systems. </li></ul>