Corpus Linguistics and Lexicography * WOLFGANG TEUBERTCorpus Linguistics—More Than a Slogan?During the last decade, it has been common practice among the linguisticcommunity in Europe—both on the continent and on the British Isles—touse corpus linguistics to verify the results of classical linguistics. In NorthAmerica, however, the situation is different. There, the Philadelphia-basedLinguistic Data Consortium, responsible for the dissemination of languageresources, is addressing the commercially oriented market of language engi-neering rather than academic research, the latter often being more interestedin universal grammar or semantic universals than in the idiosyncrasies ofnatural languages. American corpus linguists such as Doug Biber or NancyIde and general linguists who are corpus users by conviction such as CharlesFillmore are almost better known in Europe than in the United States, whichis even more astonishing when we take into account that the ﬁrst real corpusin the modern sense, the Brown Corpus, was compiled in Providence, R.I.,during the sixties. Meanwhile, European corpus linguistics is gradually becoming a sub-discipline in its own right. Unfortunately, during the last few years, thislead to a slight bias towards those ‘self-centred’ issues such as the problemsof corpus compilation, encoding, annotation and validation, the proceduresneeded for transforming raw corpus data into artiﬁcial intelligence applica-tions and automatic language processing software, not to mention the problemof standardisation with regard to form and content (cf. the long-term projectEAGLES [Expert Advisory Group on Language Engineering Standards] and INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS Vol. 6(special issue), 2001. 125–153 John Benjamins Publishing Co.
126 WOLFGANG TEUBERTthe transatlantic TEI [Text Encoding Initiative]). Today, these issues oftentend to prevail over the original gain that the analysis of corpora may con-tribute to our knowledge of language. But it was exactly this corpus-speciﬁcknowledge that the ﬁrst generation of European corpus linguists such as StureAllen, Vladislav Andrushenko, Stig Johannson, Ferenc Kiefer, Bernard Que-mada, Helmut Schnelle, or John Sinclair had in mind. In West Germany, theInstitut f¨ r Deutsche Sprache was among the ﬁrst institutions that considered uthe collection of corpus data as one of their permanent tasks; its corpora dateback as early as the late sixties, although at that time most corpus data wasstill only used for the veriﬁcation of research results gained from traditionalmethods. But has today’s corpus linguistics really advanced from there? The recent textbooks claiming to provide an introduction to corpus lin-guistics still do not add up to more than a dozen—all of them in English.Unfortunately, except for the commendable books of Stubbs 1996 and Biber,Conrad and Reppen 1998, they do deplorably little to establish corpus lin-guistics as a linguistic discipline in its own right. Instead, they are focussingon the use of corpora and corpus analysis in traditional linguistics (syn-tax, lexicology, stylistics, diachrony, variety research) and applied linguistics(language teaching, translation, language technology). Corpus Linguistics byTony McEnery and Andrew Wilson (McEnery and Wilson 1996) may serveas an example of this kind. Forty pages describe the aspects of encoding;20 pages deal with quantitative analysis; 25 pages describe the usefulness ofcorpus data for computer linguistics with another 30 pages covering the useof corpora in speech, lexicology, grammar, semantics, pragmatics, discourseanalysis, sociolinguistics, stylistics, language teaching, diachrony, dialectol-ogy, language variation studies, psycholinguistics, cultural anthropology andsocial psychology and the ﬁnal 20 pages contain a case study on sublan-guages and closure. McEnery and Wilson’s book reﬂects the current state ofcorpus linguistics. In fact, it more or less corresponds to the topics coveredat the annual meetings held by the venerable IACME, an association dealingwith English language corpora (cf. Renouf 1998). Semantics are mainly leftaside. Surprisingly, when judged by their commercial value, it is not the writtenlanguage corpora that are most successful, but rather speech corpora that canclaim the highest prices. Speech corpora are special collections of some care-fully selected text samples (words, phrases, sentences) spoken by numerousdifferent speakers under various acoustic conditions. They caused the ﬁnal
CORPUS LINGUISTICS AND LEXICOGRAPHY 127breakthrough in automatic speech recognition that computer models basedon cognitive linguistics failed to achieve for many years. The recognitionof speech patterns was only made possible by a combination of categorialand probabilistic approaches towards a connectionist model trained on largespeech corpora. Thus, speech analysis can thus be seen as an early impetusfor the establishment of corpus linguistics as an independent discipline withits own theoretical background. Lexicography is the second major ﬁeld where corpus linguistics notonly introduced new methods, but also extended the entire scope of research,however, without putting too much emphasis on the theoretical aspects ofcorpus-based lexicography. Here again, it was John Sinclair who lead theway as initiator of the ﬁrst strictly corpus-based dictionary of general lan-guage (COBUILD 1987). Britain was also the site of the ﬁrst corpus-basedcollocation dictionaries (such as Kjellmer 1994). Bilingual lexicography mayalso beneﬁt from a corpus-oriented approach: a fact that is evident whencomparing the traditional Le Robert & Collins English-French Dictionaryedited by B.T.S. Atkins with Valerie Grundy and Marie-H´ l` ne Corr´ ard’s ee eOxford-Hachette Dictionary which covers the same language pair. Here, theuse of (monolingual) corpora lead to a remarkably greater number of multi-word translation units (collocations, set phrases) and to context proﬁles thathad been written with the target language in mind. W¨ rter und Wortgebrauch oin Ost und West [Words and Word Usage in East and West Germany] (1992)by Manfred W. Hellmann may serve as the only German example of that era,using the corpus for lemma selection rather than semantic description. Onlyrecently, in 1997, did a true corpus-based dictionary appear: Schl¨ sselw¨ rter u oder Wendezeit [Keywords during German Uniﬁcation] by Dieter Herberg,Doris Steffens and Elke Tellenbach. Thus, at least in the ﬁeld of written language, corpus linguistics is still inits infancy as a discipline with its own theoretical background—a statementwhich holds true not only for Germany but also for most other Europeancountries. In this orientation phase, where corpus linguistics is still in theprocess of deﬁning its position, most publications are in English, the languagethat has become interlingua of the modern world. But this does not mean thatcorpus linguistics is dominated mainly by English and American scholars:this can be clearly seen when browsing through any issue of the InternationalJournal of Corpus Linguistics. Still, German linguistics appears somewhatunderrepresented in this discussion. One exception is Hans J¨ rgen Heringer. u
128 WOLFGANG TEUBERTHis innovative study on ‘distributive semantics’ shows a growing receptionof the programme for corpus linguistics which is outlined below. In his bookDas h¨ chste der Gef¨ hle [The most sublime of feelings] (Heringer 1999), he o udescribes the validation of semantic cohesion between adjacent words on thebasis of larger corpora. Above all, it is this area between lexis and syntaxwhere corpus linguistics offers new insights.Corpus Linguistics—A ProgrammeCorpus linguistics believes in structuralism as deﬁned by John R. Firth; there-fore, it insists on the notion that language as a research object can only beobserved in the form of written or spoken texts. Neither language-independentcognition nor propositional logic can provide information on the nature ofnatural languages. For these are, as stated in an apophthegm by Mario Wan-druszka, characterised by a mixture of analogy and anomaly. The quest for auniversal structure of grammar and lexicon which is typical for the follow-ers of Chomsky or Lakoff cannot meet the demands of these two aspects.1Instead, corpus linguistics is closer to the semantic concept inherent in thecontinental European structuralism of Ferdinand de Saussure, which regardsthe meaning as inseparable from the form, that is, the word, the phrase,the text. In this theory, the meaning does not exist per se. Corpus linguis-tics rejects the ubiquitous concept of the meaning being ‘pure information,’encoded into language by the sender and decoded by the receiver. Corpuslinguistics, instead, holds that content cannot be separated from form, ratherthey constitute the two aspects under which texts can be analysed. The word,the phrase, the text is both form and meaning. The above statement clearly outlines the programme of corpus linguis-tics. It is mainly interested in those phenomena on the fringe between syntaxand lexicon, the two subjects of classical linguistics. It deals with the pat-terns and structures of semantic cohesion between text elements that areinterpreted as compounds, multi-word units, collocations and set phrases. Inthese phenomena, the importance of the context for the meaning becomesevident. Corpus linguistics extends our knowledge of language by combiningthree different approaches: the (procedural) identiﬁcation of language databy categorial analysis, the correlation of language data by statistical methods
CORPUS LINGUISTICS AND LEXICOGRAPHY 129and ﬁnally the (intellectual) interpretation of the results. Whilst the ﬁrst twosteps should be done automatically as much as possible, the last step requireshuman intentionality, as any interpretation is an act involving consciousnessand, therefore, not transmutable into an algorithmic procedure. This is themain difference between corpus linguistics and computational linguistics,which reduces language to a set of procedures. Corpus linguistics assumes that language is a social phenomenon, to beobserved and described above all in accessible empirical data—as it were,communication acts. Corpora are cross-sections through a universe of dis-course which incorporates virtually all communication acts of any selectedlanguage community, be it monolingual (e.g., German or English), bilingual(e.g., South Tyrolean, Welsh) or multilingual (e.g., Western European). How-ever, the majority of texts that are preserved and made accessible throughcorpora in principle only have a limited life-span: most printed texts such asnewspaper texts are out of public reach within a very short time. If we consider language as a social phenomenon, we do not know—and do not want to know—what is going on in the minds of the people,how the speaker or the hearer is understanding the words, sentences andtexts that they speak or hear. Language as a social phenomenon manifestsitself only in texts that can be observed, recorded, described and analysed.Most texts happen to be communication acts, that is, interactions betweenmembers of a language community. An ideal universe of discourse would bethe sum of all communication acts ever uttered by members of a languagecommunity. Therefore, it has an inherent diachronic dimension. Of course,this ideal universe of discourse would be far too large for linguistics toexplore it in its entirety. It would have to be broken down into cross-sectionswith regard to the phenomena that we want to describe. There is no suchthing as a ‘one-size-ﬁts-all’-corpus. It is the responsibility of the linguist tolimit the scope of the universe of discourse in such a way that it may bereduced to a manageable corpus, by means of parameters such as language(sociolect, terminology, jargon), time, region, situation, external and internaltextual characteristics, to mention just a few. When looking towards language as a social phenomenon, we assume thatmeaning is expressed in texts. What a text element or text segment means isthe result of negotiation among the members of a language community, andthese negotiations are also part of the discourse. Thus, the language com-munity sets the conventions on the formal correctness of sentences and on
130 WOLFGANG TEUBERTtheir meaning. Those conventions are both implicit and dynamic; they are notengraved in stone like commandments. Any communication act may utilisesyntactic structures in a new way, create new collocations, introduce newwords or redeﬁne existing ones. If those modiﬁcations are used in a sufﬁ-cient number of other communication acts or texts, they may well result in themodiﬁcation or amendment of an existing convention. One basic differencebetween natural and formal languages is the fact that natural language notonly permits but actually integrates metalinguistic statements without explic-itly marking the metalinguistic level. There is no separation between objectlanguage and metalanguage. Any convention may be discussed, questionedor even rejected in a text. Above all, discourses deal with meaning, and itis corpus linguistics that is best suited to deal with this dynamic aspect ofmeaning. We, as linguists, have no access to the cognitive encoding of the con-ventions of a language community. We only know what is expressed in texts.Dictionaries, grammars, and language textbooks are also texts; therefore, theyare part of the universe of discourse. As long as they represent socially ac-cepted standards, we have to consider their special status. Still, their contentsare neither comprehensive nor always based on factual evidence. Corpuslinguistics, on the other hand, aims to reveal the conventions of a certainlanguage community on the basis of a relevant corpus. In a corpus, wordsare embedded in their context. Corpus linguistics is, therefore, especiallysuited to describe the gradual changes in meaning: it is the context whichdetermines the concrete meaning in most areas of the vocabulary.Cognitive Linguistics, Logical Semantics and Corpus LinguisticsPeople normally—if they are not linguists, that is—listen to or read textsbecause of their meaning. They are interested in the syntactic features ofphrases, sentences or texts only insofar as is necessary for understandingthem. Meaning is the core feature of natural language, and this is the reasonwhy semantics is the central linguistic discipline. Still, regardless of theenormous progress that phonology, syntax and many other disciplines havemade, when it comes to explaining and describing the meaning of phrases,sentences, and texts, we are far from a consensus.
CORPUS LINGUISTICS AND LEXICOGRAPHY 131 As said above, corpus linguistics regards language as a social phenom-enon. This implies a strict division between meaning and understanding. Isit really the task of linguistics to investigate how the speaker and the listenerunderstand the words, sentences or texts that they utter or perceive? Un-derstanding is a psychological, a mental, or—in modern words—a cognitivephenomenon. This is why no bond exists between cognitive linguistics andcorpus linguistics. Language as a social phenomenon is laid down in textsand only there. If we, as corpus linguists, wish to ﬁnd out how a text isunderstood, we have to ask the listeners for paraphrases; these paraphrases,being texts themselves, again become part of the discourse and can becomethe object of linguistic analysis. The difference between cognitive linguistics and corpus linguistics liesin how each deals with the unique property of language to signify. Any textelement is inevitably both form (expression) and meaning. If you delete theform, the meaning is deleted as well. There is no meaning without form,without an expression. Text elements and segments are symbols, and beingsymbols, linguistic signs, they can be analysed in principle under two aspects:the form aspect or the meaning aspect. The consequence of this stance is thatthe only way to express the meaning of a text element or a text segmentis to interpret it, that is, to paraphrase it. This is the stance of hermeneuticphilosophy, as opposed to analytic philosophy (cf. Keller 1995, J¨ ger ). a In cognitive linguistics, which is embedded in analytic philosophy, mean-ing and understanding is seen as one. Here, text elements and text segmentscorrespond to conceptual representations on the mental level. Within thissystem, however, it is not clear what the term ‘representation’ means. Doesit refer to content linked with a form (what we could call presentations) ordoes it refer to pure content disconnected from form (what we could callideations)? This ambiguity is of vast consequence (Janik and Toulmin 1973:133), as presentations themselves are signs, that is, symbols, and thus needto be understood, that is, interpreted. Cognitive linguistics, however, does nottell us how this is to happen. Rather, it describes the manipulation of mentalrepresentations as a process (whereas an interpretation is an act, presuppos-ing intentionality). Processes themselves are meaningless. It is only the actof interpretation that assigns meaning to them. Both Daniell Dennet and JohnSearle point out this aporia of the cognitive approach. In their opinion, themental processes would again require a central meaner (Dennet 1998: 287f.)or homunculus (Searle 1992: 212f.) on a level higher than cognition, that is,
132 WOLFGANG TEUBERTfor understanding mental representations, and the same would then apply forthat level, too, and so on, ad inﬁnitum. On the other hand, if we translate ‘representation’ with ‘ideation,’ wedismiss the assumption of the symbolic character of language. The meaning ofa word, a sentence or a text would then correspond to something immaterial,something without form, formulated in a so-called ‘mental language,’ whoseelements would consist of either complex or atomistic concepts, dependingwhether one refers to Anna Wierzbicka and the early Jerry Fodor (Wierzbicka1996, Fodor 1975) or to the later Jerry Fodor (Fodor 1998). On a large scale,these concepts of cognitive linguistics seem to correspond to words, but thedifference lies in the fact that they are not material symbols which call forinterpretation, but instead they are pure astral ideation, not contaminated byany form (cf. Teubert 1999). In practice, particularly in artiﬁcial intelligence and automatic transla-tion, this cognitive approach has failed. Alan Melby gave a plausible ex-planation why it was due to fail no matter which formal language had beendeﬁned for encoding the conceptual representations: “The real problem couldbe that the language-independent universal sememes we were looking for donot exist. . . [O]ur approach to word senses was dead wrong.” (Melby 1995:48.) It seems that the idea behind cognitive linguistics is the transduction ortranslation of phrases, sentences and texts in natural language, that is, of sym-bolic units, into an obviously language-independent ‘language of thought’ or‘mentalese,’ which is non-symbolic and is exclusively deﬁned by syntax.This transduction or translation is seen as a process and does not involve in-tentionality. Cognitive linguistics is committed to the computational model ofmind. According to this theory, mental representations are seen as structuresconsisting of what is called uninterpreted symbols, while mental processes arecaused by the manipulation of these representations according to rule-based,that is, exclusively syntactic, algorithms. But does it really make sense touse the term ‘symbols’ for these mental representation units, just as we callwords ‘linguistic signs’? On a cognitive (or computative) level, those entitiesare only symbols inasmuch as a content can become assigned to them fromthe outside of the mental (or computational) calculus. This content or mean-ing, however, does not affect the permissibility of manipulations with regardto their representation. The content of a text consisting of linguistic signs, onthe other hand, is something inherent to the text itself (and not assigned from
CORPUS LINGUISTICS AND LEXICOGRAPHY 133the outside), a feature we can and must investigate if we want to make senseof a text. As Rudi Keller has pointed out, the symbols of natural languageare suitable for and in need of interpretation (Keller 1995). What appeals to many researchers of semantics is the fact that in cogni-tive semantics the meaning of a text is expressed through a calculation whoseexpressions are based exclusively on syntactic rules, or in other words, thatsemantics is transformed into syntax. They take it for granted that this ispossible, as they claim that both natural and formal language are workingwith symbols. But in natural language, these symbols need to be interpretedwhereas symbols in formal languages work without being assigned a cer-tain (external) deﬁnition. Whether a formal language, a calculus, permits acertain permutation of symbols or not has nothing to do with the meaningor the deﬁnition of these symbols, it is just a question of syntax. As earlyas 1847, George Boole stated: “Those who are acquainted with the presentstate of the theory of Symbolic Algebra, are aware that the validity of theprocesses of analysis does not depend upon the interpretation of the symbolswhich are employed, but solely upon the laws of their combination.” RichardMontague also believes in the possibility of describing natural language se-mantics the same way as formal language semantics: “There is in my opinionno important theoretical difference between natural languages and the arti-ﬁcial languages of logigicians; indeed, I consider it possible to comprehendthe syntax and semantics of both kinds of languages within a single naturaland mathematically precise theory. On this point I differ from a number ofphilosophers, but agree, I believe, with Chomsky and his associates.” (Bothquotes from Devlin, 1997: 73 and 117.) From the point of view of corpus linguistics, the meaning of naturallanguage symbols, of text elements or text segments is negotiated by thediscourse participants and can be found in the paraphrases they offer, and itis contained in language usage, that is, in context patterns. Natural languagesymbols refer not so much to language-external facts, but rather they createsemantic links to other language signs. The meaning of a text segment is thehistory of the use of its constituents. Linguistic signs always require interpretation. Whoever understands atext is able to interpret it. This interpretation can be communicated as a textin itself, a paraphrase of the original text. The act of interpretation requiresintentionality, and therefore, cannot be reduced to a rule-based, algorithmic,‘mathematically precise’ procedure. If we see language as a social phenom-
134 WOLFGANG TEUBERTenon, natural language semantics can leave aside the mental or cognitivelevel. Everything that can be said about the meaning of words, phrases orsentences will be found in the discourse. Anything that cannot be paraphrasedin natural language has nothing to do with meaning. In a nutshell, this is thecore programme that distinguishes corpus linguistics from cognitive linguis-tics.Collocation and MeaningIn traditional linguistics, it is rather difﬁcult to pinpoint the difference be-tween a collocation such as harte Auseinandersetzung (hefty discussion) anda free combination such as harte Matratze (hard mattress). In corpus lin-guistics, on the other hand, it is possible to trace this awareness among themembers of a language community of a distinct semantic cohesion betweenthe lexical elements of a collocation by statistic means, that is, by detect-ing a signiﬁcant co-occurrence of these elements within a sufﬁciently largecorpus. Before it was possible to procedurally and systematically processlarge amounts of language data, syntactic rules had been the only way todescribe the complex behaviour of co-occurrence between textual elements(i.e., words). Such rules describe the relation between different classes of ele-ments, for instance, between nouns and modifying adjectives. Still, syntacticdescriptions such as ‘Adjective + Noun’ are not speciﬁc enough to detectcollocations as distinct types of semantic relationships. Traditional lexicologyfails to come up with a feasible deﬁnition for collocations that would allowtheir automatic identiﬁcation in a corpus. To classify certain co-occurringtextual elements as semantic units, that is, as collocations, it is necessary torecognise these text segments as recurrent phenomena, which is only possiblewithin a sufﬁciently large corpus. Therefore, we must complement the intra-textual perspective with its intertextual counterpart. By applying probabilisticmethods, it is possible to measure recurrence within a virtual universe of dis-course, or more precisely, within a real corpus. Collocation dictionaries in thestrict sense are always corpus-based. Even so, the speaker’s competence isstill needed to check statistically determined collocation candidates for theirrelevant semantic cohesion. The following case study aims to illustrate thepotential of the corpus linguistic approach:
CORPUS LINGUISTICS AND LEXICOGRAPHY 135Case study 1: hart as collocatorThe collocation dictionary Kollokationsw¨ rterbuch Adjektive mit ihren Be- ogleitsubstantiven (Teubert, Kervio-Berthou and Windisch [in preparation]),which is currently being compiled at the Institut f¨ r Deutsche Sprache, uis based on the IDS corpora of about 320 million words. The 400 ad-jectives were mainly selected from basic vocabulary lists. Candidates forcollocations were combinations of adjectives and nouns showing a signif-icantly higher frequency than the expected frequency based on the occur-rence of the relevant single words. The occurrences are ranked accord-ing to signiﬁcance: their overall frequency, thus, have no principal inﬂu-ence. The concept for the statistic procedures applied here was designed byCyril Belica. It is up to the competent speaker to decide whether a suf-ﬁcient lexical cohesion can be seen in the collocation candidates detectedby the computer. Manually selected citations are provided in order to facil-itate this interpretation. If a collocation candidate is translated into a for-eign language as a whole instead of a word-by-word translation, this canbe seen as evidence of a distinct semantic cohesion; therefore, we haveadded the English translation equivalents to our German examples. The ex-ample below covers rank 1-10 [for an explanation of the abbreviations seehttp://www.ids-mannheim.de/kt/cosmas.html]: Kern Rank: 1 Frequency: 63 WKB In der Treuhand selbst hat sich ein harter Kern aus fr¨ heren SED-Betonk¨ pfen eingegraben. WKB Dennoch enthalte der u o Bericht einen “harten Kern an Wahrheit.” H68 Die “Kommandoebene,” der harte Kern der RAF, umfaßt 25 bis 30 Mitglieder. H87 Der “harte Kern” um- faßt 187 Personen. H87 [. . . ] ein sicherer Hinweis, daß sich die Betreffenden dem harten Kern der RAF angeschlossen haben. H87 140 eingeschriebene Soulm¨ nner kamen regelm¨ ßig, ein harter Kern von 50 Jugendlichen fast a a t¨ glich. (Engl.: diehards/ hard core) a Arbeit Rank: 2 Frequency: 94 WKD In harter Arbeit haben wir unseren ¨ Staat aufgebaut. (Uberschrift) WKD Aber wir haben eben in dieser harten Ar- beit alle noch ein bißchen zu lernen. H85 [. . . ] Risikobereitschaft und harte Arbeit sollen sich in Malaysia wieder lohnen. H86 Mangelnde pers¨ nlicheo Ausstrahlung machte er durch harte Arbeit, eiserne Disziplin und Wil- lensst¨ rke wett. WKD Ein Sommer h¨ rtester Arbeit steht bevor. H85 Die a a Technik macht es m¨ glich, den Menschen von harter und uberm¨ ßiger Arbeit o ¨ a auch zeitlich zu entlasten. (Engl.: hard work) W¨ hrung Rank: 3 Frequency: 40 WKB Die Deutschen w¨ rden nicht nur a u durch eine harte W¨ hrung vereint. WKB Harte W¨ hrung soll mangelnden a a
136 WOLFGANG TEUBERT Geist wettmachen. WKD Doch wundersam ist die Umwandlung der Ostmark in harte W¨ hrung allemal. H87 Dann w¨ re es endg¨ ltig vorbei mit dem Glanz a a u der einst h¨ rtesten W¨ hrung der Welt. (Engl.: hard currency) a a Schlag Rank: 4 Frequency: 24 BZK Das war f¨ r ihn ein harter Schlag. MK1 u Ich habe eine junge Mannschaft, die einen harten Schlag verkraften kann, ohne zu zerbrechen. MK2 Es war ein harter, gezielter Schlag, der mich prompt von den Beinen holte. (Engl.: heavy blow) Drogen Rank: 5 Frequency: 20 H88 Außerdem sei ein immer st¨ rker werden- a der Trend zu harten Drogen zu beobachten. H87 Kontakt zu harten Drogen hatte der Jugendliche bald bekommen [. . . ] (Engl.: hard drugs) Kritik Rank: 6 Frequency: 34 H86 Aber sie erfuhren schon damals von vielen ¨ Seiten harte Kritik. MK2 Harte Kritik am Biedenkopf-Plan. (Uberschrift) H88 Zugleich ubte er harte Kritik an der Landesregierung [. . . ] (Engl.: harsh ¨ criticism) Bandagen Rank: 7 Frequency: 12 H86 Beide Seiten schlagen derweil mit harten Bandagen zu [. . . ] WKB Der Kampf um Berlin als Hauptstadt wird ¨ mit harten Bandagen gef¨ hrt. (Uberschrift) (Engl.: taking one’s gloves off ) u Kampf Rank: 8 Frequency: 30 MK1 Amerika m¨ sse notfalls auf einen langen u harten Kampf vorbereitet sein. H86 Die meisten sehen zu, daß sie im harten Kampf um die Zehntel f¨ r sich das Beste rausholen. BZK Verkaufsf¨ rderung u o gewinnt immer mehr Bedeutung im harten Kampf um die Gunst der Ver- braucher. WKD F¨ r sie geht es jetzt nicht einfach um einen harten Kampf u um Arbeitspl¨ tze. (Engl.: close ﬁght) a D-Mark Rank: 9 Frequency: 22 WKB Dann bek¨ men die DDR-B¨ rger harte a u D-Mark in die Hand und w¨ rden dr¨ benbleiben. WKB Nichts hat Vormarsch u u und Endsieg der harten D-Mark aufhalten k¨ nnen. WKD Die harte D-Mark o dient als Schmiedehammer. (Engl.: strong Deutschmark) Worte Rank: 10 Frequency: 25 H85 Harte Worte - Berliner Verh¨ ltnisse? a H86 Selbst Außenminister Shultz benutzte harte Worte. H85 [. . . der] erste Vorsitzende der Gesellschaft, ﬁndet nicht minder harte Worte, um den Bruch zu begr¨ nden [. . . ] (Engl.: bitter words) uDiscourse and MeaningOne of corpus linguistics’ most essential tenets is the assumption that themeaning of text elements and segments can be found solely in discourse.This assumption makes sense if we call to mind that in principle, every wordor combination of words was once a neologism. Neologisms are introduced
CORPUS LINGUISTICS AND LEXICOGRAPHY 137to the discourse by explicitly assigning certain meanings to new expressions,that is, by paraphrasing what a new word is supposed to mean. As statedabove, we can determine meaning in two ways: by paraphrase and by usage.Neologisms, however, still lack the usage. They only become used once otherparticipants of the discourse start using them either by accepting the proposedmeaning or by negotiating the meaning by offering a new paraphrase. Thisalso applies to those cases where a new meaning is assigned to an alreadyexisting word. It is obvious that we cannot go ‘back to the roots’ for all our establishedvocabulary; also, this is not how children learn the meaning of words. Buteven so, it is not simply the usage of words that leads to their meaning. Inmost cases, an act of explanation, very often by the parents, but sometimesalso through picture-books, sets the starting point for language acquisition.Obviously, deictic references to reality (or images thereof) are of highestimportance, but they are not understood without narrative explanations ofwords that describe what we have to watch out for in reality (or in images ofreality). The meaning of school, for instance, cannot be explained by picturesof the building, classroom, teachers or pupils. In fact, only very few wordsrelate to images unambiguously. Picture-book texts play a more importantrole with regard to the acquisition of word meanings than dictionaries. Since the times of the German lexicographers Adelung and Campe, thebasic principle of German lexicography had been the assumption that themeaning of words can be found in text samples, a basic principle also forcorpus linguistics. Nevertheless, corpus linguistics differs from traditionallexicography in various details. Firstly, corpus linguistics does not use cor-pora merely for examples: it explores them systematically. Secondly, corpuslinguistic does not try to decontextualise the objects it describes. In otherwords, it does not abstract the meaning from the context. Thirdly, corpuslinguistics tries to capture different usages in their correlation to differentcontexts, unlike traditional lexicography which tries to position word mean-ings upon a blueprint of a language-independent ontological concept (forinstance, by genus proximum and differentia speciﬁca). Fourthly, corpus lin-guistics is less interested in the single text element or word than in thesemantic interaction between text elements and context. The following case study of Globalisierung [globalisation] aims todemonstrate that it is indeed the discourse (or in other words: our corpus)where information about the meaning of words can be found. The reason why
138 WOLFGANG TEUBERTwe all seem to know the meaning of Globalisierung as it is used currentlyis the fact that we all have read those texts that explain Globalisierung. Wecannot depict Globalisierung, any more than we can point at it. In its cur-rent use, Globalisierung is certainly a neologism. It is characteristic for theintroductory phase of new words that the ﬁrst citations show a large numberof paraphrases, a fact that demonstrates the role of the discourse participantsin negotiating meaning.Case study 2: GlobalisierungGlobalisierung (Engl.: globalisation) as a non-lexicalised derivation has been,for a long time, part of our vocabulary. Its semantic vagueness is indicativeof its non-lexicalised status. As nomen actionis or nomen resultativum, ithas long been nothing more than the nominalisation of globalisieren. Thepresence of descriptive attributes is signiﬁcant for its lack of semantic spec-iﬁcation: metalingual indicators (like paraphrases), on the other hand, arealmost totally absent. The following examples were found in the Germandaily Tageszeitung: Die Vorstellung [. . . ] der Globalisierung der Kleistschen Verz¨ ckung [. . . ] u scheint mir denn doch eher m¨ rchenhaft. [14.10.89] a ¨ Aber die Globalisierung von Politik, Okonomie und Technologie dulde keinen partikularen Bezugspunkte mehr [. . . ] [05.06.92] Mit der Globalisierung der Lebensweise der modernen Zivilisation geht die Selbstaufhebung der [. . . ] Ideale und Grund¨ berzeugungen einher. [25.02.95] uAs a neologism, Globalisierung manages to almost completely displace theoriginal, non-lexicalised derivation only as late as in 1996. Suddenly, there isa distinct rise in frequency: whereas we have only about 160 citations from1988 to the end of 1995, there are about 320 citations for 1996 alone. Also,most citations come without descriptive attributes: apparently, it is no longernecessary to explain what is being globalised. Finally, many citations showmetalingual indicators (below printed in italics) that demonstrate how thediscourse participants take part in assigning a meaning to the word, namely,the following examples: Die “Globalisierung”—ein etwas unscharfer Begriff, mit dem zugleich die Ausweitung des Handels, die Liberalisierung der Finanzm¨ rkte, der Sieg der a
CORPUS LINGUISTICS AND LEXICOGRAPHY 139 Freiheitsideologie, die unkontrollierte Macht der multinationalen Unternehmen, die Internationalisierung des Arbeitsmarktes und die Umstrukturierung der Volkswirtschaften gemeint ist—hat die Gewerkschaften weiter geschw¨ cht. a [12.01.96] Verbissener Konkurrenzkampf im Inneren und nach außen hin eine maximale ¨ Offnung f¨ r Kapitel, G¨ ter und Dienstleistungen. So lautet eine der m¨ glichen u u o Deﬁnitionen der Globalisierung. [12.01.96] [. . . ] die Globalisierung, das heißt die vollst¨ ndige Liberalisierung aller a M¨ rkte auf der Welt [. . . ] [10.05.96] a Lisa Maza [. . . ] sieht die Globalisierung v¨ llig anders: Sie sei eine Fortset- o zung der Kolonialisierung mit anderen Mitteln—zum Nachteil des S¨ dens, der u Armen und der Frauen. [08.06.96] Stichwort Globalisierung: In einer globalen Wirtschaft wird es auf Dauer kein gesch¨ tztes Umfeld f¨ r die Wirtschaft irgendeines Landes mehr geben. u u [27.07.96] Globalisierung bedeutet auch die Europ¨ isierung des Globus, Kolonialismus, a okonomischer und okologischer Imperialismus. [04.05.96] ¨ ¨ Denn in der Tat bedeutet Globalisierung Amerikanisierung, und zwar nicht nur der Weltwirtschaft, sondern auch eine normative Amerikanisierung. [11.10.96] Das Stichwort “Maastricht” und das Modewort “Globalisierung” sind zu Syn- onymen f¨ r sozialen R¨ ckschritt geworden. [18.10.96] u u Typischerweise schweigen die Intellektuellen in Deutschland beharrlich zu Eu- ropa, Globalisierung und Zukunft der Arbeit [. . . ] [13.12.96]This is a brief list of comparable English citations taken from the Bank ofEnglish and shortened: What does globalisation mean? The term can happily accommodate all manner of things: expanding international trade, the growth of multinational business, the rise in international joint ventures and increasing interdependence through capital ﬂows. Globalisation: Low wages in other countries contribute to low wages in the United States. Words like globalisation and outsourcing are now in common use. Watkins sees globalisation as a euphemism for a race to maximise proﬁt by lowering workers’ pay and condition. As Mr. Keegan says, globalisation means that tax cuts for business are crucial. Globalisation represents an attempt to exploit South Korea’s enormous poten-
140 WOLFGANG TEUBERT tial. But doesn’t globalisation mean world-wide sameness? Globalisation is still more a philosophy than a business reality. Globalisation comes in many ﬂavours.More so than other words, neologisms show that the meaning of words isto be found in the texts rather than in some discourse-external reality. Thecitations—be it in their virtual entirety within the universe of discourse orbe it in some cross-section in a real corpus—are the meaning, and we mayunderstand this meaning by interpreting the citation. The formulation of a dictionary entry for globalisation, however, is theresponsibility of lexicography, not of corpus linguistics, whose main task—apart from ﬁnding the references—would instead be the correlation (by sys-tematic context analysis) of the various sets of paraphrases and usage patternsto different parameters such as text type (newspaper), genre (politics/society),ideological stance and so on. Particularly in the area of ideologically contro-versial keywords, it seems as if a useful selection of citations can be morehelpful to the user than traditional deﬁnitions.Linguistic Knowledge and Encyclopaedic KnowledgeCorpus linguistics aims to analyse the meaning of words within texts, orrather, within their individual context. First and foremost, words are textelements, not lexicon or dictionary entries. Corpus linguistics is interested intext segments whose elements exhibit an inherent semantic cohesion whichcan be made visible through quantitative analyses of discourse or corpus(Biber, Conrad and Reppen 1998). If the research focus is shifted from single words to text segments,the distinction between linguistic and encyclopaedic knowledge graduallybecomes fuzzy. The word Machtergreifung (seizure of power), outside itscontext, may be described as an incident where a certain group, previouslyexcluded from political inﬂuence, seizes the power by its own force andwithout democratic legitimation. However, we will interpret text segmentssuch as braune Machtergreifung or die Machtergreifung im Jahre 1933 asreferring to the ‘seizure of power by the Nazis’ without hesitation. Is thisbecause these texts refer to a extralingual reality, to a language-independent
CORPUS LINGUISTICS AND LEXICOGRAPHY 141knowledge? Although the majority of linguists would agree with this assump-tion, there may well be another, simpler, explanation: we have learned from alarge number of citations, whenever braune Machtergreifung or Machtergrei-fung im Jahre 1933 is mentioned, this refers to the seizure of power by theNazis and to nothing else. There is a co-occurrence between both expressionsthat may result, for instance, in an anaphoric situation: the expressions areparaphrases of each other. In the tradition of German lexicography, linguistic knowledge is sepa-rated from encyclopaedic knowledge by the process of decontextualisation, bythe endeavour to describe the meaning of words unadulterated by the contextin which they occur. If we detach all references from their relevant context,the isolated meaning remains. The different events of Machtergreifung thatare dealt with in texts are viewed as references to a discourse-external reality.Corpus linguistics, on the other hand, above all is interested in the meaningof textual segments displaying a distinct semantic cohesion. Machtergreifungim Jahre 1933 is such a segment, and by projecting it upon our discourse (i.e.,linguistic) knowledge, we are able to interpret it as ‘Nazi seizure of power’without problem. If we are no longer limited to single words detached fromtheir contexts and if we do away with decontextualisation, we can give upwith the distinction between linguistic and encyclopaedic knowledge. Forwhat we normally call encyclopaedic knowledge is in fact nothing but dis-course knowledge. Everything we know and are able to know about the Naziseizure of power is based on texts. Although some may even have witnessedone relevant incident or the other, their ability to interpret the whole courseof events as Machtergreifung is also based on texts from other persons. Ifwe reduce encyclopaedic knowledge to discourse knowledge, the distinctiondisappears. Let us take a look at the example klassische Rollenverteilung (traditionalrole allocation) (Spiegel 13, 1999: 128): Ein Zuhause wie ein Bilderbuchideal. Hier [. . . ] ist die klassische Rollen- verteilung die Regel: Ein Elternteil k¨ mmert sich um Haushalt und Kinder- u erziehung, der andere verdient das Geld. Auch dieser traditionellen Familien- vorstellung entspricht das Leben im Reihenhaus. [A home like a picture-book clich´ . Here [. . . ] the traditional role allocation e is still the rule: one parent takes care of the household and of bringing up the children, the other parent earns the family income. Also living in a terraced house contributes to this traditional image of family.]
142 WOLFGANG TEUBERTWithin the context of family/home, the meaning of the collocation klassis-che Rollenverteilung in the above example corresponds exactly to the sen-tence that may serve as deﬁnition: Ein Elternteil. . . [One parent. . . ]. Notethe sublime subversive touch that is present here, characteristic of so manySpiegel articles: what seems to be a generally acceptable deﬁnition, actu-ally shows an essential deviation from the traditional meaning of klassischeRollenverteilung—it does not distinguish between male and female. The above example aptly illustrates challenges and achievements of cor-pus linguistics. Firstly, it is not interested in the meaning of isolated wordsoutside their relevant contexts, but instead in the meaning of semanticallyconnected text segments, extracted from discourse or, in practice, from thecorpus. In the context of home and family, klassische Rollenverteilung canbe interpreted in different ways with regard to period and genre. If the aboveSpiegel-deﬁnition becomes the accepted thing, we may apply the term klas-sische Rollenverteilung even to gay or lesbian partnerships. For corpus lin-guistics, this implies a dynamic view of meaning. Every new reference mayadd to the meaning of a certain text segment; older meanings may fall intooblivion if they are not sanctioned by new evidence. The above example alsoshows that the ways in which meaning can be negotiated within the languagecommunity can be controversial indeed. It is not so long ago when lesbianpartnership and family were two different meanings that could not be imag-ined, let alone used, synonymously. Corpus linguistics may thus serve as auseful instrument to detect changes of meaning that are essential to neology. Secondly, corpus linguistics is developing devices for the identiﬁcationand extraction of potentially metalinguistic elements of citations, that is, oftext elements co-occurring with a paraphrase, thus enabling the automaticextraction, processing and presentation of semantically relevant material fromcorpora. Phrases such as something is the rule; x means y; this is to say;we understand it as; it can be said etc. point to metalingual content. Ifthe meaning of a semantically controversial textual segment is negotiated,we often ﬁnd indicators such as: some time ago; in fact; strictly speaking;without doubt; wrongly etc. These indicators can give us important clues.Above all, it must be realised that just as the meaning of a text segmentis a paraphrase found in earlier citations, peoples’ interpretations are alsoparaphrases and therefore part of the discourse. In principle, the meaningof a text element or a text segment is everything that has been said aboutit, in terms of a paraphrase or as a matter of usage; it is the result of the
CORPUS LINGUISTICS AND LEXICOGRAPHY 143negotiation of the meaning within the discourse community. Indeed this isthe difference between natural language words and technical terms. Technicalterms are deﬁned by experts, and their meaning is restricted to that deﬁnition(and thus, is discourse-external). For instance, if a tree meets the criteriafor elm-trees listed in the expert’s deﬁnition, it is rightly called an elm-tree no matter what the citations say. Any terminological deﬁnition is—atleast in principle—an algorithmic instruction for the usage of the relevantterm. This explains why it is possible to automatically translate technicaltexts, provided they are monosemous and only use specialist vocabulary.Lexicographic deﬁnitions, on the other hand, are interpretations of citations,that is, results of intentional acts. They cannot automatically be processedfrom corpus citations, because every citation can be interpreted in variousdifferent ways. Therefore, an automatic translation of general language textsis not feasible. Thirdly, corpus linguistics uses the context to distinguish between us-ages. For example, the collocation klassische Rollenverteilung is not onlyfound in the family context but also at work or in society in general. Itsmeaning differs according to on the context. Fourthly, corpus linguistics is interested in larger units of meaning,namely, in text segments. The traditional lexicographic practice of decon-textualisation and isolation of single words impedes us from knowing themeaning of larger units such as klassische Rollenverteilung. As a rule, themeaning of text segments such as multi word units, collocations or set phrasesis far more speciﬁc than that of single words. The reason why traditional lin-guistics is focussing on the single word, isolated from its context, can onlybe explained by space constraints in the past, as it is impossible to list allcollocations and set phrases even in a dictionary consisting of several vol-umes. But is klassische Rollenverteilung really a true collocation? Is corpuslinguistics really able to provide a credible validation of semantic cohesion?Is the co-occurrence klassische Rollenverteilung more than a mere additionof klassisch and Rollenverteilung? In a sufﬁciently large corpus, if the fre-quency of klassische Rollenverteilung differs signiﬁcantly from the statisti-cally expected frequency of this combination, this can be seen as one signfor possible collocation. Another sign would be the occurrence of a specialmeaning that can not be derived from the sum of the individual meaningsof the text elements. For instance, if we ﬁnd six tokens of klassische Rol-lenverteilung within the corpus although we would only expect three, given
144 WOLFGANG TEUBERTthe frequency of the constituents, and if they all suggest that one parent isthe wage-earner whereas the other is bringing up the children, then we mayregard this co-occurrence as collocation. Finally, corpus linguistic considers meaning as a feature of language, oftext elements, segments, and texts, and not as an external feature existing onlyin the human mind or in reality. The meaning of klassische Rollenverteilungin the context of family is represented in texts, and only there; it is not thereﬂection of a non-textual external reality that we could point our ﬁngers at.There is no meaning outside language, outside the discourse. We know whatglobalisation means today, because we have read the texts that explain it, butwe cannot see globalisation.Multilingual Corpus LinguisticsWhen translating a text into another language, we paraphrase the sourcetext. The translation represents the meaning of the original text just like aparaphrase within the source language. Translation requires understandingand thus intentionality. Only if we understand a text can we interpret oreven paraphrase it. This implies that different translations will yield differentversions of the same text, which again shows that translation or paraphrasingcannot be reduced to algorithmic procedures. The universe of discourse, containing all texts ever translated along withtheir translations, is the empirical base for multilingual corpus linguistics. Itis a virtual universe, and it can be realised by multilingual parallel corpora (ora collection of bilingual parallel corpora). Parallel corpora consist of sourcetexts along with their translations into other languages, whereas reciprocalparallel corpora contain the source texts in two languages along with theirtranslations into the target languages. Just as in monolingual corpus linguistics, meaning is also seen as astrictly linguistic (or better, textual) term here. Meaning is paraphrase. Theentire meaning of a text segment within a multilingual universe of discourseis enclosed in the history of all translation equivalents of the segment. The translation unit, that is, the text segment completely represented bythe translation equivalent, is the base unit of multilingual corpus semantics.Translation units, consisting of a single word or of several words, are theminimal units of translation. If they consist of several words, they are trans-
CORPUS LINGUISTICS AND LEXICOGRAPHY 145lated as a whole and not word by word. Therefore, translation equivalentscorrespond to the text segments of monolingual corpus linguistics. Within the framework of multilingual corpus linguistics, we take thatthe meaning of translation units is contained in their translation equivalentsin other languages. This corresponds to the base assumption of corpus lin-guistics, which does not regard semantic cohesion as something ﬁxed butas belonging to a large spectrum reaching from inalterable units to text seg-ments whose elements can be varied, expanded or omitted. Identifying thesetranslation units (or text segments) again involves interpretation. The transla-tion shows us whether a given co-occurrence of words is a single translationequivalent or a combination of them, that is, merely a chain of text elements.This leads to two consequences. What can be seen as an integral translationequivalent in one target language may be a simple word-by-word transla-tion in another. This may even be the case within a single target language,depending on the stylistic preferences of different translators. In fact, it isthe community of translators (along with the translation critics) who in theirdaily practice decide what is the translation equivalent, just as the monolin-gual language community decides what is a text segment. The deﬁnition of a translation unit therefore depends both on the targetlanguage and the common practice of translation. A virtual text segment is atranslation unit only in respect to those languages into which it is translated asa whole. Translation units and their equivalents are not metaphysical entities;they are the contingent results of translation acts. According to the analysisof parallel corpora, more than half of the translation units are larger thanthe single word—another example of how corpus linguistics may help toinvestigate the nature of text segments. The meaning of a translation unit is its paraphrase, that is, the translationequivalent in the target language. For ambiguous translation units, this im-plies that there are as many meanings to the unit as there are non-synonymoustranslation equivalents. If the phenomenon of meaning is thus operationalised,the meaning of a translation unit depends on the selected target language. Agiven translation unit in language A may have two non-synonymous equiv-alents in language B, but three non-synonymous equivalents in language C. Let us look at an example. The English word sorrow (a translation unitconsisting only of a single word) will usually be translated into French byone of the three equivalents chagrin, peine or tristesse; the ﬁrst two, chagrinand peine, are obviously synonymous in a variety of contexts. They both
146 WOLFGANG TEUBERTpoint at a cause for this emotion and, therefore, are sometimes interchange-able with deuil (‘loss,’ the term for the cause). Tristesse, on the other hand, isthe variety of sorrow which is not caused by a special incident. In German,there are also three standard equivalents for sorrow, namely, Trauer (causedby loss), Kummer (caused by an adverse incident, intense and usually lim-ited in duration) and ﬁnally Gram (caused by unhappiness resulting froman incident, not very intense, more a disposition than a feeling, but often oflong duration). Those three German equivalents are neither synonymous withnor corresponding to the three French equivalents. By the way, the differ-ent senses of sorrow usually found in English monolingual dictionaries andthesauri corresponds to neither the French nor the German distinctions. The above example of sorrow shows that the concept of synonymy can-not be expressed in an algorithm. To call two expressions synonymous re-quires a prior understanding of their meaning, that is, an act of interpretation.For instance, if we look at how the Greek verb pros´ uchomai in the ﬁrst sen- etence of Plato’s Republic is translated into English, we will ﬁnd ﬁve differentequivalents in eight different translations of this book: to make my prayers,to say a prayer, to offer up my prayers, to worship, to pay my devoirs and topay my devotions. We, as human beings, must decide whether we considerthe Greek verb ambiguous or just fuzzy and whether the relevant equivalentscan be seen as synonyms. This is something computers cannot do. The ex-ample also shows that the concept of synonymy can only be applied locally,referring to translation equivalents or text segments within a deﬁned context.Although we may assume that Plato’s contemporary audience considered theverb pros´ uchomai as unambiguous within the above context, this is not the ecase with native speakers of English, where there is no synonymy betweento make my prayers and to pay my devotions. It can be clearly seen thatmeaning has a dynamic quality and also that the act of translation requiresintention and thus cannot be reduced to a mere procedure. We will never ﬁndthe correct German equivalent for sorrow or the correct English equivalentfor pros´ uchmai just by deﬁning formal instructions for a machine. Before ewe can translate texts and their elements, we must understand them.
CORPUS LINGUISTICS AND LEXICOGRAPHY 147Multilingual Corpus Linguistics in PracticeNeither a lexicon derived from a bilingual dictionary nor the supposedlylanguage-neutral conceptual ontologies applied within Artiﬁcial Intelligencewill solve the problem of machine translation of general language texts.Meanwhile, this fact is acknowledged by the experts. Therefore, they focuson the machine translation of texts written in a controlled documentationlanguage, which is a more or less formal language in which all technical termsare deﬁned unambiguously along with a syntax that rejects all ambiguousexpressions as non-grammatical. General language texts written in natural languages cannot be translatedwithout interpretation. Here, multilingual corpus linguistics steers clear of thisobstacle in an elegant way. Unlike disciplines such as Artiﬁcial Intelligenceand Machine Translation, which are based on cognitive linguistics, it doesnot try to model and emulate mental processes, but instead tries to supportthe translator by processing parallel corpora. They contain the practice ofprevious human translation. In these corpora, those translation equivalentsthat are proven to be reliable and accepted will outweigh equivalents that havebeen dismissed as inadequate in the long run. If, for instance, pros´ uchomai eis translated as to make my prayers three times out of eight, it may well beassumed that it is an accepted—albeit not the ideal—equivalent within thegiven context. Parallel corpora are translation repositories. They link translation unitswith their equivalents. As ﬁrst studies have shown (Steyer and Teubert 1998),we may assume that 90 percent of all translation units along with their rel-evant equivalents may be found in a carefully compiled corpus of about 20million words per language, provided that the text to be translated is sufﬁ-ciently close to the corpus with regard to text type and genre. Multilingual corpus linguistics does not pretend to solve the problem ofmachine translation of general language. But it may help the human translatorin ﬁnding a suitable equivalent for the unit to be translated more efﬁcientlythan traditional bilingual dictionaries, because it includes the context even inthose cases where the translation equivalent is not a syntagmatically deﬁnedcollocation but a certain textual element within a sequence. The goal is toselect from among all given elements the one whose contextual proﬁle isclosest to that of the textual segment to be translated.
148 WOLFGANG TEUBERTCase study 3: The translation into German of sorrow and griefFor the two words sorrow and grief, we ﬁnd three common non-synonymous German translation equivalents: Trauer, Kummer and Gram.An analysis of the contexts of all references of these German wordsas found in the IDS corpora, based on a method designed by CyrilBelica (see http://www.ids-mannheim.de/cgi-bin/idsforms/cosmas-www-client), gives us the context proﬁles listed below. In ourexample, the number of neighbouring words (i.e. span) has been restricted to5 words on each side. The context proﬁles given below have been slightlyedited for the sake of clarity. Context proﬁle for Trauer: Wut, Angst, Betroffenheit, Schmerz, Tod,Best¨ rzung, Freude, Hoffnung, Verzweiﬂung, Scham; tragen, empﬁnden; tief, ugroß- Context proﬁle for Kummer: Sorgen, Schmerz, Leid, Seele, Freude, ¨Stress, Arger, Not; bereiten, machen, gewohnt/gew¨ hnt sein; viel, groß- o Context proﬁle for Gram: Leid, Hass, Bitterkeit, Scham; sterben;gebeugt, lauter, voll- In an English-German parallel corpus we would distinguish betweenthree translations for sorrow and grief : the ﬁrst group would contain thosecases where sorrow or grief is translated by Trauer; the second group whereit is translated by Kummer, and ﬁnally, the third group where it is translatedby Gram. For each of the above cases, we could compute a context proﬁlesimilar to the ones quoted above for the German words from the IDS corpus.We may assume that the context proﬁle for sorrow and grief, as taken fromthe parallel corpus, in the case of the translation equivalent Kummer, will notdiffer much from the context proﬁle for Kummer extracted from the Germanreference corpus, apart from it being in English instead of German. Unfortunately, a sufﬁciently large enough English-German parallel cor-pus that would allow the extraction of English context proﬁles for Germantranslation equivalents on the basis of recurrence is not yet available. As analternative, I have searched the Bank of English for those instances of sor-row and grief whose contexts are similar to our context proﬁles for Trauer,Kummer and Gram. So far these results are not thoroughly convincing: onereason is the different composition of the IDS corpora compared to the Bankof English which results in a clear imbalance of the German and Englishinstances with regard to text type and genre; also, the search criteria for the
CORPUS LINGUISTICS AND LEXICOGRAPHY 149English contexts have been too narrow, and last but not least, sorrow andgrief along with their German counterparts Trauer, Kummer, and Gram be-long to an area of vocabulary which is highly culture-speciﬁc and is almostimpossible to reduce to a common denominator. Still, the following instances taken from the Bank of English show,that in practice, the approach for the detection of equivalents outlined abovewill function to some extent. The words in square brackets are the Germanequivalents of the context words contained within the context proﬁles. (1) Trauer So on the night of the cruciﬁxion I place Simon in the home in Bethany of Mary called Magdalene and her sister Maria. I en- vision a scene in which trauma, grief, anger [Wut], and despair [Best¨ rzung] were all present, to say nothing of fear [Angst]. u (2) Kummer She enjoys her job though it is full of stress [Stress], sorrow and never-ending challenges. (3) Gram The terrible afﬂiction [Leid] that has fallen so suddenly upon our unhapply country ﬁlls and monopolises my thoughts. My soul is full of grief and bitterness [Bitterkeit] and hate [Hass] and vengeance. Although matching the context of the element to be translated againstthe context proﬁles of all possible equivalents may suggest a method for theautomatic selection of suitable equivalents, this only works in those caseswhere we have clear selection-relevant contextual information at our disposal.As stated above, this is not always the case, especially if the text element to betranslated is referring to earlier instances within the same text. In these cases,we may assume that, provided the intratextual continuity is sufﬁciently high,the text element (sorrow or grief in our example) can always be translated bythe same equivalent with regard to the target language, be it Trauer, Kummeror Gram. In most cases, whenever a word with a fuzzy, strongly context-dependent meaning appears in a text for the ﬁrst time, the information neededfor the speciﬁcation of its meaning will be found within the context. Laterinstances of the word within the text often tend to omit this informationas redundant. Within a text, we must ﬁnd one or two references where a
150 WOLFGANG TEUBERTsuitable translation equivalent is indicated by the context proﬁle and applythe result to the other instances. This shows that it is imperative to onlyinclude complete texts in the corpus.Future ProspectsCorpus linguistics sees itself not in opposition to but as a complement of tra-ditional linguistics. Corpus linguistics helps to make us aware not only of theinteraction between text element and context but also of text segments, thatis, larger, ﬂexible units whose elements are semantically linked in a certainway: multi-word-units, collocations, set phrases. It explains the repeated co-occurrence of text elements as a discourse phenomenon that can be exploredby statistical means, and it makes those co-occurrence patterns visible by acombination of quantitative and categorial devices. The investigation of the context enables us to better cope with wordsdisplaying fuzzy meanings, words of the ‘Thespian vocabulary,’ as John Sin-clair called them (Sinclair 1996), by generating context proﬁles as presentedabove on the basis of sufﬁciently large corpora. Especially when combin-ing these context proﬁles with those citations containing a paraphrase of themeaning or aspects thereof (cf. our case study of globalisation), this may leadto descriptions of meaning enabling the user to participate in the discourse. Corpus linguistics distinguishes between text segments on the one handand text elements embedded in context on the other, depending on howthey can be described. Context proﬁles are only statistically deﬁned. Withina context proﬁle, there is no such thing as an obligatory element that isindispensable within the context of a citation. The lexical constituents of textsegments, however, can be deﬁned either as indispensable or as optional.But there is still another difference between the text element with its contextproﬁle and the text segment: the latter is deﬁned not only on a lexical butalso on a syntactic level. The collocation Kummer gew¨ hnt ceases to be a ocollocation as soon as the verb gew¨ hnt sein is replaced by gew¨ hnen: Er o ohatte sich an seinen Kummer gew¨ hnt is not a collocation. The same applies ofor collocations such as geheimer Kummer, Kummer bereiten, Kummer undSorgen. If we change the syntagma or even just the word order (for example,into Sorgen und Kummer), the words lose their collocation character.
CORPUS LINGUISTICS AND LEXICOGRAPHY 151 During the last decades, we have witnessed a growing interest in seman-tic cohesion, in the special semantic relations between words within sentencesand phrases, even in traditional linguistics. Among the relatively new con-cepts are lexical solidarities, collocations, set phrases, valency, case roles,semantic frames and scripts. They all try to demonstrate that language ismore than just the assembling of context-free words using semantics-freerules. The co-occurrence patterns developed by corpus linguistics may helpto clarify heuristically the concept of text segments deﬁned by semantic co-hesion. When it comes to the identiﬁcation of text segments, multilingual cor-pus linguistics holds a privileged position. Within monolingual corpora, thisidentiﬁcation is a gruesome task that can only be turned into an automaticprocedure by a painstaking combination of various procedures based on fre-quencies, lists or rules. The use of parallel corpora makes it easier to identifytext segments (as translation units or equivalents), as they are the true prac-tical results of interpretation and paraphrase. They show what usually takesplace within the minds of the speakers without leaving their traces in texts.Parallel corpora, therefore, provide direct access to the translation practice ofhuman translators. If we assume that we may ﬁnd the meaning of a textualelement through its paraphrase, which is also a text, then we may describeparallel corpora as repositories for such paraphrases. Obviously, dictionariesalso attempt to list those paraphrases. However, since their size is limited,they need to decontextualise and isolate the lexical units, whereas the para-phrases of translators display the text elements embedded within their con-texts, along with whole text segments. Parallel corpus evidence helps us totrace the phenomenon of semantic cohesion. Meanwhile, with the availability of large corpora and improved softwarefor their exploration, corpus linguistics has become part of general lexicog-raphy. Linguistics is gradually becoming more interested in larger units ofmeaning and the use of context for their deﬁnition. Also, it is generallyaccepted that the next generation of dictionaries, both monolingual and bilin-gual, needs to be corpus-validated, if not entirely corpus-based. But there ismore to the corpus linguistic approach. By interactive procedures, the am-bitious user should be able to have direct access to corpus evidence insteadof being confronted with the subjective ﬁndings provided by lexicographers.Such a corpus platform would allow the members of the language community
152 WOLFGANG TEUBERTto participate in the social activity of negotiating meanings in a committedand informed way.Notes* This contribution is a revised version of my article ‘Korpuslinguistik und Lexikographie’ in Deutsche Sprache 4/99, pp. 292–313, translated into English by Norbert Volz.1. The rules that those followers of a universal grammar hope to ﬁnd in their quest for the language organ are not based on deductions of analogy. Whereas rules based on innate- ness had been the central factor in Chomskyan language theory until recently (cf. Stephen Pinker in The Language Instinct [Pinker 1994]), Pinker now sees language faculty as an interaction between ‘distinct mental mechanisms’ which is not yet fully explored, namely, the ‘symbolic computation’ [i.e., the algorithmic processing of uninterpreted symbols] as opposed to the ‘memory’ [i.e., recollection], the latter being responsible for the as- signment of form and meaning of symbols (Pinker 1999). The memory is seen as partly associative—an appropriate term for its description could be ‘connectionist network’. However, Pinker still sees ‘symbolic computation’ as a strictly rule-based process. We may assume that this tentative change in attitude towards language faculty and the extent of its genetic embedding might be partly due to Terrence W. Deacon’s convincing ex- planation of ﬁrst language acquisition which does without any language organ (Deacon 1997).ReferencesBiber, Douglas; Conrad, Susan; Reppen, Randi. 1998. Corpus Linguistics. Investigating Language Structure and Use. Cambridge University Press.Collins COBUILD. 1987. English Language Dictionary. Editor in Chief: John Sinclair.Deacon, Terrence W. 1997. The Symbolic Species. New York: Norton.Dennett, Daniel C. 1998. “Reﬂections on Language and Mind.” In: Peter Carruthers/ Jill Boncher (Eds.): Language and Thought. Interdisciplinary Themes. Cambridge: Cambridge University Press, 284–294.Devlin, Keith. 1997. Goodbye, Descartes. New York: Wiley.Fodor, Jerry A. 1975. The Language of Thought. New York: Crowell.Fodor, Jerry A. 1998. Concepts. Where Cognitive Science Went Wrong. Oxford: Clarendon Press.Hellmann, Manfred W. 1992. W¨ rter und Wortgebrauch in Ost und West. Vol. 1–3. o T¨ bingen: Narr. uHerberg, Dieter; Steffens, Doris; Tellenbach, Elke. 1997. Schl¨ sselw¨ rter der Wendezeit. u o W¨ rter-Buch zum offentlichen Sprachgebrauch 1989/90. Berlin: Walter de Gruyter. o ¨Heringer, Hans J¨ rgen. 1999. Das h¨ chste der Gef¨ hle. Empirische Studien zur u o u distributiven Semantik. T¨ bingen: Stauffenberg Verlag. u
CORPUS LINGUISTICS AND LEXICOGRAPHY 153J¨ ger, Ludwig. 2000. “Die Sprachvergessenheit der Medientheorie. Ein Pl¨ doyer f¨ r das a a u Medium Sprache.” In: Werner Kallmeyer (Ed.): Sprache und neue Medien. Jahrbuch 1999 des Instituts f¨ r Deutsche Sprache. Berlin/New York: de Gruyter, 9–30. uJanik, Allen; Toulmin, Stephen. 1973. Wittgenstein’s Vienna. New York: Schuster & Schuster.Keller, Rudi. 1995. Zeichentheorie. T¨ bingen: Francke. uKjellmer, G¨ ran. 1994. A Dictionary of English Collocations. Based on the Brown Corpus. o Oxford: Clarendon Press.Lenz, Susanne. 2000. Studienbibliographie Korpuslinguistik. Heidelberg: Groos.McEnery, Tony; Wilson, Andrew. 1996. Corpus Linguistics. Edinburgh: Edinburgh University Press.Melby, Allen K. 1995. The Possibility of Language. A Discussion of the Nature of Language with Implications for Human and Machine Translation. Amsterdam: John Benjamins.The Oxford-Hachette French Dictionary. 1994. French-English/ English-French. Marie- H´ l` ne Corr´ ard, Valerie Grundy (Eds.). Oxford: Oxford University Press. ee ePinker, Stephen. 1994. The Language Instinct. New York: William Morrow.Pinker, Stephen. 1999. “Regular habits. How we learn language by mixing memory and rules.” In: Times Literary Supplement, October 29, 1999, 11–13.Renouf, Antoinette (Ed.). 1998. Working with Corpora. Selected Papers from the 18th ICAME Conference. Amsterdam: Rodope.Le Robert & Collins. 1993. Dictionnaire Francais–Anglais/Anglais–Francais. 4th Edition. ¸ ¸ Editor in Chief: Beryl S. Atkins.Searle, John R. 1992. The Rediscovery of the Mind. Cambridge, Mass.: The MIT Press.Sinclair, John M. 1996. “The Empty Lexicon.” In: International Journal of Corpus Linguistics I(1): 99–120. ¨Steyer, Kathrin; Teubert, Wolfgang. 1998. “Deutsch-Franz¨ sische Ubersetzungsplattform. o Ans¨ tze, Methoden, empirische M¨ glichkeiten.” In: Deutsche Sprache 4(97): 343–359. a oStubbs, Michael. 1996. Text and Corpus Analysis. Oxford: Blackwell. ¨Teubert, Wolfgang. 1999. In: Modelle der Ubersetzung—Grundlagen der Methodik. Frankfurt/M.: Lang, 118–135.Teubert, Wolfgang; Kervio-Berthou, Val´ rie; Windisch, Eric. To be published. e Kollokationsw¨ rterbuch Adjektive und ihre Begleitsubstantive. oWierzbicka, Anna. 1996. Semantics. Primes and Universals. Oxford: Oxford University Press.