    • 104 Inte rcultural F aubline sStubbs, Michael (1995a) Collocations and Semantic Profiles: On the Cause of the Trouble with Quantitative Studies, Functions of Language 2(l): 23-55 . 8 Parallel Corpora in Thanslation Studies: Issues------ (1995b) Collocations and Cultural Connotations of Common Words, Linguistics in Corpus Design and Analysis and Education 7: 379-390.Toury, Gideon (1995) Desciptive Translation Studies - and Beyond, Amsterdam and FEDERICO ZANETTIN Philadelphia: John Benjamins.Vanderauwera, fua (1985) Dutch Novels Translated into English: The Transformation Abstract: This chapter deals with the design and analysis oftranslation- of a "Minority" Literature, Amsterdam: Rodopi. driven corpora, i.e. principled collections of electronic texts compiled withWodin, Natascha (1983) Die gliiserne Stadt,Leipzig: Reclam Verlag. the aim of studying translation products and processes, with special referenceWoolls, David ( 1997) Mubiconcord" Version 1.5, Birmingham: CFL Software Development. to parallel corpora.The comparison and contrast ofpaired translation unitsZtirn, Unica ( 1917) Der Mann im Jasmin,Frankturt am Main and Berlin: Verlag Ullstein; is, of course, not new to translation research, but the possibility of retrieving translated by Malcolm Green as The House of lllnesses, 1993, and The Man of Jas on a computer screen hundreds of similar contexts and their translations, mine & Other Texts,1994, London: Atlas Press. and the relative ease of combining this with statistical analysis and data manipulation, allow hypotheses to be tested on a larger scale as well as tentative generalilations to be made. Corpus linxuistics is seen as a .._.*^lrslbqip!.ost-wbLeh_ssryb^e-sw,ti.ed-ro^ustq;;f a;;;jt;;;;,fi i"";t"dir; *^Fa-idILet way^as it is app-lied to the sutdy,ottexruoniariilnrrs"i;nc in the same cf,rpora, as well as orher ,rr".irt"ri"iirir7;;ffio, ,o, complement other types of investigation of printed texts in translation studies. It is suggested in this chapter that issues of corpus design (which types of tefis are included, what languages are involved, which citeria are used for sampling, what the research aims and applications of the projects are), and corpus encoding (how the translation from printed texts to electronic corpus comes about) deserve careful consideration insofar as these issues are likely to affect findings based on corpora. It is also argued that, in order to enhance and maximize the advantage which can be deived from research based on electronic texts in translation stud.ies, there is a need for greater standardization and interchange of corpus resources. 1. Introduction In the past ten or so there has been a growing interest in the application of computer-assisted methods of investigation to the study of translation and trans- lated texts. In narti istics has seryed as amethodological framework for the creation and use of what I would like tb call .t The corpora developed and used within this approach have quite airreienTitraiii:. teristics, depending on the aims and research interests upon which the various research projects are based, but they generally involve a comparison between two different corpus components. A first type of translation-driven corpus is the monolingual comparable cor- pus, consisting of a set of translations and a comparable set of texts spontaneously created in the same language and selected according to similar design criteria. cor- pora belonging to this first type include the English Comparable Corpus (ECC),
    • Inte rc ultu ral F aultline s Zanettin: Issues in Corpus Design ttntl Arutlysis 106 107 cerns which surface in the design of general-language corpora. These issues are developed at UMIST, Manchester under the direction of Mona Baker, and the aspects of representativeness, diversification and encoding. Finnish Comparable Corpus at the University of Joensuu (see Jantunen 2000; Kemppanen 2000; Tirkkonen-Condit 2000). These monolingual comparable cor- ,p-o-{a ar9 tran.s!ati_93*{riy9l1.91-.t 11rjl4ion depeadent in that the "non-tr_anslationaf- 2. Representativeness in bi-directional parallel corpora component is modelled on the composition of the translatigl{ 1et" (Laviosa 1997: 293). The motivations behind the construction of a corpus with these characteristics one of the major issues in corpus design is that of representativeness; what distin_ are mostly theoretical; the comparison between the two components of the corpus guishes a corpus from a collection of electronic texts (or a text archive) is that a should enable investigation of the linguistic features of translated texts as opposed corpus is put together in a principled way so as to be representative of a larger to those found in spontaneous text production. Since translation, is_ig-tory-"-"*-- textual population, in order to make it possible to generalize findings concerning tive event which is shaped by its own goals, pressures and context of produclion" that population. Thus, the most appropriate design for a corpus depends on what it @aker 1996: r5), tirxti pioauceo ii i ieiritt of tfiieciftiffsnbnldshow a distinc- is meant to represent (Biber 1993; Halverson l99g; Kennedy 199g; Biber et al. ti","_1ilgu.i.F,gake:up, riihiah ian $ dt:{:gliTe:9_gC.lir1gggg!+,"9!dg1:1 1998). This should be remembered before making any general statements about -omparaule.o6g A diicona type oTtransfation-Oriu"n is q+l:lunguar language, texts, or translations based on corpus analysis; what is found in a corpus " either of the t ws components i ncludes-tran "olputt ated t. cumptred-trc-feiis N sl "i$Vf,ari " will only apply to what that particular corpus represents. The more general, large spontaneously produced in two languages under similar circumstances and within and varied the textual population to be represented, the more variables must be the same domains. Corpora of this kind have been created and used mainly with taken into account in the selection of the texts to be included in the corpus. To -=. ,. 3gpJ-99!_".!9f"s t, mi4,4, _either for the extraction of biliagual terminology (Laffling ensure representativeness of a certain genre, decisions must be taken as to sam_ 1992) or for the training of translators (Zanettin 1994, 1998, forthcoming; Gavioli pling criteria: and Zanettin 1997; Peters and Picchi 1998). A bilingual comparable corpus is trans- lation-driven in that the ultimate aim of its creation is to develop a tool and a resource should translations be chosen at random from within the total population to for trainees and practitioners in the translation profession, and its composition is be represented, or should a motivated choice be made based on criteria such dictated by the provision of translating texts belonging to a specifrc genre. This type as text status and reception (e.g. prestige, readership)? Even when making a of corpus, when in printed rather than in electronic format, is also referred to in random selection, we still need to dehne the boundaries and internal catego- translation studies as parallel texts (e.g. Piotrowska 1997; Scheffner 1998). ries of the total population, so all selections are bound to be subjective to a. A third type oftranslation-driven corpus, and the one on which this chapter certain extent. will focus, is the parallel corpus, comprising a set of translations in one language Should the corpus be composed of complete texts or samples, and if the lat- and their respective source texts in another language. Parallel corpora can be used ter, what size and type should the samples be? There seems to be general for studying translated texts with two different goals in mind; they can be used by agreement among translation scholars that the basic unit in translation is a researchers to describe what translators actually do with texts and how they trans- ful1 text, but pracrical limitations may still lead corpus designers to opt for form them in the process of translation, and they can help practitioners to make samples. informed choices based on translation traditions and norms while translating or In any case, a compromise has to be reached between what is desirable and learning to translate. A parallel corpus can also be created to contain both directions what is feasible on practical grounds; constraints include availability of texts, of translations, thereby forming a bi-directional parallel (Aston 1999) or reciprocal copyright restrictions and project fundin!. (Teubert 1996) corpus, appearing to encompass all the different varieties described above. The design of such a corpus is not devoid of theoretical problems, as will be one of the best known parallel corpora is the English Norwegian parallel cor- argued in the following section. However, an examination of these problems may pus (ENPC), developed at the University of oslo and documented ir a number of also provide a means of improving the design of the different kinds of translation- publications (e.g. Johansson et al. 1996; Ebeling 199g). The ENpc has been taken driven corpora. In this chapter I will briefly exemplify some of the issues which as a model in a number ofprojects for other language pairs (Johansson 199g: 9-10) arise in the design of bi-directional parallel corpora with reference to English and and as a starting point for the English Italian Translational corpus (cEXI) project Italian, focusing on some of those corpus design issues which are related to the at the school for Translators and Interpreters (SSLMIT) of the university of Bolo- specific translation-driven nature of a corpus and which may be different from con- gna at Forli, which is currently at the stage ofcorpus design.
    • Inte rc uhural F aultline s Zanettin: Issues in Corpus Design and Analysis 109108 Figure 1, adapted from Johansson and Hofland (1994:26), shows the composi- tity, but also one of quality; that is to say, if we compare translated narrative fictiontion of a bi-directional parallel corpus based on the design ofthe ENPC. Each corpus in the two languages, we find that, while most genres are translated into Italian, andcomponent (CC) has a number from 1 to 4. within popular fiction (detective, romance and science-hction novels) translations may even constitute the majority of all published books, the vast majority of prose Language A Language B fiction translated from Ttalian into English is what may be called high-quality, ,liter- _-l Translations frc2 ary fiction, by which I mean authors like Eco, Calvino or Tabucchi. I Translations in I This means that in a bi-directional parallel corpus of English and Italian narra- I L-euug" e I tive fiction, if the two translational components of t{e corpus are representative of the respective populations of translated books, strictly speaking they will not be CC3 CC4 comparable with the non-translational components, i.e. the source texts for the trans- Source Texts Source Texts for Source Texts for lations in the two languages respectively. In short, for Ita1ian, mostly translated Translations in Translations in popular fiction would be compared with original literary fiction, while the re- verse would be true for English, in which translated literary fiction would be compared with mostly original popular fiction. Since the flow of translations and Figure 1: A bi-directional parallel corpus the policies regarding them differ in different cultures and for different languages, it is difficult to envisage the same design for parallel corpora in different directionsIn principle, such a composition should not only allow comparison of tfanslations of translation and in different language pairs.and source texts (CC 1 :CC 4, CC2:CC3) but also of translations and non-translationsin the same language (CCl:CC3, CCZ:CC4), as one would do using a monolingualcomparable corpus. Furthermore, it would also seem possible to compare non- 3. Diversification in translation-driven corporatranslated texts in the two different languages (CC3:CC4), as one would using abilingual comparable corpus. I All three types of translation-driven corpora described Another key issue in the design of translation-driven corpora concerns the need toin Section 1 are then ideally represented in a bi-directional parallel corpus and, in use different types of corpora according to researchers aims and objectives. Fulladdition, for two different languages. A closer examination, however, reveals that reciprocity in a bi-directional paral.lel corpus would, in fact, appear impossible, asthe different components of a bi-directional parallel corpus are hardly comparable the concern with representativeness clashes with the concern with they stand. Is there a way out of this apparent dead end? I believe this can only be found through Like the ENPC, the CEXI is a bi-directional paraliel co{pus, comprisirg both diversification, i.e. by using the same translated texts in corpora of different typesEnglish translations and Italian source texts and Italian translations and English created ad hoc according to specific design criteria. The comparison of translatedsource texts. Like most translation-driven corpora, the Italian-English translational texts with their source texts should be complemented with the comparison be-corpus only aims at representing wrilten, published translations, and within these tween these translated texts and a reference corpus in the same language. The idealonly the universe of translated books (to the exclusion of journals or magazines), translational corpus is then not a pre-formed set of texts but an open-ended corpusrather than translation in its entirefy. Translated books may represent only a small comprising different components which allow different types of comparison topercentage of all translated texts, but they constitute data which are more readily be made.available. In Italy, one in four published books is a translation, and half of these are As an example, let us consider a simple quantitative analysis involving type/translated from English (Vigini 1999:87). On the other hand, only about 27o - or token ratio for a quite straightforward single source author parallel corpus, com- one in fifty - of the books published in English-speaking countries such as the UK prising five novels and a short story by salman Rushdie and their Italian published and the USA are translations (Venuti 7995: l2), of which only about 47o Ne from translations. Type/token ratio is taken to be an indicator of lexical variety in texts (Index Translationum 1998). The difference is not only one of relative quan- (Baker 1995 and 1996; Laviosa 1998a); the more word forms are used with respect to the total number of words in a text, the wider the range of vocabulary used in the I Yet another possibility would be to compare translated texts in two different languages text, and more effort may be required on the part of the reader to process it than (CC1:CC2), i.e. acomparable bilingual corpus oftranslations (Johansson 1998: 8). This type of texts with less varied vocabulary. Type/token ratio is also a function of the total comparison has not yet attracted the attention of scholars however. length of a text, so that to have comparable figures for texts of different length we
    • 110 I nte rcultural Faultline s Zanettin: Issue.s in Corpus Design arul Analysis 11tneed a standard measure. Wordsmith Tools (Scott 1996) allows the standardized The data, however, are not directly comparable, as different values could be duefype/token ratio to be calculated, and the base length chosen for this investigation to structural differences between the two languages rather than to the process ofwas the suggested 1000 words. Table 1 shows the standardized rype/token ratio for translation (Munday 1998:545). Therefore the typeftoken ratio for the texts in thethe Rushdie parallel corpus: two languages has to be related to reference corpora for English and Italian. Corpus Tokens Types T/T ratio Types T/T ratio T/T ratio T/T ratio Tokens (std 1M0J Texts (running (word (whole (std 10O0) Kq$Fi9,,cqs.jri.:., . -.f1 words) texts) {ll:dfanifurs d6mrj r.1 :j;lt iiit+iff!,,#t .forms) 1 Italian Reference comrrs 856,00r 58.864 6.88 49.99 I figli della mezzanotte 224,398 23.999 10.69 52.69 Rrlshd4{drpus7""". (&slisli:ibffice ir; ffih] r.s1 I Enslish Reference Comus 843.629 27.46& 3.26 M44 jMgij#,,EltiU iiirS;:l:4 ,49 Table 2: Type/token ratio: Rushdie parallel co{pus vs. comparable reference corpora I versi satanici 796,&6 23,536 11.97 53.41 , t:gz* The reference corpus for English was made up of 20 novels randomly selected from ial.;F]q the written imaginative component of the British National corpus (1995). Six of Lliltimo sospiro del Moro 167.630 21.843 13.03 53.63 full texts, while the remaining fourteen are extracts of about 40,000 these novels are words each. The reference corpus for Italian comprises seventeen full text novels iel;ffi ij by contemporary Italian writers downloaded from the Internet. Thus, the two refer- La vergogna to7.946 15,884 t4.11 s3.57 ence corpora used in this experiment are only loosely comparable, as the main criterion for the inclusion of texts in the corpus was their availability in electronic ,{ 1,1i9-5 format. They should however provide useful indicative reference values. Harun e il mar delle stoie 44.624 7,487 16.78 49.86 The overall type,/token ratio for the Italian reference corpus (49.g9) is 5.53 points higher than that of the English reference corpus (44.44). we can also see thar the i!*M ::5,5,.f.9 l?.6s,; .l ratio for the two components of the Rushdie corpus (translations and source texts) 55.54 is higher than the average for both Italian and English. However, the difference Chekov e Zulu s,o25 1,946 38.73 between the source texts by Rushdie and the reference corpus for Engiish is higher rrr:::.i,i,::i= l- :rl.i,:,: tEl:;654 ;15Oi45 (5.2) than that between the translations and the reference corpus for Italian (3.1), as ,?"qiPq can be seen in Figure 2. Italian trarslations (total) 146,269 41 ,162 6.40 53.09 A further check is necessary at this point to make sure that the results are statis- ::::r::.!l| i:i t|i;# l;!1-;;::49ffi tically significant. what must be computed is the standard deviation for the two ?l::s|,9 ::.;..ia ; : reference cofpora, that is to say, the range outside which variation between a reter- ence corpus and the texts in the corresponding component of the Rushdie parallel Table I : Type/token ratio for Rushdie parallel corpus corpus cannot be ascribed to chance. The standard rleviation is calculatecl through a formula which correlates the average standardized type/token ratio for a sample to the size of that sample.2 Table 3 shows the standardized type/token ratio for each The higher the figure obtained by dividing types by tokens, the higher the lexical text in the parallel corpus (column 2), followed by the average standardized typel variety. As can be seen from the last column, the standardized type/token ratio for the translations is higher that that of the source texts, both as a whole and with regard to individual text pairs. This may appear to indicate that the vocabulary used in the translations is more varied than that of the source texts, and it runs counter to 2 The formula used to calculate the standard cleviarion was rhe following: the suggestion that translations are lexically less varied than originals as a result of E O _ rr. the translation process (Baker 1996; Kohn 1996). o h,,-
    • Ilt2 Int e rcr.t ltural F aultline s Zanettin: Issues. in Corpus Design and Analysis u3token ratio for the reference corpora (Column 3:44.44 forEnglish a1d49.99 fot high type/token ratio, while the last one, Haroun and the sea of stories, is, at leastItalian) and the standard deviation (Column 4: +- 0.62 for English, + 0.88 for Ital- on one level of reading, a story for children, with a rather low type/token ratio, andian). The fifth column shows the gap between each text and the aYetage, which is, the difference from the average in the Italian translation is not statistically signifi-in all but one case, significantly trigher than the standard deviation for the reference cant, i.e. the gap is within the range of the standard deviation. This could also becorpora. taken to suggest that above and below a certain threshold, there is no significant variation in fype/token ratio between translations and source texts. f-*.- i summing up, we can say that, for at least four novels by Rushdie, while both I translations and source texts have a higher tvpe/token ratio than the average for the 160 I I respective languages, the translations are much closer to this average than source i texts, leading to the tentative conclusion that they are lexically less varied than the lso I source texts as a result of the process of translation- It is Aue that type/token ratio I is iem not by itself conclusive evidence of lexical variety. It should be considered along- side other indexes such as lexical density, i.e. the ratio of lexical to grammatical Io lt-l words, or the ratio of hapaxes (i.e. those words occurring only once in a text) to the It total number of words (Baker 1996; Laviosa l99gb). However, this example has l* lF ?n been used for methodological purposes to demonstrate that parallel corpora need I I to a be used in conjunction with other corpus resources. In this case, the use ofa refer_ i 10 ence corpus enabled us to go beyond a provisional finding based on the figures for a paraliel corpus alone. English Italian As much as analyses based on parallet corpora can profit from the comparison with data taken from monolingual reference corpora, other types oftranslation-driven Figure 2: Type/token ratio (std. 1,000): Rushdie parallel corpus vs. comparable corpora could benefit from diversification of analysis. For instance, if we were to reference corpora compare the corpus of the Italian translations of Rushdies novels with the Italian reference corpus alone, we would also have to hypothesize that (these) translations are more lexically varied than (the reference corpus of) texts spontaneously pro_ duced in the same language. The reference corpora used in this experiment were composed of non-translated texts only. However, other kinds ofreference corpora are also possible, depending on the purpose of the investigation. To be representative of a particular genre and thus reflect its norms and conventions, a reference corpus will have to be created by looking at the actual textual population for that genre, regardless of the stafus of the texts with respect to translation. Il for instance, what one wants to see is the expect ancy norrn for the language ofpopular narrative fiction in Italian, a reference corpus might well include translations rather than containing only texts spontaneously pro_ duced in Italian, since expectations for this kind of texts are created partly by Harun e il mr delle translations themselves (Chesterman 1997 : 67). Table 3: Lexical variation in a parallel corpus 4. Corpus encoding As we can see, the difference between the English source texts and the average for the English reference corpus is almost double that between the Italian translations After texts have been selected for inclusion in a corpus, they need to be acquired in and the average for the ltalian reference corpus, except for the flrst and last parallel electronic format, and the process oftranslating texts from paper into digital for- texts. Significantly, the first text, Chekov and Zttltt, is a short story with a rather mat, as in all kinds of intersemiotic transiation, may involve losses as weli as gains.
    • I nt e rc u I t u r al F ault line s Zanettin: Issues in Corpus Design arul Arutlysi.sWhat may be lost is useful or even essential information conceming the context of web; corpora encoded in this format could be accessed through the Internet, effec_reception of the printed texts, such as paratextual and visual information. This is tively constituting a large textual database. For instance, the same translated textwhy the investigations of electronic corpora should, if possible, be coupled with the could be part of a comparable monolingual corpus or of one or more bilingual par_analysis of the printed source texts, for which the electronic versions are not a sub- allel corpora, with different overall designs. The benefits to be derived from thestitute but a complement. On the other hand, an electronic text can be enriched encoding of texts, and specifically of translations, in accordance with internationalwith linguistic and extralinguistic information, providing a means of carrying out standards, making possible the exchange of primary data (the electronic texts) asanalyses which would not be possible without using texts in electronic format- well as of secondary data (hndings based on such data) among researchers, may At the linguistic level, corpora can be annotated, adding to each running word well compensate for the time and effort required to create ao.pu, ."aouaces in ac_part-of-speech tagging as well as syntactic and semantic information. The parsing cordance with international standards.and tagging of the corpus can be automated to a large degree (Wilson and McEnery Parallel corpora allow not only quantitative analyses of the kind illustrated 1996; Kennedy 1998), and computer applications have been developed to facilitate through the Rushdie example to be carried out. They also facilitate qualitative analy-the inserlion of annotation ofdiscourse features, such as referring expressions (Biber ses, based on the examination of parallel concordances. A search in an alignedet al. 1998). A linguisticaliy annotated corpus makes possible more refined analy- parallel corpus could go in both directions, from source ro target texts or vice; for instance, type/token ratio statistics carried out on a lemmatized corpus could In the first case, a concordance of a source language item or pattern would high_probably provide a more accurate picture of lexical variety than figures based on light recurrent translation choices made by the translators; in the second case, amere word-form counts. A corpus tagged for parts of speech couid be searched for concordance of target language items or patterns would yield as a result the rangejust one use of a word form, i.e. to obtain a concordance of set as a noun rather than of source language features from which those items or patterns an adjective or a verb. In any case, in order to carry out paraller concordances, parallel corpora need to be aligned. Alignment procedures can, to a rarge extent, be automated, As regards extralinguistic information, all the variables considered in the phase and may be performed on the basis of statistical elaboration, taking into accountof corpus design can be encoded in each electronic text so that they can be retrieved the number of sentences, words or even characters in the pairs of texts to be aligned (churchand used as selection criteria for inclusion in a different corpus. In a corpus of andtranslated books, for example, coding could include bibliographical information Gale 1991), or using bilingual lexicons (peters and picchi l99g) or .anchor lists, (Hofland and Johansson 1998). However, while advancementsabout the printed text, information about the translator and the process of transla- in automatic corpus alignment techniques will certainly help enhance not only machine transrationtion, and about the source text for the translation. We may, for example, want to re- search but also descriptive studies, a certain degree of interactionselect translations not only according to when they were published, but also accord- between machines and humans during the alignment process will not only continue toing to the amount of time that has elapsed between their publication and that of the be necessary but may also provide a way of examining in some detail how transrationssource texts.3 map onto source texts. Manual proofreading of aligned texts may help the researcher In order to be able to exchange and re-use corpus resources and in order to rep- to ob_ serve any regularities which may characteri ze a certain body of translationslicate and compare findings based on different corpora, the adoption of a common as regards the segmentation, completeness and distribution of translated segments,standard for the encoding of translated texts seems advisable. The Text Encoding in order to investigate what Toury (1995: 58-59) calls .,matricial norms,.Initiative (Sperberg-McQueen and Burnard 1994) provides useful guidelines for The output of the alignment procedure can be either bilingual texts withthe encoding of electronic texts in accordance with the international SGML stand- alter_ nating source text segments and their translations, or monolingual textsard (ISO 8879: 1986).4 A Corpus Encoding Standard (CES) compliant with these in which segments which correspond are numbered, pairgd, and retrieved onguidelines has been developed and is already available in XML format (Ide and the basis of an index. In order to ailow different types of comparison to be made, itBonhomme 2000), enabling the encoding of corpora in one or more languages, and should be possible to take individual electronic texts from translation-driven corpora, includ_also of parallel corpora. XML is an abbreviated version of SGML for use on the ing parallel co{pora, and re-use them in different projects. The Corpus Encoding Standard mentioned above makes possible the creation of parallel corpora in which translations and dource texts are encoded as individual entities and from which ther The Translational English Corpus (TEC) header (Baker 1999) includes many of these retrieval of paraliel concordances is carried out on the basis of an external indexspecifications. generated during the alignment process. The same source text couldilnternational Organization for Standardization, /,lO 8879: Information processing Text and thus be used in - a monolingual comparable colpus, aligned with translations in different languagesoffice systems Standard Generalized Markup l"angtnge (SGML), ([Geneva]: ISO, 1986). or with different translations in the same language.
    • I Inte rcuhural F aultline s Zanettin: Issues in Corpus Design and Analysis 1171165. Conclusion Gavioli, Laura and Federico zanetin (1997) comparable corpora and rranslation: A Pedagogic Perspective, http://www.sslmit.unibo.itlcultpaps/ (last checked on 15 sep- tember 2000).Computerized parallel corpora together with other types of translation-driven cor- Halverson, Sandra (1998) Translations Studies and Representative Corpora: Establish-pora can be an invaluable resource in both descriptive and applied translation studies, ing Links between Translation corpora, Theoretical/Descriptive Categories and ain that they allow the investigation of linguistic and extralinguistic features of trans- Conception of the Object of Study , Meta 43(4): 494-514.lated texts on a much larger scale than can be achieved by manual analysis ofprinted Hofland, Knut and Stig Johansson (1998) The Translation corpus Aligner: A programtexts. However, as more and more translation-driven corpora are created, there is a for Automatic Alignment of Parallel rexts, in Stig Johansson and Signe oksefjellneed for the criteria adopted in the design of these corpora to be carefully consid- (eds.) corpora and cross-Linguistic Research: Theory, Method, an^d case studies,ered and made transparent, so tiat research can be replicated and findings can be Amsterdam and Atlanta: Rodopi, 87-100.compared and evaluated. Compiling corpora can involve a lot of time and effort, Ide, Nancy and Patrice Bonhomme (2000) xML corpus Encoding standard Document xcES 0.2, http://www-cs.vassar.edurXCES/ (lasr checked on 15 September 2000).and in order to avoid ending up with strands of isolated experiments, we should Index Translationum, 5d edition, UNESCO 1998.make sure that the maximum advantage can be derived from ttre enterprise. The Jantunen, Jarmo (2000) what can corpora Tell us about Translated Language: A com-criteria used in selecting the texts to be included in a corpus, and those adopted in parable Corpus of Finnish in use for Making Hypotheses, paper presented at thepreparing them for use with corpus soffware, e.g. the procedures for alignment of a uMlsrrucl- Research Models in Translation Studies conference, Manchester, 2g-parallel corpus, have implications for the findings based on the corpora themselves. 30 April 2000.I would like to suggest that corpus design and encoding should not be seen as merely Johansson, Stig and Knut Hofland (1994) Towards an English-Norwegian paraliel Cor-preliminary to the actual analysis of a corpus, but as important moments in them- pus", in udo Fries, Gunnel rottie and Peter Schneider (eds.) creating and {Jsingselves in the study of translation and translated texts. English Language Corpora, Amsterdam: Rodopi. 25-37. ----, Jarle Ebeling and Knur Hofland (1996) coding and Aligning the English-Norwe- gian Parailel Corpus, in Karin Aijmer, Bengt Altenberg and Mats Johansson (eds.) Languages in Contrast, Lund: Lund University Press, 87-112.References ------ (1998) on the Role of Corpora in cross-Linguistic Research, in stig Johansson and signe oksefiell (eds.) corpora and cross-Linguistic Research: Theory, Method.,Aston, Guy (1999) Corpus Use and Learning to Translate, TextusXll(2),289-314. and Case Studies, Amsterdam and Atlanta: Rodopi, 3-24.Baker, Mona (1995) Corpora in Translation Studies: An Overview and Some Sugges- Kemppanen, Hannu (2000) Looking for Evaluative Keywords in Authentic and rrans- tions for Future Research, T ar g e t 7 (2) : 223 -243. lated Finnish: Corpus Research on Finnish History Texts, paper presented at the------ (1996) Corpus-Based Translation Studies: The Challenges that Lie Ahead, in UMISTTCL Research Models in Translation studies conference, Manchester, 2g- Harold Somers (ed.) Terminology, LSP, and Translation: Studies in Language En- 30 April 2000. gineeing inHonour of Juan C. Sager, Amsterdam and Philadelphia: John Benjamins, Kennedy, Graeme (1998) An Introduction to Corpus Lingr jstlcs, London and New york: 175-186. Longman.------ (1999) The Role of Corpora in Investigating the Linguistic Behaviour of Profes- Kohn, Jiinos (1996) what Can (Corpus) Linguistics Do for Translation?, in Kinga sional Translators , International Journal of Corpus Linguistics 4(2): 1-18. KJaudy, Josd Lambert and Anik6 Sohdr (eds.) Translation studies in Hungary,Bl-Biber, Douglas (1993) Representativeness in Corpus Design, Literary and Linguistic dapest: Scholastica, 39 -52. Computing 8(4): 243-257 . Laffling, John (1992) on consrructing a Transfer Dictionary for Man and Machine,------, Susan Conrad and Randi Reppen (1998) Corpus Linguistics: Investigating Lan- guage Structure and Use, Cambridge: Cambridge University Press. Target 4(1): 11-31. Laviosa, Sara (1997) How Comparable Can Comparable Corpora Be?,Target 9(2)BriishNational Corpus,version 1.0, 1995, Oxford: OxfordUniversity Computing Services. 289-319.Chesterman, Andrew (7997) Memes of Translation: The Spread of Ideas in Translation Theory, Amsterdam and Philadelphia: John Benjamins. ----- (1998a) The English Comparable Corpus: a Resource and a Methodology,, inEbeling, Jarte (1998) Contrastive Linguistics, Translation, and Parallel Corpora , Meta Lynrre Bowker, Michael Cronin, Dorothy Kenny and Jennifer pearson (ed,s.) Unity 43(4):602-615. in Diversity: current Trends in Translation studies, Manchester: St. Jerome pub- Church, Kenneth W. and William A. Gale (1991) Concordances for Parallel Texts, in lishing, l0l-112. Using Corpora: Proceedings of the 7h Annual Conference of the UW for the New ------ (1998b) Core Patterns of Lexical Use in a Comparable Corpus of English Narra- OED andText Research, Oxford, Oxford University Press, 40-62. tive Prose, Meta 43(4): 551-570.