1. Translation universals: do they exist?
A corpus-based NLP study of convergence and simplification
Gloria Corpas*, Ruslan Mitkov**, Naveed Afzal**, Viktor Pekar***
* University of Málaga
** University of Wolverhampton
*** Oxford University Press
2. Translation universals
(Baker 1993, 1996; Toury 1995)
Translated texts tend to be simpler than non-
translated, original texts (simplification)
Translated texts tend to be more explicit than
non-translated texts (explicitation)
Translated texts tend to be more similar than
non-translated texts (convergence)
3. Previous research on translation
universals
Formulation and initial explanation been based
of intuition and introspection
Follow-up corpus research limited to
comparatively small-size corpora, literary or
newswire texts and semi-manual analysis
No sufficient guidance as to which are the
features which account for these universals to
be regarded as valid
4. Objective of this study
To test the validity of convergence (translated
texts tend to be more similar than non-translated
texts)
To test the validity of simplification (translated
texts tend to be simpler than non-translated
texts)
To propose features which account for
convergence y simplification
Test (target) language: Spanish
5. General methodology: convergence
Employment of NLP techniques on corpora of
translated Spanish and on comparable corpora
of non-translated (original) Spanish
Similarity between every pair of corpora of
translated texts and between every pair of
corpora of original texts computed
Similarity is measured in terms of both style and
syntax
6. General methodology: simplification
Employment of NLP techniques on corpora of
translated Spanish and on comparable corpora
of non-translated (original) Spanish
For every corpus a set of lexical and stylistic
features computed and compared with its
comparable counterpart
7. Corpora used
Corpus of Medical Spanish Translations by Professionals
(MSTP: 1,058,122)
Corpus of Medical Spanish Translations by Students (MSTS:
1,058,122)
Corpus of Technical Spanish Translations (TST: 1,736,027)
Corpus of Original Medical Spanish Comparable to
Translations by Professionals (MSTPC: 1,402,172)
Corpus of Original Medical Spanish Comparable to
Translations by Students (MSTSC: 1,164,435)
Corpus of Original Technical Spanish Comparable to
Technical Translations (TSTC: 1,986,651)
8. Comparability of corpora
Comparability in terms of
(i) Text types and forms
(ii) Domains and sub domains
(iii) Level of specialisation and formality
(iv) Diatopic restrictions (Peninsular Spanish)
(v) Time span (2005-2008)
(vi) Similar size
10. Study 1: Convergence
Specific methodology (1)
Compared:
all 3 pairs of translated texts (MSTP-MSTS; MSTS-
TST; MSTP-TST)
all 3 pairs of comparable non- translated texts
(MSTPC-MSTSC; MSTSC-TSTC; MSTPC-TSTC)
Premise: If convergence universals holds, higher
similarity for pairs of translated texts expected.
11. Study 1: Convergence
Specific methodology (2)
Texts compared on the basis of
(i) style (stylistic features)
(ii) syntax (syntactic features).
Our proposal for stylistic and syntactic features
12. Style comparison: stylistic features
Lexical density:
(number of types)/
(total number of tokens present in corpus)
Lexical richness:
(number of lemmas)/
(number of tokens present in corpus)
Sentence length:
(number of tokens in corpus)/
(number of sentences)
13. Style comparison: stylistic features (2)
Simple/complex sentences
Discourse markers (Spanish)
Two statistical tests (Chi-Square test and T-test)
employed
14. Syntax comparison
Sequences of POS tags for every pair of corpora
compared
Corpora represented as frequency vectors of 3-
grams (Nerbonne and Wiersma, 2006)
Measures:
Cosine
Recurrence metrics R and Rsq (Kessler, 2001)
15. Experimental results
Computation of stylistic features
Chi-square values for global comparison
T-test values for statistical significance
Measuring vector differences for syntax
comparison
20. Convergence: discussion (1)
Stylistic features: translated texts included in
experiment are more similar than non-translated
texts (Chi-square test)
21. Convergence: discussion (2)
T-test observations
There are non-translated texts which are not statistically
different in terms of stylistic features whereas
corresponding translated texts different statistically
There are non-translated texts which are statistically
different in terms of only one stylistic feature whereas
corresponding translated texts different statistically with
regard to two stylistic features
Translated texts could often differ significantly with regard
to certain style features (lexical density).
22. Convergence: discussion (3)
Translated texts differ more in terms of syntax
for all compared pairs and from the point of view
of all measures (1-C, R and Rsq)
23. Study 2: Simplification
Specific methodology
Stylistic features accounting for ‘simple’ texts
Sentence length
Simple vs. Complex sentences
Readability
Automated Readability Index (ARI)
Coleman-Liau Index (CLI)
Flesch-Kincaid Grade Level Readibility Test (FK)
Results compared across pairs of corpora
24. Comparison of mean values of the lexical and stylistic features
between corresponding comparable corpora
Features MTP-MTPC MTS-MTSC TT-TTC
MTP MTPC α MTS MTSC α TT TTC α
Lexical Density .027 .042 0.005 .052 .041 0.4 .02 .025 0.001
Lexical Richness .016 .029 0.005 .037 .028 0.4 .013 .015 0.001
Average Sentence
Length
25.25 20.70 0.2 28.49 26.44 0.1 27.29 18.12 0.001
Simple Sentences
(%)
.441 .638 0.01 .507 .521 0.7 .476 .592 0.002
Discourse
Markers (Ratio)
.0012 .002 0.05 .0018 .0021 0.2 .0007 .0016 0.002
ARI 16.85 15.08 0.4 19.14 19.01 0.75 17.85 12.85 0.001
CLI 16.27 16.9 0.3 17.16 18.28 0.05 16.28 15.5 0.1
FK 19.53 18.21 0.5 21.32 21.51 0.5 20.03 15.46 0.001
26. Implications for translation universals
Convergence
Style: convergence appears to be broadly holding, but
no definite conclusion can be made that convergence is
a clear-cut universal
Syntax: there is no evidence that convergence holds in
terms of syntax
General: results do not provide sufficient support to the
convergence ‘universal’
Simplification
Mixed picture: no sufficient support for simplification
27. Implication for Machine Translation
Given the mixed picture, not many
But: translated text have to be more readable
than non-translated text
More research is needed as to which features
are ‘stable’
Included into an MT model?
28. Conclusions
There is no sufficient evidence/support that
translation universals (convergence,
simplification) hold
Features which appear to be ‘stable’ (e.g.
readability) could be modelled into MT systems