SlideShare a Scribd company logo
1 of 28
Translation universals: do they exist?
A corpus-based NLP study of convergence and simplification
Gloria Corpas*, Ruslan Mitkov**, Naveed Afzal**, Viktor Pekar***
* University of Málaga
** University of Wolverhampton
*** Oxford University Press
Translation universals
(Baker 1993, 1996; Toury 1995)
Translated texts tend to be simpler than non-
translated, original texts (simplification)
Translated texts tend to be more explicit than
non-translated texts (explicitation)
Translated texts tend to be more similar than
non-translated texts (convergence)
Previous research on translation
universals
Formulation and initial explanation been based
of intuition and introspection
Follow-up corpus research limited to
comparatively small-size corpora, literary or
newswire texts and semi-manual analysis
No sufficient guidance as to which are the
features which account for these universals to
be regarded as valid
Objective of this study
To test the validity of convergence (translated
texts tend to be more similar than non-translated
texts)
To test the validity of simplification (translated
texts tend to be simpler than non-translated
texts)
To propose features which account for
convergence y simplification
Test (target) language: Spanish
General methodology: convergence
Employment of NLP techniques on corpora of
translated Spanish and on comparable corpora
of non-translated (original) Spanish
Similarity between every pair of corpora of
translated texts and between every pair of
corpora of original texts computed
Similarity is measured in terms of both style and
syntax
General methodology: simplification
Employment of NLP techniques on corpora of
translated Spanish and on comparable corpora
of non-translated (original) Spanish
For every corpus a set of lexical and stylistic
features computed and compared with its
comparable counterpart
Corpora used
Corpus of Medical Spanish Translations by Professionals
(MSTP: 1,058,122)
Corpus of Medical Spanish Translations by Students (MSTS:
1,058,122)
Corpus of Technical Spanish Translations (TST: 1,736,027)
Corpus of Original Medical Spanish Comparable to
Translations by Professionals (MSTPC: 1,402,172)
Corpus of Original Medical Spanish Comparable to
Translations by Students (MSTSC: 1,164,435)
Corpus of Original Technical Spanish Comparable to
Technical Translations (TSTC: 1,986,651)
Comparability of corpora
Comparability in terms of
(i) Text types and forms
(ii) Domains and sub domains
(iii) Level of specialisation and formality
(iv) Diatopic restrictions (Peninsular Spanish)
(v) Time span (2005-2008)
(vi) Similar size
CORPUS DESIGNCORPUS DESIGN
NON
TRANSLATED
CORPUS
MSC
MSTSC MSTPC
TSTC
ES (TT) ES (NT)
Study 1: Convergence
Specific methodology (1)
Compared:
all 3 pairs of translated texts (MSTP-MSTS; MSTS-
TST; MSTP-TST)
all 3 pairs of comparable non- translated texts
(MSTPC-MSTSC; MSTSC-TSTC; MSTPC-TSTC)
Premise: If convergence universals holds, higher
similarity for pairs of translated texts expected.
Study 1: Convergence
Specific methodology (2)
Texts compared on the basis of
(i) style (stylistic features)
(ii) syntax (syntactic features).
Our proposal for stylistic and syntactic features
Style comparison: stylistic features
Lexical density:
(number of types)/
(total number of tokens present in corpus)
Lexical richness:
(number of lemmas)/
(number of tokens present in corpus)
Sentence length:
(number of tokens in corpus)/
(number of sentences)
Style comparison: stylistic features (2)
Simple/complex sentences
Discourse markers (Spanish)
Two statistical tests (Chi-Square test and T-test)
employed
Syntax comparison
Sequences of POS tags for every pair of corpora
compared
Corpora represented as frequency vectors of 3-
grams (Nerbonne and Wiersma, 2006)
Measures:
Cosine
Recurrence metrics R and Rsq (Kessler, 2001)
Experimental results
Computation of stylistic features
Chi-square values for global comparison
T-test values for statistical significance
Measuring vector differences for syntax
comparison
Style comparison: Stylistic Features
Features MSTP MSTS TST MSTPC MSTSC TSC
Lexical
Density
0.027954 0.052715 0.020679 0.042505 0.041159 0.025529
Lexical
Richness
0.016929 0.037709 0.013281 0.029992 0.028905 0.015591
Average
Sentence
Length
25.256248 28.499456 27.292782 20.702349 26.442412 18.124363
Simple
Sentence
s (%)
0.441768121 0.507205751 0.476949103 0.638889238 0.52120611 0.592110096
Discourse
Markers
(Ratio)
0.001268941 0.001852604 0.000763805 0.002022331 0.002099085 0.001649655
Style comparison: Chi-Square Values
Corpora Chi-Square Values
1MSTP  2MSTS 0.010622566
1MSTP  3TST 0.00266151
2MSTS  3TST 0.023731912
Total 0.037015988
Average 0.012338663
Corpora Chi-Square Values
1MSTPC  2MSTSC 0.059779549
1MSTPC  3TSC 0.006140764
2MSTSC  3TSC 0.07122404
Total 0.137144352
Average 0.045714784
Translated Corpora Non-Translated Corpora
Style comparison: T-Test Values
Features Translated Corpora (T-test Values)
MSTP MSTS MSTS  TST MSTP  TST
Non-translated Corpora (T-test Values)
MSTPC MSTSC MSTSC TSC MSTPCTSC
Lexical Density 0.002545387 0.000123172 0.079875166 0.140348431 0.201151185 0.000748439
Lexical
Richness 0.0006604 0.000006.9792 0.140236542 0.140711253 0.015893183 0.00009.71905
Sentence
Length 0.011826639 0.522122939 0.202480843 0.145216739 0.002807505 0.368840258
Simple
Sentences 0.057465277 0.673936375 0.202830407 0.096465071 0.462960518 0.21217697
Discourse
Markers 0.001048007 0.005746253 0.351552034 0.063428055 0.00084074 0.072337471
Syntax comparison: Results Measuring
Vector Differences
Corpora 1-C R Rsq
Translated texts
MSTP - MSTS 0.206015066283 252526.914323 638848591.082
MSTP - TST 0.337626383799 388466.504863 3146471863.13
MSTS - TST 0.176310545152 432725.578482 2643068563.82
Non-Translated texts
MSTPC - MSTSC 0.0176469276126 98448.0858054 82218137.9687
MSTPC - TSC 0.150912596476 364322.217714 851312764.364
MSTSC - TSC 0.167167511143 372940.61477 1008322991.78
Convergence: discussion (1)
Stylistic features: translated texts included in
experiment are more similar than non-translated
texts (Chi-square test)
Convergence: discussion (2)
T-test observations
There are non-translated texts which are not statistically
different in terms of stylistic features whereas
corresponding translated texts different statistically
There are non-translated texts which are statistically
different in terms of only one stylistic feature whereas
corresponding translated texts different statistically with
regard to two stylistic features
Translated texts could often differ significantly with regard
to certain style features (lexical density).
Convergence: discussion (3)
Translated texts differ more in terms of syntax
for all compared pairs and from the point of view
of all measures (1-C, R and Rsq)
Study 2: Simplification
Specific methodology
Stylistic features accounting for ‘simple’ texts
Sentence length
Simple vs. Complex sentences
Readability
Automated Readability Index (ARI)
Coleman-Liau Index (CLI)
Flesch-Kincaid Grade Level Readibility Test (FK)
Results compared across pairs of corpora
Comparison of mean values of the lexical and stylistic features
between corresponding comparable corpora
Features MTP-MTPC MTS-MTSC TT-TTC
MTP MTPC α MTS MTSC α TT TTC α
Lexical Density .027 .042 0.005 .052 .041 0.4 .02 .025 0.001
Lexical Richness .016 .029 0.005 .037 .028 0.4 .013 .015 0.001
Average Sentence
Length
25.25 20.70 0.2 28.49 26.44 0.1 27.29 18.12 0.001
Simple Sentences
(%)
.441 .638 0.01 .507 .521 0.7 .476 .592 0.002
Discourse
Markers (Ratio)
.0012 .002 0.05 .0018 .0021 0.2 .0007 .0016 0.002
ARI 16.85 15.08 0.4 19.14 19.01 0.75 17.85 12.85 0.001
CLI 16.27 16.9 0.3 17.16 18.28 0.05 16.28 15.5 0.1
FK 19.53 18.21 0.5 21.32 21.51 0.5 20.03 15.46 0.001
Simplification: discussion
Mixed picture
Simplification confirmed on
Lexical richness
Lexical density
Readability
Simplification not confirmed on
Sentence length
Proportion of simple sentences
Implications for translation universals
Convergence
Style: convergence appears to be broadly holding, but
no definite conclusion can be made that convergence is
a clear-cut universal
Syntax: there is no evidence that convergence holds in
terms of syntax
General: results do not provide sufficient support to the
convergence ‘universal’
Simplification
Mixed picture: no sufficient support for simplification
Implication for Machine Translation
Given the mixed picture, not many
But: translated text have to be more readable
than non-translated text
More research is needed as to which features
are ‘stable’
Included into an MT model?
Conclusions
There is no sufficient evidence/support that
translation universals (convergence,
simplification) hold
Features which appear to be ‘stable’ (e.g.
readability) could be modelled into MT systems

More Related Content

Viewers also liked

Viewers also liked (10)

Sohail resume Safety specialist
Sohail resume Safety specialistSohail resume Safety specialist
Sohail resume Safety specialist
 
Resume Update Oct
Resume Update OctResume Update Oct
Resume Update Oct
 
Ivon James Resume
Ivon James ResumeIvon James Resume
Ivon James Resume
 
Installing ubuntu
Installing ubuntuInstalling ubuntu
Installing ubuntu
 
Place your bets sxsw2016 teaser
Place your bets sxsw2016 teaserPlace your bets sxsw2016 teaser
Place your bets sxsw2016 teaser
 
billtopay
billtopaybilltopay
billtopay
 
AI_Paper_Presentation
AI_Paper_PresentationAI_Paper_Presentation
AI_Paper_Presentation
 
Naveed_Presentation_Mayo
Naveed_Presentation_MayoNaveed_Presentation_Mayo
Naveed_Presentation_Mayo
 
Installing Ubuntu
Installing UbuntuInstalling Ubuntu
Installing Ubuntu
 
CV Bruno-V Ing.-2015
CV Bruno-V Ing.-2015CV Bruno-V Ing.-2015
CV Bruno-V Ing.-2015
 

Similar to AMTA'2008 translation universals

Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
 
Lectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducersLectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducersMatias Menendez
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationGennadi Lembersky
 
Analysis And Indexing General Terms Experimentation
Analysis And Indexing General Terms ExperimentationAnalysis And Indexing General Terms Experimentation
Analysis And Indexing General Terms ExperimentationAshley Hernandez
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsMatīss ‎‎‎‎‎‎‎  
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
 
Biomedical literature mining
Biomedical literature miningBiomedical literature mining
Biomedical literature miningLars Juhl Jensen
 
05 handbook summ-hovy
05 handbook summ-hovy05 handbook summ-hovy
05 handbook summ-hovySagar Dabhi
 
Arabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodArabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodijcsit
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachFindwise
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsVincenzo Lomonaco
 
Automatic Profiling Of Learner Texts
Automatic Profiling Of Learner TextsAutomatic Profiling Of Learner Texts
Automatic Profiling Of Learner TextsJeff Nelson
 
Summary of English Japanese Translation by MSR-MT
Summary of English Japanese Translation by MSR-MTSummary of English Japanese Translation by MSR-MT
Summary of English Japanese Translation by MSR-MTHiroshi Matsumoto
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
BMI 201 - Investigating Term Reuse and Overlap in Biomedical Ontologies
BMI 201 - Investigating Term Reuse and Overlap in Biomedical OntologiesBMI 201 - Investigating Term Reuse and Overlap in Biomedical Ontologies
BMI 201 - Investigating Term Reuse and Overlap in Biomedical OntologiesMaulik Kamdar
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsDimitris Kontokostas
 

Similar to AMTA'2008 translation universals (20)

Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
 
Lectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducersLectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducers
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
Analysis And Indexing General Terms Experimentation
Analysis And Indexing General Terms ExperimentationAnalysis And Indexing General Terms Experimentation
Analysis And Indexing General Terms Experimentation
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
Biomedical literature mining
Biomedical literature miningBiomedical literature mining
Biomedical literature mining
 
05 handbook summ-hovy
05 handbook summ-hovy05 handbook summ-hovy
05 handbook summ-hovy
 
INTERPRETER.ppt
INTERPRETER.pptINTERPRETER.ppt
INTERPRETER.ppt
 
Arabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodArabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation method
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
 
Ir 09
Ir   09Ir   09
Ir 09
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
 
Automatic Profiling Of Learner Texts
Automatic Profiling Of Learner TextsAutomatic Profiling Of Learner Texts
Automatic Profiling Of Learner Texts
 
Summary of English Japanese Translation by MSR-MT
Summary of English Japanese Translation by MSR-MTSummary of English Japanese Translation by MSR-MT
Summary of English Japanese Translation by MSR-MT
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
BMI 201 - Investigating Term Reuse and Overlap in Biomedical Ontologies
BMI 201 - Investigating Term Reuse and Overlap in Biomedical OntologiesBMI 201 - Investigating Term Reuse and Overlap in Biomedical Ontologies
BMI 201 - Investigating Term Reuse and Overlap in Biomedical Ontologies
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Ir 03
Ir   03Ir   03
Ir 03
 

AMTA'2008 translation universals

  • 1. Translation universals: do they exist? A corpus-based NLP study of convergence and simplification Gloria Corpas*, Ruslan Mitkov**, Naveed Afzal**, Viktor Pekar*** * University of Málaga ** University of Wolverhampton *** Oxford University Press
  • 2. Translation universals (Baker 1993, 1996; Toury 1995) Translated texts tend to be simpler than non- translated, original texts (simplification) Translated texts tend to be more explicit than non-translated texts (explicitation) Translated texts tend to be more similar than non-translated texts (convergence)
  • 3. Previous research on translation universals Formulation and initial explanation been based of intuition and introspection Follow-up corpus research limited to comparatively small-size corpora, literary or newswire texts and semi-manual analysis No sufficient guidance as to which are the features which account for these universals to be regarded as valid
  • 4. Objective of this study To test the validity of convergence (translated texts tend to be more similar than non-translated texts) To test the validity of simplification (translated texts tend to be simpler than non-translated texts) To propose features which account for convergence y simplification Test (target) language: Spanish
  • 5. General methodology: convergence Employment of NLP techniques on corpora of translated Spanish and on comparable corpora of non-translated (original) Spanish Similarity between every pair of corpora of translated texts and between every pair of corpora of original texts computed Similarity is measured in terms of both style and syntax
  • 6. General methodology: simplification Employment of NLP techniques on corpora of translated Spanish and on comparable corpora of non-translated (original) Spanish For every corpus a set of lexical and stylistic features computed and compared with its comparable counterpart
  • 7. Corpora used Corpus of Medical Spanish Translations by Professionals (MSTP: 1,058,122) Corpus of Medical Spanish Translations by Students (MSTS: 1,058,122) Corpus of Technical Spanish Translations (TST: 1,736,027) Corpus of Original Medical Spanish Comparable to Translations by Professionals (MSTPC: 1,402,172) Corpus of Original Medical Spanish Comparable to Translations by Students (MSTSC: 1,164,435) Corpus of Original Technical Spanish Comparable to Technical Translations (TSTC: 1,986,651)
  • 8. Comparability of corpora Comparability in terms of (i) Text types and forms (ii) Domains and sub domains (iii) Level of specialisation and formality (iv) Diatopic restrictions (Peninsular Spanish) (v) Time span (2005-2008) (vi) Similar size
  • 10. Study 1: Convergence Specific methodology (1) Compared: all 3 pairs of translated texts (MSTP-MSTS; MSTS- TST; MSTP-TST) all 3 pairs of comparable non- translated texts (MSTPC-MSTSC; MSTSC-TSTC; MSTPC-TSTC) Premise: If convergence universals holds, higher similarity for pairs of translated texts expected.
  • 11. Study 1: Convergence Specific methodology (2) Texts compared on the basis of (i) style (stylistic features) (ii) syntax (syntactic features). Our proposal for stylistic and syntactic features
  • 12. Style comparison: stylistic features Lexical density: (number of types)/ (total number of tokens present in corpus) Lexical richness: (number of lemmas)/ (number of tokens present in corpus) Sentence length: (number of tokens in corpus)/ (number of sentences)
  • 13. Style comparison: stylistic features (2) Simple/complex sentences Discourse markers (Spanish) Two statistical tests (Chi-Square test and T-test) employed
  • 14. Syntax comparison Sequences of POS tags for every pair of corpora compared Corpora represented as frequency vectors of 3- grams (Nerbonne and Wiersma, 2006) Measures: Cosine Recurrence metrics R and Rsq (Kessler, 2001)
  • 15. Experimental results Computation of stylistic features Chi-square values for global comparison T-test values for statistical significance Measuring vector differences for syntax comparison
  • 16. Style comparison: Stylistic Features Features MSTP MSTS TST MSTPC MSTSC TSC Lexical Density 0.027954 0.052715 0.020679 0.042505 0.041159 0.025529 Lexical Richness 0.016929 0.037709 0.013281 0.029992 0.028905 0.015591 Average Sentence Length 25.256248 28.499456 27.292782 20.702349 26.442412 18.124363 Simple Sentence s (%) 0.441768121 0.507205751 0.476949103 0.638889238 0.52120611 0.592110096 Discourse Markers (Ratio) 0.001268941 0.001852604 0.000763805 0.002022331 0.002099085 0.001649655
  • 17. Style comparison: Chi-Square Values Corpora Chi-Square Values 1MSTP  2MSTS 0.010622566 1MSTP  3TST 0.00266151 2MSTS  3TST 0.023731912 Total 0.037015988 Average 0.012338663 Corpora Chi-Square Values 1MSTPC  2MSTSC 0.059779549 1MSTPC  3TSC 0.006140764 2MSTSC  3TSC 0.07122404 Total 0.137144352 Average 0.045714784 Translated Corpora Non-Translated Corpora
  • 18. Style comparison: T-Test Values Features Translated Corpora (T-test Values) MSTP MSTS MSTS  TST MSTP  TST Non-translated Corpora (T-test Values) MSTPC MSTSC MSTSC TSC MSTPCTSC Lexical Density 0.002545387 0.000123172 0.079875166 0.140348431 0.201151185 0.000748439 Lexical Richness 0.0006604 0.000006.9792 0.140236542 0.140711253 0.015893183 0.00009.71905 Sentence Length 0.011826639 0.522122939 0.202480843 0.145216739 0.002807505 0.368840258 Simple Sentences 0.057465277 0.673936375 0.202830407 0.096465071 0.462960518 0.21217697 Discourse Markers 0.001048007 0.005746253 0.351552034 0.063428055 0.00084074 0.072337471
  • 19. Syntax comparison: Results Measuring Vector Differences Corpora 1-C R Rsq Translated texts MSTP - MSTS 0.206015066283 252526.914323 638848591.082 MSTP - TST 0.337626383799 388466.504863 3146471863.13 MSTS - TST 0.176310545152 432725.578482 2643068563.82 Non-Translated texts MSTPC - MSTSC 0.0176469276126 98448.0858054 82218137.9687 MSTPC - TSC 0.150912596476 364322.217714 851312764.364 MSTSC - TSC 0.167167511143 372940.61477 1008322991.78
  • 20. Convergence: discussion (1) Stylistic features: translated texts included in experiment are more similar than non-translated texts (Chi-square test)
  • 21. Convergence: discussion (2) T-test observations There are non-translated texts which are not statistically different in terms of stylistic features whereas corresponding translated texts different statistically There are non-translated texts which are statistically different in terms of only one stylistic feature whereas corresponding translated texts different statistically with regard to two stylistic features Translated texts could often differ significantly with regard to certain style features (lexical density).
  • 22. Convergence: discussion (3) Translated texts differ more in terms of syntax for all compared pairs and from the point of view of all measures (1-C, R and Rsq)
  • 23. Study 2: Simplification Specific methodology Stylistic features accounting for ‘simple’ texts Sentence length Simple vs. Complex sentences Readability Automated Readability Index (ARI) Coleman-Liau Index (CLI) Flesch-Kincaid Grade Level Readibility Test (FK) Results compared across pairs of corpora
  • 24. Comparison of mean values of the lexical and stylistic features between corresponding comparable corpora Features MTP-MTPC MTS-MTSC TT-TTC MTP MTPC α MTS MTSC α TT TTC α Lexical Density .027 .042 0.005 .052 .041 0.4 .02 .025 0.001 Lexical Richness .016 .029 0.005 .037 .028 0.4 .013 .015 0.001 Average Sentence Length 25.25 20.70 0.2 28.49 26.44 0.1 27.29 18.12 0.001 Simple Sentences (%) .441 .638 0.01 .507 .521 0.7 .476 .592 0.002 Discourse Markers (Ratio) .0012 .002 0.05 .0018 .0021 0.2 .0007 .0016 0.002 ARI 16.85 15.08 0.4 19.14 19.01 0.75 17.85 12.85 0.001 CLI 16.27 16.9 0.3 17.16 18.28 0.05 16.28 15.5 0.1 FK 19.53 18.21 0.5 21.32 21.51 0.5 20.03 15.46 0.001
  • 25. Simplification: discussion Mixed picture Simplification confirmed on Lexical richness Lexical density Readability Simplification not confirmed on Sentence length Proportion of simple sentences
  • 26. Implications for translation universals Convergence Style: convergence appears to be broadly holding, but no definite conclusion can be made that convergence is a clear-cut universal Syntax: there is no evidence that convergence holds in terms of syntax General: results do not provide sufficient support to the convergence ‘universal’ Simplification Mixed picture: no sufficient support for simplification
  • 27. Implication for Machine Translation Given the mixed picture, not many But: translated text have to be more readable than non-translated text More research is needed as to which features are ‘stable’ Included into an MT model?
  • 28. Conclusions There is no sufficient evidence/support that translation universals (convergence, simplification) hold Features which appear to be ‘stable’ (e.g. readability) could be modelled into MT systems