Your SlideShare is downloading. ×
Introduction
Methodology
Evaluation
Conclusions
Identification of Translationese:
A Machine Learning Approach
Iustina Ilise...
Introduction
Methodology
Evaluation
Conclusions
Outline
1 Introduction
Introduction in Translation Studies
Universals of T...
Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Stud...
Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Stud...
Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Stud...
Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Stud...
Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Stud...
Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Aim of the Study
Objective
Languag...
Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Methodology
Our assumption
if(addi...
Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Translational Corpora
Resources
Co...
Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Datasets: training and testing
Tra...
Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Data Representation
Data Repesenta...
Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Simplification Features
Proposed fe...
Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Classification Experiments
Experiments
Train...
Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Classification Experiments
Including Simplifi...
Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Classification Experiments
Including Simplifi...
Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Decision Tree
Exploit features in categoris...
Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
JRip Classifier Rules
Rule 1: (lexicalRichne...
Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Attributes Ranking Filters
Information Gain...
Introduction
Methodology
Evaluation
Conclusions
Conclusions
Summary
Learning system able to distinguish between translated...
Introduction
Methodology
Evaluation
Conclusions
Thank you for your attention !
Iustina Ilisei, Diana Inkpen, Gloria Corpas...
Upcoming SlideShare
Loading in...5
×

Identification of Translationese: A Machine Learning Approach

713

Published on

Cicling 2010 Conference - Presentation of the following paper: 'Identification of Translationese:
A Machine Learning Approach'

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
713
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Identification of Translationese: A Machine Learning Approach"

  1. 1. Introduction Methodology Evaluation Conclusions Identification of Translationese: A Machine Learning Approach Iustina Ilisei1, Diana Inkpen2, Gloria Corpas3 and Ruslan Mitkov1 1University of Wolverhampton, United Kingdom 2University of Ottawa, Canada 3University of Malaga, Spain CICLing 2010, Iasi, Romania Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  2. 2. Introduction Methodology Evaluation Conclusions Outline 1 Introduction Introduction in Translation Studies Universals of Translation Related Studies Corpus-based Approach Machine-Learning Approach 2 Methodology Objective Resources Data Representation 3 Evaluation Classification Results Analysis 4 Conclusions Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  3. 3. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Introduction Translationese Effect Translations exhibit their own unnatural language, their own peculiar lexico-grammatical and syntactic characteristics. (Gellerstam,1986) Translational language can not avoid the effect of translationese. (Baker,1993; Laviosa,1997; McEnery & Xiao (2002, 2007) ) Intrigue As two languages can not be perfectly mapped with each other → translated text and its original can not be perfectly matched Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  4. 4. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Language Universals in Translation Mona Baker “it will be necessary to develop tools that will enable us to identify universal features of translation, that is features which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems”. (Baker, 1993:243) Practical Perspective a (self)assessment tool for translators multilingual plagiarism detection direction of translation detection can improve SMT performance Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  5. 5. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Translation Universals According to Baker (1993,1996) Simplification Translations tend to be simpler and easier-to-follow texts Explicitation Translations tend to spell things out rather than leave them implicit Convergence Translations tend to be more similar than non-translations Normalisation Translations conform to patterns typical to the target language, even to the point of exaggerating them Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  6. 6. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Related studies Corpus-Based approach S. Laviosa (2008) In translations: low proportion of lexical words over function words, high proportion of high-frequency words compared to low-frequency words, a relatively great repetition of the most frequent words, and less variety in the most frequently used words G. Corpas (2008) Simplification confirmed for lexical richness, and contradicted in terms of complex sentences, information load, sentence length, depth of trees, senses per word. G. Corpas, R. Mitkov, N. Afzal, V. Pekar (2008) Translations exhibit lower lexical density and richness, seem to be more readable, have a smaller proportion of simple sentences, and use less discourse markers. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  7. 7. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Related studies: Machine-Learning Approach Supervised Learning Approach Baroni & Bernardini (2006) “A new approach to the study of translationese: Machine Learning the difference between original and translated texts” SVM classifier distinguishes professional translations from original texts with accuracy above the chance level Depends heavily on lexical cues, the distribution of n-grams of function words, morpho-syntactic categories, personal pronouns and adverbs in general Human accuracy - much lower than the accuracy of the system Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  8. 8. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Aim of the Study Objective Language-independent learning system able to distinguish between translated and non-translated texts. To investigate the validation of the simplification hypothesis. To explore characteristic features which most influence the translational language. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  9. 9. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Methodology Our assumption if(addition of the simplification features improves learning accuracy) then this is an argument towards the existence of the Simplification Universal else “further research required” Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  10. 10. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Translational Corpora Resources Comparable corpora: translated texts vs. non-translated texts Spanish Monolingual Comparable Corpora Medical Translations by professionals (MTP) vs. Comparable Original Medical texts by professionals (MTPC) Medical Translations by translation students (MTS) vs. Comparable Original Medical texts by translation students (MTSC) Technical Translations by professionals (TT) vs Comparable Original Technical texts by professionals (TTC) Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  11. 11. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Datasets: training and testing Training set 450 instances (156 translation class, 294 non-translation class) Testing set 148 instances (52 translation class, 96 non-translation class) Set pair one: MTP-MTPC (2 + 2 translation vs non-translation) Set pair two: MTS-MTSC (36 + 66 translation vs non-translation) Set pair three: TT-TTC (14 + 28 translation vs non-translation) Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  12. 12. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Data Representation Data Repesentation without Simplification Features (DR - SF) Proportion in each text of: grammatical words, nouns, finite verbs, auxialiary verbs, adjectives, adverbs, numerals, pronouns, prepositions, determinants, conjunctions, grammatical words/lexical words ratio Data Repesentation with Simplification Features (DR + SF) All above (DR - SF) + simplification features Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  13. 13. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Simplification Features Proposed features to grasp simplicity in texts Sentence Length: proportion of number of words per sentence Sentence Length: the average of the maximum parse tree depth per sentence in texts Types of sentence: proportion of sentences without finite verbs / simple sentences / complex sentences in texts Ambiguity: average number of senses per word in texts Word Length: average number of syllables per word in texts Lexical Richness: proportion of type lemmas per tokens in texts Information Load: proportion of lexical words per tokens in texts Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  14. 14. Introduction Methodology Evaluation Conclusions Classification Results Analysis Classification Experiments Experiments Trained/tested on the entire dataset Trained on the entire dataset and tested on separate test datasets Set MTS-MTSC (medical texts) Set TT-TTC (technical texts) Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  15. 15. Introduction Methodology Evaluation Conclusions Classification Results Analysis Classification Experiments Including Simplification Excluding Simplification Features Features 10-fold Test 10-fold Test Classifier cross-validation set cross-validation set Baseline 65.33% 64.86% 65.33% 64.86% Naive Bayes *76.67% 79.05% 69.33% 75.00% BayesNet 78.67% 79.73% 75.11% 77.03% Jrip 79.56% 83.11% 73.33% 77.03% Decision Tree 78.22% 81.76% 78.22% 81.76% Simple Logistic *77.33% 83.11% 71.11% 80.41% SVM *79.11% *81.76% 69.33% 73.65% Meta-classifier *80.00% 87.16% 73.33% 85.81% Table: Classification Results: Accuracies for several classifiers Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  16. 16. Introduction Methodology Evaluation Conclusions Classification Results Analysis Classification Experiments Including Simplification Excluding Simplification Features Features Classifier MTS-MTSC TT-TTC MTS-MTSC TT-TTC Baseline 64.71% 66.67% 64.71% 66.67% Naive Bayes 71.57% 95.24% 71.57% 80.95% BayesNet 73.53% 97.62% 71.57% 92.86% Jrip 79.42% 95.24% 72.55% 92.86% Decision Tree 77.45% 92.86% 75.49% 95.24% Simple Logistic 77.45% 97.62% 79.41% 83.33% SVM 75.49% *97.62% 74.51% 69.05% Meta-classifier 82.35% 97.62% 78.43% 92.86% Table: Classification accuracy results on the medical and technical test datasets. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  17. 17. Introduction Methodology Evaluation Conclusions Classification Results Analysis Decision Tree Exploit features in categorisation task: First level Lexical Richness Secondly Sentence Length (words/sentence) Grammatical words/Lexical words proportion Thirdly Pronoun proportion in texts Conjunction proportion in texts Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  18. 18. Introduction Methodology Evaluation Conclusions Classification Results Analysis JRip Classifier Rules Rule 1: (lexicalRichness <= 0.16) and (ratioFiniteVerbs <= 0.08) => class=translation Rule 2: (simpleSentences >= 0.3) and (wordLength <= 2.46) and (sentenceLength >= 20.7) and (ratioNouns >= 0.33) => class=translation Rule 3: (ratioFiniteVerbs <= 0.09) and (ratioPreps <= 0.13) => class=translation Rule 4: => class=non-translation Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  19. 19. Introduction Methodology Evaluation Conclusions Classification Results Analysis Attributes Ranking Filters Information Gain Chi squared lexicalRichness lexicalRichness grammsPerLexics grammsPerLexics ratioFiniteVerbs ratioFiniteVerbs ratioNumerals ratioNumerals ratioAdjectives ratioAdjectives sentenceLength sentenceLength ratioProns ratioProns simpleSentences wordLength wordLength simpleSentences grammaticalWords zeroSentences zeroSentences ratioNouns ratioNouns lexicalWords ..... ..... Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  20. 20. Introduction Methodology Evaluation Conclusions Conclusions Summary Learning system able to distinguish between translated text and non-translated text for Spanish language. On a technical dataset, the accuracy reaches up to 97.62% The addition of the features related to simplification leads to an increased accuracy of the classifiers: SVM reports statistical significance improvement. The results may be considered as an argument for the existence of the Simplification Universal. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  21. 21. Introduction Methodology Evaluation Conclusions Thank you for your attention ! Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach

×