Your SlideShare is downloading. ×
0
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Identification of Translationese: A Machine Learning Approach

711

Published on

Cicling 2010 Conference - Presentation of the following paper: 'Identification of Translationese: …

Cicling 2010 Conference - Presentation of the following paper: 'Identification of Translationese:
A Machine Learning Approach'

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
711
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduction Methodology Evaluation Conclusions Identification of Translationese: A Machine Learning Approach Iustina Ilisei1, Diana Inkpen2, Gloria Corpas3 and Ruslan Mitkov1 1University of Wolverhampton, United Kingdom 2University of Ottawa, Canada 3University of Malaga, Spain CICLing 2010, Iasi, Romania Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 2. Introduction Methodology Evaluation Conclusions Outline 1 Introduction Introduction in Translation Studies Universals of Translation Related Studies Corpus-based Approach Machine-Learning Approach 2 Methodology Objective Resources Data Representation 3 Evaluation Classification Results Analysis 4 Conclusions Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 3. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Introduction Translationese Effect Translations exhibit their own unnatural language, their own peculiar lexico-grammatical and syntactic characteristics. (Gellerstam,1986) Translational language can not avoid the effect of translationese. (Baker,1993; Laviosa,1997; McEnery & Xiao (2002, 2007) ) Intrigue As two languages can not be perfectly mapped with each other → translated text and its original can not be perfectly matched Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 4. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Language Universals in Translation Mona Baker “it will be necessary to develop tools that will enable us to identify universal features of translation, that is features which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems”. (Baker, 1993:243) Practical Perspective a (self)assessment tool for translators multilingual plagiarism detection direction of translation detection can improve SMT performance Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 5. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Translation Universals According to Baker (1993,1996) Simplification Translations tend to be simpler and easier-to-follow texts Explicitation Translations tend to spell things out rather than leave them implicit Convergence Translations tend to be more similar than non-translations Normalisation Translations conform to patterns typical to the target language, even to the point of exaggerating them Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 6. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Related studies Corpus-Based approach S. Laviosa (2008) In translations: low proportion of lexical words over function words, high proportion of high-frequency words compared to low-frequency words, a relatively great repetition of the most frequent words, and less variety in the most frequently used words G. Corpas (2008) Simplification confirmed for lexical richness, and contradicted in terms of complex sentences, information load, sentence length, depth of trees, senses per word. G. Corpas, R. Mitkov, N. Afzal, V. Pekar (2008) Translations exhibit lower lexical density and richness, seem to be more readable, have a smaller proportion of simple sentences, and use less discourse markers. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 7. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Related studies: Machine-Learning Approach Supervised Learning Approach Baroni & Bernardini (2006) “A new approach to the study of translationese: Machine Learning the difference between original and translated texts” SVM classifier distinguishes professional translations from original texts with accuracy above the chance level Depends heavily on lexical cues, the distribution of n-grams of function words, morpho-syntactic categories, personal pronouns and adverbs in general Human accuracy - much lower than the accuracy of the system Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 8. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Aim of the Study Objective Language-independent learning system able to distinguish between translated and non-translated texts. To investigate the validation of the simplification hypothesis. To explore characteristic features which most influence the translational language. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 9. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Methodology Our assumption if(addition of the simplification features improves learning accuracy) then this is an argument towards the existence of the Simplification Universal else “further research required” Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 10. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Translational Corpora Resources Comparable corpora: translated texts vs. non-translated texts Spanish Monolingual Comparable Corpora Medical Translations by professionals (MTP) vs. Comparable Original Medical texts by professionals (MTPC) Medical Translations by translation students (MTS) vs. Comparable Original Medical texts by translation students (MTSC) Technical Translations by professionals (TT) vs Comparable Original Technical texts by professionals (TTC) Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 11. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Datasets: training and testing Training set 450 instances (156 translation class, 294 non-translation class) Testing set 148 instances (52 translation class, 96 non-translation class) Set pair one: MTP-MTPC (2 + 2 translation vs non-translation) Set pair two: MTS-MTSC (36 + 66 translation vs non-translation) Set pair three: TT-TTC (14 + 28 translation vs non-translation) Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 12. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Data Representation Data Repesentation without Simplification Features (DR - SF) Proportion in each text of: grammatical words, nouns, finite verbs, auxialiary verbs, adjectives, adverbs, numerals, pronouns, prepositions, determinants, conjunctions, grammatical words/lexical words ratio Data Repesentation with Simplification Features (DR + SF) All above (DR - SF) + simplification features Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 13. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Simplification Features Proposed features to grasp simplicity in texts Sentence Length: proportion of number of words per sentence Sentence Length: the average of the maximum parse tree depth per sentence in texts Types of sentence: proportion of sentences without finite verbs / simple sentences / complex sentences in texts Ambiguity: average number of senses per word in texts Word Length: average number of syllables per word in texts Lexical Richness: proportion of type lemmas per tokens in texts Information Load: proportion of lexical words per tokens in texts Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 14. Introduction Methodology Evaluation Conclusions Classification Results Analysis Classification Experiments Experiments Trained/tested on the entire dataset Trained on the entire dataset and tested on separate test datasets Set MTS-MTSC (medical texts) Set TT-TTC (technical texts) Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 15. Introduction Methodology Evaluation Conclusions Classification Results Analysis Classification Experiments Including Simplification Excluding Simplification Features Features 10-fold Test 10-fold Test Classifier cross-validation set cross-validation set Baseline 65.33% 64.86% 65.33% 64.86% Naive Bayes *76.67% 79.05% 69.33% 75.00% BayesNet 78.67% 79.73% 75.11% 77.03% Jrip 79.56% 83.11% 73.33% 77.03% Decision Tree 78.22% 81.76% 78.22% 81.76% Simple Logistic *77.33% 83.11% 71.11% 80.41% SVM *79.11% *81.76% 69.33% 73.65% Meta-classifier *80.00% 87.16% 73.33% 85.81% Table: Classification Results: Accuracies for several classifiers Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 16. Introduction Methodology Evaluation Conclusions Classification Results Analysis Classification Experiments Including Simplification Excluding Simplification Features Features Classifier MTS-MTSC TT-TTC MTS-MTSC TT-TTC Baseline 64.71% 66.67% 64.71% 66.67% Naive Bayes 71.57% 95.24% 71.57% 80.95% BayesNet 73.53% 97.62% 71.57% 92.86% Jrip 79.42% 95.24% 72.55% 92.86% Decision Tree 77.45% 92.86% 75.49% 95.24% Simple Logistic 77.45% 97.62% 79.41% 83.33% SVM 75.49% *97.62% 74.51% 69.05% Meta-classifier 82.35% 97.62% 78.43% 92.86% Table: Classification accuracy results on the medical and technical test datasets. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 17. Introduction Methodology Evaluation Conclusions Classification Results Analysis Decision Tree Exploit features in categorisation task: First level Lexical Richness Secondly Sentence Length (words/sentence) Grammatical words/Lexical words proportion Thirdly Pronoun proportion in texts Conjunction proportion in texts Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 18. Introduction Methodology Evaluation Conclusions Classification Results Analysis JRip Classifier Rules Rule 1: (lexicalRichness <= 0.16) and (ratioFiniteVerbs <= 0.08) => class=translation Rule 2: (simpleSentences >= 0.3) and (wordLength <= 2.46) and (sentenceLength >= 20.7) and (ratioNouns >= 0.33) => class=translation Rule 3: (ratioFiniteVerbs <= 0.09) and (ratioPreps <= 0.13) => class=translation Rule 4: => class=non-translation Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 19. Introduction Methodology Evaluation Conclusions Classification Results Analysis Attributes Ranking Filters Information Gain Chi squared lexicalRichness lexicalRichness grammsPerLexics grammsPerLexics ratioFiniteVerbs ratioFiniteVerbs ratioNumerals ratioNumerals ratioAdjectives ratioAdjectives sentenceLength sentenceLength ratioProns ratioProns simpleSentences wordLength wordLength simpleSentences grammaticalWords zeroSentences zeroSentences ratioNouns ratioNouns lexicalWords ..... ..... Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 20. Introduction Methodology Evaluation Conclusions Conclusions Summary Learning system able to distinguish between translated text and non-translated text for Spanish language. On a technical dataset, the accuracy reaches up to 97.62% The addition of the features related to simplification leads to an increased accuracy of the classifiers: SVM reports statistical significance improvement. The results may be considered as an argument for the existence of the Simplification Universal. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 21. Introduction Methodology Evaluation Conclusions Thank you for your attention ! Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach

×