• Save

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Identification of Translationese: A Machine Learning Approach

on

  • 873 views

Cicling 2010 Conference - Presentation of the following paper: 'Identification of Translationese:

Cicling 2010 Conference - Presentation of the following paper: 'Identification of Translationese:
A Machine Learning Approach'

Statistics

Views

Total Views
873
Views on SlideShare
873
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Identification of Translationese: A Machine Learning Approach Identification of Translationese: A Machine Learning Approach Presentation Transcript

    • Introduction Methodology Evaluation Conclusions Identification of Translationese: A Machine Learning Approach Iustina Ilisei1 , Diana Inkpen2 , Gloria Corpas3 and Ruslan Mitkov1 1 University of Wolverhampton, United Kingdom 2 University of Ottawa, Canada 3 University of Malaga, Spain CICLing 2010, Iasi, Romania Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Methodology Evaluation Conclusions Outline 1 Introduction Introduction in Translation Studies Universals of Translation Related Studies Corpus-based Approach Machine-Learning Approach 2 Methodology Objective Resources Data Representation 3 Evaluation Classification Results Analysis 4 Conclusions Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Introduction in Translation Studies Methodology Universals of Translation Evaluation Related Studies Conclusions Introduction Translationese Effect Translations exhibit their own unnatural language, their own peculiar lexico-grammatical and syntactic characteristics. (Gellerstam,1986) Translational language can not avoid the effect of translationese. (Baker,1993; Laviosa,1997; McEnery & Xiao (2002, 2007) ) Intrigue As two languages can not be perfectly mapped with each other → translated text and its original can not be perfectly matched Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Introduction in Translation Studies Methodology Universals of Translation Evaluation Related Studies Conclusions Language Universals in Translation Mona Baker “it will be necessary to develop tools that will enable us to identify universal features of translation, that is features which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems”. (Baker, 1993:243) Practical Perspective a (self)assessment tool for translators multilingual plagiarism detection direction of translation detection can improve SMT performance Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Introduction in Translation Studies Methodology Universals of Translation Evaluation Related Studies Conclusions Translation Universals According to Baker (1993,1996) Simplification Translations tend to be simpler and easier-to-follow texts Explicitation Translations tend to spell things out rather than leave them implicit Convergence Translations tend to be more similar than non-translations Normalisation Translations conform to patterns typical to the target language, even to the point of exaggerating them Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Introduction in Translation Studies Methodology Universals of Translation Evaluation Related Studies Conclusions Related studies Corpus-Based approach S. Laviosa (2008) In translations: low proportion of lexical words over function words, high proportion of high-frequency words compared to low-frequency words, a relatively great repetition of the most frequent words, and less variety in the most frequently used words G. Corpas (2008) Simplification confirmed for lexical richness, and contradicted in terms of complex sentences, information load, sentence length, depth of trees, senses per word. G. Corpas, R. Mitkov, N. Afzal, V. Pekar (2008) Translations exhibit lower lexical density and richness, seem to be more readable, have a smaller proportion of simple sentences, and use less discourse markers. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Introduction in Translation Studies Methodology Universals of Translation Evaluation Related Studies Conclusions Related studies: Machine-Learning Approach Supervised Learning Approach Baroni & Bernardini (2006) “A new approach to the study of translationese: Machine Learning the difference between original and translated texts” SVM classifier distinguishes professional translations from original texts with accuracy above the chance level Depends heavily on lexical cues, the distribution of n-grams of function words, morpho-syntactic categories, personal pronouns and adverbs in general Human accuracy - much lower than the accuracy of the system Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Objective Methodology Resources Evaluation Data Representation Conclusions Aim of the Study Objective Language-independent learning system able to distinguish between translated and non-translated texts. To investigate the validation of the simplification hypothesis. To explore characteristic features which most influence the translational language. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Objective Methodology Resources Evaluation Data Representation Conclusions Methodology Our assumption if(addition of the simplification features improves learning accuracy) then this is an argument towards the existence of the Simplification Universal else “further research required” Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Objective Methodology Resources Evaluation Data Representation Conclusions Translational Corpora Resources Comparable corpora: translated texts vs. non-translated texts Spanish Monolingual Comparable Corpora Medical Translations by professionals (MTP) vs. Comparable Original Medical texts by professionals (MTPC) Medical Translations by translation students (MTS) vs. Comparable Original Medical texts by translation students (MTSC) Technical Translations by professionals (TT) vs Comparable Original Technical texts by professionals (TTC) Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Objective Methodology Resources Evaluation Data Representation Conclusions Datasets: training and testing Training set 450 instances (156 translation class, 294 non-translation class) Testing set 148 instances (52 translation class, 96 non-translation class) Set pair one: MTP-MTPC (2 + 2 translation vs non-translation) Set pair two: MTS-MTSC (36 + 66 translation vs non-translation) Set pair three: TT-TTC (14 + 28 translation vs non-translation) Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Objective Methodology Resources Evaluation Data Representation Conclusions Data Representation Data Repesentation without Simplification Features (DR - SF) Proportion in each text of: grammatical words, nouns, finite verbs, auxialiary verbs, adjectives, adverbs, numerals, pronouns, prepositions, determinants, conjunctions, grammatical words/lexical words ratio Data Repesentation with Simplification Features (DR + SF) All above (DR - SF) + simplification features Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Objective Methodology Resources Evaluation Data Representation Conclusions Simplification Features Proposed features to grasp simplicity in texts Sentence Length: proportion of number of words per sentence Sentence Length: the average of the maximum parse tree depth per sentence in texts Types of sentence: proportion of sentences without finite verbs / simple sentences / complex sentences in texts Ambiguity: average number of senses per word in texts Word Length: average number of syllables per word in texts Lexical Richness: proportion of type lemmas per tokens in texts Information Load: proportion of lexical words per tokens in texts Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Methodology Classification Evaluation Results Analysis Conclusions Classification Experiments Experiments Trained/tested on the entire dataset Trained on the entire dataset and tested on separate test datasets Set MTS-MTSC (medical texts) Set TT-TTC (technical texts) Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Methodology Classification Evaluation Results Analysis Conclusions Classification Experiments Including Simplification Excluding Simplification Features Features 10-fold Test 10-fold Test Classifier cross-validation set cross-validation set Baseline 65.33% 64.86% 65.33% 64.86% Naive Bayes *76.67% 79.05% 69.33% 75.00% BayesNet 78.67% 79.73% 75.11% 77.03% Jrip 79.56% 83.11% 73.33% 77.03% Decision Tree 78.22% 81.76% 78.22% 81.76% Simple Logistic *77.33% 83.11% 71.11% 80.41% SVM *79.11% *81.76% 69.33% 73.65% Meta-classifier *80.00% 87.16% 73.33% 85.81% Table: Classification Results: Accuracies for several classifiers Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Methodology Classification Evaluation Results Analysis Conclusions Classification Experiments Including Simplification Excluding Simplification Features Features Classifier MTS-MTSC TT-TTC MTS-MTSC TT-TTC Baseline 64.71% 66.67% 64.71% 66.67% Naive Bayes 71.57% 95.24% 71.57% 80.95% BayesNet 73.53% 97.62% 71.57% 92.86% Jrip 79.42% 95.24% 72.55% 92.86% Decision Tree 77.45% 92.86% 75.49% 95.24% Simple Logistic 77.45% 97.62% 79.41% 83.33% SVM 75.49% *97.62% 74.51% 69.05% Meta-classifier 82.35% 97.62% 78.43% 92.86% Table: Classification accuracy results on the medical and technical test datasets. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Methodology Classification Evaluation Results Analysis Conclusions Decision Tree Exploit features in categorisation task: First level Lexical Richness Secondly Sentence Length (words/sentence) Grammatical words/Lexical words proportion Thirdly Pronoun proportion in texts Conjunction proportion in texts Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Methodology Classification Evaluation Results Analysis Conclusions JRip Classifier Rules Rule 1: (lexicalRichness <= 0.16) and (ratioFiniteVerbs <= 0.08) => class=translation Rule 2: (simpleSentences >= 0.3) and (wordLength <= 2.46) and (sentenceLength >= 20.7) and (ratioNouns >= 0.33) => class=translation Rule 3: (ratioFiniteVerbs <= 0.09) and (ratioPreps <= 0.13) => class=translation Rule 4: => class=non-translation Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Methodology Classification Evaluation Results Analysis Conclusions Attributes Ranking Filters Information Gain Chi squared lexicalRichness lexicalRichness grammsPerLexics grammsPerLexics ratioFiniteVerbs ratioFiniteVerbs ratioNumerals ratioNumerals ratioAdjectives ratioAdjectives sentenceLength sentenceLength ratioProns ratioProns simpleSentences wordLength wordLength simpleSentences grammaticalWords zeroSentences zeroSentences ratioNouns ratioNouns lexicalWords ..... ..... Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Methodology Evaluation Conclusions Conclusions Summary Learning system able to distinguish between translated text and non-translated text for Spanish language. On a technical dataset, the accuracy reaches up to 97.62% The addition of the features related to simplification leads to an increased accuracy of the classifiers: SVM reports statistical significance improvement. The results may be considered as an argument for the existence of the Simplification Universal. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
    • Introduction Methodology Evaluation Conclusions Thank you for your attention ! Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach