SlideShare a Scribd company logo
1 of 21
Introduction
Methodology
Evaluation
Conclusions
Identification of Translationese:
A Machine Learning Approach
Iustina Ilisei1, Diana Inkpen2, Gloria Corpas3 and
Ruslan Mitkov1
1University of Wolverhampton, United Kingdom
2University of Ottawa, Canada
3University of Malaga, Spain
CICLing 2010, Iasi, Romania
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Outline
1 Introduction
Introduction in Translation Studies
Universals of Translation
Related Studies
Corpus-based Approach
Machine-Learning Approach
2 Methodology
Objective
Resources
Data Representation
3 Evaluation
Classification
Results Analysis
4 Conclusions
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Studies
Introduction
Translationese Effect
Translations exhibit their own unnatural language, their
own peculiar lexico-grammatical and syntactic
characteristics. (Gellerstam,1986)
Translational language can not avoid the effect of
translationese. (Baker,1993; Laviosa,1997; McEnery &
Xiao (2002, 2007) )
Intrigue
As two languages can not be perfectly mapped with each
other → translated text and its original can not be perfectly
matched
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Studies
Language Universals in Translation
Mona Baker
“it will be necessary to develop tools that will enable us to
identify universal features of translation, that is features which
typically occur in translated text rather than original utterances
and which are not the result of interference from specific
linguistic systems”. (Baker, 1993:243)
Practical Perspective
a (self)assessment tool for translators
multilingual plagiarism detection
direction of translation detection can improve SMT
performance
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Studies
Translation Universals
According to Baker (1993,1996)
Simplification
Translations tend to be simpler and easier-to-follow
texts
Explicitation
Translations tend to spell things out rather than leave them
implicit
Convergence
Translations tend to be more similar than non-translations
Normalisation
Translations conform to patterns typical to the target
language, even to the point of exaggerating them
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Studies
Related studies
Corpus-Based approach
S. Laviosa (2008)
In translations: low proportion of lexical words over function words, high
proportion of high-frequency words compared to low-frequency words,
a relatively great repetition of the most frequent words, and less variety
in the most frequently used words
G. Corpas (2008)
Simplification confirmed for lexical richness, and contradicted in terms
of complex sentences, information load, sentence length, depth of
trees, senses per word.
G. Corpas, R. Mitkov, N. Afzal, V. Pekar (2008)
Translations exhibit lower lexical density and richness, seem to be more
readable, have a smaller proportion of simple sentences, and use less
discourse markers.
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Studies
Related studies: Machine-Learning Approach
Supervised Learning Approach
Baroni & Bernardini (2006) “A new approach to the study of
translationese: Machine Learning the difference between
original and translated texts”
SVM classifier distinguishes professional translations from
original texts with accuracy above the chance level
Depends heavily on lexical cues, the distribution of
n-grams of function words, morpho-syntactic categories,
personal pronouns and adverbs in general
Human accuracy - much lower than the accuracy of the
system
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Aim of the Study
Objective
Language-independent learning system able to distinguish
between translated and non-translated texts.
To investigate the validation of the simplification
hypothesis.
To explore characteristic features which most influence the
translational language.
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Methodology
Our assumption
if(addition of the simplification features
improves learning accuracy)
then this is an argument towards the existence
of the Simplification Universal
else “further research required”
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Translational Corpora
Resources
Comparable corpora: translated texts vs. non-translated
texts
Spanish Monolingual Comparable Corpora
Medical Translations by professionals (MTP) vs.
Comparable Original Medical texts by professionals (MTPC)
Medical Translations by translation students (MTS) vs.
Comparable Original Medical texts by translation students
(MTSC)
Technical Translations by professionals (TT) vs Comparable
Original Technical texts by professionals (TTC)
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Datasets: training and testing
Training set
450 instances (156 translation class, 294 non-translation class)
Testing set
148 instances (52 translation class, 96 non-translation class)
Set pair one: MTP-MTPC (2 + 2 translation vs non-translation)
Set pair two: MTS-MTSC (36 + 66 translation vs non-translation)
Set pair three: TT-TTC (14 + 28 translation vs non-translation)
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Data Representation
Data Repesentation without Simplification Features (DR - SF)
Proportion in each text of: grammatical words, nouns, finite
verbs, auxialiary verbs, adjectives, adverbs, numerals,
pronouns, prepositions, determinants, conjunctions,
grammatical words/lexical words ratio
Data Repesentation with Simplification Features (DR + SF)
All above (DR - SF) + simplification features
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Simplification Features
Proposed features to grasp simplicity in texts
Sentence Length: proportion of number of words per sentence
Sentence Length: the average of the maximum parse tree depth
per sentence in texts
Types of sentence: proportion of sentences without finite
verbs / simple sentences / complex sentences in texts
Ambiguity: average number of senses per word in texts
Word Length: average number of syllables per word in texts
Lexical Richness: proportion of type lemmas per tokens in texts
Information Load: proportion of lexical words per tokens in texts
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Classification Experiments
Experiments
Trained/tested on the entire dataset
Trained on the entire dataset and tested on separate test
datasets
Set MTS-MTSC (medical texts)
Set TT-TTC (technical texts)
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Classification Experiments
Including Simplification Excluding Simplification
Features Features
10-fold Test 10-fold Test
Classifier cross-validation set cross-validation set
Baseline 65.33% 64.86% 65.33% 64.86%
Naive Bayes *76.67% 79.05% 69.33% 75.00%
BayesNet 78.67% 79.73% 75.11% 77.03%
Jrip 79.56% 83.11% 73.33% 77.03%
Decision Tree 78.22% 81.76% 78.22% 81.76%
Simple Logistic *77.33% 83.11% 71.11% 80.41%
SVM *79.11% *81.76% 69.33% 73.65%
Meta-classifier *80.00% 87.16% 73.33% 85.81%
Table: Classification Results: Accuracies for several classifiers
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Classification Experiments
Including Simplification Excluding Simplification
Features Features
Classifier MTS-MTSC TT-TTC MTS-MTSC TT-TTC
Baseline 64.71% 66.67% 64.71% 66.67%
Naive Bayes 71.57% 95.24% 71.57% 80.95%
BayesNet 73.53% 97.62% 71.57% 92.86%
Jrip 79.42% 95.24% 72.55% 92.86%
Decision Tree 77.45% 92.86% 75.49% 95.24%
Simple Logistic 77.45% 97.62% 79.41% 83.33%
SVM 75.49% *97.62% 74.51% 69.05%
Meta-classifier 82.35% 97.62% 78.43% 92.86%
Table: Classification accuracy results on the medical and technical
test datasets.
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Decision Tree
Exploit features in categorisation task:
First level
Lexical Richness
Secondly
Sentence Length (words/sentence)
Grammatical words/Lexical words proportion
Thirdly
Pronoun proportion in texts
Conjunction proportion in texts
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
JRip Classifier Rules
Rule 1: (lexicalRichness <= 0.16) and (ratioFiniteVerbs <=
0.08) => class=translation
Rule 2: (simpleSentences >= 0.3) and (wordLength <=
2.46) and (sentenceLength >= 20.7) and (ratioNouns >=
0.33) => class=translation
Rule 3: (ratioFiniteVerbs <= 0.09) and (ratioPreps <= 0.13)
=> class=translation
Rule 4: => class=non-translation
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Attributes Ranking Filters
Information Gain Chi squared
lexicalRichness lexicalRichness
grammsPerLexics grammsPerLexics
ratioFiniteVerbs ratioFiniteVerbs
ratioNumerals ratioNumerals
ratioAdjectives ratioAdjectives
sentenceLength sentenceLength
ratioProns ratioProns
simpleSentences wordLength
wordLength simpleSentences
grammaticalWords zeroSentences
zeroSentences ratioNouns
ratioNouns lexicalWords
..... .....
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Conclusions
Summary
Learning system able to distinguish between translated
text and non-translated text for Spanish language.
On a technical dataset, the accuracy reaches up to 97.62%
The addition of the features related to simplification leads
to an increased accuracy of the classifiers: SVM reports
statistical significance improvement.
The results may be considered as an argument for the
existence of the Simplification Universal.
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
Introduction
Methodology
Evaluation
Conclusions
Thank you for your attention !
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach

More Related Content

What's hot

Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languageshs0041
 
Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...inscit2006
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
 
Word Frequency Effects and Plurality in L2 Word Recognition—A Preliminary Study—
Word Frequency Effects and Plurality in L2 Word Recognition—A Preliminary Study—Word Frequency Effects and Plurality in L2 Word Recognition—A Preliminary Study—
Word Frequency Effects and Plurality in L2 Word Recognition—A Preliminary Study—Yu Tamura
 
Latent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationLatent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationElaheh Barati
 
Using translog to investigate self correctionsin translation
Using translog to investigate self  correctionsin translationUsing translog to investigate self  correctionsin translation
Using translog to investigate self correctionsin translationRusdi Noor Rosa
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology miningEstelle Delpech
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结君 廖
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
 
Validation of the grammatical carefulness scale using a discourse completion ...
Validation of the grammatical carefulness scale using a discourse completion ...Validation of the grammatical carefulness scale using a discourse completion ...
Validation of the grammatical carefulness scale using a discourse completion ...Yu Tamura
 
Learning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionaryLearning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionaryRoelof Pieters
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
 

What's hot (20)

Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
 
Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Language models
Language modelsLanguage models
Language models
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
Word Frequency Effects and Plurality in L2 Word Recognition—A Preliminary Study—
Word Frequency Effects and Plurality in L2 Word Recognition—A Preliminary Study—Word Frequency Effects and Plurality in L2 Word Recognition—A Preliminary Study—
Word Frequency Effects and Plurality in L2 Word Recognition—A Preliminary Study—
 
Latent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationLatent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text Summarization
 
Using translog to investigate self correctionsin translation
Using translog to investigate self  correctionsin translationUsing translog to investigate self  correctionsin translation
Using translog to investigate self correctionsin translation
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology mining
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
 
Validation of the grammatical carefulness scale using a discourse completion ...
Validation of the grammatical carefulness scale using a discourse completion ...Validation of the grammatical carefulness scale using a discourse completion ...
Validation of the grammatical carefulness scale using a discourse completion ...
 
Learning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionaryLearning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionary
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 

Similar to Identification of Translationese: A Machine Learning Approach

Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to HindiRajat Jain
 
Method: Approach, Design, Procedure
Method: Approach, Design, ProcedureMethod: Approach, Design, Procedure
Method: Approach, Design, Proceduremahdihasanpour66
 
A Corpus-based Study of EFL Learners Errors in IELTS Essay Writing.pdf
A Corpus-based Study of EFL Learners  Errors in IELTS Essay Writing.pdfA Corpus-based Study of EFL Learners  Errors in IELTS Essay Writing.pdf
A Corpus-based Study of EFL Learners Errors in IELTS Essay Writing.pdfSarah Marie
 
1315 estella ma_motorlearning
1315 estella ma_motorlearning1315 estella ma_motorlearning
1315 estella ma_motorlearningTian Stella
 
The effect of authentic/inauthentic materials in EFL classroom
The effect of authentic/inauthentic materials in EFL classroomThe effect of authentic/inauthentic materials in EFL classroom
The effect of authentic/inauthentic materials in EFL classroomfirdausabdmunir85
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indianeSAT Publishing House
 
Automated Language Assessment Scoring and impact on instruction
Automated Language Assessment Scoring and impact on instructionAutomated Language Assessment Scoring and impact on instruction
Automated Language Assessment Scoring and impact on instructiontfarny
 
Corpus study design
Corpus study designCorpus study design
Corpus study designbikashtaly
 
Translation and Semantics.pdf
Translation and Semantics.pdfTranslation and Semantics.pdf
Translation and Semantics.pdfAhmedMoneus2
 
Support for foreign language listeners: Its effectiveness and limitations
Support for foreign language listeners: Its effectiveness and limitationsSupport for foreign language listeners: Its effectiveness and limitations
Support for foreign language listeners: Its effectiveness and limitationsCindy Shen
 
Interpreters and Emotional Intelligence How do we use it and why does it matter?
Interpreters and Emotional Intelligence How do we use it and why does it matter?Interpreters and Emotional Intelligence How do we use it and why does it matter?
Interpreters and Emotional Intelligence How do we use it and why does it matter?Diana Singureanu
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficultiesijtsrd
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxSyedNadeemAbbas6
 
Article - An Annotated Translation of How to Succeed as a Freelance Translato...
Article - An Annotated Translation of How to Succeed as a Freelance Translato...Article - An Annotated Translation of How to Succeed as a Freelance Translato...
Article - An Annotated Translation of How to Succeed as a Freelance Translato...Cynthia Velynne
 
NLP applicata a LIS
NLP applicata a LISNLP applicata a LIS
NLP applicata a LISnoemiricci2
 
Translation coverage
Translation coverageTranslation coverage
Translation coverageRudi Hartono
 
Researching Multilingually: Possibilities and Complexities
Researching Multilingually:  Possibilities and Complexities Researching Multilingually:  Possibilities and Complexities
Researching Multilingually: Possibilities and Complexities RMBorders
 
Natural Language Processing and Language Learning
Natural Language Processing and Language LearningNatural Language Processing and Language Learning
Natural Language Processing and Language Learningantonellarose
 

Similar to Identification of Translationese: A Machine Learning Approach (20)

Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to Hindi
 
Method: Approach, Design, Procedure
Method: Approach, Design, ProcedureMethod: Approach, Design, Procedure
Method: Approach, Design, Procedure
 
A Corpus-based Study of EFL Learners Errors in IELTS Essay Writing.pdf
A Corpus-based Study of EFL Learners  Errors in IELTS Essay Writing.pdfA Corpus-based Study of EFL Learners  Errors in IELTS Essay Writing.pdf
A Corpus-based Study of EFL Learners Errors in IELTS Essay Writing.pdf
 
1315 estella ma_motorlearning
1315 estella ma_motorlearning1315 estella ma_motorlearning
1315 estella ma_motorlearning
 
The effect of authentic/inauthentic materials in EFL classroom
The effect of authentic/inauthentic materials in EFL classroomThe effect of authentic/inauthentic materials in EFL classroom
The effect of authentic/inauthentic materials in EFL classroom
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indian
 
Automated Language Assessment Scoring and impact on instruction
Automated Language Assessment Scoring and impact on instructionAutomated Language Assessment Scoring and impact on instruction
Automated Language Assessment Scoring and impact on instruction
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
RusLTC at TSD-2014 (Brno)
RusLTC at TSD-2014 (Brno)RusLTC at TSD-2014 (Brno)
RusLTC at TSD-2014 (Brno)
 
Translation and Semantics.pdf
Translation and Semantics.pdfTranslation and Semantics.pdf
Translation and Semantics.pdf
 
Support for foreign language listeners: Its effectiveness and limitations
Support for foreign language listeners: Its effectiveness and limitationsSupport for foreign language listeners: Its effectiveness and limitations
Support for foreign language listeners: Its effectiveness and limitations
 
Interpreters and Emotional Intelligence How do we use it and why does it matter?
Interpreters and Emotional Intelligence How do we use it and why does it matter?Interpreters and Emotional Intelligence How do we use it and why does it matter?
Interpreters and Emotional Intelligence How do we use it and why does it matter?
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficulties
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
 
Article - An Annotated Translation of How to Succeed as a Freelance Translato...
Article - An Annotated Translation of How to Succeed as a Freelance Translato...Article - An Annotated Translation of How to Succeed as a Freelance Translato...
Article - An Annotated Translation of How to Succeed as a Freelance Translato...
 
NLP applicata a LIS
NLP applicata a LISNLP applicata a LIS
NLP applicata a LIS
 
Translation coverage
Translation coverageTranslation coverage
Translation coverage
 
The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...
 
Researching Multilingually: Possibilities and Complexities
Researching Multilingually:  Possibilities and Complexities Researching Multilingually:  Possibilities and Complexities
Researching Multilingually: Possibilities and Complexities
 
Natural Language Processing and Language Learning
Natural Language Processing and Language LearningNatural Language Processing and Language Learning
Natural Language Processing and Language Learning
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Identification of Translationese: A Machine Learning Approach

  • 1. Introduction Methodology Evaluation Conclusions Identification of Translationese: A Machine Learning Approach Iustina Ilisei1, Diana Inkpen2, Gloria Corpas3 and Ruslan Mitkov1 1University of Wolverhampton, United Kingdom 2University of Ottawa, Canada 3University of Malaga, Spain CICLing 2010, Iasi, Romania Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 2. Introduction Methodology Evaluation Conclusions Outline 1 Introduction Introduction in Translation Studies Universals of Translation Related Studies Corpus-based Approach Machine-Learning Approach 2 Methodology Objective Resources Data Representation 3 Evaluation Classification Results Analysis 4 Conclusions Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 3. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Introduction Translationese Effect Translations exhibit their own unnatural language, their own peculiar lexico-grammatical and syntactic characteristics. (Gellerstam,1986) Translational language can not avoid the effect of translationese. (Baker,1993; Laviosa,1997; McEnery & Xiao (2002, 2007) ) Intrigue As two languages can not be perfectly mapped with each other → translated text and its original can not be perfectly matched Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 4. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Language Universals in Translation Mona Baker “it will be necessary to develop tools that will enable us to identify universal features of translation, that is features which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems”. (Baker, 1993:243) Practical Perspective a (self)assessment tool for translators multilingual plagiarism detection direction of translation detection can improve SMT performance Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 5. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Translation Universals According to Baker (1993,1996) Simplification Translations tend to be simpler and easier-to-follow texts Explicitation Translations tend to spell things out rather than leave them implicit Convergence Translations tend to be more similar than non-translations Normalisation Translations conform to patterns typical to the target language, even to the point of exaggerating them Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 6. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Related studies Corpus-Based approach S. Laviosa (2008) In translations: low proportion of lexical words over function words, high proportion of high-frequency words compared to low-frequency words, a relatively great repetition of the most frequent words, and less variety in the most frequently used words G. Corpas (2008) Simplification confirmed for lexical richness, and contradicted in terms of complex sentences, information load, sentence length, depth of trees, senses per word. G. Corpas, R. Mitkov, N. Afzal, V. Pekar (2008) Translations exhibit lower lexical density and richness, seem to be more readable, have a smaller proportion of simple sentences, and use less discourse markers. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 7. Introduction Methodology Evaluation Conclusions Introduction in Translation Studies Universals of Translation Related Studies Related studies: Machine-Learning Approach Supervised Learning Approach Baroni & Bernardini (2006) “A new approach to the study of translationese: Machine Learning the difference between original and translated texts” SVM classifier distinguishes professional translations from original texts with accuracy above the chance level Depends heavily on lexical cues, the distribution of n-grams of function words, morpho-syntactic categories, personal pronouns and adverbs in general Human accuracy - much lower than the accuracy of the system Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 8. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Aim of the Study Objective Language-independent learning system able to distinguish between translated and non-translated texts. To investigate the validation of the simplification hypothesis. To explore characteristic features which most influence the translational language. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 9. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Methodology Our assumption if(addition of the simplification features improves learning accuracy) then this is an argument towards the existence of the Simplification Universal else “further research required” Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 10. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Translational Corpora Resources Comparable corpora: translated texts vs. non-translated texts Spanish Monolingual Comparable Corpora Medical Translations by professionals (MTP) vs. Comparable Original Medical texts by professionals (MTPC) Medical Translations by translation students (MTS) vs. Comparable Original Medical texts by translation students (MTSC) Technical Translations by professionals (TT) vs Comparable Original Technical texts by professionals (TTC) Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 11. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Datasets: training and testing Training set 450 instances (156 translation class, 294 non-translation class) Testing set 148 instances (52 translation class, 96 non-translation class) Set pair one: MTP-MTPC (2 + 2 translation vs non-translation) Set pair two: MTS-MTSC (36 + 66 translation vs non-translation) Set pair three: TT-TTC (14 + 28 translation vs non-translation) Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 12. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Data Representation Data Repesentation without Simplification Features (DR - SF) Proportion in each text of: grammatical words, nouns, finite verbs, auxialiary verbs, adjectives, adverbs, numerals, pronouns, prepositions, determinants, conjunctions, grammatical words/lexical words ratio Data Repesentation with Simplification Features (DR + SF) All above (DR - SF) + simplification features Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 13. Introduction Methodology Evaluation Conclusions Objective Resources Data Representation Simplification Features Proposed features to grasp simplicity in texts Sentence Length: proportion of number of words per sentence Sentence Length: the average of the maximum parse tree depth per sentence in texts Types of sentence: proportion of sentences without finite verbs / simple sentences / complex sentences in texts Ambiguity: average number of senses per word in texts Word Length: average number of syllables per word in texts Lexical Richness: proportion of type lemmas per tokens in texts Information Load: proportion of lexical words per tokens in texts Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 14. Introduction Methodology Evaluation Conclusions Classification Results Analysis Classification Experiments Experiments Trained/tested on the entire dataset Trained on the entire dataset and tested on separate test datasets Set MTS-MTSC (medical texts) Set TT-TTC (technical texts) Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 15. Introduction Methodology Evaluation Conclusions Classification Results Analysis Classification Experiments Including Simplification Excluding Simplification Features Features 10-fold Test 10-fold Test Classifier cross-validation set cross-validation set Baseline 65.33% 64.86% 65.33% 64.86% Naive Bayes *76.67% 79.05% 69.33% 75.00% BayesNet 78.67% 79.73% 75.11% 77.03% Jrip 79.56% 83.11% 73.33% 77.03% Decision Tree 78.22% 81.76% 78.22% 81.76% Simple Logistic *77.33% 83.11% 71.11% 80.41% SVM *79.11% *81.76% 69.33% 73.65% Meta-classifier *80.00% 87.16% 73.33% 85.81% Table: Classification Results: Accuracies for several classifiers Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 16. Introduction Methodology Evaluation Conclusions Classification Results Analysis Classification Experiments Including Simplification Excluding Simplification Features Features Classifier MTS-MTSC TT-TTC MTS-MTSC TT-TTC Baseline 64.71% 66.67% 64.71% 66.67% Naive Bayes 71.57% 95.24% 71.57% 80.95% BayesNet 73.53% 97.62% 71.57% 92.86% Jrip 79.42% 95.24% 72.55% 92.86% Decision Tree 77.45% 92.86% 75.49% 95.24% Simple Logistic 77.45% 97.62% 79.41% 83.33% SVM 75.49% *97.62% 74.51% 69.05% Meta-classifier 82.35% 97.62% 78.43% 92.86% Table: Classification accuracy results on the medical and technical test datasets. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 17. Introduction Methodology Evaluation Conclusions Classification Results Analysis Decision Tree Exploit features in categorisation task: First level Lexical Richness Secondly Sentence Length (words/sentence) Grammatical words/Lexical words proportion Thirdly Pronoun proportion in texts Conjunction proportion in texts Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 18. Introduction Methodology Evaluation Conclusions Classification Results Analysis JRip Classifier Rules Rule 1: (lexicalRichness <= 0.16) and (ratioFiniteVerbs <= 0.08) => class=translation Rule 2: (simpleSentences >= 0.3) and (wordLength <= 2.46) and (sentenceLength >= 20.7) and (ratioNouns >= 0.33) => class=translation Rule 3: (ratioFiniteVerbs <= 0.09) and (ratioPreps <= 0.13) => class=translation Rule 4: => class=non-translation Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 19. Introduction Methodology Evaluation Conclusions Classification Results Analysis Attributes Ranking Filters Information Gain Chi squared lexicalRichness lexicalRichness grammsPerLexics grammsPerLexics ratioFiniteVerbs ratioFiniteVerbs ratioNumerals ratioNumerals ratioAdjectives ratioAdjectives sentenceLength sentenceLength ratioProns ratioProns simpleSentences wordLength wordLength simpleSentences grammaticalWords zeroSentences zeroSentences ratioNouns ratioNouns lexicalWords ..... ..... Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 20. Introduction Methodology Evaluation Conclusions Conclusions Summary Learning system able to distinguish between translated text and non-translated text for Spanish language. On a technical dataset, the accuracy reaches up to 97.62% The addition of the features related to simplification leads to an increased accuracy of the classifiers: SVM reports statistical significance improvement. The results may be considered as an argument for the existence of the Simplification Universal. Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
  • 21. Introduction Methodology Evaluation Conclusions Thank you for your attention ! Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach