SlideShare a Scribd company logo
1 of 22
Variation-influenced quality for MT: General vs.
specialised corpora

Alejandro Curado
Martín Garay
University of Extremadura, Spain
Variation-influenced quality for MT: General vs. specialised corpora

 Theoretical background / method: Context-based MT
>Data retrieved from massive corpus: A lot of data to compare ngrams (the more context the better correspondences)
>No need to use a parallel corpus (e.g., SMT = aligned / parallel
corpora translation)
>Optimal scores in BLEU (Bilingual Evaluation Under Study) scale for
MT--Carbonell, 2006 = “Meaningful Machines” = 0.66 “blind test” out of
0.7 (human) with 53 GB of target text
>General translation (vs. Specialised translation??)
Variation-influenced quality for MT: General vs. specialised corpora
Variation-influenced quality for MT: General vs. specialised corpora

 Resources:
 English Dictionary table with 200,000 entries (single and compound words /
idioms).
 Spanish dictionary table with more than 5,000,000 entries
 Large general numerical corpus that may reach up to 100 GB (end of July
2010): Indexed by text, sentence, word
Variation-influenced quality for MT: General vs. specialised corpora

 Improve / increase resources :
 Dictionary:
Web pages have been developed to:
1. Add all those word units missing (with equivalents)
2. Increase word meanings if not in the dictionaries (wordreference)
Variation-influenced quality for MT: General vs. specialised corpora

 Improve / increase resources :
 The large corpus.
 Indexing :
1. Books on the web.
2. Wikipedia.
3. Sketch Engine-retrieved texts (seed keywords).
Variation-influenced quality for MT: General vs. specialised corpora

 Types of Spanish corpora used for the translation tests :
 The large corpus (late May 2010)
Nearly 73 million words, 11,490 texts, 3,900,000 sentences
(+ 1256 news texts indexed in June = +4 mill. words)
Experiment corpus (1) with apartment / housing ads (March 2010)
70 texts, 5,455 sentences, 87,353 words
Experiment corpus (2) with international news (June 2010)
286 texts, 2,791 sentences, 125,936 words
Variation-influenced quality for MT: General vs. specialised corpora



Translation procedure.



1 st step . Inserting the sentence or text.

The nice big house is located near the sea.
The nice big house is located near the sea.
Variation-influenced quality for MT: General vs. specialised corpora



2nd step . Dividing the text into phrases / sentences.
 The segmentation is carried out by using the following punctuation symbols:

..
 In our case:

;;

::

¿?
¿?
¡!
¡!

The nice big house is located near the sea.
Variation-influenced quality for MT: General vs. specialised corpora



3 rd step . Obtaining the numbers that correspond to those words / word
units in the English dictionary.
The

nice big

house is

44634 30497 6962 22817

3456

located near
27139

the

sea

30255 44634 39064
Variation-influenced quality for MT: General vs. specialised corpora



4th step . We remove the function / nexus words (those words that repeat
the most statistically in that language) from the sentence and we store them
on a separate table.
The

nice big house

is

located

Final phrase nice big house located sea.

near

the

sea.
Variation-influenced quality for MT: General vs. specialised corpora



5th step. The remaining words (content words) are sent to the dictionary to
retrieve the different translation equivalents they may have.
1: Restriction in the tests to only two equivalents in Spanish
Variation-influenced quality for MT: General vs. specialised corpora

Nice


big

house located

sea.

6th step. Each ngram is divided into subn-grams (different combinations of the
correspondences) which are then sent to the corpus.
bonito
gran
1º 1043795 284672
bonito
gran
2º 1043795 284672

casa
839170
casa
839170

situado
1098037
situada
1098063

……………………………………….
bonita
gran
casa situada
nº 1043794 284672 839170 1098063
Variation-influenced quality for MT: General vs. specialised corpora



7th step. A score is given to each result obtained; thus, each subn-gram will receive a
final score and an arrangement according to the score.
SCORE
Subngrama 1 .

bonito gran casa situado

2.5

Subngrama 2 .

bonito gran casa situada

3.1

Subngrama n .

bonita gran casa situada

7

- Parameters that decide the score given to each subn-gram:
 Number of needed words found in the sentence.
 Distance found beween the words.
 Number of needed words found together inside the sentence.
Variation-influenced quality for MT: General vs. specialised corpora

8th Step. Scoring the n-gram in relation to the best scores obtained by the subngrams.
Nice

big

SCORE

house located

30

Subngrama 1 .

bonito gran casa situado

SCORE
2.5

Subngrama 2 .

bonito gran casa situada

3.1

Subngrama n .

bonita gran casa situada

7

SCORE

big

house located

sea.

50
Variation-influenced quality for MT: General vs. specialised corpora



9th step . Combining the n-grams integrated in the sentence / text.

Nice

big

house located

sea.

Parameters for the combination / overlapping
•
•

Scoring the texts that repeat for the n-grams
Scoring the sentences that repeat for the n-grams
Variation-influenced quality for MT: General vs. specialised corpora



10th step . We add the function words previously removed from the sentence.
We search for these words in the best subn-grams used.

The

nice big house

is

located

near

the

sea.
Variation-influenced quality for MT: General vs. specialised corpora



11th step . Obtaining the translated sentence
The nice big house is located near the sea .

La gran y bonita casa está situada cerca del mar
Variation-influenced quality for MT: General vs. specialised corpora
In the housing ads (first specialised corpus):
the nice big house is located near the sea .
La gran y bonita casa está situada cerca del mar.
the white house has an old gate that is broken and ugly .
La casa blanca tiene una vieja verja averiada y fea.

 Time used by the system: 0.98 seconds / 1.2 seconds
Variation-influenced quality for MT: General vs. specialised corpora
In the large corpus (end of May):
the nice big house is located near the sea .
La casa grande está cerca del mar.
the white house has an old gate that is broken and ugly .
La casa blanca tiene una valla vieja que se rompe y es fea
 Time used by the system: 3 minutes and 33 seconds / 3
minutes and 39 seconds
Variation-influenced quality for MT: General vs. specialised corpora
Other problems in the large corpus (June: + news):
The director checked the mail and said he had no new mail
Nuevos directores comprobaban correo y la dijo no hay correo
The salesperson decided to stop doing business with them
El vendedor decidió parar a hacer negocios con ellos
 Time used by the system: 2 minutes and 12 seconds / 1
minute and 6 seconds
Variation-influenced quality for MT: General vs. specialised corpora
 Some linguistic / technical conclusions:
>Data retrieved from massive corpus:
Important to obtain more common phrases / familiar expressions /
overlapping connectors
>Data retrieved from the specialised corpus:
Important for fixed phrases / collocations in the field / genre – BUT
may need more linguistic information for connections
< Problems: Verb agreement in indirect clauses? / Fewer
probabilities for open content combinations (e.g., new + mail)
<EVER important need to improve dictionary entries for general
corpus
<Scores according to context: Texts repeat more in specialised
translation (problem for large corpus—e.g., nuevos directores)

More Related Content

Similar to MT Quality: General vs Specialised Corpora

Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones RIILP
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Estelle Delpech
 
2015ht13439 final presentation
2015ht13439 final presentation2015ht13439 final presentation
2015ht13439 final presentationAshutosh Kumar
 
Yelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationYelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationChengeng Ma
 
Classification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF MetricClassification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF MetricMarie Vans
 
More on Indexing Text Operations (1).pptx
More on Indexing  Text Operations (1).pptxMore on Indexing  Text Operations (1).pptx
More on Indexing Text Operations (1).pptxMahsadelavari
 
Final product group_16_task_3 tt
Final product group_16_task_3  ttFinal product group_16_task_3  tt
Final product group_16_task_3 ttalfonsorojasc
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsVincenzo Lomonaco
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Shahriar Rafee
 
Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...mlaij
 

Similar to MT Quality: General vs Specialised Corpora (20)

Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
 
2015ht13439 final presentation
2015ht13439 final presentation2015ht13439 final presentation
2015ht13439 final presentation
 
Yelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationYelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classification
 
Classification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF MetricClassification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF Metric
 
More on Indexing Text Operations (1).pptx
More on Indexing  Text Operations (1).pptxMore on Indexing  Text Operations (1).pptx
More on Indexing Text Operations (1).pptx
 
Final product group_16_task_3 tt
Final product group_16_task_3  ttFinal product group_16_task_3  tt
Final product group_16_task_3 tt
 
Task 3 - Group 16
Task 3 -  Group 16Task 3 -  Group 16
Task 3 - Group 16
 
TAUS QE Summit 2017 eBay EN-DE MT Pilot
TAUS QE Summit 2017   eBay EN-DE MT PilotTAUS QE Summit 2017   eBay EN-DE MT Pilot
TAUS QE Summit 2017 eBay EN-DE MT Pilot
 
Traslation the pragraph
Traslation the pragraphTraslation the pragraph
Traslation the pragraph
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
 
Kc3517481754
Kc3517481754Kc3517481754
Kc3517481754
 
Traslation the pragraph
Traslation the pragraphTraslation the pragraph
Traslation the pragraph
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4
 
Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...
 
Traslation the pragraph
Traslation the pragraphTraslation the pragraph
Traslation the pragraph
 
Traslation the pragraph
Traslation the pragraphTraslation the pragraph
Traslation the pragraph
 

Recently uploaded

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 

Recently uploaded (20)

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 

MT Quality: General vs Specialised Corpora

  • 1. Variation-influenced quality for MT: General vs. specialised corpora Alejandro Curado Martín Garay University of Extremadura, Spain
  • 2. Variation-influenced quality for MT: General vs. specialised corpora  Theoretical background / method: Context-based MT >Data retrieved from massive corpus: A lot of data to compare ngrams (the more context the better correspondences) >No need to use a parallel corpus (e.g., SMT = aligned / parallel corpora translation) >Optimal scores in BLEU (Bilingual Evaluation Under Study) scale for MT--Carbonell, 2006 = “Meaningful Machines” = 0.66 “blind test” out of 0.7 (human) with 53 GB of target text >General translation (vs. Specialised translation??)
  • 3. Variation-influenced quality for MT: General vs. specialised corpora
  • 4. Variation-influenced quality for MT: General vs. specialised corpora  Resources:  English Dictionary table with 200,000 entries (single and compound words / idioms).  Spanish dictionary table with more than 5,000,000 entries  Large general numerical corpus that may reach up to 100 GB (end of July 2010): Indexed by text, sentence, word
  • 5. Variation-influenced quality for MT: General vs. specialised corpora  Improve / increase resources :  Dictionary: Web pages have been developed to: 1. Add all those word units missing (with equivalents) 2. Increase word meanings if not in the dictionaries (wordreference)
  • 6. Variation-influenced quality for MT: General vs. specialised corpora  Improve / increase resources :  The large corpus.  Indexing : 1. Books on the web. 2. Wikipedia. 3. Sketch Engine-retrieved texts (seed keywords).
  • 7. Variation-influenced quality for MT: General vs. specialised corpora  Types of Spanish corpora used for the translation tests :  The large corpus (late May 2010) Nearly 73 million words, 11,490 texts, 3,900,000 sentences (+ 1256 news texts indexed in June = +4 mill. words) Experiment corpus (1) with apartment / housing ads (March 2010) 70 texts, 5,455 sentences, 87,353 words Experiment corpus (2) with international news (June 2010) 286 texts, 2,791 sentences, 125,936 words
  • 8. Variation-influenced quality for MT: General vs. specialised corpora  Translation procedure.  1 st step . Inserting the sentence or text. The nice big house is located near the sea. The nice big house is located near the sea.
  • 9. Variation-influenced quality for MT: General vs. specialised corpora  2nd step . Dividing the text into phrases / sentences.  The segmentation is carried out by using the following punctuation symbols: ..  In our case: ;; :: ¿? ¿? ¡! ¡! The nice big house is located near the sea.
  • 10. Variation-influenced quality for MT: General vs. specialised corpora  3 rd step . Obtaining the numbers that correspond to those words / word units in the English dictionary. The nice big house is 44634 30497 6962 22817 3456 located near 27139 the sea 30255 44634 39064
  • 11. Variation-influenced quality for MT: General vs. specialised corpora  4th step . We remove the function / nexus words (those words that repeat the most statistically in that language) from the sentence and we store them on a separate table. The nice big house is located Final phrase nice big house located sea. near the sea.
  • 12. Variation-influenced quality for MT: General vs. specialised corpora  5th step. The remaining words (content words) are sent to the dictionary to retrieve the different translation equivalents they may have. 1: Restriction in the tests to only two equivalents in Spanish
  • 13. Variation-influenced quality for MT: General vs. specialised corpora Nice  big house located sea. 6th step. Each ngram is divided into subn-grams (different combinations of the correspondences) which are then sent to the corpus. bonito gran 1º 1043795 284672 bonito gran 2º 1043795 284672 casa 839170 casa 839170 situado 1098037 situada 1098063 ………………………………………. bonita gran casa situada nº 1043794 284672 839170 1098063
  • 14. Variation-influenced quality for MT: General vs. specialised corpora  7th step. A score is given to each result obtained; thus, each subn-gram will receive a final score and an arrangement according to the score. SCORE Subngrama 1 . bonito gran casa situado 2.5 Subngrama 2 . bonito gran casa situada 3.1 Subngrama n . bonita gran casa situada 7 - Parameters that decide the score given to each subn-gram:  Number of needed words found in the sentence.  Distance found beween the words.  Number of needed words found together inside the sentence.
  • 15. Variation-influenced quality for MT: General vs. specialised corpora 8th Step. Scoring the n-gram in relation to the best scores obtained by the subngrams. Nice big SCORE house located 30 Subngrama 1 . bonito gran casa situado SCORE 2.5 Subngrama 2 . bonito gran casa situada 3.1 Subngrama n . bonita gran casa situada 7 SCORE big house located sea. 50
  • 16. Variation-influenced quality for MT: General vs. specialised corpora  9th step . Combining the n-grams integrated in the sentence / text. Nice big house located sea. Parameters for the combination / overlapping • • Scoring the texts that repeat for the n-grams Scoring the sentences that repeat for the n-grams
  • 17. Variation-influenced quality for MT: General vs. specialised corpora  10th step . We add the function words previously removed from the sentence. We search for these words in the best subn-grams used. The nice big house is located near the sea.
  • 18. Variation-influenced quality for MT: General vs. specialised corpora  11th step . Obtaining the translated sentence The nice big house is located near the sea . La gran y bonita casa está situada cerca del mar
  • 19. Variation-influenced quality for MT: General vs. specialised corpora In the housing ads (first specialised corpus): the nice big house is located near the sea . La gran y bonita casa está situada cerca del mar. the white house has an old gate that is broken and ugly . La casa blanca tiene una vieja verja averiada y fea.  Time used by the system: 0.98 seconds / 1.2 seconds
  • 20. Variation-influenced quality for MT: General vs. specialised corpora In the large corpus (end of May): the nice big house is located near the sea . La casa grande está cerca del mar. the white house has an old gate that is broken and ugly . La casa blanca tiene una valla vieja que se rompe y es fea  Time used by the system: 3 minutes and 33 seconds / 3 minutes and 39 seconds
  • 21. Variation-influenced quality for MT: General vs. specialised corpora Other problems in the large corpus (June: + news): The director checked the mail and said he had no new mail Nuevos directores comprobaban correo y la dijo no hay correo The salesperson decided to stop doing business with them El vendedor decidió parar a hacer negocios con ellos  Time used by the system: 2 minutes and 12 seconds / 1 minute and 6 seconds
  • 22. Variation-influenced quality for MT: General vs. specialised corpora  Some linguistic / technical conclusions: >Data retrieved from massive corpus: Important to obtain more common phrases / familiar expressions / overlapping connectors >Data retrieved from the specialised corpus: Important for fixed phrases / collocations in the field / genre – BUT may need more linguistic information for connections < Problems: Verb agreement in indirect clauses? / Fewer probabilities for open content combinations (e.g., new + mail) <EVER important need to improve dictionary entries for general corpus <Scores according to context: Texts repeat more in specialised translation (problem for large corpus—e.g., nuevos directores)