Normalization of translation memories - TAUS User Conference 2009 - Presentation Transcript
USER CONFERENCE 2009 BEFORE MT
Normalization of translation memories/training data for MT Moderator: Karen R. Combe, PTC Ryan Martin, Intel Chris Wendt, Microsoft William Wong, Language Weaver Olga Beregovaya ProMT
Agenda
TM Data for MT training
Examples – clean or normalize?
PTC and Intel
Solutions/suggestions from technology providers
Microsoft
ProMT
Data Examples
Issue: Excessive number of internal tags Pour effectuer la plupart de ces tâches, vous pouvez utiliser {1}{2} Fichier (File) {3}{4} Traitement des instances (Instance Operations) {5}{6} Actualiser l'index (Update Index) {7}{8} ou {9}{10} Fichier (File) {11}{12} Traitement des instances (Instance Operations) {13}{14} Options d'accélérateur (Accelerator Options) {15}{16} afin d'ouvrir la boîte de dialogue {17} Accélérateur d'instances (Instance Accelerator) {18} You can use {1}{2} File {3}{4} Instance Operations {5}{6} Update Index {7}{8}{9}{10} File {11}{12} Instance Operations {13} {14} Accelerator Options {15}{16} (which opens the {17} Instance Accelerator {18} dialog box) to perform most instance operations.
Issue: homonyms English: This figure shows that after midsurface compression, the resulting model develops a gap between the collet and the bracket . French: Cette figure montre qu'après la compression en feuillet moyen, le modèle obtenu crée un jeu entre le collet et le gousset . English: All data in brackets [] are optional. French: Toutes les données entre crochets [] sont facultatives. Bracket #1 (gousset): An overhanging member that projects from a structure (as a wall) and is usually designed to support a vertical load or to strengthen an angle. bracket #2 (crochet): The bracket character, such as [ or (.
Issue: Acronyms spelled out in the target English: You cannot propagate SDTAE s and DTAE s in a DTAF . French: Vous ne pouvez propager ni des éléments d'annotation d'étiquette de référence ni des éléments d'annotation de référence de positionnement à l'intérieur d'une FARP.
Issue: Mismatching number of sentences English: You can have multiple entries for the same pipe size in the bend file, that is, a single pipe size can have multiple bend radius values associated with it, as shown in the following example of a bend file. French: Vous pouvez avoir plusieurs entrées pour la même taille de tuyau dans le fichier de pliage. En d'autres termes, une même taille de tuyau peut être associée à plusieurs valeurs de rayon de pliage, comme dans le fichier de pliage d'exemple suivant.
Issue: Inconsistent double quote usage Ainsi, si vous créez une pièce portant le nom " bracket " , elle est tout d'abord enregistrée dans le fichier {1}. For example, if you create a part with the name bracket, it initially saves to the file name {1}.
Issue: Entity mismatch English: One way is to create a " flexible model. French: Une méthode consiste à créer un modèle souple.
Issue: Punctuation mismatch (brace vs. dash) English: {1}Copy as Skeleton{2} ( the option cannot be changed ) to create a skeleton model. French: Cliquez sur {1}Copier en tant que squelette (Copy as Skeleton){2} - option non modifiable - pour créer un modèle squelette.
Issue: Punctuation mismatch (dash vs. colon) English: {1}Additional Rotation{2} — Enter a real-number value for the number of degrees to rotate the spring's Y axis. French: {1}Rotation supplémentaire (Additional Rotation){2} : entrez un nombre réel pour indiquer le nombre de degrés de rotation de l'axe Y du ressort.
Issue: Capitalization mismatch English: Piping Master Catalog Directory File French: Fichier répertoire du catalogue principal de tuyauterie
Issue: English UI strings in the translation English: Click View > Color and Appearance to create or modify colors. Cliquez sur Affichage ( View ) > Couleur et apparence ( Color and Appearance ) pour créer ou modifier les couleurs.
Issue: Fix common entity issues
English: System without Intel ® vPro technology
Portuguese: Sistema sem a tecnologia Intel ® vPro
Corrected: English: System without Intel ® vPro technology Portuguese: Sistema sem a tecnologia Intel ® vPro
German: Leistung neu entdeckt, Effizienz neu definiert
Data Issues
Control characters that break wellformedness (invisible form feed character between “CONDITIONS” and “How”) and which may produce undetectable problems in translation since they are often invisible:
<TrU>
<CrD>08052004, 05:54:40
<CrU>REYNA
<Seg L=EN-US>END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs
<Seg L=ES-EM>FIN DE LOS T É RMINOS Y CONDICIONES Aplicaci ó n de estos t é rminos en sus programas nuevos
<Seg L=EN-US>Intel® Xeon™ processors - System Boots With No Problem, But Crashes Or Freezes After Several Minutes Of Operation Or The System Is Unstable
<Seg L=ES-EM>Procesadores Intel® Xeon™ : El sistema arranca sin problemas, pero deja de funcionar o se bloquea despu é s de varios minutos de funcionamiento o el sistema es inestable.
<column name="quelltext">Der Menüpunkt &lt;<010035762_GE_1048565/>&gt; wird in der vorliegenden Fehlersuchanleitung nicht genauer beschrieben.</column>
<column name="zieltext">The menu item &lt;<010035762_GE_1048565/>&gt; is not described in greater detail in these trouble-shooting instructions.</column>
<column name="quelltext">Die Komponente &lt;<010035731_TERM_1048521/>&gt; schrittweise über den Taster betätigen.</column>
<column name="zieltext">Gradually actuate the component &lt;<010035731_TERM_1048521/>&gt; by way of the button.</column>
Training Data example
Comparable data – not parallel data:
Metadata handling by PROMT
Standard TM verification/normalization process
During TM verification the following is addressed through automatic steps
Irregular characters gets flagged and replaced
Incomplete sentences get flagged
Punctuation suspects get flagged
UI strings and other irregular sentences get added to phrase tables
PROMT handling of internal tags – not excessive but useful
Original Source Segment in File Check <codeph class="+ topic/ph pr-d/codeph>NativeApplication.supportsSystemTrayIcon</codeph> to determine whether system tray icons are supported on the current system.
Converted to GMS Segment format (after GMS-native segmentation)
Check {1}NativeApplication.supportsSystemTrayIcon{2} to determine whether system tray icons are supported on the current system.
Pre‐Processed String in XLIFF Segment format is sent to PROMT.
Check <ph i=1 x=”<codeph class="+ topic/ph pr-d/codeph" >”>{1}</ph> NativeApplication.supportsSystemTrayIcon<ph i=2 </codeph>>{2}</ph> to determine whether system tray icons are supported on the current system.
Format of the translated XLIFF Segment returned by PROMT to Idiom
Проверить <ph i=1 x=”<codeph class="+ topic/ph pr-d/codeph" >”>{1}</ph> NativeApplication.supportsSystemTrayIcon<ph i=2 </codeph>>{2}</ph> для определения системном трее иконки поддерживает нынешнюю систему.
GMS Integration with XLIFF Connector – Why is metadata so Important?
PROMT handling of irrelevant data
Scenario 1: We can leave the irrelevant data untouched and let it propagate from TM or be handled through special formatting rules
Scenario 2: We will clean it up and add to the phrase table
Our system will perform well in either scenario and our course of action needs to be the clients call
PROMT handling of homonyms
PROMT system is specially tailored to handle one-to-many translations and homonymy
PROMT approach is to create context-based dictionary entries, whether single words or MWE which allows the system to properly indentify the correct translation for ambiguous entries
PROMT also uses XML metadata when assigning a semantic class to an entry
PROMT handling of expanding acronyms
PROMT system handles expansion of acronyms or different acronyms between languages through creating explicit mapping
This is a rather standard task in the process of PROMT engine customization, along with DoNotTranslate and variable lists
Should an abbreviation or the expanded version change, this can be fixed through the client interface in a matter of seconds
PROMT handling of locale-specific punctuation
Quotation mark usage for a specific small group of terms can be defined on a dictionary level
If the use of quotation marks or other punctuation is universal for a specific locate it will be defined on the linguistic rules level
PROMT handling of Entity and Capitalization mismatch
The differences in locale setting for Entities and Capitalization rules are already pre-built in the baseline engine and are regulated through regional settings in the product interface
All additional differences between locales are learnt from the TM during the engine customization phase and then are added to the client profile template
Issue: English UI strings in the translation English: Click View > Color and Appearance to create or modify colors. Cliquez sur Affichage ( View ) > Couleur et apparence ( Color and Appearance ) pour créer ou modifier les couleurs.
PROMT suggestion for UI string handling
All the UI strings will be automatically added to DoNotTranslate lists when appearing in the appropriate context
The context can be detected semantically, though formatting and punctuation
Intuitive contextual identification Any word that occurs as part of a context such as “show” in “show command,” remains in English per the UI, whereas the word command gets translated. In other contexts, both words, show and command, are translated as regular words.
PROMT approach to entities
We can address it through a set of special rules, however, typically these issues are addressed on the GMS level
If TM cleanup is a part of the specific project scope we take this task on and address all similar issues through automated scripts
PROMT handling of internal markup
This step is not necessary for PROMT translation process
Scenario 1: the markup is handled by PROMT TMX Level 2 extensive TM metadata support
Scenario 2: if we need to create phrase table entries from these strings we will normalize, but the markup will still be preserved in the translation process
PROMT handling of empty fields
Scenario 1: During TM verification an automatic script will render a warning message and the empty unit will not be propagated
Scenario 2: We also can send the empty segment to the customized engine and obtain a translation which will be propagated into the TM for further verification
This is how PROMT pre-project dictionaries are created
Microsoft Translator
In General
Liberal at throwing away training data
Automatic filters only
No human cleaning
If it is likely there is a clean variant of almost the same sentence in the data, no harm in throwing it away
In-domain diversity is a plus
Example: Several versions of the same product have close to no effect
Cleaning issues 1/3 Issue Training Action Runtime action Excessive # of internal tags Remove segment Preserve and ignore Irrelevant data Fails in ratio filter Apply factoids Homonyms n/a Target language model Acronyms spelled out May be caught in word alignment Project dictionary # sentences mismatch Sentence break, align, discard n/a Inconsistent quote usage n/a Not handled Entities Unescape Unescape-reescape XML-safe Punctuation mismatch n/a Needs special code (i.e. French “ :”)
Cleaning issues 2/3 Issue Training Action Runtime action Capitalization mismatch ignored Apply language logic and target language model English UI strings Factoid Preprocess and escape Internal markup Escape to single tag Pass through Empty field Fails size delta filter Pass through Suspect character May fail language check Pass through HTML escapes Unescape Unescape – reescape for XML Trivial segment Fails length or ratio filter Pass through Missing punctuation none Apply language-appropriate punctuation
Cleaning issues 3/3 Issue Training Action Runtime action Newline in string New sentence New sentence Program code Avoided or learned Needs markup Comparable data Currently fails length filter Research item Handled like “parallel”
0 comments
Post a comment