Normalization of translation memories - TAUS User Conference 2009

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Notes on slide 1

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page - Punctuation marks vary between languages and sometimes even product versions. Some languages put a space between the word and the colon (:), some others don’t.

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page - Punctuation marks vary between languages and sometimes even product versions. Some languages put a space between the word and the colon (:), some others don’t.

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page - Punctuation marks vary between languages and sometimes even product versions. Some languages put a space between the word and the colon (:), some others don’t.

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page - Punctuation marks vary between languages and sometimes even product versions. Some languages put a space between the word and the colon (:), some others don’t.

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page - Punctuation marks vary between languages and sometimes even product versions. Some languages put a space between the word and the colon (:), some others don’t.

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page - Punctuation marks vary between languages and sometimes even product versions. Some languages put a space between the word and the colon (:), some others don’t.

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page - Punctuation marks vary between languages and sometimes even product versions. Some languages put a space between the word and the colon (:), some others don’t.

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page - Punctuation marks vary between languages and sometimes even product versions. Some languages put a space between the word and the colon (:), some others don’t.

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page - Punctuation marks vary between languages and sometimes even product versions. Some languages put a space between the word and the colon (:), some others don’t.

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page - Punctuation marks vary between languages and sometimes even product versions. Some languages put a space between the word and the colon (:), some others don’t.

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page - Punctuation marks vary between languages and sometimes even product versions. Some languages put a space between the word and the colon (:), some others don’t.

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page

    Visit www.ptc.com Nov 4, 2009 © Copyright 2000 Parametric Technology Corporation Page - Punctuation marks vary between languages and sometimes even product versions. Some languages put a space between the word and the colon (:), some others don’t.

    Favorites, Groups & Events

    Normalization of translation memories - TAUS User Conference 2009 - Presentation Transcript

    1. USER CONFERENCE 2009 BEFORE MT
    2. Normalization of translation memories/training data for MT Moderator: Karen R. Combe, PTC Ryan Martin, Intel Chris Wendt, Microsoft William Wong, Language Weaver Olga Beregovaya ProMT
    3. Agenda
      • TM Data for MT training
      • Examples – clean or normalize?
        • PTC and Intel
      • Solutions/suggestions from technology providers
        • Microsoft
        • ProMT
      • Data Examples
    4. Issue: Excessive number of internal tags Pour effectuer la plupart de ces tâches, vous pouvez utiliser {1}{2} Fichier (File) {3}{4} Traitement des instances (Instance Operations) {5}{6} Actualiser l'index (Update Index) {7}{8} ou {9}{10} Fichier (File) {11}{12} Traitement des instances (Instance Operations) {13}{14} Options d'accélérateur (Accelerator Options) {15}{16} afin d'ouvrir la boîte de dialogue {17} Accélérateur d'instances (Instance Accelerator) {18} You can use {1}{2} File {3}{4} Instance Operations {5}{6} Update Index {7}{8}{9}{10} File {11}{12} Instance Operations {13} {14} Accelerator Options {15}{16} (which opens the {17} Instance Accelerator {18} dialog box) to perform most instance operations.
    5. Issue: Irrelevant data English: 0.31% French: 0,31 % English: &asm.mbr.name==part* French: &asm.mbr.name==pièce* English: (Windows NT/95/98/2000)D:partlib{1}objects French: (Windows NT/95/98/2000)D:partlib{1}objects
    6. Issue: homonyms English: This figure shows that after midsurface compression, the resulting model develops a gap between the collet and the bracket . French: Cette figure montre qu'après la compression en feuillet moyen, le modèle obtenu crée un jeu entre le collet et le gousset . English: All data in brackets [] are optional. French: Toutes les données entre crochets [] sont facultatives. Bracket #1 (gousset): An overhanging member that projects from a structure (as a wall) and is usually designed to support a vertical load or to strengthen an angle. bracket #2 (crochet): The bracket character, such as [ or (.
    7. Issue: Acronyms spelled out in the target English: You cannot propagate SDTAE s and DTAE s in a DTAF . French: Vous ne pouvez propager ni des éléments d'annotation d'étiquette de référence ni des éléments d'annotation de référence de positionnement à l'intérieur d'une FARP.
    8. Issue: Mismatching number of sentences English: You can have multiple entries for the same pipe size in the bend file, that is, a single pipe size can have multiple bend radius values associated with it, as shown in the following example of a bend file. French: Vous pouvez avoir plusieurs entrées pour la même taille de tuyau dans le fichier de pliage. En d'autres termes, une même taille de tuyau peut être associée à plusieurs valeurs de rayon de pliage, comme dans le fichier de pliage d'exemple suivant.
    9. Issue: Inconsistent double quote usage Ainsi, si vous créez une pièce portant le nom " bracket " , elle est tout d'abord enregistrée dans le fichier {1}. For example, if you create a part with the name bracket, it initially saves to the file name {1}.
    10. Issue: Entity mismatch English: One way is to create a " flexible model. French: Une méthode consiste à créer un modèle souple.
    11. Issue: Punctuation mismatch (brace vs. dash) English: {1}Copy as Skeleton{2} ( the option cannot be changed ) to create a skeleton model. French: Cliquez sur {1}Copier en tant que squelette (Copy as Skeleton){2} - option non modifiable - pour créer un modèle squelette.
    12. Issue: Punctuation mismatch (dash vs. colon) English: {1}Additional Rotation{2} — Enter a real-number value for the number of degrees to rotate the spring's Y axis. French: {1}Rotation supplémentaire (Additional Rotation){2}  : entrez un nombre réel pour indiquer le nombre de degrés de rotation de l'axe Y du ressort.
    13. Issue: Capitalization mismatch English: Piping Master Catalog Directory File French: Fichier répertoire du catalogue principal de tuyauterie
    14. Issue: English UI strings in the translation English: Click View > Color and Appearance to create or modify colors. Cliquez sur Affichage ( View ) > Couleur et apparence ( Color and Appearance ) pour créer ou modifier les couleurs.
    15. Issue: Fix common entity issues
      • English: System without Intel ® vPro technology
      • Portuguese: Sistema sem a tecnologia Intel ® vPro
      Corrected: English: System without Intel ® vPro technology Portuguese: Sistema sem a tecnologia Intel ® vPro
    16. Issue: Remove internal markup <tuv xml:lang=&quot;ZH-CN&quot;> <seg> <bpt i=&quot;1&quot;>&lt;span style='font-size:10.0pt; font-family:Verdana'></bpt> 在默认情况下,节点 <bpt i=&quot;2&quot; type=&quot;bold&quot;>&lt;b></bpt> 应用程序 <ept i=&quot;2&quot;>&lt;/b></ept> 之下没有任何应用程序,如下图所示。 <ept i=&quot;1&quot;>&lt;/span></ept> </seg></tuv></tu> Corrected: <tuv xml:lang=&quot;ZH-CN&quot;> <seg> 在默认情况下,节点应用程序之下没有任何应用程序,如下图所示。 </seg> </tuv> </tu>
    17. Issue: Empty field
      • <tu creationdate=&quot;20040727T134835Z&quot; creationid=&quot;RICHARD&quot;>
      • <prop type=&quot;x-error&quot;>empty field</prop>
      • <tuv xml:lang=&quot;EN-US&quot;>
      • <seg>Must re-verify if changing motherboard brands</seg>
      • </tuv>
      • <tuv xml:lang=&quot;ZH-CN&quot;>
      • <seg></seg>
      • </tuv>
      • </tu>
    18. Issue: Suspect character
      • English: “xxx”or “yyy” for a description of each.
      • Turkish: Her seçeneÄŸe iliÅŸkin açıklamayı görmek için bkz. “xxx” veya “yyy”.
    19. Issue: Suspect character
      • <tuv xml:lang=&quot;EN-US&quot;>
      • <seg>Yes ? </seg>
      • </tuv>
      • <tuv xml:lang=&quot;ZH-CN&quot;>
      • <seg> 是 ¹ </seg>
      • </tuv>
    20. Issue: Escape character in translation
      • English: Mode:
      • Turkish: Geliu351 şmiu351 ş Mod:
    21. Issue: Trivial segment; missing sentence features
      • <tuv xml:lang=&quot;EN-US&quot;>
      • <seg>put_brand_logo</seg>
      • </tuv>
      • <tuv xml:lang=&quot;DE-DE&quot;>
      • <seg>put_brand_logo</seg>
      • </tuv>
    22. Issue: Incomplete translation, missing punctuation
      • English: performance, redefine efficiency.
      • German: Leistung neu entdeckt, Effizienz neu definiert
    23. Data Issues
      • Control characters that break wellformedness (invisible form feed character between “CONDITIONS” and “How”) and which may produce undetectable problems in translation since they are often invisible:
        • <TrU>
        • <CrD>08052004, 05:54:40
        • <CrU>REYNA
        • <Seg L=EN-US>END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs
        • <Seg L=ES-EM>FIN DE LOS T É RMINOS Y CONDICIONES Aplicaci ó n de estos t é rminos en sus programas nuevos
        • </TrU>
    24. Data Issues
      • Non-standard entities need to be converted. The XML standard includes support for &amp; &lt; &gt; &quot; &apos; , while any others need to be manually converted specially, e.g. &reg; &trade; &nbsp; &copy; &lang; &rang; :
        • <TrU>
        • <CrD>22112002, 14:48:14
        • <CrU>JANICE
        • <Seg L=EN-US>Intel&reg; Xeon&trade; processors - System Boots With No Problem, But Crashes Or Freezes After Several Minutes Of Operation Or The System Is Unstable
        • <Seg L=ES-EM>Procesadores Intel&reg; Xeon&trade; : El sistema arranca sin problemas, pero deja de funcionar o se bloquea despu é s de varios minutos de funcionamiento o el sistema es inestable.
        • </TrU>
      • Keyword lists that are not fully translations:
        • <TrU>
        • <CrD>13112003, 13:28:28
        • <CrU>REYNA
        • <ChD>14112003, 10:32:01
        • <ChU>DJN
        • <Seg L=EN-US>Generic,cache,chip,core,cpu,ghz,giga,l2,mega,mhz,package,pentium,processor, processors,specifications,specs,speed,voltage
        • <Seg L=ES-EM>Generic, cache, chip, core, cpu, ghz, giga, l2, mega, mhz, package, pentium, processor, processors, specifications, specs, speed, voltage, gen é rico, cach é , chip, n ú cleo, cpu, ghz, giga, l2, mega, mhz, paquete, pentium, procesador, procesadores, especificaciones, velocidad, voltaje
        • </TrU>
    25. Data Issues
      • Mapped junk characters to proper characters. Mapped to sensible equivalents: U+0092  '; U+0096 and U+0097  -; U+008C  Œ; U+0099  ™.
      • Mapped junk characters to proper characters. The junk characters caused by converting UTF-8 (misidentified as ISO-8859-1) to UTF-8 needed to be mapped back to their true characters. For example, é  é, á  á, û  û, ç  ç, î  î.
    26. Data Issues
      • Code in TM data
      • r1 = CLSIDFromProgID(L&quot;OPC.SimaticNET&quot;, &clsid); if (r1 != S_OK) { MessageBox(&quot;Retrival of CLSID failed&quot;, &quot;Error CLSIDFromProgID()&quot;, MB_OK+MB_ICONERROR); CoUninitialize(); SendMessage(WM_CLOSE); return; }
      • //******************************************************************FUNCTION_BLOCK FB 23
      • XXXXXXXXEAX= XXXXXXXX EBX= XXXXXXXX ECX= XXXXXXXX EDX= XXXXXXXX EDI= XXXXXXXX ESI= XXXXXXXX FLAGS= XXXXXXXX DS= XXXX ES= XXXX SS= XXXX ESP=XXXXXXXX EBP= XXXXXXXX FS= XXXX GS= XXXX
    27. Data Issues
      • Inconsistent translations
        • (100 to 120, 220 to 264)
        • (100 -120 , 220 и 264 )
        • (0.197 to 0.236)
        • (0,197 -0,236 )
        •  
        • (0.276 to 0.335)
        •   (0,276 -0,335 )
        • (176 and 212°F)
        • (176 -212°F)
    28. Data Issues
      • Double escaped tags
        • <column name=&quot;quelltext&quot;>Der Menüpunkt &amp;lt;&lt;010035762_GE_1048565/&gt;&amp;gt; wird in der vorliegenden Fehlersuchanleitung nicht genauer beschrieben.</column>
        • <column name=&quot;zieltext&quot;>The menu item &amp;lt;&lt;010035762_GE_1048565/&gt;&amp;gt; is not described in greater detail in these trouble-shooting instructions.</column>
        • <column name=&quot;quelltext&quot;>Die Komponente &amp;lt;&lt;010035731_TERM_1048521/&gt;&amp;gt; schrittweise über den Taster betätigen.</column>
        • <column name=&quot;zieltext&quot;>Gradually actuate the component &amp;lt;&lt;010035731_TERM_1048521/&gt;&amp;gt; by way of the button.</column>
    29. Training Data example
      • Comparable data – not parallel data:
    30. Metadata handling by PROMT
    31. Standard TM verification/normalization process
      • During TM verification the following is addressed through automatic steps
      • Irregular characters gets flagged and replaced
      • Incomplete sentences get flagged
      • Punctuation suspects get flagged
      • UI strings and other irregular sentences get added to phrase tables
    32. PROMT handling of internal tags – not excessive but useful
      •   Original Source Segment in File Check <codeph class=&quot;+ topic/ph pr-d/codeph>NativeApplication.supportsSystemTrayIcon</codeph> to determine whether system tray icons are supported on the current system.
      •   Converted to GMS Segment format (after GMS-native segmentation)
      • Check {1}NativeApplication.supportsSystemTrayIcon{2} to determine whether system tray icons are supported on the current system.
      •   Pre‐Processed String in XLIFF Segment format is sent to PROMT.
      • Check <ph i=1 x=”&lt;codeph class=&quot;+ topic/ph pr-d/codeph&quot; &gt;”>{1}</ph> NativeApplication.supportsSystemTrayIcon<ph i=2 &lt/codeph&gt;>{2}</ph> to determine whether system tray icons are supported on the current system.
      • Format of the translated XLIFF Segment returned by PROMT to Idiom
      • Проверить <ph i=1 x=”&lt;codeph class=&quot;+ topic/ph pr-d/codeph&quot; &gt;”>{1}</ph> NativeApplication.supportsSystemTrayIcon<ph i=2 &lt/codeph&gt;>{2}</ph> для определения системном трее иконки поддерживает нынешнюю систему.
    33. GMS Integration with XLIFF Connector – Why is metadata so Important?
    34. PROMT handling of irrelevant data
      • Scenario 1: We can leave the irrelevant data untouched and let it propagate from TM or be handled through special formatting rules
      • Scenario 2: We will clean it up and add to the phrase table
      • Our system will perform well in either scenario and our course of action needs to be the clients call
    35. PROMT handling of homonyms
      • PROMT system is specially tailored to handle one-to-many translations and homonymy
      • PROMT approach is to create context-based dictionary entries, whether single words or MWE which allows the system to properly indentify the correct translation for ambiguous entries
      • PROMT also uses XML metadata when assigning a semantic class to an entry
    36. PROMT handling of expanding acronyms
      • PROMT system handles expansion of acronyms or different acronyms between languages through creating explicit mapping
      • This is a rather standard task in the process of PROMT engine customization, along with DoNotTranslate and variable lists
      • Should an abbreviation or the expanded version change, this can be fixed through the client interface in a matter of seconds
    37. PROMT handling of locale-specific punctuation
      • Quotation mark usage for a specific small group of terms can be defined on a dictionary level
      • If the use of quotation marks or other punctuation is universal for a specific locate it will be defined on the linguistic rules level
    38. PROMT handling of Entity and Capitalization mismatch
      • The differences in locale setting for Entities and Capitalization rules are already pre-built in the baseline engine and are regulated through regional settings in the product interface
      • All additional differences between locales are learnt from the TM during the engine customization phase and then are added to the client profile template
    39. Issue: English UI strings in the translation English: Click View > Color and Appearance to create or modify colors. Cliquez sur Affichage ( View ) > Couleur et apparence ( Color and Appearance ) pour créer ou modifier les couleurs.
    40. PROMT suggestion for UI string handling
      • All the UI strings will be automatically added to DoNotTranslate lists when appearing in the appropriate context
      • The context can be detected semantically, though formatting and punctuation
    41. Intuitive contextual identification Any word that occurs as part of a context such as “show” in “show command,” remains in English per the UI, whereas the word command gets translated. In other contexts, both words, show and command, are translated as regular words.
    42. PROMT approach to entities
      • We can address it through a set of special rules, however, typically these issues are addressed on the GMS level
      • If TM cleanup is a part of the specific project scope we take this task on and address all similar issues through automated scripts
    43. PROMT handling of internal markup
      • This step is not necessary for PROMT translation process
      • Scenario 1: the markup is handled by PROMT TMX Level 2 extensive TM metadata support
      • Scenario 2: if we need to create phrase table entries from these strings we will normalize, but the markup will still be preserved in the translation process
    44. PROMT handling of empty fields
      • Scenario 1: During TM verification an automatic script will render a warning message and the empty unit will not be propagated
      • Scenario 2: We also can send the empty segment to the customized engine and obtain a translation which will be propagated into the TM for further verification
      • This is how PROMT pre-project dictionaries are created
    45. Microsoft Translator
    46. In General
      • Liberal at throwing away training data
        • Automatic filters only
        • No human cleaning
        • If it is likely there is a clean variant of almost the same sentence in the data, no harm in throwing it away
        • In-domain diversity is a plus
        • Example: Several versions of the same product have close to no effect
    47. Automatic training data filtering and conversion
      • Remove
        • Low text ratio (characters vs. markup, punc.)
        • High length delta
        • Less than minimal length
        • Unexpected language
      • Convert
        • Character encoding
        • Named entities
      • Escape Factoids
        • Numbers, dates, URLs, email, etc.
        • Ex: “5/16/2009” | “June 28, 1998”  <factoid_date>
    48. Cleaning issues 1/3 Issue Training Action Runtime action Excessive # of internal tags Remove segment Preserve and ignore Irrelevant data Fails in ratio filter Apply factoids Homonyms n/a Target language model Acronyms spelled out May be caught in word alignment Project dictionary # sentences mismatch Sentence break, align, discard n/a Inconsistent quote usage n/a Not handled Entities Unescape Unescape-reescape  XML-safe Punctuation mismatch n/a Needs special code (i.e. French “ :”)
    49. Cleaning issues 2/3 Issue Training Action Runtime action Capitalization mismatch ignored Apply language logic and target language model English UI strings Factoid Preprocess and escape Internal markup Escape to single tag Pass through Empty field Fails size delta filter Pass through Suspect character May fail language check Pass through HTML escapes Unescape Unescape – reescape for XML Trivial segment Fails length or ratio filter Pass through Missing punctuation none Apply language-appropriate punctuation
    50. Cleaning issues 3/3 Issue Training Action Runtime action Newline in string New sentence New sentence Program code Avoided or learned Needs markup Comparable data Currently fails length filter Research item Handled like “parallel”

    + TAUSTAUS, 3 weeks ago

    custom

    118 views, 0 favs, 3 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 118
      • 99 on SlideShare
      • 19 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 0
    Most viewed embeds
    • 16 views on http://www.tausdata.org
    • 2 views on https://www.tausdata.org
    • 1 views on http://localisation.symantec.com

    more

    All embeds
    • 16 views on http://www.tausdata.org
    • 2 views on https://www.tausdata.org
    • 1 views on http://localisation.symantec.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories