SlideShare a Scribd company logo
1 of 39
Download to read offline
User Manual Araya
   Bilingual Term
   Extraction

            Description how to use the Araya
            Terminology Extraction Tool


© Dr. Klemens Waldhör, Heartsome Europe GmbH, 2006, 2007, 2009
Heartsome Europe GmbH
                                                               www.heartsome.de


Bilingual Extractor
    The bilingual extractor is a simple to use, but
    efficient tool to generate automatically term
    pairs from translated documents (TMX files)
         A term pair consists of a source and target term
         A term can consists of several words
    These term pairs can be used to create a
    new terminology or add new terminology to
    an existing terminology database.


18.10.2009             (c) 2006, 2007, Heartsome Europe GmbH                           2
Heartsome Europe GmbH
                                                                www.heartsome.de


Versions
    The Extractor was developed by Heartsome
    Europe GmbH and LNE International.
    It is available as
         Single user version.
         It is part of the Araya Server translation tools.




18.10.2009              (c) 2006, 2007, Heartsome Europe GmbH                           3
Heartsome Europe GmbH
                                                                  www.heartsome.de


Short introduction how to extract
    Extract Terms
         File -> Extract Bilingual Terminology from File
             (Option: Open after extraction)
    Check extracted terms
         Mark correct translation as „validated“
    Export terms
         Export validated terms …



18.10.2009                (c) 2006, 2007, Heartsome Europe GmbH                           4
Heartsome Europe GmbH
                                                                www.heartsome.de


Installation




      Installation goes into directory c:/Program Files/Araya. It is
      recommended not to change this as all initialization files map
      towards this directory.



18.10.2009              (c) 2006, 2007, Heartsome Europe GmbH                           5
Heartsome Europe GmbH
                                                          www.heartsome.de


Starting Araya Extraction tool

     Go to directory:
     c:/Program Files/Araya
     Start: BiEdit.exe
     Or double click :




18.10.2009        (c) 2006, 2007, Heartsome Europe GmbH                           6
Heartsome Europe GmbH
                                                          www.heartsome.de


The Extraction Approach

    Based on a TMX file all possible relevant
    term pairs are computed. This is based on a
    statistical approach which determines the
    frequency of source and target terms.
    TMX = XML Exchange format for translation
    memory databases




18.10.2009        (c) 2006, 2007, Heartsome Europe GmbH                           7
Heartsome Europe GmbH
                                                         www.heartsome.de


Segment
    Extraction is based on segments which are
    stored in a TMX file.
    A segment can either be a sentence or a
    whole paragraph.
    Formats in TMX files are ignored.




18.10.2009       (c) 2006, 2007, Heartsome Europe GmbH                           8
Heartsome Europe GmbH
                                                                   www.heartsome.de


Evaluating and Validating
    Each found term pair is associated with a quality
    measure.
         2. column of the extraction table
         Value is between1,0 (highest probability that the term pair
         is a translation) and 0,5 (lowest probability that the term
         pair is a translation)
    Terms can be validated as correct translations.
         Last column of table
             Approved = checked = validated
             Unapproved = not validated
    Validated terms can be exported.

18.10.2009                 (c) 2006, 2007, Heartsome Europe GmbH                           9
Heartsome Europe GmbH
                                                                     www.heartsome.de


Validating a Term Extraction Pair
    Select line with term extraction pair
    Validate = approving using
         Double click term pair
         Right mouse click
    Remove validation mark using
          a double or
         right mouse click




                       Validated terms appear in green


18.10.2009                   (c) 2006, 2007, Heartsome Europe GmbH                           10
Heartsome Europe GmbH
                                                                      www.heartsome.de


 The Extraction file
     The extraction file has the following format
     Encoding format is UTF-8!


nr;score;status;term1.LangCode;term1.wordGroup;term1.wordGroup
Len;term1.wFreq;term2.LangCode;term2.wordGroup;term2.wordGroup
Len;term2.wFreq;sentLinked

              Example line
                                                          Start line = Marker of extraction file



0;1.0;approved;de;DIN;1;15;en;DIN;1;15;11




 18.10.2009                  (c) 2006, 2007, Heartsome Europe GmbH                                 11
Heartsome Europe GmbH
                                                                              www.heartsome.de


Extraction User Interface

                                        Frequency of source          Table:
                                        and target terms             This window contains the
             Links
                                                                     source and target pairs
                                                                     found

                     Quality value


         Term number                 Source term                Target term         Validation




                                       Status window




18.10.2009                            (c) 2006, 2007, Heartsome Europe GmbH                           12
Heartsome Europe GmbH
                                                                      www.heartsome.de


Columns
    Value
         Statistical measure that the source and target term are translation of each
         other (quality measure)
    SL
         Number of segments where both source and target term appear in.
    Freq 1
         Number of segment where source term appears in
    Freq 2
         Number of segment where target term appears in
    Source term
         The source term
    Target term
         The translation of the source term
    Validation
         Check box, for marking correct term pairs



18.10.2009                    (c) 2006, 2007, Heartsome Europe GmbH                           13
Heartsome Europe GmbH
                                                                   www.heartsome.de


File Menu 1                                                   Open an extraction file


                                                              Close an extraction file

                                                           Save an open extraction file

                                                     Save extraction file with a new name

                                                          Save validated terms in a new
                                                                  extraction file




                                        Save selected terms in a new
                                               extraction file


             The opened file is shown in the title bar.



18.10.2009                 (c) 2006, 2007, Heartsome Europe GmbH                            14
Heartsome Europe GmbH
                                                                      www.heartsome.de


File Menu 2
                                                                      Extract term pairs from
                                                                            a TMX file



                                                                       Export terms from open
                                                                       extraction file into TBX
                                                                               Format




     Export validated      Export terms from open               Export validated terms
    terms from open          extraction file into              from open extraction file
    extraction file into      Multiterm Format                     into TBX Format
    Multiterm Format



18.10.2009                    (c) 2006, 2007, Heartsome Europe GmbH                             15
Heartsome Europe GmbH
                                                      www.heartsome.de


File Menu 3
                                                          Export entries from
                                                         open extraction file in
                                                          Araya CSV Format




                                                        Export validated entries
                                                        from open extraction file
                                                          in Araya CSV Format




18.10.2009    (c) 2006, 2007, Heartsome Europe GmbH                            16
Heartsome Europe GmbH
                                                                   www.heartsome.de


Extracting Term Pairs from TMX File

                                     Source language                      Target language




             Open after extraction




                                                                  Use Lucene Tokenizer for word
                                                                           detecion



18.10.2009                (c) 2006, 2007, Heartsome Europe GmbH                              17
Heartsome Europe GmbH
                                                                 www.heartsome.de


Extraction Parameters 1
    Minimum / Maximum words
         Controls how many words should be contained min/max in
         the found term pair
    Minimum / Maximum frequency
         Controls how often the term should appear min/max for the
         found term pair
    Maximum Translations
         Controls how many translation should be found at
         maximum
    Source/Target terms in lower case
         Controls if source and/or target terms should be converted
         to lower case

18.10.2009               (c) 2006, 2007, Heartsome Europe GmbH                           18
Heartsome Europe GmbH
                                                                www.heartsome.de


Extraction Parameters 2
    Validated terminology to ignore
         If a terminology extraction file is specified here, all
         terms which are marked as “validated” will be
         ignored.
         Thru this know translations are ignored.




18.10.2009              (c) 2006, 2007, Heartsome Europe GmbH                           19
Heartsome Europe GmbH
                                                                 www.heartsome.de


Extraction Parameters 3
    Upper / Lowercase
         Not valid in combination with option Lucene-Tokenizer! If
         this option is selected for one or both languages the
         segment is converted to lowercase.
    Segment Paragraphs
         If the segment consists of several sentences the segment
         is “sentence segmented”. This only applies if source and
         target segment have the same number of sentences.
    Lucene Tokenizer
         Term extraction creates words from segments in a first
         step. For CJK languages the usage of the Lucene tokenizer
         is recommended, as it uses additional criteria for detecting
         word boundaries. It should be noted that the tokenizer is
         applied to both languages.

18.10.2009               (c) 2006, 2007, Heartsome Europe GmbH                           20
Heartsome Europe GmbH
                                                               www.heartsome.de


Extraction Parameters 4
    Count TMX TU Entries
         Count the number of TU entries in the TMX file.
    Number of analysis blocks
         Currently not used. It is planed to use this feature
         for splitting up the tmx file in several sub-buckets
         for testing the stability of the extraction results.
    Remove identical terms
         If one uses identical source and target language
         this option removes trivially found identical source
         and target terms.

18.10.2009             (c) 2006, 2007, Heartsome Europe GmbH                           21
Heartsome Europe GmbH
                                                                www.heartsome.de


Exporting
    Exporting can be done in different formats.
      TBX
         Name of extraction file + „.tbx“
       Multiterm (™ of Trados/SDL International)
          Name of extraction file + „.multiterm“
       Araya CSV
          Name of extraction file + „araya.csv“
       Character encoding is always UTF-8
    Either all or only the validated entries can be exported
    In addition the selection filter (Options -> Export Score Filter)
    controls the exported terms
       Depending on chosen value only the term pairs with a minimum
       score get exported (e.g. score higher than 0.6).

18.10.2009              (c) 2006, 2007, Heartsome Europe GmbH                           22
Heartsome Europe GmbH
                                                                   www.heartsome.de


Araya CSV Format
    Araya CSV Format contains the languages in
    the first line followed by the extracted terms

             Beispiel
                               Languages separated by ;
             de;en
             Anschlussplan;Connection diagram
             DIN;DIN
             Dr;Dr
                                           Extracted terms separated by ;




18.10.2009                 (c) 2006, 2007, Heartsome Europe GmbH                           23
Heartsome Europe GmbH
                                                               www.heartsome.de


The Options Menu 1                                         Character set for the
                                                             extraction table




                                                                             Settings for
                                                                           minimum quality
                                                                           for the export of
                                                                              term pairs



             User Interface
              Language




18.10.2009         (c) 2006, 2007, Heartsome Europe GmbH                               24
Heartsome Europe GmbH
                                                           www.heartsome.de


The Options Menu 2                                       Remove all term entries from the
                                                           term table contained in term
                                                                  extraction file



                                                         Remove all term entries from the
                                                           term table contained in term
                                                             extraction file which are
                                                                     approved

                                                         Search Terms in the extraction
                                                                    table

                                                          Complex search for based on
                                                            source and target term


                                                           Deselects all selected in the
             Select Terms based on a word                       extraction table
                       list in a file



18.10.2009       (c) 2006, 2007, Heartsome Europe GmbH                                 25
Heartsome Europe GmbH
                                                          www.heartsome.de


Filter Term Functions
    The filter functions filters all those term
    entries which are contained in another term
    extraction file.
    The identical terms are removed from the
    term table.
    Depending on the chosen filter method the
    approved or all entries are used from the
    specified term extraction file.


18.10.2009        (c) 2006, 2007, Heartsome Europe GmbH                           26
Heartsome Europe GmbH
                                                                          www.heartsome.de


Search Term Functions



                                      Search this term and select all the matching entries in the table.
                                       The selected term candidates can then be saved with “File ->
                                               Save Selected Terms into Extraction File…”


                      Find the next matching term



 Start searching this term




18.10.2009                        (c) 2006, 2007, Heartsome Europe GmbH                               27
Heartsome Europe GmbH
                                                                         www.heartsome.de


Advanced Search Term Functions


                                                                         Use regular expressions for
                                                                           searching term entries




                                        Search this term and select all the matching entries in the table.
                                         The selected term candidates can then be saved with “File ->
                                                 Save Selected Terms into Extraction File…”

                      Find the next matching terms

 Start searching these combination of
        source and target terms

18.10.2009                       (c) 2006, 2007, Heartsome Europe GmbH                                 28
Heartsome Europe GmbH
                                                                   www.heartsome.de


Select Terms Based on a Word List
    The filter functions selects all the matching terms from an
    UTF-8 encoded file. The filter search the given string in both
    source and target language string
         It also matches substrings
    The selected term candidates can then be saved with “File ->
    Save Selected Terms into Extraction File…”
    Each word (string) in the file must be written on a single line.




18.10.2009                 (c) 2006, 2007, Heartsome Europe GmbH                           29
Heartsome Europe GmbH
                                                                www.heartsome.de


The Plugins Menu
                                                                Conversion between
                                                                  character sets


                                                                Check and recode
                                                              languages in TMX files

                                                              Split TMX files in several
                                                                      TMX file


               Count TU and TUV entries        Merge several TMX files in
                      in TMX file                    one TMX file

  Display of
  Araya Log-
     File




18.10.2009            (c) 2006, 2007, Heartsome Europe GmbH                                30
Heartsome Europe GmbH
                                                     www.heartsome.de


The Help Menu
                                               Display user manual



                                                Araya / Heartsome
                                                   Homepage


                                                About the bilingual
                                                   extraction




18.10.2009   (c) 2006, 2007, Heartsome Europe GmbH                           31
Heartsome Europe GmbH
                                                     www.heartsome.de


About the Extraction
                                            Basic information about the
                                            Araya configuration is shown
                                            here.
                                            This can be important for
                                            error diagnosis
                                            Using „Save…“ the
                                            information is stored in the
                                            specified file.




18.10.2009   (c) 2006, 2007, Heartsome Europe GmbH                           32
Heartsome Europe GmbH
                                                                               www.heartsome.de


Converting Files between different
Character Encodings                                                                  Source file name


                                                                              Target file; will be created based
                                                                                on source file name and as
                                                                               extension the target encoding
                                                                                         character set

                                                                              Source encoding character set


                                                                               Target encoding character set

This function supports converting files between different character sets. Depending on the target character
up to two files are written. If the target file is a UTF-8, 16, 32 or UCS file, two files are written. The second file
with the extension „.nobom“ is a copy of the first target file, the only difference is that the BOM (Byte Order
Marks) are removed from this file. This file should be used for importing, esp. when an import of an UTF-8
file is done in Araya, as the Java reading functions for UTF-8 does not over read the BOM characters. This
could lead to problems when reading normal strings from those file as the BOMs are read as normal
characters resulting in invalid entries.
(This is a known bug in Java UTF-8 file reading, but will not be corrected by SUN!).

18.10.2009                            (c) 2006, 2007, Heartsome Europe GmbH                                  33
Heartsome Europe GmbH
                                                                       www.heartsome.de


Check and Recode TMX Files                                                       TMX file name


                                                                             Recoded TMX file name;
                                                                                Consists of tmx file
                                                                                 name, any “.tmx”
                                                                             extension removed and
                                                                              “.recoded.tmx” added

                                             Light yellow lines                 Language codes
                     Green lines            indicate language                  supported by Araya
                  indicate language        codes not supported                 Combo box appears
                   codes supported               by Araya                     only in case language
                       by Araya                                               codes in TMX file not
                                                                               supported by Araya




    Start checking          Recode languages
      languages           using new language for
   contained in TMX       the specified TMX File
          File                   language

18.10.2009                     (c) 2006, 2007, Heartsome Europe GmbH                             34
Heartsome Europe GmbH
                                                                       www.heartsome.de


Split TMX File
                                                                      TMX file name


                                                                  Buttons to increase /
                                                                 decrease number of tmx
                                                                      files to create

                                                                  Number of TMX files to
                                                                   create from the given
                                                                          tmx file

       Start splitting       Close Window




  The created tmx files are composed of the original tmx file name plus a number from 1
  till the number of chosen tmx file


18.10.2009                    (c) 2006, 2007, Heartsome Europe GmbH                            35
Heartsome Europe GmbH
                                                                                www.heartsome.de


Merge TMX Files
                                                                                Choose new TMX file name


                                                                                    List of chosen files




                                                                               Remove a file from the file list




             Add a file to the file list




    Start merging                     Close Window


18.10.2009                             (c) 2006, 2007, Heartsome Europe GmbH                                  36
Heartsome Europe GmbH
                                                              www.heartsome.de


Count TUs/TUVs in a TMX File
                                                             Choose TMX file to analyse




                                                                         Result:
                                                         First line: Number of TUs and TUVs
                                                                     Following lines:
                                                         Number of language specific TUVs




   Start counting   Close Window


18.10.2009           (c) 2006, 2007, Heartsome Europe GmbH                                37
Heartsome Europe GmbH
                                                                 www.heartsome.de


Adding Language Codes
    A set of predefined language codes is
    defined in the file „ini/lancodes.txt“.
    Additional language codes can be added by
    modifying this file.       Adding de-DE: This requires to add a line
                               like that – where = separates the name
    Here is an example: displayed an the actual language code
                               German(DE)=de-DE

                                            Galician=gl
                                            German (Standard)=de
                                            German(DE)=de-DE
                                            German (Austrian)=de-at
                                            German (Liechtenstein)=de-li
                                            German (Luxembourg)=de-lu
                                            German (Standard)=de
                                            German (Swiss)=de-ch



18.10.2009              (c) 2006, 2007, Heartsome Europe GmbH                            38
Heartsome Europe GmbH
                                                       www.heartsome.de


Impressum
    Heartsome Europe GmbH
    Friedrichstr. 17
    D-90574 Roßtal

    Email: info@heartsome.de
    www.heartsome.de
    © 2007, 2009 Heartsome Europe GmbH


18.10.2009     (c) 2006, 2007, Heartsome Europe GmbH                           39

More Related Content

More from Klemens Waldhör

Heartsome Europe Xliff Editor User Guide German
Heartsome Europe Xliff Editor User Guide GermanHeartsome Europe Xliff Editor User Guide German
Heartsome Europe Xliff Editor User Guide GermanKlemens Waldhör
 
Heartsome Europe Xliff Editor User Guide English
Heartsome Europe Xliff Editor User Guide EnglishHeartsome Europe Xliff Editor User Guide English
Heartsome Europe Xliff Editor User Guide EnglishKlemens Waldhör
 
Bilingual TMX EditorTool (in German)
Bilingual TMX EditorTool (in German)Bilingual TMX EditorTool (in German)
Bilingual TMX EditorTool (in German)Klemens Waldhör
 
Bilingual Terminology Extraction
Bilingual Terminology ExtractionBilingual Terminology Extraction
Bilingual Terminology ExtractionKlemens Waldhör
 
Vortrag Ostbayrischer Tourismustag2008 Waldhoer
Vortrag Ostbayrischer Tourismustag2008 WaldhoerVortrag Ostbayrischer Tourismustag2008 Waldhoer
Vortrag Ostbayrischer Tourismustag2008 WaldhoerKlemens Waldhör
 

More from Klemens Waldhör (6)

Heartsome Europe Xliff Editor User Guide German
Heartsome Europe Xliff Editor User Guide GermanHeartsome Europe Xliff Editor User Guide German
Heartsome Europe Xliff Editor User Guide German
 
Heartsome Europe Xliff Editor User Guide English
Heartsome Europe Xliff Editor User Guide EnglishHeartsome Europe Xliff Editor User Guide English
Heartsome Europe Xliff Editor User Guide English
 
Bilingual TMX EditorTool (in German)
Bilingual TMX EditorTool (in German)Bilingual TMX EditorTool (in German)
Bilingual TMX EditorTool (in German)
 
Bilingual Terminology Extraction
Bilingual Terminology ExtractionBilingual Terminology Extraction
Bilingual Terminology Extraction
 
Heartsome Portfolio
Heartsome PortfolioHeartsome Portfolio
Heartsome Portfolio
 
Vortrag Ostbayrischer Tourismustag2008 Waldhoer
Vortrag Ostbayrischer Tourismustag2008 WaldhoerVortrag Ostbayrischer Tourismustag2008 Waldhoer
Vortrag Ostbayrischer Tourismustag2008 Waldhoer
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Bilingual Term Extraction Tool (in English)

  • 1. User Manual Araya Bilingual Term Extraction Description how to use the Araya Terminology Extraction Tool © Dr. Klemens Waldhör, Heartsome Europe GmbH, 2006, 2007, 2009
  • 2. Heartsome Europe GmbH www.heartsome.de Bilingual Extractor The bilingual extractor is a simple to use, but efficient tool to generate automatically term pairs from translated documents (TMX files) A term pair consists of a source and target term A term can consists of several words These term pairs can be used to create a new terminology or add new terminology to an existing terminology database. 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 2
  • 3. Heartsome Europe GmbH www.heartsome.de Versions The Extractor was developed by Heartsome Europe GmbH and LNE International. It is available as Single user version. It is part of the Araya Server translation tools. 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 3
  • 4. Heartsome Europe GmbH www.heartsome.de Short introduction how to extract Extract Terms File -> Extract Bilingual Terminology from File (Option: Open after extraction) Check extracted terms Mark correct translation as „validated“ Export terms Export validated terms … 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 4
  • 5. Heartsome Europe GmbH www.heartsome.de Installation Installation goes into directory c:/Program Files/Araya. It is recommended not to change this as all initialization files map towards this directory. 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 5
  • 6. Heartsome Europe GmbH www.heartsome.de Starting Araya Extraction tool Go to directory: c:/Program Files/Araya Start: BiEdit.exe Or double click : 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 6
  • 7. Heartsome Europe GmbH www.heartsome.de The Extraction Approach Based on a TMX file all possible relevant term pairs are computed. This is based on a statistical approach which determines the frequency of source and target terms. TMX = XML Exchange format for translation memory databases 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 7
  • 8. Heartsome Europe GmbH www.heartsome.de Segment Extraction is based on segments which are stored in a TMX file. A segment can either be a sentence or a whole paragraph. Formats in TMX files are ignored. 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 8
  • 9. Heartsome Europe GmbH www.heartsome.de Evaluating and Validating Each found term pair is associated with a quality measure. 2. column of the extraction table Value is between1,0 (highest probability that the term pair is a translation) and 0,5 (lowest probability that the term pair is a translation) Terms can be validated as correct translations. Last column of table Approved = checked = validated Unapproved = not validated Validated terms can be exported. 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 9
  • 10. Heartsome Europe GmbH www.heartsome.de Validating a Term Extraction Pair Select line with term extraction pair Validate = approving using Double click term pair Right mouse click Remove validation mark using a double or right mouse click Validated terms appear in green 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 10
  • 11. Heartsome Europe GmbH www.heartsome.de The Extraction file The extraction file has the following format Encoding format is UTF-8! nr;score;status;term1.LangCode;term1.wordGroup;term1.wordGroup Len;term1.wFreq;term2.LangCode;term2.wordGroup;term2.wordGroup Len;term2.wFreq;sentLinked Example line Start line = Marker of extraction file 0;1.0;approved;de;DIN;1;15;en;DIN;1;15;11 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 11
  • 12. Heartsome Europe GmbH www.heartsome.de Extraction User Interface Frequency of source Table: and target terms This window contains the Links source and target pairs found Quality value Term number Source term Target term Validation Status window 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 12
  • 13. Heartsome Europe GmbH www.heartsome.de Columns Value Statistical measure that the source and target term are translation of each other (quality measure) SL Number of segments where both source and target term appear in. Freq 1 Number of segment where source term appears in Freq 2 Number of segment where target term appears in Source term The source term Target term The translation of the source term Validation Check box, for marking correct term pairs 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 13
  • 14. Heartsome Europe GmbH www.heartsome.de File Menu 1 Open an extraction file Close an extraction file Save an open extraction file Save extraction file with a new name Save validated terms in a new extraction file Save selected terms in a new extraction file The opened file is shown in the title bar. 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 14
  • 15. Heartsome Europe GmbH www.heartsome.de File Menu 2 Extract term pairs from a TMX file Export terms from open extraction file into TBX Format Export validated Export terms from open Export validated terms terms from open extraction file into from open extraction file extraction file into Multiterm Format into TBX Format Multiterm Format 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 15
  • 16. Heartsome Europe GmbH www.heartsome.de File Menu 3 Export entries from open extraction file in Araya CSV Format Export validated entries from open extraction file in Araya CSV Format 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 16
  • 17. Heartsome Europe GmbH www.heartsome.de Extracting Term Pairs from TMX File Source language Target language Open after extraction Use Lucene Tokenizer for word detecion 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 17
  • 18. Heartsome Europe GmbH www.heartsome.de Extraction Parameters 1 Minimum / Maximum words Controls how many words should be contained min/max in the found term pair Minimum / Maximum frequency Controls how often the term should appear min/max for the found term pair Maximum Translations Controls how many translation should be found at maximum Source/Target terms in lower case Controls if source and/or target terms should be converted to lower case 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 18
  • 19. Heartsome Europe GmbH www.heartsome.de Extraction Parameters 2 Validated terminology to ignore If a terminology extraction file is specified here, all terms which are marked as “validated” will be ignored. Thru this know translations are ignored. 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 19
  • 20. Heartsome Europe GmbH www.heartsome.de Extraction Parameters 3 Upper / Lowercase Not valid in combination with option Lucene-Tokenizer! If this option is selected for one or both languages the segment is converted to lowercase. Segment Paragraphs If the segment consists of several sentences the segment is “sentence segmented”. This only applies if source and target segment have the same number of sentences. Lucene Tokenizer Term extraction creates words from segments in a first step. For CJK languages the usage of the Lucene tokenizer is recommended, as it uses additional criteria for detecting word boundaries. It should be noted that the tokenizer is applied to both languages. 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 20
  • 21. Heartsome Europe GmbH www.heartsome.de Extraction Parameters 4 Count TMX TU Entries Count the number of TU entries in the TMX file. Number of analysis blocks Currently not used. It is planed to use this feature for splitting up the tmx file in several sub-buckets for testing the stability of the extraction results. Remove identical terms If one uses identical source and target language this option removes trivially found identical source and target terms. 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 21
  • 22. Heartsome Europe GmbH www.heartsome.de Exporting Exporting can be done in different formats. TBX Name of extraction file + „.tbx“ Multiterm (™ of Trados/SDL International) Name of extraction file + „.multiterm“ Araya CSV Name of extraction file + „araya.csv“ Character encoding is always UTF-8 Either all or only the validated entries can be exported In addition the selection filter (Options -> Export Score Filter) controls the exported terms Depending on chosen value only the term pairs with a minimum score get exported (e.g. score higher than 0.6). 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 22
  • 23. Heartsome Europe GmbH www.heartsome.de Araya CSV Format Araya CSV Format contains the languages in the first line followed by the extracted terms Beispiel Languages separated by ; de;en Anschlussplan;Connection diagram DIN;DIN Dr;Dr Extracted terms separated by ; 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 23
  • 24. Heartsome Europe GmbH www.heartsome.de The Options Menu 1 Character set for the extraction table Settings for minimum quality for the export of term pairs User Interface Language 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 24
  • 25. Heartsome Europe GmbH www.heartsome.de The Options Menu 2 Remove all term entries from the term table contained in term extraction file Remove all term entries from the term table contained in term extraction file which are approved Search Terms in the extraction table Complex search for based on source and target term Deselects all selected in the Select Terms based on a word extraction table list in a file 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 25
  • 26. Heartsome Europe GmbH www.heartsome.de Filter Term Functions The filter functions filters all those term entries which are contained in another term extraction file. The identical terms are removed from the term table. Depending on the chosen filter method the approved or all entries are used from the specified term extraction file. 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 26
  • 27. Heartsome Europe GmbH www.heartsome.de Search Term Functions Search this term and select all the matching entries in the table. The selected term candidates can then be saved with “File -> Save Selected Terms into Extraction File…” Find the next matching term Start searching this term 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 27
  • 28. Heartsome Europe GmbH www.heartsome.de Advanced Search Term Functions Use regular expressions for searching term entries Search this term and select all the matching entries in the table. The selected term candidates can then be saved with “File -> Save Selected Terms into Extraction File…” Find the next matching terms Start searching these combination of source and target terms 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 28
  • 29. Heartsome Europe GmbH www.heartsome.de Select Terms Based on a Word List The filter functions selects all the matching terms from an UTF-8 encoded file. The filter search the given string in both source and target language string It also matches substrings The selected term candidates can then be saved with “File -> Save Selected Terms into Extraction File…” Each word (string) in the file must be written on a single line. 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 29
  • 30. Heartsome Europe GmbH www.heartsome.de The Plugins Menu Conversion between character sets Check and recode languages in TMX files Split TMX files in several TMX file Count TU and TUV entries Merge several TMX files in in TMX file one TMX file Display of Araya Log- File 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 30
  • 31. Heartsome Europe GmbH www.heartsome.de The Help Menu Display user manual Araya / Heartsome Homepage About the bilingual extraction 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 31
  • 32. Heartsome Europe GmbH www.heartsome.de About the Extraction Basic information about the Araya configuration is shown here. This can be important for error diagnosis Using „Save…“ the information is stored in the specified file. 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 32
  • 33. Heartsome Europe GmbH www.heartsome.de Converting Files between different Character Encodings Source file name Target file; will be created based on source file name and as extension the target encoding character set Source encoding character set Target encoding character set This function supports converting files between different character sets. Depending on the target character up to two files are written. If the target file is a UTF-8, 16, 32 or UCS file, two files are written. The second file with the extension „.nobom“ is a copy of the first target file, the only difference is that the BOM (Byte Order Marks) are removed from this file. This file should be used for importing, esp. when an import of an UTF-8 file is done in Araya, as the Java reading functions for UTF-8 does not over read the BOM characters. This could lead to problems when reading normal strings from those file as the BOMs are read as normal characters resulting in invalid entries. (This is a known bug in Java UTF-8 file reading, but will not be corrected by SUN!). 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 33
  • 34. Heartsome Europe GmbH www.heartsome.de Check and Recode TMX Files TMX file name Recoded TMX file name; Consists of tmx file name, any “.tmx” extension removed and “.recoded.tmx” added Light yellow lines Language codes Green lines indicate language supported by Araya indicate language codes not supported Combo box appears codes supported by Araya only in case language by Araya codes in TMX file not supported by Araya Start checking Recode languages languages using new language for contained in TMX the specified TMX File File language 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 34
  • 35. Heartsome Europe GmbH www.heartsome.de Split TMX File TMX file name Buttons to increase / decrease number of tmx files to create Number of TMX files to create from the given tmx file Start splitting Close Window The created tmx files are composed of the original tmx file name plus a number from 1 till the number of chosen tmx file 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 35
  • 36. Heartsome Europe GmbH www.heartsome.de Merge TMX Files Choose new TMX file name List of chosen files Remove a file from the file list Add a file to the file list Start merging Close Window 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 36
  • 37. Heartsome Europe GmbH www.heartsome.de Count TUs/TUVs in a TMX File Choose TMX file to analyse Result: First line: Number of TUs and TUVs Following lines: Number of language specific TUVs Start counting Close Window 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 37
  • 38. Heartsome Europe GmbH www.heartsome.de Adding Language Codes A set of predefined language codes is defined in the file „ini/lancodes.txt“. Additional language codes can be added by modifying this file. Adding de-DE: This requires to add a line like that – where = separates the name Here is an example: displayed an the actual language code German(DE)=de-DE Galician=gl German (Standard)=de German(DE)=de-DE German (Austrian)=de-at German (Liechtenstein)=de-li German (Luxembourg)=de-lu German (Standard)=de German (Swiss)=de-ch 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 38
  • 39. Heartsome Europe GmbH www.heartsome.de Impressum Heartsome Europe GmbH Friedrichstr. 17 D-90574 Roßtal Email: info@heartsome.de www.heartsome.de © 2007, 2009 Heartsome Europe GmbH 18.10.2009 (c) 2006, 2007, Heartsome Europe GmbH 39