Open Tms Software Architecure
Upcoming SlideShare
Loading in...5
×
 

Open Tms Software Architecure

on

  • 4,678 views

The paper describes the basic architecture of the open source translation memory system openTMS

The paper describes the basic architecture of the open source translation memory system openTMS

Statistics

Views

Total Views
4,678
Slideshare-icon Views on SlideShare
4,666
Embed Views
12

Actions

Likes
0
Downloads
21
Comments
0

1 Embed 12

http://www.techgig.com 12

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Open Tms Software Architecure Open Tms Software Architecure Document Transcript

    • Software Architecture 04/2008/KW OPENTMS SOFTWARE ARCHITECTURE Roßtal, 29/08/2008 Doc.Nr.: HEA-1.1-2008 Version 1.3 Author: Dr. Klemens Waldhör / klemens.waldhoer@heartsome.de Location: OpenTMS_Software_Architecure_v1.3.doc www.folt.de
    • Software Architecture 04/2008/KW 1 VERSIONING INFORMATION • V0.1 – Version 0.1 – April/May/June2008: Start Version; Klemens Wald- hör, Heartsome Europe - TOSS_Software_Architecure.doc; • V1.0 – Version 1.0 – 05.08.2008: Initial version; Klemens Waldhör, Heart- some Europe; based on discussion with Michael Schneider, beodoc, 04.07.2008 - OpenTMS_Software_Architecure_v1.0.doc • V1.1 – Version 1.1 – 30.08.2008: Modifications based on the FOLT inter- nal architecture discussion meeting, 29.08.2008, Acolada GmbH, Nürn- berg. Participants: Ulrike Baral, beodoc; Torsten Kuprat; Michael Schnei- der, beodoc; Klemens Waldhör, Heartsome Europe; Thomas Wedde, eu- roscript; OpenTMS_Software_Architecure_v1.1.doc Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 2/72
    • Software Architecture 04/2008/KW 2 PREFACE This manual gives an overview of the software architecture OpenTMS. It is based on the requirements defined in the FOLT Open Source Initiative (Folt, 2007b). The architecture of OpenTMS is mainly based on several models. These models describe the key components of OpenTMS. Each model handles a specific aspect of the translation process and its requirements. The models form a framework which guide the construction of language specific software tools. The following core models are identified: • Security model • Document model • Process model • User model • Data model • GUI model • Interface model On top of those models the application model organises real applications (like the GUI model). OpenTMS uses a data source in the data model which organises the access to database or any kind device which allows to store (TM or terminology) data. The architecture also contains a description of some basic functions which can form the basic core of translation tools. The architecture is defined in such a way that is can be easily extended with new functions or combining existing functions to new functionality. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 3/72
    • Software Architecture 04/2008/KW CONTENTS 1 VERSIONING INFORMATION .........................................................................2 2 PREFACE.........................................................................................................3 3 LIST OF TABLES AND FIGURES ...................................................................7 4 DEFINITIONS ...................................................................................................8 5 INTRODUCTION ............................................................................................12 5.1 Arguments for an OpenTMS Software Architcture......................................12 5.2 Basics .........................................................................................................12 5.2.1 Naming conventions........................................................................................ 12 5.2.2 Naming of OpenTMS specific functions/methods ............................................ 13 5.3 Character set ..............................................................................................13 5.4 Standards ...................................................................................................13 5.5 Basic Requirements ...................................................................................14 5.6 Architecture ................................................................................................14 6 OPENTMS ARCHITECTURE AND MODELS................................................16 6.1 Parameters in OpenTMS models ...............................................................16 6.2 Core Models of OpenTMS ..........................................................................18 6.3 OpenTMS Core Library...............................................................................20 6.4 The Application Model ................................................................................20 6.5 Implementation Languages ........................................................................21 7 SECURITY MODEL........................................................................................22 7.1 Security, OpenTMS and Programming Languages ....................................23 7.2 Communication Level .................................................................................24 7.3 Document Level..........................................................................................24 7.4 Database Level...........................................................................................25 7.5 Security Level .............................................................................................25 8 BASIC OPENTMS COMPONENTS ...............................................................27 9 DOCUMENT MODEL .....................................................................................30 9.1 Documents ...............................................................................................30 Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 4/72
    • Software Architecture 04/2008/KW 9.2 Character Sets.........................................................................................31 9.3 XML document handling ........................................................................31 9.4 XLIFF Documents ....................................................................................31 9.4.1 OpenTMS and Skeleton files ........................................................................... 32 9.4.2 Security and encryption in XLIFF – secureXLIFF............................................. 33 9.5 TMX Documents ......................................................................................33 9.5.1 Security and encryption in TMX – secureTMX................................................. 34 9.6 TBX Documents .......................................................................................34 9.6.1 Security and encryption in TBX – secure TBX ................................................. 34 9.7 Other Documents ....................................................................................35 9.8 Basic Document Access Functionality ........................................................35 10 OPENTMS AS A CLIENT/SERVER ARCHITECTURE..................................37 11 DATA MODEL................................................................................................41 11.1 Data sources ..............................................................................................41 11.2 TM Matches................................................................................................43 11.3 Basic data source access functionality .......................................................44 11.4 Databases ..................................................................................................47 11.4.1 Open source SQL data bases ......................................................................... 47 11.4.2 Closed source SQL databases ........................................................................ 47 11.4.3 Alternatives ..................................................................................................... 47 11.4.4 Database Access ............................................................................................ 49 11.4.5 Database and data source configuration ......................................................... 49 12 TRANSLATION OBJECTS ............................................................................51 12.1 Format information .....................................................................................52 12.2 Terminology versus Translation Memory....................................................52 12.3 Variables , placeholders, replacement classes...........................................53 13 PROCESS MODEL ........................................................................................56 13.1 OpenTMS Process .....................................................................................56 13.2 OpenTMS Scripting Language ...................................................................56 13.3 OpenTMSL Communication Methods.........................................................58 14 USER MODEL................................................................................................59 14.1 User roles ...................................................................................................59 Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 5/72
    • Software Architecture 04/2008/KW 14.2 Basic user functionality ...............................................................................60 15 GUI MODEL ...................................................................................................61 16 INTERFACE MODEL .....................................................................................62 17 CONFIGURING OPENTMS............................................................................63 17.1 Naming of the configuration file ..................................................................64 17.2 Structure of the configuration file ................................................................64 17.3 Configuration Options .................................................................................65 18 DMS INTERFACE ..........................................................................................66 19 BIBLIOGRAPHY ............................................................................................68 20 APPENDIX .....................................................................................................69 20.1 Multiple translations for a linguistic concept................................................69 Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 6/72
    • Software Architecture 04/2008/KW 3 LIST OF TABLES AND FIGURES Fig 1: OpenTMSName defined as a regular expression 12 Fig 2: Naming of OpenTMS functions for export 13 Fig 3: OpenTMS Procedure description 15 Fig 4: OpenTMS Models 18 Fig 5: Example securing XLIFF document exchange 23 Fig 6: OpenTMS Objects 28 Fig 7: XLIFF File 32 Fig 8: Some basic XLIFF File functions 36 Fig 9: Hierarchy of processes 38 Fig 10: Applications 38 Fig 11: Pipeline Architecture 40 Fig 12: Data sources and data components 41 Fig 13: Data sources with several data components 42 Fig 14: Data source access types 45 Fig 15: Data source access types 46 Fig 16:Configuring different database types 49 Fig 17: Representation of linguistic entities as General Linguistic Object 52 Fig 18: Conversions of linguistic entities 53 Fig 19: OpenTMS Scripting Language 56 Fig 20: OpenTMSL Inter-process and computer communication 57 Fig 21: Some basic user functions 60 Fig 22: Configuration of OpenTMS 63 Fig 23: Configuration file naming example 64 Fig 24: Configuration option structure 65 Fig 25: OpenTMS options table 65 Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 7/72
    • Software Architecture 04/2008/KW 4 DEFINITIONS Client: A client is an application or system that accesses a (remote) service on another computer system known as a server by way of a network. URL: http://en.wikipedia.org/wiki/Client_%28computing%29 Client-Server: Client-server is a computing architecture which separates a client from a server, and is almost always implemented over a computer network. A cli- ent-server application is a distributed system that constitutes of both client and server software. A client is a software or process that may initiate a communica- tion session, while a server can not initiate sessions, but is waiting for a requests from a client. Client and server may also aim at the host computer hardware con- nected to a network, that are residing the client and server software respectively. URL: http://en.wikipedia.org/wiki/Client-server Doclet: Als Doclet bezeichnet man in Anlehnung an Applets Module, die von Do- kumentationswerkzeugen zur Verarbeitung und automatischen Erzeugung von Dokumentation und eventuell auch Code eingesetzt werden. Bekannt sind Doclets insbesondere im Umfeld der Programmiersprache Java, wo sie als Module im Do- kumentationswerkzeug Javadoc eingesetzt werden. URL: http://de.wikipedia.org/wiki/Doclet. GUI: Graphical User Interface. An application which allows a human user to inter- act with a program thru windows, menus etc. “A graphical user interface (GUI) (IPA: /ˈguːiː/) is a type of user interface which al- lows people to interact with electronic devices like computers, hand-held devices (MP3 Players, Portable Media Players, Gaming devices), household appliances and office equipment. A GUI offers graphical icons, and visual indicators as op- posed to text-based interfaces, typed command labels or text navigation to fully represent the information and actions available to a user. The actions are usually performed through direct manipulation of the graphical elements.” URL: http://en.wikipedia.org/wiki/GUI FOLT: Forum Open Language Tools URL: www.folt.org HTTP: Hypertext Transfer Protocol (HTTP) is a communications protocol for the transfer of information on intranets and the World Wide Web. Its original purpose Dok. Nr.: HEA-1-2008; Version 00 ; Rev.00; April 2007 8
    • Software Architecture 04/2008/KW was to provide a way to publish and retrieve hypertext pages over the Internet. URL: http://en.wikipedia.org/wiki/HTTP HTTPS: Hypertext Transfer Protocol over Secure Socket Layer or HTTPS is a URI scheme used to indicate a secure HTTP connection. It is syntactically identical to the http:// scheme normally used for accessing resources using HTTP. Using an https: URL indicates that HTTP is to be used, but with a different default TCP port (443) and an additional encryption/authentication layer between the HTTP and TCP. This system was designed by Netscape Communications Corporation to provide authentication and encrypted communication and is widely used on the World Wide Web for security-sensitive communication such as payment transac- tions and corporate logons. URL: http://en.wikipedia.org/wiki/Https Open Source: Open source is a development methodology,[1] which offers practi- cal accessibility to a product's source (goods and knowledge). Some consider open source as one of various possible design approaches, while others consider it a critical strategic element of their operations. Before open source became widely adopted, developers and producers used a variety of phrases to describe the concept; the term open source gained popularity with the rise of the Internet, which provided access to diverse production models, communication paths, and interactive communities. The open source model of operation and decision making allows concurrent input of different agendas, approaches and priorities, and differs from the more closed, centralized models of development.[2] The principles and practices are commonly applied to the development of source code for software that is made available for public collaboration, and it is usually released as open-source software. URL: http://en.wikipedia.org/wiki/Open_source RPC: Remote procedure call (RPC) is a technology that allows a computer pro- gram to cause a subroutine or procedure to execute in another address space (commonly on another computer on a shared network) without the programmer explicitly coding the details for this remote interaction. That is, the programmer would write essentially the same code whether the subroutine is local to the exe- cuting program, or remote. When the software in question is written using object- oriented principles, RPC may be referred to as remote invocation or remote method invocation. URL: http://en.wikipedia.org/wiki/Remote_procedure_call Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 9/72
    • Software Architecture 04/2008/KW Server: In information technology, a server is an application or device that per- forms services for connected clients as part of a client-server architecture. A server application, as defined by RFC 2616 (HTTP/1.1), is "an application program that accepts connections in order to service requests by sending back responses." Server computers are devices designed to run such an application or applications, often for extended periods of time with minimal human direction. Examples of d- class servers include web servers, e-mail servers, and file servers. URL: http://en.wikipedia.org/wiki/Server_%28computing%29 Software Architecture: The software architecture of a program or computing sys- tem is the structure or structures of the system, which comprise software components, the externally visible properties of those components, and the relationships between them. The term also refers to documentation of a sys- tem's software architecture. Documenting software architecture facilitates com- munication between stakeholders, documents early decisions about high-level de- sign, and allows reuse of design components and patterns between projects. URL: http://en.wikipedia.org/wiki/Software_architecture. TOMCAT: Apache Tomcat is a Servlet container developed by the Apache Soft- ware Foundation (ASF). Tomcat implements the Java Servlet and the JavaServer Pages (JSP) specifications from Sun Microsystems, and provides a "pure Java" HTTP web server environment for Java code to run. … Apache Tomcat includes tools for configuration and management, but can also be configured by editing configuration files that are normally XML-formatted. URL: http://en.wikipedia.org/wiki/Apache_Tomcat UML (Unified Modeling Language): In the field of software engineering, the Uni- fied / Universal Modeling Language (UML) is a standardized visual specification language for object modeling. UML is a general-purpose modeling language that includes a graphical notation used to create an abstract model of a system, re- ferred to as a UML model. UML is officially defined at the Object Management Group (OMG) by the UML metamodel, a Meta-Object Facility metamodel (MOF). Like other MOF-based specifications, UML has allowed software developers to concentrate more on design and architecture URL: http://en.wikipedia.org/wiki/Unified_Modeling_Language Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 10/72
    • Software Architecture 04/2008/KW Unicode: In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a repertoire of more than 100,000 characters, a set of code charts for visual reference, an en- coding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data com- puter files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts). URL: http://en.wikipedia.org/wiki/Unicode UTF-8: UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Uni- code standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages, and other places where characters are stored or streamed. URL: http://en.wikipedia.org/wiki/UTF-8 XML-RPC: XML-RPC is a remote procedure call protocol which uses XML to en- code its calls and HTTP as a transport mechanism. URL: http://en.wikipedia.org/wiki/Xml-rpc Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 11/72
    • Software Architecture 04/2008/KW 5 INTRODUCTION 5.1 Arguments for an OpenTMS Software Architcture The arguments for an open source based localization tool have been discussed in FOLT, 2007a. Software design principles: For end users (translators): easy to install For translation providers: server version, networking For customers: running own servers; secure interfaces 5.2 Basics 5.2.1 Naming conventions OpenTMS uses a standardized naming convention scheme for variables, names in xml file etc. Each legal OpenTMS name (string, literal, variable name, function names) con- sists of one or more words. Variables starts with an uppercase letter. Function names (e.g. identifying processes) start with lowercase. Only the characters [A-Z] are allowed. The remaining characters are either [a-z] or [0-9]. No blanks are al- lowed between words. Word := [A-Z]([a-z]|[0-9])* word := [a-z]([a-z]|[0-9])* OpenTMSName := Word+ OpenTMSFunctionName := word Word* Examples: • The variable: xliffDocument • The function: openXliffDocument Fig 1: OpenTMSName defined as a regular expression Exceptions from the naming conventions could be introduced if acronyms etc. are used for words (e.g. TMX). Nevertheless it is not recommended to do this. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 12/72
    • Software Architecture 04/2008/KW 5.2.2 Naming of OpenTMS specific functions/methods It is suggested using a consistent OpenTMS naming system for functions and variables which are exported from OpenTMS. Exported functions refer to functions which can be used in applications (similar to the public concept in Java or C++). This immediately helps to identify code which is used in systems outside of OpenTMS. The special string “OpenTMS_” is used for this purpose. ExportOpenTMSName:= “OpenTMS_” Word+ ExportOpenTMSFunctionName := “OpenTMS_” word Word* Examples: • The variable: OpenTMS_Ecoding • The function: OpenTMS_openXliffDocument Fig 2: Naming of OpenTMS functions for export 5.3 Character set OpenTMS uses UTF-8 as basic character set, esp. for exchanging files. 5.4 Standards FOLT builds heavily on the idea of Open Source and using standards. Therefore the FOLT requirements use well-established localization standards to represent various types of localization information - based on XML. • XLIFF - XML based localization exchange format • TTX – Trados TM format • TMX - XML based localization translation memory exchange format • SRX - XML based format for describing segmentation rules • GMX – standard for measuring quantitative aspects in the translation process • TBX / MARTIF / OLIF – formats for representing terminology • CSV • Language Encoding ISO 639… In general the basic architecture makes heavy use of XML. XML based structures are used as the basic mechanism to exchange information between different ap- Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 13/72
    • Software Architecture 04/2008/KW plications (->Translets). Using XML has the advantage that many (open source) parsers are available for different programming languages which enables imple- menting the core OpenTMS architecture in different languages and environments. 5.5 Basic Requirements The following is taken from the FOLT (2007b); it extracts the main requirements: • Software: Web based application; thin client; no installation no properiatary run time components; preferred open source software (FOLT, 2007b, p. 17) • Operating System: OS Independent • Hardware: standard hardware (FOLT, 2007b, p. 17) • Interfaces: Integration into CMS, workflow management should be supported (FOLT, 2007b, p. 17). • Product interfaces: Exchange supported through XLIFF and TMX (FOLT, 2007b, p. 18). • Database: Open source database (FOLT, 2007b, p. 21); basically all SQL da- tabases should be supported, therefore a generic database interface is re- quired. • Scalability: single and multi user requirement 5.6 Architecture The architecture is described mainly in diagrams and text. The target group of this document are mainly non technicians. Therefore it is tried to keep the document as informal as possible without loosing the necessary precision. Further docu- ments or versions of this document may add more details to the various items dis- cussed. If possible the basic methods and classes have been written in Java but this should not induce that the implementation requires Java as an implementation language. The various components described in the document are called models. A model organizes a certain functionality or aspect of the OpenTMS systems. An example of a model is the security model of OpenTMS. This model describes all necessary functions and structures to implement the OpenTMS security system. There are several methods to describe architecture, methods and objects of a piece of software. Within this document mainly diagrams and block diagrams are Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 14/72
    • Software Architecture 04/2008/KW used to show the structure of the software. For describing methods and objects an XML based methodology is used (taken from Tomcat). The following is an example of a method call description using the Tomcat inter- face description. The method will be enhanced by describing also the possible re- turn values. <translet> <translet -name>ApplyTranslationMemoryToSegment</translet-name> <translet-class>com.OpenTMS.translet.translateSegment</translet- class> <init-param> <param-name> TMXDB </param-name> <param-value> OpenTMSexampledatabase </param-value> </init-param> <init-param> <param-name> SEGMENT </param-name> <param-value> This segments needs to be translated. </param-value> </init-param> <init-param> <param-name> FUZZYQUALITY </param-name> <param-value> 70 </param-value> </init-param> </translet> Fig 3: OpenTMS Procedure description Annotation: In order to keep the text more compact function naming does not in- clude the naming scheme described in chapter 5.2.2. But this jus for readability purposes. The real implementation should adhere to the naming scheme. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 15/72
    • Software Architecture 04/2008/KW 6 OPENTMS ARCHITECTURE AND MODELS The OpenTMS architecture is composed of several models. Each model imple- ments a specific aspect and behavior of the OpenTMS system. Each model com- municates with the other model through parameters and values. 6.1 Parameters in OpenTMS models Parameter and their realization, esp. their types, independently from a specific pro- gramming languages is not really trivial – apart from trivial types like characters, strings, integers or other numbers. Transferring more complex structured informa- tion has to be organized based on those primitive types. Programming languages typically uses “serialization” approaches to achieve at least a transfer of date from one application instance to another instance. OpenTMS tries to use a general parameter / value model which addresses both programming language specific and programming language independent parame- ter / value transfer. In order to make the integration of existing applications possi- ble OpenTMS supports different options for parameter representation. The following methods should be supported: • XML based parameters: all values should be transferred thru xml elements where the value is given thru the element content (string), the name of the parameter as attribute and the type of the parameter as an attribute too. XL based parameter / value transfer is esp. useful when transferring complex structured values between functions (e.g. objects). Nevertheless complex parameters (objects) need to be serialized. It is suggested that OpenTMS defines some additional basic parameter types which often occur in transla- tion tools (e.g. date type, TransUnits from XLIFF, tu or tuvs in TMX). • Tomcat parameters: This follows the way how the TOMCAT server engine defines method calls with parameter values. Actually also XML based. • XML-RPC parameter: This follows the way how XML-RPC defines method calls with parameter values. It supports some basic types like integer etc. More complex parameters have to be serialized. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 16/72
    • Software Architecture 04/2008/KW • Programming Language specific parameters: Those parameters should be wrapped in a specific object thru serialisation. This parameter type should only be used within a specific implementation where it is very unlikely that it will be used by other programming languages. • Hash tables: Hash tables are supported by most programming languages and transfer between database is often supported. Basically an entry in the table contains a key (the name of the parameter) and the value of the pa- rameter (value of the key). The kernel of each language specific OpenTMS implementation contains a basic library which supports creating reading and writing OpenTMS parameters. Type Comment int Integer as in Java float Float as in Java char Character as in Java String String as in Java Time Date TransUnit XML based XLIFF TransUnit Structure tu XML based TMX tu Structure GLO General Linguistic Object - see chapter 12 MoLo Monolingual Object - see chapter 12 Mulo Multilingual Object - see chapter 12 Fig 4: Table of Core OpenTMS parameter types An example how parameters are used is given in Fig. 2. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 17/72
    • Software Architecture 04/2008/KW 6.2 Core Models of OpenTMS The following chapter describes the core models of OpenTMS. The key idea is that OpenTMS uses an extendible architecture approach which allows to add new models in an easy, yet compatible way to the kernel architecture. A new model has to fulfill some basic requirements, e.g. that parameters are defined and used in the way as described in the previous chapter 6.1. Fig 5: OpenTMS Models and their relations The OpenTMS models are arranged in a kind of “onion model”. The kernel is rep- resented by the process model which in turn builds on the user, document and data model which model specific aspects of the OpenTMS system. These kernel models are “shielded” by the security model which is responsible for assuring that only allowed operations are performed. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 18/72
    • Software Architecture 04/2008/KW • Security Model: This model describes the security aspects and require- ments of OpenTMS. Other models use the security model to allow or re- strict the access to OpenTMS specific functions. OpenTMS uses a security model which on the one side secures the communication channel and on the other side secures data (e.g. the value of elements in an xml file or the values in a property file). • User Model: This model realizes the user and its representation in the OpenTMS. The user model works in tight connection with the security. User does now only imply human users, but also other processes. User models have rights attached to them which in turn support the security model of OpenTMS. • Process Model: This model implements the functions (combined finally into applications – see application model) of the OpenTMS, e.g. a converter or a translation memory search. • Data Model: Basically this model implements the database side of OpenTMS. It uses a generalized database model, called data sources. Data sources are any kind of storage media for data, starting from plain text files towards SQL and other types of databases. • Document Model: The document model describes the core documents used in OpenTMS. Basically this is based on XLIFF and TMX. The docu- ment model also could be seen as part of the data model but due to the im- portance of documents as one of the core output produced by the transla- tion and localization process they are modeled separately. • GUI Model: This model specifies editors and other functionality which re- quires a GUI. The GUI model is not further detailed in the architecture specification here. The GUI model should be defined in a separate docu- ment. • Interface Model: The model describes how to extend OpenTMS with new models. The Interface model is an abstract model and needs further inspec- tion. An example of such an extension is the interface to CMS systems. In- terface models are also of quite importance as they serve as the connection Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 19/72
    • Software Architecture 04/2008/KW to other applications (e.g. Web servers, CMS systems) and in general to scripting languages like Perl, PHP etc. • Application Model: This model realizes programs, which performs tasks like translation etc. 6.3 OpenTMS Core Library In order to achieve a consistent implementation and in order to foster a quick im- plementation OpenTMS implements its key functions in a core library. Function implemented in the core library should not be re-implemented (“reinvented”) in ex- ternal functions or processes. Obviously the set of key functions will evolve over time. Functionality and implementation of the core should not be changed without important reasons (similar to the LINUX implementation process). Using a core library OpenTMS will ensure that certain functions behave in the same way across applications. It also gives security to the developer and the user that functionality does not change unforeseeable. Core library functions should be the first one which are realized if OpenTMS is im- plemented in different programming languages. 6.4 The Application Model The OpenTMS architecture just serves as a model how the different aspects of tools supporting the translation process can be implemented. As a model it is in- dependent from any programming language. Applications need to be written in order to make the functionality of OpenTMS accessible to users. This is realized in the application model. The GUI model can be seen as an example of an application model. Applications obviously depend on the existence of a concrete implementation in an existing programming language (Java, C#, Perl or whatever). In this sense OpenTMS provides a programming framework which allows to construct language support tools. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 20/72
    • Software Architecture 04/2008/KW In the beginning OpenTMS will come with some basic applications (Editors etc.). But the main idea is that a profound framework is defined and specified which al- lows the construction of new language applications. OpenTMS also supports its own scripting language (OpenTMSL). This language makes the OpenTMS functions accessible thru simple calls (similar to batch files). This scripting language can also be used to construct applications. 6.5 Implementation Languages In a first step it is suggested to implement a Java version of OpenTMS. Java has the advantage compared to other languages that it runs on several operating ma- chines (which is one of the goals of FOLT and OpenTMS). Integrating tools written in other language can be done as OpenTMS from its basic model is constructed toward using XML-RPC and similar communication modes. The basic Java implementation can serve as the basis for other implementations (C, C#, C++, Perl, PHP etc.). With regard to security issues associated with choosing a proper programming languages see chapter 7. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 21/72
    • Software Architecture 04/2008/KW 7 SECURITY MODEL A key success factor of the OpenTMS system is security. As translation always can involve documents of various security levels a proper handling of the docu- ments and document transmission is required. Depending on the security level data can be encoded/encrypted. It is suggested to use three different levels. • Level 0: No security procedures are applied, data are transferred as they are. • Level 1: The communication channel is secured. It uses standard secure protocols here. • Level 2: Encoding for security is done here on data level. Basically this means that strings are encrypted when the are communicated through a communication channel or are written or retrieved from a database. This also involves encrypted XLIFF files (resp. parts of it). • Level 4: GUI level related security Level 1 and 2 can be used together to achieve optimal security where necessary. Security is attached to the OpenTMS User model. A key feature of the OpenTMS architecture is that the security model is transpar- ent. Actually when writing a (new) application the programmer does not need to take care of the security expect. The OpenTMS kernel provides all the functions and interfaces to make those calls transparent; supplying the correct parameters is sufficient. Actually another type of security level (Level 4) can be introduced at GUI level. At this level functions like copy and paste are secured in addition. This should pro- hibit that users can copy and paste the content of text windows (editing windows) into other applications. Defining this security level will be left to the GUI model definition. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 22/72
    • Software Architecture 04/2008/KW The following diagram shows how several methods can be combined to achieve a high security during the transmission of an XLIFF file. In this example in a first step the XLIFF is secured (encrypted). Once a transfer of the file during the net work is required the channel as such is also secured. Once the XLIFF file is received it is decoded by the OpenTMS system. From a programmatic side this is just realised. by setting and defining the security to be used. Fig 6: Example securing XLIFF document exchange 7.1 Security, OpenTMS and Programming Languages In the previous chapter the issue of programming languages has been discussed. A common known problem with programming languages – more precisely with applications written in those languages and often also only associated with specific Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 23/72
    • Software Architecture 04/2008/KW operating systems – security measures are often not properly implemented (e.g. the very old problem of “buffer overflows” in C). OpenTMS overcomes this problem by clearly defining specific modules which are encapsulated and follow modern software development rules (e.g. access only thru well defined interfaces) a special security layer wraps the various modules. This architecture specification is mainly targeted towards the server part of OpenTMS. Thus it is independently from any GUI application. GUIs can use OpenTMS basically in two ways: a) thru the OpenTMS server functionality: This approach encapsulates all modules and functions and gives the highest possible security measure. Here only “public server sided functionality” can be used. b) Directly calling functions from the OpenTMS library: Obviously this can cause problems if the GUI does not call the functions properly (esp. in pro- gramming languages like C or C++). One of the OpenTMS target GUIs are web based applications (browser based). Those will call all the functionality thru a web server, SOAP or XML-RPC inter- faces. This minimises the danger of introducing security problem on the client size (e.g. for GUIs which have to follow requirements like ZDv 54/100 VS-NfD „IT- Sicherheit in der Bundeswehr“). By restricting to “plain HTML” one can reduce the risk to a minimum. Obviously increasing the security level goes with a decrease in comfort und user friendliness. This decision is up to the end user and his organisa- tion. 7.2 Communication Level Communications which goes through TCP/IP should support (strong) encryption of the data transmitted. This is done in addition to using protocols like https, se- cureFTP etc. 7.3 Document Level The basis of most activities in OpenTMS are documents. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 24/72
    • Software Architecture 04/2008/KW A key problem is the transfer of xliff files. The content of the segments are nor- mally readable by human readers. If required the segments in the xliff files (as well as in tmx or tbx files) can be encrypted (creating something like a secureXLIFF, secureTMX, secureTBX). The segments can only be read in conjunction with a user and password. The users who have regular access to the content can be stored in encrypted form in the header of the xliff file or be supplied when opening the xliff document. 7.4 Database Level Database entries follow the same procedure. If required the entries should be en- crypted. At this level database specific security functionality can and should be applied to. Without the knowledge of the user - password combination an export etc. of the database does not provide any information in case of an attack. In addition any data base security layers need to be supported too. 7.5 Security Level The following functions assume that each encryption and decryption process as- sociates the relevant user and his roles with the security function. At this point no function parameters are defined. This will be done in an implementation manual. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 25/72
    • Software Architecture 04/2008/KW Function Comment Encrypt / Decrypt General function which encrypts and decrypts any type of document Encrypt XLIFF This function encrypts the texts (segments) of a XLIFF document. The xml structure as such is still Decrypt XLIFF visible. Depending on the parameters supplied attributes etc. are secured too. Encrypt TMX This function encrypts the texts (segments) of a TMX document. The xml structure as such is still Decrypt TMX visible. Depending on the parameters supplied attributes etc. are secured too. Encrypt TBX This function encrypts the texts (segments) of a TBX document. The xml structure as such is still Decrypt TBX visible. Depending on the parameters supplied attributes etc. are secured too. Establish Secure Communi- Establish a secure communication channel. The cation type of security depends on the supplied parame- ters. Terminate Secure Communi- Terminates a secure communication channel. cation Secure Data Source Enables the encryption / decryption of database entries. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 26/72
    • Software Architecture 04/2008/KW 8 BASIC OPENTMS COMPONENTS The OpenTMS framework is organized around a set of basic components called models (see chapter 6) which interact and allow to apply processes on them. The following is a brief overview which basic models exist: • Documents: Documents form one key feature of the architecture. Basically documents are every form of text. Translations and other modification proc- esses (e.g. segmentation) are applied to documents. A key document type in OpenTMS is an XLIFF document which is main paradigm for communication text between various processes. • Database: Database refers to any kind of storage which can be used to re- trieve a specific text or sub-text (like a paragraph, segment). Database in the OpenTMS context is understood widely, starting from simple text files towards highly sophisticated SQL or object oriented database systems. OpenTMS uses a general database object which can come in various flavors, e.g. translation memory, a phrase database or terminology databases. OpenTMS database architecture supports various security levels. Encrypting of entries should be supported. OpenTMS uses the notion of “data source” for this generalized data bases. • Processes: Processes apply operations to documents and databases. Opera- tions could be: modifications, inserting, searching, editing, converting etc. A key process in OpenTMS is the translations process. OpenTMS processes are named “Translets” (or Translet in singular). An example of a Translet is a Do- clet, a module which is applied for the conversion, modification etc. of docu- ments. Processes in OpenTMS are normally accessible through the OpenTMS Scripting Language, a language which gives access to the core operations of the OpenTMS architecture (similar to Java Scripts) Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 27/72
    • Software Architecture 04/2008/KW Fig 7: OpenTMS Objects From a certain perspective processes can be seen as a special type of commu- nication. Within OpenTMS three different communication types can be distin- guished. Communication is here used in a broad view. • Command (file) based process: Here an executable is run (batch mode). Command processes use xml based command files as input parameters. • Function based process: Here the specific process is called either as a func- tion or method within a piece of software. • Net (TCP/IP) based process: Here a process is run through a net work (TCP/IP) using SOAP, RPC, XML-RPC or similar communication methods. The method is activated in a certain process while the actual execution is run in an- other process (could be a server, a virtual machine, multi threading or similar). • Workflow: A workflow is a set of processes which are applied in a specific se- quence. A workflow also may involve humans as part of the workflow. A typical workflow could be: PM received document to translate – determines document characteristics – compute statistics – provides offer – client accepts offer – PM determines translator – converts document for translator – sends to translator – and so on. This means that a workflow also can contain purely humans actions interwoven with computer processes. Anyway each human process must be mapped to a computer process. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 28/72
    • Software Architecture 04/2008/KW Later in the document it is mentioned that processes can be organized in pipe- lines. Actually this means that one process can take the output of another process, do some computation on this output and create a new output which itself can now form the input to another process. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 29/72
    • Software Architecture 04/2008/KW 9 DOCUMENT MODEL 9.1 Documents Documents(“texts”) are a core concept in OpenTMS. Documents are normally the core interest as documents need to be translated. Documents normally come into OpenTMS as input or output. Documents are normally processed in OpenTMS thru XLIFF (chapter 9.4). Documents are converted into XLIFF and back. Docu- ments come in various formats, e.g.: • WinWord • RTF • Plain text • HTML • XML • OpenOffice • program texts • resource files • property files • database entries • any other common location industry formats • any other document type The most simple type of a document is a string, a sequence of characters. For OpenTMS processes strings are packed into XML structures, mainly a subset of XLIFF. A key property of a document is a language associated with it – although the lan- guage itself may vary within the document. If a document gets translated at least a second language is associated with it. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 30
    • Software Architecture 04/2008/KW 9.2 Character Sets OpenTMS uses the Unicode character set for all (internal) representation pur- poses. This has the advantage that most of the characters used worldwide can be processed with OpenTMS. Also most programming languages use nowadays Uni- code as their internal character representation. UTF-8 formatted text is used as the core character set if OpenTMS produces and delivers files which are some kind of final document (e.g. for statistics output). De- viations come in if the original character set differs. The core library of OpenTMS contains basic functions to convert from one charac- ter set to another character set. In addition the kernel library should contain some functions which allow the detection of a character format of a document. 9.3 XML document handling OpenTMS heavily uses XML bases standards (XLIFF, TMX, TBX). There are sev- eral good open source implementations for XML handling available (DOM model, SAX parser, JDOM just to name a view). Obviously those functions should used to manipulate those documents. On top of the standard xml library functionality functions are required to support the manipulation of the translation / localization XML standards. Those functions will also be part of the core library. 9.4 XLIFF Documents XLIFF documents form the core document type on which most of the processes are applied (segmentation, translation etc.). XLIFF documents are created by con- verters. Converters take different document formats (rtf, xml, html etc.) and con- vert them to the xml based XLIFF format (XLIFF, 2008). The following shows a very simple example of an XLIFF document. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 31/72
    • Software Architecture 04/2008/KW <?xml version="1.0" encoding="UTF-8" ?> <xliff version="1.0"> <file datatype="XML" original="D:arayatestsimplexmlsimplexml.xml" source-language="de" target-language="es"> <header> <phase-group> Header of the XLIFF File <phase company-name="Araya" date="Sun May 11 11:29:11 CEST 2008" phase- name="1" process-name="pre-process" tool="XML2XLIFF version 2.0"/> <phase company-name="Araya" date="Sun May 11 11:29:11 CEST 2008" phase- name="2" process-name="Segmentation" tool="SEGMENTER version 2.0"/> </phase-group> <skl> Reference to an external file <external-file href="C:arayasklsimplexml.xml.27120.skl"/> <internal-file form="mimestring">PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiID8+DQo 8c2ltcGxleG1sPg0KPHNl Internal File Z21lbnQ+JSUlMCUlJQo8L3NlZ21lbnQ+DQo8c2VnbWVudD4lJSUxJSUlCjwvc2VnbWVudD4NC jwv c2ltcGxleG1sPg==</internal-file></skl> <prop-group name="encoding"><prop prop-type="encoding">UTF- 8</prop></prop-group> <prop-group name="xmlformat"> <prop Properties of the XLIFF File prop-type="donotresolveentitiesfile">C:arayainiedqm- ent.txt</prop> <prop prop-type="iniFile">c:/Araya/ini/config_simplexml.xml</prop> </prop-group> <prop-group name="specialinfo"> </prop-group> </header> <body> <trans-unit approved="no" help-id="0" id="0" xml:space="preserve"> <source xml:lang="de">Das ist ein Segment</source> <target xml:lang="es" xml:space="preserve"/><prop-group><prop prop- type="segmentid">1067381512</prop></prop-group></trans-unit> Segments <trans-unit approved="no" help-id="1" id="1" xml:space="preserve"> <source xml:lang="de">Das ist ein <ph id="0">&lt;b&gt;</ph>Segment mit<ph id="1">&lt;/b&gt;</ph> Format</source> <target xml:lang="es" xml:space="preserve"/><prop-group><prop prop- type="segmentid">1067381512</prop></prop-group></trans-unit> </body> </file> </xliff> Fig 8: XLIFF File 9.4.1 OpenTMS and Skeleton files Skelton files are one of the key features of XLIFF. In order to reduce the size of content of a segment (transunit, source and target) most converters move the non- Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 32/72
    • Software Architecture 04/2008/KW relevant part (e.g. format information) of an (external) document in an external rep- resentation. They then use a kind of referencing scheme to specify where parts of the text and the segment come together (mainly for back conversion). Skeleton files mainly contain the format (non-textual) part of a document. Often this part is bigger than the core text. One can distinguish between internal and external skeleton files (also called skl files). External skl files keep the XLIFF file small, while internal skl files create a bigger XLIFF file. With external files the problem of back conversion is more complicated as the back converter requires the skl file. One way to overcome this problem is to compress the internal skl file and encode it appropriately. OpenTMS supports the back conversion of a document independently from the place it was created. Thus normally XLIFF files in OpenTMS use internal skl files. In case where this is not possible or wanted a procedure must be supplied which allows to reintegrate the skl file into the xliff file before transmitted to another ma- chine, user etc. 9.4.2 Security and encryption in XLIFF – secureXLIFF As described in the section about security XLIFF documents must follow the secu- rity architecture of OpenTMS. XLIFF documents are potential threat for security. If they are transmitted via the web or by another transport method (USB stick etc.) other persons may read the XLIFF document. In order to prevent access of unau- thorized users it is proposed to encrypt the relevant parts (esp. source and target elements) of the document. Only specified users with the correct password will gain access through an editor or similar to the content of the XLIFF document. XLIFF editors reading the file must support the OpenTMS security layer. Using such a security approach one also could forbid copy and paste etc. for a given xliff document. Annotation: Obviously an open source encryption method should be used. Using a secureXLIFF may be a good argument for industrial user to use the OpenTMS concept and architecture. 9.5 TMX Documents Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 33/72
    • Software Architecture 04/2008/KW TMX documents form the core document type on which database operations apply (fuzzy search, word based search etc.). TMX documents resp. their entries are stored in databases. Converters take different translation memory exchange for- mats (Trados, etc.) and convert them to the xml based TMX format (TMX, 2008). Databases store the tmx entries. While there is no problem with the meta informa- tion associated with each TMX entry (tu) the global TMX document meta informa- tion creates a problem. As databases are organized around entries this meta in- formation must be stored in separate tables and referenced by each entry. 1 TMX files are normally imported into databases to support high access speed . 9.5.1 Security and encryption in TMX – secureTMX The same security architecture as for XLIFF should be applied to TMX. 9.6 TBX Documents TBX documents form the core document type for terminology data. TBX docu- ments are imported into a OpenTMS database. TMX and TBX documents are in- ternally stored in the same entry structure. They can distinguished by specific markers. The reason for storing both TMX and TBX documents in the same type of data- base is that this allows the re-usage of both data in similar situations. Obvi- ously the database functions need to support reading and writing the entries given the context. This a (originally) TBX entry may be used as a TMX entry (translation memory match) in one context while a TMX entry could be used as a terminology match in another context. This internally identical handling should not imply that both entry types are the same but reality shows that often the usage patterns re- quire that they can be used interchangeable. 9.6.1 Security and encryption in TBX – secure TBX The same security architecture as fur XLIFF should be applied to TMX. 1 A key question is if OpenTMS should allow direct access to TMX files (like Star text files) too without having the need to import them into a database. Advantage would be that esp. for small TMX files there is no real need to store them in a database. It would also not require any database drivers. XML access functions would be sufficient. One could see this a special type of database. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 34/72
    • Software Architecture 04/2008/KW 9.7 Other Documents OpenTMS requires to process all types of other documents. Once those files are brought into the OpenTMS system those files are converted to XLIFF (except those cases discussed above). Once processed those XLIFF documents are con- verted back to their original format. Ideally OpenTMS should contain or interact with a CMS system which provides a convenient way of storing all kinds of documents. Interfaces to CMS will be de- fined. Although the implementation of the interface is not part of the OpenTMS implementation. See chapter 18 9.8 Basic Document Access Functionality In the following some basic XLIFF file functions are described. Those functions should go into the core library of OpenTMS. They are by far not exhaustive. A more detailed function library for XLIFF will be defined later. Although most of the functions can be realised by using DOM functionality, a function library which makes it easy to handle XLIFF files should be realised. As the functions will involve complex parameter combinations the parameters will be supplied as XML constructs. For performance reason one will not really supply flat xml files, but an in-memory version of the XML file (nodes etc.). Basic Translation Func- Comment tions for XLIFF documents Convert Document Converts a given document to XLIFF Backconvert Document Back converts a given document from XLIFF CreateXLIFFDocument Creates an empty XLIFF document. This function maybe questionable as normally XLIFF docu- ments have just an temporary status. The nor- mally come into existence thru a converter call. Nevertheless such a function may be helpful. Pure to text conversion can be achieved anyway. GetProperties Retrieves the (general) properties of the XLIFF document SetProperties Sets the (general) properties of the XLIFF docu- ment Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 35/72
    • Software Architecture 04/2008/KW Segment Segments the XLIFF document based on some SRX rules (configuration file) AddTransUnit Adds a new TransUnit at a certain position. This function also depends on the original format. De- pending on the format this function may cause problems in the back conversion process. RetrieveTransUnit Retrieves a segment of the XLIFF document; this includes all the information of the segment (thus the whole trans-unit is received) RemoveTransUnit Removes a TransUnit; here one could distinguish between immediately (and therefore permanently executing the operation) or just making the change in memory and later saving the changes. ModifyTransUnit Modifies a TransUnit; here one could distinguish between immediately (and therefore permanently executing the operation) or just making the change in memory and later saving the changes. TranslateTransUnit The TransUnit is translated based on some pa- rameters supplied. This can include TM transla- tion, term translation or machine translation or basically any other kind of translations or nvocacation. SplitTransUnit Splits the source part of a TransUnit. Care has to be taken with regard to validity. CombineTransUnit Combines the source parts of a TransUnit. Care has to be taken with regard to validity. SaveDocument Saves the XLIFF document GetStatistics Returns some statistics of the translation process (GMX based) Fig 9: Some basic XLIFF File functions Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 36/72
    • Software Architecture 04/2008/KW 10 OPENTMS AS A CLIENT/SERVER ARCHITECTURE The kernel OpenTMS architecture is based on the client server principle. Using a client server architecture brings many advantages, amongst the very critical one that processes can be spread over several computers or threads in modern oper- ating systems and hardware architectures. This does not imply that the OpenTMS architecture only can be implemented on a client server basis. All the processes (Translets) also can run in a single user environment (e.g. by a procedural call within an editor). But by using a client server framework one avoids the problem to re-program or re-implement a piece of software which was designed to run in a single threaded environment only. This holds with regard to using global or static variables etc. from an implementation point of view. Each procedure developed for OpenTMS should be designed with multi thread- ing in the background. Each procedure should be encapsulated in such a way that it can be surrounded by a (process wrapper) which allows it to run other as a (multi) thread in the same software or computer environment or can be distributed over several computers. Actually this means “globally defined variables” should be avoided as far as possible. As has been described before the key functions are implemented in the OpenTMS core library. All (main) procedures should also be written in such a way that they can be called easily by the OpenTMS scripting language. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 37/72
    • Software Architecture 04/2008/KW Fig 10: Hierarchy of processes Processes have to adhere to the security concept of OpenTMS. Processes can only be executed if they (and the user associated with the process) have appropri- ate rights (gained thru the security model). This esp. applies for processes which use network connections. Fig 11: Applications Most of the processes are XLIFF exchange based (thinking in terms of functions and variables this means that the parameters of functions are XLIFF documents or substructures of XLIFF). This means that the processes mainly operate on XLIFF based xml structures. They add or modify XLIFF structures. In principle the opera- tions should be non destructive. That is information is not deleted or removed but only added. In some cases this cannot be fully held: e.g. if a translator modifies a translation (in a destructive way) the (older) information is lost. The same may ap- ply to database entries. This also depends on the usage of a proper versioning system. As a consequence of using internally XLIFF related structures conver- sions to related XML based formats like TMX, TBX etc. must be supported. This can be realized by attaching import and export procedures to the OpenTMS ker- nel. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 38/72
    • Software Architecture 04/2008/KW Exceptions are for example converters which take a whatever formatted docu- ment as input and produce an XLIFF document. The same applies to back con- version. Please note that the above figure also represents some kind of workflow. Basic workflows can be part of the OpenTMS architecture (e.g. each process applying changes to an XLIFF document should document this in the XLIFF header). But it is not intended that OpenTMS as such comes with its own workflow solution. More complex workflow procedures should be modeled either using proprietary or open source software. OpenTMS also follow the “old style” of UNIX pipe lining. Processes (see chapter about process model) take an input and produce an output. The next process will take the output of the previous process applying some further transformation of the input and creating new output. Nevertheless there is some difference. As parame- ters can become quite complex the UNIX style of interpreting the input just as “a string” is opened here up to support input and output in form of the parameters described before. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 39/72
    • Software Architecture 04/2008/KW Fig 12: Pipeline Architecture Figure 11 shows a typical pipe lining of several processes (Translets) during a translation process. OpenTMS can differentiate between two basic Translets. • Human Initiated Translets: These are Translets which are invoked and (fully) controlled by humans. Examples are a Translation Editor, operation which invoke inserting or updating entries in a database. • Automated Translets: These are processes which are normally run auto- matically and do not require human interactions. Examples are the steps – conversion – segmentation – pre-translation. Here also automated pro- cedures (e.g. pre-translating a project – Translets applied to a set of docu- ments) have to mentioned. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 40/72
    • Software Architecture 04/2008/KW 11 DATA MODEL 11.1 Data sources Data (mostly databases) are modeled thru data sources. Data sources are the ba- sic objects which allow the access to all kind of data, esp. databases. Data sources mainly store segments from TMX files or TBX entries. Data sources are XML oriented, that is depending on the xml document supplied it converts the en- try in such a way that it can be transferred to a data component. Fig 13: Data sources and data components Why not directly refereeing to databases? The basic idea behind the usage of a data source as the core data object in OpenTMS (representing databases) etc. is that creating such a layer between the real databases (e.g. MySQL) and the OpenTMS software makes adding new types of data quite easy. The various types of data are referred to as data components. Thus an SQL database is a data Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 41/72
    • Software Architecture 04/2008/KW component, but also a TMX file could be seen as a data component if the relevant access operations are supported. Similar an Excel file can be considered as a data source. Using this approach OpenTMS is not restricted to SQL databases, but can use flat files, spread sheets etc. too. It can also support direct access to vendor specific databases or systems. A server sided installation of OpenTMS can also act as data source. Access to data sources through standardised interface O P E N Open T M TMS Data type specific S Data access S Source functions O Layer F T W Maps the OpenTMS A access functions to the specific data component R E Various data components like files etc. Fig 14: Data sources with several data components Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 42/72
    • Software Architecture 04/2008/KW A data component which is connected thru a data source must support a core functionality. This core functionality is divided into three types of functions (meth- ods): • Read methods: This involves all functions retrieving data from a data component. Read methods also maps the results in the way the caller needs the data (e.g. TBX or TMX). • Write methods: This involves all functions writing, updating and deleting data to a data component. Write methods also take into account which in- put format is used (e.g.TMX or TBX etc.) and convert them into the internal data source format. • Select Methods: This methods are part of the read methods and allow to select specific entries from the data source. Care has to be taken which security level has been chosen. Depending on the level the data have to be encrypted and decrypted. Two types of data components can be distinguished: • Read only data components: This type of component can only retrieve data, but not store data. An example could be if a plain TMX file is used as data component. • Full data components: Here both read and write methods are supported. Depending on the user configuration data components can be configured to be- have differently. It can appear as read only data component for one user, while for another used it could be accessible as full data component. 11.2 TM Matches OpenTMS differentiates between three types of matches: • Perfect Match: This is a match where the segment to be searched matches the segment in TM both with regard to the text content and the format • Exact Match: In this case only the text part of the segment matches with the database entry perfectly, the format information differs. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 43/72
    • Software Architecture 04/2008/KW • Fuzzy Match: In this case there are some deviations between the search segment and the match in the TM. The difference is usually stated in % values. This type of match is also often called inexact match. One may consider in the future other types of matches too, e.g. replacement class matches where only the “blank characters (white spaces)”, differ. For this see also chapter 12.3. 11.3 Basic data source access functionality The following (read and write ) access functions are the core functions need. Ac- cess results in matches. A basic idea is that that the function decides based on the input supplied how the entry is interpreted and written into the database. This means that TMX entries are handled differently from TBX entries etc. Please note that in the description of the functions no explicit reference is made to the security model. It is assumed that the security level is set before or in invoca- tion with the database function invocation. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 44/72
    • Software Architecture 04/2008/KW Access Type Comment Exact Access A given entry is found by the “string=segment” supplied but independently of the format.. Exact Format Access A given entry is found by the “string” supplied tak- ing format information into account. Fuzzy Access A given entry is found by using a similarity search. Similarity is measured in %, where 100% is iden- tical to an exact access. Fuzzy Format Access A given entry is found by using a similarity search – taking the format into account. Similarity is measured in %, where 100% is identical to an exact format access. Word Based Access A search is done by splitting the string into indi- viduals words. The word identification is language dependent. The words could either be searched 2 using OR or AND . Word based access could be enhanced by supporting stemming (e.g. Porter stemming algorithm) Regular Expression Access A regular expression is used to retrieve the result set. Actually such a function is quite resource consuming. Sub segment Access Segments are retrieved based on some sub seg- ments of a given search string. Actually this could be seen as a more specialized form of the regular expression search or word based search. This type of search is esp. important if a segment ac- tually represents a paragraph and may contain several sentences. Fig 15: Data source access types 2 It is suggested to use a logical represenation of the query similar to Google (www.google.com). Here + denotes”word must exist”, while – denotes that the word is not allowed to exist in the result set. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 45/72
    • Software Architecture 04/2008/KW Access Functions for TM Comment and TBX data RetrieveTMMatch Get a match from the Translation Memory. The actual result depends on the data source access type chosen. Parameters involve match quality etc. RetrieveTBXMatch Get a TBX match from the terminology database. The actual result depends on the data source ac- cess type chosen. AddEntry This is a generic function adding data (e.g. TMX entries) to data sources. The function is generic in that that sense that it decides on the type of the xml document to be added how the entry is stored (TMX, TBX etc.). CreateEntry Creates an empty data source entry of a specific type AddTMEntry Adds a TM entry; actually a specialization of Ad- dEntry AddTBXEntry Adds a TBX entry; actually a specialization of Ad- dEntry RemoveEntry This is a generic function removing data (e.g. TMX entries) to data sources. The function is ge- neric in that that sense that it decides on the type of the xml document to be added how the entry is stored (TMX, TBX etc.) ModifyEntry This is a generic function modifying data (e.g. TMX entries) to data sources. The function is ge- neric in that that sense that it decides on the type of the xml document to be added how the entry is stored (TMX, TBX etc.) CopyEntry This is a generic function copying data (e.g. TMX entries) to data sources. The function is generic in that that sense that it decides on the type of the xml document to be added how the entry is stored (TMX, TBX etc.) Fig 16: Data source access types Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 46/72
    • Software Architecture 04/2008/KW 11.4 Databases A key principle of the OpenTMS architecture is its independence from database products. OpenTMS defines a core subset of access functions (based on SQL) which can be implemented by nearly all database systems. The following gives a (a non exhaustive) list of database types which should be 3 supported . 11.4.1 Open source SQL data bases • MySQL - www.mysql.de • Postgres - www.mysql.de • H2 - www.h2database.com • Cloudscape - www.ibm.com/software/data/cloudscape (IBM) • … 11.4.2 Closed source SQL databases • SQL Server (different flavors) - www.microsoft.com/germany/sql/default.mspx • Oracle - www.oracle.com • … 11.4.3 Alternatives SQL databases are not the only databases out there. Other database formats could be: • Spreadsheets (like SQL) 3 A key question at this point is if OpenTMS should implement something as an “internal database” which just would mean storing the database as “simple hash tables” which can be serialised and de-serialised. See also the discussion of TMX documents (Footnote 1). Alternatively the internal database could just consist of an xml file. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 47/72
    • Software Architecture 04/2008/KW • Object oriented databases • XML database systems (e.g. XINDICE) • Plain text files Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 48/72
    • Software Architecture 04/2008/KW 11.4.4 Database Access Internally all main access functions of OpenTMS are based on specific objects (see page 51) and all access happens through these objects. By using this addi- tional abstraction level (interfaces as they are called in most programming lan- guages nowadays) one gets even independent from SQL and is open for future advances in the area of databases development. All access functions are mapped to SQL statements (or their equivalents) which are not hardcoded but stored in xml database configuration files. Till this point there is no real necessity to realize the database only in SQL. The advantage of using SQL as the language describing the access functions is a) that it is widespread and b) standardized. Fig 17:Configuring different database types 11.4.5 Database and data source configuration As OpenTMS needs to support a lot of different database / data sources type add- ing a new database type should not require changing the data source code kernel. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 49/72
    • Software Architecture 04/2008/KW Therefore for each data source type a configuration file defines the main pa- rameters of the database. Depending on security require the configuration file can be secured using the security model functions for documents. This includes: • Database class driver – e.g com.mysql.jdbc.Driver • Connection String – e.g. jdbc:mysql: • Any other connection string specific commands (e.g. buffer size) • Commit support • Unicode support • Server Address • Port • User (encrypted) • Password (encrypted) • Mapping of OpenTMS database access function to database specific ac- cess code (e.g. SQL code like <command step="1">DROP TABLE MONO IF EXISTS MONO</command>). Depending on the access functions they can be organized in groups if a specific functionality requires to run sev- eral database functions (e.g. creating all the necessary tables for a new database). This is mainly important for SQL databases as here a variation of supported SQL types exist. • Reference to code (e.g. jar file, dll etc.), If a specific functions needs to run at a specific point of time (e.g. creating a new database). This should en- able to inject specific implementation code for specific tasks (e.g. if some functionality cannot be executed thru SQL commands) In addition a more generic interface can be called if a database cannot be inte- grated with the configuration file specifications above. In this case the whole inter- face for the new database needs to be implemented and made available to OpenTMS. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 50/72
    • Software Architecture 04/2008/KW 12 TRANSLATION OBJECTS A key entity in the translation process are translations. Translations (inherently multilingual) consist usually of segments (monolingual) and languages associated with those segments. As a consequence the architecture uses three types of language related entities. This objects are used by processes to create the translation functionality. A “General Linguistic Object” (GLO) contains information (features, attributes) which are common to all linguistic information types. Examples are: unique id, creation and modification dates, authors etc.. Linguistic Objects always can be serialized to XML. Main supported formats are here: XLIFF, TMX and TBX. From that object two objects are derived: • A “Monolingal Object” (MoLO) which represents a linguistic entity for a given language. It inherits all the features of GLO and adds for example the language of the entity (segment). • A “Multilingual Object” (MuLO) represents translations by linking one or more MoLOS into one object. A MuLO constists at least of one MoLO and can contain up to n MoLOS. It is not required that each MoLO of a MuLO 4 has a different language. Each of those object types contain a unique id, in addition a MoLo inherits an MuLO related id so that it can be easily associated with its translations. 4 The behaviour of multilingual objects can be configured. One option can be to treat all entries as bi-lingual objects only. Thus one MuLo only would contain MoLos – a source and target MoLo. Normally options like this should be used with caution as they introduce problems in managing real multilingual databases. This is esp. true if one source segment may have several transla- tons (target MoLos). Nevertheless there may be cases where one requires to have several translations for a source segment, eg. Something like a temporary translation. In this caseit is suggested to associate “status attributes” with the MoLo. This could be the used on the one hand as a sorting criteria for matches and on the other hand for identifying problem transla- tions. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 51/72
    • Software Architecture 04/2008/KW Obviously attributes are associated with Linguistic Objects. As several standards are used (TMX, XLIFF and TBX) a mapping of the attributes between the different types is required. Within the object the attributes may be identified through their name space. Fig 18: Representation of linguistic entities as General Linguistic Object 12.1 Format information Format information (e.g. transported thru the <ph> tag in XLIFF ) and its correct handling is a key and kernel function of OpenTMS. The core OpenTMS library contains all the necessary functions to handle format information correctly. OpenTMS should aim at providing the highest possible support in format handling. 12.2 Terminology versus Translation Memory Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 52/72
    • Software Architecture 04/2008/KW Within computational linguistics a key difference is made between terminology and translation memory. Both concepts clearly are used in two different contexts. This is also reflected that there are (at least) two standards: TMX (TMX, 2008) and TBX (TBX, 2008). Nevertheless from a conceptual and software engineering point of view both concepts share more than distinguish them. Both have “strings” as their basic representations – either as terms or as segments – and also meta informa- tion matches in most cases. A main difference is their context usage. TMs are normally applied at segment level; consist normally of more characters), while terms are used at a sub segment (word, phrase) level. As this differences only appear at the usage level OpenTMS consequently imple- ments the same underlying (database) structure for TM and term entries. Using special markers a distinction can be made at run time (= usage time). The advan- tage immediately can be seen that by this approach both concepts can be used in different usage contexts. Search and retrieval functionality is available for both concepts (e.g. fuzzy search is rarely available for term databases; using a com- mon internal representation this drawback is overcome). Fig 19: Conversions of linguistic entities 12.3 Variables , placeholders, replacement classes Translation memory entries, sometimes also terminology entries, often contain textual parts which can act as placeholders. Typical examples of placeholders are Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 53/72
    • Software Architecture 04/2008/KW numbers, month names, acronyms etc. In many cases it is possible automatically replacing those “variable parts” with their actual counterpart in a segment. This is esp. useful in matching, e.g. just be replacing the numbers in a match with its cor- rect value to achieve a better match, even a perfect match. OpenTMS supports for this reason the concept of replacement classes. A re- place class is specific construct which generalizes a certain type of string or infor- mation. A replacement class consists of basically two parts: • A class name (e.g. number) • A procedure describing the replacement class. In many cases the proce- dure can be defined through a regular expression. Another option maybe that specific strings (e.g. terms from a terminology database) may act as replacement class. • A procedure maybe language dependent. If a procedure is language de- pendent transformation rules have to be defined how a value of language A is transformed to a language B. Example: Class: GeneralNumber Procedures: General: Definition: ([0-9]+?)(.)([0-9]+?) Transform: $1.$2 German: Definition: ([0-9]+?)(,)([0-9]+?) Transform: $1,$2 The basic idea is that a language specific procedure involves two parts: • a definition part which describes how to detect (evaluate) an instance of a replacement class • a transformation part which describes how to compute the instance of a replacement class given that a replacement class has been detected (e.g. in another language) Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 54/72
    • Software Architecture 04/2008/KW When a replacement class matches parts of segment the matching part is re- placed with replacement class carrying forward the class name and the value of the original class. Replacement classes invoke two main challenges: • A key problem in defining replace classes is the order in which they are involved (checked). Depending on the definition of the regular expression several expression may match (e.g. numbers without and with decimal points). Open TMS should apply a strict linear order procedure. The first matching expression is applied and used. • The other key problem is checking if all the replacement classes appear a) in both source and target match and b) appear in the source segment (the one which requires translation). For OpenTMS the proposed solution is that the replacement classes in both source and target have to mach exactly. If this is given the replacement classes also have to match source segment to be translated. It has to be noted that another approach could be used too – removing the non matching replacement classes in all three involved strings. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 55/72
    • Software Architecture 04/2008/KW 13 PROCESS MODEL 13.1 OpenTMS Process An OpenTMS process realizes the functionality of the OpenTMS system – mainly supporting the translation process. Examples of processes are converters, seg- menters, translation memories, machine translation, statistics modules etc. OpenTMS processes build on the core library functions and move them into a process environment. In many cases this does not really mean that a process is created in the deep meaning of a process, it also cold mean that a function of the core library (but any othr function defined in another OpenTMS context) is called from an application. 13.2 OpenTMS Scripting Language Most OpenTMS processes are available through the OpenTMS Scripting Lan- guage (OpenTMSL). The OpenTMS Scripting language enables developers and users to write their own scripts to adapt the OpenTMS processes to their needs. OpenTMSL is defined in a programming language independent way and should be implemented in different programming languages. It basically makes the functions defined in the core library accessible to the public through an easy to learn script- ing language. Fig 20: OpenTMS Scripting Language Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 56/72
    • Software Architecture 04/2008/KW OpenTMSL itself is defined within a OpenTMSL XML document and can be read by different XML parsers. Reference implementations should be done Java, Perl, C# etc. Fig 21: OpenTMSL Inter-process and computer communication OpenTMSL supports also multi threading. It takes a procedure (see chapter about Basic Architecture) and enriches it with multi threading capabilities as well as in- terprocess communication capabilities. This requires that the procedures inter- preted and executed by the scripting languages allows to be run in such an envi- ronment. OpenTMSL is designed in such a way that it can communicate with other OpenTMSL instances on the same or other machines. Running different OpenTMSL engines on the same machine should enhance reliability and scalabil- ity of the overall system. One might think that one OpenTMSL engine (instance) is dedicated towards TM translation where another OpenTMSL instance is dedicated Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 57/72
    • Software Architecture 04/2008/KW towards MT translations. This approach should also support the option to run older software (e.g. old MT system; legacy software) in a special environment so that nevertheless the overall goals of FOLT - exchange of information through stan- dards - are met. 13.3 OpenTMSL Communication Methods OpenTMS supports the following communication methods: • XML-RPC Interface • SOAP • HTTP Interface • Servlet Implementations • Batch File Processing Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 58/72
    • Software Architecture 04/2008/KW 14 USER MODEL Users – either human users or processes – are key components of OpenTMS. Whenever a process attaches to a OpenTMS instantiation a OpenTMS user name is attached to the process. Normally the user login-in is used to identify the OpenTMS user name. If OpenTMS runs as a server process the OpenTMS user name is assigned at ser- vice start time. OpenTMS user names are case insensitive. A OpenTMS user basically consist of a user name (together with one ore more aliases), a password (or any other secure identification method) , a set of rights and set of roles as well as a set of groups the user belongs to. Rights are usual rights like read, write, delete. Most operating systems support their user right system. OpenTMS reuse those right systems. If a user uses several machines aliases can be defined to allow identification across machines. See page 63. Depending on security require the user model configuration files can be secured using the security model functions for documents. 14.1 User roles Users can have different roles attached to them and can appear differently de- pending on their roles. Each role may have assigned specific rights. User roles are (not exhaustive!) • Translator • Evaluator • Project Manager • Customer Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 59/72
    • Software Architecture 04/2008/KW User always have to be associated with passwords. 14.2 Basic user functionality In the following some basic user file functions are described. Those functions should go into the core library of OpenTMS. They are by far not exhaustive. Basic User Functions Comment CreateUser Creates a user RemoveUser Removes a user ModifyUser Makes modification to the user properties Fig 22: Some basic user functions Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 60/72
    • Software Architecture 04/2008/KW 15 GUI MODEL The GUI model realizes editors and similar applications which support the interac- tion of the human user with the OpenTMS software. This document is not intended to discuss and describe more details how to imple- ment one or more GUIs for OpenTMS. Anyway if those applications are defined and realized they should adhere to the principles of the OpenTMS architecture. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 61/72
    • Software Architecture 04/2008/KW 16 INTERFACE MODEL OpenTMS software should be easily integrated into other software systems. An interface model realizes this aspect of the behavior of OpenTMS. Interface models are used for functions which do not model kernel aspects of OpenTMS (e.g. work- flow management, CMS integration), but anyway are of import interest for the OpenTMS community. An example of an interface model is given in the section about CMS interfaces. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 62/72
    • Software Architecture 04/2008/KW 17 CONFIGURING OPENTMS A key feature of the OpenTMS architecture is the ability to configure the system. Fig 23: Configuration of OpenTMS Due to its broad usage requirements (stand alone, server etc.) several different configuration methods should be supported. • General configuration (GC): This configuration contains the all configura- tion options which are used when no user related configuration is avail- able. The CG can also define if an option can be overwritten by the user configuration file or not. • Server configuration (SC): This is a configuration which resides on a server and is mainly target towards controlling server sided options. It is a sub set of the GC. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 63/72
    • Software Architecture 04/2008/KW • User configuration (UC): This configuration stored the user specific con- figuration options. Depending on security require the configuration file can be secured using the se- curity model functions for documents. Each option of the configuration file can be secured separately. 17.1 Naming of the configuration file The name of the user configuration file depends on the user name. The general configuration file is always: OpenTMSConfig.xml. The file is always located in the direct sub directory config (so one level below the main OpenTMS directory. A dis- tribution mechanism for transferring user profiles between different machines should be supported. Examples c:/Program Files/OpenTMS/config Configuration directory c:/Program Files/OpenTMS/config/OpenTMSConfig.xml General configuration file c:/Program Files/OpenTMS/config/OpenTMSConfig.klemens.xml User configuration file Fig 24: Configuration file naming example In addition OpenTMS can support the storage of configuration options in data- bases. This has the advantage that one user can work with his personal environ- ment on different machines. 17.2 Structure of the configuration file A configuration file is written in an XML based format. The location of the configu- ration file is relative to the start directory of the main OpenTMS application and should be stored in a config directory, The configuration file uses schema and xsd to restrict the possible values and sup- port error detection. Each option supports the overwrite option. This option allows to define if a user has the right to modify the option or not. The admin attribute al- lows overrule the overwrite attribute. Users mentioned in this list (separated by “;”) always have the right to modify the option). Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 64/72
    • Software Architecture 04/2008/KW Changes done in the user configuration gave no influence (no modification) on the general configuration. <option name=optionname [admin=<list of admins>] [over- write=true|false]> …value… </option> or alternatively <option name=optionname [overwrite=true|false] value=value> Fig 25: Configuration option structure 17.3 Configuration Options In the following table some main options are described. Option name Description Values Option Type OpenTMSDir Location of the Directory name GC, SC, UC OpenTMS directory LogDir Log Dir name; relative to Directory name OpenTMSDir ErrorDir Error Dir name; relative Directory name to OpenTMSDir LogLevel Control amount of log- <number> 0, 1, 2, 3… ging ConfigDir Configuration directory Directory name Fig 26: OpenTMS options table Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 65/72
    • Software Architecture 04/2008/KW 18 DMS INTERFACE As has been explained in the chapter about “Documents” OpenTMS may require to store various documents. Ideally this document repository should be a DMS system. For the start only some very basic functions should be supported by OpenTMS to retrieve documents from and insert into a DMS system. Please note that this does not imply a very complex DMS system, a flat system based on a directory system can be sufficient. It has to be clearly stated here that this functions must be imple- mented as part of the DMS system and are not implemented as part of OpenTMS. OpenTMS just uses (calls) this functions (methods) when storing of documents is required within OpenTMS. It is not considered to be a core functionality of OpenTMS. OpenTMS will provide this functionality through an XML RPC interface. Basic idea is that documents are organized in repositories which contain the documents. The DMS system must be able to handle any input document supplied. No re- striction is made the format of the document. The DMS interface can also be used to act as a versioning of documents. During the translation process xliff files etc. change due to the translations etc. The differ- ent versions can be kept in the DMS system. DMS document handling should be supported thru WEBDAV too. The following core functions should be supported by the DMS system. Each func- tion normally returns an unique identifier. Function Comment Connect Connects to CMS system Create Repository creates a repository where documents can be added to Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 66/72
    • Software Architecture 04/2008/KW Remove Repository Removes repository Add Document Adds a document to a given repository and identifies it by an unique identifier. Add Document Version Adds a new version of the document. Each version is identified through its original id augmented by a version code. Remove Document Removes a document from a repository Replace Document Replaces an existing document in the repository Retrieve Document Returns a document from the reposi- tory; if it is a versioned document the most recent version is returned. Retrieve Document Version Returns a specific version of the docu- ment from the repository Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 67/72
    • Software Architecture 04/2008/KW 19 BIBLIOGRAPHY FOLT (2007a). Arguments for the development of an open source soft- ware translation memory technology to support translators. Stutt- gart, February 2007. FOLT - Accessed on 13 April 2007 on http://www.folt.org/index.php?option=com_docman&task=doc_do wnload&gid=12&Itemid=39. FOLT (2007b). ExposéTranslationMemoryOpenSourceSystemTMOSS. Stuttgart, 28. Oktober 2007. FOLT - Accessed on 13 April 2007 on http://www.folt.org/index.php?option=com_docman&task=doc_do wnload&gid=16&Itemid=39. GMX (2008). Global information management Metrics eXchange (GMX). LISA Standard - Accessed on 13 April 2007 on http://www.lisa.org/Global-information-m.104.0.html. MARTIF (1999). ISO 12200 Terminology – Computer applications - Ma- chine-readable Terminology Interchange Format (MARTIF) - Nego- tiated Interchange. ISO TC 37 - http://www.ttt.org/clsframe/negotiated.html. OLIF (2008). OLIF - The open XML language data standard. OLIF 2 Con- sortium - Accessed on 13 April 2007 on http://www.olif.net/. SRX (2008). Segmentation Rules eXchange (SRX). LISA Standard - Ac- cessed on 13 April 2007 on http://www.lisa.org/Segmentation- Rules-e.40.0.html. TBX (2008). Term Base eXchange (TBX). LISA Standard - Accessed on 13 April 2007 on http://www.lisa.org/Term-Base- eXchange.32.0.html. TMX (2008). Translation Memory eXchange (TMX). LISA Standard - Ac- cessed on 13 April 2007 on http://www.lisa.org/Translation- Memory-e.34.0.html. XLIFF (2008). XLIFF Version 1.2. OASIS Standard, 1. Febraury 2008 - Accessed on 13 April 2007 on http://docs.oasis-open.org/xliff/xliff- core/xliff-core.html. Xml:tm (2008). XML Text Memory (xml:tm). Lisa Standard, - Accessed on 13 April 2007 on http://www.lisa.org/XML-Text-Memory- xml.107.0.html. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 68/72
    • Software Architecture 04/2008/KW 20 APPENDIX 20.1 Multiple translations for a linguistic concept A key problem in translation is the handling of multiple translations for a linguis- tic concept, e.g. two German translations for an English concept. This is an in- herent problem as the situation becomes more complex if more than two lan- guages are involved. This can easily lead to a situation (-> Process) where two MuLOS mayhave to be merged into one MuLO as a newly entered translation pair refers for each of translations to a MoLO which links to different MuLOs. This shown by the following examples: In the following we assume DE - EN as source and target language. Segments are identified as follows: S-<language code>-<n>: <string> - n is the sequence number. In the examples below segments which result in unifications/merges are coloured. Time 1: S-DE-1: Haus S-EN-1: house After accepted the database will contain these two entries as a translation of the segments. In TMX term this is now one TU TU 1: TUV: S-DE-1: Haus TUV: S-EN-1: house Next another language pair is added. Time 2: S-DE-2: Heim S-EN-2: home After accepting this translation pair the database will contain now two TU en- tries. TU 1: Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 69/72
    • Software Architecture 04/2008/KW TUV: S-DE-1: Haus TUV: S-EN-1: house TU 2: TUV: S-DE-2: Heim TUV: S-EN-2: home Now another pair is added: Time 3: S-DE-3: Haus S-EN-3: building After accepting this translation pair the database will contain now two TU en- tries – but TU 1 is extended with a new translation S-EN-3 – as S-DE-3 is identical to S-DE-1. TU 1: TUV: S-DE-1: Haus TUV: S-EN-1: house TUV: S-EN-3: building TU 2: TUV: S-DE-2: Heim TUV: S-EN-2: home Now another translation pair comes in: Time 4: S-DE-4: Heim S-EN-4: house Now S-DE-4 is contained as S-DE-2 in TU 2 while S-EN-4 is contained in TU 1. As there is now obvious entry which should be preferred where the translation pair should be added both TU-1 and TU-2 are unified – meaning both entries are merged into one. The result of this is with TU 2 being removed: TU 1: TUV: S-DE-1: Haus TUV: S-EN-1: house TUV: S-EN-3: building TUV: S-DE-2: Heim Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 70/72
    • Software Architecture 04/2008/KW TUV: S-EN-2: home This actually means now that Haus can be translated into English as: house – building – home (and vice versa), Heim also as house – building – home, house can be translated into German as Haus – Heim. And so on. Although this sounds quite simple for two languages it immediately gets com- plicated if several languages are involved. Here a language could operate as “pivot language” meaning that – although not really intended – a whole set of entries get merged although before two distinct entries. This can be esp. con- fusing if several translators translate. The DE-EN translator may be surprised by a unified entry as he never was the source of the merger and never pro- duced double translation. This can be seen by the following example: Time x: Initial entries TU 1: TUV: S-DE-1: Haus TUV: S-EN-1: house TUV: S-EN-3: building TUV: S-DE-2: Heim TUV: S-EN-2: home TUV: S-LA-6: domus TU 2: TUV: S-DE-1: Wald TUV: S-EN-1: wood TUV: S-LA-6: silva Assume now the EN-LA translator makes an error in his translation and adds (but the argument holds for other combinations too!) the following combina- tion: S-EN-11: wood S-LA-11: domus This results immediately in just one TU, TU 2 being removed. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 71/72
    • Software Architecture 04/2008/KW TU 1: TUV: S-DE-1: Haus TUV: S-EN-1: house TUV: S-EN-3: building TUV: S-DE-2: Heim TUV: S-EN-2: home TUV: S-LA-6: domus TUV: S-DE-1: Wald TUV: S-EN-1: wood TUV: S-LA-6: silva The DE – EN translator will be confused the next time he searches for “Haus” as he will get now the following EN proposals: house – building –home – wood. And the reason was the entries done by the EN-LA translator. One has to add that in some cases the corresponding EN-LA translation pair may be perfectly correct but for DE – EN it may be totally wrong and confusing. As a consequence translators should be carefully with their translation in order to avoid unexpected translation links. Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 72/72