This document summarizes research on structuring translation memory (TM) at the clause level rather than the sentence level. The research found that clause-level structuring of both query data and TM database contents increased the percentage of relevant matches returned from the TM. Three TM configurations were tested on different data sets, all showing that clause splitting resulted in more matches compared to sentence-level data. This study supports using clause-level structuring of TM to improve retrieval of relevant segments, increasing translator productivity.
1. BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE
PILANI (RAJASTHAN)
April, 2018
Structuring of Translation Memory
By
Ashutosh Kumar
2015HT13439
BITS ZG628T: Dissertation
2. Introduction
What is Translation Memory (TM) ?
Translation Memory (TM) is an archive of previously translated segments that stores
source language segment and its corresponding translation to target language. Here,
segment refers to a single sentence or a single paragraph.
Uses of TM
When a translator uses a TM tool to translate a new segment, the tool identifies
similarities between segment of Query data ( the segment which have to translate) and
the stored segment in TM database. A translator may then choose one of them to insert
or make slight changes to the to given segment.
Benefits of TM
TM will help Human Translators to increase their productivity.
It helps to ensure that the same term is consistently used across translation.
It helps to ensure uniform style of translation across a large document.
3. Problem statement
Sentence-level TM
Most of the TM tool stores sentence-level segments in TM database. Hence the benefits of TM
are only realized for identical or similar sentences, which may occur rarely because usually
sentences are complex, while sentence fragments (clauses) may match more often.
In TM comprising of sentence-level segments, It may sometimes occur that input sentence
contains a sub-segment (clause) and its translation is available in TM. But search-and-
retrieval function will not show any result because matching percentage (36%) will be low as
the defined threshold (75%) for TM.
Query data TM data
1. Earphone is the best option
available for them as it doesn't
disturb others' sleep
2. It doesn't disturb others' sleep
1. Earphone is the best option
available for them as it doesn't
disturb others' sleep
4. Introduction
Structuring of TM (clause-level)
In our approach we used clause splitting to define clause level structure of TM. In clause level
structure of TM, We split the sentence into clauses, and put the clauses along with the its full
sentence. So in clause level structure of TM, TM contains the given sentences along with their
clauses.
Retrieving clauses is desirable because there is a higher chance for a match to be found for a
clause than for a complex sentence (contains more than one clause). A Clause contains
complete thought because it comprises of a subject and a verb. Hence even if a translator
does not find a match for the entire sentence, he or she still might get matches for clauses
and therefore the translator will be benefited.
5. Experiments and their Results
Clause Extraction
For clause extraction tasks we used parsing module of OpenNLP. OpenNLP provides Machine
Learning models which is trained on âPenn Treebank POSâ tagged corpus. OpenNLP[6] creates a
âconstituency parse treeâ also known âphrase parse treeâ.
Example : For a sentence
âEarphone is the best option available for them as it doesn't disturb others' sleepâ
generated bracket notation tree is given below â
[TOP [S [NP [NN Earphone]] [VP [VBZ is] [NP [NP [DT the] [JJS best] [NN option]] [ADJP
[ADJP [JJ available] [PP [IN for] [NP [PRP them]]]] [SBAR [IN as] [S [NP [PRP it]] [VP
[VBD doesn't] [NP [ADJP [NN disturb] [JJ others']] [NN sleep]]]]]]]] [. .] ]]
6. Experiments and their Results
From the pictorial representation we can see that there are two clause-
Clause 1: Earphone is the best option available for them
Clause 2 : as it doesn't disturb others' sleep
7. Experiments and their Results
Experimental Data
For our experiments we have created three different test data set for English language. Set-A
contains 100 input sentences, Set-B contains 200 input sentences and Set-C contains 300 input
sentences. For TM we have a data set which contains 3,500 sentences. This data contains simple
sentences and complex sentences. Complex sentence contains more than one clause.
TM Configuration
In this dissertation we have performed experiments with three kinds of TM configurations.
TM configuration 1
In TM configuration 1, Query data contains full sentences. TM data also contains full sentences. In
this configuration, there is no clause level splitting either in Query data or in TM data.
TM configuration 2
In TM configuration 2, Query data contains full sentences. TM data contains full sentences as well as
clauses of their sentences. We used clause splitter/structure in this configuration.
TM configuration 3
In TM configuration 3, Query data contains the original sentence and its clauses, in case of complex
sentences. So there are not only one query but a set of queries while looking into TM database.
Similarly, TM data contains the original sentences as well as the clauses of those sentences. We used
clause splitter for structuring Query data in this configuration.
8. Experiment and Results
ï
âSâ denotes the sentences, âCâ denotes the clauses of these sentences, âS, Câ denotes the sentences
and its clauses.
Result on Set-A
Table 1
10. Experiment and Results
Conclusion 1
Table 1, Table 2 and Table 3, shows that whenever there is clause splitting either in the
Query data or in TM data, there is increase in % Match from TM.
Conclusion 2
Table 1, Table 2 and Table 3, the Query data set is different in each case. In spite of
different data set (Set-A, Set-B, Set-C) we observe that for each data set there is an
increasing trend in % Match from TM. Hence we can conclude that clause splitting will
always increase the % Match over sentence level data.
Conclusion 3
Table 1, Table 2 and Table 3, the increase in % Match in each case is not uniform. This is
because the % Match for a given data set from the TM is dependent upon the two factors,
i.e. TM data and Query data which is very natural.
11. Summary
We have seen that when we use clause level structuring in TM, relevant matches for
Query data that were earlier dropped due to low percentage match in sentence level,
are also retrieved in the resulting set.
So we get more relevant matches for Query data from the TM database.
This study uses different TM configurations (TM configuration 1, TM configuration 2,
TM configuration 3 ) to support the above claim on different test data set. A translator
might not get a match for a complete sentence but he or she will still get a match for a
clause, which helps him to perform translation task better, thereby increasing his
productivity (translated word per hour).
12. Acknowledgments
Firstly, I would like to express my sincere gratitude to my Supervisor Dr. Pawan Kumar for
the continuous support of my M.Tech. study and related research, for his patience,
motivation, and immense knowledge. His guidance helped me in all the time of research and
writing of this dissertation.
I would also like to thank Dr. Mukul Kumar Sinha for his insightful comments and
encouragement, but also for asking hard questions which helped me widen my research from
various perspectives.
I would also like to thank my colleague, Ms. Himanshi Thapliyal for editing and proof-
reading the dissertation.
Last but not the least, I would like to express my love and gratitude to my beloved family, for
their understanding & motivation, through the duration of this project.