2015ht13439 final presentation

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE
PILANI (RAJASTHAN)
April, 2018
Structuring of Translation Memory
By
Ashutosh Kumar
2015HT13439
BITS ZG628T: Dissertation

Introduction
What is Translation Memory (TM) ?
Translation Memory (TM) is an archive of previously translated segments that stores
source language segment and its corresponding translation to target language. Here,
segment refers to a single sentence or a single paragraph.
Uses of TM
When a translator uses a TM tool to translate a new segment, the tool identifies
similarities between segment of Query data ( the segment which have to translate) and
the stored segment in TM database. A translator may then choose one of them to insert
or make slight changes to the to given segment.
Benefits of TM
TM will help Human Translators to increase their productivity.
It helps to ensure that the same term is consistently used across translation.
It helps to ensure uniform style of translation across a large document.

Problem statement
Sentence-level TM
Most of the TM tool stores sentence-level segments in TM database. Hence the benefits of TM
are only realized for identical or similar sentences, which may occur rarely because usually
sentences are complex, while sentence fragments (clauses) may match more often.
In TM comprising of sentence-level segments, It may sometimes occur that input sentence
contains a sub-segment (clause) and its translation is available in TM. But search-and-
retrieval function will not show any result because matching percentage (36%) will be low as
the defined threshold (75%) for TM.
Query data TM data
1. Earphone is the best option
available for them as it doesn't
disturb others' sleep
2. It doesn't disturb others' sleep
1. Earphone is the best option
available for them as it doesn't
disturb others' sleep

Introduction
Structuring of TM (clause-level)
In our approach we used clause splitting to define clause level structure of TM. In clause level
structure of TM, We split the sentence into clauses, and put the clauses along with the its full
sentence. So in clause level structure of TM, TM contains the given sentences along with their
clauses.
Retrieving clauses is desirable because there is a higher chance for a match to be found for a
clause than for a complex sentence (contains more than one clause). A Clause contains
complete thought because it comprises of a subject and a verb. Hence even if a translator
does not find a match for the entire sentence, he or she still might get matches for clauses
and therefore the translator will be benefited.

Experiments and their Results
Clause Extraction
For clause extraction tasks we used parsing module of OpenNLP. OpenNLP provides Machine
Learning models which is trained on “Penn Treebank POS” tagged corpus. OpenNLP[6] creates a
“constituency parse tree” also known “phrase parse tree”.
Example : For a sentence
“Earphone is the best option available for them as it doesn't disturb others' sleep”
generated bracket notation tree is given below –
[TOP [S [NP [NN Earphone]] [VP [VBZ is] [NP [NP [DT the] [JJS best] [NN option]] [ADJP
[ADJP [JJ available] [PP [IN for] [NP [PRP them]]]] [SBAR [IN as] [S [NP [PRP it]] [VP
[VBD doesn't] [NP [ADJP [NN disturb] [JJ others']] [NN sleep]]]]]]]] [. .] ]]

From the pictorial representation we can see that there are two clause-
Clause 1: Earphone is the best option available for them
Clause 2 : as it doesn't disturb others' sleep

Experimental Data
For our experiments we have created three different test data set for English language. Set-A
contains 100 input sentences, Set-B contains 200 input sentences and Set-C contains 300 input
sentences. For TM we have a data set which contains 3,500 sentences. This data contains simple
sentences and complex sentences. Complex sentence contains more than one clause.
TM Configuration
In this dissertation we have performed experiments with three kinds of TM configurations.
TM configuration 1
In TM configuration 1, Query data contains full sentences. TM data also contains full sentences. In
this configuration, there is no clause level splitting either in Query data or in TM data.
TM configuration 2
In TM configuration 2, Query data contains full sentences. TM data contains full sentences as well as
clauses of their sentences. We used clause splitter/structure in this configuration.
TM configuration 3
In TM configuration 3, Query data contains the original sentence and its clauses, in case of complex
sentences. So there are not only one query but a set of queries while looking into TM database.
Similarly, TM data contains the original sentences as well as the clauses of those sentences. We used
clause splitter for structuring Query data in this configuration.

Experiment and Results

‘S’ denotes the sentences, ‘C’ denotes the clauses of these sentences, ‘S, C’ denotes the sentences
and its clauses.
Result on Set-A
Table 1

Result on Set-B
Result on Set-C
Table 2
Table 3

Conclusion 1
Table 1, Table 2 and Table 3, shows that whenever there is clause splitting either in the
Query data or in TM data, there is increase in % Match from TM.
Conclusion 2
Table 1, Table 2 and Table 3, the Query data set is different in each case. In spite of
different data set (Set-A, Set-B, Set-C) we observe that for each data set there is an
increasing trend in % Match from TM. Hence we can conclude that clause splitting will
always increase the % Match over sentence level data.
Conclusion 3
Table 1, Table 2 and Table 3, the increase in % Match in each case is not uniform. This is
because the % Match for a given data set from the TM is dependent upon the two factors,
i.e. TM data and Query data which is very natural.

Summary
We have seen that when we use clause level structuring in TM, relevant matches for
Query data that were earlier dropped due to low percentage match in sentence level,
are also retrieved in the resulting set.
So we get more relevant matches for Query data from the TM database.
This study uses different TM configurations (TM configuration 1, TM configuration 2,
TM configuration 3 ) to support the above claim on different test data set. A translator
might not get a match for a complete sentence but he or she will still get a match for a
clause, which helps him to perform translation task better, thereby increasing his
productivity (translated word per hour).

Acknowledgments
Firstly, I would like to express my sincere gratitude to my Supervisor Dr. Pawan Kumar for
the continuous support of my M.Tech. study and related research, for his patience,
motivation, and immense knowledge. His guidance helped me in all the time of research and
writing of this dissertation.
I would also like to thank Dr. Mukul Kumar Sinha for his insightful comments and
encouragement, but also for asking hard questions which helped me widen my research from
various perspectives.
I would also like to thank my colleague, Ms. Himanshi Thapliyal for editing and proof-
reading the dissertation.
Last but not the least, I would like to express my love and gratitude to my beloved family, for
their understanding & motivation, through the duration of this project.

References
[1] Reinke, U. (2013), State of the Art in Translation Memory Technology. Proceedings of the Workshop on Natural Language Processing
for Translation Memories (NLP4TM), pages 17–23,
[2] Grönroos, Mickel., Becks,Ari., Bringing Intelligence to Translation Memory Technology.
Translating and the Computer 27, November 2005 [London: Aslib, 2005]
[3] Timonera, Katerina., and Mitkov, Ruslan., (Sept 2015), Improving Translation Memory Matching through Clause Splitting. Proceedings
of the Workshop on Natural Language Processing for Translation Memories (NLP4TM), pages 17–23, Hissar, Bulgaria,
[4] Sharma, Sanjeev Kumar.,(2016), Clause Boundary Identification for Different Languages: A Survey, International Journal of Computer
Applications & Information Technology Vol. 8, Issue II 2016 (ISSN: 2278-7720)
[5] https://stanfordnlp.github.io/CoreNLP/
[6] https://opennlp.apache.org/
[7] https://www.ibm.com/developerworks/library/x-localis3/
[8] Translators on translation memory (TM). Results of an ethnographic study in threetranslation services and agencies Matthieu LeBlanc
Université de Moncton,
[9] Christensen,Tina Paulsen. and Schjoldager, Anne., (2011) The Impact of Translation- Memory (TM) Technology on Cognitive
Processes. NLPSC 2011
[10] A.Zerfass., (2002). Evaluating Translation Memory Systems. Proceedings of the LREC 2002 Workshop, Las Palmas, Canary Islands,
SPAIN.
[11] Timothy Baldwin & Hozumi Tanaka. (2001). Balancing up Efficiency and Accuracy in Translation Retrieval. Journal of Natural
Language Processing vol. 8.

2015ht13439 final presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2015ht13439 final presentation

Similar to 2015ht13439 final presentation (20)

Recently uploaded

Recently uploaded (20)

2015ht13439 final presentation

Editor's Notes