ESR4 Rohit Gupta - EXPERT Summer School - Malaga 2015
1. S
Use of Paraphrasing to
Improve Matching and
Retrieval in Translation
Memory
Rohit Gupta, University of Wolverhampton
Supervisors:
Dr Constantin Orasan, University of Wolverhampton
Prof Josef van Genabith, Saarland University and DFKI
Prof Ruslan Mitkov, University of Wolverhampton
3. Objective
S Improving matching and retrieval in Translation Memory
with the help of advanced language technology. This is
achieved by:
S using paraphrases
S using semantic information
4. Limitations of current TMs
S Surface form comparison
S No or very limited linguistic information
5. Limitations of current TMs
S Surface form comparison
S No or very limited linguistic information
S Paraphrased segments either not retrieved or ranked
incorrectly among the retrieved segments
6. Limitations of current TMs
S Fuzzy scores are really fuzzy
S Input_1: the period laid down in article 4(3)
S Input_2: the responsible person defined in article 4(3)
S TM: the duration set forth in article 4(3)
57% fuzzy score as per word-based edit-distance for
both input sentences
14. Our Approach
1. Dynamic programming and Greedy approximation
2. Classification of paraphrases
3. Dealing different paraphrases in different manner
4. Filtering
16. Classification of Paraphrases:
4 Types
i. One word paraphrases
S “period” => “duration”
ii. Multiple words but differing in one word
S “in the period” => “during the period”
17. Classification of Paraphrases:
4 Types
i. One word paraphrases
S “period” => “duration”
ii. Multiple words but differing in one word
S “in the period” => “during the period”
iii. Differing in multiple words but having same number of words
S “laid down in article” => “set forth in article”
18. Classification of Paraphrases:
4 Types
i. One word paraphrases
S “period” => “duration”
ii. Multiple words but differing in one word
S “in the period” => “during the period”
iii. Differing in multiple words but having same number of words
S “laid down in article” => “set forth in article”
iv. Differing in multiple words with different number of words
S “a reasonable period of time to” => “a reasonable period to”
20. Example
The period laid down in article 4(3) of decision 468 …
The period
duration
time
laid down in article 4(3) of decision 468 …
21. Example
The period laid down in article 4(3) of decision 468 …
The period
duration
time
laid down
referred to
provided for
in
in
by
article
article
article
4(3) of decision 468 …
22. Example
The period laid down in article 4(3) of decision 468 …
The period
duration
time
laid down
referred to
provided for
in
in
by
article
article
article
4(3) of decision 468 …
23. Example
The period laid down in article 4(3) of decision 468 …
The period
duration
time
laid down
referred to
provided for
in
by
article
2
3
4(3) of decision 468 …
Source
length
26. Edit-distance Calculation
0 1 2 3 4 5
TM
Input
# the period
duration
time
laid down in
0 #
1 the
2 period
3 referred
4 to
5 in
27. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 #
1 the
2 period
3 referred
4 to
5 in
28. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0
1 the 1
2 period 2
3 referred 3
4 to 4
5 in 5
29. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1
1 the 1 0
2 period 2 1
3 referred 3 2
4 to 4 3
5 in 5 4
30. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2
1 the 1 0 1
2 period 2 1 0
3 referred 3 2 1
4 to 4 3 2
5 in 5 4 3
31. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3
1 the 1 0 1 2
2 period 2 1 0 1
3 referred 3 2 1 1
4 to 4 3 2 2
5 in 5 4 3 3
32. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5
1 the 1 0 1 2 3 4
2 period 2 1 0 1 2 3
3 referred 3 2 1 1 2 3
4 to 4 3 2 2 2 3
5 in 5 4 3 3 3 2
33. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
34. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
35. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
36. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
37. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
38. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
39. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
40. Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52 5
TM
Input
# the period
duration
time
laid down in referred to provided for by in
0 # 0 1 2 3 4 5 3 4 3 4 5 5
1 the 1 0 1 2 3 4 2 3 2 3 4 4
2 period 2 1 0 1 2 3 1 2 1 2 3 3
3 referred 3 2 1 1 2 3 0 1 1 2 3 2
4 to 4 3 2 2 2 3 1 0 2 2 3 1
5 in 5 4 3 3 3 2 2 1 3 3 3 0
42. Computational Complexity
S Only type (i) and type (ii) paraphrases:
S O(mnlog(p)) , p: paraphrases of types (i) and (ii)
S All paraphrases:
S O(lmn(log(p) + q)) , q: paraphrases of types (iii) and (iv),
l: length of paraphrase
44. Filtering
1. Filter out the segments based on length (39%)
2. Filter out the candidates based on baseline edit-distance
similarity (39%)
45. Filtering
1. Filter out the segments based on length (39%)
2. Filter out the candidates based on baseline edit-distance
similarity (39%)
3. Pick the top 100 segments
46. Filtering
1. Filter out the segments based on length (39%)
2. Filter out the candidates based on baseline edit-distance
similarity (39%)
3. Pick the top 100 segments
4. Segments within a certain range of similarity with the most
similar segment are selected for paraphrasing (35%)
47. Experiments
S Corpus Used:
S Europarl V7.0
S English-German pairs
More results on DGT-TM (English-French) in:
Rohit Gupta and Constantin Orasan 2014. Incorporating Paraphrasing in Translation Memory
Matching and Retrieval. In Proceeding of EAMT-2014, Dubrovnik Croatia.
48. Corpus statistics: Europarl
TM Test
Segments 1,565,194 9,981
Source words 37,824,634 240,916
Target words 36,267,909 230,620
Source average length 24.16 24.13
Target average length 23.17 23.10
55. Evaluations
S Post-Editing time
S Keystrokes
S Subjective Evaluation 2 Options
S A is better
S B is better
S Subjective Evaluation 3 Options, Added One more
S Both are equal
56. Experimental Settings:
Post-editing time and
Keystrokes
S Each file contains segments of both types (ED+PP)
S Each file is post-edited by 5 translation student
S German: Native
S English: C1
65. Segment-wise analysis
S Statistical significance testing per segment
S Welch-t test (One tailed, p<0.05)
S Paraphrasing (Keystrokes/Post-Editing Time):
S Twelve segments are significantly better
66. Segment-wise analysis
S Statistical significance testing per segment
S Welch-t test (One tailed, p<0.05)
S Paraphrasing (Keystrokes/Post-Editing Time):
S Twelve segments are significantly better
S For ten segments all other evaluations also shows them better
67. Segment-wise analysis
S Statistical significance testing per segment
S Welch-t test (One tailed, p<0.05)
S Paraphrasing (Keystrokes/Post-Editing Time):
S Twelve segments are significantly better
S For ten segments all other evaluations also shows them better
S Edit-Distance (Keystrokes/Post-Editing Time):
S Three segments are significantly better
S Not all evaluations shows them better
68. Conclusion
S Presented approach to include paraphrasing and machine
and retrieval
S Presented human evaluations
S In future, we will use deep learning for TM matching and
retrieval
69. Related Publications
S Rohit Gupta and Constantin Orasan. 2014. Incorporating Paraphrasing in Translation
Memory Matching and Retrieval. In Proceeding of EAMT-2014, Dubrovnik Croatia.
S Rohit Gupta, Constantin Orasan, Marcos Zampieri, Mihaela Vela and Josef van
Genabith. 2015. Can Transfer Memories afford not to use paraphrasing? In Proceeding of
EAMT-2015, Antalya Turkey.
S Rohit Gupta, Hanna Bechara, Ismail El Maarouf, and Constantin Orasan. 2014a. UoW:
NLP techniques developed at the University of Wolverhampton for Semantic Similarity
and Textual Entailment. In Proceedings of the 8th International Workshop on Semantic
Evaluation (SemEval-2014), COLING-2014 Dublin Ireland.
S Rohit Gupta, Hanna Bechara, and Constantin Orasan. 2014b. Intelligent Translation
Memory Matching and Retrieval Metric Exploiting Linguistic Technology. In Proceedings
of the thirty sixth Conference on Translating and Computer, London, UK.
70. References
S Jane Bradbury and Ismaıl El Maarouf. 2013. An empirical classification of verbs based on
Semantic Types: the case of the ’poison’ verbs. In Proceedings of the Joint Symposium on Semantic
Processing. Textual Inference and Structures in Corpora, pages 70–74.
S Juri Ganitkevitch, Van Durme Benjamin, and Chris Callison-Burch. 2013. Ppdb: The
paraphrase database. In Proceedings of NAACL-HLT, pages 758–764, Atlanta, Georgia.
Association for Computational Linguistics.
S Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and
Roberto Zamparelli. 2014a. Semeval-2014 task 1: Evaluation of compositional distributional
semantic models on full sentences through semantic relatedness and textual entailment. In
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval-2014).
S Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and
Roberto Zamparelli. 2014b. A sick cure for the evaluation of compositional distributional
semantic models. In Proceedings of LREC 2014.
S Steinberger, Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos, and Patrick Schluter. 2012.
DGT- TM: A freely available Translation Memory in 22 languages. LREC, pages 454–459.