SlideShare a Scribd company logo
1 of 71
Download to read offline
S
Use of Paraphrasing to
Improve Matching and
Retrieval in Translation
Memory
Rohit Gupta, University of Wolverhampton
Supervisors:
Dr Constantin Orasan, University of Wolverhampton
Prof Josef van Genabith, Saarland University and DFKI
Prof Ruslan Mitkov, University of Wolverhampton
Outline
S  Objective
S  Translation Memory
S  Incorporating Paraphrasing
S  Human Evaluation
S  Conclusion
Objective
S  Improving matching and retrieval in Translation Memory
with the help of advanced language technology. This is
achieved by:
S  using paraphrases
S  using semantic information
Limitations of current TMs
S  Surface form comparison
S  No or very limited linguistic information
Limitations of current TMs
S  Surface form comparison
S  No or very limited linguistic information
S  Paraphrased segments either not retrieved or ranked
incorrectly among the retrieved segments
Limitations of current TMs
S  Fuzzy scores are really fuzzy
S  Input_1: the period laid down in article 4(3)
S  Input_2: the responsible person defined in article 4(3)
S  TM: the duration set forth in article 4(3)
57% fuzzy score as per word-based edit-distance for
both input sentences
S
Paraphrasing in TM
Matching and Retrieval
Paraphrases
S  PPDB: The paraphrase database (Ganitkevitch et al., 2013)
S  Phrasal and lexical paraphrases
S  L size (2 million)
Concept behind paraphrases
Figure from Ganitkevitch et al., 2013
Trivial Approach
S  Generate additional segments based on paraphrases
available
Complexity of Trivial Approach
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15 W16 W17 W18 W19 W20 W21 W22 W23 W24 W25
W1 W2 W3 W4 W5 | W6 W7 W8 W9 W10 |W11 W12 W13 W14 W15 |W16 W17 W18 W19 W20 | W21 W22 W23 W24 W25
5 5 5 55
Complexity of Trivial Approach
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15 W16 W17 W18 W19 W20 W21 W22 W23 W24 W25
W1 W2 W3 W4 W5 | W6 W7 W8 W9 W10 |W11 W12 W13 W14 W15 |W16 W17 W18 W19 W20 | W21 W22 W23 W24 W25
5 5 5 55
(5+1)^5 -1= 7775 more segments
Complexity of Trivial Approach
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15 W16 W17 W18 W19 W20 W21 W22 W23 W24 W25
W1 W2 W3 W4 W5 | W6 W7 W8 W9 W10 |W11 W12 W13 W14 W15 |W16 W17 W18 W19 W20 | W21 W22 W23 W24 W25
5 5 5 55
(5+1)^5 -1= 7775 more segments
Our Approach
1.  Dynamic programming and Greedy approximation
2.  Classification of paraphrases
3.  Dealing different paraphrases in different manner
4.  Filtering
Classification of Paraphrases:
4 Types
i.  One word paraphrases
S  “period” => “duration”
Classification of Paraphrases:
4 Types
i.  One word paraphrases
S  “period” => “duration”
ii.  Multiple words but differing in one word
S  “in the period” => “during the period”
Classification of Paraphrases:
4 Types
i.  One word paraphrases
S  “period” => “duration”
ii.  Multiple words but differing in one word
S  “in the period” => “during the period”
iii.  Differing in multiple words but having same number of words
S  “laid down in article” => “set forth in article”
Classification of Paraphrases:
4 Types
i.  One word paraphrases
S  “period” => “duration”
ii.  Multiple words but differing in one word
S  “in the period” => “during the period”
iii.  Differing in multiple words but having same number of words
S  “laid down in article” => “set forth in article”
iv.  Differing in multiple words with different number of words
S  “a reasonable period of time to” => “a reasonable period to”
Example
The period laid down in article 4(3) of decision 468…
Example
The period laid down in article 4(3) of decision 468 …
The period
duration
time
laid down in article 4(3) of decision 468 …
Example
The period laid down in article 4(3) of decision 468 …
The period
duration
time
laid down
referred to
provided for
in
in
by
article
article
article
4(3) of decision 468 …
Example
The period laid down in article 4(3) of decision 468 …
The period
duration
time
laid down
referred to
provided for
in
in
by
article
article
article
4(3) of decision 468 …
Example
The period laid down in article 4(3) of decision 468 …
The period
duration
time
laid down
referred to
provided for
in
by
article
2
3
4(3) of decision 468 …
Source
length
General Edit-distance
Implementation
Insertion cost = Deletion cost = Substitution cost =1
Edit-distance Calculation
0 1 2 3 4 5
TM
Input
# the period laid down in
0 #
1 the
2 period
3 referred
4 to
5 in
Edit-distance Calculation
0 1 2 3 4 5
TM
Input
# the period
duration
time
laid down in
0 #
1 the
2 period
3 referred
4 to
5 in
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 #
1 the
2 period
3 referred
4 to
5 in
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0
1 the 1
2 period 2
3 referred 3
4 to 4
5 in 5
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1
1 the 1 0
2 period 2 1
3 referred 3 2
4 to 4 3
5 in 5 4
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2
1 the 1 0 1
2 period 2 1 0
3 referred 3 2 1
4 to 4 3 2
5 in 5 4 3
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3
1 the 1 0 1 2
2 period 2 1 0 1
3 referred 3 2 1 1
4 to 4 3 2 2
5 in 5 4 3 3
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5
1 the 1 0 1 2 3 4
2 period 2 1 0 1 2 3
3 referred 3 2 1 1 2 3
4 to 4 3 2 2 2 3
5 in 5 4 3 3 3 2
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52
TM
Input
# the period
duration
time
laid down in referred to provided for by
0 # 0 1 2 3 4 5 3 4 3 4 5
1 the 1 0 1 2 3 4 2 3 2 3 4
2 period 2 1 0 1 2 3 1 2 1 2 3
3 referred 3 2 1 1 2 3 0 1 1 2 3
4 to 4 3 2 2 2 3 1 0 2 2 3
5 in 5 4 3 3 3 2 2 1 3 3 3
Edit-distance Calculation
0 1 2 3 4 5 31 41 32 42 52 5
TM
Input
# the period
duration
time
laid down in referred to provided for by in
0 # 0 1 2 3 4 5 3 4 3 4 5 5
1 the 1 0 1 2 3 4 2 3 2 3 4 4
2 period 2 1 0 1 2 3 1 2 1 2 3 3
3 referred 3 2 1 1 2 3 0 1 1 2 3 2
4 to 4 3 2 2 2 3 1 0 2 2 3 1
5 in 5 4 3 3 3 2 2 1 3 3 3 0
Computational Complexity
S  Only type (i) and type (ii) paraphrases:
S  O(mnlog(p)) , p: paraphrases of types (i) and (ii)
Computational Complexity
S  Only type (i) and type (ii) paraphrases:
S  O(mnlog(p)) , p: paraphrases of types (i) and (ii)
S  All paraphrases:
S  O(lmn(log(p) + q)) , q: paraphrases of types (iii) and (iv),
l: length of paraphrase
Filtering
1.  Filter out the segments based on length (39%)
Filtering
1.  Filter out the segments based on length (39%)
2.  Filter out the candidates based on baseline edit-distance
similarity (39%)
Filtering
1.  Filter out the segments based on length (39%)
2.  Filter out the candidates based on baseline edit-distance
similarity (39%)
3.  Pick the top 100 segments
Filtering
1.  Filter out the segments based on length (39%)
2.  Filter out the candidates based on baseline edit-distance
similarity (39%)
3.  Pick the top 100 segments
4.  Segments within a certain range of similarity with the most
similar segment are selected for paraphrasing (35%)
Experiments
S  Corpus Used:
S  Europarl V7.0
S  English-German pairs
More results on DGT-TM (English-French) in:
Rohit Gupta and Constantin Orasan 2014. Incorporating Paraphrasing in Translation Memory
Matching and Retrieval. In Proceeding of EAMT-2014, Dubrovnik Croatia.
Corpus statistics: Europarl
TM Test
Segments 1,565,194 9,981
Source words 37,824,634 240,916
Target words 36,267,909 230,620
Source average length 24.16 24.13
Target average length 23.17 23.10
Results: Europarl dataset
TH 100 95 90 85 80 75 70
Edit Retrieved 117 127 163 215 257 337 440
+Para Retrieved 16 16 22 33 49 79 102
% Improve 13.68 12.6 13.5 15.35 19.07 23.44 23.18
Results: Europarl dataset
TH 100 95 90 85 80 75 70
Edit Retrieved 117 127 163 215 257 337 440
+Para Retrieved 16 16 22 33 49 79 102
% Improve 13.68 12.6 13.5 15.35 19.07 23.44 23.18
Rank Change (RC) 9 19 16 25 36 65 97
Results: Europarl dataset
TH 100 95 90 85 80 75 70
Edit Retrieved 117 127 163 215 257 337 440
+Para Retrieved 16 16 22 33 49 79 102
% Improve 13.68 12.6 13.5 15.35 19.07 23.44 23.18
Rank Change (RC) 9 19 16 25 36 65 97
METEOR-Edit-RC 45.48 46.48 45.59 39.24 37.32 34.02 31.10
METEOR-Para-RC 68.08 67.03 61.09 50.07 44.16 38.35 33.19
BLEU-Edit-RC 31.88 32.37 27.70 21.71 19.32 14.98 12.25
BLEU-Para-RC 52.00 47.92 43.90 31.76 25.24 19.75 15.28
Results: Europarl dataset
TH 100 [85, 100) [70, 85)
Edit Retrieved 117 127 163
+Para Retrieved 16 30 98
% Improve 13.67 30.61 43.55
Rank Change (RC) 9 14 55
METEOR-Edit-RC 45.48 34.37 25.76
METEOR-Para-RC 68.08 40.00 25.82
BLEU-Edit-RC 31.88 13.18 6.85
BLEU-Para-RC 52.00 17.10 8.37
S
Human Evaluation
Dataset: Human Evaluation
TH 100 [85, 100) [70, 85) Total
Set1 2 6 6 14
Set2 5 4 7 16
Total 7 10 13 30
Evaluations
S  Post-Editing time
S  Keystrokes
S  Subjective Evaluation 2 Options
S  A is better
S  B is better
S  Subjective Evaluation 3 Options, Added One more
S  Both are equal
Experimental Settings:
Post-editing time and
Keystrokes
S  Each file contains segments of both types (ED+PP)
S  Each file is post-edited by 5 translation student
S  German: Native
S  English: C1
Screen: Editing…
Screen: Resting or Start
Results: Keystrokes
532.6
356.2
570.6
468.59
0
200
400
600
800
1000
1200
Edit-Distance Paraphrasing
NumberofKeystrokes
Set2
Set1
25.23% less keystrokes
Results: Post-Editing Time
520.02 466.44
657.75
603.17
0
200
400
600
800
1000
1200
1400
Edit-Distance Paraphrasing
Post-EditingTime
(Seconds)
Set2
Set1
9.18% time saved
Results: Subjective Evaluation
(Two Options, 17 Translators)
66
172
110
162
0
50
100
150
200
250
300
350
400
Edit-Distance is better Paraphrasing is better
Replies
Set2
Set1
Results: Subjective Evaluation
(Three Options, Seven Translators)
12
46 4026
53
33
0
20
40
60
80
100
120
Edit-Distance is
better
Paraphrasing is
better
Both are equal
Replies
Set2
Set1
H-TER and H-METEOR
Set1 Set2
Edit Distance Paraphrasing Edit Distance Paraphrasing
HMETEOR5 59.82 81.44 69.81 80.60
HTER5 39.72 17.63 27.81 18.71
HMETEOR10 59.82 81.44 69.81 80.61
HTER10 36.93 18.46 27.26 18.40
Segment-wise analysis
S  Statistical significance testing per segment
S  Welch-t test (One tailed, p<0.05)
Segment-wise analysis
S  Statistical significance testing per segment
S  Welch-t test (One tailed, p<0.05)
S  Paraphrasing (Keystrokes/Post-Editing Time):
S  Twelve segments are significantly better
Segment-wise analysis
S  Statistical significance testing per segment
S  Welch-t test (One tailed, p<0.05)
S  Paraphrasing (Keystrokes/Post-Editing Time):
S  Twelve segments are significantly better
S  For ten segments all other evaluations also shows them better
Segment-wise analysis
S  Statistical significance testing per segment
S  Welch-t test (One tailed, p<0.05)
S  Paraphrasing (Keystrokes/Post-Editing Time):
S  Twelve segments are significantly better
S  For ten segments all other evaluations also shows them better
S  Edit-Distance (Keystrokes/Post-Editing Time):
S  Three segments are significantly better
S  Not all evaluations shows them better
Conclusion
S  Presented approach to include paraphrasing and machine
and retrieval
S  Presented human evaluations
S  In future, we will use deep learning for TM matching and
retrieval
Related Publications
S  Rohit Gupta and Constantin Orasan. 2014. Incorporating Paraphrasing in Translation
Memory Matching and Retrieval. In Proceeding of EAMT-2014, Dubrovnik Croatia.
S  Rohit Gupta, Constantin Orasan, Marcos Zampieri, Mihaela Vela and Josef van
Genabith. 2015. Can Transfer Memories afford not to use paraphrasing? In Proceeding of
EAMT-2015, Antalya Turkey.
S  Rohit Gupta, Hanna Bechara, Ismail El Maarouf, and Constantin Orasan. 2014a. UoW:
NLP techniques developed at the University of Wolverhampton for Semantic Similarity
and Textual Entailment. In Proceedings of the 8th International Workshop on Semantic
Evaluation (SemEval-2014), COLING-2014 Dublin Ireland.
S  Rohit Gupta, Hanna Bechara, and Constantin Orasan. 2014b. Intelligent Translation
Memory Matching and Retrieval Metric Exploiting Linguistic Technology. In Proceedings
of the thirty sixth Conference on Translating and Computer, London, UK.
References
S  Jane Bradbury and Ismaıl El Maarouf. 2013. An empirical classification of verbs based on
Semantic Types: the case of the ’poison’ verbs. In Proceedings of the Joint Symposium on Semantic
Processing. Textual Inference and Structures in Corpora, pages 70–74.
S  Juri Ganitkevitch, Van Durme Benjamin, and Chris Callison-Burch. 2013. Ppdb: The
paraphrase database. In Proceedings of NAACL-HLT, pages 758–764, Atlanta, Georgia.
Association for Computational Linguistics.
S  Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and
Roberto Zamparelli. 2014a. Semeval-2014 task 1: Evaluation of compositional distributional
semantic models on full sentences through semantic relatedness and textual entailment. In
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval-2014).
S  Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and
Roberto Zamparelli. 2014b. A sick cure for the evaluation of compositional distributional
semantic models. In Proceedings of LREC 2014.
S  Steinberger, Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos, and Patrick Schluter. 2012.
DGT- TM: A freely available Translation Memory in 22 languages. LREC, pages 454–459.
Thank you!

More Related Content

More from RIILP

Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD RIILP
 
Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic RIILP
 
Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones RIILP
 
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic RIILP
 
Tony O'Dowd - KantanMT
Tony O'Dowd -  KantanMT Tony O'Dowd -  KantanMT
Tony O'Dowd - KantanMT RIILP
 
Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU RIILP
 
Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD  Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD RIILP
 
Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW RIILP
 
Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA RIILP
 
Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU RIILP
 
Liling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAARLiling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAARRIILP
 
Sandra de luca - Acclaro
Sandra de luca - AcclaroSandra de luca - Acclaro
Sandra de luca - AcclaroRIILP
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015RIILP
 
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015RIILP
 
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015RIILP
 
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015RIILP
 
ESR5 Liling Tan - EXPERT Summer School - Malaga 2015
ESR5 Liling Tan - EXPERT Summer School - Malaga 2015ESR5 Liling Tan - EXPERT Summer School - Malaga 2015
ESR5 Liling Tan - EXPERT Summer School - Malaga 2015RIILP
 
ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015
ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015
ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015RIILP
 
ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015
ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015
ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015RIILP
 
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015RIILP
 

More from RIILP (20)

Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD
 
Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic
 
Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones
 
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
 
Tony O'Dowd - KantanMT
Tony O'Dowd -  KantanMT Tony O'Dowd -  KantanMT
Tony O'Dowd - KantanMT
 
Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU
 
Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD  Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD
 
Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW
 
Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA
 
Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU
 
Liling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAARLiling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAAR
 
Sandra de luca - Acclaro
Sandra de luca - AcclaroSandra de luca - Acclaro
Sandra de luca - Acclaro
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
 
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
 
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
 
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
 
ESR5 Liling Tan - EXPERT Summer School - Malaga 2015
ESR5 Liling Tan - EXPERT Summer School - Malaga 2015ESR5 Liling Tan - EXPERT Summer School - Malaga 2015
ESR5 Liling Tan - EXPERT Summer School - Malaga 2015
 
ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015
ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015
ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015
 
ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015
ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015
ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015
 
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
 

Recently uploaded

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 

Recently uploaded (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 

ESR4 Rohit Gupta - EXPERT Summer School - Malaga 2015

  • 1. S Use of Paraphrasing to Improve Matching and Retrieval in Translation Memory Rohit Gupta, University of Wolverhampton Supervisors: Dr Constantin Orasan, University of Wolverhampton Prof Josef van Genabith, Saarland University and DFKI Prof Ruslan Mitkov, University of Wolverhampton
  • 2. Outline S  Objective S  Translation Memory S  Incorporating Paraphrasing S  Human Evaluation S  Conclusion
  • 3. Objective S  Improving matching and retrieval in Translation Memory with the help of advanced language technology. This is achieved by: S  using paraphrases S  using semantic information
  • 4. Limitations of current TMs S  Surface form comparison S  No or very limited linguistic information
  • 5. Limitations of current TMs S  Surface form comparison S  No or very limited linguistic information S  Paraphrased segments either not retrieved or ranked incorrectly among the retrieved segments
  • 6. Limitations of current TMs S  Fuzzy scores are really fuzzy S  Input_1: the period laid down in article 4(3) S  Input_2: the responsible person defined in article 4(3) S  TM: the duration set forth in article 4(3) 57% fuzzy score as per word-based edit-distance for both input sentences
  • 8. Paraphrases S  PPDB: The paraphrase database (Ganitkevitch et al., 2013) S  Phrasal and lexical paraphrases S  L size (2 million)
  • 9. Concept behind paraphrases Figure from Ganitkevitch et al., 2013
  • 10. Trivial Approach S  Generate additional segments based on paraphrases available
  • 11. Complexity of Trivial Approach W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15 W16 W17 W18 W19 W20 W21 W22 W23 W24 W25 W1 W2 W3 W4 W5 | W6 W7 W8 W9 W10 |W11 W12 W13 W14 W15 |W16 W17 W18 W19 W20 | W21 W22 W23 W24 W25 5 5 5 55
  • 12. Complexity of Trivial Approach W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15 W16 W17 W18 W19 W20 W21 W22 W23 W24 W25 W1 W2 W3 W4 W5 | W6 W7 W8 W9 W10 |W11 W12 W13 W14 W15 |W16 W17 W18 W19 W20 | W21 W22 W23 W24 W25 5 5 5 55 (5+1)^5 -1= 7775 more segments
  • 13. Complexity of Trivial Approach W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15 W16 W17 W18 W19 W20 W21 W22 W23 W24 W25 W1 W2 W3 W4 W5 | W6 W7 W8 W9 W10 |W11 W12 W13 W14 W15 |W16 W17 W18 W19 W20 | W21 W22 W23 W24 W25 5 5 5 55 (5+1)^5 -1= 7775 more segments
  • 14. Our Approach 1.  Dynamic programming and Greedy approximation 2.  Classification of paraphrases 3.  Dealing different paraphrases in different manner 4.  Filtering
  • 15. Classification of Paraphrases: 4 Types i.  One word paraphrases S  “period” => “duration”
  • 16. Classification of Paraphrases: 4 Types i.  One word paraphrases S  “period” => “duration” ii.  Multiple words but differing in one word S  “in the period” => “during the period”
  • 17. Classification of Paraphrases: 4 Types i.  One word paraphrases S  “period” => “duration” ii.  Multiple words but differing in one word S  “in the period” => “during the period” iii.  Differing in multiple words but having same number of words S  “laid down in article” => “set forth in article”
  • 18. Classification of Paraphrases: 4 Types i.  One word paraphrases S  “period” => “duration” ii.  Multiple words but differing in one word S  “in the period” => “during the period” iii.  Differing in multiple words but having same number of words S  “laid down in article” => “set forth in article” iv.  Differing in multiple words with different number of words S  “a reasonable period of time to” => “a reasonable period to”
  • 19. Example The period laid down in article 4(3) of decision 468…
  • 20. Example The period laid down in article 4(3) of decision 468 … The period duration time laid down in article 4(3) of decision 468 …
  • 21. Example The period laid down in article 4(3) of decision 468 … The period duration time laid down referred to provided for in in by article article article 4(3) of decision 468 …
  • 22. Example The period laid down in article 4(3) of decision 468 … The period duration time laid down referred to provided for in in by article article article 4(3) of decision 468 …
  • 23. Example The period laid down in article 4(3) of decision 468 … The period duration time laid down referred to provided for in by article 2 3 4(3) of decision 468 … Source length
  • 24. General Edit-distance Implementation Insertion cost = Deletion cost = Substitution cost =1
  • 25. Edit-distance Calculation 0 1 2 3 4 5 TM Input # the period laid down in 0 # 1 the 2 period 3 referred 4 to 5 in
  • 26. Edit-distance Calculation 0 1 2 3 4 5 TM Input # the period duration time laid down in 0 # 1 the 2 period 3 referred 4 to 5 in
  • 27. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 1 the 2 period 3 referred 4 to 5 in
  • 28. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 0 1 the 1 2 period 2 3 referred 3 4 to 4 5 in 5
  • 29. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 0 1 1 the 1 0 2 period 2 1 3 referred 3 2 4 to 4 3 5 in 5 4
  • 30. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 0 1 2 1 the 1 0 1 2 period 2 1 0 3 referred 3 2 1 4 to 4 3 2 5 in 5 4 3
  • 31. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 0 1 2 3 1 the 1 0 1 2 2 period 2 1 0 1 3 referred 3 2 1 1 4 to 4 3 2 2 5 in 5 4 3 3
  • 32. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 0 1 2 3 4 5 1 the 1 0 1 2 3 4 2 period 2 1 0 1 2 3 3 referred 3 2 1 1 2 3 4 to 4 3 2 2 2 3 5 in 5 4 3 3 3 2
  • 33. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 0 1 2 3 4 5 3 4 3 4 5 1 the 1 0 1 2 3 4 2 3 2 3 4 2 period 2 1 0 1 2 3 1 2 1 2 3 3 referred 3 2 1 1 2 3 0 1 1 2 3 4 to 4 3 2 2 2 3 1 0 2 2 3 5 in 5 4 3 3 3 2 2 1 3 3 3
  • 34. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 0 1 2 3 4 5 3 4 3 4 5 1 the 1 0 1 2 3 4 2 3 2 3 4 2 period 2 1 0 1 2 3 1 2 1 2 3 3 referred 3 2 1 1 2 3 0 1 1 2 3 4 to 4 3 2 2 2 3 1 0 2 2 3 5 in 5 4 3 3 3 2 2 1 3 3 3
  • 35. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 0 1 2 3 4 5 3 4 3 4 5 1 the 1 0 1 2 3 4 2 3 2 3 4 2 period 2 1 0 1 2 3 1 2 1 2 3 3 referred 3 2 1 1 2 3 0 1 1 2 3 4 to 4 3 2 2 2 3 1 0 2 2 3 5 in 5 4 3 3 3 2 2 1 3 3 3
  • 36. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 0 1 2 3 4 5 3 4 3 4 5 1 the 1 0 1 2 3 4 2 3 2 3 4 2 period 2 1 0 1 2 3 1 2 1 2 3 3 referred 3 2 1 1 2 3 0 1 1 2 3 4 to 4 3 2 2 2 3 1 0 2 2 3 5 in 5 4 3 3 3 2 2 1 3 3 3
  • 37. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 0 1 2 3 4 5 3 4 3 4 5 1 the 1 0 1 2 3 4 2 3 2 3 4 2 period 2 1 0 1 2 3 1 2 1 2 3 3 referred 3 2 1 1 2 3 0 1 1 2 3 4 to 4 3 2 2 2 3 1 0 2 2 3 5 in 5 4 3 3 3 2 2 1 3 3 3
  • 38. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 0 1 2 3 4 5 3 4 3 4 5 1 the 1 0 1 2 3 4 2 3 2 3 4 2 period 2 1 0 1 2 3 1 2 1 2 3 3 referred 3 2 1 1 2 3 0 1 1 2 3 4 to 4 3 2 2 2 3 1 0 2 2 3 5 in 5 4 3 3 3 2 2 1 3 3 3
  • 39. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 TM Input # the period duration time laid down in referred to provided for by 0 # 0 1 2 3 4 5 3 4 3 4 5 1 the 1 0 1 2 3 4 2 3 2 3 4 2 period 2 1 0 1 2 3 1 2 1 2 3 3 referred 3 2 1 1 2 3 0 1 1 2 3 4 to 4 3 2 2 2 3 1 0 2 2 3 5 in 5 4 3 3 3 2 2 1 3 3 3
  • 40. Edit-distance Calculation 0 1 2 3 4 5 31 41 32 42 52 5 TM Input # the period duration time laid down in referred to provided for by in 0 # 0 1 2 3 4 5 3 4 3 4 5 5 1 the 1 0 1 2 3 4 2 3 2 3 4 4 2 period 2 1 0 1 2 3 1 2 1 2 3 3 3 referred 3 2 1 1 2 3 0 1 1 2 3 2 4 to 4 3 2 2 2 3 1 0 2 2 3 1 5 in 5 4 3 3 3 2 2 1 3 3 3 0
  • 41. Computational Complexity S  Only type (i) and type (ii) paraphrases: S  O(mnlog(p)) , p: paraphrases of types (i) and (ii)
  • 42. Computational Complexity S  Only type (i) and type (ii) paraphrases: S  O(mnlog(p)) , p: paraphrases of types (i) and (ii) S  All paraphrases: S  O(lmn(log(p) + q)) , q: paraphrases of types (iii) and (iv), l: length of paraphrase
  • 43. Filtering 1.  Filter out the segments based on length (39%)
  • 44. Filtering 1.  Filter out the segments based on length (39%) 2.  Filter out the candidates based on baseline edit-distance similarity (39%)
  • 45. Filtering 1.  Filter out the segments based on length (39%) 2.  Filter out the candidates based on baseline edit-distance similarity (39%) 3.  Pick the top 100 segments
  • 46. Filtering 1.  Filter out the segments based on length (39%) 2.  Filter out the candidates based on baseline edit-distance similarity (39%) 3.  Pick the top 100 segments 4.  Segments within a certain range of similarity with the most similar segment are selected for paraphrasing (35%)
  • 47. Experiments S  Corpus Used: S  Europarl V7.0 S  English-German pairs More results on DGT-TM (English-French) in: Rohit Gupta and Constantin Orasan 2014. Incorporating Paraphrasing in Translation Memory Matching and Retrieval. In Proceeding of EAMT-2014, Dubrovnik Croatia.
  • 48. Corpus statistics: Europarl TM Test Segments 1,565,194 9,981 Source words 37,824,634 240,916 Target words 36,267,909 230,620 Source average length 24.16 24.13 Target average length 23.17 23.10
  • 49. Results: Europarl dataset TH 100 95 90 85 80 75 70 Edit Retrieved 117 127 163 215 257 337 440 +Para Retrieved 16 16 22 33 49 79 102 % Improve 13.68 12.6 13.5 15.35 19.07 23.44 23.18
  • 50. Results: Europarl dataset TH 100 95 90 85 80 75 70 Edit Retrieved 117 127 163 215 257 337 440 +Para Retrieved 16 16 22 33 49 79 102 % Improve 13.68 12.6 13.5 15.35 19.07 23.44 23.18 Rank Change (RC) 9 19 16 25 36 65 97
  • 51. Results: Europarl dataset TH 100 95 90 85 80 75 70 Edit Retrieved 117 127 163 215 257 337 440 +Para Retrieved 16 16 22 33 49 79 102 % Improve 13.68 12.6 13.5 15.35 19.07 23.44 23.18 Rank Change (RC) 9 19 16 25 36 65 97 METEOR-Edit-RC 45.48 46.48 45.59 39.24 37.32 34.02 31.10 METEOR-Para-RC 68.08 67.03 61.09 50.07 44.16 38.35 33.19 BLEU-Edit-RC 31.88 32.37 27.70 21.71 19.32 14.98 12.25 BLEU-Para-RC 52.00 47.92 43.90 31.76 25.24 19.75 15.28
  • 52. Results: Europarl dataset TH 100 [85, 100) [70, 85) Edit Retrieved 117 127 163 +Para Retrieved 16 30 98 % Improve 13.67 30.61 43.55 Rank Change (RC) 9 14 55 METEOR-Edit-RC 45.48 34.37 25.76 METEOR-Para-RC 68.08 40.00 25.82 BLEU-Edit-RC 31.88 13.18 6.85 BLEU-Para-RC 52.00 17.10 8.37
  • 54. Dataset: Human Evaluation TH 100 [85, 100) [70, 85) Total Set1 2 6 6 14 Set2 5 4 7 16 Total 7 10 13 30
  • 55. Evaluations S  Post-Editing time S  Keystrokes S  Subjective Evaluation 2 Options S  A is better S  B is better S  Subjective Evaluation 3 Options, Added One more S  Both are equal
  • 56. Experimental Settings: Post-editing time and Keystrokes S  Each file contains segments of both types (ED+PP) S  Each file is post-edited by 5 translation student S  German: Native S  English: C1
  • 60. Results: Post-Editing Time 520.02 466.44 657.75 603.17 0 200 400 600 800 1000 1200 1400 Edit-Distance Paraphrasing Post-EditingTime (Seconds) Set2 Set1 9.18% time saved
  • 61. Results: Subjective Evaluation (Two Options, 17 Translators) 66 172 110 162 0 50 100 150 200 250 300 350 400 Edit-Distance is better Paraphrasing is better Replies Set2 Set1
  • 62. Results: Subjective Evaluation (Three Options, Seven Translators) 12 46 4026 53 33 0 20 40 60 80 100 120 Edit-Distance is better Paraphrasing is better Both are equal Replies Set2 Set1
  • 63. H-TER and H-METEOR Set1 Set2 Edit Distance Paraphrasing Edit Distance Paraphrasing HMETEOR5 59.82 81.44 69.81 80.60 HTER5 39.72 17.63 27.81 18.71 HMETEOR10 59.82 81.44 69.81 80.61 HTER10 36.93 18.46 27.26 18.40
  • 64. Segment-wise analysis S  Statistical significance testing per segment S  Welch-t test (One tailed, p<0.05)
  • 65. Segment-wise analysis S  Statistical significance testing per segment S  Welch-t test (One tailed, p<0.05) S  Paraphrasing (Keystrokes/Post-Editing Time): S  Twelve segments are significantly better
  • 66. Segment-wise analysis S  Statistical significance testing per segment S  Welch-t test (One tailed, p<0.05) S  Paraphrasing (Keystrokes/Post-Editing Time): S  Twelve segments are significantly better S  For ten segments all other evaluations also shows them better
  • 67. Segment-wise analysis S  Statistical significance testing per segment S  Welch-t test (One tailed, p<0.05) S  Paraphrasing (Keystrokes/Post-Editing Time): S  Twelve segments are significantly better S  For ten segments all other evaluations also shows them better S  Edit-Distance (Keystrokes/Post-Editing Time): S  Three segments are significantly better S  Not all evaluations shows them better
  • 68. Conclusion S  Presented approach to include paraphrasing and machine and retrieval S  Presented human evaluations S  In future, we will use deep learning for TM matching and retrieval
  • 69. Related Publications S  Rohit Gupta and Constantin Orasan. 2014. Incorporating Paraphrasing in Translation Memory Matching and Retrieval. In Proceeding of EAMT-2014, Dubrovnik Croatia. S  Rohit Gupta, Constantin Orasan, Marcos Zampieri, Mihaela Vela and Josef van Genabith. 2015. Can Transfer Memories afford not to use paraphrasing? In Proceeding of EAMT-2015, Antalya Turkey. S  Rohit Gupta, Hanna Bechara, Ismail El Maarouf, and Constantin Orasan. 2014a. UoW: NLP techniques developed at the University of Wolverhampton for Semantic Similarity and Textual Entailment. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval-2014), COLING-2014 Dublin Ireland. S  Rohit Gupta, Hanna Bechara, and Constantin Orasan. 2014b. Intelligent Translation Memory Matching and Retrieval Metric Exploiting Linguistic Technology. In Proceedings of the thirty sixth Conference on Translating and Computer, London, UK.
  • 70. References S  Jane Bradbury and Ismaıl El Maarouf. 2013. An empirical classification of verbs based on Semantic Types: the case of the ’poison’ verbs. In Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora, pages 70–74. S  Juri Ganitkevitch, Van Durme Benjamin, and Chris Callison-Burch. 2013. Ppdb: The paraphrase database. In Proceedings of NAACL-HLT, pages 758–764, Atlanta, Georgia. Association for Computational Linguistics. S  Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. 2014a. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval-2014). S  Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. 2014b. A sick cure for the evaluation of compositional distributional semantic models. In Proceedings of LREC 2014. S  Steinberger, Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos, and Patrick Schluter. 2012. DGT- TM: A freely available Translation Memory in 22 languages. LREC, pages 454–459.