SPIDER is a system for paraphrasing in document editing and revision. It was designed to help with writing optimization, but its applicability extends to MT pre-editing.
SPIDER: a System for Paraphrasing - Applicability in Machine Translation Pre-Editing - Anabela Barreiro
1. SPIDER: A SYSTEM FOR PARAPHRASING
IN DOCUMENT EDITING AND REVISION
APPLICABILITY IN MACHINE TRANSLATION PRE-EDITING
Anabela Barreiro
ab@metatrad.com
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
2. OUTLINE
INTRODUCTION
PARAPHRASES IN NLP
PARAPHRASES IN PEDAGOGICAL AND PROFESSIONAL CONTEXTS
SPIDER
FIRST STEPS
IMPORTANT FEATURES
PARAPHRASES COVERED BY SPIDER
INTERFACE
LINGUISTIC RESOURCES
EVALUATION RESULTS
THE FUTURE
FUTURE APPLICATIONS?
FUTURE RESEARCH
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
3. IMPORTANCE OF PARAPHRASES IN NLP TASKS
Question Answering
[Ibrahim et al., 2003], [Paşca, 2003], [Duboué & Chu-Carroll, 2006]
Information Extraction and Text Mining
[Ibrahim et al., 2003], [Shinyama et al., 2002] [Shinyama & Sekine, 2003],
[Sekine, 2005] [Paşca, 2005], [Paşca & Dienes, 2005]
Summarization
[McKeown et al., 2002], [Barzilay, 2001, 2003], [Hirao et al., 2004] [Zhou et
al., 2006b]
Natural Language Generation
[Iordanskaja et al. 1991]
Plagiarism Detection
[Potthast et al., 2010], [Vila et al., 2010]
Machine Translation
[Zhou et al., 2006], [Callison-Burch et al., 2006a, 2006b, 2007 and 2008]
[Barreiro, 2008, 2009, 2011]
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
4. THE PRACTICAL NEED FOR PARAPHRASES
IN PEDAGOGICAL CONTEXTS
Text Processing and Authoring Aids
Writing and revision of original/creative/customized texts
Learning Tools
Native and second language learning
Creation of clear and understandable text content
e.g. students learning language and writing skills
Style Editors
Uniformization /consistency of style
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
5. THE PRACTICAL NEED FOR PARAPHRASES
IN PROFESSIONAL CONTEXTS
Technical Writing
Professional high quality documentation and domain-specific texts
Controlled language
Linguistic Quality Assurance
Linguistic quality of generic texts and specialized documentation
Verification/validation of meaningful content
Text Optimization
Readable / publishable texts (business-oriented or purpose-oriented content)
Terminology
Search for the “exact” term or relevant keywords
Translation
Indispensable for human and machine translation (pre-editing and post-editing)
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
6. OUTLINE
INTRODUCTION
PARAPHRASES IN NLP
PARAPHRASES IN PEDAGOGICAL AND PROFESSIONAL CONTEXTS
SPIDER
FIRST STEPS
IMPORTANT FEATURES
PARAPHRASES COVERED BY SPIDER
INTERFACE
LINGUISTIC RESOURCES
EVALUATION RESULTS
THE FUTURE
FUTURE APPLICATIONS?
FUTURE RESEARCH
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
7. SPIDER PARAPHRASING SYSTEM
FIRST STEPS
Initially developed for Portuguese
1st version – ReEscreve
publicly available service at http://www.linguateca.pt/ReEscreve/
2nd version – eSPERTo (Portuguese: the smart/clever one; expert)
currently being integrated in a cyber school project within the scope of an
educational program
Writing exercises – students learning how to improve their writing skills in
the Portuguese language
English SPIDER
prototype to assist writing of domain-specific texts
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
8. SPIDER
IMPORTANT FEATURES
Applies linguistic knowledge to recognize and generate paraphrases
automatically (preserves the source text semantics and grammaticality -
inflectional features) in the suggestions provided (included transformations of
multi-word units)
Uses text-editing mechanisms which provide a variety of alternatives for
each expression and the possibility to choose among them (according to
personal preferences, style, idiomacity, etc.)
Allows users to suggest new expressions that can be immediately applied
to their text, making the text editing process easier, more flexible, and
upgradable
Designed to help with writing optimization, understandability and
translatability (improvement of the quality of the source text so that it can cause
a positive impact in translation)
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
9. PARAPHRASES COVERED BY SPIDER
Synonyms in context (ex: phrasal verbs into equivalent expressions)
to clear up (weather) = (weather) to become better/brighter
Support verb constructions into single verbs and stylistic variants
to make a decision = to decide; to make an audit = to perform an audit
Aspectual constructions into single verbs
to launch an attack = to attack
Adverbials (compounds into single adverbs)
in a constructive way = constructively
Relatives into participial adjectives
the president that was elected = the president elect
Relatives into possessives
the role that Europe plays/has = the role of Europe
Relatives into compound nouns (and vice-versa)
a container for the milk = a milk container; a bottle made of plastic = a plastic bottle
Agentive passives into actives
the man was released by the police officer = the police officer released the man
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
10. INTERFACE
SUGGESTIONS FOR EXAMPLE SENTENCES
Suggestions for general language
linguistic phenomena
Compound adverbs >
single adverbs
Relatives >
participial adjectives
Support verb constructions >
single verbs
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
11. INTERFACE
SELECTION OF PARAPHRASING GRAMMARS FOR SPECIFIC
LINGUISTIC PHENOMENA
Users can select among general and technical dictionaries (more than one
selection allowed), grammars for specific linguistic transformations (one, several
or all grammars can be selected). The interface provides sample texts for testing.
Informative details about the
linguistic resources selected
Sample LEGAL text
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
12. INTERFACE
SELECTION OF A DOMAIN DICTIONARY
Identification of legal terms in the text
Suggestions for the term “breach of law”
Users can select one term from the list of suggestions or provide a new suggestion
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
13. INTERFACE
SUGGESTIONS PROVIDED AND USER’S CAPABILITY TO ADD NEW REWRITING
OPTIONS
The user can suggest new words or
expressions (synonyms or paraphrases)
It is possible to go back and change the user
option as many times as necessary
Text rewritten
• In red, the expressions in the source text
• In green, suggestions provided by SPIDER and selected by the user
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
14. LINGUISTIC RESOURCES
Eng4NooJ – linguistic knowledge system
• OpenLogos dictionary (http://logos-os.dfki.de/)
• converted into NooJ format, and enhanced with new
properties, including derivational and morpho-syntactic
and semantic relations
• Morphological system
• Contextual rules and grammars
• Domain specific dictionary (sample “legal terms”)
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
15. LINGUISTIC RESOURCES
General language dictionary entries
impress,V+FLX=POLISH+SAL=PVPCpleasetype+PT=impressionar+DRV=NDRV01:BOOK+
VSUP=make+VSUP=cause+NPREP=on Morpho-syntactic
aesthetic,AFLX=NATURAL+SAL=AVstate+PT=aesthetically+DRV=AVDRV03 and semantic
relations
skepticism,N+FLX=BOOK+SAL=ABcause+PT=cepticismo+DRV=NAVDRV02
NDRV04 = <B>ion/Npred+Nom Rules to transform
morpho-syntactically
ADRV02 = <B>icable and semantically
AVDRV01 = <E>ly/ADV related words of
different parts of
AVDRV04 = <B>tically/ADV speech
Grammar to recognize adverbial compounds and
transform them into equivalent single adverbs
Contextual rules
Rules to improve precision
in specific contexts
[bring(vt)) N(charge; action)
> present(vt) N(idem)]
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
16. LINGUISTIC RESOURCES
Sample of terms classified
as Information +
Instructional/legal
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
17. EVALUATION RESULTS: PARAPHRASING
PRECISION
Corpus: 500 sentences
100 sentences for each of 5 elementary support verbs
SVC Recognition SVC Recognition SVC Paraphrasing
Precision Recall Precision
Pôr 73/73 - 100% 73/100 – 73% 72/73 - 98.6%
Tomar 75/75 - 100% 75/100 – 75% 68/73 - 93.1%
Ter 65/65 - 100% 65/100 – 65% 59/65 - 90.7%
Dar 57/60 - 95% 57/100 – 57% 46/51 - 90.1%
Fazer 43/45 – 95.5% 43/100 – 43% 40/45 - 88.8%
Average 62.6/63.6 - 98.4% 62.6/100 - 62.6% 57/61 - 93.4%
Evaluation of recognition and paraphrasing
of support verb constructions
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
18. EVALUATION RESULTS: IMPACT ON
TRANSLATABILITY (MT)
Same corpus, 50 sentences selected randomly
(i) automated pre-processing of support verb constructions with SPIDER and
conversion into equivalent single verbs
(ii) pre-processed sentences (automatically generated paraphrases) and original text
are submitted to MT and the output translations for both original and pre-processed
sentences were compared
• 29 (58%) of the best translations were of automatically generated paraphrases
• 9 (18%) were of support verb constructions
• 12 (24%) were equally bad or equally good
CONCLUSION
The experiment indicates that paraphrases such as those generated by SPIDER help
improve translation scores
• The automated paraphrasing of support verb constructions through SPIDER
allowed a significant improvement of the quality of the MT results in that context
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
19. OUTLINE
INTRODUCTION
PARAPHRASES IN NLP
PARAPHRASES IN PEDAGOGICAL AND PROFESSIONAL CONTEXTS
SPIDER
FIRST STEPS
IMPORTANT FEATURES
PARAPHRASES COVERED BY SPIDER
INTERFACE
LINGUISTIC RESOURCES
EVALUATION RESULTS
THE FUTURE
FUTURE APPLICATIONS?
FUTURE RESEARCH
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
20. FUTURE APPLICATIONS?
• Writing / authoring aid (word processing applications)
• Language composition tool - general and technical language (e.g. student texts or legal
texts)
• Text production and style editor
• Terminology verification tool - professional use of terminology in technical domains
(elimination of informal, idiomatic, slang use of language)
• Empirical testbed for linguistic quality assurance (source and target texts)
• Text editing (machine translation pre-editing and post-editing) and translation aid
• Controlled language tool
• Consistent, direct, and simple language
• Restricted grammar (avoid certain types of construction)
• Avoid complex reasoning, figures of speech, metaphors, etc.
• Elimination of wordiness
• “Revision memory” tool (≈ “translation memory”) - recycling of validated reviewed
sentences, structures or phrases
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
21. FUTURE RESEARCH
FROM SPIDER TO MACHINE TRANSLATION
a fazer um estágio para dar aulas de / tutor Religião
a fazer um estágio para dar aulas de / lecture Religião
a fazer um estágio para dar aulas de / teach Religião
começa a dar exemplos / exemplify :
sentia-se capaz de dar um murro em / punch quem quisesse detê-lo
gostávamos de lhe dar uma palavrinha / speak .
$EN
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan
22. SPIDER: A SYSTEM FOR PARAPHRASING
IN DOCUMENT EDITING AND REVISION
APPLICABILITY IN MACHINE TRANSLATION PRE-EDITING
Anabela Barreiro
ab@metatrad.com
CICLing 2011 February 20-26, 2011
Anabela Barreiro Tokyo, Japan