In this paper, we target the extraction of whole-part rela- tions involving human entities and body-part nouns in SYSTEM, a hy- brid statistical and rule-based Natural Language Processing chain for Portuguese. Whole-part relation is a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set.
Body-Part Nouns and Whole-Part Relations in Portuguese
1. PROPOR2014 - Intl. Conference on Computational Processing of Portuguese
October 6-8, 2014, ICMC, São Carlos, SP, Brazil
Body part nouns and Whole-Part Relations
in Portuguese
Ilia Markov123, Nuno Mamede23, Jorge Baptista123
1 U. Algarve/CECL 2 U. Lisboa/IST 3 INESC-ID Lisboa/L2F
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 1
2. Objectives
• Improve the automatic extraction of semantic relations
between textual elements in a existing NLP system,
STRING
!
• Part-whole relations (meronymy)
!
• Human body-part nouns (Nbp)
!
O Pedro partiu o braço
‘Pedro broke the arm’
WHOLE-PART(Pedro,braço)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 2
3. Objectives (cont.)
!
• Development of a rule-base meronymy detection
module for Human-Nbp relations
• Implementation in STRING (Mamede et al., 2012)
!
!
STRING: a hybrid, statistical and rule-based, Natural
Language Processing (NLP) system for Portuguese
string.l2f.inesc-id.pt
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 3
4. Motivation
Semantic relations are a device for structuring texts:
contribute to cohesion and coherence of a text.
Automatic extraction of semantic relations is useful for
some NLP tasks:
• Anaphora Resolution
O Pedro lavou a cara
‘Pedro washed the face’
WHOLE-PART(Pedro,cara)
O Pedro lavou a sua cara
‘Pedro washed his face’
WHOLE-PART(sua,cara) & ANTECEDENT(?,sua)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 4
5. Motivation (cont.)
• Semantic Role Labeling
O Pedro partiu um braço
‘Pedro broke an arm’
WHOLE-PART(Pedro,braço)
➢ Pedro is an experiencer.
O Pedro partiu o braço do João
‘Pedro broke João’s arm’
WHOLE-PART(João,braço)
➢ Pedro is an agent.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 5
6. Motivation (cont.)
• Opinion mining
!
É um bom hotel: o quarto era limpo, as camas eram feitas
de lavado todos os dias, e os pequenos-almoços eram
opíparos
‘It is a nice hotel: the room was clean, the beds (bed
sheets) were changed everyday, and the breakfast was
sumptuous’
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 6
7. Related Work
In NLP, various information extraction techniques have
been developed in order to capture part-whole relations
from texts:
• Hearst, 1992
Lexico-syntactic patterns to capture hyponymic (type-of) relations
• Girju et al., 2003, 2006
The method semi-automatically identifies patterns that encode part-whole
relations and learns automatically the classification rules
needed for the extraction of part-whole relations from these
patterns. The authors report an overall average precision of 80.95%
and recall of 75.91%.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 7
8. Related Work (cont.)
• Van Hage et al., 2006
A method for learning part-whole relations from vocabularies and
text sources; the authors were able to acquire 503 part-whole pairs
from the AGROVOC Thesaurus to learn 91 reliable part-whole
patterns.
!
• Pantel and Pennacchiotti, 2006
The Espresso algorithm: takes as input a few seed instances of a
particular relation and learns surface patterns to extract more
instances. The algorithm obtains a precision of 80%.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 8
9. Related Work (cont.)
• Lexical ontologies for Portuguese:
- WordNet.PT
- PAPEL
- Onto.PT
!
• Parsers of Portuguese:
- The PALAVRAS parser (Bick, 2000), using
the Visual Interactive Syntax Learning (VISL) environment;
- LX Semantic Role Labeler (Branco & Costa, 2010).
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 9
10. Dependency Rule in STRING
O Pedro partiu o braço do João
‘Pedro broke João’s arm’
IF( MOD[POST](#2[UMB-Anatomical-human],#1[human]) &
PREPD(#1,?[lemma:de]) &
CDIR[POST](#3,#2) & ~WHOLE-PART(#1,#2)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 10
)
WHOLE-PART(#1,#2)
WHOLE-PART(João,braço)
11. Fixed Phrases and Frozen Sentences
involving Nbp
‣400 semi-automatically crafted rules,
based on available lexicon-grammar of European Portuguese idioms
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 11
12. Other phenomena
• DET=um and bilateral symmetry
O Pedro partiu um braço
‘Pedro broke an arm’
• relations between 2 Nbp
A Ana pinta as unhas dos pés
‘Ana paints the nails of the feet’
• part-of Nbp
O Pedro tocou com a ponta da língua no gelado
‘Pedro touched with the tip of the tongue on the ice cream’
• “hidden” Nbp with disease nouns
O Pedro tem uma gastrite (estômago)
‘Pedro has gastritis (stomach)’
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 12
13. Evaluation
• First fragment of the CETEMPúblico corpus (Rocha & Santos,
2000): 14.7 M tokens; 6.3 M simple words; and 300 K sentences.
• Using a Nbp lexicon (151 lemmas); 16,746 sentences with Nbp
were extracted.
• A random stratified sample of 1,000 sentences with Nbp,
keeping the proportion of their total frequency in the source
corpus.
• Divided between 4 annotators – 4 subsets of 225 sentences
each, with a common set of 100 sentences to assess inter-annotator
agreement.
‣WHOLE-PART, FIXED, nothing
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 12
14. Inter-annotator Agreement
Inter-annotatorA Avegrargeee Pmairewniste Percent Agreement
Fleiss’ Kappa
Average Pairwise Cohen’s Kappa
http://dfreelon.org/utils/recalfront/recal3/
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 13
15. Results
(1st evaluation)
ResSulytsstem’s performance for Nbp
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 14
16. Error Analysis
false-positives
• Disambiguation of Nbp in context
- língua ‘tonge/language’
- língua portuguesa ‘Portuguese language’
- língua de Camões ‘language of Camões’
• New idioms have been encoded in the lexicon
- abrir o coração a ‘to open one’s heart to sb.’
- fazer face a ‘to face sth./to deal with’
• Nbp used figuratively
Além disso, a nova face desta Igreja chilena…
‘Moreover, the new face of this Chilean Church…’
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 15
17. Error Analysis
false-negatives
• The whole and the part are not syntactically related and may
be quite far away from each other:
!
O facto do corpo ter sido encontrado na cozinha, leva os bombeiros a
suspeitar que a vítima, com graves problemas de saúde, tenha
desmaiado e caído à lareira, o que poderá ter estado na origem do
incêndio.
‘The fact that the body was found in the kitchen, makes the firefighters to suspect
that the victim with serious health problems fainted and fallen into the hearth,
which may have been the origin of the fire.’
WHOLE-PART(vítima,corpo)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 16
18. Error Analysis
false-negatives (cont.)
• Some human nouns and all pronouns (including personal,
relative and demonstrative) are unmarked with the human
feature (even if anaphora resolution performs ok);
Segundo o responsável do hospital, o doente – que também sofreu
graves ferimentos na cabeça – poderia ser ainda sujeito a uma segunda
intervenção cirúrgica
‘According to the head of the hospital, the patient - who also suffered
serious head injuries – could still be subjected to a second surgical
intervention’
ANTECEDENT(doente,que)!
PART-WHOLE(que,cabeça)!
‣inheritance of features and relative placing of AR and WP
modules within STRING architecture
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 17
19. Error Analysis
false-negatives (cont.)
• A modifier of a noun or an adjective (and not a verb):
!
Um mágico com um barrete (enfiado) na cabeça
‘A magician with a hat (stuck) in the head’
!
WHOLE-PART(mágico,cabeça)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 18
20. Results
(2nd evaluation)
System’s performance for Nbp
+0.13 +0.11 +0.12
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 19
21. Thank you!
echo "O Pedro penteou o cabelo do filho com os dedos" | xip/string.sh
TOP
+------------+----------+----------------+-------------------+
| | | | |
NP VF NP PP PP
+-------+ + +-------+ +----+-------+ +----+-------+
| | | | | | | | | | |
ART NOUN VERB ART NOUN PREP ART NOUN PREP ART NOUN
+ +- +- +- + + + +- +- + +-
| | | | | | | | | | |
O Pedro penteou o cabelo de o filho com os dedos
MAIN(penteou)
MOD_POST(cabelo,filho)
MOD_POST(penteou,dedos)
SUBJ_PRE(penteou,Pedro)
CDIR_POST(penteou,cabelo)
WHOLE-PART(filho,cabelo)
WHOLE-PART(Pedro,dedos)
string.l2f.inesc-id.pt 0>TOP{NP{O Pedro} VF{penteou} NP{o cabelo} PP{de o filho} PP{com os dedos}}
Questions please!
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 21
22. References
Berland, M. and Charniak, E. 1999. Finding parts in very large corpora. In Proceedings
of the 37th annual meeting of the Association for Computational Linguistics on
Computational Linguistics, pages 57–64. Morristown, NJ, USA. Association for
Computational Linguistics.
Bick, E. 2000. The Parsing System "Palavras": Automatic Grammatical Analysis of
Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University.
Aarhus, Denmark: Aarhus University Press. November 2000.
Branco, A. and Costa, F. 2010. A Deep Linguistic Processing Grammar for Portuguese.
In Pardo et al. (eds.), Computational Processing of Portuguese, LNAI 6001, Springer,
pp. 86–89.
Girju,R., Badulescu A., and Moldovan, D. 2006. Automatic discovery of part-whole
relations. Computational Linguistics, 21(1):83–135.
Nascimento, M., Veloso, R., Marrafa, P., Pereira, L., Ribeiro, R., and Wittmann, L. 1998.
LE-PAROLE: do Corpus à Modelização da Informação Lexical num Sistema-multifunção.
Actas do XIII Encontro Nacional da Associação Portuguesa de
Linguística, 2:115–134.
Mamede, N., Baptista, J., Diniz, C. and Cabarrão, V. 2012. STRING: An hybrid statistical
and rule-based natural language processing chain for portuguese. http://
www.propor2012.org/demos/DemoSTRING.pdf
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 20
23. References (cont.)
Pantel, P. and Pennacchiotti, M. 2006. Espresso: Leveraging generic patterns for
automatically harvesting semantic relations. In Proceedings of Conference on
Computational Linguistics / Association for Computational Linguistics (COLING/
ACL-06), pages 113–120. Sydney, Australia.
Rocha,P. and Santos, D. 2000. "CETEMPúblico: Um corpus de grandes dimensões de
linguagem jornalística portuguesa". In Maria das Graças Volpe Nunes (ed.), V
Encontro para o processamento computacional da língua portuguesa escrita e falada
(PROPOR 2000) (São Paulo, Brasil, 19-22 de Novembro de 2000), São Paulo:
ICMC/USP, pp. 131-140.
Widlöcher, A. and Mathet, Y. 2012. The Glozz Platform: a Corpus Annotation and Mining
Tool. In Proceedings of the 2012 Association for Computational Liguistics Symposium
on Document Engineering, DocEng ’12, pages 171–180, Paris, France. Telecom
ParisTech, Association for Computational Liguistics.
Winston, M., Chaffin, R. and Herrmann, D.1987. A Taxonomy of Part-Whole Relations.
Cognitive Science, 11:417–444.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 21