This paper describes a methodology for automatically iden- tifying proverbs and their variants in running texts. This methodology is based on existing compilations of proverbs, by exploring the regular syntactic structures that most proverbs present and intersecting syntac- tic structure with the lexical units of the proverbs. From the syntactic regularities we divided the data into 13 different classes. Finite-state au- tomata is used to represent the regular patterns found in the classes. The results showed a precision rate of 74.68% tested in Brazilian Portuguese journalistic corpus.
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Rassi et-al propor-2014
1. AUTOMATIC IDENTIFICATION OF PROVERB VARIANTS:
AN EXPERIMENT WITH BRAZILIAN PORTUGUESE
Amanda Rassi
Jorge Baptista
Oto Vale
PROPOR 2014 - International Conference
on Computational Processing of Portuguese
October 6-9, 2014 USP-São Carlos, SP, Brazil
2. Proverbs
• Definition
• a type of multiword expressions (micro-texts)
• special citation status
• express atemporal truths
• combinatorial and lexical constraints
• sentences syntactically identical to ordinary sentences
• common lexicon
!
• Delimitation
• Proverbs ≠ frozen sentences (or idioms)
• Proverbs: subject position necessarily filled
by a fixed element vs. Idioms: subject position
distributionally free (in most cases)
2
3. Goals
• Automatically detect proverbs in texts, even when they are
not introduced by any linguistic “quoting” devices:
!
Como dizem ‘as they say’
Como dizia minha avó ‘as my grandmother used to say’
Dizem por aí ‘people say/they say’
Costuma dizer-se ‘it is often said’
!
• Identify the variants of proverbs,
considering both formal and lexical variations.
3
4. Related work
• For French and Italian:
Conenna (1998, 2000, 2004) and Lacavalla (2007)
!
• For French and Spanish:
Brotons (2008)
!
• For European Portuguese (EP):
Chacoto (2006, 2007, 2008)
!
• For Brazilian Portuguese (BP):
no formal description
4
5. Motivation
• though relatively rare, proverbs are “islands of
meaning” in texts (citation status)
• often difficult to spot,
• lack formal marks
• formal and lexical variation
• often enter in wordplay
• discursive function is complex (entailment)
• relation with other textual elements disturbs
(no coreference)
5
6. Methods
• create a database with proverbs;
• define syntactic criteria to organize the collected
proverbs into similar formal classes;
• organize the elements according to POS;
• produce tables of core elements;
!
with Unitex 3.1 (Paumier 2003, 2014)
• create reference graphs with the basic syntactic
structures for each class;
• intersect the graphs with the tables of the proverbs’
core elements to produce finite-state transducers,
which can then be applied to texts.
6
7. Collection of proverbs
• 5 different sources:
• list of proverbs in Wikipedia
• grand book of proverbs (Teixeira, 1942)
• 1001 proverbs (Steinberg, 1985)
• book of proverbs (Pinto, 2003)
• dictionary of proverbs (Magalhães Jr., 1974)
!
• Original list of 3,502 proverbs (and their variants)
• Final list of 594 proverbs (types or base-forms)
7
8. Classification criteria
• number of verb phrases/clauses (P1, P2 and P3)
!
• in P1
• impersonal constructions
• the verb is a copula verb
• obligatory negation (Neg)
• obligatory fronting of PP verb complement
• in P2
• comparatives
• coordinate/subordinate clauses
• verbless coordinated phrases
• obligatory fronting of 2nd verb phrase
• in P3 (no subclasses)
8
9. Formal classes
9
Proverbs that did not fit in any of the categories above were added in a
residual class. Table 1 shows the breakdown of the proverbs (base-forms) per
class.
Table 1. Formal Classification of Brazilian Portuguese Proverbs
Class Structure Example (approximate translation) Count
P1F1 Ø V w N˜ao h´a parto sem dor 20
(impersonal) ‘There is no painless childbirth’
P1F2 N0 V cop Adj/N w O silˆencio ´e de ouro 53
‘Silence is golden’
P1F3 N0 V w Uma m˜ao lava a outra 80
‘One hand washes the other’
P1F4 N0 Neg V w C˜ao que ladra n˜ao morde 53
‘A barking dog seldom bites’
P1F5 Prep Ni N0 V w Em terra de cego, quem tem um olho ´e rei 45
‘In the land of the blind, the one-eyed is king’
P2F1 F1 Conjs-comp F2 Antes s´o que mal acompanhado 39
(comparatives) ‘Better alone than in bad company’
P2F2 F1 Conjc F2 Aqui se faz e aqui se paga 71
(coordinated) ‘What goes around comes around’
P2F3 NP1, NP2 Cada cabe¸ca, uma senten¸ca 48
‘Each head its sentence’
P2F4 Qu- F1 F2 Quem ri por ´ultimo ri melhor 90
(subordinated) ‘Who laughs last laughs best’
P2F5 F1 Conjs F2 Pense duas vezes antes de agir 20
(subordinated) ‘Look before you leap’
P2F6 Conjs F2, F1 Quando o gato sai de casa, os ratos fazem festa 28
(fronted subord.) ‘When the cat’s away, the mice will play’
P3 F1, F2, F3 M˜aos frias, cora¸c˜ao quente, amor ardente 24
‘Cold hands, warm heart, burning love’
Residual not specified Comer e co¸car ´e s´o come¸car 43
‘To keep eating and scratching, just start’
Total 614
10. Core elements
!
• Noun phrases (NP), subject (N0) or complement (N1):
• noun (N) or pronoun (PRO)
• adjective (Adj)
• eventual determiners (Det) or modifiers (Mod)
!
• Verbal phrases (VP):
• main verb (V)
• eventual auxiliaries (Aux)
• adverbial modifiers (Mod)
10
11. Graphs and Transducers
!
Quem conta um conto aumenta um ponto
!
‘Who tells a tale adds a point’
!
!
!
!
Example of a reference graph for P2F4 class
!
!
!
!
!
Example of a FS transducer for proverb 0023 in P2F4 class
11
14. Error analysis
• Specific subsets in P2F4 class:
Quem <MOT>* V <MOT>* V! Quem tem boca vai a Roma
Quem <V> <V> Quem cala consente
!
• Constraints on V tense:
Quem(<V:P3s>+<V:J3s>+<V:W>)(<V:P3s>+<V:F3s>+<V:W>)
P2F4 class Matches FP Precision
(P2F4)
Precision
(all classes)
Quem <MOT>* V <MOT>* V 276 200 27.5% 60.15%
Quem V V (no insertions) 56 26 53.57% 73.55%
14
15. Discussion - New variants
• The matches found allowed us to identify other
variants of the same proverb that were not in the initial
list:
!
Antes tarde do que nunca
‘Better later than never’
15
new
variants
16. Discussion (cont.) - New proverbs
!
• It was also possible to find proverbs
that were not in the previous list.
!
P2F4 class: quem V V ‘who V V’
!
Quem sabe faz
‘Who knows makes’
!
Quem sabe faz ao vivo
‘Who knows makes it viva’
16
17. Discussion (cont.) – Window insertion length
!
• The length of the insertion window can vary, depending
on the type of proverb involved (in general, at maximum
5 words).
!
!
O buraco [das negociações com o Congresso] é muito mais embaixo
‘the hole [in negotiations with Congress] is much more down’
!
a justiça [que o brasileiro tanto almeja] começa dentro de casa
‘the justice [that the Brazilian so much craves] begins at home’
17
18. Discussion (cont.) – Separators
!
• In Portuguese proverbs, the use of comma is not
systematic, and in many cases it can be considered
to be optional.
• The reference graphs allow the facultative presence
of punctuation between the core words.
!
Quem sai ao vento (,) perde o assento (comma facultative)
‘Who leaves to the the wind, loses the seat’
!
Quando a esmola é demais (,) o santo desconfia (comma
facultative)
‘When the alms are too much, the saint suspects’
18
19. Discussion (cont.) – Transformations
!
• Some proverbs of P1F2 class
allow a mirror permutation
O ataque é a melhor defesa
[Mirror Permut.]= A melhor defesa é o ataque
‘The attack is the best defense = The best defense is the attack’
19
20. Discussion (cont.) - Negation
!
• The negation may not be considered an obligatory
element — wordplay often involves the removal of this
negation, to produce some type of effect:
!
Beleza não põe mesa
‘Beauty does not set the table’
!
Como a maioria das outras entrevistadas,
Astrid diz que beleza põe mesa, sim
‘Like most other interviewees,
Astrid says that beauty does set the table’
20
21. Discussion (cont.) - Implicit clauses
!
• Some proverbs in P2F2 class, formed by two
propositions, may result from coordinating two simple
proverbs with one proposition each:
!
Quem casa não pensa, quem pensa não casa
‘Who gets married doesn‘t think, who think doesn‘t get married’
!
Quem casa não pensa
‘Who gets married doesn‘t think’
!
Quem pensa não casa
‘Who think doesn‘t get married’
21
22. Synopsis
(1) the formal (syntactic) classification of proverbs in 13
classes: this classification may serve as a starting point
for deeper analysis on each one of these proverbial
structures;
(2) the identification of the core elements of each proverb:
the methodology presented to extract keywords can be
replicated for other corpora in order to check different text
types and domains;
(3) the definition of an adequate length for insertions’
window (words and punctuation), which may vary
depending on the class of proverbs
22