Seminário de aula na disciplina de Natural Language Processing, com a Profa. Dra. Aline Villavicencio na UFRGS, a apresentação de um capítulo do livro de Jurafsky & Martin.
Seminar presented on the Natural Language Processing course at UFRGS.
1. Tópicos Especiais em Computação: Introdução ao
Processamento da Linguagem Natural - Profa. Dra. Aline
Villavicencio
Speech and Language Processing: An introduction to natural
language processing, computational linguistics, and speech
recognition - Daniel Jurafsky & James H. Martin.
Chapter 12 - FORMAL GRAMMARS OF ENGLISH
por Luciano Sclovsky - setembro de 2015
1
3. Introduction
Study of grammar dates back to Panini's study of Sanskrit
The grammar is wrong!
Syntax (greek) - setting out together or arrangement
ATIS domain example (Air Traffic Information Systems)
3
4. Main ideas in this chapter
Constituent (such as noun phrase)
Grammatical relations (such as subject and object from traditional grammar)
Subcategorization and dependency relations - kinds of relations between
words and phrases
Context Free Grammars are the backbone of formal models
grammar checking, semantic interpretation, dialogue understanding, machine translation
4
5. Constituency
Noun phrase: sequence of words around at least one noun
Can appear in similar syntactic environments (such as before a verb)
They
Harry the horse
a high-class spot such as Mandy's
Preposed
On september 17th, I'd like to fly from Atlanta to Denver
Postposed 5
6. Context-Free Grammars (CFG)
The most commonly used mathematical system
AKA: Phrase-Structure Grammar
Equivalent to: Backus-Naur Form (BNF)
Dates back to 1900 (psy Wilhelm Wundt), formalized by Chomsky (1956)
is a set of rules or productions
expresses the way symbols can be grouped and ordered together
includes a lexicon of words
can be hierarchically embedded
6
7. Context-Free Grammars (CFG)
terminals: correspond to words in the language
non-terminals: expresses generalization
production:
left: non-terminal
right: ordered list of terminals and non-terminals
the production associated to a single word is associated with its lexical category
A CFG can be thought as
a device for generating sentenced
7
8. Context-Free Grammars (CFG)
NP immediately dominates Det and Nom
NP dominates all nodes
Each CFG has a start symbol: often S (sentence node)
Example:
sentence can consist of a noun phrase followed by VERB PHRASE by a verb phrase
S → NP VP I prefer a morning flight
VP → Verb NP prefer a morning flight
VP → Verb NP PP leave Boston in the morning
VP → Verb PP leaving on Thursday
PP → Preposition NP from Los Angeles (locations, times, dates) 8
11. Formal definition of a CFG
Parsing: mapping a string of words into a parse tree
Direct derivation: product of a single production
Derivation: product of n productions
11
12. Some Grammar rules for English: Sentence Level Constructions
Four important sentence structures:
declarative:
subject NP + VP I prefer a morning flight
imperative
S → VP (and no subject) Show the lowest fare
yes-no question
S → Aux NP VP (aux verb) Do any of these flights have stops
wh-question 12
13. Some Grammar rules for English: Sentence Level Constructions
wh-question
mainly wh-subject-question: identical to declarative, with the wh-word
S → Wh-NP VP What airlines fly from Burbank to Denver?
wh-non-subject-question: sentence includes another subject
S → Wh-NP Aux NP VP What flights do you have from Burbank to Tacoma Washington?
wh-non-subject-question is an example of a long-distance dependency
semantic interpretation: parse that flights is the argument of have
13
14. English: Clauses and Sentences
Clause: forms a complete thought
a clause is the smallest grammatical unit that can express a complete proposition (Wikipedia)
Sentence: may form larger structures
S may be in the right side of a production
14
15. English: the Noun Phrase
Most frequent types: pronouns, proper-nouns and the NP construction
NP → Det Nominal
Determiner: NP may begin with simple lexical determiners
a stop these flights some flights
may have more complex structures (recursive)
Det → NP ′s Denver's mayor's mother's canceled flight
Nominal: follows the determiner and contains pre- and post-head noun
modifiers
Nominal → Noun
15
16. English: the Noun Phrase
Determiner: before the Head Noun
cardinal numbers: two friends
ordinal numbers: the first one the last flight
quantifiers: many fares much luggage
adjective phrase: the least expensive fare
NP → (Det) (Card) (Ord) (Quant) (AP) Nominal
16
17. English: the Noun Phrase
Determiner: after the Head Noun (postmodifiers)
prepositional phrases: all flights from Cleveland
common in ATIS corpus to mark origin and destination
Nominal → Nominal PP
non-finite clauses any flights arriving after eleven a.m.
gerundive (-ing, -ed) any of those leaving on Thursday
Nominal → Nominal GerundVP
infinitive the last flight to arrive in Boston
relative clauses a flight that serves breakfast 17
18. English: the Noun Phrase
predeterminer: word classes that modify and appear before the NP
all the flights
18
20. English: the Agreement
Agreement: in 3rd person singular the final verb form is different
the flight does the flights do
Occurs also when the verb has a noun acting as a subject
What flight leave in the morning?
One simple solution is to have different productions, but it doubles the size of
the grammar!
S → 3sgAux 3sgNP VP
S → Non3sgAux Non3sgNP VP
Rule proliferation would happen also in the noun case: nominative (I, she, he,
they) and accusative (me, her, him, them) pronouns
20
21. English: Verb Phrase and Subcategorization
Verb Phrase may contain NPs and PPs and combinations of both
VP → Verb disappear
VP → Verb NP prefer a morning flight
VP → Verb NP PP leave Boston in the morning
VP → Verb PP leaving on Thursday
Sentential complement: an entire embedded sentence, and others
VP → Verb S You said there were two flights that were the cheapest
A VP may include another VP
I want to fly from Milwaukee to Orlando
Not every verb is compatible with every verb phrase
21
22. English: Verb Phrase and Subcategorization
transitive verb: takes a direct object I found a flight
intransitive verb: I
disappeared
modern grammars sub categorizes verbs into more than 100 subcategories
complements of the verb
Verb-with-NP-complement → find | leave | repeat | . . .
Verb-with-S-complement → think | believe | say | . . .
Verb-with-Inf-VP-complement → want | try | need | . . .
VP → Verb-with-no-complement disappear
VP → Verb-with-NP-comp NP prefer a morning flight
VP → Verb-with-S-comp S said there
were two flights
22
23. English: Auxiliaries
auxiliary or helping verbs: places a constraint on the form of the following
verb, and must combine in a particular order
modal verbs: can, could, may, might, must, will, would, shall, should
perfect auxiliary: have
progressive auxiliary: be
passive auxiliary: be
A sentence can have multiple auxiliary verbs, but they must occur in a particular order: modal <
perfect < progressive < passive
modal perfect could have been a contender
modal passive will be married
perfect progressive have been feasting
modal perfect passive might have been prevented
23
24. English: Auxiliaries
verb group: includes the main verb and its auxiliaries
active sentence: subject is the semantic event described by the verb
I prevented a catastrophe
passive sentence: different than other auxiliaries: subject is the undergoer of
the event
a catastrophe was prevented
24
25. English: Coordination
Conjunctions: conjoins major phrase types
Coordinate noun phrase
NP → NP and NP Please repeat the flights and the costs
VP → VP and VP What flights do you have leaving Denver and arriving in
San Francisco
S → S and S
Metarule: X → X and X
25
26. Treebanks
A treebank is a corpus in which each sentence is syntactically annotated with
a parse tree
The Penn Treebank Project
26
32. Heads
Head: the word in the phrase which is grammatically the most important
Dates back to 1914 (Bloomfield)
Central to Head-Driven Phrase Structure Grammar
32
33. Heads finding
Heads are passed up the parse tree
Finding heads is simple for textbooks examples
Modern linguistic theories of syntax generally include a component that define
heads
33
34. Head finding rule example
If the last word is tagged POS, return last-word.
Else search from right to left for the first child which is an NN, NNP, NNPS, NX, POS,
or JJR.
Else search from left to right for the first child which is an NP.
Else search from right to left for the first child which is a $, ADJP, or PRN.
Else search from right to left for the first child which is a CD.
Else search from right to left for the first child which is a JJ, JJS, RB or QP.
Else return the last word
34
35. Grammar equivalence and normal form
A formal language is defined as a possibly infinite set of strings of word
Two grammars are equivalent if they generate the same set of strings
Two CFGs may generate the same language
weak equivalence - do not assign the same phrase structure
strong equivalence - assign the same phrase structure
Normal form: each of productions take the same form
Chomsky Normal Form (CNF)
right-side: two non-terminal symbols, or one terminal symbol 35
36. Finite-State and Context-Free Grammars
Adequate models of grammar need to be able to represent the complexities of
a natural language
Why can't we use finite-state methods?
mathematical reason: certains structures present in English make them non regular languages
subjective reason: non expressive
A finite state machine does not have center-embedded recursions
The luggage arrived.
The luggage that the passengers checked arrived.
The luggage that the passengers that the storm delayed checked arrived.
For many practical purposes matching semantic and syntactic rules aren't 36
37. Dependency Grammars
the syntactic structure of a sentence is described purely in terms of words and
binary semantic or syntactic relations between these words
since 1959 (Tesnière), trace back to ancient Greek and Indian linguistic
traditions
handle languages with relative free word order, such as Czech
37
39. Categorial grammar
combinatory categorial grammar (CCG)
two types of arguments: functors and arguments, that can be combined
For example, a determiner can be thought of as a function that applies to an N on its right to
produce an NP
39
41. Spoken language syntax
Utterance is the unit for spoken language (fala)
Fragments are common: pauses, non-verbal sounds, partial words
The subject of a sentence is often a pronoun
Disfluencies
words uh and um
word repetitions
restarts
word fragments: wha-, betwee- 41
42. Spoken language syntax
Repair, reparandum, interruption point
Edit terms: you know, I mean, uh, um
Find the interruption point, delete the reparandum, replace with repair
42
43. Treebanks of Spoken Language
Augmented notation with
disfluencies and other
phenomena
43
44. Grammars and Human processing
Do people use context-free grammars in their mental processing of language?
It has proved very difficult to find clear-cut evidence that they do.
Constituent: perceptual unit
constituents are often semantic units as well as syntactic units
It is necessary to find evidence for a constituent which is not a semantic unit
ditransitive alternation: change two arguments of a verb
The wealthy widow gave [NP the church] [NP her Mercedes].
The wealthy widow gave [NP her Mercedes] [PP to the church].
44
45. Summary
In many languages, groups of consecutive words act as a group or a constituent, which can be
modeled by context-free grammars (also known as phrase-structure grammars).
A context-free grammar consists of a set of rules or productions, expressed over a set of non-
terminal symbols and a set of terminal symbols. Formally, a particular context-free language is the
set of strings which can be derived from a particular context-free grammar.
A generative grammar is a traditional name in linguistics for a formal language which is used to
model the grammar of a natural language.
There are many sentence-level grammatical constructions in English; declarative, imperative, yes-no-
question, and wh-question are four very common types, which can be modeled with context-free
rules.
An English noun phrase can have determiners, numbers, quantifiers, and adjective phrases preceding
the head noun, which can be followed by a number of postmodifiers; gerundive VPs, infinitives
VPs, and past participial VPs are common possibilities.
45
46. Summary
Subjects in English agree with the main verb in person and number.
Verbs can be subcategorized by the types of complements they expect. Simple subcategories are
transitive and intransitive; most grammars include many more categories than these.
The correlate of sentences in spoken language are generally called utterances. Utterances may be
disfluent, containing filled pauses like um and uh, restarts, and repairs.
Treebanks of parsed sentences exist for many genres of English and for many languages. Treebanks
can be searched using tree-search tools.
Any context-free grammar can be converted to Chomsky normal form, in which the right-hand-side
of each rule has either two non-terminals or a single terminal.
Context-free grammars are more powerful than finite-state automata, but it is nonetheless possible
to approximate a context-free grammar with a FSA.
There is some evidence that constituency plays a role in the human processing of language.
46