Unsupervised Partial Parsing: Thesis defense

Unsupervised Partial Parsing
Elias Ponvert
Department of Linguistics
The University of Texas at Austin
Dissertation Defense
July 27, 2011

Elias Ponvert (UT Austin)



1 / 62

1
2

3

4

Goals and contributions
Unsupervised partial parsing
Main results
Discussion
Cascaded parsing
Main results
Discussion
Concluding remarks




2 / 62

Research goals
Generally:
Develop computational models to learn human
language
Hello!




3 / 62

Research goals
Speciﬁcally:
Learn to predict constituent structure from raw text
the cat saw the red dog run
⇓




3 / 62

Why unsupervised parsing?
1 Less reliance on annotated training
Hello!

2 Apply to new languages and domains
Særær man
annær man
mæþæn




4 / 62

Assumptions made in parser learning
Getting these labels right AS WELL AS the structure
of the tree is hard
S
PP

,

P

NP

on

N

,

NP
Det
the

A

VP
N

brown bear

V
sleeps

Sunday




5 / 62


So the task is to identify the structure alone

,
P

N

on Sunday


,

V
Det
the

A

N

sleeps

brown bear



5 / 62

Learning operates from gold-standard parts-of-speech
(POS) rather than raw text
P N , Det A N V

on Sunday , the brown bear sleeps

,
P

N

V
Det

A

N

,
on Sunday

Klein & Manning 2003 CCM
Bod 2006a, 2006b
Klein & Manning 2005 DMV
Successors to DMV:
- Smith 2006, Smith & Cohen
2009, Headden et al 2009,
Spitkovsky et al 2010ab, &c


sleeps
the brown bear

J. Gao et al 2003, 2004
Seginer 2007
this work


5 / 62

Unsupervised parsing: desiderata

Raw text
Standard NLP / extensible
Scalable and fast




6 / 62

Contributions

• Unsupervised parsing satisfying these

desiderata is possible
• Unsupervised partial parsing: predicting local
constituents with high accuracy
• Cascaded models: building constituent structure
bottom up




7 / 62

Outline
1

2

3

4

Main results
Discussion
Cascaded parsing
Main results
Discussion
Concluding remarks




8 / 62

A new approach: start from the bottom

Unsupervised Partial Parsing =
segmentation of (non-overlapping) multiword constituents




9 / 62

Unsupervised segmentation of constituents
leaves some room for interpretation
Possible segmentations
• ( the cat ) in ( the hat ) knows ( a lot ) about that
• ( the cat ) ( in the hat ) knows ( a lot ) ( about that )
• ( the cat in the hat ) knows ( a lot about that )
• ( the cat in the hat ) ( knows a lot about that )
• ( the cat in the hat ) ( knows a lot ) ( about that )




10 / 62

Deﬁning UPP by evaluation
1. Constituent chunks:
non-hierarchical multiword constituents
S
NP
D
The

VP

N

PP

Cat P

knows

NP

in D

N

the

NP

V

PP

D

N

a

lot about

hat

P

NP
N
that


11 / 62

Deﬁning UPP by evaluation
2. Base NPs:
non-recursive noun phrases
S
NP
D
The

VP

N

PP

Cat P

knows

NP

in D

N

the

NP

V

PP

D

N

a

lot about

hat

P

NP
N
that


11 / 62

Multilingual data for direct evaluation

English WSJ
German Negra
Chinese CTB
WSJ Penn Treebank
Negra Negra German Corpus
CTB Penn Chinese Treebank


Sentences Types Tokens
49K
44K
1M
21K
49K 300K
19K
37K 430K



12 / 62

Constituent chunks and NPs in the data

WSJ

Chunks
203K
NPs
172K
Chunks ∩ NPs 161K

Negra

Chunks
59K
NPs
33K
Chunks ∩ NPs 23K

CTB

Chunks
92K
NPs
56K
Chunks ∩ NPs 43K




13 / 62

The benchmark: CCL parser
the

cat
saw
run
the

red

dog

Constituency tree
0

the

0

1

cat

saw

0
0

0

the

0

0

red

0

dog

0

run

Common Cover Links representation
Seginer (2007 ACL; 2007 PhD UvA)



14 / 62

Hypothesis

Segmentation can be learned by
generalizing on phrasal boundaries




15 / 62

UPP as a tagging problem
the

cat

in

the

hat

B

I

O

B

I

the

cat

in

the

hat

B Beginning of a constituent
I Inside a constituent
O Not inside a constituent



16 / 62

Learning from boundaries

the

cat

in

the

hat

STOP

B

I

O

B

I

STOP

#

the

cat

in

the

hat

#




17 / 62

Unsupervised learning tag model for UPP

I

I

I

B

I
B

STOP

B

B

O

O
O

#

the


STOP

O

O

cat

in

the


hat

#


18 / 62

Decoding the tag model for UPP

STOP

#

B

I

O

B

I

STOP

the

cat

in

the

hat

#




19 / 62

Learning from punctuation

on

sunday

,

the

brown

bear

sleeps

STOP

B

I

STOP

B

I

I

O

STOP

#

on

sunday

,

the

brown

bear

sleeps

#




20 / 62

UPP: Models
Hidden Markov Model
B

I

O

B

I

the

cat

in

the

hat

P(

B

I

the

) ≈ P(

B

I

) P( the | B )

I

) P( the | B

Probabilistic right linear grammar

B
I

the

O

cat

P(

B

in
the

I

B
the

I

) = P(

B

I

)

hat

Learning: expectation maximization (EM) via
forward-backward (run to convergence)




21 / 62

UPP: Models
Hidden Markov Model
B

I

O

B

I

the

cat

in

the

hat

P(

B

I

the

) ≈ P(

B

I

) P( the | B )

I

) P( the | B


B
I

the

O

cat

P(

B

in
the

I

B
the

I

) = P(

B

I

)

hat

Decoding: Viterbi
Smoothing: additive smoothing on emissions




21 / 62

UPP: Constraints on sequences
the

cat

in

the

hat

STOP

B

I

O

B

I

STOP

#

the

cat

in

the

hat

#

STOP
O

B
I



22 / 62

UPP evaluation: Setup

• Evaluation by comparison to treebank data
• Standard train / development / test splits
• Precision and recall on matched constituents
• Benchmark: CCL
• Both get tokenization, punctuation,

sentence boundaries




23 / 62

UPP evaluation: Chunking (F-score)
WSJ
Negra
CTB
0

CCL∗

10

20

30

40

50

HMM Chunker

60

70

80

PRLG Chunker

CCL non-hierarchical constituents
First-level parsing output



24 / 62

UPP evaluation: Base NPs (F-score)
WSJ
Negra
CTB
0

CCL∗

10

20

30

40

50

HMM Chunker

60

70

80

PRLG Chunker

CCL non-hierarchical constituents
First-level parsing output



25 / 62

PRLG example output
(the seeds) already are in (the script)
(little chance) that (shane longman) is going
to recoup today
it would have (severe implications) for
(farmers ’ policy) holders
(thames ’s u.s. marketing agent)
(donald taffner) is preparing to do just that
and all (the while) (the bonds) are in
(the baby ’s diaper)
(mr. rustin) is (senior correspondent) in
(the journal ’s london bureau)



26 / 62

UPP: Review

• Sequence models can generalize on indicators

for phrasal boundaries
• Leads to improved unsupervised segmentation
• Learn to predict NPs with high accuracy
•

(English and German especially)




27 / 62

Outline
1

2

3

4

Main results
Discussion
Cascaded parsing
Main results
Discussion
Concluding remarks




28 / 62

Question

How do UPP models capture
noun phrase structure?




29 / 62

What UPP models learn
B 100 · P(w|B)

I

the
a
to
’s
in
mr.
its
of
an
and

%
million
be
company
year
market
billion
share
new
than

21.0
8.7
6.5
2.8
1.9
1.8
1.6
1.4
1.4
1.4

100 · P(w|I)
1.8
1.6
1.3
0.9
0.8
0.7
0.6
0.5
0.5
0.5

O 100 · P(w|O)
of
and
in
that
to
for
is
it
said
on

5.8
4.0
3.7
2.2
2.1
2.0
2.0
1.7
1.7
1.5

HMM Emissions: WSJ




30 / 62

B 100 · P(w|B)

I

der
die
den
und
im
das
des
dem
eine
ein

uhr
juni
jahren
prozent
mark
stadt
000

the
the
the
and
in
the
the
the
a
a

13.0
12.2
4.4
3.3
3.2
2.9
2.7
2.4
2.1
2.0

100 · P(w|I)
o’clock
June
years
percent
currency
city

millionen

millions

jahre

year

frankfurter

Frankfurt

0.8
0.6
0.4
0.4
0.3
0.3
0.3
0.3
0.3
0.3

O 100 · P(w|O)
in
und
mit
¨
fur
auf
zu
von
sich
ist
nicht

in
and
with
for
on
to
of
oneself
is
not

3.4
2.7
1.7
1.6
1.5
1.4
1.3
1.3
1.3
1.2

HMM Emissions: Negra




30 / 62

B
的
一
和
两
这
有
经济
各
全
不

100 · P(w|B)
de, of
one
and
two
this
have
economy
each
all
no

14.3
3.1
1.1
0.9
0.8
0.8
0.7
0.7
0.7
0.6

I
的
了
个
年
说
中
上
人
大
国

100 · P(w|I)
de
(perf. asp.)
ge (measure)
year
say
middle
on, above
person
big
country

3.9
2.2
1.5
1.3
1.0
0.9
0.9
0.7
0.7
0.6

O 100 · P(w|O)
在
是
中国
也
不
对
和
的
将
有

at, in
is
China
also
no
pair
and
de
fut. tns.
have

3.4
2.4
1.4
1.2
1.2
1.1
1.0
1.0
1.0
1.0

HMM Emissions: CTB




30 / 62

Question

What about the PRLG, why does it do so
much better than the HMM?




31 / 62

Question

Hidden Markov Model
B

I

O

B

I

the

cat

in

the

hat

P(

B

I

the

) ≈ P(

B

I

) P( the | B )

I

) P( the | B


B
I

the

O

cat

P(

B

in
the

I

B
the

I

) = P(

B

I

)

hat




31 / 62

What’s wrong with this picture?
B 100 · P(w|B)

I

the
a
to
’s
in
mr.
its
of
an
and

%
million
be
company
year
market
billion
share
new
than


21.0
8.7
6.5
2.8
1.9
1.8
1.6
1.4
1.4
1.4

100 · P(w|I)
1.8
1.6
1.3
0.9
0.8
0.7
0.6
0.5
0.5
0.5


O 100 · P(w|O)
of
and
in
that
to
for
is
it
said
on

5.8
4.0
3.7
2.2
2.1
2.0
2.0
1.7
1.7
1.5


32 / 62

What’s wrong with this picture?
B 100 · P(w|B)

I

the
a
to
’s
in
mr.
its
of
an
and

%
million
be
company
year
market
billion
share
new
than

21.0
8.7
6.5
2.8
1.9
1.8
1.6
1.4
1.4
1.4

100 · P(w|I)
1.8
1.6
1.3
0.9
0.8
0.7
0.6
0.5
0.5
0.5

O 100 · P(w|O)
of
and
in
that
to
for
is
it
said
on

5.8
4.0
3.7
2.2
2.1
2.0
2.0
1.7
1.7
1.5

• ’s occurs (immediately) before several terms that

appear after B




32 / 62

PRLG rule probabilities

B
B
B
B
B
B
B
B
B
B
B

100 · P(B → w q)
→ the I 28.2
→ a I
11.7
→ mr. I
2.4
→ its I
2.2
→ an I
1.9
→ his I
1.0
→ this I
1.0
→ their I 1.0
→ some I 0.7
→ new I 0.6


I
I
I
I
I
I
I
I
I
I
I

→
→
→
→
→
→
→
→
→
→

100 · P(I → w q)
’s I
2.6
and I
1.3
% O
1.1
million O
0.6
new I
0.5
million STOP 0.5
company O 0.5
year O
0.4
I
0.4
million I
0.4


O
O
O
O
O
O
O
O
O
O
O

100 · P(O → w q)
→ of B
3.8
→ to O
3.6
→ in B
2.5
→ and O 1.7
→ to B
1.7
→ of O
1.6
→ in O
1.5
→ and B
1.4
→ for B
1.3
→ it O
1.3


33 / 62

Learning curves: Base NPs
80

80

F -score

60
40
20

10 20 30 40K
sentences

80

60

60

40

40

20

20

100

60
EM iter

20

20

30 40K

10 sentences

0 20 40 60 80 100
EM iter

1
PRLG chunking model: WSJ




34 / 62

50
40
30
20
10

F -score


5 10 15K
sentences

50
40
30
20
10

40
20
140

80

EM iter

20

5

10

15K

0

50 100 150
EM iter

sentences

1
PRLG chunking model: Negra




34 / 62

30

30
F -score

20
10
0

5

10 15K

sentences

30

20

20

10

10
0

100

60
EM iter

20

5

10

15K

0 20 40 60 80 100
EM iter

sentences

PRLG chunking model: CTB
1




34 / 62

Question

How much can these models learn?




35 / 62

Against a supervised benchmark

Base NPs F-score

Supervised PRLG
Unsupervised PRLG

80
60
40
20
∼4500 10K

20K

30K

40K

WSJ Sentences



36 / 62


Base NPs F-score

Supervised PRLG
Unsupervised PRLG

50
40
30
20
10
∼2200

5K

10K

15K

Negra Sentences



36 / 62


Base NPs F-score

Supervised PRLG
Unsupervised PRLG

50
40
30
20
10
5

10

15K

CTB Sentences



36 / 62

Negra/CTB training much smaller than WSJ
WSJ PRLG

Base NPs F-score

80
60
40

Negra PRLG
CTB PRLG

20

10K

20K

30K

40K

Sentences




37 / 62

Treebank precision
S

NP
D
The

VP

N

PP

Cat P

NP

in D
the

NP

V
knows

PP
N

a

N

D

lot about

P

hat

NP
N
that

(the cat in the hat) knows (a lot) (about that)
• Constituent chunks: Prec = 2/3, Rec = 2/3, F = 2/3
• Base NPs: Prec = 1/3, Rec = 1/2
• Treebank precision: 3/3



38 / 62

On chunking the CTB
50

Treebank precision

30

Base NPs F-score
Constituent chunk F-score

10
3

20

60
80
40
EM Iterations




39 / 62

Question.

Do these models scale?




40 / 62

Chunking with training from Gigaword NYT
90
Treebank precision

80

Base NPs F

70

Const. chunks F

60
50
+160K +320K +480K
+NYT Sentences



+640K


41 / 62

Chunking with training from Gigaword NYT
90
Treebank precision

80

Base NPs F

70

Const. chunks F

60
50
WSJ

+160K

+320K

+480K

+640K

+NYT Sentences




41 / 62

Outline
1

2

3

4

Main results
Discussion
Cascaded parsing
Main results
Discussion
Concluding remarks




42 / 62

Question

Are we limited to segmentation?




43 / 62

Hypothesis

Identiﬁcation of higher level constituents
can also be learned by generalizing on
phrasal boundaries




44 / 62

Cascaded UPP: 1 Segment raw text

there

is

no

asbestos

in

our

products

now

there

is

no

asbestos

in

our

products

now




45 / 62

Cascaded UPP: 2 Choose stand-ins for phrases

there

is

is

no

in

our

no asbestos

there


asbestos

products

our

is

in

our


now

products

now


45 / 62

Cascaded UPP: 3 Segment text + phrasal stand-ins

there

is

in

our

now

there

is

in

our

now




45 / 62

Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4

there

is

in

our

there
is

in
our

no asbestos

is


now

in


products

now


45 / 62

Cascaded UPP: 5 Unwind to output tree

there
is

in
our

no asbestos

is

there


in

products

now

now
is

no asbestos

in

our products



45 / 62

Cascaded UPP: Review

• Separate models learned at each cascade level
• Models share hyper-parameters (smoothing etc)
• Choice of pseudowords as phrasal stand-ins
• Pseudoword-identiﬁcation: corpus frequency
• Cascade run to convergence




46 / 62

Right-branching baseline
the quick brown fox jumped over the lazy dog
the
quick
brown
fox
jumped
over
the
lazy



dog


47 / 62

Right-branching baseline
a Lorillard spokeswoman said , this is an old story

a

this
Lorillard

is

spokeswoman said

an
old




story

47 / 62

Cascaded UPP: Evaluation
WSJ
Negra
CTB
0

10

20

30

40

50

Constituents F-score

Baseline CCL
Cascaded HMM Cascaded PRLG




48 / 62

Another benchmark: CCM

Constituent-context model (Klein Manning, 2002)
• Generative probabilistic model
• Gold-standard POS
• Short sentences



49 / 62

Evaluation on ≤10 word setences
WSJ

Negra

CTB
0

10

20

30

40

50

60

70

Constituents F-score

Baseline CCM CCL
Cascaded HMM Cascaded PRLG



50 / 62

Example parses
two

Gold standard

share

a house
almost devoid
offurniture

two share
a house almost devoid of furniture

Cascaded PRLG – WSJ


correct
incorrect

51 / 62

Example parses
what

Gold standard

is
one
to

think
of

what

is

all

one

to

think of

Cascaded PRLG – WSJ


this

all

this

correct
incorrect


51 / 62

Example parses
Gold standard
tut
die

das

csu

in

doch
bayern

tut
die

csu

the

das

doch

does

this

nevertheless also

CSU

in

bayern

in

auch
sehr erfolgreich

auch sehr erfolgreich
very

successfully

Bavaria

Nevertheless, the CSU does this in Bavaria very successfully as well

Cascaded PRLG – Negra


correct
incorrect

52 / 62

Example parses
Gold standard
bei

bei
with

bleibt alles
den windsors in

bleibt alles

in

stays

in

der familie

everything

den

windsors

the

der familie

Windsors

the

family

With the Windsors everything stays in the family.



correct
incorrect

52 / 62

Example parses

¨
uberaltern
over-age

anlagenteile
immer

mehr

ever

machine parts

more

(with) more and more machine parts over-age




correct
incorrect


52 / 62

Outline
1

2

3

4

Main results
Discussion
Cascaded parsing
Main results
Discussion
Concluding remarks




53 / 62

Question

How do these cascaded chunkers work?




54 / 62

Recall of NPs and PPs

NPs
PPs
Lev 1 Lev 2 Lev 1 Lev 2
WSJ
PRLG 77.5 78.3
9.1 77.6
Negra HMM 54.7 62.3 24.8 48.1
CTB
PRLG 30.9 33.6 31.6 47.1




55 / 62

Prec / Rec trade-offs in the cascade
80
60
40
20

1

2

3 4 5
Levels

Precision

Recall

6

7
F-score

WSJ PRLG
1




56 / 62

50
40
30
1

2

3 4 5
Levels

Precision

Recall

6

7
F-score

Negra PRLG
1




56 / 62

50
40
30
20
1

2

3 4 5
Levels

Precision

Recall

6

7
F-score

CTB PRLG
1




56 / 62

Learning curves

F-score

50

PRLG
CCL

45

HMM

40
35
10K
20K
30K
WSJ Sentences



40K


57 / 62

Learning curves

F-score

PRLG
40

HMM

35

CCL

30
25


5K
10K
15K
Negra Sentences



57 / 62

Learning curves

F-score

40

PRLG
HMM

30

CCL

20
5K
10K
CTB Sentences



15K


57 / 62

Outline
1

2

3

4

Main results
Discussion
Cascaded parsing
Main results
Discussion
Concluding remarks




58 / 62

What we’ve learned

• Unsupervised identiﬁcation of base NPs and

local constituents is possible
• A cascade of chunking models for raw text
parsing has state-of-the-art results




59 / 62

Future directions

• Improvements to the sequence models
• Better phrasal stand-in (pseudoword)

construction
• Learning joint models rather than a cascade




60 / 62

Historical note

First known computational natural language parser
Transformations and Discourse Analysis Project
Zellig Harris colleagues, UPenn 1950s - 1960s




61 / 62

Historical note
To the best of our knowledge, this is the ﬁrst
application of FSTs to parsing. The program
consisted of the following phases:
1. Dictionary look-up.
2. Replacement of some ‘grammatical idioms’ by a
single part of speech.
3. Rule based part of speech disambiguation.
4. A right to left FST composed with a left to right
FST for computing ‘simple noun phrases’.
Joshi Hopely 1997



61 / 62

Historical note
To the best of our knowledge, this is the ﬁrst
application of FSTs to parsing. The program
consisted of the following phases:
4. A left to right FST for computing ‘simple
adjuncts’ such as prepositional phrases and
adverbial phrases.
5. A left to right FST for computing simple verb
clusters.
6. A left to right ‘FST’ for computing clauses.
Joshi Hopely 1997



61 / 62

Thanks!




62 / 62

Unsupervised Partial Parsing: Thesis defense

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Unsupervised Partial Parsing: Thesis defense

Similar to Unsupervised Partial Parsing: Thesis defense (6)

Recently uploaded

Recently uploaded (20)

Unsupervised Partial Parsing: Thesis defense