Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Simple Unsupervised Grammar Induction from
Raw Text with Cascaded Finite State Models
Elias Ponvert, Jason Baldridge, Katr...
Why unsupervised parsing?
1 Less reliance on annotated training
Hello!

2 Apply to new languages and domains
Særær man
ann...
Assumptions made in parser learning
Getting these labels right AS WELL AS the structure
of the tree is hard
S
PP

,

P

NP...
Assumptions made in parser learning

So the task is to identify the structure alone

,
P

N

on Sunday

Ponvert, Baldridge...
Assumptions made in parser learning
Learning operates from gold-standard parts-of-speech
(POS) rather than raw text
P N , ...
Unsupervised parsing: desiderata

Raw text
Standard NLP / extensible
Scalable and fast

Ponvert, Baldridge, Erk (UT Austin...
A new approach: start from the bottom

Unsupervised Partial Parsing =
segmentation of (non-overlapping) multiword constitu...
Unsupervised segmentation of constituents
leaves some room for interpretation
Possible segmentations
( the cat ) in ( the ...
Defining UPP by evaluation
1. Constituent chunks:
non-hierarchical multiword constituents
S
NP
D
The

VP

N

PP

Cat P

NP
...
Defining UPP by evaluation
2. Base NPs:
non-recursive noun phrases
S
NP
D
The

VP

N

PP

Cat P

NP

V
knows

NP

in D

N

...
Multilingual data for direct evaluation

English WSJ
German Negra
Chinese CTB
WSJ Penn Treebank
Negra Negra German Corpus
...
Constituent chunks and NPs in the data

WSJ

Chunks
203K
NPs
172K
Chunks ∩ NPs 161K

Negra

Chunks
59K
NPs
33K
Chunks ∩ NP...
The benchmark: CCL parser
the

cat
saw
run
the

red

dog

Constituency tree
0

the 



0

1

cat





saw

0
0



0

the 
...
Hypothesis

Segmentation can be learned by
generalizing on phrasal boundaries

Ponvert, Baldridge, Erk (UT Austin)

Simple...
UPP as a tagging problem
the

cat

in

the

hat

B

I

O

B

I

the

cat

in

the

hat

B Beginning of a constituent
I Ins...
Learning from boundaries

the

cat

in

the

hat

STOP

B

I

O

B

I

STOP

#

the

cat

in

the

hat

#

Ponvert, Baldri...
Learning from punctuation

on

sunday

,

the

brown

bear

sleeps

STOP

B

I

STOP

B

I

I

O

STOP

#

on

sunday

,

...
UPP: Models
Hidden Markov Model
B

I

O

B

I

the

cat

in

the

hat

P(

B

I

the

) ≈ P(

B

I

) P(

B

)

the

Proba...
UPP: Models
Hidden Markov Model
B

I

O

B

I

the

cat

in

the

hat

P(

B

I

the

) ≈ P(

B

I

) P(

B

)

the

Proba...
UPP: Constraints on sequences
the

cat

in

the

hat

STOP

B

I

O

B

I

STOP

#

the

cat

in

the

hat

#

STOP
O
Ponv...
UPP evaluation: Setup

Evaluation by comparison to treebank data
Standard train / development / test splits
Precision and ...
UPP evaluation: Chunking (F-score)
WSJ
Negra
CTB
0

10

CCL∗

20

30

40

HMM Chunker

50

60

70

80

PRLG Chunker

CCL n...
UPP evaluation: Base NPs (F-score)
WSJ
Negra
CTB
0

10

CCL∗

20

30

40

HMM Chunker

50

60

70

80

PRLG Chunker

CCL n...
UPP: Review

Sequence models can generalize on indicators
for phrasal boundaries
Leads to improved unsupervised segmentati...
Question

Are we limited to segmentation?

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL...
Hypothesis

Identification of higher level constituents
can also be learned by generalizing on
phrasal boundaries

Ponvert,...
Cascaded UPP: 1 Segment raw text

there

is

no

asbestos

in

our

products

now

there

is

no

asbestos

in

our

produ...
Cascaded UPP: 2 Choose stand-ins for phrases

there

is

is

no

asbestos

in

no asbestos

there

Ponvert, Baldridge, Erk...
Cascaded UPP: 3 Segment text + phrasal stand-ins

there

is

in

our

now

there

is

in

our

now

Ponvert, Baldridge, Er...
Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4

there

is

in

our

there
is

in
our

no asbestos

is

Ponvert, Bal...
Cascaded UPP: 5 Unwind to output tree

there
is

in
our

no asbestos

is

there

in

products

now

now
is

Ponvert, Baldr...
Cascaded UPP: Review

Separate models learned at each cascade level
Models share hyper-parameters (smoothing etc)
Choice o...
Cascaded UPP: Evaluation
WSJ
Negra
CTB
0

CCL

10

20

30

Cascaded HMM

40

50

60

Cascaded PRLG

All constituent F-scor...
More example parses
Gold standard
tut
die

csu

das
in

doch
bayern

tut
die

csu

the

das

doch

does

this

nevertheles...
More example parses
Gold standard
bei

bei

bleibt alles
den windsors in

bleibt alles

in

stays

with

in

der familie

...
More example parses

¨
uberaltern
over-age

anlagenteile
immer

mehr

ever

machine parts

more

(with) more and more mach...
What we’ve learned

Unsupervised identification of base NPs and
local constituents is possible
A cascade of chunking models...
Future directions

Improvements to the sequence models
Better phrasal stand-in (pseudoword)
construction
Learning joint mo...
What’s in the paper

Comparison to Klein  Manning’s CCM
Discussion of phrasal punctuation
the chunkers still do well w/out...
Thanks!

Contact: eponvert@utexas.edu
Code: elias.ponvert.net/upparse
This work is supported in part by the U. S. Army Res...
Appendices

Ponvert, Baldridge, Erk (UT Austin)

Simple Unsupervised Grammar Induction

ACL 2011

31 / 34
More example parses
two

Gold standard

share

a house
almost devoid
offurniture

two share
a house almost devoid of furni...
More example parses
what

Gold standard

is
one
to

think
of

what

is

all

one

to

think of

Cascaded PRLG – WSJ
Ponver...
Learning curves: Base NPs
80

60
40
20

F -score

80

10 20 30 40K
sentences

80

60

60

40

40

20

20

100

60
EM iter
...
50
40
30
20
10

F -score

Learning curves: Base NPs

5 10 15K
sentences

50
40
30
20
10

40
20
140

80

EM iter

20

5

10...
Learning curves: Base NPs
30

30
F -score

20
10
0

5

10 15K

sentences

30

20

20

10

10
0

100

60
EM iter

20

5

10...
What are the models learning?
B
the
a
to
’s
in
mr.
its
of
an
and

P(w|B)
21.0
8.7
6.5
2.8
1.9
1.8
1.6
1.4
1.4
1.4

I
%
mil...
What are the models learning?
P(w|B)

B
der
die
den
und
im
das
des
dem
eine
ein

the
the
the
and
in
the
the
the
a
a

13.0
...
What are the models learning?
P(w|B)

B
的
一
和
两
这
有
经济
各
全
不

de, of
one
and
two
this
have
economy
each
all
no

14.3
3.1
1...
Upcoming SlideShare
Loading in …5
×

Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

1,496 views

Published on

Slides from my 2011 Association for Computational Linguistics paper & talk (joint work with Jason Baldridge and Katrin Erk). Presents Unsupervised Partial Parsing, a simple but very effective method for discovering grammatical phrases (like noun phrases and what not)

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

  1. 1. Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models Elias Ponvert, Jason Baldridge, Katrin Erk Department of Linguistics The University of Texas at Austin Association for Computational Linguistics 19–24 June, 2011 Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 1 / 34
  2. 2. Why unsupervised parsing? 1 Less reliance on annotated training Hello! 2 Apply to new languages and domains Særær man annær man mæþæn Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 2 / 34
  3. 3. Assumptions made in parser learning Getting these labels right AS WELL AS the structure of the tree is hard S PP , P NP on N , NP Det the A VP N brown bear V sleeps Sunday Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34
  4. 4. Assumptions made in parser learning So the task is to identify the structure alone , P N on Sunday Ponvert, Baldridge, Erk (UT Austin) , V Det the A N sleeps brown bear Simple Unsupervised Grammar Induction ACL 2011 3 / 34
  5. 5. Assumptions made in parser learning Learning operates from gold-standard parts-of-speech (POS) rather than raw text P N , Det A N V on Sunday , the brown bear sleeps , P N V Det A N Klein & Manning 2003 CCM Bod 2006a, 2006b Klein & Manning 2005 DMV Successors to DMV: - Smith 2006, Smith & Cohen 2009, Headden et al 2009, Spitkovsky et al 2010ab, &c Ponvert, Baldridge, Erk (UT Austin) , on Sunday sleeps the brown bear J. Gao et al 2003, 2004 Seginer 2007 this work Simple Unsupervised Grammar Induction ACL 2011 3 / 34
  6. 6. Unsupervised parsing: desiderata Raw text Standard NLP / extensible Scalable and fast Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 4 / 34
  7. 7. A new approach: start from the bottom Unsupervised Partial Parsing = segmentation of (non-overlapping) multiword constituents Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 5 / 34
  8. 8. Unsupervised segmentation of constituents leaves some room for interpretation Possible segmentations ( the cat ) in ( the hat ) knows ( a lot ) about that ( the cat ) ( in the hat ) knows ( a lot ) ( about that ) ( the cat in the hat ) knows ( a lot about that ) ( the cat in the hat ) ( knows a lot about that ) ( the cat in the hat ) ( knows a lot ) ( about that ) Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 6 / 34
  9. 9. Defining UPP by evaluation 1. Constituent chunks: non-hierarchical multiword constituents S NP D The VP N PP Cat P NP V knows NP in D N the D N a lot about hat Ponvert, Baldridge, Erk (UT Austin) PP Simple Unsupervised Grammar Induction P NP N that ACL 2011 7 / 34
  10. 10. Defining UPP by evaluation 2. Base NPs: non-recursive noun phrases S NP D The VP N PP Cat P NP V knows NP in D N the D N a lot about hat Ponvert, Baldridge, Erk (UT Austin) PP Simple Unsupervised Grammar Induction P NP N that ACL 2011 7 / 34
  11. 11. Multilingual data for direct evaluation English WSJ German Negra Chinese CTB WSJ Penn Treebank Negra Negra German Corpus CTB Penn Chinese Treebank Ponvert, Baldridge, Erk (UT Austin) Sentences Types Tokens 49K 44K 1M 21K 49K 300K 19K 37K 430K Simple Unsupervised Grammar Induction ACL 2011 8 / 34
  12. 12. Constituent chunks and NPs in the data WSJ Chunks 203K NPs 172K Chunks ∩ NPs 161K Negra Chunks 59K NPs 33K Chunks ∩ NPs 23K CTB Chunks 92K NPs 56K Chunks ∩ NPs 43K Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 9 / 34
  13. 13. The benchmark: CCL parser the cat saw run the red dog Constituency tree 0 the 0 1 cat saw 0 0 0 the 0 0 red dog 0 0 run Common Cover Links representation Seginer (2007 ACL; 2007 PhD UvA) Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 10 / 34
  14. 14. Hypothesis Segmentation can be learned by generalizing on phrasal boundaries Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 11 / 34
  15. 15. UPP as a tagging problem the cat in the hat B I O B I the cat in the hat B Beginning of a constituent I Inside a constituent O Not inside a constituent Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 12 / 34
  16. 16. Learning from boundaries the cat in the hat STOP B I O B I STOP # the cat in the hat # Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 13 / 34
  17. 17. Learning from punctuation on sunday , the brown bear sleeps STOP B I STOP B I I O STOP # on sunday , the brown bear sleeps # Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 14 / 34
  18. 18. UPP: Models Hidden Markov Model B I O B I the cat in the hat P( B I the ) ≈ P( B I ) P( B ) the Probabilistic right linear grammar B I the O cat P( B in the I B the I ) = P( B I ) P( the | B I ) hat Learning: expectation maximization (EM) via forward-backward (run to convergence) Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34
  19. 19. UPP: Models Hidden Markov Model B I O B I the cat in the hat P( B I the ) ≈ P( B I ) P( B ) the Probabilistic right linear grammar B I the O cat P( B in the I B the I ) = P( B I ) P( the | B I ) hat Decoding: Viterbi Smoothing: additive smoothing on emissions Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34
  20. 20. UPP: Constraints on sequences the cat in the hat STOP B I O B I STOP # the cat in the hat # STOP O Ponvert, Baldridge, Erk (UT Austin) B I Simple Unsupervised Grammar Induction ACL 2011 16 / 34
  21. 21. UPP evaluation: Setup Evaluation by comparison to treebank data Standard train / development / test splits Precision and recall on matched constituents Benchmark: CCL Both get tokenization, punctuation, sentence boundaries Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 17 / 34
  22. 22. UPP evaluation: Chunking (F-score) WSJ Negra CTB 0 10 CCL∗ 20 30 40 HMM Chunker 50 60 70 80 PRLG Chunker CCL non-hierarchical constituents First-level parsing output Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 18 / 34
  23. 23. UPP evaluation: Base NPs (F-score) WSJ Negra CTB 0 10 CCL∗ 20 30 40 HMM Chunker 50 60 70 80 PRLG Chunker CCL non-hierarchical constituents First-level parsing output Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 19 / 34
  24. 24. UPP: Review Sequence models can generalize on indicators for phrasal boundaries Leads to improved unsupervised segmentation Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 20 / 34
  25. 25. Question Are we limited to segmentation? Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 21 / 34
  26. 26. Hypothesis Identification of higher level constituents can also be learned by generalizing on phrasal boundaries Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 22 / 34
  27. 27. Cascaded UPP: 1 Segment raw text there is no asbestos in our products now there is no asbestos in our products now Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
  28. 28. Cascaded UPP: 2 Choose stand-ins for phrases there is is no asbestos in no asbestos there Ponvert, Baldridge, Erk (UT Austin) our products our is in our Simple Unsupervised Grammar Induction now products now ACL 2011 23 / 34
  29. 29. Cascaded UPP: 3 Segment text + phrasal stand-ins there is in our now there is in our now Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
  30. 30. Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4 there is in our there is in our no asbestos is Ponvert, Baldridge, Erk (UT Austin) now in Simple Unsupervised Grammar Induction products now ACL 2011 23 / 34
  31. 31. Cascaded UPP: 5 Unwind to output tree there is in our no asbestos is there in products now now is Ponvert, Baldridge, Erk (UT Austin) no asbestos in our products Simple Unsupervised Grammar Induction ACL 2011 23 / 34
  32. 32. Cascaded UPP: Review Separate models learned at each cascade level Models share hyper-parameters (smoothing etc) Choice of pseudowords as phrasal stand-ins Pseudoword-identification: corpus frequency Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 24 / 34
  33. 33. Cascaded UPP: Evaluation WSJ Negra CTB 0 CCL 10 20 30 Cascaded HMM 40 50 60 Cascaded PRLG All constituent F-score Cascade run to convergence Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 25 / 34
  34. 34. More example parses Gold standard tut die csu das in doch bayern tut die csu the das doch does this nevertheless also CSU in bayern in auch sehr erfolgreich auch sehr erfolgreich very successfully Bavaria Nevertheless, the CSU does this in Bavaria very successfully as well Cascaded PRLG – Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction correct incorrect ACL 2011 26 / 34
  35. 35. More example parses Gold standard bei bei bleibt alles den windsors in bleibt alles in stays with in der familie den windsors the everything der familie Windsors the family With the Windsors everything stays in the family. Cascaded PRLG – Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction correct incorrect ACL 2011 26 / 34
  36. 36. More example parses ¨ uberaltern over-age anlagenteile immer mehr ever machine parts more (with) more and more machine parts over-age Cascaded PRLG – Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction correct incorrect ACL 2011 26 / 34
  37. 37. What we’ve learned Unsupervised identification of base NPs and local constituents is possible A cascade of chunking models for raw text parsing has state-of-the-art results Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 27 / 34
  38. 38. Future directions Improvements to the sequence models Better phrasal stand-in (pseudoword) construction Learning joint models rather than a cascade Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 28 / 34
  39. 39. What’s in the paper Comparison to Klein Manning’s CCM Discussion of phrasal punctuation the chunkers still do well w/out punctuation Analysis of chunking and parsing Chinese Error analysis Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 29 / 34
  40. 40. Thanks! Contact: eponvert@utexas.edu Code: elias.ponvert.net/upparse This work is supported in part by the U. S. Army Research Laboratory and the U.S. Army Research Office under grant number W911NF-10-1-0533. Support for Elias was also provided by Mike Hogg Endowment Fellowship, the Office of Graduate Studies at The University of Texas at Austin. Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 30 / 34
  41. 41. Appendices Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 31 / 34
  42. 42. More example parses two Gold standard share a house almost devoid offurniture two share a house almost devoid of furniture Cascaded PRLG – WSJ Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction correct incorrect ACL 2011 32 / 34
  43. 43. More example parses what Gold standard is one to think of what is all one to think of Cascaded PRLG – WSJ Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction all this this correct incorrect ACL 2011 32 / 34
  44. 44. Learning curves: Base NPs 80 60 40 20 F -score 80 10 20 30 40K sentences 80 60 60 40 40 20 20 100 60 EM iter 20 20 30 40K 10 sentences 0 20 40 60 80 100 EM iter 1 PRLG chunking model: WSJ Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34
  45. 45. 50 40 30 20 10 F -score Learning curves: Base NPs 5 10 15K sentences 50 40 30 20 10 40 20 140 80 EM iter 20 5 10 15K 0 50 100 150 EM iter sentences 1 PRLG chunking model: Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34
  46. 46. Learning curves: Base NPs 30 30 F -score 20 10 0 5 10 15K sentences 30 20 20 10 10 0 100 60 EM iter 20 5 10 15K 0 20 40 60 80 100 EM iter sentences PRLG chunking model: CTB 1 Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34
  47. 47. What are the models learning? B the a to ’s in mr. its of an and P(w|B) 21.0 8.7 6.5 2.8 1.9 1.8 1.6 1.4 1.4 1.4 I % million be company year market billion share new than P(w|I) 1.8 1.6 1.3 0.9 0.8 0.7 0.6 0.5 0.5 0.5 O of and in that to for is it said on P(w|O) 5.8 4.0 3.7 2.2 2.1 2.0 2.0 1.7 1.7 1.5 HMM Emissions: WSJ Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34
  48. 48. What are the models learning? P(w|B) B der die den und im das des dem eine ein the the the and in the the the a a 13.0 12.2 4.4 3.3 3.2 2.9 2.7 2.4 2.1 2.0 P(w|I) I uhr juni jahren prozent mark stadt 000 o’clock June years percent currency city millionen millions jahre year frankfurter Frankfurt 0.8 0.6 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.3 P(w|O) O in und mit ¨ fur auf zu von sich ist nicht in and with for on to of such is not 3.4 2.7 1.7 1.6 1.5 1.4 1.3 1.3 1.3 1.2 HMM Emissions: Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34
  49. 49. What are the models learning? P(w|B) B 的 一 和 两 这 有 经济 各 全 不 de, of one and two this have economy each all no 14.3 3.1 1.1 0.9 0.8 0.8 0.7 0.7 0.7 0.6 P(w|I) I 的 了 个 年 说 中 上 人 大 国 de (perf. asp.) ge (measure) year say middle on, above person big country 3.9 2.2 1.5 1.3 1.0 0.9 0.9 0.7 0.7 0.6 P(w|O) O 在 是 中国 也 不 对 和 的 将 有 at, in is China also no pair and de fut. tns. have 3.4 2.4 1.4 1.2 1.2 1.1 1.0 1.0 1.0 1.0 HMM Emissions: CTB Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34

×