COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IMPROVE PARSING

Computational Approaches to the
Syntax-Prosody Interface: Using
Prosody to Improve Parsing
Dissertation Defense
Hussein Ghaly
December 12th, 2019

Goal and Motivation
Main Goal:
Improve automatic syntactic parsing of spontaneous spoken
sentences using prosodic cues
Theoretical Motivation:
● Automatic parsing is negatively affected by syntactic ambiguity (Kummerfeld
et al., 2012)
● Prosody can help resolving some syntactic ambiguities (Cutler et al., 1997)
● Syntactic structure is related to prosodic structure (Selkirk, 1986, among
many other studies)

Challenges and Opportunities
Challenges
- Lack of congruence between syntactic and prosodic structures
- Lack of interdisciplinary engagement in prosody research between computational
linguistics and other branches of linguistics
Opportunities
- Availability of parsing frameworks
- Availability of ToBI annotation
- Availability of speech corpora (e.g. the Switchboard Corpus)
- Interest in Natural Language Understanding for speech

What is prosody
● “(1) acoustic patterns of F0, duration, amplitude, spectral tilt, and segmental
reduction, and their articulatory correlates, that can be best accounted for by
reference to higher-level structures, and (2) the higher-level structures that
best account for these patterns.” (Shattuck-Hufnagel and Turk, 1996)
● Includes a number of speech phenomena, including: Prosodic phrasing,
Stress, Intonation, Rhythm
● Autosegmental-Metrical Theory (Ladd, 2008) was proposed to organize these
components together

Prosodic Structure is a Hierarchy of Constituents
● all languages have
hierarchically ordered
prosodic structure
● languages make use of
the same set of
prosodic categories
(Elfner, 2018) Illustration of prosodic hierarchy, from Elfner (2018)

Prosodic structure is much flatter than syntactic structure
As a matter of fact that’s what I’m doing
⍵ ⍵ ⍵ ⍵ ⍵ ⍵ ⍵ ⍵ ⍵
iP iP
IP
Prosodic word
Intermediate phrase/Phonological Phrase
Intonational phrase
Prosodic Structure
Syntactic Structure

Prosody is influenced by syntax and other factors
Syntax
● Suci (1967): non-syntactically structured word
lists have more prosodic variation than
syntactically structured sentences
● Prosody can resolve some syntactic
ambiguities
● Some syntactic structures (e.g. parentheticals
“John, said Mary, was nice”, and tag
questions “She’s Italian, isn’t she?”)
● clause boundaries (e.g. when John left, I
cried)
Other factors
● prosodic grouping can be
different from syntactic
grouping
○ syntax follows the grouping
S (V O), while prosody
follows the grouping (S V) O
(Martin, 1970)
● speech rate
● utterance length and
constituent length
● semantic and pragmatic
factors

A theoretical model depicts factors affecting prosody
Model for factors influencing prosody
(from Turk and Shattuck-Hufnagel,
2014)
● prosodic prosodic structure as a
theoretical construct,
representing the convergence of
all these factors
● Constituent length can be added
to utterance length factors

Syntax-Prosody Interface - some phonological theories
● Indirect Reference: phonological processes apply to prosodic domains
(constituents), which are related to syntactic constituents
○ Selkirk (1986) Align-XP: syntactic constituents share one edge with prosodic constituents
(Align-R or Align-L depending on the language)
○ Truckenbrodt (1995, 1999) Wrap-XP: a constraint that demands that each syntactic phrase is
contained within a phonological phrase
○ Match Theory (Selkirk 2006, 2009, Elfner 2012, Myrberg 2013): syntactic clauses map to
Intonational phrases, syntactic phrases map to phonological phrases, and morphosyntactic
words map to prosodic words

ToBI is a system for annotating prosody
ToBI: Tones and Break Indexes
A system for annotating prosodic
information (Silverman et al., 1992)
Based on theories of prosodic
structure by Beckman and
Pierrehumbert (1986)
Break indexes (0-4) reflect
disjuncture levels between words
from (Veilleux et al., 2006)

Part 1 - The Effect of Syntactic Phrase
Length on Prosody

Phrase length affects prosody of double center embedded sentences in English
Double Center Embedded sentences (From Fodor and Nickels, 2011):
- Encouraging Phrase length (ENC) (short inner phrases) split into 3 chunks
- Discouraging Phrase length (DISC) (long inner phrases) split into 4+ chunks
NP1 NP2 NP3 VP1 VP2 VP3
the rusty old ceiling pipes that the plumber my dad trained fixed continue to leak occasionally
NP1 NP2 NP3 VP1 VP2 VP3
the pipes that the unlicensed plumber the new janitor reluctantly assisted tried to repair burst

No difference in prosody due to phrase length was found in French
Desroses (2014) manually examined the frequency of pauses (silence intervals
>=250 ms) at the edges of syntactic constituents, didn’t find difference between
the two sentence types
ENC: Le joli ballon jaune vif (1) que l'enfant (2) que le maître (3) punit (4) lâcha (5)
est vraiment coincé dans l'arbre.
DISC: Le ballon (1) que le jeune enfant (2) que le maître d'école (3) punit très
souvent (4) lâcha bêtement (5) est jaune.
Before NP2 (location
1)
After NP2 (location 2) Before VP2 (location 4) After VP2 (location 5)
ENC 6.7 % 14.07 % 22.6 % 28.15 %
DISC 8.5 % 10.4 % 19.63 % 27.04 %

Data reanalyzed using judge annotation and forced alignment
Re-analyzing recordings collected by Desroses (271 ENC and 272 DISC
recordings) to identify the prosodic boundaries at the edges of syntactic phrases by:
● Obtaining judgments by two native speakers of French, of where they perceive
the prosodic boundaries, in a subset of recordings (48 recordings)
● Using forced alignment (automatically mapping each word to its corresponding
portion of the audio file), to obtain silent pause durations between words (397
recordings)

Sentences were presented to judges for annotation
An example sentence of the set presented to judges

Forced alignment indicated pauses between words
● Montreal Forced Alignment (McAuliffe et al., 2017) was used
● edges of syntactic phrases are identified manually in a copy of the sentence,
for example:
○ Le ballon <1> que le jeune enfant <2> que le maître d'école <3> punit très souvent <4> lâcha
bêtement <5> est jaune.
● Words in the forced alignment with their start and end times are mapped to
those in the copy
● Pause values are calculated at the five locations

Judges indicated difference in average number of breaks for each sentence type
- average number of prosodic boundaries over all five syntactic boundaries for each judge (48 recordings)
- first judge: ENC: 2.43; DISC: 2.92. second judge: ENC: 2.5; DISC: 3.2.

Forced Alignment indicated more pauses before VP2 for DISC sentences
● at location 4, percentage of ENC sentences with a pause value of 250 ms or greater
was 14%, while percentage for DISC sentences was 19.1% (397 recordings)
● Average pause duration at location 4: 105 ms (ENC), 154 ms (DISC)
○ After excluding recordings with pauses > 1 second: 80 ms (ENC, 189 recordings), 110 ms (DISC, 202
recordings), p-value .056

Part 2 - Resolving Syntactic Ambiguities
Using Prosody

Goal: examine whether it is possible to identify the syntactic attachment using
prosody, both by human listeners and by computers
I saw the boy with the telescope
Can prosody resolve syntactic ambiguity?

An experiment for production and perception of sentences with ambiguities:
● Comma ambiguity:
○ John, said Mary, was the nicest person at the party.
○ John said Mary was the nicest person at the party.
● PP-attachment ambiguity:
○ I have a new telescope. I saw the boy with the telescope.
○ One of the boys got a telescope. I saw the boy with the telescope.
Comma ambiguity and PP-attachment ambiguity are investigated

Ambiguous sentences were recorded by speakers and presented to listeners
Using crowdsourcing, through Amazon Mechanical Turk (MTurk), workers were
recruited for the production and perception experiment
Production Experiment
● Ambiguous sentences were recorded by a number of naive native speakers
● Recordings by 6 speakers, with the clearest recordings, were selected
Perception Experiment
● Ambiguous sentences were presented to naive human participants both in
audio and text formats, to answer comprehension questions
● Experiment was organized in phases, where in each phase questions were
based on the recordings of only one of the speakers

Listeners answer questions about
ambiguous sentences
Question Types
1. Comma-ambiguity - Text
2. PP-attachment ambiguity with context - Text
3. PP-attachment ambiguity without context - Text
4. Comma-ambiguity - Audio
5. PP-attachment ambiguity with context - Audio
6. (and 7) PP-attachment ambiguity without context - Audio (two
different sentences of this question type)
Example: https://champolu.net/mturk/listen.html?abcd

● Participants disambiguation
accuracy (text: 49% audio
63%, p-value < .001,
independent t-test)
● Results are after excluding
sentences not understood
properly even with context,
and listeners with overall low
comprehension accuracy
PP-attachment ambiguity is more accurately resolved in audio than in text
Question Type Accuracy
comma ambiguity - text 99%
comma ambiguity - audio 92%
PP-attachment ambiguity - text - with context 97%
PP-attachment ambiguity - audio - with context 98%
PP-attachment ambiguity - text - no context 49%
PP-attachment ambiguity - audio - no context 63%

Larger pauses yield better accuracy for high attachment sentences
● higher values of pauses and
normalized duration of the last
NP word lead to higher accuracy
of disambiguation of sentences
with high attachment by listeners
Normalized duration: actual duration of the
word divided by expected duration, where
expected duration is the sum of average
duration of each phoneme for each speaker
high attachment low attachment
Pause (ms)
Number of
Recordings
average
accuracy
Number of
Recordings
average
accuracy
0 73 35.28% 103 82.26%
10 11 41.41% 11 76.73%
20+ 34 66.93% 4 57.29%
high attachment low attachment
Normalized
Duration
Number of
Recordings
average
accuracy
Number of
Recordings
average
accuracy
<1.0 10 22.08% 32 84.84%
1 17 28.53% 39 77.16%
1.1 18 52.03% 23 85.44%
1.2 22 44.29% 15 79.63%
>1.2 51 52.74% 7 19.72%

● Based on the data just presented, a machine learning system (decision trees)
used pauses and duration values as features to predict the attachment.
● System accuracy ranged from 63% to 73%, based on how the data are split
into training and test portions (e.g. system performs better when training and
testing on different portions of the recordings of the same speaker)
Machine Learning predicts attachment of recorded sentences
Speaker ID shuffled all
intra-speaker
classification
odd speaker
out
Odd sentence
out
Odd recording
out
Listener
Accuray
my64 75.0% 50.0% 67.5% 66.3% 60.0% 64.0%
wdn 62.5% 67.5% 55.0% 53.8% 70.0% 54.0%
ds 70.0% 90.0% 70.0% 66.3% 85.0% 83.0%
dz 57.5% 60.0% 52.5% 57.5% 65.0% 56.0%
mm 70.0% 87.5% 70.0% 70.0% 85.0% 52.0%
tk 69.4% 83.2% 61.1% 65.3% 77.5% 69.0%
Average 67.4% 73.0% 62.7% 63.2% 73.8% 63.0%

● Corpus analysis was conducted for the syntactic and prosodic data in the
ToBI annotated subset from the Switchboard corpus (SWB) (Godfrey et al.,
1992) from different speakers (150 speakers)
● The focus was on PP-attachment ambiguity and relative clause attachment
(RC-attachment) ambiguity
● An algorithm was developed to identify instances of such ambiguities in the
syntactic data:
○ PP-attachment: instances of a noun phrase (NP) immediately followed by a prepositional
phrase (PP)
○ RC-attachment: instances of NP immediately followed by a relative clause (SBAR)
○ Low attachment is when there is a large NP spanning both constituents (NP + PP or NP +
SBAR), otherwise high attachment
Does attachment affect prosody in spontaneous sentences?

Examples of sentences with PP-attachment identified by the algorithm
Low attachment High attachment

Examples of sentences with RC-attachment identified by the algorithm
Low attachment High attachment

Attachment affects the distribution of prosodic breaks
● At the end of NP (before PP or SBAR), identify ToBI break index (from the
Switchboard corpus)
● Effect of RC-attachment is much stronger than PP-attachment
PP-attachment RC-attachment

Phrase Length also affects the distribution of prosodic breaks
● Consistent with Shafran and Fodor (2016), Watson and Gibson (2004), phrase
length affects likelihood of prosodic breaks
● More than 75% instances of high attachment with ToBI 1 have short NPs (<3 words)
● 50% of instances of low attachment with ToBI 3,4 have longer PPs (4+ words)
ToBI 0 1 2 3 4 Grand Total
attachment Count (%) Count (%) Count (%) Count (%) Count (%) Count (%)
high 27 0.96% 842 30.04% 44 1.57% 145 5.17% 212 7.56% 1270 45.31%
low 166 5.92% 1139 40.64% 43 1.53% 89 3.18% 96 3.42% 1533 54.69%
Grand Total 193 6.89% 1981 70.67% 87 3.10% 234 8.35% 308 10.99% 2803 100.00%
NP phrase length:
High attachment
instances
ToBI 1
PP phrase length:
Low attachment
instances,
ToBI 3 and 4
ToBI 1
NP length Count (%)
1 word 425 15.16%
2 words 228 8.13%
3 words 101 3.60%
4+ words 88 3.14%
Total 842 30.04%
ToBI 3 4
PP length Count (%) Count (%)
1 word 1 0.04%
2 words 21 0.75% 16 0.57%
3 words 23 0.82% 32 1.14%
4+ words 44 1.57% 48 1.71%
89 3.18% 96 3.42%

Can we predict attachment from phrase length and prosody?
● 2803 instances of PP-
attachment:
○ 1270 high attachment (45%)
○ 1533 low attachment (55%)
● 1559 instances of RC-
attachment:
○ 739 high attachment (47%)
○ 820 low attachment (53%)
Features Labels
ambiguity sentence ID
NP size
(words)
PP/SBAR
size (words)
ToBI Break
Index low attachment
ppa sw4890.A-s89 1 3 1 FALSE
ppa sw4890.B-s72 2 3 4 FALSE
ppa sw2018.A-s144 1 2 1 TRUE
ppa sw2018.A-s145 1 2 1 TRUE
ppa sw2018.A-s157 1 2 3 TRUE
rca sw4890.B-s72 1 4 3 FALSE
rca sw2018.A-s131 3 4 1 TRUE
rca sw2018.B-s163 3 4 4 TRUE
Sample of data compiled

● Using Machine Learning
(decision trees), with different
feature combinations
● PP-attachment prediction
using prosody only is
statistically significant (p-
value: < .001 using
independent T-test)
● an improvement when
combined with phrase length,
but not statistically significant
(p-value .078)
Machine Learning Predicts attachment based on prosody and phrase length
Set Description Features Accuracy (%)
RC- Attachment instances
ToBI 69.02
Length of NP 62.80
Length of NP, ToBI 71.14
Length of NP, Length of SBAR 64.85
Length of NP, Length of SBAR, ToBI 71.20
Length of SBAR 57.79
Length of SBAR, ToBI 69.60
PP-Attachment instances
ToBI 60.54
Length of NP 60.93
Length of NP, ToBI 63.47
Length of NP, Length of PP 61.04
Length of NP, Length of PP, ToBI 63.40
Length of PP 55.19
Length of PP, ToBI 60.86

Part 3 - Using Prosody to Improve
Parsing

Goal:
build a computational system using prosody to improve parsing of
spontaneous sentences in the Switchboard Corpus
Motivation:
Previous computational approaches (e.g. Kahn et al., (2005), Huang and Harper
(2010), Tran et al., (2017)) attempted to do so. This work will proceed in this
direction informed with the theoretical foundation of syntax-prosody relationship,
mainly semantic coherence
Can prosody be used to improve parsing?

Hypothesis: Syntax-Prosody correspondences improve parsing
Hypothesis 1- There are elements of correspondence between prosody and
syntax that can be extracted from the syntactic structure
Hypothesis 2- using these correspondences, along with prosodic
information, we can select the most appropriate parse for an utterance

Parsing is identifying the structure of a sentence
● Constituency parsing: a hierarchy of syntactic constituents
● Dependency parsing: dependent-head relationships
○ Main metric: Unlabeled Attachment Score (UAS), percentage of
heads identified correctly
● Dependency parsing is now the norm in computational
linguistics:
○ Faster, scalable to new languages, represents semantic relationships
○ Provides same information as constituency plus head information
● Dependency structure has not been used much in prosody
research
○ Exception: Pate and Goldwater, 2014
Constituency
Dependency

Semantic coherence affects likelihood of prosodic breaks
● Selkirk (1984): distribution of intonational phrase boundaries can be accounted
for by a semantic constraint called the Sense Unit Condition (SUC):
○ The immediate constituents of an intonational phrase must be semantically related
■ a. John gave the book // to Mary.
■ b. * John gave // the book to Mary.
■ c. John gave // the book // to Mary (examples from Watson and Gibson, 2004)
● Ferreira (1988) and Watson and Gibson (2004): developed algorithms for
predicting likelihood of prosodic breaks, predicting higher likelihood when there
is no dependency and semantic coherence between words

Dependency configurations correspond to semantic coherence
● The concept of “dependency configurations” is proposed here to quantify
semantic coherence between adjacent words, based on dependency
structure
● It is defined in terms of dependency offsets: the distance (measured by
number of words) between a word and its head
● For each word, the offset is quantified as:
○ 0 if the word is root
○ +1 if it depends on the word immediately to the right
○ +2 if it depends on a word further to the right
○ -1 if it depends on the word immediately to the left
○ -2 if it depends on a word further to the left
● Each pair of consecutive words is characterized by a duple representation
(e.g. (+1,-2)) to describe the configurations

There are 12 different observed dependency configurations
Examples from the Switchboard Corpus, converted to dependency structure by
Honnibal and Johnson (2014)

Dependency configurations correspond to prosodic breaks
● Configuration (-1, +1) account for 35% of
ToBI 4 and 26% of ToBI 3
● Configurations (+2,+1) and (-1,-2) combined
account for 41% of ToBI 4 and 38% of ToBI 3
● If there is a direct dependency between two
consecutive words, there is a smaller
likelihood of prosodic breaks between them
ToBI Values Grand
Totalconfiguration 1 2 3 4
(+2, +1) 15657 1550 602 860 18669
(+1, -2) 10328 402 203 135 11068
(-1, +1) 6125 557 726 1456 8864
(-1, -1) 6062 281 308 358 7009
(+1, 0) 6032 233 114 94 6473
(-1, -2) 3268 334 465 829 4896
(+1, +1) 3308 102 42 30 3482
(0, -1) 2802 130 120 115 3167
(0, +1) 2645 130 174 154 3103
(+2, -1) 1011 25 31 22 1089
(-1, 0) 117 11 27 62 217
(0, 0) 74 30 2 7 113
Grand Total 57429 3785 2814 4122 68150

Features are extracted from parse hypotheses and prosodic information
Lexical Syntactic Prosodic
word head word POS Config normalized duration pause after
so know RB (+2,+1) 3.23 0.31
i know PRP (+1,0) 0.45 0
know - VBP (0,+1) 1.13 0
what said WP (+2,+1) 0.97 0
they said PRP (+2,+1) 1.62 0
‘ve said VBP (+1,-2) 0.71 0
said know VBN N/A 2.03 0

Prediction outcome represents which heads are correctly identified in the parse
spaCy (UAS: 0.5) 3 3 0 5 3 7 5 10 10 7
Gold 7 7 7 5 7 7 0 10 10 7
syntaxnet (UAS: 0.4) 4 4 4 0 4 7 5 10 10 7
clearNLP (UAS: 0.6) 3 3 0 5 7 7 3 10 10 7
0 0 0 1 0 1 0 1 1 1
Expected Prediction
Outcome
0 0 0 0 0 1 0 1 1 1
0 0 0 1 1 1 0 1 1 1
Reference
Parse

Recurrent Neural Networks offer a lot of flexibility
- RNNs accept inputs of variable length with categorical and continuous
features, and variable length output.
- Long Short-Term Memory (LSTM) is an optimization of RNNs used here
source: https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/03/1-768x421.png

Sentence
System extracts features from parses and predicts correct heads
Parse Hypothesis
Sentence Acoustics
Feature
Extraction
Training Stage
Gold Standard
Parse
Correct heads Outcome
Testing Stage
Sentence
Parse Hypothesis
Sentence Acoustics
Feature
Extraction
parse hypothesis
predicted correct
heads

System scores parses and selects most likely parse
Using syntactic features from parse hypotheses and acoustic information to make
predictions about which parse is more likely
spaCy (UAS: 0.83)
clearNLP (UAS: 0.67)
syntaxnet (UAS: 0.33)
0.75 0.87 0.69 1.01 0.75 1.04
Predictions
0.75 0.77 0.78 1.02 0.58 0.88
0.03 0.76 0.30 0.89 0.33 0.84
Sum: 5.11
Sum: 4.78
Sum: 3.15

Prosody can improve parsing for Switchboard data
System Text Features Prosodic features UAS Dev UAS Test
clearnlp 79.76 79.59
spacy 79.06 78.91
syntaxnet 72.54 72.81
Oracle 85.93 85.89
Ensemble POS, configs 80.69 80.73
Ensemble POS, configs Dur, dur log, pause 81.21 81.17
Ensemble Lexical, POS, configs, links 83.47 83.36
Ensemble Lexical, POS, configs, links Dur, pause 83.51 83.39

● Only duration and pauses were used, while pitch, intensity and other acoustic
information can still be used in further work
● Phrase length information was not used in any of the features
● Speech repairs and disfluencies are marked with prosodic cues but not
addressed in this study
● Other correspondences between prosody and dependency structure were
suggested in this study but need further development (dependency chunks)
Further improvements are possible

Output analysis doesn’t indicate clear improvement patterns
● An output analysis didn’t show clear improvement patterns in the following:
○ sentences with PP-attachment
○ sentences with RC-attachment
○ sentences with parentheticals
○ sentences with speech repairs
● For sentence size, largest improvement was for sentences 3-8 words, unclear
patterns for larger sentences, but mainly smaller improvement
UAS Dev UAS Test
UAS Improvement (POS + Configs) with prosody 0.51 0.44
Sentences with improved UAS 268 240
Sentences with worse UAS 178 180
Sentences with the Same UAS 4970 5036
p-value (paired t-test comparing UAS values for all sentences) < .001 < .001

Conclusions
● Part 1:
○ Syntactic phrase length affects prosodic phrasing, also in French
● Part 2:
○ Syntactic ambiguity can be resolved prosodically by speakers
○ Prosodic cues can be used by human listeners and computers to predict the syntactic structure
○ Syntactic phrase length also affects prosodic phrasing in speaking, and can be used by
computers as a factor, along with prosody, to improve prediction of the structure
● Part 3:
○ Certain syntactic information (dependency configurations), based on dependency structure,
relates to prosodic breaks
○ Using this information together with timing (pause and duration) is more useful for selecting
better parses than syntactic information only
○ The ensemble system yields better performance than any individual parser in the ensemble

Final Note
● This dissertation is an interdisciplinary work, building on prosody research
from phonology and psycholinguistics towards computational goals
● Using dependency structure can provide a new perspective for investigating
syntax-prosody relationship

COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IMPROVE PARSING

More Related Content

What's hot

Similar to COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IMPROVE PARSING

Recently uploaded

COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IMPROVE PARSING

Editor's Notes