More than Words: Advancing Prosodic Analysis

More than Words 
Advancing Prosodic Analysis
Andrew Rosenberg
City Tech Colloquium
February 5, 2015

Prosody
Syntax Semantics Pragmatics Paralinguistics
Mary knows; you can do it. 
Mary knows you can do it.
Bill doesn’t drink because
he’s unhappy
Going to Boston.
Going to Boston?
Three Hundred Twelve.
Three Thousand Twelve.
3

Prosody in Text
ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY
THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND
THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS
REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT
PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION
NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK
TO THE RED LINE JUST AS EASILY
4

Also, from the North Station...
(I think the Orange Line runs by there too so you can also catch the
Orange Line... )
And then instead of transferring
(um I- you know, the map is really obvious about this but)
Instead of transferring at Park Street, you can transfer at (uh what’s the
station name) Downtown Crossing and (um) that’ll get you back to the
Red Line just as easily.
Prosody in Text
5

Prosody in Text
I sooo hate you right now :-)
mondays :,(
Conner Thiele @St04hoEs:
Madison people are so funny #sarcasm
Dodie Clark @doddleoddle:
RePlAcEmEnT bus SerVicEs are mY fAvOURITE
#sARcASM.
Michelle Lee @mlee418
ﬁnding someone who loves makeup just as much as me
makes me feel warm inside #notkidding
6

Prosody in Spoken Language Processing
• Recognizing Emotions.  
Frustration and Anger in Call Centers
• Inserting punctuation in speech transcripts. 
Notably, not in mobile voice input yet…
• Speaker Recognition
• Speaking Style Recognition
• Recognizing Native Language, Gender, Speaker Roles
• Improving performance of other spoken language processing
tasks. Parsing, Discourse Structure, Intent Recognition.  
Today: Identifying (possibly misrecognized) names in speech
7

Dimensions of Prosodic Variation
Pitch in Blue Intensity in Green
Duration of words/syllables
Presence of 
Silence
Spectral Qualities
8

ToBI
• High level dimensions of prosodic variation.
• Tones and Break Indices
• High and Low tones describe prosodic events,
pitch accent and phrasing.
• Break indices describe the degree of disjuncture
between words.
• Two hierarchical levels of phrasing: intermediate
and intonational
9

Dimensions of Prosodic Variation
Prominence (bold word)
 
Phrasing (end of phrase)
L-L% L-H% H-H% H-L% !H-L%
H* L* L*+H L+H* H+!H*
Mother TheresaGive me the brown oneis that Mariana’s money?do you really think it’s that one? (x2)
get on the harvard square T stopleave the government center T stopwe will go through centralthrough Boylestongo from Harvard Square
11

How is prosody used?
Symbolic
• Modular
• Linguistically
Meaningful
• Reduced
Dimensionality
Direct
• Task-Appropriate
• Lower information
loss (general)
• High Dimensionality
Acoustic Features
D = 100s-1000s
Symbolic Analysis
D=10-20
Task Specific
Acoustic Features
D = 100s-1000s
Task Specific
Learned Representations
• Modular
• Task-Appropriate
• Linguistically Meaningful
• Low information loss
• Reduced Dimensionality
Acoustic Features
D = 100s-1000s
Learned
Representation
D=10-20
Task Specific
Goal: compact,
consistent,
universal
12

Direct Modeling
• Topic and Sentence Segmentation. 
[Liu et al. 2008, Rosenberg et al. 2006, Ostendorf et al. 2008 etc.]
• Lexical: n-grams, POS-tags, TextTiling, Lexical Chains and
other Coherence measures
• Prosodic: measures of acoustic “reset” across candidate
boundaries.
• Question Recognition for Spoken Dialog Systems 
[Liscombe et al 2006]
• Lexical: n-grams, pos tags, ﬁlled pauses
• Prosodic: pitch slope in last 200ms. pausing, loudness
13

Contour Modeling
Pitch in Blue Intensity in Green
14

Quantized Contour Modeling
• Each syllabic contour is laid onto an N-by-M grid with normalized
time and range. Results in an M element vector with an N-sized
vocabulary. 
Rosenberg 2010
• This allows for a simple classiﬁcation strategy
Contour Modeling
L-L% L-H%
type⇤
= argmax
type
p(type)
MY
i
p(Ci|type, i)
type⇤
= argmax
type
p(type)
MY
i
p(Ci|Ci 1, type, i)
16

Approximate Curve Fitting
• Polynomial ﬁtting
• Legendre polynomials 
[orthogonal bases]
• Coefﬁcients become the representation
Contour Modeling
from wikipedia
f(~x) = ~a
˜x(t) =
kX
i=0
aiti
˜x(t) =
kX
i=0
aiLi(t)
L0 = 1; L1 = x
L2 =
1
2
(3x2
1)
Ln =
1
2n
mX
k=0
✓
n
k
◆2
(x 1)n k
(x + 1)k
17

Interactions
• Most shape representations ignore the interaction
between different information streams.
• Pitch is assumed to be the most relevant dimension of
intonation.
• Combined Pitch and Energy contour. 
Can be viewed as weighting the importance of pitch
values by the energy.
• Energy and Duration (Area under Contour)
• Very simple feature.
• Improves pitch accent detection 
by >3% absolute
18

Symbolic Modeling: AuToBI
• Automatic ToBI labeling toolkit.
• Unified feature extraction and ToBI label prediction
• Released under Apache 2.0
• Extensible Feature Extraction Framework
• Low-level digital signal processing: pitch, spectrum, intensity, FFV
• Unique features: Automatic syllabification; shape modeling; context-
sensitive features
• Applied to English, German, Spanish, Portuguese, Mandarin, French
Acoustic Features
D = 100s-1000s
Symbolic Analysis
D=10-20
Task Specific
19

Feature Extraction in AuToBI
Mean Mean Mean
ContextA ContextB ContextB
normalized log F0
log F0
F0
Requested Features
mean[context[norm[log[F0]],A]]
mean[context[norm[log[F0]],B]]
mean[context[norm[log[F0]],C]]
Mean
ContextA
normalized log F0
log F0
F0F0
log F0
normalized log F0
ContextA
Mean
ContextA
Mean
ContextBContextB
Mean
ContextB
Mean
ContextBContextB
Mean
ContextB
normalized log F0
log F0
F0
20

Correcting Classiﬁers for Prominence Detection
• Examine the predictive power of Intensity drawn
from 210 different spectral regions. 
[Rosenberg & Hirschberg 2006, 2007]
etc.
[My name is Randy Keller]
21

Correcting Classifiers
• For each ensemble member, train an additional correcting
classifier — using pitch, and duration features.

• Predict if an ensemble member will be correct or incorrect
• Invert the prediction if the correcting classifier predicts
incorrect.
score(A) = θ(A | xi )*ψ(C | yi) + (1−θ(¬A | xi))*(1−ψ(¬C | yi))
i
N
∑
Correcting ClassifierEnergy Classifier
22

Correcting Classiﬁer Diagram
∑
Energy
Classiﬁers
Correctors
Aggregator
Filters
...
...
23

Correcting Classiﬁer Performance
Corpus Unﬁltered Energy Voting Corrected Voting Change
BDC-read 79.80 79.87 84.38 +4.51
BDC-spon 79.12 80.67 83.20 +2.53
BURNC 82.90 83.18 85.51 +2.33
Speaker Dependent Performance
24

Learning Representations
• Find redundancy in the data.
• Correlated dimensions — like PCA
• Irrelevant dimensions — L1 or L0 regularization
• Goal here: learn discrete categories, with no
discriminative labels (as in MDS or LDA)
• Clustering or Codebook learning
25

Clustering as a Representation
x 2 R2
f(x) 2 {A, B, C}
g(x) 2 R3
26

• Neural Net Representations
• Autoencoder
x 2 RD
g(x) 2 Rk
x xW1 W2
g(x) = s(W1s(W2x))
27

• Neural Net Representations
• Bottleneck layer
x 2 RD
g(x) 2 Rk
x W1 W2 t
g(x) = s(W1s(W2x))
28

Applications of Prosodic Representations
• Candidate Representations:
• Manual ToBI Labels
• Automatically hypothesized ToBI Labels
• Codebook/Clusters of acoustic features 
(k-means, dpgmm)
• Named Entity Tagging
• Sarcasm
• Prosody Sequence Modeling
• Speaking Style; Nativeness; Speaker
29

Name Tagging
• Names: Persons, Geopolitical Entities (Places),
Organizations.
• These are often misrecognized, and sometimes
completely unknown.
• (Most) Speech recognition systems will never
recognize a word it’s never heard before. “Out-
of-vocabulary” problem.
• Goal: Use prosody to help identify which words in a
transcript are actually names — despite this.
work with Denys Katerenchuk
30

Approach
• CRF-based Tagger 
from Heng Ji’s (RPI) group
• Lexical Features
• n-grams, POS, brown cluster, syntactic
chunking, known dictionaries (place names,
etc.)
• Prosodic Features
• AuToBI hypotheses: 6 features.
• K-means codebook of the input features used
by AuToBI with k=2-10: 8 features.
Name Tagging
31

Results
• Prosody helps. Is likely approximating punctuation.
• AuToBI features are robust at even worse ASR performance. 
still higher WER!
Name Tagging
F1-score
20
27.5
35
42.5
50
39.94
45.02
44.34
39.38
Text Features +Prosodic Clusters & AuToBI Features +AuToBI Features +Prosodic Clusters
WER: 49.13%
Ground Truth: marines battling for control of the bridges in
the southern city of Nasiriyah
Hypothesis: marines battling for control the bridges in the
southern city of non <GPE> sir </GPE> re f
32

Recognizing Sarcasm
• Sarcasm: the use of irony to indicate scorn or disdain
• Clips from Daria
• Rated by 165 participants as sarcastic or sincere
• Features:
• Baseline: Mean pitch, range pitch, standard deviation of
pitch, mean intensity, intensity range, speaking rate
• Prosodic Representations: k=3 clustering of order-2
Legendre polynomial coefﬁcients based on pitch and
intensity
• unigram and bigram rates of both pitch and intensity
representations
work with Rachel Rakov
33

Results
• Learned representations:
• Pitch: Fast Rise, Slow Rise, Fast Fall
• Intensity: Fast Rise, Stable, Moderate Fall
Recognizing Sarcasm
Feature Set Accuracy
Chance Baseline 55.26
Standard Acoustic 65.78
+Unigram Features 78.31
+Unigram Features  
+Intensity Bigrams
81.57
+Unigram Features  
+Both Bigrams
76.31
Logistic Regression
34

Modeling Prosodic Sequences
• Prosodic Recognition of:
• Speaking Style - Read, Spontaneous, Dialog,
News
• Speaker - 4 speakers all Spontaneous speech
• Nativeness - Native vs. Non-native American
English Speakers, reading the same material.
35

Prosodic Sequence Modeling
• 3-gram model with backoff
• Clusters trained over all material.
• Sequence models trained on training splits.
• automatic syllabiﬁcation
• only 7 acoustic features:  
mean pitch and intensity and delta, duration, pre/fol silence
C⇤
= argmax
C
p(x0|C)p(x1|x0, C)
NY
i=2
p(xi|xi 1, xi 2, C)
Prosodic Sequences
36

Dirichlet Process GMMs
G|{↵, G0} ⇠ DP(↵, G0)
✓n|G ⇠ G
Xn|✓n ⇠ p(xn|✓n)
G0
G0
i
xi
0
p(x) =
1X
n
⇡nN(x; µn, ⌃n)
• Non-parametric inﬁnite mixture model
• No need to specify the number of
clusters.
• need a prior of π – the dirichlet process
• and a prior over N – a zero mean
gaussian
• still need to set hyper parameters α &
G0
• Stick-breaking & Chinese Restaurant
metaphors
• Blei and Jordan 2005 
Variational Inference
• “Rich get Richer”
Plate notation from M. Jordan 2005 NIPS tutorial
Prosodic Sequences
37

Results
Prosodic Sequences
Speaking Style (of 4)
Nativeness (of 2)
Speaker (of 6)
• K-means is a
clear winner on
all tasks
• DPGMM here fail
to ﬁnd effective
representations
ToBI
K-means
DPGMM
variable lengthed
sequences with
repetition
38

Common Representations
• Previous experiments generated representations
from a wide range of material.  
(3 corpora: 1) spontaneous/read; 2) dialog; 3) news
• Here: we repeat these experiments with
representations learned from material from a single
corpus (only news)
• Also include AuToBI hypotheses, and clusters are
based on full feature set. (compared to 7 before)
Prosodic Sequences
39

Results
Prosodic Sequences
K-meansSpeaking Style (of 4)
• K-means provides a
robust representation of
prosody.
• All speaker material is
unknown during
representation generations
Speaker (of 12)
40

Next Problems
• Hunting for Language Universals
• Additional Applications
• Automatically identifying the unit of analysis.
• Too short - low information; Too long - low
generalization
• Unify with representation learning
• Identifying “discriminative” prosodic events.
• In emotion, deception, foreign accent recognition, the
important signal is rare, but important.
• Discriminative modeling
• Anomaly detection (one class modeling)
41

Thanks
Denys Katerenchuk, Rachel Rakov
Adam Goodkind, Ali Raza Syed, David Guy Brizan, Felix Grezes,
Guozhen An, Michelle Morales, Min Ma, Justin Richards, Syed Reza
andrew@cs.qc.cuny.edu
speech.cs.qc.cuny.edu 
eniac.cs.qc.cuny.edu/andrew
Questions?

More than Words: Advancing Prosodic Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to More than Words: Advancing Prosodic Analysis

Similar to More than Words: Advancing Prosodic Analysis (20)

More from New York City College of Technology Computer Systems Technology Colloquium

More from New York City College of Technology Computer Systems Technology Colloquium (12)

Recently uploaded

Recently uploaded (20)

More than Words: Advancing Prosodic Analysis