2. Spoken Content Retrieval
• Spoken term detection
– to detect if a target term was spoken in any of the utterances
in an audio dataset
target term: COVID-19
similarity scores
3. What can Spoken Content Retrieval do for us ?
• Google reads all text over the Internet
– can find any text over the Internet for the user
• All Roles of Text can be realized by Voice
• Machines can listen to all voices over the Internet
– can find any utterance over the Internet for the user
• A Spoken Version of Google
4. • Multimedia Content
exponentially increasing over the
Internet
, but Machines can
What can we do with a Spoken Version of Google ?
• Machines may be able to listen to and comprehend
the entire multimedia knowledge treasury over the
Internet
– extracting desired information for each individual user
– the unique treasury of the entire global human
knowledge is here
– desired information for each individual deeply buried
under huge quantities of unrelated information
• Nobody can go through so much multimedia
information
5. A Target Application Example : Personalized
Education Environment
• For each individual user
I wish to learn how machines
can listen to human voice
I can spend 3 hrs to learn
user
This is the 3-hr personalized
course for you. I’ll be your
personalized teaching assistant.
Ask me when you have questions.
Information
from Internet
• Comprehension, Summarization and Question
Answering for Spoken Content
– Proper use of semantics in spoken content
6. Probabilistic Latent Semantic Analysis (PLSA)
t1
t2
tj
tn
D1
D2
Di
DN
TK
Tk
T2
T1
P(T |D )
k i
P(t |T )
j k
Di: documents Tk: latent topics tj: terms
𝑃(𝑤 𝑧)
𝑃(𝑧 𝑑)
• Unsupervised Topic Analysis from text corpus
d: document z: topic w: word
[Hofmann 1999]
7. Latent Dirichlet Allocation (LDA)
𝑃(𝜑𝑘 𝛽): Dirichlet Distribution
𝑃(𝜃 𝛼): Dirichlet Distribution
𝑃(𝑤𝑚,𝑛 𝑧𝑚,𝑛, 𝜑𝑘)
𝑃(𝑧𝑚,𝑛 𝜃𝑚)
[Blei 2003]
• Unsupervised Topic Analysis from text corpus
8. Clustering and Structuring the Spoken Content
Segments (Spoken Documents) Based on Topics
[Interspeech 2006]
• Global Semantic Structuring
• Local Semantic Structuring
Chinese Broadcast
News Archive
Semantic Analysis
Global Semantic
Structuring
Query-based
Local Semantic
Structuring
Automatic
Generation of
Key terms, Titles
and Summaries
Information
Retrieval
User’s query
– fine local structure
around the user
query on top of the
global structure
9. • Example Approach : Spoken Documents categorized by
Layered Topics and organized in a Two-dimensional Tree
– topics nearby on the map are more related semantically
– each topic expanded into another map in the next layer
Global Semantic Structure of Spoken Content
[Eurospeech 2005]
10. • Broadcast News Browser (2006)
– each topic labeled by a set of key terms
An Example Screenshot of Global Semantic Structure
in a Two-dimensional Tree
[Interspeech 2006]
11. Clustering and Structuring the Segments of Spoken
Content (Spoken Documents) Based on Topics
[Interspeech 2006]
• Global Semantic Structuring
• Local Semantic Structuring
Chinese Broadcast
News Archive
Semantic Analysis
Global Semantic
Structuring
Query-based
Local Semantic
Structuring
Automatic
Generation of
Key terms, Titles
and Summaries
Information
Retrieval
User’s query
– fine local structure
around the user
query on top of the
global structure
12. • Fine structure based on a user query
• Example Approach : Topic Hierarchy constructed with
key terms (Example Query: George Bush)
Local Semantic Structure of Spoken Content
[Interspeech 2006]
13. An Example Screenshot of Local Semantic Structure
in a Topic Hierarchy
[Interspeech 2006]
• Query: “White House of United States”
– some key terms under another key term on a higher level
14. • Spoken Knowledge in courses: in sequential form
– an individual user may need only a small part
– not understandable without listening to previous lectures
• Example Approach: Key Term Graph (2009)
– each spoken slide labeled by a set of key terms (topics)
– relationships between key terms represented by a graph
Online Courses : A Well Organized Spoken Knowledge
Treasury
[ICASSP 2009][IEEE Trans ASL 2014]
spoken slides
(plus audio/video)
key term
graph
Acoustic
Modeling
Viterbi
search
HMM
Language
Modeling
Perplexity
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
15. • Temporal Structure
– chapters, sections, slides
Interconnection between Temporal and Semantic
Structures
[IEEE Trans ASL 2014]
• Semantic Structure
– key term graph
• Interconnection between the Two Structures
16. An Example Automatically Generated Key Term
Graph
[IEEE Trans ASL 2014]
• Relationship scores evaluated between each pair of key
terms
– an edge if exceeding a threshold
17. • User clicks a key term “entropy”
– possible learning path through the selected spoken slides
– related key terms via the key term graph
[ICASSP 2009][IEEE Trans ASL 2014]
An Example Screenshot from a Online Course
Browser
18. Spoken Knowledge Structuring for an Example Online
Course
• Based on a course recorded in 2006
[ICASSP 2009] [IEEE Trans ASL 2014]
19. Thousands of Online Courses over the Internet
Lectures with very
similar content
[Interspeech 2015]
• Machines listen to all online courses
three courses on some
similar topic
20. sequential order for
learning (prerequisite
conditions)
three courses on some
similar topic
[Interspeech 2015]
Thousands of Online Courses over the Internet
• Machines listen to all online courses
• Learning map for a given query
• More precise semantic analysis for speech needed
22. …… ____ wi ____ ……
Word Embeddings (Word2Vec) as Vector
Representations for Words
• Continuous bag of words (CBOW)
…… wi-1 ____ wi+1 …… Neural
Network
wi
wi-1
wi+1
Neural
Network
wi-1
wi
wi+1
predicting the word given its context
predicting the context given a word
[Mikolov 2013]
• Skip-gram
• Prediction based on some hidden structure within the
language
23. V(Berlin) – V(Germany) + V(France) V(Paris)
Word Embeddings (Word2Vec)
Berlin
Germany
France
Paris
• Carry some Semantic Structure among Words
– semantic relationship is kind of “additive” or “parellel”
[Mikolov 2013]
24. Berlin is the capital city of Germany
Input Layer
𝑾𝑉 Χ 𝐽
Context
𝒙2 𝒙3 𝒙5 𝒙6
𝑾′𝐽 Χ 𝑉 𝑾′𝐽 Χ 𝑉 𝑾′𝐽 Χ 𝑉 𝑾′𝐽 Χ 𝑉
Output Layer
is the city of
1-of-N vector 𝒙1 𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙7
Hidden Layer
“capital”
Word2Vec : Word Semantics learned from Text
Context
26. RNN Decoder
x1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3
x4
RNN Encoder
audio segment (a segmented spoken word)
acoustic features
• RNN encoder and
decoder
– learn some hidden structure
within signals
Input acoustic features
Audio Word2Vec : Sequence-to-sequence Autoencoder
[Interspeech 2016]
27. x1 x2 x3 x4
RNN Encoder
audio segment (a segmented spoken word)
acoustic features
• RNN encoder and
decoder
– learn some hidden structure
within signals
Audio Word2Vec : Sequence-to-sequence Autoencoder
[Interspeech 2016]
28. x1 x2 x3 x4
RNN Encoder
audio segment (a segmented spoken word)
acoustic features
• RNN encoder and
decoder
– learn some hidden structure
within signals
Audio Word2Vec : Sequence-to-sequence Autoencoder
[Interspeech 2016]
vector
The vector representation
• Unsupervised
• Self-supervised
– model learns from data itself
29. What was learned ?
[Interspeech 2016]
• Sequential Phonetic
structure !
– not semantics
– learned hidden structure
within speech signals is
from isolated segmented
spoken words only
– learning semantics from
context of word sequences
may be easier
new
fear
fame
name
few
new
near
night
fight
hands
hand
words
word
things
thing
days
says
day
say
30. e1
S
DR
𝑥1
DR
𝑥2
DR
𝑥3
…
DR
𝑥4
DR
𝑥𝑡
…
DR
𝑥𝑡+1
DR
𝑥𝑇
0
Segmental Audio Word2Vec — extended to an utterance
ER
…
en
xt
S
…
ER
xt+1
0
S …
ER
…
eN
xT
S
ER
x1
ER
x2
0
S
ER
x4
0
S
x3
ER
S
word
boundary
Segmentation Gate
Reset to
initial state
Segment 1 Segment 2 Segment n Segment N
Each color block performs seq2seq training individually [Interspeech 2018]
31. Encoder Decoder
input audio reconstructed
Encoder
1
Decoder
reconstructed
speaker vector
phonetic vector
Encoder
2
Feature Disentanglement for Audio Word2Vec
[SLT 2018]
• Audio Signals includes information irrelevant to
semantics
– Speaker and other acoustic information
33. Disentanglement of Speaker Information
speaker vectors far apart enough for
different speakers
speaker 1
Phonetic
Encoder
Decoder
Speaker
Encoder
speaker 2
Phonetic
Encoder
Decoder
Speaker
Encoder
[SLT 2018]
• Phonetic Vectors without Speaker Information
36. Phonetic-and-Semantic Embedding
Brother
Sister
Bother
Brother
Bother
Sister
• Two types of information sometimes disturb each other
– aligning the word embeddings in the two spaces is challenging
• A unique ID for a text word in training
– unlimited number of audio realizations for a given text word
• Phonetic Space • Semantic Space
[SLT 2018]
37. BERT (Bidirectional Encoder Representations from
Transformers)
• Some hidden structure within the language learned in an
unsupervised way
– by estimating masked token from unlabeled text data
– the representations carry the context information
[Devlin 2018]
• Useful in many different downstream tasks achievable with
much simpler models, smaller labeled datasets and faster
convergence
A B C D E
BERT
R
R
R R R
How are [M]
toda
y
?
[M] = you
Representatio
ns
Transform
er
Encoders
“you”masked
Estimating the
masked token
: Pre-training
:Downstream
• Self-supervised
learning: learns some
hidden structure
within the language
from the dataset itself
without labels
38. A B C D E
BERT
R
R
R R R
Unlabeled text
data
unsupervised
Pre-training
BERT
[CLS] w1 w2 w3
Linear
Cls
class
sentence
Linear
Cls
class
Linear
Cls
class
arrive Taipei on November 2nd
other dest other time time
Input
output
• Example downstream task (1): slot filling
• Semantics considered
Self-supervised Learning for Text : BERT
• Classifiers may have simpler models trained with smaller
labeled datasets
39. BERT
It’s a nice day
[CLS]
unlabeled text data
positive negative
Downstream Model
Positive
Self-supervised Learning for Text : BERT
• Example downstream task (2): sentiment classification
• Semantics considered
• Simpler models trained with smaller labeled datasets
40. Pre-trained Model
Unlabeled Data
Phase 1: Pre-training
representations
• Mask the input signals and then
reconstruct them (generative)
• Predictive
• Contrastive
Self-supervised Learning for Speech
• To learn the hidden structure within speech signals without
considering any specific downstream task [IEEE JSTSP 2022]
41. Pre-trained Model
Phase 2: Downstream
For a given downstream
task (e.g., ASR) Downstream Model
“How are you?”
Labelled data
Self-supervised Learning for Speech
• With the hidden structure learned from the signals, the given
task becomes easier [IEEE JSTSP 2022]
44. Pred
Real
Repr.
The model was able to
reconstruct spectrogram form
hidden representations
Masked
Frames
Mockingjay
[ICASSP 2020]
Mockingjay
Mel-spectrogram
52. WER
100 hours
labeled data
10 minutes
labeled data
6-layer LSTM
2-layer LSTM
• LibriSpeech
(Lüscher, et al.) (Yang, et al.)
(Yang, et al.) (Baevski, et al.) (Hsu, et al.)
Supervised vs. Self-supervised: ASR
[Lüscher 2019] [Baevski 2020] [Hsu 2021] [Interspeech 2021]
5.8
2.9
3.1
4.8
4.6
53. Welcome to Join
https://superbbenchmark.org/
SLT 2022 Challenge
SUPERB
56. How are you.
He thinks it’s…
Thanks for…
Data
(Audio)
Unsupervised ASR
Data
(Text)
Supervised/Unsupervised ASR
• Supervised ASR
– Has been very successful
– Problem : requiring a huge quantity of annotated data
• Unsupervised ASR
– Train without annotated data
– Unlabeled, unpaired data are easier to
collect
Thousands of languages spoken over the
world
‐ most are low-resourced without
enough annotated data
How are you.
He thinks it’s…
Thanks for…
Data
(Annotated)
Supervised ASR
57. Generator (ASR)
Tries to “fool” Discriminator
Discriminator
Tries to distinguish real or
generated phoneme sequence.
Acoustic Features
Generated
Phoneme Sequences
Real / Generated
Real / Generated
Phoneme Sequences
Train
Iteratively
Use of Generative Adversarial Networks (GAN)
• Discriminator / Generator improve themselves individually
and iteratively
• Generative Adversarial Network (GAN)
58. …
…
Audio word2vec
𝑋1
𝑋2 𝑋3
𝑋𝑀
𝑧1 𝑧2 𝑧3 𝑧𝑀
…
Acoustic Feature
Audio embedding sequence
Model 1 (2018)
• Waveform segmentation and embedding
– divide the features into acoustically similar segments of different lengths
– transform each segment into a fixed-length vector (audio embedding)
[Interspeech 2018]
59. Cluster index 16
Cluster index 25
Cluster index 1
…
Audio embedding sequence
2 …
16 25 2
𝑧1 𝑧2 𝑧3 𝑧𝑀
Cluster index sequence
K-means
…
Cluster index 2
Model 1 (2018)
• Cluster the embeddings into groups
[Interspeech 2018]
60. Generator
Lookup Table
Real phoneme sequence
Real / Generated
Discriminator
CNN network
2 … 2
Cluster index sequence
16 25
Model 1 (2018)
• Learning the mapping between cluster indices and
phonemes with a GAN
– embedding clustering followed by (cascaded with) a GAN
[Interspeech 2018]
𝑠𝑖𝑙 …
ℎℎ 𝑖ℎ 𝒔𝒊𝒍
Generated phoneme sequence
61. • Generator consists of two
parts
(a) Phoneme Classifier (DNN)
(b) Sampling Process
• Discriminator is a two layer 1-D
CNN.
Data (Text)
Data
(Audio)
Model 2 (2019)
• A GAN (Generator and Discriminator) trained
End-to-end
– DNN trained in an unsupervised way
[Interspeech 2019]
62. Accuracy
Unsupervised learning (Model 2, 2019) is as
good as supervised learning (HMM) 30 years
ago.
The Progress of Supervised Learning on TIMIT
• Milestones in phone recognition accuracy
[Phone recognition on the TIMIT database, Lopes, C. and Perdigão, F., 2011. Speech
Technologies, Vol 1, pp. 285--302.]
– Will it take another 30 years for unsupervised learning to achieve the
performance of supervised learning today ?
[Keynote, Interspeech 2020]
63. Unsupervised Speech Recognition
• Librispeech (2021)
https://ai.facebook.com/blog/wav2vec-unsupervised-speech-recognition-without-supervision/
– Word Error Rate as low as 5.9% with zero hr of annotated
data !
[facebook 2021]
65. Unsupervised Speech Recognition
[ICASSP 2022]
• Reasonable (relatively high) Error Rate achievable if
– good acoustics conditions (Libri) with reasonably
different linguistic styles
– relatively poor acoustic conditions (SB) but with same
linguistic styles
– will be lowered sooner or later…
4gram JSD
PER
Libri960
SB300_w2v2
67. A Target Application Example : Personalized
Education Environment
• For each individual user
I wish to learn how machines
can listen to human voice
I can spend 3 hrs to learn
user
This is the 3-hr personalized
course for you. I’ll be your
personalized teaching assistant.
Ask me when you have questions.
Information
from Internet
• Comprehension, Summarization and Question
Answering for Spoken Content
– Proper use of semantics in spoken content
71. End-to-end Spoken QA : DUAL
• Pre-trained with unlabled audio/text corpora
– audio represented in HuBERT units (frame level)
– fine-tuned in downstream by (question, passage, answer) sets
• Not limited by ASR Errors or OOV
– no ASR here [Interspeech 2022]
72. End-to-end Spoken QA: DUAL
• HuBERT-based speech encoder
• BERT pre-trained on text
[Interspeech 2022]
question passage
Pre-trained Speech Encoder
3
9 11 31 31
clustered
units
start
Find Ans Span
BERT (pre-trained on text)
end
73. End-to-end Spoken QA: DUAL
• ASR cascaded with Text QA: Baseline
– performance directly limited by WER (many errors due to OOV)
• End-to-end Spoken QA (DUAL)
– performance independent of WER, because semantics extracted from
audio
• No ASR here
– many answer spans include OOV
ASR + Text QA
End-to-end DUAL
[Interspeech 2022]
Word Error Rate (WER)
Frame-level
F1
score
(FF1)
76. Textless Spoken QA with Speech Discrete Units: TlDu
• Audio represented in HuBERT units
• Retrieving relevant passages and identifying answers joinyly
handled
[submitted to SLT 2022]
• No ASR transcriptions or errors
Spoken Archive
Spoken Question
(a)
Speech
Signal
Encoder
Discrete
unit
sequences
(b)
Spoken
Content
Retriever
(c)
Spoken
Content
Reader
Top-K
passages
Answer
span
77. Preliminary Results for TlDu
• Retrieval Accuracy • SQA: Accury
• Performance not limited by ASR errors or OOV
– No ASR here
– semantics extracted directly from audio (not words)
[submitted to SLT 2022]
78. • What’s next?
– don’t know
• Depending on new technologies to be developed in the
future
– by capable researchers in the future
– capable students in schools today (and in the future)
• How students can learn effectively and efficiently is
important
– people asked : are you still teaching HMM and MFCC in
your course today?
• Just to share my thoughts
– purely my personal imagination
Beyond
79. Speech Technologies shown in a 1-dim Scale
changing
very fast
changing
faster
changing
slower
never
change
LPC
signal
processing
Fourier
transform
mathematics
cutting-edge 40 years ago today
10 years ago
20 years ago
?
fundamental
Pitch
estimation
Filter
banks
ML
HMM
MFCC
NN
CNN
RNN
WFST
i-vector
DNN
?
x-years later
Meta L
Self-
supervised
Learning
GAN
80. • Students today (researchers in the future) have to
explore new knowledge and solve new problems in
the future (20 years?)
• Primary Goal for the students to learn in school: to
learn
– not how to do research today, but how to do it in the future
– not just to run deep learning packages today,
– if they are too much focused on those deep learning packages,
quick results and achievement,
– may need to learn skills “more generalizable” to future
technologies not existing today (unseen during training)…
– which produces
good results, papers and brings good jobs : those may become
obsolete very soon
what are such skills ?
will it be possible that they
may be “overfitted” on these targets?
81. Reading many books!
Listening to many voices!
Then their parents
teach them.
Analogy 1:
How do human babies learn the language ?
83. may be kind of Self-Supervised Learning…
Unlabeled
Text
Self-Supervised :
Pre-training
– learn some hidden
structure of the
language
Unlabeled
Speech
Model
Input
Labels: limited
Self-supervised:
Downstream
Model
Unlabel
ed data
Analogy 1:
How do human babies learn the language ?
84. • To start with
– learning the alphabet: a, b, c
– finding a dictionary for the language
– learning to look up through the dictionary: <word>, <unknown>
– learning the basic words, function words, keywords:
<this> <he> <is>
– learning the grammar
– these are systematic approaches to learn a new (unknown)
language,
• These systematic approaches are kind of based on
some “hidden structure” for the language?
– having to do with pre-training ?
– reading articles is some kind of downstream ?
Analogy 2:
Reading articles in a certain unknown language
generalizable to reading arbitrary articles in that
language
85. • We didn’t learn deep learning in early days
– we are working with it now
– We seemed to have generalized our skills learned earlier to handle the
new knowledgies today ?
Analogy 3:
How old generation researchers faced the Deep
Learning era ?
• What did we learn in early days ?
– mathematics, programming, fundamentals (a, b, c)
– successful stories in the past
– may include components generalizable to technologies useful today ?
– are they specially lucky to survive the technology revolutions?
– have we gone through some kind of pre-training when we were young?
• Learning new technologies is the downstream tasks ?
• We don’t know, just to share some thoughts
solutions
What ? ideas
Why ? principles
How ?
– (e.g. HMM, MFCC, although not useful
any more today)
– forward-backward algorithm, vector quantization, cepstral mean and
variance normalization still useful today, although in different context
87. • Semantics of Speech yet to be explored
– plenty of unknown space
– may offer a bridge towards “a spoken version of Google”
• Self-supervised learning for Speech
– pre-training with unlabeled data
– universally make all downstream tasks easier
– various new technologies blooming
• x years ago we never knew what kind of technologies
we could have today
– today we never know what kind of technologies we may have x
years from now
– anything in our mind could be possible
• This is the golden age we never had for speech research
– very deep learning, very big data, very powerful machines, very
strong industry
– which we never had before
• Let’s all treasure and enjoy this golden age !
Concluding Remarks