From Semantics to Self-supervised Learning for Speech and Beyond (Opening Keynote, Interspeech 2022)

From Semantics to Self-supervised Learning
for Speech and Beyond

Spoken Content Retrieval
• Spoken term detection
– to detect if a target term was spoken in any of the utterances
in an audio dataset
target term: COVID-19
similarity scores

What can Spoken Content Retrieval do for us ?
• Google reads all text over the Internet
– can find any text over the Internet for the user
• All Roles of Text can be realized by Voice
• Machines can listen to all voices over the Internet
– can find any utterance over the Internet for the user
• A Spoken Version of Google

• Multimedia Content
exponentially increasing over the
Internet
, but Machines can
What can we do with a Spoken Version of Google ?
• Machines may be able to listen to and comprehend
the entire multimedia knowledge treasury over the
Internet
– extracting desired information for each individual user
– the unique treasury of the entire global human
knowledge is here
– desired information for each individual deeply buried
under huge quantities of unrelated information
• Nobody can go through so much multimedia
information

A Target Application Example : Personalized
Education Environment
• For each individual user
I wish to learn how machines
can listen to human voice
I can spend 3 hrs to learn
user
This is the 3-hr personalized
course for you. I’ll be your
personalized teaching assistant.
Ask me when you have questions.
Information
from Internet
• Comprehension, Summarization and Question
Answering for Spoken Content
– Proper use of semantics in spoken content

Probabilistic Latent Semantic Analysis (PLSA)
t1
t2
tj
tn
D1
D2
Di
DN
TK
Tk
T2
T1
P(T |D )
k i
P(t |T )
j k
Di: documents Tk: latent topics tj: terms
𝑃(𝑤 𝑧)
𝑃(𝑧 𝑑)
• Unsupervised Topic Analysis from text corpus
d: document z: topic w: word
[Hofmann 1999]

Latent Dirichlet Allocation (LDA)
𝑃(𝜑𝑘 𝛽): Dirichlet Distribution
𝑃(𝜃 𝛼): Dirichlet Distribution
𝑃(𝑤𝑚,𝑛 𝑧𝑚,𝑛, 𝜑𝑘)
𝑃(𝑧𝑚,𝑛 𝜃𝑚)
[Blei 2003]
• Unsupervised Topic Analysis from text corpus

Clustering and Structuring the Spoken Content
Segments (Spoken Documents) Based on Topics
[Interspeech 2006]
• Global Semantic Structuring
• Local Semantic Structuring
Chinese Broadcast
News Archive
Semantic Analysis
Global Semantic
Structuring
Query-based
Local Semantic
Structuring
Automatic
Generation of
Key terms, Titles
and Summaries
Information
Retrieval
User’s query
– fine local structure
around the user
query on top of the
global structure

• Example Approach : Spoken Documents categorized by
Layered Topics and organized in a Two-dimensional Tree
– topics nearby on the map are more related semantically
– each topic expanded into another map in the next layer
Global Semantic Structure of Spoken Content
[Eurospeech 2005]

• Broadcast News Browser (2006)
– each topic labeled by a set of key terms
An Example Screenshot of Global Semantic Structure
in a Two-dimensional Tree
[Interspeech 2006]

Clustering and Structuring the Segments of Spoken
Content (Spoken Documents) Based on Topics
[Interspeech 2006]
• Global Semantic Structuring
• Local Semantic Structuring
Chinese Broadcast
News Archive
Semantic Analysis
Global Semantic
Structuring
Query-based
Local Semantic
Structuring
Automatic
Generation of
Key terms, Titles
and Summaries
Information
Retrieval
User’s query
– fine local structure
around the user
query on top of the
global structure

• Fine structure based on a user query
• Example Approach : Topic Hierarchy constructed with
key terms (Example Query: George Bush)
Local Semantic Structure of Spoken Content
[Interspeech 2006]

An Example Screenshot of Local Semantic Structure
in a Topic Hierarchy
[Interspeech 2006]
• Query: “White House of United States”
– some key terms under another key term on a higher level

• Spoken Knowledge in courses: in sequential form
– an individual user may need only a small part
– not understandable without listening to previous lectures
• Example Approach: Key Term Graph (2009)
– each spoken slide labeled by a set of key terms (topics)
– relationships between key terms represented by a graph
Online Courses : A Well Organized Spoken Knowledge
Treasury
[ICASSP 2009][IEEE Trans ASL 2014]
spoken slides
(plus audio/video)
key term
graph
Acoustic
Modeling
Viterbi
search
HMM
Language
Modeling
Perplexity
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……

• Temporal Structure
– chapters, sections, slides
Interconnection between Temporal and Semantic
Structures
[IEEE Trans ASL 2014]
• Semantic Structure
– key term graph
• Interconnection between the Two Structures

An Example Automatically Generated Key Term
Graph
[IEEE Trans ASL 2014]
• Relationship scores evaluated between each pair of key
terms
– an edge if exceeding a threshold

• User clicks a key term “entropy”
– possible learning path through the selected spoken slides
– related key terms via the key term graph
[ICASSP 2009][IEEE Trans ASL 2014]
An Example Screenshot from a Online Course
Browser

Spoken Knowledge Structuring for an Example Online
Course
• Based on a course recorded in 2006
[ICASSP 2009] [IEEE Trans ASL 2014]

Thousands of Online Courses over the Internet
Lectures with very
similar content
[Interspeech 2015]
• Machines listen to all online courses
three courses on some
similar topic

sequential order for
learning (prerequisite
conditions)
three courses on some
similar topic
[Interspeech 2015]
Thousands of Online Courses over the Internet
• Machines listen to all online courses
• Learning map for a given query
• More precise semantic analysis for speech needed

Hung-yi Lee (left) and Lin-shan Lee

…… ____ wi ____ ……
Word Embeddings (Word2Vec) as Vector
Representations for Words
• Continuous bag of words (CBOW)
…… wi-1 ____ wi+1 …… Neural
Network
wi
wi-1
wi+1
Neural
Network
wi-1
wi
wi+1
predicting the word given its context
predicting the context given a word
[Mikolov 2013]
• Skip-gram
• Prediction based on some hidden structure within the
language

V(Berlin) – V(Germany) + V(France) V(Paris)
Word Embeddings (Word2Vec)
Berlin
Germany
France
Paris
• Carry some Semantic Structure among Words
– semantic relationship is kind of “additive” or “parellel”
[Mikolov 2013]

Berlin is the capital city of Germany
Input Layer
𝑾𝑉 Χ 𝐽
Context
𝒙2 𝒙3 𝒙5 𝒙6
𝑾′𝐽 Χ 𝑉 𝑾′𝐽 Χ 𝑉 𝑾′𝐽 Χ 𝑉 𝑾′𝐽 Χ 𝑉
Output Layer
is the city of
1-of-N vector 𝒙1 𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙7
Hidden Layer
“capital”
Word2Vec : Word Semantics learned from Text
Context

Italy
Rome
France
Paris
Germany
Berlin
Dimension Reduction
Rome
Italy
France
Paris
Germany
Berlin
Country
City
Paris is the capital city of France
Rome is the capital city of Italy
• Shared common context in
Big Data offers some
hidden structure within
the language
– carry some semantics
Word2Vec : Word Semantics learned from Text
Context

RNN Decoder
x1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3
x4
RNN Encoder
audio segment (a segmented spoken word)
acoustic features
• RNN encoder and
decoder
– learn some hidden structure
within signals
Input acoustic features
Audio Word2Vec : Sequence-to-sequence Autoencoder
[Interspeech 2016]

x1 x2 x3 x4
RNN Encoder
acoustic features
• RNN encoder and
decoder
within signals
[Interspeech 2016]

x1 x2 x3 x4
RNN Encoder
acoustic features
• RNN encoder and
decoder
within signals
[Interspeech 2016]
vector
The vector representation
• Unsupervised
• Self-supervised
– model learns from data itself

What was learned ?
[Interspeech 2016]
• Sequential Phonetic
structure !
– not semantics
– learned hidden structure
within speech signals is
from isolated segmented
spoken words only
– learning semantics from
context of word sequences
may be easier
new
fear
fame
name
few
new
near
night
fight
hands
hand
words
word
things
thing
days
says
day
say

e1
S
DR
𝑥1
DR
𝑥2
DR
𝑥3
…
DR
𝑥4
DR
𝑥𝑡
…
DR
𝑥𝑡+1
DR
𝑥𝑇
0
Segmental Audio Word2Vec — extended to an utterance
ER
…
en
xt
S
…
ER
xt+1
0
S …
ER
…
eN
xT
S
ER
x1
ER
x2
0
S
ER
x4
0
S
x3
ER
S
word
boundary
Segmentation Gate
Reset to
initial state
Segment 1 Segment 2 Segment n Segment N
Each color block performs seq2seq training individually [Interspeech 2018]

Encoder Decoder
input audio reconstructed
Encoder
1
Decoder
reconstructed
speaker vector
phonetic vector
Encoder
2
Feature Disentanglement for Audio Word2Vec
[SLT 2018]
• Audio Signals includes information irrelevant to
semantics
– Speaker and other acoustic information

speaker 1
Phonetic
Encoder
Decoder
Speaker
Encoder
similar speaker vectors for the same
speaker
speaker 1
Phonetic
Encoder
Decoder
Speaker
Encoder
Disentanglement of Speaker Information
[SLT 2018]

Disentanglement of Speaker Information
speaker vectors far apart enough for
different speakers
speaker 1
Phonetic
Encoder
Decoder
Speaker
Encoder
speaker 2
Phonetic
Encoder
Decoder
Speaker
Encoder
[SLT 2018]
• Phonetic Vectors without Speaker Information

One-hot
Semantic
Embedding
linear
linear
wt
wt-2
wt-1
wt+1
wt+2
0
1
0
0
Skip-gram for Text Words
• Predicting the context given a word
…… ____ wi ____ ……
• One-hot used to train semantic
embeddings
: Semantic Embeddings

Phonetic-and-Semantic
Embedding
phonetic vectors
2 hidden layer
2 hidden layer
wt
wt-2
wt-1
wt+1
wt+2
Phonetic
Encoder
wt-1 wt wt+1
[SLT 2018]
Audio Skip-gram
• Predicting the context given an audio segment
• Phonetic Vectors used
Phonetic Vectors
: Phonetic-and-semantic Embeddings

Phonetic-and-Semantic Embedding
Brother
Sister
Bother
Brother
Bother
Sister
• Two types of information sometimes disturb each other
– aligning the word embeddings in the two spaces is challenging
• A unique ID for a text word in training
– unlimited number of audio realizations for a given text word
• Phonetic Space • Semantic Space
[SLT 2018]

BERT (Bidirectional Encoder Representations from
Transformers)
• Some hidden structure within the language learned in an
unsupervised way
– by estimating masked token from unlabeled text data
– the representations carry the context information
[Devlin 2018]
• Useful in many different downstream tasks achievable with
much simpler models, smaller labeled datasets and faster
convergence
A B C D E
BERT
R
R
R R R
How are [M]
toda
y
?
[M] = you
Representatio
ns
Transform
er
Encoders
“you”masked
Estimating the
masked token
: Pre-training
：Downstream
• Self-supervised
learning: learns some
hidden structure
within the language
from the dataset itself
without labels

A B C D E
BERT
R
R
R R R
Unlabeled text
data
unsupervised
Pre-training
BERT
[CLS] w1 w2 w3
Linear
Cls
class
sentence
Linear
Cls
class
Linear
Cls
class
arrive Taipei on November 2nd
other dest other time time
Input
output
• Example downstream task (1): slot filling
• Semantics considered
Self-supervised Learning for Text : BERT
• Classifiers may have simpler models trained with smaller
labeled datasets

BERT
It’s a nice day
[CLS]
unlabeled text data
positive negative
Downstream Model
Positive
Self-supervised Learning for Text : BERT
• Example downstream task (2): sentiment classification
• Semantics considered
• Simpler models trained with smaller labeled datasets

Pre-trained Model
Unlabeled Data
Phase 1: Pre-training
representations
• Mask the input signals and then
reconstruct them (generative)
• Predictive
• Contrastive
Self-supervised Learning for Speech
• To learn the hidden structure within speech signals without
considering any specific downstream task [IEEE JSTSP 2022]

Pre-trained Model
Phase 2: Downstream
For a given downstream
task (e.g., ASR) Downstream Model
“How are you?”
Labelled data
Self-supervised Learning for Speech
• With the hidden structure learned from the signals, the given
task becomes easier [IEEE JSTSP 2022]

Masked
Frames
Mockingjay
Mockingjay
[ICASSP 2020]
• Speech version of BERT
(text token → signal frame)
– frame level
– no segmentation problem
any more (same as ASR)
Mel-spectrogram

Repr.
Masked
Frames
Mockingjay
[ICASSP 2020]
Mockingjay
Mel-spectrogram
• Speech version of BERT
(text token → signal frame)
– frame level
– no segmentation problem
any more (same as ASR)

Pred
Real
Repr.
The model was able to
reconstruct spectrogram form
hidden representations
Masked
Frames
Mockingjay
[ICASSP 2020]
Mockingjay
Mel-spectrogram

More data
Larger model
Word-level Frame-level
More
objectives
[IEEE JSTSP 2022]

Presented at INTERSPEECH 2021
https://arxiv.org/abs/2105.01051
Presented at ACL 2022
https://arxiv.org/abs/2203.06849
[Interspeech 2021] [ACL 2022]

https://superbbenchmark.org/
IS 2021
ACL 2022
Phonetic Paralinguistic
Speaker Semantic Synthesis
Speech processing Universal PERformance
Benchmark (SUPERB)
[Interspeech 2021] [ACL 2022]
• Five categories of downstream tasks: phonetic, speaker,
paralinguistic, semantic, and synthesis

PR KS ASR QbE SID ASV SD IC SF ER
fbank 82.01 8.63 15.21 0.0058 8.50E-04 9.56 10.05 9.1 69.64 35.39
PASE+ 58.87 82.54 16.62 0.0072 37.99 11.61 8.68 29.82 62.14 57.86
APC 41.98 91.01 14.74 0.0310 60.42 8.56 10.53 74.69 70.46 59.33
VQ-APC 41.08 91.11 15.21 0.0251 60.15 8.72 10.45 74.48 68.53 59.66
NPC 43.81 88.96 13.91 0.0246 55.92 9.40 9.34 69.44 72.79 59.08
Mockingjay 70.19 83.67 15.48 6.60E-04 32.29 11.66 10.54 34.33 61.59 50.28
TERA 49.17 89.48 12.16 0.0013 57.57 15.89 9.96 58.42 67.50 56.27
DeCoAR 2.0 14.93 94.48 9.07 0.0406 74.42 7.16 6.59 90.80 83.28 62.47
modified CPC 42.54 91.88 13.53 0.0326 39.63 12.86 10.38 64.09 71.19 60.96
wav2vec 31.58 95.59 11.00 0.0485 56.56 7.99 9.90 84.92 76.37 59.79
vq-wav2vec 33.48 93.38 12.80 0.0410 38.80 10.38 9.93 85.68 77.68 58.24
wav2vec 2.0 base 5.74 96.23 4.79 0.0233 75.18 6.02 6.08 92.35 88.30 63.43
wav2vec 2.0 large 4.75 96.66 3.10 0.0489 86.14 5.65 5.62 95.28 87.11 65.64
HuBERT base 5.41 96.30 4.79 0.0736 81.42 5.11 5.88 98.34 88.53 64.92
HuBERT large 3.53 95.29 2.94 0.0353 90.33 5.98 5.75 98.76 89.81 67.62
Phonetic Semantic
Speaker Emotion
Initial Test Results of Round 2
[Interspeech 2021]
• fbank as the baseline

fbank 82.01 8.63 15.21 0.0058 8.50E-04 9.56 10.05 9.1 69.64 35.39
PASE+ 58.87 82.54 16.62 0.0072 37.99 11.61 8.68 29.82 62.14 57.86
APC 41.98 91.01 14.74 0.0310 60.42 8.56 10.53 74.69 70.46 59.33
VQ-APC 41.08 91.11 15.21 0.0251 60.15 8.72 10.45 74.48 68.53 59.66
NPC 43.81 88.96 13.91 0.0246 55.92 9.40 9.34 69.44 72.79 59.08
Mockingjay 70.19 83.67 15.48 6.60E-04 32.29 11.66 10.54 34.33 61.59 50.28
TERA 49.17 89.48 12.16 0.0013 57.57 15.89 9.96 58.42 67.50 56.27
DeCoAR 2.0 14.93 94.48 9.07 0.0406 74.42 7.16 6.59 90.80 83.28 62.47
modified CPC 42.54 91.88 13.53 0.0326 39.63 12.86 10.38 64.09 71.19 60.96
wav2vec 31.58 95.59 11.00 0.0485 56.56 7.99 9.90 84.92 76.37 59.79
vq-wav2vec 33.48 93.38 12.80 0.0410 38.80 10.38 9.93 85.68 77.68 58.24
wav2vec 2.0 base 5.74 96.23 4.79 0.0233 75.18 6.02 6.08 92.35 88.30 63.43
wav2vec 2.0 large 4.75 96.66 3.10 0.0489 86.14 5.65 5.62 95.28 87.11 65.64
HuBERT base 5.41 96.30 4.79 0.0736 81.42 5.11 5.88 98.34 88.53 64.92
HuBERT large 3.53 95.29 2.94 0.0353 90.33 5.98 5.75 98.76 89.81 67.62
Phonetic Semantic
Speaker Emotion
• Self-supervised representations outperformed fbank in most
cases
[Interspeech 2021]
• Black blocks: worse than fbank baseline

fbank 82.01 8.63 15.21 0.0058 8.50E-04 9.56 10.05 9.1 69.64 35.39
PASE+ 58.87 82.54 16.62 0.0072 37.99 11.61 8.68 29.82 62.14 57.86
APC 41.98 91.01 14.74 0.0310 60.42 8.56 10.53 74.69 70.46 59.33
VQ-APC 41.08 91.11 15.21 0.0251 60.15 8.72 10.45 74.48 68.53 59.66
NPC 43.81 88.96 13.91 0.0246 55.92 9.40 9.34 69.44 72.79 59.08
Mockingjay 70.19 83.67 15.48 6.60E-04 32.29 11.66 10.54 34.33 61.59 50.28
TERA 49.17 89.48 12.16 0.0013 57.57 15.89 9.96 58.42 67.50 56.27
DeCoAR 2.0 14.93 94.48 9.07 0.0406 74.42 7.16 6.59 90.80 83.28 62.47
modified CPC 42.54 91.88 13.53 0.0326 39.63 12.86 10.38 64.09 71.19 60.96
wav2vec 31.58 95.59 11.00 0.0485 56.56 7.99 9.90 84.92 76.37 59.79
vq-wav2vec 33.48 93.38 12.80 0.0410 38.80 10.38 9.93 85.68 77.68 58.24
wav2vec 2.0 base 5.74 96.23 4.79 0.0233 75.18 6.02 6.08 92.35 88.30 63.43
wav2vec 2.0 large 4.75 96.66 3.10 0.0489 86.14 5.65 5.62 95.28 87.11 65.64
HuBERT base 5.41 96.30 4.79 0.0736 81.42 5.11 5.88 98.34 88.53 64.92
HuBERT large 3.53 95.29 2.94 0.0353 90.33 5.98 5.75 98.76 89.81 67.62
Phonetic Semantic
Speaker Emotion
• Several self-supervised models are all-around
– good hidden structure generalizable to all tasks
[Interspeech 2021]
• Any one good in all tasks?

WER
100 hours
labeled data
10 minutes
labeled data
6-layer LSTM
2-layer LSTM
• LibriSpeech
(Lüscher, et al.) (Yang, et al.)
(Yang, et al.) (Baevski, et al.) (Hsu, et al.)
Supervised vs. Self-supervised: ASR
[Lüscher 2019] [Baevski 2020] [Hsu 2021] [Interspeech 2021]
5.8
2.9
3.1
4.8
4.6

Welcome to Join 
https://superbbenchmark.org/
SLT 2022 Challenge
SUPERB

15
Downstream
Tasks
16+
Upstream
models
Toolkit – S3PRL (https://github.com/s3prl/s3prl)

More Applications…
Example 1:
Unsupervised ASR

How are you.
He thinks it’s…
Thanks for…
Data
(Audio)
Unsupervised ASR
Data
(Text)
Supervised/Unsupervised ASR
• Supervised ASR
– Has been very successful
– Problem : requiring a huge quantity of annotated data
• Unsupervised ASR
– Train without annotated data
– Unlabeled, unpaired data are easier to
collect
Thousands of languages spoken over the
world
‐ most are low-resourced without
enough annotated data
How are you.
He thinks it’s…
Thanks for…
Data
(Annotated)
Supervised ASR

Generator (ASR)
Tries to “fool” Discriminator
Discriminator
Tries to distinguish real or
generated phoneme sequence.
Acoustic Features
Generated
Phoneme Sequences
Real / Generated
Real / Generated
Phoneme Sequences
Train
Iteratively
Use of Generative Adversarial Networks (GAN)
• Discriminator / Generator improve themselves individually
and iteratively
• Generative Adversarial Network (GAN)

…
…
Audio word2vec
𝑋1
𝑋2 𝑋3
𝑋𝑀
𝑧1 𝑧2 𝑧3 𝑧𝑀
…
Acoustic Feature
Audio embedding sequence
Model 1 (2018)
• Waveform segmentation and embedding
– divide the features into acoustically similar segments of different lengths
– transform each segment into a fixed-length vector (audio embedding)
[Interspeech 2018]

Cluster index 16
Cluster index 25
Cluster index 1
…
Audio embedding sequence
2 …
16 25 2
𝑧1 𝑧2 𝑧3 𝑧𝑀
Cluster index sequence
K-means
…
Cluster index 2
Model 1 (2018)
• Cluster the embeddings into groups
[Interspeech 2018]

Generator
Lookup Table
Real phoneme sequence
Real / Generated
Discriminator
CNN network
2 … 2
Cluster index sequence
16 25
Model 1 (2018)
• Learning the mapping between cluster indices and
phonemes with a GAN
– embedding clustering followed by (cascaded with) a GAN
[Interspeech 2018]
𝑠𝑖𝑙 …
ℎℎ 𝑖ℎ 𝒔𝒊𝒍
Generated phoneme sequence

• Generator consists of two
parts
(a) Phoneme Classifier (DNN)
(b) Sampling Process
• Discriminator is a two layer 1-D
CNN.
Data (Text)
Data
(Audio)
Model 2 (2019)
• A GAN (Generator and Discriminator) trained
End-to-end
– DNN trained in an unsupervised way
[Interspeech 2019]

Accuracy
Unsupervised learning (Model 2, 2019) is as
good as supervised learning (HMM) 30 years
ago.
The Progress of Supervised Learning on TIMIT
• Milestones in phone recognition accuracy
[Phone recognition on the TIMIT database, Lopes, C. and Perdigão, F., 2011. Speech
Technologies, Vol 1, pp. 285--302.]
– Will it take another 30 years for unsupervised learning to achieve the
performance of supervised learning today ?
[Keynote, Interspeech 2020]

Unsupervised Speech Recognition
• Librispeech (2021)
https://ai.facebook.com/blog/wav2vec-unsupervised-speech-recognition-without-supervision/
– Word Error Rate as low as 5.9% with zero hr of annotated
data !
[facebook 2021]

SwitchBoard
(Telephone conversation)
TED-LIUM
(Live talk)
Librispeech
(Read literature)
LibriLM
(Literature)
Wiki
(Encyclopedia)
NewsCrawl
(News)
ImageC
(Image Caption)
[ICASSP 2022]
• Audio/Text unlabeled corpora respectively extended to
different styles/domains

[ICASSP 2022]
• Reasonable (relatively high) Error Rate achievable if
– good acoustics conditions (Libri) with reasonably
different linguistic styles
– relatively poor acoustic conditions (SB) but with same
linguistic styles
– will be lowered sooner or later…
4gram JSD
PER
Libri960
SB300_w2v2

More Applications…
Example 2:
End-to-end
Spoken Question Answering

Question
Answering
Knowledge source
question
answer
unstructured
documents
search engine
Question Answering
• Machine answering questions from the user
spoken
content
(passages)

Question
Answering
Knowledge source
question
answer
unstructured
documents
Question Answering
spoken
content
(passages)
• Initial Work
– assuming the passages are
given

Speech
Recognition
(ASR)
Question
Answering
Answer
Question
Question
Answering
Answer
Question
Spoken Content
Retrieved
Text
Retrieved
Text v.s. Spoken QA (Cascading v.s. End-to-end)
• Text QA
End-to-end Spoken
Question Answering
Answer
Question [Interspeech 2020]
Cascading
End-to-end
• Spoken QA
Errors

End-to-end Spoken QA : DUAL
• Pre-trained with unlabled audio/text corpora
– audio represented in HuBERT units (frame level)
– fine-tuned in downstream by (question, passage, answer) sets
• Not limited by ASR Errors or OOV
– no ASR here [Interspeech 2022]

End-to-end Spoken QA: DUAL
• HuBERT-based speech encoder
• BERT pre-trained on text
[Interspeech 2022]
question passage
Pre-trained Speech Encoder
3
9 11 31 31
clustered
units
start
Find Ans Span
BERT (pre-trained on text)
end

End-to-end Spoken QA: DUAL
• ASR cascaded with Text QA: Baseline
– performance directly limited by WER (many errors due to OOV)
• End-to-end Spoken QA (DUAL)
– performance independent of WER, because semantics extracted from
audio
• No ASR here
– many answer spans include OOV
ASR + Text QA
End-to-end DUAL
[Interspeech 2022]
Word Error Rate (WER)
Frame-level
F1
score
(FF1)

Question
Answering
Knowledge source
question
answer
unstructured
documents
search engine
Question Answering
– retrieval and question answering jointly solved
spoken
content
(passages)

Textless Spoken QA with Speech Discrete Units: TlDu
• Audio represented in HuBERT units
• Retrieving relevant passages and identifying answers joinyly
handled
[submitted to SLT 2022]
• No ASR transcriptions or errors
Spoken Archive
Spoken Question
(a)
Speech
Signal
Encoder
Discrete
unit
sequences
(b)
Spoken
Content
Retriever
(c)
Spoken
Content
Reader
Top-K
passages
Answer
span

Preliminary Results for TlDu
• Retrieval Accuracy • SQA: Accury
• Performance not limited by ASR errors or OOV
– No ASR here
– semantics extracted directly from audio (not words)
[submitted to SLT 2022]

• What’s next?
– don’t know
• Depending on new technologies to be developed in the
future
– by capable researchers in the future
– capable students in schools today (and in the future)
• How students can learn effectively and efficiently is
important
– people asked : are you still teaching HMM and MFCC in
your course today?
• Just to share my thoughts
– purely my personal imagination
Beyond

Speech Technologies shown in a 1-dim Scale
changing
very fast
changing
faster
changing
slower
never
change
LPC
signal
processing
Fourier
transform
mathematics
cutting-edge 40 years ago today
10 years ago
20 years ago
?
fundamental
Pitch
estimation
Filter
banks
ML
HMM
MFCC
NN
CNN
RNN
WFST
i-vector
DNN
?
x-years later
Meta L
Self-
supervised
Learning
GAN

• Students today (researchers in the future) have to
explore new knowledge and solve new problems in
the future (20 years?)
• Primary Goal for the students to learn in school: to
learn
– not how to do research today, but how to do it in the future
– not just to run deep learning packages today,
– if they are too much focused on those deep learning packages,
quick results and achievement,
– may need to learn skills “more generalizable” to future
technologies not existing today (unseen during training)…
– which produces
good results, papers and brings good jobs : those may become
obsolete very soon
what are such skills ?
will it be possible that they
may be “overfitted” on these targets?

Reading many books!
Listening to many voices!
Then their parents
teach them.
Analogy 1:
How do human babies learn the language ?

Unlabeled
Text
Self-Supervised :
Pre-training
– learn some hidden
structure of the
language
Model Unlabeled
Speech
Unlabeled
data
Analogy 1:

may be kind of Self-Supervised Learning…
Unlabeled
Text
Self-Supervised :
Pre-training
– learn some hidden
structure of the
language
Unlabeled
Speech
Model
Input
Labels: limited
Self-supervised:
Downstream
Model
Unlabel
ed data
Analogy 1:

• To start with
– learning the alphabet: a, b, c
– finding a dictionary for the language
– learning to look up through the dictionary: <word>, <unknown>
– learning the basic words, function words, keywords:
<this> <he> <is>
– learning the grammar
– these are systematic approaches to learn a new (unknown)
language,
• These systematic approaches are kind of based on
some “hidden structure” for the language?
– having to do with pre-training ?
– reading articles is some kind of downstream ?
Analogy 2:
Reading articles in a certain unknown language
generalizable to reading arbitrary articles in that
language

• We didn’t learn deep learning in early days
– we are working with it now
– We seemed to have generalized our skills learned earlier to handle the
new knowledgies today ?
Analogy 3:
How old generation researchers faced the Deep
Learning era ?
• What did we learn in early days ?
– mathematics, programming, fundamentals (a, b, c)
– successful stories in the past
– may include components generalizable to technologies useful today ?
– are they specially lucky to survive the technology revolutions?
– have we gone through some kind of pre-training when we were young?
• Learning new technologies is the downstream tasks ?
• We don’t know, just to share some thoughts
solutions
What ? ideas
Why ? principles
How ?
– (e.g. HMM, MFCC, although not useful
any more today)
– forward-backward algorithm, vector quantization, cepstral mean and
variance normalization still useful today, although in different context

• Semantics of Speech yet to be explored
– plenty of unknown space
– may offer a bridge towards “a spoken version of Google”
• Self-supervised learning for Speech
– pre-training with unlabeled data
– universally make all downstream tasks easier
– various new technologies blooming
• x years ago we never knew what kind of technologies
we could have today
– today we never know what kind of technologies we may have x
years from now
– anything in our mind could be possible
• This is the golden age we never had for speech research
– very deep learning, very big data, very powerful machines, very
strong industry
– which we never had before
• Let’s all treasure and enjoy this golden age !
Concluding Remarks

From Semantics to Self-supervised Learning for Speech and Beyond (Opening Keynote, Interspeech 2022)

Recommended

Recommended

More Related Content

Similar to From Semantics to Self-supervised Learning for Speech and Beyond (Opening Keynote, Interspeech 2022)

Similar to From Semantics to Self-supervised Learning for Speech and Beyond (Opening Keynote, Interspeech 2022) (20)

More from linshanleearchive

More from linshanleearchive (19)

Recently uploaded

Recently uploaded (20)

From Semantics to Self-supervised Learning for Speech and Beyond (Opening Keynote, Interspeech 2022)