SlideShare a Scribd company logo
1 of 87
From Semantics to Self-supervised Learning
for Speech and Beyond
Spoken Content Retrieval
• Spoken term detection
– to detect if a target term was spoken in any of the utterances
in an audio dataset
target term: COVID-19
similarity scores
What can Spoken Content Retrieval do for us ?
• Google reads all text over the Internet
– can find any text over the Internet for the user
• All Roles of Text can be realized by Voice
• Machines can listen to all voices over the Internet
– can find any utterance over the Internet for the user
• A Spoken Version of Google
• Multimedia Content
exponentially increasing over the
Internet
, but Machines can
What can we do with a Spoken Version of Google ?
• Machines may be able to listen to and comprehend
the entire multimedia knowledge treasury over the
Internet
– extracting desired information for each individual user
– the unique treasury of the entire global human
knowledge is here
– desired information for each individual deeply buried
under huge quantities of unrelated information
• Nobody can go through so much multimedia
information
A Target Application Example : Personalized
Education Environment
• For each individual user
I wish to learn how machines
can listen to human voice
I can spend 3 hrs to learn
user
This is the 3-hr personalized
course for you. I’ll be your
personalized teaching assistant.
Ask me when you have questions.
Information
from Internet
• Comprehension, Summarization and Question
Answering for Spoken Content
– Proper use of semantics in spoken content
Probabilistic Latent Semantic Analysis (PLSA)
t1
t2
tj
tn
D1
D2
Di
DN
TK
Tk
T2
T1
P(T |D )
k i
P(t |T )
j k
Di: documents Tk: latent topics tj: terms
𝑃(𝑤 𝑧)
𝑃(𝑧 𝑑)
• Unsupervised Topic Analysis from text corpus
d: document z: topic w: word
[Hofmann 1999]
Latent Dirichlet Allocation (LDA)
𝑃(𝜑𝑘 𝛽): Dirichlet Distribution
𝑃(𝜃 𝛼): Dirichlet Distribution
𝑃(𝑤𝑚,𝑛 𝑧𝑚,𝑛, 𝜑𝑘)
𝑃(𝑧𝑚,𝑛 𝜃𝑚)
[Blei 2003]
• Unsupervised Topic Analysis from text corpus
Clustering and Structuring the Spoken Content
Segments (Spoken Documents) Based on Topics
[Interspeech 2006]
• Global Semantic Structuring
• Local Semantic Structuring
Chinese Broadcast
News Archive
Semantic Analysis
Global Semantic
Structuring
Query-based
Local Semantic
Structuring
Automatic
Generation of
Key terms, Titles
and Summaries
Information
Retrieval
User’s query
– fine local structure
around the user
query on top of the
global structure
• Example Approach : Spoken Documents categorized by
Layered Topics and organized in a Two-dimensional Tree
– topics nearby on the map are more related semantically
– each topic expanded into another map in the next layer
Global Semantic Structure of Spoken Content
[Eurospeech 2005]
• Broadcast News Browser (2006)
– each topic labeled by a set of key terms
An Example Screenshot of Global Semantic Structure
in a Two-dimensional Tree
[Interspeech 2006]
Clustering and Structuring the Segments of Spoken
Content (Spoken Documents) Based on Topics
[Interspeech 2006]
• Global Semantic Structuring
• Local Semantic Structuring
Chinese Broadcast
News Archive
Semantic Analysis
Global Semantic
Structuring
Query-based
Local Semantic
Structuring
Automatic
Generation of
Key terms, Titles
and Summaries
Information
Retrieval
User’s query
– fine local structure
around the user
query on top of the
global structure
• Fine structure based on a user query
• Example Approach : Topic Hierarchy constructed with
key terms (Example Query: George Bush)
Local Semantic Structure of Spoken Content
[Interspeech 2006]
An Example Screenshot of Local Semantic Structure
in a Topic Hierarchy
[Interspeech 2006]
• Query: “White House of United States”
– some key terms under another key term on a higher level
• Spoken Knowledge in courses: in sequential form
– an individual user may need only a small part
– not understandable without listening to previous lectures
• Example Approach: Key Term Graph (2009)
– each spoken slide labeled by a set of key terms (topics)
– relationships between key terms represented by a graph
Online Courses : A Well Organized Spoken Knowledge
Treasury
[ICASSP 2009][IEEE Trans ASL 2014]
spoken slides
(plus audio/video)
key term
graph
Acoustic
Modeling
Viterbi
search
HMM
Language
Modeling
Perplexity
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
• Temporal Structure
– chapters, sections, slides
Interconnection between Temporal and Semantic
Structures
[IEEE Trans ASL 2014]
• Semantic Structure
– key term graph
• Interconnection between the Two Structures
An Example Automatically Generated Key Term
Graph
[IEEE Trans ASL 2014]
• Relationship scores evaluated between each pair of key
terms
– an edge if exceeding a threshold
• User clicks a key term “entropy”
– possible learning path through the selected spoken slides
– related key terms via the key term graph
[ICASSP 2009][IEEE Trans ASL 2014]
An Example Screenshot from a Online Course
Browser
Spoken Knowledge Structuring for an Example Online
Course
• Based on a course recorded in 2006
[ICASSP 2009] [IEEE Trans ASL 2014]
Thousands of Online Courses over the Internet
Lectures with very
similar content
[Interspeech 2015]
• Machines listen to all online courses
three courses on some
similar topic
sequential order for
learning (prerequisite
conditions)
three courses on some
similar topic
[Interspeech 2015]
Thousands of Online Courses over the Internet
• Machines listen to all online courses
• Learning map for a given query
• More precise semantic analysis for speech needed
Hung-yi Lee (left) and Lin-shan Lee
…… ____ wi ____ ……
Word Embeddings (Word2Vec) as Vector
Representations for Words
• Continuous bag of words (CBOW)
…… wi-1 ____ wi+1 …… Neural
Network
wi
wi-1
wi+1
Neural
Network
wi-1
wi
wi+1
predicting the word given its context
predicting the context given a word
[Mikolov 2013]
• Skip-gram
• Prediction based on some hidden structure within the
language
V(Berlin) – V(Germany) + V(France) V(Paris)
Word Embeddings (Word2Vec)
Berlin
Germany
France
Paris
• Carry some Semantic Structure among Words
– semantic relationship is kind of “additive” or “parellel”
[Mikolov 2013]
Berlin is the capital city of Germany
Input Layer
𝑾𝑉 Χ 𝐽
Context
𝒙2 𝒙3 𝒙5 𝒙6
𝑾′𝐽 Χ 𝑉 𝑾′𝐽 Χ 𝑉 𝑾′𝐽 Χ 𝑉 𝑾′𝐽 Χ 𝑉
Output Layer
is the city of
1-of-N vector 𝒙1 𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙7
Hidden Layer
“capital”
Word2Vec : Word Semantics learned from Text
Context
Italy
Rome
France
Paris
Germany
Berlin
Dimension Reduction
Rome
Italy
France
Paris
Germany
Berlin
Country
City
Paris is the capital city of France
Rome is the capital city of Italy
Berlin is the capital city of Germany
Berlin is the capital city of Germany
• Shared common context in
Big Data offers some
hidden structure within
the language
– carry some semantics
Word2Vec : Word Semantics learned from Text
Context
RNN Decoder
x1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3
x4
RNN Encoder
audio segment (a segmented spoken word)
acoustic features
• RNN encoder and
decoder
– learn some hidden structure
within signals
Input acoustic features
Audio Word2Vec : Sequence-to-sequence Autoencoder
[Interspeech 2016]
x1 x2 x3 x4
RNN Encoder
audio segment (a segmented spoken word)
acoustic features
• RNN encoder and
decoder
– learn some hidden structure
within signals
Audio Word2Vec : Sequence-to-sequence Autoencoder
[Interspeech 2016]
x1 x2 x3 x4
RNN Encoder
audio segment (a segmented spoken word)
acoustic features
• RNN encoder and
decoder
– learn some hidden structure
within signals
Audio Word2Vec : Sequence-to-sequence Autoencoder
[Interspeech 2016]
vector
The vector representation
• Unsupervised
• Self-supervised
– model learns from data itself
What was learned ?
[Interspeech 2016]
• Sequential Phonetic
structure !
– not semantics
– learned hidden structure
within speech signals is
from isolated segmented
spoken words only
– learning semantics from
context of word sequences
may be easier
new
fear
fame
name
few
new
near
night
fight
hands
hand
words
word
things
thing
days
says
day
say
e1
S
DR
𝑥1
DR
𝑥2
DR
𝑥3
…
DR
𝑥4
DR
𝑥𝑡
…
DR
𝑥𝑡+1
DR
𝑥𝑇
0
Segmental Audio Word2Vec — extended to an utterance
ER
…
en
xt
S
…
ER
xt+1
0
S …
ER
…
eN
xT
S
ER
x1
ER
x2
0
S
ER
x4
0
S
x3
ER
S
word
boundary
Segmentation Gate
Reset to
initial state
Segment 1 Segment 2 Segment n Segment N
Each color block performs seq2seq training individually [Interspeech 2018]
Encoder Decoder
input audio reconstructed
Encoder
1
Decoder
reconstructed
speaker vector
phonetic vector
Encoder
2
Feature Disentanglement for Audio Word2Vec
[SLT 2018]
• Audio Signals includes information irrelevant to
semantics
– Speaker and other acoustic information
speaker 1
Phonetic
Encoder
Decoder
Speaker
Encoder
similar speaker vectors for the same
speaker
speaker 1
Phonetic
Encoder
Decoder
Speaker
Encoder
Disentanglement of Speaker Information
[SLT 2018]
Disentanglement of Speaker Information
speaker vectors far apart enough for
different speakers
speaker 1
Phonetic
Encoder
Decoder
Speaker
Encoder
speaker 2
Phonetic
Encoder
Decoder
Speaker
Encoder
[SLT 2018]
• Phonetic Vectors without Speaker Information
One-hot
Semantic
Embedding
linear
linear
wt
wt-2
wt-1
wt+1
wt+2
0
1
0
0
Skip-gram for Text Words
• Predicting the context given a word
…… ____ wi ____ ……
• One-hot used to train semantic
embeddings
: Semantic Embeddings
Phonetic-and-Semantic
Embedding
phonetic vectors
2 hidden layer
2 hidden layer
wt
wt-2
wt-1
wt+1
wt+2
Phonetic
Encoder
wt-1 wt wt+1
[SLT 2018]
Audio Skip-gram
• Predicting the context given an audio segment
• Phonetic Vectors used
Phonetic Vectors
: Phonetic-and-semantic Embeddings
Phonetic-and-Semantic Embedding
Brother
Sister
Bother
Brother
Bother
Sister
• Two types of information sometimes disturb each other
– aligning the word embeddings in the two spaces is challenging
• A unique ID for a text word in training
– unlimited number of audio realizations for a given text word
• Phonetic Space • Semantic Space
[SLT 2018]
BERT (Bidirectional Encoder Representations from
Transformers)
• Some hidden structure within the language learned in an
unsupervised way
– by estimating masked token from unlabeled text data
– the representations carry the context information
[Devlin 2018]
• Useful in many different downstream tasks achievable with
much simpler models, smaller labeled datasets and faster
convergence
A B C D E
BERT
R
R
R R R
How are [M]
toda
y
?
[M] = you
Representatio
ns
Transform
er
Encoders
“you”masked
Estimating the
masked token
: Pre-training
:Downstream
• Self-supervised
learning: learns some
hidden structure
within the language
from the dataset itself
without labels
A B C D E
BERT
R
R
R R R
Unlabeled text
data
unsupervised
Pre-training
BERT
[CLS] w1 w2 w3
Linear
Cls
class
sentence
Linear
Cls
class
Linear
Cls
class
arrive Taipei on November 2nd
other dest other time time
Input
output
• Example downstream task (1): slot filling
• Semantics considered
Self-supervised Learning for Text : BERT
• Classifiers may have simpler models trained with smaller
labeled datasets
BERT
It’s a nice day
[CLS]
unlabeled text data
positive negative
Downstream Model
Positive
Self-supervised Learning for Text : BERT
• Example downstream task (2): sentiment classification
• Semantics considered
• Simpler models trained with smaller labeled datasets
Pre-trained Model
Unlabeled Data
Phase 1: Pre-training
representations
• Mask the input signals and then
reconstruct them (generative)
• Predictive
• Contrastive
Self-supervised Learning for Speech
• To learn the hidden structure within speech signals without
considering any specific downstream task [IEEE JSTSP 2022]
Pre-trained Model
Phase 2: Downstream
For a given downstream
task (e.g., ASR) Downstream Model
“How are you?”
Labelled data
Self-supervised Learning for Speech
• With the hidden structure learned from the signals, the given
task becomes easier [IEEE JSTSP 2022]
Masked
Frames
Mockingjay
Mockingjay
[ICASSP 2020]
• Speech version of BERT
(text token → signal frame)
– frame level
– no segmentation problem
any more (same as ASR)
Mel-spectrogram
Repr.
Masked
Frames
Mockingjay
[ICASSP 2020]
Mockingjay
Mel-spectrogram
• Speech version of BERT
(text token → signal frame)
– frame level
– no segmentation problem
any more (same as ASR)
Pred
Real
Repr.
The model was able to
reconstruct spectrogram form
hidden representations
Masked
Frames
Mockingjay
[ICASSP 2020]
Mockingjay
Mel-spectrogram
[IEEE JSTSP 2022]
More data
Larger model
Word-level Frame-level
More
objectives
[IEEE JSTSP 2022]
Presented at INTERSPEECH 2021
https://arxiv.org/abs/2105.01051
Presented at ACL 2022
https://arxiv.org/abs/2203.06849
[Interspeech 2021] [ACL 2022]
https://superbbenchmark.org/
IS 2021
ACL 2022
Phonetic Paralinguistic
Speaker Semantic Synthesis
Speech processing Universal PERformance
Benchmark (SUPERB)
[Interspeech 2021] [ACL 2022]
• Five categories of downstream tasks: phonetic, speaker,
paralinguistic, semantic, and synthesis
PR KS ASR QbE SID ASV SD IC SF ER
fbank 82.01 8.63 15.21 0.0058 8.50E-04 9.56 10.05 9.1 69.64 35.39
PASE+ 58.87 82.54 16.62 0.0072 37.99 11.61 8.68 29.82 62.14 57.86
APC 41.98 91.01 14.74 0.0310 60.42 8.56 10.53 74.69 70.46 59.33
VQ-APC 41.08 91.11 15.21 0.0251 60.15 8.72 10.45 74.48 68.53 59.66
NPC 43.81 88.96 13.91 0.0246 55.92 9.40 9.34 69.44 72.79 59.08
Mockingjay 70.19 83.67 15.48 6.60E-04 32.29 11.66 10.54 34.33 61.59 50.28
TERA 49.17 89.48 12.16 0.0013 57.57 15.89 9.96 58.42 67.50 56.27
DeCoAR 2.0 14.93 94.48 9.07 0.0406 74.42 7.16 6.59 90.80 83.28 62.47
modified CPC 42.54 91.88 13.53 0.0326 39.63 12.86 10.38 64.09 71.19 60.96
wav2vec 31.58 95.59 11.00 0.0485 56.56 7.99 9.90 84.92 76.37 59.79
vq-wav2vec 33.48 93.38 12.80 0.0410 38.80 10.38 9.93 85.68 77.68 58.24
wav2vec 2.0 base 5.74 96.23 4.79 0.0233 75.18 6.02 6.08 92.35 88.30 63.43
wav2vec 2.0 large 4.75 96.66 3.10 0.0489 86.14 5.65 5.62 95.28 87.11 65.64
HuBERT base 5.41 96.30 4.79 0.0736 81.42 5.11 5.88 98.34 88.53 64.92
HuBERT large 3.53 95.29 2.94 0.0353 90.33 5.98 5.75 98.76 89.81 67.62
Phonetic Semantic
Speaker Emotion
Initial Test Results of Round 2
[Interspeech 2021]
• fbank as the baseline
PR KS ASR QbE SID ASV SD IC SF ER
fbank 82.01 8.63 15.21 0.0058 8.50E-04 9.56 10.05 9.1 69.64 35.39
PASE+ 58.87 82.54 16.62 0.0072 37.99 11.61 8.68 29.82 62.14 57.86
APC 41.98 91.01 14.74 0.0310 60.42 8.56 10.53 74.69 70.46 59.33
VQ-APC 41.08 91.11 15.21 0.0251 60.15 8.72 10.45 74.48 68.53 59.66
NPC 43.81 88.96 13.91 0.0246 55.92 9.40 9.34 69.44 72.79 59.08
Mockingjay 70.19 83.67 15.48 6.60E-04 32.29 11.66 10.54 34.33 61.59 50.28
TERA 49.17 89.48 12.16 0.0013 57.57 15.89 9.96 58.42 67.50 56.27
DeCoAR 2.0 14.93 94.48 9.07 0.0406 74.42 7.16 6.59 90.80 83.28 62.47
modified CPC 42.54 91.88 13.53 0.0326 39.63 12.86 10.38 64.09 71.19 60.96
wav2vec 31.58 95.59 11.00 0.0485 56.56 7.99 9.90 84.92 76.37 59.79
vq-wav2vec 33.48 93.38 12.80 0.0410 38.80 10.38 9.93 85.68 77.68 58.24
wav2vec 2.0 base 5.74 96.23 4.79 0.0233 75.18 6.02 6.08 92.35 88.30 63.43
wav2vec 2.0 large 4.75 96.66 3.10 0.0489 86.14 5.65 5.62 95.28 87.11 65.64
HuBERT base 5.41 96.30 4.79 0.0736 81.42 5.11 5.88 98.34 88.53 64.92
HuBERT large 3.53 95.29 2.94 0.0353 90.33 5.98 5.75 98.76 89.81 67.62
Phonetic Semantic
Speaker Emotion
• Self-supervised representations outperformed fbank in most
cases
Initial Test Results of Round 2
[Interspeech 2021]
• Black blocks: worse than fbank baseline
PR KS ASR QbE SID ASV SD IC SF ER
fbank 82.01 8.63 15.21 0.0058 8.50E-04 9.56 10.05 9.1 69.64 35.39
PASE+ 58.87 82.54 16.62 0.0072 37.99 11.61 8.68 29.82 62.14 57.86
APC 41.98 91.01 14.74 0.0310 60.42 8.56 10.53 74.69 70.46 59.33
VQ-APC 41.08 91.11 15.21 0.0251 60.15 8.72 10.45 74.48 68.53 59.66
NPC 43.81 88.96 13.91 0.0246 55.92 9.40 9.34 69.44 72.79 59.08
Mockingjay 70.19 83.67 15.48 6.60E-04 32.29 11.66 10.54 34.33 61.59 50.28
TERA 49.17 89.48 12.16 0.0013 57.57 15.89 9.96 58.42 67.50 56.27
DeCoAR 2.0 14.93 94.48 9.07 0.0406 74.42 7.16 6.59 90.80 83.28 62.47
modified CPC 42.54 91.88 13.53 0.0326 39.63 12.86 10.38 64.09 71.19 60.96
wav2vec 31.58 95.59 11.00 0.0485 56.56 7.99 9.90 84.92 76.37 59.79
vq-wav2vec 33.48 93.38 12.80 0.0410 38.80 10.38 9.93 85.68 77.68 58.24
wav2vec 2.0 base 5.74 96.23 4.79 0.0233 75.18 6.02 6.08 92.35 88.30 63.43
wav2vec 2.0 large 4.75 96.66 3.10 0.0489 86.14 5.65 5.62 95.28 87.11 65.64
HuBERT base 5.41 96.30 4.79 0.0736 81.42 5.11 5.88 98.34 88.53 64.92
HuBERT large 3.53 95.29 2.94 0.0353 90.33 5.98 5.75 98.76 89.81 67.62
Phonetic Semantic
Speaker Emotion
• Several self-supervised models are all-around
– good hidden structure generalizable to all tasks
Initial Test Results of Round 2
[Interspeech 2021]
• Any one good in all tasks?
WER
100 hours
labeled data
10 minutes
labeled data
6-layer LSTM
2-layer LSTM
• LibriSpeech
(Lüscher, et al.) (Yang, et al.)
(Yang, et al.) (Baevski, et al.) (Hsu, et al.)
Supervised vs. Self-supervised: ASR
[Lüscher 2019] [Baevski 2020] [Hsu 2021] [Interspeech 2021]
5.8
2.9
3.1
4.8
4.6
Welcome to Join 
https://superbbenchmark.org/
SLT 2022 Challenge
SUPERB
15
Downstream
Tasks
16+
Upstream
models
Toolkit – S3PRL (https://github.com/s3prl/s3prl)
More Applications…
Example 1:
Unsupervised ASR
How are you.
He thinks it’s…
Thanks for…
Data
(Audio)
Unsupervised ASR
Data
(Text)
Supervised/Unsupervised ASR
• Supervised ASR
– Has been very successful
– Problem : requiring a huge quantity of annotated data
• Unsupervised ASR
– Train without annotated data
– Unlabeled, unpaired data are easier to
collect
Thousands of languages spoken over the
world
‐ most are low-resourced without
enough annotated data
How are you.
He thinks it’s…
Thanks for…
Data
(Annotated)
Supervised ASR
Generator (ASR)
Tries to “fool” Discriminator
Discriminator
Tries to distinguish real or
generated phoneme sequence.
Acoustic Features
Generated
Phoneme Sequences
Real / Generated
Real / Generated
Phoneme Sequences
Train
Iteratively
Use of Generative Adversarial Networks (GAN)
• Discriminator / Generator improve themselves individually
and iteratively
• Generative Adversarial Network (GAN)
…
…
Audio word2vec
𝑋1
𝑋2 𝑋3
𝑋𝑀
𝑧1 𝑧2 𝑧3 𝑧𝑀
…
Acoustic Feature
Audio embedding sequence
Model 1 (2018)
• Waveform segmentation and embedding
– divide the features into acoustically similar segments of different lengths
– transform each segment into a fixed-length vector (audio embedding)
[Interspeech 2018]
Cluster index 16
Cluster index 25
Cluster index 1
…
Audio embedding sequence
2 …
16 25 2
𝑧1 𝑧2 𝑧3 𝑧𝑀
Cluster index sequence
K-means
…
Cluster index 2
Model 1 (2018)
• Cluster the embeddings into groups
[Interspeech 2018]
Generator
Lookup Table
Real phoneme sequence
Real / Generated
Discriminator
CNN network
2 … 2
Cluster index sequence
16 25
Model 1 (2018)
• Learning the mapping between cluster indices and
phonemes with a GAN
– embedding clustering followed by (cascaded with) a GAN
[Interspeech 2018]
𝑠𝑖𝑙 …
ℎℎ 𝑖ℎ 𝒔𝒊𝒍
Generated phoneme sequence
• Generator consists of two
parts
(a) Phoneme Classifier (DNN)
(b) Sampling Process
• Discriminator is a two layer 1-D
CNN.
Data (Text)
Data
(Audio)
Model 2 (2019)
• A GAN (Generator and Discriminator) trained
End-to-end
– DNN trained in an unsupervised way
[Interspeech 2019]
Accuracy
Unsupervised learning (Model 2, 2019) is as
good as supervised learning (HMM) 30 years
ago.
The Progress of Supervised Learning on TIMIT
• Milestones in phone recognition accuracy
[Phone recognition on the TIMIT database, Lopes, C. and Perdigão, F., 2011. Speech
Technologies, Vol 1, pp. 285--302.]
– Will it take another 30 years for unsupervised learning to achieve the
performance of supervised learning today ?
[Keynote, Interspeech 2020]
Unsupervised Speech Recognition
• Librispeech (2021)
https://ai.facebook.com/blog/wav2vec-unsupervised-speech-recognition-without-supervision/
– Word Error Rate as low as 5.9% with zero hr of annotated
data !
[facebook 2021]
Unsupervised Speech Recognition
SwitchBoard
(Telephone conversation)
TED-LIUM
(Live talk)
Librispeech
(Read literature)
LibriLM
(Literature)
Wiki
(Encyclopedia)
NewsCrawl
(News)
ImageC
(Image Caption)
[ICASSP 2022]
• Audio/Text unlabeled corpora respectively extended to
different styles/domains
Unsupervised Speech Recognition
[ICASSP 2022]
• Reasonable (relatively high) Error Rate achievable if
– good acoustics conditions (Libri) with reasonably
different linguistic styles
– relatively poor acoustic conditions (SB) but with same
linguistic styles
– will be lowered sooner or later…
4gram JSD
PER
Libri960
SB300_w2v2
More Applications…
Example 2:
End-to-end
Spoken Question Answering
A Target Application Example : Personalized
Education Environment
• For each individual user
I wish to learn how machines
can listen to human voice
I can spend 3 hrs to learn
user
This is the 3-hr personalized
course for you. I’ll be your
personalized teaching assistant.
Ask me when you have questions.
Information
from Internet
• Comprehension, Summarization and Question
Answering for Spoken Content
– Proper use of semantics in spoken content
Question
Answering
Knowledge source
question
answer
unstructured
documents
search engine
Question Answering
• Machine answering questions from the user
spoken
content
(passages)
Question
Answering
Knowledge source
question
answer
unstructured
documents
Question Answering
• Machine answering questions from the user
spoken
content
(passages)
• Initial Work
– assuming the passages are
given
Speech
Recognition
(ASR)
Question
Answering
Answer
Question
Question
Answering
Answer
Question
Spoken Content
Retrieved
Text
Retrieved
Text v.s. Spoken QA (Cascading v.s. End-to-end)
• Text QA
End-to-end Spoken
Question Answering
Answer
Question [Interspeech 2020]
Cascading
End-to-end
• Spoken QA
Errors
End-to-end Spoken QA : DUAL
• Pre-trained with unlabled audio/text corpora
– audio represented in HuBERT units (frame level)
– fine-tuned in downstream by (question, passage, answer) sets
• Not limited by ASR Errors or OOV
– no ASR here [Interspeech 2022]
End-to-end Spoken QA: DUAL
• HuBERT-based speech encoder
• BERT pre-trained on text
[Interspeech 2022]
question passage
Pre-trained Speech Encoder
3
9 11 31 31
clustered
units
start
Find Ans Span
BERT (pre-trained on text)
end
End-to-end Spoken QA: DUAL
• ASR cascaded with Text QA: Baseline
– performance directly limited by WER (many errors due to OOV)
• End-to-end Spoken QA (DUAL)
– performance independent of WER, because semantics extracted from
audio
• No ASR here
– many answer spans include OOV
ASR + Text QA
End-to-end DUAL
[Interspeech 2022]
Word Error Rate (WER)
Frame-level
F1
score
(FF1)
Question
Answering
Knowledge source
question
answer
unstructured
documents
Question Answering
• Machine answering questions from the user
spoken
content
(passages)
• Initial Work
– assuming the passages are
given
Question
Answering
Knowledge source
question
answer
unstructured
documents
search engine
Question Answering
• Machine answering questions from the user
– retrieval and question answering jointly solved
spoken
content
(passages)
Textless Spoken QA with Speech Discrete Units: TlDu
• Audio represented in HuBERT units
• Retrieving relevant passages and identifying answers joinyly
handled
[submitted to SLT 2022]
• No ASR transcriptions or errors
Spoken Archive
Spoken Question
(a)
Speech
Signal
Encoder
Discrete
unit
sequences
(b)
Spoken
Content
Retriever
(c)
Spoken
Content
Reader
Top-K
passages
Answer
span
Preliminary Results for TlDu
• Retrieval Accuracy • SQA: Accury
• Performance not limited by ASR errors or OOV
– No ASR here
– semantics extracted directly from audio (not words)
[submitted to SLT 2022]
• What’s next?
– don’t know
• Depending on new technologies to be developed in the
future
– by capable researchers in the future
– capable students in schools today (and in the future)
• How students can learn effectively and efficiently is
important
– people asked : are you still teaching HMM and MFCC in
your course today?
• Just to share my thoughts
– purely my personal imagination
Beyond
Speech Technologies shown in a 1-dim Scale
changing
very fast
changing
faster
changing
slower
never
change
LPC
signal
processing
Fourier
transform
mathematics
cutting-edge 40 years ago today
10 years ago
20 years ago
?
fundamental
Pitch
estimation
Filter
banks
ML
HMM
MFCC
NN
CNN
RNN
WFST
i-vector
DNN
?
x-years later
Meta L
Self-
supervised
Learning
GAN
• Students today (researchers in the future) have to
explore new knowledge and solve new problems in
the future (20 years?)
• Primary Goal for the students to learn in school: to
learn
– not how to do research today, but how to do it in the future
– not just to run deep learning packages today,
– if they are too much focused on those deep learning packages,
quick results and achievement,
– may need to learn skills “more generalizable” to future
technologies not existing today (unseen during training)…
– which produces
good results, papers and brings good jobs : those may become
obsolete very soon
what are such skills ?
will it be possible that they
may be “overfitted” on these targets?
Reading many books!
Listening to many voices!
Then their parents
teach them.
Analogy 1:
How do human babies learn the language ?
Unlabeled
Text
Self-Supervised :
Pre-training
– learn some hidden
structure of the
language
Model Unlabeled
Speech
Unlabeled
data
Analogy 1:
How do human babies learn the language ?
may be kind of Self-Supervised Learning…
Unlabeled
Text
Self-Supervised :
Pre-training
– learn some hidden
structure of the
language
Unlabeled
Speech
Model
Input
Labels: limited
Self-supervised:
Downstream
Model
Unlabel
ed data
Analogy 1:
How do human babies learn the language ?
• To start with
– learning the alphabet: a, b, c
– finding a dictionary for the language
– learning to look up through the dictionary: <word>, <unknown>
– learning the basic words, function words, keywords:
<this> <he> <is>
– learning the grammar
– these are systematic approaches to learn a new (unknown)
language,
• These systematic approaches are kind of based on
some “hidden structure” for the language?
– having to do with pre-training ?
– reading articles is some kind of downstream ?
Analogy 2:
Reading articles in a certain unknown language
generalizable to reading arbitrary articles in that
language
• We didn’t learn deep learning in early days
– we are working with it now
– We seemed to have generalized our skills learned earlier to handle the
new knowledgies today ?
Analogy 3:
How old generation researchers faced the Deep
Learning era ?
• What did we learn in early days ?
– mathematics, programming, fundamentals (a, b, c)
– successful stories in the past
– may include components generalizable to technologies useful today ?
– are they specially lucky to survive the technology revolutions?
– have we gone through some kind of pre-training when we were young?
• Learning new technologies is the downstream tasks ?
• We don’t know, just to share some thoughts
solutions
What ? ideas
Why ? principles
How ?
– (e.g. HMM, MFCC, although not useful
any more today)
– forward-backward algorithm, vector quantization, cepstral mean and
variance normalization still useful today, although in different context
Concluding Remarks
• Semantics of Speech yet to be explored
– plenty of unknown space
– may offer a bridge towards “a spoken version of Google”
• Self-supervised learning for Speech
– pre-training with unlabeled data
– universally make all downstream tasks easier
– various new technologies blooming
• x years ago we never knew what kind of technologies
we could have today
– today we never know what kind of technologies we may have x
years from now
– anything in our mind could be possible
• This is the golden age we never had for speech research
– very deep learning, very big data, very powerful machines, very
strong industry
– which we never had before
• Let’s all treasure and enjoy this golden age !
Concluding Remarks

More Related Content

Similar to From Semantics to Self-supervised Learning for Speech and Beyond (Opening Keynote, Interspeech 2022)

Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...shakimov
 
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...linshanleearchive
 
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Normunds Grūzītis
 
Towards Machine Comprehension of Spoken Content
Towards Machine Comprehension of Spoken ContentTowards Machine Comprehension of Spoken Content
Towards Machine Comprehension of Spoken ContentNVIDIA Taiwan
 
Media as Levers (pdf)
Media as Levers (pdf)Media as Levers (pdf)
Media as Levers (pdf)Lawrie Hunter
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachAhmed Hani Ibrahim
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...Marko Grobelnik
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflowseungwoo kim
 

Similar to From Semantics to Self-supervised Learning for Speech and Beyond (Opening Keynote, Interspeech 2022) (20)

What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
 
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
 
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
 
Towards Machine Comprehension of Spoken Content
Towards Machine Comprehension of Spoken ContentTowards Machine Comprehension of Spoken Content
Towards Machine Comprehension of Spoken Content
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Media as Levers (pdf)
Media as Levers (pdf)Media as Levers (pdf)
Media as Levers (pdf)
 
Word embedding
Word embedding Word embedding
Word embedding
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
 

More from linshanleearchive

星雲教育獎頒獎典禮手冊
星雲教育獎頒獎典禮手冊星雲教育獎頒獎典禮手冊
星雲教育獎頒獎典禮手冊linshanleearchive
 
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdflinshanleearchive
 
新科學創造新文明 Part 2
新科學創造新文明 Part 2新科學創造新文明 Part 2
新科學創造新文明 Part 2linshanleearchive
 
新科學創造新文明 Part 1
新科學創造新文明 Part 1新科學創造新文明 Part 1
新科學創造新文明 Part 1linshanleearchive
 
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論linshanleearchive
 
2022 國際語音學會科學成就獎章得獎致詞
2022 國際語音學會科學成就獎章得獎致詞2022 國際語音學會科學成就獎章得獎致詞
2022 國際語音學會科學成就獎章得獎致詞linshanleearchive
 
琳山老師榮退感言.pptx
琳山老師榮退感言.pptx琳山老師榮退感言.pptx
琳山老師榮退感言.pptxlinshanleearchive
 
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」linshanleearchive
 
芝麻開門:語音技術的前世今生
芝麻開門:語音技術的前世今生芝麻開門:語音技術的前世今生
芝麻開門:語音技術的前世今生linshanleearchive
 
Towards A Spoken Version of Google
Towards A Spoken Version of GoogleTowards A Spoken Version of Google
Towards A Spoken Version of Googlelinshanleearchive
 
2016《華語語音辨識研究的先驅者》科學月刊專訪
2016《華語語音辨識研究的先驅者》科學月刊專訪2016《華語語音辨識研究的先驅者》科學月刊專訪
2016《華語語音辨識研究的先驅者》科學月刊專訪linshanleearchive
 
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪linshanleearchive
 
2017《推動產業轉型 大學必修課程先鬆綁》
2017《推動產業轉型 大學必修課程先鬆綁》2017《推動產業轉型 大學必修課程先鬆綁》
2017《推動產業轉型 大學必修課程先鬆綁》linshanleearchive
 
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談linshanleearchive
 
芝麻開門 - 語音技術的前世今生
芝麻開門 - 語音技術的前世今生芝麻開門 - 語音技術的前世今生
芝麻開門 - 語音技術的前世今生linshanleearchive
 
Spoken Content Retrieval - Lattices and Beyond
Spoken Content Retrieval - Lattices and BeyondSpoken Content Retrieval - Lattices and Beyond
Spoken Content Retrieval - Lattices and Beyondlinshanleearchive
 
105-08-17 輕舟已過萬重山
105-08-17 輕舟已過萬重山105-08-17 輕舟已過萬重山
105-08-17 輕舟已過萬重山linshanleearchive
 

More from linshanleearchive (19)

星雲教育獎頒獎典禮手冊
星雲教育獎頒獎典禮手冊星雲教育獎頒獎典禮手冊
星雲教育獎頒獎典禮手冊
 
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf
 
新科學創造新文明 Part 2
新科學創造新文明 Part 2新科學創造新文明 Part 2
新科學創造新文明 Part 2
 
新科學創造新文明 Part 1
新科學創造新文明 Part 1新科學創造新文明 Part 1
新科學創造新文明 Part 1
 
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論
 
2022 國際語音學會科學成就獎章得獎致詞
2022 國際語音學會科學成就獎章得獎致詞2022 國際語音學會科學成就獎章得獎致詞
2022 國際語音學會科學成就獎章得獎致詞
 
琳山老師榮退感言.pptx
琳山老師榮退感言.pptx琳山老師榮退感言.pptx
琳山老師榮退感言.pptx
 
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」
 
芝麻開門:語音技術的前世今生
芝麻開門:語音技術的前世今生芝麻開門:語音技術的前世今生
芝麻開門:語音技術的前世今生
 
Spoken Content Retrieval
Spoken Content RetrievalSpoken Content Retrieval
Spoken Content Retrieval
 
Towards A Spoken Version of Google
Towards A Spoken Version of GoogleTowards A Spoken Version of Google
Towards A Spoken Version of Google
 
輕舟已過萬重山
輕舟已過萬重山輕舟已過萬重山
輕舟已過萬重山
 
2016《華語語音辨識研究的先驅者》科學月刊專訪
2016《華語語音辨識研究的先驅者》科學月刊專訪2016《華語語音辨識研究的先驅者》科學月刊專訪
2016《華語語音辨識研究的先驅者》科學月刊專訪
 
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪
 
2017《推動產業轉型 大學必修課程先鬆綁》
2017《推動產業轉型 大學必修課程先鬆綁》2017《推動產業轉型 大學必修課程先鬆綁》
2017《推動產業轉型 大學必修課程先鬆綁》
 
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談
 
芝麻開門 - 語音技術的前世今生
芝麻開門 - 語音技術的前世今生芝麻開門 - 語音技術的前世今生
芝麻開門 - 語音技術的前世今生
 
Spoken Content Retrieval - Lattices and Beyond
Spoken Content Retrieval - Lattices and BeyondSpoken Content Retrieval - Lattices and Beyond
Spoken Content Retrieval - Lattices and Beyond
 
105-08-17 輕舟已過萬重山
105-08-17 輕舟已過萬重山105-08-17 輕舟已過萬重山
105-08-17 輕舟已過萬重山
 

Recently uploaded

Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonJericReyAuditor
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 

Recently uploaded (20)

Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lesson
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 

From Semantics to Self-supervised Learning for Speech and Beyond (Opening Keynote, Interspeech 2022)

  • 1. From Semantics to Self-supervised Learning for Speech and Beyond
  • 2. Spoken Content Retrieval • Spoken term detection – to detect if a target term was spoken in any of the utterances in an audio dataset target term: COVID-19 similarity scores
  • 3. What can Spoken Content Retrieval do for us ? • Google reads all text over the Internet – can find any text over the Internet for the user • All Roles of Text can be realized by Voice • Machines can listen to all voices over the Internet – can find any utterance over the Internet for the user • A Spoken Version of Google
  • 4. • Multimedia Content exponentially increasing over the Internet , but Machines can What can we do with a Spoken Version of Google ? • Machines may be able to listen to and comprehend the entire multimedia knowledge treasury over the Internet – extracting desired information for each individual user – the unique treasury of the entire global human knowledge is here – desired information for each individual deeply buried under huge quantities of unrelated information • Nobody can go through so much multimedia information
  • 5. A Target Application Example : Personalized Education Environment • For each individual user I wish to learn how machines can listen to human voice I can spend 3 hrs to learn user This is the 3-hr personalized course for you. I’ll be your personalized teaching assistant. Ask me when you have questions. Information from Internet • Comprehension, Summarization and Question Answering for Spoken Content – Proper use of semantics in spoken content
  • 6. Probabilistic Latent Semantic Analysis (PLSA) t1 t2 tj tn D1 D2 Di DN TK Tk T2 T1 P(T |D ) k i P(t |T ) j k Di: documents Tk: latent topics tj: terms 𝑃(𝑤 𝑧) 𝑃(𝑧 𝑑) • Unsupervised Topic Analysis from text corpus d: document z: topic w: word [Hofmann 1999]
  • 7. Latent Dirichlet Allocation (LDA) 𝑃(𝜑𝑘 𝛽): Dirichlet Distribution 𝑃(𝜃 𝛼): Dirichlet Distribution 𝑃(𝑤𝑚,𝑛 𝑧𝑚,𝑛, 𝜑𝑘) 𝑃(𝑧𝑚,𝑛 𝜃𝑚) [Blei 2003] • Unsupervised Topic Analysis from text corpus
  • 8. Clustering and Structuring the Spoken Content Segments (Spoken Documents) Based on Topics [Interspeech 2006] • Global Semantic Structuring • Local Semantic Structuring Chinese Broadcast News Archive Semantic Analysis Global Semantic Structuring Query-based Local Semantic Structuring Automatic Generation of Key terms, Titles and Summaries Information Retrieval User’s query – fine local structure around the user query on top of the global structure
  • 9. • Example Approach : Spoken Documents categorized by Layered Topics and organized in a Two-dimensional Tree – topics nearby on the map are more related semantically – each topic expanded into another map in the next layer Global Semantic Structure of Spoken Content [Eurospeech 2005]
  • 10. • Broadcast News Browser (2006) – each topic labeled by a set of key terms An Example Screenshot of Global Semantic Structure in a Two-dimensional Tree [Interspeech 2006]
  • 11. Clustering and Structuring the Segments of Spoken Content (Spoken Documents) Based on Topics [Interspeech 2006] • Global Semantic Structuring • Local Semantic Structuring Chinese Broadcast News Archive Semantic Analysis Global Semantic Structuring Query-based Local Semantic Structuring Automatic Generation of Key terms, Titles and Summaries Information Retrieval User’s query – fine local structure around the user query on top of the global structure
  • 12. • Fine structure based on a user query • Example Approach : Topic Hierarchy constructed with key terms (Example Query: George Bush) Local Semantic Structure of Spoken Content [Interspeech 2006]
  • 13. An Example Screenshot of Local Semantic Structure in a Topic Hierarchy [Interspeech 2006] • Query: “White House of United States” – some key terms under another key term on a higher level
  • 14. • Spoken Knowledge in courses: in sequential form – an individual user may need only a small part – not understandable without listening to previous lectures • Example Approach: Key Term Graph (2009) – each spoken slide labeled by a set of key terms (topics) – relationships between key terms represented by a graph Online Courses : A Well Organized Spoken Knowledge Treasury [ICASSP 2009][IEEE Trans ASL 2014] spoken slides (plus audio/video) key term graph Acoustic Modeling Viterbi search HMM Language Modeling Perplexity …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… ……
  • 15. • Temporal Structure – chapters, sections, slides Interconnection between Temporal and Semantic Structures [IEEE Trans ASL 2014] • Semantic Structure – key term graph • Interconnection between the Two Structures
  • 16. An Example Automatically Generated Key Term Graph [IEEE Trans ASL 2014] • Relationship scores evaluated between each pair of key terms – an edge if exceeding a threshold
  • 17. • User clicks a key term “entropy” – possible learning path through the selected spoken slides – related key terms via the key term graph [ICASSP 2009][IEEE Trans ASL 2014] An Example Screenshot from a Online Course Browser
  • 18. Spoken Knowledge Structuring for an Example Online Course • Based on a course recorded in 2006 [ICASSP 2009] [IEEE Trans ASL 2014]
  • 19. Thousands of Online Courses over the Internet Lectures with very similar content [Interspeech 2015] • Machines listen to all online courses three courses on some similar topic
  • 20. sequential order for learning (prerequisite conditions) three courses on some similar topic [Interspeech 2015] Thousands of Online Courses over the Internet • Machines listen to all online courses • Learning map for a given query • More precise semantic analysis for speech needed
  • 21. Hung-yi Lee (left) and Lin-shan Lee
  • 22. …… ____ wi ____ …… Word Embeddings (Word2Vec) as Vector Representations for Words • Continuous bag of words (CBOW) …… wi-1 ____ wi+1 …… Neural Network wi wi-1 wi+1 Neural Network wi-1 wi wi+1 predicting the word given its context predicting the context given a word [Mikolov 2013] • Skip-gram • Prediction based on some hidden structure within the language
  • 23. V(Berlin) – V(Germany) + V(France) V(Paris) Word Embeddings (Word2Vec) Berlin Germany France Paris • Carry some Semantic Structure among Words – semantic relationship is kind of “additive” or “parellel” [Mikolov 2013]
  • 24. Berlin is the capital city of Germany Input Layer 𝑾𝑉 Χ 𝐽 Context 𝒙2 𝒙3 𝒙5 𝒙6 𝑾′𝐽 Χ 𝑉 𝑾′𝐽 Χ 𝑉 𝑾′𝐽 Χ 𝑉 𝑾′𝐽 Χ 𝑉 Output Layer is the city of 1-of-N vector 𝒙1 𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙7 Hidden Layer “capital” Word2Vec : Word Semantics learned from Text Context
  • 25. Italy Rome France Paris Germany Berlin Dimension Reduction Rome Italy France Paris Germany Berlin Country City Paris is the capital city of France Rome is the capital city of Italy Berlin is the capital city of Germany Berlin is the capital city of Germany • Shared common context in Big Data offers some hidden structure within the language – carry some semantics Word2Vec : Word Semantics learned from Text Context
  • 26. RNN Decoder x1 x2 x3 x4 y1 y2 y3 y4 x1 x2 x3 x4 RNN Encoder audio segment (a segmented spoken word) acoustic features • RNN encoder and decoder – learn some hidden structure within signals Input acoustic features Audio Word2Vec : Sequence-to-sequence Autoencoder [Interspeech 2016]
  • 27. x1 x2 x3 x4 RNN Encoder audio segment (a segmented spoken word) acoustic features • RNN encoder and decoder – learn some hidden structure within signals Audio Word2Vec : Sequence-to-sequence Autoencoder [Interspeech 2016]
  • 28. x1 x2 x3 x4 RNN Encoder audio segment (a segmented spoken word) acoustic features • RNN encoder and decoder – learn some hidden structure within signals Audio Word2Vec : Sequence-to-sequence Autoencoder [Interspeech 2016] vector The vector representation • Unsupervised • Self-supervised – model learns from data itself
  • 29. What was learned ? [Interspeech 2016] • Sequential Phonetic structure ! – not semantics – learned hidden structure within speech signals is from isolated segmented spoken words only – learning semantics from context of word sequences may be easier new fear fame name few new near night fight hands hand words word things thing days says day say
  • 30. e1 S DR 𝑥1 DR 𝑥2 DR 𝑥3 … DR 𝑥4 DR 𝑥𝑡 … DR 𝑥𝑡+1 DR 𝑥𝑇 0 Segmental Audio Word2Vec — extended to an utterance ER … en xt S … ER xt+1 0 S … ER … eN xT S ER x1 ER x2 0 S ER x4 0 S x3 ER S word boundary Segmentation Gate Reset to initial state Segment 1 Segment 2 Segment n Segment N Each color block performs seq2seq training individually [Interspeech 2018]
  • 31. Encoder Decoder input audio reconstructed Encoder 1 Decoder reconstructed speaker vector phonetic vector Encoder 2 Feature Disentanglement for Audio Word2Vec [SLT 2018] • Audio Signals includes information irrelevant to semantics – Speaker and other acoustic information
  • 32. speaker 1 Phonetic Encoder Decoder Speaker Encoder similar speaker vectors for the same speaker speaker 1 Phonetic Encoder Decoder Speaker Encoder Disentanglement of Speaker Information [SLT 2018]
  • 33. Disentanglement of Speaker Information speaker vectors far apart enough for different speakers speaker 1 Phonetic Encoder Decoder Speaker Encoder speaker 2 Phonetic Encoder Decoder Speaker Encoder [SLT 2018] • Phonetic Vectors without Speaker Information
  • 34. One-hot Semantic Embedding linear linear wt wt-2 wt-1 wt+1 wt+2 0 1 0 0 Skip-gram for Text Words • Predicting the context given a word …… ____ wi ____ …… • One-hot used to train semantic embeddings : Semantic Embeddings
  • 35. Phonetic-and-Semantic Embedding phonetic vectors 2 hidden layer 2 hidden layer wt wt-2 wt-1 wt+1 wt+2 Phonetic Encoder wt-1 wt wt+1 [SLT 2018] Audio Skip-gram • Predicting the context given an audio segment • Phonetic Vectors used Phonetic Vectors : Phonetic-and-semantic Embeddings
  • 36. Phonetic-and-Semantic Embedding Brother Sister Bother Brother Bother Sister • Two types of information sometimes disturb each other – aligning the word embeddings in the two spaces is challenging • A unique ID for a text word in training – unlimited number of audio realizations for a given text word • Phonetic Space • Semantic Space [SLT 2018]
  • 37. BERT (Bidirectional Encoder Representations from Transformers) • Some hidden structure within the language learned in an unsupervised way – by estimating masked token from unlabeled text data – the representations carry the context information [Devlin 2018] • Useful in many different downstream tasks achievable with much simpler models, smaller labeled datasets and faster convergence A B C D E BERT R R R R R How are [M] toda y ? [M] = you Representatio ns Transform er Encoders “you”masked Estimating the masked token : Pre-training :Downstream • Self-supervised learning: learns some hidden structure within the language from the dataset itself without labels
  • 38. A B C D E BERT R R R R R Unlabeled text data unsupervised Pre-training BERT [CLS] w1 w2 w3 Linear Cls class sentence Linear Cls class Linear Cls class arrive Taipei on November 2nd other dest other time time Input output • Example downstream task (1): slot filling • Semantics considered Self-supervised Learning for Text : BERT • Classifiers may have simpler models trained with smaller labeled datasets
  • 39. BERT It’s a nice day [CLS] unlabeled text data positive negative Downstream Model Positive Self-supervised Learning for Text : BERT • Example downstream task (2): sentiment classification • Semantics considered • Simpler models trained with smaller labeled datasets
  • 40. Pre-trained Model Unlabeled Data Phase 1: Pre-training representations • Mask the input signals and then reconstruct them (generative) • Predictive • Contrastive Self-supervised Learning for Speech • To learn the hidden structure within speech signals without considering any specific downstream task [IEEE JSTSP 2022]
  • 41. Pre-trained Model Phase 2: Downstream For a given downstream task (e.g., ASR) Downstream Model “How are you?” Labelled data Self-supervised Learning for Speech • With the hidden structure learned from the signals, the given task becomes easier [IEEE JSTSP 2022]
  • 42. Masked Frames Mockingjay Mockingjay [ICASSP 2020] • Speech version of BERT (text token → signal frame) – frame level – no segmentation problem any more (same as ASR) Mel-spectrogram
  • 43. Repr. Masked Frames Mockingjay [ICASSP 2020] Mockingjay Mel-spectrogram • Speech version of BERT (text token → signal frame) – frame level – no segmentation problem any more (same as ASR)
  • 44. Pred Real Repr. The model was able to reconstruct spectrogram form hidden representations Masked Frames Mockingjay [ICASSP 2020] Mockingjay Mel-spectrogram
  • 46. More data Larger model Word-level Frame-level More objectives [IEEE JSTSP 2022]
  • 47. Presented at INTERSPEECH 2021 https://arxiv.org/abs/2105.01051 Presented at ACL 2022 https://arxiv.org/abs/2203.06849 [Interspeech 2021] [ACL 2022]
  • 48. https://superbbenchmark.org/ IS 2021 ACL 2022 Phonetic Paralinguistic Speaker Semantic Synthesis Speech processing Universal PERformance Benchmark (SUPERB) [Interspeech 2021] [ACL 2022] • Five categories of downstream tasks: phonetic, speaker, paralinguistic, semantic, and synthesis
  • 49. PR KS ASR QbE SID ASV SD IC SF ER fbank 82.01 8.63 15.21 0.0058 8.50E-04 9.56 10.05 9.1 69.64 35.39 PASE+ 58.87 82.54 16.62 0.0072 37.99 11.61 8.68 29.82 62.14 57.86 APC 41.98 91.01 14.74 0.0310 60.42 8.56 10.53 74.69 70.46 59.33 VQ-APC 41.08 91.11 15.21 0.0251 60.15 8.72 10.45 74.48 68.53 59.66 NPC 43.81 88.96 13.91 0.0246 55.92 9.40 9.34 69.44 72.79 59.08 Mockingjay 70.19 83.67 15.48 6.60E-04 32.29 11.66 10.54 34.33 61.59 50.28 TERA 49.17 89.48 12.16 0.0013 57.57 15.89 9.96 58.42 67.50 56.27 DeCoAR 2.0 14.93 94.48 9.07 0.0406 74.42 7.16 6.59 90.80 83.28 62.47 modified CPC 42.54 91.88 13.53 0.0326 39.63 12.86 10.38 64.09 71.19 60.96 wav2vec 31.58 95.59 11.00 0.0485 56.56 7.99 9.90 84.92 76.37 59.79 vq-wav2vec 33.48 93.38 12.80 0.0410 38.80 10.38 9.93 85.68 77.68 58.24 wav2vec 2.0 base 5.74 96.23 4.79 0.0233 75.18 6.02 6.08 92.35 88.30 63.43 wav2vec 2.0 large 4.75 96.66 3.10 0.0489 86.14 5.65 5.62 95.28 87.11 65.64 HuBERT base 5.41 96.30 4.79 0.0736 81.42 5.11 5.88 98.34 88.53 64.92 HuBERT large 3.53 95.29 2.94 0.0353 90.33 5.98 5.75 98.76 89.81 67.62 Phonetic Semantic Speaker Emotion Initial Test Results of Round 2 [Interspeech 2021] • fbank as the baseline
  • 50. PR KS ASR QbE SID ASV SD IC SF ER fbank 82.01 8.63 15.21 0.0058 8.50E-04 9.56 10.05 9.1 69.64 35.39 PASE+ 58.87 82.54 16.62 0.0072 37.99 11.61 8.68 29.82 62.14 57.86 APC 41.98 91.01 14.74 0.0310 60.42 8.56 10.53 74.69 70.46 59.33 VQ-APC 41.08 91.11 15.21 0.0251 60.15 8.72 10.45 74.48 68.53 59.66 NPC 43.81 88.96 13.91 0.0246 55.92 9.40 9.34 69.44 72.79 59.08 Mockingjay 70.19 83.67 15.48 6.60E-04 32.29 11.66 10.54 34.33 61.59 50.28 TERA 49.17 89.48 12.16 0.0013 57.57 15.89 9.96 58.42 67.50 56.27 DeCoAR 2.0 14.93 94.48 9.07 0.0406 74.42 7.16 6.59 90.80 83.28 62.47 modified CPC 42.54 91.88 13.53 0.0326 39.63 12.86 10.38 64.09 71.19 60.96 wav2vec 31.58 95.59 11.00 0.0485 56.56 7.99 9.90 84.92 76.37 59.79 vq-wav2vec 33.48 93.38 12.80 0.0410 38.80 10.38 9.93 85.68 77.68 58.24 wav2vec 2.0 base 5.74 96.23 4.79 0.0233 75.18 6.02 6.08 92.35 88.30 63.43 wav2vec 2.0 large 4.75 96.66 3.10 0.0489 86.14 5.65 5.62 95.28 87.11 65.64 HuBERT base 5.41 96.30 4.79 0.0736 81.42 5.11 5.88 98.34 88.53 64.92 HuBERT large 3.53 95.29 2.94 0.0353 90.33 5.98 5.75 98.76 89.81 67.62 Phonetic Semantic Speaker Emotion • Self-supervised representations outperformed fbank in most cases Initial Test Results of Round 2 [Interspeech 2021] • Black blocks: worse than fbank baseline
  • 51. PR KS ASR QbE SID ASV SD IC SF ER fbank 82.01 8.63 15.21 0.0058 8.50E-04 9.56 10.05 9.1 69.64 35.39 PASE+ 58.87 82.54 16.62 0.0072 37.99 11.61 8.68 29.82 62.14 57.86 APC 41.98 91.01 14.74 0.0310 60.42 8.56 10.53 74.69 70.46 59.33 VQ-APC 41.08 91.11 15.21 0.0251 60.15 8.72 10.45 74.48 68.53 59.66 NPC 43.81 88.96 13.91 0.0246 55.92 9.40 9.34 69.44 72.79 59.08 Mockingjay 70.19 83.67 15.48 6.60E-04 32.29 11.66 10.54 34.33 61.59 50.28 TERA 49.17 89.48 12.16 0.0013 57.57 15.89 9.96 58.42 67.50 56.27 DeCoAR 2.0 14.93 94.48 9.07 0.0406 74.42 7.16 6.59 90.80 83.28 62.47 modified CPC 42.54 91.88 13.53 0.0326 39.63 12.86 10.38 64.09 71.19 60.96 wav2vec 31.58 95.59 11.00 0.0485 56.56 7.99 9.90 84.92 76.37 59.79 vq-wav2vec 33.48 93.38 12.80 0.0410 38.80 10.38 9.93 85.68 77.68 58.24 wav2vec 2.0 base 5.74 96.23 4.79 0.0233 75.18 6.02 6.08 92.35 88.30 63.43 wav2vec 2.0 large 4.75 96.66 3.10 0.0489 86.14 5.65 5.62 95.28 87.11 65.64 HuBERT base 5.41 96.30 4.79 0.0736 81.42 5.11 5.88 98.34 88.53 64.92 HuBERT large 3.53 95.29 2.94 0.0353 90.33 5.98 5.75 98.76 89.81 67.62 Phonetic Semantic Speaker Emotion • Several self-supervised models are all-around – good hidden structure generalizable to all tasks Initial Test Results of Round 2 [Interspeech 2021] • Any one good in all tasks?
  • 52. WER 100 hours labeled data 10 minutes labeled data 6-layer LSTM 2-layer LSTM • LibriSpeech (Lüscher, et al.) (Yang, et al.) (Yang, et al.) (Baevski, et al.) (Hsu, et al.) Supervised vs. Self-supervised: ASR [Lüscher 2019] [Baevski 2020] [Hsu 2021] [Interspeech 2021] 5.8 2.9 3.1 4.8 4.6
  • 53. Welcome to Join  https://superbbenchmark.org/ SLT 2022 Challenge SUPERB
  • 56. How are you. He thinks it’s… Thanks for… Data (Audio) Unsupervised ASR Data (Text) Supervised/Unsupervised ASR • Supervised ASR – Has been very successful – Problem : requiring a huge quantity of annotated data • Unsupervised ASR – Train without annotated data – Unlabeled, unpaired data are easier to collect Thousands of languages spoken over the world ‐ most are low-resourced without enough annotated data How are you. He thinks it’s… Thanks for… Data (Annotated) Supervised ASR
  • 57. Generator (ASR) Tries to “fool” Discriminator Discriminator Tries to distinguish real or generated phoneme sequence. Acoustic Features Generated Phoneme Sequences Real / Generated Real / Generated Phoneme Sequences Train Iteratively Use of Generative Adversarial Networks (GAN) • Discriminator / Generator improve themselves individually and iteratively • Generative Adversarial Network (GAN)
  • 58. … … Audio word2vec 𝑋1 𝑋2 𝑋3 𝑋𝑀 𝑧1 𝑧2 𝑧3 𝑧𝑀 … Acoustic Feature Audio embedding sequence Model 1 (2018) • Waveform segmentation and embedding – divide the features into acoustically similar segments of different lengths – transform each segment into a fixed-length vector (audio embedding) [Interspeech 2018]
  • 59. Cluster index 16 Cluster index 25 Cluster index 1 … Audio embedding sequence 2 … 16 25 2 𝑧1 𝑧2 𝑧3 𝑧𝑀 Cluster index sequence K-means … Cluster index 2 Model 1 (2018) • Cluster the embeddings into groups [Interspeech 2018]
  • 60. Generator Lookup Table Real phoneme sequence Real / Generated Discriminator CNN network 2 … 2 Cluster index sequence 16 25 Model 1 (2018) • Learning the mapping between cluster indices and phonemes with a GAN – embedding clustering followed by (cascaded with) a GAN [Interspeech 2018] 𝑠𝑖𝑙 … ℎℎ 𝑖ℎ 𝒔𝒊𝒍 Generated phoneme sequence
  • 61. • Generator consists of two parts (a) Phoneme Classifier (DNN) (b) Sampling Process • Discriminator is a two layer 1-D CNN. Data (Text) Data (Audio) Model 2 (2019) • A GAN (Generator and Discriminator) trained End-to-end – DNN trained in an unsupervised way [Interspeech 2019]
  • 62. Accuracy Unsupervised learning (Model 2, 2019) is as good as supervised learning (HMM) 30 years ago. The Progress of Supervised Learning on TIMIT • Milestones in phone recognition accuracy [Phone recognition on the TIMIT database, Lopes, C. and Perdigão, F., 2011. Speech Technologies, Vol 1, pp. 285--302.] – Will it take another 30 years for unsupervised learning to achieve the performance of supervised learning today ? [Keynote, Interspeech 2020]
  • 63. Unsupervised Speech Recognition • Librispeech (2021) https://ai.facebook.com/blog/wav2vec-unsupervised-speech-recognition-without-supervision/ – Word Error Rate as low as 5.9% with zero hr of annotated data ! [facebook 2021]
  • 64. Unsupervised Speech Recognition SwitchBoard (Telephone conversation) TED-LIUM (Live talk) Librispeech (Read literature) LibriLM (Literature) Wiki (Encyclopedia) NewsCrawl (News) ImageC (Image Caption) [ICASSP 2022] • Audio/Text unlabeled corpora respectively extended to different styles/domains
  • 65. Unsupervised Speech Recognition [ICASSP 2022] • Reasonable (relatively high) Error Rate achievable if – good acoustics conditions (Libri) with reasonably different linguistic styles – relatively poor acoustic conditions (SB) but with same linguistic styles – will be lowered sooner or later… 4gram JSD PER Libri960 SB300_w2v2
  • 67. A Target Application Example : Personalized Education Environment • For each individual user I wish to learn how machines can listen to human voice I can spend 3 hrs to learn user This is the 3-hr personalized course for you. I’ll be your personalized teaching assistant. Ask me when you have questions. Information from Internet • Comprehension, Summarization and Question Answering for Spoken Content – Proper use of semantics in spoken content
  • 68. Question Answering Knowledge source question answer unstructured documents search engine Question Answering • Machine answering questions from the user spoken content (passages)
  • 69. Question Answering Knowledge source question answer unstructured documents Question Answering • Machine answering questions from the user spoken content (passages) • Initial Work – assuming the passages are given
  • 70. Speech Recognition (ASR) Question Answering Answer Question Question Answering Answer Question Spoken Content Retrieved Text Retrieved Text v.s. Spoken QA (Cascading v.s. End-to-end) • Text QA End-to-end Spoken Question Answering Answer Question [Interspeech 2020] Cascading End-to-end • Spoken QA Errors
  • 71. End-to-end Spoken QA : DUAL • Pre-trained with unlabled audio/text corpora – audio represented in HuBERT units (frame level) – fine-tuned in downstream by (question, passage, answer) sets • Not limited by ASR Errors or OOV – no ASR here [Interspeech 2022]
  • 72. End-to-end Spoken QA: DUAL • HuBERT-based speech encoder • BERT pre-trained on text [Interspeech 2022] question passage Pre-trained Speech Encoder 3 9 11 31 31 clustered units start Find Ans Span BERT (pre-trained on text) end
  • 73. End-to-end Spoken QA: DUAL • ASR cascaded with Text QA: Baseline – performance directly limited by WER (many errors due to OOV) • End-to-end Spoken QA (DUAL) – performance independent of WER, because semantics extracted from audio • No ASR here – many answer spans include OOV ASR + Text QA End-to-end DUAL [Interspeech 2022] Word Error Rate (WER) Frame-level F1 score (FF1)
  • 74. Question Answering Knowledge source question answer unstructured documents Question Answering • Machine answering questions from the user spoken content (passages) • Initial Work – assuming the passages are given
  • 75. Question Answering Knowledge source question answer unstructured documents search engine Question Answering • Machine answering questions from the user – retrieval and question answering jointly solved spoken content (passages)
  • 76. Textless Spoken QA with Speech Discrete Units: TlDu • Audio represented in HuBERT units • Retrieving relevant passages and identifying answers joinyly handled [submitted to SLT 2022] • No ASR transcriptions or errors Spoken Archive Spoken Question (a) Speech Signal Encoder Discrete unit sequences (b) Spoken Content Retriever (c) Spoken Content Reader Top-K passages Answer span
  • 77. Preliminary Results for TlDu • Retrieval Accuracy • SQA: Accury • Performance not limited by ASR errors or OOV – No ASR here – semantics extracted directly from audio (not words) [submitted to SLT 2022]
  • 78. • What’s next? – don’t know • Depending on new technologies to be developed in the future – by capable researchers in the future – capable students in schools today (and in the future) • How students can learn effectively and efficiently is important – people asked : are you still teaching HMM and MFCC in your course today? • Just to share my thoughts – purely my personal imagination Beyond
  • 79. Speech Technologies shown in a 1-dim Scale changing very fast changing faster changing slower never change LPC signal processing Fourier transform mathematics cutting-edge 40 years ago today 10 years ago 20 years ago ? fundamental Pitch estimation Filter banks ML HMM MFCC NN CNN RNN WFST i-vector DNN ? x-years later Meta L Self- supervised Learning GAN
  • 80. • Students today (researchers in the future) have to explore new knowledge and solve new problems in the future (20 years?) • Primary Goal for the students to learn in school: to learn – not how to do research today, but how to do it in the future – not just to run deep learning packages today, – if they are too much focused on those deep learning packages, quick results and achievement, – may need to learn skills “more generalizable” to future technologies not existing today (unseen during training)… – which produces good results, papers and brings good jobs : those may become obsolete very soon what are such skills ? will it be possible that they may be “overfitted” on these targets?
  • 81. Reading many books! Listening to many voices! Then their parents teach them. Analogy 1: How do human babies learn the language ?
  • 82. Unlabeled Text Self-Supervised : Pre-training – learn some hidden structure of the language Model Unlabeled Speech Unlabeled data Analogy 1: How do human babies learn the language ?
  • 83. may be kind of Self-Supervised Learning… Unlabeled Text Self-Supervised : Pre-training – learn some hidden structure of the language Unlabeled Speech Model Input Labels: limited Self-supervised: Downstream Model Unlabel ed data Analogy 1: How do human babies learn the language ?
  • 84. • To start with – learning the alphabet: a, b, c – finding a dictionary for the language – learning to look up through the dictionary: <word>, <unknown> – learning the basic words, function words, keywords: <this> <he> <is> – learning the grammar – these are systematic approaches to learn a new (unknown) language, • These systematic approaches are kind of based on some “hidden structure” for the language? – having to do with pre-training ? – reading articles is some kind of downstream ? Analogy 2: Reading articles in a certain unknown language generalizable to reading arbitrary articles in that language
  • 85. • We didn’t learn deep learning in early days – we are working with it now – We seemed to have generalized our skills learned earlier to handle the new knowledgies today ? Analogy 3: How old generation researchers faced the Deep Learning era ? • What did we learn in early days ? – mathematics, programming, fundamentals (a, b, c) – successful stories in the past – may include components generalizable to technologies useful today ? – are they specially lucky to survive the technology revolutions? – have we gone through some kind of pre-training when we were young? • Learning new technologies is the downstream tasks ? • We don’t know, just to share some thoughts solutions What ? ideas Why ? principles How ? – (e.g. HMM, MFCC, although not useful any more today) – forward-backward algorithm, vector quantization, cepstral mean and variance normalization still useful today, although in different context
  • 87. • Semantics of Speech yet to be explored – plenty of unknown space – may offer a bridge towards “a spoken version of Google” • Self-supervised learning for Speech – pre-training with unlabeled data – universally make all downstream tasks easier – various new technologies blooming • x years ago we never knew what kind of technologies we could have today – today we never know what kind of technologies we may have x years from now – anything in our mind could be possible • This is the golden age we never had for speech research – very deep learning, very big data, very powerful machines, very strong industry – which we never had before • Let’s all treasure and enjoy this golden age ! Concluding Remarks