SlideShare a Scribd company logo
1 of 72
Doing Something We Never Could
with
Spoken Language Technologies
Lin-shan Lee
National Taiwan University
• Some research effort tries to do something Better
– having aircrafts fly faster
– having images look more beautiful
– always very good
• Some tries to do something we Never could
– developing the Internet to connect everyone over the world
– selecting information from Internet with Google
– usually challenging
What and Why
• Those we could Never do before
– very far from realization
– wait for the right industry to appear at the right time
– new generations of technologies used very different from the
earlier solutions found in research
• Interesting and Exciting !
What and Why
• Three examples
– (1) Teaching machines to listen to Mandarin Chinese
– (2) Towards a Spoken Version of Google
– (3) Unsupervised ASR (phoneme recognition)
(1) Teaching Machines to Listen to
Mandarin Chinese
• English typewriter
Talk to Machines in Writing –
• Chinese typewriter
Typewriting
Chinese Language is Not Alphabetic
• Every Chinese character is a square graph
– many thousands of such characters
• English typewriter
Talk to Machines in Writing - Typewriting
• Chinese typewriter
Age of Chinese Typewriters (1980)
• Many people tried to represent Chinese characters by
code sequences
– radicals (字根) : 口 火 木 山 人 手
– phonetic symbols
– corner codes (四角號碼)
• A Voice-driven Typewriter for Chinese ?
– monosyllable per character
– total number of distinct monosyllables limited
– something we could Never do before
• To Teach Machines to Listen to Mandarin ?
– too difficult (1980)
– to teach machines to speak Mandarin first
→Speech Synthesis (TTS)
Hardware-assisted Voice Synthesizer
• Real-time requirement
– producing 1 sec of voice data within 1 sec
– from Linear Predicative Coding (LPC) coefficients
• Bit-slice microprocessor far too weak
• Completed 1981, used in next several years
Initial Effort in Chinese Text-to-speech Synthesis
• Calculated and stored the LPC coefficients and tone
patterns for all Mandarin monosyllables
– concatenating isolated monosyllables into utterances directly
– Never worked
• Learned from a linguist
– the prosody (pitch, energy, duration, etc.) of the same
monosyllable is different in different context in continuous
speech (context dependency)
– made sentences, recorded voice, analyzed the prosody
– prosody rules for each monosyllable in continuous speech based
on context
– synthesized voice very poor, but intelligible
– data science in early years
• Lesson learned
– linguistics important for engineers
, stuck for long
Prof. Chiu-yu Tseng
Mandarin Prosody Rules
• Complete prosody rules (1984)
[ICASSP 1986][IEEE Trans ASSP 1989]
我 有 好 幾 把 小 雨 傘
Tone Sandhi Rule Example
• Concatenation rules for Tone 3
3 3 → 2 3
S
有
N
好幾
𝑉𝑃
我
𝑁𝑃
V
NP
QP
Q
MOD N
MOD
把 小 雨傘
2 2
3 3
2 3 3
2 [ICASSP 1986][IEEE Trans ASSP 1989]
雨傘 – the rule applies across
certain syntactic boundaries,
but not for others
Chinese Text-to-speech Synthesis
• Concatenating stored monosyllables with adjusted prosody
(1984)
– reproduced from a tape in 1984
[ICASSP 1986] [IEEE Trans ASSP 1989]
Speech Prosody Depends on Sentence Structure
• Chinese Sentence Grammar and Parser (1986)
• Empty category
被李四叫去吃飯的小孩
(The children who are asked to go to dinner by Li-s)
[AAAI 1986] [Computational Linguistics 1991]
– natural language
processing in early
years
Mandarin Speech Recognition
• Very Large Vocabulary and Unlimited Text
• Isolated Monosyllable Input
– each character pronounced as a monosyllable
– limited number of distinct monosyllables
– a voice-driven typewriter
• Decomposing the problem
– recognition of consonants/vowels/tones
– identifying the character out of many homonym characters
• System Integration very hard
– software very weak, each part assisted by different hardware
[Concept IJCAI 1987] [Initial Results ICASSP 1990]
• 1989
An Example Circuit Diagram
• Completed in 1989
Photo of Completed Hardware
, but Never worked
• 1992
– far from a complete system
LPC Analysis Chip Viterbi Chip
Chip Design
• Implemented with Transputer (10 CPUs in parallel,
everything in software) purchased in 1990
• March 1992
• 金聲玉振,金玉之聲
• Isolated monosyllable input
• Several seconds per monosyllable
• Lesson learned: software more powerful than hardware
Golden Mandarin I (金聲一號 )
[ICASSP 1993] [IEEE Trans SAP 1993]
Golden Mandarin III (金聲三號)
• March 1995
• Continuous read speech input
• Limitation by computation resources
– Version(a) on PC 486 for short utterance input
– Version(b) on Workstation for long utterance input
[ICASSP 1995] [IEEE Trans SAP 1997] [IEEE SP Magazine 1997]
Mini-Remarks
• Today
• A dream many years ago was realized by the right
industry at the right time
– with new generations of technologies
• No worry for realization during research
– someone will solve the problem in the future
(2) Towards
a Spoken Version of Google
Spoken Content Retrieval
• Spoken term detection
– to detect if a target term was spoken in any of the utterances
in an audio data set
target term: COVID-19
similarity scores
Spoken Content Retrieval – Basic Approach
Lexicon
Spoken
Content
Recognition
Engine
Language
Model
Transcriptions
Retrieval Results
Text-based
Search
Engine user
Query
Acoustic
Models
• Transcribe the spoken content
• Search over the transcriptions as they are text
• Recognition errors cause serious performance degradation
• Low recognition accuracies for speech signals under
various acoustic conditions
– considering lattices rather than 1-best output
– lattices of subword units
Lattices for Spoken Content Retrieval
W6 W8
W4
W1
W8
W7
W9
W3
W2
W5
W10
Start node End node
Time index
Wi: word hypotheses
[Ng 2000, Scarclar 2004, Chelba 2005, etc.]
• Recognition stage cascaded with retrieval stage
– retrieval performance limited by recognition accuracy
• Considering recognition and retrieval processes as a
whole (2010)
– acoustic models re-estimated by optimizing overall retrieval
performance
ASR Error Problem Comes from Cascading Two
Stages
Spoken
Content
Recognition
Engine
Acoustic
Models
lattices
Search
Engine
Retrieval
Model
user
Query Q
Retrieval
Output
Recognition Retrieval
[ICASSP 2010]
• End-to-End Spoken Content Retrieval in early years
Acoustic
Models
What can Spoken Content Retrieval do for us ?
• Google reads all text over the Internet
– can find any text over the Internet for the user
• All Roles of Text can be realized by Voice
• Machines can listen to all voices over the Internet
– can find any utterances over the Internet for the user
• A Spoken Version of Google
• Machines may be able to listen to and understand the entire
multimedia knowledge archive over the Internet
– extracting desired information for each individual user
300hrs of videos
uploaded per min
(2015.01)
Roughly 2000 online
courses on Coursera
(2016.04)
• Nobody can go through so much multimedia
information
• Multimedia Content exponentially
increasing over the Internet
– best archive of global human knowledge is here
– desired information deeply buried under huge quantities
of unrelated information
, but Machines can
What can we do with a Spoken Version of Google ?
A Target Application Example : Personalized
Education Environment
• For each individual user
I wish to learn about Wolfgang
Amadeus Mozart and his music
I can spend 3 hrs to learn
user
This is the 3-hr personalized
course for you. I’ll be your
personalized teaching assistant.
Ask me when you have questions.
Information
from Internet
• Understanding, Summarization and Question Answering for
Spoken Content
– something we could Never do (even today)
– semantic analysis for spoken content
Probabilistic Latent Semantic Analysis (PLSA)
t1
t2
tj
tn
D1
D2
Di
DN
TK
Tk
T2
T1
P(T |D )
k i
P(t |T )
j k
Di: documents Tk: latent topics tj: terms
𝑃(𝑤 𝑧)
𝑃(𝑧 𝑑)
• Unsupervised Learning of Topics from text corpus
d: document z: topic w: word
[Hofmann 1999]
Latent Dirichlet Allocation (LDA)
𝑃(𝜑𝑘 𝛽): Dirichlet Distribution
𝑃(𝜃 𝛼): Dirichlet Distribution
𝑃(𝑤𝑚,𝑛 𝑧𝑚,𝑛, 𝜑𝑘)
𝑃(𝑧𝑚,𝑛 𝜃𝑚)
[Blei 2003]
• Unsupervised Learning of Topics from text corpus
Key Term Extraction and Summarization for
Spoken Content
• Key term extraction based on semantic features from
PLSA or LDA
– key terms usually focused on smaller number of topics
Not key term
P(Tk|ti)
k
key term
P(Tk|ti)
k
topics topics
[ICASSP 2006] [SLT 2010]
• Summarization
– selecting most representative utterances but avoiding redundancy
• Constructing the Semantic Structures of the Spoken Content
• Example Approach 1: Spoken Content categorized by Topics
and organized in a Two-dimensional Tree Structure (2005)
– each category labeled by a set of key terms (topic) located on a map
– categories nearby on the map are more related semantically
– each category expanded into another map in the next layer
Semantic Structuring of Spoken Content (1/2)
[Eurospeech 2005]
• Broadcast News Browser (2006)
An Example of Two-dimensional Trees
[Interspeech 2006]
• Sequential knowledge transfer lecture by lecture
• When a lecture in an online course is retrieved for a
user
– difficult for the user to understand this lecture without
listening to previous related lectures
– not easy to find out background or related knowledge
Online Courses
• Example Approach 2: Key Term Graph (2009)
– each spoken slide labeled by a set of key terms (topics)
– relationships between key terms represented by a graph
Semantic Structuring of Spoken Content (2/2)
spoken slides
(plus audio/video)
key term
graph
Acoustic
Modeling
Viterbi
search
HMM
Language
Modeling
Perplexity
[ICASSP 2009][IEEE Trans ASL 2014]
• Very Similar to Knowledge Graph
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
• Course : Digital Speech Processing (2009)
– Query : “triphone”
– retrieved utterances shown with the spoken slides they belong to
specified by the titles and key terms
An Example of Retrieving with an Online Course
Browser (1/2)
[ICASSP 2009][IEEE Trans ASL 2014]
• User clicks to view the spoken slide (2009)
– including a summary, key terms and related key terms from the graph
– recommended learning path for a specific key term
[ICASSP 2009][IEEE Trans ASL 2014]
An Example of Retrieving with an Online Course
Browser (2/2)
A Huge Number of Online Courses
752 matches
• A user enters a keyword or a key phrase to coursera
Having Machines Listen to all the Online Courses
Lectures with very
similar content
three courses on
some similar topic
[Interspeech 2015]
sequential order for
learning (prerequisite
conditions)
three courses on
some similar topic
[Interspeech 2015]
Having Machines Listen to all the Online Courses
Demonstration
Learning Map Produced by Machine
(2014)
[Interspeech 2015]
Question
Answering
Knowledge source
question
answer
unstructured
documents
search engine
Question Answering
• Machine answering questions from the user
spoken
content
Speech
Recognition
(ASR)
Question
Answering
Answer
Question
Question
Answering
Answer
Question
Spoken Content
Retrieved
Text
Retrieved
Text v.s. Spoken QA (Cascading v.s. End-to-end)
• Text QA
End-to-end Spoken
Question Answering
Answer
Question [Interspeech 2020]
Cascading
End-to-end
• Spoken QA
Errors
Reconstruction
Random Mask
Start/End Position
Audio-and-Text Jointly Learned SpeechBERT
• Pre-training • Fine-tuning
[Interspeech 2020]
• End-to-end Globally Optimized for Overall QA Performance
– not limited by ASR errors (no ASR here)
– extracting semantics directly from speech, not from words via ASR
• Internet is the only largest archive for global human
knowledge
– spoken language technologies offer a bridge to that knowledge
• Still a long way to go towards the personalized education
environment considered
– many small steps may lead to the final destination some day
– someone may realize it at the right time in the future
• Machines capable of handling huge quantities of Data
– using machines in the way they are specially capable and
efficient
Mini-Remarks
(3) Unsupervised ASR
(Phoneme Recognition)
Hung-yi Lee (left) and Lin-shan Lee
• Supervised ASR
– Has been very successful
– Problem : requiring a huge quantity of annotated data
How are you.
He thinks it’s…
Thanks for…
Data
(Annotated)
Supervised ASR
• Unsupervised ASR
– Train without annotated data
Supervised/Unsupervised ASR
Thousands of languages spoken over the
world
‐ most are low-resourced without
enough annotated data
How are you.
He thinks it’s…
Thanks for…
Data
(Audio)
Unsupervised ASR
Data
(Text)
Supervised/Unsupervised ASR
• Supervised ASR
– Has been very successful
– Problem : requiring a huge quantity of annotated data
• Unsupervised ASR
– Train without annotated data
– Unlabeled, unpaired data are easier to
collect
– something we could Never do before
(2018)
Thousands of languages spoken over the
world
‐ most are low-resourced without
enough annotated data
Generator (ASR) Discriminator
Acoustic Features
Data
(Audio)
Data
(Text)
How are you.
He thinks it’s…
Thanks for…
hh ih hh ih ng kcl …
hh aw aa r y uw
th ae ng k s f ao r …
Generated
Phoneme Sequences
Real
Phoneme Sequences
Use of Generative Adversarial Networks (GAN)
Generator (ASR)
Tries to “fool” Discriminator
Discriminator
Tries to distinguish real or
generated phoneme sequence.
Acoustic Features
Generated
Phoneme Sequences
Real / Generated
Real / Generated
Phoneme Sequences
Train
Iteratively
Use of Generative Adversarial Networks (GAN)
• Discriminator / Generator improve themselves individually
and iteratively
• Generative Adversarial Network (GAN)
…
…
Audio word2vec
𝑋1
𝑋2 𝑋3
𝑋𝑀
𝑧1 𝑧2 𝑧3 𝑧𝑀
…
Acoustic Feature
Audio embedding sequence
Model 1 (2018)
• Waveform segmentation and embedding
– divide the features into acoustically similar segments of different lengths
– transform each segment into a fixed-length vector (audio embedding)
[Interspeech 2018]
Cluster index 16
Cluster index 25
Cluster index 1
…
Audio embedding sequence
2 …
16 25 2
𝑧1 𝑧2 𝑧3 𝑧𝑀
Cluster index sequence
K-means
…
Cluster index 2
Model 1 (2018)
• Cluster the embeddings into groups
[Interspeech 2018]
Generator
Lookup Table
Real phoneme sequence
Real / Generated
Discriminator
CNN network
2 … 2
Cluster index sequence
16 25
Model 1 (2018)
• Learning the mapping between cluster indices and
phonemes with a GAN
– embedding clustering followed by (cascaded with) a GAN
[Interspeech 2018]
𝑠𝑖𝑙 …
ℎℎ 𝑖ℎ 𝒔𝒊𝒍
Generated phoneme sequence
Generator
Lookup Table
2 … 2
Cluster index sequence
16 25
Model 1 (2018)
• Learning the mapping between cluster indices and
phonemes with a GAN
– embedding clustering followed by (cascaded with) a GAN
[Interspeech 2018]
𝑠𝑖𝑙 …
ℎℎ 𝑖ℎ 𝒔𝒊𝒍
Generated phoneme sequence
ASR !
(phoneme recognition)
• Generator consists of two
parts
(a) Phoneme Classifier (DNN)
(b) Sampling Process
Data
(Audio)
Model 2 (2019)
[Interspeech 2019]
• Generator consists of two
parts
(a) Phoneme Classifier (DNN)
(b) Sampling Process
• Discriminator is a two layer 1-D
CNN.
Data (Text)
Data
(Audio)
Model 2 (2019)
[Interspeech 2019]
• Generator consists of two
parts
(a) Phoneme Classifier (DNN)
(b) Sampling Process
• Discriminator is a two layer 1-D
CNN.
Data (Text)
Data
(Audio)
Model 2 (2019)
• A GAN (Generator and Discriminator) trained
End-to-end
– DNN trained in an unsupervised way
[Interspeech 2019]
GAN model
Hidden Markov
Models
2. Train the GAN
in an unsupervised way
3. Obtain transcriptions
by GAN model
4. Train the
HMMs
5. Obtain new
boundaries by HMMs
1. Obtain the initial
boundaries
Phoneme
Boundaries
Transcriptions
(Pseudo label)
Model 2 (2019)
• GAN iterated with HMMs
[Interspeech 2019]
Approaches
PER
Matched Unrelated
Supervised
RNN Transducer 17.7 -
Standard HMMs 21.5 -
Completely unsupervised (no label at all)
Model 1 76.0 -
Model
2
Iteration 1
GAN 48.6 50.0
HMM 30.7 39.5
Iteration 2
GAN 41.0 44.3
HMM 27.0 35.5
Iteration 3
GAN 38.4 44.2
HMM 26.1 33.1
Experimental Results for Models 1, 2 on TIMIT
• Phoneme Error Rate
[Interspeech 2018] [Interspeech 2019]
• Model 1 cascaded clustering with GAN, while Model 2 did
everything end-to-end with GAN
Accuracy
Unsupervised learning is as good as
supervised learning 30 years ago.
The Progress of Supervised Learning on TIMIT
• Milestones in phone recognition accuracy
[Phone recognition on the TIMIT database, Lopes, C. and Perdigão, F., 2011. Speech
Technologies, Vol 1, pp. 285--302.]
– Will it take another 30 years for unsupervised learning to achieve the
performance of supervised learning today ?
ASR TTS
- Learning basic sound units by listening and reproducing
- Babies learn to speak from parents
Vector Space
Model 3 (2019)
• A Cascading Recognition-Synthesis Framework
[ICASSP 2020]
ASR
B
AE
T
“Ta b”
Model 3 (2019)
• A Cascading Recognition-Synthesis Framework
[ICASSP 2020]
- Learning basic sound units by listening and reproducing
- Babies learn to speak from parents
- ASR ! (Phoneme recognition)
- Audio data only (no text) plus a small set of paired data
Encoder
Decoder
phoneme representation
sequence
Quantizer
RNNs
Seq-to-seq
Model
encoder output
sequence
(frame-sync)
• Implemented with a Sequential Quantization AutoEncoder
Model 3 (2019)
[ICASSP 2020]
encoder output
sequence
(frame-sync)
Quantizer
Learnable codebook
L2 Distance
Argmin
Lookup
4
𝑒4
2
𝑒2
Phonetic Clustering Temporal Segmentation
phoneme representation
sequence
[𝑒4,𝑒4,𝑒4,𝑒2,𝑒2]
• Implemented with a Sequential Quantization AutoEncoder
Model 3 (2019)
[ICASSP 2020]
– With small quantity of paired data, train
codebook to match real phonemes
Learnable codebook
L2 Distance
Argmin
𝑣
𝐞𝑣
Phonetic Clustering
1. Assign one phoneme to each codeword
𝐡𝑡
2. Define the probability that vector 𝐡𝑡 belongs to phoneme 𝑣
Pr 𝐡𝑡 belongs to 𝑣 =
exp(− 𝐡𝑡 − 𝐞𝑣 2)
𝑢 exp(− 𝐡𝑡 − 𝐞𝒖 2)
3. Matching with the paired data by maximizing the probability
Lookup
• Sequential Quantization AutoEncoder
Model 3 (2019)
[ICASSP 2020]
encoder output
sequence
(frame-sync)
Quantizer
Learnable codebook
L2 Distance
Argmin
Phonetic Clustering Temporal Segmentation
phoneme representation
sequence
[𝑒4,𝑒4,𝑒4,𝑒2,𝑒2]
Lookup
• Implemented with a Sequential Quantization AutoEncoder
Model 3 (2019)
[ICASSP 2020]
ℎ𝑡
L2 Distance Argmin
Learnable codebook
Quantizer
Encoder Decoder
Straight-through
Gradient Estimation
Forward pass
Backward pass
Phonetic Clustering
– End-to-end trained (with enough audio data plus small paired data)
Temporal
Segmentation
Lookup
• Implemented with a Sequential Quantization AutoEncoder
Model 3 (2019)
[ICASSP 2020]
ASR !
(phoneme
recognition)
ℎ𝑡
L2 Distance Argmin
Learnable codebook
Quantizer
Encoder
Straight-through
Gradient Estimation
Forward pass
Backward pass
Phonetic Clustering
– End-to-end trained (with enough audio data plus small paired data)
Temporal
Segmentation
Lookup
• Implemented with a Sequential Quantization AutoEncoder
Model 3 (2019)
[ICASSP 2020]
Unlabeled
Audio
20 min.
Paired
15 min.
Paired
10 min.
Paired
5 min.
Paired
Proposed Method 22 hrs. 25.5 29.0 35.2 49.3
• Phoneme Error Rate (PER)
– when different amount of paired data is available (single speaker)
Experiment Results for Model 3 on LJSpeech
[ICASSP 2020]
– initial recognition observed, although 20 min of paired data used
– related work reported recently
• Deep learning makes tasks impossible before
realizable today
• Much more crazy concepts are yet to be explored !
• Unpaired data may be more useful than we thought
• End-to-end training is more attractive than
cascading in the context of deep learning
– overall performance is optimized globally as compared to locally
Mini-Remarks
• Doing something we never could is interesting and
exciting !
• Though challenging, the difficulties may be reduced by
the fast advancing technologies including deep learning
• 15 years ago we never knew what kind of technologies
we could have today
– today we never know what kind of technologies we may have 15
years from now
– anything in our mind could be possible
• This may be the golden age we never had for research
– very deep learning, very big data, very powerful machines, very
strong industry
– which we never had before
– possible to do something we Never could !
Concluding Remarks

More Related Content

Similar to Here are the key points from the slide on triphones:- Triphones model the acoustic context on both sides of a phoneme. This provides a more accurate acoustic model than using single phonemes. - The acoustic model is built by clustering triphone units based on their acoustic similarity. - Context-dependent triphones capture coarticulation effects between adjacent phonemes for more natural speech synthesis and recognition.- In large vocabulary speech recognition systems, triphones are the basic acoustic units instead of single phonemes. This captures more acoustic variability.- The number of possible triphones is huge but most have low frequency of occurrence. Only frequent triphones are used to build the acoustic model.Let me

topics natural language processing and image processing
topics natural language processing and image processingtopics natural language processing and image processing
topics natural language processing and image processingyoukayaslam
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchNatasha Latysheva
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionZachary S. Brown
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptxbuivantan_uneti
 
From Semantics to Self-supervised Learning for Speech and Beyond
From Semantics to Self-supervised Learning for Speech and BeyondFrom Semantics to Self-supervised Learning for Speech and Beyond
From Semantics to Self-supervised Learning for Speech and Beyondlinshanleearchive
 
An Introduction to Recent Advances in the Field of NLP
An Introduction to Recent Advances in the Field of NLPAn Introduction to Recent Advances in the Field of NLP
An Introduction to Recent Advances in the Field of NLPRrubaa Panchendrarajan
 
Ry pyconjp2015 karaoke
Ry pyconjp2015 karaokeRy pyconjp2015 karaoke
Ry pyconjp2015 karaokeRenyuan Lyu
 
Efficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingEfficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingIOSR Journals
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...DataScienceConferenc1
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...DataScienceConferenc1
 
Relation between Languages, Machines and Computations
Relation between Languages, Machines and ComputationsRelation between Languages, Machines and Computations
Relation between Languages, Machines and ComputationsBHARATH KUMAR
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionStephen Marquard
 
Intro to call cai
Intro to call caiIntro to call cai
Intro to call caiIzaham
 

Similar to Here are the key points from the slide on triphones:- Triphones model the acoustic context on both sides of a phoneme. This provides a more accurate acoustic model than using single phonemes. - The acoustic model is built by clustering triphone units based on their acoustic similarity. - Context-dependent triphones capture coarticulation effects between adjacent phonemes for more natural speech synthesis and recognition.- In large vocabulary speech recognition systems, triphones are the basic acoustic units instead of single phonemes. This captures more acoustic variability.- The number of possible triphones is huge but most have low frequency of occurrence. Only frequent triphones are used to build the acoustic model.Let me (20)

topics natural language processing and image processing
topics natural language processing and image processingtopics natural language processing and image processing
topics natural language processing and image processing
 
Evolution of Programming Languages.pdf
Evolution of Programming Languages.pdfEvolution of Programming Languages.pdf
Evolution of Programming Languages.pdf
 
Evolution of Programming Languages.pdf
Evolution of Programming Languages.pdfEvolution of Programming Languages.pdf
Evolution of Programming Languages.pdf
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Television News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/SolrTelevision News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/Solr
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptx
 
The Tipping Point
The Tipping PointThe Tipping Point
The Tipping Point
 
The tipping point
The tipping pointThe tipping point
The tipping point
 
From Semantics to Self-supervised Learning for Speech and Beyond
From Semantics to Self-supervised Learning for Speech and BeyondFrom Semantics to Self-supervised Learning for Speech and Beyond
From Semantics to Self-supervised Learning for Speech and Beyond
 
An Introduction to Recent Advances in the Field of NLP
An Introduction to Recent Advances in the Field of NLPAn Introduction to Recent Advances in the Field of NLP
An Introduction to Recent Advances in the Field of NLP
 
Ry pyconjp2015 karaoke
Ry pyconjp2015 karaokeRy pyconjp2015 karaoke
Ry pyconjp2015 karaoke
 
Efficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingEfficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And Recording
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
 
Relation between Languages, Machines and Computations
Relation between Languages, Machines and ComputationsRelation between Languages, Machines and Computations
Relation between Languages, Machines and Computations
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
 
Intro to call cai
Intro to call caiIntro to call cai
Intro to call cai
 
Iitdmj 1
Iitdmj 1Iitdmj 1
Iitdmj 1
 
CALL
CALLCALL
CALL
 

More from linshanleearchive

星雲教育獎頒獎典禮手冊
星雲教育獎頒獎典禮手冊星雲教育獎頒獎典禮手冊
星雲教育獎頒獎典禮手冊linshanleearchive
 
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdflinshanleearchive
 
新科學創造新文明 Part 2
新科學創造新文明 Part 2新科學創造新文明 Part 2
新科學創造新文明 Part 2linshanleearchive
 
新科學創造新文明 Part 1
新科學創造新文明 Part 1新科學創造新文明 Part 1
新科學創造新文明 Part 1linshanleearchive
 
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論linshanleearchive
 
2022 國際語音學會科學成就獎章得獎致詞
2022 國際語音學會科學成就獎章得獎致詞2022 國際語音學會科學成就獎章得獎致詞
2022 國際語音學會科學成就獎章得獎致詞linshanleearchive
 
琳山老師榮退感言.pptx
琳山老師榮退感言.pptx琳山老師榮退感言.pptx
琳山老師榮退感言.pptxlinshanleearchive
 
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」linshanleearchive
 
芝麻開門:語音技術的前世今生
芝麻開門:語音技術的前世今生芝麻開門:語音技術的前世今生
芝麻開門:語音技術的前世今生linshanleearchive
 
2016《華語語音辨識研究的先驅者》科學月刊專訪
2016《華語語音辨識研究的先驅者》科學月刊專訪2016《華語語音辨識研究的先驅者》科學月刊專訪
2016《華語語音辨識研究的先驅者》科學月刊專訪linshanleearchive
 
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪linshanleearchive
 
2017《推動產業轉型 大學必修課程先鬆綁》
2017《推動產業轉型 大學必修課程先鬆綁》2017《推動產業轉型 大學必修課程先鬆綁》
2017《推動產業轉型 大學必修課程先鬆綁》linshanleearchive
 
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談linshanleearchive
 
芝麻開門 - 語音技術的前世今生
芝麻開門 - 語音技術的前世今生芝麻開門 - 語音技術的前世今生
芝麻開門 - 語音技術的前世今生linshanleearchive
 
Spoken Content Retrieval - Lattices and Beyond
Spoken Content Retrieval - Lattices and BeyondSpoken Content Retrieval - Lattices and Beyond
Spoken Content Retrieval - Lattices and Beyondlinshanleearchive
 
105-08-17 輕舟已過萬重山
105-08-17 輕舟已過萬重山105-08-17 輕舟已過萬重山
105-08-17 輕舟已過萬重山linshanleearchive
 

More from linshanleearchive (18)

星雲教育獎頒獎典禮手冊
星雲教育獎頒獎典禮手冊星雲教育獎頒獎典禮手冊
星雲教育獎頒獎典禮手冊
 
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf
國立臺灣大學電機資訊學院學術貢獻獎設置辦法.pdf
 
新科學創造新文明 Part 2
新科學創造新文明 Part 2新科學創造新文明 Part 2
新科學創造新文明 Part 2
 
新科學創造新文明 Part 1
新科學創造新文明 Part 1新科學創造新文明 Part 1
新科學創造新文明 Part 1
 
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論
2013《無涯學海渡扁舟:課本論文中不曾討論的電機資訊經驗談》 電機系大學部專題討論
 
2022 國際語音學會科學成就獎章得獎致詞
2022 國際語音學會科學成就獎章得獎致詞2022 國際語音學會科學成就獎章得獎致詞
2022 國際語音學會科學成就獎章得獎致詞
 
琳山老師榮退感言.pptx
琳山老師榮退感言.pptx琳山老師榮退感言.pptx
琳山老師榮退感言.pptx
 
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」
2021《芝麻開門——語音的聲音開啟人類文明的無限空間》台大科學教育中心「探索科學講座」
 
芝麻開門:語音技術的前世今生
芝麻開門:語音技術的前世今生芝麻開門:語音技術的前世今生
芝麻開門:語音技術的前世今生
 
Spoken Content Retrieval
Spoken Content RetrievalSpoken Content Retrieval
Spoken Content Retrieval
 
輕舟已過萬重山
輕舟已過萬重山輕舟已過萬重山
輕舟已過萬重山
 
2016《華語語音辨識研究的先驅者》科學月刊專訪
2016《華語語音辨識研究的先驅者》科學月刊專訪2016《華語語音辨識研究的先驅者》科學月刊專訪
2016《華語語音辨識研究的先驅者》科學月刊專訪
 
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪
2017《推動產業轉型 大學必修課程先鬆綁》自由時報星期專訪
 
2017《推動產業轉型 大學必修課程先鬆綁》
2017《推動產業轉型 大學必修課程先鬆綁》2017《推動產業轉型 大學必修課程先鬆綁》
2017《推動產業轉型 大學必修課程先鬆綁》
 
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談
無涯學海渡扁舟 - 課本論文中不曾討論的電機資訊經驗談
 
芝麻開門 - 語音技術的前世今生
芝麻開門 - 語音技術的前世今生芝麻開門 - 語音技術的前世今生
芝麻開門 - 語音技術的前世今生
 
Spoken Content Retrieval - Lattices and Beyond
Spoken Content Retrieval - Lattices and BeyondSpoken Content Retrieval - Lattices and Beyond
Spoken Content Retrieval - Lattices and Beyond
 
105-08-17 輕舟已過萬重山
105-08-17 輕舟已過萬重山105-08-17 輕舟已過萬重山
105-08-17 輕舟已過萬重山
 

Recently uploaded

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationAadityaSharma884161
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxLigayaBacuel1
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint Presentation
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 

Here are the key points from the slide on triphones:- Triphones model the acoustic context on both sides of a phoneme. This provides a more accurate acoustic model than using single phonemes. - The acoustic model is built by clustering triphone units based on their acoustic similarity. - Context-dependent triphones capture coarticulation effects between adjacent phonemes for more natural speech synthesis and recognition.- In large vocabulary speech recognition systems, triphones are the basic acoustic units instead of single phonemes. This captures more acoustic variability.- The number of possible triphones is huge but most have low frequency of occurrence. Only frequent triphones are used to build the acoustic model.Let me

  • 1. Doing Something We Never Could with Spoken Language Technologies Lin-shan Lee National Taiwan University
  • 2. • Some research effort tries to do something Better – having aircrafts fly faster – having images look more beautiful – always very good • Some tries to do something we Never could – developing the Internet to connect everyone over the world – selecting information from Internet with Google – usually challenging What and Why
  • 3. • Those we could Never do before – very far from realization – wait for the right industry to appear at the right time – new generations of technologies used very different from the earlier solutions found in research • Interesting and Exciting ! What and Why • Three examples – (1) Teaching machines to listen to Mandarin Chinese – (2) Towards a Spoken Version of Google – (3) Unsupervised ASR (phoneme recognition)
  • 4. (1) Teaching Machines to Listen to Mandarin Chinese
  • 5. • English typewriter Talk to Machines in Writing – • Chinese typewriter Typewriting
  • 6. Chinese Language is Not Alphabetic • Every Chinese character is a square graph – many thousands of such characters
  • 7. • English typewriter Talk to Machines in Writing - Typewriting • Chinese typewriter
  • 8. Age of Chinese Typewriters (1980) • Many people tried to represent Chinese characters by code sequences – radicals (字根) : 口 火 木 山 人 手 – phonetic symbols – corner codes (四角號碼) • A Voice-driven Typewriter for Chinese ? – monosyllable per character – total number of distinct monosyllables limited – something we could Never do before • To Teach Machines to Listen to Mandarin ? – too difficult (1980) – to teach machines to speak Mandarin first →Speech Synthesis (TTS)
  • 9. Hardware-assisted Voice Synthesizer • Real-time requirement – producing 1 sec of voice data within 1 sec – from Linear Predicative Coding (LPC) coefficients • Bit-slice microprocessor far too weak • Completed 1981, used in next several years
  • 10. Initial Effort in Chinese Text-to-speech Synthesis • Calculated and stored the LPC coefficients and tone patterns for all Mandarin monosyllables – concatenating isolated monosyllables into utterances directly – Never worked • Learned from a linguist – the prosody (pitch, energy, duration, etc.) of the same monosyllable is different in different context in continuous speech (context dependency) – made sentences, recorded voice, analyzed the prosody – prosody rules for each monosyllable in continuous speech based on context – synthesized voice very poor, but intelligible – data science in early years • Lesson learned – linguistics important for engineers , stuck for long Prof. Chiu-yu Tseng
  • 11. Mandarin Prosody Rules • Complete prosody rules (1984) [ICASSP 1986][IEEE Trans ASSP 1989]
  • 12. 我 有 好 幾 把 小 雨 傘 Tone Sandhi Rule Example • Concatenation rules for Tone 3 3 3 → 2 3 S 有 N 好幾 𝑉𝑃 我 𝑁𝑃 V NP QP Q MOD N MOD 把 小 雨傘 2 2 3 3 2 3 3 2 [ICASSP 1986][IEEE Trans ASSP 1989] 雨傘 – the rule applies across certain syntactic boundaries, but not for others
  • 13. Chinese Text-to-speech Synthesis • Concatenating stored monosyllables with adjusted prosody (1984) – reproduced from a tape in 1984 [ICASSP 1986] [IEEE Trans ASSP 1989]
  • 14. Speech Prosody Depends on Sentence Structure • Chinese Sentence Grammar and Parser (1986) • Empty category 被李四叫去吃飯的小孩 (The children who are asked to go to dinner by Li-s) [AAAI 1986] [Computational Linguistics 1991] – natural language processing in early years
  • 15. Mandarin Speech Recognition • Very Large Vocabulary and Unlimited Text • Isolated Monosyllable Input – each character pronounced as a monosyllable – limited number of distinct monosyllables – a voice-driven typewriter • Decomposing the problem – recognition of consonants/vowels/tones – identifying the character out of many homonym characters • System Integration very hard – software very weak, each part assisted by different hardware [Concept IJCAI 1987] [Initial Results ICASSP 1990]
  • 16. • 1989 An Example Circuit Diagram
  • 17. • Completed in 1989 Photo of Completed Hardware , but Never worked
  • 18. • 1992 – far from a complete system LPC Analysis Chip Viterbi Chip Chip Design
  • 19. • Implemented with Transputer (10 CPUs in parallel, everything in software) purchased in 1990 • March 1992 • 金聲玉振,金玉之聲 • Isolated monosyllable input • Several seconds per monosyllable • Lesson learned: software more powerful than hardware Golden Mandarin I (金聲一號 ) [ICASSP 1993] [IEEE Trans SAP 1993]
  • 20. Golden Mandarin III (金聲三號) • March 1995 • Continuous read speech input • Limitation by computation resources – Version(a) on PC 486 for short utterance input – Version(b) on Workstation for long utterance input [ICASSP 1995] [IEEE Trans SAP 1997] [IEEE SP Magazine 1997]
  • 21. Mini-Remarks • Today • A dream many years ago was realized by the right industry at the right time – with new generations of technologies • No worry for realization during research – someone will solve the problem in the future
  • 22. (2) Towards a Spoken Version of Google
  • 23. Spoken Content Retrieval • Spoken term detection – to detect if a target term was spoken in any of the utterances in an audio data set target term: COVID-19 similarity scores
  • 24. Spoken Content Retrieval – Basic Approach Lexicon Spoken Content Recognition Engine Language Model Transcriptions Retrieval Results Text-based Search Engine user Query Acoustic Models • Transcribe the spoken content • Search over the transcriptions as they are text • Recognition errors cause serious performance degradation
  • 25. • Low recognition accuracies for speech signals under various acoustic conditions – considering lattices rather than 1-best output – lattices of subword units Lattices for Spoken Content Retrieval W6 W8 W4 W1 W8 W7 W9 W3 W2 W5 W10 Start node End node Time index Wi: word hypotheses [Ng 2000, Scarclar 2004, Chelba 2005, etc.]
  • 26. • Recognition stage cascaded with retrieval stage – retrieval performance limited by recognition accuracy • Considering recognition and retrieval processes as a whole (2010) – acoustic models re-estimated by optimizing overall retrieval performance ASR Error Problem Comes from Cascading Two Stages Spoken Content Recognition Engine Acoustic Models lattices Search Engine Retrieval Model user Query Q Retrieval Output Recognition Retrieval [ICASSP 2010] • End-to-End Spoken Content Retrieval in early years Acoustic Models
  • 27. What can Spoken Content Retrieval do for us ? • Google reads all text over the Internet – can find any text over the Internet for the user • All Roles of Text can be realized by Voice • Machines can listen to all voices over the Internet – can find any utterances over the Internet for the user • A Spoken Version of Google
  • 28. • Machines may be able to listen to and understand the entire multimedia knowledge archive over the Internet – extracting desired information for each individual user 300hrs of videos uploaded per min (2015.01) Roughly 2000 online courses on Coursera (2016.04) • Nobody can go through so much multimedia information • Multimedia Content exponentially increasing over the Internet – best archive of global human knowledge is here – desired information deeply buried under huge quantities of unrelated information , but Machines can What can we do with a Spoken Version of Google ?
  • 29. A Target Application Example : Personalized Education Environment • For each individual user I wish to learn about Wolfgang Amadeus Mozart and his music I can spend 3 hrs to learn user This is the 3-hr personalized course for you. I’ll be your personalized teaching assistant. Ask me when you have questions. Information from Internet • Understanding, Summarization and Question Answering for Spoken Content – something we could Never do (even today) – semantic analysis for spoken content
  • 30. Probabilistic Latent Semantic Analysis (PLSA) t1 t2 tj tn D1 D2 Di DN TK Tk T2 T1 P(T |D ) k i P(t |T ) j k Di: documents Tk: latent topics tj: terms 𝑃(𝑤 𝑧) 𝑃(𝑧 𝑑) • Unsupervised Learning of Topics from text corpus d: document z: topic w: word [Hofmann 1999]
  • 31. Latent Dirichlet Allocation (LDA) 𝑃(𝜑𝑘 𝛽): Dirichlet Distribution 𝑃(𝜃 𝛼): Dirichlet Distribution 𝑃(𝑤𝑚,𝑛 𝑧𝑚,𝑛, 𝜑𝑘) 𝑃(𝑧𝑚,𝑛 𝜃𝑚) [Blei 2003] • Unsupervised Learning of Topics from text corpus
  • 32. Key Term Extraction and Summarization for Spoken Content • Key term extraction based on semantic features from PLSA or LDA – key terms usually focused on smaller number of topics Not key term P(Tk|ti) k key term P(Tk|ti) k topics topics [ICASSP 2006] [SLT 2010] • Summarization – selecting most representative utterances but avoiding redundancy
  • 33. • Constructing the Semantic Structures of the Spoken Content • Example Approach 1: Spoken Content categorized by Topics and organized in a Two-dimensional Tree Structure (2005) – each category labeled by a set of key terms (topic) located on a map – categories nearby on the map are more related semantically – each category expanded into another map in the next layer Semantic Structuring of Spoken Content (1/2) [Eurospeech 2005]
  • 34. • Broadcast News Browser (2006) An Example of Two-dimensional Trees [Interspeech 2006]
  • 35. • Sequential knowledge transfer lecture by lecture • When a lecture in an online course is retrieved for a user – difficult for the user to understand this lecture without listening to previous related lectures – not easy to find out background or related knowledge Online Courses
  • 36. • Example Approach 2: Key Term Graph (2009) – each spoken slide labeled by a set of key terms (topics) – relationships between key terms represented by a graph Semantic Structuring of Spoken Content (2/2) spoken slides (plus audio/video) key term graph Acoustic Modeling Viterbi search HMM Language Modeling Perplexity [ICASSP 2009][IEEE Trans ASL 2014] • Very Similar to Knowledge Graph …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… ……
  • 37. • Course : Digital Speech Processing (2009) – Query : “triphone” – retrieved utterances shown with the spoken slides they belong to specified by the titles and key terms An Example of Retrieving with an Online Course Browser (1/2) [ICASSP 2009][IEEE Trans ASL 2014]
  • 38. • User clicks to view the spoken slide (2009) – including a summary, key terms and related key terms from the graph – recommended learning path for a specific key term [ICASSP 2009][IEEE Trans ASL 2014] An Example of Retrieving with an Online Course Browser (2/2)
  • 39. A Huge Number of Online Courses 752 matches • A user enters a keyword or a key phrase to coursera
  • 40. Having Machines Listen to all the Online Courses Lectures with very similar content three courses on some similar topic [Interspeech 2015]
  • 41. sequential order for learning (prerequisite conditions) three courses on some similar topic [Interspeech 2015] Having Machines Listen to all the Online Courses
  • 42. Demonstration Learning Map Produced by Machine (2014) [Interspeech 2015]
  • 43. Question Answering Knowledge source question answer unstructured documents search engine Question Answering • Machine answering questions from the user spoken content
  • 44. Speech Recognition (ASR) Question Answering Answer Question Question Answering Answer Question Spoken Content Retrieved Text Retrieved Text v.s. Spoken QA (Cascading v.s. End-to-end) • Text QA End-to-end Spoken Question Answering Answer Question [Interspeech 2020] Cascading End-to-end • Spoken QA Errors
  • 45. Reconstruction Random Mask Start/End Position Audio-and-Text Jointly Learned SpeechBERT • Pre-training • Fine-tuning [Interspeech 2020] • End-to-end Globally Optimized for Overall QA Performance – not limited by ASR errors (no ASR here) – extracting semantics directly from speech, not from words via ASR
  • 46. • Internet is the only largest archive for global human knowledge – spoken language technologies offer a bridge to that knowledge • Still a long way to go towards the personalized education environment considered – many small steps may lead to the final destination some day – someone may realize it at the right time in the future • Machines capable of handling huge quantities of Data – using machines in the way they are specially capable and efficient Mini-Remarks
  • 47. (3) Unsupervised ASR (Phoneme Recognition) Hung-yi Lee (left) and Lin-shan Lee
  • 48. • Supervised ASR – Has been very successful – Problem : requiring a huge quantity of annotated data How are you. He thinks it’s… Thanks for… Data (Annotated) Supervised ASR • Unsupervised ASR – Train without annotated data Supervised/Unsupervised ASR Thousands of languages spoken over the world ‐ most are low-resourced without enough annotated data
  • 49. How are you. He thinks it’s… Thanks for… Data (Audio) Unsupervised ASR Data (Text) Supervised/Unsupervised ASR • Supervised ASR – Has been very successful – Problem : requiring a huge quantity of annotated data • Unsupervised ASR – Train without annotated data – Unlabeled, unpaired data are easier to collect – something we could Never do before (2018) Thousands of languages spoken over the world ‐ most are low-resourced without enough annotated data
  • 50. Generator (ASR) Discriminator Acoustic Features Data (Audio) Data (Text) How are you. He thinks it’s… Thanks for… hh ih hh ih ng kcl … hh aw aa r y uw th ae ng k s f ao r … Generated Phoneme Sequences Real Phoneme Sequences Use of Generative Adversarial Networks (GAN)
  • 51. Generator (ASR) Tries to “fool” Discriminator Discriminator Tries to distinguish real or generated phoneme sequence. Acoustic Features Generated Phoneme Sequences Real / Generated Real / Generated Phoneme Sequences Train Iteratively Use of Generative Adversarial Networks (GAN) • Discriminator / Generator improve themselves individually and iteratively • Generative Adversarial Network (GAN)
  • 52. … … Audio word2vec 𝑋1 𝑋2 𝑋3 𝑋𝑀 𝑧1 𝑧2 𝑧3 𝑧𝑀 … Acoustic Feature Audio embedding sequence Model 1 (2018) • Waveform segmentation and embedding – divide the features into acoustically similar segments of different lengths – transform each segment into a fixed-length vector (audio embedding) [Interspeech 2018]
  • 53. Cluster index 16 Cluster index 25 Cluster index 1 … Audio embedding sequence 2 … 16 25 2 𝑧1 𝑧2 𝑧3 𝑧𝑀 Cluster index sequence K-means … Cluster index 2 Model 1 (2018) • Cluster the embeddings into groups [Interspeech 2018]
  • 54. Generator Lookup Table Real phoneme sequence Real / Generated Discriminator CNN network 2 … 2 Cluster index sequence 16 25 Model 1 (2018) • Learning the mapping between cluster indices and phonemes with a GAN – embedding clustering followed by (cascaded with) a GAN [Interspeech 2018] 𝑠𝑖𝑙 … ℎℎ 𝑖ℎ 𝒔𝒊𝒍 Generated phoneme sequence
  • 55. Generator Lookup Table 2 … 2 Cluster index sequence 16 25 Model 1 (2018) • Learning the mapping between cluster indices and phonemes with a GAN – embedding clustering followed by (cascaded with) a GAN [Interspeech 2018] 𝑠𝑖𝑙 … ℎℎ 𝑖ℎ 𝒔𝒊𝒍 Generated phoneme sequence ASR ! (phoneme recognition)
  • 56. • Generator consists of two parts (a) Phoneme Classifier (DNN) (b) Sampling Process Data (Audio) Model 2 (2019) [Interspeech 2019]
  • 57. • Generator consists of two parts (a) Phoneme Classifier (DNN) (b) Sampling Process • Discriminator is a two layer 1-D CNN. Data (Text) Data (Audio) Model 2 (2019) [Interspeech 2019]
  • 58. • Generator consists of two parts (a) Phoneme Classifier (DNN) (b) Sampling Process • Discriminator is a two layer 1-D CNN. Data (Text) Data (Audio) Model 2 (2019) • A GAN (Generator and Discriminator) trained End-to-end – DNN trained in an unsupervised way [Interspeech 2019]
  • 59. GAN model Hidden Markov Models 2. Train the GAN in an unsupervised way 3. Obtain transcriptions by GAN model 4. Train the HMMs 5. Obtain new boundaries by HMMs 1. Obtain the initial boundaries Phoneme Boundaries Transcriptions (Pseudo label) Model 2 (2019) • GAN iterated with HMMs [Interspeech 2019]
  • 60. Approaches PER Matched Unrelated Supervised RNN Transducer 17.7 - Standard HMMs 21.5 - Completely unsupervised (no label at all) Model 1 76.0 - Model 2 Iteration 1 GAN 48.6 50.0 HMM 30.7 39.5 Iteration 2 GAN 41.0 44.3 HMM 27.0 35.5 Iteration 3 GAN 38.4 44.2 HMM 26.1 33.1 Experimental Results for Models 1, 2 on TIMIT • Phoneme Error Rate [Interspeech 2018] [Interspeech 2019] • Model 1 cascaded clustering with GAN, while Model 2 did everything end-to-end with GAN
  • 61. Accuracy Unsupervised learning is as good as supervised learning 30 years ago. The Progress of Supervised Learning on TIMIT • Milestones in phone recognition accuracy [Phone recognition on the TIMIT database, Lopes, C. and Perdigão, F., 2011. Speech Technologies, Vol 1, pp. 285--302.] – Will it take another 30 years for unsupervised learning to achieve the performance of supervised learning today ?
  • 62. ASR TTS - Learning basic sound units by listening and reproducing - Babies learn to speak from parents Vector Space Model 3 (2019) • A Cascading Recognition-Synthesis Framework [ICASSP 2020]
  • 63. ASR B AE T “Ta b” Model 3 (2019) • A Cascading Recognition-Synthesis Framework [ICASSP 2020] - Learning basic sound units by listening and reproducing - Babies learn to speak from parents - ASR ! (Phoneme recognition) - Audio data only (no text) plus a small set of paired data
  • 64. Encoder Decoder phoneme representation sequence Quantizer RNNs Seq-to-seq Model encoder output sequence (frame-sync) • Implemented with a Sequential Quantization AutoEncoder Model 3 (2019) [ICASSP 2020]
  • 65. encoder output sequence (frame-sync) Quantizer Learnable codebook L2 Distance Argmin Lookup 4 𝑒4 2 𝑒2 Phonetic Clustering Temporal Segmentation phoneme representation sequence [𝑒4,𝑒4,𝑒4,𝑒2,𝑒2] • Implemented with a Sequential Quantization AutoEncoder Model 3 (2019) [ICASSP 2020]
  • 66. – With small quantity of paired data, train codebook to match real phonemes Learnable codebook L2 Distance Argmin 𝑣 𝐞𝑣 Phonetic Clustering 1. Assign one phoneme to each codeword 𝐡𝑡 2. Define the probability that vector 𝐡𝑡 belongs to phoneme 𝑣 Pr 𝐡𝑡 belongs to 𝑣 = exp(− 𝐡𝑡 − 𝐞𝑣 2) 𝑢 exp(− 𝐡𝑡 − 𝐞𝒖 2) 3. Matching with the paired data by maximizing the probability Lookup • Sequential Quantization AutoEncoder Model 3 (2019) [ICASSP 2020]
  • 67. encoder output sequence (frame-sync) Quantizer Learnable codebook L2 Distance Argmin Phonetic Clustering Temporal Segmentation phoneme representation sequence [𝑒4,𝑒4,𝑒4,𝑒2,𝑒2] Lookup • Implemented with a Sequential Quantization AutoEncoder Model 3 (2019) [ICASSP 2020]
  • 68. ℎ𝑡 L2 Distance Argmin Learnable codebook Quantizer Encoder Decoder Straight-through Gradient Estimation Forward pass Backward pass Phonetic Clustering – End-to-end trained (with enough audio data plus small paired data) Temporal Segmentation Lookup • Implemented with a Sequential Quantization AutoEncoder Model 3 (2019) [ICASSP 2020]
  • 69. ASR ! (phoneme recognition) ℎ𝑡 L2 Distance Argmin Learnable codebook Quantizer Encoder Straight-through Gradient Estimation Forward pass Backward pass Phonetic Clustering – End-to-end trained (with enough audio data plus small paired data) Temporal Segmentation Lookup • Implemented with a Sequential Quantization AutoEncoder Model 3 (2019) [ICASSP 2020]
  • 70. Unlabeled Audio 20 min. Paired 15 min. Paired 10 min. Paired 5 min. Paired Proposed Method 22 hrs. 25.5 29.0 35.2 49.3 • Phoneme Error Rate (PER) – when different amount of paired data is available (single speaker) Experiment Results for Model 3 on LJSpeech [ICASSP 2020] – initial recognition observed, although 20 min of paired data used – related work reported recently
  • 71. • Deep learning makes tasks impossible before realizable today • Much more crazy concepts are yet to be explored ! • Unpaired data may be more useful than we thought • End-to-end training is more attractive than cascading in the context of deep learning – overall performance is optimized globally as compared to locally Mini-Remarks
  • 72. • Doing something we never could is interesting and exciting ! • Though challenging, the difficulties may be reduced by the fast advancing technologies including deep learning • 15 years ago we never knew what kind of technologies we could have today – today we never know what kind of technologies we may have 15 years from now – anything in our mind could be possible • This may be the golden age we never had for research – very deep learning, very big data, very powerful machines, very strong industry – which we never had before – possible to do something we Never could ! Concluding Remarks