Deep Learning
and its Applica1on on
Speech Processing
Hung-yi	Lee
Spoken	
Content
Speech	
Recogni4on
Recogni4on	
Output
Speech	
Recogni,on
How	to	do	speech	recogni4on	with	
deep	learning?
Deep		
Learning
People imagine ……
This	is	not	true!
DNN	can	only	take	fixed-length	
vectors	as	input	and	output.
“大家好 我今天 ….”
DNN
Input	and	output	are	sequences	
with	different	lengths.
Recurrent Neural Network
x1
 x2
 x3
y1
y2
 y3
Wi
Wo
……
Wh
Wh
Wi
Wo
Wi
Wo
How	about	Recurrent	Neural	Network	(RNN)?
Recurrent Neural Network
好
 好
 好
Trimming	
棒
 棒
 棒
 棒
 棒
“好棒”
Why	can’t	it	be	
“好棒棒”
Input:
Output:
 (character	sequence)
(vector		
sequence
)
Problem?
How	about	Recurrent	Neural	Network	(RNN)?	
0.01s
Recurrent Neural Network
•  Connec4onist	Temporal	Classifica4on	(CTC)	[Alex	Graves,	
ICML’06][Alex	Graves,	ICML’14][Haşim	Sak,	Interspeech’15][Jie	Li,	
Interspeech’15][Andrew	Senior,	ASRU’15]	
好
 φ
 φ
 棒
 φ
 φ
 φ
 φ
 好
 φ
 φ
 棒
 φ
 棒
 φ
 φ
“好棒”
 “好棒棒”
Add	an	extra	symbol	
“φ”	represen4ng	“null”
Sequence-to-sequence Learning
•  Sequence	to	sequence	learning:	Both	input	and	output	are	
both	sequences	with	different	lengths.		
Containing	all	
informa4on	about	
input	uferance
……
……
“機器學習”
acous4c	feature	sequence	→	character	sequence
Sequence-to-sequence Learning
•  Sequence	to	sequence	learning:	Both	input	and	output	are	
both	sequences	with	different	lengths.		
……
……
“機器學習”
機
 習
器
 學
……
……
慣
 性
Don’t	know	when	to	stop
Sequence-to-sequence Learning
•  Sequence	to	sequence	learning:	Both	input	and	output	are	
both	sequences	with	different	lengths.		
……
……
“機器學習”
機
 習
器
 學
Add	a	symbol	“。 “	(句點)
[Ilya	Sutskever,	NIPS’14][Dzmitry	Bahdanau,	arXiv’15]
。
Spoken	
Content
Speech	
Recogni4on
Recogni4on	
Output
Retrieval	
Retrieval	
Result
Spoken	Content	
Retrieval
People think ……
l Transcribe spoken content into text by speech recognition
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query
 learner
l Use text retrieval approach to search the transcriptions
Spoken
Content
Black Box
People think ……
Spoken Content Retrieval
Speech Recognition
+
Text Retrieval
=
•  Good spoken content retrieval needs good speech recognition
system.
•  In real application, such high quality recognition models are
not available
•  Ex, YouTube
•  Different languages/accents
•  Different recording environments
•  Hope for spoken content retrieval
•  Don’t completely rely on accurate speech recognition
•  Accurate spoken content retrieval, even under poor speech
recognition
Problem?
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Retrieval	
Result
Spoken	Content	
Retrieval
¨  Is the cascading of speech recognition and text retrieval
the only solution of spoken content retrieval?
Beyond Cascading Speech
Recogni1on and Text Retrieval
•  5	direc4ons	
•  Modified	Speech	Recogni4on	for	Retrieval	Purposes	
•  Exploi4ng	Informa4on	not	present	in	ASR	outputs	
•  Directly	Matching	on	Acous4c	Level	without	ASR	
•  Seman4c	Retrieval	of	Spoken	Content	
•  Interac4ve	Retrieval	and	Efficient	Presenta4on	of	
Retrieved	Objects	
Overview	paper	"Spoken	Content	Retrieval	—Beyond	
Cascading	Speech	Recogni4on	with	Text	Retrieval"
http://speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf
Our Point
Spoken Content Retrieval
Speech Recognition
+
Text Retrieval
=
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Retrieval	
Result
Interac4on	
user
Interact	with	
Humans
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Seman4c	
Analysis	
Retrieval	
Result
Interac4on	
user
Seman,c	
Analysis
Unsupervised Learning
•  Machine	reads	lots	of	text	on	the	Internet	……
蔡英文 520宣誓就職
馬英九 520宣誓就職
蔡英文、馬英九 are	
something	very	similar
You	shall	know	a	word	
by	the	company	it	keeps
Seman1c Analysis
•  Let	machine	read	lots	of	documents.		
•  Each	word	is	represented	as	a	vector
dog
cat
rabbit
jump
run
flower
tree
Seman1c Analysis
•  Even	the	distances	between	the	vectors	have	some	
meaning.
Source:	hfp://
www.slideshare.net/hustwj/cikm-
keynotenov2014
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Seman4c	
Analysis	
Key	Term	
Extrac4on	
Retrieval	
Result
Interac4on	
user
Key	Term	
Extrac,on
[Interspeech	
2015]	
(with	沈昇勳)
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Seman4c	
Analysis	
Key	Term	
Extrac4on	
Retrieval	
Result
Interac4on	
user
Summariza,on
Summari-	
za4on
Speech Summariza1on
Retrieved
Audio File
Summary
Select the most informative
segments to form a compact version
1 hour long
10 minutes
Extrac've	Summaries	
Ref:	http://speech.ee.ntu.edu.tw/
~tlkagk/courses/MLDS_2015/
Structured%20Lecture/Summarization
%20Hidden_2.ecm.mp4/index.html
Speech Summariza1on
•  用自己的話寫 summary	(Abstrac4ve	Summaries)	
•  Machine	learns	to	do	abstrac4ve	summariza4on		
from	2,000,000	training	examples
,
, , , ,
; ……
Human
 Machine
台大電機系 盧柏儒、徐翊祥	
台大資工系 葉正杰、周儒杰
(助教:余朗祺)
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Seman4c	
Analysis	
Key	Term	
Extrac4on	
Summari-	
za4on	
Ques4on-
answering	
Retrieval	
Result
Interac4on	
user
question
answer
Ques,on	
Answering
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Seman4c	
Analysis	
Key	Term	
Extrac4on	
Summari-	
za4on	
Ques4on-
answering	
Retrieval	
Result
Interac4on	
user
question
answer
Without	
Speech	
Recogni,on?
Outline
Very	Brief	Introduc4on	of	Deep	Learning
Towards	Machine	Comprehension		
of	Spoken	Content
•  Overview
•  Example	I:		Speech	Ques4on	Answering		
•  Example	II:	Interac4ve	Spoken	Content	Retrieval	
•  Example	III:	What	can	machine	learn	from	audio	
without	any	supervision
Speech Ques1on Answering 
•  Machine	answers	ques4ons	based	on	the	
informa4on	in	spoken	content
What	is	a	possible	origin	
of	Venus’	clouds?
………	answer
Speech Ques1on Answering 
•  TOEFL	Listening	Comprehension	Test	by	Machine	
•  Example:
Ques4on:	“	What	is	a	possible	origin	of	Venus’	clouds?	”	
Audio	Story:		
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
(The	original	story	is	5	min	long.)
Simple Baselines
Accuracy	(%)
(1)
 (2)
 (3)
 (4)
 (5)
 (6)
 (7)
Naive	Approaches
random
(4)	選 seman4c	和其他
選項最像的選項 
(2)	select	the	shortest	
choice	as	answer
Experimental setup:
717 for training,
124 for validation, 122 for
testing
Supervised Learning
Accuracy	(%)
(1)
 (2)
 (3)
 (4)
 (5)
 (6)
 (7)
Memory	Network:	39.2%
Naive	Approaches
Interspeech	2016		
(with	曾柏翔)
(proposed	by	FB	AI	group)
Model Architecture
	“what	is	a	possible	
origin	of	Venus
Ques4on:
Ques4on	
Seman4cs
……	It	be	quite	possible	that	this	be	
due	to	volcanic	erup4on	because	
volcanic	erup4on	o{en	emit	gas.	If	
that	be	the	case	volcanism	could	very	
well	be	the	root	cause	of	Venus	's	
thick	cloud	cover.	And	also	we	have	
observe	burst	of	radio	energy	from	the	
planet	's	surface.	These	burst	be	
similar	to	what	we	see	when	volcano	
erupt	on	earth	……
Audio	Story:
Speech	
Recogni4on
Seman4c	
Analysis
Seman4c	
Analysis
Afen4on
(畫重點)
Answer
Select	the	choice	most	
similar	to	the	answer
Afen4on
Similar	to		
Memory	Network
Model Architecture
Word-based	Afen4on
Model Architecture
Sentence-based	Afen4on
(A)
(A)
 (A)
 (A)
 (A)
(B)
 (B)
 (B)
Supervised Learning
Accuracy	(%)
(1)
 (2)
 (3)
 (4)
 (5)
 (6)
 (7)
Memory	Network:	39.2%
Naive	Approaches
Word-based	Afen4on:	48.3%
Interspeech	2016		
(with	曾柏翔)
(proposed	by	FB	AI	group)
Outline
Very	Brief	Introduc4on	of	Deep	Learning
Towards	Machine	Comprehension		
of	Spoken	Content
•  Overview
•  Example	I:		Speech	Ques4on	Answering		
•  Example	II:	Interac4ve	Spoken	Content	Retrieval	
•  Example	III:	What	can	machine	learn	from	audio	
without	any	supervision
Interact with Users
•  Interac4ve	retrieval	is	helpful.
user
“深度學習”
和機器學習有關的
”深度學習” 嗎?
還是和教育有關的
”深度學習” 呢?
Audio is hard to browse
•  When	the	system	returns	the	retrieval	results,	user	
doesn’t	know	what	he/she	get	at	the	first	glance	
Retrieval Result
user
Spoken	Content	
Retrieval
Results
Spoken	
Content	
Interac,ve	
retrieval		
of	spoken	content	
query
Directly	showing	the	retrieval	results	is	
probably	not	a	good	idea.
user
Spoken	Content	
Retrieval
Results
Spoken	
Content	
Interac,ve	
retrieval		
of	spoken	content	
query
“Give me an example.”
“Is it relevant to XXX?”
“Can you give me another query?”
“Show the results.”
Given the current situation, which action should be taken?
……
user
Spoken	Content	
Retrieval
Results
Spoken	
Content	
Interac,ve	
retrieval		
of	spoken	content	
query
State	
Es4ma4on
Ac4on	
Decision
state	
The degree of
clarity from the
retrieval results
ac4on
features
¤  The policy π(s) is a function
¤  Input: state s, output: action a
Decide the actions by intrinsic
policy π(S)
[Interspeech	2012][ICASSP	2013]
user
Spoken	Content	
Retrieval
Results
Spoken	
Content	
Interac,ve	
retrieval		
of	spoken	content	
query
features
…
……
DNN
State EstimationAction Decision
Is it relevant to
XXX?
Give me an example.
Show the results.
Max
user
Spoken	Content	
Retrieval
Results
Spoken	
Content	
Interac,ve	
retrieval		
of	spoken	content	
query
features
…
……
DNN
Is it relevant to
XXX?
Give me an example.
Show the results.
Max
Learned	from	
historical	interac4on
Goal: maximizing return
(Retrieval Quality - User labor)
Deep Reinforcement Learning
Experimental Results
•  Broadcast	news,	seman4c	retrieval	
Retrieval	Quality	(MAP)	
Op4miza4on	Target:	
Retrieval	Quality	-	User	labor
Hand-cra{ed
 Deep	Learning
Previous	Method
(state	+	decision)
submifed	to	
Interspeech	2016	(with	
吳彥諶、林子翔)
Experimental Results
Outline
Very	Brief	Introduc4on	of	Deep	Learning
Towards	Machine	Comprehension		
of	Spoken	Content
•  Overview
•  Example	I:		Speech	Ques4on	Answering		
•  Example	II:	Interac4ve	Spoken	Content	Retrieval	
•  Example	III:	What	can	machine	learn	from	audio	
without	any	supervision
Unsupervised Learning
Machine	listens	to	lots	
of	audio	book

(TA: )
Audio	Word2Vec:	Unsupervised	Learning	of	Audio	
Segment	Representa'ons	using	Sequence-to-sequence	
Autoencoder	 (accepted	by	Interspeech	2016)
Audio Word to Vector
•  Consider	audio	segment	corresponding	to	an	
unknown	word	
Deep	
Learning
with
(助教:沈家豪)
Audio Word to Vector
•  The	audio	segments	corresponding	to	words	with	
similar	pronuncia4ons	are	close	to	each	other.
Deep	
Learning
Audio Word to Vector
•  The	audio	segments	corresponding	to	words	with	
similar	pronuncia4ons	are	close	to	each	other.
ever
 ever
never
never
never
dog
dog
dogs
Deep	
Learning
Sequence Auto-encoder
How to evaluate
never
ever
Cosine	
Similarity
Phoneme	sequence	
edit	distance
Deep	
Learning
Deep	
Learning
Experimental Results
More	similar	
pronuncia4on
Larger	cosine	
similarity.
Interes1ng Observa1on
•  Projec4ng	the	embedding	vectors	to	2-D
day
days
says
say
Spoken Content Retrieval without
Speech Recognition
user
“US President”
spoken query
[Hazen,	ASRU	09]	
[Zhang		Glass,	ASRU	09]	
[Chan		Lee,	Interspeech	10]	
[Zhang		Glass,	ICASSP	11]	
[Gupta,	Interspeech	11]	
[Zhang		Glass,	Interspeech	11]	
[Zhang		Glass,	ASRU	09]	
[Huijbregts,	ICASSP	11]	
[Chan		Lee,	Interspeech	11]	
Computing similarity between spoken queries and audio
files on signal level
Spoken Content
Handheld
device
Spoken Content Retrieval without
Speech Recognition
• Why spoken content retrieval without speech
recognition? 
•  Lots of audio files in different languages on the
Internet
•  Most languages have little annotated data for
training speech recognition systems.
•  Some audio files are produced in several different
of languages
•  Some languages even do not have text
Spoken Content Retrieval without
Speech Recognition
Retrieval Performance
Concluding Remarks
Very	Brief	Introduc4on	of	Deep	Learning
Towards	Machine	Comprehension		
of	Spoken	Content
•  Overview
•  Example	I:		Speech	Ques4on	Answering		
•  Example	II:	Interac4ve	Spoken	Content	Retrieval	
•  Example	III:	What	can	machine	learn	from	audio	
without	any	supervision
Thank You for Your Attention

李宏毅/當語音處理遇上深度學習