Kaldi&voice+
Your+personal+speech+recogni4on+
server+using+open+source+code+
Xavier+Anguera+
CTO+&+CSO,+ELSA+Corp.+
xavier@elsanow.io+
Outline+
•  Intro+
•  What+is+speech+recogni4on+
–  Applica4ons+
•  Approaches+to+ASR+
–  PaHern+matching+approaches+
–  Sta4s4cal&based+approaches+
•  Available+speech+recogni4on+engines+
–  “open”+source+
–  Online+commercial+systems+
•  Building+your+own+online+system+
–  Live+demo+
Automa4c+Speech+Recogni4on+
•  Automa'c)Speech)Recogni'on)(ASR))is+the+
process+of+conver4ng+an+unknown+speech+
waveform+into+the+corresponding+orthographic+
transcrip4on.++
Image:+hHp://blogs.msdn.com/b/devschool/archive/2012/02/06/speech&recogni4on&using&visual&studio&determining&the&bna.aspx+
Content2
Personal22
context2
Search+
Summary+
Transcripts+
Meaning+Age+
Gender+
Height+
Spoken+language+
Spoken+dialect+
Spoken+accent+
Literacy+level+
Speaker+ID+
Personality+traits+(OCEAN)+
Speech+likability+
Speech+intelligibility+
Sleepiness/4redness+
Intoxica4on+level+
Emo4on+
State+of+interest+
Image:+Telefonica+I+D+
Applica4ons+of+Speech+Recogni4on/Understanding+(ASR/ASU)+
!  Dicta4on+
!  Telephone&based+Informa4on++
!  direc4ons,+air+travel,+banking,+etc+
!  Polls,+online+shopping+
!  Call+rou4ng+
!  Hands&free+
!  in+car,+computer,+home(domo4cs),+controlling+tools+
!  Second+language+(accent+reduc4on)+
!  Audio+archive+searching+
!  Help+for+disabled+people+
How+do+humans+do+it?+
Ar4cula4on+system+of+one+
person+produces+sound+waves+
which+the+ear+of+another+person+
conveys+to+the+brain+for+
processing+
How+can+computers+do+it?+
•  Digi4za4on+
•  Acous4c+analysis+of+the+
speech+signal+
•  Linguis4c+interpreta4on+
Acous4c+waveform+ Acous4c+signal+
Speech+recogni4on+
Challenges+in+ASR+processing+
!  Speaker+variability+
!  Inter&speaker:+Vocal+tract,+gender,+dialects+
!  Intra&speaker:+:+stress,+age,+humor,+changes+of+ar4cula4on+due+to+
environment+influence,+…+
!  Language+variability+
!  From+isolated+words+to+con4nuous+speech+
!  Out&of&vocabulary+words+
!  Vocabulary+size+and+domain+
!  From+just+a+few+words+(e.g.+Isolated+numbers)+to+large+vocabulary+speech+
recogni4on+
!  Domain+that+is+being+recognized+(medical,+social,+engineering,+…)+
!  Noise+
!  Convolu4ve:+recording/transmission+condi4ons,+reverbera4on+
!  Addi4ve:+recording+environment,+transmission+SNR+
Approaches+to+ASR+
!  PaHern&based+approaches+
!  Sta4s4cs&based+approaches+
PaHern&based+speech+recogni4on+
" Feature measurement: Filter Bank, MFCC, LPC, DFT, ...
" Pattern training: Creation of a reference pattern derived from an averaging technique
" Pattern classification: Compare speech patterns with a local distance measure and a
global time alignment procedure (DTW)
" Decision logic: similarity scores are used to decide which is the best reference
pattern.
Template+Matching+Mechanism+
TDP:++Speech+Recogni4on+
Alignment+Example+
Sta4s4cs&based+approaches+
•  Can+be+seen+as+extension+of+template&based+approach,+
using+more+powerful+mathema4cal+and+sta4s4cal+tools+
•  Some4mes+seen+as+ an4&linguis4c +approach+
–  Fred+Jelinek+(IBM,+1988):+ Every+4me+I+fire+a+linguist+my+
system+improves
•  Process:+
1.  Collect+a+large2corpus+of+transcribed+speech+recordings+
2.  Train+the+computer+to+learn+the+correspondences+
( machine+learning )+
3.  At+run+4me,+apply+sta4s4cal+processes+to+search+through+
the+space+of+all+possible+solu4ons,+and+pick+the+
sta4s4cally+most+likely+one+
Sta4s4cs&based+approaches+
•  Hidden+Markov+Models+(HMM)+
•  Gaussian+Mixture+Models+(GMM)+
•  Deep+Neural+Networks+(DNN)+
Markov+model+
Output2=2sequence2of2states2
Image:+hHp://madhukaudantha.blogspot.pt/2014/05/markov&models&and&hidden&markov&models.html+
Hidden+Markov+Models+(HMM)+
Output2=2observa:ons2linked2to2the2states2through2a2predefined2
probability2distribu:on2!2modeled2using2GMM2or2DNN2models2
Image:+hHp://izanami.tl.fukuoka&u.ac.jp/SLPL/HMM/HTKBook/node5.html+
19/34+
HMMs+for+some+words+
Gaussian+Mixture+Models+(GMM)+
1D+GMM+
2D+GMM+
Dep+neural+networks+
Image:+hHp://www.amax.com/blog/+
A2neuron2in2our2brain2
Image:+hHp://www.medicalsciencenavigator.com/how&to&study&for&anatomy&and&physiology/why&sleep&improves&memory+
Classical+representa4on+of+a+neuron++
Long+short&term+memory+cells++
DNN+evolu4on+
•  We+started+to+use+mul4layer+perceptrons+
(MLP’s)+about+25+years+ago+[1]+
– Neural+networks+with+1+or+few+hidden+layers+
•  Around+2010+G.+Hinton+and+S.+Bengio+
(separately)+proposed+methods+to+effec4vely+
train+many+hidden+layers+
– Machines+have+become+much+more+powerful+
– Lots+of+audio+data+with+transcrip4ons+areavailable++
[1]+“Merging+Mul4layer+perceptrons+and+Hidden+Markov+Models:+some+experiments+in+con4nuous+
speech+recogni4on”,+Herve+Bourlard+and+Nelson+Morgan,+Technical+report+ICSI,+1989+
Image:+hHp://whatsnext.nuance.com/category/in&the&labs/+
Processing+power+evolu4on+
Image:+hHp://whatsnext.nuance.com/category/in&the&labs/+
ASR+performance+evolu4on+
Speech+recogni4on+engines+
•  HTK+(hHp://htk.eng.cam.ac.uk/),+non&
commercial+license+
•  Sphinx+(hHp://cmusphinx.sourceforge.net/),+
GPL+
•  Julius+(hHp://julius.osdn.jp/en_index.php),+
open+
•  Kaldi+(hHp://www.kaldi&asr.org/),+Apache+
license+
Online+ASR&STT+services+
•  Google+voice+(
hHps://console.developers.google.com/
project)+
•  ATT+voice+recogni4on+(
hHp://developer.aH.com/apis/speech)+
•  Wit.ai+(hHps://wit.ai/)+
Building+an+ASR+with+open+source+tools+
•  We+need:+
– Speech+recogni4on+engine+
– Speech+databases+/+models+
– Online+speech+server+
– Frontend+interfaces+
Kōnele+app+
Dictate.js+
My+toolchain+
•  Kaldi+ASR+++++++++++++++++++++
hHp://www.kaldi&asr.org/+
•  Kaldi+gstreamer+server+
hHps://github.com/alumae/kaldi&gstreamer&
server+
•  Dictate.js++
hHp://kaljurand.github.io/dictate.js/+
•  Kōnele+app+
hHps://kaljurand.github.io/K6nele/+
Demo+

Kaldi-voice: Your personal speech recognition server using open source code