Visual-speech to text 
conversion applicable 
to telephone 
communication for deaf 
individuals 
30TH APRIL 2013
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
INTRODUCTION 
 Lip-reading technique, 
 speech can be understood by interpreting 
movements of lips, face and tongue. 
 not one-to-one 
 Impossible to distinguish phonemes using 
visual information alone
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
 the Cued Speech system 
 developed by Cornett 
 contains two components: 
the hand shape the hand position relative to the 
face. 
 Hand shapes- consonant phonemes 
 hand positions -vowel phonemes. 
 improves speech perception to a large extent
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
the Cued Speech system
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
AIM OF NEW SYSTEM 
 To investigate the designing of a system able to 
automatically recognize Cued Speech and convert it 
to text. 
 Possible for deaf or speech-impaired individuals to 
communicate with each other and also with normal-hearing 
persons 
 Using gestures 
 captured by devices equipped by a camera
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
METHODS 
 Corpus, feature extraction, and 
statistical modeling 
 The speakers’ lips were painted blue, and color 
marks were placed on the speakers’ fingers. . 
 The data were derived from a video recording of 
the cuers pronouncing and coding in Cued 
Speech 
 landmarks with different colors were placed on 
the fingers
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
 faster and more accurate image processing 
stage. 
 The audio part of the video recording was 
synchronized with the image. 
 An automatic image processing method was 
appliedli pt ow idththe ( Av)i,d eo 
 lip aperture (B), 
 lip area (S). 
 pinching of the upper lip (Bsup) 
 lower (Binf) lip
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
 Concatenative feature fusion 
 Tracks and extracts the xy coordinates 
each time frame, 
 uses those values as features in the 
HMM modeling. 
 uses the concatenation of the 
synchronous lip shape and hand features 
as the joint feature vector given by,
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
Joint lip hand 
feature vector, 
Lip shape 
feature vector, 
Hand feature 
vector, 
Dimensionality of the 
joint feature vector 
 Parameters used for lip 
shape modeling.
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
RESULTS 
 Isolated word recognition 
1. Recognition in normal-hearing subject
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
2. Recognition in deaf subject
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
3. Multi-speaker isolated word recognition: 
 investigate whether it is possible to train speaker-independent 
HMMs for Cued Speech recognition. 
 The training data consisted of 750 words from the 
normal-hearing subject, and 750 words from the 
deaf subject. 
 For testing 700 words from normal-hearing subject 
and 700 words from the deaf subject were used, 
respectively. 
 Each state was modeled with a mixture of 4 
Gaussian distributions. 
 For lip shape and hand shape integration, 
concatenative feature fusion was used.
Visual-speech to text conversion applicable to telephone communication for deaf individuals
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
4. Continuous phoneme recognition 
 Phoneme correct for continuous phoneme word 
recognition in the case of a normal-hearing subject.
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
Phoneme correct for continuous phoneme word 
recognition in the case of a deaf subject.
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
CONCLUSION 
 Hand shapes and lips shape were integrated 
using concatenative feature fusion and HMM-based 
automatic recognition was conducted. 
 For continuous phoneme recognition, a 86% 
phoneme correct was achieved for the normal-hearing 
cuer and a 82.7% phoneme correct for 
the dead cuer were achieved, respectively. 
 Speech in both normal-hearing and deaf 
subjects were also conducted obtaining a 
94.9% and a 89% accuracy, respectively. 
.
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
CONCLUSION 
 A multi-speaker experiment using data 
from both normal-hearing and deaf subject 
showed a 89.6% word accuracy, on 
average. 
 This result indicates that training speaker-independent 
HMMs for Cued Speech using 
a large number of subjects should not face 
particular difficulties
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
REFERENCES 
 G. Potamianos, C. Neti, G. Gravier, A. Garg, and A.W. Senior, 
“recent Advances in the automatic recognition of audiovisual 
speech,” in Proceedings of the IEEE, vol. 91, issue 9, pp. 
1306–1326, 2003. 
 S. Nakamura, K. Kumatani, and S. Tamura, “Multi-modal 
temporal asynchronicity modeling by product hmms for 
robust audio-visual speech recognition,” in Proceedings of 
Fourth IEEE International Conference on Multimodal 
Interfaces (ICMI’02), p. 305, 2002. 
 R. O. Cornett, “Cued speech,” American Annals of the Deaf, 
vol. 112, pp. 3–13, 1967. 
 J. Leybaert, “Phonology acquired through the eyes and 
spelling in deaf children,”Journal of Experimental Child 
Psychology, vol. 75, pp. 291– 318, 2000
Thank you!
Visual-speech to text conversion applicable to telephone communication for deaf individuals 
ANY 
QUESTION 
S?

Visual speech to text conversion applicable to telephone communication

  • 1.
    Visual-speech to text conversion applicable to telephone communication for deaf individuals 30TH APRIL 2013
  • 2.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals INTRODUCTION  Lip-reading technique,  speech can be understood by interpreting movements of lips, face and tongue.  not one-to-one  Impossible to distinguish phonemes using visual information alone
  • 3.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals  the Cued Speech system  developed by Cornett  contains two components: the hand shape the hand position relative to the face.  Hand shapes- consonant phonemes  hand positions -vowel phonemes.  improves speech perception to a large extent
  • 4.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals the Cued Speech system
  • 5.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals AIM OF NEW SYSTEM  To investigate the designing of a system able to automatically recognize Cued Speech and convert it to text.  Possible for deaf or speech-impaired individuals to communicate with each other and also with normal-hearing persons  Using gestures  captured by devices equipped by a camera
  • 6.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals METHODS  Corpus, feature extraction, and statistical modeling  The speakers’ lips were painted blue, and color marks were placed on the speakers’ fingers. .  The data were derived from a video recording of the cuers pronouncing and coding in Cued Speech  landmarks with different colors were placed on the fingers
  • 7.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals  faster and more accurate image processing stage.  The audio part of the video recording was synchronized with the image.  An automatic image processing method was appliedli pt ow idththe ( Av)i,d eo  lip aperture (B),  lip area (S).  pinching of the upper lip (Bsup)  lower (Binf) lip
  • 8.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals  Concatenative feature fusion  Tracks and extracts the xy coordinates each time frame,  uses those values as features in the HMM modeling.  uses the concatenation of the synchronous lip shape and hand features as the joint feature vector given by,
  • 9.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals Joint lip hand feature vector, Lip shape feature vector, Hand feature vector, Dimensionality of the joint feature vector  Parameters used for lip shape modeling.
  • 10.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals RESULTS  Isolated word recognition 1. Recognition in normal-hearing subject
  • 11.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals 2. Recognition in deaf subject
  • 12.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals 3. Multi-speaker isolated word recognition:  investigate whether it is possible to train speaker-independent HMMs for Cued Speech recognition.  The training data consisted of 750 words from the normal-hearing subject, and 750 words from the deaf subject.  For testing 700 words from normal-hearing subject and 700 words from the deaf subject were used, respectively.  Each state was modeled with a mixture of 4 Gaussian distributions.  For lip shape and hand shape integration, concatenative feature fusion was used.
  • 13.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals
  • 14.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals 4. Continuous phoneme recognition  Phoneme correct for continuous phoneme word recognition in the case of a normal-hearing subject.
  • 15.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals Phoneme correct for continuous phoneme word recognition in the case of a deaf subject.
  • 16.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals CONCLUSION  Hand shapes and lips shape were integrated using concatenative feature fusion and HMM-based automatic recognition was conducted.  For continuous phoneme recognition, a 86% phoneme correct was achieved for the normal-hearing cuer and a 82.7% phoneme correct for the dead cuer were achieved, respectively.  Speech in both normal-hearing and deaf subjects were also conducted obtaining a 94.9% and a 89% accuracy, respectively. .
  • 17.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals CONCLUSION  A multi-speaker experiment using data from both normal-hearing and deaf subject showed a 89.6% word accuracy, on average.  This result indicates that training speaker-independent HMMs for Cued Speech using a large number of subjects should not face particular difficulties
  • 18.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals REFERENCES  G. Potamianos, C. Neti, G. Gravier, A. Garg, and A.W. Senior, “recent Advances in the automatic recognition of audiovisual speech,” in Proceedings of the IEEE, vol. 91, issue 9, pp. 1306–1326, 2003.  S. Nakamura, K. Kumatani, and S. Tamura, “Multi-modal temporal asynchronicity modeling by product hmms for robust audio-visual speech recognition,” in Proceedings of Fourth IEEE International Conference on Multimodal Interfaces (ICMI’02), p. 305, 2002.  R. O. Cornett, “Cued speech,” American Annals of the Deaf, vol. 112, pp. 3–13, 1967.  J. Leybaert, “Phonology acquired through the eyes and spelling in deaf children,”Journal of Experimental Child Psychology, vol. 75, pp. 291– 318, 2000
  • 19.
  • 20.
    Visual-speech to textconversion applicable to telephone communication for deaf individuals ANY QUESTION S?