Speech To Sign Language Interpreter System
Upcoming SlideShare
Loading in...5
×
 

Speech To Sign Language Interpreter System

on

  • 5,378 views

 

Statistics

Views

Total Views
5,378
Views on SlideShare
5,371
Embed Views
7

Actions

Likes
1
Downloads
175
Comments
1

2 Embeds 7

http://www.slideshare.net 6
http://eldarymli.scienceontheweb.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Speech To Sign Language Interpreter System Speech To Sign Language Interpreter System Presentation Transcript

  • By: Khalid El-Darymli G0327887 Speech to Sign Language Interpreter System Supervisor: Dr. Othman O. Khalifa International Islamic University Malaysia Kulliyyah of Engineering, ECE Dept.
  • OUTLINE
    • Problem statement.
    • Research goal and objectives.
    • Main parts of our system.
    • The structure of ASR:
      • SP,
      • Training: AM, Dictionary and LM,
      • and Decoding: the Veterbi beam search.
    • Sign Language, ASL and ASL alphabets.
    • Signed English.
    • Demo. of ASL in our SW.
    • Milestone.
  • Problem Statement
    • There is no free software, let alone one with a reasonable price, to convert speech into sign language in live mode.
    • There is only one software commercially available to convert uttered speech in live mode to a video sign language
    • This software is called iCommunicator and in order to purchase it deaf person has to pay USD 6,499!
    ! IS IT FAIR ?
  • RESEARCH GOAL AND OBJECTIVES
    • Design and Manipulation of Speech to Sign Language Interpreter System .
    • The SW is open source and freely available which in turn will benefit the deaf community.
    • To fill the gap between deaf and nondeaf people in two senses. Firstly, by using this SW for educational purposes for deaf people and secondly, by facilitating the communication between deaf and nondeaf people.
    • To increase independence and self-confidence of the deaf person.
    • To increase opportunities for advancement and success in education, employment, personal relationships, and public access venues.
    • To improve quality of life.
  • Main Parts of Speech to Sign Language Interpreter System Speech-Recognition Engine ASL pre-recorded Video-clips Database Recognized Text ASL Translation Continuous Input Speech Recognized Text
  • Automatic Speech Recognition ( ASR ):
    • SR systems are clustered according to three categories: Isolated vs. continuous , speaker dependent vs. speaker independent and small vs. large vocabulary .
    • The expected task of our software entails using a large vocabulary , speaker independent and continuous speech recognizer.
    SR Engine Recognized Text Input Voice
  • The Structure of SR Engine (LVCSR) Signal Processing AM P ( A 1 , …, A T | P 1 ,… , P k ) Dictionary P ( P 1 , P 2 , …, P k | W ) LM P ( W n | W 1 , …, W n-1 ) X={x 1 ,x 2 , …, x T } Hypothesis Evaluation Decoder P(X | W)*P(W) TRAINING DECODING Best Hypotheses H = {W 1 , W 2 , …, W k } W BEST Input Audio
  • SIGNAL PROCESSING (FRONT-END) : Pre-emphasis Framing Windowing Speech waveform y[n] y t ` [n] Power Spectrum Calculation y t [n] Mel Filterbank S t [k] ln| | 2 IDFT 13 c t [n] 13  c t [n] 13  c t [n] x[n] , 16-bits integer data S t [m] Pre-emphasis  is the pre-emphasis parameter.
    • MFCC computation:
    • The MFCC is a representation defined as the real cepstrum of a windowed
    • short-time signal derived from the FFT of that signal.
    • MFCC computation consists of performing the inverse DFT on the logarithm
    • of the magnitude of the filterbank output:
    • TYPICALLY FOR SPEECH RECOGNITION ONLY
    • THE FIRST 13 COEFFICIENTS ARE USED.
    • Framing and Windowing
    • Typical frame duration in speech recognition is 10 ms,
    • while typical window duration is 25 ms.
    • The mel filterbank:
    • It is used to extract spectral features
    • of speech through properly integrating
    • a spectrum at defined frequency ranges.
    • The transfer function of the triangular
    • mel-weighting filters H m [k] is given by:
    • The mel-spectrum of the power spectrum is computed by:
    • where k is the DFT domain index, N is the length of the DFT, and M is total number
    • of triangular mel-weighting filters.
    • Power Spectrum
    • SFT calculated using:
    • TO reduce computational complexity, is evaluated only for a discrete number of
    •  values  =2  k/N then the DFT of all frames of the signal is obtained:
    • The phase information of the DFT samples of each frame is discarded
    • Final output of this stage is:
    • Delta and Double Delta computation
    • First and Second order differences may be used to capture the
    • dynamic evolution of the signal.
    • The first order delta MFCC computed from:
    • The second order delta MFCC computed from:
    • The final output of the FE processing would comprise 39 features
    • vector (observations vector X t ) per each processed frame.
  • Speech waveform of a phoneme “ae”
    • Explanatory Example
    After pre-emphasis and Hamming windowing Power spectrum MFCC
  • TRAINING
    • Acoustic Model (AM):
    • The AM provides a mapping between a unit of speech and an HMM that can be scored against incoming features provided by the Front-End.
    • It contains a pool of a Hidden Markov Models (HMM).
    • For large vocabularies each word is represented as a sequence of phonemes, accordingly there has to be an AM per each phoneme, moreover, it has to be depending on the context (e.g. co-articulation) and even the context dependence may cross word boundary.
    • Phones are then further refined into context-dependent triphones , i.e. , phones occurring in given left and right phonetic contexts.
    • It is the process of learning the AM, Dictionary and LM .
    AM P ( A 1 , …, A T | P 1 ,… , P k ) Dictionary P ( P 1 , P 2 , …, P k | W ) LM P ( W n | W 1 , …, W n-1 )
  • HMM s
    • HMM is defined by the model parameters  =(A, B, π) .
    • For each acoustic segment, there is a probability distribution across acoustic observations b i (k) .
    • The leading technique is to represent the acoustic observations as a mixture Gaussian distribution or shortly Gaussian Mixtures (GM).
    S 0 S 1 S 2 S 3 a 00 a 11 a 22 b 0 (k) b 1 (k) b 2 (k)
  • Dictionary :
    • Dictionary is a file contains pronunciations for all the words of interest to the decoder.
    • For large vocabulary speech recognizers pronunciations are specified as a linear sequence of phonemes.
    • Some digits pronunciations:
    • ZERO  Z IH R O EIGHT  EY TD
    • Multiple pronunciations
    • ACTUALLY  AE K CH AX W AX L IY
    • ACTUALLY(2nd)  AE K SH AX L IY
    • ACTUALLY(3rd)  AE K SH L IY
    • Compound words:
    • WANT_TO  W AA N AX
    AM P ( A 1 , …, A T | P 1 ,… , P k ) Dictionary P ( P 1 , P 2 , …, P k | W ) LM P ( W n | W 1 , …, W n-1 )
  • Language Model (LM):
    • It is a statistical LM where the speaker could be talking about any arbitrary topic.
    • The main used model is the n-gram statistics and in particular trigram (n=3), P(W t |W t-1 ,W t-2 ).
    • Bigram and Unigram LMs have to be employed as well.
    AM P ( A 1 , …, A T | P 1 ,… , P k ) Dictionary P ( P 1 , P 2 , …, P k | W ) LM P ( W n | W 1 , …, W n-1 )
  • RECOGNITION
    • Given an input speech utterance the goal is to UNVEIL the BEST hidden state sequence.
    • Let S=(s 1 ,s 2 ,…,s T ) be the sequence of states that are recognized and x t be the feature samples computed at time t , where the feature sequence from time 1 to t is indicated as: X=(x 1 ,x 2 ,…,x t ) .
    • Accordingly, the sequence of recognized states S* could be obtained by: S*=ArgMax P(S,X|  ) .
    Dynamic Structure Search Algorithm S * Static Structure  S t , P(x t ,{s t }| {s t-1 } ,  ) {S t-1 } x t
  • The Veterbi Beam search
    • Initialization:
    • For
    • Goto XX
    • Recursive Step:
    • For {
      • Goto XX }
    • Backtracking:
    • XX:
    • For
    • Find p t (s t * )= Max[V t (i)]
    • Calculate the threshold
    • For {
    • If p t (s t =j) MEMORIZE both V t (j) and path " j "
    • Else DISCARD V t (j) }
    • Return
  • SIGN LANGUAGE
    • Sign Language is a communication system using gestures that are interpreted visually.
    • As a whole, sign languages share the same modality , a sign, but they differ from country to country.
  • AMERICAN SIGN LANGUAGE ( ASL )
    • ASL is the dominant sign language in the US, anglophone Canada and parts of Mexico.
    • Currently, approximately 450,000 deaf people in the United States use ASL as their primary language
    • ASL  signs follow a certain order, just as words do in spoken English. However, in ASL one sign can express meaning that would necessitate the use of several words in speech.
    • The grammar of ASL uses spatial locations, motion, and context to indicate syntax.
  • ASL ALPHABETS
    • It is a manual alphabet representing all the letters of the English alphabet, using only the hands.
    • Making words using a manual alphabet is called fingerspelling .
    • Manual alphabets are a part of sign languages
    • For ASL, the one-handed manual alphabet is used.
    • Fingerspelling is used to complement the vocabulary of ASL when spelling individual letters of a word is the preferred or only option, such as with proper names or the titles of works.
    Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz
  • SIGNED ENGLISH ( SE ):
    • SE is a reasonable manual parallel to English.
    • The idea behind SE and other signing systems parallel to English is that deaf people will learn English better if they are exposed, visually through signs, to the grammatical features of English.
    • SE uses two kinds of gestures: sign words and sign markers .
    • Each sign word stands for a separate entry in a Standard English dictionary.
    • The sign words are signed in the same order as words appear in an English sentence. Sign words are presented in singular, non-past form.
    • Sign markers are added to these basic signs to show, for example, that you are talking about more than one thing or that some thing has happened in the past.
    • When this does not represent the word in mind, the manual alphabet can be used to fingerspell the word.
    • Most of signs in SE are taken from the American Sign Language. But these signs are now used in the same order as English words and with the same meaning.
  • ASL vs. SE (an Example) It is alright if you have a lot ASL Translation SE Translation IT I S ALL RIGHT IF YOU HAVE A LOT
  • DEMONSTRATION OF THE ASL IN OUR SW: A number of 2,600 ASL prerecorded video clips In case of nonbasic word, extract the basic word out of it Recognized Word (SR engine’s output) Is the basic word within the ASL database vocabulary? The American Manual Alphabet Only in case of a nonbasic input word, append some suitable marker Final Output None of the database contents matched the input basic word No Yes Fingerspelling of the original input word The equivalent ASL video clip of the input word, some marker could be appended
  • Speech to Sign Language Interpreter System - MILESTONE Thesis Writing Outline & Progress SW Development & Progress % Drafted Chapter 2: State-of-the-Art of SR Chapter 3: Sphinx SR Chapter 4: Sphinx Decoder Chapter 5: Sign Language Chapter 6: SW Demo ., Conclusions & Further Work Appendices SR Engine ASL Database Overall Integrated SW Chapter 1: Introduction % Completed
  • Thank You
    • Your Questions Are
    • Most Welcomed