Ming Rutar has shared 10 slides on Sign Language Recognition with Python. Sign Language Recognition can be used to translate sign language with computer vision to text, then a mathematical model can translate the text into words.
2. ASL Recognizer is a Udacity AI Course Project
Udacity is an online school founded by top AI gurus. http://www.udacity.com
Zillion ideas
floating in
academia
world
Few ideas
made to
Industry
Industry Cutting
Edge
Technologies
Science/Theory
Udacity teaches cutting-edge technologies with
academic depth and hands-on practices on
technologies
Technology/Practice
❖ A course lasts 3 - 6 months with
3-7 projects.
❖ The projects are product-like.
❖ Focus on core technologies and
provide helpers on utilitive tasks,
such as environment setup.
❖ Very active online communities.
Course instructors also
participate.
❖ Student projects are reviewed by
experts of the subject matter.
❖ If one had graduated, he/she can
always access the course
materials, which are adhered
with the technology trend and
updated accordingly.
❖ Affordable price.
3. The task
The overall goal of this project is to build a word recognizer for American Sign Language video
sequences, demonstrating the power of probabalistic models. In particular, this project employs hidden
Markov models (HMM's) to analyze a series of measurements taken from videos of American Sign
Language (ASL) collected for research (see the RWTH-BOSTON-104 Database). In this video, the
right-hand x and y locations are plotted as the speaker signs the sentence.The raw data, train, and test
sets are pre-defined. You will derive a variety of feature sets
4. The Dataset
We recognize the meaning of ASL when watch the hand movement of the speaker. The computer mimic
after us. Nowaday, the technology can tag video, but not in 1990th. The hand gestion data, such as
Cartesian coordinates of left and right hands, and of the nose, which servers as a reference, are
preprocessed (extracted from the video). After load the data, the ‘asl’ dataframe looks like this:
X
Y
nx
ny
lx
rx
ly
ry
5. More about the data
The training input file:
video,speaker,word,startframe,endframe
1,woman-1,JOHN,8,17
1,woman-1,WRITE,22,50
1,woman-1,HOMEWORK,51,77
3,woman-2,IX-1P,4,11
3,woman-2,SEE,12,20
3,woman-2,JOHN,20,31
3,woman-2,YESTERDAY,31,40
3,woman-2,IX,44,52
4,woman-1,JOHN,2,13
4,woman-1,IX-1P,13,18
4,woman-1,SEE,19,27
4,woman-1,IX,28,35
4,woman-1,YESTERDAY,36,47
5,woman-2,LOVE,12,21
The test input file:
video,speaker,word,startframe,endframe
2,woman-1,JOHN,7,20
2,woman-1,WRITE,23,36
2,woman-1,HOMEWORK,38,63
7,man-1,JOHN,22,39
7,man-1,CAN,42,47
7,man-1,GO,48,56
7,man-1,CAN,62,73
12,woman-2,JOHN,9,15
12,woman-2,CAN,19,24
12,woman-2,GO,25,34
12,woman-2,CAN,35,51
21,woman-2,JOHN,6,26
the training data contains 112 unique words; test data contains 66 unique words; in test data, we
have 40 sentences made of 178 words.l
6. Feature Extraction
Features are data we feed into networks. Feature selection is crucial in success of a network. Use common sense to
select features. Examples:
X
Y
g-ly
g-ry
g-rx
g-lx
Feature_ground
features_ground = ['grnd-rx', 'grnd-ry', 'grnd-lx', 'grnd-ly']
asl.df['grnd-ly'] = asl.df['left-y'] - asl.df['nose-y']
asl.df['grnd-lx'] = asl.df['left-x'] - asl.df['nose-x']
...
X
rr
ltheta
lr
rtheta
feature_polar
features_polar = ['polar-rr', 'polar-rtheta', 'polar-lr', 'polar-ltheta']
asl.df['polar-rr'] = np.sqrt((asl.df['right-x']- asl.df['nose-x'])**2 + (asl.df['right-y']-asl.df['nose-y'])**2)
asl.df['polar-rtheta'] = np.arctan2(asl.df['right-x']- asl.df['nose-x'],asl.df['right-y'] - asl.df['nose-y'])
...
7. HMMLearn
HMMLearn is a library for unsupervised learning. HMM stands for Hidden Markov Model. Just as Neural Network, it can be
represented in Bayesian network:
We use HMMLearn class GausianHMM model. Gausian curve is the famous bell curve. Below is the curves of word
‘Chocolate’ with different number of hidden states
● We initiate the class with number of hidden states,
number of iteration and more, see reference at
http://hmmlearn.readthedocs.io/en/latest/api.html#hm
mlearn.hmm.GaussianHMM
● for training we call method fit() and pass in the training
data, it returns itself.
● for inference, we call method score() with the word, it
emits a float that indicates the likelihood of input.
8. How do we do it
● We train the model one word at time with the training data.
● The words are encoded by associated with a unique integer, the word id
● A word has an associated list of feature set
● We train GaussianHMM model with a word feature set. Try with difference number of hidden states, then
select the best model for the word
● So after training, each word has a model.
● We test the models by building a recognizer that
○ Pick a feature and a model, test them with full sentences:
■ For each word in a sentence, ‘reading’ feature set
■ Pick the model with highest score model
■ From the model we find the word id
○ We decode the sequence of word id to a sentence
○ Company the synthesized sentence with the original sentence and get the Error Rate
● The criteria for passing the project is < 60 % error rate, or recognize 40+% words correctly
9. Model Selection
The raw Gaussian model is a rough cut. In my test, it correctly recognized 58 words out of 178 (about 67% error rate). We
improve the model selection by use 2 popular information criteria:
● Bayesian information criteria (BIC)
○ The purpose is to punish the word with longer seq to prevent overfit.
○ BIC = −2 log L + p log N
■ where p is a parameter, L is Gausian score, N is the hmm length of the word.
■ p is very magical!!!
■ to learn more, check this link http://www2.imm.dtu.dk/courses/02433/doc/ch6_slides.pdf
● Discriminative Information Criterion (DIC)
○ DIC scores the discriminant ability of a training set for one word against competing words.
10. Testing and Output
model_selector=SelectorBIC_orig, features=scale_podel
**** WER = 0.43258426966292135
Total correct: 101 out of 178
Video Recognized Correct
=====================================================================================================
2: JOHN WRITE HOMEWORK JOHN WRITE HOMEWORK
7: JOHN *HAVE GO *ARRIVE JOHN CAN GO CAN
12: JOHN *WHAT *GO1 CAN JOHN CAN GO CAN
21: JOHN FISH WONT *WHO BUT *CAR *CHICKEN CHICKEN JOHN FISH WONT EAT BUT CAN EAT CHICKEN
25: JOHN *TELL *LOVE *WHO IX JOHN LIKE IX IX IX
28: JOHN *WHO *WHO *WHO IX JOHN LIKE IX IX IX
30: JOHN *MARY *MARY *MARY *MARY JOHN LIKE IX IX IX
36: MARY VEGETABLE *GIRL *GIVE *MARY *MARY MARY VEGETABLE KNOW IX LIKE CORN1
40: JOHN *VISIT *CORN *JOHN *MARY JOHN IX THINK MARY LOVE
43: JOHN *SHOULD BUY HOUSE JOHN MUST BUY HOUSE
50: *JOHN *SEE BUY CAR SHOULD FUTURE JOHN BUY CAR SHOULD
54: JOHN *JOHN *MARY BUY HOUSE JOHN SHOULD NOT BUY HOUSE
57: JOHN *PREFER VISIT MARY JOHN DECIDE VISIT MARY
67: JOHN *YESTERDAY NOT BUY HOUSE JOHN FUTURE NOT BUY HOUSE
71: JOHN *FUTURE VISIT MARY JOHN WILL VISIT MARY
74: *IX *MARY *MARY MARY JOHN NOT VISIT MARY
77: *JOHN BLAME MARY ANN BLAME MARY
11. The Results
features_customer2 is the winner. features_customer2 is scaled Cartesian coordinates + time delta
by just scale the values of features_podel, scale_podel outperforms features_podel, 101 vs 89 words