EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
"Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Streams" Poster
1. ISOLATED SIGN RECOGNITION WITH A SIAMESE
NEURAL NETWORK OF RGB AND DEPTH STREAMS
TUR Anil Osman, YALIM KELES Hacer
Ankara University Computer Engineering Department
anilosmantur@gmail.com, hkeles@ankara.edu.tr
Contact Information
Anil Osman TUR Hacer YALIM KELES
e-mail: anilosmantur@gmail.com hkeles@ankara.edu.tr
LinkedIn: linkedin.com/in/anilosmantur linkedin.com/in/haceryalimkeles
Research Gate: researchgate.net/profile/Anil_Tur researchgate.net/profile/Hacer_Keles
GitHub: github.com/AnilOsmanTur
AU CVML LAB
Website: cvml.ankara.edu.tr
LinkedIn: linkedin.com/in/aucvmllab
GitHub: github.com/aucvmllab
References
[1] S. Escalera, X. Bar, J. Gonzlez, M.A. Bautista, M. Madadi,
M. Reyes, V. Ponce, H.J. Escalante, J. Shotton, I. Guyon,
"Chalearn looking at people challenge 2014: Dataset and
results". In: ECCV workshop. 2014.
[2] K. He, X. Zhang, S. Ren, J. Su, "Deep Residual Learning for
Image Recognition". Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016.
[3] K. Simonyan, A. Zisserman, "Very deep convolutional net-
works for large-scale image recognition". arXiv preprint
arXiv:1409.1556, 2014.
[4] I. Sutskever, O. Vinyals, Q. V. Le, "Sequence to sequence
learning with neural networks". In: Advances in neural in-
formation processing systems, pp. 3104-3112, 2014.
[5] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, "Empirical evalu-
ation of gated recurrent neural networks on sequence mod-
eling." arXiv preprint arXiv:1412.3555. 2014.
[6] TensorFlow machine learning framework.
https://www.tensorflow.org.
[7] F. Chollet et al., "Keras." https://keras.io, 2015.
Abstract
Sign recognition is a challenging problem due to
high variance of the signs among different signers
and multiple modalities of the input information.
In addition, the challenges that exist in the action
classification problems in computer vision are
similar in this domain too, such as variations
in illumination and background. In this work,
we propose a Siamese Neural Network (SNN)
architecture that is used to extract features from
the RGB and the depth streams of a sign frame in
parallel. We use a pretrained model for the SNN
without any finetuning to our training data. We
then apply global feature pooling to the depth
and color features that the SNN generates and
feed the concatenation of the selected features to a
recurrent neural network (RNN) to discriminate
the signs. We trained our model parameters
with the Montalbano dataset and achieve 93.19%
test accuracy with ResNet-50 and 91.61% with
VGG-16 Network Models.
Introduction
Motivation
• To solve communication problems between
the deaf and the hearing communities.
• Human-machine interface that can be useful
for controlling machines with human ges-
tures for other purposes.
Problem and challanges
Recognizing signs independent from each other.
• High variance of the signs among different
signers i.e. body and pose variations, dura-
tion variance of the signs etc.
• Multiple modalities of the input information
i.e. illumination changes, occlusion prob-
lems etc.
Solution
1. To be able to represent inputs in more effec-
tive feature space, we employed pretrained
Convolutional Neural Networks (CNNs).
2. To classify generated feature vectors from
CNN we need to interpret sequences
Recurrent Neural Networks (RNNs) used.
Specially Long-Short Term Memory (LSTM)
model.
3. To generalize inputs and be robust to
changes and variations e.g. lightning, per-
son in training regularization methods used.
Dataset
Montalbano Gesture Dataset [1] used in our exper-
iments.
• Video samples are in 640x480 pixels and
recorded with speed of 20 fps.
• 20 different italian hand gestures from 27 dif-
ferent users.
• Dataset includes clothing, lightning, back-
ground changes.
RGB Depth User Index Skeletal
Preprocessing
• RGB and Depth input cropped to 400 by 400 square
images with help of x value from shoulder center
joint point in skeletal data.
• All samples fixed to 40 frame with sampling.
• Median filter applied to both of the inputs
• User index data used as mask to depth input to get
background subtraction. cutting
sequence distribution
Model Architecture and Training
• Convolutional parts from pretrained ResNet-50 [2], and VGG16 [3] models used.
• Global max pooling or global average pooling layers applied to the outputs of pretrained networks.
• Pooling layer outputs connected to Fully-connected (FC) layers.
• We experimented FC layers with ReLu, Sigmoid and ReLu + Batch Normalization configurations.
• RGB and Depth outputs from FC layers concatenated and connected to LSTM.
• Output of LSTM connected to Softmax layer to classify gestures.
• We used Adam optimizer with 1e-3 learning rate.
• We chose batch size as 16 .
• Pretrained models that used as feature extractors and no finetuning applied to them.
• We experimented with L2 norm and Dropout as regularization methods.
• We chose 0.2 lambda constant for L2 norm and 0.5 probability rate for Dropout.
Results
Accuracy Results
VGG-16
Global Average Pooling Global Max Pooling
LSTM ReLU Sigmoid ReLU + Batch norm ReLU Sigmoid ReLU + Batch norm
No Regularization 87.27 88.15 87.56 85.45 83.32% 85.39
L2 88.35 86.57 84.60 87.46 85.78 84.01
Dropout 89.24 89.14 87.86 9.08 86.28 88.25
Dropout + L2 89.73 88.25 87.86 5.63 88.55 85.88
Accuracy Results
VGG-16
Global Average Pooling Global Max Pooling
GRU ReLU Sigmoid ReLU + Batch norm ReLU Sigmoid ReLU + Batch norm
No Regularization 89.63 87.07 85.39 82.43 87.96 84.5
L2 86.48 81.44 68.41 87.46 81.34 54.59
Dropout 91.51 90.82 89.34 81.84 90.03 89.24
Dropout + L2 91.61 87.27 87.46 86.38 87.46 79.76
Accuracy Results
ResNet-50
Global Average Pooling Global Max Pooling
LSTM ReLU Sigmoid ReLU + Batch norm ReLU Sigmoid ReLU + Batch norm
No Regularization 85.49 84.70 87.36 87.86 79.96 90.03
L2 87.17 86.97 85.88 86.08 87.46
Dropout 89.34 89.34 6.12 6.12 93.19
Dropout + L2 90.92 54.89 89.04
Accuracy Results
ResNet-50
Global Average Pooling Global Max Pooling
GRU ReLU Sigmoid ReLU + Batch norm ReLU Sigmoid ReLU + Batch norm
No Regularization 89.04 88.15 85.59 90.03 86.87 90.92
L2 85.19 80.75 79.17 82.43 67.82
Dropout 90.92 85.09 27.34 91.91
Dropout + L2 89.26 89.63 82.92 81.54
• Pretrained ResNet-50 and VGG16 networks used as feature extractors.
• We obtained the best results, i.e. 93.19 accuracy, using ResNet-50 with LSTM.
• We have not applied hand or face segmentation to the inputs.
• We purposed simple yet effective architecture.
• We observed that when LSTM model starts memorization GRU model solves the memorization
problem.
Future Work
In this research we wanted to create a base model
classifier for sign recognition task. As a future
work we are planning to use it with our Turkish
Sign Language (TLS) dataset.
• 228 words are recorded with general use and
challenging factors in mind.
• We recorded our videos in 5 differnt modal-
ities (i.e. HD RGB, Depth, Infrared, Skeletal,
User mask) from Microsoft Kinect V2.
• Our signers consist of 6 sign language men-
tors, one deaf person and 5 trained signers.
In total 12 people.
• Each sign is recorded with 10 repetitions,
professional signers also provided 10 repe-
titions wearing black clothes.
• 228 words × ∼ 150 samples ≈ 34.200 sample
videos
The signs in the TSL corpus is divided into 7 cate-
gories:
• 1.group, has no occlusion, crossing or con-
tact with other body parts and includes 63
words.
• 2.group, hands can occlude each other or
contact can occur between the hands and in-
cludes 52 words.
• 3.group, hands can occlude face of the signer
or can touch it and includes 58 words.
• 4.group, contains crossing hands and occlu-
sions can occure. It includes 14 words.
• 5.group, depth information is essential. It in-
cludes 22 words.
• 6.group, compound words consists of words
that has more than one sign in it and in-
cludes 19 words.
• 7.group, similar signs and it is specially chal-
lening because of similar sign patterns.
TSL dataset
Acknowledgement
The research presented is part of a project funded
by TUBITAK (The Scientific and Technological
Research Council of Turkey) under grant number
217E022.