Machine translation and computer vision have greatly benefited from the advances in deep learning. A large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two fields in sign language translation and production still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses.
6. Classic Motivation: Accessibility to basic services
6
“World Report on Hearing”.
World Health Organization 2021.
https://whereistheinterpreter.com/
#whereistheinterpreter
14. A crash course on Sign Languages (SL)
Cultural diversity of sign languages, similar to spoken languages
○ American (ASL), British (BSL), German (GSL), Chinese (CSL)… sign languages.
14
Irish Sign Language (ISL) Catalan Sign Language (LSC)
15. A crash course on Sign Languages (SL)
Sign languages are NOT a one-to-one mapping from spoken languages.
15
Look-Up
Table
Hi, I’m Amelia and I’m
going to talk to you
about how to remove
gum from hair.
Sign Language
(video)
Spoken Language
(transcription)
��🏼
16. A crash course on Sign Languages (SL)
There exist a textual transcription method named “glosses”.
16
HI, ME FS-AMELIA WILL
EXPLAIN HOW REMOVE
GUM FROM YOUR HAIR
Hi, I’m Amelia and I’m
going to talk to you about
how to remove gum from
hair.
Spoken Language
(transcription)
Sign Language
(transcription)
17. A crash course on Sign Languages (SL)
● Manual features:
○ Handshape
○ Palm
● Non-manual features
○ Head (nod / shake / tilt)
○ Mouth
○ Eyebrows
○ Cheeks
○ Facial grammar (or expressions)
○ Body position
...orientation, movement, location.
17
Stokoe Jr, William C. "Sign language structure: An outline of the visual communication systems of the American deaf." Journal of
deaf studies and deaf education (2005).
Figure: Arizona State University
18. A crash course on Sign Languages (SL)
SLs use persistent spatial grounding (eg. by pointing & placing) !
18
Liddell, Scott K. "Spatial representations in discourse: Comparing spoken and signed language." Lingua (1996).
“Right along here…” ...immobile entity is
located here,
19. A crash course on Sign Languages (SL)
SLs use persistent spatial grounding (eg. by pointing & placing) !
19
Liddell, Scott K. "Spatial representations in discourse: Comparing spoken and signed language." Lingua (1996).
“Not far and to the
right of,
...tall, vertical entity at this place.
21. Sign-to-Spoken Language Tasks
21
SL Translation Hi, I’m Amelia and I’m going to talk to you
about how to remove gum from hair.
GIPHY/SIGNN WITH ROBERT
Isolated SL Recognition
Continuous SL Recognition
Finger-spelling
HI, ME FS-AMELIA WILL EXPLAIN
HOW REMOVE GUM FROM YOUR
HAIR
“I”
A, B, C, D...
23. Sign-Spoken Language Tasks
SL Production
SL Translation
Sign Language
(video)
23
Spoken Language
(transcription)
Hi, I’m Amelia and
I’m going to talk
to you about how
to remove gum
from hair.
24. Neural Machine Translation
24
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NeurIPS 2014.
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase
representations using RNN encoder-decoder for statistical machine translation." EMNLP 2014.
Encoder Decoder
Representation
Hi, I’m Amelia and
I’m going to talk to
you about how to
remove gum from
hair.
Dia duit, is mise
Amelia agus beidh
mé ag caint leat faoi
conas guma a bhaint
de ghruaig.
25. Automatic Speech Recognition (ASR)
25
Encoder Decoder
Representation
Hi, I’m Amelia and
I’m going to talk to
you about how to
remove gum from
hair.
Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." ICML 2014.
#LAS Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary conversational speech
recognition." ICASSP 2016.
26. Image Captioning
26
Encoder Decoder
Representation
A group of people
shopping at ann
outdoor market.
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015.
Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015.
27. Neural Sign Language Translation
27
Encoder Decoder
Representation
Hi, I’m Amelia and
I’m going to talk to
you about how to
remove gum from
hair.
28. Neural Sign Language Translation
28
Camgoz, Necati Cihan, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden.
"Neural sign language translation." CVPR 2018.
29. Neural Sign Language Translation
29
Camgoz, Necati Cihan, Oscar Koller, Simon Hadfield, and Richard Bowden. "Sign language
transformers: Joint end-to-end sign language recognition and translation." CVPR 2020.
30. Neural Sign Language Production
30
Encoder Decoder
Representation
Hi, I’m Amelia and
I’m going to talk to
you about how to
remove gum from
hair.
31. Neural Sign Language Production
31
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Mixed SIGNals: Sign Language Production via
a Mixture of Motion Primitives." ICCV 2021.
32. Neural Sign Language Production
32
Encoder Decoder
Representation
Hi, I’m Amelia and
I’m going to talk to
you about how to
remove gum from
hair.
33. Neural Sign Language Production
33
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Progressive transformers for end-to-end
sign language production." ECCV 2020.
34. Neural Sign Language Production
34
Stoll, Stephanie, Necati Cihan Camgoz, Simon Hadfield, and Richard Bowden. "Text2Sign: Towards sign
language production using neural machine translation and generative adversarial networks." IJCV 2020.
35. Neural Sign Language Production
35
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Signing at Scale: Learning to Co-Articulate
Signs for Large-Scale Photo-Realistic Sign Language Production." CVPR 2022.
39. The How2Sign dataset
39
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
40. The How2Sign dataset
40
Multi-view RGB videos RGB-D videos
Body-face-hands keypoints
2D keypoints estimation from OpenPose [2]
How2 dataset [1]
Speech Signal
English Transcription
Hi, I’m Amelia and I’m going
to talk to you about how to
remove gum from hair.
Instructional videos
Multi-view VGA and HD videos [3]
Multi-view recordings (only for a subset)
3D keypoints
estimation
Gloss Annotation
HI, ME FS-AMELIA WILL EXPLAIN HOW REMOVE GUM FROM YOUR HAIR
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
41. Continuous Sign Language Datasets
41
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
42. The largest dataset in ASL
42
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
43. 43
Built on top of How2
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
44. Built on top of How2
Spoken Language
(speech)
SL Production
SL Translation
Sign Language
(video)
44
Spoken Language
(transcription)
Hi, I’m Amelia and I’m going to
talk to you about how to
remove gum from hair.
Synthesis
ASR
#How2 Sanabria, Ramon, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. "How2: a large-scale dataset for
multimodal language understanding." arXiv 2018.
45. Built on top of How2
How2 dataset [1]
Speech Signal
English Transcription
Hi, I’m Amelia and I’m going
to talk to you about how to
remove gum from hair.
Instructional videos
English Speech
Speech track available for end-to-end English to ASL.
English Transcriptions
Automatically generated subtitles aligned at the
sentence level.
English to Brazilian Translations
Allows multilingual research.
45
#How2 Sanabria, Ramon, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. "How2: a large-scale dataset for
multimodal language understanding." arXiv 2018.
48. Green Studio
Multi-view RGB videos
RGB-D videos
Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara,S.,
Sheikh, Y.: Panoptic studio: A massively multiview system for social motioncapture. In:
ICCV, 2015.
Panoptic Studio
Multi-view recordings (only for a subset)
Multi-view VGA and HD videos
48
49. 2D & 3D pose estimation
49
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.
50. 2D & 3D pose estimation
Multi-view RGB videos
Body-face-hands keypoints
2D keypoints estimation from OpenPose [1]
Multi-view recordings (only for a subset)
3D keypoints estimation [2]
[1] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei and Y. A. Sheikh, "OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields" in TPAMI, 2019.
[2] Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara,S., Sheikh, Y.: Panoptic studio: A massively multiview system for social motioncapture. In: ICCV, 2015
Multi-view VGA and HD videos
50
55. Outline
55
Motivation
A crash course on sign languages (SL)
State of the art
How2Sign dataset
Application: Human motion transfer
Challenges
Conclusion
Ventura, Lucas, Amanda Duarte, and Xavier Giró-i-Nieto. "Can everybody sign now? Exploring sign language video
generation from 2D poses." ECCV 2020 SLRTP Workshop.
56. Application: Human motion transfer
56
2D Pose
estimation
[Openpose]
GAN-
generated
[Everybody
dance now]
59. 59
“Choose one category”
Can ASL signers understand our generated videos ?
Skeleton
GAN-generated
Classification
accuracy
60. 60
Can ASL signers understand our generated videos ?
Skeleton
GAN-generated
Mean Opinion
Score
“How well could you understand the video?”
61. 61
“Translate the ASL signs into written English.”
Can ASL signers understand our generated videos ?
Skeleton
GAN-generated
62. Outline
62
Motivation
A crash course on sign languages (SL)
State of the art
How2Sign dataset
Application: Sign Language Video Retrieval
Challenges
Conclusion
Duarte, Amanda, Samuel Albanie, Xavier Giró-i-Nieto, and Gül Varol. "Sign Language Video Retrieval with Free-Form
Textual Queries." arXiv preprint arXiv:2201.02495 (2022).
63. Sign Language Video Retrieval
63
Encoder Encoder
Representation
Hi, I’m Amelia and
I’m going to talk to
you about how to
remove gum from
hair.
65. 65
Sign Language Video Retrieval
Challenge: How to train
without annotated datasets
for continuous SL ?
Approach: Produce
pseudo-annotations from
How2 + How2Sign.
66. 66
Sign Spotting: Mouthing (M)
Albanie, S., Varol, G., Momeni, L., Afouras, T., Chung, J. S., Fox, N., & Zisserman, A.. BSL-1K: Scaling up co-articulated sign language recognition using
mouthing cues. ECCV 2020.
67. 67
Sign Spotting: Visual Dictionaries (D)
Momeni, L., Varol, G., Albanie, S., Afouras, T., & Zisserman, A. Watch, read and lookup: learning to spot signs from multiple supervisors. ACCV 2020.
68. 68
Sign Video Embeddings are learned with automatic annotations from:
1. Sign spotting: Mouthing (M)
2. Sign spotting from a visual dictionaries WLASL & MSASL (Di
)
Sign Language Video Retrieval
69. 69
Effect of retraining an
I3D backbone with
automatic annotations.
1079
words
1887
words
Sign Language Video Retrieval
72. 72
Top Hit #1 (query) Top Hit #2
Sign Language Video Retrieval
More qualitative results.
Top Hit #3
Video category: 1 (Sports and Fitness)
"Then bring your feet together and by
this time you should be able to have built
up enough strength to do a full push up."
Video category: 1 (Sports and Fitness)
"A proper cardio vascular program should
incorporate various aspects of training
through intensity, frequency, as well as
time..."
Video category: 1 (Sports and Fitness)
"Then when you get strong, then you can
start picking up your feet."
73. Outline
73
Motivation
A crash course on sign languages (SL)
State of the art
How2Sign dataset
Applications
Open Challenges
Conclusion
Duarte, Amanda, Samuel Albanie, Xavier Giró-i-Nieto, and Gül Varol. "Sign Language Video Retrieval with Free-Form
Textual Queries." arXiv preprint arXiv:2201.02495 (2022).
75. Challenges in Computer Vision
75
Off-the-shelf pose detectors and generators struggle with hands.
76. 76
��
Zhou, Yuxiao, Marc Habermann, Weipeng Xu, Ikhsanul Habibie, Christian Theobalt, and Feng Xu. "Monocular real-time
hand shape and motion capture using multi-modal data." CVPR 2020.
Challenges in Computer Vision
77. 77
��
Weinzaepfel, Philippe, Romain Brégier, Hadrien Combaluzier, Vincent Leroy, and Grégory Rogez. "Dope: Distillation of
part experts for whole-body 3d pose estimation in the wild." ECCV 2020.
Challenges in Computer Vision
78. 78
��
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Progressive transformers for end-to-end sign language
production." ECCV 2020.
Challenges in Computer Vision
79. 79
��
Ng, Evonne, Shiry Ginosar, Trevor Darrell, and Hanbyul Joo. "Body2hands: Learning to infer 3d hands from
conversational gesture body dynamics." CVPR 2021.
Challenges in Computer Vision
81. Challenges in NLP
Sign Languages are:
81
🤔
(Very) low-resource
languages…
...in a (very) high
dimensional space (video).
��🏼
��🏼
82. Challenges in NLP
82
Figure: TensorFlow tutorial
Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language model." Journal of machine learning
research 3, no. Feb (2003): 1137-1155.
🤔
What are “language
models” in sign
language ?
83. Challenges in NLP
83
How to transfer from
large pre-trained
(“foundation”) models ?
#GPT-3 Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Agarwal, S. Language models
are few-shot learners. NeurIPS 2020 (best paper award).
Source: [OpenAI API]
English: My name is Barbara.
ASL: ME NAME fs-B-A-R-B-A-R-A.
English: Is he a teacher?
ASL: HE TEACHER HE
English: Amir is tall.
ASL: fs-A-M-I-R, HE TALL HE
English: I’m not sad.
ASL: ME SAD ME 🤔
85. Challenges in Speech Translation
85
Jia, Ye, Michelle Tadmor Ramanovich, Tal Remez, and Roi Pomerantz. "Translatotron 2: Robust direct speech-to-speech
translation." arXiv preprint arXiv:2107.08661 (2021).
Speech Video
Speech Speech
End-to-end End-to-end
🤔
87. Challenges in Training Data
87
Damen, Dima, and Michael Wray. "Supervision Levels Scale (SLS)." arXiv (2020). [tweet]
X
88. 88
Challenges in Training Data: Pseudo-glosses
Yin, Kayo, and Jesse Read. "Better Sign Language Translation with
STMC-Transformer." COLING 2020. [talk]
Moryossef, Amit, Kayo Yin, Graham Neubig, and Yoav Goldberg. "Data
Augmentation for Sign Language Gloss Translation." arXiv 2021.
Generation of gloss pseudo-labels by training a transformer.
Moreno D, Duarte A, Costa-jussà MR, Giró-i-Nieto X.
English to ASL Translator for Speech2Signs. UPC 2018.
89. 89
Challenges in Training Data: Self-supervision
#SignBERT Hu, Hezhen, Weichao Zhao, Wengang Zhou, Yuechen Wang, and Houqiang Li. "SignBERT: Pre-Training of Hand-Model-Aware
Representation for Sign Language Recognition." ICCV 2021.
91. 91
Conclusion: Speech2Signs (and Signs2Speech)
End-to-end translation & production
Hi, I’m Amelia and I’m going
to talk to you about how to
remove gum from hair.
HI, ME FS-AMELIA WILL
EXPLAIN HOW REMOVE
GUM FROM YOUR HAIR
Speech Language Gloss [1] Sign transcription [2] Video
3D Poses 2D Poses Segments [3]
Multiple vision, natural language & speech challenges for a societally impactful task.
[1] Yin, Kayo, and Jesse Read. "Better Sign Language Translation with STMC-Transformer." COLING 2020.
[2] Hanke, Thomas. "HamNoSys-representing sign language data in language resources and language processing contexts." In LREC, vol. 4, pp. 1-6. 2004.
[3] Renz, Katrin, Nicolaj C. Stache, Samuel Albanie, and Gül Varol. "Sign language segmentation with temporal convolutional networks." ICASSP 2021.