Towards Sign Language Translation & Production | Xavier Giro-i-Nieto

Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Towards
Sign Language Translation & Production
Universitat de Barcelona
April 1, 2022

PhD candidate Amanda Duarte
2
2022
2018
2016 2020
CVPR
Caffe2 Research
Award 2017
Facebook
Research grant.
Amanda
Duarte
CVPR

Outline
3
Motivation
A crash course on sign languages (SL)
State of the art
How2Sign dataset
Applications
Open Challenges
Conclusion

Motivation: Accessibility
4
“World Report on Hearing”. World Health Organization 2021.
60.5M people in
Severe > Complete hearing loss

5
Shelly Shadha, “Launch of the World Report on Hearing”. World Health Organization 2021.

Classic Motivation: Accessibility to basic services
6
“World Report on Hearing”.
World Health Organization 2021.
https://whereistheinterpreter.com/
#whereistheinterpreter

7
Amit Moryossef, “Google Translate for Sign Language”. 2021. [talk] [code]

Motivation: Learning SL from Personal Assistants
10
Computer Human
Teaching
that scales
Interaction
Interaction
Human

Motivation: Multimodal translation in a Metaverse
11
Helping Hands, “Using ASL in Virtual Reality (VRChat)” (2020)

12
Spoken
language
Sign language
Motivation: Multimodal translation in a Metaverse

Outline
13
Motivation
State of the art
How2Sign dataset
Applications
Open Challenges
Conclusion

A crash course on Sign Languages (SL)
Cultural diversity of sign languages, similar to spoken languages
○ American (ASL), British (BSL), German (GSL), Chinese (CSL)… sign languages.
14
Irish Sign Language (ISL) Catalan Sign Language (LSC)

Sign languages are NOT a one-to-one mapping from spoken languages.
15
Look-Up
Table
Hi, I’m Amelia and I’m
going to talk to you
about how to remove
gum from hair.
Sign Language
(video)
Spoken Language
(transcription)
��🏼

There exist a textual transcription method named “glosses”.
16
HI, ME FS-AMELIA WILL
EXPLAIN HOW REMOVE
GUM FROM YOUR HAIR
Hi, I’m Amelia and I’m
going to talk to you about
how to remove gum from
hair.
Spoken Language
(transcription)
Sign Language
(transcription)

● Manual features:
○ Handshape
○ Palm
● Non-manual features
○ Head (nod / shake / tilt)
○ Mouth
○ Eyebrows
○ Cheeks
○ Facial grammar (or expressions)
○ Body position
...orientation, movement, location.
17
Stokoe Jr, William C. "Sign language structure: An outline of the visual communication systems of the American deaf." Journal of
deaf studies and deaf education (2005).
Figure: Arizona State University

SLs use persistent spatial grounding (eg. by pointing & placing) !
18
Liddell, Scott K. "Spatial representations in discourse: Comparing spoken and signed language." Lingua (1996).
“Right along here…” ...immobile entity is
located here,

SLs use persistent spatial grounding (eg. by pointing & placing) !
19
Liddell, Scott K. "Spatial representations in discourse: Comparing spoken and signed language." Lingua (1996).
“Not far and to the
right of,
...tall, vertical entity at this place.

Outline
20
Motivation
State of the art
How2Sign dataset
Applications
Challenges
Conclusion

Sign-to-Spoken Language Tasks
21
SL Translation Hi, I’m Amelia and I’m going to talk to you
about how to remove gum from hair.
GIPHY/SIGNN WITH ROBERT
Isolated SL Recognition
Continuous SL Recognition
Finger-spelling
HI, ME FS-AMELIA WILL EXPLAIN
HOW REMOVE GUM FROM YOUR
HAIR
“I”
A, B, C, D...

Sign-to-Spoken Language Tasks
22
SL Translation Hi, I’m Amelia and I’m going to talk to you
about how to remove gum from hair.

Sign-Spoken Language Tasks
SL Production
SL Translation
Sign Language
(video)
23
Spoken Language
(transcription)
Hi, I’m Amelia and
I’m going to talk
to you about how
to remove gum
from hair.

Neural Machine Translation
24
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NeurIPS 2014.
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase
representations using RNN encoder-decoder for statistical machine translation." EMNLP 2014.
Encoder Decoder
Representation
I’m going to talk to
you about how to
remove gum from
hair.
Dia duit, is mise
Amelia agus beidh
mé ag caint leat faoi
conas guma a bhaint
de ghruaig.

Automatic Speech Recognition (ASR)
25
Encoder Decoder
Representation
you about how to
remove gum from
hair.
Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." ICML 2014.
#LAS Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary conversational speech
recognition." ICASSP 2016.

Image Captioning
26
Encoder Decoder
Representation
A group of people
shopping at ann
outdoor market.
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015.
Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015.

Neural Sign Language Translation
27
Encoder Decoder
Representation
you about how to
remove gum from
hair.

28
Camgoz, Necati Cihan, Simon Hadﬁeld, Oscar Koller, Hermann Ney, and Richard Bowden.
"Neural sign language translation." CVPR 2018.

29
Camgoz, Necati Cihan, Oscar Koller, Simon Hadﬁeld, and Richard Bowden. "Sign language
transformers: Joint end-to-end sign language recognition and translation." CVPR 2020.

Neural Sign Language Production
30
Encoder Decoder
Representation
you about how to
remove gum from
hair.

31
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Mixed SIGNals: Sign Language Production via
a Mixture of Motion Primitives." ICCV 2021.

32
Encoder Decoder
Representation
you about how to
remove gum from
hair.

33
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Progressive transformers for end-to-end
sign language production." ECCV 2020.

34
Stoll, Stephanie, Necati Cihan Camgoz, Simon Hadﬁeld, and Richard Bowden. "Text2Sign: Towards sign
language production using neural machine translation and generative adversarial networks." IJCV 2020.

35
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Signing at Scale: Learning to Co-Articulate
Signs for Large-Scale Photo-Realistic Sign Language Production." CVPR 2022.

Outline
36
Motivation
State of the art
How2Sign dataset
Applications
Challenges
Conclusion

Parallel corpus
37
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning
phrase representations using RNN encoder-decoder for statistical machine translation." AMNLP 2014.

Continuous Sign Language Datasets
38

The How2Sign dataset
39
Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., ... & Giro-i-Nieto, X.
How2Sign: a large-scale multimodal dataset for continuous American sign language. CVPR 2021.

The How2Sign dataset
40
Multi-view RGB videos RGB-D videos
Body-face-hands keypoints
2D keypoints estimation from OpenPose [2]
How2 dataset [1]
Speech Signal
English Transcription
Hi, I’m Amelia and I’m going
to talk to you about how to
remove gum from hair.
Instructional videos
Multi-view VGA and HD videos [3]
Multi-view recordings (only for a subset)
3D keypoints
estimation
Gloss Annotation
HI, ME FS-AMELIA WILL EXPLAIN HOW REMOVE GUM FROM YOUR HAIR

Continuous Sign Language Datasets
41

The largest dataset in ASL
42

43
Built on top of How2

Spoken Language
(speech)
SL Production
SL Translation
Sign Language
(video)
44
Spoken Language
(transcription)
Hi, I’m Amelia and I’m going to
talk to you about how to
Synthesis
ASR
#How2 Sanabria, Ramon, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. "How2: a large-scale dataset for
multimodal language understanding." arXiv 2018.

How2 dataset [1]
Speech Signal
English Transcription
Instructional videos
English Speech
Speech track available for end-to-end English to ASL.
English Transcriptions
Automatically generated subtitles aligned at the
sentence level.
English to Brazilian Translations
Allows multilingual research.
45
#How2 Sanabria, Ramon, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. "How2: a large-scale dataset for
multimodal language understanding." arXiv 2018.

Front+side RGB, Front Depth & Multi-view RGB
47

Green Studio
Multi-view RGB videos
RGB-D videos
Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara,S.,
Sheikh, Y.: Panoptic studio: A massively multiview system for social motioncapture. In:
ICCV, 2015.
Panoptic Studio
Multi-view VGA and HD videos
48

2D & 3D pose estimation
49

2D & 3D pose estimation
Multi-view RGB videos
Body-face-hands keypoints
2D keypoints estimation from OpenPose [1]
3D keypoints estimation [2]
[1] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei and Y. A. Sheikh, "OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields" in TPAMI, 2019.
[2] Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara,S., Sheikh, Y.: Panoptic studio: A massively multiview system for social motioncapture. In: ICCV, 2015
Multi-view VGA and HD videos
50

Dataset hierarchy
52
Camera view
Recording
Video
Clip
Frame
Green studio: Frontal or side
Panoptic: Multi-view
ASL Gloss
English transcription
RGB, Depth
Openpose
Category
Signer
Studio
Green studio
Panoptic (multi-view)

Dataset statistics
Clips length Sentences length
54

Outline
55
Motivation
State of the art
How2Sign dataset
Application: Human motion transfer
Challenges
Conclusion
Ventura, Lucas, Amanda Duarte, and Xavier Giró-i-Nieto. "Can everybody sign now? Exploring sign language video
generation from 2D poses." ECCV 2020 SLRTP Workshop.

56
2D Pose
estimation
[Openpose]
GAN-
generated
[Everybody
dance now]

57

58

59
“Choose one category”
Can ASL signers understand our generated videos ?
Skeleton
GAN-generated
Classiﬁcation
accuracy

60
Skeleton
GAN-generated
Mean Opinion
Score
“How well could you understand the video?”

61
“Translate the ASL signs into written English.”
Skeleton
GAN-generated

Outline
62
Motivation
State of the art
How2Sign dataset
Application: Sign Language Video Retrieval
Challenges
Conclusion
Duarte, Amanda, Samuel Albanie, Xavier Giró-i-Nieto, and Gül Varol. "Sign Language Video Retrieval with Free-Form
Textual Queries." arXiv preprint arXiv:2201.02495 (2022).

Sign Language Video Retrieval
63
Encoder Encoder
Representation
you about how to
remove gum from
hair.

64

65
Challenge: How to train
without annotated datasets
for continuous SL ?
Approach: Produce
pseudo-annotations from
How2 + How2Sign.

66
Sign Spotting: Mouthing (M)
Albanie, S., Varol, G., Momeni, L., Afouras, T., Chung, J. S., Fox, N., & Zisserman, A.. BSL-1K: Scaling up co-articulated sign language recognition using
mouthing cues. ECCV 2020.

67
Sign Spotting: Visual Dictionaries (D)
Momeni, L., Varol, G., Albanie, S., Afouras, T., & Zisserman, A. Watch, read and lookup: learning to spot signs from multiple supervisors. ACCV 2020.

68
Sign Video Embeddings are learned with automatic annotations from:
1. Sign spotting: Mouthing (M)
2. Sign spotting from a visual dictionaries WLASL & MSASL (Di
)

69
Eﬀect of retraining an
I3D backbone with
automatic annotations.
1079
words
1887
words

70

71

72
Top Hit #1 (query) Top Hit #2
More qualitative results.
Top Hit #3
Video category: 1 (Sports and Fitness)
"Then bring your feet together and by
this time you should be able to have built
up enough strength to do a full push up."
"A proper cardio vascular program should
incorporate various aspects of training
through intensity, frequency, as well as
time..."
"Then when you get strong, then you can
start picking up your feet."

Outline
73
Motivation
State of the art
How2Sign dataset
Applications
Open Challenges
Conclusion
Duarte, Amanda, Samuel Albanie, Xavier Giró-i-Nieto, and Gül Varol. "Sign Language Video Retrieval with Free-Form
Textual Queries." arXiv preprint arXiv:2201.02495 (2022).

Challenges
74
Computer Vision
Speech
NLP
Training Data

Challenges in Computer Vision
75
Off-the-shelf pose detectors and generators struggle with hands.

76
��
Zhou, Yuxiao, Marc Habermann, Weipeng Xu, Ikhsanul Habibie, Christian Theobalt, and Feng Xu. "Monocular real-time
hand shape and motion capture using multi-modal data." CVPR 2020.

77
��
Weinzaepfel, Philippe, Romain Brégier, Hadrien Combaluzier, Vincent Leroy, and Grégory Rogez. "Dope: Distillation of
part experts for whole-body 3d pose estimation in the wild." ECCV 2020.

78
��
Saunders, Ben, Necati Cihan Camgoz, and Richard Bowden. "Progressive transformers for end-to-end sign language
production." ECCV 2020.

79
��
Ng, Evonne, Shiry Ginosar, Trevor Darrell, and Hanbyul Joo. "Body2hands: Learning to infer 3d hands from
conversational gesture body dynamics." CVPR 2021.

Challenges
80
Computer Vision
Speech
NLP
Training Data

Challenges in NLP
Sign Languages are:
81
🤔
(Very) low-resource
languages…
...in a (very) high
dimensional space (video).
��🏼
��🏼

Challenges in NLP
82
Figure: TensorFlow tutorial
Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language model." Journal of machine learning
research 3, no. Feb (2003): 1137-1155.
🤔
What are “language
models” in sign
language ?

Challenges in NLP
83
How to transfer from
large pre-trained
(“foundation”) models ?
#GPT-3 Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Agarwal, S. Language models
are few-shot learners. NeurIPS 2020 (best paper award).
Source: [OpenAI API]
English: My name is Barbara.
ASL: ME NAME fs-B-A-R-B-A-R-A.
English: Is he a teacher?
ASL: HE TEACHER HE
English: Amir is tall.
ASL: fs-A-M-I-R, HE TALL HE
English: I’m not sad.
ASL: ME SAD ME 🤔

Challenges
84
Computer Vision
Speech
NLP
Training Data

Challenges in Speech Translation
85
Jia, Ye, Michelle Tadmor Ramanovich, Tal Remez, and Roi Pomerantz. "Translatotron 2: Robust direct speech-to-speech
translation." arXiv preprint arXiv:2107.08661 (2021).
Speech Video
Speech Speech
End-to-end End-to-end
🤔

Challenges
86
Computer Vision
Speech
NLP
Training Data

Challenges in Training Data
87
Damen, Dima, and Michael Wray. "Supervision Levels Scale (SLS)." arXiv (2020). [tweet]
X

88
Challenges in Training Data: Pseudo-glosses
Yin, Kayo, and Jesse Read. "Better Sign Language Translation with
STMC-Transformer." COLING 2020. [talk]
Moryossef, Amit, Kayo Yin, Graham Neubig, and Yoav Goldberg. "Data
Augmentation for Sign Language Gloss Translation." arXiv 2021.
Generation of gloss pseudo-labels by training a transformer.
Moreno D, Duarte A, Costa-jussà MR, Giró-i-Nieto X.
English to ASL Translator for Speech2Signs. UPC 2018.

89
Challenges in Training Data: Self-supervision
#SignBERT Hu, Hezhen, Weichao Zhao, Wengang Zhou, Yuechen Wang, and Houqiang Li. "SignBERT: Pre-Training of Hand-Model-Aware
Representation for Sign Language Recognition." ICCV 2021.

Outline
90
Motivation
State of the art
How2Sign dataset
Applications
Open Challenges
Conclusion

91
Conclusion: Speech2Signs (and Signs2Speech)
End-to-end translation & production
HI, ME FS-AMELIA WILL
EXPLAIN HOW REMOVE
GUM FROM YOUR HAIR
Speech Language Gloss [1] Sign transcription [2] Video
3D Poses 2D Poses Segments [3]
Multiple vision, natural language & speech challenges for a societally impactful task.
[1] Yin, Kayo, and Jesse Read. "Better Sign Language Translation with STMC-Transformer." COLING 2020.
[2] Hanke, Thomas. "HamNoSys-representing sign language data in language resources and language processing contexts." In LREC, vol. 4, pp. 1-6. 2004.
[3] Renz, Katrin, Nicolaj C. Stache, Samuel Albanie, and Gül Varol. "Sign language segmentation with temporal convolutional networks." ICASSP 2021.

Fellow researchers
92
Shruti
Palaskar
Deepti
Ghadiyaram
Florian
Metze
Francesc
Moreno
Jordi
Torres
Kevin
McGuinness
Gül
Varol
Samuel
Albanie
Marta R.
Costa-jussà
Kenneth
DeHaan

93
Benet
Oriol
Jordi
Aguilar
Cayetana
López
Lucas
Ventura
Sandra
Roca
Daniel
Moreno
Janna
Escur
Mireia
Hernández
Peter
Muschick
Pol
Pérez
Görkem
Camli
Jordi
López
Gerard
Gállego
Current & former students
Amanda
Duarte
Laia
Tarrés
Cristina
Puntí
Andrea
Iturralde
Maram A.
Mohamed
Álvaro
Budria
Patricia
Cabot
Divya
Chhipani
Javier
Sanz

Thank you
{Thank You}
Supported by
Facebook AI

Towards Sign Language Translation & Production | Xavier Giro-i-Nieto

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Towards Sign Language Translation & Production | Xavier Giro-i-Nieto

Similar to Towards Sign Language Translation & Production | Xavier Giro-i-Nieto (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Towards Sign Language Translation & Production | Xavier Giro-i-Nieto