Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Language Dictionaries

3D Reconstruction of
Video Sign Language
Dictionaries
author: Roman Riazantsev
supervisor: Maksym Davydov

Context & Problems
VR / AR technologies
Sign Language Dictionaries
Different data formats
How create unified database of
sign language dictionaries?

Research Questions & Goals
How addition of synthetic
video frames affects
performance of 3D
reconstruction methods ?
How refinement of results
affects performance of 3D
reconstruction methods ?
Create system for
3D reconstruction
of Sign Language
dictionaries.

How 3D reconstruction can be done?
[Jameel Malik et al. DeepHPS: End-to-end
Estimation of 3D Hand Pose and Shape by
Learning from Synthetic Depth]

Modular approach
Generalized schema of 3D hand pose estimation.

6
Stage 1. Object search with sequential
refinement
[Joseph Redmon et al. YOLOv3: An Incremental
Improvement]

7
Stage 2. 2D key points detection
Step 1 Step 3Step 2
Step 2 Step 3
[Zhe Cao et al. OpenPose: Realtime Multi-Person 2D Pose Estimation
using Part Affinity Fields]

8
Stage 3. 3D key points estimation
[Paschalis Panteleris et al. Using a single RGB frame for real time
3D hand pose estimation in the wild.]

9
Stage 4. Shape parameterization
MANO
hand
generator
Vector of
parameters
[Javier Romero et al. Embodied Hands: Modeling and Capturing
Hands and Bodies Together]

10
Another Approach
[Liuhao Ge et al. 3D Hand Shape and Pose Estimation from a
Single RGB Image]

“3D Hand Pose Estimation from Single RGB
Camera” by Olha CHERNYTSKA
Shema of the 3D key-
point detection
method.

12
Takeaways from literature
- Modular approach suffers from error accumulation;
- Usage of previous frames can improve quality of 3D
reconstruction;
- 3D shape of hand can be used for regression of 3D key points
- Refinement of results can be used for improvement of final
accuracy;
- Network for estimation of 3D key points can be connected to
previous stages.

13
Hypotheses
1) Usage of synthetic video frames will improve accuracy of the
method.
2) Usage of results refinement helps maintain higher accuracy.

14
Studied Datasets
a) FreiHAND b) SynthHands

15
13 Studied architectures
6 for 2D key-points detection - A
5 for depth estimation - B
2 for shape parameterization - C

16
VIDEO networks takes as
input RGB data of hand n
concatenated with ground
truth value for hand k with
closest 3D configuration
(Synthetic Previous Frame).
Networks with REFINEMENT
sequentially recalculates
result multiple times.
Used Metrics
Mean Squared Error Loss
L1 Loss

17
- Usage of synthetic video frames will improve accuracy of the
method.
- Usage of results refinement helps maintain higher accuracy.
Studied networks for
2D key-points detection
(Stage A)

18
depth estimation (Stage B)

MANO
hand
generator
19
shape parameterization (Stage C)
21 key-point
in 3D (vector
of size 63)
Vector of
parameters
of size 51

20
Tests of 2D key-points detection (stage A)
46.15 % on UNET
32.27 % on STH 2
28.66 % on STH 3
L1 loss decrease due to
synthetic video frames.
24.221 on UNET (0 refinements)
21.201 on STH 2 (1 refinement)
20.305 on STH 3 (2 refinements)
Effects of refinement on
L1 loss for single image networks.
(mean distances to annotated points
in pixels for 224x224 image)

21
Tests of depth estimation (stage B)
67.67 % on B no Refinement
64 % on B with Refinement
L1 loss decrease due to
synthetic video frames.
10.96 % for single RGB image
0.85 % for video sequences
L1 loss decrease due to results
refinement.

22
Tests of shape parameterization (stage C)
b) c)

23
Final 3D position analysis
~ 3.85% on FreiHand dataset
~ 13.5% on SynthHand dataset
Deviation of L1 due to addition of hand surface

26
American Sign Language data

27
Achievements:
- Thesis introduces a method for hand shape parametrization;
- Complete 3D hand shape reconstruction method from video
sequences was developed, and its performance was studied
with different ANN architectures based on UNET, STH, and
introduced RNN.

28
Conclusion:
- Usage of synthetic video frames improves quality of 3D
reconstruction;
- Usage of additional results refinement leads to higher
accuracy on single image.
GitHub: https://github.com/roman-riazantsev/sign-pose
Future work:
- Retrain system with data augmentation on multiple datasets.
- Compare to the State Of The Art methods.
- Redesign certain stages to reconstruct multiple hands at ones.

29
Thank you for your attention!Thank you for your attention!Thank you for your attention!Thank you for your attention!

30
Tests of 2D key-points detection (stage A)

31
Tests of depth estimation (stage B)

33Example of incorrect 3D reconstruction
Tests of work ‘Using a single RGB frame for real time 3D hand pose
estimation in the wild’ official github implementation.

- Introduction to 3D hand modeling
- Background
- Proposed methods
- Experiments & Results
- Conclusions
Talk Structure:

Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Language Dictionaries

Recommended

Recommended

More Related Content

Similar to Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Language Dictionaries

Similar to Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Language Dictionaries (20)

More from Lviv Data Science Summer School

More from Lviv Data Science Summer School (20)

Recently uploaded

Recently uploaded (20)

Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Language Dictionaries