Ukrainian Catholic University
Faculty of Applied Sciences
Data Science Master Program
January 22nd
Abstract. Today virtual and augmented reality applications become more and more popular. Such a trend creates a demand for 3D processing algorithms which may be applied to many areas. This work is focused on sigh language video sequences. There are a lot of prerecorded photos and video dictionaries that can be transformed into 3D and unified in one place. We research nuances of hand pose video sequence analysis as well as the influence of results refinement for 2D and 3D keypoint detection. Besides that, we designed a solution for the parametrization of hand shape and engineered system for 3D hand pose reconstruction. Model show good results on train data but lack generalization. Retraining on multiple datasets and usage of various data augmentation techniques will improve performance.
Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Language Dictionaries
1. 3D Reconstruction of
Video Sign Language
Dictionaries
author: Roman Riazantsev
supervisor: Maksym Davydov
2. Context & Problems
VR / AR technologies
Sign Language Dictionaries
Different data formats
How create unified database of
sign language dictionaries?
3. Research Questions & Goals
How addition of synthetic
video frames affects
performance of 3D
reconstruction methods ?
How refinement of results
affects performance of 3D
reconstruction methods ?
Create system for
3D reconstruction
of Sign Language
dictionaries.
4. How 3D reconstruction can be done?
[Jameel Malik et al. DeepHPS: End-to-end
Estimation of 3D Hand Pose and Shape by
Learning from Synthetic Depth]
6. 6
Stage 1. Object search with sequential
refinement
[Joseph Redmon et al. YOLOv3: An Incremental
Improvement]
7. 7
Stage 2. 2D key points detection
Step 1 Step 3Step 2
Step 2 Step 3
[Zhe Cao et al. OpenPose: Realtime Multi-Person 2D Pose Estimation
using Part Affinity Fields]
8. 8
Stage 3. 3D key points estimation
[Paschalis Panteleris et al. Using a single RGB frame for real time
3D hand pose estimation in the wild.]
9. 9
Stage 4. Shape parameterization
MANO
hand
generator
Vector of
parameters
[Javier Romero et al. Embodied Hands: Modeling and Capturing
Hands and Bodies Together]
11. “3D Hand Pose Estimation from Single RGB
Camera” by Olha CHERNYTSKA
Shema of the 3D key-
point detection
method.
12. 12
Takeaways from literature
- Modular approach suffers from error accumulation;
- Usage of previous frames can improve quality of 3D
reconstruction;
- 3D shape of hand can be used for regression of 3D key points
- Refinement of results can be used for improvement of final
accuracy;
- Network for estimation of 3D key points can be connected to
previous stages.
13. 13
Hypotheses
1) Usage of synthetic video frames will improve accuracy of the
method.
2) Usage of results refinement helps maintain higher accuracy.
15. 15
13 Studied architectures
6 for 2D key-points detection - A
5 for depth estimation - B
2 for shape parameterization - C
16. 16
VIDEO networks takes as
input RGB data of hand n
concatenated with ground
truth value for hand k with
closest 3D configuration
(Synthetic Previous Frame).
Networks with REFINEMENT
sequentially recalculates
result multiple times.
Used Metrics
Mean Squared Error Loss
L1 Loss
17. 17
- Usage of synthetic video frames will improve accuracy of the
method.
- Usage of results refinement helps maintain higher accuracy.
Studied networks for
2D key-points detection
(Stage A)
20. 20
Tests of 2D key-points detection (stage A)
46.15 % on UNET
32.27 % on STH 2
28.66 % on STH 3
L1 loss decrease due to
synthetic video frames.
24.221 on UNET (0 refinements)
21.201 on STH 2 (1 refinement)
20.305 on STH 3 (2 refinements)
Effects of refinement on
L1 loss for single image networks.
(mean distances to annotated points
in pixels for 224x224 image)
21. 21
Tests of depth estimation (stage B)
67.67 % on B no Refinement
64 % on B with Refinement
L1 loss decrease due to
synthetic video frames.
10.96 % for single RGB image
0.85 % for video sequences
L1 loss decrease due to results
refinement.
27. 27
Achievements:
- Thesis introduces a method for hand shape parametrization;
- Complete 3D hand shape reconstruction method from video
sequences was developed, and its performance was studied
with different ANN architectures based on UNET, STH, and
introduced RNN.
28. 28
Conclusion:
- Usage of synthetic video frames improves quality of 3D
reconstruction;
- Usage of additional results refinement leads to higher
accuracy on single image.
GitHub: https://github.com/roman-riazantsev/sign-pose
Future work:
- Retrain system with data augmentation on multiple datasets.
- Compare to the State Of The Art methods.
- Redesign certain stages to reconstruct multiple hands at ones.
29. 29
Thank you for your attention!Thank you for your attention!Thank you for your attention!Thank you for your attention!
33. 33Example of incorrect 3D reconstruction
Tests of work ‘Using a single RGB frame for real time 3D hand pose
estimation in the wild’ official github implementation.
34. - Introduction to 3D hand modeling
- Background
- Proposed methods
- Experiments & Results
- Conclusions
Talk Structure: