Free-viewpoint video (FVV) is a kind of advanced media that provides a more immersive user experience than traditional media. It allows users to interact with content because users can view media at the desired viewpoint and is becoming a next-generation media.
In creating FVV content, existing systems require complex and specialized capturing equipment and has low end-user usability because it needs a lot of expertise to use the system. This becomes an inconvenience for individuals or small organizations who want to create content and limits the end user’s ability to create FVV-based user-generated content (UGC) and inhibits the creation and sharing of various created content.
To tackle these problems, ParaPara is proposed in this work. ParaPara is an end-to-end system that uses a simple yet effective method to generate pseudo-2.5D FVV content from monocular videos, unlike the previously proposed systems. First, the system detects persons from the monocular video through a deep neural network, calculates the real-world homography matrix based on the minimal user interaction, and estimates the pseudo-3D positions of the detected persons. Then, person textures are extracted using general image processing algorithms and placed at the estimated real-world positions. Finally, the pseudo-2.5D content is synthesized from these elements. The content, which is synthesized by the proposed system, is implemented on Microsoft HoloLens; the user can freely place the generated content on the real world and watch it on a free viewpoint.
Synthesizing pseudo 2.5 d content from monocular videos for mixed reality
1.
2. • Education
• Korea University of Technology and Education (Mar. 2010 ~ Feb. 2017)
• Bachelor of Science in Computer Engineering
• Tokyo Institute of Technology (Apr. 2017 ~ )
• Master of Science in Computer Science candidate
• Koike laboratory (Vision-based human-computer interaction)
• Research assistant, Team Koike, CREST, JST
• Publications
[1] MonoEye: Monocular Fisheye Camera based 3D Human Pose Estimation, IEEE VR 2019 (poster abstract, accepted)
Dong-Hyun Hwang, Kohei Aso, and Hideki Koike
[2] ParaPara: Synthesizing Pseudo-2.5D Content from Monocular Videos for Mixed Reality, ACM CHI 2018 Extended Abstract
Dong-Hyun Hwang and Hideki Koike
[3] MlioLight: Projector-camera Based Multi-layered Image Overlay System for Multiple Flashlights Interaction, ACM ISS 2018
Toshiki Sato, Dong-Hyun Hwang, and Hideki Koike
[4] AR based Self-sports Learning System using Decayed Dynamic Time Warping Algorithm, Eurographics ICAT-EGVE 2018
Atsuki Ikeda, Dong-Hyun Hwang, and Hideki Koike
• Research Interest
• AR, MR based interactive content synthesis
• Egocentric vision
• Computer vision-based interactive system
Dong-Hyun Hwang (황동현)
hwang.d.ab@m.titech.ac.jp
3. • Introduction
• Reviewing Free-viewpoint video synthesis systems
• FVV Synthesis with Multiple Cameras
• FVV Synthesis with a Monocular Video
• ParaPara System
• System Implementation
• Applications
• Discussion
• Interactive Demo
• Conclusions
Contents
4.
5. • Free-viewpoint Video (FVV) is one of the advanced media that
provides immersive user experience and interaction.
• Providing flexible viewpoint navigation in 3D space.
• From limited linear-viewpoint selection to intuitive free-viewpoint
selection.
• Two ways to synthesize content.
• With multiple imaging equipment.
• With a monocular camera / video.
Free-viewpoint Video
6. • Reconstruct 3D information with multiple images.
• Structure from Motion (SfM), point clouds, image-based visual hulls, etc.
• Synthesized content is accurate and impressive.
• Accurate synchronization method and complex configuration are required.
FVV Synthesis with Multiple Cameras
Photo Tourism (SfM based, 2005) Hardware configuration of Goorts et al.’s system (2013)
7. • Virtualized Reality and EyeVision (Kanade et al. 1995 and 2001)
• Capturing a target using 51 cameras with dome structure.
• Synthesize an image from a virtual viewpoint.
• EyeVision based on this technology was used in Super Bowl XXXV.
• Image-based Visual Hulls (Matusik et al. 2000)
• Inverse projection (silhouette cone) from foreground silhouettes with
camera parameters.
• Visual hull is intersection between silhouette cones.
FVV Synthesis with Multiple Cameras (cont’d)
EyeVisionVirtualized Reality IBVH
8. • High-Quality Streamable Free-Viewpoint Video (Collet et al. 2015)
• Capturing a target with 106 RGB and infrared cameras.
• Construct a 3D mesh model from captured point clouds.
• Encoding the content through streamable MPEG-DASH.
• Holoportation (Orts et al. 2016)
• Reducing the number of cameras (From 106 to 24).
• Real-time 3D reconstruction (34 fps).
FVV Synthesis with Multiple Cameras (cont’d)
Collet et al.’s system (2015) Holoportation (2016)
9. • Reconstruct 3D information from a single monocular video or camera.
• A typical under-constrained problem with inherent ambiguity.
• Non-requiring special capturing equipment or environments.
• Reusing produced content.
FVV Synthesis with a Monocular Video
Input and output results of ParaPara
10. • Tour Into the Picture (Horry et al. 1997)
• Adding virtual vanishing point with user interaction.
• Modeling 3D model.
• Synthesizing a virtual viewpoint’s view using homography transform.
FVV Synthesis with a Monocular Video (cont’d)
Original image and synthesized images from novel viewpoint.
http://andyzeng.github.io/homography
Modeling procedure
11. • Soccer on your Tabletop (Rematis et al. 2018)
• Deep neural networks based system.
• Detecting players and tracking them.
• Reconstructing player’s depth map and mesh.
• Estimating a camera pose from landmarks of the soccer field.
• Generative Query Network (Eslami et al. 2018)
• Representation network produces a vector which describes observations.
• Generation network predicts the scene from an unobserved viewpoint.
FVV Synthesis with a Monocular Video (cont’d)
Procedures of Soccer on Your Tabletop system Input and output results of GQN
12.
13. • User-generated Content (UGC) is content created from the user side.
• It has been risen from Web 2.0.
• Simple photo sharing to 360-degree VR content.
• The impact of UGC is the delivery of a tremendous amount of content.
User-generated Content
14. • Most of FVV synthesis systems have problems:
• Requiring multiple imaging equipment
• Non-end-user friendly system configuration
• Unable to use existing content.
• The goal of our research is to develop an end-to-end system,
which synthesizes pseudo-2.5D free-viewpoint content from
monocular videos for creating and disseminating UGC.
Problem Definition and Research Objective
ParaPara
Monocular Video Pseudo-2.5D
FVV content
18. • Hybrid module (DNN and common image processing algorithms)
to synthesize FVV scene from monocular videos.
Scene Synthesizer
19. • OpenPose (DNN based) is used to detect persons in a video sequence.
• Two modes (normal mode, precision mode) are provided based on
network input resolution.
• To compensate detection failure, a detected person is tracked.
• A bounding box is refined based on detected joints.
Person Detecting and Tracking
20. • With minimal user interaction,
homography matrix (image world to
real world) is calculated.
• The pseudo position is calculated with
the homography matrix based on
ankle positions of detected person.
Pseudo-3D Position Estimation
21. • Persons’ textures are extracted by KNN based background subtraction
and refined bounding boxes.
• 1-D mean filter makes contours more smoothly.
• Texture size correction method minimizes distortion caused
by the perspective error.
Texture Extraction and Size Correction
22. • Each texture is placed in the 3D-world based on the calculated
pseudo-3D position.
• The extracted background image or a custom image is set to the
ground texture.
Scene Synthesis
24. • Playing synthesized content from scene synthesizer.
• Working on a mixed reality head mounted device.
• Provide interaction between the user and the content.
Content Player
25. • Generated content is displayed in real world with 30 fps.
• Spatial mapping allows content to be freely placed on real-world
objects.
• Head Related Transfer Function (HRTF) based spatial sound method
synthesizes a directional sound according to the position of the
content.
Playing Synthesized Content
26. • The texture is extracted from a monocular video and doesn’t include
the information not captured in the original video.
• Axial billboard rendering (y-axis) applied to ParaPara minimizes
unnaturalness of 2D textures of the generated content caused by
changing a viewpoint.
Billboard Rendering
w/o billboard rendering w/ billboard rendering
28. • Accuracy of Depth Estimation
• Accuracy of Texture Extraction
• Processing Speed
Evaluation Metrics
Private
29. • Ground truth is generated based on short-range (5 participants) and
long-range (virtual environment) conditions.
• Calculate average z-axis error.
• Evaluation function: mean absolute error.
Accuracy of Depth Estimation
Short range condition Long range condition
Private
30. • Average z-axis error (short range) : 24.57cm
• Average z-axis error (long range) : 76.04cm
• The error increased as the model moved away from the camera.
• As the distance of the ground truth image increases, the quantization
error is also increased.
Accuracy of Depth Estimation
Private
31. • Compare proposed method (background subtraction based) with
mask R-CNN (SOTA, DNN based) as ground truth.
• Mask IoU = 0.72 (sd=0.05).
• IoU score > 0.5 is normally considered a “good” prediction.
Accuracy of Texture Extraction
(a) Mask R-CNN (blue region), (b) ours (red region)
(c) overlapped two methods (purple region is the intersection area).
Private
32. • Average processing speed is approximately 180 ms
(450 ms in precision mode).
• The proposed texture extraction method is faster than mask R-CNN
(avg. processing time with mask R-CNN: 2683ms).
• By combining DNN and common image processing algorithms, the
processing speed is higher than fully DNN-based systems.
Processing Speed
179.36
448.47
0
50
100
150
200
250
300
350
400
450
500
normal mode precision mode
(ms)
Average processing time per frame
41.43
2545
0
500
1000
1500
2000
2500
3000
proposed method mask R-CNN
(ms)
Processing time of texture extraction procedures
Private
34. • Evaluate the effectiveness of the content synthesized by proposed
system and original monocular videos.
• Three comparison conditions (C1, C2, C3)
• Twelve participants (three females, !" = 26, &' = 9.23)
• 5-point Likert-based questionnaire was used.
• 1=Strongly Disagree to 5=Strongly Agree
Experiment Design
Private
35. • Users can visually recognize
the spatial information
through the proposed
method C3 more easily than
C1 and C2 (p ≤ 0.001).
• Users could feel the
stereoscopic effect on the
content created by
the proposed system
compared with the existing
monocular video.
Visual Depth Perception (Stereoscopy)
Monocular Video + 2D display
Monocular Video + MR HMD
Synthesized Content + MR HMD (w/ red box)
Private
36. • A significant difference was
observed between C3
(proposed method) and other
conditions (p ≤ 0.001).
• Content generated by the
proposed system affects the
user’s immersion.
Immersive Degree (Immersion)
Monocular Video + 2D display
Monocular Video + MR HMD
Synthesized Content + MR HMD (w/ red box)
Private
37. • A significant difference was
observed between C3
(proposed method) and
other conditions (p ≤ 0.001).
• C3 provided the most
interesting experience
among the methods.
• Stereoscopic experience
made good subject
responses.
Attractiveness
Monocular Video + 2D display
Monocular Video + MR HMD
Synthesized Content + MR HMD (w/ red box)
Private
38.
39. • The user can easily covert a monocular sports video into
immersive sports content.
Immersive Sports Broadcasting
Input video
(ISSIA-CNR dataset)
Synthesized content
40. • Various entertainment videos on the Internet can be converted
into attractive FVV content with ParaPara.
Dynamic Entertainment Content
Input video Synthesized content
41. • ParaPara can synthesize multiple videos into a single scene.
• The user can intuitively perceive spatial information and track the target
moves from the camera viewpoint to the viewpoint of another cameras.
Effective Surveillance System
Synthesized content from CMUSRD dataset
42.
43. • High versatility and low cost
• Creating FVV content from monocular videos.
• High usability
• End-user without expertise can use the system.
• Reasonable quality
• Monocular videos can be converted into immersive content.
• Fast processing speed
• The system is faster than fully DNN based systems.
Advantages of ParaPara
44. • Limited camera posture
• Only fixed viewpoint videos are supported.
• Pseudo-3D position
• !-axis position cannot be estimated.
• Texture artifacts
• Detection failure causes artifacts.
• 2D texture
• The lost information not faced to the camera cannot be restored.
Technical Challenges and Limitation
45. • Applying deep neural networks to a wider range of procedures.
• Depth estimation.
• Camera pose estimation.
• Recovering lost information.
• Converting a detected person’s silhouette into a fitted 3D model.
• Using generative models (GAN, autoencoder, etc.) for texture recovery.
Future Work
Pipeline of Photo Wake-Up (2018) Warping-GAN (2018)
46.
47. Contributions
• ParaPara, an alternative system to synthesize FVV content from
single or multiple monocular videos,
• performance evaluations for the proposed system,
• a user study to assess the usability of the synthesized content,
• suitable sample applications in which the proposed system is
capable of.
Summary
• Requiring multiple imaging equipment.
• Non-end-user friendly system
configuration.
• Unable to use existing content.
• Creating FVV content without multiple imaging
equipment.
• Increasing system’s usability.
• Utilizing current content and equipment.
Research Goals
Part of this work has been presented to ACM CHI 2018 EA.
ParaPara: Synthesizing Pseudo-2.5D Content from Monocular Videos for Mixed Reality.
Dong-Hyun Hwang and Hideki Koike.
Problems
This work was supported in part by a grant from
JST CREST Grant Number JPMJCR17A3, Japan.
“A study on skill acquisition mechanism and development of skill transfer systems.”