Synthesizing pseudo 2.5 d content from monocular videos for mixed reality

• Education
• Korea University of Technology and Education (Mar. 2010 ~ Feb. 2017)
• Bachelor of Science in Computer Engineering
• Tokyo Institute of Technology (Apr. 2017 ~ )
• Master of Science in Computer Science candidate
• Koike laboratory (Vision-based human-computer interaction)
• Research assistant, Team Koike, CREST, JST
• Publications
[1] MonoEye: Monocular Fisheye Camera based 3D Human Pose Estimation, IEEE VR 2019 (poster abstract, accepted)
Dong-Hyun Hwang, Kohei Aso, and Hideki Koike
[2] ParaPara: Synthesizing Pseudo-2.5D Content from Monocular Videos for Mixed Reality, ACM CHI 2018 Extended Abstract
Dong-Hyun Hwang and Hideki Koike
[3] MlioLight: Projector-camera Based Multi-layered Image Overlay System for Multiple Flashlights Interaction, ACM ISS 2018
Toshiki Sato, Dong-Hyun Hwang, and Hideki Koike
[4] AR based Self-sports Learning System using Decayed Dynamic Time Warping Algorithm, Eurographics ICAT-EGVE 2018
Atsuki Ikeda, Dong-Hyun Hwang, and Hideki Koike
• Research Interest
• AR, MR based interactive content synthesis
• Egocentric vision
• Computer vision-based interactive system
Dong-Hyun Hwang (황동현)
hwang.d.ab@m.titech.ac.jp

• Introduction
• Reviewing Free-viewpoint video synthesis systems
• FVV Synthesis with Multiple Cameras
• FVV Synthesis with a Monocular Video
• ParaPara System
• System Implementation
• Applications
• Discussion
• Interactive Demo
• Conclusions
Contents

• Free-viewpoint Video (FVV) is one of the advanced media that
provides immersive user experience and interaction.
• Providing flexible viewpoint navigation in 3D space.
• From limited linear-viewpoint selection to intuitive free-viewpoint
selection.
• Two ways to synthesize content.
• With multiple imaging equipment.
• With a monocular camera / video.
Free-viewpoint Video

• Reconstruct 3D information with multiple images.
• Structure from Motion (SfM), point clouds, image-based visual hulls, etc.
• Synthesized content is accurate and impressive.
• Accurate synchronization method and complex configuration are required.
FVV Synthesis with Multiple Cameras
Photo Tourism (SfM based, 2005) Hardware configuration of Goorts et al.’s system (2013)

• Virtualized Reality and EyeVision (Kanade et al. 1995 and 2001)
• Capturing a target using 51 cameras with dome structure.
• Synthesize an image from a virtual viewpoint.
• EyeVision based on this technology was used in Super Bowl XXXV.
• Image-based Visual Hulls (Matusik et al. 2000)
• Inverse projection (silhouette cone) from foreground silhouettes with
camera parameters.
• Visual hull is intersection between silhouette cones.
FVV Synthesis with Multiple Cameras (cont’d)
EyeVisionVirtualized Reality IBVH

• High-Quality Streamable Free-Viewpoint Video (Collet et al. 2015)
• Capturing a target with 106 RGB and infrared cameras.
• Construct a 3D mesh model from captured point clouds.
• Encoding the content through streamable MPEG-DASH.
• Holoportation (Orts et al. 2016)
• Reducing the number of cameras (From 106 to 24).
• Real-time 3D reconstruction (34 fps).
FVV Synthesis with Multiple Cameras (cont’d)
Collet et al.’s system (2015) Holoportation (2016)

• Reconstruct 3D information from a single monocular video or camera.
• A typical under-constrained problem with inherent ambiguity.
• Non-requiring special capturing equipment or environments.
• Reusing produced content.
FVV Synthesis with a Monocular Video
Input and output results of ParaPara

• Tour Into the Picture (Horry et al. 1997)
• Adding virtual vanishing point with user interaction.
• Modeling 3D model.
• Synthesizing a virtual viewpoint’s view using homography transform.
FVV Synthesis with a Monocular Video (cont’d)
Original image and synthesized images from novel viewpoint.
http://andyzeng.github.io/homography
Modeling procedure

• Soccer on your Tabletop (Rematis et al. 2018)
• Deep neural networks based system.
• Detecting players and tracking them.
• Reconstructing player’s depth map and mesh.
• Estimating a camera pose from landmarks of the soccer field.
• Generative Query Network (Eslami et al. 2018)
• Representation network produces a vector which describes observations.
• Generation network predicts the scene from an unobserved viewpoint.
FVV Synthesis with a Monocular Video (cont’d)
Procedures of Soccer on Your Tabletop system Input and output results of GQN

• User-generated Content (UGC) is content created from the user side.
• It has been risen from Web 2.0.
• Simple photo sharing to 360-degree VR content.
• The impact of UGC is the delivery of a tremendous amount of content.
User-generated Content

• Most of FVV synthesis systems have problems:
• Requiring multiple imaging equipment
• Non-end-user friendly system configuration
• Unable to use existing content.
• The goal of our research is to develop an end-to-end system,
which synthesizes pseudo-2.5D free-viewpoint content from
monocular videos for creating and disseminating UGC.
Problem Definition and Research Objective
ParaPara
Monocular Video Pseudo-2.5D
FVV content

ParaPara System
https://youtu.be/F2H1L2Pqj0c

• Hybrid module (DNN and common image processing algorithms)
to synthesize FVV scene from monocular videos.
Scene Synthesizer

• OpenPose (DNN based) is used to detect persons in a video sequence.
• Two modes (normal mode, precision mode) are provided based on
network input resolution.
• To compensate detection failure, a detected person is tracked.
• A bounding box is refined based on detected joints.
Person Detecting and Tracking

• With minimal user interaction,
homography matrix (image world to
real world) is calculated.
• The pseudo position is calculated with
the homography matrix based on
ankle positions of detected person.
Pseudo-3D Position Estimation

• Persons’ textures are extracted by KNN based background subtraction
and refined bounding boxes.
• 1-D mean filter makes contours more smoothly.
• Texture size correction method minimizes distortion caused
by the perspective error.
Texture Extraction and Size Correction

• Each texture is placed in the 3D-world based on the calculated
pseudo-3D position.
• The extracted background image or a custom image is set to the
ground texture.
Scene Synthesis

• Playing synthesized content from scene synthesizer.
• Working on a mixed reality head mounted device.
• Provide interaction between the user and the content.
Content Player

• Generated content is displayed in real world with 30 fps.
• Spatial mapping allows content to be freely placed on real-world
objects.
• Head Related Transfer Function (HRTF) based spatial sound method
synthesizes a directional sound according to the position of the
content.
Playing Synthesized Content

• The texture is extracted from a monocular video and doesn’t include
the information not captured in the original video.
• Axial billboard rendering (y-axis) applied to ParaPara minimizes
unnaturalness of 2D textures of the generated content caused by
changing a viewpoint.
Billboard Rendering
w/o billboard rendering w/ billboard rendering

• Accuracy of Depth Estimation
• Accuracy of Texture Extraction
• Processing Speed
Evaluation Metrics
Private

• Ground truth is generated based on short-range (5 participants) and
long-range (virtual environment) conditions.
• Calculate average z-axis error.
• Evaluation function: mean absolute error.
Accuracy of Depth Estimation
Short range condition Long range condition
Private

• Average z-axis error (short range) : 24.57cm
• Average z-axis error (long range) : 76.04cm
• The error increased as the model moved away from the camera.
• As the distance of the ground truth image increases, the quantization
error is also increased.
Accuracy of Depth Estimation
Private

• Compare proposed method (background subtraction based) with
mask R-CNN (SOTA, DNN based) as ground truth.
• Mask IoU = 0.72 (sd=0.05).
• IoU score > 0.5 is normally considered a “good” prediction.
Accuracy of Texture Extraction
(a) Mask R-CNN (blue region), (b) ours (red region)
(c) overlapped two methods (purple region is the intersection area).
Private

• Average processing speed is approximately 180 ms
(450 ms in precision mode).
• The proposed texture extraction method is faster than mask R-CNN
(avg. processing time with mask R-CNN: 2683ms).
• By combining DNN and common image processing algorithms, the
processing speed is higher than fully DNN-based systems.
Processing Speed
179.36
448.47
0
50
100
150
200
250
300
350
400
450
500
normal mode precision mode
(ms)
Average processing time per frame
41.43
2545
0
500
1000
1500
2000
2500
3000
proposed method mask R-CNN
(ms)
Processing time of texture extraction procedures
Private

• Evaluate the effectiveness of the content synthesized by proposed
system and original monocular videos.
• Three comparison conditions (C1, C2, C3)
• Twelve participants (three females, !" = 26, &' = 9.23)
• 5-point Likert-based questionnaire was used.
• 1=Strongly Disagree to 5=Strongly Agree
Experiment Design
Private

• Users can visually recognize
the spatial information
through the proposed
method C3 more easily than
C1 and C2 (p ≤ 0.001).
• Users could feel the
stereoscopic effect on the
content created by
the proposed system
compared with the existing
monocular video.
Visual Depth Perception (Stereoscopy)
Monocular Video + 2D display
Monocular Video + MR HMD
Synthesized Content + MR HMD (w/ red box)
Private

• A significant difference was
observed between C3
(proposed method) and other
conditions (p ≤ 0.001).
• Content generated by the
proposed system affects the
user’s immersion.
Immersive Degree (Immersion)
Private

• A significant difference was
observed between C3
(proposed method) and
other conditions (p ≤ 0.001).
• C3 provided the most
interesting experience
among the methods.
• Stereoscopic experience
made good subject
responses.
Attractiveness
Private

• The user can easily covert a monocular sports video into
immersive sports content.
Immersive Sports Broadcasting
Input video
(ISSIA-CNR dataset)
Synthesized content

• Various entertainment videos on the Internet can be converted
into attractive FVV content with ParaPara.
Dynamic Entertainment Content
Input video Synthesized content

• ParaPara can synthesize multiple videos into a single scene.
• The user can intuitively perceive spatial information and track the target
moves from the camera viewpoint to the viewpoint of another cameras.
Effective Surveillance System
Synthesized content from CMUSRD dataset

• High versatility and low cost
• Creating FVV content from monocular videos.
• High usability
• End-user without expertise can use the system.
• Reasonable quality
• Monocular videos can be converted into immersive content.
• Fast processing speed
• The system is faster than fully DNN based systems.
Advantages of ParaPara

• Limited camera posture
• Only fixed viewpoint videos are supported.
• Pseudo-3D position
• !-axis position cannot be estimated.
• Texture artifacts
• Detection failure causes artifacts.
• 2D texture
• The lost information not faced to the camera cannot be restored.
Technical Challenges and Limitation

• Applying deep neural networks to a wider range of procedures.
• Depth estimation.
• Camera pose estimation.
• Recovering lost information.
• Converting a detected person’s silhouette into a fitted 3D model.
• Using generative models (GAN, autoencoder, etc.) for texture recovery.
Future Work
Pipeline of Photo Wake-Up (2018) Warping-GAN (2018)

Contributions
• ParaPara, an alternative system to synthesize FVV content from
single or multiple monocular videos,
• performance evaluations for the proposed system,
• a user study to assess the usability of the synthesized content,
• suitable sample applications in which the proposed system is
capable of.
Summary
• Requiring multiple imaging equipment.
• Non-end-user friendly system
configuration.
• Unable to use existing content.
• Creating FVV content without multiple imaging
equipment.
• Increasing system’s usability.
• Utilizing current content and equipment.
Research Goals
Part of this work has been presented to ACM CHI 2018 EA.
ParaPara: Synthesizing Pseudo-2.5D Content from Monocular Videos for Mixed Reality.
Dong-Hyun Hwang and Hideki Koike.
Problems
This work was supported in part by a grant from
JST CREST Grant Number JPMJCR17A3, Japan.
“A study on skill acquisition mechanism and development of skill transfer systems.”

Synthesizing pseudo 2.5 d content from monocular videos for mixed reality

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Synthesizing pseudo 2.5 d content from monocular videos for mixed reality

Similar to Synthesizing pseudo 2.5 d content from monocular videos for mixed reality (20)

More from NAVER Engineering

More from NAVER Engineering (20)

Recently uploaded

Recently uploaded (20)

Synthesizing pseudo 2.5 d content from monocular videos for mixed reality