One of the most exciting promises of Virtual Reality is to take people to places and events they can't otherwise go. Whether its wandering amongst the ruins at Machu Picchu, being at the recording of a top TV show, at a gig of your favourite artist, front row seats at the FIFA final or standing on the red carpet at the Oscars, VR has the potential to make you feel like you're truly there, possibly in real-time as the events unfold.
To deliver on this promise with current technology introduces a number of challenges, resolution of cameras and headsets, bandwidth available to consumers, and how to generate a real sense of presence.
This talk will provide an overview of the current Live VR video pipeline, then dig into some technical detail on 2 or 3 key areas:
* Live Stream Compression (codecs, stereo, projections and packing schemes)
* Dealing with bandwidth challenges for live stream upload and consumer download
* Future opportunities from stereo depth estimation and freeing head position
6. What is Presence?
Wikipedia: It is defined as a person's subjective
sensation of being there in a scene depicted by a
medium
Michael Abrash: “Presence is VR Magic…it engages
you at a deeper, more visceral level than any other
form of entertainment”
7. Presence Requirements
Feature VR Today Human Perception
Field of View (per eye) ~80° x 90° 160° x 130°
Acuity (pixels / degree) 12 - 18 ~60 (and True HD)
Resolution (per eye) ~1k x 1k ~10k x 8k
Refresh Rate 90 Hz 120 Hz ?
Tracking / latency 5 - 20 ms 4 ms ?
Michael Abrash at Steam Dev Days 2014
http://media.steampowered.com/apps/abrashblog/Abrash%20Dev%20Days%202014.pdf
8. Video Mechanics - Capture
Samsung Beyond iZugur Z63DC
Google Jump / GoPro Odyssey
10. Stitching / Projection
Stitch images together
To map onto a sphere surrounding viewer
Just like map projection in geography
Most common is equirectangular projection
16. Market changing fast
Capture
Huge variety of cameras
No camera meets all needs
Next VR, Nokia, Samsung, GoPro, Ricoh, Kodak, Sphericam, Vuze
Stitching and Projection
Some cameras have it built in
Video-Stich have Vahana
Broadcast
YouTube, Facebook and many video streaming companies
17. Videos Today
Max resolution 4k x 2k video
Mix of mono and stereo
Almost all using equirectangular
18. Challenges
Many choice
Capture quality
Dynamic range
Resolution / Bandwidth
Head Movement
Stereo Quality
…
20. Resolution / Bandwidth
4k video is normally 3840 x 2160 x 8 bit
H.264 good quality 18 mbps
Bandwidth for 1 hour of video at 18 mbps
60 * 60 * 18 / 8 = 8 gigabytes
For 100,000 viewers
8 GB * 100000 = 800 terabytes
Bandwidth might be 5p per GB
Cost = 0.05 * 670000 = £40,000
20x cost of equivalent SD broadcast (4x 1080p)
21. Target is Headset Resolution
Gear VR has highest pixel density
H.FoV = 72.9° & H.Res = 1280
~17.5 pixels per degree
Target resolution ~6.3k x 3.2k per eye
Many H.264 codecs won’t handle this
4K video on Gear VR gives
~10.5 pixels per degree horizontally
~5.4 vertically
23. Technical Challenge #1
Bandwidth / Resolution
Native headset resolution video in Stereo
Equivalent quality to 18 Mbps 4K video
But at much lower bandwidth – ideally 3-4 Mbps
24. Look for Redundancy
Native resolution 17.5 pixels per degree
Equirectangular texture = 6.3k x 3.2k x 2
Notice how stretched it is at poles
25. Look for redundancy – Projection
Native resolution for Gear VR
6k x 3k x 2 => ~40 Megapixels for stereo pair
Actual pixels needed is much less
Surface of a sphere with circumference 6k (res ² / π)
24.5 Megapixels (~60% extra pixels wasted)
26. Why use equirectangular?
Pros
Plenty of software out there to generate it
Fairly simple to render
Creates one continuous rectangular array
Simple for highly optimised video codecs
Cons
requires 60% extra pixels to achieve equivalent quality
Big distortion – straight lines become curves
Video codecs optimised for straight lines
Rendering artefacts caused by non-linearity
27. Are there alternatives?
Cube-maps?
+ Minimal distortion – straight lines stay straight
+ Hardware accelerated rendering
- nearly 2x pixels of ideal minimum
Pyramids?
Facebook have blogged about pyramids
Cube-maps in disguise
5 planar projections instead of 6
Compress more efficiently
Problem is as old as astronomy
28. Optimise Equirectangular?
Too much horizontal resolution at poles
Resolution is about 2x above 60 degrees
Chop the top and bottom off and half their width
29. Optimise Equirectangular
Halve width of polar regions
Removes 30/180 of image => 5/6 * 40 = ~34 Megapixels
Now we’re only 35% worse than ideal
General lesson
We can divide sphere into regions
Change projection and resolution
30. Can we do better?
Divide into multiple regions
Remove down
Vary resolution
Base on projection
And area of interest
31. Other Options
Lot’s of redundancy between left / right eye
Stereo aware compression as in 3D movies
Reduction can be as much to 60%
Viewer often cares about one direction much more than another
Broadcast of this event, screens and speaker more important
Give them more bandwidth
Reduce resolution of off directions or reduce codec quality
Send area around direction user is looking
Minimise switching latency
Better codecs
H.265/HEVC – 50% if you’re lucky
32. The Future
This Year
1k x 1k per eye
3 years
2k x 2k per eye (4k screens here now)
5+ years
4k x 4k per eye (wider field of view?)
Human vision Target per eye
8k x 8k may be sufficient?
34. Stereo VR Videos
Effectively a video for each eye
Parallax comes from camera positioning
Packed vertically (left = top, right = bottom)
Much stronger sense of presence
37. Stitch and Project
Add a camera top and bottom
Stitch all the left eyes together
Stitch all the right eyes together
Stereo Vision
38. Truth about 3D VR Video
Creates a convincing sense of depth
Increases sense of presence
This is good. Yay!
39. Truth about 3D VR Video
Up and down are mono
Unavoidably – look up, turn 90°, look up again
Effective Stereo separation varies with viewing
angle
40. Truth about 3D VR Video
No toe in
Humans eyes track together
Don’t look straight forward
This impacts all VR for now
41. Truth about 3D VR Video
Camera is fixed position
Don’t move your head
Camera pairs fixed separation orientation
Don’t roll/tip your head
42. Truth about 3D VR Video
Camera positions fixed
Position
Roll
IPD (based on view angle)
Perfect when eyes aligned with camera
Less perfect elsewhere
More cameras and clever processing can improve
Still limited by fixed view in each half of stereo video
43. What can be done?
Need more 3D information
Depth and Occlusion
Reconstruct view each frame
44. Reconstruction
With depth and occlusion (geometry)
Generate right eye from left
Correct stereo for up and roll
Reconstruct different positions and orientations
Some head movement
45. Practical?
Challenging computer vision problem
Probably not full-scene in real-time yet
Multiple inward facing cameras
Motion capture suites
Potentially laser scan fixed scene in advance
Capture foreground objects live
Examples from Hololens, 8i and others
Specular lighting difficult to reconstruct
I’m Jules, from Focal Point VR. I seem to have the last slot before lunch. I can’t promise to be scintillating but shall try for pace.
One way of describing what VR is for, is teleportation
To imaginary worlds
Or for video; it’s teleportation to somewhere real
For live video, it’s presence at an event or experience
Where time and place both matter
Presence matters
Presence is the convincing illusion of being there.
Difficult to define but people know it when they experience it.
Much research has been done on what is needed to create this illusion. This is a subset. VR today is often hitting the minimum, in the long term we’ll be aiming for human perception.
We may return to this later.
VR Video starts with capture.
There are a variety of Cameras from the handmade to fully productised, and from cheap consumer to full professional.
None of them are perfect, but in the end you have video coming from multiple cameras.
Here’s some example frames from a 12 camera rig.
5 stereo pairs, one up and one down.
The goal is to combine them into a single image for each eye. In this case an equirectangular projection, like many maps of the world.
When we come to render the view in the headset, its easy to project this onto the inside of a globe.
Stitching process reverses distortion from a wide angle or fish-eye lens.
It stitches the images together and applies projection.
For Live video we have to do this at least in real-time
And for stereo we do it once for left eye cameras and again for right eye cameras.
Then compress and upload to the internet.
Once in the cloud, the video is downloaded or streamed to a headset
A sphere is placed around each eye.
Each view direction inside the sphere can be projected onto a video frame.
Now we have a VR video.
For live, the whole pipeline has to happen in real-time and with minimal latency.
Things are changing fast, new solutions coming up all the time.
There are many consumer quality options, but not really any simple options to achieve really high quality.
[Soon, consumers will be able to broadcast with a cheap camera and youtube or facebook.]
Today videos normally max out at 4k x 2k
With a few exceptions they use equirectangular
And stereo just squashes more vertically
If the goal is teleportation, to bring the visceral feeling of presence at a live event, then we have many challenges.
I’m going to try to talk a little about two.
First problem is simply the reality of 4k video and beyond
4k broadcast is pretty expensive. It can be 20x cost of equivalent standard definition video
Maybe we can accept lower quality, perhaps half bandwidth.
Ideally, we should have our video at headset resolution.
For Gear VR, that’s about 17.5 pixels per degree.
A video resolution of, not 4k x 2k, but about 6k by 3k per eye.
Most video players max out at 4K, much lower than target
Blurred images are not so noticeable at 10 meters
But...when it’s literally on your face it definitely impacts presence
So, technical challenege 1
We want native resolution, today 6k x 3k
But we want low bandwidth
Ideally much lower than 4k
How can this be achieved?
My first thought, what information we can throw away
Equirectangular project has 1-1 pixels at equator, but stretches enormously as you approach the poles
As a projection equirectangular has plenty of advantages
But, do the maths, we have 60 percent more pixels than we need
Unfortunately no simple projection can achieve this ideal from sphere surface to rectangle
Equirectangular is simple. Pretty easy to understand but it has some costs.
The distortion costs bandwidth. Both in the codec and in number of pixels we start with.
There are many map projections and many trade offs. Planar, cylindrical and azimuthal.
This is an old problem, and there’s been plenty of recent research
Equirectangular is great at the equator
How about we just try to fix it at the poles (up and down)
In this example, we just chop the bottom and top sixth off
And pack them side by side.
No loss of resolution, they were really stretched already.
This solution knocks a sixth off our area
Now we’re only 35% worse than the ideal
So there’s a lesson here. We can chop the sphere up into pieces and change the projection.
In this example keep native resolution for the centre of interest
But reduce it where projection is bad
Or where it’s less interesting, like the floor
This example reduces size to less than 40% of original
There’s plenty of redundancy here and many options to capitalise on it. Reducing bandwidth by a factor of 10 is probably achievable. Which would be a bandwidth of 5 mbps for native resolution on a Gear VR.
Today if you sit about 2 meters from a 55” True HD screen, then the pixels are at the limit of human vision.
Higher resolution is not a win.
To get the equivalent from VR it would require more than 8k per eye.
Reference keynote
On another challenge. 3D in VR videos
We achieve stereo by having a video for each eye.
The win is much greater sense of presence.
Our key goal
Stereo 360 camera seems pretty simple. Place camera’s at the eye positions.
Have multiple pairs of cameras
Stitch them together
And your done
And this works
Convincing sense of depth
And presence
But....there are a few problems
Up and down are mono. They have to be.
And actually stereo separation varies as you track from head