A talk from the Tools & Products Track at AWE USA 2017 - the largest conference for AR+VR in Santa Clara, California May 31- June 2, 2017.
Hiren Bhinde (Qualcomm): On-device Motion Tracking for Immersive VR
As the industry strives toward immersive augmented and virtual reality experiences, we are guided by the extreme requirements associated with intuitive interactions, visual, and sound quality, in order to achieve the ultimate untethered user experience. Precise, low-latency motion tracking of head movements is crucial for intuitive interactions with the virtual world, and visual-inertial odometry (VIO) is the ideal complementary subsystem to achieve this goal. VIO allows for six-degrees of freedom in VR/AR experiences, reduces latency and cuts the cord. In this session, developers will learn about the evolution of motion tracking and dive into six-degrees of freedom and its impact on VR/AR content development and user experiences.
http://AugmentedWorldExpo.com
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Hiren Bhinde (Qualcomm): On-device Motion Tracking for Immersive VR
1. On-device motion
tracking for immersive
mobile VR
Hiren Bhinde
XR Product Manager
Qualcomm Technologies, Inc.
June 2, 2017
Qualcomm Snapdragon is a product of Qualcomm Technologies, Inc.
2. 2
VR will provide the ultimate level of immersion
Creating physical presence in real or imagined worlds: Intuitive interactions required
Interactions
So intuitive that they
become second nature
Sounds
So accurate that
they are true to life
Visuals
So vibrant that they are eventually
indistinguishable from the real world
3. 3
3-DoF vs. 6-DoF
• “In which direction am I looking”
• Detect rotational head movement
• Look around the virtual world from a fixed point
3 degrees of freedom (3-DoF)
• “Where am I and in which direction am I looking”
• Detect rotational movement and translational movement
• Move in the virtual world like you move in the real world
6 degrees of freedom (6-DoF)
4. 4
6-DoF allows developers to bring the user into their story
• Full immersion
• Can become part of the story
• Can now interact and change the story
6 degrees of freedom (6-DoF)
• Can only watch
3 degrees of freedom (3-DoF)
5. 5
6-DoF motion tracking evolution
2014
Outside-in
External sensors,
markers, cameras,
or lasers to be set
up throughout the
room.
Monocular
fisheye camera
Inside-out solution
requiring no room
setup
Stereo fisheye
camera
More accurate
motion tracking
for immersive
interactions
Room-scale VR
Object detection,
boundary
augmentation, and
safety
World VR
Limitless movement
and robust safety
Future2017
Today’s discussion
8. 8
Mobile 6-DoF: “Inside-out” tracking
Visual inertial odometry (VIO) for rapid and accurate 6-DoF pose
6-DoF
position &
orientation
(aka “6-DoF pose”)
Captured from tracking camera
image sensor at ~30 fps
Mono or stereo
camera data
Accelerometer &
gyroscope data
Sampled from external
sensors at 800 / 1000 Hz
Camera feature processing
Inertial data processing
Snapdragon “VIO” subsystem
New frame accurately
displayed
Qualcomm® Hexagon™
DSP algorithms
• Camera and inertial sensor
data fusion
• Continuous localization
• Accurate, high-rate “pose”
generation & prediction
Qualcomm Hexagon is a product of Qualcomm Technologies, Inc.
9. 9
Minimizing motion to photon (MTP) latency is crucial
~15-16 ms MTP latency on Qualcomm® Snapdragon™ 835 Mobile Platform shows our mobile VR leadership
Low latency Noticeable latency
Qualcomm Snapdragon is a product of Qualcomm Technologies, Inc.
10. 10
Low latency mobile 6-DoF inside-out tracking
Many workloads must run efficiently for an immersive VR experience
1
Visual processing
• View generation
• Render / decode
Motion detection
• Sensor sampling
• Sensor fusion
Display
• Image correction
• Quality enhancement and display
“Motion” “Photon”
(new pixels’ light emitted from the screen)
Total time (motion to photon latency) for all steps above must be less than 20 milliseconds
11. 11
Stereo VIO for rapid, robust and accurate 6-DoF pose
Sensor fusion of stereo camera features and high rate IMU data
Benefits of stereo over monocular 6 DoF
• Instant accurate scene depth
• Faster initialization
• Better performance with quick and rotational
motions
• Improved tolerance to camera occlusion
Stereo wide-angle
lenses
12. 12
Some key things to consider
Developing 6-DoF content
Scale and space
• Issue: Untethered mobile allows for
limitless movement, but available space
could change between runs of app
• DO consider limits to amount of full body
movement
• DO provide movement alternatives
Tracking
• Issue: Device cameras can become
occluded or environment can be too dark
• DO fade to black or fixed image in event of
lost head tracking to avoid nauseating
jumps and judder
• DON’T fall back to tracking only orientation
(3-DoF). Jumps in position or seeing
virtual world respond differently to
movement can be uncomfortable for users
Storytelling
• Issue: Users may not be looking or
positioned in desired location to advance
story
• DO use audio and visual cues to guide
user focus
• DON’T take over control of virtual camera
in order to force focus on story element
13. 13
How to get started with 6-DoF for mobile VR
Snapdragon VR SDK Snapdragon 835 VR HMD
Access to advanced VR features to optimize
applications and simplify development
Accelerating the development of standalone
head-mounted displays
URL: developer.qualcomm.com
Tools & resources to help developers build, integrate, and optimize
VR will provide the ultimate level of immersion. For virtual reality to be truly immersive and live up to its name, it needs to create physical presence in real or imagined worlds. VR does this by stimulating our human senses with realistic feedback (inputs and responses).
Humans are very perceptive at noticing things that look wrong, sound wrong, feel out of place, or do not behave in a natural way – and this takes us out of the moment.
When the sound is right and matches the visual, you are truly immersed in the scene. In contrast, when the sound is wrong – think of a movie with sound dubbing where audio mismatches the lip movements – we immediately know and focus on it. For 3D surround sound, we have a similar experience of feeling the direction of the sound and whether it is appropriate for the visual being shown. Virtual reality is the ultimate test for this. Our brain will tell us when the visuals and sounds are right or wrong, ultimately determining the level of immersion we feel.
For creating truly immersive experiences, VR places high emphasis on the 3 pillars of immersion to stimulate our senses, providing:
Visuals so vibrant and so realistic that they are eventually indistinguishable from the real world
Sounds so accurate that they are true to life
Interactions so intuitive that they become second nature and you forget there is even an interface.
Intuitive interactions keep us immersed in the moment. For a VR headset, your primary control is actually your head movement. What’s more natural than just looking around. You might also use gestures or a control pad to provide other inputs. As we will get to later, one of the biggest challenges for VR is the response time from your head moving to the display actually being updated, which is known as “motion to photon”.
It is the level of immersion that significantly sets VR apart from anything else people have experienced before – and it is why we are so excited about VR’s potential. VR will take our experiences to the next level, not by an incremental amount but by a huge amount.
Precise (and of course low latency) motion tracking of head movements is crucial for accurate and intuitive interactions with the virtual world. Precise is key for both large and small movements. If you turn your head to explore the scene, you want very accurate control. Similarly, if your head is stable and not moving, you want the visuals to stay completely still, otherwise you might feel like you are on a boat in an unstable virtual world.
Motion is often characterized by how many degrees of freedom are possible in movement: either 3 degrees of freedom or 6 degrees of freedom.
3-DoF answers “In which direction am I looking”. This means that you can detect rotational movement around the X, Y, and Z axis – the orientation. For your head movements, that means being able to rotate, nod, and bend your neck/head, while keeping the rest of your body still.
3-DoF is very beneficial in VR. You will be able to look around the virtual world from fixed points – think of a camera on a tripod. For many 360° spherical videos, this will provide very immersive content, such as viewing sporting events or nature.
[Click]
6-DoF answers “where I am and in which direction I am looking”. It means that you can detect rotational movement and translational movement – the orientation and position. This means that your body can now move from fixed viewpoints in the virtual world – it can move in the X, Y, and Z direction.
6-DoF is very beneficial in VR. Gaming is the obvious use case where you can move freely in the virtual world and look around corners. However, even simple things, like looking at objects at your desk or shifting your head side-to-side can be compelling with 6-DoF.
Tower defense games with positional tracking is compelling -- leaning in works well to zoom.
Looking at side view mirrors in a car racing game.
6-DoF is more realistic because it captures our real movement and does not cause a sensory conflict between our vision and vestibular system.
360° spherical video that supports 6-DoF will also be very compelling. The challenge is creating that content, which is an area of active research.
Precise (and of course low latency) motion tracking of head movements is crucial for accurate and intuitive interactions with the virtual world. Precise is key for both large and small movements. If you turn your head to explore the scene, you want very accurate control. Similarly, if your head is stable and not moving, you want the visuals to stay completely still, otherwise you might feel like you are on a boat in an unstable virtual world.
Motion is often characterized by how many degrees of freedom are possible in movement: either 3 degrees of freedom or 6 degrees of freedom.
3-DoF answers “In which direction am I looking”. This means that you can detect rotational movement around the X, Y, and Z axis – the orientation. For your head movements, that means being able to rotate, nod, and bend your neck/head, while keeping the rest of your body still.
3-DoF is very beneficial in VR. You will be able to look around the virtual world from fixed points – think of a camera on a tripod. For many 360° spherical videos, this will provide very immersive content, such as viewing sporting events or nature.
[Click]
6-DoF answers “where I am and in which direction I am looking”. It means that you can detect rotational movement and translational movement – the orientation and position. This means that your body can now move from fixed viewpoints in the virtual world – it can move in the X, Y, and Z direction.
6-DoF is very beneficial in VR. Gaming is the obvious use case where you can move freely in the virtual world and look around corners. However, even simple things, like looking at objects at your desk or shifting your head side-to-side can be compelling with 6-DoF.
Tower defense games with positional tracking is compelling -- leaning in works well to zoom.
Looking at side view mirrors in a car racing game.
6-DoF is more realistic because it captures our real movement and does not cause a sensory conflict between our vision and vestibular system.
360° spherical video that supports 6-DoF will also be very compelling. The challenge is creating that content, which is an area of active research.
Visual-inertial odometry, or VIO, provides precise on-device motion tracking. By using the on-device camera and inertial sensors, a 6-DoF pose can be generated. This allows you to have those immersive experiences we just talked about, plus the ability to use the VR headset anywhere – and it is much less expensive than other solutions.
So how does VIO work?
VIO estimates relative position and orientation of a moving device in an unknown environment using a camera and motion sensors.
VIO takes advantage of the complementary strengths of the camera and inertial sensors. For example, a single camera can estimate relative position, but it cannot provide absolute scale—the actual distances between objects, or the size of objects in meters or feet. Inertial sensors provide absolute scale and take measurement samples at a higher rate, thereby improving robustness for fast device motion. However, inertial sensors, particularly low-cost MEMS varieties, are prone to substantial drifts in position estimates when compared with cameras. So VIO blends together the best of both worlds to accurately estimate device pose.
Looking at the diagram, you can see that the camera, accelerometers, and gyroscopes are the inputs to VIO. Accelerometers and gyroscopes are running at much higher frequency than the camera.
For Snapdragon VIO subsystem:
The camera feature processing block detects features and tracks them
The inertial data processing block runs at the high frequencies of the inertial sensors and fuses the data.
The VIO algorithms run on the Hexagon DSP for efficiency.
The magic really happens when fusing the camera and inertial sensor data.
Fusing this data allows for continuous localization of the head position
Finally, we need the head position at a high frequency, so we generate both an accurate, high-rate pose and predicted pose.
This 6-DoF pose can then be used to quickly update the user’s view in the VR world with the precise visuals (and sound).
Visual-inertial odometry, or VIO, provides precise on-device motion tracking. By using the on-device camera and inertial sensors, a 6-DoF pose can be generated. This allows you to have those immersive experiences we just talked about, plus the ability to use the VR headset anywhere – and it is much less expensive than other solutions.
So how does VIO work?
VIO estimates relative position and orientation of a moving device in an unknown environment using a camera and motion sensors.
VIO takes advantage of the complementary strengths of the camera and inertial sensors. For example, a single camera can estimate relative position, but it cannot provide absolute scale—the actual distances between objects, or the size of objects in meters or feet. Inertial sensors provide absolute scale and take measurement samples at a higher rate, thereby improving robustness for fast device motion. However, inertial sensors, particularly low-cost MEMS varieties, are prone to substantial drifts in position estimates when compared with cameras. So VIO blends together the best of both worlds to accurately estimate device pose.
Looking at the diagram, you can see that the camera, accelerometers, and gyroscopes are the inputs to VIO. Accelerometers and gyroscopes are running at much higher frequency than the camera.
For Snapdragon VIO subsystem:
The camera feature processing block detects features and tracks them
The inertial data processing block runs at the high frequencies of the inertial sensors and fuses the data.
The VIO algorithms run on the Hexagon DSP for efficiency.
The magic really happens when fusing the camera and inertial sensor data.
Fusing this data allows for continuous localization of the head position
Finally, we need the head position at a high frequency, so we generate both an accurate, high-rate pose and predicted pose.
This 6-DoF pose can then be used to quickly update the user’s view in the VR world with the precise visuals (and sound).
Visual-inertial odometry, or VIO, provides precise on-device motion tracking. By using the on-device camera and inertial sensors, a 6-DoF pose can be generated. This allows you to have those immersive experiences we just talked about, plus the ability to use the VR headset anywhere – and it is much less expensive than other solutions.
So how does VIO work?
VIO estimates relative position and orientation of a moving device in an unknown environment using a camera and motion sensors.
VIO takes advantage of the complementary strengths of the camera and inertial sensors. For example, a single camera can estimate relative position, but it cannot provide absolute scale—the actual distances between objects, or the size of objects in meters or feet. Inertial sensors provide absolute scale and take measurement samples at a higher rate, thereby improving robustness for fast device motion. However, inertial sensors, particularly low-cost MEMS varieties, are prone to substantial drifts in position estimates when compared with cameras. So VIO blends together the best of both worlds to accurately estimate device pose.
Looking at the diagram, you can see that the camera, accelerometers, and gyroscopes are the inputs to VIO. Accelerometers and gyroscopes are running at much higher frequency than the camera.
For Snapdragon VIO subsystem:
The camera feature processing block detects features and tracks them
The inertial data processing block runs at the high frequencies of the inertial sensors and fuses the data.
The VIO algorithms run on the Hexagon DSP for efficiency.
The magic really happens when fusing the camera and inertial sensor data.
Fusing this data allows for continuous localization of the head position
Finally, we need the head position at a high frequency, so we generate both an accurate, high-rate pose and predicted pose.
This 6-DoF pose can then be used to quickly update the user’s view in the VR world with the precise visuals (and sound).
Motion-to-Photon latency is the amount of time between the movement of the VR user’s head, and the time that new content is correctly displayed
Minimizing motion to photon latency is absolutely crucial for VR
Your vestibular system (which provides a sense of balance and spatial orientation), signals to your brain that your head has moved, but if the visual stimulus from your eyes to the brain does not match, it can not only make the experience less immersive but it can also make some users sick.
[Click]
On the left, we can see what LOW motion to photon latency looks like. As you move your head, everything is updated with minimal lag.
[Click]
On the right, the head moves, but the display update to the VR headset is delayed. This is high motion to photon latency, and it produces a physiological sensory conflict that is uncomfortable for most people
The best way to both minimize this discomfort and create immersion, is obviously to reduce the latency. In Snapdragon 835, we’ve decreased it by about 20% through efficiency improvements to sensor processing and interaction with the single buffered rendering subsystem that involves a clever way of concurrently rendering and time-warping one eye buffer while sending the other finished eye buffer to the display.
An end-to-end approach is required to minimize motion to photon latency (in fact, an end to end approach is required for many VR requirements!). The time from the head movement to the screen being updated, known as “motion to photon”, needs to be very fast. If not, the user will see the display lag, making the experience less immersive and possibly causing discomfort.
Note that the motion to audio update is also important.
There are many processing steps required before updating the display, which is why an end-to-end optimization for latency is required to provide a responsive interface. Let’s consider what happens when your head moves. At a high level, there is motion detection, visual processing, and finally displaying the updated image.
Motion detection detects the head movement.
This accomplished by sampling the sensor inputs, such as the gyroscope, accelerometer, and camera, fusing them together, and determining where you are looking – the head pose.
Visual processing generates the correct image based on where you are looking.
View generation takes the head pose and creates the commands required to render the correct image. The actual image is then rendered for graphics or decoded for pictures/video.
Display processing enhances the rendered or decoded image and scans it out to the display.
This includes image correction for the VR headset itself, such as lens distortion, but also quality enhancement for the image and the actual display panel.
For VR, the lower the latency the better, but studies have shown that under 20 ms is an important number to hit.
To put the challenge in perspective, a display running at 60 Hz is updated every 17 ms and a display running at 90 Hz is updated every 11 ms. The display update is just one part of the motion to photon path.
To achieve this goal requires us to change how we think about rendering and taking some new approaches. We need to run as many tasks in parallel as possible and also reduce individual latencies.
Visual-inertial odometry, or VIO, provides precise on-device motion tracking. By using the on-device camera and inertial sensors, a 6-DoF pose can be generated. This allows you to have those immersive experiences we just talked about, plus the ability to use the VR headset anywhere – and it is much less expensive than other solutions.
So how does VIO work?
VIO estimates relative position and orientation of a moving device in an unknown environment using a camera and motion sensors.
VIO takes advantage of the complementary strengths of the camera and inertial sensors. For example, a single camera can estimate relative position, but it cannot provide absolute scale—the actual distances between objects, or the size of objects in meters or feet. Inertial sensors provide absolute scale and take measurement samples at a higher rate, thereby improving robustness for fast device motion. However, inertial sensors, particularly low-cost MEMS varieties, are prone to substantial drifts in position estimates when compared with cameras. So VIO blends together the best of both worlds to accurately estimate device pose.
Looking at the diagram, you can see that the camera, accelerometers, and gyroscopes are the inputs to VIO. Accelerometers and gyroscopes are running at much higher frequency than the camera.
For Snapdragon VIO subsystem:
The camera feature processing block detects features and tracks them
The inertial data processing block runs at the high frequencies of the inertial sensors and fuses the data.
The VIO algorithms run on the Hexagon DSP for efficiency.
The magic really happens when fusing the camera and inertial sensor data.
Fusing this data allows for continuous localization of the head position
Finally, we need the head position at a high frequency, so we generate both an accurate, high-rate pose and predicted pose.
This 6-DoF pose can then be used to quickly update the user’s view in the VR world with the precise visuals (and sound).
When developing 6-DoF content, there are a few things to consider.
First, scale and space
Issue: Untethered mobile allows for limitless movement, but available space could change between app runs
While developing Reef, we learned a few things. The space the user is in is not well defined. They could be on couch, a small room, or a giant room. You need to consider when developing that the user may be in a 2x2 space, an 8x8 space, etc.
DO consider limits to amount of full body movement
DO provide movement alternatives such as teleportation if real world space is too small to represent virtual world
Second, tracking
Issue: Device cameras can become occluded or environment can be too dark
DO fade to black or fixed image in event of lost head tracking to avoid nauseating jumps and judder
DON’T fall back to tracking only orientation (3-DoF); Jumps in position or seeing virtual world respond differently to movement can be uncomfortable for users
Third, storytelling
Issue: Users may not be looking or positioned in desired location to advance story
You don’t control where the user is looking or moving. You can’t take control of their head. Need the ability to dodge at any point – the permutations of different scenarios makes things much more complicated.
DO use audio and visual cues to guide user focus
DON’T take over control of virtual camera in order to force focus on story element
Other issues:
Safety
Need to develop based on the environment. Considerations that a user may not be able to move around. Need to be aware of couch, chair, etc. so that user does not get hurt. It is the responsibility of the developer to take care of the user, especially as VR becomes more immersive. Mobile VR will used in a variety of locations and is immersive (lightweight, no cable)
Mobile challenges:
Power and thermal constraints of a mobile HMD are significant
Could mention our tools for helping (Symphony, Profiler, VR SDK, Adreno GPU)
Snapdragon VR Development Kit
Snapdragon 835 is purpose built silicon for superior mobile VR
Snapdragon VR SDK provides easy developer access to Snapdragon accelerated VR libraries that simplify application development; it is compatible with existing toolchains
Snapdragon VR Reference Design is an HMD reference design for Snapdragon 835, for OEMs and 3rd party content development
Snapdragon 835 hardware and software is “Daydream-ready”