A keynote talk given by Mark Billinghurst at the ICMI 2013 conference, December 12th 2013. The talk is about how to use speech and gesture interaction with Augmented Reality interfaces.
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Hands and Speech in Space: Multimodal Input for Augmented Reality
1. Hands and Speech in Space:
Multimodal Interaction for AR
Mark Billinghurst
mark.billinghurst@hitlabnz.org
The HIT Lab NZ, University of Canterbury
December 12th 2013
3. Augmented Reality Definition
Defining Characteristics
Combines Real and Virtual Images
- Both can be seen at the same time
Interactive in real-time
- The virtual content can be interacted with
Registered in 3D
- Virtual objects appear fixed in space
Azuma, R. T. (1997). A survey of augmented reality. Presence, 6(4), 355-385.
6. AR Interaction Metaphors
Information Browsing
View AR content
3D AR Interfaces
3D UI interaction techniques
Augmented Surfaces
Tangible UI techniques
Tangible AR
Tangible UI input + AR output
7. VOMAR Demo (Kato 2000)
AR Furniture Arranging
Elements + Interactions
Book:
- Turn over the page
Paddle:
- Push, shake, incline, hit, scoop
Kato, H., Billinghurst, M., et al. 2000. Virtual Object Manipulation on a Table-Top AR
Environment. In Proceedings of the International Symposium on Augmented Reality
(ISAR 2000), Munich, Germany, 111--119.
8. Opportunities for Multimodal Input
Multimodal interfaces are a natural fit for AR
Need for non-GUI interfaces
Natural interaction with real world
Natural support for body input
Previous work shown value of multimodal input
and 3D graphics
9. Related Work
Related work in 3D graphics/VR
Interaction with 3D content [Chu 1997]
Navigating through virtual worlds [Krum 2002]
Interacting with virtual characters [Billinghurst 1998]
Little earlier work in AR
Require additional input devices
Few formal usability studies
Eg Olwal et. al [2003] Sense Shapes
11. Marker Based Multimodal Interface
Add speech recognition to VOMAR
Paddle + speech commands
Irawati, S., Green, S., Billinghurst, M., Duenser, A., & Ko, H. (2006, October). IEEE Xplore. In Mixed
and Augmented Reality, 2006. ISMAR 2006. IEEE/ACM International Symposium on (pp. 183-186).
IEEE.
12.
13. Commands Recognized
Create Command "Make a blue chair": to create a virtual
object and place it on the paddle.
Duplicate Command "Copy this": to duplicate a virtual object
and place it on the paddle.
Grab Command "Grab table": to select a virtual object and
place it on the paddle.
Place Command "Place here": to place the attached object in
the workspace.
Move Command "Move the couch": to attach a virtual object
in the workspace to the paddle so that it follows the paddle
movement.
22. AR MicroMachines
AR experience with environment awareness
and physically-based interaction
Based on MS Kinect RGB-D sensor
Augmented environment supports
occlusion, shadows
physically-based interaction between real and
virtual objects
Clark, A., & Piumsomboon, T. (2011). A realistic augmented reality racing game using a
depth-sensing camera. In Proceedings of the 10th International Conference on Virtual
Reality Continuum and Its Applications in Industry (pp. 499-502). ACM.
25. System Flow
The system flow consists of three sections:
Image Processing and Marker Tracking
Physics Simulation
Rendering
26. Physics Simulation
Create virtual mesh over real world
Update at 10 fps – can move real objects
Use by physics engine for collision detection (virtual/real)
Use by OpenScenegraph for occlusion and shadows
29. Natural Hand Interaction
Using bare hands to interact with AR content
MS Kinect depth sensing
Real time hand tracking
Physics based simulation model
30. Hand Interaction
Represent models as collections of spheres
Bullet physics engine for interaction with real world
31. Scene Interaction
Render AR scene with OpenSceneGraph
Using depth map for occlusion
Shadows yet to be implemented
33. Architecture
5. Gesture
• Static Gestures
• Dynamic Gestures
• Context based Gestures
o Supports PCL, OpenNI, OpenCV, and Kinect SDK.
o Provides access to depth, RGB, XYZRGB.
o Usage: Capturing color image, depth image and concatenated
point clouds from a single or multiple cameras
o For example:
4. Modeling
• Hand recognition/
modeling
• Rigid-body modeling
3. Classification/Tracking
2. Segmentation
1. Hardware Interface
Kinect for Xbox 360
Kinect for Windows
Asus Xtion Pro Live
34. Architecture
5. Gesture
• Static Gestures
• Dynamic Gestures
• Context based Gestures
o Segment images and point clouds based on color, depth and
space.
o Usage: Segmenting images or point clouds using color
models, depth, or spatial properties such as location, shape
and size.
o For example:
4. Modeling
• Hand recognition/
modeling
• Rigid-body modeling
Skin color segmentation
3. Classification/Tracking
2. Segmentation
1. Hardware Interface
Depth threshold
35. Architecture
5. Gesture
• Static Gestures
• Dynamic Gestures
• Context based Gestures
o Identify and track objects between frames based on
XYZRGB.
o Usage: Identifying current position/orientation of the tracked
object in space.
o For example:
4. Modeling
• Hand recognition/
modeling
• Rigid-body modeling
3. Classification/Tracking
2. Segmentation
1. Hardware Interface
Training set of hand
poses, colors
represent unique
regions of the hand.
Raw output (withoutcleaning) classified
on real hand input
(depth image).
36. Architecture
5. Gesture
• Static Gestures
• Dynamic Gestures
• Context based Gestures
4. Modeling
• Hand recognition/
modeling
• Rigid-body modeling
3. Classification/Tracking
2. Segmentation
1. Hardware Interface
o Hand Recognition/Modeling
Skeleton based (for low resolution
approximation)
Model based (for more accurate
representation)
o Object Modeling (identification and tracking rigidbody objects)
o Physical Modeling (physical interaction)
Sphere Proxy
Model based
Mesh based
o Usage: For general spatial interaction in AR/VR
environment
37. Architecture
5. Gesture
• Static Gestures
• Dynamic Gestures
• Context based Gestures
4. Modeling
• Hand recognition/
modeling
• Rigid-body modeling
3. Classification/Tracking
2. Segmentation
1. Hardware Interface
o Static (hand pose recognition)
o Dynamic (meaningful movement recognition)
o Context-based gesture recognition (gestures with context,
e.g. pointing)
o Usage: Issuing commands/anticipating user intention and high
level interaction.
38. Skeleton Based Interaction
3 Gear Systems
Kinect/Primesense Sensor
Two hand tracking
http://www.threegear.com
39. Skeleton Interaction + AR
HMD AR View
Viewpoint tracking
Two hand input
Skeleton interaction, occlusion
40. What Gestures do People Want to Use?
Limitations of Previous work in AR
Limited range of gestures
Gestures designed for optimal recognition
Gestures studied as add-on to speech
Solution – elicit desired gestures from users
Eg. Gestures for surface computing [Wobbrock]
Previous work in unistroke getsures, mobile gestures
41. User Defined Gesture Study
Use AR view
HMD + AR tracking
Present AR animations
40 tasks in six categories
- Editing, transforms, menu, etc
Ask users to produce
gestures causing animations
Record gesture (video, depth)
Piumsomboon, T., Clark, A., Billinghurst, M., & Cockburn, A. (2013, April). User-defined gestures for augmented
reality. In CHI'13 Extended Abstracts on Human Factors in Computing Systems (pp. 955-960).ACM
42. Data Recorded
20 participants
Gestures recorded (video, depth data)
800 gestures from 40 tasks
Subjective rankings
Likert ranking of goodness, ease of use
Think aloud transcripts
44. Results - Gestures
Gestures grouped according to
similarity – 320 groups
44 consensus (62% all gestures)
276 low similarity (discarded)
11 hand poses seen
Degree of consensus (A) using
guessability score [Wobbrock]
46. Usability Results
Consensus Discarded
Ease of Performance
6.02
5.50
Good Match
6.17
5.83
Likert Scale [1-7], 7 = Very Good
Significant difference between consensus and
discarded gesture sets (p < 0.0001)
Gestures in consensus set better than discarded
gestures in perceived performance and goodness
47. Lessons Learned
AR animation can elicit desired gestures
For some tasks there is a high degree of
similarity in user defined gestures
Especially command gestures (eg Open), select
Less agreement in manipulation gestures
Move (40%), rotate (30%), grouping (10%)
Small portion of two handed gestures (22%)
Scaling, group selection
49. Multimodal Interaction
Combined speech input
Gesture and Speech complimentary
Speech
- modal commands, quantities
Gesture
- selection, motion, qualities
Previous work found multimodal interfaces
intuitive for 2D/3D graphics interaction
50. Wizard of Oz Study
What speech and gesture input
would people like to use?
Wizard
Perform speech recognition
Command interpretation
Domain
3D object interaction/modelling
Lee, M., & Billinghurst, M. (2008, October). A Wizard of Oz study for an AR multimodal interface.
In Proceedings of the 10th international conference on Multimodal interfaces (pp. 249-256). ACM.
54. Experiment
12 participants
Two display conditions (HMD vs. Desktop)
Three tasks
Task 1: Change object color/shape
Task 2: 3D positioning of obejcts
Task 3: Scene assembly
55. Key Results
Most commands multimodal
Multimodal (63%), Gesture (34%), Speech (4%)
Most spoken phrases short
74% phrases average 1.25 words long
Sentences (26%) average 3 words
Main gestures deictic (65%), metaphoric (35%)
In multimodal commands gesture issued first
94% time gesture begun before speech
Multimodal window 8s – speech 4.5s after gesture
56. Free Hand Multimodal Input
Point
Move
Pick/Drop
Use free hand to interact with AR content
Recognize simple gestures
Open hand, closed hand, pointing
Lee, M., Billinghurst, M., Baek, W., Green, R., & Woo, W. (2013). A usability study of multimodal
input in an augmented reality environment. Virtual Reality, 17(4), 293-305.
57. Speech Input
MS Speech + MS SAPI (> 90% accuracy)
Single word speech commands
62. User Evaluation
25 subjects, 10 task trials x 3, 3 conditions
Change object shape, colour and position
Conditions
Speech only, gesture only, multimodal
Measures
performance time, errors (system/user), subjective survey
63. Results - Performance
Average performance time
Gesture: 15.44s
Speech: 12.38s
Multimodal: 11.78s
Significant difference across conditions (p < 0.01)
Difference between gesture and speech/MMI
64. Errors
User errors – errors per task
Gesture (0.50), Speech (0.41), MMI (0.42)
No significant difference
System errors
Speech accuracy – 94%, Gesture accuracy – 85%
MMI accuracy – 90%
65. Subjective Results (Likert 1-7)
Gesture
Speech
MMI
Naturalness
4.60
5.60
5.80
Ease of Use
4.00
5.90
6.00
Efficiency
4.45
5.15
6.05
Physical Effort
4.75
3.15
3.85
User subjective survey
Gesture significantly worse, MMI and Speech same
MMI perceived as most efficient
Preference
70% MMI, 25% speech only, 5% gesture only
66. Observations
Significant difference in number of commands
Gesture (6.14), Speech (5.23), MMI (4.93)
MMI Simultaneous vs. Sequential commands
79% sequential, 21% simultaneous
Reaction to system errors
Almost always repeated same command
In MMI rarely changes modalities
67. Lessons Learned
Multimodal interaction significantly better than
gesture alone in AR interfaces for 3D tasks
Short task time, more efficient
Users felt that MMI was more natural, easier,
and more effective that gesture/speech only
Simultaneous input rarely used
More studies need to be conducted
69. Intelligent Interfaces
Most AR systems stupid
Don’t recognize user behaviour
Don’t provide feedback
Don’t adapt to user
Especially important for training
Scaffolded learning
Moving beyond check-lists of actions
70. Intelligent Interfaces
AR interface + intelligent tutoring system
ASPIRE constraint based system (from UC)
Constraints
- relevance cond., satisfaction cond., feedback
Westerfield, G., Mitrovic, A., & Billinghurst, M. (2013). Intelligent Augmented Reality Training for
Assembly Tasks. In Artificial Intelligence in Education (pp. 542-551). Springer Berlin Heidelberg.
74. Evaluation Results
16 subjects, with and without ITS
Improved task completion
Improved learning
75. Intelligent Agents
AR characters
Virtual embodiment of system
Multimodal input/output
Examples
AR Lego, Welbo, etc
Mr Virtuoso
- AR character more real, more fun
- On-screen 3D and AR similar in usefulness
Wagner, D., Billinghurst, M., & Schmalstieg, D. (2006). How real should virtual characters be?. In
Proceedings of the 2006 ACM SIGCHI international conference on Advances in computer
entertainment technology (p. 57). ACM.
77. Directions for Future Research
Mobile Gesture Interaction
Tablet, phone interfaces
Wearable Systems
Google Glass
Novel Displays
Contact lens
Environmental Understanding
Semantic representation
78. Mobile Gesture Interaction
Motivation
Richer interaction with handheld devices
Natural interaction with handheld AR
2D tracking
Finger tip tracking
3D tracking
[Hurst and Wezel 2013]
Hand tracking
[Henrysson et al. 2007]
Henrysson, A., Marshall, J., & Billinghurst, M. (2007). Experiments in 3D interaction for mobile phone AR.
In Proceedings of the 5th international conference on Computer graphics and interactive techniques in
Australia and Southeast Asia (pp. 187-194). ACM.
79. Fingertip Based Interaction
Running System
System Setup
Mobile Client + PC Server
Bai, H., Gao, L., El-Sana, J., & Billinghurst, M. (2013). Markerless 3D gesture-based interaction for
handheld augmented reality interfaces. In SIGGRAPH Asia 2013 Symposium on Mobile Graphics
and Interactive Applications (p. 22). ACM.
81. 3D Prototype System
3 Gear + Vuforia
Hand tracking + phone tracking
Freehand interaction on phone
Skeleton model
3D interaction
20 fps performance
84. User Experience
Truly Wearable Computing
Less than 46 ounces
Hands-free Information Access
Voice interaction, Ego-vision camera
Intuitive User Interface
Touch, Gesture, Speech, Head Motion
Access to all Google Services
Map, Search, Location, Messaging, Email, etc
85. Contact Lens Display
Babak Parviz
University Washington
MEMS components
Transparent elements
Micro-sensors
Challenges
Miniaturization
Assembly
Eye-safe
87. Environmental Understanding
Semantic understanding of environment
What are the key objects?
What are there relationships?
Represented in a form suitable for multimodal interaction?
89. Conclusions
AR experiences need new interaction methods
Enabling technologies are advancing quickly
Displays, tracking, depth capture devices
Natural user interfaces possible
Free hand gesture, speech, intelligence interfaces
Important research for the future
Mobile, wearable, displays
90. More Information
• Mark Billinghurst
– Email: mark.billinghurst@hitlabnz.org
– Twitter: @marknb00
• Website
– http://www.hitlabnz.org/