Natural Interfaces for
 Augmented Reality

      Mark Billinghurst
       HIT Lab NZ
  University of Canterbury
Augmented Reality Definition
 Defining Characteristics [Azuma 97]
   Combines Real and Virtual Images
     - Both can be seen at the same time
   Interactive in real-time
     - The virtual content can be interacted with
   Registered in 3D
     - Virtual objects appear fixed in space
AR Today
 Most widely used AR is mobile or web based
 Mobile AR
   Outdoor AR (GPS + compass)
     - Layar (10 million+ users), Junaio, etc
   Indoor AR (image based tracking)
     - QCAR, String etc
 Web based (Flash)
   Flartoolkit marker tracking
   Markerless tracking
AR Interaction
 You can see spatially registered AR..
          how can you interact with it?
AR Interaction Today
 Mostly simple interaction
 Mobile
   Outdoor (Junaio, Layar, Wikitude, etc)
     - Viewing information in place, touch virtual tags
   Indoor (Invizimals, Qualcomm demos)
     - Change viewpoint, screen based (touch screen)
 Web based
   Change viewpoint, screen interaction (mouse)
History of AR Interaction
1. AR Information Viewing
 Information is registered to
  real-world context
    Hand held AR displays
 Interaction
    Manipulation of a window
     into information space
    2D/3D virtual viewpoint control
 Applications
    Context-aware information displays
 Examples                                NaviCam Rekimoto, et al. 1997
    NaviCam, Cameleon, etc
Current AR Information Browsers
 Mobile AR
   GPS + compass
 Many Applications
     Layar
     Wikitude
     Acrossair
     PressLite
     Yelp
     AR Car Finder
     …
2. 3D AR Interfaces
 Virtual objects displayed in 3D
  physical space and manipulated
    HMDs and 6DOF head-tracking
    6DOF hand trackers for input
 Interaction
    Viewpoint control
    Traditional 3D UI interaction:
                                      Kiyokawa, et al. 2000
     manipulation, selection, etc.
 Requires custom input devices
VLEGO - AR 3D Interaction
3. Augmented Surfaces and
Tangible Interfaces
 Basic principles
   Virtual objects are projected
    on a surface
   Physical objects are used as
    controls for virtual objects
   Support for collaboration
Augmented Surfaces
 Rekimoto, et al. 1998
   Front projection
   Marker-based tracking
   Multiple projection surfaces
Tangible User Interfaces (Ishii 97)
 Create digital shadows
  for physical objects
 Foreground
   graspable UI
 Background
   ambient interfaces
Tangible Interface: ARgroove
 Collaborative Instrument
 Exploring Physically Based Interaction
    Move and track physical record
    Map physical actions to Midi output
       - Translation, rotation
       - Tilt, shake
 Limitation
    AR output shown on screen
    Separation between input and output
Lessons from Tangible Interfaces
 Benefits
    Physical objects make us smart (affordances, constraints)
    Objects aid collaboration (shared meaning)
    Objects increase understanding (cognitive artifacts)
 Limitations
    Difficult to change object properties
    Limited display capabilities (project onto surface)
    Separation between object and display
4: Tangible AR
 AR overcomes limitation of TUIs
   enhance display possibilities
   merge task/display space
   provide public and private views


 TUI + AR = Tangible AR
   Apply TUI methods to AR interface design
Example Tangible AR Applications
 Use of natural physical object manipulations to
  control virtual objects
 LevelHead (Oliver)
    Physical cubes become rooms
 VOMAR (Kato 2000)
    Furniture catalog book:
      - Turn over the page to see new models
    Paddle interaction:
      - Push, shake, incline, hit, scoop
VOMAR Interface
Evolution of AR Interaction
1. Information Viewing Interfaces
   simple (conceptually!), unobtrusive
§ 3D AR Interfaces
   expressive, creative, require attention
§ Tangible Interfaces
   Embedded into conventional environments
4. Tangible AR
   Combines TUI input + AR display
Limitations
 Typical limitations
     Simple/No interaction (viewpoint control)
     Require custom devices
     Single mode interaction
     2D input for 3D (screen based interaction)
     No understanding of real world
     Explicit vs. implicit interaction
     Unintelligent interfaces (no learning)
Natural Interaction
The Vision of AR
To Make the Vision Real..
 Hardware/software requirements
     Contact lens displays
     Free space hand/body tracking
     Environment recognition
     Speech/gesture recognition
     Etc..
Natural Interaction
 Automatically detecting real environment
   Environmental awareness
   Physically based interaction
 Gesture Input
   Free-hand interaction
 Multimodal Input
   Speech and gesture interaction
   Implicit rather than Explicit interaction
Environmental Awareness
AR MicroMachines
 AR experience with environment awareness
  and physically-based interaction
   Based on MS Kinect RGB-D sensor
 Augmented environment supports
   occlusion, shadows
   physically-based interaction between real and
    virtual objects
Operating Environment
Architecture
 Our framework uses five libraries:

     OpenNI
     OpenCV
     OPIRA
     Bullet Physics
     OpenSceneGraph
System Flow
 The system flow consists of three sections:
    Image Processing and Marker Tracking
    Physics Simulation
    Rendering
Physics Simulation




 Create virtual mesh over real world
 Update at 10 fps – can move real objects
 Use by physics engine for collision detection (virtual/real)
 Use by OpenScenegraph for occlusion and shadows
Rendering




Occlusion               Shadows
Natural Gesture Interaction

 HIT Lab NZ AR Gesture Library
Motivation
                   AR MicroMachines and PhobiAR




 • Treated the environment as
   static – no tracking


                                     • Tracked objects in 2D

More realistic interaction requires 3D gesture tracking
Motivation
                                   Occlusion Issues
AR MicroMachines only achieved realistic occlusion because the user’s viewpoint matched the Kinect’s




Proper occlusion requires a more complete model of scene objects
HITLabNZ’s Gesture Library




Architecture
HITLabNZ’s Gesture Library




Architecture
   o   Supports PCL, OpenNI, OpenCV, and Kinect SDK.
   o   Provides access to depth, RGB, XYZRGB.
   o   Usage: Capturing color image, depth image and
       concatenated point clouds from a single or multiple cameras
   o   For example:




                                 Kinect for Xbox 360


                                  Kinect for Windows


                                 Asus Xtion Pro Live
HITLabNZ’s Gesture Library




Architecture
  o    Segment images and point clouds based on color, depth and
       space.
  o    Usage: Segmenting images or point clouds using color
       models, depth, or spatial properties such as location, shape
       and size.
  o    For example:




                                   Skin color segmentation



                                       Depth threshold
HITLabNZ’s Gesture Library




Architecture
  o    Identify and track objects between frames based on
       XYZRGB.
  o    Usage: Identifying current position/orientation of the
       tracked object in space.
  o    For example:



                                     Training set of hand
                                     poses, colors
                                     represent unique
                                     regions of the hand.


                                       Raw output (without-
                                       cleaning) classified
                                       on real hand input
                                       (depth image).
HITLabNZ’s Gesture Library




Architecture
   o Hand Recognition/Modeling
        Skeleton based (for low resolution
          approximation)
        Model based (for more accurate
          representation)
   o Object Modeling (identification and tracking rigid-
     body objects)
   o Physical Modeling (physical interaction)
        Sphere Proxy
        Model based
        Mesh based
   o Usage: For general spatial interaction in AR/VR
     environment
Method
Represent models as collections of spheres moving with
   the models in the Bullet physics engine
Method
Render AR scene with OpenSceneGraph, using depth map
   for occlusion




              Shadows yet to be implemented
Results
HITLabNZ’s Gesture Library




Architecture
  o   Static (hand pose recognition)
  o   Dynamic (meaningful movement recognition)
  o   Context-based gesture recognition (gestures with context,
      e.g. pointing)
  o   Usage: Issuing commands/anticipating user intention and
      high level interaction.
Multimodal Interaction
Multimodal Interaction
 Combined speech input
 Gesture and Speech complimentary
   Speech
     - modal commands, quantities
   Gesture
     - selection, motion, qualities
 Previous work found multimodal interfaces
  intuitive for 2D/3D graphics interaction
1. Marker Based Multimodal Interface




  Add speech recognition to VOMAR
  Paddle + speech commands
Commands Recognized
 Create Command "Make a blue chair": to create a virtual
  object and place it on the paddle.
 Duplicate Command "Copy this": to duplicate a virtual object
  and place it on the paddle.
 Grab Command "Grab table": to select a virtual object and
  place it on the paddle.
 Place Command "Place here": to place the attached object in
  the workspace.
 Move Command "Move the couch": to attach a virtual object
  in the workspace to the paddle so that it follows the paddle
  movement.
System Architecture
Object Relationships




"Put chair behind the table”
Where is behind?
                               View specific regions
User Evaluation
 Performance time
    Speech + static paddle significantly faster




 Gesture-only condition less accurate for position/orientation
 Users preferred speech + paddle input
Subjective Surveys
2. Free Hand Multimodal Input
 Use free hand to interact with AR content
 Recognize simple gestures
 No marker tracking




        Point         Move          Pick/Drop
Multimodal Architecture
Multimodal Fusion
Hand Occlusion
User Evaluation



 Change object shape, colour and position
 Conditions
   Speech only, gesture only, multimodal
 Measure
   performance time, error, subjective survey
Experimental Setup




Change object shape
  and colour
Results
 Average performance time (MMI, speech fastest)
   Gesture: 15.44s
   Speech: 12.38s
   Multimodal: 11.78s
 No difference in user errors
 User subjective survey
   Q1: How natural was it to manipulate the object?
     - MMI, speech significantly better
   70% preferred MMI, 25% speech only, 5% gesture only
Future Directions
Future Research
   Mobile real world capture
   Mobile gesture input
   Intelligent interfaces
   Virtual characters
Natural Gesture Interaction on Mobile




 Use mobile camera for hand tracking
   Fingertip detection
Evaluation




 Gesture input more than twice as slow as touch
 No difference in naturalness
Intelligent Interfaces
 Most AR systems stupid
   Don’t recognize user behaviour
   Don’t provide feedback
   Don’t adapt to user
 Especially important for training
   Scaffolded learning
   Moving beyond check-lists of actions
Intelligent Interfaces




 AR interface + intelligent tutoring system
   ASPIRE constraint based system (from UC)
   Constraints
     - relevance cond., satisfaction cond., feedback
Domain Ontology
Intelligent Feedback




 Actively monitors user behaviour
    Implicit vs. explicit interaction
 Provides corrective feedback
Evaluation Results
 16 subjects, with and without ITS
 Improved task completion




 Improved learning
Intelligent Agents
 AR characters
   Virtual embodiment of system
   Multimodal input/output
 Examples
   AR Lego, Welbo, etc
   Mr Virtuoso
     - AR character more real, more fun
     - On-screen 3D and AR similar in usefulness
Conclusions
Conclusions
 AR traditionally involves tangible interaction
 New technologies support natural interaction
   Environment capture
   Natural gestures
   Multimodal interaction
 Opportunities for future research
   Mobile, intelligent systems, characters
More Information
• Mark Billinghurst
  – mark.billinghurst@hitlabnz.org
• Website
  – http://www.hitlabnz.org/

Natural Interfaces for Augmented Reality

  • 1.
    Natural Interfaces for Augmented Reality Mark Billinghurst HIT Lab NZ University of Canterbury
  • 3.
    Augmented Reality Definition Defining Characteristics [Azuma 97]  Combines Real and Virtual Images - Both can be seen at the same time  Interactive in real-time - The virtual content can be interacted with  Registered in 3D - Virtual objects appear fixed in space
  • 4.
    AR Today  Mostwidely used AR is mobile or web based  Mobile AR  Outdoor AR (GPS + compass) - Layar (10 million+ users), Junaio, etc  Indoor AR (image based tracking) - QCAR, String etc  Web based (Flash)  Flartoolkit marker tracking  Markerless tracking
  • 5.
    AR Interaction  Youcan see spatially registered AR.. how can you interact with it?
  • 6.
    AR Interaction Today Mostly simple interaction  Mobile  Outdoor (Junaio, Layar, Wikitude, etc) - Viewing information in place, touch virtual tags  Indoor (Invizimals, Qualcomm demos) - Change viewpoint, screen based (touch screen)  Web based  Change viewpoint, screen interaction (mouse)
  • 7.
    History of ARInteraction
  • 8.
    1. AR InformationViewing  Information is registered to real-world context  Hand held AR displays  Interaction  Manipulation of a window into information space  2D/3D virtual viewpoint control  Applications  Context-aware information displays  Examples NaviCam Rekimoto, et al. 1997  NaviCam, Cameleon, etc
  • 9.
    Current AR InformationBrowsers  Mobile AR  GPS + compass  Many Applications  Layar  Wikitude  Acrossair  PressLite  Yelp  AR Car Finder  …
  • 10.
    2. 3D ARInterfaces  Virtual objects displayed in 3D physical space and manipulated  HMDs and 6DOF head-tracking  6DOF hand trackers for input  Interaction  Viewpoint control  Traditional 3D UI interaction: Kiyokawa, et al. 2000 manipulation, selection, etc.  Requires custom input devices
  • 11.
    VLEGO - AR3D Interaction
  • 12.
    3. Augmented Surfacesand Tangible Interfaces  Basic principles  Virtual objects are projected on a surface  Physical objects are used as controls for virtual objects  Support for collaboration
  • 13.
    Augmented Surfaces  Rekimoto,et al. 1998  Front projection  Marker-based tracking  Multiple projection surfaces
  • 15.
    Tangible User Interfaces(Ishii 97)  Create digital shadows for physical objects  Foreground  graspable UI  Background  ambient interfaces
  • 16.
    Tangible Interface: ARgroove Collaborative Instrument  Exploring Physically Based Interaction  Move and track physical record  Map physical actions to Midi output - Translation, rotation - Tilt, shake  Limitation  AR output shown on screen  Separation between input and output
  • 18.
    Lessons from TangibleInterfaces  Benefits  Physical objects make us smart (affordances, constraints)  Objects aid collaboration (shared meaning)  Objects increase understanding (cognitive artifacts)  Limitations  Difficult to change object properties  Limited display capabilities (project onto surface)  Separation between object and display
  • 19.
    4: Tangible AR AR overcomes limitation of TUIs  enhance display possibilities  merge task/display space  provide public and private views  TUI + AR = Tangible AR  Apply TUI methods to AR interface design
  • 20.
    Example Tangible ARApplications  Use of natural physical object manipulations to control virtual objects  LevelHead (Oliver)  Physical cubes become rooms  VOMAR (Kato 2000)  Furniture catalog book: - Turn over the page to see new models  Paddle interaction: - Push, shake, incline, hit, scoop
  • 21.
  • 22.
    Evolution of ARInteraction 1. Information Viewing Interfaces  simple (conceptually!), unobtrusive § 3D AR Interfaces  expressive, creative, require attention § Tangible Interfaces  Embedded into conventional environments 4. Tangible AR  Combines TUI input + AR display
  • 23.
    Limitations  Typical limitations  Simple/No interaction (viewpoint control)  Require custom devices  Single mode interaction  2D input for 3D (screen based interaction)  No understanding of real world  Explicit vs. implicit interaction  Unintelligent interfaces (no learning)
  • 24.
  • 25.
  • 26.
    To Make theVision Real..  Hardware/software requirements  Contact lens displays  Free space hand/body tracking  Environment recognition  Speech/gesture recognition  Etc..
  • 27.
    Natural Interaction  Automaticallydetecting real environment  Environmental awareness  Physically based interaction  Gesture Input  Free-hand interaction  Multimodal Input  Speech and gesture interaction  Implicit rather than Explicit interaction
  • 28.
  • 29.
    AR MicroMachines  ARexperience with environment awareness and physically-based interaction  Based on MS Kinect RGB-D sensor  Augmented environment supports  occlusion, shadows  physically-based interaction between real and virtual objects
  • 30.
  • 31.
    Architecture  Our frameworkuses five libraries:  OpenNI  OpenCV  OPIRA  Bullet Physics  OpenSceneGraph
  • 32.
    System Flow  Thesystem flow consists of three sections:  Image Processing and Marker Tracking  Physics Simulation  Rendering
  • 33.
    Physics Simulation  Createvirtual mesh over real world  Update at 10 fps – can move real objects  Use by physics engine for collision detection (virtual/real)  Use by OpenScenegraph for occlusion and shadows
  • 34.
  • 35.
    Natural Gesture Interaction HIT Lab NZ AR Gesture Library
  • 36.
    Motivation AR MicroMachines and PhobiAR • Treated the environment as static – no tracking • Tracked objects in 2D More realistic interaction requires 3D gesture tracking
  • 37.
    Motivation Occlusion Issues AR MicroMachines only achieved realistic occlusion because the user’s viewpoint matched the Kinect’s Proper occlusion requires a more complete model of scene objects
  • 38.
  • 39.
    HITLabNZ’s Gesture Library Architecture o Supports PCL, OpenNI, OpenCV, and Kinect SDK. o Provides access to depth, RGB, XYZRGB. o Usage: Capturing color image, depth image and concatenated point clouds from a single or multiple cameras o For example: Kinect for Xbox 360 Kinect for Windows Asus Xtion Pro Live
  • 40.
    HITLabNZ’s Gesture Library Architecture o Segment images and point clouds based on color, depth and space. o Usage: Segmenting images or point clouds using color models, depth, or spatial properties such as location, shape and size. o For example: Skin color segmentation Depth threshold
  • 41.
    HITLabNZ’s Gesture Library Architecture o Identify and track objects between frames based on XYZRGB. o Usage: Identifying current position/orientation of the tracked object in space. o For example: Training set of hand poses, colors represent unique regions of the hand. Raw output (without- cleaning) classified on real hand input (depth image).
  • 42.
    HITLabNZ’s Gesture Library Architecture o Hand Recognition/Modeling  Skeleton based (for low resolution approximation)  Model based (for more accurate representation) o Object Modeling (identification and tracking rigid- body objects) o Physical Modeling (physical interaction)  Sphere Proxy  Model based  Mesh based o Usage: For general spatial interaction in AR/VR environment
  • 43.
    Method Represent models ascollections of spheres moving with the models in the Bullet physics engine
  • 44.
    Method Render AR scenewith OpenSceneGraph, using depth map for occlusion Shadows yet to be implemented
  • 45.
  • 46.
    HITLabNZ’s Gesture Library Architecture o Static (hand pose recognition) o Dynamic (meaningful movement recognition) o Context-based gesture recognition (gestures with context, e.g. pointing) o Usage: Issuing commands/anticipating user intention and high level interaction.
  • 47.
  • 48.
    Multimodal Interaction  Combinedspeech input  Gesture and Speech complimentary  Speech - modal commands, quantities  Gesture - selection, motion, qualities  Previous work found multimodal interfaces intuitive for 2D/3D graphics interaction
  • 49.
    1. Marker BasedMultimodal Interface  Add speech recognition to VOMAR  Paddle + speech commands
  • 51.
    Commands Recognized  CreateCommand "Make a blue chair": to create a virtual object and place it on the paddle.  Duplicate Command "Copy this": to duplicate a virtual object and place it on the paddle.  Grab Command "Grab table": to select a virtual object and place it on the paddle.  Place Command "Place here": to place the attached object in the workspace.  Move Command "Move the couch": to attach a virtual object in the workspace to the paddle so that it follows the paddle movement.
  • 52.
  • 53.
    Object Relationships "Put chairbehind the table” Where is behind? View specific regions
  • 54.
    User Evaluation  Performancetime  Speech + static paddle significantly faster  Gesture-only condition less accurate for position/orientation  Users preferred speech + paddle input
  • 55.
  • 56.
    2. Free HandMultimodal Input  Use free hand to interact with AR content  Recognize simple gestures  No marker tracking Point Move Pick/Drop
  • 57.
  • 58.
  • 59.
  • 60.
    User Evaluation  Changeobject shape, colour and position  Conditions  Speech only, gesture only, multimodal  Measure  performance time, error, subjective survey
  • 61.
  • 62.
    Results  Average performancetime (MMI, speech fastest)  Gesture: 15.44s  Speech: 12.38s  Multimodal: 11.78s  No difference in user errors  User subjective survey  Q1: How natural was it to manipulate the object? - MMI, speech significantly better  70% preferred MMI, 25% speech only, 5% gesture only
  • 63.
  • 64.
    Future Research  Mobile real world capture  Mobile gesture input  Intelligent interfaces  Virtual characters
  • 65.
    Natural Gesture Interactionon Mobile  Use mobile camera for hand tracking  Fingertip detection
  • 66.
    Evaluation  Gesture inputmore than twice as slow as touch  No difference in naturalness
  • 67.
    Intelligent Interfaces  MostAR systems stupid  Don’t recognize user behaviour  Don’t provide feedback  Don’t adapt to user  Especially important for training  Scaffolded learning  Moving beyond check-lists of actions
  • 68.
    Intelligent Interfaces  ARinterface + intelligent tutoring system  ASPIRE constraint based system (from UC)  Constraints - relevance cond., satisfaction cond., feedback
  • 69.
  • 70.
    Intelligent Feedback  Activelymonitors user behaviour  Implicit vs. explicit interaction  Provides corrective feedback
  • 72.
    Evaluation Results  16subjects, with and without ITS  Improved task completion  Improved learning
  • 73.
    Intelligent Agents  ARcharacters  Virtual embodiment of system  Multimodal input/output  Examples  AR Lego, Welbo, etc  Mr Virtuoso - AR character more real, more fun - On-screen 3D and AR similar in usefulness
  • 74.
  • 75.
    Conclusions  AR traditionallyinvolves tangible interaction  New technologies support natural interaction  Environment capture  Natural gestures  Multimodal interaction  Opportunities for future research  Mobile, intelligent systems, characters
  • 76.
    More Information • MarkBillinghurst – mark.billinghurst@hitlabnz.org • Website – http://www.hitlabnz.org/

Editor's Notes

  • #31 - To create an interaction volume, the Kinect is positioned above the desired interaction space facing downwards. - A reference marker is placed in the interaction space to calculate the transform between the Kinect coordinate system and the coordinate system used by the AR viewing camera. - Users can also wear color markers on their fingers for pre-defined gesture interaction.
  • #35 - The OpenSceneGraph framework is used for rendering. The input video image is rendered as the background, with all the virtual objects rendered on top. - At the top level of the scene graph, the viewing transformation is applied such that all virtual objects are transformed so as to appear attached to the real world. - The trimesh is rendered as an array of quads, with an alpha value of zero. This allows realistic occlusion effects of the terrain and virtual objects, while not affecting the users’ view of the real environment. - A custom fragment shader was written to allow rendering of shadows to the invisible terrain.
  • #37 Appearance-based interaction has been used at the Lab before, both in AR Micromachines and PhobiAR. Flaws in these applications have motivated my work on advanced tracking and modeling. AR Micromachines did not allow for dynamic interaction – a car could be picked up, but because the motion of the hand was not known, friction could not be simulated between the car and the hand. PhobiAR introduced tracking for dynamic interaction, but it really only tracked objects in 2D. I’ll show you what I mean.. As soon as the hand is flipped the tracking fails and the illusion of realistic interaction is broken. 3D tracking was required to make the interaction in both of these applications more realistic
  • #38 Another issue with typical AR applications is the handling of occlusion. The Kinect allows a model of the environment to be developed, which can help in determining whether a real object is in front of a virtual one. Micromachines had good success by assuming a situation such as that shown on the right, with all objects in the scene in contact with the ground. This was a fair assumption when most of the objects were books etc. However, in PhobiAR the user’s hands were often above the ground, more like the scene on the left. The thing to notice is that these two scenes are indistinguishable from the Kinect’s point of view, but completely different from the observer’s point of view. The main problem is that we don’t know enough about the shape real-world objects to handle occlusion properly. My work aims to model real-world objects by combining views of the objects across multiple frames, allowing better occlusion.
  • #39 The gesture library will provide a C++ API for real-time recognition and tracking of hands and rigid-body objects in 3D environments. The library will support usage of single and multiple depth sensing cameras. Collision detection and physics simulation will be integrated for realistic physical interaction. Finally, learning algorithms will be implemented for recognizing hand gestures.
  • #40 The library will support usage of single and multiple depth sensing cameras. Aim for general consumer hardware.
  • #44 Interaction between real objects and the virtual balls was achieved by representing objects as collections of spheres. The location of the spheres was determined by the modeling stage while their motion was found during tracking. I used the Bullet physics engine for physics simulation.
  • #45 The AR scene was rendered using OpenSceneGraph. Because the Kinect’s viewpoint was also the user’s viewpoint, realistic occlusion was possible using the Kinect’s depth data. I did not have time to experiment with using the object models to improve occlusion from other viewpoints. Also, the addition of shadows could have significantly improved the realism of the application.