Neural networks for semantic gaze analysis in xr settings

Interaction Lab. Seoul National University of Science and Technology
Neural Networks for Semantic Gaze
Analysis in XR Settings
Jeong Jae-Yeop
ETRA2021, ACM Symposium on Eye Tracking Research and Applications
Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Lena Stubbemann, Dominik Dürrschnabel, Robert Refflinghaus 2021

Interaction Lab., Seoul National University of Science and Technology
■Intro
■Approach
■Evaluation
■Conclusion and future work
Agenda
2

■Semantic gaze analysis
 The process to identify objects or features of visual and cognitive attention
• Well controlled settings
• Visual patterns and oculometric parameters
• What users are looking at
Intro(1/6)
4

■Semantic gaze analysis in XR settings
 ROI (Region of Interest)
• Two-dimensional depiction of an object
 VOI (Volumes of Interest)
• Three-dimensional object that emerges from this intersection
Intro(2/6)
5

■Annotation to VOIs data(1/2)
 VOIs data for gaze
• User-specific gaze videos with constantly changing perspectives on the target object
• Move, vanish, reappear and change shape, size or illumination …
• Time consuming process
• Manual annotations are thus still considered a standard procedure
Intro(3/6)
6

■Annotation to VOIs data(2/2)
 VOIs annotation problem → Image classification
• CAD (Computer Aided Design) model
• CNN (Convolutional Neural Network)
• Three-dimensional problem → Two-dimensional problem : simplified
• CNN can also recognize different perspectives on the same three-dimensional body
Intro(4/6)
7

■Data augmentation
 GAN (Generative Adversarial Network)
• Image augmentation technique to adapt the training data to real environmental factors
• Overcome the need for challenging photorealistic simulations
• VOI annotation not only on an object level but also on a product feature level
Intro(5/6)
8

■Overview
Intro(6/6)
9

■Address annotation problem using object recognition
 Methodological details
• Use a CAD model to prepare training data for Cycle-GAN
• Use Cycle-GAN to create reality-alike synthetic data set
• Use synthetic data set to train CNN (Convolutional Neural Network)
• Predict VOIs of experimental data with trained CNN model
Approach(1/10)

■Use a CAD model to prepare training data for Cycle-GAN(1/2)
 The essential resource for using object recognition algorithms is suitable database
 Feature level annotation
• CAD model or virtual prototype
Approach(2/10)
12

■Use a CAD model to prepare training data for Cycle-GAN(2/2)
 Training data
Approach(3/10)
13

■Experimental data
 Egocentric videos, which are split into frames
 Only fixation marker, not scan path
• Only one fixation marker is contained in each frame
 Gaze coordinates (𝑥, 𝑦)
Approach(4/10)
14

■Use Cycle-GAN to create reality-alike synthetic data set
Approach(5/10)
15

■GAN (Generative Adversarial Network)
Approach(6/10)
16

■Cycle-GAN (Cycle Generative Adversarial Network)
Approach(7/10)
17

■Use synthetic data set to train CNN (Convolutional Neural Network)
Approach(8/10)
18

■Object recognition
 Object localization combined with image classification
• Pixels to instances by means of adjacent pixels that share textures, colors, or intensities
• Feature level recognition
 Eye tracking data
• Semantic or instance segmentation can be dispensed
• Provide us with the exact coordinates of the fixation relative to the gaze replay
Approach(9/10)
19

■Predict VOIs of experimental data with trained CNN model
 ResNet50v2
Approach(10/10)
20

Evaluation
Conclusion and future work
21

■Experimental setup
 Real world and virtual-reality setting
 Fully automated coffee machine
 VOI annotation on feature level
Evaluation(1/7)

■Conditions/baseline
 Comparing other method
• 𝐸𝑦𝑒𝑆𝑒𝑒3𝐷 (https://eyesee3d.eyemovementresearch.com/)
 𝐺𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ ∶ 𝑚𝑎𝑛𝑢𝑎𝑙 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛
 Performance metrics
• 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑎𝑛𝑑 𝑟𝑒𝑐𝑎𝑙𝑙, 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐹1 − 𝑠𝑐𝑜𝑟𝑒
Evaluation(2/7)
23

■User study design
 Participants
• 24 (6 female and 18 male)
• 3 points calibration of the eye tracking system
• Interact with the product in both the virtual and the real settings
 First phase in experiment
• Freely explore object for 60 seconds
• Free movement around the machine
 Second phase in experiment
• Subjects are asked about their perceptual impressions
• Led to certain product features as they have to solve tasks such as brewing coffee
Evaluation(3/7)
24

■Apparatus
 Unity3D
• Two projectors with a resolution 1920 x 1200 pixels each
 SMIs mobile glasses + SMI 3D-6D head tracking
 Outside-in motion tracking OptiTrack Prime^x 13W
 Fixation detection with BeGaze 3.7
 Desktop
• Nvidia GeForce RTX 2060 SUPER chip
• 8GB RAM
Evaluation(4/7)
25

■Network trainings
 Thumbnail size : 224 x 224 px
 Image augmentation using Cycle-GAN
• Simulation image : 1,000
• Virtual image : 1,000
• Real image : 1,000
• Default settings except for epoch 50
 Total training data after augmentation
• Simulation image : 100,000
• Virtual image : 100,000
• Real image : 100,000
Evaluation(5/7)
26

■Data preparation
Evaluation(6/7)
27

■Network trainings
 CNN classification
• ResNet50v2 architecture
• Output layer with 12 neurons (10 VOIs + “Coffee machine but no VOI” and “No coffee machine”)
• 224 x 224
• Adam, learning rate of 0.001 over 20 epochs with the sparse categorical cross-entropy
Evaluation(7/7)
28

■Result
 CNN-approach performs slightly better in virtual reality than in the real world
 Human annotation
• About 30,000, 25 hours (20 images per minute)
Conclusion and future work(1/7)

■Discussion(1/3)
 There the fixation marker is ambiguously located between four different VOIs and default classes
• Some of which are adjacent and others which are simultaneously hidden due to depth effects
31

■Discussion(2/3)
 Some are well-recognize and some are not
• Well classified : Display
 Standard classification problem
32

■Discussion(3/3)
 Cycle-GAN can also degrade image quality
• Use gaze coordinates, not fixation marker
33

■Limitation
 The study gave a proof of concept for two different domains
• Only coffee machine
34

■Conclusion
 Propose a method for semantic gaze analysis using machine learning, while eliminating the
resource-intense process of human annotations
 Neither markers nor motion tracking systems are required
 Not contain a personal bias and is thus not prone to evaluator effects
 The same methodical evaluation can be used across platforms
35

■Future work
 Our work is to be seen as a proof of concept.
• Potential future work to further increase the accuracy of predictions
 Chances for improving our approach
• Advanced image classification methods or further improving the image augmentations techniques
36

Neural networks for semantic gaze analysis in xr settings

Recommended

Recommended

More Related Content

Similar to Neural networks for semantic gaze analysis in xr settings

Similar to Neural networks for semantic gaze analysis in xr settings (20)

More from Jaey Jeong

More from Jaey Jeong (8)

Recently uploaded

Recently uploaded (20)

Neural networks for semantic gaze analysis in xr settings