SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
Neural networks for semantic gaze analysis in xr settings
1. Interaction Lab. Seoul National University of Science and Technology
Neural Networks for Semantic Gaze
Analysis in XR Settings
Jeong Jae-Yeop
ETRA2021, ACM Symposium on Eye Tracking Research and Applications
Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Lena Stubbemann, Dominik Dürrschnabel, Robert Refflinghaus 2021
2. Interaction Lab., Seoul National University of Science and Technology
■Intro
■Approach
■Evaluation
■Conclusion and future work
Agenda
2
4. Interaction Lab., Seoul National University of Science and Technology
■Semantic gaze analysis
The process to identify objects or features of visual and cognitive attention
• Well controlled settings
• Visual patterns and oculometric parameters
• What users are looking at
Intro(1/6)
4
5. Interaction Lab., Seoul National University of Science and Technology
■Semantic gaze analysis in XR settings
ROI (Region of Interest)
• Two-dimensional depiction of an object
VOI (Volumes of Interest)
• Three-dimensional object that emerges from this intersection
Intro(2/6)
5
6. Interaction Lab., Seoul National University of Science and Technology
■Annotation to VOIs data(1/2)
VOIs data for gaze
• User-specific gaze videos with constantly changing perspectives on the target object
• Move, vanish, reappear and change shape, size or illumination …
• Time consuming process
• Manual annotations are thus still considered a standard procedure
Intro(3/6)
6
7. Interaction Lab., Seoul National University of Science and Technology
■Annotation to VOIs data(2/2)
VOIs annotation problem → Image classification
• CAD (Computer Aided Design) model
• CNN (Convolutional Neural Network)
• Three-dimensional problem → Two-dimensional problem : simplified
• CNN can also recognize different perspectives on the same three-dimensional body
Intro(4/6)
7
8. Interaction Lab., Seoul National University of Science and Technology
■Data augmentation
GAN (Generative Adversarial Network)
• Image augmentation technique to adapt the training data to real environmental factors
• Overcome the need for challenging photorealistic simulations
• VOI annotation not only on an object level but also on a product feature level
Intro(5/6)
8
11. Interaction Lab., Seoul National University of Science and Technology
■Address annotation problem using object recognition
Methodological details
• Use a CAD model to prepare training data for Cycle-GAN
• Use Cycle-GAN to create reality-alike synthetic data set
• Use synthetic data set to train CNN (Convolutional Neural Network)
• Predict VOIs of experimental data with trained CNN model
Approach(1/10)
12. Interaction Lab., Seoul National University of Science and Technology
■Use a CAD model to prepare training data for Cycle-GAN(1/2)
The essential resource for using object recognition algorithms is suitable database
Feature level annotation
• CAD model or virtual prototype
Approach(2/10)
12
13. Interaction Lab., Seoul National University of Science and Technology
■Use a CAD model to prepare training data for Cycle-GAN(2/2)
Training data
Approach(3/10)
13
14. Interaction Lab., Seoul National University of Science and Technology
■Experimental data
Egocentric videos, which are split into frames
Only fixation marker, not scan path
• Only one fixation marker is contained in each frame
Gaze coordinates (𝑥, 𝑦)
Approach(4/10)
14
15. Interaction Lab., Seoul National University of Science and Technology
■Use Cycle-GAN to create reality-alike synthetic data set
Approach(5/10)
15
16. Interaction Lab., Seoul National University of Science and Technology
■GAN (Generative Adversarial Network)
Approach(6/10)
16
17. Interaction Lab., Seoul National University of Science and Technology
■Cycle-GAN (Cycle Generative Adversarial Network)
Approach(7/10)
17
18. Interaction Lab., Seoul National University of Science and Technology
■Use synthetic data set to train CNN (Convolutional Neural Network)
Approach(8/10)
18
19. Interaction Lab., Seoul National University of Science and Technology
■Object recognition
Object localization combined with image classification
• Pixels to instances by means of adjacent pixels that share textures, colors, or intensities
• Feature level recognition
Eye tracking data
• Semantic or instance segmentation can be dispensed
• Provide us with the exact coordinates of the fixation relative to the gaze replay
Approach(9/10)
19
20. Interaction Lab., Seoul National University of Science and Technology
■Predict VOIs of experimental data with trained CNN model
ResNet50v2
Approach(10/10)
20
22. Interaction Lab., Seoul National University of Science and Technology
■Experimental setup
Real world and virtual-reality setting
Fully automated coffee machine
VOI annotation on feature level
Evaluation(1/7)
23. Interaction Lab., Seoul National University of Science and Technology
■Conditions/baseline
Comparing other method
• 𝐸𝑦𝑒𝑆𝑒𝑒3𝐷 (https://eyesee3d.eyemovementresearch.com/)
𝐺𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ ∶ 𝑚𝑎𝑛𝑢𝑎𝑙 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛
Performance metrics
• 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑎𝑛𝑑 𝑟𝑒𝑐𝑎𝑙𝑙, 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐹1 − 𝑠𝑐𝑜𝑟𝑒
Evaluation(2/7)
23
24. Interaction Lab., Seoul National University of Science and Technology
■User study design
Participants
• 24 (6 female and 18 male)
• 3 points calibration of the eye tracking system
• Interact with the product in both the virtual and the real settings
First phase in experiment
• Freely explore object for 60 seconds
• Free movement around the machine
Second phase in experiment
• Subjects are asked about their perceptual impressions
• Led to certain product features as they have to solve tasks such as brewing coffee
Evaluation(3/7)
24
25. Interaction Lab., Seoul National University of Science and Technology
■Apparatus
Unity3D
• Two projectors with a resolution 1920 x 1200 pixels each
SMIs mobile glasses + SMI 3D-6D head tracking
Outside-in motion tracking OptiTrack Prime^x 13W
Fixation detection with BeGaze 3.7
Desktop
• Nvidia GeForce RTX 2060 SUPER chip
• 8GB RAM
Evaluation(4/7)
25
26. Interaction Lab., Seoul National University of Science and Technology
■Network trainings
Thumbnail size : 224 x 224 px
Image augmentation using Cycle-GAN
• Simulation image : 1,000
• Virtual image : 1,000
• Real image : 1,000
• Default settings except for epoch 50
Total training data after augmentation
• Simulation image : 100,000
• Virtual image : 100,000
• Real image : 100,000
Evaluation(5/7)
26
27. Interaction Lab., Seoul National University of Science and Technology
■Data preparation
Evaluation(6/7)
27
28. Interaction Lab., Seoul National University of Science and Technology
■Network trainings
CNN classification
• ResNet50v2 architecture
• Output layer with 12 neurons (10 VOIs + “Coffee machine but no VOI” and “No coffee machine”)
• 224 x 224
• Adam, learning rate of 0.001 over 20 epochs with the sparse categorical cross-entropy
Evaluation(7/7)
28
30. Interaction Lab., Seoul National University of Science and Technology
■Result
CNN-approach performs slightly better in virtual reality than in the real world
Human annotation
• About 30,000, 25 hours (20 images per minute)
Conclusion and future work(1/7)
31. Interaction Lab., Seoul National University of Science and Technology
■Discussion(1/3)
There the fixation marker is ambiguously located between four different VOIs and default classes
• Some of which are adjacent and others which are simultaneously hidden due to depth effects
Conclusion and future work(2/7)
31
32. Interaction Lab., Seoul National University of Science and Technology
■Discussion(2/3)
Some are well-recognize and some are not
• Well classified : Display
Standard classification problem
Conclusion and future work(3/7)
32
33. Interaction Lab., Seoul National University of Science and Technology
■Discussion(3/3)
Cycle-GAN can also degrade image quality
• Use gaze coordinates, not fixation marker
Conclusion and future work(4/7)
33
34. Interaction Lab., Seoul National University of Science and Technology
■Limitation
The study gave a proof of concept for two different domains
• Only coffee machine
Conclusion and future work(5/7)
34
35. Interaction Lab., Seoul National University of Science and Technology
■Conclusion
Propose a method for semantic gaze analysis using machine learning, while eliminating the
resource-intense process of human annotations
Neither markers nor motion tracking systems are required
Not contain a personal bias and is thus not prone to evaluator effects
The same methodical evaluation can be used across platforms
Conclusion and future work(6/7)
35
36. Interaction Lab., Seoul National University of Science and Technology
■Future work
Our work is to be seen as a proof of concept.
• Potential future work to further increase the accuracy of predictions
Chances for improving our approach
• Advanced image classification methods or further improving the image augmentations techniques
Conclusion and future work(7/7)
36