Screen2Vec: Semantic Embedding of GUI Screens and GUI Components

Screen2Vec: Semantic Embedding of GUI
Screens and GUI Components
Toby Li, Lindsay Popowski, Tom M Mitchell, Brad A. Myers
2021 CHI Conference on Human Factors in Computing Systems

Background
• Existing approaches of representing GUI screens are limited
 Capturing only text on the screen
• Missing information encoded in the layout and design pattern
 Focusing on the visual design patterns and GUI layouts
• Not capturing the content in the GUI
• Prior approaches use supervised learning with large datasets for specific task
objectives
 Requiring labeling efforts
 Inapplicable in different downstream tasks
1
Semantic representations of GUI screens and components

Contribution
• Presenting a self-supervised technique, not requiring human-labeled data
• Generating more comprehensive semantic embeddings of GUI screens and
components using
 Textual content
 Visual design
 Layout patterns
 App meta-data
• Training an open-sourced GUI embedding model using Screen2Vec with RICO
dataset
• Providing sample downstream tasks such as
 Nearest neighbor retrieval
 Composability-based retrieval
 Representing mobile tasks
2

Architecture of Screen2Vec
3
GUI Component level
GUI Screen Level
• Two-level architecture

• Input
 768-dimensional embedding vector of the text label of the GUI component
• Encoded using a pre-trained Sentence-BERT
 6-dimensional class embedding vector
• Representing the class type of the GUI component
• Optimizing weights in the class embeddings and weights in the linear layer (text + class)
• Output
 768-dimensional embedding vector
4
GUI Component Level

1) Collection of GUI component embedding vector
 Combined into a 768-dimensional vector using RNN
2) 64-dimensional layout embedding vector
 Encoding the screen’s visual layout
3) 768-dimensional embedding vector of the textual App Store description
 Encoded with a pre-trained Sentence-BERT model
• GUI(1) and layout(2) vectors are combined using a linear layer  768-dimensional
embedding vector
• After training, description(3) vector is concatenated  1536-dimensional embedding
vector
• Weights of RNN and weights of the linear layer trained on a Continuous Bag of Word
prediction 5
GUI Screen Level

Dataset
• RICO Dataset
 Containing interaction traces on 66,261 unique GUI screens
 From 9,384 free Android apps
• Specifics
 Each dataset with a screenshot image
 Screen’s “view hierarchy” (e.g., DOM tree in HTML) in a JSON file
• Each node including
• Class type
• Textual content
• Location as the bounding box on the screen
• Properties such as whether it is clickable, focused, or scrollable
 Each interaction trace represented as a sequence of GUI screens
• Which location is clicked or swiped
6

Implementation Details
• Encoding 26 class categories into a vector space
• Mapping each of the categories into a continuous 6-dimensional vector
• Optimizing embedding vector value by training GUI component prediction task
 Categories semantically similar, close in the vector space
7
GUI Class Type Embeddings

• Defining the context of a component as its 16 nearest components
• Measures of screen distance for determining the context
 Euclidean : straight-line minimal distance on the screen
• In pixel
 Hierarchical : distance between 2 GUI components on the hierarchical view tree
• Parent and children : 1
8
GUI Component Context

• Combining multiple vectors into a lower-dimension vector
• GUI component level
 Concatenating 768-dimension with 6-dimension
 Shrinking down to 768-dimension
 Creating 774 x 768 weights
• GUI screen level
 Combining 768-dimension and 64-dimension
 Producing 768-dimension for screen content and layout
9
Linear Layer

• Use a pre-trained Sentence-BERT language model
 Using SNLI and Multi-Genre NLI datasets with mean-pooling
• Encoding the text label of description to 768-dimensional vectors
• Deriving semantically similar sentences and phrases
10
Text Embeddings

• Extracting the layout from a screenshot
• Differentiating between text and non-text GUI components
• Using autoencoder to encode each image into 64-dimensional embedding vector
• Encoder’s input dimension : 11,200
• Two hidden layers of 2,048 and 256
• Applying RLU to eliminate negative input
• Loss determined by MSE
11
Layout Embeddings

• Combining embedding vectors of multiple GUI components
• GUI components embeddings fed into the RNN
 In the pre-order traversal order of hierarchy tree
• Starting with hidden state of zero, fed into a linear layer with 𝑛 − 1 𝑡ℎ output
12
GUI Embedding Combining Layer

Training Configuration
• Training: 90% of the data; validation: 10%
• Cross entropy loss function with Adam optimizer
• Learning rate: 0.001; batch size: 256
• GUI component model: 120 epochs; GUI screen model: 80-120 epochs
• Total loss
 Component
• Total Loss = Loss(text prediction) + Loss(class type prediction)
 Screen
• Negative sampling
• Prediction compared to the correct screen and a sample of negative data
• Random sampling of other screens with size 128 on the same app
• To differentiate different screens on the same app
13

Baselines
• Text Embedding Only (similar textual context)
 Screen embedding method used in SOVITE
 Computed by averaging the text embedding vectors for all the text in the screen
• Layout Embedding Only (similar layout)
 Screen embedding method used in the original RICO paper
 Computed by the layout autoencoder to represent the screen
• Visual Embedding Only (similar visual)
 Direct screen shot of image instead of layout
 Inspired by VASTA, Sikuli, and HILC
14

Results
• Predicting each GUI screen in all the GUI interaction traces in the RICO dataset using its
context
 3 versions to compare
• EUCLIDEAN with locations of GUI components and the screen layouts
• HIERARCHICAL with above spatial info
• EUCLIDEAN without spatial info
15

Sample Downstream Tasks
• The main purpose is to produce distributed vector representations that encode useful
semantic, layout, design properties
• Compare similarity between the nearest neighbor results by different models
Methods
• Select 50 screens from apps and app domains
• Retrieve top-5 most similar screens using each of 3 models
• 79 Mechanical Turk workers participated
• Each worker saw top-5 most similar screens of 5 source screens produced by 3 models
• Questionnaires include followings
 (1) App similarity (2) Screen type similarity (3) Content similarity
16
Nearest Neighbors

Results
• The differences between the mean ratings of the Screen2Vec model and both TextOnly
and LayoutOnly model are significant (non-parametric Mann-Whitney U test)
• Retrieve top-5 most similar screens using each of 3 models
17
Nearest Neighbors

Observation
• Screen2Vec generate more comprehensive
representations
 “Request ride” in Lyft
• “Get direction” in Uber Driver
• “Select navigation type” in Waze app
• “Request ride” in Free Now
 MapView taking majority
 All feature a menu/information card at the
bottom 1/3 – 1/4
• TextOnly generated results are semantically
similar to “payment”
• LayoutOnly generated results show lower score
in the content and app-context similarity
18
Nearest Neighbors

Word2Vec
• “Man is to woman as brother is to sister”
• (brother - man + woman) results in an
embedding vector representing sister
Screen2Vec
• Marriott app ’s “hotel booking” screen +
(Cheapoair app’s “search result” screen
− Cheapoair app’s “hotel booking”
screen))
• The top result is the “search result”
screen in the Marriott app
19
Embedding Composability

• Preliminary evaluation on the effectiveness of embedding mobile tasks as
sequences of Screen2Vec screen embedding vectors
• Recording scripts of completing 10 common smartphone tasks
• Representing each task as the average of Screen2Vec vectors
• Querying for the nearest neighbor within 20 task variations and get 18/20
accuracy
 TextOnly : 14/20 accuracy
20
Screen Embedding Sequences for Representing Mobile Tasks

Potential Application
• Designers query for example designs that display similar content or screens in
apps of a similar domain
• Composability helps to find a specific page for the app
 Suppose a designer searches for checkout page for app A
 A’s order page + (App B’s checkout page – App B’s order page
• LayoutGAN can generate realistic GUI layouts based on user-specified
constraints
 Applying Screen2Vec to incorporate the semantics of GUIs and the context
of user interaction
21

Limitation
• Only trained and test on Android app GUIs
• RICO dataset
 contains interaction traces within single apps  need to generalize multiple app
 Does not contain paid apps
• Screen2Vec does not encode the semantics of graphic icons that have no textual
information
22

Screen2Vec: Semantic Embedding of GUI Screens and GUI Components

Screen2Vec: Semantic Embedding of GUI Screens and GUI Components

More Related Content

Similar to Screen2Vec: Semantic Embedding of GUI Screens and GUI Components

More from ivaderivader

Recently uploaded

Screen2Vec: Semantic Embedding of GUI Screens and GUI Components

Editor's Notes