240408_Thanh_LabSeminar[Region Graph Embedding Network for Zero-Shot Learning].pptx
1. Region Graph Embedding
Network for Zero-Shot Learning
Tien-Bach-Thanh Do
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: osfa19730@catholic.ac.kr
2024/04/08
Guo-Sen Xie et al.
ECCV 2020
2. 2
Introduction
● What is Zero-Shot Learning?
○ ZSL is a model’s ability to detect classes never seen during training
○ The condition is that the classes are not known during supervised learning
○ Earlier work in ZSL use attributes in a two-step approach to infer unknown classes
○ In the CV context, more recent advances learn mappings from image feature space to
semantic space
○ Other approaches learn non-linear multimodal embeddings
○ In the modern NLP context, language models can be evaluated on downstream tasks
without fine tuning
● What is Generalized ZSL?
○ The set of classes are split into seen and unseen classes, where training relies on the
semantic features of the seen and unseen classes and the visual representations of only
the seen classes, while testing uses the visual representations of the seen and unseen
classes
3. 3
Previous Approaches
● ZSL rely on learning attribute classifiers, based on which the class posterior of a test image is deduced,
however, associations among these attributes are not well exploited
● Embedding based methods
○ Accompanied by a compatibility loss and can effectively address the association issue [50]
○ Leverage a compatibility hinge loss for learning the association between images and attributes
[2]
○ [39,66,41,4,14] are also competitive embedding based models, however these methods usually
achieve relatively inferior results, since they adopt global features and/or exploit shallow models
○ End-to-end CNN models [33,26,42,28] obtain the best performances => extend the compatibility
loss by adding the seen class attributes and advocate learning more discriminative features,
however they struggle to focus on the discriminative parts which are intrinsically accounting for
better semantic transfer
● Part-based ZSL
○ [11,1,64] utilized part annotations to discover discriminative part features for tackling fine-
grained ZSL, however part annotation are costly and labor-dependent
○ Pursing automatic part discovery [53], attention mechanisms [57,56,55,25] have been applied
into ZSL and GZSL [52,80,78,30] for capturing multiple semantic regions, which can facilitate
4. 4
Previous Approaches
● Part-based ZSL
○ [11,1,64] utilized part annotations to discover discriminative part features for tackling fine-
grained ZSL, however part annotation are costly and labor-dependent
○ Pursing automatic part discovery [53], attention mechanisms [57,56,55,25] have been applied
into ZSL and GZSL [52,80,78,30] for capturing multiple semantic regions, which can facilitate
desirable knowledge transfer, these methods achieve remarkable improvements on ZSL, but the
performance gains on GZSL are not satisfactory => fail at solving the domain bias issue
5. 5
How graph is created?
● Each node in the graph representing an attended region in the image
● Edges of these region nodes are their pairwise appearance similarities
7. 7
Method
Task Definitions
● Have Ns training samples from Cs seen classes which are defined as S = {(ss
i, ys
i)}Ns
i=1
● XS = {xs
i}Ns
i=1 and YS are the training dataset and its label set
● The seen class label of the ith sample xs
i is ys
i ∈ YS
● As = {xs
i}Ns
i=1 represents the semantic vector set of seen classes
● Given an unseen testing set U = {(xu
i, yu
i)}Nu
i=1 which Nu samples => predict the label ys
i ∈ YU for each
● More knowledge for U is provided by the semantic vector set Au = {au
i}Cu
i=1 for the Cu unseen classes
● The label sets of seen and unseen classes are disjoint
● For GZSL, the searched label space is expanded to Y = YS U YU by taking samples from both seen and
unseen classes as the testing data
● Denote as
i / au
i ∈ RQ
8. 8
Method
Overview
● RGEN consists of two sub-branches:
○ Constrained Part Attention (CPA) branch
■ Capable of automatically discovering more discriminative regions => to generate attended object
regions and is different from [52]
● Unlike [52] without any regularizations on attention masks, compactness and diversity are
introduced for learning desirable parts
● Transfer and balance losses are leveraged comparing to [52] which uses attribute
incorporated cross-entropy loss
○ Parts Relation Reasoning (PRR) branch
■ Aims at capturing appearance relationships among the discovered parts by GCN-based graph
reasoning
■ The outputs of such GCNs are updated node features, which are further used to learn
embedding to the semantic space
● Both branches are jointly trained by the proposed transfer and balance losses
9. 9
Method
Constrained Part Attention Branch
● Attention Parts Generation
○ Leverage the soft spatial attention to map image x into a set of K part features
○ Suppose the last convolutional feature map w.r.t. x is Z(x) ∈ RH*W*C, which H,W,C being its height,
width, and channel number
○ K attention masks {Mi(x)}K
i=1 are obtained by a 1*1 convolution G on Z(x) and a Sigmoid thresholding
where Mi(x) ∈ RH*W is the ith attention mask of input x. Based on these masks, obtain K corresponding
attentive feature maps {Ti(x)}K
i=1 w.r.t Z(x)
where R reshapes the input to be the same shape as Z(x), ☉ is an element-wise multiplication and Ti(x) .
Apply global max-pooling to each Ti(x), get K part features {fi(x)}K
i=1 fi(x) ∈ RC
● {fi(x)}K
i=1 have 2 functions
○ Concatenated as a vector f ∈ RKC, which is connected to the bottleneck layer and then semantic
space
○ Finally the semantic layer output is supervised by the transfer and balance losses
○ They are taken as nodes and used to construct region graph, which is fed to GCNs in the PRR
branch for parts relation reasoning
10. 10
Method
Constrained Part Attention Branch
● Constrained Attention Masks
○ Discover more compact and divergent parts, constrain the attention masks from the channel
clustering
○ Constrain masks from spatial attention
○ Compact loss and divergent loss for K masks on nb batch data are
where Mi hat is an ideal peaked attention map for the ith part, is the maximum activation
of other masks at coordinate (h, w)
11. 11
Method
Parts Relation Reasoning Branch
● K part features {fi(x)}K
i=1 represents one attended region
● Employ GCN to perform region-based relation modeling => lead to PRR branch
● Region graph Г ∈ RK*K, which K part features as its K nodes
● In Г have high confidence edge between similar regions and low confidence edge dissimilar regions
● Conduct l2-normalization on each fi(x) => dot-product is leveraged to calculate the pairwise similarity
● Dot-product calculation is equal to the cosine similarity metric and the graph has self-connections as well
● Calculate the degree matrix D of Г with
● Leverage GCN to perform reasoning on region graph => use 2 layer GCN propagation
where F(0) ∈ RK*C are stacked K part features and C is dimension,W(l) l=0 are learnable parameters and σ is
the Relu activation function
● The updated features undergo a concatenation, a bottleneck layer and an embedding to the semantic
space
12. 12
The Transfer and Balance Losses
The Transfer Loss
● To make ZSL and GZSL feasible, the achieved features should be further embedded into a certain
subspace
● Given the ith seen image and its ground truth semantic vector as
* ∈ AS, suppose its embedded feature is
collectively denoted as ε(xs
i), which equals to the concatenated rows of F(2) or the concatenated K part
features (Ф,f)
● Revisit the ACE loss, to associate image xs
i with its true attribute information, to compatibility score Г*
i is
formulated as
where W are the embedding weights that need to be learned jointly, which is a two-layer MLP in
implementation
● Consider Г*
i as a classification score in the cross-entropy loss, for seen data from a batch, the Attribute
incorporated CE loss (ACE) becomes
where are the scores on Cs seen semantic vectors
13. 13
The Transfer and Balance Losses
The Transfer Loss
● There are 2 drawbacks
○ The learned models are still biased towards seen classes
○ The performances of these deep models are inferior on GZSL
● To alleviate these problems, incorporate unseen attributes Au into RGEN
● Leverage least square regression to obtain the reconstruction coefficients V ∈ RCu*Cs of each seen class
attribute w.r.t. all unseen class attributes V = (BTB + βI)-1 BTA, which is obtained by solving
the ith column of V represents the contrasting class similarity of as
i w.r.t. B
where are the scores w.r.t. Cu unseen semantic vector for is the
softmax-layer normalization of Sij and yi is the column location in V w.r.t. growth-truth semantic vector of xs
i
14. 14
The Transfer and Balance Losses
The Balance Loss
● To tackle the challenge of extreme domain bias in GZSL, propose a balance loss by pursuing the
maximum response consistency, among seen and unseen outputs
● Given the input seen sample xs
i, get its prediction scores on seen class and unseen class attributes as
● To balance these scores from the two sides, the balance loss is proposed for batch data
where max P outputs the maximum value of the input vector P
● The balance loss is only utilized for GZSL, not ZSL, since balancing is not required when only unseen
test images are available
15. 15
Training Objective
● As two branches are guided by proposed transfer and balance losses during end-to-end training
● Only one stream of data as the input of net, the backbone is shared
● The final loss for RGEN is as follows
● The formulations of LCPA and LPRR
where λ1 and λ2 take same values for 2 branches.
● The difference between LCPA and LPRR lies in the concatenated embedding features f and θ
16. 16
Zero-Shot Prediction
● RGEN framework, the unseen test image xu is predicted in a fused manner
● After obtaining the embedding features of xu in the semantic space w.r.t. CPA and PRR branches,
denoted as , calculate their fused result by the same combination
coefficients as the training phase, then predict its label by
where Yu/Y corresponds to ZSL/GZSL
17. 17
Datasets
4 datasets
• SUN [36], CUB[44], AWA2 [50], APY [12]
• Use the Proposed Split [50] for evaluation => more strict and does not contain any class overlapping with
ImageNet classes
19. 19
Conclusion
• RGEN is proposed for tacking ZSL and GZSL tasks
• RGEN contains the constrained part attention and the parts relation reasoning branches
• To guide RGEN training, the transfer and balance losses are integrated into the framework
○ The balance loss is especially valuable for alleviating the extreme bias in the deep GZSL models,
providing intrinsic insights for solving GZSL