ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
240401_Thanh_LabSeminar[Person Re-identification using Heterogeneous Local Graph Attention Networks].pptx
1. Person Re-identification using
Heterogeneous Local Graph
Attention Networks
Tien-Bach-Thanh Do
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: osfa19730@catholic.ac.kr
2024/04/01
Zhong Zhang et al.
CVPR 2021
2. 2
Introduction
● What is person re-identification (Re-ID)?
○ Re-ID aims to match the given person of interest in different scenarios
3. 3
Previous Approaches
● Part-based methods for person Re-ID
○ Direct partition is adopted where images or feature maps are divided into horizontal stripes
[22,34,44]
○ Deep Metric Learning used to partition pedestrian images into 3 parts => feed into siamese
convolutional neural network to learn local features [44]
○ CNNs to extract feature maps => divide into several fixed size grids so as to obtain local
features from each grid independently [34]
○ Horizontal Pyramid Matching to partition feature maps into multi-scale horizontal regions => use
global average pooling and global max pooling to extract local features from multiple scales [9]
● Relation learning for person Re-ID
○ Pairwise relation
■ Similarity-Guided Graph Neural Network to model the pairwise relation among different
probe-gallery pairs so as to learn the probe-gallery relation features [30]
○ Local relation [12, 26]
○ Global relation [47]
4. 4
How graph is created?
● Node is local feature extracted from pedestrian image
● Edge
○ Inter-local edge connects nodes belonging to the same pedestrian image
○ Intra-local edge connect nodes belonging to different pedestrian image
6. 6
Method
Feature Extractor
● ResNet-50 as the backbone of Feature Extractor due to its powerful capacity of feature representation
● Resize image into 384*128 => feed into backbone => feature maps 2048*24*8, where 2048 is channel
number, 24 is height and 8 is width of feature map
● Split feature maps into P uniform horizontal grids and use global max pooling to obtain local feature F =
{fi
p ∈ R2048}, where p = 1,...,P and f denotes feature extracted from p-th part of image xi
7. 7
Method
Graph Attention Subnet
● Treat local features F as nodes and construct the completed local graph G = (F,E)
● Author divided nodes into 3 types: NSP, NAP, NSI
● To learn inter-local, intra-local relation in the completed local graph, they resort to the traditional GAT to
differentiate the importance of neighbor nodes and aggregate the information from them
hk is feature vector of node k
M is transformation matrix
𝛂jk is attention weight for node
k
𝛔 denotes the non-linear
activation function
8. 8
Method
Graph Attention Subnet
● Traditional GAT equally treats all the nodes when computing the attention weights without discriminating
them
● Two limitations
○ Intra-local relation, the difference between different relative parts of images is neglected in the
learning of weight attentions => losing the structure information of images
○ Inter-local relation, the traditional GAT ignores the difference of attention weights from the same
identity and different identities => inaccurate attention weights learning
● Propose differentiate the attention weights in the learning process of the inter-local and intral-local
relation
● New attention weight
where Φ(.,.) is cosine similarity function, N(fi
p) is the neighbor node set of node fi
p and K is a regulation
coefficient to consider different types of neighbor nodes.
9. 9
Method
Graph Attention Subnet
● There are 3 types of neighbor nodes, authors define different K when node fi
p belongs to different
neighbor node sets
○ When i ≠ j and p = q, fi
p and fi
q are both from the corresponding parts of different images and fi
q
belongs to the neighbor node set NSP => K = 1
○ When i ≠ j and p = q + 1 or p = q - 1, fi
p and fi
q are from the adjacent parts of different images and fi
q
belongs to the neighbor node set NAP => K is constant value and smaller than NSP because the
correlation decreases with the increase of the spatial distance
○ When i = j and p ≠ q, fi
p and fi
q are from different parts of the same image and fi
q belongs to the
neighbor node set NSI , i.e. N(fi
p) = NSI. They built K as
where s is the adjustment coefficient and |p - q| is the relative spatial distance between the p-th part and the
q-th part
● They learn 3 independent transformation matrices W for 3 kinds of neighbor nodes in Eq 2,3 can
distinguish the different relative parts of pedestrian images in the learning process of intra-local relation,
which overcome the first limitation
10. 10
Method
Graph Attention Subnet
● In order to integrate the inter-local and intra-local relation, they aggregate these local features with the
attention weights to obtain features zi
p
where 𝛔 denotes the non-linear activation function and V is the transformation matrix. Like W, they learn
different V for different neighbor nodes
● In the inter-local relation, if the nodes belong to the same identity, then they possess high correlation and
the attention weights should be large => propose the attention regularization loss to constrain the
attention weights
where Lbce(.,.) denotes the binary cross-entropy loss and τ denotes the ground truth value. If node fi
p and node
fi
q belong to the same identity τ = 1, otherwise τ = 0
● The attention weights between the nodes from the same identity are larged, and the attention weights
between the nodes from different identities are reduced, which could overcome the second limitation
11. 11
Method
Embedding Subnet
● Utilize P independent FC layers to reduce dimension of zi
p from 2048 to 256 to obtain final feature ei
p for
p-th part of image xi => identity prediction
● Combine the cross-entropy loss with proposed attention regularization loss
where λ is the balance coefficient and Lce denotes the cross-entropy loss. Lce defined as
where yi is the ground truth label of ei
p and Q(ei
p ) ∈ [0,1] denotes the prediction probability
● In test stage, they compute the distance between query image xq and gallery image xg to measure their
similarity
where eq
p and eg
p denote the final features extracted from p-th parts of xq and xg , respectively
12. 12
Datasets
4 datasets
• Market-1501 is composed of 32668 pedestrian images (1501 identities). These images are split into 2
subsets including training set (12936, 751) valid set (19732, 750), test set contains 3368 query images
and 15913 gallery images
• CUHK03 contains (14097, 1467). Training dataset (7365, 767). Test set has 1400 query images, 5332
gallery images of 700 identities
• DukeMTMC-reID which train set (16522, 702). Valid set has 2228 query images of 702 identities and test
set 17661 gallery images of 1100 identities (408 distractor identities)
• MSMT17 which train set (32621, 1041), 11659 query images and 82161 gallery images of 3060 identities
15. 15
Conclusion
• Proposed HLGAT to model the inter-local and intra-local relation in a unified framework for person Re-ID
• Regard the local features as the nodes to construct the completed local graph where its learn
○ Inter-local relation among corresponding and adjacent parts from different images
○ Intra-local relation among different parts form the same image
• Propose the attention regularization loss to constrain the attention weights for the inter-local relation
• Propose to inject the contextual information into the attention weights for the intra-local relation