Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search

Visual-Textual Joint Relevance Learning
for Tag-Based Social Image Search
Yue Gao, Meng Wang, Zheng-Jun Zha, Jialie Shen, Xuelong Li, Xindong Wu
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2013

Tag-based social image search
• Social media data (Flickr, Youtube, etc.)
• Associated with user generated tags, meta information (date, location, etc.)
• Conventional tag-based social image search
• Too much noise in tags
• Lacks an optimal ranking strategy (e.g. Flickr – time-based ranking,
interestingness-based ranking)
• Existing relevance-based ranking method
• explore visual content and tags separately or sequentially

Proposed schema
• a hypergraph-based approach to simultaneously utilize visual information and tags
Vertex: social image
Hyperedge: visual word / tag
Learn the weights(importance of different visual
words and tags)
Relevance scores of images

Social image search
• Separated Methods
• Only the textual content or the
visual content is employed for
tag analysis
• The useful information is
missing

Social image search
• Sequential Methods
• The visual content and the tags
are sequentially employed for
image search
• The correlation among visual
content and tags are separated

Social image search
Joint method

Hypergraph learning
• Hypergraph is generalization of graph in which an edge can connect
to multiple vertices
• Used for data mining and information retrieval task
• Effective in capturing higher-order relationship

Definition
Image from Wikipedia
• Vertex set 𝒱 = {𝑣1, 𝑣2, 𝑣3, 𝑣4, 𝑣5, 𝑣6, 𝑣7}
• Hyperedge set ℰ = 𝑒1, 𝑒2, 𝑒3, 𝑒4 =
{ 𝑣1, 𝑣2, 𝑣3 , 𝑣2, 𝑣3 , 𝑣3, 𝑣5, 𝑣6 , 𝑣4 }
• Hyperedge is able to link more than two
vertices.
• Edge weight set 𝓌
Hypergraph 𝒢 = (𝒱, ℰ, 𝓌)

Hypergraph analysis
• Learning with hypergraphs
• Binary classification with hypergraph
• Normalized Laplacian method is formulated as a regularization framework
𝑎𝑟𝑔𝑚𝑖𝑛 𝑓{𝜆𝑅 𝑒𝑚𝑝 𝑓 + Ω(𝑓)}
Regularizer
Empirical loss
Weighting parameter
To-be-learned classification function

Visual-textual
joint relevance learning

Hypergraph construction
• Vertex construction
• Vertices : Social image set
• The number of vertices in
Hypergraph is equals to the
number of images in the image
dataset.

• Hyperedge construction
• Feature 1. visual contents
• Bag of Visual Words
• Extracts local SIFT descriptors for
each images
• Trains visual vocabularies with
descriptors
• 𝑓𝑖
𝑏𝑜𝑤
𝑘, 1 =
1
0
if i-th image has k-th visual word
otherwise

• Feature 2. Textual information
• Bag of Textual Words
• Tags in each image are ranked by
TagRanking
• For further processing, top 𝑛𝑙 tags
for each image are left
• For further hyperedge construction,
the total number of tags with the
highest TF-IDF are left in the
database
• 𝑓𝑖
𝑡𝑎𝑔
𝑘, 1 =
1
0
if i-th image has k-th tag
otherwise

• If selected two images contain the
same visual word, they are
connected with the hyperedge.
• If selected two images contain the
same tag, they are connected with
the hyperedge.
• If 𝑓𝑖
𝑏𝑜𝑤
𝑘, 1 = 𝑓𝑗
𝑏𝑜𝑤
𝑘, 1 = 1, 𝑖 and 𝑗 is connected.
• If 𝑓𝑖
𝑡𝑎𝑔
𝑘, 1 = 𝑓𝑗
𝑡𝑎𝑔
𝑘, 1 = 1, 𝑖 and 𝑗 is connected.
𝑛 𝑐 visual content based hyperedges
𝑛 𝑡 tag-based hyperedges
𝑛 𝑐 + 𝑛 𝑡 hyperedges
in total

Example of textual hyperedge construction Example of visual hyperedge construction

Example of the connection between two images

Social image relevance learning
• Social image search task
• Binary classification problem
• Measure the relevance score among all vertices in hypergraph
• Transductivie inference is also formulated as a regularization framework
• Object Function
• 𝑎𝑟𝑔𝑚𝑖𝑛 𝑓,𝜔{Ω 𝑓 + 𝜆𝑅 𝑒𝑚𝑝 𝑓 + 𝜇Ψ(𝜔)}
• Regularizer term indicates that highly related vertices should have close label results
Weight regularizer term
Empirical loss term
Regularizer term
Weight vector
To-be-learned relevance score vector

Social image relevance learning
• Object Function
• 𝑎𝑟𝑔𝑚𝑖𝑛 𝑓,𝜔{Ω 𝑓 + 𝜆𝑅 𝑒𝑚𝑝 𝑓 + 𝜇Ψ(𝜔)}
• Ω 𝑓 = 𝑓 𝑇
Δ𝑓
• 𝑅 𝑒𝑚𝑝 𝑓 = 𝑓 − 𝑦
2
= Σ 𝑢∈𝑉 𝑓 𝑢 − 𝑦 𝑢
2
• guarantees that the new generated labeling results are not far away from the initial label
information
• Ψ 𝜔 = Σ𝑖=1
𝑛 𝑒
𝜔𝑖
2
s.t. Σ𝑖=1
𝑛 𝑒
𝜔𝑖 = 1
• 𝑎𝑟𝑔𝑚𝑖𝑛 𝑓,𝜔Φ 𝑓 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑓{𝑓 𝑇Δ𝑓+ 𝜆 𝑓 − 𝑦
2
+ 𝜇Σ𝑖=1
𝑛 𝑒
𝜔𝑖
2
} s.t. Σ𝑖=1
𝑛 𝑒
𝜔𝑖 = 1
(Δ: the normalized hypergraph Laplacian)
(y : n × 1 initial label vector)

Optimization
• Alternating optimization strategy
• to-be-learned two variable w and f
we fix one and optimize the other
one each time
• Using the iterative optimization
method, w and f are obtained.

Probabilistic explanation
• Probabilistic perspective
• Deriving the optimal f and w with the maximum posterior probability given
the samples X and the label vector y
𝑓, 𝑤 ∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝑓, 𝑤|𝑋, 𝑦)
• Equivalent to the object function
𝑎𝑟𝑔𝑚𝑖𝑛 𝑓{𝑓 𝑇
Δ𝑓 + 𝜆 𝑓 − 𝑦
2
+ 𝜇Σ𝑖=1
𝑛 𝑒
𝜔𝑖
2
} s.t. Σ𝑖=1
𝑛 𝑒
𝜔𝑖 = 1

Pseudo-relevant sample selection
• Pseudo-relevant samples
• Associated with the query tag
• Have high relevance probabilities
• They are not far away from result
• Used for noise reduction

Pseudo-relevant sample selection
• Semantic Relevance Measuring
• All the social images that are
associated with the tag are ranked
in descending order
• The top K results are selected as
the pseudo-relevant images
• Semantic similarity
• Flickr Distance between two tags
• Based on a latent topic based
visual language model
𝑠 𝑥𝑖, 𝑡 𝑞 =
1
𝑛𝑖
Σ 𝑡∈𝑇 𝑖
𝑠𝑡𝑎𝑔(𝑡 𝑞, 𝑡) 𝑠𝑡𝑎𝑔 𝑡1, 𝑡2 = exp(−𝐹𝐷(𝑡1, 𝑡2))

Experimental settings
• Dataset : Flickr dataset(104,000 images, 83,999 tags) + NUS-WIDE (370K+ images)
• Labeling : three relevance levels : very relevant(2), relevant(1) and irrelevant(0)
• Compared algorithms
• Graph based semi supervised learning (Graph)
• Sequential social image relevance learning (Sequential)
• Tag ranking (TagRanking)
• Tag relevance combination (Uniform Tagger)
• Hypergraph based relevance learning (HG)
• HG + hyperedge weight estimation (HG+WE)
• HG + WE (visual contents only)
• HG + WE (textual contents only)
• Performance evaluation metric
• Normalised Discounted Cumulative Gain (NDCG)

The NDCG@20 Results of different methods
0.8814 0.8578 0.8463
0.7418
0.6281
0.5994 0.5778 0.5727
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The Average NDCG@20 Results

Average NDCG@k comparison
• This approach consistently
outperforms the other methods
Depth for NDCG

• Top results obtained by different
methods for the query weapon.
• the final ranking list can preserve
images from all different
meanings

• Top results obtained by different
methods for the query apple.
• the proposed method can return
relevant results with different
meanings

The effects of hyperedge weight learning
Top 100 visual words
with the highest weights
after the hypergraph
learning process

The effects of hyperedge weight learning
Ten tags with the highest
weights after the
hypergraph learning
process for the queries (a)
car and (b) weapon.

Variation of weighting parameters
Average NDCG@20 performance curves with respect to the variation of λ and μ.

Variation of dictionary size
NDCG@20 comparison of the proposed method with different sizes
of the tag and visual word dictionaries, i.e., 𝑛 𝑐 and 𝑛 𝑡.

Variation of max. number of tags
NDCG@20 comparison of the proposed method with different 𝑛𝑙 selection
The parameter 𝑛𝑙 is employed to filter noise tags

Conclusion
• Proposal : joint utilization of both visual contents and tags by
hypergraph and relevance learning procedure for social image search.
• Consideration of weights of hyperedges
• Differ from previous hypergraph learning algorithms
• Minimizes the effects of uninformative features
• Future work
• Diversity of search results : Next issue

Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search

More Related Content

What's hot

Viewers also liked

Similar to Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search

More from SOYEON KIM

Recently uploaded

Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search

Editor's Notes