adaptive metric learning for saliency detection base paper

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015 3321
Adaptive Metric Learning for Saliency Detection
Shuang Li, Huchuan Lu, Senior Member, IEEE, Zhe Lin, Member, IEEE,
Xiaohui Shen, Member, IEEE, and Brian Price
Abstract—In this paper, we propose a novel adaptive metric
learning algorithm (AML) for visual saliency detection. A key
observation is that the saliency of a superpixel can be estimated
by the distance from the most certain foreground and background
seeds. Instead of measuring distance on the Euclidean space,
we present a learning method based on two complementary
Mahalanobis distance metrics: 1) generic metric learning (GML)
and 2) specific metric learning (SML). GML aims at the global
distribution of the whole training set, while SML considers the
specific structure of a single image. Considering that multiple
similarity measures from different views may enhance the
relevant information and alleviate the irrelevant one, we try
to fuse the GML and SML together and experimentally find
the combining result does work well. Different from the most
existing methods which are directly based on low-level features,
we devise a superpixelwise Fisher vector coding approach to
better distinguish salient objects from the background. We also
propose an accurate seeds selection mechanism and exploit
contextual and multiscale information when constructing the final
saliency map. Experimental results on various image sets
show that the proposed AML performs favorably against the
state-of-the-arts.
Index Terms—Metric learning, saliency detection,
Mahalanobis distance, Fisher vector.
I. INTRODUCTION
VISUAL saliency aims at finding the regions on an image
that are more visually distinctive or important and often
serves as a pre-processing procedure for many vision tasks,
such as image categorization [1], image retrieval [2], image
compression [3], content-aware image/video resizing [4], etc.
Visual saliency basically breaks down into the problem of
separating the salient regions from the non-salient ones by
measuring differences in their features. Numerous models and
algorithms have been proposed to perform this. Unsupervised
approaches [5]–[9] are stimuli-driven and rely largely on
distinguishing low-level visual features. Early unsupervised
models, such as Gaussian pyramids [5], central-surround [5],
fuzzy growing [10] are mainly inspired by original biological
Manuscript received August 21, 2014; revised February 10, 2015 and
April 10, 2015; accepted May 26, 2015. Date of publication June 3,
2015; date of current version June 23, 2015. This work was supported
in part by the Natural Science Foundation of China under Grant 61472060
and in part by the Fundamental Research Funds for the Central Universities
under Grant DUT14YQ101. The associate editor coordinating the review of
this manuscript and approving it for publication was Mr. Pierre-Marc Jodoin.
S. Li and H. Lu are with the School of Information and Communication
Engineering, Faculty of Electronic Information and Electrical Engineering,
Dalian University of Technology, Dalian 116024, China (e-mail:
shuangli59app@gmail.com; lhchuan@dlut.edu.cn).
Z. Lin, X. Shen, and B. Price are with Adobe Research, San Jose,
CA 95110 USA (e-mail: zlin@adobe.com; xshen@adobe.com;
bprice@adobe.com).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2015.2440755
Fig. 1. The comparison between the Euclidean distance space and the
Mahalanobis distance space. The Mahalanobis distance is more discriminative
than the Euclidean distance, since its background part is less salient.
vision stimulus. Later studies address saliency detection from
broader views, e.g., convex hull [7], [11] and frequency
domain [12], [13]. In contrast, supervised methods [14]–[16]
incorporate high-level and known information to better
distinguish the salient regions by learning salient visual
information from a large number of images with ground truth
labels.
Despite the differences in these methods, they all require
the basic ability to compute a difference measure on some
regions features to distinguish them. To the best of our
knowledge, all existing models address saliency detection
based on the Euclidean distance. However, Euclidean distance
weights features equally without considering the distribution of
the data, thereby it becomes invalid when detecting objects in
complex images. This phenomenon happens frequently in the
saliency detection process, especially when the salient regions
and backgrounds are similar, which leads to the problem
that the Euclidean distances between the foregrounds and
the similar backgrounds are smaller than the distances within
the foregrounds. Figure 1 illustrates this problem. Given an
image, we first select some initial seeds, including foreground
and backgrounds seeds. The process of seeds selection is the
same as Section III-C mentioned. We compute the distance
between each superpixel and seeds and draw the distance
distribution in Figure 1. We observed that the Mahalanobis
distance is more distinctive than the Euclidean distance, since
its background part is less salient. This motivates us to train
a discriminative distance metric to assign appropriate weights
to features so that the objects can be precisely separated from
the background.
We use metric learning to compute a more discriminative
distance measure. Distance metric learning has been widely
1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

3322 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015
Fig. 2. The comparison between low-level features and our SFV feature. (a) input image. (b) saliency map based on low-level features. (c) saliency map
based on SFV. (d) ground truth.
Fig. 3. Pipeline of the adaptive metric learning algorithm. IT [5], GB [19], LR [14], RC [6] are other four saliency methods.
adopted for this purpose in different applications since it takes
into account the covariance information when estimating the
data distributions and improves the performance of learning
methods significantly. To our knowledge, we are the first to
successfully formulate the saliency detection problem into a
metric learning framework and our method works well on
different databases. We also propose a Superpixel-wise Fisher
Vector coding approach which maps the low-level features,
such as RGB and LAB, to high dimensional sparse vector.
Compared with using low-level features directly, the SFV is
more discriminative in challenging environments as shown
in Figure 2. Thus we use SFV features to describe each
superpixel.
In this paper, we adopt an effective feature coding method
and propose a novel metric learning based saliency detection
model, which incorporates both supervised and
semi-supervised information. Our algorithm considers both
the global distribution of the whole training dataset (GML)
and the typical structure of a specific image (SML), and
we successfully fuse them together to extract the clustering
characteristics for estimating the final saliency map. Figure 3
shows the pipeline of our method. First, as an extension of the
traditional Fisher Vector coding [17], Superpixel-wise Fisher
Vector coding is proposed to describe superpixels by learning
the parameters of a Gaussian mixture model (Section III-A).
Second, we train a Generic metric from the training
set (Section III-B1) and apply it to a single image to find
the saliency seeds with the assistance of the superpixel-wise
objectness map generated by [18] (Section III-C).
Third, a Specific metric based on kernel classification is
learnt from the chosen seeds for each image (Section III-B2).
Finally, by integrating the Generic metric and Specific
metric together (Section III-D), we obtain the clustering
information for each superpixel and use it to generate the
final saliency map (Section III-E). The GML and SML as
shown in Figure 3 are two intermediate images which are
not really generated when computing saliency maps. But they
serve as comparisons to demonstrate the efficiency of the
fused results in Section IV-A. The main contributions of our
work include:
• Two metric learning approaches are first applied to
saliency detection as the optimal distance measure of
two superpixels. GML is learnt from the global training
set while SML is learnt from the specific image training
samples. They are complementary to each other and
achieve promising results after the affinity aggregation.
• A superpixel-wise fisher vector coding method is first
put forward which contains image contextual information
when representing superpixels and makes supervised
learning methods more suitable for single image
processing.
• An accurate seeds selection method is first presented
based on the Mahalanobis distance metric. The selected
seeds serve as training samples of the Specific metric
learning and reference nodes when evaluating saliency
values.

LI et al.: ADAPTIVE METRIC LEARNING FOR SALIENCY DETECTION 3323
Experimental results on various image sets show that our
method is comparable with most of the state-of-the-arts and
the proposed metric learning approaches can be extended to
other fields as well.
II. RELATED WORK
Significant improvement and prosperity in saliency
detection have been witnessed in recent years. Numerous
unsupervised approaches have been proposed under different
theoretical models. Cheng et al. [6] propose a global region
contrast algorithm which simultaneously considers the spatial
coherence across the regions and the global contrast over
the entire image. However, low-level color contrast becomes
invalid when dealing with challenging scenes. Li et al. [20]
compute the dense and sparse reconstruction errors based
on background templates which are extracted from image
boundaries. They propose several integration strategies, such
as multi-scale reconstruction error and Bayesian integration,
which improve the performance of saliency detection
significantly. In [21], boundary connectivity, a robust
background measure, is first applied to saliency detection.
It characterizes the spatial layout of image regions and
provides a specific geometrical explanation to its definition.
Perazzi et al. [22] formulate the saliency estimation and
complete contrast using high-dimensional Gaussian filters.
They modify SLIC [23] and demonstrate the effectiveness of
their superpixel segmentation approach in detecting salient
objects.
Furthermore, lacking the knowledge of sizes and locations
of objects, boundary prior and objectness are often adopted
to highlight the salient regions or depress the backgrounds.
Jiang et al. [18] construct saliency by integrating
three visual cues, including uniqueness, focusness and
objectness (UFO), where uniqueness represents color contrast;
focusness indicates the degree of focus, often appearing as the
reverse of blurriness; objectness proposed by Alexe et al. [24]
is the likelihood of a given image window containing an
object. In [25], Wei et al. define the saliency value of
each patch as the shortest distance to the image boundary,
observing that image boundaries are more likely to be the
background. However, this assumption is less convincing,
especially when the scene is challenging.
Compared with unsupervised approaches, supervised
methods are apparently rare. In [26] and [27], Jiang et al.
also propose a multi-scale learning approach, which maps the
regional feature vector to a saliency score and fuse these
scores across multiple levels to generate the final saliency
map. They introduce a novel feature vector, which integrates
the regional contrast, regional property and regional
backgroundness descriptors together, to represent each region
and learn a discriminative random forest regressor to predict
regional scores. Shen and Wu [14] treat an image as the
combination of sparse noises and the low-rank matrix. They
extract low-level features to form high-level priors and then
incorporate the priors to a low-rank matrix recovery model
for constructing the saliency map. However, the saliency
assignment near the object is unsatisfying due to the ambiguity
of prior maps. Liu et al. [28] formulate the saliency detection
as a partial differential equation problem and solve it under
an adaptive PDE learning framework. They learn the optimal
saliency seeds via discrete submodularity and use seeds as
boundary condition to solve the Linear Elliptic System.
Inspired by these works, we construct a metric fusion
framework which contains two complementary metric learning
approaches to generate robust and accurate saliency maps even
in complex scenes. Our method encodes low-level features into
a high-dimensional feature space and incorporates multi-scale
and objectness information when measuring saliency values.
Therefore, our method can uniformly highlight objects with
explicit object boundaries.
III. PROPOSED ALGORITHM
In this section, we present an effective and robust adaptive
metric learning method for visual saliency detection. The
proposed algorithm proceeds through five steps to generate
the final saliency map. Firstly, we extract low-level features
to encode the superpixels generated by the simple
linear iterative clustering (SLIC) [23] algorithm
with a Superpixel-wise Fisher Vector representation.
Secondly, two Mahalanobis distance metric learning
approaches, Generic metric learning and Specific metric
learning are introduced to learn the optimal distance measure
of superpixels. Thirdly, we propose a novel seeds selection
strategy based on the Mahalanobis distance to generate
saliency seeds, which can be used to train Specific metric
as training samples and evaluate the saliency values as
referenced nodes. Fourthly, a metric fusion framework is
presented to fuse the Generic and Specific metrics together.
Finally, we obtain graceful and smooth saliency maps by
combining the spectral clustering and multi-scale information.
A. Superpixel-Wise Fisher Vector Coding (SFV)
Appropriate feature coding approaches can effectively
extract main information and remove the redundancies, thus
greatly improving the performance of saliency detection.
Fisher Vector can be regarded as an extension of the
well-known bag-of-words representation, since it captures
the first-order and second-order differences between local
features and the centers of a Mixture of Gaussian Distributions.
Recently, Chen et al. [29] extend Fisher Vector to the point
level image representation for object detection. For a different
purpose, we propose to further extend the FV coding to
superpixel level and experimentally verify the superiority of
our Superpixel-wise Fisher Vector coding method.
Given a superpixel i= {pt, t = 1, . . ., T }, where pt is a
-dimensional image pixel, and T is the number of pixels
within i, we train a Gaussian mixture model (GMM)
λ(pt) = K
k=1 υkψk(pt) from all the pixels of an
image using the Maximum Likelihood (ML) criterion. The
parameters of the K-component GMM are defined as
λ = {υk, μk, k, k = 1, . . ., K}, where υk, μk and k are
the mixture weight, mean vector and covariance matrix of
Gaussian k respectively. Similar to the FV coding method, the
SFV representation can be written as a = 2 K-dimensional
concatenated form:
ϕi = {ζμ1, ζσ1 , . . . , ζμK , ζσK} (1)

where ζμk and ζσk are defined as: ζμk = 1
T
√
υk
T
t=1
ηt (k) pt −μk
σk
,
ζσk = 1
T
√
υk
T
t=1
ηt (k) 1√
2
{(pt−μk )2
σ2
k
− 1}, and σk is the square root
of the diagonal values of k, ηt (k) is the soft assignment of pt
to Gaussian k.
The SFV representation ϕi is hereby used to describe
superpixel i in this paper. It has several advantages:
• As an extension of Fisher Vector coding,
SFV successfully realizes superpixel level coding
representation, making Fisher Vector more suitable for
single image processing. Instead of averaging low-level
features of contained pixels, SFV statistically analyzes
the internal feature distribution of each superpixel,
providing a more accurate and reliable representation
for it. Experiments show that our SFV generates more
smooth and uniform saliency maps and improves about
2 percent compared with low-level features in the
precision-recall curve on the MSRA-1000 database as
shown in Figure 7.
• SFV can be regarded as an adaptive Fisher Vector coding,
since the parameters of the GMM model are trained
on a specific image online. This means even the same
superpixels in different images have different coding
representations. Therefore, our SFV better considers
image contextual information.
• Due to the small number of superpixels in an image and
their disjoint nature, SFV is much faster than existing
state-of-the-art FV variants. Furthermore, besides saliency
detection, SFV can also be applied to other vision tasks,
such as image segmentation and content-aware image
resizing, etc.
B. Adaptive Metric Learning
Learning a discriminative metric can better distinguish the
samples in different classes, as well as shortening the distance
within the same class. Numerous models and methods
have been proposed in the last decade, especially for the
Mahalanobis distance metric learning, such as information
theoretic metric learning (ITML) [30], large margin nearest
neighbor (LMNN) [31], [32], and logistic discriminative based
metric learning (LDML) [33].
However, most existing metric learning approaches learn a
fixed metric for all samples without considering the deeper
structure of the data, thereby breaking down in the presence of
irrelevant or unreliable features. In this paper, we propose an
adaptive metric learning approach, which considers both the
global distribution of the whole training set (GML) and the
specific structure of a single image (SML) to better separate
objects from the background. Our approach can also be viewed
as an integration of a supervised distance metric learning
model (GML) and a semi-supervised distance metric learning
model (SML). Since GML and SML are complimentary to
each other, we get promising results after fusing
them together under an affinity aggregation framework
(Section III-D).
1) Generic Metric Learning (GML): Metric learning has
been widely applied to vision tasks, but never been used for
saliency detection because of its long training time, which is
infeasible for single image processing. In this part, we solve
this problem by pre-training a Generic metric Mg from the
first 500 images of MSRA-1000 database using gradient
descent, and we verify, both experimentally and empirically,
that Mg is generally suitable for all images.
First, we construct a training set {ϕi, i = 1, 2, . . . , M}
consisted of superpixels extracted from all training images,
where ϕi is the SFV representation of superpixel i. To find
the most discriminative Mg, we minimize
M∗
g = arg min
Mg
1
2
α Mg
2
+
n {ij|δn
i =1,δn
j =0}
D(i, j) (2)
D(i, j) = exp{−(ϕi − ϕj)T
Mg(ϕi − ϕj )/σ2
1 } (3)
where δn
i is an indicator of the ith superpixel in the nth image
belonging to the foreground or background, D(i, j) is the
exponential Mahalanobis distance between i and j under the
distance metric Mg. We set σ1 = 0.1 to control the strength
of distances.
Considering that the background is various and chaotic, and
different object regions are distinctive as well, we just impose
restriction on pairwise distances between positive samples
and negative ones, which is more reliable and reasonable
for the fact that salient objects are always distinct from the
background. This minimization aims at maximizing feature
distances between foreground and background samples,
thereby significantly improving the performance of saliency
detection. Eqn 2 can be easily solved by gradient descent.
The Generic metric includes the information of all superpixels
in the whole training images, thus it is appropriate for most
images.
2) Specific Metric Learning (SML): Recently,
Wang et al. [34] propose a novel doublet-SVM metric
learning approach based on Kernel Classification Framework,
thus formulating the metric learning into a SVM problem and
achieving desirable results with less training time. However,
experiments show that directly applying doublet-SVM
to saliency detection cannot ensure good detection
accuracy. Therefore, we modify this approach by adding
a constraint ω(τ1,τ2), which significantly improves the
performance of the final saliency map.
Let {ϕi, i = 1, 2, . . . , m} be the training dataset, where ϕi is
the SFV representation of a labeled superpixel extracted from
a specific image. The detailed process of extracting labeled
superpixels from an image will be discussed in Section III-C.
We first divide these samples into foreground seeds and
background seeds and label them as 1 and 0 respectively.
Given a training sample ϕi with label hi , we find its q1 nearest
neighbors with the same label and q2 nearest neighbors with
different labels, and then (q1 + q2) doublets are constructed
for it. Each doublet consists of the training sample ϕi and
one of its nearest neighbors. By combining the doublets of
all samples together, a doublet set χ = {x1, x2, . . . , xZ } is
established, where xτ = (ϕτ,1, ϕτ,2), τ = 1, 2, . . . Z is one
of the doublets, and ϕτ,1 and ϕτ,2 are the SFV of superpixel
τ1 and τ2 in doublet xτ , We assign xτ a label as follows:
lτ = −1 if hτ,1 = hτ,2, and lτ = 1 if hτ,1 = hτ,2.

As an extension of degree-2 polynomial kernel, we define
the doublet level degree-2 polynomial kernel as:
Kp(xτ , xι)
= tr
ω(τ1,τ2)(ϕτ,1 − ϕτ,2)(ϕτ,1 − ϕτ,2)T
ω(ι1,ι2)(ϕι,1 − ϕι,2)(ϕι,1 − ϕι,2)T
= ω(τ1,τ2)ω(ι1,ι2){(ϕτ,1 − ϕτ,2)T
(ϕι,1 − ϕι,2)}2
(4)
where ω(τ1,τ2) = θ(τ1,τ2) ∗ O(τ1,τ2) is a weight parameter.
θ(τ1,τ2) = 1−exp(−dist(τ1,τ2)/σ2) (5)
O(τ1,τ2) = 1 − exp{−(Oτ1 − Oτ2)2
/σ2} (6)
where dist(τ1,τ2) is the space distance between superpixel
τ1 and τ2, and θ(τ1,τ2) is the corresponding exponential space
distance. Oτ1 is the objectness score defined as Eqn 11 of
superpixel τ1, and O(τ1,τ2) is the superpixel-wise objectness
distance between τ1 and τ2. We set σ2 = 0.1. The weight
parameter ω(τ1,τ2) provides crucial spatial and prior informa-
tion regarding the interesting objects, thus it is more robust in
evaluating the similarity between a pair of superpixels than the
feature distance alone. In order to determinate the similarity of
two samples in a doublet, we further define a kernel decision
function as follows:
E(x) = sgn{
τ
ατ lτ Kp(xτ , x) + β} (7)
where ατ is the weight of doublet xτ , β is a bias parameter.
We have
τ
ατ lτ Kp(xτ , x) + β
= ω(x1,x2)(ϕx,1 − ϕx,2)T
Ms(ϕx,1 − ϕx,2) + β (8)
Ms =
τ
ατ lτ ω(τ1,τ2)(ϕτ,1 − ϕτ,2)(ϕτ,1 − ϕτ,2)T
(9)
For the facility of computation, we set ω(x1,x2)=1. The
proposed Specific metric Ms can be easily solved by existing
SVM solvers. The Specific metric is trained only on the
test image, and it is much faster than existing metric
learning approaches. According to [34], the doublet-SVM
is 2000 times, on average, faster than the ITML [30].
Therefore, it is feasible to train a Specific metric for each
image to better distinguish its objects from the background.
In this part, we propose two metric learning approaches:
GML and SML. The first one considers more about the global
distribution of the whole training set, while the second one
aims at exploring the deeper structure of a specific image.
GML can be pretrained offline and is generally suitable for
all images, while SML is much faster, since it can be solved
by existing SVM solvers. We need to mention that the image
specific is not always better than the Generic metric, as it has
fewer training samples and less reliable labels. Instead, these
two metrics are supposed to be complementary to each other
and can be fused together to improve the performance of the
final detection results.
C. Iterative Seeds Selection by Mahalanobis Distance (ISMD)
As a preliminary criterion of saliency detection, saliency
seeds directly influence the performance of seeds-based
solutions. Recently, Liu et al. [28] propose an optimal
seeds selection strategy via submodularity. By adding a stop
criterion, the submodularity problem can be solved and then
the optimal seed set is obtained accordingly. In [35], Lu et al.
learn optimal seeds by combining bottom-up saliency maps
and mid-level vision cues. Inspired by their works, we propose
a compact but efficient iterative seeds selection scheme based
on the Mahalanobis distance assessment (ISMD).
Alexe et al. [24] present a novel objectness method to
measure the likelihood of a given image window containing
an object. Jiang et al. [18] extend the original objectness to
Pixel-level Objectness O(p) and Region-level Objectness Oi
by defining:
O(p) =
W
w=1
P(w) (10)
Oi =
1
T
p∈i
O(p) (11)
where W is the number of sampling windows that contain
pixel p, and P(w) is the probability score of the wth window,
T is the number of pixels within region i. We redefine the
region-level objectness as superpixle-wise objectness in this
paper.
Motivated by the fact that highlights of the superpixle-wise
objectness map are more likely to be the foreground seeds,
a set of initial foreground seeds is constructed from the lightest
two percent regions of the objectness map. Considering
that the background is massive and scattered, we pick out
several lowest objectness values from each boundary of the
superpixel-wise objectness map as initial background seeds.
The intuition is that if superpixel i is a foreground seed, the
ratio of distances from foreground seeds and background seeds
should be small. We formulate the ratio as follows:
i =
f s
drat(i, f s)
bs
drat(i, bs)
(12)
where
drat(i, f s) = φ(i, f s)(ϕi − ϕ f s)Mg(ϕi − ϕ f s)T
(13)
is the Mahalanobis distance between superpixel i and one
of foreground seeds f s under the Generic metric Mg, and
φ(i, f s) = d(i, f s) ∗ O(i, f s) is a weight parameter, where
d(i, f s) = exp(−dist2
(i, f s)/σ2) (14)
is another kind of exponential space distance between
superpixel i and f s. Only when i ≤ 0 or i ≥ 1,
i can be added to the foreground seeds set or background
seeds set, where 0 and 1 are two thresholds. With the
new added seeds each time, we iterate this process N1 times.
Since most of the area in an image belongs to the back-
ground, in order to generate more background seeds, the
iteration continues N2 times more, but only selects seeds

Fig. 4. Iterative seeds selection by Mahalanobis distance. Initial saliency
seeds are first selected from the lightest and the darkest parts of the superpixel-
wise objectness map. By computing the Mahalanobis distance between any
superpixel and the chosen seeds, we iteratively increase the foreground and
background seeds.
with
bs
drat(i, bs) ≤ 2, where 2 is a threshold. Then we
obtain the final seeds set as illustrated in Figure 4.
As elaborated in Section III-B2, the Specific metric Ms
can be learnt from the labeled seeds via doublet-SVM.
One may concern that Ms will rely too much on Mg, since the
labeled seeds are generated under Mg. Fortunately, by learning
a generally suitable metric, we can enforce a very high seeds
accuracy (98.82% on MSRA-1000 database) which means the
seeds-based Specific metric is reliable enough to measure the
distance.
D. Metric Fusion for Extracting Spectral
Clustering Characteristics
Aggregating several affinity matrices appropriately may
enhance the relevant and useful information, and at the same
time, alleviate the irrelevant and unreliable one. Spectral
clustering is an important unsupervised clustering algorithm
for transferring the feature representation into a more
discriminative indicator space, and we call this property as
“spectral clustering characteristics”. Spectral clustering has
been applied to many fields for its effective and outstanding
performance.
In this section, we merge the metric fusion into a spectral
clustering features extraction process [36] and learn the
optimal aggregation weight for each affinity matrix. The fusion
strategy significantly improves the results of saliency detection
as shown in Figure 5. Based on the two metrics learnt above,
two affinity matrices g and s are constructed with the
corresponding i jth element
π
g
i, j = exp{−φ(i, j)(ϕi − ϕj )Mg(ϕi − ϕj )T
/σ3}
πs
i, j = exp{−φ(i, j)(ϕi − ϕj )Ms(ϕi − ϕj )T
/σ3} (15)
where σ3 = 0.1. The affinity aggregation strategy aims
at finding the optimal clustering characteristic vector of
all the superpixels in an image and the weight parameter
ϑ = [ϑg, ϑs]T associated with g and s, so the fusion
Fig. 5. Evaluation of metrics. (a) input images. (b) Generic metric.
(c) Specific metric. (d) fused results. (e) ground truth.
problem can be conducted as:
min
ϑg,ϑs
1,..., r
{
i, j
ϑ2
g π
g
i, j i − j
2
+
i, j
ϑ2
s πs
i, j i − j
2
}
= min
ϑg,ϑs
1,..., r
{ϑ2
g
T
(Hg − g) + ϑ2
s
T
(Hs − s) }
= min
ϑg,ϑs
(βgϑ2
g + βsϑ2
s ) (16)
where i is the clustering characteristic indicator of
superpixel i, and r is the number of superpixels in an image,
Hg = diag{h11, . . . , hrr } is the diagonal matrix of g with its
diagonal element hii =
j
π
g
i, j , βg = T (Hg − g) . To solve
this problem, we first employ two constraints: the normalized
weight constraint ϑg + ϑs = 1 and the normalized spectral
clustering constraint T H = 1. By fixing ϑ, the clustering
characteristic vector can be easily obtained using standard
spectral clustering. If is given, Eqn 16 can be formulated as:
min
ϑg,ϑs
(βgϑ2
g + βsϑ2
s ) = min
μg,μs
(ρgμ2
g + ρsμ2
s ) (17)
subject to
μ2
g + μ2
s = 1,
μg
√
αg
+
μs
√
αs
= 1 (18)
where αg = T Hg , ρg=
βg
αg
and μg =
√
αgϑg. This can be
easily solved by existing 1D line-search methods.
To summarize, metric fusion tries to find the optimal
clustering characteristic vector and the optimal weight
parameter ϑ via a two-step iterative strategy. Since affinity
matrices incorporate φ(i, j) in Eqn 15, the convergence
can be very fast, about three iterations in each image.
We use the indicator representation to compute saliency maps
(Section III-E).
E. Context-Based Multi-Scale Saliency Detection
In this section, we propose a context-based multi-scale
saliency detection algorithm to compute the saliency map for
each image. Lacking the knowledge of sizes of objects, we first
generate superpixels in S different scales. Then the K-means
algorithm is applied in each scale to segment an image into

Fig. 6. The distribution of saliency values of ground truth foregrounds and backgrounds. (a) Generic metric on MSRA-1000. (b) Specific metric on
MSRA-1000. (c) AML on MSRA-1000. (d) AML on MSRA-5000.
N clusters via their SFV features. According to the intuition
that a superpixel is salient if its cluster neighbors are close
to the foreground seeds and far from the background seeds,
we define the distance between superpixel i and saliency seeds
in scale s as:
D(s)
i, f s =
f n(s)
q=1
{γ i − q + (1 − γ)
N
(s)
c
j=1
Wi, j j − q }
D
(s)
i,bs =
bn(s)
q=1
{γ i − q + (1 − γ)
N
(s)
c
j=1
Wi, j j − q }
(19)
where
Wi, j = Q1 exp{−dist(i, j)/σ2} ∗ Q2 exp{−(Oi − Oj )2
/σ2}
(20)
is the weighted distance between superpixel i and its cluster
neighbor j, and i is the clustering characteristic indicator
of superpixel i, f n and bn are the number of foreground
and background seeds chosen by our ISMD seeds selection
approach. Q1, Q2 and γ are weight parameters, Nc is the
number of cluster neighbors of superpixels i. The saliency
value of superpixel i can be formulated as:
sal(i) =
S
s=1
νs ∗ exp(Oi)
1 + {(1 − exp(−D(s)
i, f s/σ4)}/D(s)
i,bs
=
S
s=1
νs ∗ exp(Oi) ∗ D(s)
i,bs
D
(s)
i,bs + 1 − exp(−D
(s)
i, f s/σ4)
(21)
where νs is the weight of scale s, and σ4 = 0.1.
The considerations of all the other superpixels belonging to
the same cluster and multiple scales smooth the saliency map
effectively, and make our approach more robust in dealing with
complicated scenes.
IV. EXPERIMENTS
We evaluate the proposed method on four benchmark
datasets. The first one is MSRA-1000 [13], a subset
of MSRA-5000, which has been widely used in previous
works with its accurate human-labelled masks. The second
one is MRAS-5000 dataset [15] which includes 5000 more
comprehensive images. The third one is THUS-10000 [37]
consists of 10000 images, each of which has an unambiguous
salient object with pixel-wise ground truth labeling. The last
Fig. 7. (a) Precision-recall curve for Generic metric, Specific metric, and
fused results without neighbor smoothness (MSRA-1000 and Berkeley-300).
Precision-recall curve based on SFV and low-level features. Precision-recall
curve for other two fusion methods. (b) Images of fused results based on SFV
and low-level features.
one is Berkeley-300 [38] which contains more challenging
scenes with multiple objects of different sizes and locations.
Since we have already used the first 500 images
of MSRA-1000 for training, we evaluate our algorithm
and compare it with other methods on the rest 500 images of
MSRA-1000, 4500 images of MSRA-5000, where excludes
500 training images (MSRA-5000 contains all the images of
MSRA-1000), 9501 images of THUS-10000 (THUS-10000
contains 499 training images), and Berkeley-300.
A. Evaluation of Metrics
We perform several comparative experiments as shown
in Figure 5, Figure 6 and Figure 7(a) to demonstrate the
efficiency of Generic metric (GML), Specific metric (SML),
and their combination (AML based on SFV). In order to
eliminate the influence of neighbor smoothness, Eqn 19, when
comparing metrics, we just compute the distance between each
superpixel and seeds, instead of the sum of weighted distances
of its cluster neighbors:
D(s)
i, f s =
f n(s)
q=1
i − q , D(s)
i,bs =
bn(s)
q=1
i − q (22)
The precision-recall curves of the Generic metric and Specific
metric are almost the same, but their combination outperforms
both of them. We also try to add or multiply saliency maps
generated by these two metrics directly, but the PR curves are
much lower than our fusion approach in Figure 7(a). This
is consistent with our motivation: Mg is trained from the

Fig. 8. Results of different methods. (a), (b) Precision-recall curves on MSRA-1000. (c) Average precisions, recalls, F-measures and AUC on MSRA-1000.
(d), (e) Precision-recall curves on MSRA-5000. (f) Average precisions, recalls, F-measures and AUC on MSRA-5000.
Fig. 9. Results of different methods. (a), (b) Precision-recall curves on THUS-10000. (c) Average precisions, recalls, F-measures and AUC on THUS-10000.
(d), (e) Precision-recall curves on Berkeley-300. (f) Average precisions, recalls, F-measures and AUC on Berkeley-300.
whole training dataset, containing the global distribution of the
data, and Ms aims at a single image, considering the specific
structure of samples.
Figure 5 demonstrates that the fused results significantly
remove the light saliency values in the background regions
produced by GML and SML. Since most parts in computing
saliency maps under different metrics are the same,
e.g., objectness prior map, seeds selection, etc., it is reasonable
that Figure 5 (b) and (c) are similar, but there are still
differences between them. To further prove this, we conduct an
extra experiment as shown in Figure 11. The second line is the
results generated by fusing the GML with itself, the third line
is the results generated by fusing the SML with itself, and the
fourth line the obtained by fusing the GML and SML. We call
them as GG, SS, and AML respectively. Limited by the image
resolution, some differences between the GML and SML may
not be find in Figure 5, but the integration with the metric
itself can apparently enlarge their distinctiveness. Furthermore,
if one metric is incorrect, another one can make up it.
The SS performs better than the GG in Figure 11 (a)-(e),
while the GG is better in (f)-(g), and the AML tends to
take the best results of them, which demonstrates that the
GML and SML are indeed complimentary to each other and
improve the performance of saliency detection after fusion.
Figure 11 (k)-(m) show that if both the GML and SML get
bad results, the results after fusion are still bad.
In addition, we plot the distribution of saliency values in
Figure 6. Ground truth masks provide a specific label, 1 or 0,
for each pixel and we regard a superpixel as foreground when
more than 80% pixels of it are labelled by 1. Otherwise, the
superpixel will be background. We put all the foreground
superpixels from the whole dataset together and get the
distribution of their saliency values computed by different
saliency methods as the red line. The blue line is the
distribution of saliency values of background superpixels.
Figure 6(a), (b), (c) are the saliency distribution produced
by GML, SML and AML on MSRA-1000 respectively.
Figure 6(d) is AML on MSRA-5000. This shows that AML is
better than GML and SML, since its background saliency
values are closer to 0.
Furthermore, our Generic metric is robust to different
databases. We use the metric trained from MSRA-1000 to
all the databases, including MSRA-1000, MSRA-5000,
THUS-10000, and Berkeley-300. As shown in
Figure 8 and Figure 9, the results are still promising even
on different databases, which demonstrates the effectiveness
and adaptiveness of our Generic metric. Overall, the fused
results based on two outstanding and complementary metrics
achieve higher precision and recall values and generate more
accurate saliency maps.
B. Evaluation of Superpixel-Wise Fisher Vector
We have mentioned that our Superpixel-wise Fisher Vector
coding approach can improve the performance of saliency
detection by capturing the average first-order and second-order
differences between local features and the centers of a Mixture
of Gaussian Distributions. In experiments, we extract the
low-level features: RGB and LAB to learn a 12D SFV
representation for each superpixel ( = 6, K = 1,
= 2 K = 12). Figure 7(a) shows the efficiency of our
SFV coding approach by comparing the precision-recall curves
of low-level features and the SFV on MSRA-1000 database.
Figure 7(b) are corresponding images.
C. Evaluation of Saliency Maps
We compare the proposed saliency detection model with
several state-of-the-art methods: IT [5], GB [19], FT [13],

Fig. 10. The comparison of previous methods, our algorithm and ground truth. (a) Test image. (b) IT [5]. (c) GB [19]. (d) GC [39].
(e) CB [44]. (f) UFO [18]. (g) Proposed. (h) Ground truth.
GC [39], UFO [18], SVO [40], HS [41], PD [42], AMC [43],
RCJ [37], DSR [20], DRFI [26], CB [44], RC [6], LR [14]
and XL [45]. We use source codes provided by the
authors or implement them based on the available codes or
softwares.
We conduct several quantitative comparisons of some
typical saliency detection methods. Figure 8(a), (b), (d) and (e)
show that the proposed AML is comparable with most of the
state-of-the-arts on MSRA-1000 and MSRA-5000 databases.
Figure 8(c) and (f) are the comparisons of average precision,
recall, F-measure and AUC. We use AUC as an evaluation
criteria, since it represents the area under the PR curve
and can effectively reflect the global properties of different
algorithms. Instead of using the bounding boxes to evaluate
the saliency detection performances on MSRA-5000 database,
we adopt the accurate human-labeled masks provided by [26]
to ensure more reliable comparative results. We also perform
experiments on THUS-10000 and Berkeley-300 databases
as shown in Figure 9. Precision-recall curves show that
AML reaches 97.4%, 94.0%, 96.5%, 81.5% precision rate on
MSRA-1000, MSRA-5000, THUS-10000, and Berkeley-300
respectively. All of them demonstrate the efficiency of our
method.
Figure 10 shows some sample results of five previous
approaches and our AML algorithm. The IT and GB methods
are capable in finding the salient regions in most cases, but
they tend to highlight the boundaries and miss lots of object
information because of the blurriness of saliency maps. The
GC method cannot contain all the salient pixels and often
mislabels small background patches as salient regions. The
CB and UFO models can highlight the objects uniformly, but
they become invalid in dealing with challenging scenes. Our
method can catch both the small and large salient objects even
in complex environments. In addition, we can highlight the
objects uniformly with accurate boundaries and do not need
to care about the number and locations of the salient objects.
We also test the average computational cost on different
datasets: 18.15s on MSRA-1000, 18.42s on MSRA-5000,
17.90s on THUS-10000 and 18.78s on Berkeley-300. The pro-
posed algorithm is implemented in MATLAB on a PC machine
with Intel i7-3370 CPU (3.4 GHz) and 32 GB memory.
D. Evaluation of Selected Seeds
We train an effective Specific metric based on the
assumption that the selected seeds are correct. In experiments,

Fig. 11. Example results of different metrics. The first line is the input images, the second line is the results generated by fusing the GML with itself, the
third line is the results generated by fusing the SML with itself, the fourth line is obtained by fusing the GML and SML, and the last line is the ground truth
images.
we cannot ensure that the chosen seeds are completely
accurate, but we can enforce a very high seeds accuracy. The
accuracy of selected seeds is defined as follows:
sa =
f sc + bsc
f st + bst
=
f sc + bsc
( f sc + f sic) + (bsc + bsic)
(23)
where
f sc =
n i
(gtn
i &seedn
i )
bsc =
n i
(gtn
i &seedn
i ) (24)
i represents the ith superpixel extracted from the nth image
of a typical database. gtn
i and seedn
i are the ground truth and
label assigned by our seeds selection mechanism of i. The
accuracy rates of four databases are: 0.9882 on MSRA-1000,
0.9769 on MSRA-5000, 0.9822 on THUS-10000 and 0.8874
on Berkeley-300. We experimentally verify that the seeds are
accurate enough to generate a reliable Specific metric for
each image.
V. CONCLUSION
In this paper, we explicitly propose two Mahalanobis
distance metric learning models and a superpixel-wise fisher
vector representation for visual saliency detection. To our
knowledge, we are the first to apply metric learning to
saliency detection and conduct a metric fusion mechanism
to improve the detection accuracy. Different from previous
methods, we adopt a new feature coding strategy and make
the supervised metric learning more suitable for single image
processing. In addition, we propose an accurate seeds selection
method based on the Mahalanobis distance measure to train
the Specific metric and construct the final saliency map.
We estimate the saliency value of each superpixel from a
multi-scale view and include the contextual information when
computing it. Experimental results with sixteen state-of-the-art
algorithms on four benchmark image databases demonstrate
the efficiency of our metric learning approach and the saliency
detection model. In the future, we plan to explore more robust
object detection approaches to further improve the accuracy
of saliency detection.
REFERENCES
[1] C. Siagian and L. Itti, “Rapid biologically-inspired scene classification
using features shared with visual attention,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 29, no. 2, pp. 300–312, Feb. 2007.
[2] H. Liu, X. Xie, X. Tang, Z.-W. Li, and W.-Y. Ma, “Effective browsing
of Web image search results,” in Proc. 6th ACM SIGMM Int. Workshop
Multimedia Inf. Retr., 2004, pp. 84–90.
[3] C. Christopoulos, A. Skodras, and T. Ebrahimi, “The JPEG2000 still
image coding system: An overview,” IEEE Trans. Consum. Electron.,
vol. 46, no. 4, pp. 1103–1127, Nov. 2000.
[4] Y. Niu, F. Liu, X. Li, and M. Gleicher, “Warp propagation for video
resizing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010,
pp. 537–544.
[5] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual
attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998.
[6] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu,
“Global contrast based salient region detection,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., Jun. 2011, pp. 409–416.
[7] Y. Xie, H. Lu, and M.-H. Yang, “Bayesian saliency via low and mid
level cues,” IEEE Trans. Image Process., vol. 22, no. 5, pp. 1689–1698,
May 2013.
[8] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection
via graph-based manifold ranking,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., Jun. 2013, pp. 3166–3173.
[9] J. Sun, H. Lu, and X. Liu, “Saliency region detection based on Markov
absorption probabilities,” IEEE Trans. Image Process., vol. 24, no. 5,
pp. 1639–1649, May 2015.
[10] Y.-F. Ma and H.-J. Zhang, “Contrast-based image attention analysis by
using fuzzy growing,” in Proc. 11th ACM Int. Conf. Multimedia, 2003,
pp. 374–381.
[11] J. Sun, H. Lu, and S. Li, “Saliency detection based on integration
of boundary and soft-segmentation,” in Proc. IEEE Int. Conf. Image
Process., Sep./Oct. 2012, pp. 1085–1088.
[12] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2007,
pp. 1–8.
[13] R. Achanta, S. Hemami, F. Estrada, and S. Süsstrunk, “Frequency-tuned
salient region detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2009, pp. 1597–1604.
[14] X. Shen and Y. Wu, “A unified approach to salient object detection via
low rank matrix recovery,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2012, pp. 853–860.
[15] T. Liu et al., “Learning to detect a salient object,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 33, no. 2, pp. 353–367, Feb. 2011.
[16] J. Yang and M.-H. Yang, “Top-down visual saliency via joint CRF and
dictionary learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Jun. 2012, pp. 2296–2303.
[17] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image
classification with the Fisher vector: Theory and practice,” Int. J.
Comput. Vis., vol. 105, no. 3, pp. 222–245, 2013.

[18] P. Jiang, H. Ling, J. Yu, and J. Peng, “Salient region detection by UFO:
Uniqueness, focusness and objectness,” in Proc. IEEE Int. Conf. Comput.
Vis., Dec. 2013, pp. 1976–1983.
[19] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in Proc.
Adv. Neural Inf. Process. Syst., 2006, pp. 545–552.
[20] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang, “Saliency detection
via dense and sparse reconstruction,” in Proc. IEEE Int. Conf. Comput.
Vis., Dec. 2013, pp. 2976–2983.
[21] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from
robust background detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2014, pp. 2814–2821.
[22] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, “Saliency filters:
Contrast based filtering for salient region detection,” in Proc. IEEE Conf.
[23] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk,
“SLIC superpixels compared to state-of-the-art superpixel methods,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282,
Nov. 2012.
[24] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness of
image windows,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34,
no. 11, pp. 2189–2202, Nov. 2012.
[25] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using
background priors,” in Proc. 12th Eur. Conf. Comput. Vis. (ECCV), 2012,
pp. 29–42.
[26] H. Jiang, Z. Yuan, M.-M. Cheng, Y. Gong, N. Zheng, and J. Wang.
(2014). “Salient object detection: A discriminative regional feature inte-
gration approach.” [Online]. Available: http://arxiv.org/abs/1410.5926
[27] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient
object detection: A discriminative regional feature integration approach,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013,
pp. 2083–2090.
[28] R. Liu, J. Cao, Z. Lin, and S. Shan, “Adaptive partial differential
equation learning for visual saliency detection,” in Proc. IEEE Conf.
[29] Q. Chen et al., “Efficient maximum appearance search for large-scale
object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Jun. 2013, pp. 3190–3197.
[30] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon,
“Information-theoretic metric learning,” in Proc. 24th Int. Conf. Mach.
Learn., 2007, pp. 209–216.
[31] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning
for large margin nearest neighbor classification,” in Proc. Adv. Neural
Inf. Process. Syst., 2005, pp. 1473–1480.
[32] K. Q. Weinberger and L. K. Saul, “Fast solvers and efficient
implementations for distance metric learning,” in Proc. 25th Int. Conf.
Mach. Learn., 2008, pp. 1160–1167.
[33] M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? Metric
learning approaches for face identification,” in Proc. IEEE 12th Int.
Conf. Comput. Vis., Sep./Oct. 2009, pp. 498–505.
[34] F. Wang, W. Zuo, L. Zhang, D. Meng, and D. Zhang. (2013). “A kernel
classification framework for metric learning.” [Online]. Available:
http://arxiv.org/abs/1309.5823
[35] S. Lu, V. Mahadevan, and N. Vasconcelos, “Learning optimal seeds for
diffusion-based salient object detection,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., Jun. 2014, pp. 2790–2797.
[36] H.-C. Huang, Y.-Y. Chuang, and C.-S. Chen, “Affinity aggregation for
spectral clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Jun. 2012, pp. 773–780.
[37] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.-M. Hu,
“Global contrast based salient region detection,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, Mar. 2014.
[38] V. Movahedi and J. H. Elder, “Design and perceptual validation
of performance measures for salient object segmentation,” in Proc.
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops,
Jun. 2010, pp. 49–56.
[39] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet, and N. Crook,
“Efficient salient region detection with soft image abstraction,” in Proc.
IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 1529–1536.
[40] K.-Y. Chang, T.-L. Liu, H.-T. Chen, and S.-H. Lai, “Fusing generic
objectness and visual saliency for salient object detection,” in Proc. IEEE
Int. Conf. Comput. Vis., Nov. 2011, pp. 914–921.
[41] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013,
pp. 1155–1162.
[42] R. Margolin, A. Tal, and L. Zelnik-Manor, “What makes a patch
distinct?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2013, pp. 1139–1146.
[43] B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang, “Saliency detection
via absorbing Markov chain,” in Proc. IEEE Int. Conf. Comput. Vis.,
Dec. 2013, pp. 1665–1672.
[44] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li, “Automatic
salient object segmentation based on context and shape prior,” in Proc.
BMVC, 2011, pp. 110.1–110.12
[45] Y. Xie and H. Lu, “Visual saliency detection based on Bayesian model,”
in Proc. 18th IEEE Int. Conf. Image Process., Sep. 2011, pp. 645–648.
Shuang Li is currently pursuing the
B.E. degree with the School of Information
and Communication Engineering, Dalian University
of Technology (DUT), China. From 2012 to 2015,
she was a Research Assistant with the Computer
Vision Group, DUT. Her research interests focus
on saliency detection and object recognition.
Huchuan Lu (SM’12) received the M.Sc. degree
in signal and information processing and the
Ph.D. degree in system engineering from the Dalian
University of Technology (DUT), Dalian, China,
in 1998 and 2008, respectively. He joined as a
Faculty Member in 1998, and is currently a Full
Professor with the School of Information and
Communication Engineering, DUT. His current
research interests include the areas of computer
vision and pattern recognition with a focus on visual
tracking, saliency detection, and segmentation.
He is also a member of the Association for Computing Machinery and
an Associate Editor of the IEEE TRANSACTIONS ON SYSTEMS, MAN AND
CYBERNETICS—PART B.
Zhe Lin (M’10) received the B.Eng. degree in
automatic control from the University of Science
and Technology of China, in 2002, the M.S. degree
in electrical engineering from the Korea Advanced
Institute of Science and Technology, in 2004, and the
Ph.D. degree in electrical and computer engineering
from the University of Maryland, College Park,
in 2009. He has been a Research Intern with
Microsoft Live Labs Research. He is currently a
Senior Research Scientist with Adobe Research,
San Jose, CA. His research interests include deep
learning, object detection and recognition, image classification and tagging,
content-based image and video retrieval, human motion tracking, and activity
analysis.
Xiaohui Shen (M’11) received the B.S. and
M.S. degrees from the Department of Automation,
Tsinghua University, China, and the Ph.D. degree
from the Department of Electrical Engineering
and Computer Sciences, Northwestern University,
in 2013. He is currently a Research Scientist with
Adobe Research, San Jose, CA. He is generally
interested in the research problems in the area of
computer vision, in particular, image retrieval, object
detection, and image understanding.
Brian Price received the Ph.D. degree in computer
science from Brigham Young University under the
advisement of Dr. B. Morse. He has contributed
new features to many Adobe products, such as
Photoshop, Photoshop Elements, and After-Effects,
mostly involving interactive image segmentation and
matting. He is currently a Senior Research Scientist
with Adobe Research, specializing in computer
vision. His research interests include semantic seg-
mentation, interactive object selection and matting,
stereo and RGBD, and broad interest in computer
vision and its intersections with machine learning and computer graphics.

adaptive metric learning for saliency detection base paper

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to adaptive metric learning for saliency detection base paper

Similar to adaptive metric learning for saliency detection base paper (20)

Recently uploaded

Recently uploaded (20)

adaptive metric learning for saliency detection base paper