1. Enhanced Max Margin Learning on Multimodal Data Mining
in a Multimedia Database
Zhen Guo, Zhongfei (Mark) Zhang Eric P. Xing, Christos Faloutsos
Computer Science Department School of Computer Science
SUNY Binghamton Carnegie Mellon University
Binghamton, NY, 13902 Pittsburgh, PA, 15213
{zguo,zhongfei}@cs.binghamton.edu {epxing,christos}@cs.cmu.edu
ABSTRACT General Terms
The problem of multimodal data mining in a multimedia Algorithms, experimentation
database can be addressed as a structured prediction prob-
lem where we learn the mapping from an input to the struc- Keywords
tured and interdependent output variables. In this paper,
built upon the existing literature on the max margin based Multimodal data mining, image annotation, image retrieval,
learning, we develop a new max margin learning approach max margin
called Enhanced Max Margin Learning (EMML) framework.
In addition, we apply EMML framework to developing an 1. INTRODUCTION
eﬀective and eﬃcient solution to the multimodal data min-
Multimodal data mining in a multimedia database is a
ing problem in a multimedia database. The main contri-
challenging topic in data mining research. Multimedia data
butions include: (1) we have developed a new max mar-
may consist of data in diﬀerent modalities, such as digital
gin learning approach — the enhanced max margin learning
images, audio, video, and text data. In this context, a multi-
framework that is much more eﬃcient in learning with a
media database refers to a data collection in which there are
much faster convergence rate, which is veriﬁed in empirical
multiple modalities of data such as text and imagery. In this
evaluations; (2) we have applied this EMML approach to
database system, the data in diﬀerent modalities are related
developing an eﬀective and eﬃcient solution to the multi-
to each other. For example, the text data are related to im-
modal data mining problem that is highly scalable in the
ages as their annotation data. By multimodal data mining
sense that the query response time is independent of the
in a multimedia database it is meant that the knowledge
database scale, allowing facilitating a multimodal data min-
discovery to the multimedia database is initiated by a query
ing querying to a very large scale multimedia database, and
that may also consist of multiple modalities of data such as
excelling many existing multimodal data mining methods in
text and imagery. In this paper, we focus on a multimedia
the literature that do not scale up at all; this advantage is
database as an image database in which each image has a
also supported through the complexity analysis as well as
few textual words given as annotation. We then address
empirical evaluations against a state-of-the-art multimodal
the problem of multimodal data mining in such an image
data mining method from the literature. While EMML is
database as the problem of retrieving similar data and/or
a general framework, for the evaluation purpose, we apply
inferencing new patterns to a multimodal query from the
it to the Berkeley Drosophila embryo image database, and
database.
report the performance comparison with a state-of-the-art
Speciﬁcally, in the context of this paper, multimodal data
multimodal data mining method.
mining refers to two aspects of activities. The ﬁrst is the
multimodal retrieval. This is the scenario where a mul-
Categories and Subject Descriptors timodal query consisting of either textual words alone, or
H.2.8 [Database Management]: Database Applications— imagery alone, or in any combination is entered and an ex-
Data mining,Image databases; H.3.3 [Information Stor- pected retrieved data modality is speciﬁed that can also be
age and Retrieval]: Information Search and Retrieval— text alone, or imagery alone, or in any combination; the re-
Retrieval models; I.5.1 [Pattern Recognition]: Models— trieved data based on a pre-deﬁned similarity criterion are
Structural ; J.3 [Computer Applications]: Life and Med- returned back to the user. The second is the multimodal in-
ical Sciences—Biology and genetics ferencing. While the retrieval based multimodal data mining
has its standard deﬁnition in terms of the semantic similarity
between the query and the retrieved data from the database,
the inferencing based mining depends on the speciﬁc applica-
Permission to make digital or hard copies of all or part of this work for tions. In this paper, we focus on the application of the fruit
personal or classroom use is granted without fee provided that copies are ﬂy image database mining. Consequently, the inferencing
not made or distributed for proﬁt or commercial advantage and that copies based multimodal data mining may include many diﬀerent
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to scenarios. A typical scenario is the across-stage multimodal
republish, to post on servers or to redistribute to lists, requires prior speciﬁc inferencing. There are many interesting questions a biologist
permission and/or a fee.
KDD’07, August 12–15, 2007, San Jose, California, USA. may want to ask in the fruit ﬂy research given such a mul-
Copyright 2007 ACM 978-1-59593-609-7/07/0008 ...$5.00. timodal mining capability. For example, given an embryo
2. image in stage 5, what is the corresponding image in stage multimodal approaches.
7 for an image-to-image three-stage inferencing? What is The learning with structured output variables covers many
the corresponding annotation for this image in stage 7 for natural learning tasks including named entity recognition,
an image-to-word three-stage inferencing? The multimodal natural language parsing, and label sequence learning. There
mining technique we have developed in this paper also ad- have been many studies on the structured model which in-
dresses this type of across-stage inferencing capability, in clude conditional random ﬁelds [14], maximum entropy model
addition to the multimodal retrieval capability. [15], graph model [8], semi-supervised learning [6] and max
In the image retrieval research area, one of the notorious margin approaches [13, 21, 20, 2]. The challenge of learning
bottlenecks is the semantic gap [18]. Recently, it is reported with structured output variables is that the number of the
that this bottleneck may be reduced by the multimodal data structures is exponential in terms of the size of the struc-
mining approaches [3, 11] which take advantage of the fact ture output space. Thus, the problem is intractable if we
that in many applications image data typically co-exist with treat each structure as a separate class. Consequently, the
other modalities of information such as text. The synergy multiclass approach is not well ﬁtted into the learning with
between diﬀerent modalities may be exploited to capture the structured output variables.
high level conceptual relationships. As an eﬀective approach to this problem, the max margin
To exploit the synergy among the multimodal data, the principle has received substantial attention since it was used
relationships among these diﬀerent modalities need to be in the support vector machine (SVM) [22]. In addition, the
learned. For an image database, we need to learn the rela- perceptron algorithm is also used to explore the max margin
tionship between images and text. The learned relationship classiﬁcation [12]. Taskar et al. [19] reduce the number of the
between images and text can then be further used in mul- constraints by considering the dual of the loss-augmented
timodal data mining. Without loss of generality, we start problem. However, the number of the constraints in their
with a special case of the multimodal data mining problem approach is still large for a large structured output space
— image annotation, where the input is an image query and and a large training set.
the expected output is the annotation words. We show later For learning with structured output variables, Tsochan-
that this approach is also valid to the general multimodal taridis et al. [21] propose a cutting plane algorithm which
data mining problem. The image annotation problem can ﬁnds a small set of active constraints. One issue of this al-
be formulated as a structured prediction problem where the gorithm is that it needs to compute the most violated con-
input (image) x and the output (annotation) y are struc- straint which would involve another optimization problem in
tures. An image can be partitioned into blocks which form the output space. In EMML, instead of selecting the most
a structure. The word space can be denoted by a vector violated constraint, we arbitrarily select a constraint which
where each entry represents a word. Under this setting, the violates the optimality condition of the optimization prob-
learning task is therefore formulated as ﬁnding a function lem. Thus, the selection of the constraints does not involve
f : X × Y → R such that any optimization problem. Osuna et al. [16] propose the de-
composition algorithm for the support vector machine. In
ˆ
y = arg max f (x, y) (1) EMML, we generalize their idea to the scenario of learning
y∈Y
with structured output variables.
is the desired output for any input x.
In this paper, built upon the existing literature on the
max margin learning, we propose a new max margin learn- 3. HIGHLIGHTS OF THIS WORK
ing approach on the structured output space to learn the This work is based on the existing literature on max mar-
above function. Like the existing max margin learning meth- gin learning, and aims at solving for the problem of multi-
ods, the image annotation problem may be formulated as a modal data mining in a multimedia database deﬁned in this
quadratic programming (QP) problem. The relationship be- paper. In comparison with the existing literature, the main
tween images and text is discovered once this QP problem contributions of this work include: (1) we have developed
is solved. Unlike the existing max margin learning meth- a new max margin learning approach — the enhanced max
ods, the new max margin learning method is much more margin learning framework that is much more eﬃcient in
eﬃcient with a much faster convergence rate. Consequently, learning with a much faster convergence rate, which is veri-
we call this new max margin learning approach as Enhanced ﬁed in empirical evaluations; (2) we have applied this EMML
Max Margin Learning (EMML). We further apply EMML approach to developing an eﬀective and eﬃcient solution to
to solving the multimodal data mining problem eﬀectively the multimodal data mining problem that is highly scalable
and eﬃciently. in the sense that the query response time is independent of
Note that the proposed approach is general that can be the database scale, allowing facilitating a multimodal data
applied to any structured prediction problems. For the eval- mining querying to a very large scale multimedia database,
uation purpose, we apply this approach to the Berkeley and excelling many existing multimodal data mining meth-
Drosophila embryo image database. Extensive empirical ods in the literature that do not scale up at all; this advan-
evaluations against a state-of-the-art method on this database tage is also supported through the complexity analysis as
are reported. well as empirical evaluations against a state-of-the-art mul-
timodal data mining method from the literature.
2. RELATED WORK
4. LEARNING IN THE STRUCTURED OUT-
Multimodal approaches have recently received the sub-
stantial attention since Barnard and Duygulu et al. started PUT SPACE
their pioneering work on image annotation [3, 10]. Recently Assume that the image database consists of a set of in-
there have been many studies [4, 17, 11, 7, 9, 23] on the stances S = {(Ii , Wi )}L , where each instance consists of
i=1
3. can be expressed as a linear combination of each component
¯ ¯
in the joint feature representation: f (x, y) = α, Φ(x, y) .
Then the learning task is to ﬁnd the optimal weight vector
α such that the prediction error is minimized for all the
training instances. That is
¯
arg max f (xi , y) ≈ yi , i = 1, · · · , n
y ∈Y (i)
¯
where Yi = {¯ | W yj = j=1 yij }. We use Φi (¯ ) to de-
y j=1 ¯
W
y
¯
note Φ(xi , y). To make the prediction to be the true output
yi , we must follow
α⊤ Φi (yi ) ≥ α⊤ Φi (¯ ),
y ∀¯ ∈ Yi {yi }
y
Figure 1: An illustration of the image partitioning where Yi {y i } denotes the removal of the element yi from
and the structured output word space the set Yi . In order to accommodate the prediction error
on the training examples, we introduce the slack variable ξi .
The above constraint then becomes
an image object Ii and the corresponding annotation word
α⊤ Φi (yi ) ≥ α⊤ Φi (¯ ) − ξi ,
y ξi ≥ 0 ∀¯ ∈ Yi {y i }
y
set Wi . First we partition an image into a set of blocks.
Thus, an image can be represented by a set of sub-images. We measure the prediction error on the training instances
The feature vector in the feature space for each block can by the loss function which is the distance between the true
be computed from the selected feature representation. Con- ¯
output yi and the prediction y . The loss function measures
sequently, an image is represented as a set of feature vectors the goodness of the learning model. The standard zero-one
in the feature space. A clustering algorithm is then applied classiﬁcation loss is not suitable for the structured output
to the whole feature space to group similar feature vectors space. We deﬁne the loss function l (¯ , yi ) as the number
y
together. The centroid of a cluster represents a visual rep- of the diﬀerent entries in these two vectors. We include the
resentative (we refer it to VRep in this paper) in the image loss function in the constraints as is proposed by Taskar et
space. In Figure 1, there are two VReps, water and duck al. [19]
in the water. The corresponding annotation word set can
be easily obtained for each VRep. Consequently, the image α⊤ Φi (yi ) ≥ α⊤ Φi (¯ ) + l (¯ , yi ) − ξi
y y
database becomes the VRep-word pairs S = {(xi , yi )}n , i=1 We interpret 1
α⊤ [Φi (yi )−Φi (¯ )]
y as the margin of yi over
α
where n is the number of the clusters, xi is a VRep ob- (i)
¯
another y ∈ Y . We then rewrite the above constraint as
ject and yi is the word annotation set corresponding to this 1
α⊤ [Φi (yi ) − Φi (¯ )] ≥ α [l (¯ , yi ) − ξi ]. Thus, minimiz-
y 1
y
VRep object. Another simple method to obtain the VRep- α
word pairs is that we randomly select some images from the ing α maximizes such margin.
image database and each image is viewed as a VRep. The goal now is to solve the optimization problem
Suppose that there are W distinct annotation words. An 1 2 n
r
arbitrary subset of annotation words is represented by the min α +C ξi (2)
2
¯
binary vector y whose length is W ; the j-th component i=1
¯
yj = 1 if the j-th word occurs in this subset, and 0 oth- s.t. α⊤ Φi (yi ) ≥ α⊤ Φi (¯ ) + l (¯ , yi ) − ξi
y y
erwise. All possible binary vectors form the word space Y.
∀¯ ∈ Yi {yi }, ξi ≥ 0, i = 1, · · · , n
y
We use wj to denote the j-th word in the whole word set.
We use x to denote an arbitrary vector in the feature space. where r = 1, 2 corresponds to the linear or quadratic slack
Figure 1 shows an illustrative example in which the original variable penalty. In this paper, we use the linear slack vari-
image is annotated by duck and water which are represented able penalty. For r = 2, we obtain similar results. C > 0
by a binary vector. There are two VReps after the clustering is a constant that controls the tradeoﬀ between the training
and each has a diﬀerent annotation. In the word space, a error minimization and the margin maximization.
word may be related to other words. For example, duck and Note that in the above formulation, we do not introduce
water are related to each other because water is more likely the relationships between diﬀerent words in the word space.
to occur when duck is one of the annotation words. Conse- However, the relationships between diﬀerent words are im-
quently, the annotation word space is a structured output plicitly included in the VRep-word pairs because the related
space where the elements are interdependent. words is more likely to occur together. Thus, Eq. (2) is in
The relationship between the input example VRep x and fact a structured optimization problem.
¯
an arbitrary output y is represented as the joint feature
mapping Φ(x, y), Φ : X × Y → Rd where d is the dimension
¯ 4.1 EMML Framework
of the joint feature space. It can be expressed as a linear One can solve the optimization problem Eq. (2) in the
combination of the joint feature mapping between x and all primal space — the space of the parameters α. In fact this
the unit vectors. That is problem is intractable when the structured output space is
W large because the number of the constraints is exponential
¯
Φ(x, y) = ¯
yj Φ(x, ej ) in terms of the size of the output space. As in the tradi-
j=1
tional support vector machine, the solution can be obtained
by solving this quadratic optimization problem in the dual
¯
where ej is the j-th unit vector. The score between x and y space — the space of the Lagrange multipliers. Vapnik [22]
4. and Boyd et al. [5] have an excellent review for the related able set µB when keeping µN =0 as follows
optimization problem.
1 ⊤
The dual problem formulation has an important advan- min µ Dµ − µ⊤ S (5)
tage over the primal problem: it only depends on the inner 2
products in the joint feature representation deﬁned by Φ, s.t. Aµ C
allowing the use of a kernel function. We introduce the µB 0, µN = 0
Lagrange multiplier µi,¯ for each constraint to form the La-
y
grangian. We deﬁne Φi,yi ,¯ = Φi (yi ) − Φi (¯ ) and the kernel
y It is clearly true that we can move those µi,¯ = 0, µi,¯ ∈
y y
y
¯ ˜
function K ((xi , y), (xj , y)) = Φi,yi ,¯ , Φj,yj ,˜ . The deriva- µB to set µN without changing the objective function. Fur-
y y
tives of the Lagrangian over α and ξi should be equal to thermore, we can move those µi,¯ ∈ µN satisfying certain
y
zero. Substituting these conditions into the Lagrangian, we conditions to set µB to form a new optimization subprob-
obtain the following Lagrange dual problem lem which yields a strict decrease in the objective function
in Eq. (4) when the new subproblem is optimized. This
property is guaranteed by the following theorem.
1
min y y ¯ ˜
µi,¯ µj,˜ K ((xi , y), (xj , y))− µi,¯ l (¯ , yi ) (3)
y y Theorem 1. Given an optimal solution of the subprob-
2
¯
i,j
y=yi ¯
i
y=yi
lem deﬁned on µB in Eq. (5), if the following conditions
˜
y=yj hold true:
s.t. µi,¯ ≤ C
y µi,¯ ≥ 0
y i = 1, · · · , n ∃i, µi,¯ < C
¯
y y
¯
y =yi
∃µi,¯ ∈ µN ,
y α⊤ Φi,yi ,¯ − l (¯, yi ) < 0
y y (6)
After this dual problem is solved, we have α = i,¯ µi,¯ Φi,yi ,¯ .
y y y the operation of moving the Lagrange multiplier µi,¯ satisfy-
y
For each training example, there are a number of con- ing Eq. (6) from set µN to set µB generates a new optimiza-
straints related to it. We use the subscript i to represent the tion subproblem that yields a strict decrease in the objective
part related to the i-th example in the matrix. For example, function in Eq. (4) when the new subproblem in Eq.(5) is
let µi be the vector with entries µi,¯ . We stack the µi to-
y optimized.
gether to form the vector µ. That is µ = [µ1 ⊤ · · · µn ⊤ ]⊤ .
Similarly, let Si be the vector with entries l (¯ , yi ). We stack
y Proof. Suppose that the current optimal solution is µ.
Si together to form the vector S. That is S = [S⊤ · · · S⊤ ]⊤ . ¯
Let δ be a small positive number. Let µ = µ + δer , where er
1 n
The lengths of µ and S are the same. We deﬁne Ai as ¯
is the r-th unit vector and r = (i, y) denotes the Lagrange
the vector which has the same length as that of µ, where multiplier satisfying condition Eq. (6). Thus, the objective
Ai,¯ = 1 and Aj,¯ = 0 for j = i. Let A = [A1 · · · An ]⊤ . function becomes
y y
Let matrix D represent the kernel matrix where each entry 1
¯
W(µ) = (µ + δer )⊤ D(µ + δer ) − (µ + δer )⊤ S
¯ ˜
is K ((xi , y), (xj , y)). Let C be the vector where each entry 2
is constant C. 1 ⊤
With the above notations we rewrite the Lagrange dual = (µ Dµ + δer Dµ + δµ⊤ Der + δ 2 er Der )
⊤ ⊤
2
problem as follows −µ⊤ S − δer S
⊤
1
1 ⊤ = W(µ) + (δer Dµ + δµ⊤ Der + δ 2 er Der )
⊤ ⊤
min µ Dµ − µ⊤ S (4) 2
2 ⊤
−δer S
s.t. Aµ C
1
µ 0 = W(µ) + δer Dµ − δer S + δ 2 er Der
⊤ ⊤ ⊤
2
1
where and represent the vector comparison deﬁned as = W(µ) + δ(α⊤ Φi,yi ,¯ − l (¯ , yi )) + δ 2 Φi,y i ,¯
y y y
2
2
entry-wise less than or equal to and greater than or equal
to, respectively.
Since α⊤ Φi,yi ,¯ − l (¯ , yi ) < 0, for small enough δ, we
y y
Eq. (4) has the same number of the constraints as Eq. (2).
¯
have W(µ) < W(µ). For small enough δ, the constraints
However, in Eq. (4) most of the constraints are lower bound
Aµ ¯ C is also valid. Therefore, when the new optimiza-
constraints (µ 0) which deﬁne the feasible region. Other
tion subproblem in Eq. (5) is optimized, there must be an
than these lower bound constraints, the rest constraints de-
optimal solution no worse than µ. ¯
termine the complexity of the optimization problem. There-
fore, the number of constraints is considered to be reduced
in Eq. (4). However, the challenge still exists to solve it ef- In fact, the optimal solution is obtained when there is no
ﬁciently since the number of the dual variables is still huge. Lagrange multiplier satisfying the condition Eq. (6). This is
Osuna et al. [16] propose a decomposition algorithm for the guaranteed by the following theorem.
support vector machine learning over large data sets. We Theorem 2. The optimal solution of the optimization prob-
generalize this idea to learning with the structured output lem in Eq. (4) is achieved if and only if the condition Eq. (6)
space. We decompose the constraints of the optimization does not hold true.
problem Eq. (2) into two sets: the working set B and the
nonactive set N. The Lagrange multipliers are also corre- ˆ
Proof. If the optimal solution µ is achieved, the condi-
spondingly partitioned into two parts µB and µN . We are ˆ
tion Eq. (6) must not hold true. Otherwise, µ is not op-
interested in the subproblem deﬁned only for the dual vari- timal according to the Theorem 1. To prove in the reverse
5. direction, we consider the Karush-Kuhn-Tucker (KKT) con- In the recent literature, there are also other methods at-
ditions [5] of the optimization problem Eq. (4). tempting to reduce the number of the constraints. Taskar
et al. [19] reduce the number of the constraints by consider-
Dµ − S + A⊤ γ − π = 0 ing the dual of the loss-augmented problem. However, the
γ ⊤ (C − Aµ) = 0 number of the constraints in their approach is still large for
π⊤µ = 0 a large structured output space and a large training set.
They do not use the fact that only some of the constraints
γ 0 are active in the optimization problem. Tsochantaridis et
π 0 al. [21] also propose a cutting plane algorithm which ﬁnds a
small set of active constraints. One issue of this algorithm is
For the optimization problem Eq. (4), the KKT conditions
that it needs to compute the most violated constraint which
provide necessary and suﬃcient conditions for optimality.
would involve another optimization problem in the output
One can check that the condition Eq. (6) violates the KKT
space. In EMML, instead of selecting the most violated con-
conditions. On the other hand, one can check that the KKT
straint, we arbitrarily select a constraint which violates the
conditions are satisﬁed when the condition Eq. (6) does not
optimality condition of the optimization problem. Thus, the
hold true. Therefore, the optimal solution is achieved when
selection of the constraint does not involve any optimization
the condition Eq. (6) does not hold true.
problem. Therefore, EMML is much more eﬃcient in learn-
The above theorems suggest the Enhanced Max Margin ing with a much faster convergence rate.
Learning (EMML) algorithm listed in Algorithm 1. The
correctness (convergence) of EMML algorithm is provided 5. MULTIMODAL DATA MINING
by Theorem 3. The solution to the Lagrange dual problem makes it pos-
sible to capture the semantic relationships among diﬀerent
Algorithm 1 EMML Algorithm data modalities. We show that the developed EMML frame-
Input: n labeled examples, dual variable set µ. work can be used to solve for the general multimodal data
Output: Optimized µ mining problem in all the scenarios. Speciﬁcally, given a
1: procedure training data set, we immediately obtain the direct rela-
2: Arbitrarily decompose µ into two sets: µB and µN . tionship between the VRep space and the word space using
3: Solve the subproblem in Eq. (5) deﬁned by the vari- the EMML framework in Algorithm 1. Given this obtained
ables in µB . direct relationship, we show below that all the multimodal
4: While there exists µi,¯ ∈ µB such that µi,¯ = 0,
y y data mining scenarios concerned in this paper can be facili-
move it to set µN . tated.
5: While there exists µi,¯ ∈ µN satisfying condition
y
Eq. (6), move it to set µB . If no such µi,¯ ∈ µN exists,
y 5.1 Image Annotation
the iteration exits. Image annotation refers to generating annotation words
6: Goto Step 3. for a given image. First we partition the test image into
7: end procedure blocks and compute the feature vector in the feature space
for each block. We then compute the similarity between
feature vectors and the VReps in terms of the distance. We
Theorem 3. EMML algorithm converges to the global op- return the top n most-relevant VReps. For each VRep, we
timal solution in a ﬁnite number of iterations. compute the score between this VRep and each word as the
Proof. This is the direct result from Theorems 1 and 2. function f in Eq. (1). Thus, for each of the top n most
Step 3 in Algorithm 1 strictly decreases the objective func- relevant VReps, we have the ranking-list of words in terms
tion of Eq. (4) at each iteration and thus the algorithm does of the score. We then merge these n ranking-lists and sort
not cycle. Since the objective function of Eq. (4) is convex them to obtain the overall ranking-list of the whole word
and quadratic, and the feasible solution region is bounded, space. Finally, we return the top m words as the annotation
the objective function is bounded. Therefore, the algorithm result.
must converge to the global optimal solution in a ﬁnite num- In this approach, the score between the VReps and the
ber of iterations. words can be computed in advance. Thus, the computa-
tion complexity of image annotation is only related to the
Note that in Step 5, we only need ﬁnd one dual variable number of the VReps. Under the assumption that all the im-
satisfying Eq. (6). We need examine all the dual variables ages in the image database follow the same distribution, the
in the set µN only when no dual variable satisﬁes Eq. (6). number of the VReps is independent of the database scale.
It is fast to examine the dual variables in the set µN even if Therefore, the computation complexity in this approach is
the number of the dual variables is large. O(1) which is independent of the database scale.
4.2 Comparison with other methods 5.2 Word Query
In the max margin optimization problem Eq. (2), only Word query refers to generating corresponding images in
some of the constraints determine the optimal solution. We response to a query word. For a given word input, we com-
call these constraints active constraints. Other constraints pute the score between each VRep and the word as the func-
are automatically met as long as these active constraints are tion f in Eq. (1). Thus, we return the top n most relevant
valid. EMML algorithm uses this fact to solve the optimiza- VReps. Since for each VRep, we compute the similarity
tion problem by substantially reducing the number of the between this VRep and each image in the image database
dual variables in Eq. (3). in terms of the distance, for each of those top n most rele-
6. (a) (b)
Figure 2: An example of 3-stage image-to-image in-
ferencing.
vant VReps, we have the ranking-list of images in terms of
the distance. Then we merge these n ranking-lists and sort
them to obtain the overall ranking-list in the image space.
Finally, we return the top m images as the query result.
For each VRep, the similarity between this VRep and each
image in the image database can be computed in advance.
Similar to the analysis in Sec. 5.1, the computation com-
plexity is only related to the number of the VReps, which is
O(1).
5.3 Image Retrieval
Image retrieval refers to generating semantically similar Figure 3: An illustrative diagram for image-to-
images to a query image. Given a query image, we annotate image across two stages inferencing
it using the procedure in Sec. 5.1. In the image database,
for each annotation word j there are a subset of images Sj
in which this annotation word appears. We then have the both correspond to the same gene as a result of the image-
union set S = ∪j Sj for all the annotation words of the query to-image inferencing between the two stages1 . However, it
image. is clear that they exhibit a very large visual dissimilarity.
On the other hand, for each annotation word j of the Consequently, it is not appropriate to use any pure visual
query image, the word query procedure in Sec. 5.2 is used feature based similarity retrieval method to identify such
to obtain the related sorted image subset Tj from the image image-to-image correspondence across stages. Furthermore,
database. We then merge these subsets Tj to form the sorted we also expect to have the word-to-image and image-to-word
image set T in terms of their scores. The ﬁnal image retrieval inferencing capabilities across diﬀerent stages, in addition to
result is R = S ∩ T . the image-to-image inferencing.
In this approach, the synergy between the image space and Given this consideration, this is exactly where the pro-
the word space is exploited to reduce the semantic gap based posed approach for multimodal data mining can be applied
on the developed learning approach. Since the complexity to complement the existing pure retrieval based methods to
of the retrieval methods in Secs. 5.1 and 5.2 are both O(1), identify such correspondence. Typically in such a fruit ﬂy
and since these retrievals are only returned for the top few embryo image database, there are textual words for anno-
items, respectively, ﬁnding the intersection or the union is tation to the images in each stage. These annotation words
O(1). Consequently, the overall complexity is also O(1). in one stage may or may not have the direct semantic cor-
respondence to the images in another stage. However, since
5.4 Multimodal Image Retrieval the data in all the stages are from the same fruit ﬂy embryo
The general scenario of multimodal image retrieval is a image database, the textual annotation words between two
query as a combination of a series of images and a series of diﬀerent stages share a semantic relationship which can be
words. Clearly, this retrieval is simply a linear combination obtained by a domain ontology.
of the retrievals in Secs. 5.2 and 5.3 by merging the retrievals In order to apply our approach to this across-stage infer-
together based on their corresponding scores. Since each encing problem, we treat each stage as a separate multime-
individual retrieval is O(1), the overall retrieval is also O(1). dia database, and map the across-stage inferencing problem
to a retrieval based multimodal data mining problem by ap-
5.5 Across-Stage Inferencing plying the approach to the two stages such that we take the
multimodal query as the data from one stage and pose the
For a fruit ﬂy embryo image database such as the Berke-
query to the data in the other stage for the retrieval based
ley Drosophila embryo image database which is used for our
multimodal data mining. Figure 3 illustrates the diagram
experimental evaluations, we have embryo images classiﬁed
of the two stages (state i and state j where i = j) image-to-
in advance into diﬀerent stages of the embryo development
image inferencing.
with separate sets of textual words as annotation to those
Clearly, in comparison with the retrieval based multi-
images in each of these stages. In general, images in diﬀerent
modal data mining analyzed in the previous sections, the
stages may or may not have the direct semantic correspon-
only additional complexity here in across-stage inferencing
dence (e.g., they all correspond to the same gene), not even
speaking that images in diﬀerent stages may necessarily ex- 1
The Berkeley Drosophila embryo image database is given
hibit any visual similarity. Figure 2 shows an example of a in such a way that images from several real stages are mixed
pair of identiﬁed embryo images at stages 9-10 (Figure 2(a)) together to be considered as one “stage”. Thus, stages 9-10
and stages 13-16 (Figure 2(b)), respectively, in which they are considered as one stage, and so are stages 13-16.
7. is the inferencing part using the domain ontology in the word
space. Typically this ontology is small in scale. In fact, in Table 1: Comparison of scalability
our evaluations for the Berkeley Drosophila embryo image Database Size 50 100 150
database, this ontology is handcrafted and is implemented EMML 1 1 1
as a look-up table for word matching through an eﬃcient MBRM 1 2.2 3.3
hashing function. Thus, this part of the computation may
be ignored. Consequently, the complexity of the across-stage
inferencing based multimodal data mining is the same as 0.24
EMML
Image Annotation Average Precision(n)
that of the retrieval based multimodal data mining which is MBRM
1.0
Image Annotation Average Recall(n)
0.22
independent of database scale.
0.20
0.8
6. EMPIRICAL EVALUATIONS 0.18
0.16 0.6
While EMML is a general learning framework, and it can
also be applied to solve for a general multimodal data min- 0.14
ing problem in any application domains, for the evaluation 0.12
0.4
purpose, we apply it to the Berkeley Drosophila embryo im- 0.10
age database [1] for the multimodal data mining task de- 0.08
0.2
ﬁned in this paper. We evaluate this approach’s perfor- 0.06
mance using this database for both the retrieval based and 0.0
0 10 20 30 40 50
the across-stage inferencing based multimodal data mining
Top(n)
scenarios. We compare this approach with a state-of-the-art
multimodal data mining method MBRM [11] for the mining
performance.
In this image database, there are in total 16 stages of the Figure 4: Precisions and Recalls of image annotation
embryo images archived in six diﬀerent folders with each between EMML and MBRM (the solid lines are for
folder containing two to four real stages of the images; there precisions and the dashed lines are for recalls)
are in total 36,628 images and 227 words in all the six folders;
not all the images have annotation words. For the retrieval
based multimodal data mining evaluations, we use the ﬁfth parison with MBRM model. Figure 6 reports the precisions
folder as the multimedia database, which corresponds to and recalls averaged over 1648 queries for image retrieval in
stages 11 and 12. There are about 5,500 images that have comparison with MBRM model.
annotation words and there are 64 annotation words in this For the 2-stage inferencing, Figure 7 reports the precisions
folder. We split the whole folder’s images into two parts (one and recalls averaged over 1648 queries for image-to-word in-
third and two thirds), with the two thirds used in the train- ferencing in comparison with MBRM model, and Figure 8
ing and the one third used in the evaluation testing. For reports the precisions and recalls averaged over 64 queries
the across-stage inferencing based multimodal data mining for word-to-image inferencing in comparison with MBRM
evaluations, we use the fourth and the ﬁfth folders for the model. Figure 9 reports the precisions and recalls averaged
two stages inferencing evaluations, and use the third, the over 1648 queries for image-to-image inferencing in compari-
fourth and the ﬁfth folders for the three stages inferencing son with MBRM model. Finally, for the 3-stage inferencing,
evaluations. Consequently, each folder here is considered as Figure 10 reports precisions and recalls averaged over 1100
a “stage” in the across-stage inferencing based multimodal queries for image-to-image inferencing in comparison with
data mining evaluations. In each of the inferencing scenar- MBRM model.
ios, we use the same split as we do in the retrieval based In summary, there is no single winner for all the cases.
multimodal data mining evaluations for training and test- Overall, EMML outperforms MBRM substantially in the
ing. scenarios of word query and image retrieval, and slightly
In order to facilitate the across-stage inferencing capabil- in the scenario of 2-stage word-to-image inferencing and
ities, we handcraft the ontology of the words involved in 3-stage image-to-image inferencing. On the other hand,
the evaluations. This is simply implemented as a simple MBRM has a slight better performance than EMML in the
look-up table indexed by an eﬃcient hashing function. For scenario of 2-stage image-to-word inferencing. For all other
example, cardiac mesoderm primordium in the fourth folder scenarios the two methods have a comparable performance.
is considered as the same as circulatory system in the ﬁfth In order to demonstrate the strong scalability of EMML
folder. With this simple ontology and word matching, the approach to multimodal data mining, we take image anno-
proposed approach may be well applied to this across-stage tation as a case study and compare the scalability between
inferencing problem for the multimodal data mining. EMML and MBRM. We randomly select three subsets of the
The EMML algorithm is applied to obtain the model pa- embryo image database in diﬀerent scales (50, 100, 150 im-
rameters. In the ﬁgures below, the horizonal axis denotes ages, respectively), and apply both methods to the subsets
the number of the top retrieval results. We investigate the to measure the query response time. The query response
performance from top 2 to top 50 retrieval results. Fig- time is obtained by taking the average response time over
ure 4 reports the precisions and recalls averaged over 1648 1648 queries. Since EMML is implemented in MATLAB
queries for image annotation in comparison with MBRM environment and MBRM is implemented in C in Linux en-
model where the solid lines are for precisions and the dashed vironment, to ensure a fair comparison, we report the scala-
lines are for recalls. Similarly, Figure 5 reports the precisions bility as the relative ratio of a response time to the baseline
and recalls averaged over 64 queries for word query in com- response time for the respective methods. Here the baseline
8. response time is the response time to the smallest scale sub-
set (i.e., 50 images). Table 1 documents the scalability com-
0.10
parison. Clearly, MBRM exhibits a linear scalability w.r.t
Single Word Query Average Precision(n)
EMML 0.035
the database size while that of EMML is constant. This is 0.09 MBRM
Single Word Query Average Recall(n)
0.030
consistent with the scalability analysis in Sec. 5. 0.08
In order to verify the fast learning advantage of EMML 0.07
0.025
in comparison with the existing max margin based learn-
0.020
ing literature, we have implemented one of the most re- 0.06
cently proposed max margin learning methods by Taskar 0.05 0.015
et al. [19]. For the reference purpose, in this paper we 0.04 0.010
call this method as TCKG. We have applied both EMML
0.03
and TCKG to a small data set randomly selected from the 0.005
whole Berkeley embryo database, consisting of 110 images 0.02
0.000
along with their annotation words. The reason we use this
0 10 20 30 40 50
small data set for the comparison is that we have found that Top(n)
in MATLAB platform TCKG immediately runs out of mem-
ory when the data set is larger, due to the large number of
the constraints, which is typical for the existing max margin
learning methods. Under the environment of 2.2GHz CPU Figure 5: Precisions and Recalls of word query be-
and 1GB memory, TCKG takes about 14 hours to complete tween EMML and MBRM
the learning for such a small data set while EMML only
takes about 10 minutes. We have examined the number of
the constraints reduced in both methods during their exe-
0.9
cutions for this data set. EMML has reduced the number EMML
0.035
of the constraints in a factor of 70 times more than that
Image Retrieval Average Precision(n)
MBRM
0.8
Image Retrieval Average Recall(n)
reduced by TCKG. This explains why EMML is about 70 0.030
times faster than TCKG in learning for this data set. 0.7
0.025
0.6
0.020
7. CONCLUSION
0.5
We have developed a new max margin learning framework 0.015
— the enhanced max margin learning (EMML), and applied 0.4
0.010
it to developing an eﬀective and eﬃcient multimodal data 0.3 0.005
mining solution. EMML attempts to ﬁnd a small set of
active constraints, and thus is more eﬃcient in learning than 0.2 0.000
the existing max margin learning literature. Consequently, 0 10 20 30 40 50
it has a much faster convergence rate which is veriﬁed in Top(n)
empirical evaluations. The multimodal data mining solution
based on EMML is highly scalable in the sense that the
query response time is independent of the database scale.
Figure 6: Precisions and Recalls of image retrieval
This advantage is also supported through the complexity
between EMML and MBRM
analysis as well as empirical evaluations. While EMML is
a general learning framework and can be used for general
multimodal data mining, for the evaluation purpose, we have
applied it to the Berkeley Drosophila embryo image database
0.32
and have reported the evaluations against a state-of-the-art EMML 2-stage inferencing
Image Annotation Average Precision(n)
MBRM 2-stage inferencing 1.0
multimodal data mining method. Image Annotation Average Recall(n)
0.28
0.8
8. ACKNOWLEDGEMENT 0.24
This work is supported in part by NSF (IIS-0535162, EF- 0.6
0331657), AFRL (FA8750-05-2-0284), AFOSR (FA9550-06-
0.20
1-0327), and a CAREER Award to EPX by the National
0.4
Science Foundation under Grant No. DBI-054659.
0.16
0.2
9. REFERENCES 0.12
[1] http://www.fruitfly.org/.
0.0
[2] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden 0 10 20 30 40 50
markov support vector machines. In Proc. ICML, Top(n)
Washington DC, 2003.
[3] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas,
D. M. Blei, and M. I. Jordan. Matching words and Figure 7: Precisions and Recalls of 2-stage image to
pictures. Journal of Maching Learning Research, word inferencing between EMML and MBRM
3:1107–1135, 2003.
9. [4] D. Blei and M. Jordan. Modeling annotated data. In
Proceedings of the 26th annual International ACM
0.18 0.030
EMML 2-stage inferencing SIGIR Conference on Research and Development in
Single Word Query Average Precision(n)
MBRM 2-stage inferencing
Information Retrieval, pages 127–134, 2003.
Single Word Query Average Recall(n)
0.16 0.025
[5] S. Boyd and L. Vandenberghe. Convex Optimization.
0.14
0.020 Cambridge University Press, 2004.
0.12
[6] U. Brefeld and T. Scheﬀer. Semi-supervised learning
0.015
for structured output variables. In Proc. ICML,
0.10
0.010
Pittsburgh, PA, 2006.
0.08
[7] E. Chang, K. Goh, G. Sychay, and G. Wu. Cbsa:
0.005 content-based soft annotation for multimodal image
0.06 retrieval using bayes point machines. IEEE Trans. on
0.000
Circuits and Systems for Video Technology, 13:26–38,
0.04
0 10 20 30 40 50
Jan 2003.
Top(n) [8] W. Chu, Z. Ghahramani, and D. L. Wild. A graphical
model for protein secondary structure prediction. In
Proc. ICML, Banﬀ, Canada, 2004.
Figure 8: Precisions and Recalls of 2-stage word to [9] R. Datta, W. Ge, J. Li, and J. Z. Wang. Toward
image inferencing between EMML and MBRM bridging the annotation-retrieval gap in image search
by a generative modeling approach. In Proc. ACM
Multimedia, Santa Barbara, CA, 2006.
[10] P. Duygulu, K. Barnard, N. de Freitas, and
0.45 0.040
D. Forsyth. Object recognition as machine translation:
EMML 2-stage inferencing
MBRM 2-stage inferencing
Learning a lexicon for a ﬁxed image vocabulary. In
Image Retrieval Average Precision(n)
0.035
0.40
Seventh European Conference on Computer Vision,
Image Retrieval Average Recall(n)
0.35
0.030 volume IV, pages 97–112, 2002.
0.025
[11] S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple
0.30
bernoulli relevance models for image and video
0.020
0.25
annotation. In International Conference on Computer
0.015 Vision and Pattern Recognition, Washington DC,
0.20
0.010
2004.
0.15
[12] Y. Freund and R. E. Schapire. Large margin
0.005
classiﬁcation using the perceptron algorithm. In
0.10 0.000
Maching Learning, volume 37, 1999.
0 10 20 30 40 50 [13] H. D. III and D. Marcu. Learning as search
Top(n) optimization: Approximate large margin methods for
structured prediction. In Proc. ICML, Bonn,
Germany, 2005.
Figure 9: Precisions and Recalls of 2-stage image to [14] J. Laﬀerty, A. McCallum, and F. Pereira. Conditional
image inferencing between EMML and MBRM random ﬁelds: Probabilistic models for segmenting
and labeling sequence data. In Proc. ICML, 2001.
[15] A. McCallum, D. Freitag, and F. Pereira. Maximum
entropy markov models for information extraction and
0.50 segmentation. In Proc. ICML, 2000.
EMML 3-stage inferencing
0.48 [16] E. Osuna, R. Freund, and F. Girosi. An improved
Image Retrieval Average Precision(n)
MBRM 3-stage inferencing 0.04
training algorithm for support vector machines. In
Image Retrieval Average Recall(n)
0.46
0.44
Proc. of IEEE NNSP’97, Amelia Island, FL, Sept.
0.03
1997.
0.42
[17] J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu.
0.40
0.02
Automatic multimedia cross-modal correlation
0.38
discovery. In Proceedings of the 10th ACM SIGKDD
0.36
0.01 Conference, Seattle, WA, 2004.
0.34
[18] A. W. M. Smeulders, M. Worring, S. Santini,
0.32
0.00
A. Gupta, and R. Jain. Content-based image retrieval
0.30 at the end of the early years. IEEE Trans. on Pattern
0 10 20 30 40 50 Analysis and Machine Intelligence, 22:1349–1380,
Top(n) 2000.
[19] B. Taskar, V. Chatalbashev, D. Koller, and
C. Guestrin. Learning structured prediction models: A
Figure 10: Precisions and Recalls of 3-stage image large margin approach. In Proc. ICML, Bonn,
to image inferencing between EMML and MBRM Germany, 2005.
10. [20] B. Taskar, C. Guestrin, and D. Koller. Max-margin
markov networks. In Neural Information Processing
Systems Conference, Vancouver, Canada, 2003.
[21] I. Tsochantaridis, T. Hofmann, T. Joachims, and
Y. Altun. Support vector machine learning for
interdependent and structured output spaces. In Proc.
ICML, Banﬀ, Canada, 2004.
[22] V. N. Vapnik. The nature of statistical learning theory.
Springer, 1995.
[23] Y. Wu, E. Y. Chang, and B. L. Tseng. Multimodal
metadata fusion using causal strength. In Proc. ACM
Multimedia, pages 872–881, Hilton, Singapore, 2005.
Be the first to comment