Enhanced Max Margin Learning on Multimodal Data Mining in a ...
Upcoming SlideShare
Loading in...5
×
 

Enhanced Max Margin Learning on Multimodal Data Mining in a ...

on

  • 1,049 views

 

Statistics

Views

Total Views
1,049
Views on SlideShare
1,049
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Enhanced Max Margin Learning on Multimodal Data Mining in a ... Enhanced Max Margin Learning on Multimodal Data Mining in a ... Document Transcript

  • Enhanced Max Margin Learning on Multimodal Data Mining in a Multimedia Database Zhen Guo, Zhongfei (Mark) Zhang Eric P. Xing, Christos Faloutsos Computer Science Department School of Computer Science SUNY Binghamton Carnegie Mellon University Binghamton, NY, 13902 Pittsburgh, PA, 15213 {zguo,zhongfei}@cs.binghamton.edu {epxing,christos}@cs.cmu.edu ABSTRACT General Terms The problem of multimodal data mining in a multimedia Algorithms, experimentation database can be addressed as a structured prediction prob- lem where we learn the mapping from an input to the struc- Keywords tured and interdependent output variables. In this paper, built upon the existing literature on the max margin based Multimodal data mining, image annotation, image retrieval, learning, we develop a new max margin learning approach max margin called Enhanced Max Margin Learning (EMML) framework. In addition, we apply EMML framework to developing an 1. INTRODUCTION effective and efficient solution to the multimodal data min- Multimodal data mining in a multimedia database is a ing problem in a multimedia database. The main contri- challenging topic in data mining research. Multimedia data butions include: (1) we have developed a new max mar- may consist of data in different modalities, such as digital gin learning approach — the enhanced max margin learning images, audio, video, and text data. In this context, a multi- framework that is much more efficient in learning with a media database refers to a data collection in which there are much faster convergence rate, which is verified in empirical multiple modalities of data such as text and imagery. In this evaluations; (2) we have applied this EMML approach to database system, the data in different modalities are related developing an effective and efficient solution to the multi- to each other. For example, the text data are related to im- modal data mining problem that is highly scalable in the ages as their annotation data. By multimodal data mining sense that the query response time is independent of the in a multimedia database it is meant that the knowledge database scale, allowing facilitating a multimodal data min- discovery to the multimedia database is initiated by a query ing querying to a very large scale multimedia database, and that may also consist of multiple modalities of data such as excelling many existing multimodal data mining methods in text and imagery. In this paper, we focus on a multimedia the literature that do not scale up at all; this advantage is database as an image database in which each image has a also supported through the complexity analysis as well as few textual words given as annotation. We then address empirical evaluations against a state-of-the-art multimodal the problem of multimodal data mining in such an image data mining method from the literature. While EMML is database as the problem of retrieving similar data and/or a general framework, for the evaluation purpose, we apply inferencing new patterns to a multimodal query from the it to the Berkeley Drosophila embryo image database, and database. report the performance comparison with a state-of-the-art Specifically, in the context of this paper, multimodal data multimodal data mining method. mining refers to two aspects of activities. The first is the multimodal retrieval. This is the scenario where a mul- Categories and Subject Descriptors timodal query consisting of either textual words alone, or H.2.8 [Database Management]: Database Applications— imagery alone, or in any combination is entered and an ex- Data mining,Image databases; H.3.3 [Information Stor- pected retrieved data modality is specified that can also be age and Retrieval]: Information Search and Retrieval— text alone, or imagery alone, or in any combination; the re- Retrieval models; I.5.1 [Pattern Recognition]: Models— trieved data based on a pre-defined similarity criterion are Structural ; J.3 [Computer Applications]: Life and Med- returned back to the user. The second is the multimodal in- ical Sciences—Biology and genetics ferencing. While the retrieval based multimodal data mining has its standard definition in terms of the semantic similarity between the query and the retrieved data from the database, the inferencing based mining depends on the specific applica- Permission to make digital or hard copies of all or part of this work for tions. In this paper, we focus on the application of the fruit personal or classroom use is granted without fee provided that copies are fly image database mining. Consequently, the inferencing not made or distributed for profit or commercial advantage and that copies based multimodal data mining may include many different bear this notice and the full citation on the first page. To copy otherwise, to scenarios. A typical scenario is the across-stage multimodal republish, to post on servers or to redistribute to lists, requires prior specific inferencing. There are many interesting questions a biologist permission and/or a fee. KDD’07, August 12–15, 2007, San Jose, California, USA. may want to ask in the fruit fly research given such a mul- Copyright 2007 ACM 978-1-59593-609-7/07/0008 ...$5.00. timodal mining capability. For example, given an embryo
  • image in stage 5, what is the corresponding image in stage multimodal approaches. 7 for an image-to-image three-stage inferencing? What is The learning with structured output variables covers many the corresponding annotation for this image in stage 7 for natural learning tasks including named entity recognition, an image-to-word three-stage inferencing? The multimodal natural language parsing, and label sequence learning. There mining technique we have developed in this paper also ad- have been many studies on the structured model which in- dresses this type of across-stage inferencing capability, in clude conditional random fields [14], maximum entropy model addition to the multimodal retrieval capability. [15], graph model [8], semi-supervised learning [6] and max In the image retrieval research area, one of the notorious margin approaches [13, 21, 20, 2]. The challenge of learning bottlenecks is the semantic gap [18]. Recently, it is reported with structured output variables is that the number of the that this bottleneck may be reduced by the multimodal data structures is exponential in terms of the size of the struc- mining approaches [3, 11] which take advantage of the fact ture output space. Thus, the problem is intractable if we that in many applications image data typically co-exist with treat each structure as a separate class. Consequently, the other modalities of information such as text. The synergy multiclass approach is not well fitted into the learning with between different modalities may be exploited to capture the structured output variables. high level conceptual relationships. As an effective approach to this problem, the max margin To exploit the synergy among the multimodal data, the principle has received substantial attention since it was used relationships among these different modalities need to be in the support vector machine (SVM) [22]. In addition, the learned. For an image database, we need to learn the rela- perceptron algorithm is also used to explore the max margin tionship between images and text. The learned relationship classification [12]. Taskar et al. [19] reduce the number of the between images and text can then be further used in mul- constraints by considering the dual of the loss-augmented timodal data mining. Without loss of generality, we start problem. However, the number of the constraints in their with a special case of the multimodal data mining problem approach is still large for a large structured output space — image annotation, where the input is an image query and and a large training set. the expected output is the annotation words. We show later For learning with structured output variables, Tsochan- that this approach is also valid to the general multimodal taridis et al. [21] propose a cutting plane algorithm which data mining problem. The image annotation problem can finds a small set of active constraints. One issue of this al- be formulated as a structured prediction problem where the gorithm is that it needs to compute the most violated con- input (image) x and the output (annotation) y are struc- straint which would involve another optimization problem in tures. An image can be partitioned into blocks which form the output space. In EMML, instead of selecting the most a structure. The word space can be denoted by a vector violated constraint, we arbitrarily select a constraint which where each entry represents a word. Under this setting, the violates the optimality condition of the optimization prob- learning task is therefore formulated as finding a function lem. Thus, the selection of the constraints does not involve f : X × Y → R such that any optimization problem. Osuna et al. [16] propose the de- composition algorithm for the support vector machine. In ˆ y = arg max f (x, y) (1) EMML, we generalize their idea to the scenario of learning y∈Y with structured output variables. is the desired output for any input x. In this paper, built upon the existing literature on the max margin learning, we propose a new max margin learn- 3. HIGHLIGHTS OF THIS WORK ing approach on the structured output space to learn the This work is based on the existing literature on max mar- above function. Like the existing max margin learning meth- gin learning, and aims at solving for the problem of multi- ods, the image annotation problem may be formulated as a modal data mining in a multimedia database defined in this quadratic programming (QP) problem. The relationship be- paper. In comparison with the existing literature, the main tween images and text is discovered once this QP problem contributions of this work include: (1) we have developed is solved. Unlike the existing max margin learning meth- a new max margin learning approach — the enhanced max ods, the new max margin learning method is much more margin learning framework that is much more efficient in efficient with a much faster convergence rate. Consequently, learning with a much faster convergence rate, which is veri- we call this new max margin learning approach as Enhanced fied in empirical evaluations; (2) we have applied this EMML Max Margin Learning (EMML). We further apply EMML approach to developing an effective and efficient solution to to solving the multimodal data mining problem effectively the multimodal data mining problem that is highly scalable and efficiently. in the sense that the query response time is independent of Note that the proposed approach is general that can be the database scale, allowing facilitating a multimodal data applied to any structured prediction problems. For the eval- mining querying to a very large scale multimedia database, uation purpose, we apply this approach to the Berkeley and excelling many existing multimodal data mining meth- Drosophila embryo image database. Extensive empirical ods in the literature that do not scale up at all; this advan- evaluations against a state-of-the-art method on this database tage is also supported through the complexity analysis as are reported. well as empirical evaluations against a state-of-the-art mul- timodal data mining method from the literature. 2. RELATED WORK 4. LEARNING IN THE STRUCTURED OUT- Multimodal approaches have recently received the sub- stantial attention since Barnard and Duygulu et al. started PUT SPACE their pioneering work on image annotation [3, 10]. Recently Assume that the image database consists of a set of in- there have been many studies [4, 17, 11, 7, 9, 23] on the stances S = {(Ii , Wi )}L , where each instance consists of i=1
  • can be expressed as a linear combination of each component ¯ ¯ in the joint feature representation: f (x, y) = α, Φ(x, y) . Then the learning task is to find the optimal weight vector α such that the prediction error is minimized for all the training instances. That is ¯ arg max f (xi , y) ≈ yi , i = 1, · · · , n y ∈Y (i) ¯ where Yi = {¯ | W yj = j=1 yij }. We use Φi (¯ ) to de- y j=1 ¯ W y ¯ note Φ(xi , y). To make the prediction to be the true output yi , we must follow α⊤ Φi (yi ) ≥ α⊤ Φi (¯ ), y ∀¯ ∈ Yi {yi } y Figure 1: An illustration of the image partitioning where Yi {y i } denotes the removal of the element yi from and the structured output word space the set Yi . In order to accommodate the prediction error on the training examples, we introduce the slack variable ξi . The above constraint then becomes an image object Ii and the corresponding annotation word α⊤ Φi (yi ) ≥ α⊤ Φi (¯ ) − ξi , y ξi ≥ 0 ∀¯ ∈ Yi {y i } y set Wi . First we partition an image into a set of blocks. Thus, an image can be represented by a set of sub-images. We measure the prediction error on the training instances The feature vector in the feature space for each block can by the loss function which is the distance between the true be computed from the selected feature representation. Con- ¯ output yi and the prediction y . The loss function measures sequently, an image is represented as a set of feature vectors the goodness of the learning model. The standard zero-one in the feature space. A clustering algorithm is then applied classification loss is not suitable for the structured output to the whole feature space to group similar feature vectors space. We define the loss function l (¯ , yi ) as the number y together. The centroid of a cluster represents a visual rep- of the different entries in these two vectors. We include the resentative (we refer it to VRep in this paper) in the image loss function in the constraints as is proposed by Taskar et space. In Figure 1, there are two VReps, water and duck al. [19] in the water. The corresponding annotation word set can be easily obtained for each VRep. Consequently, the image α⊤ Φi (yi ) ≥ α⊤ Φi (¯ ) + l (¯ , yi ) − ξi y y database becomes the VRep-word pairs S = {(xi , yi )}n , i=1 We interpret 1 α⊤ [Φi (yi )−Φi (¯ )] y as the margin of yi over α where n is the number of the clusters, xi is a VRep ob- (i) ¯ another y ∈ Y . We then rewrite the above constraint as ject and yi is the word annotation set corresponding to this 1 α⊤ [Φi (yi ) − Φi (¯ )] ≥ α [l (¯ , yi ) − ξi ]. Thus, minimiz- y 1 y VRep object. Another simple method to obtain the VRep- α word pairs is that we randomly select some images from the ing α maximizes such margin. image database and each image is viewed as a VRep. The goal now is to solve the optimization problem Suppose that there are W distinct annotation words. An 1 2 n r arbitrary subset of annotation words is represented by the min α +C ξi (2) 2 ¯ binary vector y whose length is W ; the j-th component i=1 ¯ yj = 1 if the j-th word occurs in this subset, and 0 oth- s.t. α⊤ Φi (yi ) ≥ α⊤ Φi (¯ ) + l (¯ , yi ) − ξi y y erwise. All possible binary vectors form the word space Y. ∀¯ ∈ Yi {yi }, ξi ≥ 0, i = 1, · · · , n y We use wj to denote the j-th word in the whole word set. We use x to denote an arbitrary vector in the feature space. where r = 1, 2 corresponds to the linear or quadratic slack Figure 1 shows an illustrative example in which the original variable penalty. In this paper, we use the linear slack vari- image is annotated by duck and water which are represented able penalty. For r = 2, we obtain similar results. C > 0 by a binary vector. There are two VReps after the clustering is a constant that controls the tradeoff between the training and each has a different annotation. In the word space, a error minimization and the margin maximization. word may be related to other words. For example, duck and Note that in the above formulation, we do not introduce water are related to each other because water is more likely the relationships between different words in the word space. to occur when duck is one of the annotation words. Conse- However, the relationships between different words are im- quently, the annotation word space is a structured output plicitly included in the VRep-word pairs because the related space where the elements are interdependent. words is more likely to occur together. Thus, Eq. (2) is in The relationship between the input example VRep x and fact a structured optimization problem. ¯ an arbitrary output y is represented as the joint feature mapping Φ(x, y), Φ : X × Y → Rd where d is the dimension ¯ 4.1 EMML Framework of the joint feature space. It can be expressed as a linear One can solve the optimization problem Eq. (2) in the combination of the joint feature mapping between x and all primal space — the space of the parameters α. In fact this the unit vectors. That is problem is intractable when the structured output space is W large because the number of the constraints is exponential ¯ Φ(x, y) = ¯ yj Φ(x, ej ) in terms of the size of the output space. As in the tradi- j=1 tional support vector machine, the solution can be obtained by solving this quadratic optimization problem in the dual ¯ where ej is the j-th unit vector. The score between x and y space — the space of the Lagrange multipliers. Vapnik [22]
  • and Boyd et al. [5] have an excellent review for the related able set µB when keeping µN =0 as follows optimization problem. 1 ⊤ The dual problem formulation has an important advan- min µ Dµ − µ⊤ S (5) tage over the primal problem: it only depends on the inner 2 products in the joint feature representation defined by Φ, s.t. Aµ C allowing the use of a kernel function. We introduce the µB 0, µN = 0 Lagrange multiplier µi,¯ for each constraint to form the La- y grangian. We define Φi,yi ,¯ = Φi (yi ) − Φi (¯ ) and the kernel y It is clearly true that we can move those µi,¯ = 0, µi,¯ ∈ y y y ¯ ˜ function K ((xi , y), (xj , y)) = Φi,yi ,¯ , Φj,yj ,˜ . The deriva- µB to set µN without changing the objective function. Fur- y y tives of the Lagrangian over α and ξi should be equal to thermore, we can move those µi,¯ ∈ µN satisfying certain y zero. Substituting these conditions into the Lagrangian, we conditions to set µB to form a new optimization subprob- obtain the following Lagrange dual problem lem which yields a strict decrease in the objective function in Eq. (4) when the new subproblem is optimized. This property is guaranteed by the following theorem. 1 min y y ¯ ˜ µi,¯ µj,˜ K ((xi , y), (xj , y))− µi,¯ l (¯ , yi ) (3) y y Theorem 1. Given an optimal solution of the subprob- 2 ¯ i,j y=yi ¯ i y=yi lem defined on µB in Eq. (5), if the following conditions ˜ y=yj hold true: s.t. µi,¯ ≤ C y µi,¯ ≥ 0 y i = 1, · · · , n ∃i, µi,¯ < C ¯ y y ¯ y =yi ∃µi,¯ ∈ µN , y α⊤ Φi,yi ,¯ − l (¯, yi ) < 0 y y (6) After this dual problem is solved, we have α = i,¯ µi,¯ Φi,yi ,¯ . y y y the operation of moving the Lagrange multiplier µi,¯ satisfy- y For each training example, there are a number of con- ing Eq. (6) from set µN to set µB generates a new optimiza- straints related to it. We use the subscript i to represent the tion subproblem that yields a strict decrease in the objective part related to the i-th example in the matrix. For example, function in Eq. (4) when the new subproblem in Eq.(5) is let µi be the vector with entries µi,¯ . We stack the µi to- y optimized. gether to form the vector µ. That is µ = [µ1 ⊤ · · · µn ⊤ ]⊤ . Similarly, let Si be the vector with entries l (¯ , yi ). We stack y Proof. Suppose that the current optimal solution is µ. Si together to form the vector S. That is S = [S⊤ · · · S⊤ ]⊤ . ¯ Let δ be a small positive number. Let µ = µ + δer , where er 1 n The lengths of µ and S are the same. We define Ai as ¯ is the r-th unit vector and r = (i, y) denotes the Lagrange the vector which has the same length as that of µ, where multiplier satisfying condition Eq. (6). Thus, the objective Ai,¯ = 1 and Aj,¯ = 0 for j = i. Let A = [A1 · · · An ]⊤ . function becomes y y Let matrix D represent the kernel matrix where each entry 1 ¯ W(µ) = (µ + δer )⊤ D(µ + δer ) − (µ + δer )⊤ S ¯ ˜ is K ((xi , y), (xj , y)). Let C be the vector where each entry 2 is constant C. 1 ⊤ With the above notations we rewrite the Lagrange dual = (µ Dµ + δer Dµ + δµ⊤ Der + δ 2 er Der ) ⊤ ⊤ 2 problem as follows −µ⊤ S − δer S ⊤ 1 1 ⊤ = W(µ) + (δer Dµ + δµ⊤ Der + δ 2 er Der ) ⊤ ⊤ min µ Dµ − µ⊤ S (4) 2 2 ⊤ −δer S s.t. Aµ C 1 µ 0 = W(µ) + δer Dµ − δer S + δ 2 er Der ⊤ ⊤ ⊤ 2 1 where and represent the vector comparison defined as = W(µ) + δ(α⊤ Φi,yi ,¯ − l (¯ , yi )) + δ 2 Φi,y i ,¯ y y y 2 2 entry-wise less than or equal to and greater than or equal to, respectively. Since α⊤ Φi,yi ,¯ − l (¯ , yi ) < 0, for small enough δ, we y y Eq. (4) has the same number of the constraints as Eq. (2). ¯ have W(µ) < W(µ). For small enough δ, the constraints However, in Eq. (4) most of the constraints are lower bound Aµ ¯ C is also valid. Therefore, when the new optimiza- constraints (µ 0) which define the feasible region. Other tion subproblem in Eq. (5) is optimized, there must be an than these lower bound constraints, the rest constraints de- optimal solution no worse than µ. ¯ termine the complexity of the optimization problem. There- fore, the number of constraints is considered to be reduced in Eq. (4). However, the challenge still exists to solve it ef- In fact, the optimal solution is obtained when there is no ficiently since the number of the dual variables is still huge. Lagrange multiplier satisfying the condition Eq. (6). This is Osuna et al. [16] propose a decomposition algorithm for the guaranteed by the following theorem. support vector machine learning over large data sets. We Theorem 2. The optimal solution of the optimization prob- generalize this idea to learning with the structured output lem in Eq. (4) is achieved if and only if the condition Eq. (6) space. We decompose the constraints of the optimization does not hold true. problem Eq. (2) into two sets: the working set B and the nonactive set N. The Lagrange multipliers are also corre- ˆ Proof. If the optimal solution µ is achieved, the condi- spondingly partitioned into two parts µB and µN . We are ˆ tion Eq. (6) must not hold true. Otherwise, µ is not op- interested in the subproblem defined only for the dual vari- timal according to the Theorem 1. To prove in the reverse
  • direction, we consider the Karush-Kuhn-Tucker (KKT) con- In the recent literature, there are also other methods at- ditions [5] of the optimization problem Eq. (4). tempting to reduce the number of the constraints. Taskar et al. [19] reduce the number of the constraints by consider- Dµ − S + A⊤ γ − π = 0 ing the dual of the loss-augmented problem. However, the γ ⊤ (C − Aµ) = 0 number of the constraints in their approach is still large for π⊤µ = 0 a large structured output space and a large training set. They do not use the fact that only some of the constraints γ 0 are active in the optimization problem. Tsochantaridis et π 0 al. [21] also propose a cutting plane algorithm which finds a small set of active constraints. One issue of this algorithm is For the optimization problem Eq. (4), the KKT conditions that it needs to compute the most violated constraint which provide necessary and sufficient conditions for optimality. would involve another optimization problem in the output One can check that the condition Eq. (6) violates the KKT space. In EMML, instead of selecting the most violated con- conditions. On the other hand, one can check that the KKT straint, we arbitrarily select a constraint which violates the conditions are satisfied when the condition Eq. (6) does not optimality condition of the optimization problem. Thus, the hold true. Therefore, the optimal solution is achieved when selection of the constraint does not involve any optimization the condition Eq. (6) does not hold true. problem. Therefore, EMML is much more efficient in learn- The above theorems suggest the Enhanced Max Margin ing with a much faster convergence rate. Learning (EMML) algorithm listed in Algorithm 1. The correctness (convergence) of EMML algorithm is provided 5. MULTIMODAL DATA MINING by Theorem 3. The solution to the Lagrange dual problem makes it pos- sible to capture the semantic relationships among different Algorithm 1 EMML Algorithm data modalities. We show that the developed EMML frame- Input: n labeled examples, dual variable set µ. work can be used to solve for the general multimodal data Output: Optimized µ mining problem in all the scenarios. Specifically, given a 1: procedure training data set, we immediately obtain the direct rela- 2: Arbitrarily decompose µ into two sets: µB and µN . tionship between the VRep space and the word space using 3: Solve the subproblem in Eq. (5) defined by the vari- the EMML framework in Algorithm 1. Given this obtained ables in µB . direct relationship, we show below that all the multimodal 4: While there exists µi,¯ ∈ µB such that µi,¯ = 0, y y data mining scenarios concerned in this paper can be facili- move it to set µN . tated. 5: While there exists µi,¯ ∈ µN satisfying condition y Eq. (6), move it to set µB . If no such µi,¯ ∈ µN exists, y 5.1 Image Annotation the iteration exits. Image annotation refers to generating annotation words 6: Goto Step 3. for a given image. First we partition the test image into 7: end procedure blocks and compute the feature vector in the feature space for each block. We then compute the similarity between feature vectors and the VReps in terms of the distance. We Theorem 3. EMML algorithm converges to the global op- return the top n most-relevant VReps. For each VRep, we timal solution in a finite number of iterations. compute the score between this VRep and each word as the Proof. This is the direct result from Theorems 1 and 2. function f in Eq. (1). Thus, for each of the top n most Step 3 in Algorithm 1 strictly decreases the objective func- relevant VReps, we have the ranking-list of words in terms tion of Eq. (4) at each iteration and thus the algorithm does of the score. We then merge these n ranking-lists and sort not cycle. Since the objective function of Eq. (4) is convex them to obtain the overall ranking-list of the whole word and quadratic, and the feasible solution region is bounded, space. Finally, we return the top m words as the annotation the objective function is bounded. Therefore, the algorithm result. must converge to the global optimal solution in a finite num- In this approach, the score between the VReps and the ber of iterations. words can be computed in advance. Thus, the computa- tion complexity of image annotation is only related to the Note that in Step 5, we only need find one dual variable number of the VReps. Under the assumption that all the im- satisfying Eq. (6). We need examine all the dual variables ages in the image database follow the same distribution, the in the set µN only when no dual variable satisfies Eq. (6). number of the VReps is independent of the database scale. It is fast to examine the dual variables in the set µN even if Therefore, the computation complexity in this approach is the number of the dual variables is large. O(1) which is independent of the database scale. 4.2 Comparison with other methods 5.2 Word Query In the max margin optimization problem Eq. (2), only Word query refers to generating corresponding images in some of the constraints determine the optimal solution. We response to a query word. For a given word input, we com- call these constraints active constraints. Other constraints pute the score between each VRep and the word as the func- are automatically met as long as these active constraints are tion f in Eq. (1). Thus, we return the top n most relevant valid. EMML algorithm uses this fact to solve the optimiza- VReps. Since for each VRep, we compute the similarity tion problem by substantially reducing the number of the between this VRep and each image in the image database dual variables in Eq. (3). in terms of the distance, for each of those top n most rele-
  • (a) (b) Figure 2: An example of 3-stage image-to-image in- ferencing. vant VReps, we have the ranking-list of images in terms of the distance. Then we merge these n ranking-lists and sort them to obtain the overall ranking-list in the image space. Finally, we return the top m images as the query result. For each VRep, the similarity between this VRep and each image in the image database can be computed in advance. Similar to the analysis in Sec. 5.1, the computation com- plexity is only related to the number of the VReps, which is O(1). 5.3 Image Retrieval Image retrieval refers to generating semantically similar Figure 3: An illustrative diagram for image-to- images to a query image. Given a query image, we annotate image across two stages inferencing it using the procedure in Sec. 5.1. In the image database, for each annotation word j there are a subset of images Sj in which this annotation word appears. We then have the both correspond to the same gene as a result of the image- union set S = ∪j Sj for all the annotation words of the query to-image inferencing between the two stages1 . However, it image. is clear that they exhibit a very large visual dissimilarity. On the other hand, for each annotation word j of the Consequently, it is not appropriate to use any pure visual query image, the word query procedure in Sec. 5.2 is used feature based similarity retrieval method to identify such to obtain the related sorted image subset Tj from the image image-to-image correspondence across stages. Furthermore, database. We then merge these subsets Tj to form the sorted we also expect to have the word-to-image and image-to-word image set T in terms of their scores. The final image retrieval inferencing capabilities across different stages, in addition to result is R = S ∩ T . the image-to-image inferencing. In this approach, the synergy between the image space and Given this consideration, this is exactly where the pro- the word space is exploited to reduce the semantic gap based posed approach for multimodal data mining can be applied on the developed learning approach. Since the complexity to complement the existing pure retrieval based methods to of the retrieval methods in Secs. 5.1 and 5.2 are both O(1), identify such correspondence. Typically in such a fruit fly and since these retrievals are only returned for the top few embryo image database, there are textual words for anno- items, respectively, finding the intersection or the union is tation to the images in each stage. These annotation words O(1). Consequently, the overall complexity is also O(1). in one stage may or may not have the direct semantic cor- respondence to the images in another stage. However, since 5.4 Multimodal Image Retrieval the data in all the stages are from the same fruit fly embryo The general scenario of multimodal image retrieval is a image database, the textual annotation words between two query as a combination of a series of images and a series of different stages share a semantic relationship which can be words. Clearly, this retrieval is simply a linear combination obtained by a domain ontology. of the retrievals in Secs. 5.2 and 5.3 by merging the retrievals In order to apply our approach to this across-stage infer- together based on their corresponding scores. Since each encing problem, we treat each stage as a separate multime- individual retrieval is O(1), the overall retrieval is also O(1). dia database, and map the across-stage inferencing problem to a retrieval based multimodal data mining problem by ap- 5.5 Across-Stage Inferencing plying the approach to the two stages such that we take the multimodal query as the data from one stage and pose the For a fruit fly embryo image database such as the Berke- query to the data in the other stage for the retrieval based ley Drosophila embryo image database which is used for our multimodal data mining. Figure 3 illustrates the diagram experimental evaluations, we have embryo images classified of the two stages (state i and state j where i = j) image-to- in advance into different stages of the embryo development image inferencing. with separate sets of textual words as annotation to those Clearly, in comparison with the retrieval based multi- images in each of these stages. In general, images in different modal data mining analyzed in the previous sections, the stages may or may not have the direct semantic correspon- only additional complexity here in across-stage inferencing dence (e.g., they all correspond to the same gene), not even speaking that images in different stages may necessarily ex- 1 The Berkeley Drosophila embryo image database is given hibit any visual similarity. Figure 2 shows an example of a in such a way that images from several real stages are mixed pair of identified embryo images at stages 9-10 (Figure 2(a)) together to be considered as one “stage”. Thus, stages 9-10 and stages 13-16 (Figure 2(b)), respectively, in which they are considered as one stage, and so are stages 13-16.
  • is the inferencing part using the domain ontology in the word space. Typically this ontology is small in scale. In fact, in Table 1: Comparison of scalability our evaluations for the Berkeley Drosophila embryo image Database Size 50 100 150 database, this ontology is handcrafted and is implemented EMML 1 1 1 as a look-up table for word matching through an efficient MBRM 1 2.2 3.3 hashing function. Thus, this part of the computation may be ignored. Consequently, the complexity of the across-stage inferencing based multimodal data mining is the same as 0.24 EMML Image Annotation Average Precision(n) that of the retrieval based multimodal data mining which is MBRM 1.0 Image Annotation Average Recall(n) 0.22 independent of database scale. 0.20 0.8 6. EMPIRICAL EVALUATIONS 0.18 0.16 0.6 While EMML is a general learning framework, and it can also be applied to solve for a general multimodal data min- 0.14 ing problem in any application domains, for the evaluation 0.12 0.4 purpose, we apply it to the Berkeley Drosophila embryo im- 0.10 age database [1] for the multimodal data mining task de- 0.08 0.2 fined in this paper. We evaluate this approach’s perfor- 0.06 mance using this database for both the retrieval based and 0.0 0 10 20 30 40 50 the across-stage inferencing based multimodal data mining Top(n) scenarios. We compare this approach with a state-of-the-art multimodal data mining method MBRM [11] for the mining performance. In this image database, there are in total 16 stages of the Figure 4: Precisions and Recalls of image annotation embryo images archived in six different folders with each between EMML and MBRM (the solid lines are for folder containing two to four real stages of the images; there precisions and the dashed lines are for recalls) are in total 36,628 images and 227 words in all the six folders; not all the images have annotation words. For the retrieval based multimodal data mining evaluations, we use the fifth parison with MBRM model. Figure 6 reports the precisions folder as the multimedia database, which corresponds to and recalls averaged over 1648 queries for image retrieval in stages 11 and 12. There are about 5,500 images that have comparison with MBRM model. annotation words and there are 64 annotation words in this For the 2-stage inferencing, Figure 7 reports the precisions folder. We split the whole folder’s images into two parts (one and recalls averaged over 1648 queries for image-to-word in- third and two thirds), with the two thirds used in the train- ferencing in comparison with MBRM model, and Figure 8 ing and the one third used in the evaluation testing. For reports the precisions and recalls averaged over 64 queries the across-stage inferencing based multimodal data mining for word-to-image inferencing in comparison with MBRM evaluations, we use the fourth and the fifth folders for the model. Figure 9 reports the precisions and recalls averaged two stages inferencing evaluations, and use the third, the over 1648 queries for image-to-image inferencing in compari- fourth and the fifth folders for the three stages inferencing son with MBRM model. Finally, for the 3-stage inferencing, evaluations. Consequently, each folder here is considered as Figure 10 reports precisions and recalls averaged over 1100 a “stage” in the across-stage inferencing based multimodal queries for image-to-image inferencing in comparison with data mining evaluations. In each of the inferencing scenar- MBRM model. ios, we use the same split as we do in the retrieval based In summary, there is no single winner for all the cases. multimodal data mining evaluations for training and test- Overall, EMML outperforms MBRM substantially in the ing. scenarios of word query and image retrieval, and slightly In order to facilitate the across-stage inferencing capabil- in the scenario of 2-stage word-to-image inferencing and ities, we handcraft the ontology of the words involved in 3-stage image-to-image inferencing. On the other hand, the evaluations. This is simply implemented as a simple MBRM has a slight better performance than EMML in the look-up table indexed by an efficient hashing function. For scenario of 2-stage image-to-word inferencing. For all other example, cardiac mesoderm primordium in the fourth folder scenarios the two methods have a comparable performance. is considered as the same as circulatory system in the fifth In order to demonstrate the strong scalability of EMML folder. With this simple ontology and word matching, the approach to multimodal data mining, we take image anno- proposed approach may be well applied to this across-stage tation as a case study and compare the scalability between inferencing problem for the multimodal data mining. EMML and MBRM. We randomly select three subsets of the The EMML algorithm is applied to obtain the model pa- embryo image database in different scales (50, 100, 150 im- rameters. In the figures below, the horizonal axis denotes ages, respectively), and apply both methods to the subsets the number of the top retrieval results. We investigate the to measure the query response time. The query response performance from top 2 to top 50 retrieval results. Fig- time is obtained by taking the average response time over ure 4 reports the precisions and recalls averaged over 1648 1648 queries. Since EMML is implemented in MATLAB queries for image annotation in comparison with MBRM environment and MBRM is implemented in C in Linux en- model where the solid lines are for precisions and the dashed vironment, to ensure a fair comparison, we report the scala- lines are for recalls. Similarly, Figure 5 reports the precisions bility as the relative ratio of a response time to the baseline and recalls averaged over 64 queries for word query in com- response time for the respective methods. Here the baseline
  • response time is the response time to the smallest scale sub- set (i.e., 50 images). Table 1 documents the scalability com- 0.10 parison. Clearly, MBRM exhibits a linear scalability w.r.t Single Word Query Average Precision(n) EMML 0.035 the database size while that of EMML is constant. This is 0.09 MBRM Single Word Query Average Recall(n) 0.030 consistent with the scalability analysis in Sec. 5. 0.08 In order to verify the fast learning advantage of EMML 0.07 0.025 in comparison with the existing max margin based learn- 0.020 ing literature, we have implemented one of the most re- 0.06 cently proposed max margin learning methods by Taskar 0.05 0.015 et al. [19]. For the reference purpose, in this paper we 0.04 0.010 call this method as TCKG. We have applied both EMML 0.03 and TCKG to a small data set randomly selected from the 0.005 whole Berkeley embryo database, consisting of 110 images 0.02 0.000 along with their annotation words. The reason we use this 0 10 20 30 40 50 small data set for the comparison is that we have found that Top(n) in MATLAB platform TCKG immediately runs out of mem- ory when the data set is larger, due to the large number of the constraints, which is typical for the existing max margin learning methods. Under the environment of 2.2GHz CPU Figure 5: Precisions and Recalls of word query be- and 1GB memory, TCKG takes about 14 hours to complete tween EMML and MBRM the learning for such a small data set while EMML only takes about 10 minutes. We have examined the number of the constraints reduced in both methods during their exe- 0.9 cutions for this data set. EMML has reduced the number EMML 0.035 of the constraints in a factor of 70 times more than that Image Retrieval Average Precision(n) MBRM 0.8 Image Retrieval Average Recall(n) reduced by TCKG. This explains why EMML is about 70 0.030 times faster than TCKG in learning for this data set. 0.7 0.025 0.6 0.020 7. CONCLUSION 0.5 We have developed a new max margin learning framework 0.015 — the enhanced max margin learning (EMML), and applied 0.4 0.010 it to developing an effective and efficient multimodal data 0.3 0.005 mining solution. EMML attempts to find a small set of active constraints, and thus is more efficient in learning than 0.2 0.000 the existing max margin learning literature. Consequently, 0 10 20 30 40 50 it has a much faster convergence rate which is verified in Top(n) empirical evaluations. The multimodal data mining solution based on EMML is highly scalable in the sense that the query response time is independent of the database scale. Figure 6: Precisions and Recalls of image retrieval This advantage is also supported through the complexity between EMML and MBRM analysis as well as empirical evaluations. While EMML is a general learning framework and can be used for general multimodal data mining, for the evaluation purpose, we have applied it to the Berkeley Drosophila embryo image database 0.32 and have reported the evaluations against a state-of-the-art EMML 2-stage inferencing Image Annotation Average Precision(n) MBRM 2-stage inferencing 1.0 multimodal data mining method. Image Annotation Average Recall(n) 0.28 0.8 8. ACKNOWLEDGEMENT 0.24 This work is supported in part by NSF (IIS-0535162, EF- 0.6 0331657), AFRL (FA8750-05-2-0284), AFOSR (FA9550-06- 0.20 1-0327), and a CAREER Award to EPX by the National 0.4 Science Foundation under Grant No. DBI-054659. 0.16 0.2 9. REFERENCES 0.12 [1] http://www.fruitfly.org/. 0.0 [2] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden 0 10 20 30 40 50 markov support vector machines. In Proc. ICML, Top(n) Washington DC, 2003. [3] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei, and M. I. Jordan. Matching words and Figure 7: Precisions and Recalls of 2-stage image to pictures. Journal of Maching Learning Research, word inferencing between EMML and MBRM 3:1107–1135, 2003.
  • [4] D. Blei and M. Jordan. Modeling annotated data. In Proceedings of the 26th annual International ACM 0.18 0.030 EMML 2-stage inferencing SIGIR Conference on Research and Development in Single Word Query Average Precision(n) MBRM 2-stage inferencing Information Retrieval, pages 127–134, 2003. Single Word Query Average Recall(n) 0.16 0.025 [5] S. Boyd and L. Vandenberghe. Convex Optimization. 0.14 0.020 Cambridge University Press, 2004. 0.12 [6] U. Brefeld and T. Scheffer. Semi-supervised learning 0.015 for structured output variables. In Proc. ICML, 0.10 0.010 Pittsburgh, PA, 2006. 0.08 [7] E. Chang, K. Goh, G. Sychay, and G. Wu. Cbsa: 0.005 content-based soft annotation for multimodal image 0.06 retrieval using bayes point machines. IEEE Trans. on 0.000 Circuits and Systems for Video Technology, 13:26–38, 0.04 0 10 20 30 40 50 Jan 2003. Top(n) [8] W. Chu, Z. Ghahramani, and D. L. Wild. A graphical model for protein secondary structure prediction. In Proc. ICML, Banff, Canada, 2004. Figure 8: Precisions and Recalls of 2-stage word to [9] R. Datta, W. Ge, J. Li, and J. Z. Wang. Toward image inferencing between EMML and MBRM bridging the annotation-retrieval gap in image search by a generative modeling approach. In Proc. ACM Multimedia, Santa Barbara, CA, 2006. [10] P. Duygulu, K. Barnard, N. de Freitas, and 0.45 0.040 D. Forsyth. Object recognition as machine translation: EMML 2-stage inferencing MBRM 2-stage inferencing Learning a lexicon for a fixed image vocabulary. In Image Retrieval Average Precision(n) 0.035 0.40 Seventh European Conference on Computer Vision, Image Retrieval Average Recall(n) 0.35 0.030 volume IV, pages 97–112, 2002. 0.025 [11] S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple 0.30 bernoulli relevance models for image and video 0.020 0.25 annotation. In International Conference on Computer 0.015 Vision and Pattern Recognition, Washington DC, 0.20 0.010 2004. 0.15 [12] Y. Freund and R. E. Schapire. Large margin 0.005 classification using the perceptron algorithm. In 0.10 0.000 Maching Learning, volume 37, 1999. 0 10 20 30 40 50 [13] H. D. III and D. Marcu. Learning as search Top(n) optimization: Approximate large margin methods for structured prediction. In Proc. ICML, Bonn, Germany, 2005. Figure 9: Precisions and Recalls of 2-stage image to [14] J. Lafferty, A. McCallum, and F. Pereira. Conditional image inferencing between EMML and MBRM random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML, 2001. [15] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction and 0.50 segmentation. In Proc. ICML, 2000. EMML 3-stage inferencing 0.48 [16] E. Osuna, R. Freund, and F. Girosi. An improved Image Retrieval Average Precision(n) MBRM 3-stage inferencing 0.04 training algorithm for support vector machines. In Image Retrieval Average Recall(n) 0.46 0.44 Proc. of IEEE NNSP’97, Amelia Island, FL, Sept. 0.03 1997. 0.42 [17] J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. 0.40 0.02 Automatic multimedia cross-modal correlation 0.38 discovery. In Proceedings of the 10th ACM SIGKDD 0.36 0.01 Conference, Seattle, WA, 2004. 0.34 [18] A. W. M. Smeulders, M. Worring, S. Santini, 0.32 0.00 A. Gupta, and R. Jain. Content-based image retrieval 0.30 at the end of the early years. IEEE Trans. on Pattern 0 10 20 30 40 50 Analysis and Machine Intelligence, 22:1349–1380, Top(n) 2000. [19] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediction models: A Figure 10: Precisions and Recalls of 3-stage image large margin approach. In Proc. ICML, Bonn, to image inferencing between EMML and MBRM Germany, 2005.
  • [20] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Neural Information Processing Systems Conference, Vancouver, Canada, 2003. [21] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In Proc. ICML, Banff, Canada, 2004. [22] V. N. Vapnik. The nature of statistical learning theory. Springer, 1995. [23] Y. Wu, E. Y. Chang, and B. L. Tseng. Multimodal metadata fusion using causal strength. In Proc. ACM Multimedia, pages 872–881, Hilton, Singapore, 2005.