論文紹介 Fast imagetagging
Upcoming SlideShare
Loading in...5
×
 

論文紹介 Fast imagetagging

on

  • 7,244 views

 

Statistics

Views

Total Views
7,244
Views on SlideShare
3,749
Embed Views
3,495

Actions

Likes
3
Downloads
22
Comments
0

12 Embeds 3,495

http://research.preferred.jp 3243
http://cloud.feedly.com 165
https://twitter.com 29
http://www.feedspot.com 20
http://reader.aol.com 11
http://feedly.com 9
http://everrss.me 6
http://localhost 5
http://www.newsblur.com 3
http://digg.com 2
http://inoreader.com 1
http://news.google.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

論文紹介 Fast imagetagging 論文紹介 Fast imagetagging Presentation Transcript

  • Fast Image Tagging M. Chen(Amazon.com), A. Zheng(MSR, Redmond), and K. Weinberger(Washington Univ.) ICML2013 ICML2013読み会 2013.7.9 株式会社Preferred Infrastructure Takashi Abe <tabe@preferred.jp>
  • 自己紹介  阿部厳 (あべたかし)  Twitter: @tabe2314  東北大岡谷研(コンピュータビジョン)→PFIインターン→PFI新入 社員 2
  • 紹介する論文 M. Chen, A. Zheng and K. Weinberger. Fast Image Tagging. ICML, 2013. ※ スライド中の図表はこの論文より引用しています 3
  • Image Tagging (1) 4  画像から、関連するタグ(複数)を推定  training:  入力: {(画像, タグ), …} 出力: 画像→タグ集合  testing:  入力: 画像 出力: 推定したタグ集合 bear polar snow tundra buildings clothes shops street ??? training testing
  • Image Tagging (2): 何が難しい?  効果的な特徴が物体によって違う → いろんな特徴を入れたい  見えの多様性 → 大きなデータセットを使いたい  不完全なアノテーションデータ  PrecisionはともかくRecallが低いデータしか得られない  (本来のタグから一部が抜け落ちたデータが得られる)  例: Flickrのタグ  タグの出現頻度の偏り 5 Color Edges
  • FastTag 6
  • 基本的なアイデア  アノテーションされたタグ集合を補完しつつ、画像から補完された タグへの(線形の)マッピングを学習  B: タグ集合 → 補完されたタグ集合  W: 画像特徴 → 補完されたタグ集合  学習: 7 gure 1. Schematic illustration of FastTag. During training two classifiers B and W edict similar results. At testing time, a simple linear mapping x → W x predicts t ntly convex and has closed form solutions in each eration of the optimization. o-regularized learning. As we are only provided th an incomplete set of tags, wecreate an additional xiliary problem and obtain two sub-tasks: 1) train- g an image classifier xi → W xi that predicts the mplete tag set from image features, and 2) training mapping yi → Byi to enrich the existing sparse g vector yi by estimating which tags are likely to -occur with those already in yi . We train both clas- ers simultaneously and force their output to agree minimizing 1 n n i = 1 Byi − W xi 2 . (1) B to y would recov set. For simplicity, we u ondary corruption m belers may select ta bility. We can appr distribution with pie learning step (see se the original corrupti also easily be incorp More formally, for e created by randomly each entry in y with fore, for each user p(˜yt = 0) = p and
  • Marginalized blank-out regularization (1)  を単純に最小化するとB=0=Wなので要制約  Bはアノテーションyiを真のタグ集合ziにマップして欲しい  ziは得られないので、yiの要素をそれぞれ確率pで落とした から yi を復元するBを考える  の生成を繰り返し行うことを考えると復元誤差の期待値は 8 or yi by estimating which tags are likely to r with those already in yi . We train both clas- multaneously and force their output to agree mizing 1 n n i = 1 Byi − W xi 2 . (1) yi is the enriched tag set for the i-th training and each row of W contains the weights of a assifier that tries to predict the corresponding ed) tag based on image features. s function as currently written has a trivial so- t B = 0 = W , suggesting that the current for- n is underconstrained. We next describe ad- regularizations on B that guides the solution something more useful. alized blank-out regularizat ion. We take on from the idea of marginalized stacked de- autoencoders (Chen et al., 2012) and related ?) in formulating the tag enrichment mapping } T →RT . Our intention isto enrich theincom- the original corruption mechanism is also easily be incorporated into our m More formally, for each y, a corrup created by randomly removing (i.e., each entry in y with some probability fore, for each user tag vector y an p(˜yt = 0) = p and p(˜yt = yt ) = 1 − to optimize B = argmin B 1 n n i = 1 yi − B Here, each row of B is an ordinary gressor that predicts the presence o existing tags in ˜y. To reduce varian repeated samples of ˜y. In the limi many corrupted versions of y), the struction error under the corrupting be expressed as r(B) = 1 n n i = 1 E yi − B ˜yi cts the raining sparse kely to th clas- o agree (1) raining hts of a ponding vial so- ent for- ibe ad- solution bility. We can approximate the unknown corrupting distribution with piecewise uniform corruption in the learning step (see section 3.2). If prior knowledge on the original corruption mechanism is available, it can also easily be incorporated into our model. More formally, for each y, a corrupted version ˜y is created by randomly removing (i.e., setting to zero) each entry in y with some probability p≥ 0 and there- fore, for each user tag vector y and dimensions t, p(˜yt = 0) = p and p(˜yt = yt ) = 1 − p. We train B to optimize B = argmin B 1 n n i = 1 yi − B ˜yi 2 . Here, each row of B is an ordinary least squares re- gressor that predicts the presence of a tag given all existing tags in ˜y. To reduce variance in B, we take repeated samples of ˜y. In the limit (with infinitely many corrupted versions of y), the expected recon- (1) ning of a ding so- for- ad- tion ake de- ated ping om- hat each entry in y with some probability p≥ 0 and there- fore, for each user tag vector y and dimensions t, p(˜yt = 0) = p and p(˜yt = yt ) = 1 − p. We train B to optimize B = argmin B 1 n n i = 1 yi − B ˜yi 2 . Here, each row of B is an ordinary least squares re- gressor that predicts the presence of a tag given all existing tags in ˜y. To reduce variance in B, we take repeated samples of ˜y. In the limit (with infinitely many corrupted versions of y), the expected recon- struction error under the corrupting distribution can be expressed as r(B) = 1 n n i = 1 E yi − B ˜yi 2 p( ˜y i |y ) . (2) Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-
  • Marginalized blank-out regularization (2)  (2)を式変形して  つまり実際に を作る必要は無い(pだけ決めればよい)  この復元誤差の期待値をロス関数に加える 9 com- that that m the of the e en- ocess. on of from tches lying Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain- ing the partial labels for each image in each column. Define P ≡ n i = 1 yi E[˜yi ] and Q ≡ n i = 1 E[˜yi ˜yi ], then we can rewrite the loss in (2) as r(B) = 1 n trace(BQB − 2PB + Y Y ) (3) We use Eq. (3) to regularize B. For the uniform “ blank-out” noise introduced above, we have the ex- pected value of the corruptions E[˜y]p( ˜y |y ) = (1 − p)y,
  • Optimization  Bを固定すると(5)を最小化するWは閉じた式で  同様にWを固定すると  交互に求めると大域解に収束 (jointly convex) 10
  • 拡張  Tag bootstrapping  Bはタグの共起関係で学習されてるので、共起しないけど似たタグ が補完されない(例: lakeとpond)  stacking  Byiを新たなアノテーションとしてもう一度学習、を繰り返す  共起関係を伝搬させるイメージ?  スタック数は実験的に決定 11
  • 画像特徴  複数の特徴を組み合わせ(既存手法と同じもの)  GIST  6種類の色ヒストグラム  8種類の局所特徴のBoW  事前に内積がχ^2距離を近似する空間にあらかじめ写像しておく  Vedaldi, A. and Zisserman, A. Efficient additive kernels via explicit feature maps. PAMI, 34(3):480–492, 2012. 12
  • 実験結果 13
  • 精度評価  leastSquares: FastTagのタグ補完無し版、ベースライン  TagProp: これまでのstate of the art, 学習O(n^2), テストO(n)  FastTagの精度はほぼTagPropと同じ 14 brown, ear, painting, woman, yellow board, lake, man wave, white blue, circle, feet round, white drawing, hat, people red, woman blue, dot, feet, microphone, statue hair, ice, man, white, woman black, moon, red, shadow, woman LowF-1s Figure 2. Predicted keywords using FastTag for sample images in the ESP game dataset (using all 268 keywords). Table 1. Comparison of FastTag and TagProp in terms of P, R, F1 score and N+ on the Corel5K dataset. Previously reported results using other image annotation techniques are also included for reference. Name P R F1 N+ leastSquares 29 32 30 125 CRM (Lavrenko et al., 2003) 16 19 17 107 InfNet (M et zler & M anmat ha, 2004) 17 24 20 112 NPDE (Yavlinsky et al., 2005) 18 21 19 114 SML (Carneiro et al., 2007) 23 29 26 137 MBRM (Feng et al., 2004) 24 25 24 122 TGLM (Liu et al., 2009) 25 29 27 131 JEC (M akadia et al., 2008) 27 32 29 139 TagProp (Guillaumin et al., 2009) 33 42 37 160 Fast Tag 32 43 37 166 report the number of keywords with non-zero recall value (N+ ). In all metrics a higher value indicates better performance. B aselines. We compare against leastSquares, a ridge regression model which uses the partial subset of tags y1, . . . , yn as labels to learn W , i.e., FastTag without tag enrichment. We also compare against the Tag- Prop algorithm (Guillaumin et al., 2009), a local kNN method combining different distance metrics through Table 2. Comparison of FastTag and TagProp in terms of P , R, F1 score and N+ on the Espgame and IAPRTC-12 datasets. ESP game IAPR P R F1 N+ P R F1 N+ leastSquares 35 19 25 215 40 19 26 198 MBRM 18 19 18 209 24 23 23 223 JEC 24 19 21 222 29 19 23 211 TagProp 39 27 32 238 45 34 39 260 FastTag 46 22 30 247 47 26 34 280 4.2. Comparison wit h relat ed work Table 1 shows a detailed comparison of FastTag to the leastSquares baseline and eight published results on the Corel5K dataset. We can make three obser- vations: 1. The performance of FastTag aligns with that of TagProp (so far the best algorithm in terms of accuracy on this dataset), and significantly outper- forms the other methods; 2. The leastSquares base- line, which corresponds to FastTag without the tag enricher, performs surprisingly well compared to exist- ing approaches, which suggests theadvantage of a sim- ple model that can extend to a large number of visual descriptor, as opposed to a complex model that can af- ford fewer descriptors. One may instead more cheaply brown, ear, painting, woman, yellow board, lake, man wave, white blue, circle, feet round, white drawing, hat, people red, woman blue, dot, feet, microphone, statue hair, ice, man, white, woman black, moon, red, shadow, woman LowF-1scor Figure 2. Predicted keywords using FastTag for sample images in the ESP game dataset (using all 268 keywords). Table 1. Comparison of FastTag and TagProp in terms of P, R, F1 score and N+ on the Corel5K dataset. Previously reported results using other image annotation techniques are also included for reference. Name P R F1 N+ leastSquares 29 32 30 125 CRM (Lavrenko et al., 2003) 16 19 17 107 InfNet (M et zler & M anmat ha, 2004) 17 24 20 112 NPDE (Yavlinsky et al., 2005) 18 21 19 114 SML (Carneiro et al., 2007) 23 29 26 137 MBRM (Feng et al., 2004) 24 25 24 122 TGLM (Liu et al., 2009) 25 29 27 131 JEC (M akadia et al., 2008) 27 32 29 139 TagProp (Guillaumin et al., 2009) 33 42 37 160 Fast Tag 32 43 37 166 report the number of keywords with non-zero recall value (N+ ). In all metrics a higher value indicates better performance. Baselines. We compare against leastSquares, a ridge regression model which uses the partial subset of tags y1, . . . , yn as labels to learn W , i.e., FastTag without tag enrichment. We also compare against the Tag- Prop algorithm (Guillaumin et al., 2009), a local kNN method combining different distance metrics through metric learning. It is the current best performer on these benchmark sets. Most existing work do not pro- vide publicly available implementations. As a result, Table 2. Comparison of FastTag and TagProp in terms of P , R, F1 score and N+ on the Espgame and IAPRTC-12 datasets. ESP game IAPR P R F1 N+ P R F1 N+ leastSquares 35 19 25 215 40 19 26 198 MBRM 18 19 18 209 24 23 23 223 JEC 24 19 21 222 29 19 23 211 TagProp 39 27 32 238 45 34 39 260 FastTag 46 22 30 247 47 26 34 280 4.2. Comparison wit h relat ed work Table 1 shows a detailed comparison of FastTag to the leastSquares baseline and eight published results on the Corel5K dataset. We can make three obser- vations: 1. The performance of FastTag aligns with that of TagProp (so far the best algorithm in terms of accuracy on this dataset), and significantly outper- forms the other methods; 2. The leastSquares base- line, which corresponds to FastTag without the tag enricher, performs surprisingly well compared to exist- ing approaches, which suggeststheadvantage of a sim- ple model that can extend to a large number of visual descriptor, as opposed to a complex model that can af- ford fewer descriptors. One may instead more cheaply glean the benefits of a complex model via non-linear transformation of the features. 3. The duo classifier formulation of FastTag, which adds the tag enricher,
  • 15 最大タグ数
  • 16 Fast I m age Tagging bug, green, insect, tree, wood baby, doll, dress, green, hair blue, earth, globe, map, world fish, fishing, fly, hook, orange blue, cloud, ocean, sky, water fly, plane, red, sky, train black, computer, drawing handle, screen brown, ear, painting, woman, yellow board, lake, man wave, white blue, circle, feet round, white drawing, hat, people red, woman blue, dot, feet, microphone, statue hair, ice, man, white, woman black, moon, red, shadow, woman asian, boy, gun, man, white anime, comic, people, red, woman feet, flower, fur. red, shoes blue, chart, diagram, internet, table gray, sky, stone, water, white black, dark, game, man, night plane, red, sky, train, truck HighF-1scoreLowF-1scoreRandom Figure 2. Predicted keywords using FastTag for sample images in the ESP game dataset (using all 268 keywords). Table 1. Comparison of FastTag and TagProp in terms of P, R, F1 score and N+ on the Corel5K dataset. Previously Table 2. Comparison of FastTag and TagProp in terms of P , R, F1 score and N+ on the Espgame and IAPRTC-12