Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[DL輪読会]Generative Models of Visually Grounded Imagination

2,348 views

Published on

2017/6/2
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Published in: Technology
  • Login to see the comments

[DL輪読会]Generative Models of Visually Grounded Imagination

  1. 1. Generative Models of Visually Grounded Imagination 輪読担当者:鈴⽊雅⼤ 2017/06/02
  2. 2. 本論⽂について ¤ 著者:Ramakrishna Vedantam, Ian Fischer, Jonathan Huang, Kevin Murphy ¤ 第⼀著者はバージニア⼯科⼤,残りはGoogleの⼈ ¤ Murphyさんは「Machine Learning」の著者としても有名 ¤ 2017/05/30, arXiv. NIPSに投稿中. ¤ 本論⽂の貢献: ¤ 属性->画像の⽣成において,属性の推論分布がエキスパートの積で表される深層⽣ 成モデルを提案. ¤ 属性->画像が上⼿くできているかを測る新しい指標(3C)を提案. ¤ 選んだ理由: ¤ 提案⼿法がJMVAE[Suzuki+ 2017]とほぼ⼀緒(論⽂内で引⽤&⽐較実験されてい る) ¤ 画像と属性情報の接地は⾯⽩いテーマ(というか僕の研究テーマ)
  3. 3. 背景 ¤ ⼈間は「big bird」と聞くと,何らかの画像を想像する. ¤ それらの画像には,テキストで指定していないような詳細な画像が⼊⼒さ れるため,バリエーションが⽣じる. ¤ 「big red bird」のように特徴を追加することで,より正確な概念を 定義できる. big bird big red bird
  4. 4. 属性ベースの概念の記述 ¤ 概念を記述する⽅法は様々あるが,本研究では属性ベースに着⽬. ¤ 概念を単数もしくは複数の属性で定義する. ¤ 属性は,組み合わせや階層構造(compositional concept hierarchy)に よって新しい概念を設計することができる. ¤ 本研究では,⾔語的な記述が与えられると,概念階層から概念を「想 像」し,その中間表現を画像に「翻訳」するモデルを提案する. ¤ この能⼒をvisually grounded semantic imaginationと呼ぶ. specify novel concepts. For instance, we might never have seen any birds with the attributes “size”: small, “color”: white, but we can easily imagine such a concept.1 We can also use attributes to automatically create a subsumption (is-a) hierarchy or taxonomy: each node is a concept (a partial specification of certain attribute values), and its children are refinements of this concept, obtained by specifying the value of one or more “missing” attributes. Conversely, the parents of a node are abstractions of that node, obtained by unspecifying one or more attributes. For example, suppose the root of the tree corresponds to the abstract concept “bird”; the next level, which is slightly less abstract, contains the concepts “small birds”, “big birds”, “red birds”, “white birds”; below this we have “small white birds”, etc. The leaf nodes of the tree correspond to maximally specific, or concrete, concepts, where all attributes are specified. See Figure 1 for an illustration. We call this a compositional concept hierarchy. It differs from standard concept hierarchies, such as Wordnet (G. A. Miller, 1990) (which is the basis for Imagenet (Russakovsky et al., 2015)), in that it is algorithmically created by combining primitive attributes in all combinatorially possible ways. This is similar to the capacity of natural language to “make infinite use of finite means”, in the words of Noam Chomsky (1965). Figure 1: Illustration of a compositional concept hierarchy related to birds, derived from two independent attributes, size and color. Moving up the graph corresponds to abstraction of a concept; moving down the graph corresponds to refinement or specialization of a concept. The leaf nodes correspond to concrete, or maximally specific, concepts. We would like to construct models that are able to “imagine” concepts from this concept hierarchy, given a linguistic description, and then “translate” this internal representation into an image. We call this ability visually grounded semantic imagination. We argue that a good method for visually grounded semantic imagination should satisfy the follow- ing three criteria: (1) Correctness: The representation we create should match all the properties (attributes) that were mentioned (this is known as the intension of the concept). For example, if we say “red bird”, all the images we generate should contain birds that are red. (2) Coverage: The representation should only match the specified attributes, and should be indifferent to the values of the other unspecified attributes. This allows the representation to cover all of the different object
  5. 5. 3C:良い「想像」と「翻訳」の基準 ¤ 正確性(Correctness): ¤ ⽣成した画像には,指定した属性が含まれていなければならない. ¤ 例)「red bird」と指定したら,⾚い⿃の画像のみを⽣成してほしい. ¤ 網羅性(Coverage): ¤ ⽣成した画像は,特定していない属性とは無関係であり,その集合に属す るすべてのバリエーションをカバーしなければならない. ¤ 例) 「red bird」と指定したら,⼩さいものや⼤きいものなど,様々な⾚ い⿃が⽣成されていてほしい. ¤ 合成性(Compositionality): ¤ すでに学習した属性を⾜したり引いたりすることで,新しい概念を想像で きてほしい.具体的な概念から,概念の⼀般化も⾏いたい. ¤ 例)⼤きな⾚い⿃や⼩さい⽩い⿃をみたことがあれば,⼤きな⽩い⿃や⼩ さな⾚い⿃を想像できる.また,⼤きな⾚い⿃や⼤きな⽩い⿃をみたこと があれば,⿃の⾊を取り除くことを学んで,⼤きな⿃がどのように⾒える かを想像できるはず.
  6. 6. Variational Autoencoder(VAE) ¤ Variational autoencoder [Kingma+ 13] ¤ 周辺分布𝑝(𝑥)をモデル化 ¤ 𝑝% 𝑥 𝑧 と𝑞( 𝑧 𝑥 はそれぞれ多層ニューラルネットワークで定義 ¤ 下界ℒ(𝑥)は次のように求まり,これを最⼤化する. ¤ 本研究では,VAEを画像と属性が扱えるように拡張 ℒ 𝑥 = −𝐷-.[𝑞((𝑧|𝑥)| 𝑝 𝑧 + 𝐸34(5|6) log 𝑝% 𝑥 𝑧
  7. 7. 提案⼿法 ¤ ⼊⼒画像を𝑥,属性を𝑦,それらの中間表現(潜在変数)を𝑧とする. ¤ それらのマルチモーダルな同時分布(Joint VAE)をモデル化 ¤ 特徴: ¤ ⽋損したモダリティも推論できるように3つの近似分布 𝑞(𝑧|𝑥, 𝑦), 𝑞(𝑧|𝑥), 𝑞(𝑧|𝑦)を導⼊. ¤ 𝑝(𝑦|𝑧)は𝑝 𝑦 𝑧 = ∏ 𝑝(𝑦=|𝑧)= のように各属性が条件付き独⽴とする. ¤ 下界としてtriple ELBOを提案. 𝑧 𝑥 𝑦 𝑝% 𝑥, 𝑦 𝑧 𝑞( 𝑧 𝑦𝑞( 𝑧 𝑥 𝑞( 𝑧 𝑥, 𝑦
  8. 8. エキスパートの積 ¤ ある属性が⽋損した場合についても推論したい. ¤ しかし,すべての推論分布を作るとなると,属性集合𝒜について2|𝒜| もの推論分布が 必要となる! ¤ そこで本研究では,エキスパートの積(PoE)[Hinton, 2012]に基づく ⼿法を提案する. ¤ 属性の近似分布は次のようになる. ただし,𝒪 ⊆ 𝒜は観測した属性集合. ¤ 属性が指定されなければ,事後分布は事前分布と等しくなる. ¤ 指定する属性の数が増えれば,事後分布が狭くなり,より正確な概念を指定できる. ¤ [Hinton, 2012]と異なり,本研究ではuniversal expert 𝑝 𝑧 = 𝑁(0, 𝐼) を導⼊している. ¤ 具体的な事後分布は次のとおり. space (using q(z|x) and q(z|y), respectively); this lets us “translate” images int vice versa, by computing p(y|x) = R dz p(y|z)q(z|x) and p(x|y) = R dz p(x|z)q( q(z|x) and q(z|y) inference networks, that can handle missing modalities, also l images without descriptions, and descriptions without images, i.e., we can perform learning. Our third extension is a tractable way to handle partially specified concepts (e.g. not specify the color). To do this, we introduce the notion of observed versus missi O ✓ A represent the set of observed attributes, and yO their values. Since we a are conditionally independent, the corresponding likelihood is given by p(yO|z) = We also need to modify the inference network to handle missing inputs. Unlike th modalities, it is not feasible to fit a separate inference network for all 2|A| possible patterns. Instead, we use a method based on the product of experts (Hinton, 200 our model has the form q(z|yO) / p(z) Q k2O q(z|yk). If no attributes are specifi is equal to the prior. As we condition on more attributes, the posterior becomes corresponds to specifying a more precise concept. In Section 5, we show that t generate diverse images representing abstract concepts, as well as novel concrete co not seen during training. In summary, the contributions of this paper are as follows. First, we define the n (consistency, coverage, and compositionality) for evaluating models of visually grou in an objective way. Second, we propose a novel loss function to train joint multi which we show outperforms previous approaches for grounded visual imagination ctive becomes ] KL(q(z|x), p(z)) (6) xKL(q(z|x), p(z)) (7) pproach of (Higgins et al., 2017). They showed ing of a “disentangled” representation, since the which is already disentangled. terms affect not just how much we regularize nfluences the latent space. When the modalities t to tune these scaling parameters appropriately, an an attribute, we need to set y y > 1, x x  1, is shared between the attributes and the images, ssing attributes at test time, we assume that the z|yO) / p(z) Q k2O q(z|yk), where q(z|yk) = , and p(z) = N(z|µ0 = 0, C0 = I) is the prior. l proposed in (Hinton, 2002), but differs in two for reasons we discuss in Section 5.1. Second, he latent variables, not the visible variables. on over binary variables, similar to a Boltzmann e product was intractable. In our case, it can be sian experts has the form q(z|yO) = N(z|µ, C), here the sum is over all observed attributes plus s a simple intuitive interpretation: If we do not When we have a uni-modality VAE, the scaled objective becomes L = x xEq(z|x)[log p(x|z)] KL(q(z|x), p = Eq(z|x)[log p(x|z)] xKL(q(z|x), p where x = 1/ x x. This is equivalent to the -VAE approach of (Higg that by using x 1, they could encourage the learning of a “disenta posterior is forced to be closer to the N(0, I) prior, which is already When we have a multi-modality VAE, the scaling terms affect not towards the prior, but also how much each modality influences the late have different amounts of information, it is important to tune these sc In particular, since an image is more informative than an attribute, w and yx y / xy x > 1; this ensures that the latent space is shared betwee as we illustrate in Section 5.1. Product of Experts (PoE). In order to handle missing attributes a q(z|y) inference network has the following form: q(z|yO) / p(z) Q N(z|µk(yk), Ck(yk)) is the k’th Gaussian “expert”, and p(z) = N( This is similar to the product of experts model model proposed in (H ways. First, we include the “universal expert” p(z), for reasons we d we apply this model to represent the distribution of the latent variable In (Hinton, 2002), each expert produced a distribution over binary va machine, and hence the normalization constant of the product was in shown that the (normalized!) product of all the Gaussian experts has th where C 1 = P k C 1 k and µ = C( P k C 1 k µk), where the sum is o the 0 term corresponding to p(z). This equation has a simple intuiti observe any attributes, the posterior reduces to the prior. As we observ
  9. 9. Triple ELBO ¤ ⽣成分布と3つの推論分布を学習するために,本研究ではtriple ELBO という下界を導⼊. ただし, はスケーリングパラメータ. ¤ が経験分布 と⼀致するようにしたい (今まで⾒た概念をすべてカバーするようにする) ¤ しかし,単⼀の事例に影響されたくないので,エントロピックにしたい (例えばN(0,1)に近づけたい). ¤ パラメータは,各モダリティが潜在空間に与える影響を制御する. ¤ 画像は属性よりも情報量が⼤きいので𝜆F Fは⼩さくする. y y y Similarly, if we just observe a attribute, and no image, the conditional likelihood becomes p(y|z). In addition to the generative model, we fit an inference network of the form q(z|x, y). To handle the case where we only have an image as input, we fit a q(z|x) network. Similarly, to handle the case where we only have a attributes as input, we also fit a q(z|y) network. So overall we fit three inference networks. The output of each of these networks is two vectors, one predicting the posterior mean of z, and one predicting its posterior covariance (we assume the covariance is diagonal for simplicity). Our objective function for a single datapoint has the following form, which we call the triple ELBO: L(x, y) = Eq(z|x,y)[ xy x log p(x|z) + yx y log p(y|z)] KL(q(z|x, y), p(z)) + Eq(z|x)[ x x log p(x|z)] KL(q(z|x), p(z)) + Eq(z|y)[ y y log p(y|z)] KL(q(z|y), p(z)) (5) where x x, y y, xy x , yx y are scaling terms, chosen to balance the likelihood contribution of each type of data. (We discuss these terms in more detail later.) Justification for the triple ELBO. We now give an intuitive justification for the triple ELBO objective. We want to ensure that the region of latent space occupied by each attribute vector y matches the region of latent space occupied by the set of corresponding (x, y) pairs, where x 2 Dxy is drawn from the set of images corresponding to y. More precisely, let q(z|Dxy) = 1 |Dxy| P x2Dxy q(z|x, y) be the “empirical posterior” for this grounding of the concept. We would like q(z|y) to match q(z|Dxy), so it “covers” all the examples of the concept that it has seen. However, we don’t want to only put mass on the observed examples, or worse a single observed example, so we also want q(z|y) to be as entropic (i.e., as close to the N(0, I) prior) as possible. (The issue of how much to generalize from a small set of examples of a concept, and along which feature dimensions, is also discussed in (Tenenbaum, 1999).) y p(x, y|z) = y [p(x|z)p(y|z)] = p(x|z) y p(y|z) = p(x|z) (4) Similarly, if we just observe a attribute, and no image, the conditional likelihood becomes p(y|z). In addition to the generative model, we fit an inference network of the form q(z|x, y). To handle the case where we only have an image as input, we fit a q(z|x) network. Similarly, to handle the case where we only have a attributes as input, we also fit a q(z|y) network. So overall we fit three inference networks. The output of each of these networks is two vectors, one predicting the posterior mean of z, and one predicting its posterior covariance (we assume the covariance is diagonal for simplicity). Our objective function for a single datapoint has the following form, which we call the triple ELBO: L(x, y) = Eq(z|x,y)[ xy x log p(x|z) + yx y log p(y|z)] KL(q(z|x, y), p(z)) + Eq(z|x)[ x x log p(x|z)] KL(q(z|x), p(z)) + Eq(z|y)[ y y log p(y|z)] KL(q(z|y), p(z)) (5) where x x, y y, xy x , yx y are scaling terms, chosen to balance the likelihood contribution of each type of data. (We discuss these terms in more detail later.) Justification for the triple ELBO. We now give an intuitive justification for the triple ELBO objective. We want to ensure that the region of latent space occupied by each attribute vector y matches the region of latent space occupied by the set of corresponding (x, y) pairs, where x 2 Dxy is drawn from the set of images corresponding to y. More precisely, let q(z|Dxy) = 1 |Dxy| P x2Dxy q(z|x, y) be the “empirical posterior” for this grounding of the concept. We would like q(z|y) to match q(z|Dxy), so it “covers” all the examples of the concept that it has seen. However, we don’t want to only put mass on the observed examples, or worse a single observed example, so we also want q(z|y) to be as entropic (i.e., as close to the N(0, I) prior) as possible. (The issue of how much to generalize from a small set of examples of a concept, and along which feature dimensions, is also discussed in (Tenenbaum, 1999).) As we explain in Section 4, different papers adopt different objectives for encouraging this behavior. observe a attribute, and no image, the conditional likelihood becomes p(y|z). nerative model, we fit an inference network of the form q(z|x, y). To handle nly have an image as input, we fit a q(z|x) network. Similarly, to handle the have a attributes as input, we also fit a q(z|y) network. So overall we fit three The output of each of these networks is two vectors, one predicting the posterior predicting its posterior covariance (we assume the covariance is diagonal for on for a single datapoint has the following form, which we call the triple ELBO: = Eq(z|x,y)[ xy x log p(x|z) + yx y log p(y|z)] KL(q(z|x, y), p(z)) + Eq(z|x)[ x x log p(x|z)] KL(q(z|x), p(z)) + Eq(z|y)[ y y log p(y|z)] KL(q(z|y), p(z)) (5) yx y are scaling terms, chosen to balance the likelihood contribution of each type these terms in more detail later.) e triple ELBO. We now give an intuitive justification for the triple ELBO ensure that the region of latent space occupied by each attribute vector y matches pace occupied by the set of corresponding (x, y) pairs, where x 2 Dxy is drawn es corresponding to y. More precisely, let q(z|Dxy) = 1 |Dxy| P x2Dxy q(z|x, y) osterior” for this grounding of the concept. We would like q(z|y) to match ers” all the examples of the concept that it has seen. However, we don’t want the observed examples, or worse a single observed example, so we also want pic (i.e., as close to the N (0, I) prior) as possible. (The issue of how much to mall set of examples of a concept, and along which feature dimensions, is also aum, 1999).) tion 4, different papers adopt different objectives for encouraging this behavior. rely on the triple ELBO. This encourages q(z|y) to be as close as possible to ame decoder network p(y|z) is used for generating both the paired and unpaired ) network. Similarly, to handle the |y) network. So overall we fit three vectors, one predicting the posterior sume the covariance is diagonal for orm, which we call the triple ELBO: )] KL(q(z|x, y), p(z)) )) ) (5) likelihood contribution of each type ve justification for the triple ELBO ed by each attribute vector y matches x, y) pairs, where x 2 Dxy is drawn q(z|Dxy) = 1 |Dxy| P x2Dxy q(z|x, y) pt. We would like q(z|y) to match has seen. However, we don’t want observed example, so we also want ossible. (The issue of how much to g which feature dimensions, is also ctives for encouraging this behavior. q(z|y) to be as close as possible to nerating both the paired and unpaired
  10. 10. 関連研究:条件付きモデル ¤ 条件付きVAEや条件付きGAN ¤ xとyが対称ではないため,双⽅向に⽣成ができない. ¤ ⽋損した⼊⼒を処理できない.このため「⼤きな⾚い⿃」でなく,「⼤き な⿃」のような部分的に抽象的な概念を⽣成できない. 𝑧 𝑥 𝑦 𝑝% 𝑥 𝑦, 𝑧 𝑞( 𝑧 𝑥, 𝑦
  11. 11. 関連研究:同時分布のVAE ¤ JMVAE [Suzuki+ 2017]とTriple ELBOのモデルは同じ(下界が違う). ¤ JMVAEの下界: ¤ JMVAEは3つ⽬の項のせいで,網羅率が落ちる可能性がある. ¤ しかし,実験的にはこの問題は⽣じていない. ¤ bi-VCCA[Wang et al., 2016]の下界や問題点等は論⽂参照. JMVAE (Suzuki et al., 2017) p(z)p(x|z)p(y|z) elbo(x, y|z; z|x, y) ↵KL(q(z|x, y), q(z|x)) ↵KL(q(z|x, y), q(z|y)) bi-VCCA (Wang et al., 2016) p(z)p(x|z)p(y|z) µ elbo(x, y|z; z|x) +(1 µ)elbo(x, y|z; z|y) JVAE-Pu (Pu et al., 2016) p(z)p(x|z)p(y|z) elbo(x, y|z; z|x) + elbo(x|z; z|x) JVAE-Kingma (Kingma et al., 2014) p(z)p(y)p(x|z, y) elbo(x|y, z; z|x, y) + log p(y) CVAE-Yan (Yan et al., 2016) p(z)p(x|y, z) elbo(x|y, z; z|x, y) CVAE-Sohn (Sohn et al., 2015) p(z|x)p(y|x, z) elbo(y|x, z; z|x, y; z|x) CMMA (Pandey et al., 2017) p(z|y)p(x|z) See text. z x y z x y z x y z x y VAE JVAE-Kingma JVAE-Pu bi-VCCA JMVAE/ triple ELBO Figure 2: Summary of different (joint) VAEs. Circles are random variables, downward pointing arrows represent the generative (decoding) process, upnward pointing arrows (with dotted lines) represent the inference (encoding) process, and black squares represent “inference factors”. Method names are defined in Table 1. Conditional models. Conditional VAEs (e.g., (Yan et al., 2016; Pandey and Dukkipati, 2017)) and conditional GANs (e.g., (Reed et al., 2016; Mansimov et al., 2016)) learn a stochastic mapping p(x|y) from semantics or attributes y to images x.3 See Table 1 for a summary. Since these models treat x and y asymmetrically, they cannot compute both p(x|y) and p(y|x), and do not support semi-supervised learning, unlike our joint method. More importantly, these conditional models cannot handle missing inputs, so they cannot be used to generate abstract, partially specified concepts, such as “big bird”, as opposed to “big red bird”. One heuristic that is commonly used to adapt discriminative models so they can handle missing inputs is to set the unspecified input attributes, such as the bird’s color, to a special “UNK” value, and hope the network learns to “do the right thing”. Alternatively, if we have a joint model over inputs, we can estimate or impute the missing values when predicting the output image, as follows: ˆx(yO) = arg max y log p(x|y) + log p(y|yO) (8) Recently there has been a trend towards learning from unaligned multiple modality data (see e.g., (Aytar et al., 2016)). However, this can cause problems when fitting VAEs. In particular, VAEs with powerful stochastic decoders (such as pixelCNNs for p(x|z) and RNNs for p(y|z)), can excel at learning good single modality generative models, but may learn to ignore z, as pointed out in (Chen et al., 2017). This cannot happen with paired data, since the only way to explain the correlation between the two modalities is via the shared latent factors.4 Training objectives for single modality inference networks. In a JVAE, we may train on aligned data, but at test time, we usually only observe a single modality. Hence we must fit up to 3 inference networks, q(z|x, y), q(z|x) and q(z|y). Several different objective functions have been proposed for this. To explain them all concisely, we define a modified version of the ELBO, where we distinguish between the variables we pass to the decoder, p(a|b), the encoder, q(c|d), and the prior, p(c|e): elbo(a|b; c|d; c|e) = Eq(c|d)[log p(a|b)] KL(q(c|d), p(c|e)). (9) If we omit the p(c|e) term, we assume we are using the unconditional p(c) prior. The objective functions used by various papers are shown in Table 1. In particular, the bi-VCCA objective of (Wang et al., 2016) has the form µ Eq(z|x)[log p(x, y|z)] KL(q(z|x), p(z)) +(1 µ) Eq(z|y)[log p(x, y|z)] KL(q(z|y), p(z)) (10) and the JMVAE objective of (Suzuki et al., 2017) has the form Eq(z|x,y)[log p(x, y|z)] KL(q(z|x, y), p(z)) ↵KL(q(z|x, y), q(z|y)) ↵KL(q(z|x, y), q(z|x)) (11) We see that the main difference between our approach and previous joint models is our use of the triple ELBO objective, which provides a different way to train the single modality inference networks q(z|x) and q(z|y). The problem with the bi-VCCA method of (Wang et al., 2016) is that the Eq(z|y) log p(x, y|z) term causes the generation of blurry samples, and hence low correctness. The reason is that a single y is required to “explain” all the different x’s to which it is matched, so it ends up being associated with their average. This problem can be partially compensated for by increasing µ, but that reduces the KL(q(z|y), p(z)) penalty, which is required to ensure q(z|y) is a broad distribution with good
  12. 12. VAEとなかまたちTable 1: Summary of VAE variants. x represents some form of image, and y represents some form of annotation. For notational simplicity, we omit scaling factors for the ELBO terms. The objective in (Pandey and Dukkipati, 2017) cannot be expressed using our notation, since it does not correspond to a log likelihood of their model, even after rescaling. Name Ref Model Objective VAE (Kingma et al., 2014) p(z)p(x|z) elbo(x|z; z|x) triple ELBO This p(z)p(x|z)p(y|z) elbo(x, y|z; z|x, y) +elbo(x|z; z|x) + elbo(y|z; z|y) JMVAE (Suzuki et al., 2017) p(z)p(x|z)p(y|z) elbo(x, y|z; z|x, y) ↵KL(q(z|x, y), q(z|x)) ↵KL(q(z|x, y), q(z|y)) bi-VCCA (Wang et al., 2016) p(z)p(x|z)p(y|z) µ elbo(x, y|z; z|x) +(1 µ)elbo(x, y|z; z|y) JVAE-Pu (Pu et al., 2016) p(z)p(x|z)p(y|z) elbo(x, y|z; z|x) + elbo(x|z; z|x) JVAE-Kingma (Kingma et al., 2014) p(z)p(y)p(x|z, y) elbo(x|y, z; z|x, y) + log p(y) CVAE-Yan (Yan et al., 2016) p(z)p(x|y, z) elbo(x|y, z; z|x, y) CVAE-Sohn (Sohn et al., 2015) p(z|x)p(y|x, z) elbo(y|x, z; z|x, y; z|x) CMMA (Pandey et al., 2017) p(z|y)p(x|z) See text. z y z zz VAE JVAE-Kingma JVAE-Pu bi-VCCA JMVAE/ triple ELBO
  13. 13. 関連研究:もつれを解く表現 ¤ Beta-VAEやInfoGANでは,属性情報などを使わないで,もつれを解い た(disentangled)表現を獲得できる. ¤ Beta-VAEは,KL項の係数を⼤きくすることで獲得する. ¤ しかし,これらがどのように意味構造を学習するのかは明らかではない. ¤ 良い表現を獲得するには,いくつかのラベルや属性が必要である [Soatto and Chiuso, 2016] ⽣成データ 回転 移動 ⾊ disentangled
  14. 14. 実験 ¤ MNISTに基づくデータセットで実験 ¤ MNIST-2bit: ¤ MNIST画像に「small or large」と「even or odd」のタグをつけたもの ¤ MNIST-a(MNIST with attributes): ¤ MNIST画像をアフィン変換してclass label,location,orientation , scaleのタグをつけたもの ¤ 評価は3Cで⾏う(次のページで3Cの定式化をする) 6, small, upright, top-right 7, small, clockwise, bottom-left 1, big, counter-clockwise, bottom-left 4, big, counter-clockwise, bottom-left 9, big, clockwise, top-right 3, big, upright, top-left 2, small, upright, bottom-left 0, small, upright, bottom-right Figure 11: Example binary images from our MNIST-a dataset. • Image decoder, p(x|z): Our architecture for the image decoder exactly follows the standard DCGAN architecture from (Radford et al., 2015), where the input to the model is the latent state of the VAE. • Label decoder, p(y|z): Our label decoder assumes a factorized output space p(y|z) =Q k2A p(yk|z), where yk is each individual attribute. We parameterize each p(yk|z) with a two-layer MLP with 128 hidden units each. We optionally apply L1 regularization on the first layer of the MLP, which consumes as input the samples z from inference networks. • Image and Label encoder, q(z|x, y): Our architecture (Figure 12) for the image-label encoder first separately processes the images and the labels, and then concatenates them downstream in the network and then passes the concatenated features through a multi-layered perceptron. More specifically, we have convolutional layers which process image into 32, 64, 128, 16 feature maps with strides 1, 2, 2, 2 in the corresponding layers. We use batch normalization
  15. 15. 3Cの定式化 ¤ 正確性: ただし, で,𝑦Gは事前に学習した分類器での予測結果. ¤ 網羅性: ただし, は⽋損した属性. ¤ 𝑞=は集合Sからの属性kの値に対する経験分布 ¤ 𝑝=は𝑦 𝒪のすべての画像の属性𝑘の値に対する真の分布 ¤ JSはJensen-Shannonダイバージェンスで, 𝑞= と𝑝= の差を計測. ¤ ⽋損した多様性を計測したいが,他の属性との被りも計測したいという意図. ¤ 合成性: ¤ iid設定:訓練とテストで同じ属性の組み合わせを観測. ¤ comp設定:訓練で学習していない属性の組み合わせでテスト. likelihood.) Given this classifier, we can measure the quality of the set of generated images using the 3 C’s, as we now discuss. Correctness. We define correctness as the fraction of attributes for each generated image that match those specified in the concept’s description: correctness(S, yO) = 1 |S| X x2S 1 |O| X k2O I(ˆy(x)k = yk). (1) We compute the correctness for a random sample of concrete (leaf node) concepts, for which all attributes are observed. We discuss how we create such concepts below. Coverage. For coverage, we want to measure the diversity of values for the unspecified or missing attributes, M = A O. One approach would be to compute the entropy of the distributions qk for each k 2 M, where qk is the empirical distribution over values for attribute k induced by set S. However, since there may be correlation amongst the attributes (e.g., if most big birds are red), we instead compare qk to pk, which is the true distribution over values for attribute k for all images in the extension of yO. We measure the difference between these distributions using the Jensen-Shannon divergence, since it is symmetric and satisfies 0  JS(p, q)  1. We then define: coverage(S, yO) = 1 |M| X k2M (1 JS(pk, qk)). (2) ability to perform abstraction and compositional generalization. g visual semantic imagination n is the act of creating a latent representation of some concept. But how can we ty these internal representations? Some papers (e.g., (Chen et al., 2016; Higgins ieu et al., 2016)) assess the quality of an internal representation by checking if it roperties, such as being “disentangled”. However, we prefer to use an evaluation uses on externally observable data, so that we can compare methods objectively ured properties. We draw inspiration from the field of education, which similarly ge of assessing whether a student has successfully “understood” a concept (c.f., 5)). With visual concepts, a natural approach is to give the student a description d ask them to generate N images that match that description, by creating a set of tension), which we denote by S(yO) = {x(n) ⇠ p(x|yO) : n = 1 : N}. uality of these generated images, we apply a multi-label classifier to each one, to dicted attribute vector, ˆy(x). This classifier, which we call observation classifier, is more common term “multi-modal”, since it may get confused with the notion of a probability ultiple modes. 3 ttributes are observed. We discuss how we create such concepts below. Coverage. For coverage, we want to measure the diversity of values for the unspecified or missing ttributes, M = A O. One approach would be to compute the entropy of the distributions qk for ach k 2 M, where qk is the empirical distribution over values for attribute k induced by set S. However, since there may be correlation amongst the attributes (e.g., if most big birds are red), we nstead compare qk to pk, which is the true distribution over values for attribute k for all images in the xtension of yO. We measure the difference between these distributions using the Jensen-Shannon ivergence, since it is symmetric and satisfies 0  JS(p, q)  1. We then define: coverage(S, yO) = 1 |M| X k2M (1 JS(pk, qk)). (2) We compute the coverage for a random sample of abstract (non-leaf) concepts, where at least one ttribute is missing. Compositionality. In standard supervised learning, we train on Dtrain xy and test on Dtest xy , which re two disjoint labeled datasets of the form {(x, y) ⇠ ptrue(x, y)}. We call this the iid setting. We sually assume that every class label in Dtest xy has already been seen in Dtrain xy ; the iid setting therefore ests our ability to generalize across visual variation within known categories (attribute combinations). o test our ability to generalize beyond the known categories, to novel combinations of attributes, we partition the label space Y into two disjoint subsets, Y1 and Y2, such that for each y(2) 2 Y2, here is no y(1) 2 Y1 that is identically equal, but there is some y(1) which shares at least one (2) likelihood.) Given this classifier, we can measure the quality of the set of generated images usin 3 C’s, as we now discuss. Correctness. We define correctness as the fraction of attributes for each generated image that m those specified in the concept’s description: correctness(S, yO) = 1 |S| X x2S 1 |O| X k2O I(ˆy(x)k = yk). We compute the correctness for a random sample of concrete (leaf node) concepts, for whic attributes are observed. We discuss how we create such concepts below. Coverage. For coverage, we want to measure the diversity of values for the unspecified or mis attributes, M = A O. One approach would be to compute the entropy of the distributions q each k 2 M, where qk is the empirical distribution over values for attribute k induced by s However, since there may be correlation amongst the attributes (e.g., if most big birds are red) instead compare qk to pk, which is the true distribution over values for attribute k for all images i extension of yO. We measure the difference between these distributions using the Jensen-Shan divergence, since it is symmetric and satisfies 0  JS(p, q)  1. We then define: coverage(S, yO) = 1 |M| X k2M (1 JS(pk, qk)). We compute the coverage for a random sample of abstract (non-leaf) concepts, where at least
  16. 16. 実験1-1:属性の必要性 ¤ Beta-VAE(もつれを解くVAE)と⽐較 ¤ 2次元の潜在空間を可視化.⾊は属性情報に対応. ¤ (a):beta-VAE,(b):JVAE ¤ 属性情報があった⽅が,上⼿く分離されている. (a) (b) Figure 3: Visualization of the benefit of semantic annotations for learning a good latent space. (a) -VAE fit to images without annotations. Note that the red region (corresponding to the concept of large and even digits) is almost non existent. (b) Joint-VAE with yx y = 50. latent space, we proceed as follows: we embed each training image x (with label y(x)) into latent space, by computing ˆz(x) = Eq(z|x)[z]. We then associate label y(x) with this point in space. To
  17. 17. 実験1-2:universal expertの必要性 ¤ エキスパートの積において𝑝(𝑧)は必要なのかを検証 ¤ (a):エキスパートの積ではない場合,(b):エキスパートの積(ただし 𝑝(𝑧) なし), (c):𝑝(𝑧)ありのエキスパートの積 ¤ 𝑝(𝑧)がないと,𝑞(𝑧|𝑦=)が⼤きく縦や横に分布が伸びてしまう(b). ¤ 𝑝(𝑧)によって,事前分布の範囲に収まるようになる(c). when all attributes are present (as is the case during training), the individual Gaussians get multiplied together, to produce a well defined results (as shown by the small ellipses in each quadrant), but at test time, when some attributes are missing, the product can be rather poorly behaved. We can solve this problem by always including the universal expert p(z) in the product. The benefits of this are shown in Figure 4(c). We now get very nicely shaped posteriors for both concrete and abstract queries, since we always multiply each Gaussian expert, which may be long and thin, by the universal expert, which is a fixed sized circle. (a) (b) (c) Figure 4: Visualization of the effect of using different inference networks. (For each experiment, we show the result obtained using the best hyperparameters.) (a) Product of Experts disabled, x x = xy x = 1, y y = yx y = 10. (b) Product of Experts enabled, but no universal expert, x x = xy x = 1, y y = yx y = 50. (c) Product of Experts enabled, with universal expert. x x = 0.001, xy x = 0.1, y y = yx y = 100. Figure best viewed by zooming in. 5.1.4 Why we need likelihood scaling terms. In Figure 5(a), we show what happens if we optimize the unscaled triple ELBO objective. Without
  18. 18. 実験1-3:スケーリングパラメータの必要性 ¤ λの調節が重要であることを⽰す. ¤ 𝜆F F が⼩さくないと,うまく潜在空間上で学習ができない上に,上⼿くp(x|z)を⽣成 できない. (a) (b) (c) Figure 5: Visualization of the impact of likelihood scaling terms on the latent space. (a) x x = xy x = y y = yx y = 1. (b) x x = xy x = 1, y y = yx y = 100. (c) x x = 0.01, xy x = 1, y y = yx y = 100. Figure best viewed by zooming in. 5.1.6 Understanding the JMVAE objective. Figure 6(c) shows the results of fitting the JMVAE model using ↵ = 0.1. JMVAE consistently generates Gaussians with high variance. When ↵ is small, all of the Gaussians overlap heavily. In this case, the red Gaussian (large, even) is positioned over the black region (small, even) more fully than the white Gaussian, and vice-versa. The fact that p(x|z) and p(y|z) still perform well indicates that they are relying almost entirely on the signal from the q(z|x, y) network to solve the task. Figure 6(d) shows the results of fitting the JMVAE model using ↵ = 10. When ↵ is large, the Gaussians correspond well to the attribute regions output by p(y|z), meaning that larger values of ↵ lead to better alignment in our MNIST-2bit world. Note, however, that there is still more overlap begween the concepts than when using triple ELBO (compare Figure 6(d) with Figure 5(c)).
  19. 19. 実験1-4:JMVAEの解釈 ¤ JMVAEが潜在空間でどのように学習しているのかを確認する(bi- VCCAは省略). ¤ (c):λ=0.1,(d):λ=10 ¤ λが⼩さいと, q(z|x, y)に依存した状態になり,q(z|yk)の分散は⼤きくなる(し かし,p(x|z)の⽣成はうまくできている). ¤ λが⼤きいと,上⼿くq(z|yk)が学習できている(しかし,分布同⼠が重なるところ もある) (a) (b) (c) (d) Figure 6: Effect of hyper-parameters on bi-VCCA and JMVAE. (a) bi-VCCA, µ = 0.1. (b) bi-VCCA, µ = 0.9. (c) JMVAE, ↵ = 0.1. (d) JMVAE, ↵ = 10. Figure best viewed by zooming in. 13
  20. 20. 実験2-1:MNIST-aの実験結果 ¤ MNIST-aでtriple ELBO,JMVAE, bi-VCCAを⽐較実験. ¤ 係数のパラメータ調整は検証データセットで⾏った. ¤ bi-VCCAがダメダメ ¤ triple ELBOがJMVAEを上回る結果 Table 2: Comparison of different approaches on MNIST-a test set. Higher numbers are better. Error bars (in parentheses) are standard error of the mean. For concrete concepts (where all 4 attributes are specified), we do not use a PoE inference network, and we do not report coverage. Hyperparameter settings for each result are discussed in the supplementary material. Method #Attributes Coverage (%) Correctness (%) PoE? Training set triple ELBO 4 - 90.76 (0.11) N iid JMVAE 4 - 86.38 (0.14) N iid bi-VCCA 4 - 80.57 (0.26) N iid triple ELBO 3 90.76 (0.21) 77.79 (0.30) Y iid JMVAE 3 89.99 (0.20) 79.30 (0.26) Y iid bi-VCCA 3 85.60 (0.34) 75.52 (0.43) Y iid triple ELBO 2 90.58 (0.17) 80.10 (0.47) Y iid JMVAE 2 89.55 (0.30) 77.32 (0.44) Y iid bi-VCCA 2 85.75 (0.32) 75.98 (0.78) Y iid triple ELBO 1 91.55 (0.05) 81.90 (0.48) Y iid JMVAE 1 89.50 (0.09) 81.06 (0.23) Y iid bi-VCCA 1 87.77 (0.10) 76.33 (0.67) Y iid triple ELBO 4 - 83.10 (0.07) N comp JMVAE 4 - 79.34 (0.52) N comp bi-VCCA 4 - 75.18 (0.51) N comp 5.2.4 Hyper-parameters. For each model, we have to choose various hyperparameters: the label likelihood weighting yx y 2 {1, 10, 50} (we keep xy x = 1 fixed throughout), and whether to use PoE or not for q(z|y). In addition, each way of training the model has its own method-specific hyperparameters: for JMVAE, we choose ↵ 2 {0.01, 0.1, 1.0} (the same set of values used in (Suzuki et al., 2017)); for bi-VCCA, we choose
  21. 21. JMVAE提案者としての⾔い分 ¤ パラメータ調整の結果,JMVAEではパラメータ𝛼をすべて1に設定し ているが,triple ELBOでは実験毎に異なる値となっている. ¤ どう考えてもtriple ELBOの⽅がパラメータ依存性が⾼く,調整が⼤変では?? ¤ 多分JMVAEのパラメータ𝛼をもっと⼤きくすれば(10とか), JMVAEの⽅が上回っている気がする. ¤ わざわざ「もとの論⽂に従って𝛼 ∈ {0.01, 0.1, 1.0} にした」と書いてある. 512 Class [10] Scale [2] Orientation [3] Location [4] concat (2048) 64x64x32 Image [64x64x1] flatten (1024) concat (1536) 512 512 512 512 512 512 512 512 512 512 512 32x32x64 16x16x128 8x8x16 Figure 12: Architecture for the q(z|y) network in our JVAE models for MNIST-a. Images are (64x64x1), class has 10 possible values, scale has 2 possible values, orientation has 3 possible values, and location has 4 possible values. Table 3: Here we list the hyperparameters used by each method to produce the results in Table 2. (Recall that we fix xy x = 1 for all methods, and x x = xy x for triple ELBO.) Method #Attributes yx y Private Hyperparameter L1 POE? Training set triple ELBO 4 10 y y = 100 5e-05 N iid JMVAE 4 50 ↵ = 1 0 N iid bi-VCCA 4 10 µ = 0.7 5e-07 N iid triple ELBO 3 50 y y = 1 5e-03 Y iid JMVAE 3 50 ↵ = 1 5e-03 Y iid bi-VCCA 3 50 µ = 0.7 5e-04 Y iid triple ELBO 2 50 y y = 1 5e-03 Y iid JMVAE 2 50 ↵ = 1 5e-03 Y iid bi-VCCA 2 50 µ = 0.7 5e-04 Y iid triple ELBO 1 50 y y = 1 5e-3 Y iid JMVAE 1 50 ↵ = 1 5e-06 Y iid bi-VCCA 1 50 µ = 0.7 5e-04 Y iid triple ELBO 4 10 y y = 100 0 Y comp JMVAE 4 50 ↵ = 1 5e-03 Y comp bi-VCCA 4 10 µ = 0.7 5e-05 Y comp µ 2 {0.3, 0.5, 0.7}; for triple ELBO, we choose y y 2 {1, 50, 100} (we keep x x = xy x = 1). Thus all methods have the same number of hyperparameters. We choose the best hyperparameters based on performance on the corresponding validation set. More precisely, when evaluating concrete test concepts, we choose the values that maximize the correctness score on concrete validation concepts. But when evaluating abstract test concepts, we choose the values that maximize the coverage scores on the abstract validation set. If there are multiple values with very similar coverage scores (within one standard error), we break ties by picking the values which give better correctness. The resulting hyperparameters are shown in Table 3. 21
  22. 22. 実験2-2:正確性の定性評価 ¤ 様々な属性から画像を⽣成して,その属性を予測する. ¤ ⾚い四⾓が間違って属性が予測された画像. ¤ bi-VCCAがダメダメ ¤ JMVAEでも⼀部ミスがあったり,上⼿く⽣成できていなかったりする. Concept: 4, big, upright, bottom-left JMVAE bi-VCCAtriple ELBO Concept: 1, big, counter-clockwise, top-left JMVAE bi-VCCAtriple ELBO Figure 7: Samples of 2 (previously seen) concrete concepts using 3 different models. For each concept, we draw 4 samples from the posterior, zi ⇠ q(z|y), convert each one to a mean image, µi = E[x|zi], and then show the results. The caption at the top of each image (in small font) is the predicted attribute values. (The observation classifier is fed sampled images, not the mean image that we are showing here.) The border of the image is black if all attributes are correct, otherwise the border is red. 3 methods for 2 different concrete concepts, one chosen at random (bottom row), and one chosen where the discrepancy in correctness between triple ELBO and bi-VCCA was maximal. We see that the images generated by bi-VCCA are much blurrier than the other methods, and are con- sidered incorrect by the observation classifier. The blurriness is because of the Eq(z|y)[log p(x, y|z)] term, as we discussed above. For all of our experiments, we use a value of µ = 0.7, which reduces blurriness, and yields the best correctness score on the validation set. Nevertheless, this does not completely eliminate bluriness, as we can see. From Figure 7 we see that the JMVAE samples look good. However, it sometimes makes mistakes by
  23. 23. 実験2-3:網羅性の定性評価 ¤ triple ELBOで網羅性が⾼いことを確認. ¤ ⼊⼒として与えていない属性について,様々なバリエーションが⽣成できている. In this section, we evaluate how well each method covers a concept, in terms of the diversity of the samples it generates. Note that we only apply this to abstract concepts, since concrete concepts fully specify all attributes, and hence a single sample automatically covers the entire concept. The results are shown in Table 2. Once again, we see that triple ELBO outperforms JMVAE (although the gap is small), and both methods outperform bi-VCCA. To get a better understanding of how well the methods work, Figure 8 shows some sampled images for concepts at different levels of abstraction. In general we see that the samples are correct (consistent with the attributes that were specified), yet relatively diverse, as desired. (Note, however, that we manually selected 6 from 10 samples to make the figure. Automatically generating a diverse set of samples is left to future work.) 3-bit concept big, counter-clockwise, bottom-right (Digit unspecified) 2-bit concept 3, big (location, orientation unspecified) 1-bit concept bottom-left (digit, location, orientation unspecified) Figure 8: Samples of 3 abstract concepts using the triple ELBO model. More precisely, for each concept, we draw 10 samples from the posterior, zi ⇠ q(z|y), convert each one to a mean image, µi = E[x|zi], and then manually pick the 6 most diverse ones to show here. 5.2.7 Evaluating compositionality. In this section, we evaluate how well methods handle compositionally novel, concrete concepts. The results are shown in Table 2. We see that the correctness scores are lower for all methods than
  24. 24. 実験2-4:合成性の定性評価 ¤ 未知の属性の組み合わせをうまく⽣成できるか検証 ¤ bi-VCCAがダメダメ.JMVAEも同様に上⼿く⽣成できていない. ¤ triple ELBOだとそれなりに⽣成できている. Concept: 0, big, upright, top-right JMVAE bi-VCCAtriple ELBO Concept: 2, big, clockwise, bottom-left JMVAE bi-VCCAtriple ELBO Figure 9: Samples of 2 compositionally novel concrete concepts using 3 different models. For each concept, we draw 4 samples from the posterior, zi ⇠ q(z|y), convert each one to a mean image, µi = E[x|zi], and then show the results. The color coding is the same as Figure 7, namely red border means one or more attributes are incorrect (according to the observation classifier), black border means all attributes are correct. 6 Conclusions and future work We have shown how we can learn to represent the semantic content of images and descriptions using probability distributions over random vectors in a shared latent space. We use this to “imagine”
  25. 25. まとめ ¤ 共有する潜在空間上において確率分布を⽤いて,画像と記述の意味内 容の表現を学習する⼿法を提案した. ¤ triple ELBO ¤ これによって具体的,抽象的な概念を「想像」し,画像に「接地」するこ とができる. ¤ 感想: ¤ 検証の仕⽅がさすがという感じ. ¤ ⾃分でも本論⽂の結果を再現して検証したい. ¤ 論⽂読んでて⾃分の名前が出るのが気持ち悪い.

×