Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

マルチモーダルデータのグラフ埋め込みとニューラルネットワークによる近似定理

2,158 views

Published on

京都大学x理研AIPxエクサウィザーズ 機械学習勉強会1で発表頂いた下平英寿教授の資料です

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

マルチモーダルデータのグラフ埋め込みとニューラルネットワークによる近似定理

  1. 1. x AIPx AIP 1
  2. 2. degm n))(o 1 2 ASD gm n) ,)(o 3 6 0 9 P φ(A(1) ,...,A(D) ) = 1 2 wij (de) A(d) xi (d) − A(e) xj (e) j=1 ne ∑ i=1 nd ∑ e=1 D ∑ d=1 D ∑ 2 2 li n )( a ))( C DM H D P o h
  3. 3. Query CDMCA-ITG CDMCA-ITrG cheetah active leopard presence puma gitzo lion leopard cub lemur leland cannons cannon cannon World War II guns Leichte Feldhaubitze Cartes de Visite cannons fortifications fortifications Table 2. Top-5 tags obtained by Imag Query CDMCA-ITG CDMCA-ITrG squirrel Neofelis nebulosa gerbils felis lynx Roborovski Hamsters Rensdyr Query CDMCA-ITG cheetah leopard puma lion lemur cannons cannon guns Cartes de Visite fortifications Tab Query CDMCA-ITG squirrel gerbils Roborovski Hamsters Aalborg Zoo Lagomorph truck Chevy blazer mobilgas Motor vehicle GMC truck Table 3. Top-5 tags obtained by Im 5.5. Retrieving Methods We test the following methods, including . E E ( E , 1 6 - 6 F I F EA C E 6 A 66 - - ) - 0 CC -)-0 0 H ( 21( C 3 O K P
  4. 4. 4 - “brown” + “white” = - “day” + “night” = Multimodal Eigenwords (Fukui, Oshikiri and Shimodaira, Textgraphs 2017)
  5. 5. 4 ABAHA 1 .KBKA 0 6 A F AH ,HF 2AE K C FH 5 GH E AFE LA 6G H C H G - AE ( EEK C AE F F A AFE FH ,F GK AFE C 2AE KA A ,2 ) HCAE H EM ) 5
  6. 6. 6 K +⇤⇥⇢⌦↵◆⌥⇤ ⇤6 ⌃⇧⇡⌥⌘ 5⇤⇥✏⌅ ⇢⇤ ⇤⌥ ⌦↵◆⌘⇥⇢ -⇡⌥⌘ 5⇤⇥✏⌅;@⇠⌥⇤ <⌧ F=⇤ 3⌘◆⌥⌘6⇤⇥⌘↵⇠ ⇤⇥✓⇤⇥↵ ⇥⇤⌅⌅⇧⌥ ⌦↵◆⌘⇥⇢ - ⇥⇤⌅⌅⇧⌃↵ ⌦↵ ⌘⌃ /;⌃⌥◆◆⇢↵ < II!= ↵⌅⌘✏⇤ 3⌘◆⌥⌘
  7. 7. 1 のイントロ用に数式などまとめておく ンは d = 1, . . . , D.ドメイン d のデータベクトル.ベクトル数は nd,ベクトルの次元は pd. x (d) i ∈ Rpd i = 1, . . . , nd d = 1, . . . , D φ(A(1) , . . . , A(D) ) = 1 2 D d=1 D e=1 nd i=1 ne j=1 w (de) ij A(d) T x (d) i − A(e) T x (e) j 2 φ(A) = 1 2 n i=1 n j=1 wij AT xi − AT xj 2 n = D d=1 nd p = D d=1 pd xi′ ∈ Rn i′ = 1, . . . , n W ∈ Rn×n (0T p1 , . . . , 0T pd−1 , (x (d) i )T , 0T pd+1 , . . . , 0T pD )T AT = ((A(1) )T , . . . , (A(D) )T ) W = ⎛ ⎜ ⎜ ⎜ ⎝ W (11) W (12) . . . W (1D) W (21) W (22) . . . W (2D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W (D1) W (D2) . . . W (DD) ⎞ ⎟ ⎟ ⎟ ⎠ の記述はずっとシンプルになった. 新はヒューリスティックスであったが,今回は MM アルゴリズムを導出した. 良い点】以前はスケーリング i,j wij = 1 としていたが,今回は観測した wij をそのまま使 ズムを cdmca3.R に実装した. ロ用に数式などまとめておく 1, . . . , D.ドメイン d のデータベクトル.ベクトル数は nd,ベクトルの次元は pd. x (d) i ∈ Rpd i = 1, . . . , nd d = 1, . . . , D φ(A(1) , . . . , A(D) ) = 1 2 D d=1 D e=1 nd i=1 ne j=1 w (de) ij A(d) T x (d) i − A(e) T x (e) j 2 φ(A) = 1 2 n i=1 n j=1 wij AT xi − AT xj 2 n = D d=1 nd p = D d=1 pd xi′ ∈ Rn i′ = 1, . . . , n W ∈ Rn×n (d) ズムを cdmca3.R に実装した. ロ用に数式などまとめておく 1, . . . , D.ドメイン d のデータベクトル.ベクトル数は nd,ベクトルの次 x (d) i ∈ Rpd i = 1, . . . , nd d = 1, . . . , D φ(A(1) , . . . , A(D) ) = 1 2 D d=1 D e=1 nd i=1 ne j=1 w (de) ij A(d) T x (d) i − A(e) T x (e) j 2 φ(A) = 1 2 n i=1 n j=1 wij AT xi − AT xj 2 n = D d=1 nd p = D d=1 pd x (d) i′ ∈ Rpd ⇒ xi ∈ Rp W ∈ Rn×n T T T (d) T T T T イントロ用に数式などまとめておく は d = 1, . . . , D.ドメイン d のデータベクトル.ベクトル数は nd,ベクトルの次元は pd. x (d) i ∈ Rpd i = 1, . . . , nd d = 1, . . . , D φ(A(1) , . . . , A(D) ) = 1 2 D d=1 D e=1 nd i=1 ne j=1 w (de) ij A(d) T x (d) i − A(e) T x (e) j 2 φ(A) = 1 2 n i=1 n j=1 wij AT xi − AT xj 2 n = D d=1 nd p = D d=1 pd x (d) i′ ∈ Rpd ⇒ xi ∈ Rp W ∈ Rn×n xT i = (0T p1 , . . . , 0T pd−1 , (x (d) i′ )T , 0T pd+1 , . . . , 0T pD )T i = 1, . . . , n AT = ((A(1) )T , . . . , (A(D) )T ) W = ⎛ ⎜ ⎜ ⎜ ⎝ W (11) W (12) . . . W (1D) W (21) W (22) . . . W (2D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W (D1) W (D2) . . . W (DD) ⎞ ⎟ ⎟ ⎟ ⎠ ルゴリズムを cdmca3.R に実装した. イントロ用に数式などまとめておく は d = 1, . . . , D.ドメイン d のデータベクトル.ベクトル数は nd,ベクトルの次元は p x (d) i ∈ Rpd i = 1, . . . , nd d = 1, . . . , D φ(A(1) , . . . , A(D) ) = 1 2 D d=1 D e=1 nd i=1 ne j=1 w (de) ij A(d) T x (d) i − A(e) T x (e) j 2 φ(A) = 1 2 n i=1 n j=1 wij AT xi − AT xj 2 n = D d=1 nd p = D d=1 pd x (d) i′ ∈ Rpd ⇒ xi ∈ Rp W ∈ Rn×n xT i = (0T p1 , . . . , 0T pd−1 , (x (d) i′ )T , 0T pd+1 , . . . , 0T pD )T ∈ Rp i = 1, . . . , n AT = ((A(1) )T , . . . , (A(D) )T ) のイントロ用に数式などまとめておく ンは d = 1, . . . , D.ドメイン d のデータベクトル.ベクトル数は nd,ベクトルの次元は pd. x (d) i ∈ Rpd i = 1, . . . , nd d = 1, . . . , D φ(A(1) , . . . , A(D) ) = 1 2 D d=1 D e=1 nd i=1 ne j=1 w (de) ij A(d) T x (d) i − A(e) T x (e) j 2 φ(A) = 1 2 n i=1 n j=1 wij AT xi − AT xj 2 n = D d=1 nd p = D d=1 pd x (d) i′ ∈ Rpd ⇒ xi ∈ Rp W ∈ Rn×n xT i = (0T p1 , . . . , 0T pd−1 , (x (d) i′ )T , 0T pd+1 , . . . , 0T pD )T ∈ Rp i = 1, . . . , n AT = ((A(1) )T , . . . , (A(D) )T ) ∈ Rp×K W = ⎛ ⎜ ⎜ ⎜ ⎝ W (11) W (12) . . . W (1D) W (21) W (22) . . . W (2D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W (D1) W (D2) . . . W (DD) ⎞ ⎟ ⎟ ⎟ ⎠ ∈ Rn×n • 以前は π の更新はヒューリスティックスであったが,今回は MM アルゴリズムを導出した. • 【どっちでも良い点】以前はスケーリング i,j wij = 1 としていたが,今回は観測した wij をそのまま使う. • このアルゴリズムを cdmca3.R に実装した. メモ CDMCA のイントロ用に数式などまとめておく ドメインは d = 1, . . . , D.ドメイン d のデータベクトル.ベクトル数は nd,ベクトルの次元は pd. x (d) i ∈ Rpd i = 1, . . . , nd d = 1, . . . , D φ(A(1) , . . . , A(D) ) = 1 2 D d=1 D e=1 nd i=1 ne j=1 w (de) ij A(d) T x (d) i − A(e) T x (e) j 2 φ(A) = 1 2 n i=1 n j=1 wij AT xi − AT xj 2 n = D d=1 nd p = D d=1 pd x (d) i′ ∈ Rpd ⇒ xi ∈ Rp W ∈ Rn×n xT i = (0T p1 , . . . , 0T pd−1 , (x (d) i′ )T , 0T pd+1 , . . . , 0T pD )T ∈ Rp i = 1, . . . , n AT = ((A(1) )T , . . . , (A(D) )T ) ∈ Rp×K W = ⎛ ⎜ ⎜ ⎜ ⎝ W (11) W (12) . . . W (1D) W (21) W (22) . . . W (2D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W (D1) W (D2) . . . W (DD) ⎞ ⎟ ⎟ ⎟ ⎠ ∈ Rn×n メモ CDMCA のイントロ用に数式などまとめておく ドメインは d = 1, . . . , D.ドメイン d のデータベクトル.ベクトル数は nd,ベクトルの次元は pd. x (d) i ∈ Rpd i = 1, . . . , nd d = 1, . . . , D φ(A(1) , . . . , A(D) ) = 1 2 D d=1 D e=1 nd i=1 ne j=1 w (de) ij A(d) T x (d) i − A(e) T x (e) j 2 φ(A) = 1 2 n i=1 n j=1 wij AT xi − AT xj 2 n = D d=1 nd p = D d=1 pd x (d) i′ ∈ Rpd ⇒ xi ∈ Rp W ∈ Rn×n xT i = (0T p1 , . . . , 0T pd−1 , (x (d) i′ )T , 0T pd+1 , . . . , 0T pD )T ∈ Rp i = 1, . . . , n AT = ((A(1) )T , . . . , (A(D) )T ) ∈ Rp×K W = ⎛ ⎜ ⎜ ⎜ ⎝ W (11) W (12) . . . W (1D) W (21) W (22) . . . W (2D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W (D1) W (D2) . . . W (DD) ⎞ ⎟ ⎟ ⎟ ⎠ ∈ Rn×n 1 1 ) 0( 0( 2 7
  8. 8. 97 21 (),) 8 d) 2 x (d) i . (31) is a zero tors We MCA d as nted here interpreted as dimensionality reduction of the correlation matrix  with regularization term 1G + 1H. For CDMCA, the correlation matrix becomes  = 1 2 DX d=1 DX e=1 ndX i=1 neX j=1 w(de) ij (˜x (d) i + ˜x (e) j )(˜x (d) i + ˜x (e) j )T . This is the correlation matrix of the input pattern of a pair of data vectors (˜x (d) i + ˜x (e) j )T = ⇣ 0, . . . , 0, (x (d) i )T , 0, . . . , 0, (x (e) j )T , 0, . . . , 0 ⌘ (32) weighted by w(de) ij . Interestingly, the same correlation matrix is found in one of the classical neural network models. Any part of the memorized vector can be used as a key for recalling the whole vector in the auto-associative correlation matrix memory (Kohonen, 1972), also known as Associatron (Nakano, 1972). This associative memory may recall ˜x (d) i + ˜x (e) j for input key either ˜x (d) i or ˜x (e) j if w(de) ij > 0. In particular, the representation (32) of a pair of data vectors is equivalent to eq. (14) of Nakano (1972). Thus CDMCA is interpreted as dimensionality reduction of the auto- associative correlation matrix memory for pairs of data vectors. ) 2 x (d) i 31) s a zero tors We MCA 1G + 1H)A) with respect to A 2 RP⇥K subject to (6). The role of H is now replaced by  + 1G + 1H. Thus MCA is interpreted as dimensionality reduction of the correlation matrix  with regularization term 1G + 1H. For CDMCA, the correlation matrix becomes  = 1 2 DX d=1 DX e=1 ndX i=1 neX j=1 w(de) ij (˜x (d) i + ˜x (e) j )(˜x (d) i + ˜x (e) j )T . This is the correlation matrix of the input pattern of a pair of data vectors (˜x (d) i + ˜x (e) j )T = ⇣ 0, . . . , 0, (x (d) i )T , 0, . . . , 0, (x (e) j )T , 0, . . . , 0 ⌘ (32) weighted by w(de) ij . Interestingly, the same correlation matrix is found in one of the classical neural network models. Any part of the memorized vector can be used as a key for recalling the whole vector in the auto-associative correlation matrix memory (Kohonen, 1972), also known as Associatron (Nakano, 1972). This associative memory may recall ˜x (d) i + ˜x (e) j for input key either ˜x (d) i or ˜x (e) j if w(de) ij > 0. In particular, the representation (32) of a pair IBEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-2, NO. 3, stems, in dark en- visual processes of ysiological factors, perators and make arch in this area is acity of the human ormation via the [6] J. Hirsch, "Rate control in man-machine systems," Hum. Factors Div. ASME, ASME Winter Annu. M Angeles, Calif., Nov. 1969. [7] D. Katz, J. A. Emery, R. A. Gabriel, and A. A. Burro perimental study of acoustic displays of flight paramet simulated aerospace vehicle," NASA, Washington, D.C. CR-509, July 1966. [8] J.C. R. Licklider, "Basic correlates of the auditory still!. Handbook of Experimental Psychology, S. S. Stevens, E York: Wiley, 1951. · [9] D. T. McRuer, D. Graham, E. S. Krendal, and W. Reise · operator in processing peripheral information via the auditory system. ACKNOWLEDGMENT The author wishes to thank Prof. L. R. Young, Prof. R. E. Curry, J. Allum, and C. Oman of the Man-Vehicle Laboratory at M.I.T. for their helpful suggestions and comments. REFERENCES [1] G. D. Bergland, "A guided tour of the fast Fourier transform," IEEE Spectrum, vol. 6, pp. 41-52, July 1969. [2] A. H. Bowker and G. J. Lieberman, Engineering Statistics. Englewood Cliffs, N.J.: Prentice-Hall, 1964. [3] L. DeFlorez, "True blind flight," J. Aeronaut. Sci., vol. 3, Mar. 1936. [4] J. I. Elkind and L. T. Sprague, "Transmission of information in simple manual control systems," IRE Trans. Hum. Factors Elec- tron. (Commun.), vol. HFE-2, pp. 58-60, Mar. 1961. [5] T. W. Forbes, W. R. Garner, and J. G. Howard, "Flying by auditory reference ("Flybar")," Office of Sci. Res., Dev. Nat. Defense Res. Comm., OSRD Rep. 5123, June 1945. Handbook of Experimental Psychology, S. S. Stevens, E York: Wiley, 1951. · [9] D. T. McRuer, D. Graham, E. S. Krendal, and W. Reise · "Human pilot dynamics in compensatory systems: models, and experiments with controlled element and .· function variations," Wright-Patterson AFB, Ohio, AFF , 65-15, July 1965. [10] R. Massa and R. Keston, "Minimum attention displa niques," Navigat. J. Inst. Navigat., vol. 12, no. 2, Su [11] P. B. Mirchandani, "Evaluation of a supplementary a display in a dual-axis compensatory tracking task," M.S.< Massachusetts Inst. Technol., Cambridge, June 1971. [12] E. M. Roth, Ed., "Compendium of human responses aerospace environment," vol. II, NASA, Washington, · NASA CR-1205(11), Nov. 1968. [13] J. W. Senders, "The human operator as a monitor and con of multidegree of freedom systems," IEEE Trans. Hum. Electron., vol. HFE-5, pp. 2-5, Sept. 1964. [14] N. A. J. Van Routte, "Display instrumentation for V/ aircraft in landing, vol. ill," Sc.D. dissertation, Massach Inst. Technol., Cambridge, 1970. [15] E.W. Vinje and E.T. Pitkin, "Human operator for aurak pensatory tracking,'' in Proc. Seventh Annu. NASA-AF.;e, Conf. on Manual Control, Univ. Southern California; Angeles, June 1971. [16] T. E. Wempe and D. L. Baty, "Human information proc rates during certain multiaxis tracking tasks with a cone auditory task," IEEE Trans. Man-Machine Syst., vol. M pp. 129-138, Dec. 1968. Associatron-A Model of Associative Memory KAORU NAKANO Abstract-Thinking In the human brain greatly depends upon associa- tion mechanisms which can be utilized In machine intelligence. An associative memory device, called "Associatron," is proposed. The Associatron stores entities represented by bit patterns In a distributed manner and recalls the whole ofany entity from a part ofit. Ifthe part is large, the recalled entity will be accurate; on the other band, ifthe part is small, the recalled entity wlll be rather ambiguous. Any number of entities can be stored, but the accuracy of the recalled entity decreases as the number of entities Increases. The Associatron Is considered to be a simpUfied model of the neural network and can be constructed as a cellular structure, where each cell is connected to only its neighbor cells and all cells run In parallel. From its mechanisms some properties are derived that are expected to be utilized for human-like Information processing. After these properties have been analyzed, an Assoclatron which deals with entities composed of less than 180 bits is simulated by a computer. Simple examples of its applications I. INTRODUCTION THE PURPOSE of this paper is to outline an appro for simulating certain functions of the human br It has been known that an association mechanism essential to information processing in the human br · Association in the human brain has been studied mai in the field of psychology. Quite a few semantic models· association have been presented in the past few ye Now biological studies are beginning gradually to re the structure of the nervous system, but our present kno is still not sufficient to construct the structure a ficially, although models of nerve cells have been presen [I], [2]. . 381 (a) PATT!llll3 - .... (b)y two neurons, x1 and XJ> 1J> which corresponds to "IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, PATTERN 2 PAnERN 3 lfoRD 'APPLE' PATTERN 4 Son SPHERICAL No MEANING Fig. 3. Explanation of entity, pattern, and concept. Any p entity can be pattern, but in this case each pattern is chosen to o: ASSQCIATRON B B' c c (a) (b) (c) The are said to have the same concept at area v. of area v is defined by n l(v) = v1• (13) l=l assumed that the entity x or a pair of patterns stored in the memory device, as in Fig. 4(a), entation is the same as that of Figs. I and 3. If he elements of x is changed to put together the structing each of A and B just for simplifying n, the entity x is represented by a row vector as x = (A,B,O) (14) d B are row vectors and 0 denotes the zero the matrix M, which stores only x, will be = (0,0 + = (0,0 Thus, CD can two patterns arbitrary, the Let k pairs simplicity of c Also, it is assu are random pa do not hold in matters little, estimation of curacy when t here v1 = 0,1. This vector represents a certain area of the eural network. Defining elementwise multiplication * as (11) e call v *x the concept of x at the area v. Assume that th x and y are members of Q. If it is clear th AB -.. CD, w and B, etc., s can be consid x = The recalling (12) Ven* ct(vAs * en x and y are said to have the same concept at area v. e measure of area v is defined by n l(v) = v1• (13) l=l ' Now it is assumed that the entity x or a pair of patterns 1A and B is stored in the memory device, as in Fig. 4(a), representation is the same as that of Figs. I and 3. If (the order ofthe elements of x is changed to put together the enients constructing each of A and B just for simplifying he expression, the entity x is represented by a row vector as x = (A,B,O) (14) where A and B are row vectors and 0 denotes the zero ector. Then the matrix M, which stores only x, will be ( AtA A'B 0)M = (A,B,O)'(A,B,O) = B'A B'B 0 . 0 0 0 (15) = (0, + = (0, Thus, CD can two patterns arbitrary, the Let k pairs simplicity of Also, it is ass are random p do not hold i matters little estimation o curacy when than when it similar statem probability th a neuron for the index vectors of A and B are vA and Vs, respectively, is he concept at area Vs of the recalled pattern from vA *x is Vs* </J((vA *x) · </J(M)) = Vs* </J(A</J(A'A), A</>(A'B), 0) = (0, A</J(AB), 0) The probabi majority is r (s + 1)/2 term ( 0 ) 0 )2 0 1
  9. 9. Recommender systems (collaborative filtering via matrix factorization) the amount of data. In addition, inaccurate imputation might distort the data considerably. Hence, more recent works3-6 suggested modeling directly the observed rat- ings only, while avoiding overfitting through a regularized model. To learn the factor vectors (pu and qi ), the system minimizes the regularized squared error on the set of known ratings: min * *,q p ( , )u i (rui - qi T pu )2 + λ(|| qi ||2 + || pu ||2 ) (2) Here, κ is the set of the (u,i) pairs for which rui is known (the training set). The system learns the model by fitting the previously observed ratings. However, the goal is to generalize those previous ratings in a way that predicts future, unknown ratings. Thus, the system should avoid overfitting the observed data by regularizing the learned parameters, whose magnitudes are penalized. The constant λ controls ck com- ikely to ms. t allows explicit an infer directly ing pur- or even otes the y repre- EL nd items uch that ducts in d with a n the Music Genome Project undreds of distinct musical cs. These attributes, or genes, only a song’s musical identity y significant qualities that are nderstanding listeners’ musi- ces. ative to content filtering relies user behavior—for example, nsactions or product ratings— uiring the creation of explicit s approach is known as col- tering, a term coined by the of Tapestry, the first recom- tem.1 Collaborative filtering ationships between users and dencies among products to user-item associations. appeal of collaborative fil- t it is domain free, yet it can aspects that are often elusive to profile using content filter- enerally more accurate than Joe #2 #3 #1 #4 Figure 1. The user-oriented neighborhood method. Joe likes the three movies on the left. To make a prediction for him, the system finds similar users who also liked those movies, and then determines which other movies they liked. In this case, all three liked Saving Private Ryan, so that is the first COVER FEATURE vector qi ∈ f , an ated with a vector i, the elements of which the item p positive or negat the elements of p interest the user h on the correspond tive or negative. T qi T pu , captures the u and item i—the the item’s characte user u’s rating of it rui , leading to the e ˆrui = qi T pu . The major challen ping of each item a qi , pu ∈ f . After t completes this ma Geared toward males Serious Escapist The Princess Diaries Braveheart LethalWeapon Independence Day Ocean’s 11 Sense and Sensibility Gus Dave Geared toward females Amadeus The Lion King Dumb and Dumber The Color Purple Figure 2. A simplified illustration of the latent factor approach, which characterizes both users and movies using two axes—male versus female and serious versus escapist. Koren, Yehuda, Robert Bell, and Chris Volinsky. “Matrix factorization techniques for recommender systems.” Computer 8 (2009): 30-37. 9
  10. 10. Siamese neural networks (NN-based similarity learning)740 Bromley, Guyon, Le Cun, Sackinger, and Shah TARGET ..... t?---------' :01 fltt be- 11 . 2OOu.... • -....... beUli Figure 1: Architecture 1 consists of two identical time delay neural networks. Each network has an input of 8 by 200 units, first layer of 12 by 64 units with receptive fields for each unit being 8 by 11 and a second layer of 16 by 19 units with receptive fields 12 by 10. Bromley, Guyon, Le Cun, Sackinger, Shah (1994). "Signature verification using a “Siamese” time delay neural network." NIPS. 10
  11. 11. 11 Okuno, Hada and Shimodaira (2018). A probabilistic framework for multi-view feature learning with many-to- many associations via neural networks, ICML.
  12. 12. (PMvGE: Probabilistic Multi-view Graph Embedding) 12 Proposed probabilistic mo To realize MVFL simultaneously solvin mentioned problems (i)–(iii), we propose probabilsitic model of link weights: wij | xi, xj indep. ≥ Po(exp(µ(xi, xj; ◊)) where µ(xi, xj; ◊) := –(di,dj) exp(Èf (di)  (xi), f (dj)  ( ◊ := (Â, {–(d,e) }) is to be estimated. has pdi -dimensional vector representation xi œ Rpdi, which we call data vector. The strength of asso- ciation between xi, xj is dentoed as wij = wji Ø 0, and is called matching weight. {xi} and {wij} are observed variables in our setting. These multi-view data is hard to analyze, because their dimension may di er depending on the view. To get over the problem, data vector is transformed into feature vector yi := f (di)  (xi) œ RK by con- tinuous maps f (d)  : Rpd æ RK , (d = 1, 2, . . . , D) so that the inner product similarity of {yi} in RK represents matching weights {wij}. The dimension K is the same among di erent views. We call the procedure in general to transform multi-view data vectors {xi} into feature vec- tors {yi} as multi- view feature learn- ¸(◊) := by gradien GD requi puting th tational c ˆ ˆ◊ 8 >< >: X (i,j)œ as an app is a tuni ments re- Wn := {(i, j contain m Using ˆ◊ = we have o We call th feature ve Graph E Representation power of PMvGE We again show the proposed probabilistic model: wij | xi , xj ∼ Po(µ(xi , xj ; θ)), µ(xi , xj ; θ) := α(di dj ) exp( ⟨f (di ) ψ (xi ), f dj ψ (xj )⟩ =:h(xi ,xj ;ψ) (a.k.a. siamese-network) ). Representation power of real-valued NN is well-known 1989; Funahashi, 1989; Telgarsky, 2017). However, h(xi , xj ; θ) = ⟨f (di ) ψ (xi ), f dj ψ (xj )⟩ is the inner p vector-valued NNs. Existing theorem cannot be applied ation power of PMvGE how the proposed probabilistic model: wij | xi , xj ∼ Po(µ(xi , xj ; θ)), µ(xi , xj ; θ) := α(di dj ) exp( ⟨f (di ) ψ (xi ), f dj ψ (xj )⟩ =:h(xi ,xj ;ψ) (a.k.a. siamese-network) ). Okuno, Hada and Shimodaira (2018)
  13. 13. SGD Log-likelihood ℓ(θ) := log P({wij }) = 1≤i<j≤n {wij log µ(xi , xj ; θ) − µ(xi , xj ; θ)} Gradient descent for ℓ(θ) requries summing up O(n2) terms. Minibatch SGD θ(t+1) = θ(t) + γt ∂ ∂θ ˜ℓ(θ) θ(t) ˜ℓ(θ) := (i,j)∈I′ n wij log µ(xi , xj ; θ) − λ (i,j)∈W′ n µ(xi , xj ; θ) where λ > 0 is a tuning parameter, I′ n, W′ n are mini-batches resampled from In := {(i, j) | 1 ≤ i < j ≤ n}, Wn := {(i, j) ∈ In | wij > 0}, resp. minibatch SGD requires O(|I′ n| + |W′ n|) operations regardless of n. 8 / 22 13
  14. 14. e vec- multi- earn- has at mited to associ- cost. del above novel (1) Í) and We call the above mentioned procedure to obtain feature vectors as Probabilistic Multi-view Graph Embedding (PMvGE). Comparison (Nv): Number of views, (MM): Many-to-many, (NL): Non-linear, (Ind): Inductive, (Lik): Likelihood-based. PMvGE has all the properties. Nv = D represents that the method can deal with arbitrary number of views. (Nv) (MM) (NL) (Ind) (Lik) CCA (Hotelling 1936) 2 X Deep CCA (Andrew et al. 2013) 2 X X MCCA (Kettenring 1971) D X SGE (Belkin et al. 2001) 0 X LINE (Tang et al. 2015) 0 X X LPP (He et al. 2004) 1 X X CvGE (Huang et al. 2013) 2 X X CDMCA (Shimodaira 2016) D X X DeepWalk (Perozzi et al. 2014) 0 X X SBM (Holland et al. 1983) 1 X X X GCN (Kipf et al. 2017) 1 X X X GraphSAGE (Hamilton et al. 2017) 1 X X X X IDW (Dai et al. 2018) 1 X X X X PMvGE (Proposed) D X X X X As the clos Analysis (C mation (LT with a quad of {Â(d) }D d= Here, we a quadratic a with –(d,e) as ˆÂApr.PMv Then, ther Since CDM HIMFAC (N PCA, thus methods as 14
  15. 15. NN+ y Representation power of PMvGE with NN Theorem 1 f (d) ú : [≠M, M]pd æ [≠MÕ , MÕ ]Kú , (d = 1, 2, . . . , D) are continuous maps and gú : [≠MÕ , MÕ ]2Kú æ R is a positive-definite kernel for some M, MÕ > 0 and Kú œ N. For arbitrary Á > 0, by spec- ifying large T, K, there exist MLPs f (d)  : Rpd æ RK with T hidden untis and ReLU or sigmoid activation s.t. gú ⇣ f(d) ú (x), f(e) ú (xÕ ) ⌘ ≠ D f (d)  (x), f (e)  (xÕ ) E < Á, ’(x, xÕ ) œ [≠M, M]pd+pe , ’d, e. We visualize Theorem 1. With cosine similarity gú, we define Gú(s, t) := gú(fú(se1), fú(te2)), ˆGK(s, t) := ÈfÂ(se1), fÂ(te2)Í, where fú(x) = (x1, cos x2, exp(≠x3), sin(x4 ≠x5)), e1, e2 œ R5 • • • • Mercer’s theorem + Universal approximation theorem 15
  16. 16. NN With sufficiently large K = ♯output units, T = ♯hidden untis of NNs, g∗: positive-definite (PD). "dog" "animal" View-2 View-1 "dog" K-dim. latent space 𝑓 ( ) 𝑓 ( ) 𝑤 𝑥 𝑥 ℝ ℝ "dog" "animal" View-2 View-1 "dog" K*-dim. latent space 𝑓∗ ( ) 𝑓∗ ( ) 𝑤 𝑥 𝑥 ℝ ℝ ≈𝑔∗(𝑓∗ , 𝑓∗ ( ))"dog" 𝑓 , 𝑓 ( )"dog" Underlying true structure Estimation with PMvGE Mercer’s theorem + Universal approximation theorem From PD to CPD (e.g. Poincare dist.): Our new paper will be We model the similarity function as h(xi, xj) := g(f(xi), f(xj)), where f : Rp → RK is a continuous map an RK×K → R is a symmetric continuous function, is defined later in Definition 3.1. By using a neur work y = fψ(x) with parameter ψ, we consider the h(xi, xj) = g(fψ(xi), fψ(xj)), which is called si network (Bromley et al., 1994) in neural network ture. The original form of siamese network uses the similarity for g, but we can specify other types of si ity function. By specifying the inner product g(y, ⟨y, y′ ⟩, the similarity function (1) becomes h(xi, xj) = ⟨fψ(xi), fψ(xj)⟩. We call (2) as Inner Product Similarity (IPS) mode commonly appears in a broad range of methods as DeepWalk (Perozzi et al., 2014), LINE (Tang 2015), node2vec (Grover and Leskovec, 2016), Varia similarity for g, but we can specify other types of sim ity function. By specifying the inner product g(y, y ⟨y, y′ ⟩, the similarity function (1) becomes h(xi, xj) = ⟨fψ(xi), fψ(xj)⟩. We call (2) as Inner Product Similarity (IPS) model commonly appears in a broad range of methods, as DeepWalk (Perozzi et al., 2014), LINE (Tang 2015), node2vec (Grover and Leskovec, 2016), Variat Graph AutoEncoder (Kipf and Welling, 2016), and G SAGE (Hamilton et al., 2017). Multi-view exten (Okuno et al., 2018) are easily obtained by preparing ferent f for each view and restricting loss terms in obje only to specific pairs; for example, the skip-gram m considers a bipartite graph of two-views with the c tional distribution of contexts given a word. 3. Previous study: PD similarities In order to prove the approximation capability of IPS16
  17. 17. − 17
  18. 18. − Experiment-1 (D = 1) 𝑣 𝑣 𝑣 𝑣 𝑣 𝑣 We conduct label classification and clustering experiments on Cora citation dataset (Sen et al., 2008), which consists of 2,708 Nodes: document vi has a 1, 433-dimensional (bag-of-words) data vector xi ∈ {0, 1}1433 and a class label of 7 classes. 5,278 (directed) Edges: each directed edge represents citation from vi to vj . We set wij = wji = 1 by ignoring the direction, and wij = 0 otherwise. There is no cross or self-citation. 64% of nodes with their edges, is used for training. 16% is for validation, and the remaining 20% is for test. 14 / 22 Label classification (Task 1): We compute feature vectors by each method. We classify the documents into 7 classes using multi-class logistic regression whose input is the feature vectors. We evaluate the results by classification accuracy. We take average and std. dev. by 10 times experiments. Seen Unseen SBM - - ISOMAP 54.5 ± 1.78 54.8 ± 2.43 LLE 30.2 ± 1.91 31.9 ± 2.62 SGE 47.6 ± 1.64 - MDS 29.8 ± 2.25 - DeepWalk 54.2 ± 2.04 - GraphSAGE 60.8 ± 1.73 57.1 ± 1.61 PMvGE 74.8 ± 2.55 71.1 ± 2.10 “Unseen”: feature vectors are obtained without using test set. 15 / 22 Clustering (Task 2): We compute feature vectors by each method. k-means clustering (♯clusters = 7) of the feature vectors is performed for unsupervised learning of document clusters. We evaluate the results by Normalized Mutual Information (NMI). We take average and std. dev. by 10 times experiments. Seen Unseen SBM 4.37 ± 1.44 2.81 ± 0.10 ISOMAP 13.0 ± 0.36 14.3 ± 1.98 LLE 7.40 ± 3.40 9.47 ± 3.00 SGE 1.41 ± 0.34 - MDS 2.81 ± 0.10 - DeepWalk 16.7 ± 1.05 - GraphSAGE 19.6 ± 0.93 12.4 ± 3.00 PMvGE 35.9 ± 0.88 30.5 ± 3.90 “Unseen”: feature vectors are obtained without using test set. 18
  19. 19. − Experiment-2 (D = 2) (view-1) 2,500 images (view-2) 85 attributes “black” “white” “brown” “stripes” “water” “eats fish” (Images we use in this Figure are under CC0 license.) Link prediction experiment on Animal with Attribute (AwA) dataset (Lampert et al., 2009), which consists of (view-1) 2,500 images: Each image has a class label of 50 classes. 4, 096 dimensional DeCAF data vector (Donahue et al., 2014) associated with some attributes. (view-2) 85 attributes: 300 dimensional GloVe (Pennington et al., 2014) data vector. 17 / 22 Link prediction (Task 3): We compute feature vectors by each method. For each query image, we rank attributes according to the cosine similarity of feature vectors across views. We evaluate the results by Average Precision (AP). We take average and std. dev. by 10 times experiments. Seen Unseen CCA 45.5 ± 0.20 42.4 ± 0.30 Deep CCA 41.4 ± 0.30 41.2 ± 0.35 SGE 43.5 ± 0.39 - PMvGE 71.5 ± 0.48 70.5 ± 0.53 “Unseen”: feature vectors are obtained without using test set. 18 / 22 19
  20. 20. 20
  21. 21. eural network-based graph embedding and beyond strength r repre- denoted vectors are ob- for the 1 given ution of xj) of ribution d func- distribu- ditional he con- j)) and d to lo- ntext of rem (Minh et al., 2006). To show the result in Theorem 3.1, we first define a kernel and its positive-definiteness. Definition 3.1 For some set Y, a symmetric continuous function g : Y2 → R is called a kernel on Y2 . Definition 3.2 A kernel g on Y2 is said to be Positive Def- inite (PD) if satisfying n i=1 n j=1 cicjg(yi, yj) ≥ 0 for arbitrary c1, c2, . . . , cn ∈ R, y1, y2, . . . , yn ∈ Y. For instance, cosine similarity g(y, y′ ) := ⟨ y ∥y∥2 , y′ ∥y′∥2 ⟩ is a PD kernel on (Rp {0})2 . Its PD-ness im- mediately follows from n i=1 n j=c cicjg(yi, yj) = ∥ n i=1 ci yi ∥yi∥2 ∥2 2 ≥ 0 for arbitrary {ci}n i=1 ⊂ R and {yi}n i=1 ⊂ Y. Also polynomial kernel, Gaussian kernel, and Laplacian kernel are PD (Berg et al., 1984). By utiliz- ing these kernels, we define a similarity of data vectors. Definition 3.3 A function h(x, x′ ) := g(f(x), f(x′ )) with a kernel g : Y2 → R and a continuous map f : X → Y is called a similarity on X2 . for the 1 given ution of xj) of ribution d func- distribu- ditional he con- j)) and d to lo- ntext of (1) nd g : , which ral net- e model siamese k litera- Definition 3.2 A kernel g on Y is said to be Positive Def- inite (PD) if satisfying n i=1 n j=1 cicjg(yi, yj) ≥ 0 for arbitrary c1, c2, . . . , cn ∈ R, y1, y2, . . . , yn ∈ Y. For instance, cosine similarity g(y, y′ ) := ⟨ y ∥y∥2 , y′ ∥y′∥2 ⟩ is a PD kernel on (Rp {0})2 . Its PD-ness im- mediately follows from n i=1 n j=c cicjg(yi, yj) = ∥ n i=1 ci yi ∥yi∥2 ∥2 2 ≥ 0 for arbitrary {ci}n i=1 ⊂ R and {yi}n i=1 ⊂ Y. Also polynomial kernel, Gaussian kernel, and Laplacian kernel are PD (Berg et al., 1984). By utiliz- ing these kernels, we define a similarity of data vectors. Definition 3.3 A function h(x, x′ ) := g(f(x), f(x′ )) with a kernel g : Y2 → R and a continuous map f : X → Y is called a similarity on X2 . For a PD kernel g, the similarity h is also a PD kernel on X2 , since n i=1 n j=1 cicjh(xi, xj) = n i=1 n j=1 cicjg(f(xi), f(xj)) ≥ 0. Briefly speaking, a similarity h is used for measuring how similar two data vectors are, while a kernel is used to com- pare feature vectors. Regarding PD similarities, the fol- lowing Theorem 3.1 shows that IPS approximates any PD r of neural network-based graph embedding and beyond s the strength vector repre- vi is denoted 1-hot vectors i}n i=1 are ob- odel for the }n i,j=1 given stribution of h(xi, xj) of distribution gmoid func- son distribu- e conditional ying the con- xi, xj)) and spond to lo- rem (Minh et al., 2006). To show the result in Theorem 3.1, we first define a kernel and its positive-definiteness. Definition 3.1 For some set Y, a symmetric continuous function g : Y2 → R is called a kernel on Y2 . Definition 3.2 A kernel g on Y2 is said to be Positive Def- inite (PD) if satisfying n i=1 n j=1 cicjg(yi, yj) ≥ 0 for arbitrary c1, c2, . . . , cn ∈ R, y1, y2, . . . , yn ∈ Y. For instance, cosine similarity g(y, y′ ) := ⟨ y ∥y∥2 , y′ ∥y′∥2 ⟩ is a PD kernel on (Rp {0})2 . Its PD-ness im- mediately follows from n i=1 n j=c cicjg(yi, yj) = ∥ n i=1 ci yi ∥yi∥2 ∥2 2 ≥ 0 for arbitrary {ci}n i=1 ⊂ R and {yi}n i=1 ⊂ Y. Also polynomial kernel, Gaussian kernel, and Laplacian kernel are PD (Berg et al., 1984). By utiliz- ing these kernels, we define a similarity of data vectors. 21
  22. 22. NN+ 4.1. Fundamental limitation of IPS Let us consider the negative squared distance (NSD) g(y, y′ ) = −∥y − y′ ∥2 2 and the identity map f(x) = x. Then the similarity function h(x, x′ ) = g(f(x), f(x′ )) = −∥x − x′ ∥2 2 defined on Rp × Rp is not PD but CPD, which is defined later in Section 4.2. Regarding the NSD similarity, Propo- sition 4.1 shows a strictly positive lower bound of approxi- mation error for IPS. Proposition 4.1 Let Φ(p, K) denote the set of all contin- uous maps from Rp to RK . For all M > 0, p, K ∈ N, we have inf φ∈Φ(p,K) 1 (2M)2p [−M,M]p [−M,M]p and − log(1 − g). Example 4.1 (Poincar´e distance ∥y∥2 < 1} be a K-dimensional o distance between y, y′ ∈ BK as dPoincar´e(y, y′ ) := cosh−1 1 + 2 ( where cosh−1 (z) = log(z + √ z + the setting of Section 2 with 1-h embedding (Nickel and Kiela, 20 i = 1, . . . , n, by fitting σ(−dPo served wij ∈ {0, 1}. Interestingly, negative Poincar´e CPD in Faraut and Harzallah (197 Proposition 4.2 −dPoincar´e is CPD tending the IPS model. In Section 4.4, we give interpreta- tions of SIPS and its simpler variant C-SIPS. In Section 4.5, we prove that SIPS approximates CPD similarities arbitrary well. 4.1. Fundamental limitation of IPS Let us consider the negative squared distance (NSD) g(y, y′ ) = −∥y − y′ ∥2 2 and the identity map f(x) = x. Then the similarity function h(x, x′ ) = g(f(x), f(x′ )) = −∥x − x′ ∥2 2 defined on Rp × Rp is not PD but CPD, which is defined later in Section 4.2. Regarding the NSD similarity, Propo- sition 4.1 shows a strictly positive lower bound of approxi- mation error for IPS. Proposition 4.1 Let Φ(p, K) denote the set of all contin- uous maps from Rp to RK . For all M > 0, p, K ∈ N, we have inf φ∈Φ(p,K) 1 (2M)2p [−M,M]p [−M,M]p ties (Berg et al., 1984): For any f u(y) + u(y′ ) is CPD. Constant two CPD kernels is also CPD. F g(y, y′ ) ≤ 0, CPD-ness holds fo and − log(1 − g). Example 4.1 (Poincar´e distance) ∥y∥2 < 1} be a K-dimensional op distance between y, y′ ∈ BK as dPoincar´e(y, y′ ) := cosh−1 1 + 2 (1 where cosh−1 (z) = log(z + √ z + the setting of Section 2 with 1-ho embedding (Nickel and Kiela, 201 i = 1, . . . , n, by fitting σ(−dPo served wij ∈ {0, 1}. Interestingly, negative Poincar´e d CPD in Faraut and Harzallah (197 Proposition 4.2 −dPoincar´e is CPD r of neural network-based graph embedding and beyond N fψ : Rp → , fψ(x′ )⟩ ≈ x′ )) as K → he existence s that it can that the IPS − ∥x − x′ ∥2 2 − ⟨φ(x), φ(x′ )⟩ dxdx′ ≥ 2pM2 3 . The proof is in Supplement B.1. Since Φ(p, K) represents the set of arbitrary continuous maps including neural networks, Proposition 4.1 indicates that IPS does not approximate NSD similarity arbitrary 22
  23. 23. (Conditionally Positive Definite) PD any PD similar- n general are not es, we consider a y PD (CPD) ker- which include PD IPS to approxi- hat IPS has nice of feature vector imilarity models act, according to NN of the form rbitrary well, but ut kernels based d as follows. In limitation of IPS ction 4.2, we de- Section 4.3, we S (SIPS), by ex- e give interpreta- S. In Section 4.5, ilarities arbitrary Here, we introduce similarities based on Conditionally PD (CPD) kernels (Berg et al., 1984; Sch¨olkopf, 2001) to consider non-PD similarities which IPS does not approxi- mate arbitrary well. We first define CPD kernels. Definition 4.1 A kernel g on Y2 is called Conditionally PD (CPD) if satisfying n i=1 n j=1 cicjg(yi, yj) ≥ 0 for arbitrary c1, c2, . . . , cn ∈ R, y1, y2, . . . , yn ∈ Y with the constraint n i=1 ci = 0. The difference between the definitions of CPD and PD ker- nels is whether it imposes the constraint n i=1 ci = 0 or not. According to these definitions, CPD kernels include PD kernels as special cases. For a CPD kernel g, the simi- larity h is also a CPD kernel on X2 . A simple example of CPD kernel is g(y, y′ ) = −∥y−y′ ∥α 2 for 0 < α ≤ 2 defined on RK × RK . Other examples are −(sin(y − y′ ))2 and −1(0,∞)(y + y′ ) on R × R. CPD- ness is a well-established concept with interesting proper- ties (Berg et al., 1984): For any function u(·), g(y, y′ ) = u(y) + u(y′ ) is CPD. Constants are CPD. The sum of two CPD kernels is also CPD. For CPD kernels g with g(y, y′ ) ≤ 0, CPD-ness holds for −(−g)α (α ∈ (0, 1]) and − log(1 − g). Example 4.1 (Poincar´e distance) Let BK := {y ∈ RK | similar- l are not nsider a PD) ker- lude PD approxi- has nice e vector models rding to he form well, but ls based ows. In n of IPS , we de- 4.3, we , by ex- erpreta- tion 4.5, 4.2. CPD kernels and similarities Here, we introduce similarities based on Conditionally PD (CPD) kernels (Berg et al., 1984; Sch¨olkopf, 2001) to consider non-PD similarities which IPS does not approxi- mate arbitrary well. We first define CPD kernels. Definition 4.1 A kernel g on Y2 is called Conditionally PD (CPD) if satisfying n i=1 n j=1 cicjg(yi, yj) ≥ 0 for arbitrary c1, c2, . . . , cn ∈ R, y1, y2, . . . , yn ∈ Y with the constraint n i=1 ci = 0. The difference between the definitions of CPD and PD ker- nels is whether it imposes the constraint n i=1 ci = 0 or not. According to these definitions, CPD kernels include PD kernels as special cases. For a CPD kernel g, the simi- larity h is also a CPD kernel on X2 . A simple example of CPD kernel is g(y, y′ ) = −∥y−y′ ∥α 2 for 0 < α ≤ 2 defined on RK × RK . Other examples are −(sin(y − y′ ))2 and −1(0,∞)(y + y′ ) on R × R. CPD- ness is a well-established concept with interesting proper- ties (Berg et al., 1984): For any function u(·), g(y, y′ ) = u(y) + u(y′ ) is CPD. Constants are CPD. The sum of 23
  24. 24. Poincare CPD (SIPS), by ex- give interpreta- . In Section 4.5, arities arbitrary istance (NSD) map f(x) = x. − x′ ∥2 2 which is defined milarity, Propo- und of approxi- et of all contin- , p, K ∈ N, we ness is a well-established concept with interesting proper- ties (Berg et al., 1984): For any function u(·), g(y, y′ ) = u(y) + u(y′ ) is CPD. Constants are CPD. The sum of two CPD kernels is also CPD. For CPD kernels g with g(y, y′ ) ≤ 0, CPD-ness holds for −(−g)α (α ∈ (0, 1]) and − log(1 − g). Example 4.1 (Poincar´e distance) Let BK := {y ∈ RK | ∥y∥2 < 1} be a K-dimensional open unit ball and define a distance between y, y′ ∈ BK as dPoincar´e(y, y′ ) := cosh−1 1 + 2 ∥y − y′ ∥2 2 (1 − ∥y∥2 2)(1 − ∥y′∥2 2) , where cosh−1 (z) = log(z + √ z + 1 √ z − 1). Considering the setting of Section 2 with 1-hot data vectors, Poincar´e embedding (Nickel and Kiela, 2017) learns parameters yi, i = 1, . . . , n, by fitting σ(−dPoincar´e(yi, yj)) to the ob- served wij ∈ {0, 1}. Interestingly, negative Poincar´e distance is proved to be CPD in Faraut and Harzallah (1974, Corollary 7.4). Proposition 4.2 −dPoincar´e is CPD on BK × BK . 24
  25. 25. Poincaré Embeddings for Learning Hierarchical Representations Nickel & Kiela (NIPS 2017) 25
  26. 26. 26
  27. 27. Wasserstein CPD On representation power of neural network-based graph embedding and beyond It is strictly CPD in the sense that −dPoincar´e is not PD. A counter-example of PD-ness is, for example, n = 2, K = 2, c1 = c2 = 1, y1 = (1/2, 1/2), y2 = (0, 0) ∈ B2 . Another interesting example of CPD kernels is negative Wasserstein distance. Example 4.2 (Wasserstein distance) For q ∈ (0, ∞), let Z be a metric space endowed with a metric dZ, which we call as “ground distance”. Let Y be the space of all mea- sures µ on Z satisfying Z dZ(z, z0)dµ(z) < ∞ for all z0 ∈ Z. The q-Wasserstein distance between y, y′ is de- fined as d (q) W (y, y′ ) := inf π∈Π(y,y′) Z×Z dZ(z, z′ )q dπ(z, z′ ) 1/q . Here, Π(y, y′ ) is the set of joint probability measures on Z × Z having marginals y, y′ . Wasserstein distance is used for a broad range of methods, such as Generative Adversarial Networks (Arjovsky et al., 2017) and AutoEn- coder (Tolstikhin et al., 2018). With some assumptions, negative Wasserstein distance is proved to be CPD. Proposition 4.3 −d (1) W is CPD on Y2 if −dZ is CPD on Z2 . −d (2) W is CPD on Y2 if Z is a subset of R. −d (1) W is known as the negative earth mover’s distance, and its CPD-ness is discussed in Gardner et al. (2017). The CPD-ness of −d (2) W is shown in Kolouri et al. (2016, Corol- lary 1). Therefore negative Poincar´e distance and negative Wasser- stein distance are CPD kernels. In the following section, where γ ≥ 0 is a parameter to be estimated. We Constantly-Shifted IPS (C-SIPS) model. If we have no attributes, we use 1-hot vectors Rn instead, and fψ(xi) = yi ∈ RK , uξ(xi) are model parameters. Then SIPS reduces to decomposition model with biases h(xi, xj) = ⟨yi, yj⟩ + ui + uj. This model is widely used for recommende (Koren et al., 2009) and word vectors (Pennin 2014), and SIPS is considered as its generalizati 4.4. Interpretation of SIPS and C-SIPS Here we illustrate the interpretation of the propo by returning back to the setting in Section 2. W a simple generative model of independent Poisso tion with mean parameter E(wij) = exp(h(xi, SIPS gives a generative model wij indep. ∼ Po β(xi)β(xj) exp(⟨fψ(xi), fψ(x where β(x) := exp(uψ(x)) > 0. Since β(x) garded as the “importance weight” of data vect naturally incorporates the weight function β(x bilistic models used in a broad range of existin Similarly, C-SIPS gives a generative model wij indep. ∼ Po α exp(⟨fψ(xi), fψ(xj)⟩) where α := exp(−γ) > 0 regulates the spa {wij}. The generative model (7) is already p 1-view PMvGE (Okuno et al., 2018). 27
  28. 28. CPD (SIPS) Therefore negative Poincar´e distance and negative Wasser- stein distance are CPD kernels. In the following section, we propose a novel model that approximates any CPD sim- ilarities arbitrary well. 4.3. Proposed models For extending IPS model given in eq. (2), we propose a novel model h(xi, xj) = ⟨fψ(xi), fψ(xj)⟩ + uξ(xi) + uξ(xj), (3) where fψ : Rp → RK and uξ : Rp → R are neural networks whose parameter matrices are ψ and ξ, respec- tively. We call (3) as Shifted IPS (SIPS) model, because the inner product ⟨fψ(xi), fψ(xj)⟩ is shifted by the offset uξ(xi)+uξ(xj). Later, we show in Theorem 4.1 that SIPS approximates any CPD kernels arbitrary well. We also consider a special case of SIPS. By assuming where {wij}. 1-view It was s PMvGE wij is r result c mates t 4.5. Re Theore proxim overcom proves similari Theore Let f∗ and g∗ xi, xj) = g(fψ(xi), fψ(xj)), which is called siamese work (Bromley et al., 1994) in neural network litera- e. The original form of siamese network uses the cosine milarity for g, but we can specify other types of similar- function. By specifying the inner product g(y, y′ ) = y′ ⟩, the similarity function (1) becomes h(xi, xj) = ⟨fψ(xi), fψ(xj)⟩. (2) call (2) as Inner Product Similarity (IPS) model. IPS mmonly appears in a broad range of methods, such DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 15), node2vec (Grover and Leskovec, 2016), Variational aph AutoEncoder (Kipf and Welling, 2016), and Graph- GE (Hamilton et al., 2017). Multi-view extensions kuno et al., 2018) are easily obtained by preparing dif- ent f for each view and restricting loss terms in objective y to specific pairs; for example, the skip-gram model nsiders a bipartite graph of two-views with the condi- nal distribution of contexts given a word. pare feature vecto lowing Theorem 3 similarities arbitra output dimension a Theorem 3.1 (Ok Let f∗ : [−M, M g∗ : Y2 → R be a and some K∗ , M function which is monotonically-inc specifying sufficie there exist A ∈ RK g∗ (f∗(x) for all (x, x′ ) ∈ [− c) is a two-layer n K outputs and σ(x IPS (inner product similarity) SIPS (shifted inner product similarity) we propose a novel model that approximates any CPD sim- ilarities arbitrary well. 4.3. Proposed models For extending IPS model given in eq. (2), we propose a novel model h(xi, xj) = ⟨fψ(xi), fψ(xj)⟩ + uξ(xi) + uξ(xj), (3) where fψ : Rp → RK and uξ : Rp → R are neural networks whose parameter matrices are ψ and ξ, respec- tively. We call (3) as Shifted IPS (SIPS) model, because the inner product ⟨fψ(xi), fψ(xj)⟩ is shifted by the offset uξ(xi)+uξ(xj). Later, we show in Theorem 4.1 that SIPS approximates any CPD kernels arbitrary well. We also consider a special case of SIPS. By assuming uξ(x) = −γ/2 for all x, SIPS reduces to h(xi, xj) = ⟨fψ(xi), fψ(xj)⟩ − γ, (4) It was shown in Supple PMvGE (based on C-S wij is replaced by δij i result can be extended mates the original CDM 4.5. Representation th Theorem 4.1 below sh proximates any CPD si overcomes the fundame proves that C-SIPS give similarities in a weaker Theorem 4.1 (Represe Let f∗ : [−M, M]p and g∗ : Y2 → R b set Y ⊂ RK∗ and so or activation function uous, bounded, and C-SIPS (constantly-shifted inner product similarity) based graph embedding and beyond here γ ≥ 0 is a parameter to be estimated. We call (4) as onstantly-Shifted IPS (C-SIPS) model. we have no attributes, we use 1-hot vectors for xi in n instead, and fψ(xi) = yi ∈ RK , uξ(xi) = ui ∈ R e model parameters. Then SIPS reduces to the matrix ecomposition model with biases h(xi, xj) = ⟨yi, yj⟩ + ui + uj. (5) his model is widely used for recommender systems Koren et al., 2009) and word vectors (Pennington et al., 014), and SIPS is considered as its generalization. 4. Interpretation of SIPS and C-SIPS On representation power of neural network-based graph embedding and beyond It is strictly CPD in the sense that −dPoincar´e is not PD. A counter-example of PD-ness is, for example, n = 2, K = 2, c1 = c2 = 1, y1 = (1/2, 1/2), y2 = (0, 0) ∈ B2 . Another interesting example of CPD kernels is negative Wasserstein distance. Example 4.2 (Wasserstein distance) For q ∈ (0, ∞), let Z be a metric space endowed with a metric dZ, which we call as “ground distance”. Let Y be the space of all mea- sures µ on Z satisfying Z dZ(z, z0)dµ(z) < ∞ for all z0 ∈ Z. The q-Wasserstein distance between y, y′ is de- fined as d (q) W (y, y′ ) := inf π∈Π(y,y′) Z×Z dZ(z, z′ )q dπ(z, z′ ) 1/q . Here, Π(y, y′ ) is the set of joint probability measures on Z × Z having marginals y, y′ . Wasserstein distance is used for a broad range of methods, such as Generative Adversarial Networks (Arjovsky et al., 2017) and AutoEn- coder (Tolstikhin et al., 2018). where γ ≥ 0 is a parameter to be estimated. We call (4) as Constantly-Shifted IPS (C-SIPS) model. If we have no attributes, we use 1-hot vectors for xi in Rn instead, and fψ(xi) = yi ∈ RK , uξ(xi) = ui ∈ R are model parameters. Then SIPS reduces to the matrix decomposition model with biases h(xi, xj) = ⟨yi, yj⟩ + ui + uj. (5) This model is widely used for recommender systems (Koren et al., 2009) and word vectors (Pennington et al., 2014), and SIPS is considered as its generalization. 4.4. Interpretation of SIPS and C-SIPS Here we illustrate the interpretation of the proposed models by returning back to the setting in Section 2. We consider a simple generative model of independent Poisson distribu- tion with mean parameter E(wij) = exp(h(xi, xj)). Then SIPS gives a generative model wij indep. ∼ Po β(xi)β(xj) exp(⟨fψ(xi), fψ(xj)⟩) , (6) 28
  29. 29. NN+ f all mea- ∞ for all , y′ is de- z′ ) 1/q . easures on istance is Generative d AutoEn- distance is s CPD on tance, and 17). The 16, Corol- This model is widely used for recommender systems (Koren et al., 2009) and word vectors (Pennington et al., 2014), and SIPS is considered as its generalization. 4.4. Interpretation of SIPS and C-SIPS Here we illustrate the interpretation of the proposed models by returning back to the setting in Section 2. We consider a simple generative model of independent Poisson distribu- tion with mean parameter E(wij) = exp(h(xi, xj)). Then SIPS gives a generative model wij indep. ∼ Po β(xi)β(xj) exp(⟨fψ(xi), fψ(xj)⟩) , (6) where β(x) := exp(uψ(x)) > 0. Since β(x) can be re- garded as the “importance weight” of data vector x, SIPS naturally incorporates the weight function β(x) to proba- bilistic models used in a broad range of existing methods. Similarly, C-SIPS gives a generative model wij indep. ∼ Po α exp(⟨fψ(xi), fψ(xj)⟩) , (7) S is considered as its generalization. tion of SIPS and C-SIPS te the interpretation of the proposed models ck to the setting in Section 2. We consider tive model of independent Poisson distribu- parameter E(wij) = exp(h(xi, xj)). Then nerative model β(xi)β(xj) exp(⟨fψ(xi), fψ(xj)⟩) , (6) exp(uψ(x)) > 0. Since β(x) can be re- mportance weight” of data vector x, SIPS orates the weight function β(x) to proba- used in a broad range of existing methods. PS gives a generative model p. Po α exp(⟨fψ(xi), fψ(xj)⟩) , (7) ∞), let hich we ll mea- for all ′ is de- 1/q . ures on ance is nerative utoEn- ance is CPD on ce, and ). The Corol- h(xi, xj) = ⟨yi, yj⟩ + ui + uj. (5) This model is widely used for recommender systems (Koren et al., 2009) and word vectors (Pennington et al., 2014), and SIPS is considered as its generalization. 4.4. Interpretation of SIPS and C-SIPS Here we illustrate the interpretation of the proposed models by returning back to the setting in Section 2. We consider a simple generative model of independent Poisson distribu- tion with mean parameter E(wij) = exp(h(xi, xj)). Then SIPS gives a generative model wij indep. ∼ Po β(xi)β(xj) exp(⟨fψ(xi), fψ(xj)⟩) , (6) where β(x) := exp(uψ(x)) > 0. Since β(x) can be re- garded as the “importance weight” of data vector x, SIPS naturally incorporates the weight function β(x) to proba- bilistic models used in a broad range of existing methods. Similarly, C-SIPS gives a generative model wij indep. ∼ Po α exp(⟨fψ(xi), fψ(xj)⟩) , (7) where α := exp(−γ) > 0 regulates the sparseness of SIPS (shifted inner product similarity) C-SIPS (constantly-shifted inner product similarity) r all s de- /q . s on ce is ative oEn- ce is D on and The orol- sser- (Koren et al., 2009) and word vectors (Pennington et al., 2014), and SIPS is considered as its generalization. 4.4. Interpretation of SIPS and C-SIPS Here we illustrate the interpretation of the proposed models by returning back to the setting in Section 2. We consider a simple generative model of independent Poisson distribu- tion with mean parameter E(wij) = exp(h(xi, xj)). Then SIPS gives a generative model wij indep. ∼ Po β(xi)β(xj) exp(⟨fψ(xi), fψ(xj)⟩) , (6) where β(x) := exp(uψ(x)) > 0. Since β(x) can be re- garded as the “importance weight” of data vector x, SIPS naturally incorporates the weight function β(x) to proba- bilistic models used in a broad range of existing methods. Similarly, C-SIPS gives a generative model wij indep. ∼ Po α exp(⟨fψ(xi), fψ(xj)⟩) , (7) where α := exp(−γ) > 0 regulates the sparseness of {wij}. The generative model (7) is already proposed as (0, ∞), let which we of all mea- ∞ for all y, y′ is de- z′ ) 1/q . easures on distance is Generative d AutoEn- distance is s CPD on tance, and 017). The 16, Corol- ve Wasser- ng section, decomposition model with biases h(xi, xj) = ⟨yi, yj⟩ + ui + uj. (5) This model is widely used for recommender systems (Koren et al., 2009) and word vectors (Pennington et al., 2014), and SIPS is considered as its generalization. 4.4. Interpretation of SIPS and C-SIPS Here we illustrate the interpretation of the proposed models by returning back to the setting in Section 2. We consider a simple generative model of independent Poisson distribu- tion with mean parameter E(wij) = exp(h(xi, xj)). Then SIPS gives a generative model wij indep. ∼ Po β(xi)β(xj) exp(⟨fψ(xi), fψ(xj)⟩) , (6) where β(x) := exp(uψ(x)) > 0. Since β(x) can be re- garded as the “importance weight” of data vector x, SIPS naturally incorporates the weight function β(x) to proba- bilistic models used in a broad range of existing methods. Similarly, C-SIPS gives a generative model wij indep. ∼ Po α exp(⟨fψ(xi), fψ(xj)⟩) , (7) where α := exp(−γ) > 0 regulates the sparseness of {wij}. The generative model (7) is already proposed as 1-view PMvGE (Okuno et al., 2018). PMvGE 29
  30. 30. SIPS CPD (c)SIPS (Proposed) Approximation by IPS (Existing) and SIPS (Proposed) are similarly plotted. PS) R into (5) uces to (6) satisfies ), (7) (8) (8) is 18). Representation theorem for SIPS Theorem 1 For some M, MÕ > 0, K œ N, fú : [≠M, M]p æ [≠MÕ , MÕ ]Kú is continuous map, gú : [≠MÕ , MÕ ]2Kú æ R is a CPD kernel. For arbitrary Á > 0, by specifying large T, K, there exist multi- layer perceptrons (MLPs) f : Rp æ RK and u› : Rp æ R with T hidden untis and ReLU or sigmoid activation such that gú (fú(x), fú(xÕ )) ≠ ( D fÂ(x), fÂ(xÕ ) E + u›(x) + u›(xÕ )) < Á for all (x, xÕ ) œ [≠M, M]2p . Theorem 2 Similarly to Theorem 1, there exist a MLP f : Rp æ RK and “ = O(r2 ) such that gú (fú(x), fú(xÕ )) ≠ ( D fÂ(x), fÂ(xÕ ) E ≠ “) < Á + O(r≠2 ). (a)T = 1 ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.05 0.10 0.15 1 2 3 4 output dimens Table 2: (a)T = 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 1 2 3 4 5 output dimens Table 3:Ne (a)T = 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 0.8 1.21.6 1 2 3 4 5 output dimens We propose mating arbit Poincare dist 30
  31. 31. 31
  32. 32. 32 <latexit sha1_base64="foN1vtKkwvkOb8dqzlQjZD2/QBA=">AAADGnichVFLS9xQFD5JrbXxNX2B4CZ0ULuow8kgKKUFoRuXvuYBxg5JvDMN3jxI7gyMcei+f6CLurHQhfgvdNPiVhf+ABelyym4ceHJnUipr56Qe7/7ne879x6OHXI3Foinivqg72H/o4HH2uDQ8Mho7snTchw0I4eVnIAHUdW2YsZdn5WEKzirhhGzPJuzir35Ps1XWiyK3cBfFe2QrXtWw3frrmMJomq5T9q0KeqR5SRGJyl2zG3T9pJ2R5/We2DK3P5Q1N/pJrf8BmcZ+/oqGUlWS+W3VUnNdyWnZLaWy2MBZeg3gZGBPGSxGOQOwIQNCMCBJnjAwAdBmIMFMX1rYABCSNw6JMRFhFyZZ9ABjbxNUjFSWMRu0tqg01rG+nROa8bS7dAtnP6InDpM4AnuYRd/4D7+wos7ayWyRvqWNu12z8vC2ujnsZXz/7o82gV8/Ou6x2GTWuoIpS/dIHzlv69TAXWYkx261HEombR3p1ettfWlu/JmeSKZxG/4m7rexVM8pL791h/n+xJb/goajc24PqSboFwsGFgwlmby82+zAQ7AOLyEVzSlWZiHBViEEt17pgwqz5UX6o56oP5Uj3pSVck8z+CfUI8vAT8XwxM=</latexit>
  33. 33. h(xi,xj) NN king-man+woman = queen ver IPS. ond CPD inkowski general h as texts, im- nsable role in s and contexts University, Ky- ligence Project Akifumi Okuno retical Founda- els, Stockholm, ity graph is first constructed from data vectors, and nodes are embedded to a lower dimensional space where con- nected nodes are closer to each other (Cai et al., 2018). Embedding is often designed so that the inner product be- tween two vector representations in Euclidean space ex- presses their similarity. In addition to its interpretability, the inner product similarity has the following two desir- able properties: (1) The vector representations are suitable for downstream tasks as feature vectors because machine learning methods are often based on inner products (e.g., kernel methods). (2) Simple vector arithmetic in the em- bedded space may represent similarity arithmetic such as the “linguistic regularities” of word vectors (Mikolov et al., 2013b). The latter property comes from the distributive law of inner product ⟨a + b, c⟩ = ⟨a, c⟩ + ⟨b, c⟩, which de- composes the similarity of a + b and c into the sum of the two similarities. For seeking the word vector y′ = yqueen, we maximize ⟨yking − yman + ywoman, y′ ⟩ = ⟨yking, y′ ⟩ − ⟨yman, y′ ⟩ + ⟨ywoman, y′ ⟩ in Eq. (3) of Levy and Goldberg (2014). Thus solving analogy questions with vector arith- 33 (distributive law)
  34. 34. Minkowski IPS (MIPS) Let us consider a similarity h(x, x ) = g∗(f∗(x), f∗(x )) with any kernel g∗ : R2K∗ → R and a continuous map f∗ : Rp → RK∗ . To approximate it, we consider a similarity model h(xi, xj) = ⟨fψ(xi), fψ(xj)⟩ − ⟨rζ(xi), rζ(xj)⟩, (9) where fψ : Rp → RK+ and rζ : Rp → RK− are neu- ral networks whose parameters are ψ and ζ, respectively. Since the kernel g(y, y′ ) = ⟨y+, y′ +⟩ − ⟨y−, y′ −⟩ with re- spect to y = (y+, y−) ∈ RK++K− is known as the inner product in Minkowski space (Naber, 2012), we call (9) as Minkowski IPS (MIPS) model. By replacing fψ(x) and rζ(x) with (fψ(x)⊤ , uξ(x), 1)⊤ and uξ(x) − 1 ∈ R, respectively, MIPS reduces to SIPS rd deviation of MSPE y ∥y∥2 , y′ ∥y′∥2 ⟩ (c) ♯units= 1000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.05 0.10 0.15 1 2 3 4 5 6 7 output dimension ce: −∥y − y′ ∥2 2 (c) ♯units= 1000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 1 2 3 4 5 6 7 output dimension e: −dPoincar´e(y, y′ ) (c) ♯units= 1000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 0.8 1.2 1.6 1 2 3 4 5 6 7 output dimension sive degrees of freedom, the model is, in theory, shown to be capable of approximating more general kernels that are considered in Ong et al. (2004). A.1. Proposed model Let us consider a similarity h(x, x′ ) = g∗(f∗(x), f∗(x′ )) with any kernel g∗ : R2K∗ → R and a continuous map f∗ : Rp → RK∗ . To approximate it, we consider a similarity model h(xi, xj) = ⟨fψ(xi), fψ(xj)⟩ − ⟨rζ(xi), rζ(xj)⟩, (9) where fψ : Rp → RK+ and rζ : Rp → RK− are neu- ral networks whose parameters are ψ and ζ, respectively. Since the kernel g(y, y′ ) = ⟨y+, y′ +⟩ − ⟨y−, y′ −⟩ with re- spect to y = (y+, y−) ∈ RK++K− is known as the inner product in Minkowski space (Naber, 2012), we call (9) as Minkowski IPS (MIPS) model. By replacing fψ(x) and rζ(x) with (fψ(x)⊤ , uξ(x), 1)⊤ and uξ(x) − 1 ∈ R, respectively, MIPS reduces to SIPS defined in eq. (3), meaning that MIPS includes SIPS as a special case. Therefore, MIPS approximates any CPD sim- ilarities arbitrary well. Further, we prove that MIPS ap- proximates more general similarities arbitrary well. 34
  35. 35. MIPS On representation power of neural network-based graph embeddin A.2. Representation theorem Theorem A.1 (Representation theorem for MIPS) Symbols and assumptions are the same as those of Theo- rem 4.1 but g∗ is a general kernel, which is only required to be dominated by some PD kernels g (i.e., g − g∗ is PD). For arbitrary ε > 0, by specifying sufficiently large K+, K− ∈ N, T+ = T+(K+), T− = T−(K−) ∈ N, there exist A ∈ RK+×T+ , B ∈ RT+×p , c ∈ RT+ , E ∈ RK−×T− , F ∈ RT−×p , o ∈ RT− such that g∗ (f∗(x), f∗(x′ )) − fψ(x), fψ(x′ ) − rζ(x), rζ(x′ ) < ε for all (x, x′ ) ∈ [−M, M]2p , where fψ(x) = Aσ(Bx + c) ∈ RK+ and rζ(x) = Eσ(F x + o) ∈ RK− are two- layer neural networks with T+ and T− hidden units, re- spectively, and σ(x) is element-wise σ(·) function. In theorem A.1, the kernel g∗ is only required to be dom- inated by some PD kernels, thus g is not limited to CPD. incorporates neural n ding (Vilnis and McCal works µ : Rp → Rq , Σ tion σ(−dKL(Nq(µ(xi) proximates E(wij|xi, x tive definite matrices and normal distribution with matrix Σ. Unlike typical graph em embedding maps data v Rp ∋ x → y : where y is also interpre q + q(q + 1)/2 by con in µ and Σ. Our con However, in the first pl not symmetric. In orde Leibler divergence may gence (Kullback and Le35
  36. 36. • • • 36

×