Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
深層意味表現学習
ボレガラ ダヌシカ
英国リバープール大学 准教授
単語自身,意味を持っているか
無いよ.
周辺に現れる単語によって決まるだけ
J. R. Firth 1957
Image credit: www.odlt.org 2
“You shall know a word by
the company ...
Quiz
•X は持ち歩く可能で,相手と通信ができて,ネッ
トも見れて,便利だ.X は次の内どれ?
•犬
•飛行機
•iPhone
•バナナ
3
でもそれは本当?
•だって辞書は単語の意味を定義しているじゃないか
•辞書も他の単語との関係を述べることで単語の意味を説明
している.
•膨大なコーパスがあれば周辺単語を集めてくるだけで単語の
意味表現が作れるので自然言語処理屋には嬉しい.
•...
意味表現構築手法
•分布的意味表現
•Distributional Semantic Representations
•単語xをコーパス中でその周辺に現れる全ての単語との共起頻度分布を持っ
て表す.
•高次元,スパース
•古典的なアプローチ
•...
意味表現を作るアプローチ
•分布的意味表現
•Distributional Semantic Representations
•単語xをコーパス中でその周辺に現れる全ての単語の共起頻度分布を持って
表す.
•高次元,スパース
•古典的なアプロー...
分布的意味表現構築
•「リンゴ」の単語の意味表現を作りなさい.
•S1=リンゴは赤い.
•S2=リンゴは美味しい.
•S3=青森県はリンゴの生産地として有名である.
7
分布的意味表現構築
•「リンゴ」の単語の意味表現を作りなさい.
•S1=リンゴは赤い.
•S2=赤いリンゴは美味しい.
•S3=青森県はリンゴの生産地として有名である.
リンゴ=[(赤い,2),(美味しい,1),(青森県,1),(生産 地,1)...
応用例:意味的類似性計測
•「リンゴ」と「みかん」の意味的類似性を計測したい.
•まず,「みかん」の意味表現を作ってみる.
•S4=みかんはオレンジ色.
•S5=みかんは美味しい.
•S6=兵庫県はみかんの生産地として有名である.
9
みかん=...
「リンゴ」と「みかん」
10
リンゴ=[(赤い,2),(美味しい,1),(青森県,1),(生産 地,1),(有名,1)]
みかん=[(オレンジ色,1),(美味しい,1),(兵庫県,1),(生産 地,1),(有名,1)]
両方の単語に対し,「美味...
細かい工夫が多数
•文脈として何を選ぶか
•文全体 (sentence-level co-occurrences)
•前後のn単語 (proximity window)
•係り受け関係にある単語 (dependencies)
•文脈の距離によっ...
意味表現を作るアプローチ
•分布的意味表現
•Distributional Semantic Representations
•単語xをコーパス中でその周辺に現れる全ての単語の共起頻度分布を持って
表す.
•高次元,スパース
•古典的なアプロー...
局所的表現 vs. 分散表現
13
•  Clustering,!NearestJ
Neighbors,!RBF!SVMs,!local!
nonJparametric!density!
es>ma>on!&!predic>on,!
decis...
skip-gramモデル
14
私はみそ汁とご飯を頂いた
skip-gramモデル
15
私はみそ汁とご飯を頂いた
私 は みそ汁 と ご飯 を 頂いた
形態素解析
各単語に対してd次元のベクトルが2個割り当てられている
単語xが意味表現学習対象となる場合のベクトルを対象語ベクトル
v(x)といい,赤...
skip-gramモデル
16
私はみそ汁とご飯を頂いた
私 は みそ汁 と ご飯 を 頂いた
形態素解析
v(x) v(c)
例えば「みそ汁」の周辺で「ご飯」が出現するかどうか
を予測する問題を考えよう.
skip-gramモデル
17
私はみそ汁とご飯を頂いた
私 は みそ汁 と ? を 頂いた
形態素解析
v(x) v(c)
c=ご飯, c =ケーキ とすると (x=みそ汁, c=ご飯)という
組み合わせの方が,(x=みそ汁, c =ケーキ)...
skip-gramモデル
18
私はみそ汁とご飯を頂いた
私 は みそ汁 と ? を 頂いた
形態素解析
v(x) v(c)
提案1 この尤もらしさをベクトルの内積で定義しましょう.
score(x,c) = v(x)Tv(c)
skip-gramモデル
19
私はみそ汁とご飯を頂いた
私 は みそ汁 と ? を 頂いた
形態素解析
v(x) v(c)
提案2 しかし,内積は(- ,+ )の値であり,正規化されていない
ので,都合が悪い.全ての文脈単語c に関するスコア...
対数双線型
•log-bilinear model
20
xの周辺でcが
出現する確率
p(c|x) =
exp(v(x)>
v(c))
P
c02V exp(v(x)>v(c0))
xとcの共起しやすさ
語彙集合(V)に含まれる全ての単語
c...
何が凄いか
•skip-gramで学習した単語の意味表現ベクトルを2
次元で可視化すると
•v(king) - v(man) + v(woman) v(queen)
21
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2 -1....
我々の研究成果
22
単語の意味は一意ではない
•同じ単語でも使う場面において異なる意味を表
すことがある.
•軽いノートPC (+)  vs. 軽い男/女 (-)
•同じ単語に対し,複数の意味表現を学習しなけ
ればならない.[Neelakantan+ EMNLP-...
ピボット (pivots)
•異なるドメインで似たような意味を持つ単語(意味普
遍な単語/semantic invariant)
•値段,形,安い,高い (excellent, cheap, digital)
•ピボットに関してはそれぞれのドメ...
損失関数
•ranked hinge lossで損失を計測する.[Collobert + Weston ICML08]
•あるレビュー(口コミ)d中で出現しているpivotを使ってdに含
まれているnon-pivotの予測スコアがd中に出現して...
全体のロス関数
26
L(CS, WS) =
X
d2DS
X
(c,w)2d
X
w⇤⇠p(w)
max 0, 1 cS
>
wS + cS
>
w⇤
S
L(CT , WT ) =
X
d2DT
X
(c,w)2d
X
w⇤⇠p(w)
ma...
27
E−>B D−>B K−>B
55
60
65
70
75
80
Accuracy
B−>E D−>E K−>E
50
55
60
65
70
75
80
85
Accuracy
B−>D E−>D K−>D
55
60
65
70
75...
単語間の関係の表現学習
•2つの単語の間に成立つ関係をどのように表現できるか.
[Bollegala+ AAAI-15]
•単語はベクトルで表現できるなら2つの単語の間の関係が行列で
表現できるはず.
•この「関係行列」はそれぞれの単語の意味表...
学習手法
29
cates co-
ediction-
However,
n learn-
nces be-
that ex-
context.
ased ap-
d by de-
ating se-
i 2010).
hown to
able...
最適化
30
cates co-
ediction-
However,
n learn-
nces be-
that ex-
context.
ased ap-
d by de-
ating se-
i 2010).
hown to
able ...
アナロジー予測の性能
31
Method
capital-
common
capital-
world
city-in-
state
family
(gender)
currency overall
SVD+LEX 11.43 5.43 0 9...
単語から関係を導出
•v(king) - v(man)はkingとmanの間の関係を表わ
しているはず.そうでなければ,類推問題が解け
ない(関係類似性が計測できない)
•ならば,特定の関係で結ばれている単語同士の
意味表現ベクトルの差分をとれ...
意味表現学習
33
x1 x2
p1
1
x3 x4
p2
lion cat ostrich bird
large Ys such as Xs X is a huge Y
f(p1, x1, x2)
(p1
>
p2)
-f(p1, x1, x...
アナロジー予測の性能
34
Table 1: Word analogy results on benchmark datasets.
Method sem. synt. total SAT SemEval
ivLBL CosAdd 63.60 ...
コーパス vs. 辞書
•コーパスさえあれば単語(関係)の分散的意味表現
が学習できる.
•しかし,既に人間が長年かけて作った「辞書」とい
うもので単語の意味が定義されている
•この両方を使うことでより正確な意味表現が学習で
きないか.[Bol...
JointReps
•コーパス中で同一文内に出現する単語を予測す
る.その際に生じる誤差(目的関数)を最小化
する.
•辞書(WordNet)で定義されている意味的関係を
制約として入れる.

36
then extract unigrams ...
単語間の意味的類似性計測
37
Table 1: Performance of the proposed method with different semantic relation types.
Method RG MC RW SCWS M...
残された難題
•単語の共起を予測するのが意味表現を学習するた
めの最適なタスクなのか.
•単語の意味表現ベクトルがなす空間について何も
しらない.
•そもそもベクトルで十分なのかさえ分からない
•文や文書の意味をどう表すか.(構成的意味論)
•...
39
御免 - sorry + thanks = 有難う
Danushka Bollegala
www.csc.liv.ac.uk/~danushka
danushka.bollegala@liverpool.ac.uk
@Bollegala
Upcoming SlideShare
Loading in …5
×

深層意味表現学習 (Deep Semantic Representations)

3,801 views

Published on

単語の意味表現から関係の意味表現まで,曖昧性,意味の変化,アナロジーなど様々な現象を考慮しながら表現学習を行う研究を紹介します.

Published in: Science
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

深層意味表現学習 (Deep Semantic Representations)

  1. 1. 深層意味表現学習 ボレガラ ダヌシカ 英国リバープール大学 准教授
  2. 2. 単語自身,意味を持っているか 無いよ. 周辺に現れる単語によって決まるだけ J. R. Firth 1957 Image credit: www.odlt.org 2 “You shall know a word by the company it keeps”
  3. 3. Quiz •X は持ち歩く可能で,相手と通信ができて,ネッ トも見れて,便利だ.X は次の内どれ? •犬 •飛行機 •iPhone •バナナ 3
  4. 4. でもそれは本当? •だって辞書は単語の意味を定義しているじゃないか •辞書も他の単語との関係を述べることで単語の意味を説明 している. •膨大なコーパスがあれば周辺単語を集めてくるだけで単語の 意味表現が作れるので自然言語処理屋には嬉しい. •practicalな意味表現手法 •色んなタスクに応用して成功しているので意味表現として(定 量的に)は正しい •単語の意味はタスクに依存する? •どのタスクが良くて,どのタスクがダメなのか? 4
  5. 5. 意味表現構築手法 •分布的意味表現 •Distributional Semantic Representations •単語xをコーパス中でその周辺に現れる全ての単語との共起頻度分布を持っ て表す. •高次元,スパース •古典的なアプローチ •分散的意味表現 •Distributed Semantic Representations •有数(10 1000)の次元/分布/クラスターの組み合わせ/混合として単語xの 意味を表す. •低次元,密 •深層学習/表現学習ブームで最近人気 5
  6. 6. 意味表現を作るアプローチ •分布的意味表現 •Distributional Semantic Representations •単語xをコーパス中でその周辺に現れる全ての単語の共起頻度分布を持って 表す. •高次元,スパース •古典的なアプローチ •分散的意味表現 •Distributed Semantic Representations •有数(10 1000)の次元/分布/クラスターの組み合わせ/混合として単語xの 意味を表す. •低次元,密 •深層学習/表現学習ブームで最近人気 6
  7. 7. 分布的意味表現構築 •「リンゴ」の単語の意味表現を作りなさい. •S1=リンゴは赤い. •S2=リンゴは美味しい. •S3=青森県はリンゴの生産地として有名である. 7
  8. 8. 分布的意味表現構築 •「リンゴ」の単語の意味表現を作りなさい. •S1=リンゴは赤い. •S2=赤いリンゴは美味しい. •S3=青森県はリンゴの生産地として有名である. リンゴ=[(赤い,2),(美味しい,1),(青森県,1),(生産 地,1),(有名,1)] 8
  9. 9. 応用例:意味的類似性計測 •「リンゴ」と「みかん」の意味的類似性を計測したい. •まず,「みかん」の意味表現を作ってみる. •S4=みかんはオレンジ色. •S5=みかんは美味しい. •S6=兵庫県はみかんの生産地として有名である. 9 みかん=[(オレンジ色,1),(美味しい,1),(兵庫県,1),(生産 地,1),(有名,1)]
  10. 10. 「リンゴ」と「みかん」 10 リンゴ=[(赤い,2),(美味しい,1),(青森県,1),(生産 地,1),(有名,1)] みかん=[(オレンジ色,1),(美味しい,1),(兵庫県,1),(生産 地,1),(有名,1)] 両方の単語に対し,「美味しい」,「生産地」,「有名」と いった共通な文脈語があるので「リンゴ」と「みかん」はかなり 意味的に似ているといえる. 定量的に比較したければ集合同士の重なりとしてみれば良い Jaccard係数 = ¦リンゴ AND みかん¦ / ¦リンゴ OR みかん¦ ¦リンゴ AND みかん¦ = ¦{美味しい,生産地,有名}¦ = 3 ¦リンゴ OR みかん¦ =¦{赤い,美味しい,青森県,生産地,有名,オレンジ色,兵庫県}¦ = 7 sim(リンゴ,みかん) = 3/7 = 0.4285
  11. 11. 細かい工夫が多数 •文脈として何を選ぶか •文全体 (sentence-level co-occurrences) •前後のn単語 (proximity window) •係り受け関係にある単語 (dependencies) •文脈の距離によって重みをつける. •遠ければその共起の重みを距離分だけ減らす •などなど 11
  12. 12. 意味表現を作るアプローチ •分布的意味表現 •Distributional Semantic Representations •単語xをコーパス中でその周辺に現れる全ての単語の共起頻度分布を持って 表す. •高次元,スパース •古典的なアプローチ •分散的意味表現 •Distributed Semantic Representations •有数(10 1000)の次元/分布/クラスターの組み合わせ/混合として単語xの 意味を表す. •低次元,密 •深層学習/表現学習ブームで最近人気 12
  13. 13. 局所的表現 vs. 分散表現 13 •  Clustering,!NearestJ Neighbors,!RBF!SVMs,!local! nonJparametric!density! es>ma>on!&!predic>on,! decision!trees,!etc.! •  Parameters!for!each! dis>nguishable!region! •  #!dis>nguishable!regions! linear!in!#!parameters! #2 The need for distributed representations Clustering! 16! •  Factor!models,!PCA,!RBMs,! Neural!Nets,!Sparse!Coding,! Deep!Learning,!etc.! •  Each!parameter!influences! many!regions,!not!just!local! neighbors! •  #!dis>nguishable!regions! grows!almost!exponen>ally! with!#!parameters! •  GENERALIZE+NON5LOCALLY+ TO+NEVER5SEEN+REGIONS+ #2 The need for distributed representations Mul>J! Clustering! 17! C1! C2! C3! input! ある点のラベルを決める ときに近隣する数個の点 しか関与しない. 3個のパーテションで 8個の領域が定義される. (2nの表現能力) slide credit: Yoshua Bengio
  14. 14. skip-gramモデル 14 私はみそ汁とご飯を頂いた
  15. 15. skip-gramモデル 15 私はみそ汁とご飯を頂いた 私 は みそ汁 と ご飯 を 頂いた 形態素解析 各単語に対してd次元のベクトルが2個割り当てられている 単語xが意味表現学習対象となる場合のベクトルを対象語ベクトル v(x)といい,赤で示す. xの周辺で現れる文脈単語cを文脈語ベクトルv(c)で表し,青で示す.
  16. 16. skip-gramモデル 16 私はみそ汁とご飯を頂いた 私 は みそ汁 と ご飯 を 頂いた 形態素解析 v(x) v(c) 例えば「みそ汁」の周辺で「ご飯」が出現するかどうか を予測する問題を考えよう.
  17. 17. skip-gramモデル 17 私はみそ汁とご飯を頂いた 私 は みそ汁 と ? を 頂いた 形態素解析 v(x) v(c) c=ご飯, c =ケーキ とすると (x=みそ汁, c=ご飯)という 組み合わせの方が,(x=みそ汁, c =ケーキ)より日本語として もっともらしいという「意味」を反映させたv(x), v(c), v(c ) を学習したい.
  18. 18. skip-gramモデル 18 私はみそ汁とご飯を頂いた 私 は みそ汁 と ? を 頂いた 形態素解析 v(x) v(c) 提案1 この尤もらしさをベクトルの内積で定義しましょう. score(x,c) = v(x)Tv(c)
  19. 19. skip-gramモデル 19 私はみそ汁とご飯を頂いた 私 は みそ汁 と ? を 頂いた 形態素解析 v(x) v(c) 提案2 しかし,内積は(- ,+ )の値であり,正規化されていない ので,都合が悪い.全ての文脈単語c に関するスコアで 割ることで確率にできる.
  20. 20. 対数双線型 •log-bilinear model 20 xの周辺でcが 出現する確率 p(c|x) = exp(v(x)> v(c)) P c02V exp(v(x)>v(c0)) xとcの共起しやすさ 語彙集合(V)に含まれる全ての単語 c とxが共起しやすさ [Mnih+Hinton ICML’07]
  21. 21. 何が凄いか •skip-gramで学習した単語の意味表現ベクトルを2 次元で可視化すると •v(king) - v(man) + v(woman) v(queen) 21 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Country and Capital Vectors Projected by PCA China Japan France Russia Germany Italy Spain Greece Turkey Beijing Paris Tokyo Poland Moscow Portugal Berlin Rome Athens Madrid Ankara Warsaw Lisbon Figure 2: Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their capital cities. The figure illustrates ability of the model to automatically organize concepts and learn implicitly the relationships between them, as during the training we did not provide any supervised information about what a capital city means.
  22. 22. 我々の研究成果 22
  23. 23. 単語の意味は一意ではない •同じ単語でも使う場面において異なる意味を表 すことがある. •軽いノートPC (+)  vs. 軽い男/女 (-) •同じ単語に対し,複数の意味表現を学習しなけ ればならない.[Neelakantan+ EMNLP-14] •ある分野(ドメイン)で良く使われる意味を正 確に予測しなければならない •意味表現の分野適応 [Bollegala+ ACL-15] 23
  24. 24. ピボット (pivots) •異なるドメインで似たような意味を持つ単語(意味普 遍な単語/semantic invariant) •値段,形,安い,高い (excellent, cheap, digital) •ピボットに関してはそれぞれのドメインにおける意味 表現が近くなって欲しい. •そうでない(non-pivot)単語に関しては,それぞれのド メインでピボットを予測できるようになって欲しい. •イメージ:ピボットを介して,異なるドメインが近く になる. 24
  25. 25. 損失関数 •ranked hinge lossで損失を計測する.[Collobert + Weston ICML08] •あるレビュー(口コミ)d中で出現しているpivotを使ってdに含 まれているnon-pivotの予測スコアがd中に出現していない non-pivotより高くなるように意味表現を調整する. 25 ply the senti- How- iment- n. De- g from e sub- well as repre- ty. Al- of do- can be senta- , prior show ns im- a tar- et al., boundaries. The notation (c, w) 2 d denotes the co-occurrence of a pivot c and a non-pivot w in a document d. We learn domain-specific word representations by maximizing the prediction accuracy of the non- pivots w that occur in the local context of a pivot c. The hinge loss, L(CS, WS), associated with predicting a non-pivot w in a source document d 2 DS that co-occurs with pivots c is given by X d2DS X (c,w)2d X w⇤⇠p(w) max ⇣ 0, 1 cS > wS + cS > w⇤ S ⌘ . (1) Here, w⇤ S is the source domain representation of a non-pivot w⇤ that does not occur in d. The loss function given by Eq. 1 requires that a non-pivot w that co-occurs with a pivot c in the document d is assigned a higher ranking score as measured by the inner-product between cS and wS than a non-pivot w⇤ that does not occur in d. We ran- domly sample k non-pivots from the set of all sourceドメインで pivot, cの意味表現 sourceドメインnon-pivot, w,w*の意味表現. w∈d, w*∉d
  26. 26. 全体のロス関数 26 L(CS, WS) = X d2DS X (c,w)2d X w⇤⇠p(w) max 0, 1 cS > wS + cS > w⇤ S L(CT , WT ) = X d2DT X (c,w)2d X w⇤⇠p(w) max 0, 1 cT > wT + cT > w⇤ T . , w⇤ denotes target domain non-pivots that ot occur in d, and are randomly sampled p(w) following the same procedure as in the ce domain. e source and target loss functions given re- ively by Eqs. 1 and 2 can be used on their own dependently learn source and target domain representations. However, by definition, piv- re common to both domains. We use this erty to relate the source and target word repre- tions via a pivot-regularizer, R(CS, CT ), de- as R(CS , CT ) = 1 2 KX i=1 ||c (i) S c (i) T || 2 . (3) , ||x|| represents the L2 norm of a vector x, c(i) is the i-th pivot in a total collection of K s. Word representations for non-pivots in the ce and target domains are linked via the pivot @L @cT = (cT cS ) w⇤ T wT + (cT c Here, for simplicity, we drop the loss function and write batch stochastic gradient des of 50 instances. Adaptive g 2011) is used to schedule t word representations are init sional random vectors samp and unit variance Gaussian. tive in Eq. 4 is not jointly c resentations, it is convex w. of a particular feature (pivo the representations for all t held fixed. In our experime verged in all cases with less the dataset. S T ned as R(CS , CT ) = 1 2 KX i=1 ||c (i) S c (i) T || 2 . (3) ere, ||x|| represents the L2 norm of a vector x, nd c(i) is the i-th pivot in a total collection of K vots. Word representations for non-pivots in the urce and target domains are linked via the pivot gularizer because, the non-pivots in each domain e predicted using the word representations for e pivots in each domain, which in turn are reg- arized by Eq. 3. The overall objective function, (CS, WS, CT , WT ), we minimize is the sum1 of e source and target loss functions, regularized a Eq. 3 with coefficient , and is given by L(CS , WS , ) + L(CT , WT ) + R(CS , CT ). (4) 3 Training and unit variance Gauss tive in Eq. 4 is not joint resentations, it is convex of a particular feature ( the representations for held fixed. In our exper verged in all cases with the dataset. The rank-based predic inspired by the prior w tions learning for a sin al., 2011). However, u ral network in Collober posed method uses a com gle layer to reduce the n must be learnt, thereby Similar to the skip-gram 2013a), the proposed me
  27. 27. 27 E−>B D−>B K−>B 55 60 65 70 75 80 Accuracy B−>E D−>E K−>E 50 55 60 65 70 75 80 85 Accuracy B−>D E−>D K−>D 55 60 65 70 75 80 Accuracy NA GloVe SFA SCL CS Proposed B−>K E−>K D−>K 50 60 70 80 90 Accuracy Figure 1: Accuracies obtained by different methods for each source-target pair in cross-domain sentiment classification. differences reported in Figure 1 can be directly attributable to the domain adaptation, or word- representation learning methods compared. All methods use L2 regularized logistic regression as the binary sentiment classifier, and the regulariza- tion coefficients are set to their optimal values on 正しい意味表現を使い分けることで 評判分類の性能があがる!
  28. 28. 単語間の関係の表現学習 •2つの単語の間に成立つ関係をどのように表現できるか. [Bollegala+ AAAI-15] •単語はベクトルで表現できるなら2つの単語の間の関係が行列で 表現できるはず. •この「関係行列」はそれぞれの単語の意味表現からそれらの間の 関係に寄与する属性のみを選択するものと解釈できる. 28 男 女 王 水 配 男 女 王 水 配 king queen 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1
  29. 29. 学習手法 29 cates co- ediction- However, n learn- nces be- that ex- context. ased ap- d by de- ating se- i 2010). hown to able set- l for us between earning. ree-way ons ex- e to data existing ree-way ostrich bird penguin X is a large Y [0.8] X is a Y [0.7] both X and Y are fligtless [0.5] Figure 1: A relational graph between three words. automatically extracted ontologies can be represented as re- lational graphs. Consider the relational graph shown in Figure 1. For ex- ample, let us assume that we observed the context ostrich is a large bird that lives in Africa in a corpus. Then, we ex- tract the lexical pattern X is a large Y between ostrich and bird from this context and include it in the relational graph by adding two vertices each for ostrich and bird, and an edge from ostrich to bird. Such lexical patterns have been used for related tasks such as measuring semantic similarity between xostrich=[:] xostrich=[:] xbird=[:]Glarge=[::] Gfligtless=[::] Gis-a=[::] s the e re- es of such elled ords co- rn is c re- 2 E raph abel two co- u, v). ver- d by pon- both ostrich and penguin are flightless birds and penguin is a bird will result in the relational graph shown in Figure 1. Learning Word Representations Given a relational graph as the input, we learn d dimen- sional vector representations for each vertex in the graph. The dimensionality d of the vector space is a pre-defined pa- rameter of the method, and by adjusting it one can obtain word representations at different granularities. Let us con- sider two vertices u and v connected by an edge with label l and weight w. We represent the two words u and v respec- tively by two vectors x(u), x(v) 2 Rd , and the label l by a matrix G(l) 2 Rd⇥d . We model the problem of learning op- timal word representations ˆx(u) and pattern representations ˆG(l) as the solution to the following squared loss minimisa- tion problem argmin x(u)2Rd,G(l)2Rd⇥d 1 2 X (u,v,l,w)2E (x(u)> G(l)x(v) w) 2 . (1) The objective function given by Eq. 1 is jointly non- convex in both word representations x(u) (or alternatively x(v)) and pattern representations G(l). However, if G(l) is positive semidefinite, and one of the two variables is held 関係行列 自乗誤差 単語の意味表現ベクトル 共起の強さ u, v, l
  30. 30. 最適化 30 cates co- ediction- However, n learn- nces be- that ex- context. ased ap- d by de- ating se- i 2010). hown to able set- l for us between earning. ree-way ons ex- e to data existing ree-way ostrich bird penguin X is a large Y [0.8] X is a Y [0.7] both X and Y are fligtless [0.5] Figure 1: A relational graph between three words. automatically extracted ontologies can be represented as re- lational graphs. Consider the relational graph shown in Figure 1. For ex- ample, let us assume that we observed the context ostrich is a large bird that lives in Africa in a corpus. Then, we ex- tract the lexical pattern X is a large Y between ostrich and bird from this context and include it in the relational graph by adding two vertices each for ostrich and bird, and an edge from ostrich to bird. Such lexical patterns have been used for related tasks such as measuring semantic similarity between xostrich=[:] xostrich=[:] xbird=[:]Glarge=[::] Gfligtless=[::] Gis-a=[::] • 目的関数はそれぞれの変数x(u), G(l), x(v)に対し,非凸関数となって いる. • しかし,これらの変数のうちどれか2つを固定すれば残りの変数に関 して凸関数となる.(但しG(l)は正定値行列でなければならない) • 従って,目的関数をそれぞれの変数で偏微分し,確率的勾配法を使っ て最適化することができる. s the e re- es of such elled ords co- rn is c re- 2 E raph abel two co- u, v). ver- d by pon- both ostrich and penguin are flightless birds and penguin is a bird will result in the relational graph shown in Figure 1. Learning Word Representations Given a relational graph as the input, we learn d dimen- sional vector representations for each vertex in the graph. The dimensionality d of the vector space is a pre-defined pa- rameter of the method, and by adjusting it one can obtain word representations at different granularities. Let us con- sider two vertices u and v connected by an edge with label l and weight w. We represent the two words u and v respec- tively by two vectors x(u), x(v) 2 Rd , and the label l by a matrix G(l) 2 Rd⇥d . We model the problem of learning op- timal word representations ˆx(u) and pattern representations ˆG(l) as the solution to the following squared loss minimisa- tion problem argmin x(u)2Rd,G(l)2Rd⇥d 1 2 X (u,v,l,w)2E (x(u)> G(l)x(v) w) 2 . (1) The objective function given by Eq. 1 is jointly non- convex in both word representations x(u) (or alternatively x(v)) and pattern representations G(l). However, if G(l) is positive semidefinite, and one of the two variables is held 関係行列 自乗誤差 単語の意味表現ベクトル 共起の強さ u, v, l
  31. 31. アナロジー予測の性能 31 Method capital- common capital- world city-in- state family (gender) currency overall SVD+LEX 11.43 5.43 0 9.52 0 3.84 SVD+POS 4.57 9.06 0 29.05 0 6.57 SVD+DEP 5.88 3.02 0 0 0 1.11 CBOW 8.49 5.26 4.95 47.82 2.37 10.58 skip-gram 9.15 9.34 5.97 67.98 5.29 14.86 GloVe 4.24 4.93 4.35 65.41 0 11.89 Prop+LEX 22.87 31.42 15.83 61.19 25.0 26.61 Prop+POS 22.55 30.82 14.98 60.48 20.0 25.35 Prop+DEP 20.92 31.40 15.27 56.19 20.0 24.68
  32. 32. 単語から関係を導出 •v(king) - v(man)はkingとmanの間の関係を表わ しているはず.そうでなければ,類推問題が解け ない(関係類似性が計測できない) •ならば,特定の関係で結ばれている単語同士の 意味表現ベクトルの差分をとれば関係の表現が 作れるはず.[Bollegala+ IJCAI-15]
 32 |R(p)| = (u,v)2R(p) f(p, u, v) (3) We represent a word x using a vector x 2 Rd . The dimen- sionality of the representation, d, is a hyperparameter of the proposed method. Prior work on word representation learn- ing have observed that the difference between the vectors that represent two words closely approximates the semantic re- lations that exist between those two words. For example, the vector v(king) v(queen) has shown to be similar to the vec- tor v(man) v(woman). We use this property to represent a pattern p by a vector p 2 Rd as the weighted sum of dif- ferences between the two words in all word-pairs (u, v) that co-occur with p as follows, p = 1 |R(p)| X (u,v)2R(p) f(p, u, v)(u v). (4) For example, consider Fig. 1, where the two word-pairs los Dif fun gen we to lin T con tio
  33. 33. 意味表現学習 33 x1 x2 p1 1 x3 x4 p2 lion cat ostrich bird large Ys such as Xs X is a huge Y f(p1, x1, x2) (p1 > p2) -f(p1, x1, x2) f(p2, x3, x4) -f(p2, x3, x4) Figure 1: Computing the similarity between two patterns. p2 = X is a huge Y. Assuming that there are no other co- occurrences between word-pairs and patterns in the corpus, he representations of the patterns p1 and p2 are given respec- ively by p1 = x1 x2, and p2 = x3 x4. We measure the relational similarity between (x1, x2) and (x3, x4) using the nner-product p1 > p2. We model the problem of learning word representations as all words (or patterns) corresponding to the slot variabl represent a pattern p by the set R(p) of word-pairs (u, which f(p, u, v) > 0. Formally, we define R(p) and its |R(p)| as follows, R(p) = {(u, v)|f(p, u, v) > 0} |R(p)| = X (u,v)2R(p) f(p, u, v) We represent a word x using a vector x 2 Rd . The d sionality of the representation, d, is a hyperparameter proposed method. Prior work on word representation ing have observed that the difference between the vecto represent two words closely approximates the seman lations that exist between those two words. For examp vector v(king) v(queen) has shown to be similar to th tor v(man) v(woman). We use this property to repre pattern p by a vector p 2 Rd as the weighted sum o ferences between the two words in all word-pairs (u, v co-occur with p as follows, p = 1 |R(p)| X (u,v)2R(p) f(p, u, v)(u v). For example, consider Fig. 1, where the two word (lion, cat), and (ostrich, bird) co-occur respectively the two lexical patterns, p1 = large Ys such as Xs 語彙パターンの集合として関係を表現 sionality of the representation, d, is a hyperparamete proposed method. Prior work on word representatio ing have observed that the difference between the vec represent two words closely approximates the sema lations that exist between those two words. For exam vector v(king) v(queen) has shown to be similar to tor v(man) v(woman). We use this property to rep pattern p by a vector p 2 Rd as the weighted sum ferences between the two words in all word-pairs (u co-occur with p as follows, p = 1 |R(p)| X (u,v)2R(p) f(p, u, v)(u v). For example, consider Fig. 1, where the two wo (lion, cat), and (ostrich, bird) co-occur respective the two lexical patterns, p1 = large Ys such as Xuとvの間の関係をそれらの意味表現 ベクトルの「引き算」で与える (i.e. the sequence of tokens that appear in between en two words in a context). Although we use lexi- erns as features for representing semantic relations in rk, our proposed method is not limited to lexical pat- nd can be used in principle with any type of features resent relations. The strength of association between pair (u, v) and a pattern p is measured using the pos- intwise mutual information (PPMI), f(p, u, v), which ed as follows, f(p, u, v) = max(0, log ✓ g(p, u, v)g(⇤, ⇤, ⇤) g(p, ⇤, ⇤)g(⇤, u, v) ◆ ). (1) (p, u, v) denotes the number of co-occurrences be- p and (u, v), and ⇤ denotes the summation taken over ds (or patterns) corresponding to the slot variable. We nt a pattern p by the set R(p) of word-pairs (u, v) for f(p, u, v) > 0. Formally, we define R(p) and its norm as follows, R(p) = {(u, v)|f(p, u, v) > 0} (2) |R(p)| = X (u,v)2R(p) f(p, u, v) (3) resent a word x using a vector x 2 Rd . The dimen- y of the representation, d, is a hyperparameter of the ed method. Prior work on word representation learn- e observed that the difference between the vectors that nt two words closely approximates the semantic re- that exist between those two words. For example, the v(king) v(queen) has shown to be similar to the vec- man) v(woman). We use this property to represent a p by a vector p 2 Rd as the weighted sum of dif- p2 = X is a huge Y. Assuming that there are no other co- occurrences between word-pairs and patterns in the corpus, the representations of the patterns p1 and p2 are given respec- tively by p1 = x1 x2, and p2 = x3 x4. We measure the relational similarity between (x1, x2) and (x3, x4) using the inner-product p1 > p2. We model the problem of learning word representations as a binary classification task, where we learn representations for words such that they can be used to accurately predict whether a given pair of patterns are relationally similar. In our previous example, we would learn representations for the four words lion, cat, ostrich, and bird such that the similarity between the two patterns large Ys such as Xs, and X is a huge Y is maximized. Later in Section 3.1, we propose an unsuper- vised method for selecting relationally similar (positive) and dissimilar (negative) pairs of patterns as training instances to train a binary classifier. Let us denote the target label for two patterns p1, p2 by t(p1, p2) 2 {1, 0}, where the value 1 indicates that p1 and p2 are relationally similar, and 0 otherwise. We compute the prediction loss for a pair of patterns (p1, p2) as the squared loss between the target and the predicted labels as follows, L(t(p1, p2), p1, p2) = 1 2 (t(p1, p2) (p1 > p2)) 2 . (5) Different non-linear functions can be used as the prediction function (·) such as the logistic-sigmoid, hyperbolic tan- gent, or rectified linear units. In our preliminary experiments we found hyperbolic tangent, tanh, given by (✓) = tanh(✓) = exp(✓) exp( ✓) exp(✓) + exp( ✓) (6) to work particularly well among those different non- tations are given by, @L @p1 = 0 (p1 > p2)( (p1 > p2) t(p1, p2))p2, (8) @L @p2 = 0 (p1 > p2)( (p1 > p2) t(p1, p2))p1. (9) Here, 0 denotes the first derivative of tanh, which is given by 1 (✓) 2 . To simplify the notation we drop the arguments of the loss function. From Eq. 4 we get, @p1 @x = 1 |R(p1)| (h(p1, u = x, v) h(p1, u, v = x)) , (10) @p2 @x = 1 |R(p2)| (h(p2, u = x, v) h(p2, u, v = x)) , (11) where, h(p, u = x, v) = X (x,v)2{(u,v)|(u,v)2R(p),u=x} f(p, x, v), and h(p, u, v = x) = X (u,x)2{(u,v)|(u,v)2R(p),v=x} f(p, u, x). Substituting the partial derivatives given by Eqs. 8-11 in Eq. 7 we get, @L @x = (p1, p2)[H(p1, x) X (u,v)2R(p2) f(p2, u, v)(u v) +H(p2, x) X (u,v)2R(p1) f(p1, u, v)(u v)], (12) where (p1, p2) is defined as
  34. 34. アナロジー予測の性能 34 Table 1: Word analogy results on benchmark datasets. Method sem. synt. total SAT SemEval ivLBL CosAdd 63.60 61.80 62.60 20.85 34.63 ivLBL CosMult 65.20 63.00 64.00 19.78 33.42 ivLBL PairDiff 52.60 48.50 50.30 22.45 36.94 skip-gram CosAdd 31.89 67.67 51.43 29.67 40.89 skip-gram CosMult 33.98 69.62 53.45 28.87 38.54 skip-gram PairDiff 7.20 19.73 14.05 35.29 43.99 CBOW CosAdd 39.75 70.11 56.33 29.41 40.31 CBOW CosMult 38.97 70.39 56.13 28.34 38.19 CBOW PairDiff 5.76 13.43 9.95 33.16 42.89 GloVe CosAdd 86.67 82.81 84.56 27.00 40.11 GloVe CosMult 86.84 84.80 85.72 25.66 37.56 GloVe PairDiff 45.93 41.23 43.36 44.65 44.67 Prop CosAdd 86.70 85.35 85.97 29.41 41.86 Prop CosMult 86.91 87.04 86.98 28.87 39.67 Prop PairDiff 41.85 42.86 42.40 45.99 44.88 number of candidate word-pairs out of which only one is cor-
  35. 35. コーパス vs. 辞書 •コーパスさえあれば単語(関係)の分散的意味表現 が学習できる. •しかし,既に人間が長年かけて作った「辞書」とい うもので単語の意味が定義されている •この両方を使うことでより正確な意味表現が学習で きないか.[Bollegala+ AAAI-15] •特に,コーパスが不完全な場合,辞書(オントロ ジー)が役立つ •私は犬と猫が好きだ. 35
  36. 36. JointReps •コーパス中で同一文内に出現する単語を予測す る.その際に生じる誤差(目的関数)を最小化 する. •辞書(WordNet)で定義されている意味的関係を 制約として入れる.
 36 then extract unigrams from the co-occurrence windows as the corresponding context words. We down-weight distant (and potentially noisy) co-occurrences using the reciprocal 1/l of the distance in tokens l between the two words that co-occur. A word wi is assigned two vectors wi and ˜wi denoting whether wi is respectively the target of the prediction (cor- responding to the rows of X), or in the context of another word (corresponding to the columns of X). The GloVe ob- jective can then be written as: JC = 1 2 X i2V X j2V f(Xij) ⇣ wi > ˜wj + bi + ˜bj log(Xij) ⌘2 (1) Here, bi and ˜bj are real-valued scalar bias terms that adjust for the difference between the inner-product and the loga- rithm of the co-occurrence counts. The function f discounts the co-occurrences between frequent words and is given by: f(t) = ( (t/tmax)↵ if t < tmax 1 otherwise (2) (3). miza Here coeffi man corp value Th w.r.t we fi the r tion rame pre-d keep Th d as re- seman- hod for , where vectors a man- tracted nly the escribe Rd for denote wi, and pus) is repre- od that ic lexi- etween Miller, aphrase we do rticular s paper tmax = 100 in our experiments. The objective function de- fined by (1) encourages the learning of word representations that demonstrate the desirable property that vector differ- ence between the word embeddings for two words represents the semantic relations that exist between those two words. For example, Mikolov et al. [2013c] observed that the dif- ference between the word embeddings for the words king and man when added to the word embedding for the word woman yields a vector similar to that of queen. Unfortunately, the objective function given by (1) does not capture the semantic relations that exist between wi and wj as specified in the lexicon S. Consequently, it considers all co-occurrences equally and is likely to encounter prob- lems when the co-occurrences are rare. To overcome this problem we propose a regularizer, JS, by considering the three-way co-occurrence among words wi, wj, and a seman- tic relation R that exists between the target word wi and one of its context words wj in the lexicon as follows: JS = 1 2 X i2V X j2V R(i, j) (wi ˜wj)2 (3) Here, R(i, j) is a binary function that returns 1 if the se- mantic relation R exists between the words wi and wj in
  37. 37. 単語間の意味的類似性計測 37 Table 1: Performance of the proposed method with different semantic relation types. Method RG MC RW SCWS MEN sem syn total SemEval corpus only 0.7523 0.6398 0.2708 0.460 0.6933 61.49 66.00 63.95 37.98 Synonyms 0.7866 0.7019 0.2731 0.4705 0.7090 61.46 69.33 65.76 38.65 Antonyms 0.7694 0.6417 0.2730 0.4644 0.6973 61.64 66.66 64.38 38.01 Hypernyms 0.7759 0.6713 0.2638 0.4554 0.6987 61.22 68.89 65.41 38.21 Hyponyms 0.7660 0.6324 0.2655 0.4570 0.6972 61.38 68.28 65.15 38.30 Member-holonyms 0.7681 0.6321 0.2743 0.4604 0.6952 61.69 66.36 64.24 37.95 Member-meronyms 0.7701 0.6223 0.2739 0.4611 0.6963 61.61 66.31 64.17 37.98 Part-holonyms 0.7852 0.6841 0.2732 0.4650 0.7007 61.44 67.34 64.66 38.07 Part-meronyms 0.7786 0.6691 0.2761 0.4679 0.7005 61.66 67.11 64.63 38.29 syn) analogies, and 8869 semantic analogies (sem). mEval dataset contains manually ranked word-pairs word-pairs describing various semantic relation types, s defective, and agent-goal. In total there are 3218 airs in the SemEval dataset. Given a proportional y a : b :: c : d, we compute the cosine similarity be- b a+c and c, where the boldface symbols represent beddings of the corresponding words. For the Google we measure the accuracy for predicting the fourth in each proportional analogy from the entire vocab- We use the binomial exact test with Clopper-Pearson nce interval to test for the statistical significance of orted accuracy values. For SemEval we use the offi- luation tool3 to compute MaxDiff scores. Table 2: Comparison against prior work. Method RG MEN sem syn RCM 0.471 0.501 - 29.9 R-NET - - 32.64 43.46 C-NET - - 37.07 40.06 RC-NET - - 34.36 44.42 Retro (CBOW) 0.577 0.605 36.65 52.5 Retro (SG) 0.745 0.657 45.29 65.65 Retro (corpus only) 0.786 0.673 61.11 68.14 Proposed (synonyms) 0.787 0.709 61.46 69.33 learning of rare words, among which the co-occurrences a 人間が付けた類似度スコアとアルゴリズムが 出した類似度とのSpearman相関を使って評価. 様々な意味的関係を制約として使える. 類義語関係(synonymy)が最も有効.
  38. 38. 残された難題 •単語の共起を予測するのが意味表現を学習するた めの最適なタスクなのか. •単語の意味表現ベクトルがなす空間について何も しらない. •そもそもベクトルで十分なのかさえ分からない •文や文書の意味をどう表すか.(構成的意味論) •多言語,曖昧性をどう扱うか. 38
  39. 39. 39 御免 - sorry + thanks = 有難う Danushka Bollegala www.csc.liv.ac.uk/~danushka danushka.bollegala@liverpool.ac.uk @Bollegala

×