Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Harnessing Deep Neural Networks with Logic Rules

1,044 views

Published on

第8回最先端NLP勉強会で紹介した "Harnessing Deep Neural Networks with Logic Rules"(ACL 2016)のスライド

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Harnessing Deep Neural Networks with Logic Rules

  1. 1. Harnessing Deep Neural Networks with Logic Rules Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric P. Xing ACL2016 読む⼈人:東北北⼤大学,⾼高瀬翔 116/09/12 第8回最先端NLP勉強会 スライド中の図,表は [Hu+ 16] から引⽤用
  2. 2. ⽬目的 •  ⼀一般的な規則や⼈人の直観をニューラル ネットに導⼊入したい – 評判分析において,A but B という⽂文のポジ/ ネガは B と⼀一致する – 固有表現抽出において,B-PERの後にI-ORG はありえない •  規則や直観を⼀一階述語論論理理で表現 – equal(yi-1, B-PER) → ¬ equal(yi, I-ORG) •  論論理理規則を制約として学習に⽤用いる 216/09/12 第8回最先端NLP勉強会
  3. 3. ⼿手法の概要 •  pθ(y|x):モデル(任意のニューラルネット,例例えばCNN)の出⼒力力 •  q(y|x):制約を満たした上でのモデルの出⼒力力 •  q(y|x)の計算とパラメータの学習(θの更更新)を交互に⾏行行う –  Posterior regularization [Ganchev+ 10] をニューラルネットで⾏行行う 316/09/12 第8回最先端NLP勉強会
  4. 4. ⼿手法の概要 •  pθ(y|x):モデル(任意のニューラルネット,例例えばCNN)の出⼒力力 •  q(y|x):制約を満たした上でのモデルの出⼒力力 •  q(y|x)の計算とパラメータの学習(θの更更新)を交互に⾏行行う –  Posterior regularization [Ganchev+ 10] をニューラルネットで⾏行行う 4 説明1 θの更更新 説明2 q(y|x) 説明3 16/09/12 第8回最先端NLP勉強会
  5. 5. パラメータの学習 •  達成したいこと 1,訓練事例例で正しいラベルを出⼒力力できる 2,制約を満たしたモデル(q(y|x))に似た出⼒力力 •  ⽬目的関数 –  1と2をπで重み調整し,⾜足す –  lossは損失関数(今回は交差エントロピー) 5 min 1 N NX n=1 (1 ⇡)loss(yn, ✓(xn)) + ⇡loss(q(y|xn), ✓(xn)) 1に相当 2に相当 xnについて,モデルの予測値 16/09/12 第8回最先端NLP勉強会
  6. 6. •  達成したいこと 1,制約(論論理理規則)を満たす 2,モデルの出⼒力力(pθ(y|x))に近い値となる •  ⽬目的関数 expectation operator. That is, for each rule (indexed by l) and each of its groundings (indexed by g) on (X, Y ), we expect Eq(Y |X)[rlg(X, Y )] = 1, with confidence l. The constraints define a rule- regularized space of all valid distributions. For the second property, we measure the closeness between q and p✓ with KL-divergence, and wish to minimize it. Combining the two factors together and further allowing slackness for the constraints, we finally get the following optimization problem: min q,⇠ 0 KL(q(Y |X)kp✓(Y |X)) + C X l,gl ⇠l,gl s.t. l(1 Eq[rl,gl (X, Y )])  ⇠l,gl gl = 1, . . . , Gl, l = 1, . . . , L, (3) where ⇠l,gl 0 is the slack variable for respec- tive logic constraint; and C is the regularization parameter. The problem can be seen as project- forward straints the bas sentime straints gram de task (se ming fo constra proxima samples the con the rele ference when a the soft forward 2に相当 q(y|x)の計算(1/2) 6 1に相当 ξで制約を緩和 0から1までの連続値 事例例glが規則rlを満たすとき1 規則rlの強さ (λが⼤大きい =満たすべき規則) 16/09/12 第8回最先端NLP勉強会
  7. 7. •  q(y|x)は解析的に解ける(ラグランジュ双対問題) –  Posterior regularization [Ganchev+ 10] と同様の解き⽅方   •  制約の強さはC(定数)と  λ(規則ごとの値)で決定 •  規則を満たしていない場合,q(y|x)は⼩小さくなる regularized space of all valid distributions. For the second property, we measure the closeness between q and p✓ with KL-divergence, and wish to minimize it. Combining the two factors together and further allowing slackness for the constraints, we finally get the following optimization problem: min q,⇠ 0 KL(q(Y |X)kp✓(Y |X)) + C X l,gl ⇠l,gl s.t. l(1 Eq[rl,gl (X, Y )])  ⇠l,gl gl = 1, . . . , Gl, l = 1, . . . , L, (3) where ⇠l,gl 0 is the slack variable for respec- tive logic constraint; and C is the regularization parameter. The problem can be seen as project- ing p✓ into the constrained subspace. The problem is convex and can be efficiently solved in its dual form with closed-form solutions. We provide the detailed derivation in the supplementary materials sentime straints gram de task (se ming fo constrai proxima samples the con the rele ference when a the soft forward tributio calculat p v.s. q q(y|x)の計算(2/2) 7 q,⇠ 0 l,gl l s.t. l(1 Eq[rl,gl (X, Y )])  ⇠l,gl gl = 1, . . . , Gl, l = 1, . . . , L, (3) where ⇠l,gl 0 is the slack variable for respec- tive logic constraint; and C is the regularization parameter. The problem can be seen as project- ing p✓ into the constrained subspace. The problem is convex and can be efficiently solved in its dual form with closed-form solutions. We provide the detailed derivation in the supplementary materials and directly give the solution here: q⇤ (Y |X) / p✓(Y |X) exp 8 < : X l,gl C l(1 rl,gl (X, Y )) 9 = ; (4) Intuitively, a strong rule with large l will lead to low probabilities of predictions that fail to meet the constraints spa the relevant instan ference (and rando when a group is to the soft prediction forward pass is re tribution p✓(y|x) calculating the trut p v.s. q at Test T either the distilled network q after a fi sults show that bot over the base netwo label instances. In p. Particularly, q i rules introduce add 2413 16/09/12 第8回最先端NLP勉強会
  8. 8. 規則について •  ⼀一階述語論論理理で表現 –  A but B という⽂文のポジ/ネガは B と⼀一致 •  Probabilistic soft logicの枠組みで0から1の連続値 に変換 –  論論理理演算⼦子は 8 lg g=1 is typically relevant to only a single or subset of examples, though here we give the most general form on the entire set. We encode the FOL rules using soft logic (Bach et al., 2015) for flexible encoding and stable opti- mization. Specifically, soft logic allows continu- ous truth values from the interval [0, 1] instead of {0, 1}, and the Boolean logic operators are refor- mulated as: A&B = max{A + B 1, 0} A _ B = min{A + B, 1} A1 ^ · · · ^ AN = X i Ai/N ¬A = 1 A (1) Here & and ^ are two different approximations vector tion pa of the t A si other s et al., cess is p✓(y|x which to hum system by pro (i.e., th ence f teache trained is a classification gnition which is a describe the base not focusing on we largely use the evious successful the linguistically- junction word “but” is one of the strong indicators for such sentiment changes in a sentence, where the sentiment of clauses following “but” generally dominates. We thus consider sentences S with an “A-but-B” structure, and expect the sentiment of the whole sentence to be consistent with the sentiment of clause B. The logic rule is written as: has-‘A-but-B’-structure(S) ) (1(y = +) ) ✓(B)+ ^ ✓(B)+ ) 1(y = +)) , (5) 2414 16/09/12 第8回最先端NLP勉強会
  9. 9. 規則の計算例例 •  A but B という⽂文のポジ/ネガは  B と⼀一致 •  ⽂文がポジティブ: •  ⽂文がネガティブ: 9 ich is a classification ecognition which is a efly describe the base are not focusing on s, we largely use the o previous successful ign the linguistically- ted. junction word “but” is one of the strong indicators for such sentiment changes in a sentence, where the sentiment of clauses following “but” generally dominates. We thus consider sentences S with an “A-but-B” structure, and expect the sentiment of the whole sentence to be consistent with the sentiment of clause B. The logic rule is written as: has-‘A-but-B’-structure(S) ) (1(y = +) ) ✓(B)+ ^ ✓(B)+ ) 1(y = +)) , (5) 2414 ⽂文Sが  A but B という構造を持つ ⽂文がポジティブのとき1 そうでないとき0 B部分がポジティブと モデルが予測した確率率率 (1(y = +) ) ✓(B)+ ^ ✓(B)+ ) 1(y = +)) , (¬1(y = +) _ ✓(B)+ ^ ¬ ✓(B)+ _ 1(y = +)) , (1 1(y = +) _ ✓(B)+ ^ 1 ✓(B)+ _ 1(y = +)) , (min{1 1(y = +) + ✓(B)+, 1} ^ min{1 ✓(B)+ + 1(y = +), 1}) (1 + ✓(B)+)/2 (2 ✓(B)+)/2 16/09/12 第8回最先端NLP勉強会
  10. 10. 実験概要 •  制約を導⼊入して学習し,性能が向上する か検証 •  評判分析,固有表現抽出で実験 – p,qの両⽅方の性能を検証 1016/09/12 第8回最先端NLP勉強会
  11. 11. 実験設定(評判分析) •  ポジ/ネガの⼆二値分類タスク •  データセット: –  Stanford Sentiment Treebank(SST2) –  Movie Review(MR) –  Customer Review(CR) •  ベースライン:(単純な)CNN –  [Kim+ 14]  と同じ⼿手法 •  適⽤用する制約(規則) –  A but B という⽂文のポジ/ネガは  B と⼀一致 •  重要性  λ = 1    11 Algorithm 1 Harnessing NN with Rules Input: The training data D = {(xn, yn)}N n=1, The rule set R = {(Rl, l)}L l=1, Parameters: ⇡ – imitation parameter C – regularization strength 1: Initialize neural network parameter ✓ 2: repeat 3: Sample a minibatch (X, Y ) ⇢ D 4: Construct teacher network q with Eq.(4) 5: Transfer knowledge into p✓ by updating ✓ with Eq.(2) 6: until convergence Output: Distill student network p✓ and teacher network q ning over multiple examples), requiring joint infer- ence. In contrast, as mentioned above, p is more lightweight and efficient, and useful when rule eval- uation is expensive or impossible at prediction time. Our experiments compare the performance of p and q extensively. Imitation Strength ⇡ The imitation parameter ⇡ in Eq.(2) balances between emulating the teacher soft predictions and predicting the true hard la- bels. Since the teacher network is constructed from p✓, which, at the beginning of training, would pro- duce low-quality predictions, we thus favor pre- I like this book store a lot PaddingPadding Word Embedding Convolution Max Pooling Sentence Representation Figure 2: The CNN architecture for sentence-level sentiment analysis. The sentence representation vector is followed by a fully-connected layer with softmax output activation, to output sentiment pre- dictions. 4.1 Sentiment Classification Sentence-level sentiment analysis is to identify the sentiment (e.g., positive or negative) underlying an individual sentence. The task is crucial for many opinion mining applications. One challeng- ing point of the task is to capture the contrastive of neural networks t users are allowed ntentions through c. In this section ur approach by ap- work architectures, recurrent network, ons, i.e., sentence- is a classification gnition which is a describe the base not focusing on we largely use the evious successful the linguistically- windows. Multiple filters with varying window sizes are used to obtain multiple features. Figure 2 shows the network architecture. Logic Rules One difficulty for the plain neural network is to identify contrastive sense in order to capture the dominant sentiment precisely. The con- junction word “but” is one of the strong indicators for such sentiment changes in a sentence, where the sentiment of clauses following “but” generally dominates. We thus consider sentences S with an “A-but-B” structure, and expect the sentiment of the whole sentence to be consistent with the sentiment of clause B. The logic rule is written as: has-‘A-but-B’-structure(S) ) (1(y = +) ) ✓(B)+ ^ ✓(B)+ ) 1(y = +)) , (5) 241416/09/12 第8回最先端NLP勉強会
  12. 12. 実験結果(評判分析)(1/3) •  ベースライン [Kim+ 14] から性能が向上 •  MR,CRでstate-of-the-art •  MVCNN(複数の単語ベクトル利利⽤用 + 複雑なCNN (マルチチャンネル,多層))と同等の性能 12 Model SST2 MR CR 1 CNN (Kim, 2014) 87.2 81.3±0.1 84.3±0.2 2 CNN-Rule-p 88.8 81.6±0.1 85.0±0.3 3 CNN-Rule-q 89.3 81.7±0.1 85.3±0.3 4 MGNC-CNN (Zhang et al., 2016) 88.4 – – 5 MVCNN (Yin and Schutze, 2015) 89.4 – – 6 CNN-multichannel (Kim, 2014) 88.1 81.1 85.0 7 Paragraph-Vec (Le and Mikolov, 2014) 87.8 – – 8 CRF-PR (Yang and Cardie, 2014) – – 82.7 9 RNTN (Socher et al., 2013) 85.4 – – 10 G-Dropout (Wang and Manning, 2013) – 79.0 82.1 Table 1: Accuracy (%) of Sentiment Classification. Row 1, CNN (Kim, 2014) is the base network corresponding to the “CNN-non-static” model in (Kim, 2014). Rows 2-3 are the networks enhanced by our framework: CNN-Rule-p is the student network and CNN-Rule-q is the teacher network. For MR and CR, we report the average accuracy±one standard deviation using 10-fold cross validation. the base networks, we obtain substantial improve- ments on both tasks and achieve state-of-the-art or comparable results to previous best-performing systems. Comparison with a diverse set of other or positive sentiment. 3) CR (Hu and Liu, 2004), customer reviews of various products, containing 2 classes and 3,775 instances. For MR and CR, we use 10-fold cross validation as in previous work. In 16/09/12 第8回最先端NLP勉強会
  13. 13. 実験結果(評判分析)(2/3) •  pθ(y|x) と  q(y|x) を交互に計算する必要性 –  どちらか⼀一⽅方を最適化,⽚片⽅方の最適化後にもう⼀一⽅方 を求める,などでの性能は?    13 Model Accuracy (%) 1 CNN (Kim, 2014) 87.2 2 -but-clause 87.3 3 -`2-reg 87.5 4 -project 87.9 5 -opt-project 88.3 6 -pipeline 87.9 7 -Rule-p 88.8 8 -Rule-q 89.3 Table 2: Performance of different rule integration 1 2 3 4 5 6 Table 3 of labe header CNNを学習→q(y|x)の計算 q(y|x)を最適化 q(y|x)を最適化→CNNの学習 交互に計算したほうが 性能が良良い 16/09/12 第8回最先端NLP勉強会
  14. 14. 実験結果(評判分析)(3/3) •  データ量量に対する効果とラベルなしデータの利利⽤用 •  ラベルなしデータの利利⽤用で性能向上 –  制約によってラベルなしデータを上⼿手く使える 14 (%) integration Data size 5% 10% 30% 100% 1 CNN 79.9 81.6 83.6 87.2 2 -Rule-p 81.5 83.2 84.5 88.8 3 -Rule-q 82.5 83.9 85.6 89.3 4 -semi-PR 81.5 83.1 84.6 – 5 -semi-Rule-p 81.7 83.3 84.7 – 6 -semi-Rule-q 82.7 84.2 85.7 – Table 3: Accuracy (%) on SST2 with varying sizes of labeled data and semi-supervised learning. The header row is the percentage of labeled examples ◯%のラベル付き データを⽤用いて 学習 ◯%のラベル付き データと(100 - ◯)%の ラベルなしデータを ⽤用いて学習 16/09/12 第8回最先端NLP勉強会
  15. 15. 実験設定(固有表現抽出) •  4種の固有表現(PER,ORG,LOC,Misc)の認識識タスク •  データセット:CoNLL-2003 データセット –  BIOESタグを採⽤用([Lample+ 16] などと同じ) •  ベースライン:双⽅方向LSTM –  [Chiu and Nichols, 15] からCNNを除去 •  適⽤用する制約(規則) –  出⼒力力タグの並びが破綻していない •  重要性  λ = ∞(強い制約) –  リスト形式の場合,同種のタグとなる •  1.  Juventus, 2. Barcelona, 3. …で  Juventus と Barcelona のタグは同種 •  重要性  λ = 1    1516/09/12 第8回最先端NLP勉強会 where 1(·) is an indicator function that takes 1 when its argument is true, and 0 otherwise; class ‘+’ represents ‘positive’; and ✓(B)+ is the element of ✓(B) for class ’+’. By Eq.(1), when S has the ‘A- but-B’ structure, the truth value of the above logic rule equals to (1 + ✓(B)+)/2 when y = +, and (2 ✓(B)+)/2 otherwise 1. Note that here we assume two-way classification (i.e., positive and negative), though it is straightforward to design rules for finer grained sentiment classification. 4.2 Named Entity Recognition NER is to locate and classify elements in text into entity categories such as “persons” and “organiza- tions”. It is an essential first step for downstream language understanding applications. The task as- signs to each word a named entity tag in an “X-Y” format where X is one of BIEOS (Beginning, In- side, End, Outside, and Singleton) and Y is the entity category. A valid tag sequence has to follow Char+Word Representation Backward LSTM Forward LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Output Representation NYC locates in USA Figure 3: The architecture of the bidirectional LSTM recurrent network for NER. The CNN for extracting character representation is omitted. The confidence levels are set to 1 to prevent any violation. We further leverage the list structures within and などequal(yi 1, B PER) ) ¬ equal(yi, I ORG) s. The task as- g in an “X-Y” Beginning, In- and Y is the e has to follow of the tagging es (e.g., lists) y expose some has a similar LSTM recur- ) proposed in which has out- models. The word vectors l information, hen fed into a s for sequence hols, 2015) we The confidence levels are set to 1 to prevent any violation. We further leverage the list structures within and across sentences of the same documents. Specifi- cally, named entities at corresponding positions in a list are likely to be in the same categories. For instance, in “1. Juventus, 2. Barcelona, 3. ...” we know “Barcelona” must be an organization rather than a location, since its counterpart entity “Juven- tus” is an organization. We describe our simple procedure for identifying lists and counterparts in the supplementary materials. The logic rule is en- coded as: is-counterpart(X, A) ) 1 kc(ey) c( ✓(A))k2, (7) where ey is the one-hot encoding of y (the class pre- diction of X); c(·) collapses the probability mass
  16. 16. 実験結果(固有表現抽出) •  ベースラインから性能が向上 •  膨⼤大な外部資源を使った⼿手法 [Luo+ 15] やパラメータの多い ニューラルネット [Ma and Hovy, 16] と同等の性能 16/09/12 第8回最先端NLP勉強会 16 Model F1 1 BLSTM 89.55 2 BLSTM-Rule-trans p: 89.80, q: 91.11 3 BLSTM-Rules p: 89.93, q: 91.18 4 NN-lex (Collobert et al., 2011) 89.59 5 S-LSTM (Lample et al., 2016) 90.33 6 BLSTM-lex (Chiu and Nichols, 2015) 90.77 7 BLSTM-CRF1 (Lample et al., 2016) 90.94 8 Joint-NER-EL (Luo et al., 2015) 91.20 9 BLSTM-CRF2 (Ma and Hovy, 2016) 91.21 Table 4: Performance of NER on CoNLL-2003. Row 2, BLSTM-Rule-trans imposes the transition rules (Eq.(6)) on the base BLSTM. Row 3, BLSTM- Rules further incorporates the list rule (Eq.(7)). We report the performance of both the student model p NER extra as w joint tured 6 D We h deep to al tions pose fers the w リストに関する制約なし リストに関する制約あり
  17. 17. まとめ •  ⼀一般的な規則や⼈人の直観をニューラル ネットに導⼊入する⼿手法を提案 – 規則を⼀一階述語論論理理で表現 – Probabilistic soft logicで0から1の連続値に – 制約として学習 •  評判分析,固有表現抽出で実験 – 制約の導⼊入で性能が向上 – 複雑なネットワークなどと同等の性能 1716/09/12 第8回最先端NLP勉強会

×