Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Slider for 論文紹介20181126

Published in: Education
  • Be the first to comment

  • Be the first to like this


  1. 1. Introducer: Z.Chen
  2. 2. Points • A neural discriminative constituency parser - F1 93.55 • Chart parser/decoder • Encoder-decoder style dcp - the architecture • Structure meaning of multi-headed self-attention for cp • 8-layer, 8-head transformer + BiLSTM decoder • Analysis by input ablation: word, POS and position • position or content (POS⟺morph, ElMo/CharConcat) • Metric of tree structure accuracy: ParsEval
  3. 3. Constituency Parsing Grammar structure CKY-algorithm ChomskyCFG Transition-based Chart parser NLP tutorial 10: (11↑) • Probability as score • Bottom-up combine (bracketing per se) • Beam search Godfather Transformer . word+POS+position Decomposition, 3 4 5 6 7 A BiLSTM for 
 fence points
  4. 4. Incrementally build up W0 W1 W2 W3 W4<bos> <eos> CKY ⇊fence points⇊
  5. 5. Incrementally build up Score for a bracket: (decoder) How to deal with non-phrase? • CKY: little probability (PCFG) • Chen (me): <nil> tag / vector • This research: s(i, j, ∅) = 0 i, j are fence points; l is a label ↕ train with ∅ or <nil>
  6. 6. Position Embedding Encoder: linguistic Information Word Embedding POS Embedding Input Zdmodel T Component-wise add zt = wt + mt + pt Since then, zt is sent to the Transformer and dmodel keeps throughout the encoder.
  7. 7. Encoder: linguistic Information zt xt yt xt xt
  8. 8. Encoder: linguistic Information qt = WT Qxt kt = WT K xt vt = WT V xt p(i → j) ¯vt qi ki vi vj kj qj p(i → j) ¯vi xi “gather information from up to 8 remote locations”
  9. 9. Decoder again Wi Wj… Run a BiRNN once Run a FFN several times “92.67 F1 on Penn Treebank WSJ dev set” We must be the 2018 champion! と⼼心が叫びそうだ T*(T+1)/2 times Δ
  10. 10. Analysis by Input Ablation zt = wt + mt + pt Word, POS and position embeddings are added, but also overlapped: qt = WT Qzt kt = WT K zt vt = WT V zt p(i → j) ¯vt qt = WT Q pt kt = WT K pt vt = WT V zt Layer-wise disabled “it seems strange that content-based attention benefits our model to such a small degree.”
  11. 11. Decomposition on i/w zt = wt + mt + pt zt = [wt + mt; pt] F1 92.60 F1 92.67 1. Decompose input 2. Decompose attention q ⋅ k q = q(c) + q(p) k = k(c) + k(p) k ⋅ q = (q(c) + q(p) ) ⋅ (k(c) + k(p) ) k(c) ⋅ q(p) + k(p) ⋅ q(c) All mix-up: An example of cross-terms: “the word the always attends to the 5th position in the sentence” xt = [x(c) ; x(p) ] c = Wx = [c(c) ; c(p) ] = [W(c) x(c) ; W(p) x(p) ] F1 93.15 (+0.5) all on dev set
  12. 12. Analysis by Constrains “When we began to investigate how the model makes use of long-distance attention, we found that there are particular attention heads at some layers in our model that almost always attend to the start token.” RECALL: There are 8 heads in each of the transformer layer. “This suggests that the start token is being used as the location for some sentence-wide pooling/ processing, or perhaps as a dummy target location when a head fails to find the particular phenomenon that it’s learned to search for.” In short, it is a dustbin for redundant .attention WinA WinA + some spec ←Train with window and then test on dev 8 layers :)
  13. 13. 5 Lexical ModelsPOS tags from Stanford parserzt = [wt + mt; pt] -4 layers at ELMo pneumonoultramicrosco picsilicovolcanoconiosis >> Longtu’s
  14. 14. Finale