N20190530

Issue
• Straight as a neural model (vs. CKY / transition-based genres)

1. A novel neural part:

• Parsing by syntactic distance: hinge loss for ranking

2. Greedy and recursive decoding part

• Top-down split: avoid compounding error 
(bottom up combination is feasible, but may not be so good at
avoiding such error)

• F1 score: 91.8 - main stream among neural methods.
!2

Syntactic Distance
• Represents the syntactic relationships between all successive
pairs of words in a sentence. 
1. Convert a parse tree into a
representation based on
distances between consecutive
words;

2. Map the inferred representation
back to a complete parse tree.
Training(1+2) / Parsing(2):
!3

Previous Studies - 1
• Serializing a tree into a sequence …

• Of syntactic tokens 
using misc. seq2seq model as a parser (Vinyals et al., 2015)

• Of transitions / shift-reduce actions 
producing action/tag/label with current state to from a tree
(Stern et al., 2017; Cross and Huang, 2016)

• Chart parsers

• Non-linear triangular potentials + dynamic programming
search (Gaddy et al., 2018)
!4

Treebank
( S 
( NP 
( PRP She ) 
VP 
( ( VBZ enjoys ) 
( S 
( VP 
( VBG playing) 
( NP ( NN tennis )) 
) 
) 
(. .) 
)
Input: She enjoys playing tennis .

+( PRP VBZ VBG NN . )

Output: (tree-structured output)
!5

• Transition-based models (Dyer et al., 2016)
How about
a mistake
happening
here?😈
!6

• Transition-based models (Dyer et al., 2016)
Compounding errors cause further compounding errors
because the model is never exposed to its own mistakes.
(a.k.a. exposure bias in all stateful applications)
Dynamic oracles and beam search provides limited
improvements. 
(Goldberg and Nivre, 2012; Cross and Huang, 2016)
How about
a mistake
happening
here?😈
!6

Chart parser

• Good old CKY algorithm

• Free from previous state with
fenceposts (when training).

• Decoding is time-exhausting.

• SOTA (Kitaev and Klein, 2018) 
F1 > 95 (Self-Att x8 + BERT/ELMo) 
F1 ~ 93 (Self-Att x8) 
decoder implemented in Cython 
fenceposts→
BiLSTM
!7

Comparison
• Simplicity:

• BiLSTM x2; CNN x1; FNN x2 (vs. self-attention)

• Greedy decoding (vs. chart parsers)

• Sequence hinge error (vs. chart parsers)

• Decoupled / state-free (vs. transition-based)

• High speed:

• Greedy decoding - O(n log n) | T ~ O(n)
!8

Syntactic Distance
1. Convert a parse tree into a
representation based on
distances between consecutive
words;

2. Map the inferred representation
back to a complete parse tree.
Training(1+2) / Parsing(2):
!10
Def 2.1:
• tree T;

• leaves (w0, …, wn) of T;

• Height dji of the lowest common ancestor for two leaves (wi,wj)

• Syntactic distace d = (d1, …, dn)
~
Relationship between heights and distances:

!11
Tree to tensors for training Tensor to tree for decoding / parsing
←Leaf node @ height 0
←height＋＋
←concat distance in word sequential order
←POS tags
←greedy split
↑append sub-trees
Greedy top-down: n x search log n
Conﬁdential top-down (Stern et al., 2017a):
n-ary nodes: leftmost split (like CNF)
←Leaf w/o syntactic label
← but w. POS tag

!12
⑵→
⑶→
⑷→
⑸→
⑹⑺→
Prepare context:
Bottom syntactic labels:
Prepare relationships (1/2):
Output distances:
Prepare relationships (2/2):
Output syntactic labels for chunks:

Hinge loss & ranking distances
• assert all(isinstance(di, int) for di in d), ‘nature of heights’

• MSE loss:

• sum((di - dp) ** 2 for di,dp in zip(d, pred_d))

• Hinge loss for ranking:
!13
∵
recall

!15
Penn Treebank Chinese Treebank
Model settings
• Train: 2-21 (45k)

• Devel: 22 (1.7k)

• Test: 23 (2.4k)
• Word embedding

• D = 400

• rand_unk = 0.1

• LSTM

• D = 1200

• Dropout = 0.2

• CNN

• D = 1200

• Win = 2

• FFN

• Dropout = 0.3

• Adam

• L2 beta = 1e-6
• Train: 001-270, 440-1151

• devel: 301-325

• test: 271-300

• Word embedding

• D = 400

• rand_unk = 0.1

• LSTM

• D = 1200

• Dropout = 0.1

• CNN

• D = 1200

• Win = 2

• FFN

• Dropout = 0.4

• Adam

• L2 beta = 1e-6
Data settings

!16
←PTB
S2S
T.td
T.bu
T.bu
T.td
C
T.bu
T.in
T.td
.gen
@O(n3)
C-alike

!17
use an NVIDIA TITAN Xp for neural, an Intel Core
i7-6850K CPU, with 3.60GHz for tree inference

Conclusion
• Parallelization & “Neuralization”: Greedy

• Way to avoid exposure bias: decoupling

• Make “output variables conditionally independent given
the inputs.”
!18

N20190530

Recommended

Recommended

More Related Content

Similar to N20190530

Similar to N20190530 (20)

Recently uploaded

Recently uploaded (20)

N20190530