2. Issue
• Straight as a neural model (vs. CKY / transition-based genres)
1. A novel neural part:
• Parsing by syntactic distance: hinge loss for ranking
2. Greedy and recursive decoding part
• Top-down split: avoid compounding error
(bottom up combination is feasible, but may not be so good at
avoiding such error)
• F1 score: 91.8 - main stream among neural methods.
!2
3. Syntactic Distance
• Represents the syntactic relationships between all successive
pairs of words in a sentence.
1. Convert a parse tree into a
representation based on
distances between consecutive
words;
2. Map the inferred representation
back to a complete parse tree.
Training(1+2) / Parsing(2):
!3
4. Previous Studies - 1
• Serializing a tree into a sequence …
• Of syntactic tokens
using misc. seq2seq model as a parser (Vinyals et al., 2015)
• Of transitions / shift-reduce actions
producing action/tag/label with current state to from a tree
(Stern et al., 2017; Cross and Huang, 2016)
• Chart parsers
• Non-linear triangular potentials + dynamic programming
search (Gaddy et al., 2018)
!4
5. Treebank
( S
( NP
( PRP She )
VP
( ( VBZ enjoys )
( S
( VP
( VBG playing)
( NP ( NN tennis ))
)
)
(. .)
)
Input: She enjoys playing tennis .
+( PRP VBZ VBG NN . )
Output: (tree-structured output)
!5
6. Previous Studies - 2
• Transition-based models (Dyer et al., 2016)
How about
a mistake
happening
here?😈
!6
7. Previous Studies - 2
• Transition-based models (Dyer et al., 2016)
Compounding errors cause further compounding errors
because the model is never exposed to its own mistakes.
(a.k.a. exposure bias in all stateful applications)
Dynamic oracles and beam search provides limited
improvements.
(Goldberg and Nivre, 2012; Cross and Huang, 2016)
How about
a mistake
happening
here?😈
!6
8. Previous Studies - 3
Chart parser
• Good old CKY algorithm
• Free from previous state with
fenceposts (when training).
• Decoding is time-exhausting.
• SOTA (Kitaev and Klein, 2018)
F1 > 95 (Self-Att x8 + BERT/ELMo)
F1 ~ 93 (Self-Att x8)
decoder implemented in Cython
fenceposts→
BiLSTM
!7
11. Syntactic Distance
1. Convert a parse tree into a
representation based on
distances between consecutive
words;
2. Map the inferred representation
back to a complete parse tree.
Training(1+2) / Parsing(2):
!10
Def 2.1:
• tree T;
• leaves (w0, …, wn) of T;
• Height dji of the lowest common ancestor for two leaves (wi,wj)
• Syntactic distace d = (d1, …, dn)
~
Relationship between heights and distances:
12. !11
Tree to tensors for training Tensor to tree for decoding / parsing
←Leaf node @ height 0
←height++
←concat distance in word sequential order
←POS tags
←greedy split
↑append sub-trees
Greedy top-down: n x search log n
Confidential top-down (Stern et al., 2017a):
n-ary nodes: leftmost split (like CNF)
←Leaf w/o syntactic label
← but w. POS tag
14. Hinge loss & ranking distances
• assert all(isinstance(di, int) for di in d), ‘nature of heights’
• MSE loss:
• sum((di - dp) ** 2 for di,dp in zip(d, pred_d))
• Hinge loss for ranking:
!13
∵
recall
18. !17
use an NVIDIA TITAN Xp for neural, an Intel Core
i7-6850K CPU, with 3.60GHz for tree inference
19. Conclusion
• Parallelization & “Neuralization”: Greedy
• Way to avoid exposure bias: decoupling
• Make “output variables conditionally independent given
the inputs.”
!18