Your SlideShare is downloading. ×
Transition-based Dependency Parsing with Selectional Branching
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Transition-based Dependency Parsing with Selectional Branching

198
views

Published on

We present a novel approach, called selectional branching, which uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to a greedy …

We present a novel approach, called selectional branching, which uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to a greedy transition-based dependency parsing approach. Selectional branching is guaranteed to perform a fewer number of transitions than beam search yet performs as accurately. We also present a new transition-based dependency parsing algorithm that gives a complexity of O(n) for projective parsing and an expected linear time speed for non-projective parsing. With the standard setup, our parser shows an unlabeled attachment score of 92.96% and a parsing speed of 9 milliseconds per sentence, which is faster and more accurate than the current state-of-the-art transition- based parser that uses beam search.


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
198
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Transition-based Dependency Parsing With Selectional Branching Jinho D. Choi and Andrew McCallum Department of Computer Science, University of Massachusetts Amherst Greedy vs. non-greedy dependency parsing • Speed vs. Accuracy : Greedy parsing is faster (about 10-35 times faster). : Non-greedy parsing is more accurate (about 1-2% more accurate). • Non-greedy parsing approaches : Transition-based parsing using beam search. : Other approaches: graph-based parsing, linear programming, dual decomposition. How many beams do we need during parsing? • Intuition: simpler sentences need a fewer number of beams than more complex sentences to generate the most accurate parse output. • Rule of thumb: a greedy parser performs as accurately as a non-greedy parser using beam search (beam size = 80) for about 64% of time. • Motivation: the average parsing speed can be improved without compromising the overall parsing accuracy by allocating different beam sizes for different sentences. • Challenge: how can we determine the appropriate beam size given a sentence? Introduction Branching strategy • sij represents a parsing state, where i is the index of the current transition sequence and j is the index of the current parsing state, and pkj represents the k’th best prediction given s1j. 1. The one-best sequence T1 = [s11, … , s1t] is generated by a greedy parser. 2. While generating T1, the parser adds tuples (s1j, p2j), … , (s1j, pkj) to a list λ for each low confidence prediction p1j given s1j (in our case, k = 2). 3. Then, new transition sequences are generated by using the b highest scoring predictions in λ, where b is the beam size. 4. The same greedy parser is used to generate these new sequences although it now starts with s1j instead of an initial parsing state, applies pkj to s1j, and performs further transitions. 5. Finally, a parse tree is built from a sequence with the highest score, where the score of Ti is the average score of all predictions that lead to generate the sequence. Comparison to beam search • t is the maximum number of parsing states generated by any transition sequence. Finding low confidence predictions • The best prediction is low confident if there exists any other prediction whose margin (score difference) to the best prediction is less than a threshold, m (|Ck(x, m)| > 1). • The optimal beam size and margin threshold are found during development using grid search. Selectional Branching Projective parsing experiments • The standard setup on WSJ (Yamada and Matsumoto’s headrules, Nivre’s labeling rules). • Our speeds (seconds per sentence) are measured on an Intel Xeon 2.57GHz machine. • POS tags are generated by the ClearNLP POS tagger (97.5% accuracy on WSJ-23). • bt: beam size used during training, bd: beam size used during decoding. Non-projective parsing experiments • Danish, Dutch, Slovene, and Swedish data from the CoNLL-X shared task. • Nivre’06: pseudo-projective parsing, McDonald’06: 2nd order graph-based parsing. • Nivre’09: swap transition, N&M’08: ensemble between Nivre’06 and McDonald’06. • ndez- mez- guez’12: buffer transition, Martins’10: linear programming. Experiments Approach Danish Dutch Slovene Swedish LAS UAS LAS UAS LAS UAS LAS UAS Nivre’06 84.77 89.80 78.59 81.35 70.30 78.72 84.58 89.50 McDonald’06 84.79 90.58 79.19 83.57 73.44 83.17 82.55 88.93 Nivre’09 84.20 – – – 75.20 – – – F&G’12 85.17 90.10 – – – – 83.55 89.30 N&M’08 86.67 – 81.63 – – 75.94 84.66 – Martins’10 – 91.50 – 84.91 – 85.53 – 89.80 bt = 80, bt = 80 87.27 91.36 82.45 85.33 77.46 84.65 86.80 91.36 bt = 80, bt = 1 86.75 91.04 80.75 83.59 75.66 83.29 86.32 91.12 • Our non-projective parsing algorithm shows an expected linear-time parsing speed and gives state-of-the-art parsing accuracy compared to other non-projective parsing approaches. • Our selectional branching uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to greedy parsing. • Our parser is publicly available under the open source project, ClearNLP (clearnlp.com). Conclusion • We gratefully acknowledge a grant from the Defense Advanced Research Projects Agency under the DEFT project, solicitation #: DARPA-BAA-12-47. Acknowledgments Algorithm • A hybrid between Nivre’s arc-eager and list-based algorithms (Nivre, 2003; Nivre, 2008). • When training data contains only projective trees, it learns only projective transitions. → gives a parsing complexity of O(n) for projective parsing. • When training data contains both projective and non-projective trees, it learns both kinds of transitions → gives an expected linear time parsing speed. Transitions Hybrid Dependency Parsing IESL Approach UAS LAS Speed Note Zhang & Clark (2008) 92.10 – – Beam search Huang & Sagae (2010) 92.10 – 0.04 + Dynamic programming Zhang & Nivre (2011) 92.90 91.80 0.03 + Rich non-local features Bohnet & Nivre (2012) 93.38 92.44 0.40 + Joint Inference bt = 80, bt = 80 92.96 91.93 0.009 Using beam sizes of 16 or above during decoding gave almost the same results. bt = 80, bt = 64 92.96 91.93 0.009 bt = 80, bt = 32 92.96 91.94 0.009 bt = 80, bt = 16 92.96 91.94 0.008 bt = 80, bt = 1 92.26 91.25 0.002 Training with a higher beam size improved greedy parsing.bt = 1, bt = 1 92.06 91.05 0.002 600 10 20 30 40 50 130 0 20 40 60 80 100 Sentence length Transitions • The # of transitions performed during decoding with respect to sentence lengths for Dutch. • Dutch consists of the highest number of non-projective trees among languages in CoNLL-X (5.4% in arcs, 36.4% in trees). Proj. Non Proj. s11 s12 p11 s22 … … s1t p12 … … s2t p21 s33 p22 … s3t sdt… … … p2j T1 = T2 = T3 = Td = p1j 2nd parsing state in the 1st transition sequence2nd-best prediction given s11 0.920.83 0.86 0.88 0.9 91.2 91 91.04 91.08 91.12 91.16 Margin Accuracy 64|80 32 16 b = Beam search Selectional branching Max. # of transition sequences b d = min(b, |λ| + 1) Max. # of parsing states b ∙ t d ∙ t - d(d − 1)/2

×