SlideShare a Scribd company logo
1 of 1
Transition-based Dependency Parsing
With Selectional Branching
Jinho D. Choi and Andrew McCallum
Department of Computer Science, University of Massachusetts Amherst
Greedy vs. non-greedy dependency parsing
• Speed vs. Accuracy
: Greedy parsing is faster (about 10-35 times faster).
: Non-greedy parsing is more accurate (about 1-2% more accurate).
• Non-greedy parsing approaches
: Transition-based parsing using beam search.
: Other approaches: graph-based parsing, linear programming, dual decomposition.
How many beams do we need during parsing?
• Intuition: simpler sentences need a fewer number of beams than more complex sentences to
generate the most accurate parse output.
• Rule of thumb: a greedy parser performs as accurately as a non-greedy parser using beam
search (beam size = 80) for about 64% of time.
• Motivation: the average parsing speed can be improved without compromising the overall
parsing accuracy by allocating different beam sizes for different sentences.
• Challenge: how can we determine the appropriate beam size given a sentence?
Introduction
Branching strategy
• sij represents a parsing state, where i is the index of the current transition sequence and j is
the index of the current parsing state, and pkj represents the k’th best prediction given s1j.
1. The one-best sequence T1 = [s11, … , s1t] is generated by a greedy parser.
2. While generating T1, the parser adds tuples (s1j, p2j), … , (s1j, pkj) to a list λ for each low
confidence prediction p1j given s1j (in our case, k = 2).
3. Then, new transition sequences are generated by using the b highest scoring predictions in λ,
where b is the beam size.
4. The same greedy parser is used to generate these new sequences although it now starts with
s1j instead of an initial parsing state, applies pkj to s1j, and performs further transitions.
5. Finally, a parse tree is built from a sequence with the highest score, where the score of Ti is
the average score of all predictions that lead to generate the sequence.
Comparison to beam search
• t is the maximum number of parsing states generated by any transition sequence.
Finding low confidence predictions
• The best prediction is low confident if there exists any other prediction whose margin (score
difference) to the best prediction is less than a threshold, m (|Ck(x, m)| > 1).
• The optimal beam size and margin threshold are found during development using grid search.
Selectional Branching
Projective parsing experiments
• The standard setup on WSJ (Yamada and Matsumoto’s headrules, Nivre’s labeling rules).
• Our speeds (seconds per sentence) are measured on an Intel Xeon 2.57GHz machine.
• POS tags are generated by the ClearNLP POS tagger (97.5% accuracy on WSJ-23).
• bt: beam size used during training, bd: beam size used during decoding.
Non-projective parsing experiments
• Danish, Dutch, Slovene, and Swedish data from the CoNLL-X shared task.
• Nivre’06: pseudo-projective parsing, McDonald’06: 2nd order graph-based parsing.
• Nivre’09: swap transition, N&M’08: ensemble between Nivre’06 and McDonald’06.
• ndez- mez- guez’12: buffer transition, Martins’10: linear programming.
Experiments
Approach
Danish Dutch Slovene Swedish
LAS UAS LAS UAS LAS UAS LAS UAS
Nivre’06 84.77 89.80 78.59 81.35 70.30 78.72 84.58 89.50
McDonald’06 84.79 90.58 79.19 83.57 73.44 83.17 82.55 88.93
Nivre’09 84.20 – – – 75.20 – – –
F&G’12 85.17 90.10 – – – – 83.55 89.30
N&M’08 86.67 – 81.63 – – 75.94 84.66 –
Martins’10 – 91.50 – 84.91 – 85.53 – 89.80
bt = 80, bt = 80 87.27 91.36 82.45 85.33 77.46 84.65 86.80 91.36
bt = 80, bt = 1 86.75 91.04 80.75 83.59 75.66 83.29 86.32 91.12
• Our non-projective parsing algorithm shows an expected linear-time parsing speed and gives
state-of-the-art parsing accuracy compared to other non-projective parsing approaches.
• Our selectional branching uses confidence estimates to decide when to employ a beam,
providing the accuracy of beam search at speeds close to greedy parsing.
• Our parser is publicly available under the open source project, ClearNLP (clearnlp.com).
Conclusion
• We gratefully acknowledge a grant from the Defense Advanced Research Projects Agency
under the DEFT project, solicitation #: DARPA-BAA-12-47.
Acknowledgments
Algorithm
• A hybrid between Nivre’s arc-eager and list-based algorithms (Nivre, 2003; Nivre, 2008).
• When training data contains only projective trees, it learns only projective transitions.
→ gives a parsing complexity of O(n) for projective parsing.
• When training data contains both projective and non-projective trees, it learns both kinds of
transitions → gives an expected linear time parsing speed.
Transitions
Hybrid Dependency Parsing
IESL
Approach UAS LAS Speed Note
Zhang & Clark (2008) 92.10 – – Beam search
Huang & Sagae (2010) 92.10 – 0.04 + Dynamic programming
Zhang & Nivre (2011) 92.90 91.80 0.03 + Rich non-local features
Bohnet & Nivre (2012) 93.38 92.44 0.40 + Joint Inference
bt = 80, bt = 80 92.96 91.93 0.009
Using beam sizes of 16 or above
during decoding gave almost the
same results.
bt = 80, bt = 64 92.96 91.93 0.009
bt = 80, bt = 32 92.96 91.94 0.009
bt = 80, bt = 16 92.96 91.94 0.008
bt = 80, bt = 1 92.26 91.25 0.002 Training with a higher beam
size improved greedy parsing.bt = 1, bt = 1 92.06 91.05 0.002
600 10 20 30 40 50
130
0
20
40
60
80
100
Sentence length
Transitions
• The # of transitions performed
during decoding with respect to
sentence lengths for Dutch.
• Dutch consists of the highest
number of non-projective trees
among languages in CoNLL-X
(5.4% in arcs, 36.4% in trees).
Proj.
Non
Proj.
s11 s12
p11
s22
… … s1t
p12
… … s2t
p21
s33
p22
… s3t
sdt…
…
…
p2j
T1 =
T2 =
T3 =
Td =
p1j
2nd parsing state in the 1st transition sequence2nd-best prediction given s11
0.920.83 0.86 0.88 0.9
91.2
91
91.04
91.08
91.12
91.16
Margin
Accuracy
64|80
32
16
b =
Beam search Selectional branching
Max. # of transition sequences b d = min(b, |λ| + 1)
Max. # of parsing states b ∙ t d ∙ t - d(d − 1)/2

More Related Content

Viewers also liked

CS571: Distributional semantics
CS571: Distributional semanticsCS571: Distributional semantics
CS571: Distributional semanticsJinho Choi
 
Getting the Most out of Transition-based Dependency Parsing
Getting the Most out of Transition-based Dependency ParsingGetting the Most out of Transition-based Dependency Parsing
Getting the Most out of Transition-based Dependency ParsingJinho Choi
 
Multi-layer Annotation in Dependency Structure
Multi-layer Annotation in Dependency StructureMulti-layer Annotation in Dependency Structure
Multi-layer Annotation in Dependency StructureJinho Choi
 
The CLEAR Dependency
The CLEAR DependencyThe CLEAR Dependency
The CLEAR DependencyJinho Choi
 
Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingJinho Choi
 
CS571: Vector Space Models
CS571: Vector Space ModelsCS571: Vector Space Models
CS571: Vector Space ModelsJinho Choi
 
CS571: Gradient Descent
CS571: Gradient DescentCS571: Gradient Descent
CS571: Gradient DescentJinho Choi
 
CS571: Sentiment Analysis
CS571: Sentiment AnalysisCS571: Sentiment Analysis
CS571: Sentiment AnalysisJinho Choi
 
CS571: Introduction
CS571: IntroductionCS571: Introduction
CS571: IntroductionJinho Choi
 
CS571: Language Models
CS571: Language ModelsCS571: Language Models
CS571: Language ModelsJinho Choi
 
CS571:: Part of-Speech Tagging
CS571:: Part of-Speech TaggingCS571:: Part of-Speech Tagging
CS571:: Part of-Speech TaggingJinho Choi
 

Viewers also liked (11)

CS571: Distributional semantics
CS571: Distributional semanticsCS571: Distributional semantics
CS571: Distributional semantics
 
Getting the Most out of Transition-based Dependency Parsing
Getting the Most out of Transition-based Dependency ParsingGetting the Most out of Transition-based Dependency Parsing
Getting the Most out of Transition-based Dependency Parsing
 
Multi-layer Annotation in Dependency Structure
Multi-layer Annotation in Dependency StructureMulti-layer Annotation in Dependency Structure
Multi-layer Annotation in Dependency Structure
 
The CLEAR Dependency
The CLEAR DependencyThe CLEAR Dependency
The CLEAR Dependency
 
Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional Branching
 
CS571: Vector Space Models
CS571: Vector Space ModelsCS571: Vector Space Models
CS571: Vector Space Models
 
CS571: Gradient Descent
CS571: Gradient DescentCS571: Gradient Descent
CS571: Gradient Descent
 
CS571: Sentiment Analysis
CS571: Sentiment AnalysisCS571: Sentiment Analysis
CS571: Sentiment Analysis
 
CS571: Introduction
CS571: IntroductionCS571: Introduction
CS571: Introduction
 
CS571: Language Models
CS571: Language ModelsCS571: Language Models
CS571: Language Models
 
CS571:: Part of-Speech Tagging
CS571:: Part of-Speech TaggingCS571:: Part of-Speech Tagging
CS571:: Part of-Speech Tagging
 

Similar to Transition-based Dependency Parsing with Selectional Branching

Accelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPUAccelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPUDavide Nardone
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionYoussefKitane
 
3 article azojete vol 7 24 33
3 article azojete vol 7 24 333 article azojete vol 7 24 33
3 article azojete vol 7 24 33Oyeniyi Samuel
 
Indexing Text with Approximate q-grams
Indexing Text with Approximate q-gramsIndexing Text with Approximate q-grams
Indexing Text with Approximate q-gramsYasmine Long
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1VitAnhNguyn94
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2
 
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding AlgorithmFixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding AlgorithmCSCJournals
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured predictionzukun
 
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...Salford Systems
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
Performance of the Maximum Stable Connected Dominating Sets in the Presence o...
Performance of the Maximum Stable Connected Dominating Sets in the Presence o...Performance of the Maximum Stable Connected Dominating Sets in the Presence o...
Performance of the Maximum Stable Connected Dominating Sets in the Presence o...csandit
 
IGARSS2011-I-Ling.ppt
IGARSS2011-I-Ling.pptIGARSS2011-I-Ling.ppt
IGARSS2011-I-Ling.pptgrssieee
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsSangwoo Mo
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...PyData
 
Novel algorithms for Knowledge discovery from neural networks in Classificat...
Novel algorithms for  Knowledge discovery from neural networks in Classificat...Novel algorithms for  Knowledge discovery from neural networks in Classificat...
Novel algorithms for Knowledge discovery from neural networks in Classificat...Dr.(Mrs).Gethsiyal Augasta
 
Design of ternary sequence using msaa
Design of ternary sequence using msaaDesign of ternary sequence using msaa
Design of ternary sequence using msaaEditor Jacotech
 
Role of Tensors in Machine Learning
Role of Tensors in Machine LearningRole of Tensors in Machine Learning
Role of Tensors in Machine LearningAnima Anandkumar
 
Handling Numeric Attributes in Hoeffding Trees
Handling Numeric Attributes in Hoeffding TreesHandling Numeric Attributes in Hoeffding Trees
Handling Numeric Attributes in Hoeffding Treesbutest
 

Similar to Transition-based Dependency Parsing with Selectional Branching (20)

P1121133727
P1121133727P1121133727
P1121133727
 
Accelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPUAccelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPU
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_vision
 
3 article azojete vol 7 24 33
3 article azojete vol 7 24 333 article azojete vol 7 24 33
3 article azojete vol 7 24 33
 
Indexing Text with Approximate q-grams
Indexing Text with Approximate q-gramsIndexing Text with Approximate q-grams
Indexing Text with Approximate q-grams
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
 
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding AlgorithmFixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
 
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
Probability Assignment Help
Probability Assignment HelpProbability Assignment Help
Probability Assignment Help
 
Performance of the Maximum Stable Connected Dominating Sets in the Presence o...
Performance of the Maximum Stable Connected Dominating Sets in the Presence o...Performance of the Maximum Stable Connected Dominating Sets in the Presence o...
Performance of the Maximum Stable Connected Dominating Sets in the Presence o...
 
IGARSS2011-I-Ling.ppt
IGARSS2011-I-Ling.pptIGARSS2011-I-Ling.ppt
IGARSS2011-I-Ling.ppt
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
Novel algorithms for Knowledge discovery from neural networks in Classificat...
Novel algorithms for  Knowledge discovery from neural networks in Classificat...Novel algorithms for  Knowledge discovery from neural networks in Classificat...
Novel algorithms for Knowledge discovery from neural networks in Classificat...
 
Design of ternary sequence using msaa
Design of ternary sequence using msaaDesign of ternary sequence using msaa
Design of ternary sequence using msaa
 
Role of Tensors in Machine Learning
Role of Tensors in Machine LearningRole of Tensors in Machine Learning
Role of Tensors in Machine Learning
 
Handling Numeric Attributes in Hoeffding Trees
Handling Numeric Attributes in Hoeffding TreesHandling Numeric Attributes in Hoeffding Trees
Handling Numeric Attributes in Hoeffding Trees
 

More from Jinho Choi

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Jinho Choi
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Jinho Choi
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Jinho Choi
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Jinho Choi
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionJinho Choi
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Jinho Choi
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning RepresentationJinho Choi
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingJinho Choi
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet SimilaritiesJinho Choi
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical RelationsJinho Choi
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementJinho Choi
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingJinho Choi
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueJinho Choi
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingJinho Choi
 
Topological Sort
Topological SortTopological Sort
Topological SortJinho Choi
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseJinho Choi
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsJinho Choi
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyJinho Choi
 

More from Jinho Choi (20)

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
 
Topological Sort
Topological SortTopological Sort
Topological Sort
 
Tries - Put
Tries - PutTries - Put
Tries - Put
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
 

Transition-based Dependency Parsing with Selectional Branching

  • 1. Transition-based Dependency Parsing With Selectional Branching Jinho D. Choi and Andrew McCallum Department of Computer Science, University of Massachusetts Amherst Greedy vs. non-greedy dependency parsing • Speed vs. Accuracy : Greedy parsing is faster (about 10-35 times faster). : Non-greedy parsing is more accurate (about 1-2% more accurate). • Non-greedy parsing approaches : Transition-based parsing using beam search. : Other approaches: graph-based parsing, linear programming, dual decomposition. How many beams do we need during parsing? • Intuition: simpler sentences need a fewer number of beams than more complex sentences to generate the most accurate parse output. • Rule of thumb: a greedy parser performs as accurately as a non-greedy parser using beam search (beam size = 80) for about 64% of time. • Motivation: the average parsing speed can be improved without compromising the overall parsing accuracy by allocating different beam sizes for different sentences. • Challenge: how can we determine the appropriate beam size given a sentence? Introduction Branching strategy • sij represents a parsing state, where i is the index of the current transition sequence and j is the index of the current parsing state, and pkj represents the k’th best prediction given s1j. 1. The one-best sequence T1 = [s11, … , s1t] is generated by a greedy parser. 2. While generating T1, the parser adds tuples (s1j, p2j), … , (s1j, pkj) to a list λ for each low confidence prediction p1j given s1j (in our case, k = 2). 3. Then, new transition sequences are generated by using the b highest scoring predictions in λ, where b is the beam size. 4. The same greedy parser is used to generate these new sequences although it now starts with s1j instead of an initial parsing state, applies pkj to s1j, and performs further transitions. 5. Finally, a parse tree is built from a sequence with the highest score, where the score of Ti is the average score of all predictions that lead to generate the sequence. Comparison to beam search • t is the maximum number of parsing states generated by any transition sequence. Finding low confidence predictions • The best prediction is low confident if there exists any other prediction whose margin (score difference) to the best prediction is less than a threshold, m (|Ck(x, m)| > 1). • The optimal beam size and margin threshold are found during development using grid search. Selectional Branching Projective parsing experiments • The standard setup on WSJ (Yamada and Matsumoto’s headrules, Nivre’s labeling rules). • Our speeds (seconds per sentence) are measured on an Intel Xeon 2.57GHz machine. • POS tags are generated by the ClearNLP POS tagger (97.5% accuracy on WSJ-23). • bt: beam size used during training, bd: beam size used during decoding. Non-projective parsing experiments • Danish, Dutch, Slovene, and Swedish data from the CoNLL-X shared task. • Nivre’06: pseudo-projective parsing, McDonald’06: 2nd order graph-based parsing. • Nivre’09: swap transition, N&M’08: ensemble between Nivre’06 and McDonald’06. • ndez- mez- guez’12: buffer transition, Martins’10: linear programming. Experiments Approach Danish Dutch Slovene Swedish LAS UAS LAS UAS LAS UAS LAS UAS Nivre’06 84.77 89.80 78.59 81.35 70.30 78.72 84.58 89.50 McDonald’06 84.79 90.58 79.19 83.57 73.44 83.17 82.55 88.93 Nivre’09 84.20 – – – 75.20 – – – F&G’12 85.17 90.10 – – – – 83.55 89.30 N&M’08 86.67 – 81.63 – – 75.94 84.66 – Martins’10 – 91.50 – 84.91 – 85.53 – 89.80 bt = 80, bt = 80 87.27 91.36 82.45 85.33 77.46 84.65 86.80 91.36 bt = 80, bt = 1 86.75 91.04 80.75 83.59 75.66 83.29 86.32 91.12 • Our non-projective parsing algorithm shows an expected linear-time parsing speed and gives state-of-the-art parsing accuracy compared to other non-projective parsing approaches. • Our selectional branching uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to greedy parsing. • Our parser is publicly available under the open source project, ClearNLP (clearnlp.com). Conclusion • We gratefully acknowledge a grant from the Defense Advanced Research Projects Agency under the DEFT project, solicitation #: DARPA-BAA-12-47. Acknowledgments Algorithm • A hybrid between Nivre’s arc-eager and list-based algorithms (Nivre, 2003; Nivre, 2008). • When training data contains only projective trees, it learns only projective transitions. → gives a parsing complexity of O(n) for projective parsing. • When training data contains both projective and non-projective trees, it learns both kinds of transitions → gives an expected linear time parsing speed. Transitions Hybrid Dependency Parsing IESL Approach UAS LAS Speed Note Zhang & Clark (2008) 92.10 – – Beam search Huang & Sagae (2010) 92.10 – 0.04 + Dynamic programming Zhang & Nivre (2011) 92.90 91.80 0.03 + Rich non-local features Bohnet & Nivre (2012) 93.38 92.44 0.40 + Joint Inference bt = 80, bt = 80 92.96 91.93 0.009 Using beam sizes of 16 or above during decoding gave almost the same results. bt = 80, bt = 64 92.96 91.93 0.009 bt = 80, bt = 32 92.96 91.94 0.009 bt = 80, bt = 16 92.96 91.94 0.008 bt = 80, bt = 1 92.26 91.25 0.002 Training with a higher beam size improved greedy parsing.bt = 1, bt = 1 92.06 91.05 0.002 600 10 20 30 40 50 130 0 20 40 60 80 100 Sentence length Transitions • The # of transitions performed during decoding with respect to sentence lengths for Dutch. • Dutch consists of the highest number of non-projective trees among languages in CoNLL-X (5.4% in arcs, 36.4% in trees). Proj. Non Proj. s11 s12 p11 s22 … … s1t p12 … … s2t p21 s33 p22 … s3t sdt… … … p2j T1 = T2 = T3 = Td = p1j 2nd parsing state in the 1st transition sequence2nd-best prediction given s11 0.920.83 0.86 0.88 0.9 91.2 91 91.04 91.08 91.12 91.16 Margin Accuracy 64|80 32 16 b = Beam search Selectional branching Max. # of transition sequences b d = min(b, |λ| + 1) Max. # of parsing states b ∙ t d ∙ t - d(d − 1)/2