Adaptation of Multilingual Transformer
Encoder for Robust Enhanced
Universal Dependency Parsing
Adaptation of Multilingual Transformer Encoder for
Robust Enhanced Universal Dependency Parsing
Han He
Computer Science
Emory University
Atlanta GA 30322, USA
han.he@emory.edu
Jinho D. Choi
Computer Science
Emory University
Atlanta GA 30322, USA
jinho.choi@emory.edu
Abstract
This paper presents our enhanced dependency
parsing approach using transformer encoders,
coupled with a simple yet powerful ensemble
analyze gapping constructions in the enhanced UD
representation. Nivre et al. (2018) evaluate both
rule-based and data-driven systems for adding en-
hanced dependencies to existing treebanks. Apart
from syntactic relations, researchers are moving to-
Enhanced Universal Dependency Parsing
Ellipsi
s

Conjoined subjects and object
s

https://universaldependencies.org/u/overview/enhanced-syntax.html
Preprocessing
• Sentence split and tokenization

• UDPipe (itssearch-engine —> its search - engine)

• Remove multiword expressions

• —>

• collapse empty nodes
sing
raining and development sets are
segmented and tokenized. For the
is used to segment raw input into
each sentence gets split into a list
a and Straková, 2017). A custom
us is used to remove multiwords
splits (e.g., remove vámonos but
s), as well as to collapse empty
NLL-U format.
mer Encoder
els use contextualized embeddings
it can be easily adapted to languages that may no
have dedicated POS taggers, and drops the Bidire
tional LSTM encoder while integrating the tran
former encoder directly into the biaffine decoder t
minimize the redundancy of multiple encoders fo
the generation of contextualized embeddings.
Every token wi in the input sentence is split int
one or more sub-tokens by the transformer encode
(Section 2.2). The contextualized embedding tha
corresponds to the first sub-token of wi is treated a
the embedding of wi, say ei, and fed into four type
of multilayer perceptron (MLP) layers to extrac
features for wi being a head (*-h) or a dependen
(*-d) for the arc relations (arc-*) and the label
2 Approach
2.1 Preprocessing
The data in the training and development sets are
already sentence segmented and tokenized. For the
test set, UDPipe is used to segment raw input into
sentences, where each sentence gets split into a list
of tokens (Straka and Straková, 2017). A custom
script written by us is used to remove multiwords
but retain their splits (e.g., remove vámonos but
retain vámos nos), as well as to collapse empty
nodes in the CoNLL-U format.
2.2 Transformer Encoder
Our parsing models use contextualized embeddings
it
h
ti
f
m
th
o
(
c
th
o
f
(
(
E2.1 word
L2 L1 word L1>L2
Encoder
• mBERT v.s. language speci
fi
c Transformers

• ALBERT for English, RoBERTa for French

• mBERT and for all languages
Decoder
• Bia
ffi
ne DTP and DGP

• Tree Parsing v.s. Graph Parsing
(b) Labeled attachment score on enhanced dependencies where labels are restricted to the UD relation (EULAS).
able 1: Parsing results on the test sets for all languages. For both (a) and (b), the rows 2-4 show the results by the
multilingual encoder and the rows 5-7 show the results by the language-specific encoders if available.
Lang. Encoder Corpus Provider
AR BERT 8.2 B Hugging Face
EN ALBERT 16 GB Hugging Face
ET BERT N/A TurkuNLP
FR RoBERTa 138 GB Hugging Face
FI BERT 24 B Hugging Face
IT BERT 13 GB Hugging Face
NL BERT N/A Hugging Face
PL BERT 1.8 B Hugging Face
SV BERT 3 B Hugging Face
BG BERT N/A Hugging Face
CS BERT N/A Hugging Face
SK BERT N/A Hugging Face
able 2: Language-specific transformer encoders to de-
elop our models. The corpus column shows the corpus
ze used to pretrain each encoder (B: billion tokens,
B: gigabytes). BERT and RoBERTa adapt the base
Figure 2: Percentages of tokens with multiple heads.
Ensemble
+ (H H ) · V 2 R
2.4 Dependency Tree & Graph Parsing
The arc score matrix S(arc) and the label score ten-
sor S(rel) generated by the bilinear and biaffine clas-
sifiers can be used for both dependency tree parsing
(DTP) and graph parsing (DGP). For DTP, which
takes only the primary dependencies to learn tree
structures during training, the Chu-Liu-Edmond’s
Maximum Spanning Tree (MST) algorithm is ap-
plied to S(arc) for the arc prediction, then the label
with largest score in S(rel) corresponding to the arc
is taken for the label prediction (ADTP: the list of
predicted arcs, LDTP: the labels predicted for ADTP,
I: the indices of ADTP in S(rel)):
ADTP = MST(S(arc)
)
LDTP = argmax(S(rel)
[I(ADTP)])
For DGP, which takes the primary as well as the
secondary dependencies in the enhanced types to
learn graph structures during training, the sigmoid
function is applied to S(arc) instead of the softmax
function (Figure 1) so that zero to many heads can
be predicted per node by measuring the pairwise
losses. Then, the same logic can be used to predict
the labels for those arcs as follows:
ADGP = SIGMOID(S(arc)
)
LDGP = argmax(S(rel)
[I(ADGP)])
the output of the DGP model is NP-hard (Schluter,
2014). Thus, we design an ensemble approach that
computes approximate MSDAGs using a greedy al-
gorithm. Given the score matrices S(arc)
DTP and S(arc)
DGP
from the DTP and DGP models respectively and
the label score tensor S(rel)
DGP from the DGP model,
Algorithm 1 is applied to find the MSDAG:
Algorithm 1: Ensemble parsing algorithm
Input: S(arc)
DTP, S(arc)
DGP, and S(rel)
DGP
Output: G, that is an approximate MSDAG
1 r root index(ADTP)
2 S(rel)
DGP[root, :, :] 1
3 S(rel)
DGP[root, r, r] +1
4 R argmax(S(rel)
DGP)) 2 Rn⇥n
5 ADTP MST(S(arc)
DTP)
6 G ;
7 foreach arc (d, h) 2 ADTP do
8 G G [ {(d, h, R[d, h]}
9 end
10 ADGP sorted descend(SIGMOID(S(arc)
DGP))
11 foreach arc (d, h) 2 ADGP do
12 G(d,h) G [ {(d, h, R[d, h]}
13 if is acyclic(G(d,h)) then
14 G G(d,h)
15 end
16 end
• DTP (Tree) + DGP (Graph)
Results
• O
ffi
cially ranked the 3rd place according to Coarse ELAS
F1 scores

• O
ffi
cially ranked the 1st place on French treebank.
Results
• On 13 languages, multilingual BERT outperforms language
speci
fi
c

• Exceptions are English, French, Finnish and Italian

• On 15 languages, ensemble methods outperforms DTP/
DGP
To be Improved
• Tree constraint is not necessary.

• Concatenation of all treebanks yield better performance.
Conclusion
• mBERT improves multilingual parsing

• DGP helps the prediction of enhanced dependencies 

• Other than ensemble, more advanced parsing algorithm
is needed
References
• Straka, M., & Straková, J. (2017, August). Tokenizing, pos tagging,
lemmatizing and parsing ud 2.0 with udpipe. In Proceedings of the
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to
Universal Dependencies (pp. 88-99).

• Dozat, T., & Manning, C. D. (2016). Deep bia
ffi
ne attention for
neural dependency parsing. arXiv preprint arXiv:1611.01734.

• He, H., & Choi, J. (2020, May). Establishing strong baselines for the
new decade: Sequence tagging, syntactic and semantic parsing
with bert. In The Thirty-Third International Flairs Conference.

• Kondratyuk, D. (2019). 75 Languages, 1 Model: Parsing Universal
Dependencies Universally. arXiv preprint arXiv:1904.02099.

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal Dependency Parsing

  • 1.
    Adaptation of MultilingualTransformer Encoder for Robust Enhanced Universal Dependency Parsing Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal Dependency Parsing Han He Computer Science Emory University Atlanta GA 30322, USA han.he@emory.edu Jinho D. Choi Computer Science Emory University Atlanta GA 30322, USA jinho.choi@emory.edu Abstract This paper presents our enhanced dependency parsing approach using transformer encoders, coupled with a simple yet powerful ensemble analyze gapping constructions in the enhanced UD representation. Nivre et al. (2018) evaluate both rule-based and data-driven systems for adding en- hanced dependencies to existing treebanks. Apart from syntactic relations, researchers are moving to-
  • 2.
    Enhanced Universal DependencyParsing Ellipsi s Conjoined subjects and object s https://universaldependencies.org/u/overview/enhanced-syntax.html
  • 3.
    Preprocessing • Sentence splitand tokenization • UDPipe (itssearch-engine —> its search - engine) • Remove multiword expressions • —> • collapse empty nodes sing raining and development sets are segmented and tokenized. For the is used to segment raw input into each sentence gets split into a list a and Straková, 2017). A custom us is used to remove multiwords splits (e.g., remove vámonos but s), as well as to collapse empty NLL-U format. mer Encoder els use contextualized embeddings it can be easily adapted to languages that may no have dedicated POS taggers, and drops the Bidire tional LSTM encoder while integrating the tran former encoder directly into the biaffine decoder t minimize the redundancy of multiple encoders fo the generation of contextualized embeddings. Every token wi in the input sentence is split int one or more sub-tokens by the transformer encode (Section 2.2). The contextualized embedding tha corresponds to the first sub-token of wi is treated a the embedding of wi, say ei, and fed into four type of multilayer perceptron (MLP) layers to extrac features for wi being a head (*-h) or a dependen (*-d) for the arc relations (arc-*) and the label 2 Approach 2.1 Preprocessing The data in the training and development sets are already sentence segmented and tokenized. For the test set, UDPipe is used to segment raw input into sentences, where each sentence gets split into a list of tokens (Straka and Straková, 2017). A custom script written by us is used to remove multiwords but retain their splits (e.g., remove vámonos but retain vámos nos), as well as to collapse empty nodes in the CoNLL-U format. 2.2 Transformer Encoder Our parsing models use contextualized embeddings it h ti f m th o ( c th o f ( ( E2.1 word L2 L1 word L1>L2
  • 4.
    Encoder • mBERT v.s.language speci fi c Transformers • ALBERT for English, RoBERTa for French • mBERT and for all languages
  • 5.
    Decoder • Bia ffi ne DTPand DGP • Tree Parsing v.s. Graph Parsing (b) Labeled attachment score on enhanced dependencies where labels are restricted to the UD relation (EULAS). able 1: Parsing results on the test sets for all languages. For both (a) and (b), the rows 2-4 show the results by the multilingual encoder and the rows 5-7 show the results by the language-specific encoders if available. Lang. Encoder Corpus Provider AR BERT 8.2 B Hugging Face EN ALBERT 16 GB Hugging Face ET BERT N/A TurkuNLP FR RoBERTa 138 GB Hugging Face FI BERT 24 B Hugging Face IT BERT 13 GB Hugging Face NL BERT N/A Hugging Face PL BERT 1.8 B Hugging Face SV BERT 3 B Hugging Face BG BERT N/A Hugging Face CS BERT N/A Hugging Face SK BERT N/A Hugging Face able 2: Language-specific transformer encoders to de- elop our models. The corpus column shows the corpus ze used to pretrain each encoder (B: billion tokens, B: gigabytes). BERT and RoBERTa adapt the base Figure 2: Percentages of tokens with multiple heads.
  • 6.
    Ensemble + (H H) · V 2 R 2.4 Dependency Tree & Graph Parsing The arc score matrix S(arc) and the label score ten- sor S(rel) generated by the bilinear and biaffine clas- sifiers can be used for both dependency tree parsing (DTP) and graph parsing (DGP). For DTP, which takes only the primary dependencies to learn tree structures during training, the Chu-Liu-Edmond’s Maximum Spanning Tree (MST) algorithm is ap- plied to S(arc) for the arc prediction, then the label with largest score in S(rel) corresponding to the arc is taken for the label prediction (ADTP: the list of predicted arcs, LDTP: the labels predicted for ADTP, I: the indices of ADTP in S(rel)): ADTP = MST(S(arc) ) LDTP = argmax(S(rel) [I(ADTP)]) For DGP, which takes the primary as well as the secondary dependencies in the enhanced types to learn graph structures during training, the sigmoid function is applied to S(arc) instead of the softmax function (Figure 1) so that zero to many heads can be predicted per node by measuring the pairwise losses. Then, the same logic can be used to predict the labels for those arcs as follows: ADGP = SIGMOID(S(arc) ) LDGP = argmax(S(rel) [I(ADGP)]) the output of the DGP model is NP-hard (Schluter, 2014). Thus, we design an ensemble approach that computes approximate MSDAGs using a greedy al- gorithm. Given the score matrices S(arc) DTP and S(arc) DGP from the DTP and DGP models respectively and the label score tensor S(rel) DGP from the DGP model, Algorithm 1 is applied to find the MSDAG: Algorithm 1: Ensemble parsing algorithm Input: S(arc) DTP, S(arc) DGP, and S(rel) DGP Output: G, that is an approximate MSDAG 1 r root index(ADTP) 2 S(rel) DGP[root, :, :] 1 3 S(rel) DGP[root, r, r] +1 4 R argmax(S(rel) DGP)) 2 Rn⇥n 5 ADTP MST(S(arc) DTP) 6 G ; 7 foreach arc (d, h) 2 ADTP do 8 G G [ {(d, h, R[d, h]} 9 end 10 ADGP sorted descend(SIGMOID(S(arc) DGP)) 11 foreach arc (d, h) 2 ADGP do 12 G(d,h) G [ {(d, h, R[d, h]} 13 if is acyclic(G(d,h)) then 14 G G(d,h) 15 end 16 end • DTP (Tree) + DGP (Graph)
  • 7.
    Results • O ffi cially rankedthe 3rd place according to Coarse ELAS F1 scores • O ffi cially ranked the 1st place on French treebank.
  • 8.
    Results • On 13languages, multilingual BERT outperforms language speci fi c • Exceptions are English, French, Finnish and Italian • On 15 languages, ensemble methods outperforms DTP/ DGP
  • 9.
    To be Improved •Tree constraint is not necessary. • Concatenation of all treebanks yield better performance.
  • 10.
    Conclusion • mBERT improvesmultilingual parsing • DGP helps the prediction of enhanced dependencies • Other than ensemble, more advanced parsing algorithm is needed
  • 11.
    References • Straka, M.,& Straková, J. (2017, August). Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with udpipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (pp. 88-99). • Dozat, T., & Manning, C. D. (2016). Deep bia ffi ne attention for neural dependency parsing. arXiv preprint arXiv:1611.01734. • He, H., & Choi, J. (2020, May). Establishing strong baselines for the new decade: Sequence tagging, syntactic and semantic parsing with bert. In The Thirty-Third International Flairs Conference. • Kondratyuk, D. (2019). 75 Languages, 1 Model: Parsing Universal Dependencies Universally. arXiv preprint arXiv:1904.02099.