[Paper Review] DRAGNN

DRAGNN: A Transitionbased Framework for
Dynamically Connected Neural Networks
Lingpeng Kong et al. [2017]
Review editted by PyunginPaek@modulabs

Table of Contents
● Introduction of DRAGNN
● Transition system & TBRU
● Examples of TBRU
● Combinations of TBRUs
● Experiments and Discussions

Intro. of DRAGONDRAGNN*
● General Model for Structured Prediction
– By connecting multiple TBRUs, we can extend to
architectures such as seq2seq, attention mechanism,
recursive tree models..
DRAGNN
TBRU
TBRU
TBRU TBRU
* Kong et al. [2017] A Transitionbased Framework for Dynamically Connected Neural Networks , https://arxiv.org/abs/1703.04474

SyntaxNetII (from Google)
● Major upgrade to SyntaxNet (Mar/15, 2017)
– Multilingual language understanding
– Joint modeling of multiple levels of linguistic structure
– Neuralnetwork architectures to be created dynamically during processing of a
sentence or document
● Implementation*
: combination of below 2.
– Recurrent multitask parsing model (with DRAGNN)
– Characterbased representation (with LSTM) as input to DRAGNN
● DRAGNN as the new core
– Architecture as a series of modular unit(TBRU)s
– Connections between modules are unfolded dynamically.
→ Dynamic Recurrent Acyclic Graphcal Neural Networks
* Alberti et al. [2017] SyntaxNet Models for the CoNLL 2017 Shared Task ,
(paper) https://arxiv.org/abs/1703.04929 (code) https://github.com/tensorflow/models/tree/master/syntaxnet

(Roughly Speaking) Generalization??
RNN module
LSTM, GRU, ...
Structured Models
Seq2seq, recursive tree,
biLSTM tagger,
Encoder/Decoder...
Generalize TBRU
Generalize DRAGNN

Transition System
● Transition System : T
– T = {S, A, t}
– S(x) : Set of States  ( s+
∈ S(x) : Start State  )
– x :  input  (eg. sentence)
– A(s, x) : Set of allowed decision  for any s ∈  S
– transition function t(s, d, x)
● s’ = t(s, d, x)   : new state s’ for any decision d ∈ A(s, x)
● s’ = t(s, d)  for brevity
● Complete structure : sequence of state/decision pairs
– (s1
, d1
)...(sn
, dn
)
– s1 = s+
,  di
∈ A(s, x)
– si+1
= t(si
, di
)

Transition Based Recurrent Unit (TBRU)
● m(s) : Input embedding function
– eg. lookup op m(s) : S R→ k
● r(s) : recurrence function (Connection to previous states)
– mapping : states set of previous time steps→
– r(si
) : S P{1, …, i1} ( P : power set, so n(P) variable)→
● (RNN) Network Cell : computes new hidden representation
– hs
= RNN( m(s), {hi
| i ∈ r(s)} )
( Logically : r,m h d )→ →
di
argmax← d A(si)∈
wd
T
hi

Inference with TBRUs
1. Initialize s1
= s+
2. For i = 1, …, n:
(a) Update the hidden state:
hi
RNN(m(s← i
), {hj
| j r(s∈ i
)})
(b) Update the transition state:
di
argmax← d A(si)∈
wd
T
hi
si+1
t(s← i
, di
)

TBRU Ex.(1) Simple LSTM Tagger
input x sequence of word embedding
x = {x1
, …, xn
}
output d sequence of tags
d1
, …, dn
discrete state s state = previous sequence of tags
si
= {d1
, …, di1
}
input embedding
m(s)
word embedding for the next token
m(si
) = xi
recurrence function
r(s)
network connected only to the previous state
r(si
) = {i1}
RNN and hidden
state h
Simple LSTM cell
hi
= RNN(xi
, hi1
)
Transition
t(s, d)
si+1
t(s← i
, di
) as LSTM cell state

TBRU Ex.(2) SyntaxNetI*
input x sequence of characterlevel multiscale representation
x = {x1
, …, xn
}
output d     if action = {Shift}:                   d = input word

elif action = {Left/Right Arc}: d = partialbuilt tree
discrete state s State = stack squence of {d1
, …, di1
}
It contains all words and partialbuilt trees on the stack
input embedding
m(s)
Sentence structure projected to 52 feature embedding
recurrence function
r(s)
Empty
r(si
) = {}
RNN and hidden
state h
Feedforward MLP (Not RNN!!)
di
= MLP(m(si
))
Transition
t(s, d)
si+1
t(s← i
, di
) as next stack sequence
* Andor et al. [2016] Globally Normalized Transitionbased Neural Networks,    https://arxiv.org/abs/1603.06042
Arcstandard
Transition System
Refer Andor et al. [2016]
Global Normalization

ArcStandard Transition System*
Input Word Top Tree node On the Stack
Bob Bob
gave gave
Alice gave
a gave a
pretty gave a pretty
flower gave flower
on gave flower on
Monday gave flower on
. gave
* Nivre [2006] Inductive dependency parsing.

Now, let’s combine TBRUs !!
● Multiple TBRUs, having different recurrence r(s).
– So, they act as different transition task.
● Multiple TBRUs, but run one at a time, after another.
– Multiple TBRUs have common global stepcounter.
● Then, if a hidden state hA
of TBRU A is connected to
recurrence r(sB
) of TBRU B??
● Possibly, might we reach to multitask, hierarchical learning?

TBRU Combination Examples (1)
Important Note:
A TBRU can be both
a decoder of it’s own task
and
an encoder of other TBRU’s of another TBRU
at the same time.

Stack Propagation
● Using explicit structure improves encoderdecoder.
– For newswire text, TBRU exceeds pure encoderdecoder seq2seq.
● Utilizing parse representations improves summarization.
– Multitask learning setup
● Deep stacked bidirectional parsing
– Bidirectional approach gives significant advantages.
– Final model (5 TBRUs) : 1 LR postagging 2 shift only 2 arcparser bidirectionally→ →

TBRU Combination Examples (2)

How to train a DRAGNN
● Assumption
– there should be at least one TBRU for which example x along with gold decision(d*
)
sequence.
● Loglikelihood objective
– For TBRU which has gold
decision sequence,
– Θ are combined parameters across all TBRUs.
● Where decision d1:N
from?
– 1) case when they are part of gold annotations
– 2) prediction from unrolling previous TBRU
● In practice, task sampling is usually preceded by pretraining steps.

Experiment Result (1)
● Using explicit structure improves encoderdecoder.
– For newswire text, TBRU exceeds pure encoderdecoder seq2seq.
● Utilizing parse representations improves summarization.
– Multitask learning setup

Experiment Result (2)
● Deep stacked bidirectional parsing
– Bidirectional approach gives significant advantages.
– Final model (5 TBRUs) : 1 LR postagging 2 shift →
only 2 arcparser bidirectionally→

Conclusion & Discussions
● Primarily focused on syntactic parsing, but can it
provide general means of sharing representations
between tasks?
● Can this approach be globally normalized?
● Multitask learning possible with a single model?
– By combining translation, summarization, and even
abstraction and reasoning.
– And beyond language processing

[Paper Review] DRAGNN

Recommended

Recommended

More Related Content

Similar to [Paper Review] DRAGNN

Similar to [Paper Review] DRAGNN (20)

Recently uploaded

Recently uploaded (20)

[Paper Review] DRAGNN