Distributed perceptron

Distributed Perceptron

Introducing Distributed Training Strategies for the
Structured Perceptron,
published by R. McDonald, K. Hall & G. Mann
in NAACL 2010

2010-10-06 / 2nd seminar for State-of-the-Art NLP

Distributed training of perceptrons
in a theoretically-proven way
Naive distribution strategy fails
Parameter mixing (or averaging)

Simple modification
Iterative parameter mixing

Proofs & Experiments
Convergence
Convergence speed
NER experiments
Dependency parsing experiments

Timeline

1958 F. Rosenblatt
Principles of Neurodynamics: Perceptrons and the
Theory of Brain Mechanisms
1962 H.D. Block and A.B. Novikoff (independently)
the perceptron convergence theorem of for the
separable case
1999 Y. Freund & R.E. Schapire
voted perceptron with a bound to the generalization error
for the inseparable case
2002 M. Collins
Generalization to the structured prediction problem
2010 R. McDonald et al
parallelization with parameter mixing and
synchronization

A new strategy of parallelization is
required for distributed perceptrons
Gradient-based batch training algorithms have been
parallelized in the forms of Map-Reduce
Parameter mixing works for maximum entropy models
Divide the training data into a number of shards
Train separate models with the shards
Take average of the weights of the models
Perceptrons?
Non-convex objective function
Simple parameter mixing doesn't work

Parameter mixing (averaging) fails (1/6)

Parameter mixing:
Train S perceptrons with S shards of the training data,
Take a weighted average of their weights

Distributed Training Strategies for the Structured Perceptron
by R. McDonald, K. Hall & G. Mann, 2010


Counter example

Feature space (separated into observed and non-observed examples):
f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0]
f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1]
f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1]
f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0]
Preview of the consequence:
Shard 1: Mixing of two local optima
(x1,1, 0), (x1,2, 1) Smaller data can fool the
Shard 2: algorithm, because of the
(x2,1, 0), (x2,2, 1) increased initializations and tie-
breakings.


Counter example

Feature space:
f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0]
f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1]

w1 := [0 0 0 0 0 0] {initialization}
Shard 1:
(x1,1, 0), (x1,2, 1) w1·f(x1,1,0)t ≦ w1·f(x1,1,1)t
w1 := [1 1 0 0 0 0] - [0 0 0 1 1 0]
= [1 1 0 -1 -1 0]
w1·f(x1,2,0)t ≦ w1·f(x1,1,1)t {tie-breaking}


Counter example

Feature space:
f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1]
f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0]

w2 := [0 0 0 0 0 0] {initialization}
Shard 2:
(x2,1, 0), (x2,2, 1) w2·f(x2,1,0)t ≦ w1·f(x2,1,1)t
w2 := [0 1 1 0 0 0] - [0 0 0 0 1 1]
= [0 1 1 0 -1 -1]
w2·f(x2,2,0)t ≦ w2·f(x2,2,1)t {tie-breaking}


Counter example

Feature space:
f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0]
f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1]
f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1]
f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0]

Shard 1:
(x1,1, 0), (x1,2, 1) ... w1=[1 1 0 -1 -1 0] mixed weight:
Shard 2: [μ1 1 μ2 -μ1 -1 -μ2]
(x2,1, 0), (x2,2, 1) ... w2=[0 1 1 0 -1 -1]


Counter example

Feature space:
f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] ... μ1+1, -μ1-1
f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] ... μ2, -μ2
f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] ... μ2+1, -μ2-1
f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] ... μ1, -μ1

Mixed weight [μ1 1 μ2 -μ1 -1 -μ2] doesn't separate positives and
negatives:
LHS feature vectors always beat RHS vectors
w·f(*,0) ≦ w·f(*,1)
But there is a separating weight vector: [-1 2 -1 1 -2 1]

Convergence theorem of iterative
parameter mixing (1/4)
Assumptions
u: separating weight vector
γ: margin, γ ≦ u ·(f(xt,yt) - f(xt,y')) for all t and y'
R: maxt,y' |f(xt,yt) - f(xt,y')|
ki,n : the number of updates (errors) occur in the n th
epoch of the i th OneEpochPerceptron

Distributed Training Strategies for the
Structured Perceptron

Lowerbound of the number of the errors in a epoch

← from definition:
γ ≦ u ·(f(xt,yt) - f(xt,
y'
))

By induction on n, u·w(avg,N) ≧ ΣnΣi μi,nki,nγ


Upperbound of the number of the errors in a epoch

← from definition:
R ≧ |f(xt,yt) - f(xt,y')|
y' = argmaxy w f(...)

By induction on n, |w(avg,N)|2 ≦ ΣnΣi μi,n ki,n R2


|w(avg,N)|2 ≧ (u·w(avg,N))2 ≧ (ΣnΣi μi,n ki,n γ)2
= (ΣnΣi μi,n ki,n )2γ2

|w(avg,N)|2 ≦ (ΣnΣi μi,n ki,n) R2
(ΣnΣi μi,n ki,n )2γ2 ≦ (ΣnΣi μi,n ki,n) R2
(ΣnΣi μi,n ki,n )γ2 ≦ R2
(ΣnΣi μi,n ki,n ) ≦ R2/γ2


Convergence speed is predicted in two
ways (1/2)

Theorem 3 implies
When we take uniform weights for mixing, the number of
errors is proportional to the number of shards (in worst
case when the equality holds)
implying that we cannot benefit from the parallelization
very much
#(errors per epoch) can be multiplied by S
the time required in an epoch would reduced to 1/S.


Convergence speed is predicted in two
ways (2/2)
Section 4.3
When we take error-proportional weighting for mixing, the
number of epochs Ndist is bounded by

↑error-proportional mixing
geometric mean ≦ arithmetic mean
Worst case (when the equality holds)
The same number of epochs as the vanilla perceptron
Even in that case, each epoch is S times faster because
of the parallelization
Ndist doesn't depend on the number of shards
implying that we can well benefit from parallelization

Experiments

Comparison
Serial (All Data)
Serial (Sub Sampling): use only one shard
Parallel (Parameter Mix)
Parallel (Iterative Parameter Mix)
Settings
Number of shards: 10
(see the paper for more details)

NER experiments: faster & better, close
to averaged perceptrons


NER experiments: faster & better, close
to averaged perceptrons

Iterative mixing is faster and Iterative mixing is faster and
more accurate than serial. (non- similarly accurate to serial.
averaged case) (averaged case)


Dependency parsing experiments:
similar improvements


Different shard size: the more shards,
the slower convergence


Different shard size: the more shards,
the slower convergence

High parallelism leads to
slower convergence (in a
rate somewhere middle in the
two predictions)


Conclusions

Distributed training of the structured perceptron via simple
parameter mixing strategies
Guaranteed to converge and separate the data (if
separable)
Results in fast and accurate classifiers
Trade-off between high parallelism and slow convergence

(+ applicable to online passive-aggressive algorithm)

Presenter's comments

Parameter synchronization can be slow, especially when
the feature space or the number of epochs is large
Analysis of the generalization error (for inseparable case)?
Relation to voted perceptron?
Voted perceptron: weighting with survival time
Distributed perceptron: weighting with the number of
updates
Relation to Bayes point machines?

Distributed perceptron

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Distributed perceptron

Similar to Distributed perceptron (20)

Recently uploaded

Recently uploaded (20)

Distributed perceptron