Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Distributed perceptron
1. Distributed Perceptron
Introducing Distributed Training Strategies for the
Structured Perceptron,
published by R. McDonald, K. Hall & G. Mann
in NAACL 2010
2010-10-06 / 2nd seminar for State-of-the-Art NLP
2. Distributed training of perceptrons
in a theoretically-proven way
Naive distribution strategy fails
Parameter mixing (or averaging)
Simple modification
Iterative parameter mixing
Proofs & Experiments
Convergence
Convergence speed
NER experiments
Dependency parsing experiments
3. Timeline
1958 F. Rosenblatt
Principles of Neurodynamics: Perceptrons and the
Theory of Brain Mechanisms
1962 H.D. Block and A.B. Novikoff (independently)
the perceptron convergence theorem of for the
separable case
1999 Y. Freund & R.E. Schapire
voted perceptron with a bound to the generalization error
for the inseparable case
2002 M. Collins
Generalization to the structured prediction problem
2010 R. McDonald et al
parallelization with parameter mixing and
synchronization
4. A new strategy of parallelization is
required for distributed perceptrons
Gradient-based batch training algorithms have been
parallelized in the forms of Map-Reduce
Parameter mixing works for maximum entropy models
Divide the training data into a number of shards
Train separate models with the shards
Take average of the weights of the models
Perceptrons?
Non-convex objective function
Simple parameter mixing doesn't work
5. Parameter mixing (averaging) fails (1/6)
Parameter mixing:
Train S perceptrons with S shards of the training data,
Take a weighted average of their weights
Distributed Training Strategies for the Structured Perceptron
by R. McDonald, K. Hall & G. Mann, 2010
6. Parameter mixing (averaging) fails (2/6)
Counter example
Feature space (separated into observed and non-observed examples):
f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0]
f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1]
f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1]
f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0]
Preview of the consequence:
Shard 1: Mixing of two local optima
(x1,1, 0), (x1,2, 1) Smaller data can fool the
Shard 2: algorithm, because of the
(x2,1, 0), (x2,2, 1) increased initializations and tie-
breakings.
12. Convergence theorem of iterative
parameter mixing (1/4)
Assumptions
u: separating weight vector
γ: margin, γ ≦ u ·(f(xt,yt) - f(xt,y')) for all t and y'
R: maxt,y' |f(xt,yt) - f(xt,y')|
ki,n : the number of updates (errors) occur in the n th
epoch of the i th OneEpochPerceptron
Distributed Training Strategies for the
Structured Perceptron
by R. McDonald, K. Hall & G. Mann, 2010
13. Convergence theorem of iterative
parameter mixing (2/4)
Lowerbound of the number of the errors in a epoch
← from definition:
γ ≦ u ·(f(xt,yt) - f(xt,
y'
))
By induction on n, u·w(avg,N) ≧ ΣnΣi μi,nki,nγ
Distributed Training Strategies for the Structured Perceptron
by R. McDonald, K. Hall & G. Mann, 2010
14. Convergence theorem of iterative
parameter mixing (3/4)
Upperbound of the number of the errors in a epoch
← from definition:
R ≧ |f(xt,yt) - f(xt,y')|
y' = argmaxy w f(...)
By induction on n, |w(avg,N)|2 ≦ ΣnΣi μi,n ki,n R2
Distributed Training Strategies for the Structured Perceptron
by R. McDonald, K. Hall & G. Mann, 2010
15. Convergence theorem of iterative
parameter mixing (4/4)
|w(avg,N)|2 ≧ (u·w(avg,N))2 ≧ (ΣnΣi μi,n ki,n γ)2
= (ΣnΣi μi,n ki,n )2γ2
|w(avg,N)|2 ≦ (ΣnΣi μi,n ki,n) R2
(ΣnΣi μi,n ki,n )2γ2 ≦ (ΣnΣi μi,n ki,n) R2
(ΣnΣi μi,n ki,n )γ2 ≦ R2
(ΣnΣi μi,n ki,n ) ≦ R2/γ2
Distributed Training Strategies for the Structured Perceptron
by R. McDonald, K. Hall & G. Mann, 2010
16. Convergence speed is predicted in two
ways (1/2)
Theorem 3 implies
When we take uniform weights for mixing, the number of
errors is proportional to the number of shards (in worst
case when the equality holds)
implying that we cannot benefit from the parallelization
very much
#(errors per epoch) can be multiplied by S
the time required in an epoch would reduced to 1/S.
Distributed Training Strategies for the Structured Perceptron
by R. McDonald, K. Hall & G. Mann, 2010
17. Convergence speed is predicted in two
ways (2/2)
Section 4.3
When we take error-proportional weighting for mixing, the
number of epochs Ndist is bounded by
↑error-proportional mixing
geometric mean ≦ arithmetic mean
Worst case (when the equality holds)
The same number of epochs as the vanilla perceptron
Even in that case, each epoch is S times faster because
of the parallelization
Ndist doesn't depend on the number of shards
implying that we can well benefit from parallelization
Distributed Training Strategies for the Structured Perceptron
by R. McDonald, K. Hall & G. Mann, 2010
18. Experiments
Comparison
Serial (All Data)
Serial (Sub Sampling): use only one shard
Parallel (Parameter Mix)
Parallel (Iterative Parameter Mix)
Settings
Number of shards: 10
(see the paper for more details)
19. NER experiments: faster & better, close
to averaged perceptrons
Distributed Training Strategies for the Structured Perceptron
by R. McDonald, K. Hall & G. Mann, 2010
20. NER experiments: faster & better, close
to averaged perceptrons
Iterative mixing is faster and Iterative mixing is faster and
more accurate than serial. (non- similarly accurate to serial.
averaged case) (averaged case)
Distributed Training Strategies for the Structured Perceptron
by R. McDonald, K. Hall & G. Mann, 2010
22. Different shard size: the more shards,
the slower convergence
Distributed Training Strategies for the Structured Perceptron
by R. McDonald, K. Hall & G. Mann, 2010
23. Different shard size: the more shards,
the slower convergence
High parallelism leads to
slower convergence (in a
rate somewhere middle in the
two predictions)
Distributed Training Strategies for the Structured Perceptron
by R. McDonald, K. Hall & G. Mann, 2010
24. Conclusions
Distributed training of the structured perceptron via simple
parameter mixing strategies
Guaranteed to converge and separate the data (if
separable)
Results in fast and accurate classifiers
Trade-off between high parallelism and slow convergence
(+ applicable to online passive-aggressive algorithm)
25. Presenter's comments
Parameter synchronization can be slow, especially when
the feature space or the number of epochs is large
Analysis of the generalization error (for inseparable case)?
Relation to voted perceptron?
Voted perceptron: weighting with survival time
Distributed perceptron: weighting with the number of
updates
Relation to Bayes point machines?