SlideShare a Scribd company logo
1 of 25
Download to read offline
Distributed Perceptron

Introducing Distributed Training Strategies for the
             Structured Perceptron,
  published by R. McDonald, K. Hall & G. Mann
                  in NAACL 2010

           2010-10-06 / 2nd seminar for State-of-the-Art NLP
Distributed training of perceptrons
in a theoretically-proven way
Naive distribution strategy fails
   Parameter mixing (or averaging)

Simple modification
   Iterative parameter mixing

Proofs & Experiments
   Convergence
   Convergence speed
   NER experiments
   Dependency parsing experiments
Timeline

 1958 F. Rosenblatt
    Principles of Neurodynamics: Perceptrons and the
    Theory of Brain Mechanisms
 1962 H.D. Block and A.B. Novikoff (independently)
    the perceptron convergence theorem of for the
    separable case
 1999 Y. Freund & R.E. Schapire
    voted perceptron with a bound to the generalization error
    for the inseparable case
 2002 M. Collins
    Generalization to the structured prediction problem
 2010 R. McDonald et al
    parallelization with parameter mixing and
    synchronization
A new strategy of parallelization is
required for distributed perceptrons
  Gradient-based batch training algorithms have been
  parallelized in the forms of Map-Reduce
  Parameter mixing works for maximum entropy models
     Divide the training data into a number of shards
     Train separate models with the shards
     Take average of the weights of the models
  Perceptrons?
     Non-convex objective function
     Simple parameter mixing doesn't work
Parameter mixing (averaging) fails (1/6)

Parameter mixing:
Train S perceptrons with S shards of the training data,
Take a weighted average of their weights




                    Distributed Training Strategies for the Structured Perceptron
                    by R. McDonald, K. Hall & G. Mann, 2010
Parameter mixing (averaging) fails (2/6)

Counter example

Feature space (separated into observed and non-observed examples):
f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0]
f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1]
f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1]
f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0]
                       Preview of the consequence:
Shard 1:                  Mixing of two local optima
(x1,1, 0), (x1,2, 1)      Smaller data can fool the
Shard 2:                  algorithm, because of the
(x2,1, 0), (x2,2, 1)      increased initializations and tie-
                          breakings.
Parameter mixing (averaging) fails (3/6)

Counter example

Feature space:
f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0]
f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1]

                       w1 := [0 0 0 0 0 0]            {initialization}
Shard 1:
(x1,1, 0), (x1,2, 1)   w1·f(x1,1,0)t ≦ w1·f(x1,1,1)t
                       w1 := [1 1 0 0 0 0] - [0 0 0 1 1 0]
                            = [1 1 0 -1 -1 0]
                       w1·f(x1,2,0)t ≦ w1·f(x1,1,1)t {tie-breaking}
Parameter mixing (averaging) fails (4/6)

Counter example

Feature space:
f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1]
f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0]

                       w2 := [0 0 0 0 0 0]            {initialization}
Shard 2:
(x2,1, 0), (x2,2, 1)   w2·f(x2,1,0)t ≦ w1·f(x2,1,1)t
                       w2 := [0 1 1 0 0 0] - [0 0 0 0 1 1]
                            = [0 1 1 0 -1 -1]
                       w2·f(x2,2,0)t ≦ w2·f(x2,2,1)t {tie-breaking}
Parameter mixing (averaging) fails (5/6)

Counter example

Feature space:
f(x1,1,0) = [1 1 0 0 0 0]   f(x1,1,1) = [0 0 0 1 1 0]
f(x1,2,0) = [0 0 1 0 0 0]   f(x1,2,1) = [0 0 0 0 0 1]
f(x2,1,0) = [0 1 1 0 0 0]   f(x2,1,1) = [0 0 0 1 1 1]
f(x2,2,0) = [1 0 0 0 0 0]   f(x2,2,1) = [0 0 0 1 0 0]

Shard 1:
(x1,1, 0), (x1,2, 1) ... w1=[1 1 0 -1 -1 0]      mixed weight:
Shard 2:                                         [μ1 1 μ2 -μ1 -1 -μ2]
(x2,1, 0), (x2,2, 1) ... w2=[0 1 1 0 -1 -1]
Parameter mixing (averaging) fails (6/6)

Counter example

Feature space:
f(x1,1,0) = [1 1 0 0 0 0]   f(x1,1,1) = [0 0 0 1 1 0] ... μ1+1, -μ1-1
f(x1,2,0) = [0 0 1 0 0 0]   f(x1,2,1) = [0 0 0 0 0 1] ... μ2, -μ2
f(x2,1,0) = [0 1 1 0 0 0]   f(x2,1,1) = [0 0 0 1 1 1] ... μ2+1, -μ2-1
f(x2,2,0) = [1 0 0 0 0 0]   f(x2,2,1) = [0 0 0 1 0 0] ... μ1, -μ1

Mixed weight [μ1 1 μ2 -μ1 -1 -μ2] doesn't separate positives and
negatives:
   LHS feature vectors always beat RHS vectors
       w·f(*,0) ≦ w·f(*,1)
But there is a separating weight vector: [-1 2 -1 1 -2 1]
Iterative parameter mixing
Convergence theorem of iterative
parameter mixing (1/4)
  Assumptions
     u: separating weight vector
     γ: margin, γ ≦ u ·(f(xt,yt) - f(xt,y')) for all t and y'
     R: maxt,y' |f(xt,yt) - f(xt,y')|
     ki,n : the number of updates (errors) occur in the n th
     epoch of the i th OneEpochPerceptron




                                    Distributed Training Strategies for the
                                    Structured Perceptron
                                    by R. McDonald, K. Hall & G. Mann, 2010
Convergence theorem of iterative
parameter mixing (2/4)
   Lowerbound of the number of the errors in a epoch




                                                 ← from definition:
                                                      γ ≦ u ·(f(xt,yt) - f(xt,
                                                 y'
                                                    ))



 By induction on n, u·w(avg,N) ≧ ΣnΣi μi,nki,nγ

                       Distributed Training Strategies for the Structured Perceptron
                       by R. McDonald, K. Hall & G. Mann, 2010
Convergence theorem of iterative
parameter mixing (3/4)
   Upperbound of the number of the errors in a epoch


                                                         ← from definition:
                                                         R ≧ |f(xt,yt) - f(xt,y')|
                                                         y' = argmaxy w f(...)




 By induction on n, |w(avg,N)|2 ≦ ΣnΣi μi,n ki,n R2

                        Distributed Training Strategies for the Structured Perceptron
                        by R. McDonald, K. Hall & G. Mann, 2010
Convergence theorem of iterative
parameter mixing (4/4)
 |w(avg,N)|2 ≧ (u·w(avg,N))2 ≧ (ΣnΣi μi,n ki,n γ)2
                                    = (ΣnΣi μi,n ki,n )2γ2

 |w(avg,N)|2 ≦ (ΣnΣi μi,n ki,n) R2
(ΣnΣi μi,n ki,n )2γ2 ≦ (ΣnΣi μi,n ki,n) R2
(ΣnΣi μi,n ki,n )γ2 ≦  R2
(ΣnΣi μi,n ki,n ) ≦  R2/γ2




                            Distributed Training Strategies for the Structured Perceptron
                            by R. McDonald, K. Hall & G. Mann, 2010
Convergence speed is predicted in two
ways (1/2)



Theorem 3 implies
   When we take uniform weights for mixing, the number of
   errors is proportional to the number of shards (in worst
   case when the equality holds)
      implying that we cannot benefit from the parallelization
      very much
         #(errors per epoch) can be multiplied by S
         the time required in an epoch would reduced to 1/S.

                       Distributed Training Strategies for the Structured Perceptron
                       by R. McDonald, K. Hall & G. Mann, 2010
Convergence speed is predicted in two
ways (2/2)
Section 4.3
   When we take error-proportional weighting for mixing, the
   number of epochs Ndist is bounded by


                                          ↑error-proportional mixing
              geometric mean ≦ arithmetic mean
   Worst case (when the equality holds)
       The same number of epochs as the vanilla perceptron
       Even in that case, each epoch is S times faster because
       of the parallelization
   Ndist doesn't depend on the number of shards
       implying that we can well benefit from parallelization
                      Distributed Training Strategies for the Structured Perceptron
                      by R. McDonald, K. Hall & G. Mann, 2010
Experiments

Comparison
   Serial (All Data)
   Serial (Sub Sampling): use only one shard
   Parallel (Parameter Mix)
   Parallel (Iterative Parameter Mix)
Settings
   Number of shards: 10
   (see the paper for more details)
NER experiments: faster & better, close
to averaged perceptrons




             Distributed Training Strategies for the Structured Perceptron
             by R. McDonald, K. Hall & G. Mann, 2010
NER experiments: faster & better, close
to averaged perceptrons




Iterative mixing is faster and            Iterative mixing is faster and
more accurate than serial. (non-          similarly accurate to serial.
averaged case)                            (averaged case)

                      Distributed Training Strategies for the Structured Perceptron
                      by R. McDonald, K. Hall & G. Mann, 2010
Dependency parsing experiments:
similar improvements




            Distributed Training Strategies for the Structured Perceptron
            by R. McDonald, K. Hall & G. Mann, 2010
Different shard size: the more shards,
the slower convergence




              Distributed Training Strategies for the Structured Perceptron
              by R. McDonald, K. Hall & G. Mann, 2010
Different shard size: the more shards,
the slower convergence




High parallelism leads to
slower convergence (in a
rate somewhere middle in the
two predictions)


                    Distributed Training Strategies for the Structured Perceptron
                    by R. McDonald, K. Hall & G. Mann, 2010
Conclusions

 Distributed training of the structured perceptron via simple
 parameter mixing strategies
    Guaranteed to converge and separate the data (if
    separable)
    Results in fast and accurate classifiers
 Trade-off between high parallelism and slow convergence


 (+ applicable to online passive-aggressive algorithm)
Presenter's comments

 Parameter synchronization can be slow, especially when
 the feature space or the number of epochs is large
 Analysis of the generalization error (for inseparable case)?
 Relation to voted perceptron?
    Voted perceptron: weighting with survival time
    Distributed perceptron: weighting with the number of
    updates
 Relation to Bayes point machines?

More Related Content

What's hot

Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2NBER
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Fabian Pedregosa
 
Parameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsParameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsDaniel Bruggisser
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
 
The Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisThe Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisDaniel Bruggisser
 
An Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point TheoremsAn Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point Theoremsijtsrd
 
The Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionThe Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionPedro222284
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 
OPTIMAL PREDICTION OF THE EXPECTED VALUE OF ASSETS UNDER FRACTAL SCALING EXPO...
OPTIMAL PREDICTION OF THE EXPECTED VALUE OF ASSETS UNDER FRACTAL SCALING EXPO...OPTIMAL PREDICTION OF THE EXPECTED VALUE OF ASSETS UNDER FRACTAL SCALING EXPO...
OPTIMAL PREDICTION OF THE EXPECTED VALUE OF ASSETS UNDER FRACTAL SCALING EXPO...mathsjournal
 
A Note on Pseudo Operations Decomposable Measure
A Note on Pseudo Operations Decomposable MeasureA Note on Pseudo Operations Decomposable Measure
A Note on Pseudo Operations Decomposable Measureinventionjournals
 
S. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applicationsS. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applicationsSteven Duplij (Stepan Douplii)
 
Improving on daily measures of price discovery
Improving on daily measures of price discoveryImproving on daily measures of price discovery
Improving on daily measures of price discoveryFGV Brazil
 
Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij, "Polyadic Hopf algebras and quantum groups"Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij, "Polyadic Hopf algebras and quantum groups"Steven Duplij (Stepan Douplii)
 
On estimating the integrated co volatility using
On estimating the integrated co volatility usingOn estimating the integrated co volatility using
On estimating the integrated co volatility usingkkislas
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerChristian Robert
 
Stochastic Gravity in Conformally-flat Spacetimes
Stochastic Gravity in Conformally-flat SpacetimesStochastic Gravity in Conformally-flat Spacetimes
Stochastic Gravity in Conformally-flat SpacetimesRene Kotze
 

What's hot (19)

Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2
 
Generalized interaction in multigravity
Generalized interaction in multigravityGeneralized interaction in multigravity
Generalized interaction in multigravity
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Parameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsParameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial Decisions
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
The Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisThe Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysis
 
An Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point TheoremsAn Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point Theorems
 
The Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionThe Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability Distribution
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
OPTIMAL PREDICTION OF THE EXPECTED VALUE OF ASSETS UNDER FRACTAL SCALING EXPO...
OPTIMAL PREDICTION OF THE EXPECTED VALUE OF ASSETS UNDER FRACTAL SCALING EXPO...OPTIMAL PREDICTION OF THE EXPECTED VALUE OF ASSETS UNDER FRACTAL SCALING EXPO...
OPTIMAL PREDICTION OF THE EXPECTED VALUE OF ASSETS UNDER FRACTAL SCALING EXPO...
 
A Note on Pseudo Operations Decomposable Measure
A Note on Pseudo Operations Decomposable MeasureA Note on Pseudo Operations Decomposable Measure
A Note on Pseudo Operations Decomposable Measure
 
Polyadic systems and multiplace representations
Polyadic systems and multiplace representationsPolyadic systems and multiplace representations
Polyadic systems and multiplace representations
 
S. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applicationsS. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applications
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Improving on daily measures of price discovery
Improving on daily measures of price discoveryImproving on daily measures of price discovery
Improving on daily measures of price discovery
 
Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij, "Polyadic Hopf algebras and quantum groups"Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij, "Polyadic Hopf algebras and quantum groups"
 
On estimating the integrated co volatility using
On estimating the integrated co volatility usingOn estimating the integrated co volatility using
On estimating the integrated co volatility using
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like sampler
 
Stochastic Gravity in Conformally-flat Spacetimes
Stochastic Gravity in Conformally-flat SpacetimesStochastic Gravity in Conformally-flat Spacetimes
Stochastic Gravity in Conformally-flat Spacetimes
 

Similar to Distributed perceptron

Statistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data LanguageStatistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data Languagemaggiexyz
 
Patch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesPatch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesFrank Nielsen
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinChristian Robert
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingFrank Nielsen
 
Dynamics of structures with uncertainties
Dynamics of structures with uncertaintiesDynamics of structures with uncertainties
Dynamics of structures with uncertaintiesUniversity of Glasgow
 
SigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerSigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerIan Dewancker
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
 
When Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying ViewWhen Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying ViewMohamed Farouk
 
Refining Measure of Central Tendency and Dispersion
Refining Measure of Central Tendency and DispersionRefining Measure of Central Tendency and Dispersion
Refining Measure of Central Tendency and DispersionIOSR Journals
 
Vladimir Milov and Andrey Savchenko - Classification of Dangerous Situations...
Vladimir Milov and  Andrey Savchenko - Classification of Dangerous Situations...Vladimir Milov and  Andrey Savchenko - Classification of Dangerous Situations...
Vladimir Milov and Andrey Savchenko - Classification of Dangerous Situations...AIST
 
PATTEM JAGADESH_21mt0269_research proposal presentation.pptx
PATTEM JAGADESH_21mt0269_research proposal presentation.pptxPATTEM JAGADESH_21mt0269_research proposal presentation.pptx
PATTEM JAGADESH_21mt0269_research proposal presentation.pptxPATTEMJAGADESH
 
A Note on Confidence Bands for Linear Regression Means-07-24-2015
A Note on Confidence Bands for Linear Regression Means-07-24-2015A Note on Confidence Bands for Linear Regression Means-07-24-2015
A Note on Confidence Bands for Linear Regression Means-07-24-2015Junfeng Liu
 
Optimization of sample configurations for spatial trend estimation
Optimization of sample configurations for spatial trend estimationOptimization of sample configurations for spatial trend estimation
Optimization of sample configurations for spatial trend estimationAlessandro Samuel-Rosa
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsUniversity of Glasgow
 
Bayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iBayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iAlexander Decker
 

Similar to Distributed perceptron (20)

Statistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data LanguageStatistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data Language
 
Patch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesPatch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective Divergences
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
 
Dynamics of structures with uncertainties
Dynamics of structures with uncertaintiesDynamics of structures with uncertainties
Dynamics of structures with uncertainties
 
SigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerSigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_Primer
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
When Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying ViewWhen Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying View
 
A0610104
A0610104A0610104
A0610104
 
Refining Measure of Central Tendency and Dispersion
Refining Measure of Central Tendency and DispersionRefining Measure of Central Tendency and Dispersion
Refining Measure of Central Tendency and Dispersion
 
Vladimir Milov and Andrey Savchenko - Classification of Dangerous Situations...
Vladimir Milov and  Andrey Savchenko - Classification of Dangerous Situations...Vladimir Milov and  Andrey Savchenko - Classification of Dangerous Situations...
Vladimir Milov and Andrey Savchenko - Classification of Dangerous Situations...
 
Presentation.pdf
Presentation.pdfPresentation.pdf
Presentation.pdf
 
PATTEM JAGADESH_21mt0269_research proposal presentation.pptx
PATTEM JAGADESH_21mt0269_research proposal presentation.pptxPATTEM JAGADESH_21mt0269_research proposal presentation.pptx
PATTEM JAGADESH_21mt0269_research proposal presentation.pptx
 
A Note on Confidence Bands for Linear Regression Means-07-24-2015
A Note on Confidence Bands for Linear Regression Means-07-24-2015A Note on Confidence Bands for Linear Regression Means-07-24-2015
A Note on Confidence Bands for Linear Regression Means-07-24-2015
 
BAYSM'14, Wien, Austria
BAYSM'14, Wien, AustriaBAYSM'14, Wien, Austria
BAYSM'14, Wien, Austria
 
Al24258261
Al24258261Al24258261
Al24258261
 
Optimization of sample configurations for spatial trend estimation
Optimization of sample configurations for spatial trend estimationOptimization of sample configurations for spatial trend estimation
Optimization of sample configurations for spatial trend estimation
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
 
Bayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iBayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type i
 

Recently uploaded

Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptxJoelynRubio1
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationNeilDeclaro1
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsSandeep D Chaudhary
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 

Recently uploaded (20)

Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health Education
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 

Distributed perceptron

  • 1. Distributed Perceptron Introducing Distributed Training Strategies for the Structured Perceptron, published by R. McDonald, K. Hall & G. Mann in NAACL 2010 2010-10-06 / 2nd seminar for State-of-the-Art NLP
  • 2. Distributed training of perceptrons in a theoretically-proven way Naive distribution strategy fails Parameter mixing (or averaging) Simple modification Iterative parameter mixing Proofs & Experiments Convergence Convergence speed NER experiments Dependency parsing experiments
  • 3. Timeline 1958 F. Rosenblatt Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms 1962 H.D. Block and A.B. Novikoff (independently) the perceptron convergence theorem of for the separable case 1999 Y. Freund & R.E. Schapire voted perceptron with a bound to the generalization error for the inseparable case 2002 M. Collins Generalization to the structured prediction problem 2010 R. McDonald et al parallelization with parameter mixing and synchronization
  • 4. A new strategy of parallelization is required for distributed perceptrons Gradient-based batch training algorithms have been parallelized in the forms of Map-Reduce Parameter mixing works for maximum entropy models Divide the training data into a number of shards Train separate models with the shards Take average of the weights of the models Perceptrons? Non-convex objective function Simple parameter mixing doesn't work
  • 5. Parameter mixing (averaging) fails (1/6) Parameter mixing: Train S perceptrons with S shards of the training data, Take a weighted average of their weights Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 6. Parameter mixing (averaging) fails (2/6) Counter example Feature space (separated into observed and non-observed examples): f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Preview of the consequence: Shard 1: Mixing of two local optima (x1,1, 0), (x1,2, 1) Smaller data can fool the Shard 2: algorithm, because of the (x2,1, 0), (x2,2, 1) increased initializations and tie- breakings.
  • 7. Parameter mixing (averaging) fails (3/6) Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] w1 := [0 0 0 0 0 0] {initialization} Shard 1: (x1,1, 0), (x1,2, 1) w1·f(x1,1,0)t ≦ w1·f(x1,1,1)t w1 := [1 1 0 0 0 0] - [0 0 0 1 1 0] = [1 1 0 -1 -1 0] w1·f(x1,2,0)t ≦ w1·f(x1,1,1)t {tie-breaking}
  • 8. Parameter mixing (averaging) fails (4/6) Counter example Feature space: f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] w2 := [0 0 0 0 0 0] {initialization} Shard 2: (x2,1, 0), (x2,2, 1) w2·f(x2,1,0)t ≦ w1·f(x2,1,1)t w2 := [0 1 1 0 0 0] - [0 0 0 0 1 1] = [0 1 1 0 -1 -1] w2·f(x2,2,0)t ≦ w2·f(x2,2,1)t {tie-breaking}
  • 9. Parameter mixing (averaging) fails (5/6) Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Shard 1: (x1,1, 0), (x1,2, 1) ... w1=[1 1 0 -1 -1 0] mixed weight: Shard 2: [μ1 1 μ2 -μ1 -1 -μ2] (x2,1, 0), (x2,2, 1) ... w2=[0 1 1 0 -1 -1]
  • 10. Parameter mixing (averaging) fails (6/6) Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] ... μ1+1, -μ1-1 f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] ... μ2, -μ2 f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] ... μ2+1, -μ2-1 f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] ... μ1, -μ1 Mixed weight [μ1 1 μ2 -μ1 -1 -μ2] doesn't separate positives and negatives: LHS feature vectors always beat RHS vectors w·f(*,0) ≦ w·f(*,1) But there is a separating weight vector: [-1 2 -1 1 -2 1]
  • 12. Convergence theorem of iterative parameter mixing (1/4) Assumptions u: separating weight vector γ: margin, γ ≦ u ·(f(xt,yt) - f(xt,y')) for all t and y' R: maxt,y' |f(xt,yt) - f(xt,y')| ki,n : the number of updates (errors) occur in the n th epoch of the i th OneEpochPerceptron Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 13. Convergence theorem of iterative parameter mixing (2/4) Lowerbound of the number of the errors in a epoch ← from definition: γ ≦ u ·(f(xt,yt) - f(xt, y' )) By induction on n, u·w(avg,N) ≧ ΣnΣi μi,nki,nγ Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 14. Convergence theorem of iterative parameter mixing (3/4) Upperbound of the number of the errors in a epoch ← from definition: R ≧ |f(xt,yt) - f(xt,y')| y' = argmaxy w f(...) By induction on n, |w(avg,N)|2 ≦ ΣnΣi μi,n ki,n R2 Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 15. Convergence theorem of iterative parameter mixing (4/4) |w(avg,N)|2 ≧ (u·w(avg,N))2 ≧ (ΣnΣi μi,n ki,n γ)2 = (ΣnΣi μi,n ki,n )2γ2 |w(avg,N)|2 ≦ (ΣnΣi μi,n ki,n) R2 (ΣnΣi μi,n ki,n )2γ2 ≦ (ΣnΣi μi,n ki,n) R2 (ΣnΣi μi,n ki,n )γ2 ≦  R2 (ΣnΣi μi,n ki,n ) ≦  R2/γ2 Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 16. Convergence speed is predicted in two ways (1/2) Theorem 3 implies When we take uniform weights for mixing, the number of errors is proportional to the number of shards (in worst case when the equality holds) implying that we cannot benefit from the parallelization very much #(errors per epoch) can be multiplied by S the time required in an epoch would reduced to 1/S. Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 17. Convergence speed is predicted in two ways (2/2) Section 4.3 When we take error-proportional weighting for mixing, the number of epochs Ndist is bounded by ↑error-proportional mixing geometric mean ≦ arithmetic mean Worst case (when the equality holds) The same number of epochs as the vanilla perceptron Even in that case, each epoch is S times faster because of the parallelization Ndist doesn't depend on the number of shards implying that we can well benefit from parallelization Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 18. Experiments Comparison Serial (All Data) Serial (Sub Sampling): use only one shard Parallel (Parameter Mix) Parallel (Iterative Parameter Mix) Settings Number of shards: 10 (see the paper for more details)
  • 19. NER experiments: faster & better, close to averaged perceptrons Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 20. NER experiments: faster & better, close to averaged perceptrons Iterative mixing is faster and Iterative mixing is faster and more accurate than serial. (non- similarly accurate to serial. averaged case) (averaged case) Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 21. Dependency parsing experiments: similar improvements Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 22. Different shard size: the more shards, the slower convergence Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 23. Different shard size: the more shards, the slower convergence High parallelism leads to slower convergence (in a rate somewhere middle in the two predictions) Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 24. Conclusions Distributed training of the structured perceptron via simple parameter mixing strategies Guaranteed to converge and separate the data (if separable) Results in fast and accurate classifiers Trade-off between high parallelism and slow convergence (+ applicable to online passive-aggressive algorithm)
  • 25. Presenter's comments Parameter synchronization can be slow, especially when the feature space or the number of epochs is large Analysis of the generalization error (for inseparable case)? Relation to voted perceptron? Voted perceptron: weighting with survival time Distributed perceptron: weighting with the number of updates Relation to Bayes point machines?