SlideShare a Scribd company logo
1 of 25
Download to read offline
Distributed Perceptron

Introducing Distributed Training Strategies for the
             Structured Perceptron,
  published by R. McDonald, K. Hall & G. Mann
                  in NAACL 2010

           2010-10-06 / 2nd seminar for State-of-the-Art NLP
Distributed training of perceptrons
in a theoretically-proven way
Naive distribution strategy fails
   Parameter mixing (or averaging)

Simple modification
   Iterative parameter mixing

Proofs & Experiments
   Convergence
   Convergence speed
   NER experiments
   Dependency parsing experiments
Timeline

 1958 F. Rosenblatt
    Principles of Neurodynamics: Perceptrons and the
    Theory of Brain Mechanisms
 1962 H.D. Block and A.B. Novikoff (independently)
    the perceptron convergence theorem of for the
    separable case
 1999 Y. Freund & R.E. Schapire
    voted perceptron with a bound to the generalization error
    for the inseparable case
 2002 M. Collins
    Generalization to the structured prediction problem
 2010 R. McDonald et al
    parallelization with parameter mixing and
    synchronization
A new strategy of parallelization is
required for distributed perceptrons
  Gradient-based batch training algorithms have been
  parallelized in the forms of Map-Reduce
  Parameter mixing works for maximum entropy models
     Divide the training data into a number of shards
     Train separate models with the shards
     Take average of the weights of the models
  Perceptrons?
     Non-convex objective function
     Simple parameter mixing doesn't work
Parameter mixing (averaging) fails (1/6)

Parameter mixing:
Train S perceptrons with S shards of the training data,
Take a weighted average of their weights




                    Distributed Training Strategies for the Structured Perceptron
                    by R. McDonald, K. Hall & G. Mann, 2010
Parameter mixing (averaging) fails (2/6)

Counter example

Feature space (separated into observed and non-observed examples):
f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0]
f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1]
f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1]
f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0]
                       Preview of the consequence:
Shard 1:                  Mixing of two local optima
(x1,1, 0), (x1,2, 1)      Smaller data can fool the
Shard 2:                  algorithm, because of the
(x2,1, 0), (x2,2, 1)      increased initializations and tie-
                          breakings.
Parameter mixing (averaging) fails (3/6)

Counter example

Feature space:
f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0]
f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1]

                       w1 := [0 0 0 0 0 0]            {initialization}
Shard 1:
(x1,1, 0), (x1,2, 1)   w1·f(x1,1,0)t ≦ w1·f(x1,1,1)t
                       w1 := [1 1 0 0 0 0] - [0 0 0 1 1 0]
                            = [1 1 0 -1 -1 0]
                       w1·f(x1,2,0)t ≦ w1·f(x1,1,1)t {tie-breaking}
Parameter mixing (averaging) fails (4/6)

Counter example

Feature space:
f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1]
f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0]

                       w2 := [0 0 0 0 0 0]            {initialization}
Shard 2:
(x2,1, 0), (x2,2, 1)   w2·f(x2,1,0)t ≦ w1·f(x2,1,1)t
                       w2 := [0 1 1 0 0 0] - [0 0 0 0 1 1]
                            = [0 1 1 0 -1 -1]
                       w2·f(x2,2,0)t ≦ w2·f(x2,2,1)t {tie-breaking}
Parameter mixing (averaging) fails (5/6)

Counter example

Feature space:
f(x1,1,0) = [1 1 0 0 0 0]   f(x1,1,1) = [0 0 0 1 1 0]
f(x1,2,0) = [0 0 1 0 0 0]   f(x1,2,1) = [0 0 0 0 0 1]
f(x2,1,0) = [0 1 1 0 0 0]   f(x2,1,1) = [0 0 0 1 1 1]
f(x2,2,0) = [1 0 0 0 0 0]   f(x2,2,1) = [0 0 0 1 0 0]

Shard 1:
(x1,1, 0), (x1,2, 1) ... w1=[1 1 0 -1 -1 0]      mixed weight:
Shard 2:                                         [μ1 1 μ2 -μ1 -1 -μ2]
(x2,1, 0), (x2,2, 1) ... w2=[0 1 1 0 -1 -1]
Parameter mixing (averaging) fails (6/6)

Counter example

Feature space:
f(x1,1,0) = [1 1 0 0 0 0]   f(x1,1,1) = [0 0 0 1 1 0] ... μ1+1, -μ1-1
f(x1,2,0) = [0 0 1 0 0 0]   f(x1,2,1) = [0 0 0 0 0 1] ... μ2, -μ2
f(x2,1,0) = [0 1 1 0 0 0]   f(x2,1,1) = [0 0 0 1 1 1] ... μ2+1, -μ2-1
f(x2,2,0) = [1 0 0 0 0 0]   f(x2,2,1) = [0 0 0 1 0 0] ... μ1, -μ1

Mixed weight [μ1 1 μ2 -μ1 -1 -μ2] doesn't separate positives and
negatives:
   LHS feature vectors always beat RHS vectors
       w·f(*,0) ≦ w·f(*,1)
But there is a separating weight vector: [-1 2 -1 1 -2 1]
Iterative parameter mixing
Convergence theorem of iterative
parameter mixing (1/4)
  Assumptions
     u: separating weight vector
     γ: margin, γ ≦ u ·(f(xt,yt) - f(xt,y')) for all t and y'
     R: maxt,y' |f(xt,yt) - f(xt,y')|
     ki,n : the number of updates (errors) occur in the n th
     epoch of the i th OneEpochPerceptron




                                    Distributed Training Strategies for the
                                    Structured Perceptron
                                    by R. McDonald, K. Hall & G. Mann, 2010
Convergence theorem of iterative
parameter mixing (2/4)
   Lowerbound of the number of the errors in a epoch




                                                 ← from definition:
                                                      γ ≦ u ·(f(xt,yt) - f(xt,
                                                 y'
                                                    ))



 By induction on n, u·w(avg,N) ≧ ΣnΣi μi,nki,nγ

                       Distributed Training Strategies for the Structured Perceptron
                       by R. McDonald, K. Hall & G. Mann, 2010
Convergence theorem of iterative
parameter mixing (3/4)
   Upperbound of the number of the errors in a epoch


                                                         ← from definition:
                                                         R ≧ |f(xt,yt) - f(xt,y')|
                                                         y' = argmaxy w f(...)




 By induction on n, |w(avg,N)|2 ≦ ΣnΣi μi,n ki,n R2

                        Distributed Training Strategies for the Structured Perceptron
                        by R. McDonald, K. Hall & G. Mann, 2010
Convergence theorem of iterative
parameter mixing (4/4)
 |w(avg,N)|2 ≧ (u·w(avg,N))2 ≧ (ΣnΣi μi,n ki,n γ)2
                                    = (ΣnΣi μi,n ki,n )2γ2

 |w(avg,N)|2 ≦ (ΣnΣi μi,n ki,n) R2
(ΣnΣi μi,n ki,n )2γ2 ≦ (ΣnΣi μi,n ki,n) R2
(ΣnΣi μi,n ki,n )γ2 ≦  R2
(ΣnΣi μi,n ki,n ) ≦  R2/γ2




                            Distributed Training Strategies for the Structured Perceptron
                            by R. McDonald, K. Hall & G. Mann, 2010
Convergence speed is predicted in two
ways (1/2)



Theorem 3 implies
   When we take uniform weights for mixing, the number of
   errors is proportional to the number of shards (in worst
   case when the equality holds)
      implying that we cannot benefit from the parallelization
      very much
         #(errors per epoch) can be multiplied by S
         the time required in an epoch would reduced to 1/S.

                       Distributed Training Strategies for the Structured Perceptron
                       by R. McDonald, K. Hall & G. Mann, 2010
Convergence speed is predicted in two
ways (2/2)
Section 4.3
   When we take error-proportional weighting for mixing, the
   number of epochs Ndist is bounded by


                                          ↑error-proportional mixing
              geometric mean ≦ arithmetic mean
   Worst case (when the equality holds)
       The same number of epochs as the vanilla perceptron
       Even in that case, each epoch is S times faster because
       of the parallelization
   Ndist doesn't depend on the number of shards
       implying that we can well benefit from parallelization
                      Distributed Training Strategies for the Structured Perceptron
                      by R. McDonald, K. Hall & G. Mann, 2010
Experiments

Comparison
   Serial (All Data)
   Serial (Sub Sampling): use only one shard
   Parallel (Parameter Mix)
   Parallel (Iterative Parameter Mix)
Settings
   Number of shards: 10
   (see the paper for more details)
NER experiments: faster & better, close
to averaged perceptrons




             Distributed Training Strategies for the Structured Perceptron
             by R. McDonald, K. Hall & G. Mann, 2010
NER experiments: faster & better, close
to averaged perceptrons




Iterative mixing is faster and            Iterative mixing is faster and
more accurate than serial. (non-          similarly accurate to serial.
averaged case)                            (averaged case)

                      Distributed Training Strategies for the Structured Perceptron
                      by R. McDonald, K. Hall & G. Mann, 2010
Dependency parsing experiments:
similar improvements




            Distributed Training Strategies for the Structured Perceptron
            by R. McDonald, K. Hall & G. Mann, 2010
Different shard size: the more shards,
the slower convergence




              Distributed Training Strategies for the Structured Perceptron
              by R. McDonald, K. Hall & G. Mann, 2010
Different shard size: the more shards,
the slower convergence




High parallelism leads to
slower convergence (in a
rate somewhere middle in the
two predictions)


                    Distributed Training Strategies for the Structured Perceptron
                    by R. McDonald, K. Hall & G. Mann, 2010
Conclusions

 Distributed training of the structured perceptron via simple
 parameter mixing strategies
    Guaranteed to converge and separate the data (if
    separable)
    Results in fast and accurate classifiers
 Trade-off between high parallelism and slow convergence


 (+ applicable to online passive-aggressive algorithm)
Presenter's comments

 Parameter synchronization can be slow, especially when
 the feature space or the number of epochs is large
 Analysis of the generalization error (for inseparable case)?
 Relation to voted perceptron?
    Voted perceptron: weighting with survival time
    Distributed perceptron: weighting with the number of
    updates
 Relation to Bayes point machines?

More Related Content

What's hot

Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij, "Polyadic Hopf algebras and quantum groups"Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij (Stepan Douplii)
 

What's hot (19)

Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2
 
Generalized interaction in multigravity
Generalized interaction in multigravityGeneralized interaction in multigravity
Generalized interaction in multigravity
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Parameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsParameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial Decisions
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
The Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisThe Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysis
 
An Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point TheoremsAn Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point Theorems
 
The Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionThe Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability Distribution
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
OPTIMAL PREDICTION OF THE EXPECTED VALUE OF ASSETS UNDER FRACTAL SCALING EXPO...
OPTIMAL PREDICTION OF THE EXPECTED VALUE OF ASSETS UNDER FRACTAL SCALING EXPO...OPTIMAL PREDICTION OF THE EXPECTED VALUE OF ASSETS UNDER FRACTAL SCALING EXPO...
OPTIMAL PREDICTION OF THE EXPECTED VALUE OF ASSETS UNDER FRACTAL SCALING EXPO...
 
A Note on Pseudo Operations Decomposable Measure
A Note on Pseudo Operations Decomposable MeasureA Note on Pseudo Operations Decomposable Measure
A Note on Pseudo Operations Decomposable Measure
 
Polyadic systems and multiplace representations
Polyadic systems and multiplace representationsPolyadic systems and multiplace representations
Polyadic systems and multiplace representations
 
S. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applicationsS. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applications
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Improving on daily measures of price discovery
Improving on daily measures of price discoveryImproving on daily measures of price discovery
Improving on daily measures of price discovery
 
Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij, "Polyadic Hopf algebras and quantum groups"Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij, "Polyadic Hopf algebras and quantum groups"
 
On estimating the integrated co volatility using
On estimating the integrated co volatility usingOn estimating the integrated co volatility using
On estimating the integrated co volatility using
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like sampler
 
Stochastic Gravity in Conformally-flat Spacetimes
Stochastic Gravity in Conformally-flat SpacetimesStochastic Gravity in Conformally-flat Spacetimes
Stochastic Gravity in Conformally-flat Spacetimes
 

Similar to Distributed perceptron

Statistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data LanguageStatistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data Language
maggiexyz
 
SigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerSigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_Primer
Ian Dewancker
 
A Note on Confidence Bands for Linear Regression Means-07-24-2015
A Note on Confidence Bands for Linear Regression Means-07-24-2015A Note on Confidence Bands for Linear Regression Means-07-24-2015
A Note on Confidence Bands for Linear Regression Means-07-24-2015
Junfeng Liu
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
University of Glasgow
 
Bayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iBayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type i
Alexander Decker
 

Similar to Distributed perceptron (20)

Statistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data LanguageStatistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data Language
 
Patch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesPatch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective Divergences
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
 
Dynamics of structures with uncertainties
Dynamics of structures with uncertaintiesDynamics of structures with uncertainties
Dynamics of structures with uncertainties
 
SigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerSigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_Primer
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
When Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying ViewWhen Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying View
 
A0610104
A0610104A0610104
A0610104
 
Refining Measure of Central Tendency and Dispersion
Refining Measure of Central Tendency and DispersionRefining Measure of Central Tendency and Dispersion
Refining Measure of Central Tendency and Dispersion
 
Vladimir Milov and Andrey Savchenko - Classification of Dangerous Situations...
Vladimir Milov and  Andrey Savchenko - Classification of Dangerous Situations...Vladimir Milov and  Andrey Savchenko - Classification of Dangerous Situations...
Vladimir Milov and Andrey Savchenko - Classification of Dangerous Situations...
 
Presentation.pdf
Presentation.pdfPresentation.pdf
Presentation.pdf
 
PATTEM JAGADESH_21mt0269_research proposal presentation.pptx
PATTEM JAGADESH_21mt0269_research proposal presentation.pptxPATTEM JAGADESH_21mt0269_research proposal presentation.pptx
PATTEM JAGADESH_21mt0269_research proposal presentation.pptx
 
A Note on Confidence Bands for Linear Regression Means-07-24-2015
A Note on Confidence Bands for Linear Regression Means-07-24-2015A Note on Confidence Bands for Linear Regression Means-07-24-2015
A Note on Confidence Bands for Linear Regression Means-07-24-2015
 
BAYSM'14, Wien, Austria
BAYSM'14, Wien, AustriaBAYSM'14, Wien, Austria
BAYSM'14, Wien, Austria
 
Al24258261
Al24258261Al24258261
Al24258261
 
Optimization of sample configurations for spatial trend estimation
Optimization of sample configurations for spatial trend estimationOptimization of sample configurations for spatial trend estimation
Optimization of sample configurations for spatial trend estimation
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
 
Bayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iBayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type i
 

Recently uploaded

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Recently uploaded (20)

Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 

Distributed perceptron

  • 1. Distributed Perceptron Introducing Distributed Training Strategies for the Structured Perceptron, published by R. McDonald, K. Hall & G. Mann in NAACL 2010 2010-10-06 / 2nd seminar for State-of-the-Art NLP
  • 2. Distributed training of perceptrons in a theoretically-proven way Naive distribution strategy fails Parameter mixing (or averaging) Simple modification Iterative parameter mixing Proofs & Experiments Convergence Convergence speed NER experiments Dependency parsing experiments
  • 3. Timeline 1958 F. Rosenblatt Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms 1962 H.D. Block and A.B. Novikoff (independently) the perceptron convergence theorem of for the separable case 1999 Y. Freund & R.E. Schapire voted perceptron with a bound to the generalization error for the inseparable case 2002 M. Collins Generalization to the structured prediction problem 2010 R. McDonald et al parallelization with parameter mixing and synchronization
  • 4. A new strategy of parallelization is required for distributed perceptrons Gradient-based batch training algorithms have been parallelized in the forms of Map-Reduce Parameter mixing works for maximum entropy models Divide the training data into a number of shards Train separate models with the shards Take average of the weights of the models Perceptrons? Non-convex objective function Simple parameter mixing doesn't work
  • 5. Parameter mixing (averaging) fails (1/6) Parameter mixing: Train S perceptrons with S shards of the training data, Take a weighted average of their weights Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 6. Parameter mixing (averaging) fails (2/6) Counter example Feature space (separated into observed and non-observed examples): f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Preview of the consequence: Shard 1: Mixing of two local optima (x1,1, 0), (x1,2, 1) Smaller data can fool the Shard 2: algorithm, because of the (x2,1, 0), (x2,2, 1) increased initializations and tie- breakings.
  • 7. Parameter mixing (averaging) fails (3/6) Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] w1 := [0 0 0 0 0 0] {initialization} Shard 1: (x1,1, 0), (x1,2, 1) w1·f(x1,1,0)t ≦ w1·f(x1,1,1)t w1 := [1 1 0 0 0 0] - [0 0 0 1 1 0] = [1 1 0 -1 -1 0] w1·f(x1,2,0)t ≦ w1·f(x1,1,1)t {tie-breaking}
  • 8. Parameter mixing (averaging) fails (4/6) Counter example Feature space: f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] w2 := [0 0 0 0 0 0] {initialization} Shard 2: (x2,1, 0), (x2,2, 1) w2·f(x2,1,0)t ≦ w1·f(x2,1,1)t w2 := [0 1 1 0 0 0] - [0 0 0 0 1 1] = [0 1 1 0 -1 -1] w2·f(x2,2,0)t ≦ w2·f(x2,2,1)t {tie-breaking}
  • 9. Parameter mixing (averaging) fails (5/6) Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Shard 1: (x1,1, 0), (x1,2, 1) ... w1=[1 1 0 -1 -1 0] mixed weight: Shard 2: [μ1 1 μ2 -μ1 -1 -μ2] (x2,1, 0), (x2,2, 1) ... w2=[0 1 1 0 -1 -1]
  • 10. Parameter mixing (averaging) fails (6/6) Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] ... μ1+1, -μ1-1 f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] ... μ2, -μ2 f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] ... μ2+1, -μ2-1 f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] ... μ1, -μ1 Mixed weight [μ1 1 μ2 -μ1 -1 -μ2] doesn't separate positives and negatives: LHS feature vectors always beat RHS vectors w·f(*,0) ≦ w·f(*,1) But there is a separating weight vector: [-1 2 -1 1 -2 1]
  • 12. Convergence theorem of iterative parameter mixing (1/4) Assumptions u: separating weight vector γ: margin, γ ≦ u ·(f(xt,yt) - f(xt,y')) for all t and y' R: maxt,y' |f(xt,yt) - f(xt,y')| ki,n : the number of updates (errors) occur in the n th epoch of the i th OneEpochPerceptron Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 13. Convergence theorem of iterative parameter mixing (2/4) Lowerbound of the number of the errors in a epoch ← from definition: γ ≦ u ·(f(xt,yt) - f(xt, y' )) By induction on n, u·w(avg,N) ≧ ΣnΣi μi,nki,nγ Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 14. Convergence theorem of iterative parameter mixing (3/4) Upperbound of the number of the errors in a epoch ← from definition: R ≧ |f(xt,yt) - f(xt,y')| y' = argmaxy w f(...) By induction on n, |w(avg,N)|2 ≦ ΣnΣi μi,n ki,n R2 Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 15. Convergence theorem of iterative parameter mixing (4/4) |w(avg,N)|2 ≧ (u·w(avg,N))2 ≧ (ΣnΣi μi,n ki,n γ)2 = (ΣnΣi μi,n ki,n )2γ2 |w(avg,N)|2 ≦ (ΣnΣi μi,n ki,n) R2 (ΣnΣi μi,n ki,n )2γ2 ≦ (ΣnΣi μi,n ki,n) R2 (ΣnΣi μi,n ki,n )γ2 ≦  R2 (ΣnΣi μi,n ki,n ) ≦  R2/γ2 Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 16. Convergence speed is predicted in two ways (1/2) Theorem 3 implies When we take uniform weights for mixing, the number of errors is proportional to the number of shards (in worst case when the equality holds) implying that we cannot benefit from the parallelization very much #(errors per epoch) can be multiplied by S the time required in an epoch would reduced to 1/S. Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 17. Convergence speed is predicted in two ways (2/2) Section 4.3 When we take error-proportional weighting for mixing, the number of epochs Ndist is bounded by ↑error-proportional mixing geometric mean ≦ arithmetic mean Worst case (when the equality holds) The same number of epochs as the vanilla perceptron Even in that case, each epoch is S times faster because of the parallelization Ndist doesn't depend on the number of shards implying that we can well benefit from parallelization Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 18. Experiments Comparison Serial (All Data) Serial (Sub Sampling): use only one shard Parallel (Parameter Mix) Parallel (Iterative Parameter Mix) Settings Number of shards: 10 (see the paper for more details)
  • 19. NER experiments: faster & better, close to averaged perceptrons Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 20. NER experiments: faster & better, close to averaged perceptrons Iterative mixing is faster and Iterative mixing is faster and more accurate than serial. (non- similarly accurate to serial. averaged case) (averaged case) Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 21. Dependency parsing experiments: similar improvements Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 22. Different shard size: the more shards, the slower convergence Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 23. Different shard size: the more shards, the slower convergence High parallelism leads to slower convergence (in a rate somewhere middle in the two predictions) Distributed Training Strategies for the Structured Perceptron by R. McDonald, K. Hall & G. Mann, 2010
  • 24. Conclusions Distributed training of the structured perceptron via simple parameter mixing strategies Guaranteed to converge and separate the data (if separable) Results in fast and accurate classifiers Trade-off between high parallelism and slow convergence (+ applicable to online passive-aggressive algorithm)
  • 25. Presenter's comments Parameter synchronization can be slow, especially when the feature space or the number of epochs is large Analysis of the generalization error (for inseparable case)? Relation to voted perceptron? Voted perceptron: weighting with survival time Distributed perceptron: weighting with the number of updates Relation to Bayes point machines?