SlideShare a Scribd company logo
1 of 5
Download to read offline
TECHNOMETRICS            0,   VOL.   23,    NO.   2, MAY   1981




A Relative                  Off set Orthogonality                                           Convergence
Criterion            for Nonlinear                               least Squares
                                    Douglas    M. Bates                        Donald    G. Watts
                              Department       of Statistics             Department  of Mathematics
                              University   of Wisconsin                         and Statistics
                                     Madison, WI                               Queen’s   University
                                                                                Kingston, Ontario
                                                                                Canada K7L 3N6


                   An orthogonality convergence criterion using relative offset is proposed. This criterion             is
                   compared to currently used criteria and its advantages are discussed.


                   KEY WORDS: Nonlinear; Regression; Least Squares; Convergence criterion; Orthogonality.



                1. INTRODUCTION                                              where the double vertical bars denote the length of a
   A vital part of any nonlinear least squaresestima-                        vector, so the least squaresestimatesare those values
tion program is the test for convergenceto the least                         such that I@) is the point on the solution locus
squaresestimates. Such a test, or convergencecriter-                         closest to y. This implies that the residual vector
ion, consists of an indicator calculated at each itera-                      y - q(G) is orthogonal to the tangent plane to the
tion and a tolerance level such that convergence is                          solution locus at I@), where the tangent plane is
declared when the indicator falls below the tolerance                        specified by the columns of the derivative matrix.
level.                                                                          Many authors writing about nonlinear least
   For the model                                                             squares (e.g., Bard 1974, Draper and Smith 1966,
                                                                             Jennrich and Sampson 1968, Ralston and Jennrich
                      Yt =f(xt, 0) + Et               (1.1)                  1978, and Kennedy and Gentle 1980) recommend
where 0 = (e,, 02, . . .,    S,)’ is a set of unknown par-                   relative changeconvergencecriteria basedon changes
ameters, y, is the observed value on the tth experi-                         in S(0) and the parametersin going from the ith to the
ment (t = 1, 2, . . . , n) for which the control variables                   (i + 1)th iteration. That is, if the relative changein the
assumethe values x, = (x1,, xZt. . . ., xlt)‘, and E,is an                   sum of squaresat the ith iteration,
additive random error, the least squareestimatesare                                          (s(e(i)) s(e(i+
                                                                                                   -       l)))p(e(i)),                 (1.6)
the values 6 that minimize
                                                                             falls in the interval 0 to 6,, where 6, is a preselected
                                                                             tolerance level such as 10P4,then the reduction in the
                                                                             sum of squares is considered insufficient to warrant
                                                                             continuing, and so computation may be halted. This
Geometrically, we may consider the vector y = (yl,
y,, . . . , y,)’ as a point in an n-dimensional sample                       is usually accompanied by a parameter relative
space,and the expectedresponsesconditional on the                            change criterion such as
parameter vector 0, as a vector                                                       ( (OS-” - Oy))l/lO~)l < 6, j = 1, . . . . p (1.7)
                                                                                            1)

           tlP-4= (Irl w,    v2Ph      . . .? %x(e))           (1.3)         so that when every relative parameter change at the
                                                                             ith iteration is less than 6, the parameter increments
where
                                                                             are too small to warrant continuing and the program
                   Q) =f(X*, 0).                (1.4)                        terminates. Himmelblau (1972) recommends that
                                                                             both of thesecriteria be included since compliance to
Each value of 8 defines a point in the sample space,
                                                                             one does not imply compliance to the other. We wish
I#), which lies on the solution locus (Box and Lucas
                                                                             to emphasize,however, that compliance to even both
1959).Equation (1.2) can then be written
                                                                             relative change criteria does not guarantee
                  s(e) = 11~ ml2,
                           -                                   (1.5)         convergence.
                                                                       179
180                              DOUGLAS       M.    BATES     AND     DONALD    G. WATTS


   Kennedy and Gentle (1980)mention a relative step
size criterion like (1.7)as well as relative changein the            1   2

sum of squares and gradient size criteria. However,
they also state that “there is no known criterion that
is absolutely satisfactory.” Chambers (1973) quotes
several other criteria, including the size of the gra-
dient and the size of the Gauss-Newton step and the
fact that the residual vector should be orthogonal to
the derivative vectors, but there is no scalesuggested
for measuring “size.”
   All but the last of these are more correctly
described as termination criteria (Bard 1974) since
they merely indicate whether further iterations might
be useful according to arbitrary tolerance levels: they
are not convergence criteria since they do not neces-
sarily indicate whether a local minimum has been                                   Y
                                                                                   -
attained. On the other hand, orthogonality is an abso-
lute indicator of convergence and hence it implies
that the other criteria will be satisfied.Furthermore, it
is possible to develop a meaningful tolerance level for
orthogonality based on statistical considerations.

      2. DEVELOPING AND IMPLEMENTING
                THE CRITERION
                                                                                           -1.
2.1    Relative Offset
    An important consequence orthogonality is that
                                of
                                                                                A- "2
                                                                                        -1.5



the residual vector has zero projection onto the tan-
                                                                                                                           I
gent plane, and so we may use the length of this                                                                                1

projection as the indicator. An appropriate scale of                 Figure 1. Projecting      the Residual Vector Onto the
measurementcan be determined by considering the                      Tangent Plane
effect of this tangential component on inferences
about the parameters.
    To illustrate the orthogonality convergencecriter-               clared at a value rYi) close to 6, the tangential com-
ion we use a simple example in which f(x,, 0) =                      ponent eT will causethis disk to be slightly increased
exp(ex,) with n = 2 observations. We supposefurther                  in radius and offset by an amount (1 I).Therefore, we
                                                                                                          eT
that x1 = 1 and x2 = 2. In this casethe solution locus               baseour orthogonality criterion on the offset relative
is a parabola in the two-dimensional sample space                    to the radius of the tangent plane confidence disk. To
 (Fig. l), which demonstrates the two major effectsof                avoid dependenceon the confidence level we remove
nonlinear models, namely curving of the solution                     the factor F(p, n - p; a) and define the proportional
 locus and nonuniformity of the spacing of the par-                  offset at f)(i)
                                                                                   as
 ameter values.
    In general, if Oci) 6 is the least squaresestimate,
                      =                                                          p = IleT II/(PSP’~/(~ - P))“*.        (2.2)
then e = y - r@“) will have zero projection onto the                 The tolerance level for the criterion can now be set at,
 tangent plane at n(0”‘). If it is not, the tangential               say, 001, since a confidence region will not be
 component eT will offset the confidence region.                     materially affected by the fact that the current par-
 Assuming that near I@) the solution locus is rela-                  ameter vector f3@) lessthan one tenth of one percent
                                                                                       is
 tively flat so it can be reasonably approximated by                 of the confidence region radius away from the least
 the tangent plane, a (1 - M)confidenceregion consists               squarespoint 6.
 of those values of 0 for which                                         When the experimental design includes replica-
                                                                     tions, the residual vector contains a component that
 lhm - t1@)ll* 5 PMHP,          n - P; NJ/h - ~1,      (2.1)         is always orthogonal to the solution locus and hence
as discussed in Hamilton, Watts, and Bates (1981).                   to any tangent plane. This constant component,
That is, the approximate confidence region corres-                   which contributes the pure error sum of squaresS,,, ,
ponds to a disk on the tangent plane with radius pro-                inflates the overall length of the residual vector but
portional to (pS@)/(n - p))“‘. If convergence is de-                 does not change I/e, /I and so the angle of the residual
TECHNOMETRICS       0,   VOL.   23,   NO.   2, MAY    1981
A CONVERGENCE          CRITERION           FOR    NONLINEAR     LEAST        SQUARES                       181


vector to the tangent plane is inflated. To avoid this,                   (Marquardt 1963) by using the implementation
we modify (2.2) and define the relative offset as                         described in Golub and Pereyra (1973) in which the
                                                                          first step is a QR decomposition of V. Other iterative
                         (IleTII/W’)                                      methods that involve at least calculating the gradient
          ’ =   ((s(eci)) &)/(n - p -
                       -                      v))lj2        (2.3)         of S(O), such as the gradient methods described in
where v is the degrees of freedom for replications.                       Kennedy and Gentle (1980) would require the calcu-
This definition includes the case for no replications,                    lation of V so the information necessaryto calculate
since then both Srepand v will be zero and (2.3)                          P would be available.
reducesto (2.2).
                                                                                           3. DISCUSSION
2.2    Implementation                                                        It is useful to compare the termination criteria
  To implement the orthogonality criterion, one                           recommended in the literature (e.g., Bard 1974,
must calculate IleTI( and S(Oci)) eachiteration. (The
                                at                                        Chambers 1973,Draper and Smith 1966,and Meyer
replication information, v, and Sreg be determined
                                   can                                    and Roth 1972) and used in programs (e.g., BMD
by preprocessing the data manually or by using a                          P3R and BMD PAR from Dixon and Brown 1977or
computer method.) In the standard Gauss-Newton                            TSP. from Hall and Hall 1976)with the orthogona-
algorithm the derivative matrix                                           lity criterion proposed here. The indicator, tolerance
                                                                          level, and implementation of the criteria provide
                        V = dqlde                           (2.4)         grounds for comparison.
                                    WI
                                                                          3.1   The Indicator
 is evaluated at each iteration, and the parameter
 vector is changed to                                                        A key feature of the offset criterion is that it pro-
                                                                          vides direct information about orthogonality of a re-
                 @i+1)= fj(i) + k&C’+1)
                                                  (2.5)                   sidual vector to the solution locus, and hencerelative
where                                                                     offset is an unambiguous indicator of convergence.
                 h(i+l) = (vfjf-1~7~                                      On the other hand, relative changecriteria only indi-
                                                (2.6)
                                                                          cate progressby the algorithm. Unfortunately, lack of
and k is the increment factor adjusted to ensure that                     progress does not imply convergenceand so prema-
S(#+ l)) < S(@) (Box 1958, Hartley 1961). To test                         ture termination can occur. This is evenmore likely in
whether a point Oci)is close enough to fi to declare                      the presence of severe nonlinearity, since then the
convergence,one could evaluate (2.3) as                                   increment factor k may be forced to a small value and
          p =       (n - p - v)e’V(V’V)-‘ve          1’2   (2.7)
                                                                          so relative changes in the sum of squares and the
                                                                          parameters may drop below the tolerance level, re-
                (        (p(sv))   - k,))        i                        sulting in termination.
and compare this to the tolerance level. However, a                          As discussedby Himmelblau (1972) a small rela-
more efficient and computationally stable procedure                       tive change in the sum of squares does not imply
is to calculate an orthogonal-triangular decomposi-                       convergence, only that the sum of squaressurface
                                                                                         but
tion of V as                                                              is quite flat. Similarly, a small relative change in the
                                                                          parametersdoesnot imply convergence, only that
                                                                                                                      but
                           V=QR                            (2.8 1         a small increment can be tolerated becauseof either
where Q is an n by p matrix with unit orthogonal                          intrinsic or parameter-effects nonlinearities, as dis-
columns and R is a p by p upper triangular matrix                         cussed by Bates and Watts (1980). Orthogonality,
(Chambers 1977). This can be done with a Gram-                            on the other hand, directly measures convergence,
Schmidt, Householder, or Givens method to produce                         and implies zero relative changes.
the increment                                                                Finally, becauserelative offset is an absolute meas-
                &(i+1)= R- 1~'~.                                          ure of convergence,it can be used to test whether any
                                             (2.9)                        point is a point of convergence, regardless of the
BecauseQ has orthogonal columns that form a basis                         method of arriving at that point, and independently
for V,                                                                    of the severity of the nonlinearity of the problem.
                        IleT = llQ’4~                      (2.10)         3.2   The Tolerance         Level
and so P is easily obtained in an intermediate step. If                       Becauserelative change criteria are only indirect
P exceeds the tolerance level, the iteration is                            indicators of convergence,the choice of appropriate
completed and the next parameter vector examined.                          tolerance levels is difficult. The levels are often
The criterion can be easily incorporated into pro-                        justified by suchconsiderationsas“There is no needto
grams basedon the Levenburg-Marquardt procedure                            continue iterating if the relative changein the sum of
                                                                           TECHNOMETRICS         0,    VOL.   23,   NO.   2, MAY   1981
182                             DOUGLAS       M.     BATES   AND     DONALD     G. WATTS


squares is less than one part in ten thousand.” Such               lems the residual vector will be entirely the result of
arbitrary choices of tolerance level can result in                 round-off error at the minimum, and it is possible
premature termination or unnecessarycomputation                    that the desired degreeof orthogonality could not be
in the pursuit of unwarranted precision. Furthermore,              achieved because the round-off error would effec-
in the caseof the sum of squares,a large replication               tively produce a random orientation of the residual
sum of squares will cause any relative change to                   vector. Failure in these casesis not the fault of the
appear small, thereby increasing the probability of                criterion, but results from unrealistic “zero residual”
premature termination.                                             problems that are often usedto compare the behavior
    In the caseof the parameter relative changecriter-             of different convergencealgorithms. It is extremely
ion, its form suggeststhat it is based on numerical                unlikely to encounter such data in practice.
considerations, since computation halts when the re-                   There can also be circumstances in which ortho-
lative accuracy of each parameter estimate appears                  gonality cannot be achieved because the solution
adequate.Statistical considerations suggest,however,                locus is finite in extent and the data vector lies “off the
that a more appropriate scalewould be basedon the                   edge.” These cases are often discovered when one
inherent variability of the estimates. In our criterion             parameter is forced out to infinity or negative infinity
since convergenceis measuredby offset relative to the               or the derivative matrix becomes singular. An
residual variation, the tolerance level is easily as-               example of this is example 5 from Meyer and Roth
signed a sensible value on a meaningful scale thus                  (1972)where the minimum actually occurs with el at
avoiding both premature termination and unproduc-                   infinity. In such circumstances the data analyst
 tive computation,                                                  should realize that the model does not fit well to the
                                                                    data and that alternate models should be employed.
3.3   Implementation                                                The relative offset criterion provides such informa-
   Various forms of the relative change criteria have               tion, but a relative change criterion based on lack of
been implemented. For example, the BMDP series                      progress would not.
programs use only a sum of squares termination
criterion with a tolerance level of lo- 5 subject to the           3.4    Examples
requirement that the criterion be satisfied on five                   We were motivated to develop the orthogonality
successive iterations. We suspect that while this                  convergence criterion by the discovery that for
would ensure convergencefor most problems, it ob-                  several examplesin the literature the residual vector
viously involves arbitrary, and probably conservative,             was not at all close to being orthogonal at the
choices of the tolerance level and number of repeti-               reported parameter estimates. This usually occurred
tions. (Even so, there is no guaranteeof convergence               when the data were simulated with small residual
since the criterion may be fooled by parameter-effects             variance and the estimateswere rounded in reporting.
nonlinearity.)                                                         Example 8 of Meyer and Roth (1972)has a relative
   The relative parameter increment criterion appears              offset of 208 percent at the reported parameter values
in different forms (Bard 1974,Draper and Smith 1966,               0 = (.0056, 6181.4, 345.2)‘, which corresponds to an
Marquardt 1963, and Kennedy and Gentle 1980).                      angle of 12”. In addition, the reported parameter
One modification to avoid small parameter values                   values produce a sum of squaresof 2021.9,instead of
inflates the denominator of (1.7) by a small positive              the 88.0 quoted in the paper. Even when the estimates
quantity r so as to avoid division by zero. This ver-              are recalculated and rounded to six significant digits
sion is clearly sensitiveto scaling or transformation of           to yield the parameter vector 8 = (.00560964,6181.35,
the parameters and could result in different termina-              345.224)‘, there was still an offset of 6.9 percent
tion with different scaling or parameterization,                   although the sum of squares did in fact go down to
especially since the choice of r is quite arbitrary.               88.04. When the estimates were reported to a lower
   Although the two relative-change criteria have                  accuracy by rounding them to five significant digits,
been widely used and have probably been quite                      the offset increased to 27 percent and the sum of
successful,they can be easily replaced by one simple               squaresincreasedto 89.45.For this example there is a
relative-offset convergencecriterion that does not re-             very high correlation between the parameter esti-
quire ad hoc modifications to avoid pathological                    mates,and the effect of rounding theseestimatesis to
complications and that provides a direct unequivocal                produce values that would not even fall in the proper
indication of convergence.                                          95 percentjoint confidenceregion. Note that the rela-
   There is, however, one type of problem for which a               tive offset can be measuredat the rounded parameter
relative offset criterion or any other criterion based              values without any information on the way that those
on orthogonality is inappropriate: the least squares                values were obtained, which is not the case for
problem using simulated data for which the residual                 relative-changecriteria.
sum of squarescan be reduced to zero. In theseprob-                    A similar situation occurs with the example in Box
TECHNOMETRICS       0,   VOL.   23,   NO.   2, MAY    1981
A CONVERGENCE         CRITERION       FOR   NONLINEAR         LEAST         SQUARES                          183

and Hunter (1963) which uses all 13 data points.                   of Canada, the University of Alberta General
There is a high replications sum of squares(the F for              ResearchFund, and the Queen’sUniversity Advisory
lack of fit is .056) that inflates the length of the resi-         ResearchCouncil.
dual vector, but not the projection, so the relative
                                                                       [Received June 1979. Revised December 1980.1
offset using (2.3) at the reported parameter values of
8 = (3.57, 12.77, 0.63) is 8.2 percent. Further itera-
tions indicate that five significant digits are required
to produce an offset less than .l percent for the par-
ameter vector 8 = (3.5691, 12.800,.62950)‘.We note
                                                                                            REFERENCES
that the second parameter does not round down to
the reported value if four significant digits are used.            BARD, Y. (1974) “Nonlinear Parameter Estimation,” New York:
                                                                     Academic Press.
                    4. SUMMARY                                     BATES, D. M., and WATTS, D. G. (1980), “Relative Curvature
                                                                      Measures of Nonlinearity” (with discussion), J. Roy. Statist. Sot.,
  A relative offset convergencecriterion for nonlinear               Ser. B, 42, l-25.
least squareshas been proposed, basedon the size of                BOX, G. E. P. (1958), “Use of Statistical Methods in the Elucidation
the projection of the residual vector in the tangent                 of Physical Mechanisms,” Bull. International Statistical Institute,
plane relative to the radius of the confidence region                36, 215-225.
                                                                   BOX, G. E. P., and HUNTER, W. G. (1963) “Sequential Design of
disk on that tangent plane. Becausethe criterion is                  Experiments for Nonlinear Models,” in Proc. IBM Scientijc
based on orthogonality, it embodies the following                    Computing Symposium Statistics, White Plains, N.Y.: IBM.
advantages:                                                        BOX, G. E. P., and LUCAS, H. L. (1959). “Design of Experiments
                                                                     in Nonlinear Situations,” Biometrika, 46, 77?-X1.
   I    It provides an absolute measure of conver-                 CHAMBERS, J. R. (1973) “Fitting Nonlinear Models: Numerical
          gence. On the other hand, relative change                  Techniques,” Biometrika, 60, 1-13.
          values are merely termination criteria, and              -         (1977) Computational Methods for Data Analysis, New
          compliance of relative changevalues to toler-              York: John Wiley.
                                                                   DIXON, W. J., and BROWN, M. B. (eds.) (1977), BMDP-77
          ance levels does not imply convergence.                    Biomedical Computer Programs, P-Series, Los Angeles: Univer-
   II     It is independent of any scaling of the data, of           sity of California Press.
          linear or nonlinear transformations of the par-          DRAPER, N. R., and SMITH, H. (1966) Applied Regression
          ameters, and of the method used to arrive at                Analysis, New York: John Wiley.
          the test point.                                          GOLUB, G. H., and PEREYRA, V. (1973) “The Differentiation of
                                                                     Pseudo-Inverses and Nonlinear Least Squares Problems Whose
  III     It is applicable to Gauss-Newton and gradient              Variables Separate,” J. SIAM, 10,413-432.
          methods.                                                 HALL, ROBERT, E., and HALL, BRONWYN, H. (1976) Time
  IV      It is independent of conditioning of the prob-             Series Processor, Cambridge: Harvard Institute of Economic
          lem and of parameter-effectsnonlinearities.                 Research, Harvard University.
   V      It provides a meaningful scaleof measurement             HAMILTON,        D. C., WATTS, D. G., and BATES, D. M. (1981)
                                                                     “Accounting for Intrinsic Nonlinearity in Nonlinear Regression
          for orthogonality based on statistical con-                Parameter Inference Regions,” unpublished paper.
          siderations. This permits one to specify appro-          HARTLEY, H. 0. (1961) “The Modified Gauss-Newton Method
          priate tolerance levels to avoid both                      for Fitting of Nonlinear Regression Functions by Least Squares,”
          premature termination and unproductive                     Technometrics, 3, 269.
          computation. On the other hand, relative                 HIMMELBLAU,         D. M. (1972) “A Uniform Evaluation of Uncon-
                                                                     strained Optimization Techniques,” in Numerical Methods for
          change criteria involve arbitrary decisions                Nonlinear Optimization, ed. F. A. Lootsma, London: Academic
          concerning tolerance levels.                               Press.
  VI       It provides a meaningful link betweenstatisti-          JENNRICH, R. I., and SAMPSON, P. F. (1968) “Application of
          cal measuresof precision (i.e., the confidence             Stepwise Regression to Nonlinear Estimation,” Technometrics,
          region radius) and numerical measures of                    10, 63-72.
                                                                   KENNEDY,        W. J., Jr., and GENTLE, J. E. (1980) Statistical
          precision (i.e., the number of significant digits)          Computing, New York: Marcel Dekker.
          and hencedictates the necessaryand sufficient            MARQUARDT,          D. W. (1963) “An Algorithm for Least Squares
          number of digits to be used in reporting each              Estimation of Nonlinear Parameters,” J. SIAM, 11, 431-441.
          parameter value.                                         MEYER, R. R., and ROTH, P. M. (1972) “Modified Damped
                                                                     Least Squares-An Algorithm for Nonlinear Estimation,” J.
              5. ACKNOWLEDGMENTS                                     Inst. Math. Applies., 9, 218-253.
                                                                   RALSTON, M. L., and JENNRICH,                R. I. (1978) “Dud, a
 This research was supported by grants from the                      Derivative-Free Algorithm for Nonlinear Least Squares,” Tech-
Natural Sciencesand Engineering ResearchCouncil                      nometrics, 20, 7-14.




                                                                    TECHNOMETRICS             0,    VOL.     23,   NO.   2, MAY    1981

More Related Content

What's hot

What's hot (9)

Viterbi2
Viterbi2Viterbi2
Viterbi2
 
An Analysis and Study of Iteration Procedures
An Analysis and Study of Iteration ProceduresAn Analysis and Study of Iteration Procedures
An Analysis and Study of Iteration Procedures
 
Logarithmic transformations
Logarithmic transformationsLogarithmic transformations
Logarithmic transformations
 
Stat sample test ch 10
Stat sample test ch 10Stat sample test ch 10
Stat sample test ch 10
 
Stability and Robust Stabilization of 2-D Continuous Systems in Roesser Model...
Stability and Robust Stabilization of 2-D Continuous Systems in Roesser Model...Stability and Robust Stabilization of 2-D Continuous Systems in Roesser Model...
Stability and Robust Stabilization of 2-D Continuous Systems in Roesser Model...
 
Archivonewton
ArchivonewtonArchivonewton
Archivonewton
 
Evaluating Classifiers' Performance KDD2002
Evaluating Classifiers' Performance KDD2002Evaluating Classifiers' Performance KDD2002
Evaluating Classifiers' Performance KDD2002
 
Takhtabnoos2
Takhtabnoos2Takhtabnoos2
Takhtabnoos2
 
Sm421 rg
Sm421 rgSm421 rg
Sm421 rg
 

Similar to V2302179

Applied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
Applied Numerical Methods Curve Fitting: Least Squares Regression, InterpolationApplied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
Applied Numerical Methods Curve Fitting: Least Squares Regression, InterpolationBrian Erandio
 
Statistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docx
Statistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docxStatistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docx
Statistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docxsusanschei
 
Statistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docx
Statistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docxStatistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docx
Statistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docxrafaelaj1
 
ITS World Congress :: Vienna, Oct 2012
ITS World Congress :: Vienna, Oct 2012ITS World Congress :: Vienna, Oct 2012
ITS World Congress :: Vienna, Oct 2012László Nádai
 
Predictve data mining
Predictve data miningPredictve data mining
Predictve data miningMintu246
 
Spatial Point Processes and Their Applications in Epidemiology
Spatial Point Processes and Their Applications in EpidemiologySpatial Point Processes and Their Applications in Epidemiology
Spatial Point Processes and Their Applications in EpidemiologyLilac Liu Xu
 
Introduction to financial forecasting in investment analysis
Introduction to financial forecasting in investment analysisIntroduction to financial forecasting in investment analysis
Introduction to financial forecasting in investment analysisSpringer
 
A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3Mintu246
 
Factor analysis
Factor analysis Factor analysis
Factor analysis Mintu246
 
Chapter 8 Of Rock Engineering
Chapter 8 Of  Rock  EngineeringChapter 8 Of  Rock  Engineering
Chapter 8 Of Rock EngineeringNgo Hung Long
 

Similar to V2302179 (20)

Quantitative Methods - Level II - CFA Program
Quantitative Methods - Level II - CFA ProgramQuantitative Methods - Level II - CFA Program
Quantitative Methods - Level II - CFA Program
 
1607.01152.pdf
1607.01152.pdf1607.01152.pdf
1607.01152.pdf
 
Climate Extremes Workshop - Assessing models for estimation and methods for u...
Climate Extremes Workshop - Assessing models for estimation and methods for u...Climate Extremes Workshop - Assessing models for estimation and methods for u...
Climate Extremes Workshop - Assessing models for estimation and methods for u...
 
Applied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
Applied Numerical Methods Curve Fitting: Least Squares Regression, InterpolationApplied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
Applied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
 
Statistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docx
Statistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docxStatistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docx
Statistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docx
 
Statistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docx
Statistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docxStatistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docx
Statistica Sinica 16(2006), 847-860PSEUDO-R2IN LOGIS.docx
 
ICCF_2022_talk.pdf
ICCF_2022_talk.pdfICCF_2022_talk.pdf
ICCF_2022_talk.pdf
 
ITS World Congress :: Vienna, Oct 2012
ITS World Congress :: Vienna, Oct 2012ITS World Congress :: Vienna, Oct 2012
ITS World Congress :: Vienna, Oct 2012
 
Predictve data mining
Predictve data miningPredictve data mining
Predictve data mining
 
Spatial Point Processes and Their Applications in Epidemiology
Spatial Point Processes and Their Applications in EpidemiologySpatial Point Processes and Their Applications in Epidemiology
Spatial Point Processes and Their Applications in Epidemiology
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
 
Introduction to financial forecasting in investment analysis
Introduction to financial forecasting in investment analysisIntroduction to financial forecasting in investment analysis
Introduction to financial forecasting in investment analysis
 
A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3
 
Factor analysis
Factor analysis Factor analysis
Factor analysis
 
201977 1-1-4-pb
201977 1-1-4-pb201977 1-1-4-pb
201977 1-1-4-pb
 
Chapter 8 Of Rock Engineering
Chapter 8 Of  Rock  EngineeringChapter 8 Of  Rock  Engineering
Chapter 8 Of Rock Engineering
 
Sdarticle 2
Sdarticle 2Sdarticle 2
Sdarticle 2
 
Sdarticle 2
Sdarticle 2Sdarticle 2
Sdarticle 2
 
Sdarticle 2
Sdarticle 2Sdarticle 2
Sdarticle 2
 
Regression
RegressionRegression
Regression
 

V2302179

  • 1. TECHNOMETRICS 0, VOL. 23, NO. 2, MAY 1981 A Relative Off set Orthogonality Convergence Criterion for Nonlinear least Squares Douglas M. Bates Donald G. Watts Department of Statistics Department of Mathematics University of Wisconsin and Statistics Madison, WI Queen’s University Kingston, Ontario Canada K7L 3N6 An orthogonality convergence criterion using relative offset is proposed. This criterion is compared to currently used criteria and its advantages are discussed. KEY WORDS: Nonlinear; Regression; Least Squares; Convergence criterion; Orthogonality. 1. INTRODUCTION where the double vertical bars denote the length of a A vital part of any nonlinear least squaresestima- vector, so the least squaresestimatesare those values tion program is the test for convergenceto the least such that I@) is the point on the solution locus squaresestimates. Such a test, or convergencecriter- closest to y. This implies that the residual vector ion, consists of an indicator calculated at each itera- y - q(G) is orthogonal to the tangent plane to the tion and a tolerance level such that convergence is solution locus at I@), where the tangent plane is declared when the indicator falls below the tolerance specified by the columns of the derivative matrix. level. Many authors writing about nonlinear least For the model squares (e.g., Bard 1974, Draper and Smith 1966, Jennrich and Sampson 1968, Ralston and Jennrich Yt =f(xt, 0) + Et (1.1) 1978, and Kennedy and Gentle 1980) recommend where 0 = (e,, 02, . . ., S,)’ is a set of unknown par- relative changeconvergencecriteria basedon changes ameters, y, is the observed value on the tth experi- in S(0) and the parametersin going from the ith to the ment (t = 1, 2, . . . , n) for which the control variables (i + 1)th iteration. That is, if the relative changein the assumethe values x, = (x1,, xZt. . . ., xlt)‘, and E,is an sum of squaresat the ith iteration, additive random error, the least squareestimatesare (s(e(i)) s(e(i+ - l)))p(e(i)), (1.6) the values 6 that minimize falls in the interval 0 to 6,, where 6, is a preselected tolerance level such as 10P4,then the reduction in the sum of squares is considered insufficient to warrant continuing, and so computation may be halted. This Geometrically, we may consider the vector y = (yl, y,, . . . , y,)’ as a point in an n-dimensional sample is usually accompanied by a parameter relative space,and the expectedresponsesconditional on the change criterion such as parameter vector 0, as a vector ( (OS-” - Oy))l/lO~)l < 6, j = 1, . . . . p (1.7) 1) tlP-4= (Irl w, v2Ph . . .? %x(e)) (1.3) so that when every relative parameter change at the ith iteration is less than 6, the parameter increments where are too small to warrant continuing and the program Q) =f(X*, 0). (1.4) terminates. Himmelblau (1972) recommends that both of thesecriteria be included since compliance to Each value of 8 defines a point in the sample space, one does not imply compliance to the other. We wish I#), which lies on the solution locus (Box and Lucas to emphasize,however, that compliance to even both 1959).Equation (1.2) can then be written relative change criteria does not guarantee s(e) = 11~ ml2, - (1.5) convergence. 179
  • 2. 180 DOUGLAS M. BATES AND DONALD G. WATTS Kennedy and Gentle (1980)mention a relative step size criterion like (1.7)as well as relative changein the 1 2 sum of squares and gradient size criteria. However, they also state that “there is no known criterion that is absolutely satisfactory.” Chambers (1973) quotes several other criteria, including the size of the gra- dient and the size of the Gauss-Newton step and the fact that the residual vector should be orthogonal to the derivative vectors, but there is no scalesuggested for measuring “size.” All but the last of these are more correctly described as termination criteria (Bard 1974) since they merely indicate whether further iterations might be useful according to arbitrary tolerance levels: they are not convergence criteria since they do not neces- sarily indicate whether a local minimum has been Y - attained. On the other hand, orthogonality is an abso- lute indicator of convergence and hence it implies that the other criteria will be satisfied.Furthermore, it is possible to develop a meaningful tolerance level for orthogonality based on statistical considerations. 2. DEVELOPING AND IMPLEMENTING THE CRITERION -1. 2.1 Relative Offset An important consequence orthogonality is that of A- "2 -1.5 the residual vector has zero projection onto the tan- I gent plane, and so we may use the length of this 1 projection as the indicator. An appropriate scale of Figure 1. Projecting the Residual Vector Onto the measurementcan be determined by considering the Tangent Plane effect of this tangential component on inferences about the parameters. To illustrate the orthogonality convergencecriter- clared at a value rYi) close to 6, the tangential com- ion we use a simple example in which f(x,, 0) = ponent eT will causethis disk to be slightly increased exp(ex,) with n = 2 observations. We supposefurther in radius and offset by an amount (1 I).Therefore, we eT that x1 = 1 and x2 = 2. In this casethe solution locus baseour orthogonality criterion on the offset relative is a parabola in the two-dimensional sample space to the radius of the tangent plane confidence disk. To (Fig. l), which demonstrates the two major effectsof avoid dependenceon the confidence level we remove nonlinear models, namely curving of the solution the factor F(p, n - p; a) and define the proportional locus and nonuniformity of the spacing of the par- offset at f)(i) as ameter values. In general, if Oci) 6 is the least squaresestimate, = p = IleT II/(PSP’~/(~ - P))“*. (2.2) then e = y - r@“) will have zero projection onto the The tolerance level for the criterion can now be set at, tangent plane at n(0”‘). If it is not, the tangential say, 001, since a confidence region will not be component eT will offset the confidence region. materially affected by the fact that the current par- Assuming that near I@) the solution locus is rela- ameter vector f3@) lessthan one tenth of one percent is tively flat so it can be reasonably approximated by of the confidence region radius away from the least the tangent plane, a (1 - M)confidenceregion consists squarespoint 6. of those values of 0 for which When the experimental design includes replica- tions, the residual vector contains a component that lhm - t1@)ll* 5 PMHP, n - P; NJ/h - ~1, (2.1) is always orthogonal to the solution locus and hence as discussed in Hamilton, Watts, and Bates (1981). to any tangent plane. This constant component, That is, the approximate confidence region corres- which contributes the pure error sum of squaresS,,, , ponds to a disk on the tangent plane with radius pro- inflates the overall length of the residual vector but portional to (pS@)/(n - p))“‘. If convergence is de- does not change I/e, /I and so the angle of the residual TECHNOMETRICS 0, VOL. 23, NO. 2, MAY 1981
  • 3. A CONVERGENCE CRITERION FOR NONLINEAR LEAST SQUARES 181 vector to the tangent plane is inflated. To avoid this, (Marquardt 1963) by using the implementation we modify (2.2) and define the relative offset as described in Golub and Pereyra (1973) in which the first step is a QR decomposition of V. Other iterative (IleTII/W’) methods that involve at least calculating the gradient ’ = ((s(eci)) &)/(n - p - - v))lj2 (2.3) of S(O), such as the gradient methods described in where v is the degrees of freedom for replications. Kennedy and Gentle (1980) would require the calcu- This definition includes the case for no replications, lation of V so the information necessaryto calculate since then both Srepand v will be zero and (2.3) P would be available. reducesto (2.2). 3. DISCUSSION 2.2 Implementation It is useful to compare the termination criteria To implement the orthogonality criterion, one recommended in the literature (e.g., Bard 1974, must calculate IleTI( and S(Oci)) eachiteration. (The at Chambers 1973,Draper and Smith 1966,and Meyer replication information, v, and Sreg be determined can and Roth 1972) and used in programs (e.g., BMD by preprocessing the data manually or by using a P3R and BMD PAR from Dixon and Brown 1977or computer method.) In the standard Gauss-Newton TSP. from Hall and Hall 1976)with the orthogona- algorithm the derivative matrix lity criterion proposed here. The indicator, tolerance level, and implementation of the criteria provide V = dqlde (2.4) grounds for comparison. WI 3.1 The Indicator is evaluated at each iteration, and the parameter vector is changed to A key feature of the offset criterion is that it pro- vides direct information about orthogonality of a re- @i+1)= fj(i) + k&C’+1) (2.5) sidual vector to the solution locus, and hencerelative where offset is an unambiguous indicator of convergence. h(i+l) = (vfjf-1~7~ On the other hand, relative changecriteria only indi- (2.6) cate progressby the algorithm. Unfortunately, lack of and k is the increment factor adjusted to ensure that progress does not imply convergenceand so prema- S(#+ l)) < S(@) (Box 1958, Hartley 1961). To test ture termination can occur. This is evenmore likely in whether a point Oci)is close enough to fi to declare the presence of severe nonlinearity, since then the convergence,one could evaluate (2.3) as increment factor k may be forced to a small value and p = (n - p - v)e’V(V’V)-‘ve 1’2 (2.7) so relative changes in the sum of squares and the parameters may drop below the tolerance level, re- ( (p(sv)) - k,)) i sulting in termination. and compare this to the tolerance level. However, a As discussedby Himmelblau (1972) a small rela- more efficient and computationally stable procedure tive change in the sum of squares does not imply is to calculate an orthogonal-triangular decomposi- convergence, only that the sum of squaressurface but tion of V as is quite flat. Similarly, a small relative change in the parametersdoesnot imply convergence, only that but V=QR (2.8 1 a small increment can be tolerated becauseof either where Q is an n by p matrix with unit orthogonal intrinsic or parameter-effects nonlinearities, as dis- columns and R is a p by p upper triangular matrix cussed by Bates and Watts (1980). Orthogonality, (Chambers 1977). This can be done with a Gram- on the other hand, directly measures convergence, Schmidt, Householder, or Givens method to produce and implies zero relative changes. the increment Finally, becauserelative offset is an absolute meas- &(i+1)= R- 1~'~. ure of convergence,it can be used to test whether any (2.9) point is a point of convergence, regardless of the BecauseQ has orthogonal columns that form a basis method of arriving at that point, and independently for V, of the severity of the nonlinearity of the problem. IleT = llQ’4~ (2.10) 3.2 The Tolerance Level and so P is easily obtained in an intermediate step. If Becauserelative change criteria are only indirect P exceeds the tolerance level, the iteration is indicators of convergence,the choice of appropriate completed and the next parameter vector examined. tolerance levels is difficult. The levels are often The criterion can be easily incorporated into pro- justified by suchconsiderationsas“There is no needto grams basedon the Levenburg-Marquardt procedure continue iterating if the relative changein the sum of TECHNOMETRICS 0, VOL. 23, NO. 2, MAY 1981
  • 4. 182 DOUGLAS M. BATES AND DONALD G. WATTS squares is less than one part in ten thousand.” Such lems the residual vector will be entirely the result of arbitrary choices of tolerance level can result in round-off error at the minimum, and it is possible premature termination or unnecessarycomputation that the desired degreeof orthogonality could not be in the pursuit of unwarranted precision. Furthermore, achieved because the round-off error would effec- in the caseof the sum of squares,a large replication tively produce a random orientation of the residual sum of squares will cause any relative change to vector. Failure in these casesis not the fault of the appear small, thereby increasing the probability of criterion, but results from unrealistic “zero residual” premature termination. problems that are often usedto compare the behavior In the caseof the parameter relative changecriter- of different convergencealgorithms. It is extremely ion, its form suggeststhat it is based on numerical unlikely to encounter such data in practice. considerations, since computation halts when the re- There can also be circumstances in which ortho- lative accuracy of each parameter estimate appears gonality cannot be achieved because the solution adequate.Statistical considerations suggest,however, locus is finite in extent and the data vector lies “off the that a more appropriate scalewould be basedon the edge.” These cases are often discovered when one inherent variability of the estimates. In our criterion parameter is forced out to infinity or negative infinity since convergenceis measuredby offset relative to the or the derivative matrix becomes singular. An residual variation, the tolerance level is easily as- example of this is example 5 from Meyer and Roth signed a sensible value on a meaningful scale thus (1972)where the minimum actually occurs with el at avoiding both premature termination and unproduc- infinity. In such circumstances the data analyst tive computation, should realize that the model does not fit well to the data and that alternate models should be employed. 3.3 Implementation The relative offset criterion provides such informa- Various forms of the relative change criteria have tion, but a relative change criterion based on lack of been implemented. For example, the BMDP series progress would not. programs use only a sum of squares termination criterion with a tolerance level of lo- 5 subject to the 3.4 Examples requirement that the criterion be satisfied on five We were motivated to develop the orthogonality successive iterations. We suspect that while this convergence criterion by the discovery that for would ensure convergencefor most problems, it ob- several examplesin the literature the residual vector viously involves arbitrary, and probably conservative, was not at all close to being orthogonal at the choices of the tolerance level and number of repeti- reported parameter estimates. This usually occurred tions. (Even so, there is no guaranteeof convergence when the data were simulated with small residual since the criterion may be fooled by parameter-effects variance and the estimateswere rounded in reporting. nonlinearity.) Example 8 of Meyer and Roth (1972)has a relative The relative parameter increment criterion appears offset of 208 percent at the reported parameter values in different forms (Bard 1974,Draper and Smith 1966, 0 = (.0056, 6181.4, 345.2)‘, which corresponds to an Marquardt 1963, and Kennedy and Gentle 1980). angle of 12”. In addition, the reported parameter One modification to avoid small parameter values values produce a sum of squaresof 2021.9,instead of inflates the denominator of (1.7) by a small positive the 88.0 quoted in the paper. Even when the estimates quantity r so as to avoid division by zero. This ver- are recalculated and rounded to six significant digits sion is clearly sensitiveto scaling or transformation of to yield the parameter vector 8 = (.00560964,6181.35, the parameters and could result in different termina- 345.224)‘, there was still an offset of 6.9 percent tion with different scaling or parameterization, although the sum of squares did in fact go down to especially since the choice of r is quite arbitrary. 88.04. When the estimates were reported to a lower Although the two relative-change criteria have accuracy by rounding them to five significant digits, been widely used and have probably been quite the offset increased to 27 percent and the sum of successful,they can be easily replaced by one simple squaresincreasedto 89.45.For this example there is a relative-offset convergencecriterion that does not re- very high correlation between the parameter esti- quire ad hoc modifications to avoid pathological mates,and the effect of rounding theseestimatesis to complications and that provides a direct unequivocal produce values that would not even fall in the proper indication of convergence. 95 percentjoint confidenceregion. Note that the rela- There is, however, one type of problem for which a tive offset can be measuredat the rounded parameter relative offset criterion or any other criterion based values without any information on the way that those on orthogonality is inappropriate: the least squares values were obtained, which is not the case for problem using simulated data for which the residual relative-changecriteria. sum of squarescan be reduced to zero. In theseprob- A similar situation occurs with the example in Box TECHNOMETRICS 0, VOL. 23, NO. 2, MAY 1981
  • 5. A CONVERGENCE CRITERION FOR NONLINEAR LEAST SQUARES 183 and Hunter (1963) which uses all 13 data points. of Canada, the University of Alberta General There is a high replications sum of squares(the F for ResearchFund, and the Queen’sUniversity Advisory lack of fit is .056) that inflates the length of the resi- ResearchCouncil. dual vector, but not the projection, so the relative [Received June 1979. Revised December 1980.1 offset using (2.3) at the reported parameter values of 8 = (3.57, 12.77, 0.63) is 8.2 percent. Further itera- tions indicate that five significant digits are required to produce an offset less than .l percent for the par- ameter vector 8 = (3.5691, 12.800,.62950)‘.We note REFERENCES that the second parameter does not round down to the reported value if four significant digits are used. BARD, Y. (1974) “Nonlinear Parameter Estimation,” New York: Academic Press. 4. SUMMARY BATES, D. M., and WATTS, D. G. (1980), “Relative Curvature Measures of Nonlinearity” (with discussion), J. Roy. Statist. Sot., A relative offset convergencecriterion for nonlinear Ser. B, 42, l-25. least squareshas been proposed, basedon the size of BOX, G. E. P. (1958), “Use of Statistical Methods in the Elucidation the projection of the residual vector in the tangent of Physical Mechanisms,” Bull. International Statistical Institute, plane relative to the radius of the confidence region 36, 215-225. BOX, G. E. P., and HUNTER, W. G. (1963) “Sequential Design of disk on that tangent plane. Becausethe criterion is Experiments for Nonlinear Models,” in Proc. IBM Scientijc based on orthogonality, it embodies the following Computing Symposium Statistics, White Plains, N.Y.: IBM. advantages: BOX, G. E. P., and LUCAS, H. L. (1959). “Design of Experiments in Nonlinear Situations,” Biometrika, 46, 77?-X1. I It provides an absolute measure of conver- CHAMBERS, J. R. (1973) “Fitting Nonlinear Models: Numerical gence. On the other hand, relative change Techniques,” Biometrika, 60, 1-13. values are merely termination criteria, and - (1977) Computational Methods for Data Analysis, New compliance of relative changevalues to toler- York: John Wiley. DIXON, W. J., and BROWN, M. B. (eds.) (1977), BMDP-77 ance levels does not imply convergence. Biomedical Computer Programs, P-Series, Los Angeles: Univer- II It is independent of any scaling of the data, of sity of California Press. linear or nonlinear transformations of the par- DRAPER, N. R., and SMITH, H. (1966) Applied Regression ameters, and of the method used to arrive at Analysis, New York: John Wiley. the test point. GOLUB, G. H., and PEREYRA, V. (1973) “The Differentiation of Pseudo-Inverses and Nonlinear Least Squares Problems Whose III It is applicable to Gauss-Newton and gradient Variables Separate,” J. SIAM, 10,413-432. methods. HALL, ROBERT, E., and HALL, BRONWYN, H. (1976) Time IV It is independent of conditioning of the prob- Series Processor, Cambridge: Harvard Institute of Economic lem and of parameter-effectsnonlinearities. Research, Harvard University. V It provides a meaningful scaleof measurement HAMILTON, D. C., WATTS, D. G., and BATES, D. M. (1981) “Accounting for Intrinsic Nonlinearity in Nonlinear Regression for orthogonality based on statistical con- Parameter Inference Regions,” unpublished paper. siderations. This permits one to specify appro- HARTLEY, H. 0. (1961) “The Modified Gauss-Newton Method priate tolerance levels to avoid both for Fitting of Nonlinear Regression Functions by Least Squares,” premature termination and unproductive Technometrics, 3, 269. computation. On the other hand, relative HIMMELBLAU, D. M. (1972) “A Uniform Evaluation of Uncon- strained Optimization Techniques,” in Numerical Methods for change criteria involve arbitrary decisions Nonlinear Optimization, ed. F. A. Lootsma, London: Academic concerning tolerance levels. Press. VI It provides a meaningful link betweenstatisti- JENNRICH, R. I., and SAMPSON, P. F. (1968) “Application of cal measuresof precision (i.e., the confidence Stepwise Regression to Nonlinear Estimation,” Technometrics, region radius) and numerical measures of 10, 63-72. KENNEDY, W. J., Jr., and GENTLE, J. E. (1980) Statistical precision (i.e., the number of significant digits) Computing, New York: Marcel Dekker. and hencedictates the necessaryand sufficient MARQUARDT, D. W. (1963) “An Algorithm for Least Squares number of digits to be used in reporting each Estimation of Nonlinear Parameters,” J. SIAM, 11, 431-441. parameter value. MEYER, R. R., and ROTH, P. M. (1972) “Modified Damped Least Squares-An Algorithm for Nonlinear Estimation,” J. 5. ACKNOWLEDGMENTS Inst. Math. Applies., 9, 218-253. RALSTON, M. L., and JENNRICH, R. I. (1978) “Dud, a This research was supported by grants from the Derivative-Free Algorithm for Nonlinear Least Squares,” Tech- Natural Sciencesand Engineering ResearchCouncil nometrics, 20, 7-14. TECHNOMETRICS 0, VOL. 23, NO. 2, MAY 1981