Customer Service Analytics - Make Sense of All Your Data.pptx
Sequential estimation of_discrete_choice_models
1. ECOLE POLYTECHNIQUE FEDERALE DE LAUSANNE School Of
Architecture, Civil and Environmental Engineering
Semester Project in Civil Engineering
Enhancing the Serial Estimation of Discrete
Choice Models Sequences
by
Youssef Kitane
Under the direction of Prof. Michel Bierlaire
Under the supervision of Nicola Ortelli and Gael Lederrey
TRANSP-OR: Transport and Mobility Laboratory
Lausanne, June 2020
1
3. 1 Introduction
Discrete Choice Models (DCMs) have played an essential role in transportation modeling
for the last 25 years [1]. Discrete choice modeling is a field designed to capture in detail the
underlying behavioral mechanisms at the foundation of the decision-making process that
drives consumers [2]. Because they must be behaviorally realistic while properly fitting
the data, appropriate utility specifications for discrete choice models are hard to develop.
In particular, modelers usually start by including a number of variables that are seen as
”essential” in the specification; these originate from their context knowledge or intuition.
Then, small changes are tested sequentially so as to improve the goodness of fit of the model
while ensuring its behavioral realism.The result is that many model specifications are usually
tested before the modeler is satisfied with the result. Thus, this approach leads to extensive
computational time because each model is optimized separately. A faster optimization time
would allow researchers to test many more specifications in the same amount of time.
In this project, the quasi-Newton Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm
is used to estimate the parameters of each DCM. Three techniques are implemented to
accelerate the process of estimating a sequence of DCMs:
• Standardization (ST) of the variables: The goal is to change the values of numeric
columns in the dataset to a common scale, without distorting differences in the ranges
of values.
• Warm Start (WS): This technique uses the knowledge acquired by the precedent model
to initialize the values of the parameters for the estimation of the next model.
• Early Stopping (ES): This consists in stopping the estimation of a model earlier than
expected, based on how promising the improvement in log likelihood looks in the last
iterations of the optimization algorithm.
The next Section is dedicated to the literature review of the existent methods that speed
up an optimization process. Then, in Section 3, the three techniques are presented in detail
for a sequence of DCMs. Section 4 presents the data considered in this project, as well as the
sequences of models that we use to measure the effectiveness of the three techniques. Section
5 gathers the results obtained by the implemented methods. The last Section resumes the
findings of this project and highlights possible improvements and directions of research for
the future.
3
4. 2 Literature Review
In large-scale convex optimization, first-order methods are methods of choice due to their
cheap iteration cost [3]. While second-order methods, such as the Newton method, are
making use of the curvature’s information, the cost of computing the Hessian can become a
hassle. Thus, quasi-Newton methods are a good compromise between curvature information
and low computation time. Indeed, they use an approximation of the Hessian instead of
its exact computation. The BFGS algorithm named after its inventors Broyen, Fletcher,
Goldfarb and Shannon [4] is one of the most well-known quasi-Newton methods. A new
method for solving linear systems is proposed [5]. The algorithm is specialized to invert
positive definite matrices in such a way that all iterates (approximate solutions) generated
by the algorithm are positive definite matrices themselves. This opens the way for many
applications in the field of optimization. Under a careful choice of the parameters of the
method, and depending on the problem structure and conditioning, acceleration might re-
sult into significant speedups both for the matrix inversion problem and for the stochastic
BFGS algorithm. It is confirmed experimentally that these accelerated methods can lead
to speed-ups when compared to the classical BFGS algorithm, but no convergence analysis
is yet provided.
The increase in the size of choice modeling datasets in recent years has led to a growing
interest in research to accelerate the estimation of DCMs. Researchers have used techniques
to speed-up the estimation of one DCM inspired on Machine Learning (ML) techniques [6].
It is achieved by proposing new efficient stochastic optimization algorithms and extensively
testing them alongside existing approaches. These algorithms are developed based on three
main contributions: the use of a stochastic Hessian, the modification of the batch size, and
a change of optimization algorithm depending on the batch size. This paper shows that
the use of a second-order method and a small batch size is a good starting point for DCM
optimization. It also shows that BFGS is an algorithm that works particularly well when
the said starting point has been found.
The problem of initializing a parameter in a model is central in ML. One particularly com-
mon scenario is where a ML algorithm must be constantly updated with new data. This
situation occurs generally in finance, online advertising, recommendation systems, fraud
detection, and many other domains where machine learning systems are used for prediction
and decision making in the real world [7]. When new data arrive, the model needs to be
updated so that it can be as accurate as possible. While the majority of existing methods
start the configuration process of an algorithm from scratch by initializing randomly the
parameters, it is possible to exploit information previously learned in order to ”warm start”
its configuration on new type of configurations.
In most common optimization algorithms and more precisely in ML, the modeler decides to
stop the optimization procedure before reaching the required tolerance in the solution [8].
Stopping earlier an optimization process is a trick used to control the generalization per-
formance of the ongoing model during the training phase and avoid over-fitting in the test
phase. In discrete choice modeling, the main objective is not to have the highest accuracy
but parameters that are behaviorally realistic.
4
5. 3 Methodology
This section briefly introduces the principles underlying the BFGS algorithm before pre-
senting the techniques used to speed-up the estimation of a sequence of DCMs.
As a reminder, the iterates {xj
} of a line search optimization method following a descent
direction dj and a step size αj are defined as follows :
xj+1
= xj
+ αjdj (1)
where the direction of descent is obtained by preconditioning the gradient and is defined
as :
dj = −Dj f(xj
) (2)
assuming that the matrix Dj at the iterate xj
is semi-definite positive.
For quasi-Newton methods, Dj is an approximation of the Hessian. A slightly different
version of BFGS consists in approximating the inverse of the Hessian. The BFGS−1
algo-
rithm uses the following approximation [9] :
D−1
j+1 = D−1
j +
(sT
j yj + yT
j D−1
j yj)(sjsT
j )
(sT
j yj)2
−
D−1
j yjsT
j + sjyT
j D−1
j
sT
j yj
(3)
where sj = xj+1
− xj
and yj = f(xj+1
) − f(xj
)
The step is calculated with an inexact line search method, based on the two Wolfe con-
ditions. The first condition [11], also known as the Armijo rule, guarantees that the step
gives a sufficient decrease in the objective function. The second condition [12], known as
the curvature condition, prevents the step length from being too short.
3.1 Standardization
The concept of standardization is relevant when continuous independent variables are mea-
sured at different scales. Indeed, standardization is a technique often applied as part of data
preparation for ML. The goal is to change the values of numeric columns in the dataset to
a common scale, without distorting differences in the ranges of values. More formally, let’s
suppose that a variable x takes values from the set S = {x1, x2, ...., xn}. The process of
standardization of one variable xi in S is applied as follows :
ˆxi =
xi − ¯x
σ
(4)
where ¯x is the mean of the values in S and σ is the respective standard deviation. It
consists in re-scaling the variable so as to obtain a mean equal to 0 and a standard deviation
of 1.
5
6. 3.2 Warm Start
A method commonly used in the field of ML consists on initializing a set of parameters with
non-arbitrary values. In our case, we initialize the parameters of a model with the values
obtained by the BFGS−1
algorithm in the previous model.
Formally, we define the set of parameters in model m as xm ∈ RNm
where Nm corre-
sponds to the number of parameters in model m. The set of parameters for the following
model is defined similarly, i.e. xm+1 ∈ RNm+1
where Nm+1 corresponds to the number of
parameters of this model. To generate the initial variables of model m + 1, i.e. x0
m+1, we
use the optimized variable of the previous model, i.e. x∗
m. In the case where the Box-Cox
parameter of a variable is increasing or decreasing from model m to m+1, we decide to ini-
tialize at 0 instead of using the previous optimized value. We thus define the initialization
of x0
m+1 for each index i ∈ {1, . . . , Nm+1} such that:
x0
m+1,i =
0 if i /∈ {1, . . . , Nm} or modification of the Box-Cox parameter,
x∗
m,i otherwise.
(5)
The same procedure is used for the initialization of the Hessian between the model m
and the following model m+1. Instead of initializing with the identity matrix, we define the
initialization of H0
m+1 for each combination of indexes i, k ∈ {1, . . . , Nm+1} such that:
H0
m+1,(i,k) =
H∗
m,(i,k) if i, k ∈ {1, . . . , Nm} ,
1 if i = k,
0 otherwise.
(6)
3.3 Early Stopping
The early stopping method consists in stopping the estimation process before the conver-
gence to the maximum is achieved. Because the objective is to select the best model among a
sequence of DCMs, the log likelihood evaluation LL(xi
) obtained at an iteration is compared
to the highest log likelihood LLbest of all previous models. In the case where the log like-
lihood LL(xi
) is higher than the LLbest, the estimation process is not stopped because the
best value can only be further maximized by the BFGS−1
algorithm until the convergence.
The best value LLbest is updated by the estimated value LL∗
(xi
). When the log likelihood
LL(xi
) is lower than the LLbest, the estimation process could be stopped based on some
criterion. This criterion estimates the relative evolution of the function in order to detect
a plateau, it means that the function is no longer experiencing a significant improvement.
Three evaluations of the function are considered in order to be sure of the convergence of
the log likelihood.
6
7. Let’s considere the last three evaluations of the log likelihood LL(xi
), LL(xi−1
), LL(xi−2
)
during the estimation process of one model. Is it possible to assess that stagnation by
evaluating the two following ratios and compare them to a predefined threshold ε:
LL(xi−1
)
LL(xi)
< ε (7)
LL(xi−2
)
LL(xi−1)
< ε (8)
Even though the goal of the early stopping is to reduces the estimation time of DCMs,
it is important to keep in mind that an important difference between the solution obtained
by applying the early stopping to the BFGS−1
algorithm and the standard BFGS−1
should
not arise. For example, Figure 3 shows the value of the log likelihood during the estimation
process of a random model. As we can see in this example, there is a stagnation in the
middle of the estimation. We do not want to do an early stopping at this moment since the
estimation is far from being finished. We thus have to be careful with the threshold and
make a sensitivity analysis on this parameter.
Figure 1: Difference between a possible stagnation of the log likelihood and the real conver-
gence
4 Case Study
4.1 Dataset
The Swissmetro dataset [10] corresponds to survey data collected during March 1998. In that
sense, it was used to study the market penetration of the Swissmetro, a revolutionary mag-
lev under-ground system. Three alternatives - train,car and swissmetro - were generated for
each of the 1192 respondents. A sample of 10’728 observations was obtained by generating 9
7
8. types of situations. The pre-selected attributes of the alternatives are categorical (travel card
ownership, gender, type of luggage, etc.) and continuous (travel time, cost and headway).
4.2 Sequence of Discrete Choice Models
For the purpose of this project, two sequences of a hundred DCMs respectively denoted by
S1 and S2, are considered. Each sequence starts with a given choice model. Then, a random
perturbation is applied. These small modifications corresponds to the typical elementary
perturbations that models consider when developing DCMs:
• Add a non selected variable to enter the utility of an alternative
• Remove a variable from the utility of an alternative
• Increment the Box-Cox parameter of a given variable
• Decrement the Box-Cox parameter of a given variable
• Interact a variable with a socioeconomic variable
• Deactivate the interaction of the considered variable with a socioeconmic variable
The first sequence S1 begins with an alternative specific constant model and the com-
plexity increases while the sequence S2 start with a random model and the complexity is
approximately constant along the hundred models. The number of parameters for each
sequence of DCMs is shown in the Figure 4:
Figure 2: Number of parameters for the two sequences S1 and S2.
8
9. 5 Results
In order to avoid misunderstandings, abbreviations are given to the different methods. The
base method estimates the parameters without applying any of the methods previously men-
tioned methods and is denoted by Base. The standardization of the variables is denoted by
ST. The warm start of the parameters is denoted by WSx, the warm start of the Hessian
by WSH and the combination of the warm start of the Hessian and the variables by WSxH
.
The early stopping method is denoted by ES.
5.1 Standardization
A benchmark of ten estimations for the methods Base and ST is conducted for the sequences
S1 and S2. The Tables 1 and 2 presents a summary of the statistics for the methods
previously mentioned. The lowest, mean and highest time among the ten estimations are
reported. The average speedup corresponds to the ratio between the mean time of the Base
method and the mean time of the ST method.
Table 1: Summary of statistics for 10 estimations by method for the sequence S1
Statistics Base ST
Minimum [s] 224.9 198.1
Mean [s] 229.5 201.7
Maximum [s] 231.3 203.2
Average Speedup / 1.15
Table 2: Summary of statistics for 10 estimations by method for the sequence S2
Statistics Base ST
Minimum [s] 534.5 478.3
Mean [s] 536.2 480.9
Maximum [s] 539.4 485.9
Average Speedup / 1.12
It appears that the ST method of the variables is useful. For the first sequence S1, the
estimation time is reduced from 229.5 s to 203.3 s. Concerning the sequence S2, a reduc-
tion of 10 % of the estimation time is obtained. Since the initial Hessian of each model
corresponds to the identity matrix, the direction of descent of the BFGS−1
corresponds to
that of gradient descent. Because gradient descent algorithm does not take into account
the curvature, having an error surface with high curvature will mean that we take many
steps which may not be in the optimal direction. When we scale the variables, we reduce
the curvature, which makes methods that ignore curvature work much better and reach the
9
10. convergence faster. The standardization of the variables permits to have interesting results
and should be applied beforehand for every sequence of DCMs that presents variables with
differences in the range of values.
5.2 Warm Start
A benchmark of ten estimations for the methods Base, WSx, WSH and WSxH
is conducted
for the sequences S1 and S2. Tables 3 and 4 presents a summary of the statistics for the
methods previously mentioned.
Table 3: Summary of statistics for 10 estimations by method for the sequence S1
Statistics Base WSx WSH WSxH
Minimum [s] 224.9 187.4 104.9 60.8
Mean [s] 229.5 188.7 105.4 61.0
Maximum [s] 231.3 189.5 106.1 61.5
Average Speedup / 1.21 2.20 3.84
Table 4: Summary of statistics for 10 estimations by method for the sequence S2
Statistics Base WSx WSH WSxH
Minimum [s] 534.6 459.3 210.4 119.2
Mean [s] 536.2 460.3 211.3 119.7
Maximum [s] 539.4 461.4 212.5 120.4
Average Speedup / 1.17 2.56 4.50
The results shows that the WSx and WSH methods enables to achieve a gain of time
compared to the Base method. The WSx method reduces the estimation time of the Base
method by respectively 19 % and 15 % for S1 and S2. For the sequences S1 and S2,
the WSH is also an efficient method because it permits to accelerate the estimation by a
factor respectively equal to 2.20 and 2.56. Indeed, the BFGS−1
algorithm is a line search
optimization method and the iterates use the Hessian information and first-order information
given by the gradient. Since the parameters involved from one model to another do not differ
significantly and then the majority of the parameters are initialized with the estimated values
of the previous model, the learned gradient and approximated Hessian from the previous
model leads to a near-optimal initial value. The next iterate will be very close to the previous
one and eventually subsequent update directions are expected to have small difference. It
explains the improvement of time of the WSx and WSH methods compared to the Base
method. The WSH is a more efficient method than the WSx. The former use the curvature
information while the latter initialize the Hessian matrix with the identity matrix. Even
10
11. though the iterates are close to the optimal value, the WSx corresponds to a gradient descent
method which is known to have a low convergence speed compared to second-order methods.
Because the WSx and WSH methods lead to an improvement of the estimation time, is it
reasonable that the combination of these two methods which is the WSxH
method leads to a
better average speedup ratio. The WSxH
permits to speedup the estimation time by a factor
of 3.84 for S1 and 4.50 for S2. This is not only the sum of the two speedup ratios of the
WSx and WSH, but an even better improvement. By initially having access to the gradient
information, the BFGS−1
algorithm is faster in the first iterations, and also converges faster
by exploiting the curvature information given by the Hessian.
5.3 Early Stopping
A sensitivity analysis is conducted for The ES method. A sequence of 20 thresholds ranging
from 10−7
to 5·10−4
is used in order to test the performance of the ES method compared to
the Base method. Figures 3 and 4 presents the relative estimation time of the ES method
for the 20 thresholds compared to the Base method for respectively S1 and S2. The black
lines correspond to the mean time observed for the Base method for 10 estimations. The
grey lines represent a confidence interval of 95 % around the mean estimation time of the
Base method. A box plot with a confidence interval of 95 % for every threshold is plotted.
As long as the value of the threshold parameter increases, the estimation time of the ES
method decreases. This is an expected behavior, since higher thresholds are more flexible
and the estimation process could potentially stop at iterations where the log likelihood is far
away from the convergence region. For sequence S1, a threshold of 10−7
leads to a speed up
of 3 %, while for S2 a speed up for approximately 4% is observed. Is it possible to obtain a
better speed up when increasing the value of the threshold. Indeed, a reduction of 35 % and
15 % of the optimization is obtained for respectively S1 and S2 when a higher threshold of
5·10−4
is used.
Figure 3: Sensitivity analysis of the threshold parameter for S1
11
12. Figure 4: Sensitivity analysis of the threshold parameter for S2
In order to select the best threshold, the improvement of the estimation time is not
the only criterion that should be taken into account. Even though the number of models
stopped earlier increases as long as the threshold increases and the total optimization time
decreases, the main drawback is that the method could stop at a plateau that is far away
from the real plateau of convergence of the log likelihood. These models are falsely stopped
earlier and should be distinguished from the models that have reached the real plateau of
convergence. Figures 5 and 6 shows that from a certain threshold, some models are falsely
stopped. Indeed, a threshold of 10−4
leads to 6 models among the 76 models that do not
reach the real convergence of the log likelihood for the sequence S1. Concerning sequence
S2, a higher threshold of 3·10−4
stopped falsely 3 models among the 90 models stopped
earlier. Even though the main objective is to speed up the optimization time of a sequence
of models and higher thresholds leads to lower optimization time, the modeler has to be
careful with models that are falsely stopped earlier. We want now to select a suitable
threshold parameter that provide both a good speedup ratio and a low number of models
stopped earlier. A threshold of 2·10−5
is acceptable in the sense that no model is falsely
stopped earlier for S1 and the speed up performance is equivalent to higher thresholds.
Concerning S2, higher thresholds have no models that are stopped earlier but does not offer
a substantial improvement in term of speedup ratio compared to the threshold 2·10−5
. We
propose to use the threshold 2·10−5
.
12
13. Figure 5: Number of models falsely stopped earlier for S1
Figure 6: Number of models falsely stopped earlier for S2
5.4 All Methods Combined
We want now to test all the methods that leads to an improvement of the estimation time
compared to the Base method. For both S1 and S2, the ST, the WSxH
and the ES with a
threshold of 2·10−5
methods are speeding-up the estimation process. The obtained results
of the combination of all this methods for both S1 and S2 are compared to the Base method
and presented in the Tables 5 and 6. The combination of the WSxH
, ES with a threshold
of 2·10−5
, ST methods leads to an improvement of a factor 5.26 and 6.67 compared to the
Base method for respectively S1 and S2. The WSxH
method is the best method among the
three main techniques in term of speedup ratio, but the association with the ST and ES
methods leads to an even better improvement compared to the Base method.
13
14. Table 5: Summary of statistics for 10 estimations for the sequence S1 : Comparison between
the combined methods and the Base method
Statistics Base Final
Minimum [s] 224.9 44.3
Mean [s] 229.5 44.3
Maximum [s] 231.3 44.8
Average Speedup / 5.26
Table 6: Summary of statistics for 10 estimations for the sequence S2 : Comparison between
the combined methods and the Base method
Statistics Base Final
Minimum [s] 534.6 83.8
Mean [s] 536.2 84.2
Maximum [s] 539.4 84.6
Average Speedup / 6.67
6 Conclusion
Enhancing the estimation of a sequence of DCMs is a direction of research that has not yet
been explored. The objective of this project was to propose different methods to improve the
total estimation time of a sequence of DCMs. The BFGS−1
algorithm is used to estimate the
two sequence of DCMs, S1 and S2. The standardization of the variables accelerate slowly
the estimation time and should be used at the beginning of every optimization task because
of its simplicity. The second approach was to implement a commonly used method in ML,
which uses the knowledge acquired before to use it for a new task. The WSxH
method per-
mits to speedup the estimation time by a factor of 3.84 and 4.5 compared to the Base method
for respectively S1 and S2. The last approach is the ES method which has shown interesting
improvement of the estimation time but the applied threshold has to be carefully chosen in
order not stop at a bad convergence plateau of the log likelihood. A threshold of 2·10−5
is
chosen. The combination of all the methods that speedup the estimation time of both S1
and S2 leads to an improvement of respectively 5.26 and 6.67 compared to the Base method.
Future directions of research include two main improvements. The first one concerns the
ES method. The ES method tends to stop at a plateau of convergences that could be far
away from the real convergence of the log likelihood. A more robust ES method could be
implemented by finding an efficient way to detect these regions of convergence. The second
possible improvement concerns the warm start. A more detailed analysis of the warm start
could be done. Even though, the total estimation time is reduced by the WSxH
method,
some models were this method is applied have an optimization time higher than the case
where no warm start is applied.
14
15. References
[1] Bierlaire M. (1998) Discrete Choice Models. In: Labb´e M., Laporte G., Tanczos K.,
Toint P. (eds) Operations Research and Decision Aid Methodologies in Traffic and Trans-
portation Management. NATO ASI Series (Series F: Computer and Systems Sciences),
vol 166. Springer, Berlin, Heidelberg
[2] Ben-Akiva M., Bierlaire M. (1999) Discrete Choice Methods and their Applications to
Short Term Travel Decisions. In: Hall R.W. (eds) Handbook of Transportation Science.
International Series in Operations Research & Management Science, vol 23. Springer,
Boston, MA
[3] Devolder, O., Glineur, F. & Nesterov, Y. First-order methods of smooth convex opti-
mization with inexact oracle. Math. Program. 146, 37–75 (2014).
https://doi.org/10.1007/s10107-013-0677-5
[4] Henning, P. and Kiefel, M. (2013). Quasi-newton methods: A new direction. The Journal
of Machine Learning Research,14(1):843-865
[5] Robert M. Gower and Filip Hanzely and Peter Richt´arik and Sebastian Stich (2018).
Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules
for Faster Second-Order Optimization.
[6] Lederrey G., Lurkin V. and Hillel T. and Bierlaire M (2020). Estimation of Discrete
Choice Models with Hybrid Stochastic Adaptive Batch Size Algorithms.
[7] Jordan T. Ash and Ryan P. Adams (2019). On the Difficulty of Warm-Starting Neural
Network Training.
[8] Prechelt L. (2012) Early Stopping — But When?. In: Montavon G., Orr G.B., M¨uller
KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol
7700. Springer, Berlin, Heidelberg.
[9] Fletcher, R. (1987). Practical Methods of Optimization; (2Nd Ed.). Wiley-
Interscience,New York, NY, USA.
[10] Wolfe, P. (1969). Convergence Conditions for Ascent Methods. SIAM Review,
11(2):226– 235.
[11] Wolfe, P. (1971). Convergence Conditions for Ascent Methods. II: Some Corrections.
SIAM Review, 13(2):185–188
[12] Bierlaire, M., Axhausen, K. and Abay, G. (2001), The acceptance of modal innovation:
The case of Swissmetro, in ‘Proceedings of the Swiss Transport Research Conference’,
Ascona, Switzerland.
15