An Introduction to Discrete Choice Modelling

AN INTRODUCTION TO DISCRETE
CHOICE MODELLING
Tony Fowkes
Visiting Reader
Institute for Transport Studies
University of Leeds
Internal Seminar, ITS, 07/04/16

WHAT DO YOU THINK OF BRITISH TV?
• How good are the BBC channels? – Think
of a number!

• How good are the BBC channels? – Think
of a number!
Specifically, how ‘satisfied’ are you the BBC
channels (BBC1, BBC2 & BBC4)?
We will be dealing with comparisons, so any
number will do for now. Write down 100 if
you can think of nothing better.

• Relative to the number you gave for the
BBC channels, how good do you think the
ITV offering is (ITV1 – ITV4)?
If you think one is twice as good as another,
you might give it twice the number.
Be guided by how often you watch ITV
channels as against BBC channels.

• Now give me a third number for how good
you think all the other channels are.

• Lastly, taking the total time you spend
watching all channels in a typical week as
100%, please write down the 3
percentages of time you typically spend
watching each of the channel groups.
You do not need to be too exact, and if you
don’t watch TV in a typical week, choose a
non-typical one.

So, we have been able to measure
shares (also known as proportions,
probabilities and, if multiplied by 100,
percentages).
But we want to model the shares, so
that we understand how they vary from
one person to another, and over time
as things change. That will allow us to
make predictions.

HOW MIGHT WE RELATE THE
VIEWING % FIGURES TO THE
SATISFACTION NUMBERS?
• Each person will have used a different,
(and unknown to the analyst) scale when
selecting their satisfaction numbers, but
we might try to guess (FOR EACH
PERSON) the proportion of time they
spend watching each of the 3 groups of
channels.

A SHARE MODEL
The simplest way of looking at this problem
is to try to form a simple ‘share model’.
Let Hi denote the hours spent watching
channel i, Si satisfaction with channel i and
Pi denote the share of the hours watched for
channel i in the total. Then:
PBBC = HBBC/(HBBC+HITV+HELSE)

A SHARE MODEL
If hours watched are proportional to
Satisfaction, then:
PBBC = SBBC/(SBBC+SITV+SELSE)
BUT – is Usage always proportional to
Satisfaction?

CONSIDER YOUR JOURNEY HOME
FROM THE UNIVERSITY
• If you had the choice of two alternative
routes, one of which is three times as
good as the other, would you ever willingly
choose the worse route?
• P1 = S1/(S1+S2) = 100/(100+300) = 0.25
Seems like we need a better share model.

TRY USING EXPONENTIALS
P1 = Exp(S1)/[Exp(S1)+Exp(S2)]
= 2.47/1081
Rather too extreme, but we can define a
Utility (U) as a function of the S values,
eg. U = θS
Let θ = 0.05 (just to try it)
P1 = Exp(5)/[Exp(5)+Exp(10)] = 0.03
By changing θ we can get sensible Ps

BACK TO THE TV EXAMPLE
If you had given S1=100, S2=80, S3=160;
then with θ=0.01 (just as an example),
PBBC = Exp(1)/[Exp(1)+Exp(0.8)+Exp(1.6)]
= 0.27
PITV = 0.22
PELSE = 0.50

THE SCALE FACTOR
We call θ the SCALE FACTOR, and it is a
crucial parameter that has to be estimated
when calibrating a Discrete Choice
forecasting model.
The scale factor determines the relative
weight we give to the deterministic part of
the model compared to everything else
(the unknown residual or ‘error’ term).

The Scale Factor Problem
Logit Models consist of 2 parts:
U = Deterministic part + Random error
U = ΩV + ε
where the Ω ‘scales’ the expression we use for V
to the scale of the random error.
Suppose V = β0 + β1X1 + β2X2
Then ΩV = Ωβ0 + Ωβ1X1 + Ωβ2X2
And so the modelled coefficients are estimates of
Ωβ0, Ωβ1, Ωβ2

Why does the scale factor problem matter?
• For attribute valuation, such as ‘value of time’, it
doesn’t matter since the scale factors cancel
• For mode choice forecasting it does matter,
unless the errors are the correct size. This may
well be the case for RP, but will not be the case
for SP, where the errors are likely to be greater
than real errors due to the hypothetical nature of
the experiment. That will mean that the formula
for P will overstate small probabilities and
understate the probability of the dominant mode.

Probability P varies with Ω
P = exp(ΩV)/∑kexp(ΩVk)
As Ω → 0, P → 1/k
ie. complete ignorance – toss of a coin.
As Ω increases, the more the model is
explaining what is going on – good.

How can the Binary Logit model be derived?
P1 = Prob(U1 > U2)
= Prob(ΩV1+ε1 > ΩV2+ε2)
= Prob(ε2 = h AND ε1 ≥ h + ΩV2 - ΩV1)
Assume a Gumbel distribution for the ε’s.
Cumulative F(ε) = exp(-exp(-ε))
Density fn. dF(ε) = exp(-ε) exp(-exp(-ε)) dε
P1 = ∫ from minus infinity to plus infinity of
dF(ε2)F(ε1) which on substitution gives
exp(-h)exp(-exp(-h).exp(-exp(- h + ΩV2 -ΩV1)) dh

which, after some tricky but conventional
manipulation gives:
P1 = 1/(1+exp(ΩV2-ΩV1)
Or
P1 = (exp(ΩV1))/[exp(ΩV1) + exp(ΩV2)]
which is the Binary Logit model.

Multinomial Logit Model (MNL)
• This brings us back to where we started, a
three way choice of TV channels. For
more than 2 choices we use a Multinomial
Logit model
P1 = exp(U1)/(exp(U1) + exp(U2) + …)

Problem with the MNL model
• A theoretical, and sometimes important
problem with MNL is the Red Bus – Blue
Bus problem, which arises from the
Independence of Irrelevant Alternatives
property.
• This can be avoided by using various
Nested Logits, Mixed Logit, Cascetta’s C-
Logit, or Fowkes & Toner’s Flat Logit.

THE DETERMINISTIC PART
Here we seek to model Utility.
The current terminology we use is to regard
the 3 channel groups as 3
ALTERNATIVES, each described by a set
of ATTRIBUTES, each set to a particular
LEVEL.

Examples of ALTERNATIVES,
ATTRIBUTES and ATTRIBUTE LEVELS
Our Alternatives are BBC, ITV, ELSE
Important ATTRIBUTES might be:
(i) Availability
(ii) Cost
(iii) Variety of programmes
(iv) Quality of programmes

Possible attribute LEVELS for Availability
might be:
a) Freeview
b) Satellite
c) High Definition
d) On Demand

Possible attribute LEVELS for Variety might
be:
(a) Very good choice
(b) Good choice
(c) Average
(d) Poor range of programmes
(e) Very limited range of programmes
(f) Only phone-in shows

Possible attribute LEVELS for Quality might
be:
(a) International top quality
(b) Not bad for a national network
(c) Has occasional good programmes
(d) Only repeats
(e) Only phone-in shows
(f) Ant ‘n’ Dec

Transport Applications
In Transport there are many occasions
where we model Alternatives by their
Generalised Cost, GC:
eg. GC = αC + βT
Or, more generally,
GC = αC + β1T1 + β2T2... + βnTn

Excerpt from A Gray (1977)
“For the UK, the generalised cost concept was
perhaps invented by Quarmby in the famous 1967
article about modal choice, based on some earlier
work by Warner (1962) in the United States. In
Quarmby’s article the concept was described as
‘disutility’ and referred to a linear combination of
the time and money costs of a journey”.

VALUE OF TIME
In passing we note that the RATIO OF the
coefficient of the nth type of time (Tn) TO
the coefficient of cost is called the value of
the nth type of time, ie
VOT(n) = βn /α
This has kept some of us employed for a
good part of our working lives.

WHAT IS THE VALUE OF TIME?
It is just the exchange rate (for a person, a
sample, or a population) between money
and spending extra time in an activity. It
has 2 parts.
There is always something we can do with
time so the Resource VOT is always +ve.
Usually more important is the (dis)utility of
the activity concerned. Most activities
have a –ve utility from time reduction, but
in transport they are mostly +ve.

Binary Choice
Let us estimate a model for 2 Alternatives: 1
& 2 (just 2, so we say “Binary”)
Suppose the Alternatives only differ in terms
of measured Generalised Cost.
We need to observe P1, the proportion
choosing Alternative 1 for various levels of
difference in GC between the Alternatives.

The Binary Logit Model
A Linear expression for P1 is not
satisfactory.
(eg. P1 has to lie between zero and one).
• A linear expression for
ln(P1/(1-P1))
seems much more satisfactory
Put this “logit” (or ‘log-odds’) equal to
difference in Generalised Cost, GC1-GC2

Equation for the Binary Logit Model
Ln(P1/(1-P1)) = GC1-GC2
P1/(1-P1) = exp(GC1-GC2)
P1 = exp(GC1-GC2) - P1.exp(GC1-GC2)
P1(1+exp(GC1-GC2)) = exp(GC1-GC2)
P1 = exp(GC1-GC2)/(1+exp(GC1-GC2))
P1 = exp(GC1)/[(exp(GC1)+exp(GC2)]

Excerpt from D McFadden (2001)
“In 1965, a graduate student asked me how she
might analyze her thesis data in freeway routing
choices by the California Department of
Highways. This led me to consider the problem of
economic choice among discrete alternatives. The
problem was to devise a computationally tractable
model of economic decision making that yielded
choice probabilities for each alternative in a finite
feasible set. It was natural to think of highway
department decision-makers as maximizing
preferences that varied from one bureaucrat to
another.

“I drew on a classical psychological study of
perception, Thurstone’s Law of comparative
Judgment. In this theory, the perceived level of a
stimulus equals its objective level plus a random
error. The probability that one object is judged
higher than a second is the probability that this
alternative has the higher perceived stimulus.
When the perceived stimuli are interpreted as
levels of satisfaction, or utility, this can be
interpreted as a model for economic choice in
which utility levels are random, and observed
choices pick out the alternative that has the
highest realized utility level. This connection
was made in the 1950’s by the economist Jacob
Marschak, who called this the random utility
maximization hypothesis, abbreviated to RUM.

“Another psychologist I relied on was Duncan Luce,
who in 1959 introduced an axiom that simplified
experimental collection of psychological choice data by
allowing choice probabilities for many alternatives to
be inferred from choices between pairs of alternatives.
Marschak showed that choice probabilities satisfying
Luce’s axiom were consistent with the RUM
hypothesis.
I proposed an econometric version of the Luce model
in which the utilities of alternatives depended on their
measured attributes, such as construction cost, route
length, and areas of parklands and open space taken. I
called this a conditional or multinomial logit model, and
developed a computer program to estimate it.”

DALY-ZACHARY-WILLIAMS
THEOREM
Andrew Daly & Stan Zachary (1976) and
Huw Williams (1977) added significantly to
Discrete Choice theory, particularly
providing a set of conditions that
Generalised Extreme Value models need
to meet in order to be a probability choice
model.
Williams also related the concept of
Consumer Surplus to Discrete Choice
Model parameters.

Revealed Preference Analysis
Key References
1. P Samuelson (1938). Econometrica.
Observing a consumer to have chosen one alternative
and, by so doing, have rejected a second alternative.
2. K Lancaster (1966). Journal of Political Economy.
Utility for a commodity determined by the
characteristics of that commodity. Then a small step
to modelling utility as a sum of ‘part-worths’ of these
characteristics individually.
3. D McFadden (1974). In: Zarembka (ed), Frontiers of
Econometrics.
‘Conditional Logit Analysis of Qualitative Choice
Behaviour’

Revealed Preference Data
TRAVELLERS ARE OBSERVED TO CHOOSE AN
OPTION (HAVING CERTAIN CHARACTERISTICS) IN
PREFERENCE TO ANOTHER OPTION (HAVING OTHER
CHARACTERISITCS)
e.g. Traveller chooses train with cost £30 and travel time 2
hours in preference to coach costing £15 and taking 4
hours.
EITHER Requires ‘Engineering’ data on costs, times,
etc. (Possibly from fare manuals, timetables
or modelled)
OR Requires traveller to report the costs and
times of both the chosen and rejected modes.

– Self justification bias in reported data
– Many choices ‘dominated’
– Cost and time differences between modes may
be correlated
– Habit/inertia effects
– Respondent may not be able to give satisfactory
data about the alternative mode
Generally need very large samples
Problems with Revealed Preference Data

Transfer Price Data
TRAVELLERS ARE ASKED DIRECTLY FOR A
MEASURE OF UTILITY DIFFERENCE BETWEEN
TWO TRAVEL ALTERNATIVES
by questions such as:
‘How much would the cost of your chosen alternative
have to rise in order for you to switch to your rejected
alternative?

Problems with Transfer Price Data
– Policy response bias
– Unconstrained response bias
– Self justification bias
– Requires data about the rejected alternative,
which may only be known very inexactly
– Respondent may not understand or be able to
relate to question

Stated Preference Data
TRAVELLERS ARE PRESENTED WITH A SET OF
HYPOTHETICAL TRAVEL CHOICES, EACH WITH
ITS OWN CHARACTERISTICS (e.g. Cost, Travel
time, etc), AND ASKED TO
- MAKE A CHOICE
- RANK ALTERNATIVES
- RATE ALTERNATIVES
THE CRUCIAL REQUIREMENT IS THAT THE
ABOVE INCORPORATE IMPLICIT TRADE-OFFS

Advantages of Stated Preference
– Can represent situations that do not yet exist
– No problem of reporting error/bias
– Can ‘design in’ interesting trade offs
– Can ensure low correlation between
characteristic differences
– Can ask ‘many’ choices of each individual
– Avoids requirement for ‘confidential’ information

Problems with Stated Preference Data
– Response not rooted in an actual choice
– Questions may be difficult to understand
– Respondents may refuse to ‘play games’
– Relatively unimportant characteristics may be
ignored
– Design is (very?) difficult
– Scale factor problem

An Introduction to Discrete Choice Modelling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An Introduction to Discrete Choice Modelling

Similar to An Introduction to Discrete Choice Modelling (20)

More from Institute for Transport Studies (ITS)

More from Institute for Transport Studies (ITS) (20)

Recently uploaded

Recently uploaded (20)

An Introduction to Discrete Choice Modelling