255. 변환)
The Underlying Math
So far we’ve seen that the beauty of logistic regression is it outputs
values bounded by 0 and 1; hence they can be directly interpreted as
probabilities.Let’sgetintothemathbehinditabit.Youwantafunction
that takes the data and transforms it into a single value bounded inside
the closed interval 0,1 . For an example of a function bounded be‐
tween0and1,considertheinverse-logitfunctionshowninFigure5-2.
P t =logit−1
t ≡
1
1+e−t
=
et
1+et
P t =logit−1
t ≡
1
1+e−t
=
et
1+et
Figure 5-2. The inverse-logit function
Logit Versus Inverse-logit
The logit function takes x values in the range 0,1 and transforms
them to y values along the entire real line:
logit p =log
p
1− p
=log p −log 1− p
The inverse-logit does the reverse, and takes x values along the real
The Underlying Math
So far we’ve seen that the beauty of logistic regression is it outputs
values bounded by 0 and 1; hence they can be directly interpreted as
probabilities.Let’sgetintothemathbehinditabit.Youwantafunction
that takes the data and transforms it into a single value bounded inside
the closed interval 0,1 . For an example of a function bounded be‐
tween0and1,considertheinverse-logitfunctionshowninFigure5-2.
P t =logit−1
t ≡
1
1+e−t
=
et
1+et
Figure 5-2. The inverse-logit function
로짓함수:
264. 모형화the denominator is large, which makes the function close to zero. So
that’s the inverse-logit function, which you’ll use to begin deriving a
logisticregressionmodel.Inordertomodelthedata,youneedtowork
with a slightly more general function that expresses the relationship
between the data and a probability of a click. Start by defining:
P ci xi = logit−1
α+βτ
xi
ci
* 1−logit−1
α+βτ
xi
1−ci
Here ci is the labels or classes (clicked or not), and xi is the vector of
features for user i. Observe that ci can only be 1 or 0, which means that
if ci =1, the second term cancels out and you have:
P ci =1 xi =
1
1+e
− α+β
τ
xi
=logit−1
α+βτ
xi
And similarly, if ci =0, the first term cancels out and you have:
P ci =0 xi =1−logit−1
α+βτ
xi
To make this a linear model in the outcomes ci, take the log of the odds
ratio:
클릭
271. 함수
the denominator is large, which makes the function close to zero. So
that’s the inverse-logit function, which you’ll use to begin deriving a
logisticregressionmodel.Inordertomodelthedata,youneedtowork
with a slightly more general function that expresses the relationship
between the data and a probability of a click. Start by defining:
P ci xi = logit−1
α+βτ
xi
ci
* 1−logit−1
α+βτ
xi
1−ci
Here ci is the labels or classes (clicked or not), and xi is the vector of
features for user i. Observe that ci can only be 1 or 0, which means that
if ci =1, the second term cancels out and you have:
P ci =1 xi =
1
1+e
− α+β
τ
xi
=logit−1
α+βτ
xi
And similarly, if ci =0, the first term cancels out and you have:
P ci =0 xi =1−logit−1
α+βτ
xi
To make this a linear model in the outcomes ci, take the log of the odds
ratio:
log P ci =1 xi / 1−P ci =1 xi =α+βτ
xi .
the denominator is large, which makes the function close to zero. So
that’s the inverse-logit function, which you’ll use to begin deriving a
logisticregressionmodel.Inordertomodelthedata,youneedtowork
with a slightly more general function that expresses the relationship
between the data and a probability of a click. Start by defining:
P ci xi = logit−1
α+βτ
xi
ci
* 1−logit−1
α+βτ
xi
1−ci
Here ci is the labels or classes (clicked or not), and xi is the vector of
features for user i. Observe that ci can only be 1 or 0, which means that
if ci =1, the second term cancels out and you have:
P ci =1 xi =
1
1+e
− α+β
τ
xi
=logit−1
α+βτ
xi
And similarly, if ci =0, the first term cancels out and you have:
P ci =0 xi =1−logit−1
α+βτ
xi
To make this a linear model in the outcomes ci, take the log of the odds
ratio:
log P ci =1 xi / 1−P ci =1 xi =α+βτ
xi .
Which can also be written as:
that’s the inverse-logit function, which you’ll use to begin deriving a
logisticregressionmodel.Inordertomodelthedata,youneedtowork
with a slightly more general function that expresses the relationship
between the data and a probability of a click. Start by defining:
P ci xi = logit−1
α+βτ
xi
ci
* 1−logit−1
α+βτ
xi
1−ci
Here ci is the labels or classes (clicked or not), and xi is the vector of
features for user i. Observe that ci can only be 1 or 0, which means that
if ci =1, the second term cancels out and you have:
P ci =1 xi =
1
1+e
− α+β
τ
xi
=logit−1
α+βτ
xi
And similarly, if ci =0, the first term cancels out and you have:
P ci =0 xi =1−logit−1
α+βτ
xi
To make this a linear model in the outcomes ci, take the log of the odds
ratio:
log P ci =1 xi / 1−P ci =1 xi =α+βτ
xi .
Which can also be written as:
logit P ci =1 xi =α+βτ
xi .
P ci xi = logit α+β xi * 1−logit α+β xi
Here ci is the labels or classes (clicked or not), and xi is the vector of
features for user i. Observe that ci can only be 1 or 0, which means that
if ci =1, the second term cancels out and you have:
P ci =1 xi =
1
1+e
− α+β
τ
xi
=logit−1
α+βτ
xi
And similarly, if ci =0, the first term cancels out and you have:
P ci =0 xi =1−logit−1
α+βτ
xi
To make this a linear model in the outcomes ci, take the log of the odds
ratio:
log P ci =1 xi / 1−P ci =1 xi =α+βτ
xi .
Which can also be written as:
logit P ci =1 xi =α+βτ
xi .
If it feels to you that we went in a bit of a circle here (this last equation
i가
281. 모형
ratio:
log P ci =1 xi / 1−P ci =1 xi =α+βτ
xi .
Which can also be written as:
logit P ci =1 xi =α+βτ
xi .
If it feels to you that we went in a bit of a circle here (this
was also implied by earlier equations), it’s because we did.
of this was to show you how to go back and forth betwe
abilities and the linearity.
So the logit of the probability that user i clicks on the sho
modeled as a linear function of the features, which were t
user i visited. This model is called the logistic regression
The parameter α is what we call the base rate, or the u
probabilityof“1”or“click”knowingnothingmoreabout
1.
331. 결정
feature vector xi. In the case of measuring the likelihood of an averag
user clicking on an ad, the base rate would correspond to the click
through rate, i.e., the tendency over all users to click on ads. This i
typically on the order of 1%.
Ifyouhadnoinformationaboutyourspecificsituationexceptthebas
rate, the average prediction would be given by just α:
P ci =1 =
1
1+e−α
The variable β defines the slope of the logit function. Note that in
general it’s a vector that is as long as the number of features you ar
using for each data point. The vector β determines the extent to which
certain features are markers for increased or decreased likelihood to
click on an ad.
384. 방법
rate, the average prediction would be given by just α:
P ci =1 =
1
1+e−α
The variable β defines the slope of the logit function. Note that in
general it’s a vector that is as long as the number of features you are
using for each data point. The vector β determines the extent to which
certain features are markers for increased or decreased likelihood to
click on an ad.
Estimating α and β
Your immediate modeling goal is to use the training data to find the
best choices for α and β. In general you want to solve this with max‐
imum likelihood estimation and use a convex optimization algorithm
becausethelikelihoodfunctionisconvex;youcan’tjustusederivatives
and vector calculus like you did with linear regression because it’s a
complicatedfunctionofyourdata,andinparticularthereisnoclosed-
form solution.
Denote by Θ the pair α,β . The likelihood function L is defined by:
L Θ X1,X2,⋯,Xn =P X Θ =P X1 Θ ·⋯·P Xn Θ
where you are assuming the data points Xi are independent, where
i=1, . . .,n represent your n users. This independence assumption cor‐
responds to saying that the click behavior of any given user doesn’t
affect the click behavior of all the other users—in this case, “click be‐
havior”means“probabilityofclicking.”It’sarelativelysafeassumption