Machine Learning Basics for Web Search

Machine Learning and Web Search
Part 1: Basics of Machine Learning

Hongyuan Zha

College of Computing
Georgia Institute of Technology

Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 1 / 50

Outline

1 Classification Problems
Bayes Error and Risk
Naive Bayes Classifier
Logistic Regression
Decision Trees

2 Regression Problems
Least Squares Problem
Regularization
Bias-Variance Decomposition

3 Cross-Validation and Comparison
Cross-Validation
p-Value and Test of Significance


Supervised Learning

We are predicting a target variable based a set of predictor variables using
a training set of examples.
Classiﬁcation: predict a discrete target variable
– spam ﬁltering based on message contents
Regression: predict a continuous target variable
– predict income based on other demographic information


Probabilistic Setting for Classification

X is the predictor space, and C = {1, . . . , k} the set of class labels
P(x , j) a probability distribution on X × C
A classifier d(x ) is a function h : X → C
We want to learn d from a training sample

D = {(x1 , j1 ), . . . (xN , jN )}

How do we measure the performance of a classifier?
Miss-classification error,

errorh = P({(x , j) | h(x ) = j}),

where (x , j) ∼ P(x , j)


Risk Minimization

Loss function L(i, j) = (C (i|j)): cost if class j is predicted to be class i
Risk for h = expected loss for h,
k
Rf = L(h(x ), j)p(j|x )p(x )dx
j=1

Minimizing the risk ⇒ Bayes classiﬁer
k
h∗ (x ) = argminj C (j| )p( |x )
=1

0/1-loss ⇒ errorh


Naive Bayes Classiﬁer

Use Bayes Rule
p(j|x ) ∼ p(x |j)p(j)
Feature vector x = [t1 , . . . , tn ]. Conditional independence assumption

p(x |j) = p(t1 |j)p(t2 |j) . . . p(tn |j)

MLE for p(ti |j), smoothed version,
nc + mp
p(ti |j) =
n+m
n: the number of training examples in class j
nc : number of examples in class j with i attribute ti
p: a priori estimate
m: the equivalent sample size


here:
Yes: No:
n= the number of training examples for which v = vj
Naive Bayes Classiﬁer: Example 5
Red:
nc =
n = 5
p=
Red:
number of examples for which v = vj and a = ai
n =
a priori estimate for P (ai |vj )
m= the equivalent sample size 3
n_c= n_c = 2
p = .5 p = .5
Car theft Example m = 3 m = 3
SUV: SUV:
ttributes are Color , Type , Origin, and the subject, stolen can be either yes or no.
n = 5 n = 5
n_c = 1 n_c = 3
1 data set Data p = .5 p =and multiply them by P(Yes) and P(No) respectively . We can estimate
.5
Example No. Color m = 3 Origin
Type Stolen? m =Yes:Red:
3 No:
Red:
1 RedDomestic: Domestic
Sports Yes Domestic: n = 5 n = 5
2 Red Sports 5 Domestic
n = No n = 5 n_c= 3 n_c = 2
p = .5 p = .5
3 Red Sports = Domestic
n_c 2 Yes n_c = 3 m = 3 m = 3
4 Yellow Sports Domestic No SUV: SUV:
5
p = .5
Yellow Sports Imported Yes
p = .5 n = 5 n = 5
n_c = 1 n_c = 3
6 Yellow m = 3 Imported
SUV No m =3 p = .5 p = .5
7 Yellow SUV Imported Yes m = 3 m = 3
8 Yellow P (Red|Y es), we No
Looking atSUV Domestic have 5 cases where vj = Yes , and in 3 of those cases ai = Red. So for
Domestic: Domestic:
n = 5 n = 5
9 P (Red|Y es), n = 5 Imported 3. Note that all attribute are binary (two possible values). We are= 3
Red SUV and nc = No n_c = 2 n_c assuming
10 no Red information so, p = 1 /Yes
other Sports Imported (number-of-attribute-values) = 0.5 for all of our attributes. = .5m value
p = .5
m = 3
p
m =3
Our
is arbitrary, (We will use m = 3) but consistent for all attributes. Nowwe have 5 cases where v = Yes , and in 3
Looking at P (Red|Y es),
we simply apply eqauation (3)
2 Training example
j
using the precomputed values of n , nc , p, andP (Red|Y es), n = 5 and nc = 3. Note that all attribute are binary (two po
m.
Estimates no other information so, p = 1 / (number-of-attribute-values) = 0.5 for a
e want to classify a Red Domestic SUV. Note there is no example of a Red Domestic SUV in our data (We will use m = 3) but consistent for
is arbitrary, all attributes. Now
3 + 3 ∗ .5 the probabilities 2+3∗ n
using the precomputed values of.5 , nc , p, and m.
t. Looking back at equation (2) we can see how to compute this. We need to calculate= .56
P (Red|Y es) = P (Red|N o) = = .43
5+3 5 + 33 + 3 ∗ .5 = .56
P (Red|Y es) = P (Red|N o) =
(Red|Yes), P(SUV|Yes), P(Domestic|Yes) ,
1 + 3 ∗ .5 3 + 3 ∗ .5+ ∗ .5
5 3
P (SU V |Y es) =
= .31 P (SU V(SUo) |Y es) = 1 + 3= .56= .31
P |N V = P (SU V |N o) =
(Red|No) , P(SUV|No), and P(Domestic|No) 5+3 5 + 35 + 3
2 + 3 ∗ .5
2 + 3 ∗ .5 P (Domestic|Y es) =3 + 3 ∗ .5 = .43 P (Domestic|N
P (Domestic|Y es) = = .43 P (Domestic|N o) = 5 + 3 = .56
1 5+3 We have P (Y es) = .5 and P (N o) 5 + 3 so we can apply equation (2). F
= .5,
P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Ye
We have P (Y es) = .5 and P (N o) = .5, so we can apply equation (2). For v = Y es, we have
= .5 * .56 * .31 * .43 = .037
P(Yes) * P(Red | Yes) * P(SUV | Yes) for P(Domestic|Yes)
and * v = N o, we have
P(No) * P(Red | No) * P(SUV | No) * P (Domestic | No
= .5 * .56 * .31 * .43 = .037 = .5 * .43 * .56 * .56 = .069
Since 0.069 > 0.037, our example gets classiﬁed as ’NO’
and for v = N o, we have

Learning to Classify Text

Target concept Interesting : Document → {+, −}
1 Example: document classiﬁcation using BOW
— Multiple Bernoulli model
— Multinomial model
one attribute per word position in document
2 Learning: Use training examples to estimate
P(+), P(−), P(doc|+), P(doc|−)
Naive Bayes conditional independence assumption
length(doc)
P(doc|j) = P(ti |j)
i=1

where P(ti |j) is probability that word i appears in class j


Learn_naive_Bayes_text(Examples, k)
1. collect all words and other tokens that occur in Examples
Vocabulary ← all distinct words and other tokens in Examples
2. calculate the required P(j) and P(ti |j) probability terms
For each target value j in {1, . . . , k} do
docsj ← subset of Examples for which the target value is j
|docsj |
P(j) ← |Examples|
Textj ← a single document created by concatenating all members of
docsj
n ← total number of words in Textj (counting duplicate words multiple
times)
for each word ti in Vocabulary
nk ← number of times word wk occurs in Textj
ni +1
P(ti |j) ← n+|Vocabulary |


Classify_naive_Bayes_text(Doc)
positions ← all word positions in Doc that contain tokens found in
Vocabulary
Return vNB , where

jNB = argmax P(j) P(ti |j)
j i

When k 1, need special smoothing method,
Congle Zhang et. al. Web-scale classiﬁcation with Naive Bayes, WWW
2009


Logistic Regression

In Naive Bayes, the discriminant function is

P(j) P(ti |j)
i

Let ni be frequency of ti , and take the log of the above

log P(j) + ni log P(ti |j)
i

which is a linear function of the frequency vector x = [n1 , . . . , nV ]T


Logistic Regression

More generally,
1 1
P(j|x ) = exp(wjT x ) ≡ exp(w T f (x , j))
Z (x ) Zw (x )

Given a training sample L = {(x1 , j1 ), . . . (xN , jN )} we minimize the
Conditional likelihood,
1 2
min w +C log Zw (xi ) − w T f (xi , ji )
2 i

A Convex function in w
C.J. Lin et. al., Trust region Newton methods for large-scale logistic
regression, ICML, 2007 (several millions of features, N = 105 )


Generative vs. Discriminative Classifiers

Naive Bayes estimates parameters for P(j), P(x |j) while logistic
regression estimates parameters for P(j|x )
Naive Bayes: generative classifier
Logistic regression: discriminative classifier
Logistic more general, gives better asymptotic error. Convergence
rates are different, GNB with O(log n) examples, and logistic
regression with O(n) examples, n dimension of X

Ng and Jordan, On generative vs. discriminative classifiers: a comparison
of Naive Bayes and logistic regession, NIPS 2002


fication and regression trees, or CART (Breiman et al., 1984), although
ny other variants going by such names as ID3 and C4.5 (Quinlan, 1986;
3).
Tree-Based Models
4.5 shows an illustration of a recursive binary partitioning of the input
with the corresponding tree structure. In this example, the first step

a two-dimensional in- x2
664 14. COMBINING MODELS
at has been partitioned
ions using axis-aligned E
Figure 14.6 Binary tree corresponding to the par-
θ3 titioning of input space shown in Fig- x1 > θ1
B ure 14.5.

x2 θ2 x2 > θ3
C D
θ2
x1 θ4
A

θ1 θ4 x1 A B C D E

Partition inputdivides the whole of thea input spaceregions, Thisaccordingtwo whether x each with the
space into cuboid into two regions creates to subregions, θ
or x > θ where θ is parameter of the model.
1 1 1
with edges aligned 1 1

of which can then be subdivided independently. For instance, the region x θ
axes is further subdivided according to whether x θ or x > θ , giving rise to the
2 2 2 2
1 1

regions denoted A and B. The recursive subdivision can be described by the traversal
Classifier of the binary tree shown in Figure 14.6. For any new input x, we determine which
region it falls into by starting at the top of the tree at the root node and following
h(x ) = i j I(x ∈ R )i
a path down to a specific leaf node according to the decision criteria at each node.
Note that such decision trees are not probabilistic graphical models.
Within each region, there is a
iseparate model to predict the target variable. For
instance, in regression we might simply predict a constant over each region, or in
CART (classification and regression trees)
classification we might assign each region to a specific class. A key property of tree-
based models, which makes them popular in fields such as medical diagnosis, for
example, is that they are readily interpretable by humans because they correspond
to a sequence of binary decisions applied to the individual input variables. For in-
stance, to predict a patient’s disease, we might first ask “is their temperature greater
Hongyuan Zha (Georgia Tech)
than some threshold?”. If the answer isWeb Search might next ask “is their blood
Machine Learning and yes, then we 16 / 50

Decision Trees

Decision tree representation:
Each internal node tests an attribute (predictor variable)
Each branch corresponds to attribute value
– branching factor > 2 (discrete case)
– binary tree more common (split on a threshold)
Each leaf node assigns a class label


Top-Down Induction of Decision Trees

Main loop:
1 A ← the “best” decision attribute for next node
2 For each value of A, create new descendant of node
3 Sort training examples to leaf nodes
4 If training examples perfectly classiﬁed, Then STOP, Else iterate over
new leaf nodes
Which attribute is best? Binary classiﬁcation with two attributes

[29+,35-] A1=? [29+,35-] A2=?

t f t f

[21+,5-] [8+,30-] [18+,33-] [11+,2-]


Information Gain

S is a sample of training examples
p⊕ is the proportion of positive examples in S
p is the proportion of negative examples in S
Entropy measures the impurity of S

Entropy (S) ≡ −p⊕ log2 p⊕ − p log2 p

Gain(S, A) = expected reduction in entropy due to sorting on A

|Sv |
Gain(S, A) ≡ Entropy (S) − Entropy (Sv ) ≥ 0
v ∈Values(A)
|S|


Extension

General predictor variables x ∈ X
A set of binary splits s at each node based on a question: is x ∈ A?
where A ∈ X
the split s sends all (xi , ji ) with "yes" answer to the left child and
"no" answer to the right child
Standard set of questions
Predictor variable x i continuous: is x i ≤ c?
Predictor variable x i categorical: is x i ∈ T ? where T ⊂ T and T is
the values of x i


Goodness of Split goodness ofnode.is measured by an impurity function
The
deﬁned for each
split

Intuitively, we want each leaf node to be “pure”, that is, one
class dominates.

The goodness of split is measured by an impurity function deﬁned for
Jia Li http://www.stat.psu.edu/∼jiali

each node
Intuitively, we want each leaf node to be "pure", that is, one class
dominates
Given class probabilities in a node S: p(1|S), . . . , p(J|S) Impurity
function for S, (more generally for the samples in a node)

i(S) = φ(p(1|S), . . . , p(J|S))


Goodness of Split

Examples
Entropy
φ(p1 , . . . , pJ ) = − pj log pj
j

Gini Index
φ(p1 , . . . , pJ ) = − pi pj = 1 − pj2
i=j j

Goodness of a split s for node t,

Φ(s, t) = ∆i(s, t) = i(t) − pR i(tR ) − pL i(tL )

where pR and pL are the proportions of the samples in node t that go
to the right node tR and the left node tL , respectively.


Stopping Criteria

A simple criteria: stop splitting a node t when

max p(t)∆i(s, t) < β
s∈S

The above stopping criteria is unsatisfactory
— A node with a small decrease of impurity after one step of splitting
may have a large decrease after multiple levels of splits.


CART: Classification and Regression Trees

Two phases: growing and pruning
Growing: input space is recursively partitioned into cells, each cell
corresponding to a leaf node
– training data are fitted well
– but poor performance on test data (overfitting)
Pruning: objective function consists of empirical risk and penalty term

C (T ) = LN (T ) + α|T |

where T ∈ T , all possible subtrees obtained from prunning the
original tree.
CART select T to minimize C (T ) with α selected by cross-validation


Probabilistic Setting for Regression

X is the predictor space, and T = R the set of reals
P(x , t) a probability distribution on X × T
A regression function y (t) is a function h : X → T
We want to learn h from a training sample

D = {(x1 , t1 ), . . . (xN , tN )}

How do we measure the performance of a classiﬁer?
Mean Squared Error,

errorh = (t − y (x ))2 dP(x , t)


Conditional Mean as Optimal Regression Function

Mean Squared Error,

errorh = (t − h(x ))2 dP(x , t) = (t − h(x ))2 p(t|x )dtdP(x )

Optimal regression function

h∗ (x ) = tp(t|x )dt


Least Squares Problem
DUCTION

The error function (1.2) corre-
sponds to (one half of) the sum of t tn
the squares of the displacements
(shown by the vertical green bars)
of each data point from the function
y(x, w).
y(xn , w)

xn x

For training set, D = {(x1 , training . (xN , t The geomet-
function y(x, w) were to pass exactly through each t1 ), . . data point.N )}, the empirical risk, or
rical interpretation of the sum-of-squares error function is illustrated in Figure 1.3.
training error, for y (x , w ),
We can solve the curve fitting problem by choosing the value of w for which
E(w) is as small as possible. Because the error function N a quadratic function of
is
1
the coefficients w, its derivatives with respect to the coefficients will be linear in the2
has − y (x solution,
elements of w, and so the minimization E (w error function (ti a unique i , w ))
of the ) =
denoted by w , which can be found in closed form. 2The resulting polynomial is
i=1
given by the function y(x, w ).
There remains the problem of choosing the order M of the polynomial, and as
we shall see this will turn out to be an example of an important concept called model
comparison or model selection. In Figure 1.4, we show four examples of the results
of fitting polynomials having orders M = Machine Learningto the data set shown in
Hongyuan Zha (Georgia Tech) 0, 1, 3, and 9 and Web Search 27 / 50

Linear Least Squares Problem

y (x , w ) = w T f (x ), f (x ) = [f1 (x ), . . . , fn (x )] a set of basis functions,
N
1
E (w ) = (ti − w T f (xi ))2
2 i=1

Let t = [ti ], A = [f (xi )], then
2
min t − Aw 2
w

Normal equation,
AT Aw = AT t


Overﬁtting and Regularization

DUCTION
Example taken from C. Bishop
Plot of a training data set of N =
10 points, shown as blue circles,
each comprising an observation
1
of the input variable x along with
the corresponding target variable t
t. The green curve shows the
function sin(2πx) used to gener-
ate the data. Our goal is to pre- 0
dict the value of t for some new
value of x, without knowledge of
the green curve.
−1

0 x 1

Polynomial curve ﬁtting
detailed treatment lies beyond the scope of this book.
Although each of these tasks needs its own tools and techniques, many of the
key ideas that underpin them are common to all such problems. One of the main M
goals of this chapter is to introduce, y (xrelatively informal way, x + · · · the most x
in a , w ) = w0 + w1 several of + wM
important of these concepts and to illustrate them using simple examples. Later in
the book we shall see these same ideas re-emerge in the context of more sophisti-
cated models that are applicable to real-world pattern recognition applications. This
chapter also provides a self-contained introduction to three important tools that will
be used throughout the book, namely probability theory, decision theory, and infor-
mation theory. Zha (Georgia Tech)might sound like daunting and Web they are in fact
Hongyuan Although these Machine Learning topics, Search 29 / 50

1.1. Example: Polynomial Curve Fitting 7

1 M =0 1 M =1
t t

0 0

−1 −1

0 x 1 0 x 1

1 M =3 1 M =9
t t

0 0

−1 −1

0 x 1 0 x 1

Figure 1.4 Plots of polynomials having various orders M, shown as red curves, ﬁtted to the data set shown in
Figure 1.2.

(RMS) error deﬁned by
ERMS = 2E(w )/N (1.3)
in which the division by N allows us to compare different sizes of data sets on
Hongyuan Zha (Georgia Tech) an equal footing, andLearning and Web Search RMS is measured on the same
Machine the square root ensures that E 30 / 50

from which the data was generated (and we shall see later that this is indeed the
case). We know that a power series expansion of the function sin(2πx) contains
Training Error and Test Error terms of all orders, so we might expect that results should improve monotonically as
we increase M .
We can gain some insight into the problem by examining the values of the co-
efficients w obtained from polynomials of various order, as shown in Table 1.1.
We see that, as M increases, the magnitude of the coefficients typically gets larger.
In particular for the M = 9 polynomial, the coefficients have become finely tuned
to the data by developing large positive and negative values so that the correspond-

uare
ated 1 Table 1.1 Table of the coefficients w for M =0 M =1 M =6 M =9
nde- Training polynomials of various order. w0 0.19 0.82 0.31 0.35
alues Test Observe how the typical mag-
nitude of the coefficients in- w1 -1.27 7.99 232.37
creases dramatically as the or- w2 -25.43 -5321.83
der of the polynomial increases. w3 17.37 48568.31
ERMS

0.5 w4 -231639.30
w5 640042.26
w6 -1061800.52
w7 1042400.18
w8 -557682.99
0
0 3 6 9 w9 125201.43
M

ng set error goes to zero, as we might expect because
degrees of freedom corresponding to the 10 coefficients
uned exactly to the 10 data points in the training set.
as become very large and, as we saw in Figure 1.4, the
w ) exhibits wild oscillations.
ical because a polynomial of given order contains all
pecial cases. The M = 9 polynomial is therefore capa-
ast as good as the M = 3 polynomial. Furthermore, we
predictor of new data would be the function sin(2πx)
nerated (and we shall see later that this isMachine Learning and Web Search
Hongyuan Zha (Georgia Tech) indeed the 31 / 50

Increasing Training Set Size

1.1. Example: Polynomial Curve Fitting 9

1 N = 15 1 N = 100
t t

0 0

−1 −1

0 x 1 0 x 1

Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M = 9
polynomial for N = 15 data points (left plot) and N = 100 data points (right plot). We see that increasing the
size of the data set reduces the over-ﬁtting problem.

ing polynomial function matches each of the data points exactly, but between data
points (particularly near the ends of the range) the function exhibits the large oscilla-
tions observed in Figure 1.4. Intuitively, what is happening is that the more ﬂexible
polynomials with larger values of M are becoming increasingly tuned to the random
noise on the target values.

Regularization
10 1. INTRODUCTION

1 ln λ = −18 1 ln λ = 0
t t

0 0

−1 −1

0 x 1 0 x 1

Figure 1.7 Plots of M = 9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error
function (1.4) for set, L the {(x , t parameter λ corresponding ln λ −18 and ln λ = The
For case of no regularizer,values of= corresponding ),ln.λ.= −∞,N , tN )}, tothe=empirical 0.risk, or
training two i.e., λ = 0, regularizationto . (x is shown at the bottom right of Figure 1.4.
1 1
training error, for y (x , w ),
may wish to use relatively complex and flexible models. One technique that is often
used to control the over-fitting phenomenon in such cases is that of regularization,
N
ˆ coefficients 1 λ
which involves adding a penalty term to the error function (1.2) in order to discourage
E (w ) = from reaching largey (xi ,The ))2 +such penalty 2 takes the
the (ti − values. w simplest w term
form of a sum of2 i=1of all of the coefficients, leading to 2
squares a modified error function
of the form
N
1 2 λ
E(w) = {y(xn , w) − tn } + w 2 (1.4)
2 2
n=1

Hongyuan Zha (Georgia Tech) where w 2 Machine = w0 + w1 + . Webw2 , and the coefficient λ governs the rel-
≡ wT w Learning 2 . . + Search
2 and 33 / 50

itable value for the model complexity. The results above
chieving this, namely by taking the available data and
Training Error and Test Error: Regularization
g set, used to determine the coefficients w, and a separate
a hold-out set, used to optimize the model complexity
cases, however, this will prove to be too wasteful of
we have to seek more sophisticated approaches.
of polynomial curve fitting has appealed largely to in-
ore principled approach to solving problems in pattern
discussion of probability theory. As well as providing the
the subsequent developments in this book, it will also 1.1. Example: Polynomial Curve Fitting 11

re er- ln λ = −∞ ln λ = −18 ln λ = 0
M = 9 1 Table 1.2 Table of the coefficients w for M =
9 polynomials with various values for w0 0.35 0.35 0.13
Training
the regularization parameter λ. Note
Test
that ln λ = −∞ corresponds to a w1 232.37 4.74 -0.05
model with no regularization, i.e., to w2 -5321.83 -0.77 -0.06
the graph at the bottom right in Fig- w3 48568.31 -31.97 -0.05
ERMS

0.5 ure 1.4. We see that, as the value of w4 -231639.30 -3.89 -0.03
λ increases, the typical magnitude of w5 640042.26 55.28 -0.02
the coefficients gets smaller.
w6 -1061800.52 41.32 -0.01
w7 1042400.18 -45.95 -0.00
0
w8 -557682.99 -91.53 0.00
−35 −30 ln λ −25 −20 w9 125201.43 72.68 0.01

the magnitude of the coefficients.
The impact of the regularization term on the generalization error can be seen by
plotting the value of the RMS error (1.3) for both training and test sets against ln λ,
as shown in Figure 1.8. We see that in effect λ now controls the effective complexity
of the model and hence determines the degree of over-fitting.
The issue of model complexity is an important one and will be discussed at
length in Section 1.3. Here we simply note that, if we were trying to solve a practical
application using this approach of minimizing an error function, we would have to
find a way to determine aLearning and Webfor the model complexity. The results above/ 50
Hongyuan Zha (Georgia Tech) Machine
suitable value Search 34


Let h∗ (x ) be the conditional mean, h∗ (x ) = tp(t|x )dt
For any regression function y (x ),

(y (x ) − t)2 = (y (x ) − h(x ))2 + (h(x ) − t)2

Given training data D, algorithm outputs y (x , D)
— average behavior over all D

ED (y (x , D)−h(x ))2 = (ED y (x , D) − h(x ))2 + ED (y (x , D) − ED y (x , D))2
(bias)2 variance

Expected loss = (bias)2 + variance + noise


Bias-Variance and Model Complexity
150 3. LINEAR MODELS FOR REGRESSION

1 1
ln λ = 2.6
t t

0 0

−1 −1

0 x 1 0 x 1

1 1
ln λ = −0.31
t t

0 0

−1 −1

0 x 1 0 x 1

1 1
ln λ = −2.4
t t

0 0

−1 −1

0 x 1 0 x 1

Figure 3.5 Illustration of the dependence of bias and variance on model complexity, governed by a regulariza-
tion parameter λ, using the sinusoidal data set from Chapter 1. There are L = 100 data sets, each having N = 25
Hongyuan Zha (Georgia data points, and there are 24 Gaussian basis functions inand Web Search number of parameters is
Tech) Machine Learning the model so that the total 36 / 50

3.2. The Bias-Variance Decomposition 151

as and variance,
um, correspond- 0.15
shown in Fig- (bias)2
n is the average 0.12 variance
est data set size (bias)2 + variance
e minimum value 0.09 test error
e occurs around
h is close to the
0.06
e minimum error
0.03

0
−3 −2 −1 0 1 2
ln λ

4 Gaussian basis functions by minimizing the regularized error
give Hongyuan Zha (Georgia Tech) y (l) (x) as shown Web Search 3.5. The
a prediction function Machine Learning and in Figure 37 / 50

Overﬁtting

0.9

0.85

0.8

0.75

Accuracy
0.7

0.65

0.6 On training data
On test data
0.55

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

Common problem with most learning algorithms
Given a function space H, a function h ∈ H is said to overﬁtthe
training data if there exists some alternative function h ∈ H such
that h has smaller error than h over the training examples but h has
smaller error than h over the entire distribution of instances


Cross-Validation

Performance on the training set is not a good indicator of predictive
performance on unseen data
If data is plentiful, training data, validation data and test data
— models (diﬀerent degree polynomials) compared on validation data
— high variance if validation data small
S-fold cross-validation 1.4. The Curse of Dimensionality 33

Figure 1.18 The technique of S-fold cross-validation, illus- run 1
trated here for the case of S = 4, involves tak-
ing the available data and partitioning it into S
groups (in the simplest case these are of equal run 2
size). Then S − 1 of the groups are used to train
a set of models that are then evaluated on the re- run 3
maining group. This procedure is then repeated
for all S possible choices for the held-out group, run 4
indicated here by the red blocks, and the perfor-
mance scores from the S runs are then averaged.

data to assess performance. When data is particularly scarce, it may be appropriate
to consider the case S = N , where N is the total number of data points, which gives
the leave-one-out technique.

Cross-Validation

Which kind of Cross Validation?
Downside Upside
Test-set Variance: unreliable Cheap
estimate of future
performance

Leave- Expensive. Doesn’t waste data
one-out Has some weird behavior
10-fold Wastes 10% of the data. Only wastes 10%. Only
10 times more expensive 10 times more expensive
than test set instead of R times.
3-fold Wastier than 10-fold. Slightly better than test-
Expensivier than test set set
R-fold Identical to Leave-one-out

Andrew Moore’s slides
Copyright © Andrew W. Moore Slide 35


CV-based Model Choice

CV-based Model Selection
Example: •Choosing which model toofuse
Example: Choosing number hidden units in a one-
hidden-layer neural net.
Step 1: Compute 10-fold CV error for six diﬀerent model classes,
• Step 1: Compute 10-fold CV error for six different model
classes:
Algorithm TRAINERR 10-FOLD-CV-ERR Choice
0 hidden units
1 hidden units
2 hidden units !
3 hidden units
4 hidden units
5 hidden units

• Step 2: Whichever model class gave best CV score: train it
Step 2: Whichever model and that’s theCV score:model you’ll use. all the
with all the data,
gave best predictive train it with
data, and that is the predictive model you will use.
Copyright © Andrew W. Moore Slide 38


Two Definitions of Error

The true error of classifier h with respect to P(x , y ) is the probability
that h will misclassify an instance drawn at random according to P
(population).
errorP (h) ≡ P[h(x ) = y ]
The sample error of h with respect to data sample S = {(xi , yi )}N is
i=1
the proportion of examples h misclassifies
N
1
errorS (h) ≡ δ(yi = h(xi ))
n i=1

Where δ(y = h(x )) is 1 if y = h(x ), and 0 otherwise.
How well does errorS (h) estimate errorP (h)?


Example

Hypothesis h misclassiﬁes 12 of the 40 examples in S

12
errorS (h) = = .30
40
What is errorP (h)?


Conﬁdence Intervals

If
S contains n examples, drawn independently of h and each other
n ≥ 30
Then
With approximately N% probability, errorP (h) lies in interval

errorS (h)(1 − errorS (h))
errorS (h) ± zN
n

where
N%: 50% 68% 80% 90% 95% 98% 99%
zN : 0.67 1.00 1.28 1.64 1.96 2.33 2.58


k-fold Cross-Validated Paired t-test

Comparing two algorithms A and B
— L(S) returns the classiﬁer produced by algorithm L training on training
data S
1 Randomly partition training data D into k disjoint test sets
T1 , T2 , . . . , Tk of equal size.
2 For i from 1 to k, do
use Ti for the test set, and the remaining data for training set Si
hA ← LA (Si ), hB ← LB (Si )
δi ← errorTi (hA ) − errorTi (hB )
¯
Return the value δ ≡ 1 k
i=1 δi
3
k
Let sδ ≡ 1 k ¯
− δ)2 .
i=1 (δi
4 ¯ k(k−1)


t-Distribution

t ¯ ¯
Then ˆ ≡ δ/sδ has an approximate t-distribution with k − 1 degree of
freedom under the null hypothesis that there’s no diﬀerence in the true
errors



Null hypothesis: no diﬀerence in true errors, and alternative
hypothesis
We may be able to demonstrate that the alternative is much more
plausible than the null hypothesis given the data
This is done in terms of a probability (a p-value)
— quantifying the strength of the evidence against the null
hypothesis in favor of the alternative.
Are the data consistent with the null hypothesis?
— use a test statistic, like the ˆ
t
— need to know the null distribution of test statistic (degree k − 1
student t)



For a given data set, we can compute the value of ˆ, and see whether
t
it is
— in the middle of the distribution (consistent with the null
hypothesis)
— out in a tail of the distribution (making the alternative hypothesis
seem more plausible)
Alternative hypothesis ⇒ large positive ˆ t
— measure of how far out ˆ is in the right-hand tail of null
t
distribution
p-Value is the probability to the right of our test statistic (ˆ)
t
calculated using the null distribution
The smaller the P-value, and the stronger the evidence against the
null hypothesis in favor of the alternative


Bayes’ Theorem is used to update subjective probabilities to reflect new information.
p-Value and P-value to persons unfamiliar with statistics, it is often necessary to use
When reporting a Test of Signiﬁcance
descriptive language to indicate the strength of the evidence. I tend to use the following
sort of language. Obviously the cut-offs are somewhat arbitrary and another person might
use different language.

P > 0.10 No evidence against the null hypothesis. The data appear to be
consistent with the null hypothesis.
0.05 < P < 0.10 Weak evidence against the null hypothesis in favor of the alternative.

0.01 < P < 0.05 Moderate evidence against the null hypothesis in favor of the
alternative.
0.001 < P < 0.01 Strong evidence against the null hypothesis in favor of the
alternative.
P < 0.001 Very strong evidence against the null hypothesis in favor of the
alternative.
Level α this kind of0.05 or 0.01) test:keep in mind the difference between statistical
In using (usually language, one should We reject the null hypothesis at level
α if the P-value is smaller than α a large study one may obtain a small P-value
significance and practical significance. In
even though the magnitude of the effect being tested is too small to be of importance (see
the discussion of power below). It is a good idea to support a P-value with a confidence
interval for the parameter being tested.

AHongyuan Zha (Georgia be reported more formally in terms of a fixed level ! test. Here ! is 49 / 50
P-value can also Tech) Machine Learning and Web Search
a

The Power of Tests

When comparing algorithms
— null hypothesis: no difference
— alternative hypothesis: my new algorithm is better
We want to have good chance of reporting a small P-value assuming
the alternative hypothesis is true
The power of level α test: the probability that the null hypothesis will
be rejected at level α (i.e., the p-value will be less than α) assuming
the alternative hypothesis
— variability of the data: lower variance, higher power
— sample size: higher N, higher power
— the magnitude of the difference: large difference, higher power
is true.


Machine Learning Basics for Web Search

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (10)

Similar to Machine Learning Basics for Web Search

Similar to Machine Learning Basics for Web Search (20)

Recently uploaded

Recently uploaded (20)

Machine Learning Basics for Web Search