SlideShare a Scribd company logo
1 of 50
Download to read offline
Machine Learning and Web Search
                          Part 1: Basics of Machine Learning


                                     Hongyuan Zha

                                    College of Computing
                               Georgia Institute of Technology




Hongyuan Zha (Georgia Tech)      Machine Learning and Web Search   1 / 50
Outline

1    Classification Problems
       Bayes Error and Risk
       Naive Bayes Classifier
       Logistic Regression
       Decision Trees

2    Regression Problems
       Least Squares Problem
       Regularization
       Bias-Variance Decomposition

3    Cross-Validation and Comparison
       Cross-Validation
       p-Value and Test of Significance


    Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search   2 / 50
Supervised Learning




We are predicting a target variable based a set of predictor variables using
a training set of examples.
     Classification: predict a discrete target variable
     – spam filtering based on message contents
     Regression: predict a continuous target variable
     – predict income based on other demographic information




 Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search            3 / 50
Probabilistic Setting for Classification


     X is the predictor space, and C = {1, . . . , k} the set of class labels
     P(x , j) a probability distribution on X × C
     A classifier d(x ) is a function h : X → C
     We want to learn d from a training sample

                                 D = {(x1 , j1 ), . . . (xN , jN )}

How do we measure the performance of a classifier?
     Miss-classification error,

                               errorh = P({(x , j) | h(x ) = j}),

     where (x , j) ∼ P(x , j)


 Hongyuan Zha (Georgia Tech)      Machine Learning and Web Search               4 / 50
Bayes Classifier


     P(x , j) = P(x |j)P(j) = P(j|x )P(x )
     Assume X continuous, and

                                P(x ∈ A, j) =             p(j|x )p(x )dx
                                                      A

     errorh = P(h(x ) = j)

                =         p(h(x ) = j|x )p(x )dx =             (1 − p(h(x )|x ))p(x )dx
                      A                                    A

     Bayes error = minh errorh and Bayes classifier

                                    h∗ (x ) = argmaxj p(j|x )



 Hongyuan Zha (Georgia Tech)       Machine Learning and Web Search                        5 / 50
Risk Minimization


     Loss function L(i, j) = (C (i|j)): cost if class j is predicted to be class i
     Risk for h = expected loss for h,
                                          k
                               Rf =           L(h(x ), j)p(j|x )p(x )dx
                                        j=1

     Minimizing the risk ⇒ Bayes classifier
                                                        k
                               h∗ (x ) = argminj             C (j| )p( |x )
                                                        =1

     0/1-loss ⇒ errorh



 Hongyuan Zha (Georgia Tech)       Machine Learning and Web Search             6 / 50
Naive Bayes Classifier

     Use Bayes Rule
                                       p(j|x ) ∼ p(x |j)p(j)
     Feature vector x = [t1 , . . . , tn ]. Conditional independence assumption

                               p(x |j) = p(t1 |j)p(t2 |j) . . . p(tn |j)

     MLE for p(ti |j), smoothed version,
                                                      nc + mp
                                        p(ti |j) =
                                                       n+m
     n: the number of training examples in class j
     nc : number of examples in class j with i attribute ti
     p: a priori estimate
     m: the equivalent sample size

 Hongyuan Zha (Georgia Tech)       Machine Learning and Web Search          7 / 50
here:
                                  Yes:                                  No:
                    n=     the number of training examples for which v = vj
    Naive Bayes Classifier: Example 5
                 Red:
                    nc =
                     n = 5
                    p=
                              Red:
                           number of examples for which v = vj and a = ai
                                  n =
                           a priori estimate for P (ai |vj )
                    m=     the equivalent sample size 3
                                                 n_c=                                  n_c = 2
                                                   p = .5                              p = .5
    Car theft Example                              m = 3                               m = 3
                                             SUV:                                  SUV:
ttributes are Color , Type , Origin, and the subject, stolen can be either yes or no.
                                                   n = 5                               n = 5
                                                   n_c = 1                             n_c = 3
1 data set    Data                                 p = .5                              p =and multiply them by P(Yes) and P(No) respectively . We can estimate
                                                                                            .5
                        Example No. Color          m = 3 Origin
                                                   Type                  Stolen?       m =Yes:Red:
                                                                                            3                           No:
                                                                                                                            Red:
                              1           RedDomestic: Domestic
                                                  Sports                    Yes Domestic:           n = 5                         n = 5
                              2           Red     Sports 5 Domestic
                                                   n =                      No         n = 5        n_c= 3                        n_c = 2
                                                                                                    p = .5                        p = .5
                              3           Red     Sports = Domestic
                                                   n_c        2             Yes        n_c = 3      m = 3                         m = 3
                              4         Yellow Sports Domestic              No                 SUV:                         SUV:
                              5
                                                   p = .5
                                        Yellow Sports Imported              Yes
                                                                                       p = .5       n = 5                         n = 5
                                                                                                    n_c = 1                       n_c = 3
                              6         Yellow m = 3 Imported
                                                   SUV                      No         m =3         p = .5                        p = .5
                            7       Yellow     SUV     Imported     Yes                        m = 3                             m = 3
                            8       Yellow      P (Red|Y es), we No
                                  Looking atSUV Domestic have 5 cases where vj = Yes , and in 3 of those cases ai = Red. So for
                                                                                        Domestic:                       Domestic:
                                                                                              n = 5                          n = 5
                            9     P (Red|Y es), n = 5 Imported 3. Note that all attribute are binary (two possible values). We are= 3
                                      Red     SUV      and nc = No                            n_c = 2                        n_c assuming
                           10     no Red information so, p = 1 /Yes
                                      other Sports Imported        (number-of-attribute-values) = 0.5 for all of our attributes. = .5m value
                                                                                              p = .5
                                                                                              m = 3
                                                                                                                             p
                                                                                                                             m =3
                                                                                                                                 Our
                                  is arbitrary, (We will use m = 3) but consistent for all attributes. Nowwe have 5 cases where v = Yes , and in 3
                                                                                   Looking at P (Red|Y es),
                                                                                                             we simply apply eqauation (3)
2 Training example
                                                                                                                                       j
                                  using the precomputed values of n , nc , p, andP (Red|Y es), n = 5 and nc = 3. Note that all attribute are binary (two po
                                                                                    m.
              Estimates                                                             no other information so, p = 1 / (number-of-attribute-values) = 0.5 for a
e want to classify a Red Domestic SUV. Note there is no example of a Red Domestic SUV in our data (We will use m = 3) but consistent for
                                                                                        is arbitrary,                                   all attributes. Now
                                                                        3 + 3 ∗ .5 the probabilities             2+3∗ n
                                                                                        using the precomputed values of.5 , nc , p, and m.
t. Looking back at equation (2) we can see how to compute this. We need to calculate= .56
                                                      P (Red|Y es) =                               P (Red|N o) =          = .43
                                                          5+3                                     5 + 33 + 3 ∗ .5 = .56
                                                                                       P (Red|Y es) =                          P (Red|N o) =
(Red|Yes), P(SUV|Yes), P(Domestic|Yes) ,
                                                        1 + 3 ∗ .5                               3 + 3 ∗ .5+ ∗ .5
                                                                                                          5   3
                                                     P (SU V |Y es) =
                                                                   = .31        P (SU V(SUo) |Y es) = 1 + 3= .56= .31
                                                                                      P |N V =                                 P (SU V |N o) =
(Red|No) , P(SUV|No), and P(Domestic|No)                  5+3                                       5 + 35 + 3
                                                                                                       2 + 3 ∗ .5
                                                        2 + 3 ∗ .5               P (Domestic|Y es) =3 + 3 ∗ .5 = .43           P (Domestic|N
                                    P (Domestic|Y es) =            = .43        P (Domestic|N o) = 5 + 3 = .56
                                    1                     5+3          We have P (Y es) = .5 and P (N o) 5 + 3 so we can apply equation (2). F
                                                                                                         = .5,
                                                                                    P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Ye
                                  We have P (Y es) = .5 and P (N o) = .5, so we can apply equation (2). For v = Y es, we have
                                                                                          =   .5 * .56 * .31 * .43             = .037
                                  P(Yes) * P(Red | Yes) * P(SUV | Yes) for P(Domestic|Yes)
                                                                    and * v = N o, we have
                                                                                    P(No) * P(Red | No) * P(SUV | No) * P (Domestic | No
                                        =    .5 * .56 * .31 * .43             = .037      = .5 * .43 * .56 * .56 = .069
                                                                    Since 0.069 > 0.037, our example gets classified as ’NO’
                          and for v = N o, we have
        Hongyuan Zha (Georgia Tech)             Machine Learning and Web Search                                                                    8 / 50
Naive Bayes Classifier: Example




     To classify a Red Domestic SUV,
     For Yes:
     P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Yes)
     = .5 * .56 * .31 * .43 = .037
     For No:
     P(No) * P(Red | No) * P(SUV | No) * P (Domestic | No) =
     .5 * .43 * .56 * .56 = .069




 Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search   9 / 50
Learning to Classify Text

Target concept Interesting : Document → {+, −}
  1   Example: document classification using BOW
      — Multiple Bernoulli model
      — Multinomial model
            one attribute per word position in document
  2   Learning: Use training examples to estimate
            P(+), P(−), P(doc|+), P(doc|−)
Naive Bayes conditional independence assumption
                                              length(doc)
                               P(doc|j) =                   P(ti |j)
                                                   i=1

where P(ti |j) is probability that word i appears in class j


 Hongyuan Zha (Georgia Tech)      Machine Learning and Web Search      10 / 50
Learn_naive_Bayes_text(Examples, k)
  1. collect all words and other tokens that occur in Examples
     Vocabulary ← all distinct words and other tokens in Examples
  2. calculate the required P(j) and P(ti |j) probability terms
    For each target value j in {1, . . . , k} do
            docsj ← subset of Examples for which the target value is j
                      |docsj |
            P(j) ← |Examples|
            Textj ← a single document created by concatenating all members of
            docsj
            n ← total number of words in Textj (counting duplicate words multiple
            times)
            for each word ti in Vocabulary
                   nk ← number of times word wk occurs in Textj
                                  ni +1
                   P(ti |j) ← n+|Vocabulary |




 Hongyuan Zha (Georgia Tech)     Machine Learning and Web Search             11 / 50
Classify_naive_Bayes_text(Doc)
     positions ← all word positions in Doc that contain tokens found in
     Vocabulary
     Return vNB , where

                               jNB = argmax P(j)                  P(ti |j)
                                            j               i



When k    1, need special smoothing method,
Congle Zhang et. al. Web-scale classification with Naive Bayes, WWW
2009




 Hongyuan Zha (Georgia Tech)    Machine Learning and Web Search              12 / 50
Logistic Regression



In Naive Bayes, the discriminant function is

                                      P(j)         P(ti |j)
                                               i

Let ni be frequency of ti , and take the log of the above

                               log P(j) +          ni log P(ti |j)
                                               i

which is a linear function of the frequency vector x = [n1 , . . . , nV ]T




 Hongyuan Zha (Georgia Tech)      Machine Learning and Web Search            13 / 50
Logistic Regression


     More generally,
                                  1                    1
                   P(j|x ) =           exp(wjT x ) ≡         exp(w T f (x , j))
                                Z (x )               Zw (x )

     Given a training sample L = {(x1 , j1 ), . . . (xN , jN )} we minimize the
     Conditional likelihood,
                               1      2
                        min      w        +C         log Zw (xi ) − w T f (xi , ji )
                               2                 i

     A Convex function in w
     C.J. Lin et. al., Trust region Newton methods for large-scale logistic
     regression, ICML, 2007 (several millions of features, N = 105 )


 Hongyuan Zha (Georgia Tech)         Machine Learning and Web Search                   14 / 50
Generative vs. Discriminative Classifiers



     Naive Bayes estimates parameters for P(j), P(x |j) while logistic
     regression estimates parameters for P(j|x )
     Naive Bayes: generative classifier
     Logistic regression: discriminative classifier
     Logistic more general, gives better asymptotic error. Convergence
     rates are different, GNB with O(log n) examples, and logistic
     regression with O(n) examples, n dimension of X

Ng and Jordan, On generative vs. discriminative classifiers: a comparison
of Naive Bayes and logistic regession, NIPS 2002




 Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search           15 / 50
fication and regression trees, or CART (Breiman et al., 1984), although
ny other variants going by such names as ID3 and C4.5 (Quinlan, 1986;
 3).
    Tree-Based Models
4.5 shows an illustration of a recursive binary partitioning of the input
 with the corresponding tree structure. In this example, the first step

  a two-dimensional in- x2
                    664             14. COMBINING MODELS
at has been partitioned
ions using axis-aligned                                     E
                                Figure 14.6   Binary tree corresponding to the par-
                           θ3                 titioning of input space shown in Fig-                                  x1 > θ1
                                      B       ure 14.5.


                                                                                       x2       θ2                              x2 > θ3
                                                        C                D
                           θ2
                                                                                                         x1   θ4
                                      A

                                               θ1               θ4           x1        A             B        C           D           E


               Partition inputdivides the whole of thea input spaceregions, Thisaccordingtwo whether x each with the
                                space into cuboid into two regions creates to subregions, θ
                              or x > θ where θ is parameter of the model.
                                                    1   1            1
                                                                              with edges aligned                                  1       1


                              of which can then be subdivided independently. For instance, the region x   θ
               axes           is further subdivided according to whether x   θ or x > θ , giving rise to the
                                                                                            2        2   2        2
                                                                                                                                 1        1


                                   regions denoted A and B. The recursive subdivision can be described by the traversal
               Classifier           of the binary tree shown in Figure 14.6. For any new input x, we determine which
                                   region it falls into by starting at the top of the tree at the root node and following
                                                                 h(x ) =      i    j I(x ∈ R )i
                                   a path down to a specific leaf node according to the decision criteria at each node.
                                   Note that such decision trees are not probabilistic graphical models.
                                        Within each region, there is a
                                                                        iseparate model to predict the target variable. For
                                   instance, in regression we might simply predict a constant over each region, or in
               CART (classification and regression trees)
                                   classification we might assign each region to a specific class. A key property of tree-
                                   based models, which makes them popular in fields such as medical diagnosis, for
                                   example, is that they are readily interpretable by humans because they correspond
                                   to a sequence of binary decisions applied to the individual input variables. For in-
                                   stance, to predict a patient’s disease, we might first ask “is their temperature greater
         Hongyuan Zha (Georgia Tech)
                                   than some threshold?”. If the answer isWeb Search might next ask “is their blood
                                                   Machine Learning and yes, then we                                                          16 / 50
Decision Trees




Decision tree representation:
     Each internal node tests an attribute (predictor variable)
     Each branch corresponds to attribute value
     – branching factor > 2 (discrete case)
     – binary tree more common (split on a threshold)
     Each leaf node assigns a class label




 Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search    17 / 50
Top-Down Induction of Decision Trees


Main loop:
  1    A ← the “best” decision attribute for next node
  2    For each value of A, create new descendant of node
  3    Sort training examples to leaf nodes
  4    If training examples perfectly classified, Then STOP, Else iterate over
       new leaf nodes
Which attribute is best? Binary classification with two attributes

 [29+,35-]        A1=?          [29+,35-]        A2=?

           t      f                      t       f


      [21+,5-]    [8+,30-]         [18+,33-]     [11+,2-]




  Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search           18 / 50
Information Gain

     S is a sample of training examples
     p⊕ is the proportion of positive examples in S
     p is the proportion of negative examples in S
     Entropy measures the impurity of S

                           Entropy (S) ≡ −p⊕ log2 p⊕ − p log2 p

     Gain(S, A) = expected reduction in entropy due to sorting on A


                                                            |Sv |
        Gain(S, A) ≡ Entropy (S) −                                Entropy (Sv ) ≥ 0
                                               v ∈Values(A)
                                                             |S|



 Hongyuan Zha (Georgia Tech)      Machine Learning and Web Search                     19 / 50
Extension



General predictor variables x ∈ X
     A set of binary splits s at each node based on a question: is x ∈ A?
     where A ∈ X
     the split s sends all (xi , ji ) with "yes" answer to the left child and
     "no" answer to the right child
     Standard set of questions
            Predictor variable x i continuous: is x i ≤ c?
            Predictor variable x i categorical: is x i ∈ T ? where T ⊂ T and T is
            the values of x i




 Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search                  20 / 50
Goodness of Split goodness ofnode.is measured by an impurity function
                The
                defined for each
                                split

                                   Intuitively, we want each leaf node to be “pure”, that is, one
                                   class dominates.




       The goodness of split is measured by an impurity function defined for
                          Jia Li   http://www.stat.psu.edu/∼jiali



       each node
       Intuitively, we want each leaf node to be "pure", that is, one class
       dominates
       Given class probabilities in a node S: p(1|S), . . . , p(J|S) Impurity
       function for S, (more generally for the samples in a node)

                                               i(S) = φ(p(1|S), . . . , p(J|S))



   Hongyuan Zha (Georgia Tech)                       Machine Learning and Web Search                21 / 50
Goodness of Split


     Examples
            Entropy
                                     φ(p1 , . . . , pJ ) = −         pj log pj
                                                                 j

            Gini Index
                               φ(p1 , . . . , pJ ) = −         pi pj = 1 −           pj2
                                                         i=j                     j

     Goodness of a split s for node t,

                       Φ(s, t) = ∆i(s, t) = i(t) − pR i(tR ) − pL i(tL )

     where pR and pL are the proportions of the samples in node t that go
     to the right node tR and the left node tL , respectively.



 Hongyuan Zha (Georgia Tech)       Machine Learning and Web Search                         22 / 50
Stopping Criteria




     A simple criteria: stop splitting a node t when

                                 max p(t)∆i(s, t) < β
                                  s∈S

     The above stopping criteria is unsatisfactory
     — A node with a small decrease of impurity after one step of splitting
     may have a large decrease after multiple levels of splits.




 Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search         23 / 50
CART: Classification and Regression Trees


     Two phases: growing and pruning
     Growing: input space is recursively partitioned into cells, each cell
     corresponding to a leaf node
     – training data are fitted well
     – but poor performance on test data (overfitting)
     Pruning: objective function consists of empirical risk and penalty term

                                C (T ) = LN (T ) + α|T |

     where T ∈ T , all possible subtrees obtained from prunning the
     original tree.
     CART select T to minimize C (T ) with α selected by cross-validation



 Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search               24 / 50
Probabilistic Setting for Regression


     X is the predictor space, and T = R the set of reals
     P(x , t) a probability distribution on X × T
     A regression function y (t) is a function h : X → T
     We want to learn h from a training sample

                                D = {(x1 , t1 ), . . . (xN , tN )}

How do we measure the performance of a classifier?
     Mean Squared Error,

                               errorh =      (t − y (x ))2 dP(x , t)




 Hongyuan Zha (Georgia Tech)     Machine Learning and Web Search       25 / 50
Conditional Mean as Optimal Regression Function




     Mean Squared Error,

           errorh =        (t − h(x ))2 dP(x , t) =         (t − h(x ))2 p(t|x )dtdP(x )

     Optimal regression function

                                     h∗ (x ) =        tp(t|x )dt




 Hongyuan Zha (Georgia Tech)      Machine Learning and Web Search                          26 / 50
Least Squares Problem
DUCTION

The error function (1.2) corre-
sponds to (one half of) the sum of t                                       tn
the squares of the displacements
(shown by the vertical green bars)
of each data point from the function
y(x, w).
                                                                           y(xn , w)




                                                                         xn        x


           For training set, D = {(x1 , training . (xN , t The geomet-
function y(x, w) were to pass exactly through each t1 ), . . data point.N )}, the empirical   risk, or
rical interpretation of the sum-of-squares error function is illustrated in Figure 1.3.
          training error, for y (x , w ),
     We can solve the curve fitting problem by choosing the value of w for which
E(w) is as small as possible. Because the error function N a quadratic function of
                                                              is
                                                          1
the coefficients w, its derivatives with respect to the coefficients will be linear in the2
                                                                 has − y (x solution,
elements of w, and so the minimization E (w error function (ti a unique i , w ))
                                           of the ) =
denoted by w , which can be found in closed form. 2The resulting polynomial is
                                                             i=1
given by the function y(x, w ).
     There remains the problem of choosing the order M of the polynomial, and as
we shall see this will turn out to be an example of an important concept called model
comparison or model selection. In Figure 1.4, we show four examples of the results
of fitting polynomials having orders M = Machine Learningto the data set shown in
      Hongyuan Zha (Georgia Tech)               0, 1, 3, and 9 and Web Search                            27 / 50
Linear Least Squares Problem



     y (x , w ) = w T f (x ), f (x ) = [f1 (x ), . . . , fn (x )] a set of basis functions,
                                               N
                                        1
                               E (w ) =             (ti − w T f (xi ))2
                                        2     i=1

     Let t = [ti ], A = [f (xi )], then
                                                             2
                                        min t − Aw           2
                                          w

     Normal equation,
                                         AT Aw = AT t




 Hongyuan Zha (Georgia Tech)     Machine Learning and Web Search                        28 / 50
Overfitting and Regularization

DUCTION
   Example taken from C. Bishop
Plot of a training data set of N =
10 points, shown as blue circles,
each comprising an observation
                                         1
of the input variable x along with
the corresponding target variable    t
t. The green curve shows the
function sin(2πx) used to gener-
ate the data. Our goal is to pre-        0
dict the value of t for some new
value of x, without knowledge of
the green curve.
                                     −1


                                             0                           x     1


           Polynomial curve fitting
detailed treatment lies beyond the scope of this book.
    Although each of these tasks needs its own tools and techniques, many of the
key ideas that underpin them are common to all such problems. One of the main M
goals of this chapter is to introduce, y (xrelatively informal way, x + · · · the most x
                                       in a , w ) = w0 + w1 several of + wM
important of these concepts and to illustrate them using simple examples. Later in
the book we shall see these same ideas re-emerge in the context of more sophisti-
cated models that are applicable to real-world pattern recognition applications. This
chapter also provides a self-contained introduction to three important tools that will
be used throughout the book, namely probability theory, decision theory, and infor-
mation theory. Zha (Georgia Tech)might sound like daunting and Web they are in fact
      Hongyuan Although these                  Machine Learning topics, Search             29 / 50
1.1. Example: Polynomial Curve Fitting               7




                     1                                  M =0               1                                  M =1
                 t                                                     t

                     0                                                     0



                −1                                                    −1


                         0                                x     1              0                                x     1




                     1                                  M =3               1                                  M =9
                 t                                                     t

                     0                                                     0



                −1                                                    −1


                         0                                x     1              0                                x     1

               Figure 1.4 Plots of polynomials having various orders M, shown as red curves, fitted to the data set shown in
               Figure 1.2.


                                     (RMS) error defined by
                                                                    ERMS =         2E(w )/N                           (1.3)
                                     in which the division by N allows us to compare different sizes of data sets on
Hongyuan Zha (Georgia Tech)          an equal footing, andLearning and Web Search RMS is measured on the same
                                               Machine the square root ensures that E                                         30 / 50
from which the data was generated (and we shall see later that this is indeed the
                                    case). We know that a power series expansion of the function sin(2πx) contains
    Training Error and Test Error   terms of all orders, so we might expect that results should improve monotonically as
                                    we increase M .
                                         We can gain some insight into the problem by examining the values of the co-
                                    efficients w obtained from polynomials of various order, as shown in Table 1.1.
                                    We see that, as M increases, the magnitude of the coefficients typically gets larger.
                                    In particular for the M = 9 polynomial, the coefficients have become finely tuned
                                    to the data by developing large positive and negative values so that the correspond-

 uare
 ated           1        Table 1.1 Table of the coefficients w for          M =0     M =1       M =6           M =9
 nde-                     Training polynomials of various order.      w0    0.19      0.82       0.31           0.35
alues                     Test     Observe how the typical mag-
                                   nitude of the coefficients in-      w1             -1.27       7.99         232.37
                                   creases dramatically as the or-    w2                       -25.43       -5321.83
                                   der of the polynomial increases.   w3                        17.37       48568.31
        ERMS




               0.5                                                    w4                                  -231639.30
                                                                      w5                                   640042.26
                                                                      w6                                 -1061800.52
                                                                      w7                                  1042400.18
                                                                      w8                                  -557682.99
                0
                     0        3           6           9               w9                                   125201.43
                                   M


ng set error goes to zero, as we might expect because
degrees of freedom corresponding to the 10 coefficients
 uned exactly to the 10 data points in the training set.
 as become very large and, as we saw in Figure 1.4, the
 w ) exhibits wild oscillations.
 ical because a polynomial of given order contains all
 pecial cases. The M = 9 polynomial is therefore capa-
 ast as good as the M = 3 polynomial. Furthermore, we
 predictor of new data would be the function sin(2πx)
 nerated (and we shall see later that this isMachine Learning and Web Search
        Hongyuan Zha (Georgia Tech)          indeed the                                                              31 / 50
Increasing Training Set Size


                                                                 1.1. Example: Polynomial Curve Fitting                9




                1                                  N = 15              1                                   N = 100
            t                                                      t

                0                                                      0



            −1                                                     −1


                    0                                 x     1              0                                 x     1

          Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M = 9
          polynomial for N = 15 data points (left plot) and N = 100 data points (right plot). We see that increasing the
          size of the data set reduces the over-fitting problem.


                                ing polynomial function matches each of the data points exactly, but between data
                                points (particularly near the ends of the range) the function exhibits the large oscilla-
                                tions observed in Figure 1.4. Intuitively, what is happening is that the more flexible
                                polynomials with larger values of M are becoming increasingly tuned to the random
                                noise on the target values.
 Hongyuan Zha (Georgia Tech)                  Machine Learning and Web Search                                               32 / 50
Regularization
          10            1. INTRODUCTION




                1                         ln λ = −18                  1                          ln λ = 0
            t                                                     t

                0                                                     0



            −1                                                   −1


                    0                                  x   1              0                                 x    1

          Figure 1.7 Plots of M = 9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error
         function (1.4) for set, L the {(x , t parameter λ corresponding ln λ −18 and ln λ = The
     For case of no regularizer,values of= corresponding ),ln.λ.= −∞,N , tN )}, tothe=empirical 0.risk, or
          training two i.e., λ = 0, regularizationto . (x is shown at the bottom right of Figure 1.4.
                                                  1 1
     training error, for y (x , w ),
                               may wish to use relatively complex and flexible models. One technique that is often
                               used to control the over-fitting phenomenon in such cases is that of regularization,
                                                    N
                                ˆ coefficients 1                                             λ
                               which involves adding a penalty term to the error function (1.2) in order to discourage
                               E (w ) = from reaching largey (xi ,The ))2 +such penalty 2 takes the
                               the                      (ti − values. w simplest                  w term
                               form of a sum of2 i=1of all of the coefficients, leading to 2
                                                squares                                     a modified error function
                               of the form
                                                               N
                                                             1                       2   λ
                                                   E(w) =         {y(xn , w) − tn } + w 2                         (1.4)
                                                             2                            2
                                                                n=1

 Hongyuan Zha (Georgia Tech) where w       2   Machine = w0 + w1 + . Webw2 , and the coefficient λ governs the rel-
                                               ≡ wT w Learning 2 . . + Search
                                                          2    and                                                        33 / 50
itable value for the model complexity. The results above
chieving this, namely by taking the available data and
    Training Error and Test Error: Regularization
g set, used to determine the coefficients w, and a separate
a hold-out set, used to optimize the model complexity
  cases, however, this will prove to be too wasteful of
we have to seek more sophisticated approaches.
of polynomial curve fitting has appealed largely to in-
 ore principled approach to solving problems in pattern
discussion of probability theory. As well as providing the
  the subsequent developments in this book, it will also 1.1. Example: Polynomial Curve Fitting                    11

 re er-                                                                        ln λ = −∞     ln λ = −18    ln λ = 0
M = 9             1 Table 1.2   Table of the coefficients w for M =
                                9 polynomials with various values for    w0           0.35          0.35        0.13
                                                    Training
                                the regularization parameter λ. Note
                                                    Test
                                that ln λ = −∞ corresponds to a          w1         232.37          4.74       -0.05
                                model with no regularization, i.e., to   w2       -5321.83         -0.77       -0.06
                                the graph at the bottom right in Fig-    w3       48568.31        -31.97       -0.05
          ERMS




                 0.5            ure 1.4. We see that, as the value of    w4    -231639.30          -3.89       -0.03
                                λ increases, the typical magnitude of    w5     640042.26          55.28       -0.02
                                the coefficients gets smaller.
                                                                         w6   -1061800.52          41.32       -0.01
                                                                         w7    1042400.18         -45.95       -0.00
                  0
                                                                         w8    -557682.99         -91.53        0.00
                       −35         −30   ln λ   −25       −20            w9     125201.43          72.68        0.01


                             the magnitude of the coefficients.
                                  The impact of the regularization term on the generalization error can be seen by
                             plotting the value of the RMS error (1.3) for both training and test sets against ln λ,
                             as shown in Figure 1.8. We see that in effect λ now controls the effective complexity
                             of the model and hence determines the degree of over-fitting.
                                  The issue of model complexity is an important one and will be discussed at
                             length in Section 1.3. Here we simply note that, if we were trying to solve a practical
                             application using this approach of minimizing an error function, we would have to
                             find a way to determine aLearning and Webfor the model complexity. The results above/ 50
          Hongyuan Zha (Georgia Tech)          Machine
                                                        suitable value Search                                    34
Bias-Variance Decomposition


     Let h∗ (x ) be the conditional mean, h∗ (x ) =                 tp(t|x )dt
     For any regression function y (x ),

                       (y (x ) − t)2 =     (y (x ) − h(x ))2 +        (h(x ) − t)2

     Given training data D, algorithm outputs y (x , D)
     — average behavior over all D

     ED (y (x , D)−h(x ))2 = (ED y (x , D) − h(x ))2 + ED (y (x , D) − ED y (x , D))2
                                             (bias)2                         variance

     Expected loss = (bias)2 + variance + noise



 Hongyuan Zha (Georgia Tech)      Machine Learning and Web Search                       35 / 50
Bias-Variance and Model Complexity
                         150           3. LINEAR MODELS FOR REGRESSION




                               1                                                      1
                                                          ln λ = 2.6
                           t                                                      t

                               0                                                      0



                           −1                                                     −1


                                   0                                   x   1              0                                 x     1



                               1                                                      1
                                                          ln λ = −0.31
                           t                                                      t

                               0                                                      0



                           −1                                                     −1


                                   0                                   x   1              0                                 x     1



                               1                                                      1
                                                          ln λ = −2.4
                           t                                                      t

                               0                                                      0



                           −1                                                     −1


                                   0                                   x   1              0                                 x     1

                         Figure 3.5 Illustration of the dependence of bias and variance on model complexity, governed by a regulariza-
                         tion parameter λ, using the sinusoidal data set from Chapter 1. There are L = 100 data sets, each having N = 25
 Hongyuan Zha (Georgia   data points, and there are 24 Gaussian basis functions inand Web Search number of parameters is
                         Tech)                         Machine Learning the model so that the total                                        36 / 50
Bias-Variance Decomposition
              3.2. The Bias-Variance Decomposition                 151

as and variance,
 um, correspond-     0.15
  shown in Fig-                     (bias)2
 n is the average    0.12           variance
 est data set size                  (bias)2 + variance
e minimum value      0.09           test error
 e occurs around
 h is close to the
                     0.06
e minimum error
                     0.03

                       0
                       −3      −2         −1             0   1       2
                                               ln λ


 4 Gaussian basis functions by minimizing the regularized error
 give Hongyuan Zha (Georgia Tech) y (l) (x) as shown Web Search 3.5. The
       a prediction function            Machine Learning and in Figure     37 / 50
Overfitting

                                           0.9

                                          0.85

                                           0.8

                                          0.75




                               Accuracy
                                           0.7

                                          0.65

                                           0.6                                          On training data
                                                                                            On test data
                                          0.55

                                           0.5
                                                 0    10   20   30    40      50      60      70      80   90   100
                                                                 Size of tree (number of nodes)




     Common problem with most learning algorithms
     Given a function space H, a function h ∈ H is said to overfitthe
     training data if there exists some alternative function h ∈ H such
     that h has smaller error than h over the training examples but h has
     smaller error than h over the entire distribution of instances


 Hongyuan Zha (Georgia Tech)                         Machine Learning and Web Search                                  38 / 50
Cross-Validation


     Performance on the training set is not a good indicator of predictive
     performance on unseen data
     If data is plentiful, training data, validation data and test data
     — models (different degree polynomials) compared on validation data
     — high variance if validation data small
     S-fold cross-validation                           1.4. The Curse of Dimensionality          33

   Figure 1.18 The technique of S-fold cross-validation, illus-                               run 1
               trated here for the case of S = 4, involves tak-
               ing the available data and partitioning it into S
               groups (in the simplest case these are of equal                                run 2
               size). Then S − 1 of the groups are used to train
               a set of models that are then evaluated on the re-                             run 3
               maining group. This procedure is then repeated
               for all S possible choices for the held-out group,                             run 4
               indicated here by the red blocks, and the perfor-
               mance scores from the S runs are then averaged.


                data to assess performance. When data is particularly scarce, it may be appropriate
                to consider the case S = N , where N is the total number of data points, which gives
                the leave-one-out technique.
 Hongyuan Zha (Georgia Tech)               Machine Learning and Web Search                             39 / 50
Cross-Validation

                       Which kind of Cross Validation?
                                             Downside                      Upside
                 Test-set                    Variance: unreliable          Cheap
                                             estimate of future
                                             performance

                 Leave-                      Expensive.                    Doesn’t waste data
                 one-out                     Has some weird behavior
                 10-fold                     Wastes 10% of the data.       Only wastes 10%. Only
                                             10 times more expensive       10 times more expensive
                                             than test set                 instead of R times.
                 3-fold                      Wastier than 10-fold.         Slightly better than test-
                                             Expensivier than test set     set
                 R-fold                      Identical to Leave-one-out

Andrew Moore’s slides
               Copyright © Andrew W. Moore                                                         Slide 35




 Hongyuan Zha (Georgia Tech)                        Machine Learning and Web Search                           40 / 50
CV-based Model Choice

                                    CV-based Model Selection
     Example: •Choosing which model toofuse
                Example: Choosing number hidden units in a one-
                           hidden-layer neural net.
     Step 1: Compute 10-fold CV error for six different model classes,
              • Step 1: Compute 10-fold CV error for six different model
                           classes:
                     Algorithm                 TRAINERR   10-FOLD-CV-ERR           Choice
                     0 hidden units
                     1 hidden units
                     2 hidden units                                                !
                     3 hidden units
                     4 hidden units
                     5 hidden units

                     • Step 2: Whichever model class gave best CV score: train it
     Step 2: Whichever model and that’s theCV score:model you’ll use. all the
                 with all the data,
                                    gave best predictive train it with
     data, and that is the predictive model you will use.
                 Copyright © Andrew W. Moore                                                Slide 38




 Hongyuan Zha (Georgia Tech)                     Machine Learning and Web Search                       41 / 50
Two Definitions of Error


The true error of classifier h with respect to P(x , y ) is the probability
that h will misclassify an instance drawn at random according to P
(population).
                           errorP (h) ≡ P[h(x ) = y ]
The sample error of h with respect to data sample S = {(xi , yi )}N is
                                                                  i=1
the proportion of examples h misclassifies
                                                    N
                                               1
                                errorS (h) ≡             δ(yi = h(xi ))
                                               n   i=1

Where δ(y = h(x )) is 1 if y = h(x ), and 0 otherwise.
How well does errorS (h) estimate errorP (h)?



  Hongyuan Zha (Georgia Tech)        Machine Learning and Web Search         42 / 50
Example




Hypothesis h misclassifies 12 of the 40 examples in S

                                                 12
                               errorS (h) =         = .30
                                                 40
What is errorP (h)?




 Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search   43 / 50
Confidence Intervals


If
         S contains n examples, drawn independently of h and each other
         n ≥ 30
Then
         With approximately N% probability, errorP (h) lies in interval

                                                       errorS (h)(1 − errorS (h))
                              errorS (h) ± zN
                                                                   n

         where
                     N%:           50%    68%       80%        90%          95%    98%    99%
                     zN :          0.67   1.00      1.28       1.64         1.96   2.33   2.58



     Hongyuan Zha (Georgia Tech)          Machine Learning and Web Search                        44 / 50
k-fold Cross-Validated Paired t-test


Comparing two algorithms A and B
— L(S) returns the classifier produced by algorithm L training on training
data S
  1   Randomly partition training data D into k disjoint test sets
      T1 , T2 , . . . , Tk of equal size.
  2   For i from 1 to k, do
      use Ti for the test set, and the remaining data for training set Si
          hA ← LA (Si ), hB ← LB (Si )
          δi ← errorTi (hA ) − errorTi (hB )
                       ¯
      Return the value δ ≡       1       k
                                         i=1 δi
  3
                                 k
      Let sδ ≡           1     k           ¯
                                         − δ)2 .
                               i=1 (δi
  4        ¯          k(k−1)




 Hongyuan Zha (Georgia Tech)     Machine Learning and Web Search            45 / 50
t-Distribution




       t ¯ ¯
Then ˆ ≡ δ/sδ has an approximate t-distribution with k − 1 degree of
freedom under the null hypothesis that there’s no difference in the true
errors


 Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search            46 / 50
p-Value and Test of Significance


     Null hypothesis: no difference in true errors, and alternative
     hypothesis
     We may be able to demonstrate that the alternative is much more
     plausible than the null hypothesis given the data
     This is done in terms of a probability (a p-value)
     — quantifying the strength of the evidence against the null
     hypothesis in favor of the alternative.
     Are the data consistent with the null hypothesis?
     — use a test statistic, like the ˆ
                                      t
     — need to know the null distribution of test statistic (degree k − 1
     student t)



 Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search          47 / 50
p-Value and Test of Significance

     For a given data set, we can compute the value of ˆ, and see whether
                                                           t
     it is
     — in the middle of the distribution (consistent with the null
     hypothesis)
     — out in a tail of the distribution (making the alternative hypothesis
     seem more plausible)
     Alternative hypothesis ⇒ large positive ˆ t
     — measure of how far out ˆ is in the right-hand tail of null
                                  t
     distribution
     p-Value is the probability to the right of our test statistic (ˆ)
                                                                    t
     calculated using the null distribution
     The smaller the P-value, and the stronger the evidence against the
     null hypothesis in favor of the alternative


 Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search          48 / 50
Bayes’ Theorem is used to update subjective probabilities to reflect new information.
p-Value and P-value to persons unfamiliar with statistics, it is often necessary to use
 When reporting a Test of Significance
 descriptive language to indicate the strength of the evidence. I tend to use the following
 sort of language. Obviously the cut-offs are somewhat arbitrary and another person might
 use different language.

 P > 0.10                 No evidence against the null hypothesis. The data appear to be
                          consistent with the null hypothesis.
 0.05 < P < 0.10          Weak evidence against the null hypothesis in favor of the alternative.

 0.01 < P < 0.05          Moderate evidence against the null hypothesis in favor of the
                          alternative.
 0.001 < P < 0.01         Strong evidence against the null hypothesis in favor of the
                          alternative.
 P < 0.001                Very strong evidence against the null hypothesis in favor of the
                          alternative.
Level α this kind of0.05 or 0.01) test:keep in mind the difference between statistical
 In using (usually language, one should We reject the null hypothesis at level
α if the P-value is smaller than α a large study one may obtain a small P-value
 significance and practical significance. In
 even though the magnitude of the effect being tested is too small to be of importance (see
 the discussion of power below). It is a good idea to support a P-value with a confidence
 interval for the parameter being tested.

 AHongyuan Zha (Georgia be reported more formally in terms of a fixed level ! test. Here ! is 49 / 50
   P-value can also Tech)            Machine Learning and Web Search
                                                                                              a
The Power of Tests


     When comparing algorithms
     — null hypothesis: no difference
     — alternative hypothesis: my new algorithm is better
     We want to have good chance of reporting a small P-value assuming
     the alternative hypothesis is true
     The power of level α test: the probability that the null hypothesis will
     be rejected at level α (i.e., the p-value will be less than α) assuming
     the alternative hypothesis
     — variability of the data: lower variance, higher power
     — sample size: higher N, higher power
     — the magnitude of the difference: large difference, higher power
     is true.



 Hongyuan Zha (Georgia Tech)   Machine Learning and Web Search           50 / 50

More Related Content

What's hot

Engr 371 final exam april 2010
Engr 371 final exam april 2010Engr 371 final exam april 2010
Engr 371 final exam april 2010amnesiann
 
01 graphical models
01 graphical models01 graphical models
01 graphical modelszukun
 
Computational tools for Bayesian model choice
Computational tools for Bayesian model choiceComputational tools for Bayesian model choice
Computational tools for Bayesian model choiceChristian Robert
 
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013Christian Robert
 
Engr 371 final exam april 2006
Engr 371 final exam april 2006Engr 371 final exam april 2006
Engr 371 final exam april 2006amnesiann
 
4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics SeminarChristian Robert
 
Final examination 2011 class vii
Final examination 2011 class viiFinal examination 2011 class vii
Final examination 2011 class viiAsad Shafat
 
Machine Learning in Speech and Language Processing
Machine Learning in Speech and Language ProcessingMachine Learning in Speech and Language Processing
Machine Learning in Speech and Language Processingbutest
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testingChristian Robert
 
Mid term examination -2011 class vii
Mid term examination -2011 class viiMid term examination -2011 class vii
Mid term examination -2011 class viiAsad Shafat
 
Stratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationStratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationUmberto Picchini
 
ABC in London, May 5, 2011
ABC in London, May 5, 2011ABC in London, May 5, 2011
ABC in London, May 5, 2011Christian Robert
 
Final examination 2011 class vi
Final examination 2011 class viFinal examination 2011 class vi
Final examination 2011 class viAsad Shafat
 
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...Reza Rahimi
 
Habilitation à diriger des recherches
Habilitation à diriger des recherchesHabilitation à diriger des recherches
Habilitation à diriger des recherchesPierre Pudlo
 

What's hot (19)

Engr 371 final exam april 2010
Engr 371 final exam april 2010Engr 371 final exam april 2010
Engr 371 final exam april 2010
 
01 graphical models
01 graphical models01 graphical models
01 graphical models
 
Computational tools for Bayesian model choice
Computational tools for Bayesian model choiceComputational tools for Bayesian model choice
Computational tools for Bayesian model choice
 
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
 
Engr 371 final exam april 2006
Engr 371 final exam april 2006Engr 371 final exam april 2006
Engr 371 final exam april 2006
 
4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar
 
Em
EmEm
Em
 
Final examination 2011 class vii
Final examination 2011 class viiFinal examination 2011 class vii
Final examination 2011 class vii
 
Machine Learning in Speech and Language Processing
Machine Learning in Speech and Language ProcessingMachine Learning in Speech and Language Processing
Machine Learning in Speech and Language Processing
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
 
004 parabola
004 parabola004 parabola
004 parabola
 
Midterm I Review
Midterm I ReviewMidterm I Review
Midterm I Review
 
Mid term examination -2011 class vii
Mid term examination -2011 class viiMid term examination -2011 class vii
Mid term examination -2011 class vii
 
Stratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationStratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computation
 
ABC in London, May 5, 2011
ABC in London, May 5, 2011ABC in London, May 5, 2011
ABC in London, May 5, 2011
 
Final examination 2011 class vi
Final examination 2011 class viFinal examination 2011 class vi
Final examination 2011 class vi
 
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
 
Savage-Dickey paradox
Savage-Dickey paradoxSavage-Dickey paradox
Savage-Dickey paradox
 
Habilitation à diriger des recherches
Habilitation à diriger des recherchesHabilitation à diriger des recherches
Habilitation à diriger des recherches
 

Viewers also liked

Pace january 2014
Pace january 2014Pace january 2014
Pace january 2014rankina
 
Cycling Scotland Newsletter Spring 2008
Cycling Scotland Newsletter Spring 2008Cycling Scotland Newsletter Spring 2008
Cycling Scotland Newsletter Spring 2008Cycling Scotland
 
Presentation to HR students - UNCW's Cameron School of Business
Presentation to HR students - UNCW's Cameron School of BusinessPresentation to HR students - UNCW's Cameron School of Business
Presentation to HR students - UNCW's Cameron School of Businessrankina
 
Thecnological inventions
Thecnological inventionsThecnological inventions
Thecnological inventionsMabelly02
 
Presentation to HR Students - UNCW Cameron School of Business
Presentation to HR Students - UNCW Cameron School of BusinessPresentation to HR Students - UNCW Cameron School of Business
Presentation to HR Students - UNCW Cameron School of Businessrankina
 
Deans mtg march2010_revised
Deans mtg march2010_revisedDeans mtg march2010_revised
Deans mtg march2010_revisedrankina
 
MCM Cycle Demonstration - Glasgow (Mar 12)
MCM Cycle Demonstration - Glasgow (Mar 12)MCM Cycle Demonstration - Glasgow (Mar 12)
MCM Cycle Demonstration - Glasgow (Mar 12)Cycling Scotland
 
2014 03-14 MCM Marketing and Social Presentation
2014 03-14 MCM Marketing and Social Presentation 2014 03-14 MCM Marketing and Social Presentation
2014 03-14 MCM Marketing and Social Presentation Cycling Scotland
 

Viewers also liked (10)

Pace january 2014
Pace january 2014Pace january 2014
Pace january 2014
 
Cycling Scotland Newsletter Spring 2008
Cycling Scotland Newsletter Spring 2008Cycling Scotland Newsletter Spring 2008
Cycling Scotland Newsletter Spring 2008
 
Give me cycle space 2011
Give me cycle space 2011Give me cycle space 2011
Give me cycle space 2011
 
Presentation to HR students - UNCW's Cameron School of Business
Presentation to HR students - UNCW's Cameron School of BusinessPresentation to HR students - UNCW's Cameron School of Business
Presentation to HR students - UNCW's Cameron School of Business
 
Thecnological inventions
Thecnological inventionsThecnological inventions
Thecnological inventions
 
Presentation to HR Students - UNCW Cameron School of Business
Presentation to HR Students - UNCW Cameron School of BusinessPresentation to HR Students - UNCW Cameron School of Business
Presentation to HR Students - UNCW Cameron School of Business
 
Fort William Feb 2012
Fort William Feb 2012Fort William Feb 2012
Fort William Feb 2012
 
Deans mtg march2010_revised
Deans mtg march2010_revisedDeans mtg march2010_revised
Deans mtg march2010_revised
 
MCM Cycle Demonstration - Glasgow (Mar 12)
MCM Cycle Demonstration - Glasgow (Mar 12)MCM Cycle Demonstration - Glasgow (Mar 12)
MCM Cycle Demonstration - Glasgow (Mar 12)
 
2014 03-14 MCM Marketing and Social Presentation
2014 03-14 MCM Marketing and Social Presentation 2014 03-14 MCM Marketing and Social Presentation
2014 03-14 MCM Marketing and Social Presentation
 

Similar to Machine Learning Basics for Web Search

A lab report on modeling and simulation with python code
A lab report on modeling and simulation with python codeA lab report on modeling and simulation with python code
A lab report on modeling and simulation with python codeAlamgir Hossain
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsNAVER Engineering
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavAgile Testing Alliance
 
Text classification
Text classificationText classification
Text classificationFraboni Ec
 
Text classification
Text classificationText classification
Text classificationDavid Hoen
 
Text classification
Text classificationText classification
Text classificationJames Wong
 
Text classification
Text classificationText classification
Text classificationTony Nguyen
 
Text classification
Text classificationText classification
Text classificationYoung Alista
 
Text classification
Text classificationText classification
Text classificationHarry Potter
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
09 logic programming
09 logic programming09 logic programming
09 logic programmingsaru40
 
Auto encoding-variational-bayes
Auto encoding-variational-bayesAuto encoding-variational-bayes
Auto encoding-variational-bayesmehdi Cherti
 
Exercise #13 notes ~ graphing
Exercise #13 notes ~ graphingExercise #13 notes ~ graphing
Exercise #13 notes ~ graphingKelly Scallion
 
English math dictionary
English math dictionaryEnglish math dictionary
English math dictionarynurwa ningsih
 
BUKU ENGLIS FOR MATHEMATICS
BUKU ENGLIS FOR MATHEMATICSBUKU ENGLIS FOR MATHEMATICS
BUKU ENGLIS FOR MATHEMATICSHanifa Zulfitri
 

Similar to Machine Learning Basics for Web Search (20)

Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
A lab report on modeling and simulation with python code
A lab report on modeling and simulation with python codeA lab report on modeling and simulation with python code
A lab report on modeling and simulation with python code
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representations
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
Text classification
Text classificationText classification
Text classification
 
Text classification
Text classificationText classification
Text classification
 
Text classification
Text classificationText classification
Text classification
 
Text classification
Text classificationText classification
Text classification
 
Text classification
Text classificationText classification
Text classification
 
Text classification
Text classificationText classification
Text classification
 
Text classification
Text classificationText classification
Text classification
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
09 logic programming
09 logic programming09 logic programming
09 logic programming
 
Chapter 2 2.1.4
Chapter 2 2.1.4Chapter 2 2.1.4
Chapter 2 2.1.4
 
Auto encoding-variational-bayes
Auto encoding-variational-bayesAuto encoding-variational-bayes
Auto encoding-variational-bayes
 
Exercise #13 notes ~ graphing
Exercise #13 notes ~ graphingExercise #13 notes ~ graphing
Exercise #13 notes ~ graphing
 
English math dictionary
English math dictionaryEnglish math dictionary
English math dictionary
 
BUKU ENGLIS FOR MATHEMATICS
BUKU ENGLIS FOR MATHEMATICSBUKU ENGLIS FOR MATHEMATICS
BUKU ENGLIS FOR MATHEMATICS
 
BIRS 12w5105 meeting
BIRS 12w5105 meetingBIRS 12w5105 meeting
BIRS 12w5105 meeting
 
Madrid easy
Madrid easyMadrid easy
Madrid easy
 

Recently uploaded

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 

Recently uploaded (20)

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 

Machine Learning Basics for Web Search

  • 1. Machine Learning and Web Search Part 1: Basics of Machine Learning Hongyuan Zha College of Computing Georgia Institute of Technology Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 1 / 50
  • 2. Outline 1 Classification Problems Bayes Error and Risk Naive Bayes Classifier Logistic Regression Decision Trees 2 Regression Problems Least Squares Problem Regularization Bias-Variance Decomposition 3 Cross-Validation and Comparison Cross-Validation p-Value and Test of Significance Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 2 / 50
  • 3. Supervised Learning We are predicting a target variable based a set of predictor variables using a training set of examples. Classification: predict a discrete target variable – spam filtering based on message contents Regression: predict a continuous target variable – predict income based on other demographic information Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 3 / 50
  • 4. Probabilistic Setting for Classification X is the predictor space, and C = {1, . . . , k} the set of class labels P(x , j) a probability distribution on X × C A classifier d(x ) is a function h : X → C We want to learn d from a training sample D = {(x1 , j1 ), . . . (xN , jN )} How do we measure the performance of a classifier? Miss-classification error, errorh = P({(x , j) | h(x ) = j}), where (x , j) ∼ P(x , j) Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 4 / 50
  • 5. Bayes Classifier P(x , j) = P(x |j)P(j) = P(j|x )P(x ) Assume X continuous, and P(x ∈ A, j) = p(j|x )p(x )dx A errorh = P(h(x ) = j) = p(h(x ) = j|x )p(x )dx = (1 − p(h(x )|x ))p(x )dx A A Bayes error = minh errorh and Bayes classifier h∗ (x ) = argmaxj p(j|x ) Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 5 / 50
  • 6. Risk Minimization Loss function L(i, j) = (C (i|j)): cost if class j is predicted to be class i Risk for h = expected loss for h, k Rf = L(h(x ), j)p(j|x )p(x )dx j=1 Minimizing the risk ⇒ Bayes classifier k h∗ (x ) = argminj C (j| )p( |x ) =1 0/1-loss ⇒ errorh Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 6 / 50
  • 7. Naive Bayes Classifier Use Bayes Rule p(j|x ) ∼ p(x |j)p(j) Feature vector x = [t1 , . . . , tn ]. Conditional independence assumption p(x |j) = p(t1 |j)p(t2 |j) . . . p(tn |j) MLE for p(ti |j), smoothed version, nc + mp p(ti |j) = n+m n: the number of training examples in class j nc : number of examples in class j with i attribute ti p: a priori estimate m: the equivalent sample size Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 7 / 50
  • 8. here: Yes: No: n= the number of training examples for which v = vj Naive Bayes Classifier: Example 5 Red: nc = n = 5 p= Red: number of examples for which v = vj and a = ai n = a priori estimate for P (ai |vj ) m= the equivalent sample size 3 n_c= n_c = 2 p = .5 p = .5 Car theft Example m = 3 m = 3 SUV: SUV: ttributes are Color , Type , Origin, and the subject, stolen can be either yes or no. n = 5 n = 5 n_c = 1 n_c = 3 1 data set Data p = .5 p =and multiply them by P(Yes) and P(No) respectively . We can estimate .5 Example No. Color m = 3 Origin Type Stolen? m =Yes:Red: 3 No: Red: 1 RedDomestic: Domestic Sports Yes Domestic: n = 5 n = 5 2 Red Sports 5 Domestic n = No n = 5 n_c= 3 n_c = 2 p = .5 p = .5 3 Red Sports = Domestic n_c 2 Yes n_c = 3 m = 3 m = 3 4 Yellow Sports Domestic No SUV: SUV: 5 p = .5 Yellow Sports Imported Yes p = .5 n = 5 n = 5 n_c = 1 n_c = 3 6 Yellow m = 3 Imported SUV No m =3 p = .5 p = .5 7 Yellow SUV Imported Yes m = 3 m = 3 8 Yellow P (Red|Y es), we No Looking atSUV Domestic have 5 cases where vj = Yes , and in 3 of those cases ai = Red. So for Domestic: Domestic: n = 5 n = 5 9 P (Red|Y es), n = 5 Imported 3. Note that all attribute are binary (two possible values). We are= 3 Red SUV and nc = No n_c = 2 n_c assuming 10 no Red information so, p = 1 /Yes other Sports Imported (number-of-attribute-values) = 0.5 for all of our attributes. = .5m value p = .5 m = 3 p m =3 Our is arbitrary, (We will use m = 3) but consistent for all attributes. Nowwe have 5 cases where v = Yes , and in 3 Looking at P (Red|Y es), we simply apply eqauation (3) 2 Training example j using the precomputed values of n , nc , p, andP (Red|Y es), n = 5 and nc = 3. Note that all attribute are binary (two po m. Estimates no other information so, p = 1 / (number-of-attribute-values) = 0.5 for a e want to classify a Red Domestic SUV. Note there is no example of a Red Domestic SUV in our data (We will use m = 3) but consistent for is arbitrary, all attributes. Now 3 + 3 ∗ .5 the probabilities 2+3∗ n using the precomputed values of.5 , nc , p, and m. t. Looking back at equation (2) we can see how to compute this. We need to calculate= .56 P (Red|Y es) = P (Red|N o) = = .43 5+3 5 + 33 + 3 ∗ .5 = .56 P (Red|Y es) = P (Red|N o) = (Red|Yes), P(SUV|Yes), P(Domestic|Yes) , 1 + 3 ∗ .5 3 + 3 ∗ .5+ ∗ .5 5 3 P (SU V |Y es) = = .31 P (SU V(SUo) |Y es) = 1 + 3= .56= .31 P |N V = P (SU V |N o) = (Red|No) , P(SUV|No), and P(Domestic|No) 5+3 5 + 35 + 3 2 + 3 ∗ .5 2 + 3 ∗ .5 P (Domestic|Y es) =3 + 3 ∗ .5 = .43 P (Domestic|N P (Domestic|Y es) = = .43 P (Domestic|N o) = 5 + 3 = .56 1 5+3 We have P (Y es) = .5 and P (N o) 5 + 3 so we can apply equation (2). F = .5, P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Ye We have P (Y es) = .5 and P (N o) = .5, so we can apply equation (2). For v = Y es, we have = .5 * .56 * .31 * .43 = .037 P(Yes) * P(Red | Yes) * P(SUV | Yes) for P(Domestic|Yes) and * v = N o, we have P(No) * P(Red | No) * P(SUV | No) * P (Domestic | No = .5 * .56 * .31 * .43 = .037 = .5 * .43 * .56 * .56 = .069 Since 0.069 > 0.037, our example gets classified as ’NO’ and for v = N o, we have Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 8 / 50
  • 9. Naive Bayes Classifier: Example To classify a Red Domestic SUV, For Yes: P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Yes) = .5 * .56 * .31 * .43 = .037 For No: P(No) * P(Red | No) * P(SUV | No) * P (Domestic | No) = .5 * .43 * .56 * .56 = .069 Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 9 / 50
  • 10. Learning to Classify Text Target concept Interesting : Document → {+, −} 1 Example: document classification using BOW — Multiple Bernoulli model — Multinomial model one attribute per word position in document 2 Learning: Use training examples to estimate P(+), P(−), P(doc|+), P(doc|−) Naive Bayes conditional independence assumption length(doc) P(doc|j) = P(ti |j) i=1 where P(ti |j) is probability that word i appears in class j Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 10 / 50
  • 11. Learn_naive_Bayes_text(Examples, k) 1. collect all words and other tokens that occur in Examples Vocabulary ← all distinct words and other tokens in Examples 2. calculate the required P(j) and P(ti |j) probability terms For each target value j in {1, . . . , k} do docsj ← subset of Examples for which the target value is j |docsj | P(j) ← |Examples| Textj ← a single document created by concatenating all members of docsj n ← total number of words in Textj (counting duplicate words multiple times) for each word ti in Vocabulary nk ← number of times word wk occurs in Textj ni +1 P(ti |j) ← n+|Vocabulary | Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 11 / 50
  • 12. Classify_naive_Bayes_text(Doc) positions ← all word positions in Doc that contain tokens found in Vocabulary Return vNB , where jNB = argmax P(j) P(ti |j) j i When k 1, need special smoothing method, Congle Zhang et. al. Web-scale classification with Naive Bayes, WWW 2009 Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 12 / 50
  • 13. Logistic Regression In Naive Bayes, the discriminant function is P(j) P(ti |j) i Let ni be frequency of ti , and take the log of the above log P(j) + ni log P(ti |j) i which is a linear function of the frequency vector x = [n1 , . . . , nV ]T Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 13 / 50
  • 14. Logistic Regression More generally, 1 1 P(j|x ) = exp(wjT x ) ≡ exp(w T f (x , j)) Z (x ) Zw (x ) Given a training sample L = {(x1 , j1 ), . . . (xN , jN )} we minimize the Conditional likelihood, 1 2 min w +C log Zw (xi ) − w T f (xi , ji ) 2 i A Convex function in w C.J. Lin et. al., Trust region Newton methods for large-scale logistic regression, ICML, 2007 (several millions of features, N = 105 ) Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 14 / 50
  • 15. Generative vs. Discriminative Classifiers Naive Bayes estimates parameters for P(j), P(x |j) while logistic regression estimates parameters for P(j|x ) Naive Bayes: generative classifier Logistic regression: discriminative classifier Logistic more general, gives better asymptotic error. Convergence rates are different, GNB with O(log n) examples, and logistic regression with O(n) examples, n dimension of X Ng and Jordan, On generative vs. discriminative classifiers: a comparison of Naive Bayes and logistic regession, NIPS 2002 Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 15 / 50
  • 16. fication and regression trees, or CART (Breiman et al., 1984), although ny other variants going by such names as ID3 and C4.5 (Quinlan, 1986; 3). Tree-Based Models 4.5 shows an illustration of a recursive binary partitioning of the input with the corresponding tree structure. In this example, the first step a two-dimensional in- x2 664 14. COMBINING MODELS at has been partitioned ions using axis-aligned E Figure 14.6 Binary tree corresponding to the par- θ3 titioning of input space shown in Fig- x1 > θ1 B ure 14.5. x2 θ2 x2 > θ3 C D θ2 x1 θ4 A θ1 θ4 x1 A B C D E Partition inputdivides the whole of thea input spaceregions, Thisaccordingtwo whether x each with the space into cuboid into two regions creates to subregions, θ or x > θ where θ is parameter of the model. 1 1 1 with edges aligned 1 1 of which can then be subdivided independently. For instance, the region x θ axes is further subdivided according to whether x θ or x > θ , giving rise to the 2 2 2 2 1 1 regions denoted A and B. The recursive subdivision can be described by the traversal Classifier of the binary tree shown in Figure 14.6. For any new input x, we determine which region it falls into by starting at the top of the tree at the root node and following h(x ) = i j I(x ∈ R )i a path down to a specific leaf node according to the decision criteria at each node. Note that such decision trees are not probabilistic graphical models. Within each region, there is a iseparate model to predict the target variable. For instance, in regression we might simply predict a constant over each region, or in CART (classification and regression trees) classification we might assign each region to a specific class. A key property of tree- based models, which makes them popular in fields such as medical diagnosis, for example, is that they are readily interpretable by humans because they correspond to a sequence of binary decisions applied to the individual input variables. For in- stance, to predict a patient’s disease, we might first ask “is their temperature greater Hongyuan Zha (Georgia Tech) than some threshold?”. If the answer isWeb Search might next ask “is their blood Machine Learning and yes, then we 16 / 50
  • 17. Decision Trees Decision tree representation: Each internal node tests an attribute (predictor variable) Each branch corresponds to attribute value – branching factor > 2 (discrete case) – binary tree more common (split on a threshold) Each leaf node assigns a class label Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 17 / 50
  • 18. Top-Down Induction of Decision Trees Main loop: 1 A ← the “best” decision attribute for next node 2 For each value of A, create new descendant of node 3 Sort training examples to leaf nodes 4 If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes Which attribute is best? Binary classification with two attributes [29+,35-] A1=? [29+,35-] A2=? t f t f [21+,5-] [8+,30-] [18+,33-] [11+,2-] Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 18 / 50
  • 19. Information Gain S is a sample of training examples p⊕ is the proportion of positive examples in S p is the proportion of negative examples in S Entropy measures the impurity of S Entropy (S) ≡ −p⊕ log2 p⊕ − p log2 p Gain(S, A) = expected reduction in entropy due to sorting on A |Sv | Gain(S, A) ≡ Entropy (S) − Entropy (Sv ) ≥ 0 v ∈Values(A) |S| Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 19 / 50
  • 20. Extension General predictor variables x ∈ X A set of binary splits s at each node based on a question: is x ∈ A? where A ∈ X the split s sends all (xi , ji ) with "yes" answer to the left child and "no" answer to the right child Standard set of questions Predictor variable x i continuous: is x i ≤ c? Predictor variable x i categorical: is x i ∈ T ? where T ⊂ T and T is the values of x i Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 20 / 50
  • 21. Goodness of Split goodness ofnode.is measured by an impurity function The defined for each split Intuitively, we want each leaf node to be “pure”, that is, one class dominates. The goodness of split is measured by an impurity function defined for Jia Li http://www.stat.psu.edu/∼jiali each node Intuitively, we want each leaf node to be "pure", that is, one class dominates Given class probabilities in a node S: p(1|S), . . . , p(J|S) Impurity function for S, (more generally for the samples in a node) i(S) = φ(p(1|S), . . . , p(J|S)) Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 21 / 50
  • 22. Goodness of Split Examples Entropy φ(p1 , . . . , pJ ) = − pj log pj j Gini Index φ(p1 , . . . , pJ ) = − pi pj = 1 − pj2 i=j j Goodness of a split s for node t, Φ(s, t) = ∆i(s, t) = i(t) − pR i(tR ) − pL i(tL ) where pR and pL are the proportions of the samples in node t that go to the right node tR and the left node tL , respectively. Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 22 / 50
  • 23. Stopping Criteria A simple criteria: stop splitting a node t when max p(t)∆i(s, t) < β s∈S The above stopping criteria is unsatisfactory — A node with a small decrease of impurity after one step of splitting may have a large decrease after multiple levels of splits. Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 23 / 50
  • 24. CART: Classification and Regression Trees Two phases: growing and pruning Growing: input space is recursively partitioned into cells, each cell corresponding to a leaf node – training data are fitted well – but poor performance on test data (overfitting) Pruning: objective function consists of empirical risk and penalty term C (T ) = LN (T ) + α|T | where T ∈ T , all possible subtrees obtained from prunning the original tree. CART select T to minimize C (T ) with α selected by cross-validation Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 24 / 50
  • 25. Probabilistic Setting for Regression X is the predictor space, and T = R the set of reals P(x , t) a probability distribution on X × T A regression function y (t) is a function h : X → T We want to learn h from a training sample D = {(x1 , t1 ), . . . (xN , tN )} How do we measure the performance of a classifier? Mean Squared Error, errorh = (t − y (x ))2 dP(x , t) Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 25 / 50
  • 26. Conditional Mean as Optimal Regression Function Mean Squared Error, errorh = (t − h(x ))2 dP(x , t) = (t − h(x ))2 p(t|x )dtdP(x ) Optimal regression function h∗ (x ) = tp(t|x )dt Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 26 / 50
  • 27. Least Squares Problem DUCTION The error function (1.2) corre- sponds to (one half of) the sum of t tn the squares of the displacements (shown by the vertical green bars) of each data point from the function y(x, w). y(xn , w) xn x For training set, D = {(x1 , training . (xN , t The geomet- function y(x, w) were to pass exactly through each t1 ), . . data point.N )}, the empirical risk, or rical interpretation of the sum-of-squares error function is illustrated in Figure 1.3. training error, for y (x , w ), We can solve the curve fitting problem by choosing the value of w for which E(w) is as small as possible. Because the error function N a quadratic function of is 1 the coefficients w, its derivatives with respect to the coefficients will be linear in the2 has − y (x solution, elements of w, and so the minimization E (w error function (ti a unique i , w )) of the ) = denoted by w , which can be found in closed form. 2The resulting polynomial is i=1 given by the function y(x, w ). There remains the problem of choosing the order M of the polynomial, and as we shall see this will turn out to be an example of an important concept called model comparison or model selection. In Figure 1.4, we show four examples of the results of fitting polynomials having orders M = Machine Learningto the data set shown in Hongyuan Zha (Georgia Tech) 0, 1, 3, and 9 and Web Search 27 / 50
  • 28. Linear Least Squares Problem y (x , w ) = w T f (x ), f (x ) = [f1 (x ), . . . , fn (x )] a set of basis functions, N 1 E (w ) = (ti − w T f (xi ))2 2 i=1 Let t = [ti ], A = [f (xi )], then 2 min t − Aw 2 w Normal equation, AT Aw = AT t Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 28 / 50
  • 29. Overfitting and Regularization DUCTION Example taken from C. Bishop Plot of a training data set of N = 10 points, shown as blue circles, each comprising an observation 1 of the input variable x along with the corresponding target variable t t. The green curve shows the function sin(2πx) used to gener- ate the data. Our goal is to pre- 0 dict the value of t for some new value of x, without knowledge of the green curve. −1 0 x 1 Polynomial curve fitting detailed treatment lies beyond the scope of this book. Although each of these tasks needs its own tools and techniques, many of the key ideas that underpin them are common to all such problems. One of the main M goals of this chapter is to introduce, y (xrelatively informal way, x + · · · the most x in a , w ) = w0 + w1 several of + wM important of these concepts and to illustrate them using simple examples. Later in the book we shall see these same ideas re-emerge in the context of more sophisti- cated models that are applicable to real-world pattern recognition applications. This chapter also provides a self-contained introduction to three important tools that will be used throughout the book, namely probability theory, decision theory, and infor- mation theory. Zha (Georgia Tech)might sound like daunting and Web they are in fact Hongyuan Although these Machine Learning topics, Search 29 / 50
  • 30. 1.1. Example: Polynomial Curve Fitting 7 1 M =0 1 M =1 t t 0 0 −1 −1 0 x 1 0 x 1 1 M =3 1 M =9 t t 0 0 −1 −1 0 x 1 0 x 1 Figure 1.4 Plots of polynomials having various orders M, shown as red curves, fitted to the data set shown in Figure 1.2. (RMS) error defined by ERMS = 2E(w )/N (1.3) in which the division by N allows us to compare different sizes of data sets on Hongyuan Zha (Georgia Tech) an equal footing, andLearning and Web Search RMS is measured on the same Machine the square root ensures that E 30 / 50
  • 31. from which the data was generated (and we shall see later that this is indeed the case). We know that a power series expansion of the function sin(2πx) contains Training Error and Test Error terms of all orders, so we might expect that results should improve monotonically as we increase M . We can gain some insight into the problem by examining the values of the co- efficients w obtained from polynomials of various order, as shown in Table 1.1. We see that, as M increases, the magnitude of the coefficients typically gets larger. In particular for the M = 9 polynomial, the coefficients have become finely tuned to the data by developing large positive and negative values so that the correspond- uare ated 1 Table 1.1 Table of the coefficients w for M =0 M =1 M =6 M =9 nde- Training polynomials of various order. w0 0.19 0.82 0.31 0.35 alues Test Observe how the typical mag- nitude of the coefficients in- w1 -1.27 7.99 232.37 creases dramatically as the or- w2 -25.43 -5321.83 der of the polynomial increases. w3 17.37 48568.31 ERMS 0.5 w4 -231639.30 w5 640042.26 w6 -1061800.52 w7 1042400.18 w8 -557682.99 0 0 3 6 9 w9 125201.43 M ng set error goes to zero, as we might expect because degrees of freedom corresponding to the 10 coefficients uned exactly to the 10 data points in the training set. as become very large and, as we saw in Figure 1.4, the w ) exhibits wild oscillations. ical because a polynomial of given order contains all pecial cases. The M = 9 polynomial is therefore capa- ast as good as the M = 3 polynomial. Furthermore, we predictor of new data would be the function sin(2πx) nerated (and we shall see later that this isMachine Learning and Web Search Hongyuan Zha (Georgia Tech) indeed the 31 / 50
  • 32. Increasing Training Set Size 1.1. Example: Polynomial Curve Fitting 9 1 N = 15 1 N = 100 t t 0 0 −1 −1 0 x 1 0 x 1 Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M = 9 polynomial for N = 15 data points (left plot) and N = 100 data points (right plot). We see that increasing the size of the data set reduces the over-fitting problem. ing polynomial function matches each of the data points exactly, but between data points (particularly near the ends of the range) the function exhibits the large oscilla- tions observed in Figure 1.4. Intuitively, what is happening is that the more flexible polynomials with larger values of M are becoming increasingly tuned to the random noise on the target values. Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 32 / 50
  • 33. Regularization 10 1. INTRODUCTION 1 ln λ = −18 1 ln λ = 0 t t 0 0 −1 −1 0 x 1 0 x 1 Figure 1.7 Plots of M = 9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error function (1.4) for set, L the {(x , t parameter λ corresponding ln λ −18 and ln λ = The For case of no regularizer,values of= corresponding ),ln.λ.= −∞,N , tN )}, tothe=empirical 0.risk, or training two i.e., λ = 0, regularizationto . (x is shown at the bottom right of Figure 1.4. 1 1 training error, for y (x , w ), may wish to use relatively complex and flexible models. One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, N ˆ coefficients 1 λ which involves adding a penalty term to the error function (1.2) in order to discourage E (w ) = from reaching largey (xi ,The ))2 +such penalty 2 takes the the (ti − values. w simplest w term form of a sum of2 i=1of all of the coefficients, leading to 2 squares a modified error function of the form N 1 2 λ E(w) = {y(xn , w) − tn } + w 2 (1.4) 2 2 n=1 Hongyuan Zha (Georgia Tech) where w 2 Machine = w0 + w1 + . Webw2 , and the coefficient λ governs the rel- ≡ wT w Learning 2 . . + Search 2 and 33 / 50
  • 34. itable value for the model complexity. The results above chieving this, namely by taking the available data and Training Error and Test Error: Regularization g set, used to determine the coefficients w, and a separate a hold-out set, used to optimize the model complexity cases, however, this will prove to be too wasteful of we have to seek more sophisticated approaches. of polynomial curve fitting has appealed largely to in- ore principled approach to solving problems in pattern discussion of probability theory. As well as providing the the subsequent developments in this book, it will also 1.1. Example: Polynomial Curve Fitting 11 re er- ln λ = −∞ ln λ = −18 ln λ = 0 M = 9 1 Table 1.2 Table of the coefficients w for M = 9 polynomials with various values for w0 0.35 0.35 0.13 Training the regularization parameter λ. Note Test that ln λ = −∞ corresponds to a w1 232.37 4.74 -0.05 model with no regularization, i.e., to w2 -5321.83 -0.77 -0.06 the graph at the bottom right in Fig- w3 48568.31 -31.97 -0.05 ERMS 0.5 ure 1.4. We see that, as the value of w4 -231639.30 -3.89 -0.03 λ increases, the typical magnitude of w5 640042.26 55.28 -0.02 the coefficients gets smaller. w6 -1061800.52 41.32 -0.01 w7 1042400.18 -45.95 -0.00 0 w8 -557682.99 -91.53 0.00 −35 −30 ln λ −25 −20 w9 125201.43 72.68 0.01 the magnitude of the coefficients. The impact of the regularization term on the generalization error can be seen by plotting the value of the RMS error (1.3) for both training and test sets against ln λ, as shown in Figure 1.8. We see that in effect λ now controls the effective complexity of the model and hence determines the degree of over-fitting. The issue of model complexity is an important one and will be discussed at length in Section 1.3. Here we simply note that, if we were trying to solve a practical application using this approach of minimizing an error function, we would have to find a way to determine aLearning and Webfor the model complexity. The results above/ 50 Hongyuan Zha (Georgia Tech) Machine suitable value Search 34
  • 35. Bias-Variance Decomposition Let h∗ (x ) be the conditional mean, h∗ (x ) = tp(t|x )dt For any regression function y (x ), (y (x ) − t)2 = (y (x ) − h(x ))2 + (h(x ) − t)2 Given training data D, algorithm outputs y (x , D) — average behavior over all D ED (y (x , D)−h(x ))2 = (ED y (x , D) − h(x ))2 + ED (y (x , D) − ED y (x , D))2 (bias)2 variance Expected loss = (bias)2 + variance + noise Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 35 / 50
  • 36. Bias-Variance and Model Complexity 150 3. LINEAR MODELS FOR REGRESSION 1 1 ln λ = 2.6 t t 0 0 −1 −1 0 x 1 0 x 1 1 1 ln λ = −0.31 t t 0 0 −1 −1 0 x 1 0 x 1 1 1 ln λ = −2.4 t t 0 0 −1 −1 0 x 1 0 x 1 Figure 3.5 Illustration of the dependence of bias and variance on model complexity, governed by a regulariza- tion parameter λ, using the sinusoidal data set from Chapter 1. There are L = 100 data sets, each having N = 25 Hongyuan Zha (Georgia data points, and there are 24 Gaussian basis functions inand Web Search number of parameters is Tech) Machine Learning the model so that the total 36 / 50
  • 37. Bias-Variance Decomposition 3.2. The Bias-Variance Decomposition 151 as and variance, um, correspond- 0.15 shown in Fig- (bias)2 n is the average 0.12 variance est data set size (bias)2 + variance e minimum value 0.09 test error e occurs around h is close to the 0.06 e minimum error 0.03 0 −3 −2 −1 0 1 2 ln λ 4 Gaussian basis functions by minimizing the regularized error give Hongyuan Zha (Georgia Tech) y (l) (x) as shown Web Search 3.5. The a prediction function Machine Learning and in Figure 37 / 50
  • 38. Overfitting 0.9 0.85 0.8 0.75 Accuracy 0.7 0.65 0.6 On training data On test data 0.55 0.5 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes) Common problem with most learning algorithms Given a function space H, a function h ∈ H is said to overfitthe training data if there exists some alternative function h ∈ H such that h has smaller error than h over the training examples but h has smaller error than h over the entire distribution of instances Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 38 / 50
  • 39. Cross-Validation Performance on the training set is not a good indicator of predictive performance on unseen data If data is plentiful, training data, validation data and test data — models (different degree polynomials) compared on validation data — high variance if validation data small S-fold cross-validation 1.4. The Curse of Dimensionality 33 Figure 1.18 The technique of S-fold cross-validation, illus- run 1 trated here for the case of S = 4, involves tak- ing the available data and partitioning it into S groups (in the simplest case these are of equal run 2 size). Then S − 1 of the groups are used to train a set of models that are then evaluated on the re- run 3 maining group. This procedure is then repeated for all S possible choices for the held-out group, run 4 indicated here by the red blocks, and the perfor- mance scores from the S runs are then averaged. data to assess performance. When data is particularly scarce, it may be appropriate to consider the case S = N , where N is the total number of data points, which gives the leave-one-out technique. Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 39 / 50
  • 40. Cross-Validation Which kind of Cross Validation? Downside Upside Test-set Variance: unreliable Cheap estimate of future performance Leave- Expensive. Doesn’t waste data one-out Has some weird behavior 10-fold Wastes 10% of the data. Only wastes 10%. Only 10 times more expensive 10 times more expensive than test set instead of R times. 3-fold Wastier than 10-fold. Slightly better than test- Expensivier than test set set R-fold Identical to Leave-one-out Andrew Moore’s slides Copyright © Andrew W. Moore Slide 35 Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 40 / 50
  • 41. CV-based Model Choice CV-based Model Selection Example: •Choosing which model toofuse Example: Choosing number hidden units in a one- hidden-layer neural net. Step 1: Compute 10-fold CV error for six different model classes, • Step 1: Compute 10-fold CV error for six different model classes: Algorithm TRAINERR 10-FOLD-CV-ERR Choice 0 hidden units 1 hidden units 2 hidden units ! 3 hidden units 4 hidden units 5 hidden units • Step 2: Whichever model class gave best CV score: train it Step 2: Whichever model and that’s theCV score:model you’ll use. all the with all the data, gave best predictive train it with data, and that is the predictive model you will use. Copyright © Andrew W. Moore Slide 38 Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 41 / 50
  • 42. Two Definitions of Error The true error of classifier h with respect to P(x , y ) is the probability that h will misclassify an instance drawn at random according to P (population). errorP (h) ≡ P[h(x ) = y ] The sample error of h with respect to data sample S = {(xi , yi )}N is i=1 the proportion of examples h misclassifies N 1 errorS (h) ≡ δ(yi = h(xi )) n i=1 Where δ(y = h(x )) is 1 if y = h(x ), and 0 otherwise. How well does errorS (h) estimate errorP (h)? Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 42 / 50
  • 43. Example Hypothesis h misclassifies 12 of the 40 examples in S 12 errorS (h) = = .30 40 What is errorP (h)? Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 43 / 50
  • 44. Confidence Intervals If S contains n examples, drawn independently of h and each other n ≥ 30 Then With approximately N% probability, errorP (h) lies in interval errorS (h)(1 − errorS (h)) errorS (h) ± zN n where N%: 50% 68% 80% 90% 95% 98% 99% zN : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 44 / 50
  • 45. k-fold Cross-Validated Paired t-test Comparing two algorithms A and B — L(S) returns the classifier produced by algorithm L training on training data S 1 Randomly partition training data D into k disjoint test sets T1 , T2 , . . . , Tk of equal size. 2 For i from 1 to k, do use Ti for the test set, and the remaining data for training set Si hA ← LA (Si ), hB ← LB (Si ) δi ← errorTi (hA ) − errorTi (hB ) ¯ Return the value δ ≡ 1 k i=1 δi 3 k Let sδ ≡ 1 k ¯ − δ)2 . i=1 (δi 4 ¯ k(k−1) Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 45 / 50
  • 46. t-Distribution t ¯ ¯ Then ˆ ≡ δ/sδ has an approximate t-distribution with k − 1 degree of freedom under the null hypothesis that there’s no difference in the true errors Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 46 / 50
  • 47. p-Value and Test of Significance Null hypothesis: no difference in true errors, and alternative hypothesis We may be able to demonstrate that the alternative is much more plausible than the null hypothesis given the data This is done in terms of a probability (a p-value) — quantifying the strength of the evidence against the null hypothesis in favor of the alternative. Are the data consistent with the null hypothesis? — use a test statistic, like the ˆ t — need to know the null distribution of test statistic (degree k − 1 student t) Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 47 / 50
  • 48. p-Value and Test of Significance For a given data set, we can compute the value of ˆ, and see whether t it is — in the middle of the distribution (consistent with the null hypothesis) — out in a tail of the distribution (making the alternative hypothesis seem more plausible) Alternative hypothesis ⇒ large positive ˆ t — measure of how far out ˆ is in the right-hand tail of null t distribution p-Value is the probability to the right of our test statistic (ˆ) t calculated using the null distribution The smaller the P-value, and the stronger the evidence against the null hypothesis in favor of the alternative Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 48 / 50
  • 49. Bayes’ Theorem is used to update subjective probabilities to reflect new information. p-Value and P-value to persons unfamiliar with statistics, it is often necessary to use When reporting a Test of Significance descriptive language to indicate the strength of the evidence. I tend to use the following sort of language. Obviously the cut-offs are somewhat arbitrary and another person might use different language. P > 0.10 No evidence against the null hypothesis. The data appear to be consistent with the null hypothesis. 0.05 < P < 0.10 Weak evidence against the null hypothesis in favor of the alternative. 0.01 < P < 0.05 Moderate evidence against the null hypothesis in favor of the alternative. 0.001 < P < 0.01 Strong evidence against the null hypothesis in favor of the alternative. P < 0.001 Very strong evidence against the null hypothesis in favor of the alternative. Level α this kind of0.05 or 0.01) test:keep in mind the difference between statistical In using (usually language, one should We reject the null hypothesis at level α if the P-value is smaller than α a large study one may obtain a small P-value significance and practical significance. In even though the magnitude of the effect being tested is too small to be of importance (see the discussion of power below). It is a good idea to support a P-value with a confidence interval for the parameter being tested. AHongyuan Zha (Georgia be reported more formally in terms of a fixed level ! test. Here ! is 49 / 50 P-value can also Tech) Machine Learning and Web Search a
  • 50. The Power of Tests When comparing algorithms — null hypothesis: no difference — alternative hypothesis: my new algorithm is better We want to have good chance of reporting a small P-value assuming the alternative hypothesis is true The power of level α test: the probability that the null hypothesis will be rejected at level α (i.e., the p-value will be less than α) assuming the alternative hypothesis — variability of the data: lower variance, higher power — sample size: higher N, higher power — the magnitude of the difference: large difference, higher power is true. Hongyuan Zha (Georgia Tech) Machine Learning and Web Search 50 / 50