View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
conjunctive rules of the form y=x i Æ x j Æ x k ...
No simple rule explains the data. The same is true for simple clauses
1 0 0 1 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 2 0 1 0 0 0 y =c x 1 1100 0 x 2 0100 0 x 3 0110 0 x 4 0101 1 x 1 x 2 1100 0 x 1 x 3 0011 1 x 1 x 4 0011 1 Rule Counterexample x 2 x 3 0011 1 x 2 x 4 0011 1 x 3 x 4 1001 1 x 1 x 2 x 3 0011 1 x 1 x 2 x 4 0011 1 x 1 x 3 x 4 0011 1 x 2 x 3 x 4 0011 1 x 1 x 2 x 3 x 4 0011 1 Rule Counterexample
Our prediction on the d-th example x is therefore:
Let t d be the target value for this example ( real value; represents u ¢ x )
A convenient error function of the data set is:
(i (subscript) – vector component; j (superscript) - time; d – example #) Assumption: x 2 R n ; u 2 R n is the target weight vector; the target (label) is t d = u ¢ x Noise has been added; so, possibly, no weight vector is consistent with the data.
In the general (non-separable) case the learning rate R
must decrease to zero to guarantee convergence. It cannot decrease too quickly nor too slowly.
The learning rate is called the step size. There are more
sophisticates algorithms (Conjugate Gradient) that choose
the step size automatically and converge faster.
There is only one “basin” for linear threshold units, so a
local minimum is the global minimum. However, choosing
a starting point can make the algorithm converge much
Computational Issues Assume the data is linearly separable. Sample complexity: Suppose we want to ensure that our LTU has an error rate (on new examples) of less than with high probability(at least (1- )) How large must m (the number of examples) be in order to achieve this? It can be shown that for n dimensional problems m = O(1/ [ln(1/ ) + (n+1) ln(1/ ) ]. Computational complexity: What can be said? It can be shown that there exists a polynomial time algorithm for finding consistent LTU (by reduction from linear programming). (On-line algorithms have inverse quadratic dependence on the margin)
Set J( w ) = 0 and solve for w . Can be accomplished using SVD
Fisher Linear Discriminant:
A direct computation method.
Probabilistic methods (naive Bayes):
Produces a stochastic classifier that can be viewed as a linear
A multiplicative update algorithm with the property that it can
handle large numbers of irrelevant attributes.
Summary of LMS algorithms for LTUs Local search: Begins with initial weight vector. Modifies iteratively to minimize and error function. The error function is loosely related to the goal of minimizing the number of classification errors. Memory: The classifier is constructed from the training examples. The examples can then be discarded. Online or Batch: Both online and batch variants of the algorithms can be used.