Your SlideShare is downloading. ×
WEKA:Algorithms The Basic Methods
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

WEKA:Algorithms The Basic Methods


Published on

WEKA:Algorithms The Basic Methods

WEKA:Algorithms The Basic Methods

Published in: Business

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Algorithms: The Basic Methods
  • 2. 1-rule Algorithm (1R)
    Way to find very easy classification rule
    Generates a one level decision tree which tests just one attribute
    Consider each attribute in turn
    There will be one branch in the decision tree for each value of this attribute
    Allot the majority class to each branch
    Repeat the same for all attributes and choose the one with minimum error
  • 3. 1R Pseudo Code
    Pseudo code for 1R
  • 4. 1R in action
    Consider the problem of weather’s effect on play. Data is:
  • 5. 1R in action
    Let us consider the Outlook parameter first
    Total Error = 4/14
  • 6. 1R in action
    Consolidated table for all the attributes, ‘*’ represent arbitrary choice from equivalent options:
  • 7. 1R in action
    From this table we can see that a decision tree on Outlook and Humidity gives minimum error
    We can choose and of these two attributes and the corresponding rules as our choice of classification rule
    Missing is treated as just another attribute, one branch in the decision tree dedicated to missing values like any other attribute value
  • 8. Numeric attributes and 1R
    To deal with numeric attributes, we Discretize them
    The steps are :
    Sort instances on the basis of attribute’s value
    Place breakpoints where class changes
    These breakpoints gives us discrete numerical range
    Majority class of each range is considered as its range
  • 9. Numeric attributes and 1R
    We have the following data for the weather example,
  • 10. Numeric attributes and 1R
    Applying the steps we get:
    The problem with this approach is that we can get a large number of division or Overfitting
    Therefore we enforce a minimum number of instances , for example taking min = 3 in above example, we get:
  • 11. Numeric attributes and 1R
    When two adjacent division have the same majority class, then we can join these two divisions
    So after this we will get:
    Which gives the following classification rules:
  • 12. Statistical Modeling
    Another classification technique
    Assumptions (for a given class):
    All attributes contributes equally to decision making
    All attributes are independent of each other
  • 13. Statistical Modeling: An example
    Given Data:
  • 14. Statistical Modeling: An example
    Data Description:
    The upper half shows how many time a value of an attribute occurs for a class
    The lower half shows the same data in terms of fraction
    For example, class is yes 9 times
    For class = yes, outlook = sunny 2 times
    So under outlook = sunny and class = yes we have 2/9
  • 15. Statistical Modeling
    Problem at hand:
    Taking into the consideration that all attributes equally and are independent
    Likelihood of yes = 2/9x3/9x3/9x3/9x9/14 = 0.0053
    Likelihood of no = 3/5x1/5x4/5x3/5x5/14 = 0.0206
  • 16. Statistical Modeling: An example
    Solution continued..
    As can be observed, likelihood of yes is high
    Using normalization, we can calculate probability as:
    Probability of yes = (.0053)/(.0053 + .0206) = 20.5%
    Probability of no = (.0206)/(.0053 + .0206) = 79.5%
  • 17. Statistical Modeling: An example
    Derivation using Bayes’ rule:
    Acc to Bayes’ rule, for a hypothesis H and evidence E that bears on that hypothesis, then
    P[H|E] = (P[E|H] x P[H]) / P[E]
    For our example hypothesis H is that play will be, say, yes and E is the particular combination of attribute values at hand
    Outlook = sunny(E1)
    Temperature = cool (E2)
    Humidity = high(E3)
    Windy = True (E4)
  • 18. Statistical Modeling: An example
    Derivation using Bayes’ rule:
    Now since E1, E2, E3 and E4 are independent therefore we have
    P[H|E] = (P[E1|H] x P[E2|H] x P[E3|H] x P[E4|H] x P[H] ) / P[E]
    Replacing values from the table we get,
    P[yes|E] = (2/9 x 3/9 x 3/9 x 3/9 x 9/14) / P[E]
    P[E] will be taken care of during normalization of P[yes|E] and P[No|E]
    This method is called as Naïve Bayes
  • 19. Problem and Solution for Naïve Bayes
    In case we have an attribute value (Ea)for which P[Ea|H] = 0, then irrespective of other attributes P[H|E] = 0
    We can add a constant to numerator and denominator, a technique called Laplace Estimator for example,
    P1 + P2 + P3 = 1:
  • 20. Statistical Modeling: Dealing with missing attributes
    Incase an value is missing, say for attribute Ea in the given data set, we just don’t count it while calculating the P[Ea|H]
    Incase an attribute is missing in the instance to be classified, then its factor is not there in the expression for P[H|E], for example if outlook is missing then we will have:
    Likelihood of Yes = 3/9 x 3/9 x 3/9 x 9/14 = 0.0238
    Likelihood of No = 1/5 x 4/5 x 3/5 x 5/14 = 0.0343
  • 21. Statistical Modeling: Dealing with numerical attributes
    Numeric values are handled by assuming that they have :
    Normal probability distribution
    Gaussian probability distribution
    For a normal distribution we have:
    u = mean
    sigma = Standard deviation
    x = instance under consideration
    f(x) = contribution of to likelihood figures
  • 22. Statistical Modeling: Dealing with numerical attributes
    An example, we have the data:
  • 23. Statistical Modeling: Dealing with numerical attributes
    So here we have calculated the mean and standard deviation for numerical attributes like temperature and humidity
    For temperature = 66
    So the contribution of temperature = 66 in P[yes|E] is 0.0340
    We do this similarly for other numerical attributes
  • 24. Divide-and-Conquer: Constructing Decision Trees
    Steps to construct a decision tree recursively:
    Select an attribute to placed at root node and make one branch for each possible value
    Repeat the process recursively at each branch, using only those instances that reach the branch
    If at any time all instances at a node have the classification, stop developing that part of the tree
    Problem: How to decide which attribute to split on
  • 25. Divide-and-Conquer: Constructing Decision Trees
    Steps to find the attribute to split on:
    We consider all the possible attributes as option and branch them according to different possible values
    Now for each possible attribute value we calculate Information and then find the Information gain for each attribute option
    Select that attribute for division which gives a Maximum Information Gain
    Do this until each branch terminates at an attribute which gives Information = 0
  • 26. Divide-and-Conquer: Constructing Decision Trees
    Calculation of Information and Gain:
    For data: (P1, P2, P3……Pn) such that P1 + P2 + P3 +……. +Pn = 1
    Information(P1, P2 …..Pn) = -P1logP1 -P2logP2 – P3logP3 ……… -PnlogPn
    Gain= Information before division – Information after division
  • 27. Divide-and-Conquer: Constructing Decision Trees
    Here we have consider each
    attribute individually
    Each is divided into branches
    according to different possible
    Below each branch the number of
    class is marked
  • 28. Divide-and-Conquer: Constructing Decision Trees
    Using the formulae for Information, initially we have
    Number of instances with class = Yes is 9
    Number of instances with class = No is 5
    So we have P1 = 9/14 and P2 = 5/14
    Info[9/14, 5/14] = -9/14log(9/14) -5/14log(5/14) = 0.940 bits
    Now for example lets consider Outlook attribute, we observe the following:
  • 29. Divide-and-Conquer: Constructing Decision Trees
    Example Contd.
    Gain by using Outlook for division = info([9,5]) – info([2,3],[4,0],[3,2])
    = 0.940 – 0.693 = 0.247 bits
    Gain (outlook) = 0.247 bits
    Gain (temperature) = 0.029 bits
    Gain (humidity) = 0.152 bits
    Gain (windy) = 0.048 bits
    So since Outlook gives maximum gain, we will use it for division
    And we repeat the steps for Outlook = Sunny and Rainy and stop for Overcast since we have Information = 0 for it
  • 30. Divide-and-Conquer: Constructing Decision Trees
    Highly branching attributes: The problem
    If we follow the previously subscribed method, it will always favor an attribute with the largest number of branches
    In extreme cases it will favor an attribute which has different value for each instance: Identification code
  • 31. Divide-and-Conquer: Constructing Decision Trees
    Highly branching attributes: The problem
    Information for such an attribute is 0
    info([0,1]) + info([0,1]) + info([0,1]) + …………. + info([0,1]) = 0
    It will hence have the maximum gain and will be chosen for branching
    But such an attribute is not good for predicting class of an unknown instance nor does it tells anything about the structure of division
    So we use gain ratio to compensate for this
  • 32. Divide-and-Conquer: Constructing Decision Trees
    Highly branching attributes: Gain ratio
    Gain ratio = gain/split info
    To calculate split info, for each instance value we just consider the number of instances covered by each attribute value, irrespective of the class
    Then we calculate the split info, so for identification code with 14 different values we have:
    info([1,1,1,…..,1]) = -1/14 x log1/14 x 14 = 3.807
    For Outlook we will have the split info:
    info([5,4,5]) = -1/5 x log 1/5 -1/4 x log1/4 -1/5 x log 1/5 = 1.577
  • 33. Divide-and-Conquer: Constructing Decision Trees
    Highly branching attributes: Gain ratio
    So we have:
    And for the ‘highly branched attribute’, gain ratio = 0.247
  • 34. Divide-and-Conquer: Constructing Decision Trees
    Highly branching attributes: Gain ratio
    Though the ‘highly branched attribute’ still have the maximum gain ratio, but its advantage is greatly reduced
    Problem with using gain ratio:
    In some situations the gain ratio modification overcompensates and can lead to preferring an attribute just because its intrinsic information
    is much lower than that for the other attributes.
    A standard fix is to choose the attribute that maximizes the gain ratio, provided that the information gain or that attribute is at least as great as the average information gain for all the attributes examined
  • 35. Covering Algorithms: Constructing rules
    Consider each class in turn
    Seek a way of covering all instances in it, excluding instances not belonging to this class
    Identify a rule to do so
    This is called a covering approach because at each stage we identify a rule that covers some of the instances
  • 36. Covering Algorithms: Constructing rules
    Rules for class = a:
    • If x > 1.2 then class = a
    • 37. If x > 1.2 and y > 2.6 then class = a
    • 38. If x > 1.2 and y > 2.6 then class = a
    If x > 1.4 and y < 2.4 then class = a
  • 39. Covering Algorithms: Constructing rules
    Rules Vs Trees:
    Covering algorithm covers only a single class at a time whereas division takes all the classes in account as decision trees creates a combines concept description
    Problem of replicated sub trees is avoided in rules
    Tree for the previous problem:
  • 40. Covering Algorithms: Constructing rules
    PRISM Algorithm: A simple covering algorithm
    Instance space after addition of rules:
  • 41. Covering Algorithms: Constructing rules
    PRISM Algorithm: Criteria to select an attribute for division
    Include as many instances of the desired class and exclude as many instances of other class as possible
    If a new rule covers t instances of which p are positive examples of the class and t-p are instances of other classes i.e errors, then try to maximize p/t
  • 42. Covering Algorithms: Constructing rules
    PRISM Algorithm: Example data
  • 43. Covering Algorithms: Constructing rules
    PRISM Algorithm: In action
    We start with the class = hard and have the following rule:
    If ? Then recommendation = hard
    Here ? represents an unknown rule
    For unknown we have nine choices:
  • 44. Covering Algorithms: Constructing rules
    PRISM Algorithm: In action
    Here the maximum t/p ratio is for astigmatism = yes (choosing randomly between equivalent option in case there coverage is also same)
    So we get the rule:
    If astigmatism = yes then recommendation = hard
    We wont stop at this rule as this rule gives only 4 correct results out of 12 instances it covers
    We remove the correct instances of the above rule from our example set and start with the rule:
    If astigmatism = yes and ? then recommendation = hard
  • 45. Covering Algorithms: Constructing rules
    PRISM Algorithm: In action
    Now we have the data as:
  • 46. Covering Algorithms: Constructing rules
    PRISM Algorithm: In action
    And the choices for this data is:
    We choose tear production rate = normal which has highest t/p
  • 47. Covering Algorithms: Constructing rules
    PRISM Algorithm: In action
    So we have the rule:
    If astigmatism = yes and tear production rate = normal then recommendation = hard
    Again, we remove matched instances, now we have the data:
  • 48. Covering Algorithms: Constructing rules
    PRISM Algorithm: In action
    Now again using t/p we finally have the rule (based on maximum coverage):
    If astigmatism = yes and tear production rate = normal and spectacle prescription = myope then recommendation = hard
    And so on. …..
  • 49. Covering Algorithms: Constructing rules
    PRISM Algorithm: Pseudo Code
  • 50. Covering Algorithms: Constructing rules
    Rules Vs decision lists
    The rules produced, for example by PRISM algorithm, are not necessarily to be interpreted in order like decision lists
    There is no order in which class should be considered while generating rules
    Using rules for classification, one instance may receive multiple receive multiple classification or no classification at all
    In such cases go for the rule with maximum coverage and training examples respecitively
    These difficulties are not there with decision lists as they are to be interpreted in order and have a default rule at the end
  • 51. Mining Association Rules
    An association rule can predict any number of attributes and also any combination of attributes
    Parameter for selecting an Association Rule:
    Coverage: The number of instances they predict correctly
    Accuracy: The ratio of coverageand total number of instances the rule is applicable
    We want association rule with high coverage and a minimum specified accuracy
  • 52. Mining Association Rules
    Item – set: A combination of attributes
    Item: An attribute – value pair
    An example:
    For the weather data we have a table with each column containing an item – set having different number of attributes
    With each entry the coverage is also given
    The table is not complete, just gives us a good idea
  • 53. Mining Association Rules
  • 54. Mining Association Rules
    Generating Association rules:
    We need to specify a minimum coverage and accuracy for the rules to be generated before hand
    Generate the item sets
    Each item set can be permuted to generate a number of rules
    For each rule check if the coverage and accuracy is appropriate
    This is how we generate association rules
  • 55. Mining Association Rules
    Generating Association rules:
    For example if we take the item set:
    humidity = normal, windy = false, play = yes
    This gives seven potential rules (with accuracy):
  • 56. Linear models
    We will look at methods to deal with the prediction of numerical quantities
    We will see how to use numerical methods for classification
  • 57. Linear models
    Numerical Prediction: Linear regression
    Linear regression is a technique to predict numerical quantities
    Here we express the class (a numerical quantity) as a linear combination of attributes with predetermined weights
    For example if we have attributes a1,a2,a3…….,ak
    x = (w0) + (w1)x(a1) + (w2)x(a2) + …… + (wk)x(ak)
    Here x represents the predicted class and w0,w1……,wk are the predetermined weights
  • 58. Linear models
    Numerical Prediction: Linear regression
    The weights are calculated by using the training set
    To choose optimum weights we select the weights with minimum square sum:
  • 59. Linear models
    Linear classification: Multi response linear regression
    For each class we use linear regression to get a linear expression
    When the instance belongs to the class output is 1, otherwise 0
    Now for an unclassified instance we use the expression for each class and get an output
    The class expression giving the maximum output is selected as the classified class
    This method has the drawbacks that values produced are not proper probabilities
  • 60. Linear models
    Linear classification: Logistic regression
    To get the output as proper probabilities in the range 0 to 1 we use logistic regression
    Here the output y is defined as:
    y = 1/(1+e^(-x))
    x = (w0) + (w1)x(a1) + (w2)x(a2) + …… + (wk)x(ak)
    So the output y will lie in the range (0,1]
  • 61. Linear models
    Linear classification: Logistic regression
    To select appropriate weights for the expression of x, we maximize:
    To generalize Logistic regression we can use do the calculation like we did in Multi response linear regression
    Again the problem with this approach is that the probabilities of different classes do not sum up to 1
  • 62. Linear models
    Linear classification using the perceptron
    If instances belonging to different classes can be divided in the instance space by using hyper planes, then they are called linearly separable
    If instances are linearly separable then we can use perceptron learning rule for classification
    Lets assume that we have only 2 classes
    The equation of hyper plane is (a0 = 1):
    (w0)(a0) + (w1)(a1) + (w2)(a2) +…….. + (wk)(ak) = 0
  • 63. Linear models
    Linear classification using the perceptron
    Steps (contd.):
    If the sum (mentioned in previous step) is greater than 0 than we have first class else the second one
    The algorithm to get the weight and hence the equation of dividing hyper plane (or the perceptron)is:
  • 64. Instance-based learning
    General steps:
    No preprocessing of training sets, just store the training instances as it is
    To classify a new instance calculate its distance with every stored training instance
    The unclassified instance is allotted the class of the instance which has the minimum distance from it
  • 65. Instance-based learning
    The distance function
    The distance function we use depends on our application
    Some of the popular distance functions are: Euclidian distance, Manhattan distance metric etc.
    The most popular distance metric is Euclidian distance (between teo instances) given by:
    K is the number of attributes
  • 66. Instance-based learning
    Normalization of data:
    We normalize attributes such that they lie in the range [0,1], by using the formulae:
    Missing attributes:
    In case of nominal attributes, if any of the two attributes are missing or if the attributes are different, the distance is taken as 1
    In nominal attributes, if both are missing than difference is 1. If only one attribute is missing than the difference is the either the normalized value of given attribute or one minus that size, which ever is bigger
  • 67. Instance-based learning
    Finding nearest neighbors efficiently:
    Finding nearest neighbor by calculating distance with every attribute of each instance if linear
    We make this faster by using kd-trees
    They are binary trees that divide the input space with a hyper plane and then split each partition again, recursively
    It stores the points in k dimensional space, k being the number of attributes
  • 68. Instance-based learning
    Finding nearest neighbors efficiently:
  • 69. Instance-based learning
    Finding nearest neighbors efficiently:
    Here we see a kd tree and the instances and splits with k=2
    As you can see not all child nodes are developed to the same depth
    We have mentioned the axis along which the division has been done (v or h in this case)
    Steps to find the nearest neighbor:
    Construct the kd tree (explained later)
    Now start from the root node and comparing the appropriate attribute (based on the axis along which the division has been done), move to left or the right sub-tree
  • 70. Instance-based learning
    Steps to find the nearest neighbor (contd.):
    Repeat this step recursively till you reach a node which is either a leaf node or has no appropriate leaf node (left or right)
    Now you have find the region to which this new instance belong
    You also have a probable nearest neighbor in the form of the regions leaf node (or immediate neighbor)
    Calculate the distance of the instance with the probable nearest neighbor. Any closer instance will lie in a circle with radius equal to this distance
  • 71. Instance-based learning
    Finding nearest neighbors efficiently:
    Steps to find the nearest neighbor (contd.):
    Now we will move redo our recursive trace looking for an instance which is closer to put unclassified instance than the probable nearest neighbor we have
    We start with the immediate neighbor, if it lies in the circle than we will have to consider it and all its child nodes (if any)
    If condition of previous step is not true then we check the siblings of the parent of our probable nearest neighbor
    We repeat these steps till we reach the root
    In case we find instance(s) which are nearer, we update the nearest neighbor
  • 72. Instance-based learning
    Steps to find the nearest neighbor (contd.):
  • 73. Instance-based learning
    Construction of KD tree:
    We need to figure out two things to construct a kd tree:
    Along which dimension to make the cut
    Which instance to use to make the cut
    Deciding the dimension to make the cut:
    We calculate the variance along each axis
    The division is done perpendicular to the axis with minimum variance
    Deciding the instance to be used for division:
    Just take the median as the point of division
    So we repeat these steps recursively till all the points are exhausted
  • 74. Clustering
    Clustering techniques apply when rather than predicting the class, we just want the instances to be divided into natural group
    Iterative instance based learning: k-means
    Here k represents the number of clusters
    The instance space is divided in to k clusters
    K-means forms the cluster so as the sum of square distances of instances from there cluster center is minimum
  • 75. Clustering
    Decide the number of clusters or k manually
    Now from the instance set to be clustered, randomly select k points. These will be our initial k cluster centers of our k clusters
    Now take each instance one by one , calculate its distance from all the cluster centers and allot it to the cluster for which it has the minimum distance
    Once all the instances have been classified, take centroid of all the points in a cluster. This centroid will be give the new cluster center
    Again re-cluster all the instances followed by taking the centroid to get yet another cluster center
    Repeat step 5 till we reach the stage in which the cluster centers don’t change. Stop at this, we have our k-clusters
  • 76. Visit more self help tutorials
    Pick a tutorial of your choice and browse through it at your own pace.
    The tutorials section is free, self-guiding and will not involve any additional support.
    Visit us at