Decision tree lecture 3


Published on

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Decision tree lecture 3

  1. 1. 3/1/2012 Outline  Introduction  Basic Algorithm for Decision Tree Induction  Attribute Selection Measures – Information Gain – Gain Ratio – Gini Index  Tree Pruning  Scalable Decision Tree Induction Methods 2 1. Introduction Decision Tree Induction The Decision Tree is one of the most powerful and popular classification and prediction algorithms in current use in data mining and machine learning. The attractiveness of decision trees is due to the fact that, in contrast to neural networks, decision trees represent rules. Rules can readily be expressed so that humans can understand them or even directly used in a database access language like SQL so that records falling into a particular category may be retrieved.• A decision tree is a flowchart classifier like tree structure, where – each internal node (non-leaf node, decision node) denotes a test on an attribute – each branch represents an outcome of the test – each leaf node (or terminal node) indicates the value of the target attribute (class) of examples. 3 – The topmost node in a tree is the root node 1
  2. 2. 3/1/2012 A decision tree consists of nodes and arcs which connect nodes. To make a decision, one starts at the root node, and asks questions to determine which arc to follow, until one reaches a leaf node and the decision is made. How are decision trees used for classification? – Given an instance, X, for which the associated class label is unknown – The attribute values of the instance are tested against the decision tree – A path is traced from the root to a leaf node, which holds the class prediction for that instance. Applications Decision tree algorithms have been used for classification in many application areas, such as: – Medicine – Manufacturing and production – Financial analysis – Astronomy – Molecular biology. 4• Advantages of decision tree– The construction of decision tree classifiers does not parameter setting.– Decision trees can handle high dimensional data.– Easy to interpret for small-sized trees– The learning and classification steps of decision tree induction are simple and fast.– Accuracy is comparable to other classification techniques for many simple data sets– Convertible to simple and easy to understand classification rules 5 2
  3. 3. 3/1/2012 2. Basic Algorithm Decision Tree Algorithms• ID3 algorithm• C4.5 algorithm - A successor of ID3 – Became a benchmark to which newer supervised learning algorithms are often compared. – Commercial successor: C5.0• CART (Classification and Regression Trees) algorithm – The generation of binary decision trees – Developed by a group of statisticians 6 Basic Algorithm• Basic algorithm ,[ID3, C4.5, and CART], (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and- conquer manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are discretized in advance)– Examples are partitioned recursively into smaller subsets as the tree is being built based on selected attributes– Test attributes are selected on the basis of a statistical measure (e.g., information gain) 7 3
  4. 4. 3/1/2012 ID3 Algorithm function ID3 (I, 0, T) { /* I is the set of input attributes (non-target attributes) * O is the output attribute (the target attribute) * S is a set of training data * function ID3 returns a decision tree */ begin if (S is empty) { return a single node with the value "Failure"; } if (all records in S have the same value for the target attribute O) { return a single leaf node with that value; if (I is empty) { return a single node with the value of the most frequent value of O that are found in records of S; /* Note: some elements in this node will be incorrectly classified */ } /* now handle the case where we can’t return a single node */ compute the information gain for each attribute in I relative to S; let A be the attribute with largest Gain(A, S) of the attributes in I; } let {aj| j=1,2, .., m} be the values of attribute A; let {Sj| j=1,2, .., m} be the subsets of S when S is partitioned according the value of A; Return a tree with the root node labeled A and arcs labeled a1, a2, .., am, where the arcs go to the trees ID3(I-{A}, O, S1), ID3(I-{A}, O, S2), .., ID3(I-{A}, O, Sm); Recursively apply ID3 to subsets {Sj| j=1,2, .., m} until they are empty end } 8 3.Attribute Selection MeasuresWhich is the best attribute? – Want to get the smallest tree – choose the attribute that produces the “purest” nodesThree popular attribute selection measures: – Information gain – Gain ratio – Gini index 9 4
  5. 5. 3/1/2012 Information gain• The estimation criterion in the decision tree algorithm is the selection of an attribute to test at each decision node in the tree.• The goal is to select the attribute that is most useful for classifying examples. A good quantitative measure of the worth of an attribute is a statistical property called information gain that measures how well a given attribute separates the training examples according to their target classification.• This measure is used to select among the candidate attributes at each step while growing the tree. 10Entropy - a measure of homogeneity of the set of examples • In order to define information gain precisely, we need to define a measure commonly used in information theory, called entropy (Expected information, info(),). • Given a set S, containing only positive and negative examples of some target concept (a 2 class problem), the entropy of set S relative to this simple, binary classification is defined as: Info(S) = • where pi is the proportion of S belonging to class i. Note the logarithm is still base 2 because entropy is a measure of the expected encoding length measured in bits. • In all calculations involving entropy we define 0log0 to be 0. 11 5
  6. 6. 3/1/2012• For example, suppose S is a collection of 25 examples, including 15 positive and 10 negative examples [15+, 10-]. Then the entropy of S relative to this classification is : Entropy(S) = - (15/25) log2 (15/25) - (10/25) log2 (10/25) = 0.970• Notice that the entropy is 0 if all members of S belong to the same class. For example, Entropy(S) = -1 log2(1) - 0 log20 = -1 0 - 0 log20 = 0.• Note the entropy is 1 (at its maximum!) when the collection contains an equal number of positive and negative examples.• If the collection contains unequal numbers of positive and negative examples, the entropy is between 0 and 1. Figure 1 shows the form of the entropy function relative to a binary classification, as p+ varies between 0 and 1. 12 Figure 1: The entropy function relative to a binary classification, as the proportion of positive examples pp varies between 0 and 1.Entropy of S = Info(S)-The average amount of information needed to identify the class label of an instance in D.- A measure of the impurity in a collection of training examples- The smaller information required, the greater the purity of the partitions. 13 6
  7. 7. 3/1/2012• Information gain measures the expected reduction in entropy caused by partitioning the examples according to this attribute.• The information gain, Gain (S, A) of an attribute A, relative to a collection of examples S, is defined as = info (S) – infoA (s) = information needed before splitting – information needed after splitting• where Values(A) is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v (i.e., Sv = {s  S | A(s) = v }). Note the first term in the equation for Gain is just the entropy of the original collection S and the second term is infoA (S), the expected value of the entropy after S is partitioned using attribute A 14 An example: Weather Data The aim of this exercise is to learn how ID3 works. You will do this by building a decision tree by hand for a small dataset. At the end of this exercise you should understand how ID3 constructs a decision tree using the concept of Information Gain. You will be able to use the decision tree you create to make a decision about new data. 15 7
  8. 8. 3/1/2012 • In this dataset, there are five categorical attributes outlook, temperature, humidity, windy, and play. • We are interested in building a system which will enable us to decide whether or not to play the game on the basis of the weather conditions, i.e. we wish to predict the value of play using outlook, temperature, humidity, and windy. • We can think of the attribute we wish to predict, i.e. play, as the output attribute, and the other attributes as input attributes. • In this problem we have 14 examples in which: 9 examples with play= yes and 5 examples with play = no So, S={9,5}, and Entropy(S) = info (S) = info([9,5] ) = Entropy(9/14, 5/14) = -9/14 log2 9/14 – 5/14 log2 5/14 = 0.940 16consider splitting on Outlook attributeOutlook = Sunnyinfo([2; 3]) = entropy(2/5 ; 3/5 ) = -2/5 log2 2/5 - 3/5 log2 3/5 = 0.971 bitsOutlook = Overcastinfo([4; 0]) = entropy(4/4,0/4) = -1 log2 1 - 0 log2 0 = 0 bitsOutlook = Rainyinfo([3; 2]) = entropy(3/5,2/5)= - 3/5 log2 3/5 – 2/5 log2 2/5 =0.971 bitsSo, the expected information needed to classify objects in all sub trees of theOutlook attribute is :info outlook (S) = info([2; 3]; [4; 0]; [3; 2]) = 5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971 = 0.693 bitsinformation gain = info before split - info after splitgain(Outlook) = info([9; 5]) - info([2; 3]; [4; 0]; [3; 2]) = 0.940 - 0.693 = 0.247 bits 17 8
  9. 9. 3/1/2012consider splitting on Temperature attributetemperature = Hotinfo([2; 2]) = entropy(2/4 ; 2/4 ) = -2/4 log2 2/4 - 2/4 log2 2/4 = = 1 bits temperature = mildinfo([4; 2]) = entropy(4/6,2/6) = -4/6 log2 4/6 - 2/6 log2 2/6 = = 0.918 bits temperature = coolinfo([3; 1]) = entropy(3/4,1/4)= - 3/4 log2 3/4 – 1/4 log2 1/4 =0.881 bitsSo, the expected information needed to classify objects in all sub trees of thetemperature attribute is:info([2; 2]; [4; 2]; [3; 1]) = 4/14 * 1 + 6/14 * 0.981 + 4/14 * 0.881= 0.911 bitsinformation gain = info before split - info after splitgain(temperature) = 0.940 - 0.911 = 0.029 bits 18 • Complete in the same way we get: gain(Outlook ) = 0.247 bits gain(Temperature ) = 0.029 bits gain(Humidity ) = 0.152 bits gain(Windy ) = 0.048 bit • And the selected attribute will be the one with largest information gain = outlook • Then Continuing to split ……. 19 9
  10. 10. 3/1/2012gain(temperature) = 0.571bits gain(humidity) = 0.971bits gain(Windy) = 0.020 bits 20 The output decision tree 21 10
  11. 11. 3/1/2012ID3 versus C4.5• ID3 uses information gain• C4.5 can use either information gain or gain ratio• C4.5 can deal with -numeric/continuous attributes -missing values -noisy data• Alternate method: classification and regression trees(CART) 22 Decision trees advantages• Requires little data preparation• Are able to handle both categorical and numerical data• Are simple to understand and interpret• Generate models that can be statistically validated• The construction of decision tree classifiers does not parameter setting.• Decision trees can handle high dimensional data• perform well with large data in a short time• The learning and classification steps of decision tree induction are simple and fast.• Accuracy is comparable to other classification techniques for many simple data sets• Convertible to simple and easy to understand classification 23 rules 11