Machine learning Lecture 2

8,813 views
8,624 views

Published on

Machine learning lecture series by Ravi Gupta, AU-KBC in MIT

Published in: Technology, Education
3 Comments
7 Likes
Statistics
Notes
No Downloads
Views
Total views
8,813
On SlideShare
0
From Embeds
0
Number of Embeds
43
Actions
Shares
0
Downloads
519
Comments
3
Likes
7
Embeds 0
No embeds

No notes for slide

Machine learning Lecture 2

  1. 1. Lecture No. 2 Ravi Gupta AU-KBC Research Centre, MIT Campus, Anna University Date: 8.3.2008
  2. 2. Today’s Agenda • Recap (FIND-S Algorithm) • Version Space • Candidate-Elimination Algorithm • Decision Tree • ID3 Algorithm • Entropy
  3. 3. Concept Learning as Search Concept learning can be viewed as the task of searching through a large space of hypothesis implicitly defined by the hypothesis representation. The goal of the concept learning search is to find the hypothesis that best fits the training examples.
  4. 4. General-to-Specific Learning Every day Tom his enjoy i.e., Only positive examples. Most General Hypothesis: h = <?, ?, ?, ?, ?, ?> Most Specific Hypothesis: h = < Ø, Ø, Ø, Ø, Ø, Ø>
  5. 5. General-to-Specific Learning h2 is more general than h1 h2 imposes fewer constraints on the instance than h1
  6. 6. Definition Given hypotheses hj and hk, hj is more_general_than_or_equal_to hk if and only if any instance that satisfies hk also satisfies hj. We can also say that hj is more_specific_than hk when hk is more_general_than hj.
  7. 7. FIND-S: Finding a Maximally Specific Hypothesis
  8. 8. Step 1: FIND-S h0 = <Ø, Ø, Ø, Ø, Ø, Ø>
  9. 9. Step 2: FIND-S h0 = <Ø, Ø, Ø, Ø, Ø, Ø> a1 a2 a3 a4 a5 a6 x1 = <Sunny, Warm, Normal, Strong, Warm, Same> Iteration 1 h1 = <Sunny, Warm, Normal, Strong, Warm, Same>
  10. 10. h1 = <Sunny, Warm, Normal, Strong, Warm, Same> Iteration 2 x2 = <Sunny, Warm, High, Strong, Warm, Same> h2 = <Sunny, Warm, ?, Strong, Warm, Same>
  11. 11. Iteration 3 Ignore h3 = <Sunny, Warm, ?, Strong, Warm, Same>
  12. 12. h3 = < Sunny, Warm, ?, Strong, Warm, Same > Iteration 4 x4 = < Sunny, Warm, High, Strong, Cool, Change > Step 3 Output h4 = <Sunny, Warm, ?, Strong, ?, ?>
  13. 13. Unanswered Questions by FIND-S • Has the learner converged to the correct target concept? • Why prefer the most specific hypothesis? • What if the training examples consistent?
  14. 14. Version Space The set of all valid hypotheses provided by an algorithm is called version space (VS) with respect to the hypothesis space H and the given example set D.
  15. 15. Candidate-Elimination Algorithm The Candidate-Elimination algorithm finds all describable hypotheses that are consistent with the observed training examples Hypothesis is derived from examples regardless of whether x is positive or negative example
  16. 16. Candidate-Elimination Algorithm Earlier (i.e., FIND-S) Def.
  17. 17. LIST-THEN-ELIMINATE Algorithm to Obtain Version Space
  18. 18. LIST-THEN-ELIMINATE Algorithm to Obtain Version Space Examples Hypothesis Space . Version Space . . . . VSH,D . H D
  19. 19. LIST-THEN-ELIMINATE Algorithm to Obtain Version Space • In principle, the LIST-THEN-ELIMINATE algorithm can be applied whenever the hypothesis space H is finite. • It is guaranteed to output all hypotheses consistent with the training data. • Unfortunately, it requires exhaustively enumerating all hypotheses in H-an unrealistic requirement for all but the most trivial hypothesis spaces.
  20. 20. Candidate-Elimination Algorithm • The CANDIDATE-ELIMINATION algorithm works on the same principle as the above LIST-THEN-ELIMINATE algorithm. • It employs a much more compact representation of the version space. • In this the version space is represented by its most general and least general members (Specific). • These members form general and specific boundary sets that delimit the version space within the partially ordered hypothesis space.
  21. 21. Least General (Specific) Most General
  22. 22. Candidate-Elimination Algorithm
  23. 23. Example G0 ← {<?, ?, ?, ?, ?, ?>} Initialization S0 ← {<Ø, Ø, Ø, Ø, Ø, Ø >}
  24. 24. G0 ← {<?, ?, ?, ?, ?, ?>} S0 ← {<Ø, Ø, Ø, Ø, Ø, Ø >} x1 = <Sunny, Warm, Normal, Strong, Warm, Same> Iteration 1 G1 ← {<?, ?, ?, ?, ?, ?>} S1 ← {< Sunny, Warm, Normal, Strong, Warm, Same >} x2 = <Sunny, Warm, High, Strong, Warm, Same> Iteration 2 G2 ← {<?, ?, ?, ?, ?, ?>} S2 ← {< Sunny, Warm, ?, Strong, Warm, Same >}
  25. 25. G2 ← {<?, ?, ?, ?, ?, ?>} S2 ← {< Sunny, Warm, ?, Strong, Warm, Same >} consistent x3 = <Rainy, Cold, High, Strong, Warm, Change> Iteration 3 S3 ← {< Sunny, Warm, ?, Strong, Warm, Same >} G3 ← {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>} G2 ← {<?, ?, ?, ?, ?, ?>}
  26. 26. S3 ← {< Sunny, Warm, ?, Strong, Warm, Same >} G3 ← {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>} x4 = <Sunny, Warm, high, Strong, Cool, Change> Iteration 4 S4 ← {< Sunny, Warm, ?, Strong, ?, ? >} G4 ← {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>} G3 ← {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}
  27. 27. Remarks on Version Spaces and Candidate-Elimination The version space learned by the CANDIDATE-ELIMINATION algorithm will converge toward the hypothesis that correctly describes the target concept, provided (1) there are no errors in the training examples, and (2) there is some hypothesis in H that correctly describes the target concept.
  28. 28. What will Happen if the Training Contains errors ? No
  29. 29. G0 ← {<?, ?, ?, ?, ?, ?>} S0 ← {<Ø, Ø, Ø, Ø, Ø, Ø >} x1 = <Sunny, Warm, Normal, Strong, Warm, Same> Iteration 1 G1 ← {<?, ?, ?, ?, ?, ?>} S1 ← {< Sunny, Warm, Normal, Strong, Warm, Same >} x2 = <Sunny, Warm, High, Strong, Warm, Same> Iteration 2 G2 ← {<?, ?, Normal, ?, ?, ?>} S2 ← {< Sunny, Warm, Normal, Strong, Warm, Same >}
  30. 30. G2 ← {<?, ?, Normal, ?, ?, ?>} S2 ← {< Sunny, Warm, Normal, Strong, Warm, Same >} consistent x3 = <Rainy, Cold, High, Strong, Warm, Change> Iteration 3 S3 ← {< Sunny, Warm, Normal, Strong, Warm, Same >} G3 ← {<?, ?, Normal, ?, ?, ?>}
  31. 31. S3 ← {< Sunny, Warm, Normal, Strong, Warm, Same >} G3 ← {<?, ?, Normal, ?, ?, ?>} x4 = <Sunny, Warm, high, Strong, Cool, Change> Iteration 4 S4 ← { } Empty G4 ← { } G3 ← {<?, ?, Normal, ?, ?, ?>}
  32. 32. What will Happen if Hypothesis is not Present ?
  33. 33. Remarks on Version Spaces and Candidate-Elimination The target concept is exactly learned when the S and G boundary sets converge to a single, identical, hypothesis.
  34. 34. Remarks on Version Spaces and Candidate-Elimination How Can Partially Learned Concepts Be Used? Suppose that no additional training examples are available beyond the four in our example. And the learner is now required to classify new instances that it has not yet observed. The target concept is exactly learned when the S and G boundary sets converge to a single, identical, hypothesis.
  35. 35. Remarks on Version Spaces and Candidate-Elimination
  36. 36. Remarks on Version Spaces and Candidate-Elimination All six hypotheses satisfied All six hypotheses satisfied
  37. 37. Remarks on Version Spaces and Candidate-Elimination Three hypotheses satisfied Three hypotheses not satisfied Two hypotheses satisfied Four hypotheses not satisfied
  38. 38. Remarks on Version Spaces and Candidate-Elimination Yes No
  39. 39. Decision Trees
  40. 40. Decision Trees • Decision tree learning is a method for approximating discrete value target functions, in which the learned function is represented by a decision tree. • Decision trees can also be represented by if-then-else rule. • Decision tree learning is one of the most widely used approach for inductive inference .
  41. 41. Decision Trees An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node.
  42. 42. Decision Trees <Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong> PlayTennis = No
  43. 43. Decision Trees Edges: Attribute value Intermediate Nodes: Attributes Attribute: A1 Attribute Attribute value Attribute value value Attribute: A2 Output Attribute: A3 value Attribute Attribute Attribute Attribute value value value value Output Output Output Output value value value value Leave node: Output value
  44. 44. Decision Trees conjunction disjunction Decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions.
  45. 45. Decision Trees
  46. 46. Decision Trees (F = A ^ B') F = A ^ B‘ If (A=True and B = False) then Yes else No If then else form A False True No B False True Yes No
  47. 47. Decision Trees (F = A V (B ^ C)) If (A=True) then Yes else if (B = True and C=True) then Yes If then else form else No A True False Yes B False True No C False True No Yes
  48. 48. Decision Trees (F = A XOR B) F = (A ^ B') V (A' ^ B) If (A=True and B = False) then Yes If then else form else If (A=False and B = False) then Yes else No A False True B B False False True True No Yes No Yes
  49. 49. Decision Trees as If-then-else rule conjunction disjunction If (Outlook = Sunny AND humidity = Normal) then PlayTennis = Yes If (Outlook = Overcast) then PlayTennis = Yes If (Outlook = Rain AND Wind = Weak) then PlayTennis = Yes
  50. 50. Problems Suitable for Decision Trees • Instances are represented by attribute-value pairs Instances are described by a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot). The easiest situation for decision tree learning is when each attribute takes on a small number of disjoint possible values (e.g., Hot, Mild, Cold). However, extensions to the basic algorithm allow handling real- valued attributes as well (e.g., representing Temperature numerically). • The target function has discrete output values • Disjunctive descriptions may be required • The training data may contain errors • The training data may contain missing attribute values
  51. 51. Basic Decision Tree Learning Algorithm • ID3 Algorithm (Quinlan 1986) and it’s successors C4.5 and C5.0 • Employs a top-down An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node. • Greedy search the space of possible http://www.rulequest.com/Personal/ decision trees. The algorithm never backtracks to reconsider earlier choices.
  52. 52. ID3 Algorithm
  53. 53. Example
  54. 54. Attributes… Attributes are Outlook, Temperature, Humidity, Wind
  55. 55. Building Decision Tree
  56. 56. Building Decision Tree Attribute: A1 Attribute value Attribute value Attribute value Output value Attribute: A2 Attribute: A3 Attribute value Attribute value Attribute value Attribute value Output value Output value Output value Output value
  57. 57. Building Decision Tree Outlook Temperature Which attribute to select ????? Humidity Wind Root node
  58. 58. Which Attribute to Select ?? • We would like to select the attribute that is most useful for classifying examples. • What is a good quantitative measure of the worth of an attribute? ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree.
  59. 59. Information Gain Information gain is based on information theory concept called Entropy “Nothing in life is certain except death, taxes and the second law of thermodynamics. All three are processes in which useful or accessible forms of some quantity, such as energy or money, are transformed into useless, inaccessible forms of the same quantity. That is not to say that these three processes don’t have fringe benefits: taxes pay for Rudolf Julius Emanuel roads and schools; the second law of Claude Elwood Clausius (January 2, thermodynamics drives cars, Shannon (April 30, 1822 – August 24, 1888), 1916 – February 24, computers and metabolism; and death, was a German physicist 2001), an American at the very least, opens up tenured and mathematician and electrical engineer and faculty positions” is considered one of the mathematician, has central founders of the been called quot;the father Seth Lloyd, writing in Nature 430, science of of information theoryquot; 971 (26 August 2004). thermodynamics
  60. 60. Entropy • In information theory, the Shannon entropy or information entropy is a measure of the uncertainty associated with a random variable. • It quantifies the information contained in a message, usually in bits or bits/symbol. • It is the minimum message length necessary to communicate information.
  61. 61. Why Shannon named his uncertainty function quot;entropy“ ? John von Neumann My greatest concern was what to call it. I thought of calling it 'information,' but the word was overly used, so I decided to call it 'uncertainty.' When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, 'You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.'
  62. 62. Shannon's mouse Shannon and his famous electromechanical mouse Theseus, named after the Greek mythology hero of Minotaur and Labyrinth fame, and which he tried to teach to come out of the maze in one of the first experiments in artificial intelligence.
  63. 63. Entropy The information entropy of a discrete random variable X, that can take on possible values {x1...xn} is where I(X) is the information content or self-information of X, which is itself a random variable; and p(xi) = Pr(X=xi) is the probability mass function of X.
  64. 64. Entropy in our Context Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this boolean classification (yes/no) is where is the proportion of positive examples in S and pӨ, is the proportion of negative examples in S. In all calculations involving entropy we define 0 log 0 to be 0.
  65. 65. Example There are 14 examples. 9 positive and 5 negative examples [9+, 5-]. The entropy of S relative to this boolean (yes/no) classification is
  66. 66. Information Gain Measure Information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute. More precisely, the information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as where Values(A) is the set of all possible values for attribute A, and Sv, is the subset of S for which attribute A has value v, i.e.,
  67. 67. Information Gain Measure Entropy of S after Entropy of S partition Gain(S, A) is the expected reduction in entropy caused by knowing the value of attribute A. Gain(S, A) is the information provided about the target &action value, given the value of some other attribute A. The value of Gain(S, A) is the number of bits saved when encoding the target value of an arbitrary member of S, by knowing the value of attribute A.
  68. 68. Example There are 14 examples. 9 positive and 5 negative examples [9+, 5-]. The entropy of S relative to this boolean (yes/no) classification is
  69. 69. Gain (S, Attribute = Wind)
  70. 70. Gain (S,A)
  71. 71. Gain (SSunny,A) Temperature Humidity Wind (Hot) {0+, 2-) (High) {0+, 3-} (Weak) {1+, 2-} (Mild) {1+, 1-} (Normal) {2+, 0-} (Strong) {1+, 1-} (Cool) {1+, 0-}
  72. 72. Gain (SSunny,A) Entropy(SSunny) = - { 2/5 log(2/5) + 3/5 log(3/5)} = 0.97095 Entropy(Hot) = 0 Temperature (Hot) {0+, 2-) Entropy(Mild) = 1 (Mild) {1+, 1-} Entropy(Cool) = 0 (Cool) {1+, 0-} Gain(S1, Temperature) = 0.97095 – 2/5*0 – 2/5*1 – 1/5*0 = 0.57095 Humidity Entropy(High) = 0 (High) {0+, 3-} Entropy(Normal) = 0 (Normal) {2+, 0-} Gain(S1, Humidity) = 0.97095 – 3/5*0 – 2/5*0 = 0.97095 Entropy(Weak) = 0.9183 Wind (Weak) {1+, 2-} Entropy(Normal) = 1.0 (Strong) {1+, 1-} Gain(S1, Wind) = 0.97095 – 3/5*0.9183 – 2/5*1 = 0.01997
  73. 73. Modified Decision Tree
  74. 74. Gain (SRain,A) Temperature Humidity Wind (Hot) {0+, 0-) (High) {1+, 1-} (Weak) {3+, 0-} (Mild) {2+, 1-} (Normal) {2+, 1-} (Strong) {0+, 2-} (Cool) {1+, 1-}
  75. 75. Gain (SRain,A) Entropy(SRain) = - { 2/5 log(2/5) + 3/5 log(3/5)} = 0.97095 Entropy(Hot) = 0 Temperature (Hot) {0+, 0-) Entropy(Mild) = 0.1383 (Mild) {2+, 1-} Entropy(Cool) = 1.0 (Cool) {1+, 1-} Gain(S1, Temperature) = 0.97095 – 0 – 2/3*0.1383 - 2/5*1 = 0.4922 Humidity Entropy(High) = 1.0 (High) {1+, 1-} Entropy(Normal) = 0.1383 (Normal) {2+, 1-} Gain(S1, Humidity) = 0.97095 – 2/5*1.0 – 3/5*0.1383 = 0.4922 Entropy(Weak) = 0.0 Wind (Weak) {3+, 0-} Entropy(Normal) = 0.0 (Strong) {0+, 2-} Gain(S1, Humidity) = 0.97095 - 3/5*0 – 2/5*0 = 0.97095
  76. 76. Final Decision Tree
  77. 77. Home work
  78. 78. Home work
  79. 79. Home work a1 (True) {2+, 1-} (False) {1+, 2-} Entropy(a1=True) = -{2/3log(2/3) + 1/3log(1/3)} = 0.9183 Entropy(a1=False) = 0.9183 Gain (S, a1) = 1 – 3/6*0.9183 – 3/6*0.9183 = 0.0817 S {3+, 3-} => Entropy(S) = 1 a2 Entropy(a2=True) = 1.0 (True) {2+, 2-} Entropy(a1=False) = 1.0 (False) {1+, 1-} Gain (S, a1) = 1 – 4/6*1 -2/6*1 = 0.0
  80. 80. Home work a1 True False [D1, D2, D3] [D4, D5, D6]
  81. 81. Home work a1 True False [D1, D2, D3] [D4, D5, D6] a2 a2 True False True False + (Yes) - (No) - (No) + (Yes)
  82. 82. Home work a1 True False [D1, D2, D3] [D4, D5, D6] a2 a2 True False True False + (Yes) - (No) - (No) + (Yes) (a1^a2) V (a1' ^ a2')
  83. 83. Some Insights into Capabilities and Limitations of ID3 Algorithm • ID3’s algorithm searches complete hypothesis space. [Advantage] • ID3 maintain only a single current hypothesis as it searches through the space of decision trees. By determining only as single hypothesis, ID3 loses the capabilities that follows explicitly representing all consistent hypothesis. [Disadvantage] • ID3 in its pure form performs no backtracking in its search. Once it selects an attribute to test at a particular level in the tree, it never backtracks to reconsider this choice. Therefore, it is susceptible to the usual risks of hill-climbing search without backtracking: converging to locally optimal solutions that are not globally optimal. [Disadvantage]
  84. 84. Some Insights into Capabilities and Limitations of ID3 Algorithm • ID3 uses all training examples at each step in the search to make statistically based decisions regarding how to refine its current hypothesis. This contrasts with methods that make decisions incrementally, based on individual training examples (e.g., FIND-S or CANDIDATE-ELIMINATION). One advantage of using statistical properties of all the examples (e.g., information gain) is that the resulting search is much less sensitive to errors in individual training examples. [Advantage]

×