Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Learning • Machine learning is an area of AI concerned with the automatic learning of knowledge • some ways that machine learning can be used in expert systems 1. increase efficiency of inference engine and knowledge base processing 2. testing the knowledge base 3. use learning principles to acquire knowledge itself 4. ??? • Most learning techniques exploit heuristics: problem-specific information which makes the search for a solution more efficient • Without heuristics, typical learning problems either take too long to execute effectively, or produce results which are too large & general to be useful
  2. 2. Learning 1. Increase inference engine efficiency • 20-80 principle: 20% of rules in KB account for 80% of diagnoses • these 20% rules should take precedence in order to make execution faster • otherwise, roughly half of the KB needs to be looked at for every diagnosis, which is a waste of time for most (80%) problems • However, it is also possible that the set of rules most often used can vary according to who, where, and how the expert system is used • One way to fix this: keep a record of the number of times a high-level rule was successful in making a diagnosis eg. record(rule12, 102). record(rule6, 25). etc • Save this information in a file, and reload it every session. • Use these rule stats to determine the order in which diagnoses are to be executed
  3. 3. Learning 1. ordering rules: p.118 Schnupp p.118 schnupp
  4. 4. Learning • Another possibility is to ask some preliminary questions to determine general high-level information, and then order the high-level inference accordingly p.112-113 Schnupp
  5. 5. Learning 2. Knowledge acquisition (i) Learning rules from examples • user inputs typical examples ( decision table ) for a given rule domain; or, can process a database to automatically generate production rules • system constructs a rule or set of rules ( decision tree ) for this example set • can then generate production rules from this tree • trivial to make comprehensive tree but more involved to make a minimal one • inductive inference : learning technique which constructs a general rule from specific examples (compare with mathematical induction) • popular algorithm: Quinlan's ID3, used in shells such as VP-expert, ExpertEase, RuleMaster, and others
  6. 6. Learning • Decision table : table of attribute values, with one or more conclusions - each row is an example or true instance of attribute value - conclusion •  Convenient for classification & identification problems • Could create production rules directly from table: one rule per row • induction : tries to generalize information in table, disregarding superfluous information, and yielding an efficient smaller decision tree - results in "smarter" system - good example of machine learning : computer tries to generalize and abstract from examples •  Given a table, can look at conclusions, and see if particular attributes have any effect on them. If not, then disregard those attributes when deriving that conclusion. • There exist one or more "minimal" sets of tests for a table; however, finding this minimal set can be intractable in general
  7. 7. ID3 definitions <ul><li>entropy : measure of how much an attribute matches the conclusion </li></ul><ul><ul><li>match 1:1 -- low entropy (high information content, low uncertainty) </li></ul></ul><ul><li>eg. a:1, b:2, c:3, d:4 </li></ul><ul><ul><li>differ on every value matching - high entropy (low info content, high uncertainty) </li></ul></ul><ul><li>eg. a:1, a:2, a:3, a:4 </li></ul><ul><li>Information or entropy : mathematical measurement of an entity that provides the answer to a question, or certainty about an outcome, eg. to describe whether a coin will be heads or tails </li></ul><ul><ul><li>a value between 0 and 1, measured in bits </li></ul></ul><ul><ul><li>1 bit = enough information to answer a yes/no question of a fair, random event (because log2 (2) = 1) </li></ul></ul><ul><ul><li>4 events require log2(4) = 2 bits etc. </li></ul></ul><ul><ul><li>Note: log2 (X) = ln(X) / ln(2) </li></ul></ul>
  8. 8. ID3 definitions <ul><li>Information content : average entropy of different events weighted by probabilities of those events: Pr(E) log2 E </li></ul><ul><ul><li>Formula: - Pr(E1)log2 E1 - Pr(E2)log2 E2 - ... - Pr(Ek) log2 Ek </li></ul></ul><ul><ul><ul><li>when attribute A has events E1,...,Ek </li></ul></ul></ul><ul><ul><li>eg. fair coin: IC(0.5, 0.5) = -0.5 log2 0.5 - 0.5 log2 0.5 = 1 bit </li></ul></ul><ul><ul><li>eg. weighted heads: </li></ul></ul><ul><ul><li>IC(0.01, 0.99) = -0.01 log2 0.01 - 0.99 log2 0.99 = 0.08 bits </li></ul></ul><ul><ul><li>eg. always heads: IC(0, 1) = 0 bits </li></ul></ul><ul><ul><li>Note: if event has prob 0, don’t use in equation (log 0 = ???) </li></ul></ul><ul><ul><ul><li>instead, set value as 0. </li></ul></ul></ul><ul><li>Information gain : difference between original information content and new information content, after attribute A selected </li></ul>
  9. 9. ID3 Algorithm <ul><li>minimal trees are useful when example table is completely adequate for observations of interest, and user input is certain </li></ul><ul><ul><li>for uncertain input, then additional tests are required for support </li></ul></ul><ul><li>ID3: induction algorithm that tries to find a small tree efficiently </li></ul><ul><ul><li>not guaranteed to be minimal, but it is generally small </li></ul></ul><ul><li>1. Given the example set C, find an attribute A that gives the highest information content </li></ul><ul><li>--> this is the most discriminatory test for distinguising the data </li></ul><ul><li>--> it yields the highest information gain: ideally, if we have X entropy before, then after A we have 0 entropy (ie. perfect decision strategy!) </li></ul><ul><li>2. Add it as the next node in tree </li></ul><ul><li>3. Partition the examples into subtables, and recurse (fills in remainder of tree) </li></ul>
  10. 10. ID3 algorithm <ul><li>consider p positive and n negative examples </li></ul><ul><ul><li>probability of p’s: p+ = p / (p + n) </li></ul></ul><ul><ul><li>probability of n’s: p- = n / (p + n) </li></ul></ul><ul><ul><li>compute the above from the example set </li></ul></ul><ul><li>information content for each attribute value: </li></ul><ul><ul><li>IC(Vi) = - (p+) log2 (p+) - (p-) log2 (p-) </li></ul></ul><ul><ul><li>where Vi is value of attribute A </li></ul></ul><ul><li>information content for table after attribute A is used as test: </li></ul><ul><ul><li>B(C, A) = sum_i : [ (prob value of A is Vi) * IC(Ci) ] </li></ul></ul><ul><ul><li>for attribute A, value Vi (i=1,...,#values), </li></ul></ul><ul><ul><li>subset of examples Ci corresp. to each Vi </li></ul></ul><ul><li>We wish to select an attribute which maximizes the information gain at that node in the tree. Repeat the following for all attribute in example set: </li></ul><ul><ul><li>compute p+, p- for each value of attribute </li></ul></ul><ul><ul><li>compute IC for each attribute, value pair </li></ul></ul><ul><ul><li>compute overall info gain B(C, A) for that attribute </li></ul></ul><ul><li>--> select attribute A which maximizes information gain, ie. yields low information content after it is applied --> lowest B(C,A) value. </li></ul>
  11. 11. ID3: entropy extremes <ul><ul><li>attr. 1 attr. 2 value IC(C) = -(3/4 log2 3/4) - (1/4 log2 1/4) </li></ul></ul><ul><li>a c x = -(.75)(-.45) - (.25)(.5) = 0.725 </li></ul><ul><li>b c y </li></ul><ul><li>a c x </li></ul><ul><li>a c x </li></ul><ul><li>attr 1: </li></ul><ul><ul><li>IC(a) = -1 log2 1 - 0 log2 0 = 0 </li></ul></ul><ul><ul><li>IC(b) = -1 log2 1 -0 log2 0 = 0 </li></ul></ul><ul><ul><li>B(C, attr1) = Pr(a)*IC(a) + Pr(b)*IC(b) = 0 + 0 = 0 </li></ul></ul><ul><ul><li>Gain: IC(C) -B(C, attr1) = 0.725 - 0 = 0.725 </li></ul></ul><ul><ul><li>--> maximum gain! All values precisely predicted using attribute 1. </li></ul></ul><ul><li>attr 2: </li></ul><ul><ul><li>IC(c) = -(3/4) log2 (3/4) - (1/4) log2 (1/4) = 0.725 </li></ul></ul><ul><ul><li>B(C, attr2) = Pr(c)*IC(c) = 1*.725 = 0.725 </li></ul></ul><ul><ul><li>Gain: IC(C) - B(C, attr2) = 0.725 - 0.725 = 0 </li></ul></ul><ul><ul><li>--> minimum gain; no information gained at all fromusing attribute 2. </li></ul></ul><ul><li>Note : we can simply select attribute yielding minimum information content “B” for table; computing the gain is redundant (doesn’t help us). </li></ul>
  12. 12. ID3: example (from Durkin, pp.496-498) <ul><li>IC(C) = -(4/8) log2 (4/8) - (4/8) log2 (4/8) = 1 (initial info content of all examples) </li></ul><ul><li>(a) test “wind”: </li></ul><ul><ul><li>IC(North) = -3/5 log2 (3/5) - 2/5 log2 (2/5) = .971 </li></ul></ul><ul><ul><li>IC(South) = -1/3 log2 (1/3) - 2/3 log2 (2/3) = .918 </li></ul></ul><ul><ul><li>B(C, “Wind”) = 5/8* 0.971 + 3/8 log 0.918 = 0.951 </li></ul></ul><ul><ul><li>gain = IC(C) - B(C, “Wind”) = 1 - 0.951= - .049 </li></ul></ul><ul><li>(b) test “sky”: </li></ul><ul><ul><li>“ clear’: all are negative (= 0) </li></ul></ul><ul><ul><li>IC(cloudy) = -4/5 log2(4/5) - 1/5 log2 (1/5) = .722 </li></ul></ul><ul><ul><li>B(C, “sky”) = 3/8 * 0 + 5/8 * .722 = .45 </li></ul></ul><ul><ul><li>gain = IC(C) - B(C, “sky”) = 1 - .45 = .548 </li></ul></ul><ul><li>(c) barometer: gain is .156 </li></ul><ul><li>therefore sky gives highest info gain, and is selected. </li></ul><ul><li>Algorithm partitions example set for each new subcategory, and recurses </li></ul><ul><li>Note: we’re simply finding attribute that yields smallest information content for remaining table table after it is applied </li></ul>
  13. 13. Example (cont) <ul><li>Must now apply ID3 to remaining examples that have differing result values --> new layers of decision tree </li></ul><ul><li>(a) barometer: </li></ul><ul><ul><li>IC(rising) = - 1 log2 1 - 0 log2 0 = 0 </li></ul></ul><ul><ul><li>IC(steady) = - 1 log2 1 - 0 log2 0 = 0 </li></ul></ul><ul><ul><li>IC(falling) = - 1/2 log2 1/2 - 1/2 log2 1/2 = .951 </li></ul></ul><ul><ul><li>B(C, barometer) = (2/5)*0 + (1/5)*0 + (2/5)*.951= .38 </li></ul></ul><ul><li>(b) wind: </li></ul><ul><ul><li>IC(south) = - 1/2 log2 1/2 - 1/2 log2 1/2 = .951 </li></ul></ul><ul><ul><li>IC(north) = - 1 log 2 1 - 0 = 0 </li></ul></ul><ul><ul><li>B(C, wind) = (2/5)*.951 + (3/5)*0 = .38 </li></ul></ul><ul><li>--> choose either </li></ul><ul><li>note that you’ll need both attributes together to classify remaining table </li></ul>
  14. 14. Example <ul><li>If we choose ‘barometer’, then remaining table left is: </li></ul><ul><ul><li>5 Cloudy, Falling, North + </li></ul></ul><ul><ul><li>7 Cloudy, Falling, South - </li></ul></ul><ul><li>Should be obvious that ‘wind’ is only possibility now. </li></ul>
  15. 15. Example: final tree
  16. 16. ID3 <ul><li>ID3 generalizes to multivalued classifications (not just plus and minus): information content expression extends to multiple categories... </li></ul><ul><ul><li>IC(a, b, c) = -pa log2 pa - pb log2 pb - pc log2 pc </li></ul></ul><ul><li>Note: final tree can have “no data” leafs, meaning that the example set does not cover that combination of tests </li></ul><ul><ul><li>can presume that such a leaf is “impossible” wrt the examples </li></ul></ul><ul><ul><li>otherwise, implies that example set is missing information </li></ul></ul><ul><ul><li>--> must assume that examples are complete and correct; this is responsibility of knowledge engineer! </li></ul></ul>
  17. 17. Learning Inductive Inference: p.129-132
  18. 18. Learning Inductive Inference
  19. 19. Learning
  20. 20. Learning ID3 algorithm p. 134-5
  21. 21. Learning
  22. 22. Learning
  23. 23. Learning
  24. 24. Possible ID3 problems <ul><li>clashing examples: need more data attributes, or must correct knowledge </li></ul><ul><li>continuous values (floating point): must create ranges, </li></ul><ul><ul><li>eg. 0 < x < 5 </li></ul></ul><ul><li>noise: if compressing a database, noise can unduly influence decision tree </li></ul><ul><li>trees can be too large: production rules are therefore large too </li></ul><ul><ul><li>break up table and create hierarchy of tables (structured induction) </li></ul></ul><ul><li>flat rules: only one or more conclusions, no intermediate rules </li></ul>
  25. 25. ID3 enhancements <ul><li>If you have a large example set, can use a subset (“window”) of examples </li></ul><ul><ul><li>then need to verify that resulting decision tree is valid, in case that window wasn’t inclusive or had noise </li></ul></ul><ul><ul><li>knowledge bases must be 100% correct! </li></ul></ul><ul><li>Data mining: finding trends in large databases </li></ul><ul><ul><li>ID3 is one tool used there </li></ul></ul><ul><ul><li>don’t care about 100% correctness </li></ul></ul><ul><ul><li>rather, a good random sample of database may yield enough useful information </li></ul></ul><ul><ul><li>can also apply to entire database, in an attempt to categorize information into useful classes and trends </li></ul></ul><ul><li>C4.5 is the successor of ID3. Uses more advanced heuristic. </li></ul>
  26. 26. Learning: Genetic algorithms <ul><li>Another way to create small decision trees from examples: genetic programming </li></ul><ul><li>1. Create a population of random decision trees </li></ul><ul><li>2. Repeat Until a suitably correct and small tree is found: </li></ul><ul><ul><li>a. Rate all trees based on: (i) size, (ii) how many examples they cover (iii) how many examples they miss </li></ul></ul><ul><ul><li>--> fitness score </li></ul></ul><ul><ul><li>b. Create a new population: </li></ul></ul><ul><ul><li>(i) mate trees using Crossover: swap subtrees between parents </li></ul></ul><ul><ul><li>(ii) mutate trees using Mutation: random change to tree </li></ul></ul><ul><li>This will search the space of decision trees for a correct and small sized tree </li></ul><ul><li>Much slower than ID3, but possibly better results </li></ul><ul><li>Remember: finding the smallest tree for an example set is NP-complete </li></ul>
  27. 27. Learning comments • Inductive inference is a convenient way of obtaining productions automatically from databases or user examples. • Good for classification & diagnosis problems • Assumes that domain is deterministic: that particular premises lead to only one conclusion, not multiple ones • Need to have: good data, no noise, no clashes - assumes that the entire universe of interest is encapsulated in example set or database - clashes probably mean you need to identify more attributes • ID3 doesn't assure the minimal tree, but its entropy measure is a heuristic that often generates a small tree • Note that a minimal rule is not necessarily desireable: when you discard attributes, you discard information which might be relevant, especially later when the system is being upgraded • Not desireable to use this technique on huge tables with many attributes. Better approach is to modularise data hierarchically.