Basics of Machine Learning


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Basics of Machine Learning

  1. 1. Advanced Artificial Intelligence Lecture 3: Learning <ul><li>Bob McKay </li></ul><ul><ul><li>School of Computer Science and Engineering </li></ul></ul><ul><ul><li>College of Engineering </li></ul></ul><ul><ul><li>Seoul National University </li></ul></ul>
  2. 2. Outline <ul><li>Defining Learning </li></ul><ul><li>Kinds of Learning </li></ul><ul><li>Generalisation and Specialisation </li></ul><ul><li>Some Simple Learning Algorithms </li></ul>
  3. 3. References <ul><li>Mitchell, Tom M: Machine Learning, McGraw-Hill, 1997, ISBN 0 07 115467 1 </li></ul>
  4. 4. Defining a Learning System (Mitchell) <ul><li>“A program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” </li></ul>
  5. 5. Specifying a Learning System <ul><li>Specifying the task T, the performance P and the experience E defines the learning problem. Specifying the learning system requires us to define: </li></ul><ul><ul><li>Exactly what knowledge is to be learnt </li></ul></ul><ul><ul><li>How this knowledge is to be represented </li></ul></ul><ul><ul><li>How this knowledge is to be learnt </li></ul></ul>
  6. 6. Specifying What is to be Learnt <ul><li>Usually, the desired knowledge can be represented as a target valuation function V: I -> D </li></ul><ul><ul><li>It takes in information about the problem and gives back a desired decision </li></ul></ul><ul><li>Often, it is unrealistic to expect to learn the ideal function V </li></ul><ul><ul><li>All that is required is a ‘good enough’ approximation, V’: I -> D </li></ul></ul>
  7. 7. Specifying How Knowledge is to be Represented <ul><li>The function V’ must be represented symbolically, in some language L </li></ul><ul><ul><li>The language may be a well-known language </li></ul></ul><ul><ul><ul><li>Boolean expressions </li></ul></ul></ul><ul><ul><ul><li>Arithmetic functions </li></ul></ul></ul><ul><ul><ul><li>… . </li></ul></ul></ul><ul><ul><li>Or for some systems, the language may be defined by a grammar </li></ul></ul>
  8. 8. Specifying How the Knowledge is to be Learnt <ul><li>If the learning system is to be implemented, we must specify an algorithm A, which defines the way in which the system is to search the language L for an acceptable V’ </li></ul><ul><ul><li>That is, we must specify a search algorithm </li></ul></ul>
  9. 9. Structure of a Learning System <ul><li>Four modules </li></ul><ul><ul><li>The Performance System </li></ul></ul><ul><ul><li>The Critic </li></ul></ul><ul><ul><li>The Generaliser (or sometimes Specialiser) </li></ul></ul><ul><ul><li>The Experiment Generator </li></ul></ul>
  10. 10. Performance Module <ul><li>This is the system which actually uses the function V’ as we learn it </li></ul><ul><ul><li>Learning Task </li></ul></ul><ul><ul><ul><li>Learning to play checkers </li></ul></ul></ul><ul><ul><li>Performance module </li></ul></ul><ul><ul><ul><li>System for playing checkers </li></ul></ul></ul><ul><ul><ul><ul><li>(I.e. makes the checkers moves) </li></ul></ul></ul></ul>
  11. 11. Critic Module <ul><li>The critic module evaluates the performance of the current V’ </li></ul><ul><ul><li>It produces a set of data from which the system can learn further </li></ul></ul>
  12. 12. Generaliser/Specialiser Module <ul><li>Takes a set of data and produces a new V’ for the system to run again </li></ul>
  13. 13. Experiment Generator <ul><li>Takes the new V’ </li></ul><ul><ul><li>Maybe also uses the previous history of the system </li></ul></ul><ul><li>Produces a new experiment for the performance system to undertake </li></ul>
  14. 14. The Importance of Bias <ul><li>Important theoretical results from learning theory (PAC learning) tell us that learning without some presuppositions is infeasible. </li></ul><ul><ul><li>Practical experience, of both machine and human learning, confirms this. </li></ul></ul><ul><ul><ul><li>To learn effectively, we must limit the class of V’s. </li></ul></ul></ul><ul><li>Two approaches are used in machine learning: </li></ul><ul><ul><li>Language bias </li></ul></ul><ul><ul><li>Search Bias </li></ul></ul><ul><ul><li>Combined Bias </li></ul></ul><ul><ul><ul><li>Language and search bias are not mutually exclusive: most learning systems feature both </li></ul></ul></ul>
  15. 15. Language Bias <ul><li>The language L is restricted so that it cannot represent all possible target functions V </li></ul><ul><ul><li>This is usually on the basis of some knowledge we have about the likely form of V’ </li></ul></ul><ul><ul><li>It introduces risk </li></ul></ul><ul><ul><ul><li>Our system will fail if L does not contain an acceptable V’ </li></ul></ul></ul>
  16. 16. Search Bias <ul><li>The order in which the system searches L is controlled, so that promising areas for V’ are searched first </li></ul>
  17. 17. The Downside: No Free Lunches <ul><li>Wolpert and MacReady’s No Free Lunch Theorem states, in effect, that averaged over all problems, all biases are equally good (or bad). </li></ul><ul><li>Conventional view </li></ul><ul><ul><li>The choice of a learning system cannot be universal </li></ul></ul><ul><ul><ul><li>It must be matched to the problem being solved </li></ul></ul></ul><ul><li>In most systems, the bias is not explicit </li></ul><ul><ul><li>The ability to identify the language and search biases of a particular system is an important aspect of machine learning </li></ul></ul><ul><li>Some more recent systems permit the explicit and flexible specification of both language and search biases </li></ul>
  18. 18. No Free Lunch: Does it Matter? <ul><li>Alternative view </li></ul><ul><ul><li>We aren’t interested in all problems </li></ul></ul><ul><ul><ul><li>We are only interested in prolems which have solutions of less than some bounded complexity </li></ul></ul></ul><ul><ul><ul><ul><li>(so that we can understand the solutions) </li></ul></ul></ul></ul><ul><ul><li>The No Free Lunch Theorem may not apply in this case </li></ul></ul>
  19. 19. Some Dimensions of Learning <ul><li>Induction vs Discovery: </li></ul><ul><li>Guided learning vs learning from raw data </li></ul><ul><li>Learning How vs Learning That (vs Learning a Better That) </li></ul><ul><li>Stochastic vs Deterministic; Symbolic vs Subsymbolic </li></ul><ul><li>Clean vs Noisy Data </li></ul><ul><li>Discrete vs continuous variables </li></ul><ul><li>Attribute vs Relational Learning </li></ul><ul><li>The Importance of Background Knowledge </li></ul>
  20. 20. Induction vs Discovery <ul><li>Has the target concept been previously identified? </li></ul><ul><ul><li>Pearson: cloud classifications from satellite data </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Autoclass and H - R diagrams </li></ul></ul><ul><ul><li>AM and prime numbers </li></ul></ul><ul><ul><li>BACON and Boyle's Law </li></ul></ul>
  21. 21. Guided Learning vs Learning from Raw Data <ul><li>Does the learning system require carefully selected examples and counterexamples, as in a teacher – student situation? </li></ul><ul><ul><li>(allows fast learning) </li></ul></ul><ul><ul><li>CIGOL learning sort/merge </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Garvan institute's thyroid data </li></ul></ul>
  22. 22. Learning How vs Learning That vs Learning a Better That <ul><ul><li>Classifying handwritten symbols </li></ul></ul><ul><ul><li>Distinguishing vowel sounds (Sejnowski & Rosenberg) </li></ul></ul><ul><ul><li>Learning to fly a (simulated!) plane </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Michalski & learning diagnosis of soy diseases </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Mitchell & learning about chess forks </li></ul></ul>
  23. 23. Stochastic vs Deterministic; Symbolic vs Subsymbolic <ul><ul><li>Classifying handwritten symbols (stochastic, subsymbolic) </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Predicting plant distributions (stochastic, symbolic) </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Cloud classification (deterministic, symbolic) </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>? (deterministic, subsymbolic) </li></ul></ul>
  24. 24. Clean vs Noisy Data <ul><ul><li>Learning to diagnose errors in programs </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Greater gliders in the Coolangubra </li></ul></ul>
  25. 25. Discrete vs Continuous Variables <ul><ul><li>Quinlan's chess end games </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Pearson's clouds (eg cloud heights) </li></ul></ul>
  26. 26. Attibute vs Relational Learning <ul><ul><li>Predicting plant distributions </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Predicting animal distributions </li></ul></ul><ul><ul><ul><li>(because plants can’t move, they don’t care - much - about spatial relationships) </li></ul></ul></ul>
  27. 27. The importance of Background Knowledge <ul><li>Learning about faults in a satellite power supply </li></ul><ul><ul><li>general electric circuit theory </li></ul></ul><ul><ul><li>knowledge about the particular circuit </li></ul></ul>
  28. 28. Generalisation and Learning <ul><li>What do we mean when we say of two propositions, S and G, that G is a generalisation of S? </li></ul><ul><ul><li>Suppose skippy is a grey kangaroo. </li></ul></ul><ul><ul><li>We would regard ‘Kangaroos are grey as a generalisation of ‘Skippy is grey’. </li></ul></ul><ul><ul><li>In any world in which ‘kangaroos are grey’ is true, ‘Skippy is grey’ will also be true. </li></ul></ul><ul><li>In other words, if G is a generalisation of specialisation S, then G is 'at least as true' as S, </li></ul><ul><ul><li>That is, S is true in all states of the world in which G is, and perhaps in other states as well. </li></ul></ul>
  29. 29. Generalisation and Inference <ul><li>In logic, we assume that if S is true in all worlds in which G is, then </li></ul><ul><ul><li>G -> S </li></ul></ul><ul><li>That is, G is a generalisation of S exactly when G implies S </li></ul><ul><ul><li>So we can think of learning from S as a search for a suitable G for which G -> S </li></ul></ul><ul><li>In propositional learning, this is often used as a definition: </li></ul><ul><ul><li>G is more general than S if and only if G -> S </li></ul></ul>
  30. 30. Issues <ul><li>Equating generalisation and logical implication is only useful if the validity of an implication can be readily computed </li></ul><ul><ul><li>In the propositional calculus, validity is an exponential problem </li></ul></ul><ul><ul><li>in the predicate calculus, validity is an undecidable problem </li></ul></ul><ul><li>so the definition is not universally useful </li></ul><ul><ul><li>(although for some parts of logic - eg learning rules - it is perfectly adequate). </li></ul></ul>
  31. 31. A Common Misunderstanding <ul><li>Suppose we have two rules, </li></ul><ul><ul><li>1) A ∧ Β -> G </li></ul></ul><ul><ul><li>2) A ∧ Β ∧ C -> G </li></ul></ul><ul><li>Clearly, we would want 1 to be a generalisation of 2 </li></ul><ul><li>This is OK with our definition, because </li></ul><ul><ul><li>((A ^ B -> G) -> (A ^ B ^ C -> G)) </li></ul></ul><ul><li>is valid </li></ul><ul><ul><li>But the confusing thing is that ((A^B^C) -> (A ∧ Β)) is valid </li></ul></ul><ul><ul><ul><li>Iif you only look at the hypotheses of the rule, rather than the whole rule, the implication is the wrong way around </li></ul></ul></ul><ul><ul><ul><li>Note that some textbooks are themselves confused about this </li></ul></ul></ul>
  32. 32. Defining Generalisaion <ul><li>We could try to define the properties that generalisation must satisfy, </li></ul><ul><li>So let's write down some axioms. We need some notation. </li></ul><ul><ul><li>We will write 'S < G G' as shorthand for 'S is less general than G'. </li></ul></ul><ul><li>Axioms: </li></ul><ul><ul><li>Transitivity: If A < G B and B < G C then also A < G C </li></ul></ul><ul><ul><li>Antisymmetry: If A < G B then it's not true that B < G A </li></ul></ul><ul><ul><li>Top: there is a unique element, ⊥ , for which it is always true that A < G ⊥ . </li></ul></ul><ul><ul><li>Bottom: there is a unique element, T, for which it is always true that T < G A. </li></ul></ul>
  33. 33. Picturing Generalisaion <ul><li>We can draw a 'picture' of a generalisation hierarchy satisfying these axioms: </li></ul>
  34. 34. Specifying Generalisaion <ul><li>In a particular domain, the generalisation hierarchy may be defined in either of two ways: </li></ul><ul><ul><li>By giving a general definition of what generalisation means in that domain </li></ul></ul><ul><ul><ul><li>Example: our earlier definition in terms of implication </li></ul></ul></ul><ul><ul><li>By directly specifying the specialisation and generalisation operators that may be used to climb up and down the links in the generalisation hierarchy </li></ul></ul>
  35. 35. Learning and Generalisaion <ul><li>How does learning relate to generalisation? </li></ul><ul><ul><li>We can view most learning as an attempt to find an appropriate generalisation that generalises the examples. </li></ul></ul><ul><ul><li>In noise free domains, we usually want the generalisation to cover all the examples. </li></ul></ul><ul><ul><li>Once we introduce noise, we want the generalisation to cover 'enough' examples, and the interesting bit is in defining what 'enough' is. </li></ul></ul><ul><li>In our picture of a generalisation hierarchy, most learning algorithms can be viewed as methods for searching the hierarchy. </li></ul><ul><ul><li>The examples can be pictured as locations low down in the hierarchy, and the learning algorithm attempts to find a location that is above all (or 'enough') of them in the hierarchy, but usually, no higher 'than it needs to be' </li></ul></ul>
  36. 36. Searching the Generalisaion Hierarchy <ul><li>The commonest approaches are: </li></ul><ul><ul><li>generalising search </li></ul></ul><ul><ul><ul><li>the search is upward from the original examples, towards the more general hypotheses </li></ul></ul></ul><ul><ul><li>specialising search </li></ul></ul><ul><ul><ul><li>the search is downward from the most general hypothesis, towards the more special examples </li></ul></ul></ul><ul><ul><li>Some algorithms use different approaches. Mitchell's version space approach, for example, tries to 'home in' on the right generalisation from both directions at once. </li></ul></ul>
  37. 37. Completeness and Generalisaion <ul><li>Many approaches to axiomatising generalisation add an extra axiom: </li></ul><ul><ul><li>Completeness: For any set Σ of members of the generalisation hierarchy, there is a unique 'least general generalisation' L, which satisfies two properties: </li></ul></ul><ul><ul><ul><li>1) for every S in Σ, S < G L </li></ul></ul></ul><ul><ul><ul><li>2) if any other L' satisfies 1), then L < G L' </li></ul></ul></ul><ul><ul><li>If this definition is hard to understand, compare it with the definition of 'Least Upper Bound' in set theory, or of 'Least Common Multiple' in arithmetic </li></ul></ul>
  38. 38. Restricting Generalisation <ul><li>Let's go back to our original definition of generalisation: </li></ul><ul><ul><li>G generalises S iff G -> S </li></ul></ul><ul><li>In the general predicate calculus case, this relation is uncomputable, so it's not very useful </li></ul><ul><li>One approach to avoiding the problem is to limit the implications allowed </li></ul>
  39. 39. Generalisation and Substitution <ul><li>Very commonly, the generalisations we want to make involve turning a constant into a variable. </li></ul><ul><ul><li>So we see a particular black crow, fred, so we notice: </li></ul></ul><ul><ul><ul><li>crow(fred) -> black(fred) </li></ul></ul></ul><ul><ul><li>and we may wish to generalise this to </li></ul></ul><ul><ul><ul><li>∀ X(crow(X) -> black(X)) </li></ul></ul></ul><ul><li>Notice that the original proposition can be recovered from the generalisation by substituting 'fred' for the variable 'X' </li></ul><ul><ul><li>The original is a substitution instance of the generalisation </li></ul></ul><ul><ul><li>So we could define a new, restricted generalisation: </li></ul></ul><ul><ul><ul><li>G subsumes S if S is a substitution instance of G </li></ul></ul></ul><ul><li>An example of our earlier definition, because a substitution instance is always implied by the original proposition. </li></ul>
  40. 40. Learning Algorithms <ul><li>For the rest of this lecture, we will work with a specific learning dataset (due to Mitchell): </li></ul><ul><ul><li>Item Sky AirT Hum Wnd Wtr Fcst Enjy </li></ul></ul><ul><ul><li>1 Sun Wrm Nml Str Wrm Sam Yes </li></ul></ul><ul><ul><li>2 Sun Wrm High Str Wrm Sam Yes </li></ul></ul><ul><ul><li>3 Rain Cold High Str Wrm Chng No </li></ul></ul><ul><ul><li>4 Sun Wrm High Str Cool Chng Yes </li></ul></ul><ul><li>First, we look at a really simple algorithm, Maximally Specific Learning </li></ul>
  41. 41. Maximally Specific Learning <ul><li>The learning language consists of sets of tuples, representing the values of these attributes </li></ul><ul><ul><li>A ‘?’ represents that any value is acceptable for this attribute </li></ul></ul><ul><ul><li>A particular value represents that only that value is acceptable for this attribute </li></ul></ul><ul><ul><li>A ‘φ’ represents that no value is acceptable for this attribute </li></ul></ul><ul><ul><li>Thus (?, Cold, High, ?, ?, ?) represents the hypothesis that water sport is enjoyed only on cold, moist days. </li></ul></ul><ul><li>Note that our language is already heavily biased: only conjunctive hypotheses (hypotheses built with ‘^’) are allowed. </li></ul>
  42. 42. Find-S <ul><li>Find-S is a simple algorithm: its initial hypothesis is that water sport is never enjoyed </li></ul><ul><ul><li>It expands the hypothesis as positive data items are noted </li></ul></ul>
  43. 43. Running Find-S <ul><li>Initial Hypothesis </li></ul><ul><ul><li>The most specific hypothesis (water sports are never enjoyed): </li></ul></ul><ul><ul><li>h ← (φ,φ,φ,φ,φ,φ) </li></ul></ul><ul><li>After First Data Item </li></ul><ul><ul><li>Water sport is enjoyed only under the conditions of the first item: </li></ul></ul><ul><ul><li>h ← (Sun,Wrm,Nml,Str,Wrm,Sam) </li></ul></ul><ul><li>After Second Data Item </li></ul><ul><ul><li>Water sport is enjoyed only under the common conditions of the first two items: </li></ul></ul><ul><ul><li>h ← (Sun,Wrm,?,Str,Wrm,Sam) </li></ul></ul>
  44. 44. Running Find-S <ul><li>After Third Data Item </li></ul><ul><ul><li>Since this item is negative, it has no effect on the learning hypothesis: </li></ul></ul><ul><ul><li>h ← (Sun,Wrm,?,Str,Wrm,Sam) </li></ul></ul><ul><li>After Final Data Item </li></ul><ul><ul><li>Further generalises the conditions encountered: </li></ul></ul><ul><ul><li>h ← (Sun,Wrm,?,Str,?,?) </li></ul></ul>
  45. 45. Discussion <ul><li>We have found the most specific hypothesis corresponding to the dataset and the restricted (conjunctive) language </li></ul><ul><li>It is not clear it is the best hypothesis </li></ul><ul><ul><li>If the best hypothesis is not conjunctive (eg if we enjoy swimming if it’s warm or sunny), it will not be found </li></ul></ul><ul><ul><li>Find-S will not handle noise and inconsistencies well. </li></ul></ul><ul><ul><li>In other languages (not using pure conjunction) there may be more than one maximally specific hypothesis; Find-S will not work well here </li></ul></ul>
  46. 46. Version Spaces <ul><li>One possible improvement on Find-S is to search many possible solutions in parallel </li></ul><ul><li>Consistency </li></ul><ul><ul><li>A hypothesis h is consistent with a dataset D of training examples iff h gives the same answer on every element of the dataset as the dataset does </li></ul></ul><ul><li>Version Space </li></ul><ul><ul><li>The version space with respect to the language L and the dataset D is the set of hypotheses h in the language L which are consistent with D </li></ul></ul>
  47. 47. List-then-Eliminate <ul><li>Obvious algorithm </li></ul><ul><ul><li>The list-then-eliminate algorithm aims to find the version space in L for the given dataset D </li></ul></ul><ul><ul><li>It can thus return all hypotheses which could explain D </li></ul></ul><ul><li>It works by beginning with L as its set of hypotheses H </li></ul><ul><ul><li>As each item d of the dataset D is examined in turn, any hypotheses in H which are inconsistent with d are eliminated </li></ul></ul><ul><li>The language L is usually large, and often infinite, so this algorithm is computationally infeasible as it stands </li></ul>
  48. 48. Version Space Representation <ul><li>One of the problems with the previous algorithm is the representation of the search space </li></ul><ul><ul><li>We need to represent version spaces efficiently </li></ul></ul><ul><li>General Boundary </li></ul><ul><ul><li>The general boundary G with respect to language L and dataset D is the set of hypotheses h in L which are consistent with D, and for which there is no more general hypothesis in L which is consistent with D </li></ul></ul><ul><li>Specific Boundary </li></ul><ul><ul><li>The specific boundary S with respect to language L and dataset D is the set of hypotheses h in L which are consistent with D, and for which there is no more specific hypothesis in L which is consistent with D </li></ul></ul>
  49. 49. Version Space Representation 2 <ul><li>A version space may be represented by its general and specific boundary </li></ul><ul><li>That is, given the general and specific boundaries, the whole version space may be recovered </li></ul><ul><li>The Candidate Elimination Algorithm traces the general and specific boundaries of the version space as more examples and counter-examples of the concept are seen </li></ul><ul><ul><li>Positive examples are used to generalise the specific boundary </li></ul></ul><ul><ul><li>Negative examples permit the general boundary to be specialised. </li></ul></ul>
  50. 50. Candidate Elimination Algorithm <ul><li>Set G to the set of most general hypotheses in L </li></ul><ul><li>Set S to the set of most specific hypotheses in L </li></ul><ul><li>For each example d in D: </li></ul>
  51. 51. Candidate Elimination Algorithm <ul><ul><li>If d is a positive example </li></ul></ul><ul><li>Remove from G any hypothesis inconsistent with d </li></ul><ul><li>For each hypothesis s in S that is not consistent with d </li></ul><ul><li>Remove s from S </li></ul><ul><li>Add to S all minimal generalisations h of s such that h is consistent with d, and some member of G is more general than h </li></ul><ul><li>Remove from S any hypothesis that is more general than another hypothesis in S </li></ul>
  52. 52. Candidate Elimination Algorithm <ul><li>If d is a negative example </li></ul><ul><li>Remove from S any hypothesis inconsistent with d </li></ul><ul><li>For each hypothesis g in G that is not consistent with d </li></ul><ul><li>Remove g from G </li></ul><ul><li>Add to G all minimal specialisations h of g such that h is consistent with d, and some member of S is more specific than h </li></ul><ul><li>Remove from G any hypothesis that is less general than another hypothesis in G </li></ul>
  53. 53. Summary <ul><li>Defining Learning </li></ul><ul><li>Kinds of Learning </li></ul><ul><li>Generalisation and Specialisation </li></ul><ul><li>Some Simple Learning Algorithms </li></ul><ul><ul><li>Find-S </li></ul></ul><ul><ul><li>Version Spaces </li></ul></ul><ul><ul><ul><li>List-then-Eliminate </li></ul></ul></ul><ul><ul><ul><li>Candidate Elimination </li></ul></ul></ul>
  54. 54. 감사합니다