Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Basics of Machine Learning


Published on

  • Be the first to comment

Basics of Machine Learning

  1. 1. Advanced Artificial Intelligence Lecture 3: Learning <ul><li>Bob McKay </li></ul><ul><ul><li>School of Computer Science and Engineering </li></ul></ul><ul><ul><li>College of Engineering </li></ul></ul><ul><ul><li>Seoul National University </li></ul></ul>
  2. 2. Outline <ul><li>Defining Learning </li></ul><ul><li>Kinds of Learning </li></ul><ul><li>Generalisation and Specialisation </li></ul><ul><li>Some Simple Learning Algorithms </li></ul>
  3. 3. References <ul><li>Mitchell, Tom M: Machine Learning, McGraw-Hill, 1997, ISBN 0 07 115467 1 </li></ul>
  4. 4. Defining a Learning System (Mitchell) <ul><li>“A program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” </li></ul>
  5. 5. Specifying a Learning System <ul><li>Specifying the task T, the performance P and the experience E defines the learning problem. Specifying the learning system requires us to define: </li></ul><ul><ul><li>Exactly what knowledge is to be learnt </li></ul></ul><ul><ul><li>How this knowledge is to be represented </li></ul></ul><ul><ul><li>How this knowledge is to be learnt </li></ul></ul>
  6. 6. Specifying What is to be Learnt <ul><li>Usually, the desired knowledge can be represented as a target valuation function V: I -> D </li></ul><ul><ul><li>It takes in information about the problem and gives back a desired decision </li></ul></ul><ul><li>Often, it is unrealistic to expect to learn the ideal function V </li></ul><ul><ul><li>All that is required is a ‘good enough’ approximation, V’: I -> D </li></ul></ul>
  7. 7. Specifying How Knowledge is to be Represented <ul><li>The function V’ must be represented symbolically, in some language L </li></ul><ul><ul><li>The language may be a well-known language </li></ul></ul><ul><ul><ul><li>Boolean expressions </li></ul></ul></ul><ul><ul><ul><li>Arithmetic functions </li></ul></ul></ul><ul><ul><ul><li>… . </li></ul></ul></ul><ul><ul><li>Or for some systems, the language may be defined by a grammar </li></ul></ul>
  8. 8. Specifying How the Knowledge is to be Learnt <ul><li>If the learning system is to be implemented, we must specify an algorithm A, which defines the way in which the system is to search the language L for an acceptable V’ </li></ul><ul><ul><li>That is, we must specify a search algorithm </li></ul></ul>
  9. 9. Structure of a Learning System <ul><li>Four modules </li></ul><ul><ul><li>The Performance System </li></ul></ul><ul><ul><li>The Critic </li></ul></ul><ul><ul><li>The Generaliser (or sometimes Specialiser) </li></ul></ul><ul><ul><li>The Experiment Generator </li></ul></ul>
  10. 10. Performance Module <ul><li>This is the system which actually uses the function V’ as we learn it </li></ul><ul><ul><li>Learning Task </li></ul></ul><ul><ul><ul><li>Learning to play checkers </li></ul></ul></ul><ul><ul><li>Performance module </li></ul></ul><ul><ul><ul><li>System for playing checkers </li></ul></ul></ul><ul><ul><ul><ul><li>(I.e. makes the checkers moves) </li></ul></ul></ul></ul>
  11. 11. Critic Module <ul><li>The critic module evaluates the performance of the current V’ </li></ul><ul><ul><li>It produces a set of data from which the system can learn further </li></ul></ul>
  12. 12. Generaliser/Specialiser Module <ul><li>Takes a set of data and produces a new V’ for the system to run again </li></ul>
  13. 13. Experiment Generator <ul><li>Takes the new V’ </li></ul><ul><ul><li>Maybe also uses the previous history of the system </li></ul></ul><ul><li>Produces a new experiment for the performance system to undertake </li></ul>
  14. 14. The Importance of Bias <ul><li>Important theoretical results from learning theory (PAC learning) tell us that learning without some presuppositions is infeasible. </li></ul><ul><ul><li>Practical experience, of both machine and human learning, confirms this. </li></ul></ul><ul><ul><ul><li>To learn effectively, we must limit the class of V’s. </li></ul></ul></ul><ul><li>Two approaches are used in machine learning: </li></ul><ul><ul><li>Language bias </li></ul></ul><ul><ul><li>Search Bias </li></ul></ul><ul><ul><li>Combined Bias </li></ul></ul><ul><ul><ul><li>Language and search bias are not mutually exclusive: most learning systems feature both </li></ul></ul></ul>
  15. 15. Language Bias <ul><li>The language L is restricted so that it cannot represent all possible target functions V </li></ul><ul><ul><li>This is usually on the basis of some knowledge we have about the likely form of V’ </li></ul></ul><ul><ul><li>It introduces risk </li></ul></ul><ul><ul><ul><li>Our system will fail if L does not contain an acceptable V’ </li></ul></ul></ul>
  16. 16. Search Bias <ul><li>The order in which the system searches L is controlled, so that promising areas for V’ are searched first </li></ul>
  17. 17. The Downside: No Free Lunches <ul><li>Wolpert and MacReady’s No Free Lunch Theorem states, in effect, that averaged over all problems, all biases are equally good (or bad). </li></ul><ul><li>Conventional view </li></ul><ul><ul><li>The choice of a learning system cannot be universal </li></ul></ul><ul><ul><ul><li>It must be matched to the problem being solved </li></ul></ul></ul><ul><li>In most systems, the bias is not explicit </li></ul><ul><ul><li>The ability to identify the language and search biases of a particular system is an important aspect of machine learning </li></ul></ul><ul><li>Some more recent systems permit the explicit and flexible specification of both language and search biases </li></ul>
  18. 18. No Free Lunch: Does it Matter? <ul><li>Alternative view </li></ul><ul><ul><li>We aren’t interested in all problems </li></ul></ul><ul><ul><ul><li>We are only interested in prolems which have solutions of less than some bounded complexity </li></ul></ul></ul><ul><ul><ul><ul><li>(so that we can understand the solutions) </li></ul></ul></ul></ul><ul><ul><li>The No Free Lunch Theorem may not apply in this case </li></ul></ul>
  19. 19. Some Dimensions of Learning <ul><li>Induction vs Discovery: </li></ul><ul><li>Guided learning vs learning from raw data </li></ul><ul><li>Learning How vs Learning That (vs Learning a Better That) </li></ul><ul><li>Stochastic vs Deterministic; Symbolic vs Subsymbolic </li></ul><ul><li>Clean vs Noisy Data </li></ul><ul><li>Discrete vs continuous variables </li></ul><ul><li>Attribute vs Relational Learning </li></ul><ul><li>The Importance of Background Knowledge </li></ul>
  20. 20. Induction vs Discovery <ul><li>Has the target concept been previously identified? </li></ul><ul><ul><li>Pearson: cloud classifications from satellite data </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Autoclass and H - R diagrams </li></ul></ul><ul><ul><li>AM and prime numbers </li></ul></ul><ul><ul><li>BACON and Boyle's Law </li></ul></ul>
  21. 21. Guided Learning vs Learning from Raw Data <ul><li>Does the learning system require carefully selected examples and counterexamples, as in a teacher – student situation? </li></ul><ul><ul><li>(allows fast learning) </li></ul></ul><ul><ul><li>CIGOL learning sort/merge </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Garvan institute's thyroid data </li></ul></ul>
  22. 22. Learning How vs Learning That vs Learning a Better That <ul><ul><li>Classifying handwritten symbols </li></ul></ul><ul><ul><li>Distinguishing vowel sounds (Sejnowski & Rosenberg) </li></ul></ul><ul><ul><li>Learning to fly a (simulated!) plane </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Michalski & learning diagnosis of soy diseases </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Mitchell & learning about chess forks </li></ul></ul>
  23. 23. Stochastic vs Deterministic; Symbolic vs Subsymbolic <ul><ul><li>Classifying handwritten symbols (stochastic, subsymbolic) </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Predicting plant distributions (stochastic, symbolic) </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Cloud classification (deterministic, symbolic) </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>? (deterministic, subsymbolic) </li></ul></ul>
  24. 24. Clean vs Noisy Data <ul><ul><li>Learning to diagnose errors in programs </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Greater gliders in the Coolangubra </li></ul></ul>
  25. 25. Discrete vs Continuous Variables <ul><ul><li>Quinlan's chess end games </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Pearson's clouds (eg cloud heights) </li></ul></ul>
  26. 26. Attibute vs Relational Learning <ul><ul><li>Predicting plant distributions </li></ul></ul><ul><li>vs </li></ul><ul><ul><li>Predicting animal distributions </li></ul></ul><ul><ul><ul><li>(because plants can’t move, they don’t care - much - about spatial relationships) </li></ul></ul></ul>
  27. 27. The importance of Background Knowledge <ul><li>Learning about faults in a satellite power supply </li></ul><ul><ul><li>general electric circuit theory </li></ul></ul><ul><ul><li>knowledge about the particular circuit </li></ul></ul>
  28. 28. Generalisation and Learning <ul><li>What do we mean when we say of two propositions, S and G, that G is a generalisation of S? </li></ul><ul><ul><li>Suppose skippy is a grey kangaroo. </li></ul></ul><ul><ul><li>We would regard ‘Kangaroos are grey as a generalisation of ‘Skippy is grey’. </li></ul></ul><ul><ul><li>In any world in which ‘kangaroos are grey’ is true, ‘Skippy is grey’ will also be true. </li></ul></ul><ul><li>In other words, if G is a generalisation of specialisation S, then G is 'at least as true' as S, </li></ul><ul><ul><li>That is, S is true in all states of the world in which G is, and perhaps in other states as well. </li></ul></ul>
  29. 29. Generalisation and Inference <ul><li>In logic, we assume that if S is true in all worlds in which G is, then </li></ul><ul><ul><li>G -> S </li></ul></ul><ul><li>That is, G is a generalisation of S exactly when G implies S </li></ul><ul><ul><li>So we can think of learning from S as a search for a suitable G for which G -> S </li></ul></ul><ul><li>In propositional learning, this is often used as a definition: </li></ul><ul><ul><li>G is more general than S if and only if G -> S </li></ul></ul>
  30. 30. Issues <ul><li>Equating generalisation and logical implication is only useful if the validity of an implication can be readily computed </li></ul><ul><ul><li>In the propositional calculus, validity is an exponential problem </li></ul></ul><ul><ul><li>in the predicate calculus, validity is an undecidable problem </li></ul></ul><ul><li>so the definition is not universally useful </li></ul><ul><ul><li>(although for some parts of logic - eg learning rules - it is perfectly adequate). </li></ul></ul>
  31. 31. A Common Misunderstanding <ul><li>Suppose we have two rules, </li></ul><ul><ul><li>1) A ∧ Β -> G </li></ul></ul><ul><ul><li>2) A ∧ Β ∧ C -> G </li></ul></ul><ul><li>Clearly, we would want 1 to be a generalisation of 2 </li></ul><ul><li>This is OK with our definition, because </li></ul><ul><ul><li>((A ^ B -> G) -> (A ^ B ^ C -> G)) </li></ul></ul><ul><li>is valid </li></ul><ul><ul><li>But the confusing thing is that ((A^B^C) -> (A ∧ Β)) is valid </li></ul></ul><ul><ul><ul><li>Iif you only look at the hypotheses of the rule, rather than the whole rule, the implication is the wrong way around </li></ul></ul></ul><ul><ul><ul><li>Note that some textbooks are themselves confused about this </li></ul></ul></ul>
  32. 32. Defining Generalisaion <ul><li>We could try to define the properties that generalisation must satisfy, </li></ul><ul><li>So let's write down some axioms. We need some notation. </li></ul><ul><ul><li>We will write 'S < G G' as shorthand for 'S is less general than G'. </li></ul></ul><ul><li>Axioms: </li></ul><ul><ul><li>Transitivity: If A < G B and B < G C then also A < G C </li></ul></ul><ul><ul><li>Antisymmetry: If A < G B then it's not true that B < G A </li></ul></ul><ul><ul><li>Top: there is a unique element, ⊥ , for which it is always true that A < G ⊥ . </li></ul></ul><ul><ul><li>Bottom: there is a unique element, T, for which it is always true that T < G A. </li></ul></ul>
  33. 33. Picturing Generalisaion <ul><li>We can draw a 'picture' of a generalisation hierarchy satisfying these axioms: </li></ul>
  34. 34. Specifying Generalisaion <ul><li>In a particular domain, the generalisation hierarchy may be defined in either of two ways: </li></ul><ul><ul><li>By giving a general definition of what generalisation means in that domain </li></ul></ul><ul><ul><ul><li>Example: our earlier definition in terms of implication </li></ul></ul></ul><ul><ul><li>By directly specifying the specialisation and generalisation operators that may be used to climb up and down the links in the generalisation hierarchy </li></ul></ul>
  35. 35. Learning and Generalisaion <ul><li>How does learning relate to generalisation? </li></ul><ul><ul><li>We can view most learning as an attempt to find an appropriate generalisation that generalises the examples. </li></ul></ul><ul><ul><li>In noise free domains, we usually want the generalisation to cover all the examples. </li></ul></ul><ul><ul><li>Once we introduce noise, we want the generalisation to cover 'enough' examples, and the interesting bit is in defining what 'enough' is. </li></ul></ul><ul><li>In our picture of a generalisation hierarchy, most learning algorithms can be viewed as methods for searching the hierarchy. </li></ul><ul><ul><li>The examples can be pictured as locations low down in the hierarchy, and the learning algorithm attempts to find a location that is above all (or 'enough') of them in the hierarchy, but usually, no higher 'than it needs to be' </li></ul></ul>
  36. 36. Searching the Generalisaion Hierarchy <ul><li>The commonest approaches are: </li></ul><ul><ul><li>generalising search </li></ul></ul><ul><ul><ul><li>the search is upward from the original examples, towards the more general hypotheses </li></ul></ul></ul><ul><ul><li>specialising search </li></ul></ul><ul><ul><ul><li>the search is downward from the most general hypothesis, towards the more special examples </li></ul></ul></ul><ul><ul><li>Some algorithms use different approaches. Mitchell's version space approach, for example, tries to 'home in' on the right generalisation from both directions at once. </li></ul></ul>
  37. 37. Completeness and Generalisaion <ul><li>Many approaches to axiomatising generalisation add an extra axiom: </li></ul><ul><ul><li>Completeness: For any set Σ of members of the generalisation hierarchy, there is a unique 'least general generalisation' L, which satisfies two properties: </li></ul></ul><ul><ul><ul><li>1) for every S in Σ, S < G L </li></ul></ul></ul><ul><ul><ul><li>2) if any other L' satisfies 1), then L < G L' </li></ul></ul></ul><ul><ul><li>If this definition is hard to understand, compare it with the definition of 'Least Upper Bound' in set theory, or of 'Least Common Multiple' in arithmetic </li></ul></ul>
  38. 38. Restricting Generalisation <ul><li>Let's go back to our original definition of generalisation: </li></ul><ul><ul><li>G generalises S iff G -> S </li></ul></ul><ul><li>In the general predicate calculus case, this relation is uncomputable, so it's not very useful </li></ul><ul><li>One approach to avoiding the problem is to limit the implications allowed </li></ul>
  39. 39. Generalisation and Substitution <ul><li>Very commonly, the generalisations we want to make involve turning a constant into a variable. </li></ul><ul><ul><li>So we see a particular black crow, fred, so we notice: </li></ul></ul><ul><ul><ul><li>crow(fred) -> black(fred) </li></ul></ul></ul><ul><ul><li>and we may wish to generalise this to </li></ul></ul><ul><ul><ul><li>∀ X(crow(X) -> black(X)) </li></ul></ul></ul><ul><li>Notice that the original proposition can be recovered from the generalisation by substituting 'fred' for the variable 'X' </li></ul><ul><ul><li>The original is a substitution instance of the generalisation </li></ul></ul><ul><ul><li>So we could define a new, restricted generalisation: </li></ul></ul><ul><ul><ul><li>G subsumes S if S is a substitution instance of G </li></ul></ul></ul><ul><li>An example of our earlier definition, because a substitution instance is always implied by the original proposition. </li></ul>
  40. 40. Learning Algorithms <ul><li>For the rest of this lecture, we will work with a specific learning dataset (due to Mitchell): </li></ul><ul><ul><li>Item Sky AirT Hum Wnd Wtr Fcst Enjy </li></ul></ul><ul><ul><li>1 Sun Wrm Nml Str Wrm Sam Yes </li></ul></ul><ul><ul><li>2 Sun Wrm High Str Wrm Sam Yes </li></ul></ul><ul><ul><li>3 Rain Cold High Str Wrm Chng No </li></ul></ul><ul><ul><li>4 Sun Wrm High Str Cool Chng Yes </li></ul></ul><ul><li>First, we look at a really simple algorithm, Maximally Specific Learning </li></ul>
  41. 41. Maximally Specific Learning <ul><li>The learning language consists of sets of tuples, representing the values of these attributes </li></ul><ul><ul><li>A ‘?’ represents that any value is acceptable for this attribute </li></ul></ul><ul><ul><li>A particular value represents that only that value is acceptable for this attribute </li></ul></ul><ul><ul><li>A ‘φ’ represents that no value is acceptable for this attribute </li></ul></ul><ul><ul><li>Thus (?, Cold, High, ?, ?, ?) represents the hypothesis that water sport is enjoyed only on cold, moist days. </li></ul></ul><ul><li>Note that our language is already heavily biased: only conjunctive hypotheses (hypotheses built with ‘^’) are allowed. </li></ul>
  42. 42. Find-S <ul><li>Find-S is a simple algorithm: its initial hypothesis is that water sport is never enjoyed </li></ul><ul><ul><li>It expands the hypothesis as positive data items are noted </li></ul></ul>
  43. 43. Running Find-S <ul><li>Initial Hypothesis </li></ul><ul><ul><li>The most specific hypothesis (water sports are never enjoyed): </li></ul></ul><ul><ul><li>h ← (φ,φ,φ,φ,φ,φ) </li></ul></ul><ul><li>After First Data Item </li></ul><ul><ul><li>Water sport is enjoyed only under the conditions of the first item: </li></ul></ul><ul><ul><li>h ← (Sun,Wrm,Nml,Str,Wrm,Sam) </li></ul></ul><ul><li>After Second Data Item </li></ul><ul><ul><li>Water sport is enjoyed only under the common conditions of the first two items: </li></ul></ul><ul><ul><li>h ← (Sun,Wrm,?,Str,Wrm,Sam) </li></ul></ul>
  44. 44. Running Find-S <ul><li>After Third Data Item </li></ul><ul><ul><li>Since this item is negative, it has no effect on the learning hypothesis: </li></ul></ul><ul><ul><li>h ← (Sun,Wrm,?,Str,Wrm,Sam) </li></ul></ul><ul><li>After Final Data Item </li></ul><ul><ul><li>Further generalises the conditions encountered: </li></ul></ul><ul><ul><li>h ← (Sun,Wrm,?,Str,?,?) </li></ul></ul>
  45. 45. Discussion <ul><li>We have found the most specific hypothesis corresponding to the dataset and the restricted (conjunctive) language </li></ul><ul><li>It is not clear it is the best hypothesis </li></ul><ul><ul><li>If the best hypothesis is not conjunctive (eg if we enjoy swimming if it’s warm or sunny), it will not be found </li></ul></ul><ul><ul><li>Find-S will not handle noise and inconsistencies well. </li></ul></ul><ul><ul><li>In other languages (not using pure conjunction) there may be more than one maximally specific hypothesis; Find-S will not work well here </li></ul></ul>
  46. 46. Version Spaces <ul><li>One possible improvement on Find-S is to search many possible solutions in parallel </li></ul><ul><li>Consistency </li></ul><ul><ul><li>A hypothesis h is consistent with a dataset D of training examples iff h gives the same answer on every element of the dataset as the dataset does </li></ul></ul><ul><li>Version Space </li></ul><ul><ul><li>The version space with respect to the language L and the dataset D is the set of hypotheses h in the language L which are consistent with D </li></ul></ul>
  47. 47. List-then-Eliminate <ul><li>Obvious algorithm </li></ul><ul><ul><li>The list-then-eliminate algorithm aims to find the version space in L for the given dataset D </li></ul></ul><ul><ul><li>It can thus return all hypotheses which could explain D </li></ul></ul><ul><li>It works by beginning with L as its set of hypotheses H </li></ul><ul><ul><li>As each item d of the dataset D is examined in turn, any hypotheses in H which are inconsistent with d are eliminated </li></ul></ul><ul><li>The language L is usually large, and often infinite, so this algorithm is computationally infeasible as it stands </li></ul>
  48. 48. Version Space Representation <ul><li>One of the problems with the previous algorithm is the representation of the search space </li></ul><ul><ul><li>We need to represent version spaces efficiently </li></ul></ul><ul><li>General Boundary </li></ul><ul><ul><li>The general boundary G with respect to language L and dataset D is the set of hypotheses h in L which are consistent with D, and for which there is no more general hypothesis in L which is consistent with D </li></ul></ul><ul><li>Specific Boundary </li></ul><ul><ul><li>The specific boundary S with respect to language L and dataset D is the set of hypotheses h in L which are consistent with D, and for which there is no more specific hypothesis in L which is consistent with D </li></ul></ul>
  49. 49. Version Space Representation 2 <ul><li>A version space may be represented by its general and specific boundary </li></ul><ul><li>That is, given the general and specific boundaries, the whole version space may be recovered </li></ul><ul><li>The Candidate Elimination Algorithm traces the general and specific boundaries of the version space as more examples and counter-examples of the concept are seen </li></ul><ul><ul><li>Positive examples are used to generalise the specific boundary </li></ul></ul><ul><ul><li>Negative examples permit the general boundary to be specialised. </li></ul></ul>
  50. 50. Candidate Elimination Algorithm <ul><li>Set G to the set of most general hypotheses in L </li></ul><ul><li>Set S to the set of most specific hypotheses in L </li></ul><ul><li>For each example d in D: </li></ul>
  51. 51. Candidate Elimination Algorithm <ul><ul><li>If d is a positive example </li></ul></ul><ul><li>Remove from G any hypothesis inconsistent with d </li></ul><ul><li>For each hypothesis s in S that is not consistent with d </li></ul><ul><li>Remove s from S </li></ul><ul><li>Add to S all minimal generalisations h of s such that h is consistent with d, and some member of G is more general than h </li></ul><ul><li>Remove from S any hypothesis that is more general than another hypothesis in S </li></ul>
  52. 52. Candidate Elimination Algorithm <ul><li>If d is a negative example </li></ul><ul><li>Remove from S any hypothesis inconsistent with d </li></ul><ul><li>For each hypothesis g in G that is not consistent with d </li></ul><ul><li>Remove g from G </li></ul><ul><li>Add to G all minimal specialisations h of g such that h is consistent with d, and some member of S is more specific than h </li></ul><ul><li>Remove from G any hypothesis that is less general than another hypothesis in G </li></ul>
  53. 53. Summary <ul><li>Defining Learning </li></ul><ul><li>Kinds of Learning </li></ul><ul><li>Generalisation and Specialisation </li></ul><ul><li>Some Simple Learning Algorithms </li></ul><ul><ul><li>Find-S </li></ul></ul><ul><ul><li>Version Spaces </li></ul></ul><ul><ul><ul><li>List-then-Eliminate </li></ul></ul></ul><ul><ul><ul><li>Candidate Elimination </li></ul></ul></ul>
  54. 54. 감사합니다