CS364 Artificial Intelligence Machine Learning

742 views
593 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
742
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Which class boundary is better?
  • Least squares: linear model, given a set of points, attempt to fit a line/boundary by minimising the residual sum of squares. Decision trees: our focus. Support vector machines: non-linear boundaries represented as linear boundaries in transformed (high-dimensional) space. Boosting: combine weak classifiers and tune the training data. Neural networks: biologically inspired supervised and unsupervised techniques. K-means: given a set of points, attempt to fit k-nearest neighbours to the points. Genetic algorithms: evolutionary techniques to evolve solutions using a fitness function/breeding/mutation.
  • CS364 Artificial Intelligence Machine Learning

    1. 1. CS364 Artificial Intelligence Machine Learning Matthew Casey portal.surrey.ac.uk/computing/resources/l3/cs364
    2. 2. Learning Outcomes <ul><li>Describe methods for acquiring human knowledge </li></ul><ul><ul><li>Through experience </li></ul></ul><ul><li>Evaluate which of the acquisition methods would be most appropriate in a given situation </li></ul><ul><ul><li>Limited data available through example </li></ul></ul>
    3. 3. Learning Outcomes <ul><li>Describe techniques for representing acquired knowledge in a way that facilitates automated reasoning over the knowledge </li></ul><ul><ul><li>Generalise experience to novel situations </li></ul></ul><ul><li>Categorise and evaluate AI techniques according to different criteria such as applicability and ease of use, and intelligently participate in the selection of the appropriate techniques and tools, to solve simple problems </li></ul><ul><ul><li>Strategies to overcome the ‘knowledge engineering bottleneck’ </li></ul></ul>
    4. 4. What is Learning? <ul><li>‘ The action of receiving instruction or acquiring knowledge’ </li></ul><ul><li>‘ A process which leads to the modification of behaviour or the acquisition of new abilities or responses , and which is additional to natural development by growth or maturation’ </li></ul>Source: Oxford English Dictionary Online: http://www. oed .com/ , accessed October 2003
    5. 5. Machine Learning <ul><li>Negnevitsky: </li></ul><ul><ul><li>‘ In general, machine learning involves adaptive mechanisms that enable computers to learn from experience, learn by example and learn by analogy’ (2005:165) </li></ul></ul><ul><li>Callan: </li></ul><ul><ul><li>‘ A machine or software tool would not be viewed as intelligent if it could not adapt to changes in its environment’ (2003:225) </li></ul></ul><ul><li>Luger: </li></ul><ul><ul><li>‘ Intelligent agents must be able to change through the course of their interactions with the world’ (2002:351) </li></ul></ul>
    6. 6. Types of Learning <ul><li>Inductive learning </li></ul><ul><ul><li>Learning from examples </li></ul></ul><ul><li>Evolutionary/genetic learning </li></ul><ul><ul><li>Shaping a population of individual solutions through survival of the fittest </li></ul></ul><ul><ul><li>Emergent learning </li></ul></ul><ul><ul><li>Learning through social interaction: game of life </li></ul></ul>
    7. 7. Inductive Learning <ul><li>Supervised learning </li></ul><ul><ul><li>Training examples with a known classification from a teacher </li></ul></ul><ul><li>Unsupervised learning </li></ul><ul><ul><li>No pre-classification of training examples </li></ul></ul><ul><ul><li>Competitive learning: learning through competition on training examples </li></ul></ul>
    8. 8. Key Concepts <ul><li>Learn from experience </li></ul><ul><ul><li>Through examples, analogy or discovery </li></ul></ul><ul><li>To adapt </li></ul><ul><ul><li>Changes in response to interaction </li></ul></ul><ul><li>Generalisation </li></ul><ul><ul><li>To use experience to form a response to novel situations </li></ul></ul>
    9. 9. Generalisation
    10. 10. Machine Learning <ul><li>Techniques and algorithms that adapt through experience </li></ul><ul><li>Used for: </li></ul><ul><ul><li>Interpretation / visualisation: summarise data </li></ul></ul><ul><ul><li>Prediction: time series / stock data </li></ul></ul><ul><ul><li>Classification: malignant or benign tumours </li></ul></ul><ul><ul><li>Regression: curve fitting </li></ul></ul><ul><ul><li>Discovery: data mining / pattern discovery </li></ul></ul>
    11. 11. Why? <ul><li>Complexity of task / amount of data </li></ul><ul><ul><li>Other techniques fail or are computationally expensive </li></ul></ul><ul><li>Problems that cannot be defined </li></ul><ul><ul><li>Discovery of patterns / data mining </li></ul></ul><ul><li>Knowledge Engineering Bottleneck </li></ul><ul><ul><li>‘ Cost and difficulty of building expert systems using traditional […] techniques’ (Luger 2002:351) </li></ul></ul>
    12. 12. Common Techniques <ul><li>Least squares </li></ul><ul><li>Decision trees </li></ul><ul><li>Support vector machines </li></ul><ul><li>Boosting </li></ul><ul><li>Neural networks </li></ul><ul><li>K-means </li></ul><ul><li>Genetic algorithms </li></ul>Hastie, T., Tibshirani, R. & Friedman, J.H. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction . New York: Springer-Verlag.
    13. 13. Neural Networks <ul><li>Developed from models of the biology of behaviour </li></ul><ul><ul><li>Human brain contains of the order of 10 10 neurons, each connecting to 10 4 others </li></ul></ul><ul><ul><li>Parallel processing </li></ul></ul><ul><ul><li>Inhibitory and excitatory </li></ul></ul><ul><li>Supervised: </li></ul><ul><ul><li>Perceptron, radial basis function </li></ul></ul><ul><li>Unsupervised: </li></ul><ul><ul><li>Self-organising map </li></ul></ul>http:// cwabacon . pearsoned .com/ bookbind / pubbooks / pinel _ ab /chapter3/deluxe.html
    14. 14. Neural Networks <ul><li>Output is a complex (often non-linear) combination of the inputs </li></ul>w k1 x 1 Input Signal w k2 x 2 w km x m Φ(.) Σ Output y k Bias w k0 Synaptic Weights Summing Junction Activation Function
    15. 15. Decision Trees <ul><li>A map of the reasoning process, good at solving classification problems (Negnevitsky, 2005) </li></ul><ul><li>A decision tree represents a number of different attributes and values </li></ul><ul><ul><li>Nodes represent attributes </li></ul></ul><ul><ul><li>Branches represent values of the attributes </li></ul></ul><ul><li>Path through a tree represents a decision </li></ul><ul><li>Tree can be associated with rules </li></ul>
    16. 16. Example 1 <ul><li>Consider one rule for an ice-cream seller (Callan 2003:241) </li></ul><ul><ul><ul><li>IF Outlook = Sunny </li></ul></ul></ul><ul><ul><ul><li>AND Temperature = Hot </li></ul></ul></ul><ul><ul><ul><li>THEN Sell </li></ul></ul></ul>
    17. 17. Example 1 Root node Branch Leaf Node Outlook Temperature Sunny Hot Sell Don’t Sell Sell Yes No Mild Holiday Season Cold Don’t Sell Holiday Season Overcast No Don’t Sell Yes Temperature Hot Cold Mild Don’t Sell Sell Don’t Sell
    18. 18. Construction <ul><li>Concept learning: </li></ul><ul><ul><li>Inducing concepts from examples </li></ul></ul><ul><li>We can intuitively construct a decision tree for a small set of examples </li></ul><ul><li>Different algorithms used to construct a tree based upon the examples </li></ul><ul><ul><li>Most popular ID3 (Quinlan, 1986) </li></ul></ul>
    19. 19. Example 2 Draw a tree based upon the following data: - True False 3 - True True 4 2 1 Item + False True + False False Class Y X
    20. 20. Example 2: Tree 1 Y {3,4} Negative {1,2} Positive True False
    21. 21. Example 2: Tree 2 X {2,4} Y {1,3} Y {2} Positive {4} Negative True False True False {1} Positive {3} Negative True False
    22. 22. Which Tree? <ul><li>Different trees can be constructed from the same set of examples </li></ul><ul><li>Which tree is the best? </li></ul><ul><ul><li>Based upon choice of attributes at each node in the tree </li></ul></ul><ul><ul><li>A split in the tree (branches) should correspond to the predictor with the maximum separating power </li></ul></ul><ul><li>Examples can be contradictory </li></ul><ul><ul><li>Real-life is noisy </li></ul></ul>
    23. 23. Example 3 <ul><li>Callan (2003:242-247) </li></ul><ul><ul><li>Locating a new bar </li></ul></ul>
    24. 24. Choosing Attributes <ul><li>Entropy: </li></ul><ul><ul><li>Measure of disorder (high is bad) </li></ul></ul><ul><li>For c classification categories </li></ul><ul><li>Attribute a that has value v </li></ul><ul><li>Probability of v being in category i is p i </li></ul><ul><li>Entropy E is: </li></ul>
    25. 25. Entropy Example <ul><li>Choice of attributes: </li></ul><ul><ul><li>City/Town, University, Housing Estate, Industrial Estate, Transport and Schools </li></ul></ul><ul><li>City/Town: is either Y or N </li></ul><ul><li>For Y: 7 positive examples, 3 negative </li></ul><ul><li>For N: 4 positive examples, 6 negative </li></ul>
    26. 26. Entropy Example <ul><li>City/Town as root node: </li></ul><ul><ul><li>For c=2 ( positive and negative ) classification categories </li></ul></ul><ul><ul><li>Attribute a= City/Town that has value v= Y </li></ul></ul><ul><ul><li>Probability of v= Y being in category positive </li></ul></ul><ul><ul><li>Probability of v= Y being in category negative </li></ul></ul>
    27. 27. <ul><li>City/Town as root node: </li></ul><ul><ul><li>For c=2 ( positive and negative ) classification categories </li></ul></ul><ul><ul><li>Attribute a= City/Town that has value v= Y </li></ul></ul><ul><ul><li>Entropy E is: </li></ul></ul>Entropy Example
    28. 28. <ul><li>City/Town as root node: </li></ul><ul><ul><li>For c=2 ( positive and negative ) classification categories </li></ul></ul><ul><ul><li>Attribute a= City/Town that has value v= N </li></ul></ul><ul><ul><li>Probability of v= N being in category positive </li></ul></ul><ul><ul><li>Probability of v= N being in category negative </li></ul></ul>Entropy Example
    29. 29. <ul><li>City/Town as root node: </li></ul><ul><ul><li>For c=2 ( positive and negative ) classification categories </li></ul></ul><ul><ul><li>Attribute a= City/Town that has value v= N </li></ul></ul><ul><ul><li>Entropy E is: </li></ul></ul>Entropy Example
    30. 30. Choosing Attributes <ul><li>Information gain: </li></ul><ul><ul><li>Expected reduction in entropy (high is good) </li></ul></ul><ul><li>Entropy of whole example set T is E(T) </li></ul><ul><li>Examples with a=v, v is j th value are T j, </li></ul><ul><li>Entropy E(a=v)=E(T j ) </li></ul><ul><li>Gain is: </li></ul>
    31. 31. <ul><li>For root of tree there are 20 examples: </li></ul><ul><ul><li>For c=2 ( positive and negative ) classification categories </li></ul></ul><ul><ul><li>Probability of being positive with 11 examples </li></ul></ul><ul><ul><li>Probability of being negative with 9 examples </li></ul></ul>Information Gain Example
    32. 32. <ul><li>For root of tree there are 20 examples: </li></ul><ul><ul><li>For c=2 ( positive and negative ) classification categories </li></ul></ul><ul><ul><li>Entropy of all training examples E(T) is: </li></ul></ul>Information Gain Example
    33. 33. Information Gain Example <ul><li>City/Town as root node: </li></ul><ul><ul><li>10 examples for a= City/Town and value v= Y </li></ul></ul><ul><ul><li>10 examples for a= City/Town and value v= N </li></ul></ul>
    34. 34. Example 4 <ul><li>Calculate the information gain for the Transport attribute </li></ul>
    35. 35. Information Gain Example
    36. 36. Choosing Attributes <ul><li>Chose root node as the attribute that gives the highest Information Gain </li></ul><ul><ul><li>In this case attribute Transport with 0.266 </li></ul></ul><ul><li>Branches from root node then become the values associated with the attribute </li></ul><ul><ul><li>Recursive calculation of attributes/nodes </li></ul></ul><ul><ul><li>Filter examples by attribute value </li></ul></ul>
    37. 37. Recursive Example <ul><li>With Transport as the root node: </li></ul><ul><ul><li>Select examples where Transport is Average </li></ul></ul><ul><ul><li>(1, 3, 6, 8, 11, 15, 17) </li></ul></ul><ul><ul><li>Use only these examples to construct this branch of the tree </li></ul></ul><ul><ul><li>Repeat for each attribute (Poor, Good) </li></ul></ul>
    38. 38. Final Tree Transport {7,12,16,19,20} Positive {8} Negative {6} Negative Callan 2003:243 {5,9,14} P ositive {2,4,10,13,18} Negative A P G {1,3,6,8,11,15,17} Housing Estate L M S N {11,17} Industrial Estate {17} Negative {11} P ositive Y N {1,3,15} University {15} Negative {1,3} P ositive Y N {2,4,5,9,10,13,14,18} Industrial Estate Y N
    39. 39. ID3 <ul><li>Procedure Extend(Tree d , Examples T) </li></ul><ul><ul><li>Choose best attribute a for root of d </li></ul></ul><ul><ul><ul><li>Calculate E(a=v) and Gain(T,a) for each attribute </li></ul></ul></ul><ul><ul><ul><li>Attribute with highest Gain(T,a) is selected as best </li></ul></ul></ul><ul><ul><li>Assign best attribute a to root of d </li></ul></ul><ul><ul><li>For each value v of attribute a </li></ul></ul><ul><ul><ul><li>Create branch for v=a resulting in sub-tree d j </li></ul></ul></ul><ul><ul><ul><li>Assign to T j training examples from T where v=a </li></ul></ul></ul><ul><ul><ul><li>Recurse sub-tree with Extend( d j , T j ) </li></ul></ul></ul>
    40. 40. Issues <ul><li>Use prior knowledge where available </li></ul><ul><li>Not all the examples may be needed to construct a tree </li></ul><ul><ul><li>Test generalisation of tree during training and stop when desired performance is achieved </li></ul></ul><ul><ul><li>Prune the tree once constructed </li></ul></ul><ul><li>Examples may be noisy </li></ul><ul><li>Examples may contain irrelevant attributes </li></ul>
    41. 41. Extracting Rules <ul><li>We can extract rules from decision trees </li></ul><ul><ul><li>Create one rule for each root-to-leaf path </li></ul></ul><ul><ul><li>Simplify by combining rules </li></ul></ul><ul><li>Other techniques are not so transparent: </li></ul><ul><ul><li>Neural networks are often described as ‘black boxes’ – it is difficult to understand what the network is doing </li></ul></ul><ul><ul><li>Extraction of rules from trees can help us to understand the decision process </li></ul></ul>
    42. 42. Rules Example Transport {1,3,6,8,11,15,17} Housing Estate {2,4,5,9,10,13,14,18} Industrial Estate {7,12,16,19,20} Positive {11,17} Industrial Estate {1,3,15} University {8} Negative {6} Negative {5,9,14} P ositive {2,4,10,13,18} Negative {17} Negative {11} P ositive {15} Negative {1,3} P ositive A P G L M S N Y N Y N Callan 2003:243 Y N
    43. 43. Rules Example <ul><li>IF Transport is Average AND Housing Estate is Large AND Industrial Estate is Yes THEN Positive </li></ul><ul><li>… </li></ul><ul><li>IF Transport is Good THEN Positive </li></ul>
    44. 44. Summary <ul><li>What are the benefits/drawbacks of machine learning? </li></ul><ul><ul><li>Are the techniques simple? </li></ul></ul><ul><ul><li>Are they simple to implement? </li></ul></ul><ul><ul><li>Are they computationally cheap? </li></ul></ul><ul><ul><li>Do they learn from experience? </li></ul></ul><ul><ul><li>Do they generalise well? </li></ul></ul><ul><ul><li>Can we understand how knowledge is represented? </li></ul></ul><ul><ul><li>Do they provide perfect solutions? </li></ul></ul>
    45. 45. Source Texts <ul><li>Negnevitsky, M. (2005). Artificial Intelligence: A Guide to Intelligent Systems . 2 nd Edition. Essex, UK: Pearson Education Limited. </li></ul><ul><ul><li>Chapter 6, pp. 165-168, chapter 9, pp. 349-360. </li></ul></ul><ul><li>Callan, R. (2003). Artificial Intelligence , Basingstoke, UK: Palgrave MacMillan. </li></ul><ul><ul><li>Part 5, chapters 11-17, pp. 225-346. </li></ul></ul><ul><li>Luger, G.F. (2002). Artificial Intelligence: Structures & Strategies for Complex Problem Solving. 4 th Edition. London, UK: Addison-Wesley. </li></ul><ul><ul><li>Part IV, chapters 9-11, pp. 349-506. </li></ul></ul>
    46. 46. Journals <ul><li>Artificial Intelligence </li></ul><ul><ul><li>http://www. elsevier .com/locate/ issn /00043702 </li></ul></ul><ul><ul><li>http://www. sciencedirect .com/science/journal/00043702 </li></ul></ul>
    47. 47. Articles <ul><li>Quinlan, J.R. (1986). Induction of Decision Trees. Machine Learning , vol. 1, pp.81-106. </li></ul><ul><li>Quinlan, J.R. (1993). C4.5: Programs for Machine Learning . San Mateo, CA: Morgan Kaufmann Publishers. </li></ul>
    48. 48. Websites <ul><li>UCI Machine Learning Repository </li></ul><ul><ul><li>Example data sets for benchmarking </li></ul></ul><ul><ul><li>http://www. ics . uci . edu /~ mlearn / MLRepository .html </li></ul></ul><ul><li>Genetic Algorithms Archive </li></ul><ul><ul><li>Programs, data, bibliographies, etc. </li></ul></ul><ul><ul><li>http://www. aic . nrl .navy.mil/ galist / </li></ul></ul><ul><li>Wonders of Math: Game of Life </li></ul><ul><ul><li>Game of life applet and details </li></ul></ul><ul><ul><li>http://www.math.com/students/wonders/life/life.html </li></ul></ul>

    ×