This is a heavily data-oriented


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

This is a heavily data-oriented

  1. 1. Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 Machine Learning: Making Computer Science Scientific
  2. 2. Acknowledgements <ul><li>VLSI Wafer Testing </li></ul><ul><ul><li>Tony Fountain </li></ul></ul><ul><li>Robot Navigation </li></ul><ul><ul><li>Didac Busquets </li></ul></ul><ul><ul><li>Carles Sierra </li></ul></ul><ul><ul><li>Ramon Lopez de Mantaras </li></ul></ul><ul><li>NSF grants IIS-0083292 and ITR-085836 </li></ul>
  3. 3. Outline <ul><li>Three scenarios where standard software engineering methods fail </li></ul><ul><li>Machine learning methods applied to these scenarios </li></ul><ul><li>Fundamental questions in machine learning </li></ul><ul><li>Statistical thinking in computer science </li></ul>
  4. 4. Scenario 1: Reading Checks Find and read “courtesy amount” on checks:
  5. 5. Possible Methods: <ul><li>Method 1: Interview humans to find out what steps they follow in reading checks </li></ul><ul><li>Method 2: Collect examples of checks and the correct amounts. Train a machine learning system to recognize the amounts </li></ul>
  6. 6. Scenario 2: VLSI Wafer Testing <ul><li>Wafer test: Functional test of each die (chip) while on the wafer </li></ul>
  7. 7. Which Chips (and how many) should be tested? <ul><li>Tradeoff: </li></ul><ul><ul><li>Test all chips on wafer? </li></ul></ul><ul><ul><ul><li>Avoid cost of packaging bad chips </li></ul></ul></ul><ul><ul><ul><li>Incur cost of testing all chips </li></ul></ul></ul><ul><ul><li>Test none of the chips on the wafer? </li></ul></ul><ul><ul><ul><li>May package some bad chips </li></ul></ul></ul><ul><ul><ul><li>No cost of testing on wafer </li></ul></ul></ul>
  8. 8. Possible Methods <ul><li>Method 1: Guess the right tradeoff point </li></ul><ul><li>Method 2: Learn a probabilistic model that captures the probability that each chip will be bad </li></ul><ul><ul><li>Plug this model into a Bayesian decision making procedure to optimize expected profit </li></ul></ul>
  9. 9. Scenario 3: Allocating mobile robot camera <ul><li>Binocular </li></ul><ul><li>No GPS </li></ul>
  10. 10. Camera tradeoff <ul><li>Mobile robot uses camera both for obstacle avoidance and landmark-based navigation </li></ul><ul><li>Tradeoff: </li></ul><ul><ul><li>If camera is used only for navigation, robot collides with objects </li></ul></ul><ul><ul><li>If camera is used only for obstacle avoidance, robot gets lost </li></ul></ul>
  11. 11. Possible Methods <ul><li>Method 1: Manually write a program to allocate the camera </li></ul><ul><li>Method 2: Experimentally learn a policy for switching between obstacle avoidance and landmark tracking </li></ul>
  12. 12. Software Engineering Methodology <ul><li>Analyze </li></ul><ul><ul><li>Interview experts, users, etc. to determine the actions the system must perform </li></ul></ul><ul><li>Design </li></ul><ul><ul><li>Apply CS knowledge to design a solution </li></ul></ul><ul><li>Implement </li></ul><ul><li>Test </li></ul>
  13. 13. Challenges for SE Methodology <ul><li>Standard SE methods fail when… </li></ul><ul><ul><li>System requirements are hard to collect </li></ul></ul><ul><ul><li>The system must resolve difficult tradeoffs </li></ul></ul>
  14. 14. (1) System requirements are hard to collect <ul><li>There are no human experts </li></ul><ul><ul><li>Cellular telephone fraud </li></ul></ul><ul><li>Human experts are inarticulate </li></ul><ul><ul><li>Handwriting recognition </li></ul></ul><ul><li>The requirements are changing rapidly </li></ul><ul><ul><li>Computer intrusion detection </li></ul></ul><ul><li>Each user has different requirements </li></ul><ul><ul><li>E-mail filtering </li></ul></ul>
  15. 15. (2) The system must resolve difficult tradeoffs <ul><li>VLSI Wafer testing </li></ul><ul><ul><li>Tradeoff point depends on probability of bad chips, relative costs of testing versus packaging </li></ul></ul><ul><li>Camera Allocation for Mobile Robot </li></ul><ul><ul><li>Tradeoff depends on probability of obstacles, number and quality of landmarks </li></ul></ul>
  16. 16. Machine Learning: Replacing guesswork with data <ul><li>In all of these cases, the standard SE methodology requires engineers to make guesses </li></ul><ul><ul><li>Guessing how to do character recognition </li></ul></ul><ul><ul><li>Guessing the tradeoff point for wafer test </li></ul></ul><ul><ul><li>Guessing the tradeoff for camera allocation </li></ul></ul><ul><li>Machine Learning provides a way of making these decisions based on data </li></ul>
  17. 17. Outline <ul><li>Three scenarios where software engineering methods fail </li></ul><ul><li>Machine learning methods applied to these scenarios </li></ul><ul><li>Fundamental questions in machine learning </li></ul><ul><li>Statistical thinking in computer science </li></ul>
  18. 18. Basic Machine Learning Methods <ul><li>Supervised Learning </li></ul><ul><li>Density Estimation </li></ul><ul><li>Reinforcement Learning </li></ul>
  19. 19. Supervised Learning Training Examples Learning Algorithm Classifier New Examples 8 8 3 6 0 1
  20. 20. AT&T/NCR Check Reading System Recognition transformer is a neural network trained on 500,000 examples of characters The entire system is trained given entire checks as input and dollar amounts as output LeCun, Bottou, Bengio & Haffner (1998) Gradient-Based Learning Applied to Document Recognition
  21. 21. Check Reader Performance <ul><li>82% of machine-printed checks correctly recognized </li></ul><ul><li>1% of checks incorrectly recognized </li></ul><ul><li>17% “rejected” – check is presented to a person for manual reading </li></ul><ul><li>Fielded by NCR in June 1996; reads millions of checks per month </li></ul>
  22. 22. Supervised Learning Summary <ul><li>Desired classifier is a function y = f(x) </li></ul><ul><li>Training examples are desired input-output pairs (x i ,y i ) </li></ul>
  23. 23. Density Estimation Training Examples Learning Algorithm Density Estimator P(chip i is bad) = 0.42 Partially-tested wafer
  24. 24. On-Wafer Testing System <ul><li>Trained density estimator on 600 wafers from mature product (HP; Corvallis, OR) </li></ul><ul><ul><li>Probability model is “naïve Bayes” mixture model with four components (trained with EM) </li></ul></ul>W C209 C3 C2 C1 . . .
  25. 25. One-Step Value of Information <ul><li>Choose the larger of </li></ul><ul><ul><li>Expected profit if we predict remaining chips, package, and re-test </li></ul></ul><ul><ul><li>Expected profit if we test chip Ci, then predict remaining chips, package, and re-test [for all Ci not yet tested] </li></ul></ul>
  26. 26. On-Wafer Chip Test Results 3.8% increase in profit
  27. 27. Density Estimation Summary <ul><li>Desired output is a joint probability distribution P(C 1 , C 2 , …, C 203 ) </li></ul><ul><li>Training examples are points X= (C 1 , C 2 , …, C 203 ) sampled from this distribution </li></ul>
  28. 28. Reinforcement Learning Environment state s reward r action a Agent’s goal: Choose actions to maximize total reward Action Selection Rule is called a “policy”: a =  (s) agent
  29. 29. Reinforcement Learning Methods <ul><li>Direct </li></ul><ul><ul><li>Start with initial policy  </li></ul></ul><ul><ul><li>Experiment with environment to decide how to improve  </li></ul></ul><ul><ul><li>Repeat </li></ul></ul><ul><li>Model Based </li></ul><ul><ul><li>Experiment with environment to learn how it behaves (dynamics + rewards) </li></ul></ul><ul><ul><li>Compute optimal policy  </li></ul></ul>
  30. 30. Reinforcement Learning for Robot Navigation <ul><li>Learning from rewards and punishments in the environment </li></ul><ul><ul><li>Give reward for reaching goal </li></ul></ul><ul><ul><li>Give punishment for getting lost </li></ul></ul><ul><ul><li>Give punishment for collisions </li></ul></ul>
  31. 31. Experimental Results: % trials robot reaches goal Busquets, Lopez de Mantaras, Sierra, Dietterich (2002)
  32. 32. Reinforcement Learning Summary <ul><li>Desired output is an action selection policy  </li></ul><ul><li>Training examples are <s,a,r,s’> tuples collected by the agent interacting with the environment </li></ul>
  33. 33. Outline <ul><li>Three scenarios where software engineering methods fail </li></ul><ul><li>Machine learning methods applied to these scenarios </li></ul><ul><li>Fundamental questions in machine learning </li></ul><ul><li>Statistical thinking in computer science </li></ul>
  34. 34. Fundamental Issues in Machine Learning <ul><li>Incorporating Prior Knowledge </li></ul><ul><li>Incorporating Learned Structures into Larger Systems </li></ul><ul><li>Making Reinforcement Learning Practical </li></ul><ul><li>Triple Tradeoff: accuracy, sample size, hypothesis complexity </li></ul>
  35. 35. Incorporating Prior Knowledge <ul><li>How can we incorporate our prior knowledge into the learning algorithm? </li></ul><ul><ul><li>Difficult for decision trees, neural networks, support-vector machines, etc. </li></ul></ul><ul><ul><ul><li>Mismatch between form of our knowledge and the way the algorithms work </li></ul></ul></ul><ul><ul><li>Easier for Bayesian networks </li></ul></ul><ul><ul><ul><li>Express knowledge as constraints on the network </li></ul></ul></ul>
  36. 36. Incorporating Learned Structures into Larger Systems <ul><li>Success story: Digit recognizer incorporated into check reader </li></ul><ul><li>Challenges: </li></ul><ul><ul><li>Larger system may make several coordinated decisions, but learning system treated each decision as independent </li></ul></ul><ul><ul><li>Larger system may have complex cost function: Errors in thousands place versus the cents place: $7,236.07 </li></ul></ul>
  37. 37. Making Reinforcement Learning Practical <ul><li>Current reinforcement learning methods do not scale well to large problems </li></ul><ul><li>Need robust reinforcement learning methodologies </li></ul>
  38. 38. The Triple Tradeoff <ul><li>Fundamental relationship between </li></ul><ul><ul><li>amount of training data </li></ul></ul><ul><ul><li>size and complexity of hypothesis space </li></ul></ul><ul><ul><li>accuracy of the learned hypothesis </li></ul></ul><ul><li>Explains many phenomena observed in machine learning systems </li></ul>
  39. 39. Learning Algorithms <ul><li>Set of data points </li></ul><ul><li>Class H of hypotheses </li></ul><ul><li>Optimization problem: Find the hypothesis h in H that best fits the data </li></ul>Training Data h Hypothesis Space
  40. 40. Triple Tradeoff <ul><li>Amount of Data – Hypothesis Complexity – Accuracy </li></ul>N = 1000 Hypothesis Space Complexity Accuracy N = 10 N = 100
  41. 41. Triple Tradeoff (2) Number of training examples N Accuracy Hypothesis Complexity H 1 H 2 H 3
  42. 42. Intuition <ul><li>With only a small amount of data, we can only discriminate between a small number of different hypotheses </li></ul><ul><li>As we get more data, we have more evidence, so we can consider more alternative hypotheses </li></ul><ul><li>Complex hypotheses give better fit to the data </li></ul>
  43. 43. Fixed versus Variable-Sized Hypothesis Spaces <ul><li>Fixed size </li></ul><ul><ul><li>Ordinary linear regression </li></ul></ul><ul><ul><li>Bayes net with fixed structure </li></ul></ul><ul><ul><li>Neural networks </li></ul></ul><ul><li>Variable size </li></ul><ul><ul><li>Decision trees </li></ul></ul><ul><ul><li>Bayes nets with variable structure </li></ul></ul><ul><ul><li>Support vector machines </li></ul></ul>
  44. 44. Corollary 1: Fixed H will underfit Number of training examples N Accuracy H 1 H 2 underfit
  45. 45. Corollary 2: Variable-sized H will overfit Hypothesis Space Complexity Accuracy N = 100 overfit
  46. 46. Ideal Learning Algorithm: Adapt complexity to data Hypothesis Space Complexity Accuracy N = 10 N = 100 N = 1000
  47. 47. Adapting Hypothesis Complexity to Data Complexity <ul><li>Find hypothesis h to minimize </li></ul><ul><ul><li>error(h) +  complexity(h) </li></ul></ul><ul><li>Many methods for adjusting  </li></ul><ul><ul><li>Cross-validation </li></ul></ul><ul><ul><li>MDL </li></ul></ul>
  48. 48. Corollary 3: It is optimal to be suboptimal <ul><li>Finding the smallest decision tree (or the smallest neural network) that fits N data points is NP-Hard </li></ul><ul><li>Heuristic greedy algorithms work well </li></ul><ul><li>Smarter algorithms do NOT work as well! </li></ul>
  49. 49. What’s going on? <ul><li>Heuristic algorithms do not consider all possible trees or neural networks </li></ul><ul><ul><li>They effectively consider a smaller H </li></ul></ul><ul><ul><li>They are less likely to overfit the data </li></ul></ul><ul><li>Conclusion: It is optimal (for accuracy) to be suboptimal (for fitting the data) </li></ul>
  50. 50. Outline <ul><li>Three scenarios where software engineering methods fail </li></ul><ul><li>Machine learning methods applied to these scenarios </li></ul><ul><li>Fundamental questions in machine learning </li></ul><ul><li>Statistical thinking in computer science </li></ul>
  51. 51. The Data Explosion <ul><li>NASA Data </li></ul><ul><ul><li>284 Terabytes (as of August, 1999) </li></ul></ul><ul><ul><li>Earth Observing System: 194 G/day </li></ul></ul><ul><ul><li>Landsat 7: 150 G/day </li></ul></ul><ul><ul><li>Hubble Space Telescope: 0.6 G/day </li></ul></ul>
  52. 52. The Data Explosion (2) <ul><li>Google indexes 2,073,418,204 web pages </li></ul><ul><li>US Year 2000 Census: 62 Terabytes of scanned images </li></ul><ul><li>Walmart Data Warehouse: 7 (500?) Terabytes </li></ul><ul><li>Missouri Botanical Garden TROPICOS plant image database: 700 Gbytes </li></ul>
  53. 53. The Data Explosion (3)
  54. 54. Old Computer Science Conception of Data Store Retrieve
  55. 55. New Computer Science Conception of Data Store Build Models Solve Problems Problems Solutions
  56. 56. Machine Learning: Making Data Active <ul><li>Methods for building models from data </li></ul><ul><li>Methods for collecting and/or sampling data </li></ul><ul><li>Methods for evaluating and validating learned models </li></ul><ul><li>Methods for reasoning and decision-making with learned models </li></ul><ul><li>Theoretical analyses </li></ul>
  57. 57. Machine Learning and Computer Science <ul><li>Natural language processing </li></ul><ul><li>Databases and data mining </li></ul><ul><li>Computer architecture </li></ul><ul><li>Compilers </li></ul><ul><li>Computer graphics </li></ul>
  58. 58. Hardware Branch Prediction Source: Jim é nez & Lin (2000) Perceptron Learning for Predicting the Behavior of Conditional Branches
  59. 59. Instruction Scheduler for New CPU <ul><li>The performance of modern microprocessors depends on the order in which instructions are executed </li></ul><ul><li>Modern compilers rearrange instruction order to optimize performance (“instruction scheduling”) </li></ul><ul><li>Each new CPU design requires modifying the instruction scheduler </li></ul>
  60. 60. Instruction Scheduling <ul><li>Moss, et al. (1997): Machine Learning scheduler can beat performance of commercial compilers and match the performance of research compiler. </li></ul><ul><li>Training examples: small basic blocks </li></ul><ul><ul><li>Experimentally determine optimal instruction order </li></ul></ul><ul><ul><li>Learn preference function </li></ul></ul>
  61. 61. Computer Graphics: Video Textures <ul><li>Generate new video by splicing together short stretches of old video </li></ul>A B C D E F B D E D E F A Apply reinforcement learning to identify good transition points Arno Schödl, Richard Szeliski, David H. Salesin, Irfan Essa (SIGGRAPH 2000)
  62. 62. Video Textures Arno Schödl, Richard Szeliski, David H. Salesin, Irfan Essa (SIGGRAPH 2000) You can find this video at Virtual Fish Tank Movie
  63. 63. Graphics: Image Analogies : :: : ? Hertzmann, Jacobs, Oliver, Curless, Salesin (2000) SIGGRAPH
  64. 64. Learning to Predict Textures Find p to minimize Euclidean distance between and B’(q) := A’(p) A(p) A’(p) B(q) B’(q)
  65. 65. Image Analogies : :: :
  66. 66. A video can be found at Image Analogies Movie
  67. 67. Summary <ul><li>Standard Software Engineering methods fail in many application problems </li></ul><ul><li>Machine Learning methods can replace guesswork with data to make good design decisions </li></ul>
  68. 68. Machine Learning and Computer Science <ul><li>Machine Learning is already at the heart of speech recognition and handwriting recognition </li></ul><ul><li>Statistical methods are transforming natural language processing (understanding, translation, retrieval) </li></ul><ul><li>Statistical methods are creating opportunities in databases, computer graphics, robotics, computer vision, networking, and computer security </li></ul>
  69. 69. Computer Power and Data Power <ul><li>Data is a new source of power for computer science </li></ul><ul><li>Every computer science student should learn the fundamentals of machine learning and statistical thinking </li></ul><ul><li>By combining engineered frameworks with models learned from data, we can develop the high-performance systems of the future </li></ul>