Friday, 12 January 2007 William H. Hsu Department of Computing and Information Sciences, KSU http:// www.kddresearch.org http:// www.cis.ksu.edu/~bhsu Readings: Chapter 1, Mitchell A Brief Survey of Machine Learning CIS 732/830: Lecture 0
Lecture Outline Course Information: Format, Exams, Homework, Programming, Grading Class Resources: Course Notes, Web/News/Mail, Reserve Texts Overview Topics covered Applications Why machine learning? Brief Tour of Machine Learning A case study A taxonomy of learning Intelligent systems engineering:  specification of learning problems Issues in Machine Learning Design choices The performance element: intelligent systems Some Applications of Learning Database mining, reasoning (inference/decision support), acting Industrial usage of intelligent systems
Course Information and Administrivia Instructor: William H. Hsu E-mail:  [email_address]   Office phone: (785) 532-6350 x29; home phone: (785) 539-7180; ICQ 28651394 Office hours: Wed/Thu/Fri (213 Nichols); by appointment, Tue Grading Assignments (8): 32%, paper reviews (13): 10%, midterm: 20%, final: 25% Class participation: 5% Lowest homework score and lowest 3 paper summary scores dropped Homework Eight assignments: written (5) and programming (3) – drop lowest Late policy: due on Fridays Cheating: don’t do it; see class web page for policy Project Option CIS 690/890 project option for graduate students Term paper or semester research project; sign up by 09 Feb 2007
Class Resources Web Page (Required) http://www.kddresearch.org/Courses/Spring-2007/CIS732 Lecture notes (MS PowerPoint XP, PDF) Homeworks (MS PowerPoint XP, PDF) Exam and homework solutions (MS PowerPoint XP, PDF) Class announcements (students responsibility to follow) and grade postings Course Notes at Copy Center (Required)  Course Web Group Mirror of announcements from class web page Discussions (instructor and other students) Dated research announcements (seminars, conferences, calls for papers) Mailing List (Automatic) [email_address]   Sign-up sheet Reminders, related research announcements
Course Overview Learning Algorithms and Models Models: decision trees, linear threshold units (winnow, weighted majority), neural networks, Bayesian networks (polytrees, belief networks, influence diagrams, HMMs), genetic algorithms, instance-based (nearest-neighbor) Algorithms (e.g., for decision trees): ID3, C4.5, CART, OC1 Methodologies: supervised, unsupervised, reinforcement; knowledge-guided Theory of Learning Computational learning theory (COLT): complexity, limitations of learning Probably Approximately Correct (PAC) learning Probabilistic, statistical, information theoretic results Multistrategy Learning: Combining Techniques, Knowledge Sources Data: Time Series, Very Large Databases (VLDB), Text Corpora Applications Performance element: classification, decision support, planning, control Database mining and knowledge discovery in databases (KDD) Computer inference: learning to reason
Why Machine Learning? New Computational Capability Database mining: converting (technical) records into knowledge Self-customizing programs: learning news filters, adaptive monitors Learning to act: robot planning, control optimization, decision support Applications that are hard to program: automated driving, speech recognition Better Understanding of Human Learning and Teaching Cognitive science: theories of knowledge acquisition (e.g., through practice) Performance elements: reasoning (inference) and  recommender  systems Time is Right Recent progress in algorithms and theory Rapidly growing volume of online data from various sources Available computational power Growth and interest of learning-based industries (e.g., data mining/KDD)
Rule and Decision Tree Learning Example: Rule Acquisition from Historical Data Data Patient 103 (time = 1): Age 23, First-Pregnancy: no, Anemia: no, Diabetes: no, Previous-Premature-Birth: no, Ultrasound:  unknown , Elective C-Section:  unknown , Emergency-C-Section:  unknown Patient 103 (time = 2): Age 23, First-Pregnancy: no, Anemia: no, Diabetes:  yes , Previous-Premature-Birth: no, Ultrasound:  abnormal , Elective C-Section: no, Emergency-C-Section:  unknown Patient 103 (time = n): Age 23, First-Pregnancy: no, Anemia: no, Diabetes: no, Previous-Premature-Birth: no, Ultrasound:  unknown , Elective C-Section: no, Emergency-C-Section:  YES Learned Rule IF no previous vaginal delivery , AND  abnormal 2nd trimester ultrasound ,   AND  malpresentation at admission , AND  no elective C-Section THEN probability of emergency C-Section is 0.6 Training set: 26/41 = 0.634 Test set: 12/20 = 0.600
Neural Network Learning A utonomous  L earning  V ehicle  I n a  N eural  N et (ALVINN): Pomerleau  et al http://www.ri.cmu.edu/projects/project_160.html Drives 70mph on highways
Relevant Disciplines Artificial Intelligence Bayesian Methods Cognitive Science Computational Complexity Theory Control Theory Information Theory Neuroscience Philosophy Psychology Statistics Machine Learning Symbolic Representation Planning/Problem Solving Knowledge-Guided Learning Bayes’s Theorem Missing Data Estimators PAC Formalism Mistake Bounds Language Learning Learning to Reason Optimization Learning Predictors Meta-Learning Entropy Measures MDL Approaches Optimal Codes ANN Models Modular Learning Occam’s Razor Inductive Generalization Power Law of Practice Heuristic Learning Bias/Variance Formalism Confidence Intervals Hypothesis Testing
Specifying A Learning Problem Learning = Improving with Experience at Some Task Improve over task  T, with respect to performance measure  P , based on experience  E . Example: Learning to Play Checkers T : play games of checkers P : percent of games won in world tournament E : opportunity to play against self Refining the Problem Specification: Issues What experience? What  exactly  should be learned? How shall it be  represented ? What specific algorithm to learn it? Defining the Problem Milieu Performance element: How shall the results of learning be applied? How shall the performance element be evaluated?  The learning system?
Example: Learning to Play Checkers Type of Training Experience Direct or indirect? Teacher or not? Knowledge about the game (e.g., openings/endgames)? Problem: Is Training Experience  Representative  (of Performance Goal)? Software Design Assumptions of the learning system:  legal  move generator exists Software requirements: generator, evaluator(s), parametric target function Choosing a Target Function ChooseMove: Board    Move  (action selection function, or  policy ) V: Board     R   (board evaluation function) Ideal target  V ; approximated target  Goal of learning process: operational description (approximation) of  V
A Target Function for Learning to Play Checkers Possible Definition If  b  is a final board state that is won, then  V(b)  = 100 If  b  is a final board state that is lost, then  V(b)  = -100 If  b  is a final board state that is drawn, then  V(b)  = 0 If  b  is  not a final board state in the game, then  V(b)  =  V(b’)  where  b’  is the best final board state that can be achieved starting from  b  and playing optimally until the end of the game Correct values, but not operational Choosing a Representation for the Target Function Collection of rules? Neural network? Polynomial function (e.g., linear, quadratic combination) of board features? Other? A Representation for Learned Function bp/rp = number of black/red pieces; bk/rk = number of black/red kings;  bt/rt = number of black/red pieces  threatened  (can be taken on next turn)
A Training Procedure for  Learning to Play Checkers Obtaining Training Examples the target function the learned function the training value One Rule For Estimating Training Values: Choose Weight Tuning Rule Least Mean Square (LMS) weight update rule: REPEAT Select a training example  b  at random Compute the  error(b)  for this training example For each board feature  f i , update weight  w i  as follows: where  c  is a small, constant factor to adjust the learning rate
Design Choices for Learning to Play Checkers Completed Design Determine Type of Training Experience Games against experts Games against self Table of correct moves Determine Target Function Board     value Board     move Determine Representation of Learned Function Polynomial Linear function of six features Artificial neural network Determine Learning Algorithm Gradient descent Linear programming
Some Issues in Machine Learning What Algorithms Can Approximate Functions Well? When? How Do Learning System Design Factors Influence Accuracy? Number of training examples Complexity of  hypothesis representation How Do Learning Problem Characteristics Influence Accuracy? Noisy data Multiple data sources What Are The Theoretical Limits of Learnability? How Can Prior Knowledge of Learner Help? What Clues Can We Get From Biological Learning Systems? How Can Systems Alter Their Own Representation?
Interesting Applications Reasoning (Inference, Decision Support) Cartia  ThemeScapes -  http:// www.cartia.com 6500 news stories from the WWW in 1997 Planning, Control Normal Ignited Engulfed Destroyed Extinguished Fire Alarm Flooding DC-ARM  -  http://www- kbs.ai.uiuc.edu Database Mining NCSA  D2K  -  http:// alg.ncsa.uiuc.edu
Lecture Outline Read: Chapter 2, Mitchell; Section 5.1.2, Buchanan and Wilkins Suggested Exercises: 2.2, 2.3, 2.4, 2.6 Taxonomy of Learning Systems Learning from Examples (Supervised) concept learning framework Simple approach: assumes no noise; illustrates key concepts General-to-Specific Ordering over Hypotheses Version space: partially-ordered set (poset) formalism Candidate elimination algorithm Inductive learning Choosing New Examples Next Week The need for inductive bias: 2.7, Mitchell; 2.4.1-2.4.3, Shavlik and Dietterich Computational learning theory (COLT): Chapter 7, Mitchell PAC learning formalism: 7.2-7.4, Mitchell; 2.4.2, Shavlik and Dietterich
What to Learn? Classification Functions Learning hidden functions: estimating (“fitting”) parameters Concept learning (e.g., chair, face, game) Diagnosis, prognosis: medical, risk assessment, fraud, mechanical systems Models Map (for navigation) Distribution (query answering,  aka  QA) Language model (e.g., automaton/grammar) Skills Playing games Planning Reasoning (acquiring representation to use in reasoning) Cluster Definitions for Pattern Recognition Shapes of objects Functional or taxonomic definition Many Problems Can Be Reduced to Classification
How to Learn It? Supervised What is learned?  Classification function; other models Inputs and outputs?  Learning: How is it learned?  Presentation of examples to learner (by teacher) Unsupervised Cluster definition, or  vector quantization  function ( codebook ) Learning: Formation, segmentation, labeling of clusters based on observations, metric Reinforcement Control policy (function from states of the world to actions) Learning: (Delayed) feedback of reward values to agent based on actions selected; model updated based on reward, (partially) observable state
(Supervised) Concept Learning Given: Training Examples < x, f(x) > of Some Unknown Function  f Find: A Good Approximation to  f Examples (besides Concept Learning) Disease diagnosis x  = properties of patient (medical history, symptoms, lab tests) f  = disease (or recommended therapy) Risk assessment  x  = properties of consumer, policyholder (demographics, accident history) f  = risk level (expected cost) Automatic steering x  = bitmap picture of road surface in front of vehicle f  = degrees to turn the steering wheel Part-of-speech tagging Fraud/intrusion detection Web log analysis Multisensor integration and prediction
A Learning Problem Unknown Function x 1 x 2 x 3 x 4 y =  f  (x 1 ,   x 2 ,   x 3 , x 4  ) x i : t i ,   y: t,  f : (t 1    t 2    t 3    t 4 )    t Our learning function: Vector (t 1    t 2    t 3    t 4     t)    (t 1    t 2    t 3    t 4 )    t
Hypothesis Space: Unrestricted Case |  A    B  | = |  B  |  |  A  |   | H 4     H  | = | {0,1}     {0,1}      {0,1}      {0,1}    {0,1} | = 2 2 4  = 65536 function values Complete Ignorance: Is Learning Possible? Need to see every possible input/output pair After 7 examples, still have 2 9  = 512 possibilities (out of 65536) for  f
Training Examples for Concept  EnjoySport Specification for Examples Similar to a data type definition 6 attributes: Sky, Temp, Humidity, Wind, Water, Forecast Nominal-valued (symbolic) attributes - enumerative data type Binary (Boolean-Valued or  H  -Valued) Concept Supervised Learning Problem:  Describe the General Concept
Representing Hypotheses Many Possible Representations Hypothesis  h : Conjunction of Constraints on Attributes Constraint Values Specific value (e.g.,  Water = Warm ) Don’t care (e.g.,  “Water = ?” ) No value allowed (e.g., “Water = Ø”) Example Hypothesis for  EnjoySport Sky AirTemp Humidity Wind Water  Forecast <Sunny ? ? Strong ? Same> Is this consistent with the training examples? What are some hypotheses that are consistent with the examples?

original

  • 1.
    Friday, 12 January2007 William H. Hsu Department of Computing and Information Sciences, KSU http:// www.kddresearch.org http:// www.cis.ksu.edu/~bhsu Readings: Chapter 1, Mitchell A Brief Survey of Machine Learning CIS 732/830: Lecture 0
  • 2.
    Lecture Outline CourseInformation: Format, Exams, Homework, Programming, Grading Class Resources: Course Notes, Web/News/Mail, Reserve Texts Overview Topics covered Applications Why machine learning? Brief Tour of Machine Learning A case study A taxonomy of learning Intelligent systems engineering: specification of learning problems Issues in Machine Learning Design choices The performance element: intelligent systems Some Applications of Learning Database mining, reasoning (inference/decision support), acting Industrial usage of intelligent systems
  • 3.
    Course Information andAdministrivia Instructor: William H. Hsu E-mail: [email_address] Office phone: (785) 532-6350 x29; home phone: (785) 539-7180; ICQ 28651394 Office hours: Wed/Thu/Fri (213 Nichols); by appointment, Tue Grading Assignments (8): 32%, paper reviews (13): 10%, midterm: 20%, final: 25% Class participation: 5% Lowest homework score and lowest 3 paper summary scores dropped Homework Eight assignments: written (5) and programming (3) – drop lowest Late policy: due on Fridays Cheating: don’t do it; see class web page for policy Project Option CIS 690/890 project option for graduate students Term paper or semester research project; sign up by 09 Feb 2007
  • 4.
    Class Resources WebPage (Required) http://www.kddresearch.org/Courses/Spring-2007/CIS732 Lecture notes (MS PowerPoint XP, PDF) Homeworks (MS PowerPoint XP, PDF) Exam and homework solutions (MS PowerPoint XP, PDF) Class announcements (students responsibility to follow) and grade postings Course Notes at Copy Center (Required) Course Web Group Mirror of announcements from class web page Discussions (instructor and other students) Dated research announcements (seminars, conferences, calls for papers) Mailing List (Automatic) [email_address] Sign-up sheet Reminders, related research announcements
  • 5.
    Course Overview LearningAlgorithms and Models Models: decision trees, linear threshold units (winnow, weighted majority), neural networks, Bayesian networks (polytrees, belief networks, influence diagrams, HMMs), genetic algorithms, instance-based (nearest-neighbor) Algorithms (e.g., for decision trees): ID3, C4.5, CART, OC1 Methodologies: supervised, unsupervised, reinforcement; knowledge-guided Theory of Learning Computational learning theory (COLT): complexity, limitations of learning Probably Approximately Correct (PAC) learning Probabilistic, statistical, information theoretic results Multistrategy Learning: Combining Techniques, Knowledge Sources Data: Time Series, Very Large Databases (VLDB), Text Corpora Applications Performance element: classification, decision support, planning, control Database mining and knowledge discovery in databases (KDD) Computer inference: learning to reason
  • 6.
    Why Machine Learning?New Computational Capability Database mining: converting (technical) records into knowledge Self-customizing programs: learning news filters, adaptive monitors Learning to act: robot planning, control optimization, decision support Applications that are hard to program: automated driving, speech recognition Better Understanding of Human Learning and Teaching Cognitive science: theories of knowledge acquisition (e.g., through practice) Performance elements: reasoning (inference) and recommender systems Time is Right Recent progress in algorithms and theory Rapidly growing volume of online data from various sources Available computational power Growth and interest of learning-based industries (e.g., data mining/KDD)
  • 7.
    Rule and DecisionTree Learning Example: Rule Acquisition from Historical Data Data Patient 103 (time = 1): Age 23, First-Pregnancy: no, Anemia: no, Diabetes: no, Previous-Premature-Birth: no, Ultrasound: unknown , Elective C-Section: unknown , Emergency-C-Section: unknown Patient 103 (time = 2): Age 23, First-Pregnancy: no, Anemia: no, Diabetes: yes , Previous-Premature-Birth: no, Ultrasound: abnormal , Elective C-Section: no, Emergency-C-Section: unknown Patient 103 (time = n): Age 23, First-Pregnancy: no, Anemia: no, Diabetes: no, Previous-Premature-Birth: no, Ultrasound: unknown , Elective C-Section: no, Emergency-C-Section: YES Learned Rule IF no previous vaginal delivery , AND abnormal 2nd trimester ultrasound , AND malpresentation at admission , AND no elective C-Section THEN probability of emergency C-Section is 0.6 Training set: 26/41 = 0.634 Test set: 12/20 = 0.600
  • 8.
    Neural Network LearningA utonomous L earning V ehicle I n a N eural N et (ALVINN): Pomerleau et al http://www.ri.cmu.edu/projects/project_160.html Drives 70mph on highways
  • 9.
    Relevant Disciplines ArtificialIntelligence Bayesian Methods Cognitive Science Computational Complexity Theory Control Theory Information Theory Neuroscience Philosophy Psychology Statistics Machine Learning Symbolic Representation Planning/Problem Solving Knowledge-Guided Learning Bayes’s Theorem Missing Data Estimators PAC Formalism Mistake Bounds Language Learning Learning to Reason Optimization Learning Predictors Meta-Learning Entropy Measures MDL Approaches Optimal Codes ANN Models Modular Learning Occam’s Razor Inductive Generalization Power Law of Practice Heuristic Learning Bias/Variance Formalism Confidence Intervals Hypothesis Testing
  • 10.
    Specifying A LearningProblem Learning = Improving with Experience at Some Task Improve over task T, with respect to performance measure P , based on experience E . Example: Learning to Play Checkers T : play games of checkers P : percent of games won in world tournament E : opportunity to play against self Refining the Problem Specification: Issues What experience? What exactly should be learned? How shall it be represented ? What specific algorithm to learn it? Defining the Problem Milieu Performance element: How shall the results of learning be applied? How shall the performance element be evaluated? The learning system?
  • 11.
    Example: Learning toPlay Checkers Type of Training Experience Direct or indirect? Teacher or not? Knowledge about the game (e.g., openings/endgames)? Problem: Is Training Experience Representative (of Performance Goal)? Software Design Assumptions of the learning system: legal move generator exists Software requirements: generator, evaluator(s), parametric target function Choosing a Target Function ChooseMove: Board  Move (action selection function, or policy ) V: Board  R (board evaluation function) Ideal target V ; approximated target Goal of learning process: operational description (approximation) of V
  • 12.
    A Target Functionfor Learning to Play Checkers Possible Definition If b is a final board state that is won, then V(b) = 100 If b is a final board state that is lost, then V(b) = -100 If b is a final board state that is drawn, then V(b) = 0 If b is not a final board state in the game, then V(b) = V(b’) where b’ is the best final board state that can be achieved starting from b and playing optimally until the end of the game Correct values, but not operational Choosing a Representation for the Target Function Collection of rules? Neural network? Polynomial function (e.g., linear, quadratic combination) of board features? Other? A Representation for Learned Function bp/rp = number of black/red pieces; bk/rk = number of black/red kings; bt/rt = number of black/red pieces threatened (can be taken on next turn)
  • 13.
    A Training Procedurefor Learning to Play Checkers Obtaining Training Examples the target function the learned function the training value One Rule For Estimating Training Values: Choose Weight Tuning Rule Least Mean Square (LMS) weight update rule: REPEAT Select a training example b at random Compute the error(b) for this training example For each board feature f i , update weight w i as follows: where c is a small, constant factor to adjust the learning rate
  • 14.
    Design Choices forLearning to Play Checkers Completed Design Determine Type of Training Experience Games against experts Games against self Table of correct moves Determine Target Function Board  value Board  move Determine Representation of Learned Function Polynomial Linear function of six features Artificial neural network Determine Learning Algorithm Gradient descent Linear programming
  • 15.
    Some Issues inMachine Learning What Algorithms Can Approximate Functions Well? When? How Do Learning System Design Factors Influence Accuracy? Number of training examples Complexity of hypothesis representation How Do Learning Problem Characteristics Influence Accuracy? Noisy data Multiple data sources What Are The Theoretical Limits of Learnability? How Can Prior Knowledge of Learner Help? What Clues Can We Get From Biological Learning Systems? How Can Systems Alter Their Own Representation?
  • 16.
    Interesting Applications Reasoning(Inference, Decision Support) Cartia ThemeScapes - http:// www.cartia.com 6500 news stories from the WWW in 1997 Planning, Control Normal Ignited Engulfed Destroyed Extinguished Fire Alarm Flooding DC-ARM - http://www- kbs.ai.uiuc.edu Database Mining NCSA D2K - http:// alg.ncsa.uiuc.edu
  • 17.
    Lecture Outline Read:Chapter 2, Mitchell; Section 5.1.2, Buchanan and Wilkins Suggested Exercises: 2.2, 2.3, 2.4, 2.6 Taxonomy of Learning Systems Learning from Examples (Supervised) concept learning framework Simple approach: assumes no noise; illustrates key concepts General-to-Specific Ordering over Hypotheses Version space: partially-ordered set (poset) formalism Candidate elimination algorithm Inductive learning Choosing New Examples Next Week The need for inductive bias: 2.7, Mitchell; 2.4.1-2.4.3, Shavlik and Dietterich Computational learning theory (COLT): Chapter 7, Mitchell PAC learning formalism: 7.2-7.4, Mitchell; 2.4.2, Shavlik and Dietterich
  • 18.
    What to Learn?Classification Functions Learning hidden functions: estimating (“fitting”) parameters Concept learning (e.g., chair, face, game) Diagnosis, prognosis: medical, risk assessment, fraud, mechanical systems Models Map (for navigation) Distribution (query answering, aka QA) Language model (e.g., automaton/grammar) Skills Playing games Planning Reasoning (acquiring representation to use in reasoning) Cluster Definitions for Pattern Recognition Shapes of objects Functional or taxonomic definition Many Problems Can Be Reduced to Classification
  • 19.
    How to LearnIt? Supervised What is learned? Classification function; other models Inputs and outputs? Learning: How is it learned? Presentation of examples to learner (by teacher) Unsupervised Cluster definition, or vector quantization function ( codebook ) Learning: Formation, segmentation, labeling of clusters based on observations, metric Reinforcement Control policy (function from states of the world to actions) Learning: (Delayed) feedback of reward values to agent based on actions selected; model updated based on reward, (partially) observable state
  • 20.
    (Supervised) Concept LearningGiven: Training Examples < x, f(x) > of Some Unknown Function f Find: A Good Approximation to f Examples (besides Concept Learning) Disease diagnosis x = properties of patient (medical history, symptoms, lab tests) f = disease (or recommended therapy) Risk assessment x = properties of consumer, policyholder (demographics, accident history) f = risk level (expected cost) Automatic steering x = bitmap picture of road surface in front of vehicle f = degrees to turn the steering wheel Part-of-speech tagging Fraud/intrusion detection Web log analysis Multisensor integration and prediction
  • 21.
    A Learning ProblemUnknown Function x 1 x 2 x 3 x 4 y = f (x 1 , x 2 , x 3 , x 4 ) x i : t i , y: t, f : (t 1  t 2  t 3  t 4 )  t Our learning function: Vector (t 1  t 2  t 3  t 4  t)  (t 1  t 2  t 3  t 4 )  t
  • 22.
    Hypothesis Space: UnrestrictedCase | A  B | = | B | | A | | H 4  H | = | {0,1}  {0,1}  {0,1}  {0,1}  {0,1} | = 2 2 4 = 65536 function values Complete Ignorance: Is Learning Possible? Need to see every possible input/output pair After 7 examples, still have 2 9 = 512 possibilities (out of 65536) for f
  • 23.
    Training Examples forConcept EnjoySport Specification for Examples Similar to a data type definition 6 attributes: Sky, Temp, Humidity, Wind, Water, Forecast Nominal-valued (symbolic) attributes - enumerative data type Binary (Boolean-Valued or H -Valued) Concept Supervised Learning Problem: Describe the General Concept
  • 24.
    Representing Hypotheses ManyPossible Representations Hypothesis h : Conjunction of Constraints on Attributes Constraint Values Specific value (e.g., Water = Warm ) Don’t care (e.g., “Water = ?” ) No value allowed (e.g., “Water = Ø”) Example Hypothesis for EnjoySport Sky AirTemp Humidity Wind Water Forecast <Sunny ? ? Strong ? Same> Is this consistent with the training examples? What are some hypotheses that are consistent with the examples?