CS760 – Machine Learning Course Instructor: David Page email: dpage@cs.wisc.edu office: MSC 6743 (University & Charter)  hours: TBA Teaching Assistant: Daniel Wong email: wong@cs.wisc.edu office: TBA hours: TBA
Textbooks &  Reading Assignment Machine Learning   (Tom Mitchell) Selected on-line readings Read in Mitchell  (posted on class web page) Preface Chapter 1 Sections 2.1 and 2.2 Chapter 8
Monday, Wednesday,  and   Friday? We’ll meet 30 times this term (may or may not include exam in this count) We’ll meet on FRIDAY this and next week, in order to cover material for HW 1 (plus I have some business travel this term) Default : we WILL meet on Friday unless I announce otherwise
Course "Style" Primarily algorithmic & experimental Some theory, both mathematical & conceptual (much on  statistics ) "Hands on" experience, interactive lectures/discussions Broad survey of many ML subfields, including "symbolic"  (rules, decision trees, ILP) "connectionist"  (neural nets) support vector machines, nearest-neighbors theoretical  ("COLT") statistical  ("Bayes rule") reinforcement learning, genetic algorithms
"MS vs. PhD" Aspects MS'ish topics mature, ready for practical application first 2/3 – ¾ of semester Naive Bayes, Nearest-Neighbors, Decision Trees, Neural Nets, Suport Vector Machines, ensembles, experimental methodology (10-fold cross validation,  t -tests) PhD'ish topics inductive logic programming, statistical relational learning, reinforcement learning, SVMs,  use of prior knowledge Other machine learning material covered in Bioinformatics CS 576/776, Jerry Zhu’s CS 838
Two Major Goals to understand  what  a learning system should do to understand  how  (and how  well ) existing systems work Issues in algorithm design Choosing algorithms for applications
Background Assumed Languages Java  (see CS 368 tutorial online) AI Topics Search FOPC Unification Formal Deduction Math Calculus (partial derivatives) Simple prob & stats No previous ML experience assumed   (so some overlap with CS 540)
Requirements Bi-weekly programming HW's "hands on" experience valuable HW0 – build a dataset HW1 – simple ML algo's and exper. methodology HW2 – decision trees (?) HW3 – neural nets (?) HW4 – reinforcement learning (in a simulated world) "Midterm" exam  (in class, about 90% through semester) Find project of your choosing during last 4-5 weeks of class
Grading HW's 35% "Midterm" 40% Project 20% Quality Discussion   5%
Late HW's Policy HW's due @ 4pm you have  5   late days to use  over the semester (Fri 4pm  -> Mon 4pm is  1   late "day") SAVE UP late days! extensions only for  extreme  cases Penalty points after late days exhausted Can't be more than ONE WEEK late
Academic Misconduct  (also on course homepage) All examinations, programming assignments, and written homeworks must be done  individually . Cheating and plagiarism will be dealt with in accordance with University procedures (see the  Academic Misconduct Guide for Students ). Hence, for example, code for programming assignments must not be developed in groups, nor should code be shared. You are encouraged to discuss with your peers, the TAs or the instructor ideas, approaches and techniques broadly, but not at a level of detail where specific implementation issues are described by anyone. If you have any questions on this, please ask the instructor before you act.
What Do You Think  Learning Means?
What is Learning? “ Learning denotes changes in the system that  …  enable the  system to do the same task …  more  effectively the next time.” -  Herbert Simon “ Learning is making useful changes in our minds.” -  Marvin Minsky
Today’s   Topics Memorization as Learning Feature Space Supervised ML K -NN ( K -Nearest Neighbor)
Memorization (Rote Learning) Employed by first machine learning systems, in 1950s Samuel’s Checkers program Michie’s MENACE: Matchbox Educable Naughts and Crosses Engine Prior to these, some people believed computers could not  improve  at a task  with experience
Rote Learning is Limited Memorize I/O pairs and perform exact matching with new inputs If computer has not seen precise case before, it cannot apply its experience Want computer to “generalize” from prior experience
Some Settings in Which Learning May Help Given an input, what is appropriate response (output/action)? Game playing – board state/move Autonomous robots (e.g., driving a vehicle) -- world state/action Video game characters – state/action Medical decision support – symptoms/ treatment Scientific discovery – data/hypothesis Data mining – database/regularity
Broad Paradigms of Machine Learning Inducing Functions from I/O Pairs Decision trees (e.g., Quinlan’s C4.5 [1993]) Connectionism / neural networks (e.g., backprop) Nearest-neighbor methods Genetic algorithms SVM’s  Learning without Feedback/Teacher Conceptual clustering Self-organizing systems Discovery systems Not in Mitchell’s textbook (covered in CS 776)
IID  (Completion of Lec #2) We are assuming examples are IID:  independently identically distributed Eg, we are ignoring  temporal  dependencies (covered in  time-series learning ) Eg, we assume the learner has no say in which examples it gets (covered in  active learning )
Supervised Learning Task Overview Concepts/ Classes/ Decisions Feature Selection (usually done by humans) Classification Rule Construction (done by learning algorithm) Real World Feature Space HW 0 HW 1-3
Supervised Learning Task Overview (cont.) Note: mappings on previous slide are not necessarily 1-to-1 Bad for first mapping? Good for the second  (in fact, it’s the goal!)
Empirical Learning:  Task Definition Given  A collection of  positive  examples of some concept/class/category (i.e., members of the class) and, possibly, a collection of the  negative  examples (i.e., non-members) Produce A description that  covers  (includes) all/most of the positive examples and non/few of the negative examples  (and, hopefully, properly categorizes most future examples!) Note: one can easily extend this definition to handle more than two classes The Key Point!
Example Positive Examples Negative Examples How does this symbol classify? Concept Solid Red Circle in a (Regular?) Polygon What about? Figures on left side of page Figures drawn before 5pm 2/2/89 <etc>
Concept Learning Learning systems differ in how they represent concepts: . . . Training Examples Backpropagation C4.5, CART AQ, FOIL SVMs Neural Net Decision Tree Φ <- X^Y Φ <- Z Rules If 5x 1  + 9x 2  – 3x 3  > 12 Then +
Feature Space If examples are described in terms of values of features, they can be plotted as points in an  N -dimensional space. Size Color Weight ? Big 2500 Gray A  “concept”  is then a (possibly disjoint)  volume  in this space.
Learning from Labeled Examples Most common and successful form of ML Venn Diagram + + + + - - - - - - - - Examples  –  points in a multi-dimensional “feature space” Concepts  –  “function” that labels every point in feature space (as +, -, and possibly ?)
Brief Review Conjunctive  Concept Color(?obj1, red) ^ Size(?obj1, large) Disjunctive  Concept Color(?obj2, blue) v Size(?obj2, small) More formally a “concept” is of the form x  y  z  F(x, y, z) -> Member(x, Class1) A A A “ and” “ or” Instances
Empirical Learning and Venn Diagrams Concept =  A  or  B  (Disjunctive concept) Examples = labeled points in feature space Concept = a label for a  set  of points Venn Diagram A B - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + Feature Space
Aspects of an ML System “ Language” for representing classified examples “ Language” for representing “Concepts” Technique for producing concept “consistent” with the training examples Technique for classifying new instance Each of these limits the  expressiveness / efficiency   of the supervised learning algorithm. HW 0 Other HW’s
Nearest-Neighbor Algorithms (aka. Exemplar models, instance-based learning (IBL), case-based learning) Learning ≈ memorize training examples Problem solving = find most similar example in memory; output its category Venn - - - - - - - - + + + + + + + + + + ? … “ Voronoi Diagrams” (pg 233)
Simple Example: 1-NN Training Set a=0, b=0, c=1   + a=0, b=0, c=0   - a=1, b=1, c=1   - Test Example a=0, b=1, c=0  ? “ Hamming Distance” Ex 1 = 2 Ex 2 = 1 Ex 3 = 2 So output - (1-NN ≡   one nearest neighbor)
Sample Experimental Results  (see UCI archive for more) Simple algorithm works quite well! Testset Correctness Testbed 86% 85% 83% Appendicitis ? 38% 37% Tumor ? 76% 78% Heart Disease 96% 95% 98% Wisconsin Cancer Neural Nets D-Trees 1-NN
K -NN Algorithm Collect  K  nearest neighbors, select majority classification (or somehow combine their classes) What should  K  be? It probably is problem dependent Can use  tuning sets  (later) to select a good setting for  K 1 Shouldn’t really “ connect the dots” (Why?) Tuning Set Error Rate 2 3 4 5 K
Data Representation Creating a dataset of Be sure to include – on separate 8x11 sheet – a photo and a brief bio HW0 out on-line Due next Friday fixed length feature vectors
HW0 – Create Your Own Dataset  (repeated from lecture #1) Think about before next class Read HW0 (on-line) Google to find: UCI archive (or UCI KDD archive) UCI ML archive (UCI ML repository) More links in HW0’s web page
HW0 – Your “Personal Concept” Step 1:  Choose a Boolean (true/false) concept Books I like/dislike  Movies I like/dislike  www pages I like/dislike   Subjective judgment (can’t articulate) “ time will tell” concepts Stocks to buy Medical treatment at time  t , predict outcome at time ( t  + ∆ t) Sensory interpretation  Face recognition (see textbook) Handwritten digit recognition Sound recognition Hard-to-Program Functions
Some Real-World Examples Car Steering (Pomerleau, Thrun) Medical Diagnosis (Quinlan) DNA Categorization TV-pilot rating Chemical-plant control Backgammon playing Medical record Learned  Function Steering  Angle Digitized  camera image age=13, sex=M, wgt=18 Learned  Function sick vs  healthy
HW0 – Your “Personal Concept” Step 2:  Choosing a  feature space We will use  fixed-length feature vectors Choose  N   features Each feature has  V i   possible values Each example is represented by a vector of  N  feature values  (i.e.,  is a point in the feature space ) e.g.:  <red,  50,  round> color   weight  shape Feature Types Boolean Nominal Ordered Hierarchical Step 3:  Collect examples (“I/O” pairs) Defines a space In HW0 we will use a subset (see next slide)
Standard Feature Types for representing training examples    – a source of “ domain knowledge ” Nominal No relationship among possible values e.g.,  color  є  {red, blue, green}  (vs.  color = 1000  Hertz) Linear (or Ordered) Possible values of the feature are totally ordered e.g.,  size  є   {small, medium, large}   ←   discrete   weight  є  [0…500]  ←   continuous Hierarchical Possible values are  partially   ordered in an ISA hierarchy e.g. for  shape   -> closed polygon continuous triangle square circle ellipse
Our Feature Types (for CS 760 HW’s) Discrete tokens (char strings, w/o quote marks and spaces) Continuous numbers (int’s or float’s) If only a few possible values (e.g., 0 & 1) use discrete i.e., merge  nominal  and  discrete-ordered   (or convert  discrete-ordered  into 1,2,…) We will ignore hierarchical info and  only use the leaf values (common approach)
Example Hierarchy  (KDD* Journal, Vol 5, No. 1-2, 2001, page 17) Structure of one feature! “ the need to be able to incorporate hierarchical (knowledge about data types) is shown in every paper.” - From eds. Intro to special issue (on applications) of KDD journal, Vol 15, 2001 *  Officially, “Data Mining and Knowledge Discovery”, Kluwer Publishers  Product Pct  Foods Tea Canned  Cat Food Dried  Cat Food 99 Product  Classes 2302 Product  Subclasses Friskies  Liver, 250g ~30k  Products
HW0:  Creating Your Dataset Ex:  IMDB has a lot of data that are not discrete or continuous or binary-valued for target function (category) Studio Movie Director/ Producer Actor Made Acted in Directed Name Country List of movies Name Year of birth Gender Oscar nominations List of movies Title, Genre, Year, Opening Wkend BO receipts , List of actors/actresses, Release season Name Year of birth List of movies Produced
HW0: Sample DB Choose a Boolean or binary-valued target function (category) Opening weekend box-office    receipts > $2 million Movie is drama? (action, sci-fi,…) Movies I like/dislike (e.g. Tivo)
HW0: Representing as a Fixed-Length Feature Vector <discuss on chalkboard> Note:  some advanced ML approaches do  not  require such “feature mashing” (eg, ILP)
[email_address] David Jensen’s group at UMass uses Naïve Bayes and other ML algo’s on the IMDB Opening weekend box-office  receipts > $2 million 25 attributes Accuracy = 83.3% Default accuracy = 56%   (default algo?) Movie is drama? 12 attributes Accuracy = 71.9% Default accuracy = 51% http://kdl.cs.umass.edu/proximity/about.html
First Algorithm in Detail K -Nearest Neighbors /  Instance-Based Learning ( k -NN/IBL) Distance functions Kernel functions Feature selection (applies to all ML algo’s) IBL Summary Chapter 8 of Mitchell
Some Common Jargon Classification Learning a  discrete  valued function Regression Learning a  real  valued function IBL easily extended to regression tasks (and to multi-category classification) Discrete/Real Outputs
Variations on a Theme IB1  – keep all examples IB2  – keep next instance if  incorrectly  classified by using previous instances Uses less storage (good) Order dependent (bad) Sensitive to noisy data (bad) (From Aha, Kibler and Albert in ML Journal)
Variations on a Theme (cont.) IB3   – extend IB2 to more intelligently decide which examples to keep (see article) Better handling of noisy data Another Idea  -  cluster groups, keep example from each (median/centroid) Less storage, faster lookup
Distance Functions Key  issue in IBL  (instance-based learning) One approach:   assign weights to each feature
Distance Functions (sample) distance between examples 1 and 2 a numeric weighting factor distance for feature i only between examples 1 and 2
Kernel Functions  and  k -NN Term “kernel” comes from statistics Major topic in support vector machines (SVMs) Weights the interaction between pairs of examples
Kernel Functions and  k -NN (continued) Assume we have k  nearest neighbors  e 1 , ..., e k associated output categories  O 1 , ..., O k Then output for test case  e t   is the kernel “ delta” function  (=1 if  O i =c , else =0)
Sample Kernel Functions   (e i  , e t )  ( e i  , e t ) = 1  ( e i  , e t ) = 1 / dist( e i  , e t )  simple majority vote (? classified as -) inverse distance weight (? could be classified as +) In diagram to right, example ‘?’ has three neighbors, two of which are ‘-’ and one of which is ‘+’. - - + ?
Gaussian Kernel Heavily used in SVMs Euler’s constant
Local Learning Collect  k  nearest neighbors Give them to some supervised ML algo Apply learned model to test example + + + + + + + - - - - ? - Train on these
Instance-Based Learning (IBL) and Efficiency IBL algorithms postpone work from training to testing Pure  k -NN/IBL just memorizes  the training data Sometimes called  lazy learning Computationally intensive Match all features of all training examples
Instance-Based Learning (IBL) and Efficiency Possible Speed-ups Use a subset of the training examples (Aha) Use clever data structures (A. Moore) KD trees, hash tables, Voronoi diagrams Use subset of the features
Number of Features and Performance Too many features can hurt  test  set performance Too many irrelevant features mean many spurious correlation possibilities for a ML algorithm to detect “Curse of dimensionality”
Feature Selection and ML (general issue for ML) Filtering-Based Feature Selection all features subset of features model Wrapper-Based Feature Selection FS algorithm ML algorithm ML algorithm all features model FS algorithm  calls ML algorithm  many  times, uses it to help select features
Feature Selection as Search Problem State  = set of features Start state   =  empty   ( forward selection )     or  full  ( backward selection ) Goal test  = highest scoring state Operators  add/subtract features Scoring function  accuracy on training (or tuning) set of  ML algorithm using this state’s feature set
Forward and Backward Selection of Features Hill-climbing (“greedy”) search Forward Backward add F 1 ... ... Features to use Accuracy on tuning set (our heuristic function) ... ... {} 50% {F N } 71% {F 1 } 62% add F N add F 1 {F 1 ,F 2 ,...,F N } 73% {F 2 ,...,F N } 79% subtract F 1 subtract F 2
Forward vs. Backward Feature Selection Faster in early steps because fewer features to test Fast for choosing a small subset of the features Misses useful features whose usefulness requires other features (feature synergy) Fast for choosing all but a small subset of the features Preserves useful features whose usefulness requires other features Example: area important,  features = length, width Forward Backward
Some Comments on  k -NN Easy to implement Good “baseline” algorithm / experimental control Incremental learning easy Psychologically plausible model of human memory No insight into domain (no explicit model) Choice of distance function is problematic Doesn’t exploit/notice structure in examples Positive Negative
Questions about IBL  (Breiman et al. - CART book) Computationally expensive to save all examples; slow classification of new examples Addressed by IB2/IB3 of Aha et al. and work of A. Moore (CMU; now Google) Is this really a problem?
Questions about IBL  (Breiman et al. - CART book) Intolerant of Noise Addressed by IB3 of Aha et al. Addressed by  k -NN version Addressed by feature selection - can discard the noisy feature Intolerant of Irrelevant Features Since algorithm very fast, can experimentally choose good feature sets (Kohavi, Ph. D. – now at Amazon)
More IBL Criticisms High sensitivity to choice of similiarity (distance) function Euclidean distance might not be best choice Handling non-numeric features and missing feature values is not natural, but doable How might we do this? (Part of HW1) No insight into task  (learned concept not interpretable)
Summary IBL can be a very effective machine learning algorithm Good “baseline” for experiments

MLlecture1.ppt

  • 1.
    CS760 – MachineLearning Course Instructor: David Page email: dpage@cs.wisc.edu office: MSC 6743 (University & Charter) hours: TBA Teaching Assistant: Daniel Wong email: wong@cs.wisc.edu office: TBA hours: TBA
  • 2.
    Textbooks & Reading Assignment Machine Learning (Tom Mitchell) Selected on-line readings Read in Mitchell (posted on class web page) Preface Chapter 1 Sections 2.1 and 2.2 Chapter 8
  • 3.
    Monday, Wednesday, and Friday? We’ll meet 30 times this term (may or may not include exam in this count) We’ll meet on FRIDAY this and next week, in order to cover material for HW 1 (plus I have some business travel this term) Default : we WILL meet on Friday unless I announce otherwise
  • 4.
    Course &quot;Style&quot; Primarilyalgorithmic & experimental Some theory, both mathematical & conceptual (much on statistics ) &quot;Hands on&quot; experience, interactive lectures/discussions Broad survey of many ML subfields, including &quot;symbolic&quot; (rules, decision trees, ILP) &quot;connectionist&quot; (neural nets) support vector machines, nearest-neighbors theoretical (&quot;COLT&quot;) statistical (&quot;Bayes rule&quot;) reinforcement learning, genetic algorithms
  • 5.
    &quot;MS vs. PhD&quot;Aspects MS'ish topics mature, ready for practical application first 2/3 – ¾ of semester Naive Bayes, Nearest-Neighbors, Decision Trees, Neural Nets, Suport Vector Machines, ensembles, experimental methodology (10-fold cross validation, t -tests) PhD'ish topics inductive logic programming, statistical relational learning, reinforcement learning, SVMs, use of prior knowledge Other machine learning material covered in Bioinformatics CS 576/776, Jerry Zhu’s CS 838
  • 6.
    Two Major Goalsto understand what a learning system should do to understand how (and how well ) existing systems work Issues in algorithm design Choosing algorithms for applications
  • 7.
    Background Assumed LanguagesJava (see CS 368 tutorial online) AI Topics Search FOPC Unification Formal Deduction Math Calculus (partial derivatives) Simple prob & stats No previous ML experience assumed (so some overlap with CS 540)
  • 8.
    Requirements Bi-weekly programmingHW's &quot;hands on&quot; experience valuable HW0 – build a dataset HW1 – simple ML algo's and exper. methodology HW2 – decision trees (?) HW3 – neural nets (?) HW4 – reinforcement learning (in a simulated world) &quot;Midterm&quot; exam (in class, about 90% through semester) Find project of your choosing during last 4-5 weeks of class
  • 9.
    Grading HW's 35%&quot;Midterm&quot; 40% Project 20% Quality Discussion 5%
  • 10.
    Late HW's PolicyHW's due @ 4pm you have 5 late days to use over the semester (Fri 4pm -> Mon 4pm is 1 late &quot;day&quot;) SAVE UP late days! extensions only for extreme cases Penalty points after late days exhausted Can't be more than ONE WEEK late
  • 11.
    Academic Misconduct (also on course homepage) All examinations, programming assignments, and written homeworks must be done individually . Cheating and plagiarism will be dealt with in accordance with University procedures (see the Academic Misconduct Guide for Students ). Hence, for example, code for programming assignments must not be developed in groups, nor should code be shared. You are encouraged to discuss with your peers, the TAs or the instructor ideas, approaches and techniques broadly, but not at a level of detail where specific implementation issues are described by anyone. If you have any questions on this, please ask the instructor before you act.
  • 12.
    What Do YouThink Learning Means?
  • 13.
    What is Learning?“ Learning denotes changes in the system that … enable the system to do the same task … more effectively the next time.” - Herbert Simon “ Learning is making useful changes in our minds.” - Marvin Minsky
  • 14.
    Today’s Topics Memorization as Learning Feature Space Supervised ML K -NN ( K -Nearest Neighbor)
  • 15.
    Memorization (Rote Learning)Employed by first machine learning systems, in 1950s Samuel’s Checkers program Michie’s MENACE: Matchbox Educable Naughts and Crosses Engine Prior to these, some people believed computers could not improve at a task with experience
  • 16.
    Rote Learning isLimited Memorize I/O pairs and perform exact matching with new inputs If computer has not seen precise case before, it cannot apply its experience Want computer to “generalize” from prior experience
  • 17.
    Some Settings inWhich Learning May Help Given an input, what is appropriate response (output/action)? Game playing – board state/move Autonomous robots (e.g., driving a vehicle) -- world state/action Video game characters – state/action Medical decision support – symptoms/ treatment Scientific discovery – data/hypothesis Data mining – database/regularity
  • 18.
    Broad Paradigms ofMachine Learning Inducing Functions from I/O Pairs Decision trees (e.g., Quinlan’s C4.5 [1993]) Connectionism / neural networks (e.g., backprop) Nearest-neighbor methods Genetic algorithms SVM’s Learning without Feedback/Teacher Conceptual clustering Self-organizing systems Discovery systems Not in Mitchell’s textbook (covered in CS 776)
  • 19.
    IID (Completionof Lec #2) We are assuming examples are IID: independently identically distributed Eg, we are ignoring temporal dependencies (covered in time-series learning ) Eg, we assume the learner has no say in which examples it gets (covered in active learning )
  • 20.
    Supervised Learning TaskOverview Concepts/ Classes/ Decisions Feature Selection (usually done by humans) Classification Rule Construction (done by learning algorithm) Real World Feature Space HW 0 HW 1-3
  • 21.
    Supervised Learning TaskOverview (cont.) Note: mappings on previous slide are not necessarily 1-to-1 Bad for first mapping? Good for the second (in fact, it’s the goal!)
  • 22.
    Empirical Learning: Task Definition Given A collection of positive examples of some concept/class/category (i.e., members of the class) and, possibly, a collection of the negative examples (i.e., non-members) Produce A description that covers (includes) all/most of the positive examples and non/few of the negative examples (and, hopefully, properly categorizes most future examples!) Note: one can easily extend this definition to handle more than two classes The Key Point!
  • 23.
    Example Positive ExamplesNegative Examples How does this symbol classify? Concept Solid Red Circle in a (Regular?) Polygon What about? Figures on left side of page Figures drawn before 5pm 2/2/89 <etc>
  • 24.
    Concept Learning Learningsystems differ in how they represent concepts: . . . Training Examples Backpropagation C4.5, CART AQ, FOIL SVMs Neural Net Decision Tree Φ <- X^Y Φ <- Z Rules If 5x 1 + 9x 2 – 3x 3 > 12 Then +
  • 25.
    Feature Space Ifexamples are described in terms of values of features, they can be plotted as points in an N -dimensional space. Size Color Weight ? Big 2500 Gray A “concept” is then a (possibly disjoint) volume in this space.
  • 26.
    Learning from LabeledExamples Most common and successful form of ML Venn Diagram + + + + - - - - - - - - Examples – points in a multi-dimensional “feature space” Concepts – “function” that labels every point in feature space (as +, -, and possibly ?)
  • 27.
    Brief Review Conjunctive Concept Color(?obj1, red) ^ Size(?obj1, large) Disjunctive Concept Color(?obj2, blue) v Size(?obj2, small) More formally a “concept” is of the form x y z F(x, y, z) -> Member(x, Class1) A A A “ and” “ or” Instances
  • 28.
    Empirical Learning andVenn Diagrams Concept = A or B (Disjunctive concept) Examples = labeled points in feature space Concept = a label for a set of points Venn Diagram A B - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + Feature Space
  • 29.
    Aspects of anML System “ Language” for representing classified examples “ Language” for representing “Concepts” Technique for producing concept “consistent” with the training examples Technique for classifying new instance Each of these limits the expressiveness / efficiency of the supervised learning algorithm. HW 0 Other HW’s
  • 30.
    Nearest-Neighbor Algorithms (aka.Exemplar models, instance-based learning (IBL), case-based learning) Learning ≈ memorize training examples Problem solving = find most similar example in memory; output its category Venn - - - - - - - - + + + + + + + + + + ? … “ Voronoi Diagrams” (pg 233)
  • 31.
    Simple Example: 1-NNTraining Set a=0, b=0, c=1 + a=0, b=0, c=0 - a=1, b=1, c=1 - Test Example a=0, b=1, c=0 ? “ Hamming Distance” Ex 1 = 2 Ex 2 = 1 Ex 3 = 2 So output - (1-NN ≡ one nearest neighbor)
  • 32.
    Sample Experimental Results (see UCI archive for more) Simple algorithm works quite well! Testset Correctness Testbed 86% 85% 83% Appendicitis ? 38% 37% Tumor ? 76% 78% Heart Disease 96% 95% 98% Wisconsin Cancer Neural Nets D-Trees 1-NN
  • 33.
    K -NN AlgorithmCollect K nearest neighbors, select majority classification (or somehow combine their classes) What should K be? It probably is problem dependent Can use tuning sets (later) to select a good setting for K 1 Shouldn’t really “ connect the dots” (Why?) Tuning Set Error Rate 2 3 4 5 K
  • 34.
    Data Representation Creatinga dataset of Be sure to include – on separate 8x11 sheet – a photo and a brief bio HW0 out on-line Due next Friday fixed length feature vectors
  • 35.
    HW0 – CreateYour Own Dataset (repeated from lecture #1) Think about before next class Read HW0 (on-line) Google to find: UCI archive (or UCI KDD archive) UCI ML archive (UCI ML repository) More links in HW0’s web page
  • 36.
    HW0 – Your“Personal Concept” Step 1: Choose a Boolean (true/false) concept Books I like/dislike Movies I like/dislike www pages I like/dislike Subjective judgment (can’t articulate) “ time will tell” concepts Stocks to buy Medical treatment at time t , predict outcome at time ( t + ∆ t) Sensory interpretation Face recognition (see textbook) Handwritten digit recognition Sound recognition Hard-to-Program Functions
  • 37.
    Some Real-World ExamplesCar Steering (Pomerleau, Thrun) Medical Diagnosis (Quinlan) DNA Categorization TV-pilot rating Chemical-plant control Backgammon playing Medical record Learned Function Steering Angle Digitized camera image age=13, sex=M, wgt=18 Learned Function sick vs healthy
  • 38.
    HW0 – Your“Personal Concept” Step 2: Choosing a feature space We will use fixed-length feature vectors Choose N features Each feature has V i possible values Each example is represented by a vector of N feature values (i.e., is a point in the feature space ) e.g.: <red, 50, round> color weight shape Feature Types Boolean Nominal Ordered Hierarchical Step 3: Collect examples (“I/O” pairs) Defines a space In HW0 we will use a subset (see next slide)
  • 39.
    Standard Feature Typesfor representing training examples – a source of “ domain knowledge ” Nominal No relationship among possible values e.g., color є {red, blue, green} (vs. color = 1000 Hertz) Linear (or Ordered) Possible values of the feature are totally ordered e.g., size є {small, medium, large} ← discrete weight є [0…500] ← continuous Hierarchical Possible values are partially ordered in an ISA hierarchy e.g. for shape -> closed polygon continuous triangle square circle ellipse
  • 40.
    Our Feature Types(for CS 760 HW’s) Discrete tokens (char strings, w/o quote marks and spaces) Continuous numbers (int’s or float’s) If only a few possible values (e.g., 0 & 1) use discrete i.e., merge nominal and discrete-ordered (or convert discrete-ordered into 1,2,…) We will ignore hierarchical info and only use the leaf values (common approach)
  • 41.
    Example Hierarchy (KDD* Journal, Vol 5, No. 1-2, 2001, page 17) Structure of one feature! “ the need to be able to incorporate hierarchical (knowledge about data types) is shown in every paper.” - From eds. Intro to special issue (on applications) of KDD journal, Vol 15, 2001 * Officially, “Data Mining and Knowledge Discovery”, Kluwer Publishers Product Pct Foods Tea Canned Cat Food Dried Cat Food 99 Product Classes 2302 Product Subclasses Friskies Liver, 250g ~30k Products
  • 42.
    HW0: CreatingYour Dataset Ex: IMDB has a lot of data that are not discrete or continuous or binary-valued for target function (category) Studio Movie Director/ Producer Actor Made Acted in Directed Name Country List of movies Name Year of birth Gender Oscar nominations List of movies Title, Genre, Year, Opening Wkend BO receipts , List of actors/actresses, Release season Name Year of birth List of movies Produced
  • 43.
    HW0: Sample DBChoose a Boolean or binary-valued target function (category) Opening weekend box-office receipts > $2 million Movie is drama? (action, sci-fi,…) Movies I like/dislike (e.g. Tivo)
  • 44.
    HW0: Representing asa Fixed-Length Feature Vector <discuss on chalkboard> Note: some advanced ML approaches do not require such “feature mashing” (eg, ILP)
  • 45.
    [email_address] David Jensen’sgroup at UMass uses Naïve Bayes and other ML algo’s on the IMDB Opening weekend box-office receipts > $2 million 25 attributes Accuracy = 83.3% Default accuracy = 56% (default algo?) Movie is drama? 12 attributes Accuracy = 71.9% Default accuracy = 51% http://kdl.cs.umass.edu/proximity/about.html
  • 46.
    First Algorithm inDetail K -Nearest Neighbors / Instance-Based Learning ( k -NN/IBL) Distance functions Kernel functions Feature selection (applies to all ML algo’s) IBL Summary Chapter 8 of Mitchell
  • 47.
    Some Common JargonClassification Learning a discrete valued function Regression Learning a real valued function IBL easily extended to regression tasks (and to multi-category classification) Discrete/Real Outputs
  • 48.
    Variations on aTheme IB1 – keep all examples IB2 – keep next instance if incorrectly classified by using previous instances Uses less storage (good) Order dependent (bad) Sensitive to noisy data (bad) (From Aha, Kibler and Albert in ML Journal)
  • 49.
    Variations on aTheme (cont.) IB3 – extend IB2 to more intelligently decide which examples to keep (see article) Better handling of noisy data Another Idea - cluster groups, keep example from each (median/centroid) Less storage, faster lookup
  • 50.
    Distance Functions Key issue in IBL (instance-based learning) One approach: assign weights to each feature
  • 51.
    Distance Functions (sample)distance between examples 1 and 2 a numeric weighting factor distance for feature i only between examples 1 and 2
  • 52.
    Kernel Functions and k -NN Term “kernel” comes from statistics Major topic in support vector machines (SVMs) Weights the interaction between pairs of examples
  • 53.
    Kernel Functions and k -NN (continued) Assume we have k nearest neighbors e 1 , ..., e k associated output categories O 1 , ..., O k Then output for test case e t is the kernel “ delta” function (=1 if O i =c , else =0)
  • 54.
    Sample Kernel Functions  (e i , e t )  ( e i , e t ) = 1  ( e i , e t ) = 1 / dist( e i , e t ) simple majority vote (? classified as -) inverse distance weight (? could be classified as +) In diagram to right, example ‘?’ has three neighbors, two of which are ‘-’ and one of which is ‘+’. - - + ?
  • 55.
    Gaussian Kernel Heavilyused in SVMs Euler’s constant
  • 56.
    Local Learning Collect k nearest neighbors Give them to some supervised ML algo Apply learned model to test example + + + + + + + - - - - ? - Train on these
  • 57.
    Instance-Based Learning (IBL)and Efficiency IBL algorithms postpone work from training to testing Pure k -NN/IBL just memorizes the training data Sometimes called lazy learning Computationally intensive Match all features of all training examples
  • 58.
    Instance-Based Learning (IBL)and Efficiency Possible Speed-ups Use a subset of the training examples (Aha) Use clever data structures (A. Moore) KD trees, hash tables, Voronoi diagrams Use subset of the features
  • 59.
    Number of Featuresand Performance Too many features can hurt test set performance Too many irrelevant features mean many spurious correlation possibilities for a ML algorithm to detect “Curse of dimensionality”
  • 60.
    Feature Selection andML (general issue for ML) Filtering-Based Feature Selection all features subset of features model Wrapper-Based Feature Selection FS algorithm ML algorithm ML algorithm all features model FS algorithm calls ML algorithm many times, uses it to help select features
  • 61.
    Feature Selection asSearch Problem State = set of features Start state = empty ( forward selection ) or full ( backward selection ) Goal test = highest scoring state Operators add/subtract features Scoring function accuracy on training (or tuning) set of ML algorithm using this state’s feature set
  • 62.
    Forward and BackwardSelection of Features Hill-climbing (“greedy”) search Forward Backward add F 1 ... ... Features to use Accuracy on tuning set (our heuristic function) ... ... {} 50% {F N } 71% {F 1 } 62% add F N add F 1 {F 1 ,F 2 ,...,F N } 73% {F 2 ,...,F N } 79% subtract F 1 subtract F 2
  • 63.
    Forward vs. BackwardFeature Selection Faster in early steps because fewer features to test Fast for choosing a small subset of the features Misses useful features whose usefulness requires other features (feature synergy) Fast for choosing all but a small subset of the features Preserves useful features whose usefulness requires other features Example: area important, features = length, width Forward Backward
  • 64.
    Some Comments on k -NN Easy to implement Good “baseline” algorithm / experimental control Incremental learning easy Psychologically plausible model of human memory No insight into domain (no explicit model) Choice of distance function is problematic Doesn’t exploit/notice structure in examples Positive Negative
  • 65.
    Questions about IBL (Breiman et al. - CART book) Computationally expensive to save all examples; slow classification of new examples Addressed by IB2/IB3 of Aha et al. and work of A. Moore (CMU; now Google) Is this really a problem?
  • 66.
    Questions about IBL (Breiman et al. - CART book) Intolerant of Noise Addressed by IB3 of Aha et al. Addressed by k -NN version Addressed by feature selection - can discard the noisy feature Intolerant of Irrelevant Features Since algorithm very fast, can experimentally choose good feature sets (Kohavi, Ph. D. – now at Amazon)
  • 67.
    More IBL CriticismsHigh sensitivity to choice of similiarity (distance) function Euclidean distance might not be best choice Handling non-numeric features and missing feature values is not natural, but doable How might we do this? (Part of HW1) No insight into task (learned concept not interpretable)
  • 68.
    Summary IBL canbe a very effective machine learning algorithm Good “baseline” for experiments