www1.cs.columbia.edu

545 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
545
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

www1.cs.columbia.edu

  1. 1. Machine Learning Reading: Chapter 18
  2. 2. Machine Learning and AI <ul><li>Improve task performance through observation, teaching </li></ul><ul><li>Acquire knowledge automatically for use in a task </li></ul><ul><li>Learning as a key component in intelligence </li></ul>
  3. 3. What does it mean to learn?
  4. 4. Different kinds of Learning <ul><li>Rote learning </li></ul><ul><li>Learning from instruction </li></ul><ul><li>Learning by analogy </li></ul><ul><li>Learning from observation and discovery </li></ul><ul><li>Learning from examples </li></ul>
  5. 5. Inductive Learning <ul><li>Input: x, f(x) </li></ul><ul><li>Output: a function h that approximates f </li></ul><ul><li>A good hypothesis, h, is a generalization or learned rule </li></ul>
  6. 6. How do systems learn? <ul><li>Supervised </li></ul><ul><li>Unsupervised </li></ul><ul><li>Reinforcement </li></ul>
  7. 7. Three Types of Learning <ul><li>Rule induction </li></ul><ul><ul><ul><li>E.g., decision trees </li></ul></ul></ul><ul><li>Knowledge based </li></ul><ul><ul><ul><li>E.g., using a domain theory </li></ul></ul></ul><ul><li>Statistical </li></ul><ul><ul><ul><li>E.g., Naïve bayes, Nearest neighbor, support vector machines </li></ul></ul></ul>
  8. 8. Applications <ul><li>Language/speech </li></ul><ul><ul><ul><li>Machine translation </li></ul></ul></ul><ul><ul><ul><li>Summarization </li></ul></ul></ul><ul><ul><ul><li>Grammars </li></ul></ul></ul><ul><li>IR </li></ul><ul><ul><ul><li>Text categorization, relevance feedback </li></ul></ul></ul><ul><li>Medical </li></ul><ul><ul><ul><li>Assessment of illness severity </li></ul></ul></ul><ul><li>Vision </li></ul><ul><ul><ul><li>Face recognition, digit recognition, outdoor scene recognition </li></ul></ul></ul><ul><li>Security </li></ul><ul><ul><ul><li>Intrusion detection, network traffic, credit fraud </li></ul></ul></ul><ul><li>Social networks </li></ul><ul><ul><ul><li>Email traffic </li></ul></ul></ul><ul><li>To think about: applications to systems, computer engineering, software? </li></ul>
  9. 9. Language Tasks <ul><li>Text summarization </li></ul><ul><ul><ul><li>Task: given a document which sentences could serve as the summary </li></ul></ul></ul><ul><ul><ul><li>Training data: summary + document pairs </li></ul></ul></ul><ul><ul><ul><li>Output: rules which extract sentences given an unseen document </li></ul></ul></ul><ul><li>Grammar induction </li></ul><ul><ul><ul><li>Task: produce a tree representing syntactic structure given a sentence </li></ul></ul></ul><ul><ul><ul><li>Training data: set of sentences annotated with parse tree </li></ul></ul></ul><ul><ul><ul><li>Output: rules which generate a parse tree given an unseen sentence </li></ul></ul></ul>
  10. 10. IR Task <ul><li>Text categorization </li></ul><ul><ul><li>http:// www.yahoo.com </li></ul></ul><ul><ul><li>Task: given a web page, is it news or not? </li></ul></ul><ul><ul><ul><li>Binary classification (yes, no) </li></ul></ul></ul><ul><ul><ul><li>Classify as one of business&economy,news&media, computer </li></ul></ul></ul><ul><ul><li>Training data: documents labeled with category </li></ul></ul><ul><ul><li>Output: a yes/no response for a new document; a category for a new document </li></ul></ul>
  11. 11. Medical <ul><li>Task: Does a patient have heart disease (on a scale from 1 to 4) </li></ul><ul><li>Training data: </li></ul><ul><ul><ul><li>Age, sex,cholesterol, chest pain location, chest pain type, resting blood pressure, smoker?, fasting blood sugar, etc. </li></ul></ul></ul><ul><ul><ul><li>Characterization of heart disease (0,1-4) </li></ul></ul></ul><ul><li>Output: </li></ul><ul><ul><ul><li>Given a new patient, classification by disease </li></ul></ul></ul>
  12. 12. General Approach <ul><li>Formulate task </li></ul><ul><li>Prior model (parameters, structure) </li></ul><ul><li>Obtain data </li></ul><ul><li>What representation should be used? (attribute/value pairs) </li></ul><ul><li>Annotate data </li></ul><ul><li>Learn/refine model with data (training) </li></ul><ul><li>Use model for classification or prediction on unseen data (testing) </li></ul><ul><li>Measure accuracy </li></ul>
  13. 13. Issues <ul><li>Representation </li></ul><ul><ul><ul><li>How to map from a representation in the domain to a representation used for learning? </li></ul></ul></ul><ul><li>Training data </li></ul><ul><ul><ul><li>How can training data be acquired? </li></ul></ul></ul><ul><li>Amount of training data </li></ul><ul><ul><ul><li>How well does the algorithm do as we vary the amount of data? </li></ul></ul></ul><ul><li>Which attributes influence learning most? </li></ul><ul><li>Does the learning algorithm provide insight into the generalizations made? </li></ul>
  14. 14. Classification Learning <ul><li>Input: a set of attributes and values </li></ul><ul><li>Output: discrete valued function </li></ul><ul><ul><ul><li>Learning a continuous valued function is called regression </li></ul></ul></ul><ul><li>Binary or boolean classification: category is either true or false </li></ul>
  15. 15. Learning Decision Trees <ul><li>Each node tests the value of an input attribute </li></ul><ul><li>Branches from the node correspond to possible values of the attribute </li></ul><ul><li>Leaf nodes supply the values to be returned if that leaf is reached </li></ul>
  16. 16. Example <ul><li>http://www.ics.uci.edu/~mlearn/MLSummary.html </li></ul><ul><li>Iris Plant Database </li></ul><ul><li>Which of 3 classes is a given Iris plant? </li></ul><ul><ul><ul><li>Iris Setosa </li></ul></ul></ul><ul><ul><ul><li>Iris Versicolour </li></ul></ul></ul><ul><ul><ul><li>Iris Virginica </li></ul></ul></ul><ul><li>Attributes </li></ul><ul><ul><ul><li>Sepal length in cm </li></ul></ul></ul><ul><ul><ul><li>Sepal width in cm </li></ul></ul></ul><ul><ul><ul><li>Petal length in cm </li></ul></ul></ul><ul><ul><ul><li>Petal width in cm </li></ul></ul></ul>
  17. 17. <ul><li>Summary Statistics: </li></ul><ul><li>Min Max Mean SD ClassCorrelation </li></ul><ul><li>sepal length : 4.3 7.9 5.84 0.83 0.7826 </li></ul><ul><li>sepal width : 2.0 4.4 3.05 0.43 -0.4194 </li></ul><ul><li>petal length : 1.0 6.9 3.76 1.76 0.9490 (high!) </li></ul><ul><li>petal width : 0.1 2.5 1.20 0.76 0.9565 (high!) </li></ul><ul><li>Rules to learn </li></ul><ul><ul><ul><li>If sepal length > 6 and sepal width > 3.8 and petal length < 2.5 and petal width < 1.5 then class = Iris Setosa </li></ul></ul></ul><ul><ul><ul><li>If sepal length > 5 and sepal width > 3 and petal length >5.5 and petal width >2 then class = Iris Versicolour </li></ul></ul></ul><ul><ul><ul><li>If sepal length <5 and sepal width > 3 and petal length  2.5 and ≤ 5.5 and petal width  1.5 and ≤ 2 then class = Iris Virginica </li></ul></ul></ul>
  18. 18. Data Verginica 1.7 2.6 3 2 3 Versicolour 2.1 5.6 4.2 6 9 Verginica 2.2 2.8 3.7 1 8 Setosa 1.2 1.6 4.3 6.3 7 Setosa 1.4 1.2 4.1 7.7 6 Versicolour 2.4 6.8 3.6 5.5 5 Verginica 1.1 2.5 3.4 3 4 Setosa 2.2 2.4 3.9 7 2 Versicolour 2.3 6.3 3 6.8 1 Class P-width P-length S-width S-length
  19. 19. Data Verginica 2.6 3 2 3 Versicolour 5.6 4.2 6 9 Verginica 2.8 3.7 1 8 Setosa 1.6 4.3 6.3 7 Setosa 1.2 4.1 7.7 6 Versicolour 6.8 3.6 5.5 5 Verginica 2.5 3.4 3 4 Setosa 2.4 3.9 7 2 Versicolour 6.3 3 6.8 1 Class P-length S-width S-length
  20. 20. Constructing the Decision Tree <ul><li>Goal: Find the smallest decision tree consistent with the examples </li></ul><ul><li>Find the attribute that best splits examples </li></ul><ul><li>Form tree with root = best attribute </li></ul><ul><li>For each value v i (or range) of best attribute </li></ul><ul><ul><ul><li>Selects those examples with best=v i </li></ul></ul></ul><ul><ul><ul><li>Construct subtree i by recursively calling decision tree with subset of examples , all attributes except best </li></ul></ul></ul><ul><ul><ul><li>Add a branch to tree with label= v i and subtree= subtree i </li></ul></ul></ul>
  21. 21. Construct example decision tree
  22. 22. Tree and Rules Learned S-length <5.5  5.5 3,4,8 P-length  5.6 ≤ 2.4 1,5,9 2,6,7 If S-length < 5.5., then Verginica If S-length  5.5 and P-length  5.6, then Versicolour If S-length  5.5 and P-length ≤ 2.4, then Setosa
  23. 23. Comparison of Target and Learned If S-length < 5.5., then Verginica If S-length  5.5 and P-length  5.6, then Versicolour If S-length  5.5 and P-length ≤ 2.4, then Setosa <ul><ul><ul><li>If sepal length > 6 and sepal width > 3.8 and petal length < 2.5 and petal width < 1.5 then class = Iris Setosa </li></ul></ul></ul><ul><ul><ul><li>If sepal length > 5 and sepal width > 3 and petal length >5.5 and petal width >2 then class = Iris Versicolour </li></ul></ul></ul><ul><ul><ul><li>If sepal length <5 and sepal width > 3 and petal length  2.5 and ≤ 5.5 and petal width  1.5 and ≤ 2 then class = Iris Virginica </li></ul></ul></ul>
  24. 24. Text Classification <ul><li>Is text i a finance new article? </li></ul>Positive Negative
  25. 25. 20 attributes <ul><li>Investors 2 </li></ul><ul><li>Dow 2 </li></ul><ul><li>Jones 2 </li></ul><ul><li>Industrial 1 </li></ul><ul><li>Average 3 </li></ul><ul><li>Percent 5 </li></ul><ul><li>Gain 6 </li></ul><ul><li>Trading 8 </li></ul><ul><li>Broader 5 </li></ul><ul><li>stock 5 </li></ul><ul><li>Indicators 6 </li></ul><ul><li>Standard 2 </li></ul><ul><li>Rolling 1 </li></ul><ul><li>Nasdaq 3 </li></ul><ul><li>Early 10 </li></ul><ul><li>Rest 12 </li></ul><ul><li>More 13 </li></ul><ul><li>first 11 </li></ul><ul><li>Same 12 </li></ul><ul><li>The 30 </li></ul>
  26. 26. 20 attributes <ul><li>Men’s </li></ul><ul><li>Basketball </li></ul><ul><li>Championship </li></ul><ul><li>UConn Huskies </li></ul><ul><li>Georgia Tech </li></ul><ul><li>Women </li></ul><ul><li>Playing </li></ul><ul><li>Crown </li></ul><ul><li>Titles </li></ul><ul><li>Games </li></ul><ul><li>Rebounds </li></ul><ul><li>All-America </li></ul><ul><li>early </li></ul><ul><li>rolling </li></ul><ul><li>Celebrates </li></ul><ul><li>Rest </li></ul><ul><li>More </li></ul><ul><li>First </li></ul><ul><li>The </li></ul><ul><li>same </li></ul>
  27. 27. Example finance 20 6 5 7 other 28 15 0 10 finance 25 11 0 9 other 35 2 0 8 finance 25 4 9 6 finance 20 2 8 5 other 14 7 5 4 other 25 7 7 3 finance 35 8 6 2 other 40 3 0 1 class the rolling stock
  28. 28. Issues <ul><li>Representation </li></ul><ul><ul><ul><li>How to map from a representation in the domain to a representation used for learning? </li></ul></ul></ul><ul><li>Training data </li></ul><ul><ul><ul><li>How can training data be acquired? </li></ul></ul></ul><ul><li>Amount of training data </li></ul><ul><ul><ul><li>How well does the algorithm do as we vary the amount of data? </li></ul></ul></ul><ul><li>Which attributes influence learning most? </li></ul><ul><li>Does the learning algorithm provide insight into the generalizations made? </li></ul>
  29. 29. Constructing the Decision Tree <ul><li>Goal: Find the smallest decision tree consistent with the examples </li></ul><ul><li>Find the attribute that best splits examples </li></ul><ul><li>Form tree with root = best attribute </li></ul><ul><li>For each value v i (or range) of best attribute </li></ul><ul><ul><ul><li>Selects those examples with best=v i </li></ul></ul></ul><ul><ul><ul><li>Construct subtree i by recursively calling decision tree with subset of examples , all attributes except best </li></ul></ul></ul><ul><ul><ul><li>Add a branch to tree with label= v i and subtree= subtree i </li></ul></ul></ul>
  30. 30. Choosing the Best Attribute: Binary Classification <ul><li>Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction </li></ul><ul><li>Information theory (Shannon and Weaver 49) </li></ul><ul><li>H(P(v 1 ),…P(v n ))= ∑-P(v i )log 2 P(v i ) H(1/2,1/2)=-1/2log 2 1/2-1/2log 2 1/2=1 bit H(1/100,1/99)=-.01log 2 .01-.99log 2 .99=.08 bits </li></ul>n i=1
  31. 31. Information based on attributes = Remainder (A) P=n=10, so H(1/2,1/2)= 1 bit
  32. 32. Information Gain <ul><li>Information gain (from attribute test) = difference between the original information requirement and new requirement </li></ul><ul><li>Gain(A)=H(p/p+n,n/p+n)-Remainder(A) </li></ul>
  33. 33. Example finance 20 6 5 7 other 28 15 0 10 finance 25 11 0 9 other 35 2 0 8 finance 25 4 9 6 finance 20 2 8 5 other 14 7 5 4 other 25 7 7 3 finance 35 8 6 2 other 40 3 0 1 class the rolling stock
  34. 34. stock rolling <5 5-10  10  10 5-10 <5 1,8,9,10 2 , 3,4,5,6,7 1 , 5,6,8 2,3,4,7 9,10 Gain(stock)=1-[4/10H(1/10,3/10)+6/10H(4/10,2/10)]= 1-[.4((-.1* -3.32)-(.3*-1.74))+.6((-.4*-1.32)-(.2*-2.32))]= 1-[.303+.5952]=.105 Gain(rolling)=1-[4/10H(1/2,1/2)+4/10H(1/2,1/2)+2/10H(1/2,1/2)]=0
  35. 35. Other cases <ul><li>What if class is discrete valued, not binary? </li></ul><ul><li>What if an attribute has many values (e.g., 1 per instance)? </li></ul>
  36. 36. Training vs. Testing <ul><li>A learning algorithm is good if it uses its learned hypothesis to make accurate predictions on unseen data </li></ul><ul><ul><ul><li>Collect a large set of examples (with classifications) </li></ul></ul></ul><ul><ul><ul><li>Divide into two disjoint sets: the training set and the test set </li></ul></ul></ul><ul><ul><ul><li>Apply the learning algorithm to the training set, generating hypothesis h </li></ul></ul></ul><ul><ul><ul><li>Measure the percentage of examples in the test set that are correctly classified by h </li></ul></ul></ul><ul><ul><ul><li>Repeat for different sizes of training sets and different randomly selected training sets of each size. </li></ul></ul></ul>
  37. 38. Division into 3 sets <ul><li>Inadvertent peeking </li></ul><ul><ul><li>Parameters that must be learned (e.g., how to split values) </li></ul></ul><ul><ul><li>Generate different hypotheses for different parameter values on training data </li></ul></ul><ul><ul><li>Choose values that perform best on testing data </li></ul></ul><ul><ul><li>Why do we need to do this for selecting best attributes? </li></ul></ul>
  38. 39. Overfitting <ul><li>Learning algorithms may use irrelevant attributes to make decisions </li></ul><ul><ul><li>For news, day published and newspaper </li></ul></ul><ul><li>Decision tree pruning </li></ul><ul><ul><li>Prune away attributes with low information gain </li></ul></ul><ul><ul><li>Use statistical significance to test whether gain is meaningful </li></ul></ul>
  39. 40. K-fold Cross Validation <ul><li>To reduce overfitting </li></ul><ul><li>Run k experiments </li></ul><ul><ul><li>Use a different 1/k of data for testing each time </li></ul></ul><ul><ul><li>Average the results </li></ul></ul><ul><li>5-fold, 10-fold, leave-one-out </li></ul>
  40. 41. Ensemble Learning <ul><li>Learn from a collection of hypotheses </li></ul><ul><li>Majority voting </li></ul><ul><li>Enlarges the hypothesis space </li></ul>
  41. 42. Boosting <ul><li>Uses a weighted training set </li></ul><ul><ul><li>Each example as an associated weight w j  0 </li></ul></ul><ul><ul><li>Higher weighted examples have higher importance </li></ul></ul><ul><li>Initially, w j =1 for all examples </li></ul><ul><li>Next round: increase weights of misclassified examples, decrease other weights </li></ul><ul><li>From the new weighted set, generate hypothesis h 2 </li></ul><ul><li>Continue until M hypotheses generated </li></ul><ul><li>Final ensemble hypothesis = weighted-majority combination of all M hypotheses </li></ul><ul><ul><ul><li>Weight each hypothesis according to how well it did on training data </li></ul></ul></ul>
  42. 43. AdaBoost <ul><li>If input learning algorithm is a weak learning algorithm </li></ul><ul><ul><li>L always returns a hypothesis with weighted error on training slightly better than random </li></ul></ul><ul><li>Returns hypothesis that classifies training data perfectly for large enough M </li></ul><ul><li>Boosts the accuracy of the original learning algorithm on training data </li></ul>
  43. 44. Issues <ul><li>Representation </li></ul><ul><ul><ul><li>How to map from a representation in the domain to a representation used for learning? </li></ul></ul></ul><ul><li>Training data </li></ul><ul><ul><ul><li>How can training data be acquired? </li></ul></ul></ul><ul><li>Amount of training data </li></ul><ul><ul><ul><li>How well does the algorithm do as we vary the amount of data? </li></ul></ul></ul><ul><li>Which attributes influence learning most? </li></ul><ul><li>Does the learning algorithm provide insight into the generalizations made? </li></ul>

×