Overview of Machine Learning Raymond J. Mooney Department of Computer Sciences University of Texas at Austin
What is Learning? <ul><li>Definition by H. Simon: “Any process by which a system improves performance.” </li></ul><ul><li>...
Classification Examples <ul><li>Medical diagnosis </li></ul><ul><li>Credit card applications or transactions </li></ul><ul...
Other Tasks <ul><li>Solving calculus problems </li></ul><ul><li>Playing games </li></ul><ul><ul><li>Checkers </li></ul></u...
How is Performance Measured? <ul><li>Classification accuracy </li></ul><ul><ul><li>False positives </li></ul></ul><ul><ul>...
Training Experience <ul><li>Direct supervision </li></ul><ul><ul><li>Checkers board positions labeled with correct move. <...
Types of Direct Supervision <ul><li>Examples chosen by a benevolent teacher </li></ul><ul><ul><li>Near miss negative examp...
Categorization <ul><li>Given: </li></ul><ul><ul><li>A description of an instance,  x  X , where X is the  instance langua...
Learning for Categorization <ul><li>A training example is an instance  x  X,  paired with its correct category  c ( x ): ...
Sample Category Learning Problem <ul><li>Instance language: <size, color, shape> </li></ul><ul><ul><li>size    {small, me...
General Learning Issues <ul><li>Many hypotheses are usually consistent with the training data. </li></ul><ul><li>Bias </li...
Learning as Search <ul><li>Learning for categorization requires searching for a consistent hypothesis in a given space,  H...
Types of Bias <ul><li>Language Bias : Limit hypothesis space  a priori  to a restricted set of functions. </li></ul><ul><l...
Generalization <ul><li>Hypotheses must generalize to correctly classify instances not in the training data. </li></ul><ul>...
Over-Fitting <ul><li>Frequently, complete consistency with the training data is not desirable. </li></ul><ul><li>A complet...
Learning Approaches EM (inside-outside) PCFG Probabilistic Grammar EM (forward-backward) HMM Hidden Markov Model Maximum l...
More Learning Approaches Genetic algorithm Rules/neural-nets Evolutionary computation Greedy set covering Prolog program I...
Text Categorization <ul><li>Assigning documents to a fixed set of categories. </li></ul><ul><li>Applications: </li></ul><u...
Relevance Feedback Architecture Rankings IR System Document corpus Ranked Documents 1. Doc1  2. Doc2  3. Doc3  . . 1. Doc1...
Using Relevance Feedback (Rocchio) <ul><li>Relevance feedback methods can be adapted for text categorization. </li></ul><u...
Illustration of Rocchio Text Categorization
Rocchio Text Categorization Algorithm (Training) Assume the set of categories is { c 1 ,  c 2 ,… c n } For  i  from 1 to  ...
Rocchio Text Categorization Algorithm (Test) Given test document  x Let  d  be the TF/IDF weighted term vector for  x Let ...
Rocchio Properties  <ul><li>Does not guarantee a consistent hypothesis. </li></ul><ul><li>Forms a simple generalization of...
Rocchio Time Complexity <ul><li>Note:  The time to add two sparse vectors is proportional to minimum number of non-zero en...
Nearest-Neighbor Learning Algorithm <ul><li>Learning is just storing the representations of the training examples in  D . ...
K Nearest-Neighbor <ul><li>Using only the closest example to determine categorization is subject to errors due to: </li></...
Similarity Metrics <ul><li>Nearest neighbor method depends on a similarity (or distance) metric. </li></ul><ul><li>Simples...
3 Nearest Neighbor Illustration (Euclidian Distance) . . . . . . . . . . .
K Nearest Neighbor for Text Training: For each each   training example < x ,  c ( x )>     D Compute the corresponding TF...
Illustration of 3 Nearest Neighbor for Text
Rocchio Anomoly  <ul><li>Prototype models have problems with polymorphic (disjunctive) categories. </li></ul>
3 Nearest Neighbor Comparison <ul><li>Nearest Neighbor tends to handle polymorphic categories better.  </li></ul>
Nearest Neighbor Time Complexity <ul><li>Training Time : O(| D |  L d ) to compose  TF-IDF vectors. </li></ul><ul><li>Test...
Nearest Neighbor with Inverted Index <ul><li>Determining  k  nearest neighbors is the same as determining the  k  best ret...
Bayesian Methods <ul><li>Learning and classification methods based on probability theory. </li></ul><ul><li>Bayes theorem ...
Conditional Probability  <ul><li>P( A  |  B ) is the probability of  A  given  B </li></ul><ul><li>Assumes that  B  is all...
Independence <ul><li>A  and  B  are  independent  iff: </li></ul><ul><li>Therefore, if  A  and  B  are independent: </li><...
Bayes Theorem <ul><li>Simple proof from definition of conditional probability: </li></ul>QED: (Def. cond. prob.) (Def. con...
Bayesian Categorization <ul><li>Let   set of categories be { c 1 ,  c 2 ,… c n } </li></ul><ul><li>Let  E  be description ...
Bayesian Categorization (cont.) <ul><li>Need to know: </li></ul><ul><ul><li>Priors: P( c i )  </li></ul></ul><ul><ul><li>C...
Naïve Bayesian Categorization <ul><li>If we assume features of an instance are independent given the category ( c i ) ( co...
Naïve Bayes Example <ul><li>C = {allergy, cold, well} </li></ul><ul><li>e 1  = sneeze;  e 2  = cough;  e 3  = fever </li><...
Naïve Bayes Example (cont.) <ul><li>P(well | E) = (0.9)(0.1)(0.1)(0.99)/P(E)=0.0089/P(E) </li></ul><ul><li>P(cold | E) = (...
Estimating Probabilities <ul><li>Normally, probabilities are estimated based on observed frequencies in the training data....
Smoothing <ul><li>To account for estimation from small samples, probability estimates are adjusted or  smoothed . </li></u...
Naïve Bayes for Text <ul><li>Modeled as generating a bag of words for a document in a given category by repeatedly samplin...
Text Naïve Bayes Algorithm (Train) Let  V  be the vocabulary of all words in the documents in  D For each category  c i  ...
Text Naïve Bayes Algorithm (Test) Given a test document  X Let  n  be the number of word occurrences in  X Return the cate...
Naïve Bayes Time Complexity <ul><li>Training Time :  O(| D | L d  + | C || V |))  where  L d  is the average length of a d...
Underflow Prevention <ul><li>Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in flo...
Naïve Bayes Posterior Probabilities <ul><li>Classification results of naïve Bayes (the class with maximum posterior probab...
Evaluating Categorization <ul><li>Evaluation must be done on test data that are independent of the training data (usually ...
N -Fold Cross-Validation <ul><li>Ideally, test and training sets are independent on each trial. </li></ul><ul><ul><li>But ...
Learning Curves <ul><li>In practice, labeled data is usually rare and expensive. </li></ul><ul><li>Would like to know how ...
N -Fold Learning Curves <ul><li>Want learning curves averaged over multiple trials. </li></ul><ul><li>Use  N -fold cross v...
Sample Document Corpus <ul><li>600 science pages from the web. </li></ul><ul><li>200 random samples each from the Yahoo in...
Sample Learning Curve (Yahoo Science Data)
Clustering <ul><li>Partition unlabeled examples into disjoint subsets of  clusters , such that: </li></ul><ul><ul><li>Exam...
Clustering Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hierarchical Clustering <ul><li>Build a tree-based hierarchical taxonomy ( dendrogram ) from a set of unlabeled examples. ...
Aglommerative vs. Divisive Clustering <ul><li>Aglommerative  ( bottom-up ) methods start with each example in its own clus...
Direct Clustering Method <ul><li>Direct clustering  methods require a specification of the number of clusters,  k , desire...
Hierarchical Agglomerative Clustering  (HAC) <ul><li>Assumes a  similarity function  for determining the similarity of two...
HAC Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, ...
Cluster Similarity <ul><li>Assume a similarity function that determines the similarity of two instances:  sim ( x , y ). <...
Single Link Agglomerative Clustering <ul><li>Use maximum similarity of pairs: </li></ul><ul><li>Can result in “straggly” (...
Single Link Example
Complete Link Agglomerative Clustering <ul><li>Use minimum similarity of pairs: </li></ul><ul><li>Makes more “tight,” sphe...
Complete Link Example
Computational Complexity <ul><li>In the first iteration, all HAC methods need to compute similarity of all pairs of  n  in...
Computing Cluster Similarity <ul><li>After merging  c i  and  c j , the similarity of the resulting cluster to any other c...
Group Average Agglomerative Clustering <ul><li>Use average similarity across all pairs within the merged cluster to measur...
Computing Group Average Similarity <ul><li>Assume cosine similarity and normalized vectors with unit length. </li></ul><ul...
Non-Hierarchical Clustering <ul><li>Typically must provide the number of desired clusters,  k . </li></ul><ul><li>Randomly...
K-Means <ul><li>Assumes instances are real-valued vectors. </li></ul><ul><li>Clusters based on  centroids ,  center of gra...
Distance Metrics <ul><li>Euclidian distance (L 2  norm): </li></ul><ul><li>L 1  norm: </li></ul><ul><li>Cosine Similarity ...
K-Means Algorithm Let  d  be the distance measure between instances. Select  k  random instances { s 1 ,  s 2 ,…  s k } as...
K Means Example (K=2) Reassign clusters Converged! Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x...
Time Complexity <ul><li>Assume computing distance between two instances is O( m ) where  m  is the dimensionality of the v...
Seed Choice <ul><li>Results can vary based on random seed selection. </li></ul><ul><li>Some seeds can result in poor conve...
Buckshot Algorithm <ul><li>Combines HAC and K-Means clustering. </li></ul><ul><li>First randomly take a sample of instance...
Text Clustering <ul><li>HAC and K-Means have been applied to text in a straightforward way. </li></ul><ul><li>Typically us...
Soft Clustering <ul><li>Clustering typically assumes that each instance is given a “hard” assignment to exactly one cluste...
Expectation Maximization (EM) <ul><li>Probabilistic method for soft clustering. </li></ul><ul><li>Direct method that assum...
EM Algorithm <ul><li>Iterative method for learning probabilistic categorization model from unsupervised data. </li></ul><u...
Learning from Probabilistically Labeled Data  <ul><li>Instead of training data labeled with “hard” category labels, traini...
Naïve Bayes EM Randomly assign examples probabilistic category labels. Use standard naïve-Bayes training to learn a probab...
Semi-Supervised Learning <ul><li>For supervised categorization, generating labeled training data is expensive. </li></ul><...
Semi-Supervised Example <ul><li>Assume “quantum” is present in several labeled physics documents, but “Heisenberg” occurs ...
Semi-Supervision Results <ul><li>Experiments on assigning messages from 20 Usenet newsgroups their proper newsgroup label....
Active Learning <ul><li>Select only the most informative examples for labeling. </li></ul><ul><li>Initial methods: </li></...
Weak Supervision <ul><li>Sometimes uncertain labeling can be inferred. </li></ul><ul><li>Learning apprentices </li></ul><u...
Prior Knowledge <ul><li>Use of prior declarative knowledge in learning. </li></ul><ul><li>Initial methods: </li></ul><ul><...
Learning to Learn <ul><li>Many applications require learning for  multiple, related  problems. </li></ul><ul><li>What can ...
Upcoming SlideShare
Loading in …5
×

lecture_mooney.ppt

556 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
556
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

lecture_mooney.ppt

  1. 1. Overview of Machine Learning Raymond J. Mooney Department of Computer Sciences University of Texas at Austin
  2. 2. What is Learning? <ul><li>Definition by H. Simon: “Any process by which a system improves performance.” </li></ul><ul><li>What is the task? </li></ul><ul><ul><li>Classification/categorization </li></ul></ul><ul><ul><li>Problem solving </li></ul></ul><ul><ul><li>Planning </li></ul></ul><ul><ul><li>Control </li></ul></ul><ul><ul><li>Language understanding </li></ul></ul>
  3. 3. Classification Examples <ul><li>Medical diagnosis </li></ul><ul><li>Credit card applications or transactions </li></ul><ul><li>DNA sequences </li></ul><ul><ul><li>Promoter </li></ul></ul><ul><ul><li>Splice-junction </li></ul></ul><ul><ul><li>Protein structure </li></ul></ul><ul><li>Spoken words </li></ul><ul><li>Handwritten characters </li></ul><ul><li>Astronomical images </li></ul><ul><li>Market basket analysis </li></ul>
  4. 4. Other Tasks <ul><li>Solving calculus problems </li></ul><ul><li>Playing games </li></ul><ul><ul><li>Checkers </li></ul></ul><ul><ul><li>Chess </li></ul></ul><ul><ul><li>Backgammon </li></ul></ul><ul><li>Pole balancing </li></ul><ul><li>Driving a car </li></ul><ul><li>Flying a helicopter </li></ul><ul><li>Robot navigation </li></ul>
  5. 5. How is Performance Measured? <ul><li>Classification accuracy </li></ul><ul><ul><li>False positives </li></ul></ul><ul><ul><li>False negatives </li></ul></ul><ul><li>Precision/Recall/F-measure </li></ul><ul><li>Solution correctness and quality (optimality) </li></ul><ul><ul><li>Number of questions answered correctly </li></ul></ul><ul><ul><li>Distance traveled for navigation problem </li></ul></ul><ul><li>Percentage of games won against an opponent </li></ul><ul><li>Time to find a solution </li></ul>
  6. 6. Training Experience <ul><li>Direct supervision </li></ul><ul><ul><li>Checkers board positions labeled with correct move. </li></ul></ul><ul><ul><li>Road images with correct steering position. </li></ul></ul><ul><li>Indirect supervision (delayed reward, reinforcement learning) </li></ul><ul><ul><li>Choose sequence of checkers move and eventually win or lose game. </li></ul></ul><ul><ul><li>Drive car and rewarded if reach destination. </li></ul></ul>
  7. 7. Types of Direct Supervision <ul><li>Examples chosen by a benevolent teacher </li></ul><ul><ul><li>Near miss negative examples </li></ul></ul><ul><li>Random examples from the environment. </li></ul><ul><ul><li>Positive and negative examples </li></ul></ul><ul><ul><li>Positive examples only </li></ul></ul><ul><li>Choose examples for a teacher (oracle) to classify. </li></ul><ul><li>Design and run one’s own experiments. </li></ul>
  8. 8. Categorization <ul><li>Given: </li></ul><ul><ul><li>A description of an instance, x  X , where X is the instance language or instance space . </li></ul></ul><ul><ul><li>A fixed set of categories: C= { c 1 , c 2 ,… c n } </li></ul></ul><ul><ul><li>A categorization function, c ( x ), whose domain is X and whose range is C . </li></ul></ul><ul><li>Determine: </li></ul><ul><ul><li>The category of x : c ( x )  C, </li></ul></ul>
  9. 9. Learning for Categorization <ul><li>A training example is an instance x  X, paired with its correct category c ( x ): < x , c ( x )> for an unknown categorization function, c . </li></ul><ul><li>Given: </li></ul><ul><ul><li>A set of training examples, D . </li></ul></ul><ul><ul><li>An hypothesis space, H , of possible categorization functions, h ( x ). </li></ul></ul><ul><li>Find a consistent hypothesis, h ( x )  H , such that: </li></ul>
  10. 10. Sample Category Learning Problem <ul><li>Instance language: <size, color, shape> </li></ul><ul><ul><li>size  {small, medium, large} </li></ul></ul><ul><ul><li>color  {red, blue, green} </li></ul></ul><ul><ul><li>shape  {square, circle, triangle} </li></ul></ul><ul><li>C = {positive, negative} </li></ul><ul><li>D : </li></ul>negative triangle red small 3 positive circle red large 2 positive circle red small 1 negative circle blue large 4 Category Shape Color Size Example
  11. 11. General Learning Issues <ul><li>Many hypotheses are usually consistent with the training data. </li></ul><ul><li>Bias </li></ul><ul><ul><li>Any criteria other than consistency with the training data that is used to select a hypothesis. </li></ul></ul><ul><li>Classification accuracy (% of instances classified correctly). </li></ul><ul><ul><li>Measured on independent test data. </li></ul></ul><ul><li>Training time (efficiency of training algorithm). </li></ul><ul><li>Testing time (efficiency of subsequent classification). </li></ul>
  12. 12. Learning as Search <ul><li>Learning for categorization requires searching for a consistent hypothesis in a given space, H . </li></ul><ul><li>Enumerate and test is a possible algorithm for any finite or countably infinite H. </li></ul><ul><li>Most hypothesis spaces are very large: </li></ul><ul><ul><li>Conjunctions on n binary features: 3 n </li></ul></ul><ul><ul><li>All binary functions on n binary features: 2 </li></ul></ul><ul><li>Efficient algorithms needed for finding a consistent hypothesis without enumerating them all. </li></ul>2 n
  13. 13. Types of Bias <ul><li>Language Bias : Limit hypothesis space a priori to a restricted set of functions. </li></ul><ul><li>Search Bias : Employ a hypothesis space that includes all possible functions but use a search algorithm that prefers simpler hypotheses. </li></ul><ul><ul><li>Since finding the simplest hypothesis is usually intractable (e.g. NP-Hard), satisficing heuristic search is usually employed. </li></ul></ul>
  14. 14. Generalization <ul><li>Hypotheses must generalize to correctly classify instances not in the training data. </li></ul><ul><li>Simply memorizing training examples is a consistent hypothesis that does not generalize. </li></ul><ul><li>Occam’s razor : </li></ul><ul><ul><li>Finding a simple hypothesis helps ensure generalization. </li></ul></ul>
  15. 15. Over-Fitting <ul><li>Frequently, complete consistency with the training data is not desirable. </li></ul><ul><li>A completely consistent hypothesis may be fitting errors and noise in the training data, preventing generalization. </li></ul><ul><li>There is usually a trade-off between hypothesis complexity and degree of fit to the training data. </li></ul><ul><li>Methods for preventing over-fitting: </li></ul><ul><ul><li>Predetermined strong language bias. </li></ul></ul><ul><ul><li>“ Pruning” or “early stopping” criteria to prevent learning overly-complex hypotheses. </li></ul></ul>
  16. 16. Learning Approaches EM (inside-outside) PCFG Probabilistic Grammar EM (forward-backward) HMM Hidden Markov Model Maximum likelihood/EM Bayesian Network Bayes Net Memorize then Find closest match Stored instances Nearest Neighbor Instance/Case-based Gradient descent Artificial neural net Neural Network Greedy divide & conquer Decision trees Decision tree induction Greedy set covering Rules Rule Induction Search Method Representation Approach
  17. 17. More Learning Approaches Genetic algorithm Rules/neural-nets Evolutionary computation Greedy set covering Prolog program Inductive Logic Programming Averaging Average instance Prototype Quadratic optimization Hyperplane Support Vector Machine (SVM) Generalized/Improved Iterative Scaling Exponential Model Maximum Entropy (MaxEnt) Search Method Representation Approach
  18. 18. Text Categorization <ul><li>Assigning documents to a fixed set of categories. </li></ul><ul><li>Applications: </li></ul><ul><ul><li>Web pages </li></ul></ul><ul><ul><ul><li>Recommending </li></ul></ul></ul><ul><ul><ul><li>Yahoo-like classification </li></ul></ul></ul><ul><ul><li>Newsgroup Messages </li></ul></ul><ul><ul><ul><li>Recommending </li></ul></ul></ul><ul><ul><ul><li>spam filtering </li></ul></ul></ul><ul><ul><li>News articles </li></ul></ul><ul><ul><ul><li>Personalized newspaper </li></ul></ul></ul><ul><ul><li>Email messages </li></ul></ul><ul><ul><ul><li>Routing </li></ul></ul></ul><ul><ul><ul><li>Prioritizing </li></ul></ul></ul><ul><ul><ul><li>Folderizing </li></ul></ul></ul><ul><ul><ul><li>spam filtering </li></ul></ul></ul>
  19. 19. Relevance Feedback Architecture Rankings IR System Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . 1. Doc1  2. Doc2  3. Doc3  . . Feedback Query String Revised Query ReRanked Documents 1. Doc2 2. Doc4 3. Doc5 . . Query Reformulation
  20. 20. Using Relevance Feedback (Rocchio) <ul><li>Relevance feedback methods can be adapted for text categorization. </li></ul><ul><li>Use standard TF/IDF weighted vectors to represent text documents (normalized by maximum term frequency). </li></ul><ul><li>For each category, compute a prototype vector by summing the vectors of the training documents in the category. </li></ul><ul><li>Assign test documents to the category with the closest prototype vector based on cosine similarity. </li></ul>
  21. 21. Illustration of Rocchio Text Categorization
  22. 22. Rocchio Text Categorization Algorithm (Training) Assume the set of categories is { c 1 , c 2 ,… c n } For i from 1 to n let p i = <0, 0,…,0> ( init. prototype vectors ) For each training example < x , c ( x )>  D Let d be the frequency normalized TF/IDF term vector for doc x Let i = j : ( c j = c ( x )) ( sum all the document vectors in c i to get p i ) Let p i = p i + d
  23. 23. Rocchio Text Categorization Algorithm (Test) Given test document x Let d be the TF/IDF weighted term vector for x Let m = –2 ( init. maximum cosSim ) For i from 1 to n : ( compute similarity to prototype vector ) Let s = cosSim( d , p i ) if s > m let m = s let r = c i ( update most similar class prototype ) Return class r
  24. 24. Rocchio Properties <ul><li>Does not guarantee a consistent hypothesis. </li></ul><ul><li>Forms a simple generalization of the examples in each class (a prototype ). </li></ul><ul><li>Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length. </li></ul><ul><li>Classification is based on similarity to class prototypes. </li></ul>
  25. 25. Rocchio Time Complexity <ul><li>Note: The time to add two sparse vectors is proportional to minimum number of non-zero entries in the two vectors. </li></ul><ul><li>Training Time : O(| D |( L d + | V d |)) = O(| D | L d ) where L d is the average length of a document in D and V d is the average vocabulary size for a document in D. </li></ul><ul><li>Test Time : O( L t + |C||V t | ) where L t is the average length of a test document and | V t | is the average vocabulary size for a test document. </li></ul><ul><ul><li>Assumes lengths of p i vectors are computed and stored during training, allowing cosSim( d , p i ) to be computed in time proportional to the number of non-zero entries in d (i.e. |V t | ) </li></ul></ul>
  26. 26. Nearest-Neighbor Learning Algorithm <ul><li>Learning is just storing the representations of the training examples in D . </li></ul><ul><li>Testing instance x : </li></ul><ul><ul><li>Compute similarity between x and all examples in D . </li></ul></ul><ul><ul><li>Assign x the category of the most similar example in D . </li></ul></ul><ul><li>Does not explicitly compute a generalization or category prototypes. </li></ul><ul><li>Also called: </li></ul><ul><ul><li>Case-based </li></ul></ul><ul><ul><li>Instance-based </li></ul></ul><ul><ul><li>Memory-based </li></ul></ul><ul><ul><li>Lazy learning </li></ul></ul>
  27. 27. K Nearest-Neighbor <ul><li>Using only the closest example to determine categorization is subject to errors due to: </li></ul><ul><ul><li>A single atypical example. </li></ul></ul><ul><ul><li>Noise (i.e. error) in the category label of a single training example. </li></ul></ul><ul><li>More robust alternative is to find the k most-similar examples and return the majority category of these k examples. </li></ul><ul><li>Value of k is typically odd to avoid ties, 3 and 5 are most common. </li></ul>
  28. 28. Similarity Metrics <ul><li>Nearest neighbor method depends on a similarity (or distance) metric. </li></ul><ul><li>Simplest for continuous m -dimensional instance space is Euclidian distance . </li></ul><ul><li>Simplest for m -dimensional binary instance space is Hamming distance (number of feature values that differ). </li></ul><ul><li>For text, cosine similarity of TF-IDF weighted vectors is typically most effective. </li></ul>
  29. 29. 3 Nearest Neighbor Illustration (Euclidian Distance) . . . . . . . . . . .
  30. 30. K Nearest Neighbor for Text Training: For each each training example < x , c ( x )>  D Compute the corresponding TF-IDF vector, d x , for document x Test instance y : Compute TF-IDF vector d for document y For each < x , c ( x )>  D Let s x = cosSim( d , d x ) Sort examples, x , in D by decreasing value of s x Let N be the first k examples in D. ( get most similar neighbors ) Return the majority class of examples in N
  31. 31. Illustration of 3 Nearest Neighbor for Text
  32. 32. Rocchio Anomoly <ul><li>Prototype models have problems with polymorphic (disjunctive) categories. </li></ul>
  33. 33. 3 Nearest Neighbor Comparison <ul><li>Nearest Neighbor tends to handle polymorphic categories better. </li></ul>
  34. 34. Nearest Neighbor Time Complexity <ul><li>Training Time : O(| D | L d ) to compose TF-IDF vectors. </li></ul><ul><li>Testing Time : O( L t + |D||V t | ) to compare to all training vectors. </li></ul><ul><ul><li>Assumes lengths of d x vectors are computed and stored during training, allowing cosSim( d , d x ) to be computed in time proportional to the number of non-zero entries in d (i.e. |V t | ) </li></ul></ul><ul><li>Testing time can be high for large training sets. </li></ul>
  35. 35. Nearest Neighbor with Inverted Index <ul><li>Determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documents. </li></ul><ul><li>Use standard VSR inverted index methods to find the k nearest neighbors. </li></ul><ul><li>Testing Time : O( B|V t | ) where B is the average number of training documents in which a test-document word appears. </li></ul><ul><li>Therefore, overall classification is O( L t + B|V t | ) </li></ul><ul><ul><li>Typically B << | D | </li></ul></ul>
  36. 36. Bayesian Methods <ul><li>Learning and classification methods based on probability theory. </li></ul><ul><li>Bayes theorem plays a critical role in probabilistic learning and classification. </li></ul><ul><li>Uses prior probability of each category given no information about an item. </li></ul><ul><li>Categorization produces a posterior probability distribution over the possible categories given a description of an item. </li></ul>
  37. 37. Conditional Probability <ul><li>P( A | B ) is the probability of A given B </li></ul><ul><li>Assumes that B is all and only information known. </li></ul><ul><li>Defined by: </li></ul>A B
  38. 38. Independence <ul><li>A and B are independent iff: </li></ul><ul><li>Therefore, if A and B are independent: </li></ul>These two constraints are logically equivalent
  39. 39. Bayes Theorem <ul><li>Simple proof from definition of conditional probability: </li></ul>QED: (Def. cond. prob.) (Def. cond. prob.)
  40. 40. Bayesian Categorization <ul><li>Let set of categories be { c 1 , c 2 ,… c n } </li></ul><ul><li>Let E be description of an instance. </li></ul><ul><li>Determine category of E by determining for each c i </li></ul><ul><li>P( E ) can be determined since categories are complete and disjoint. </li></ul>
  41. 41. Bayesian Categorization (cont.) <ul><li>Need to know: </li></ul><ul><ul><li>Priors: P( c i ) </li></ul></ul><ul><ul><li>Conditionals: P( E | c i ) </li></ul></ul><ul><li>P( c i ) are easily estimated from data. </li></ul><ul><ul><li>If n i of the examples in D are in c i , then P( c i ) = n i / | D| </li></ul></ul><ul><li>Assume instance is a conjunction of binary features: </li></ul><ul><li>Too many possible instances (exponential in m ) to estimate all P( E | c i ) </li></ul>
  42. 42. Naïve Bayesian Categorization <ul><li>If we assume features of an instance are independent given the category ( c i ) ( conditionally independent ). </li></ul><ul><li>Therefore, we then only need to know P( e j | c i ) for each feature and category. </li></ul>
  43. 43. Naïve Bayes Example <ul><li>C = {allergy, cold, well} </li></ul><ul><li>e 1 = sneeze; e 2 = cough; e 3 = fever </li></ul><ul><li>E = {sneeze, cough,  fever} </li></ul>0.4 0.7 0.01 P(fever| c i ) 0.7 0.8 0.1 P(cough| c i ) 0.9 0.9 0.1 P(sneeze| c i ) 0.05 0.05 0.9 P( c i ) Allergy Cold Well Prob
  44. 44. Naïve Bayes Example (cont.) <ul><li>P(well | E) = (0.9)(0.1)(0.1)(0.99)/P(E)=0.0089/P(E) </li></ul><ul><li>P(cold | E) = (0.05)(0.9)(0.8)(0.3)/P(E)=0.01/P(E) </li></ul><ul><li>P(allergy | E) = (0.05)(0.9)(0.7)(0.6)/P(E)=0.019/P(E) </li></ul><ul><li>Most probable category: allergy </li></ul><ul><li>P(E) = 0.0089 + 0.01 + 0.019 = 0.0379 </li></ul><ul><li>P(well | E) = 0.23 </li></ul><ul><li>P(cold | E) = 0.26 </li></ul><ul><li>P(allergy | E) = 0.50 </li></ul>E={sneeze, cough,  fever} 0.4 0.7 0.01 P(fever | c i ) 0.7 0.8 0.1 P(cough | c i ) 0.9 0.9 0.1 P(sneeze | c i ) 0.05 0.05 0.9 P( c i ) Allergy Cold Well Probability
  45. 45. Estimating Probabilities <ul><li>Normally, probabilities are estimated based on observed frequencies in the training data. </li></ul><ul><li>If D contains n i examples in category c i , and n ij of these n i examples contains feature e j , then: </li></ul><ul><li>However, estimating such probabilities from small training sets is error-prone. </li></ul><ul><li>If due only to chance, a rare feature, e k , is always false in the training data,  c i :P( e k | c i ) = 0. </li></ul><ul><li>If e k then occurs in a test example, E , the result is that  c i : P( E | c i ) = 0 and  c i : P( c i | E ) = 0 </li></ul>
  46. 46. Smoothing <ul><li>To account for estimation from small samples, probability estimates are adjusted or smoothed . </li></ul><ul><li>Laplace smoothing using an m -estimate assumes that each feature is given a prior probability, p , that is assumed to have been previously observed in a “virtual” sample of size m . </li></ul><ul><li>For binary features, p is simply assumed to be 0.5. </li></ul>
  47. 47. Naïve Bayes for Text <ul><li>Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V = { w 1 , w 2 ,… w m } based on the probabilities P( w j | c i ). </li></ul><ul><li>Smooth probability estimates with Laplace m -estimates assuming a uniform distribution over all words ( p = 1/| V |) and m = | V | </li></ul><ul><ul><li>Equivalent to a virtual sample of seeing each word in each category exactly once. </li></ul></ul>
  48. 48. Text Naïve Bayes Algorithm (Train) Let V be the vocabulary of all words in the documents in D For each category c i  C Let D i be the subset of documents in D in category c i P( c i ) = | D i | / | D | Let T i be the concatenation of all the documents in D i Let n i be the total number of word occurrences in T i For each word w j  V Let n ij be the number of occurrences of w j in T i Let P( w i | c i ) = ( n ij + 1) / ( n i + | V |)
  49. 49. Text Naïve Bayes Algorithm (Test) Given a test document X Let n be the number of word occurrences in X Return the category: where a j is the word occurring the j th position in X
  50. 50. Naïve Bayes Time Complexity <ul><li>Training Time : O(| D | L d + | C || V |)) where L d is the average length of a document in D. </li></ul><ul><ul><li>Assumes V and all D i , n i , and n ij pre-computed in O(| D | L d ) time during one pass through all of the data. </li></ul></ul><ul><ul><li>Generally just O(| D | L d ) since usually | C || V | < | D | L d </li></ul></ul><ul><li>Test Time : O( |C| L t ) where L t is the average length of a test document. </li></ul><ul><li>Very efficient overall, linearly proportional to the time needed to just read in all the data. </li></ul><ul><li>Similar to Rocchio time complexity. </li></ul>
  51. 51. Underflow Prevention <ul><li>Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. </li></ul><ul><li>Since log( xy ) = log( x ) + log( y ), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. </li></ul><ul><li>Class with highest final un-normalized log probability score is still the most probable. </li></ul>
  52. 52. Naïve Bayes Posterior Probabilities <ul><li>Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate. </li></ul><ul><li>However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not. </li></ul><ul><ul><li>Output probabilities are generally very close to 0 or 1. </li></ul></ul>
  53. 53. Evaluating Categorization <ul><li>Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances). </li></ul><ul><li>Classification accuracy : c / n where n is the total number of test instances and c is the number of test instances correctly classified by the system. </li></ul><ul><li>Results can vary based on sampling error due to different training and test sets. </li></ul><ul><li>Average results over multiple training and test sets (splits of the overall data) for the best results. </li></ul>
  54. 54. N -Fold Cross-Validation <ul><li>Ideally, test and training sets are independent on each trial. </li></ul><ul><ul><li>But this would require too much labeled data. </li></ul></ul><ul><li>Partition data into N equal-sized disjoint segments. </li></ul><ul><li>Run N trials, each time using a different segment of the data for testing, and training on the remaining N  1 segments. </li></ul><ul><li>This way, at least test-sets are independent. </li></ul><ul><li>Report average classification accuracy over the N trials. </li></ul><ul><li>Typically, N = 10. </li></ul>
  55. 55. Learning Curves <ul><li>In practice, labeled data is usually rare and expensive. </li></ul><ul><li>Would like to know how performance varies with the number of training instances. </li></ul><ul><li>Learning curves plot classification accuracy on independent test data ( Y axis) versus number of training examples ( X axis). </li></ul>
  56. 56. N -Fold Learning Curves <ul><li>Want learning curves averaged over multiple trials. </li></ul><ul><li>Use N -fold cross validation to generate N full training and test sets. </li></ul><ul><li>For each trial, train on increasing fractions of the training set, measuring accuracy on the test data for each point on the desired learning curve. </li></ul>
  57. 57. Sample Document Corpus <ul><li>600 science pages from the web. </li></ul><ul><li>200 random samples each from the Yahoo indices for biology, physics, and chemistry. </li></ul>
  58. 58. Sample Learning Curve (Yahoo Science Data)
  59. 59. Clustering <ul><li>Partition unlabeled examples into disjoint subsets of clusters , such that: </li></ul><ul><ul><li>Examples within a cluster are very similar </li></ul></ul><ul><ul><li>Examples in different clusters are very different </li></ul></ul><ul><li>Discover new categories in an unsupervised manner (no sample category labels provided). </li></ul>
  60. 60. Clustering Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
  61. 61. Hierarchical Clustering <ul><li>Build a tree-based hierarchical taxonomy ( dendrogram ) from a set of unlabeled examples. </li></ul><ul><li>Recursive application of a standard clustering algorithm can produce a hierarchical clustering. </li></ul>animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate
  62. 62. Aglommerative vs. Divisive Clustering <ul><li>Aglommerative ( bottom-up ) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. </li></ul><ul><li>Divisive ( partitional, top-down ) separate all examples immediately into clusters. </li></ul>
  63. 63. Direct Clustering Method <ul><li>Direct clustering methods require a specification of the number of clusters, k , desired. </li></ul><ul><li>A clustering evaluation function assigns a real-value quality measure to a clustering. </li></ul><ul><li>The number of clusters can be determined automatically by explicitly generating clusterings for multiple values of k and choosing the best result according to a clustering evaluation function. </li></ul>
  64. 64. Hierarchical Agglomerative Clustering (HAC) <ul><li>Assumes a similarity function for determining the similarity of two instances. </li></ul><ul><li>Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. </li></ul><ul><li>The history of merging forms a binary tree or hierarchy. </li></ul>
  65. 65. HAC Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j , that are most similar. Replace c i and c j with a single cluster c i  c j
  66. 66. Cluster Similarity <ul><li>Assume a similarity function that determines the similarity of two instances: sim ( x , y ). </li></ul><ul><ul><li>Cosine similarity of document vectors. </li></ul></ul><ul><li>How to compute similarity of two clusters each possibly containing multiple instances? </li></ul><ul><ul><li>Single Link : Similarity of two most similar members. </li></ul></ul><ul><ul><li>Complete Link : Similarity of two least similar members. </li></ul></ul><ul><ul><li>Group Average : Average similarity between members. </li></ul></ul>
  67. 67. Single Link Agglomerative Clustering <ul><li>Use maximum similarity of pairs: </li></ul><ul><li>Can result in “straggly” (long and thin) clusters due to chaining effect . </li></ul><ul><ul><li>Appropriate in some domains, such as clustering islands. </li></ul></ul>
  68. 68. Single Link Example
  69. 69. Complete Link Agglomerative Clustering <ul><li>Use minimum similarity of pairs: </li></ul><ul><li>Makes more “tight,” spherical clusters that are typically preferable. </li></ul>
  70. 70. Complete Link Example
  71. 71. Computational Complexity <ul><li>In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O( n 2 ). </li></ul><ul><li>In each of the subsequent n  2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. </li></ul><ul><li>In order to maintain an overall O( n 2 ) performance, computing similarity to each other cluster must be done in constant time. </li></ul>
  72. 72. Computing Cluster Similarity <ul><li>After merging c i and c j , the similarity of the resulting cluster to any other cluster, c k , can be computed by: </li></ul><ul><ul><li>Single Link: </li></ul></ul><ul><ul><li>Complete Link: </li></ul></ul>
  73. 73. Group Average Agglomerative Clustering <ul><li>Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. </li></ul><ul><li>Compromise between single and complete link. </li></ul><ul><li>Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters. </li></ul>
  74. 74. Computing Group Average Similarity <ul><li>Assume cosine similarity and normalized vectors with unit length. </li></ul><ul><li>Always maintain sum of vectors in each cluster. </li></ul><ul><li>Compute similarity of clusters in constant time: </li></ul>
  75. 75. Non-Hierarchical Clustering <ul><li>Typically must provide the number of desired clusters, k . </li></ul><ul><li>Randomly choose k instances as seeds , one per cluster. </li></ul><ul><li>Form initial clusters based on these seeds. </li></ul><ul><li>Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. </li></ul><ul><li>Stop when clustering converges or after a fixed number of iterations. </li></ul>
  76. 76. K-Means <ul><li>Assumes instances are real-valued vectors. </li></ul><ul><li>Clusters based on centroids , center of gravity , or mean of points in a cluster, c : </li></ul><ul><li>Reassignment of instances to clusters is based on distance to the current cluster centroids. </li></ul>
  77. 77. Distance Metrics <ul><li>Euclidian distance (L 2 norm): </li></ul><ul><li>L 1 norm: </li></ul><ul><li>Cosine Similarity (transform to a distance by subtracting from 1): </li></ul>
  78. 78. K-Means Algorithm Let d be the distance measure between instances. Select k random instances { s 1 , s 2 ,… s k } as seeds. Until clustering converges or other stopping criterion: For each instance x i : Assign x i to the cluster c j such that d ( x i , s j ) is minimal. ( Update the seeds to the centroid of each cluster ) For each cluster c j s j =  ( c j )
  79. 79. K Means Example (K=2) Reassign clusters Converged! Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids
  80. 80. Time Complexity <ul><li>Assume computing distance between two instances is O( m ) where m is the dimensionality of the vectors. </li></ul><ul><li>Reassigning clusters: O( kn ) distance computations, or O( knm ). </li></ul><ul><li>Computing centroids: Each instance vector gets added once to some centroid: O( nm ). </li></ul><ul><li>Assume these two steps are each done once for I iterations: O( Iknm ). </li></ul><ul><li>Linear in all relevant factors, assuming a fixed number of iterations, more efficient than O(n 2 ) HAC. </li></ul>
  81. 81. Seed Choice <ul><li>Results can vary based on random seed selection. </li></ul><ul><li>Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. </li></ul><ul><li>Select good seeds using a heuristic or the results of another method. </li></ul>
  82. 82. Buckshot Algorithm <ul><li>Combines HAC and K-Means clustering. </li></ul><ul><li>First randomly take a sample of instances of size  n </li></ul><ul><li>Run group-average HAC on this sample, which takes only O( n ) time. </li></ul><ul><li>Use the results of HAC as initial seeds for K-means. </li></ul><ul><li>Overall algorithm is O( n ) and avoids problems of bad seed selection. </li></ul>
  83. 83. Text Clustering <ul><li>HAC and K-Means have been applied to text in a straightforward way. </li></ul><ul><li>Typically use normalized , TF/IDF-weighted vectors and cosine similarity. </li></ul><ul><li>Optimize computations for sparse vectors. </li></ul><ul><li>Applications: </li></ul><ul><ul><li>During retrieval, add other documents in the same cluster as the initial retrieved documents to improve recall. </li></ul></ul><ul><ul><li>Clustering of results of retrieval to present more organized results to the user ( à la Northernlight folders). </li></ul></ul><ul><ul><li>Automated production of hierarchical taxonomies of documents for browsing purposes ( à la Yahoo & DMOZ). </li></ul></ul>
  84. 84. Soft Clustering <ul><li>Clustering typically assumes that each instance is given a “hard” assignment to exactly one cluster. </li></ul><ul><li>Does not allow uncertainty in class membership or for an instance to belong to more than one cluster. </li></ul><ul><li>Soft clustering gives probabilities that an instance belongs to each of a set of clusters. </li></ul><ul><li>Each instance is assigned a probability distribution across a set of discovered categories (probabilities of all categories must sum to 1). </li></ul>
  85. 85. Expectation Maximization (EM) <ul><li>Probabilistic method for soft clustering. </li></ul><ul><li>Direct method that assumes k clusters: { c 1 , c 2 ,… c k } </li></ul><ul><li>Soft version of k -means. </li></ul><ul><li>Assumes a probabilistic model of categories that allows computing P( c i | E ) for each category, c i , for a given example, E . </li></ul><ul><li>For text, typically assume a naïve-Bayes category model. </li></ul><ul><ul><li>Parameters  = {P( c i ), P( w j | c i ): i  {1,… k }, j  {1,…,| V |}} </li></ul></ul>
  86. 86. EM Algorithm <ul><li>Iterative method for learning probabilistic categorization model from unsupervised data. </li></ul><ul><li>Initially assume random assignment of examples to categories. </li></ul><ul><li>Learn an initial probabilistic model by estimating model parameters  from this randomly labeled data. </li></ul><ul><li>Iterate following two steps until convergence: </li></ul><ul><ul><li>Expectation (E-step): Compute P( c i | E ) for each example given the current model, and probabilistically re-label the examples based on these posterior probability estimates. </li></ul></ul><ul><ul><li>Maximization (M-step): Re-estimate the model parameters,  , from the probabilistically re-labeled data. </li></ul></ul>
  87. 87. Learning from Probabilistically Labeled Data <ul><li>Instead of training data labeled with “hard” category labels, training data is labeled with “soft” probabilistic category labels. </li></ul><ul><li>When estimating model parameters  from training data, weight counts by the corresponding probability of the given category label. </li></ul><ul><li>For example, if P( c 1 | E ) = 0.8 and P( c 2 | E ) = 0.2, each word w j in E contributes only 0.8 towards the counts n 1 and n 1 j , and 0.2 towards the counts n 2 and n 2 j . </li></ul>
  88. 88. Naïve Bayes EM Randomly assign examples probabilistic category labels. Use standard naïve-Bayes training to learn a probabilistic model with parameters  from the labeled data. Until convergence or until maximum number of iterations reached: E-Step : Use the naïve Bayes model  to compute P( c i | E ) for each category and example, and re-label each example using these probability values as soft category labels. M-Step : Use standard naïve-Bayes training to re-estimate the parameters  using these new probabilistic category labels.
  89. 89. Semi-Supervised Learning <ul><li>For supervised categorization, generating labeled training data is expensive. </li></ul><ul><li>Idea : Use unlabeled data to aid supervised categorization. </li></ul><ul><li>Use EM in a semi-supervised mode by training EM on both labeled and unlabeled data. </li></ul><ul><ul><li>Train initial probabilistic model on user-labeled subset of data instead of randomly labeled unsupervised data. </li></ul></ul><ul><ul><li>Labels of user-labeled examples are “frozen” and never relabeled during EM iterations. </li></ul></ul><ul><ul><li>Labels of unsupervised data are constantly probabilistically relabeled by EM. </li></ul></ul>
  90. 90. Semi-Supervised Example <ul><li>Assume “quantum” is present in several labeled physics documents, but “Heisenberg” occurs in none of the labeled data. </li></ul><ul><li>From labeled data, learn that “quantum” is indicative of a physics document. </li></ul><ul><li>When labeling unsupervised data, label several documents with “quantum” and “Heisenberg” correctly with the “physics” category. </li></ul><ul><li>When retraining, learn that “Heisenberg” is also indicative of a physics document. </li></ul><ul><li>Final learned model is able to correctly assign documents containing only “Heisenberg” to physics. </li></ul>
  91. 91. Semi-Supervision Results <ul><li>Experiments on assigning messages from 20 Usenet newsgroups their proper newsgroup label. </li></ul><ul><li>With very few labeled examples (2 examples per class), semi-supervised EM improved accuracy from 27% (supervised data only) to 43% (supervised + unsupervised data). </li></ul><ul><li>With more labeled examples, semi-supervision can actually decrease accuracy, but refinements to standard EM can prevent this. </li></ul><ul><li>For semi-supervised EM to work, the “natural clustering of data” must be consistent with the desired categories. </li></ul>
  92. 92. Active Learning <ul><li>Select only the most informative examples for labeling. </li></ul><ul><li>Initial methods: </li></ul><ul><ul><li>Uncertainty sampling </li></ul></ul><ul><ul><li>Committee-based sampling </li></ul></ul><ul><ul><li>Error-reduction sampling </li></ul></ul>
  93. 93. Weak Supervision <ul><li>Sometimes uncertain labeling can be inferred. </li></ul><ul><li>Learning apprentices </li></ul><ul><li>Inferred feedback </li></ul><ul><ul><li>Click patterns, reading time, non-verbal cues </li></ul></ul><ul><li>Delayed feedback </li></ul><ul><ul><li>Reinforcement learning </li></ul></ul><ul><li>Programming by Demonstration </li></ul>
  94. 94. Prior Knowledge <ul><li>Use of prior declarative knowledge in learning. </li></ul><ul><li>Initial methods: </li></ul><ul><ul><li>Explanation-based Learning </li></ul></ul><ul><ul><li>Theory Refinement </li></ul></ul><ul><ul><li>Bayesian Priors </li></ul></ul><ul><ul><li>Reinforcement Learning with Advice </li></ul></ul>
  95. 95. Learning to Learn <ul><li>Many applications require learning for multiple, related problems. </li></ul><ul><li>What can be learned from one problem that can aid the learning for other problems? </li></ul><ul><li>Initial approaches: </li></ul><ul><ul><li>Multi-task learning </li></ul></ul><ul><ul><li>Life-long learning </li></ul></ul><ul><ul><li>Learning similarity metrics </li></ul></ul><ul><ul><li>Supra-classifiers </li></ul></ul>

×