0
Semi Supervised Learning <ul><li>Qiang Yang </li></ul><ul><ul><li>Adapted from… </li></ul></ul><ul><li>Thanks </li></ul><u...
Supervised learning is a typical machine learning setting, where   labeled   examples  are used as training examples ? =  ...
Labeled vs. Unlabeled  In many practical applications,  unlabeled  training examples are readily available but labeled one...
Three main paradigms for Semi-supervised Learning: <ul><li>Transductive learning :  </li></ul><ul><ul><ul><li>Unlabeled ex...
SSL: Why unlabeled data can be helpful?  Suppose the data is well-modeled by a mixture density: Thus, the optimal classifi...
Transductive SVM  Transductive SVM : Taking into account a particular test set and trying to minimize misclassifications o...
Active learning: Getting more from query  The labels of the training examples are obtained by querying the  oracle . Thus,...
<ul><li>Uncertainty sampling </li></ul><ul><ul><li>Train a single learner and then query the unlabeled instances on which ...
<ul><li>To retrieve images from a (usually large) image database according to user interest </li></ul><ul><ul><li>very use...
<ul><li>Every image is associated with a text annotation </li></ul><ul><li>User poses a keyword </li></ul><ul><li>The syst...
<ul><li>In some applications, there are two  sufficient and redundant views , i.e. two attribute sets each of which is  su...
[A. Blum & T. Mitchell, COLT98]   Co-training (con’t) learner 1 learner 2 X 1  view X 2  view labeled training examples un...
Co-training (con’t) <ul><li>Theoretical analysis   [Blum & Mitchell, COLT’98; Dasgupta, </li></ul><ul><ul><ul><ul><ul><li>...
Multi-view Learning and Co-training <ul><li>Multi-view learning describes the setting of learning from data where observat...
Inductive vs.Transductive <ul><li>Transductive : Produce label only for the available unlabeled data. </li></ul><ul><ul><l...
An Example of two views <ul><li>Web-page classification: e.g.,  </li></ul><ul><li>find homepages of faculty members . </li...
Another Example Classifying Jobs for FlipDog X1 : job title X2: job description
Two Views <ul><li>: the set of target function over  .  </li></ul><ul><li>: the set of target functions over  . </li></ul>...
Co-training <ul><li>Proposed by (Blum and Mitchell 1998) </li></ul><ul><li>Combine Multi-view learning & semi-supervised l...
The Yarowsky Algorithm Choose instances labeled with  high confidence Add them to the pool of  current  labeled training d...
Co-training   Assumption 1: compatibility <ul><li>The instance distribution  is compatible with the target function  if fo...
Co-training   Assumption 2: conditional independence <ul><li>Definition: A pair of views  satisfy view independence when: ...
Co-training Algorithm
Co-Training <ul><li>Instances contain two  sufficient sets of features </li></ul><ul><ul><li>i.e. an instance is x=(x 1 ,x...
Co-Training Allow C1 to label  Some instances Allow C2 to label  Some instances Iteration:  t + - Iteration:  t +1 + - …… ...
Agreement Maximization <ul><li>A side effect of the Co-Training:  Agreement between two views. </li></ul><ul><li>Is it pos...
What if Co-training Assumption  Not  Perfectly  Satisfied? <ul><li>Idea: Want classifiers that produce a  maximally   cons...
Other Related Works <ul><li>Multi-view clustering  (Bickel & Scheffer 2004) </li></ul><ul><li>Modified the co-training alg...
Reference <ul><li>A. Blum and T. Mitchell, 1998. “Combining Labeled and Unlabeled Data with Co-Training,” In  Proceedings ...
Upcoming SlideShare
Loading in...5
×

Semi-supervised Learning

558

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
558
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Semi-supervised Learning"

  1. 1. Semi Supervised Learning <ul><li>Qiang Yang </li></ul><ul><ul><li>Adapted from… </li></ul></ul><ul><li>Thanks </li></ul><ul><ul><li>Zhi-Hua Zhou </li></ul></ul><ul><ul><li>http://cs.nju.edu.cn/people/zhouzh/ </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>LAMDA Group, </li></ul></ul><ul><ul><li>National Laboratory for Novel Software Technology, Nanjing University, China </li></ul></ul>
  2. 2. Supervised learning is a typical machine learning setting, where labeled examples are used as training examples ? = yes Supervised learning decision trees, neural networks, support vector machines, etc. trained model training data label training unseen data (Jeff, Professor, 7, ?) label unknown
  3. 3. Labeled vs. Unlabeled In many practical applications, unlabeled training examples are readily available but labeled ones are fairly expansive to obtain because labeling the unlabeled examples requires human effort class = “ war ” (almost) infinite number of web pages on the Internet ?
  4. 4. Three main paradigms for Semi-supervised Learning: <ul><li>Transductive learning : </li></ul><ul><ul><ul><li>Unlabeled examples are exactly the test examples </li></ul></ul></ul><ul><li>Active learning : </li></ul><ul><ul><ul><li>Assume that a user can continue to label data </li></ul></ul></ul><ul><ul><ul><li>The learner actively selects some unlabeled examples to query from an oracle (assume the learner has some control over the input space) </li></ul></ul></ul><ul><li>Multi-view Learning </li></ul><ul><ul><ul><li>Unlabeled examples may be different from the test examples </li></ul></ul></ul><ul><ul><ul><li>Regularization (minimize error and maximize smoothness) </li></ul></ul></ul><ul><ul><ul><li>Multi-view Learning and Co-training </li></ul></ul></ul>
  5. 5. SSL: Why unlabeled data can be helpful? Suppose the data is well-modeled by a mixture density: Thus, the optimal classification rule for this model is the MAP rule: [D.J. Miller & H.S. Uyar, NIPS’96] where and  = {  l } The class labels are viewed as random quantities and are assumed chosen conditioned on the selected mixture component m i  {1,2,…, L } and possibly on the feature value, i.e. according to the probabilities P[ c i | x i , m i ] where unlabeled examples can be used to help estimate this term
  6. 6. Transductive SVM Transductive SVM : Taking into account a particular test set and trying to minimize misclassifications of just those particular examples Figure reprinted from [T. Joachims, ICML99] Concretely, using unlabeled examples to help identify the maximum margin hyperplanes
  7. 7. Active learning: Getting more from query The labels of the training examples are obtained by querying the oracle . Thus, for the same number of queries, more helpful information can be obtained by actively selecting some unlabeled examples to query Key: To select the unlabeled examples on which the labeling will convey the most helpful information for the learner
  8. 8. <ul><li>Uncertainty sampling </li></ul><ul><ul><li>Train a single learner and then query the unlabeled instances on which the learner is the least confident </li></ul></ul><ul><ul><ul><li>[Lewis & Gale, SIGIR’94] </li></ul></ul></ul><ul><li>Committee-based sampling </li></ul><ul><ul><li>Generate a committee of multiple learners and select the unlabeled examples on which the committee members disagree the most [Abe & Mamitsuka, ICML’98; Seung et al., COLT’92] </li></ul></ul>Active Learning: Representative approaches
  9. 9. <ul><li>To retrieve images from a (usually large) image database according to user interest </li></ul><ul><ul><li>very useful in digital library, digital photo album, etc. </li></ul></ul>Active Learning Application: Image retrieval Where are my photos taken at Guilin?
  10. 10. <ul><li>Every image is associated with a text annotation </li></ul><ul><li>User poses a keyword </li></ul><ul><li>The system retrieves images by matching the keyword </li></ul><ul><li>with annotations </li></ul>Active Learning: Text-based image retrieval query Database Text Interface Text-based Retrieval Engine “ tiger” tiger lily white tiger
  11. 11. <ul><li>In some applications, there are two sufficient and redundant views , i.e. two attribute sets each of which is sufficient for learning and conditionally independent to the other given the class label </li></ul><ul><ul><li>e.g. two views for web page classification: 1) the text appearing on the page itself, and 2) the anchor text attached to hyperlinks pointing to this page, from other pages </li></ul></ul>Co-training
  12. 12. [A. Blum & T. Mitchell, COLT98] Co-training (con’t) learner 1 learner 2 X 1 view X 2 view labeled training examples unlabeled training examples labeled unlabeled examples labeled unlabeled examples
  13. 13. Co-training (con’t) <ul><li>Theoretical analysis [Blum & Mitchell, COLT’98; Dasgupta, </li></ul><ul><ul><ul><ul><ul><li>NIPS’01; Balcan et al., NIPS’04; etc.] </li></ul></ul></ul></ul></ul><ul><li>Experimental studies [Nigam & Ghani, CIKM’00] </li></ul><ul><li>New algorithms </li></ul><ul><ul><li>Co-training without two views [Goldman & Zhou, ICML’00; </li></ul></ul><ul><ul><ul><ul><ul><li>Zhou & Li, TKDE’05] </li></ul></ul></ul></ul></ul><ul><ul><li>Semi-supervised regression [Zhou & Li, IJCAI’05] </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Statistical parsing [Sarkar, NAACL01; Steedman et al., </li></ul></ul><ul><ul><ul><ul><ul><li>EACL03; R. Hwa et al., ICML03w] </li></ul></ul></ul></ul></ul><ul><ul><li>Noun phrase identification [Pierce & Cardie, EMNLP01] </li></ul></ul><ul><ul><li>Image retrieval [Zhou et al., ECML’04; Zhou et al., TOIS06] </li></ul></ul>
  14. 14. Multi-view Learning and Co-training <ul><li>Multi-view learning describes the setting of learning from data where observations are represented by multiple independent sets of features . </li></ul><ul><li>An example of two views: </li></ul><ul><li>Features can be split into two sets: </li></ul><ul><ul><li>The instance space: </li></ul></ul><ul><ul><li>Each instance: </li></ul></ul>
  15. 15. Inductive vs.Transductive <ul><li>Transductive : Produce label only for the available unlabeled data. </li></ul><ul><ul><li>The output of the method is not a classifier. </li></ul></ul><ul><li>Inductive : Not only produce label for unlabeled data, but also produce a classifier . </li></ul>
  16. 16. An Example of two views <ul><li>Web-page classification: e.g., </li></ul><ul><li>find homepages of faculty members . </li></ul><ul><ul><li>Page text : words occurring on that page: </li></ul></ul><ul><ul><li>e.g., “research interest”, “teaching” </li></ul></ul><ul><ul><li>Hyperlink text : words occurring in hyperlinks that point to that page: </li></ul></ul><ul><ul><li>e.g., “my advisor” </li></ul></ul>
  17. 17. Another Example Classifying Jobs for FlipDog X1 : job title X2: job description
  18. 18. Two Views <ul><li>: the set of target function over . </li></ul><ul><li>: the set of target functions over . </li></ul><ul><li>: the set of target function over . </li></ul><ul><li>Instead of learning from , multi-view learning aims to learn a pair of functions from , such that . </li></ul>
  19. 19. Co-training <ul><li>Proposed by (Blum and Mitchell 1998) </li></ul><ul><li>Combine Multi-view learning & semi-supervised learning. </li></ul><ul><li>Related work: </li></ul><ul><ul><li>(Yarowsky 1995) </li></ul></ul><ul><ul><li>(Nigam and Ghani, 2000) </li></ul></ul><ul><ul><li>(Goldman and Zhou, 2000) </li></ul></ul><ul><ul><li>(Abney, 2002) </li></ul></ul><ul><ul><li>(Sarkar, 2002) </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Used in document classification, parsing, etc. </li></ul>
  20. 20. The Yarowsky Algorithm Choose instances labeled with high confidence Add them to the pool of current labeled training data …… (Yarowsky 1995) Iteration: 0 + - A Classifier trained by SL Iteration: 1 + - Iteration: 2 + -
  21. 21. Co-training Assumption 1: compatibility <ul><li>The instance distribution is compatible with the target function if for any with non-zero probability, . </li></ul><ul><li>Definition: compatibility of with : </li></ul> Each set of features is sufficient for classification
  22. 22. Co-training Assumption 2: conditional independence <ul><li>Definition: A pair of views satisfy view independence when: </li></ul><ul><li>A classification problem instance satisfies view independence when all pairs satisfy view independence. </li></ul>
  23. 23. Co-training Algorithm
  24. 24. Co-Training <ul><li>Instances contain two sufficient sets of features </li></ul><ul><ul><li>i.e. an instance is x=(x 1 ,x 2 ) </li></ul></ul><ul><ul><li>Each set of features is called a View </li></ul></ul><ul><li>Two views are independent given the label : </li></ul><ul><li>Two views are consistent: </li></ul>x x 1 x 2 (Blum and Mitchell 1998)
  25. 25. Co-Training Allow C1 to label Some instances Allow C2 to label Some instances Iteration: t + - Iteration: t +1 + - …… C1 : A Classifier trained on view 1 C2 : A Classifier trained on view 2 Add self-labeled instances to the pool of training data
  26. 26. Agreement Maximization <ul><li>A side effect of the Co-Training: Agreement between two views. </li></ul><ul><li>Is it possible to pose agreement as the explicit goal? </li></ul><ul><ul><li>Yes. The resulting algorithm: Agreement Boost </li></ul></ul>(Leskes 2005)
  27. 27. What if Co-training Assumption Not Perfectly Satisfied? <ul><li>Idea: Want classifiers that produce a maximally consistent labeling of the data </li></ul><ul><li>If learning is an optimization problem, what function should we optimize? </li></ul>- + + +
  28. 28. Other Related Works <ul><li>Multi-view clustering (Bickel & Scheffer 2004) </li></ul><ul><li>Modified the co-training algorithm by replacing the class variable (class label) with a mixture coefficient to obtain a multi-view clustering algorithm. </li></ul><ul><li>Manifold co-regularization (Sindhwani et al., 2005) </li></ul><ul><li>Extended Manifold regularization to multi-view learning. </li></ul><ul><li>Active multi-view learning (Muslea 2002) </li></ul><ul><li>Combine active learning and multi-view learning. </li></ul><ul><li>More related works can be find in the workshop on Multi-view learning in ICML 2005: </li></ul><ul><li>http://www-ai.cs.uni-dortmund.de/MULTIVIEW2005/index.html </li></ul>
  29. 29. Reference <ul><li>A. Blum and T. Mitchell, 1998. “Combining Labeled and Unlabeled Data with Co-Training,” In Proceedings of COLT 1998. </li></ul><ul><li>D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of ACL 1995 . </li></ul><ul><li>Nigam, K., & Ghani, R, 2000. Analyzing the effectiveness and applicability of co-training. In Proceedings of CIKM 2000 . </li></ul><ul><li>Steven Abney, 2002. Bootstrapping. In Proceedings of ACL, 2002. </li></ul><ul><li>Ulf Brefeld and Tobias Scheer. Co-EM support vector learning. In Proceedings ICML, 2004. </li></ul><ul><li>Steen Bickel and Tobias Scheer. Multi-view clustering. In Proceedings of ICDM, 2004 . </li></ul><ul><li>Sindhwani, V.; Niyogi, P.; and Belkin, M. 2005. A Co-Regularization Approach to Semi-supervised Learning with Multiple Views. In Workshop on Learning with Multiple Views at ICML 2005. </li></ul><ul><li>Ion Muslea. Active learning with multiple views. PhD thesis, University of Southern California, 2002. </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×