Your SlideShare is downloading. ×
0
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

541

Published on

My presentation at SSLNLP Workshop (NAACL 2009) on June 4th, 2009

My presentation at SSLNLP Workshop (NAACL 2009) on June 4th, 2009

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
541
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification? Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ ınez Universidad Nacional de Educaci´n a Distancia o June 4, 2009
  • 2. Text Classification Index 1 Text Classification 2 Motivation 3 Support Vector Machines 4 Multiclass SVM 5 S3 VM 6 Multiclass S3 VM 7 Compared Approaches: Multiclass SVM vs Multiclass S3 VM 8 Experiments 9 Results 10 Conclusions and Outlook 11 Thank youA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 2 / 31
  • 3. Text Classification What is it? We have a set of documents: D = {d1 , ..., d|D| } With a set of predefined categories: C = {c1 , ..., c|C | } Classification is known as: dj , ci ∈ D × CA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 3 / 31
  • 4. Motivation Index 1 Text Classification 2 Motivation 3 Support Vector Machines 4 Multiclass SVM 5 S3 VM 6 Multiclass S3 VM 7 Compared Approaches: Multiclass SVM vs Multiclass S3 VM 8 Experiments 9 Results 10 Conclusions and Outlook 11 Thank youA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 4 / 31
  • 5. Motivation Motivation Several studies for plain text classification (news), but a few for web page classification. Typical web page classification task: Semi-supervised: not much labeled documents. Multiclass: taxonomy > 2. (Joachims, 1999) proved the suitability of unlabeled data for binary tasks. What about multiclass tasks? (Chapelle et al., 2006) did it over image datasets, but never for text/web pages.A. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 5 / 31
  • 6. Support Vector Machines Index 1 Text Classification 2 Motivation 3 Support Vector Machines 4 Multiclass SVM 5 S3 VM 6 Multiclass S3 VM 7 Compared Approaches: Multiclass SVM vs Multiclass S3 VM 8 Experiments 9 Results 10 Conclusions and Outlook 11 Thank youA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 6 / 31
  • 7. Support Vector Machines SVM It looks for a hyperplane to separate the classes Margin maximizationA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 7 / 31
  • 8. Support Vector Machines SVM It looks for a hyperplane to separate the classes Margin maximizationA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 7 / 31
  • 9. Support Vector Machines SVM Optimization function: min 1 ||ω||2 + C · n ξid 2 i=1 Subject to: yi (ω · xi + b) ≥ 1 − ξi , ξi ≥ 0 It only handles binary and supervised problems by nature.A. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 8 / 31
  • 10. Multiclass SVM Index 1 Text Classification 2 Motivation 3 Support Vector Machines 4 Multiclass SVM 5 S3 VM 6 Multiclass S3 VM 7 Compared Approaches: Multiclass SVM vs Multiclass S3 VM 8 Experiments 9 Results 10 Conclusions and Outlook 11 Thank youA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 9 / 31
  • 11. Multiclass SVM Multiclass SVM Approaches to multiclass SVM: Direct. Combining binary classfiers. One-against-one. One-against-all. Usually applied to supervised tasks, but hardly ever to semi-supervised ones.A. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 10 / 31
  • 12. Multiclass SVM Multiclass SVM: Direct approach The optimization function considers all the hyperplanes at the same time. n l 1 min ||wm ||2 + C ξim 2 m=1 i=1 m=yi Subject to: wyi · xi + byi ≥ wm · xi + bm + 2 − ξim , ξim ≥ 0A. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 11 / 31
  • 13. Multiclass SVM Multiclass SVM: One-against-one k·(k−1) It creates 2 binary classifiersA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 12 / 31
  • 14. Multiclass SVM Multiclass SVM: One-against-one k·(k−1) It creates 2 binary classifiersA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 12 / 31
  • 15. Multiclass SVM Multiclass SVM: One-against-one k·(k−1) It creates 2 binary classifiersA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 12 / 31
  • 16. Multiclass SVM Multiclass SVM: One-against-one k·(k−1) It creates 2 binary classifiers T sign(ωij · x + bij ) −→ Add a vote for the winning class between i and j The class with more votes will be the output.A. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 12 / 31
  • 17. Multiclass SVM Multiclass SVM: One-against-all It creates k binary classifiersA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 13 / 31
  • 18. Multiclass SVM Multiclass SVM: One-against-all It creates k binary classifiersA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 13 / 31
  • 19. Multiclass SVM Multiclass SVM: One-against-all It creates k binary classifiersA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 13 / 31
  • 20. Multiclass SVM Multiclass SVM: One-against-all It creates k binary classifiers ˆ Ci = arg max (ωi · x + bi ) i=1,...,kA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 13 / 31
  • 21. S3 VM Index 1 Text Classification 2 Motivation 3 Support Vector Machines 4 Multiclass SVM 5 S3 VM 6 Multiclass S3 VM 7 Compared Approaches: Multiclass SVM vs Multiclass S3 VM 8 Experiments 9 Results 10 Conclusions and Outlook 11 Thank youA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 14 / 31
  • 22. S3 VM Semi-supervised SVM (S3 VM) Unlabeled documents are considered during the learning phase. The optimization function results: l u 1 ∗ d min · ||ω||2 + C · ξid +C · ξj∗ 2 i=1 j=1 Convex optimization algorithms required. Commonly used over binary taxonomies, but hardly ever with more classes.A. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 15 / 31
  • 23. Multiclass S3 VM Index 1 Text Classification 2 Motivation 3 Support Vector Machines 4 Multiclass SVM 5 S3 VM 6 Multiclass S3 VM 7 Compared Approaches: Multiclass SVM vs Multiclass S3 VM 8 Experiments 9 Results 10 Conclusions and Outlook 11 Thank youA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 16 / 31
  • 24. Multiclass S3 VM Multiclass S3 VM (Yajima and Kuo, 2006) present the following optimization function: h l 1 T y min( β i K −1 β i + C max(0, 1 − (βj j − βji ))2 ) 2 i=1 j=1 i=yj where β represents the product of a vector and a kernel matrix defined by the author. (Chapelle et al., 2006): direct approach by means of the Continuation Method. 2 steps: (Qi et al., 2004) use Fuzzy C-Means to predict new unlabeled documents. (Xu and Schuurmans, 2005) rely on a clustering-based approach to label the unlabeled data.A. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 17 / 31
  • 25. Compared Approaches: Multiclass SVM vs Multiclass S3 VM Index 1 Text Classification 2 Motivation 3 Support Vector Machines 4 Multiclass SVM 5 S3 VM 6 Multiclass S3 VM 7 Compared Approaches: Multiclass SVM vs Multiclass S3 VM 8 Experiments 9 Results 10 Conclusions and Outlook 11 Thank youA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 18 / 31
  • 26. Compared Approaches: Multiclass SVM vs Multiclass S3 VM Multiclass SVM vs Multiclass S3 VM 2-steps-SVM/1-step-SVM: Multiclass SVM. Does an intermediate step adding newly labeled data improve classifier’s performance? One-against-all-S3 VM/One-against-all-SVM. One-against-one-S3 VM/One-agaisnt-one-SVM. Does unlabeled data help to improve binary combining classifier’s results?A. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 19 / 31
  • 27. Experiments Index 1 Text Classification 2 Motivation 3 Support Vector Machines 4 Multiclass SVM 5 S3 VM 6 Multiclass S3 VM 7 Compared Approaches: Multiclass SVM vs Multiclass S3 VM 8 Experiments 9 Results 10 Conclusions and Outlook 11 Thank youA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 20 / 31
  • 28. Experiments Experiments settings Datasets: BankSearch: 10.000 web documents / 10 categories (4.000 for the training set). WebKB: 4.518 web documents / 6 categories (2.000 for the training set). Yahoo! Science: 788 web documents / 6 categories (200 for the training set). Numerous labeled/unlabeled sets. 9 executions for each. Representation: TF-IDF. Software: SVM-light (http://svmlight.joachims.org) SVM-multiclass Evaluation by means of the accuracy (percent of correct predictions).A. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 21 / 31
  • 29. Results Index 1 Text Classification 2 Motivation 3 Support Vector Machines 4 Multiclass SVM 5 S3 VM 6 Multiclass S3 VM 7 Compared Approaches: Multiclass SVM vs Multiclass S3 VM 8 Experiments 9 Results 10 Conclusions and Outlook 11 Thank youA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 22 / 31
  • 30. Results Results: BankSearchA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 23 / 31
  • 31. Results Results: WebKBA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 24 / 31
  • 32. Results Results: Yahoo! ScienceA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 25 / 31
  • 33. Results Results Supervised multiclass approaches (2-steps-SVM & 1-step-SVM) outperform the rest. Among binary combinations, one-against-all outperforms one-against-one. Unlabeled data slightly helps for one-against-all. 1-step-SVM and 2-steps-SVM show similar results, except for WebKB, where the former wins. It could be due to the homogeneous nature of the WebKB dataset.A. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 26 / 31
  • 34. Conclusions and Outlook Index 1 Text Classification 2 Motivation 3 Support Vector Machines 4 Multiclass SVM 5 S3 VM 6 Multiclass S3 VM 7 Compared Approaches: Multiclass SVM vs Multiclass S3 VM 8 Experiments 9 Results 10 Conclusions and Outlook 11 Thank youA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 27 / 31
  • 35. Conclusions and Outlook Conclusions Comparison of multiclass SVM and S3 VM approaches for web page classification. Direct and combining approaches. Direct approaches outperform the rest. Unlabeled data did not provide considerable improvements, and even provide worsenings in some cases.A. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 28 / 31
  • 36. Conclusions and Outlook Future Work To add more multiclass S3 VM approaches to the study. To test with different SVM settings (kernel, parameters,...).A. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 29 / 31
  • 37. Thank you Index 1 Text Classification 2 Motivation 3 Support Vector Machines 4 Multiclass SVM 5 S3 VM 6 Multiclass S3 VM 7 Compared Approaches: Multiclass SVM vs Multiclass S3 VM 8 Experiments 9 Results 10 Conclusions and Outlook 11 Thank youA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 30 / 31
  • 38. Thank you Thank you Thank youA. Zubiaga, V. Fresno, R. Mart´ ınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 31 / 31

×