Text categorization


Published on

A Comparation study on SVM,TSVM and SVM-kNN in Text Categorization

Published in: Education, Technology
1 Comment
  • SVM_kNN will suffer much more when we increase the number of categories, consider classification in a hierarchy, than SVM.. Am I right on that???
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • this is the instance in many application areas of machine learning, for example,
  • For instance: in content-based image retrieval, a user usually poses several example images as a query and asks a system to return similar images. In this situation there are many unlabeled examples. IE: images that exist in a database, but there are only several labeled examples. Another instance is online web page recommendation. When a user is surfing the Internet, he may occasionally encounter some interesting web pages and may want the system bring him similarly interesting web pages. It will be difficult to require the user to confirm more interesting pages as training examples because the user may not know where they are. In this instance, although there are a lot of unlabeled examples and there are only a few labeled examples. In the cases, there is only one labeled training example to reply on. If the initial weakly useful predictor cannot be generated based on this single example, the above-mentioned SSL techniques cannot be applied.[4]
  • Because the RBF kernel nonlinearly maps examples into a higher dimensional space, unlike the linear kernel, it can handle the situation when the relation between class labels and attributes is nonlinear.
  • This method is very simple: for example, if example x 1 has k nearest examples in the feature space and majority of them have the same label y 1 , then example x 1 belongs to y 1. Because: Although KNN method depends on outmost theorem in the theory, during the decision course it is only related to small number of nearest neighbors, so adopting this method can avoid the problem of examples imbalanced, otherwise, KNN mainly depends on limited number of nearest neighbor around not a decision boundary.
  • Because the examples located around the boundary are easy to be misclassified, but they are likely to the support vectors, we call them boundary vectors, so picking out these boundary vectors whose labels are fuzzy labeled by weaker classifier SVM.
  • Text categorization

    1. 1. By: Sy-Quan Nguyen Minh-Hoang Nguyen Phi-Dung Tran Instructor : Prof. Quang-Thuy Ha Tuan-Quang Nguyen A Comparation study on SVM,TSVM and SVM-kNN in Text Categorization
    2. 2. Table of content <ul><li>Introduce about Text Categorization </li></ul><ul><li>Support Vector Machine (SVM) </li></ul><ul><li>Transductive SVM </li></ul><ul><li>SVM- kNN </li></ul><ul><li>Experiment </li></ul>
    3. 3. Document Classification: Motivation <ul><li>News article classification </li></ul><ul><li>Automatic email filtering </li></ul><ul><li>Webpage classification </li></ul><ul><li>Word sense disambiguation </li></ul><ul><li>… … </li></ul>
    4. 4. Text Categorization <ul><li>Pre-given categories and labeled document examples (Categories may form hierarchy) </li></ul><ul><li>Classify new documents </li></ul><ul><li>A standard classification (supervised learning ) problem </li></ul>11/01/11 Categorization System … Sports Business Education Science … Sports Business Education
    5. 5. Document Classification: Problem Definition <ul><li>Need to assign a boolean value {0,1} to each entry of the decision matrix </li></ul><ul><li>C = {c 1 ,....., c m } is a set of pre-defined categories </li></ul><ul><li>D = {d 1 ,..... d n } is a set of documents to be categorized </li></ul><ul><li>1 for a ij : d j belongs to c i </li></ul><ul><li>0 for a ij : d j does not belong to c i </li></ul>
    6. 6. Flavors of Classification <ul><li>Single Label </li></ul><ul><ul><li>For a given d i at most one (d i , c i ) is true </li></ul></ul><ul><ul><li>Train a system which takes a d i and C as input and outputs a c i </li></ul></ul><ul><li>Multi-label </li></ul><ul><ul><li>For a given d i zero, one or more (d i , c i ) can be true </li></ul></ul><ul><ul><li>Train a system which takes a d i and C as input and outputs C’, a subset of C </li></ul></ul><ul><li>Binary </li></ul><ul><ul><li>Build a separate system for each c i , such that it takes in as input a d i and outputs a boolean value for (d i , c i ) </li></ul></ul><ul><ul><li>The most general approach </li></ul></ul><ul><ul><li>Based on assumption that decision on (d i , c i ) is independent of (d i , c j ) </li></ul></ul>
    7. 7. Classification Methods <ul><li>Manual: Typically rule-based (KE Approach) </li></ul><ul><ul><li>Does not scale up (labor-intensive, rule inconsistency) </li></ul></ul><ul><ul><li>May be appropriate for special data on a particular domain </li></ul></ul><ul><li>Automatic: Typically exploiting machine learning techniques </li></ul><ul><ul><li>Vector space model based </li></ul></ul><ul><ul><ul><li>Prototype-based (Rocchio) </li></ul></ul></ul><ul><ul><ul><li>K-nearest neighbor (KNN) </li></ul></ul></ul><ul><ul><ul><li>Decision-tree (learn rules) </li></ul></ul></ul><ul><ul><ul><li>Neural Networks (learn non-linear classifier) </li></ul></ul></ul><ul><ul><ul><li>Support Vector Machines (SVM) </li></ul></ul></ul><ul><ul><li>Probabilistic or generative model based </li></ul></ul><ul><ul><ul><li>Naïve Bayes classifier </li></ul></ul></ul>
    8. 8. Steps in Document Classification <ul><li>Classification Process </li></ul><ul><ul><li>Data preprocessing </li></ul></ul><ul><ul><ul><li>E.g., Term Extraction, Dimensionality Reduction, Feature Selection, etc. </li></ul></ul></ul><ul><ul><li>Definition of training set and test sets </li></ul></ul><ul><ul><li>Creation of the classification model using the selected classification algorithm </li></ul></ul><ul><ul><li>Classification model validation </li></ul></ul><ul><ul><li>Classification of new/unknown text documents </li></ul></ul>
    9. 9. Support Vector Mechine (SVM)
    10. 10. SVM — History and Applications <ul><li>Vapnik and colleagues (1992) — groundwork from Vapnik & Chervonenkis ’ statistical learning theory in 1960s </li></ul><ul><li>Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization) </li></ul><ul><li>Used both for classification and prediction </li></ul><ul><li>Applications: </li></ul><ul><ul><li>handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests </li></ul></ul>
    11. 11. SVM <ul><li>Give training data set </li></ul><ul><li>D= </li></ul><ul><li>Where : </li></ul><ul><ul><li>- input vector , </li></ul></ul><ul><ul><li>- label or class need to classify </li></ul></ul><ul><li>SVM build a linear function: </li></ul>
    12. 12. <ul><li>Vector x i is positive if , inversely it is negative </li></ul><ul><li>where : </li></ul><ul><ul><li>is weight vector </li></ul></ul><ul><ul><li>b is bias , is dot product of w and x </li></ul></ul><ul><li>We have : </li></ul>
    13. 13. <ul><li>SVM is find a hyperplane : </li></ul>
    14. 15. Problem <ul><li>There are an infinite number of lines that can separate the positive and negative data points as illustrated by Fig.(B). Which line should we choose? </li></ul><ul><li>A hyperplane classifier is only applicable if the positive and negative data can be linearly separated. How can we deal with nonlinear separa-tions or data sets that require nonlinear decision boundaries? </li></ul>
    15. 16. SVM – Separable Case <ul><li>SVM chooses the hyperplane that maximizes the margin </li></ul><ul><li>Let d + and d _ be the shortest distance from hyperplane to the closest positive and negative data point. </li></ul><ul><li>Margin of hyperplane: d + + d _ </li></ul><ul><li>SVM looks for the separating hyperplane with the largest margin (i.e deci-sion boundary or maximal margin hyperplane) </li></ul>
    16. 17. SVM Linear
    17. 18. <ul><li>Consider ( x + , 1) and ( x - , - 1) are 2 points that closest to hyeperplane </li></ul><ul><li>We define two parallel hyper-planes, H + and H - . We rescale w and b to obtain: </li></ul><ul><li> </li></ul>
    18. 19. <ul><li>x s belong to hyperplane  d + is distance from x s to H + </li></ul><ul><li>Margin of hype : </li></ul>
    19. 21. <ul><li>Solve the above problem, we can define w and b </li></ul><ul><ul><li>α i >0 </li></ul></ul><ul><ul><li>x i - is support vector </li></ul></ul><ul><li>b is computed base on KKT condition </li></ul><ul><li>(Karrush-Kuhn-Tucker condition): </li></ul>
    20. 22. SVM – Nonlinear <ul><li>Use a hyperplane with soft margin: </li></ul>
    21. 23. <ul><li>Use Kernel function : </li></ul><ul><ul><li>Polynomial : </li></ul></ul><ul><ul><li>Gaussian RBF : </li></ul></ul><ul><ul><li>Sigmoid : K( x,z) = tan </li></ul></ul>
    22. 25. Transductive SVM
    23. 26. TSVM - Overview <ul><li>Partially Supervised Learning </li></ul><ul><ul><li>Time-consuming to manual labeled in pure SVM </li></ul></ul><ul><ul><li> use Partially Supervised Learning to solve </li></ul></ul><ul><ul><li>2 tasks of this approach: </li></ul></ul><ul><ul><ul><li>Labeled and Unlabeled Learning (LU Learning) </li></ul></ul></ul><ul><ul><ul><li>Postive and Unlabeled Learning (PU Learning) </li></ul></ul></ul>
    24. 27. TSVM - Overview <ul><li>Labeled and Unlabeled Learning </li></ul><ul><ul><li>Know as semi-supervised learning </li></ul></ul><ul><ul><li>Training set contains: </li></ul></ul><ul><ul><ul><li>A small set of labeled examples </li></ul></ul></ul><ul><ul><ul><li>A large set of unlabeled examples </li></ul></ul></ul><ul><ul><ul><li>use unlabeled examples to improve effective learning </li></ul></ul></ul>
    25. 28. TSVM - Overview <ul><li>Positive and Unlabeled Learning </li></ul><ul><ul><li>Suppose problem is two-class classification </li></ul></ul><ul><ul><li>Training set contains: </li></ul></ul><ul><ul><ul><li>A set of labeled positive examples </li></ul></ul></ul><ul><ul><ul><li>A set of unlabeled examples </li></ul></ul></ul><ul><ul><ul><li>No labeled negative examples </li></ul></ul></ul><ul><ul><ul><li> build a classifier without labeling any negative examples </li></ul></ul></ul>
    26. 29. TSVM - Overview <ul><li>Transductive SVM </li></ul><ul><ul><li>One of some methods in LU Learning </li></ul></ul><ul><ul><li>Very good effect with small data set (train and test) </li></ul></ul><ul><ul><li>Use unlabeled data in train set to choose margin of the classifier that is maximized! </li></ul></ul>After SVM After TSVM
    27. 30. TSVM - Algorithm <ul><li>Use iterative method </li></ul><ul><li>Three steps: </li></ul><ul><ul><li>Step 1: Learning a classifier by using labeled data </li></ul></ul><ul><ul><li>Step 2: </li></ul></ul><ul><ul><ul><li>Choose subset positive data from unlabeled data by using learned classifier (from step 1) </li></ul></ul></ul><ul><ul><ul><li>Initial negative examples by remain unlabeled </li></ul></ul></ul><ul><ul><li>Step 3: Improve the soft margin cost function </li></ul></ul><ul><ul><ul><li>Change label of one positive, one negative document that they were labeled wrong. </li></ul></ul></ul><ul><ul><ul><li>Retrain </li></ul></ul></ul><ul><ul><ul><li>Stop when system constants are less than user constant </li></ul></ul></ul>
    28. 31. TSVM - Algorithm <ul><li>Input </li></ul><ul><ul><li>Training set: </li></ul></ul><ul><ul><li>Testing set: </li></ul></ul><ul><li>Parameters: </li></ul><ul><ul><li>User’s constants: </li></ul></ul><ul><ul><li>Num of Test data to assign for positive document </li></ul></ul><ul><li>Output </li></ul><ul><ul><li>Predicted labels of test examples </li></ul></ul>
    29. 32. TSVM - Algorithm <ul><li>(1) </li></ul>
    30. 33. TSVM - Algorithm <ul><li>After (1): </li></ul><ul><ul><li>Classify test set </li></ul></ul><ul><ul><li> test examples with highest values of </li></ul></ul><ul><ul><li>are assigned to class positive </li></ul></ul><ul><ul><li>Remaining test examples are assigned to negative </li></ul></ul>
    31. 34. SVM - KNN
    32. 35. Problem <ul><li>Traditional classifiers are constructed based on labeled data in supervised learning </li></ul><ul><li>Labeled examples, are often difficult, expensive, or time consuming to obtain </li></ul><ul><li>Semi-supervised learning addresses this problem by using large number of unlabeled data, together with small number of labeled data </li></ul>
    33. 36. Semi-Supervised <ul><li>Semi-Supervised is halfway between supervised and unsupervised learning [3] </li></ul><ul><li>Data of SSL set X=(x 1 ,x 2, ….,x n ) can separated into two parts: </li></ul><ul><ul><li>The points X h =(x 1 ,…,x h ), for which labels Y h =(y 1 ,....,y h ) are provided. </li></ul></ul><ul><ul><li>The points X t= (x h+1 ,….,x h+t ) the labels of which are not known. </li></ul></ul><ul><li>SSL will be most useful whenever there are much more unlabeled data than labeled </li></ul>
    34. 37. Semi-Supervised <ul><li>A number of classification algorithms that uses both labeled and unlabeled data have been proposed </li></ul><ul><ul><li>Self-training or self-labeling is the earliest SSL method probably, which is still extensively used in the processing of natural language. </li></ul></ul><ul><li>One of the many approaches to SSL is to first train a weaker predictor, which is then used in exploiting the unlabeled examples. </li></ul>
    35. 38. Algorithm SVM <ul><li>Mr. Quan is presented </li></ul><ul><li>Note: in my experiment, I choose RBF kernel as my kernel function. </li></ul>
    36. 39. Algorithm KNN <ul><li>The KNN algorithms is proposed by Cover and Hart (1968) [5]. </li></ul><ul><li>KNN are calculated using Euclidean distance : X i (x 1 ,...,x i ) and X j =(x 1 ,..,x j ) </li></ul><ul><li>KNN are suitable for classifying the case of examples set of boundary intercross and examples overlapped. </li></ul>
    37. 40. Pseudocode KNN <ul><li>Step 1: Determine parameter K = number of nearest neighbor s </li></ul><ul><li>Step 2: Calculate the distance between the query-instance and all the training examples. </li></ul><ul><li>Step 3: Sort the distance and determine nearest neighbors based on the k-th minimum distance. </li></ul><ul><li>Step 4:Gather the category Y of the nearest neighbors. </li></ul><ul><li>Step 5: Use simple majority of the category of nearest neighbors as the prediction value of the query instance. </li></ul>
    38. 41. Example <ul><li>Four training samples </li></ul><ul><li>Test with x 1 =3,x 2 =7. </li></ul>
    39. 42. <ul><li>Suppose k=3 and calculate the distance. </li></ul>
    40. 43. <ul><li>Sort distance </li></ul>
    41. 44. <ul><li>Gather the category Y of the nearest neighbors and the prediction value of the query instance. </li></ul>
    42. 45. Algorithm SVM-KNN <ul><li>Problem: </li></ul><ul><ul><li>When we classified a date set including large number of unlabeled data, if only utilize the few training examples available. </li></ul></ul><ul><ul><ul><li>Then the can’t obtain a high accuracy classifier with inadequate training examples. </li></ul></ul></ul><ul><ul><ul><li>If we want to obtain a classifier of high performance, then labeling the unlabeled data is necessary. </li></ul></ul></ul><ul><ul><ul><ul><li>Problem: labeling vast unlabeled data wastes time and consumes strength. </li></ul></ul></ul></ul>
    43. 46. Pesudocode SVM-KNN [1] <ul><li>Step 1:Utilize the labeled data available in a data set as inital training set and construct a weaker classifier SVM1 based on this training set. </li></ul><ul><li>Step 2: Utilize SVM1 to predict the labels of all the remaining unlabeled data in the data set, the pick out 2 n examples located around the decision boundary as boundary vectors. </li></ul><ul><ul><ul><li>Choose an example x i from class A(A is the label) and calculate the distance between x i and all the examples of class B (B is the label) using Euclidean distance subsequently pick out n exam;es of B corresponding to the n minimum distances. </li></ul></ul></ul><ul><ul><ul><li>Similar choose y j from class B. </li></ul></ul></ul><ul><ul><ul><li>Call the 2n examples as boundary vectors, make the 2n boundary vectors together as a new testing set. </li></ul></ul></ul><ul><li>Step 3:KNN classifier classifies the new testing set with the initial training set, the boudary vectors get new labels. </li></ul><ul><li>Step 4: Put the boundary vectors and their new labels into initial training set to enlarge the training set, then retrain a new SVM2 </li></ul><ul><li>Step 5: Iteratively as above until the number of the training examples is m times of the whole data set. </li></ul>
    44. 47. Model SVM1
    45. 48. Initial training set Predict labels all the remaining unlabeled data SVM1 New testing set Choose 2n Vector boudary Boundary vectors get new labels Use KNN Retrain new SVM2 Put them in training set Number training set =m times whole data set
    46. 49. Experiment
    47. 50. <ul><li>Data set is Reuter – 21578 </li></ul><ul><li>After preprocessing ( remove stopword, stemming…), we choose ten classes that contain the most documents are selected for testing. </li></ul><ul><li>Finally, for each class only 800 most informative words (measured by chi-square statistic) are selected. </li></ul>
    48. 53. References <ul><li>Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol. 34, No.1, March 2002 </li></ul><ul><li>Yiming Yang, “An evaluation of statistical approaches to text categorization”, Journal of Information Retrieval, 1:67-88, 1999. </li></ul><ul><li>Yiming Yang and Xin Liu, “A re-examination of text categorization methods”, Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42--49), 1999. </li></ul>11/01/11 Data Mining: Principles and Algorithms
    49. 54. References <ul><li>[1]Kunlun Li, Xuerong Luo and Ming Jin (2010). Semi-supervised Learning for SVM-KNN . Journal of computers, 5 (5): 671-678, May 2010 </li></ul><ul><li>[2] X.J. Zhu. Semi-supervised learning literature survey[R].Technical Report 1530, Department of Computer Sciences,University of Wisconsin at Madison, Madison, WI,December, 2007. </li></ul><ul><li>[3] Fung, G., & Mangasarian, O. Semi-supervised support vector machines for unlabeled data classification (Technical Report 99-05). Data Mining Institute,University of Wisconsin Madison, 1999 . </li></ul><ul><li>[4] Zhou, Z.-H., Zhan, D.-C., & Yang, Q.Semi-supervised learning with very few labeled training examples. Twenty-Second AAAI Conference on Artificial Intelligence(AAAI-07), 2007. </li></ul><ul><li>[5] Dasarathy, B. V., Nearest Neighbor (NN) Norms,NN Pattern Classification Techniques. IEEEComputer Society Press, 1990 </li></ul>
    50. 55. Thank You