By: Sy-Quan Nguyen Minh-Hoang Nguyen Phi-Dung Tran Instructor : Prof. Quang-Thuy Ha   Tuan-Quang Nguyen  A Comparation study on SVM,TSVM and SVM-kNN in Text Categorization
Table of content Introduce about Text Categorization Support Vector Machine (SVM) Transductive SVM SVM- kNN Experiment
Document Classification: Motivation News article classification Automatic email filtering Webpage classification Word sense disambiguation … …
Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents  A standard classification (supervised learning ) problem 11/01/11 Categorization System … Sports Business Education Science … Sports Business Education
Document Classification: Problem Definition Need to assign a boolean value {0,1} to each entry of the decision matrix C = {c 1 ,....., c m } is a set of pre-defined categories D = {d 1 ,..... d n } is a set of documents to be categorized 1 for a ij  : d j  belongs to c i 0 for a ij  : d j  does not belong to c i
Flavors of Classification Single Label For a given d i  at most one (d i , c i ) is true Train a system which takes a d i  and C as input and outputs a c i Multi-label For a given d i  zero, one or more (d i , c i ) can be true Train a system which takes a d i  and C as input and outputs C’, a subset of C Binary Build a separate system for each c i , such that it takes in as input a d i  and outputs a boolean value for (d i , c i ) The most general approach Based on assumption that decision on (d i , c i ) is independent of (d i , c j )
Classification Methods Manual: Typically rule-based (KE Approach) Does not scale up (labor-intensive, rule inconsistency) May be appropriate for special data on a particular domain Automatic: Typically exploiting machine learning techniques Vector space model based Prototype-based (Rocchio) K-nearest neighbor (KNN) Decision-tree (learn rules) Neural Networks (learn non-linear classifier) Support Vector Machines (SVM) Probabilistic or generative model based Naïve Bayes classifier
Steps in Document Classification Classification Process Data preprocessing E.g., Term Extraction, Dimensionality Reduction, Feature Selection, etc. Definition of training set and test sets Creation of the classification model using the selected classification algorithm Classification model validation Classification of new/unknown text documents
Support Vector Mechine (SVM)
SVM — History and Applications Vapnik and colleagues (1992) — groundwork from Vapnik & Chervonenkis ’  statistical learning theory in 1960s Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization) Used both for classification and prediction Applications:  handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests
SVM Give training data set  D= Where : -  input vector , - label or class need to classify SVM build a linear function:
Vector  x i  is positive if  , inversely it is negative where : is  weight vector  b  is bias ,  is dot product of  w  and  x We have :
SVM is find a hyperplane :
 
Problem There are an infinite number of lines that can separate the positive and negative data points as illustrated by Fig.(B). Which line should we choose? A hyperplane classifier is only applicable if the positive and negative data can be linearly separated. How can we deal with nonlinear separa-tions or data sets that require nonlinear decision boundaries?
SVM – Separable Case SVM chooses the hyperplane that maximizes the margin Let  d +  and d _  be the shortest distance from hyperplane to the closest positive and negative data point. Margin of hyperplane:  d +  + d _   SVM looks for the separating hyperplane with the largest margin (i.e deci-sion boundary or maximal margin hyperplane)
SVM Linear
Consider  ( x + , 1) and ( x - , - 1) are 2 points that closest to hyeperplane  We define two parallel hyper-planes,  H +  and H -   . We  rescale   w  and b to obtain:  
x s   belong to hyperplane    d +  is distance from  x s   to H +   Margin of hype :
 
Solve the  above problem, we can define  w  and b α i  >0  x i   - is support vector b is computed base on KKT condition  (Karrush-Kuhn-Tucker condition):
SVM – Nonlinear Use a hyperplane with soft margin:
Use Kernel function :  Polynomial :  Gaussian RBF : Sigmoid :  K( x,z) = tan
 
Transductive SVM
TSVM - Overview Partially Supervised Learning Time-consuming to manual labeled in pure SVM    use Partially Supervised Learning to solve 2 tasks of this approach: Labeled and Unlabeled Learning (LU Learning) Postive and Unlabeled Learning (PU Learning)
TSVM - Overview Labeled and Unlabeled Learning Know as semi-supervised learning Training set contains: A  small set of labeled  examples A  large set of unlabeled  examples use unlabeled examples to improve effective learning
TSVM - Overview Positive and Unlabeled Learning Suppose problem is two-class classification Training set contains: A set of  labeled positive  examples A set of  unlabeled  examples No  labeled negative examples    build  a classifier  without labeling  any  negative  examples
TSVM - Overview Transductive SVM One of some methods in LU Learning Very good effect with small data set (train and test) Use unlabeled data in train set to choose margin of the classifier that is maximized! After SVM After TSVM
TSVM - Algorithm Use iterative method Three steps: Step 1: Learning a classifier by using labeled data Step 2:  Choose subset positive data from unlabeled data by using learned classifier (from step 1) Initial negative examples by remain unlabeled Step 3: Improve the soft margin cost function Change label of one positive, one negative document that they were labeled wrong. Retrain Stop when system constants are less than user constant
TSVM - Algorithm Input Training set: Testing set: Parameters: User’s constants:  Num of Test data to assign for positive document  Output  Predicted labels of test examples
TSVM - Algorithm (1)
TSVM - Algorithm After (1):  Classify test set   test examples with highest values of  are assigned to class positive Remaining test examples are assigned to negative
SVM - KNN
Problem Traditional classifiers are constructed based on labeled data in supervised learning Labeled examples, are often difficult, expensive, or time consuming to obtain Semi-supervised learning addresses this problem by using large number of unlabeled data, together with small number of labeled data
Semi-Supervised Semi-Supervised is halfway between supervised and unsupervised learning [3] Data of SSL set X=(x 1 ,x 2, ….,x n ) can separated into two parts: The points X h =(x 1 ,…,x h ), for which labels Y h =(y 1 ,....,y h )  are provided. The points X t= (x h+1 ,….,x h+t ) the labels of which are not known. SSL will be most useful whenever there are much more unlabeled data than labeled
Semi-Supervised A number of classification algorithms that uses both labeled and unlabeled data have  been proposed Self-training or self-labeling is the earliest SSL method probably, which is  still extensively used in the processing of natural language. One of the many approaches to SSL is to first train a weaker predictor, which is then used in exploiting the unlabeled examples.
Algorithm SVM Mr. Quan is presented Note: in my experiment, I choose RBF kernel as my kernel function.
Algorithm KNN The KNN algorithms is proposed by Cover and Hart (1968) [5]. KNN are calculated using Euclidean distance : X i (x 1 ,...,x i ) and X j =(x 1 ,..,x j ) KNN are suitable for classifying the case of examples set of boundary intercross and examples overlapped.
Pseudocode  KNN Step 1: Determine parameter K = number of nearest neighbor s   Step 2: Calculate the distance between the query-instance and all the training examples. Step 3: Sort the distance and determine nearest neighbors based on the k-th minimum distance. Step 4:Gather the category Y of the nearest neighbors. Step 5: Use simple majority of the category of nearest neighbors as the prediction value of the query instance.
Example Four training samples Test with x 1 =3,x 2 =7.
Suppose k=3 and calculate the distance.
Sort distance
Gather the category Y of the nearest neighbors and the prediction value of the query instance.
Algorithm SVM-KNN Problem: When we classified a date set including  large number of unlabeled data, if only utilize the few training examples available. Then the can’t obtain a high accuracy classifier with inadequate training examples. If we want to obtain a classifier of high performance, then labeling the unlabeled data is necessary. Problem: labeling vast unlabeled data wastes time and consumes strength.
Pesudocode SVM-KNN [1] Step 1:Utilize the labeled data available in a data set as inital training set and construct a weaker classifier SVM1 based on this training set. Step 2: Utilize SVM1 to predict the labels of all the remaining unlabeled data in the data set, the pick out 2 n examples located around the decision boundary as boundary vectors. Choose an example x i  from class A(A is the label) and calculate the distance between x i  and all the examples of class B (B is the label) using Euclidean distance subsequently pick out n exam;es of B corresponding to the n minimum distances. Similar choose y j  from class B. Call the 2n examples as boundary vectors, make the 2n boundary vectors together as a new testing set. Step 3:KNN classifier classifies the new testing set with the initial training set, the boudary vectors get new labels. Step 4: Put the boundary vectors and their new labels into initial training set to enlarge the training set, then retrain a new SVM2 Step 5: Iteratively as above until the number of the training examples is m times of the whole data set.
Model SVM1
Initial training set Predict labels  all the remaining unlabeled data SVM1 New testing set Choose 2n  Vector boudary Boundary vectors  get new labels Use  KNN Retrain new SVM2 Put them in training set Number training set =m  times whole data set
Experiment
Data set is Reuter – 21578 After preprocessing ( remove stopword, stemming…), we choose ten classes that contain the most documents are selected for testing. Finally, for each class only 800 most informative words (measured by chi-square statistic) are selected.
 
 
References Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol. 34, No.1, March 2002 Yiming Yang, “An evaluation of statistical approaches to text categorization”, Journal of Information Retrieval, 1:67-88, 1999.  Yiming Yang and Xin Liu, “A re-examination of text categorization methods”, Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42--49), 1999.  11/01/11 Data Mining: Principles and Algorithms
References [1]Kunlun Li, Xuerong Luo and Ming Jin (2010).  Semi-supervised Learning for SVM-KNN . Journal of computers,  5 (5): 671-678, May 2010 [2] X.J. Zhu. Semi-supervised learning literature survey[R].Technical Report 1530, Department of Computer Sciences,University of Wisconsin at Madison, Madison, WI,December, 2007. [3] Fung, G., & Mangasarian, O. Semi-supervised support vector machines for unlabeled data classification (Technical Report 99-05). Data Mining Institute,University of Wisconsin Madison, 1999 . [4] Zhou, Z.-H., Zhan, D.-C., & Yang, Q.Semi-supervised learning with very few labeled training examples. Twenty-Second AAAI Conference on Artificial Intelligence(AAAI-07), 2007. [5] Dasarathy, B. V., Nearest Neighbor (NN) Norms,NN Pattern Classification Techniques. IEEEComputer Society Press, 1990
Thank You

Text categorization

  • 1.
    By: Sy-Quan NguyenMinh-Hoang Nguyen Phi-Dung Tran Instructor : Prof. Quang-Thuy Ha Tuan-Quang Nguyen A Comparation study on SVM,TSVM and SVM-kNN in Text Categorization
  • 2.
    Table of contentIntroduce about Text Categorization Support Vector Machine (SVM) Transductive SVM SVM- kNN Experiment
  • 3.
    Document Classification: MotivationNews article classification Automatic email filtering Webpage classification Word sense disambiguation … …
  • 4.
    Text Categorization Pre-givencategories and labeled document examples (Categories may form hierarchy) Classify new documents A standard classification (supervised learning ) problem 11/01/11 Categorization System … Sports Business Education Science … Sports Business Education
  • 5.
    Document Classification: ProblemDefinition Need to assign a boolean value {0,1} to each entry of the decision matrix C = {c 1 ,....., c m } is a set of pre-defined categories D = {d 1 ,..... d n } is a set of documents to be categorized 1 for a ij : d j belongs to c i 0 for a ij : d j does not belong to c i
  • 6.
    Flavors of ClassificationSingle Label For a given d i at most one (d i , c i ) is true Train a system which takes a d i and C as input and outputs a c i Multi-label For a given d i zero, one or more (d i , c i ) can be true Train a system which takes a d i and C as input and outputs C’, a subset of C Binary Build a separate system for each c i , such that it takes in as input a d i and outputs a boolean value for (d i , c i ) The most general approach Based on assumption that decision on (d i , c i ) is independent of (d i , c j )
  • 7.
    Classification Methods Manual:Typically rule-based (KE Approach) Does not scale up (labor-intensive, rule inconsistency) May be appropriate for special data on a particular domain Automatic: Typically exploiting machine learning techniques Vector space model based Prototype-based (Rocchio) K-nearest neighbor (KNN) Decision-tree (learn rules) Neural Networks (learn non-linear classifier) Support Vector Machines (SVM) Probabilistic or generative model based Naïve Bayes classifier
  • 8.
    Steps in DocumentClassification Classification Process Data preprocessing E.g., Term Extraction, Dimensionality Reduction, Feature Selection, etc. Definition of training set and test sets Creation of the classification model using the selected classification algorithm Classification model validation Classification of new/unknown text documents
  • 9.
  • 10.
    SVM — Historyand Applications Vapnik and colleagues (1992) — groundwork from Vapnik & Chervonenkis ’ statistical learning theory in 1960s Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization) Used both for classification and prediction Applications: handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests
  • 11.
    SVM Give trainingdata set D= Where : - input vector , - label or class need to classify SVM build a linear function:
  • 12.
    Vector xi is positive if , inversely it is negative where : is weight vector b is bias , is dot product of w and x We have :
  • 13.
    SVM is finda hyperplane :
  • 14.
  • 15.
    Problem There arean infinite number of lines that can separate the positive and negative data points as illustrated by Fig.(B). Which line should we choose? A hyperplane classifier is only applicable if the positive and negative data can be linearly separated. How can we deal with nonlinear separa-tions or data sets that require nonlinear decision boundaries?
  • 16.
    SVM – SeparableCase SVM chooses the hyperplane that maximizes the margin Let d + and d _ be the shortest distance from hyperplane to the closest positive and negative data point. Margin of hyperplane: d + + d _ SVM looks for the separating hyperplane with the largest margin (i.e deci-sion boundary or maximal margin hyperplane)
  • 17.
  • 18.
    Consider (x + , 1) and ( x - , - 1) are 2 points that closest to hyeperplane We define two parallel hyper-planes, H + and H - . We rescale w and b to obtain: 
  • 19.
    x s belong to hyperplane  d + is distance from x s to H + Margin of hype :
  • 20.
  • 21.
    Solve the above problem, we can define w and b α i >0 x i - is support vector b is computed base on KKT condition (Karrush-Kuhn-Tucker condition):
  • 22.
    SVM – NonlinearUse a hyperplane with soft margin:
  • 23.
    Use Kernel function: Polynomial : Gaussian RBF : Sigmoid : K( x,z) = tan
  • 24.
  • 25.
  • 26.
    TSVM - OverviewPartially Supervised Learning Time-consuming to manual labeled in pure SVM  use Partially Supervised Learning to solve 2 tasks of this approach: Labeled and Unlabeled Learning (LU Learning) Postive and Unlabeled Learning (PU Learning)
  • 27.
    TSVM - OverviewLabeled and Unlabeled Learning Know as semi-supervised learning Training set contains: A small set of labeled examples A large set of unlabeled examples use unlabeled examples to improve effective learning
  • 28.
    TSVM - OverviewPositive and Unlabeled Learning Suppose problem is two-class classification Training set contains: A set of labeled positive examples A set of unlabeled examples No labeled negative examples  build a classifier without labeling any negative examples
  • 29.
    TSVM - OverviewTransductive SVM One of some methods in LU Learning Very good effect with small data set (train and test) Use unlabeled data in train set to choose margin of the classifier that is maximized! After SVM After TSVM
  • 30.
    TSVM - AlgorithmUse iterative method Three steps: Step 1: Learning a classifier by using labeled data Step 2: Choose subset positive data from unlabeled data by using learned classifier (from step 1) Initial negative examples by remain unlabeled Step 3: Improve the soft margin cost function Change label of one positive, one negative document that they were labeled wrong. Retrain Stop when system constants are less than user constant
  • 31.
    TSVM - AlgorithmInput Training set: Testing set: Parameters: User’s constants: Num of Test data to assign for positive document Output Predicted labels of test examples
  • 32.
  • 33.
    TSVM - AlgorithmAfter (1): Classify test set test examples with highest values of are assigned to class positive Remaining test examples are assigned to negative
  • 34.
  • 35.
    Problem Traditional classifiersare constructed based on labeled data in supervised learning Labeled examples, are often difficult, expensive, or time consuming to obtain Semi-supervised learning addresses this problem by using large number of unlabeled data, together with small number of labeled data
  • 36.
    Semi-Supervised Semi-Supervised ishalfway between supervised and unsupervised learning [3] Data of SSL set X=(x 1 ,x 2, ….,x n ) can separated into two parts: The points X h =(x 1 ,…,x h ), for which labels Y h =(y 1 ,....,y h ) are provided. The points X t= (x h+1 ,….,x h+t ) the labels of which are not known. SSL will be most useful whenever there are much more unlabeled data than labeled
  • 37.
    Semi-Supervised A numberof classification algorithms that uses both labeled and unlabeled data have been proposed Self-training or self-labeling is the earliest SSL method probably, which is still extensively used in the processing of natural language. One of the many approaches to SSL is to first train a weaker predictor, which is then used in exploiting the unlabeled examples.
  • 38.
    Algorithm SVM Mr.Quan is presented Note: in my experiment, I choose RBF kernel as my kernel function.
  • 39.
    Algorithm KNN TheKNN algorithms is proposed by Cover and Hart (1968) [5]. KNN are calculated using Euclidean distance : X i (x 1 ,...,x i ) and X j =(x 1 ,..,x j ) KNN are suitable for classifying the case of examples set of boundary intercross and examples overlapped.
  • 40.
    Pseudocode KNNStep 1: Determine parameter K = number of nearest neighbor s Step 2: Calculate the distance between the query-instance and all the training examples. Step 3: Sort the distance and determine nearest neighbors based on the k-th minimum distance. Step 4:Gather the category Y of the nearest neighbors. Step 5: Use simple majority of the category of nearest neighbors as the prediction value of the query instance.
  • 41.
    Example Four trainingsamples Test with x 1 =3,x 2 =7.
  • 42.
    Suppose k=3 andcalculate the distance.
  • 43.
  • 44.
    Gather the categoryY of the nearest neighbors and the prediction value of the query instance.
  • 45.
    Algorithm SVM-KNN Problem:When we classified a date set including large number of unlabeled data, if only utilize the few training examples available. Then the can’t obtain a high accuracy classifier with inadequate training examples. If we want to obtain a classifier of high performance, then labeling the unlabeled data is necessary. Problem: labeling vast unlabeled data wastes time and consumes strength.
  • 46.
    Pesudocode SVM-KNN [1]Step 1:Utilize the labeled data available in a data set as inital training set and construct a weaker classifier SVM1 based on this training set. Step 2: Utilize SVM1 to predict the labels of all the remaining unlabeled data in the data set, the pick out 2 n examples located around the decision boundary as boundary vectors. Choose an example x i from class A(A is the label) and calculate the distance between x i and all the examples of class B (B is the label) using Euclidean distance subsequently pick out n exam;es of B corresponding to the n minimum distances. Similar choose y j from class B. Call the 2n examples as boundary vectors, make the 2n boundary vectors together as a new testing set. Step 3:KNN classifier classifies the new testing set with the initial training set, the boudary vectors get new labels. Step 4: Put the boundary vectors and their new labels into initial training set to enlarge the training set, then retrain a new SVM2 Step 5: Iteratively as above until the number of the training examples is m times of the whole data set.
  • 47.
  • 48.
    Initial training setPredict labels all the remaining unlabeled data SVM1 New testing set Choose 2n Vector boudary Boundary vectors get new labels Use KNN Retrain new SVM2 Put them in training set Number training set =m times whole data set
  • 49.
  • 50.
    Data set isReuter – 21578 After preprocessing ( remove stopword, stemming…), we choose ten classes that contain the most documents are selected for testing. Finally, for each class only 800 most informative words (measured by chi-square statistic) are selected.
  • 51.
  • 52.
  • 53.
    References Fabrizio Sebastiani,“Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol. 34, No.1, March 2002 Yiming Yang, “An evaluation of statistical approaches to text categorization”, Journal of Information Retrieval, 1:67-88, 1999. Yiming Yang and Xin Liu, “A re-examination of text categorization methods”, Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42--49), 1999. 11/01/11 Data Mining: Principles and Algorithms
  • 54.
    References [1]Kunlun Li,Xuerong Luo and Ming Jin (2010). Semi-supervised Learning for SVM-KNN . Journal of computers, 5 (5): 671-678, May 2010 [2] X.J. Zhu. Semi-supervised learning literature survey[R].Technical Report 1530, Department of Computer Sciences,University of Wisconsin at Madison, Madison, WI,December, 2007. [3] Fung, G., & Mangasarian, O. Semi-supervised support vector machines for unlabeled data classification (Technical Report 99-05). Data Mining Institute,University of Wisconsin Madison, 1999 . [4] Zhou, Z.-H., Zhan, D.-C., & Yang, Q.Semi-supervised learning with very few labeled training examples. Twenty-Second AAAI Conference on Artificial Intelligence(AAAI-07), 2007. [5] Dasarathy, B. V., Nearest Neighbor (NN) Norms,NN Pattern Classification Techniques. IEEEComputer Society Press, 1990
  • 55.

Editor's Notes

  • #37 this is the instance in many application areas of machine learning, for example,
  • #38 For instance: in content-based image retrieval, a user usually poses several example images as a query and asks a system to return similar images. In this situation there are many unlabeled examples. IE: images that exist in a database, but there are only several labeled examples. Another instance is online web page recommendation. When a user is surfing the Internet, he may occasionally encounter some interesting web pages and may want the system bring him similarly interesting web pages. It will be difficult to require the user to confirm more interesting pages as training examples because the user may not know where they are. In this instance, although there are a lot of unlabeled examples and there are only a few labeled examples. In the cases, there is only one labeled training example to reply on. If the initial weakly useful predictor cannot be generated based on this single example, the above-mentioned SSL techniques cannot be applied.[4]
  • #39 Because the RBF kernel nonlinearly maps examples into a higher dimensional space, unlike the linear kernel, it can handle the situation when the relation between class labels and attributes is nonlinear.
  • #40 This method is very simple: for example, if example x 1 has k nearest examples in the feature space and majority of them have the same label y 1 , then example x 1 belongs to y 1. Because: Although KNN method depends on outmost theorem in the theory, during the decision course it is only related to small number of nearest neighbors, so adopting this method can avoid the problem of examples imbalanced, otherwise, KNN mainly depends on limited number of nearest neighbor around not a decision boundary.
  • #47 Because the examples located around the boundary are easy to be misclassified, but they are likely to the support vectors, we call them boundary vectors, so picking out these boundary vectors whose labels are fuzzy labeled by weaker classifier SVM.