Evaluating multi-class classification using binary SVMs
                        (IT-642: Course Project)

             Vijay T. Raisinghani                              Pradeep Jagannath
              rvijay@it.iitb.ac.in                            pradeep@it.iitb.ac.in

              Roll No: 01429703                                Roll No: 00329010

                                           Abstract

We study how SVM-based binary classifiers are used for multi-way classification. We present the
results of experiments run on various UCI and KDD datasets, using the SVMLight package. The
methods evaluated are 1-versus-1, 1-versus-many and Erroc Correcting Output Coding (ECOC).


1. Main Objectives                                 4. Introduction
- Use three (1-vs-1, 1-vs-many, ECOC)              Many supervised learning tasks can be cast
   binary classification schemes, with             as the problem of assigning elements to a
   SVMLight [SVM02] on various UCI                 finite set of classes or categories. For
   datasets and the KDD intrusion                  example the goal of optical character
   detection dataset.                              recognition (OCR) is to determine the digit
- Report accuracy and run-time for the             value (0..9) from its image. A number of
   various methods.                                other applications too require such multi-
                                                   way classification e.g.: text and speech
2. Status and other details                        categorisation, natural language processing
- Fully completed                                  tasks and gesture and object recognition in
                                                   machine vision [All00].
- Percentage contribution of members:
                                                   In designing machine learning algorithms, it
   - Pradeep Jagannath – 50%
                                                   is often easier first to devise algorithms for
   - Vijay T. Raisinghani – 50%                    distinguishing between only two classes
- Total time spent on the project: ??              [All00]. Ensemble schemes have been
                                                   proposed, which use binary (two-class)
3. Major stumbling blocks                          classification algorithms to solve K-class
- SVM parameter estimation. We referred            classification problems. Decomposing a K-
   to [Dua01] and other papers (see section        class classification problem into a number of
   “Related Work”) and had discussions             binary classification problems allows an
   with Shantanu Godbole (Ph.D student,            ensemble scheme to model binary class
   KR School of IT, IIT Bombay) to                 boundaries with much greater flexibility at a
   “estimate” the kernel parameters                lower computational cost [Goh01].
   required. Still, with no prior estimate
   about the time required for various             Three representative ensemble schemes are
   datasets, especially the KDD dataset, we        one per class (1-vs-many), pairwise coupling
   had to abort tests, which were running          (1-vs-1), and error-correcting output coding
   for days together.                              (ECOC) [Goh01].
- KDD dataset tested using only 1% of the          1. One per class (OPC). This is also known
   data (i.e. 50,000) records. Full dataset        as “one against others.” OPC trains K binary
   has 5 million records. Even 10% of the          classifiers, each of which separates one class
   records were taking a very large amount         from the other (K - 1) classes. Given a point
   of time.                                        X to classify, the binary classifier with the
                                                   largest output determines the class of X.
2. Pairwise coupling (PWC). PWC                   7. Experiments and Results
constructs K(K-1)/2 pairwise binary               All our tests were run on cygnus (PIII – 3
classifiers. The classifying decision is made     processors, 512 MB RAM, running Linux
by aggregating the outputs of the pairwise        version 2.4.17 (RedHat 7.1).
classifiers.
3. Error-correcting output coding (ECOC).         We experimented with various kernel
ECOC was first proposed by Dietterich and         settings. For some cases, the test did not
Bakiri [S] to reduce classification error by      terminate and had to be aborted. For some
exploiting the redundancy of the coding           settings, the accuracy from all the methods
scheme. ECOC employs a set of binary              was very low. We exclude the results for
classifiers assigned with codewords such          which the test had to be aborted or the
that the Hamming distance between each            accuracy was low for all the methods.
pair is far enough apart to enable good error
correction.                                        Data Set       No. of       Train       Test
                                                                 Classes      records    records
5. Related Work                                  iris                 3         100          50
[Die95] discusses the use of ECOC method         dermatalogy          6         244         122
versus multi-way classification using            glass                7         142          72
decision trees. [Gha00], [Ren01] present the     ecoli                8         224         112
use of ECOC, for improving the                   yeast               10         989         495
performance of Naïve-Bayes, for text             KDD                23       48,984,31   3,11,029
                                                 intrusion       (in train      (4.9       (0.3
classification. [All00], [Wes99], [Hsu02]
                                                 detection         data)      million)   million)
propose extensions to SVMs for multi-way                            38
classification. [Goh01] provides details of                       (in test
how to boost the output of binary SVMs, for                        data)
image classification. [Mor98] discusses          letter              26       15,000      5000
various methods of combining the output of
1-vs-1.                                                        Figure 1: Data sets

6. Implementation details                         The dematalogy dataset had missing values
All our test scripts were shell scripts, which    in the age attribute. We substituted this with
invoked SVMLight . Additionally, for ECOC         the maximum frequency value of age.
we used a modified form of bch3.c [Zar02]
– encoder/decoder program for BCH codes           For the KDD dataset we did the following:
in C. We modified the program to decode           - Reduced the training data to only 1% i.e.
and encode in parts. The program generates            50,000 records.
the code matrix based on the input set of         - Scanned the 1% test set for duplicates --
classes and accepts the ‘received’ code from          found 50% duplicates. Eliminated them,
our shell scripts for decoding. We hard-              finally training data had 23000 records.
coded other parameters: code length to 31         - One training record had 55 features
bits and errors correcting capability to 15           while all others had 41. We eliminated
bits. This resulted in the data length being a        this record, although it may not have
maximum of 6 bits i.e. we could encode a              had contributed to any problems.
maximum of 64 classes with these settings.        - Feature selection was done using
This was sufficient for the data sets we used,        “Inducer”[MLC++] and C4.5. Selected
which had a maximum of 26 classes.                    16 features from the original set of 41.
                                                  - Stratified the de-duped file to max 50
                                                      per class to get 689 records. To run a
                                                      simpler / faster test.
-      Used these 689 records to train with the   KDD                                  3.7%                     53.9%        57.8%
        RBF kernel with params: -g 0.03 -c 10 -    intrusion
        q 50 -n 40                                 detection++

                                                   letter                               34%                      78.5%        88.38%
 In KDD with FSS using only 689 records.
 And testing 10 percent of the test file
                                                    Figure 3: Accruacy with various methods
 (29615) records:
 - 5000 records had certain class labels            * (1% train data, deduplicated = 23000;
      present in test file, which were non-         10% test data = 29615)
      existent in the training data. These 5000     ++ (0.01% train data, deduplicated = 689;
      records      directly     contributed  to     10% test data = 29615)
      classification errors.
                                                   Dataset                     1-vs-1                 1-vs-many                           ECOC
 - 1-vs-1 – almost all records got classified




                                                                     File Convert


                                                                                         Learn


                                                                                                      File Convert


                                                                                                                      Learn


                                                                                                                               File Convert


                                                                                                                                               Learn
      as class 1 (3.7 % accuracy)
 - 1-vs-many – 13687 errors (53.9%
      accuracy)
 - ECOC – 12489 errors (57.8% accuracy)            iris          0.09                   0.11         0.06            0.17     0.79             1.21
 -                                                 dermatal      0.43                    3.1         0.29            1.19      2.1             9.79
    Data Set                Parameters             ogy
 iris              kernel:poly d=3,                glass         0.29                   1.14         0.13            0.38     0.96             3.45
                   other params: c=0.001
                                                   ecoli         0.65                   1.22         0.19            0.45     1.16             2.53
 dermatalogy       kernel: RBF g=0.01
                   other params: c=10              yeast         1.66                   9.87          0.6            14.73    4.29             97.1
 glass             kernel: RBF g=0.8               KDD           75.96                  6530         44.37           2980     81.25           13675
                   other params: c=10              intrusion
 ecoli             kernel: poly d=3                detection
                   other params: -                 *
 yeast             kernel: RBF g=10
                   other params: c=10              KDD           6.63                   101.9        1.44            19.87     1.2            83.96
 KDD               kernel: RBF g=0.001             intrusion
                                                   detection
 intrusion         other params: c=1, q=50, n=40
                                                   ++
 detection
 letter            kernel: RBF g=0.01              letter        105.4                  1609         33.49           972.7    108             9230
                   other params: c=1, q=50, n=40
 Figure 2: SVM Parameters for various
 data sets                                                                          #Classes vs accuracy
     Dataset       1-vs-1      1-vs-      ECOC
                               many                       120%
iris                94%        98%         98%            100%
                                                           80%                                                                                1v1
dermatalogy        86.9%       89.3%      90.98%
                                                           60%                                                                                1vm
glass              66.7%       68%        73.6%            40%                                                                                ECOC
ecoli              80.3%       55.3%      73.2%            20%
yeast              53.3%       44.6%      55.8%             0%
                                                                 3                  6    7       8    10 23 23 26
KDD                 5.8%       47.5%      63.7%
intrusion
detection*
[Gha00] Rayid Ghani. Using error-
              convert vs row s/class
                                                       correcting     codes   for    text
                                                       classification. In Proceedings of
  16000
  14000
                                                       the Seventeenth International
  12000                                                Conference on Machine Learning,
                                        1v1
  10000                                                2000.
   8000                                 1vm
   6000
   4000
                                        ECOC   [Ren01] Jason D. M. Rennie. Improving
   2000                                                multi-class text classification with
      0                                                naïve     bayes. Master's thesis,
                                                       Massachusetts       Institute     of
                               4
       29

              63

                     43

                            5.

                                                       Technology, 2001.
     0.

            6.

                   0.

                          10




                                               [Hsu02] C.-W. Hsu and C.-J. Lin. A
                                                       comparison of methods for multi-
              learn vs row s/class
                                                       class support vector machines ,
                                                       IEEE Transactions on Neural
  120
                                                       Networks, 13(2002), 415-425.
  100
   80                                  1v1     [Wes99] J. Weston, "Extensions to the
   60                                  1vm             Support Vector Method", PhD
   40                                  ECOC            thesis, Royal Holloway University
   20                                                  of London, 1999.
    0                                          [Mor98] M. Moreira and E. Mayoraz.
                                                       Improving pairwise coupling
                             4
             63
     29




                    43

                           5.




                                                       classification with error correcting
   0.


          6.

                  0.

                         10




                                                       classifiers. Proceedings of the
                                                       Tenth European Conference on
                                                       Machine Learning, April 1998.
    References
[All00] E. L. Allwein, R. E. Schapire, and     [SVM02]            Thorsten         Joachims
         Y. Singer. Reducing multiclass to               http://svmlight.joachims.org/,
         binary: A unifying approach for                 Cornell University, Department of
         margin classifiers. Journal of                  Computer Science.
         Machine Learning Research,
         1:113-141, 2000.                      [Dua01] Kaibo Duan, S Sathiya Keerthi,
                                                       Aun Neow PooICONIP – 2001,
[Goh01] K. Goh, E. Chang, K. Cheng.                    8th International Conference on
        SVM Binary Classifier Ensembles                Neural Information Processing,
        for      Image      Classification.            Shanghai     China,   November
        CIKM’01, November 5-10,2001,                   14-18.2001
        Atlanta, Georgia, USA.
                                               [MLC++] Silicon Graphics, Inc., MLC++,
[Die95] T. G. Dietterich and G. Bakiri.               http://www.sgi.com/tech/mlc/,
        Solving     multiclass    learning            2002
        problems via error-correcting
        output codes. Journal of Artificial    [Zar02] R. Morelos-Zaragoza., BCH codes -
        Intelligence Research, 2:263-286,               The Error Correcting Codes (ECC)
        1995.                                           Page,
                                                        http://www.csl.sony.co.jp/person/
                                                        morelos/ecc/codes.html, 2002

.doc

  • 1.
    Evaluating multi-class classificationusing binary SVMs (IT-642: Course Project) Vijay T. Raisinghani Pradeep Jagannath rvijay@it.iitb.ac.in pradeep@it.iitb.ac.in Roll No: 01429703 Roll No: 00329010 Abstract We study how SVM-based binary classifiers are used for multi-way classification. We present the results of experiments run on various UCI and KDD datasets, using the SVMLight package. The methods evaluated are 1-versus-1, 1-versus-many and Erroc Correcting Output Coding (ECOC). 1. Main Objectives 4. Introduction - Use three (1-vs-1, 1-vs-many, ECOC) Many supervised learning tasks can be cast binary classification schemes, with as the problem of assigning elements to a SVMLight [SVM02] on various UCI finite set of classes or categories. For datasets and the KDD intrusion example the goal of optical character detection dataset. recognition (OCR) is to determine the digit - Report accuracy and run-time for the value (0..9) from its image. A number of various methods. other applications too require such multi- way classification e.g.: text and speech 2. Status and other details categorisation, natural language processing - Fully completed tasks and gesture and object recognition in machine vision [All00]. - Percentage contribution of members: In designing machine learning algorithms, it - Pradeep Jagannath – 50% is often easier first to devise algorithms for - Vijay T. Raisinghani – 50% distinguishing between only two classes - Total time spent on the project: ?? [All00]. Ensemble schemes have been proposed, which use binary (two-class) 3. Major stumbling blocks classification algorithms to solve K-class - SVM parameter estimation. We referred classification problems. Decomposing a K- to [Dua01] and other papers (see section class classification problem into a number of “Related Work”) and had discussions binary classification problems allows an with Shantanu Godbole (Ph.D student, ensemble scheme to model binary class KR School of IT, IIT Bombay) to boundaries with much greater flexibility at a “estimate” the kernel parameters lower computational cost [Goh01]. required. Still, with no prior estimate about the time required for various Three representative ensemble schemes are datasets, especially the KDD dataset, we one per class (1-vs-many), pairwise coupling had to abort tests, which were running (1-vs-1), and error-correcting output coding for days together. (ECOC) [Goh01]. - KDD dataset tested using only 1% of the 1. One per class (OPC). This is also known data (i.e. 50,000) records. Full dataset as “one against others.” OPC trains K binary has 5 million records. Even 10% of the classifiers, each of which separates one class records were taking a very large amount from the other (K - 1) classes. Given a point of time. X to classify, the binary classifier with the largest output determines the class of X.
  • 2.
    2. Pairwise coupling(PWC). PWC 7. Experiments and Results constructs K(K-1)/2 pairwise binary All our tests were run on cygnus (PIII – 3 classifiers. The classifying decision is made processors, 512 MB RAM, running Linux by aggregating the outputs of the pairwise version 2.4.17 (RedHat 7.1). classifiers. 3. Error-correcting output coding (ECOC). We experimented with various kernel ECOC was first proposed by Dietterich and settings. For some cases, the test did not Bakiri [S] to reduce classification error by terminate and had to be aborted. For some exploiting the redundancy of the coding settings, the accuracy from all the methods scheme. ECOC employs a set of binary was very low. We exclude the results for classifiers assigned with codewords such which the test had to be aborted or the that the Hamming distance between each accuracy was low for all the methods. pair is far enough apart to enable good error correction. Data Set No. of Train Test Classes records records 5. Related Work iris 3 100 50 [Die95] discusses the use of ECOC method dermatalogy 6 244 122 versus multi-way classification using glass 7 142 72 decision trees. [Gha00], [Ren01] present the ecoli 8 224 112 use of ECOC, for improving the yeast 10 989 495 performance of Naïve-Bayes, for text KDD 23 48,984,31 3,11,029 intrusion (in train (4.9 (0.3 classification. [All00], [Wes99], [Hsu02] detection data) million) million) propose extensions to SVMs for multi-way 38 classification. [Goh01] provides details of (in test how to boost the output of binary SVMs, for data) image classification. [Mor98] discusses letter 26 15,000 5000 various methods of combining the output of 1-vs-1. Figure 1: Data sets 6. Implementation details The dematalogy dataset had missing values All our test scripts were shell scripts, which in the age attribute. We substituted this with invoked SVMLight . Additionally, for ECOC the maximum frequency value of age. we used a modified form of bch3.c [Zar02] – encoder/decoder program for BCH codes For the KDD dataset we did the following: in C. We modified the program to decode - Reduced the training data to only 1% i.e. and encode in parts. The program generates 50,000 records. the code matrix based on the input set of - Scanned the 1% test set for duplicates -- classes and accepts the ‘received’ code from found 50% duplicates. Eliminated them, our shell scripts for decoding. We hard- finally training data had 23000 records. coded other parameters: code length to 31 - One training record had 55 features bits and errors correcting capability to 15 while all others had 41. We eliminated bits. This resulted in the data length being a this record, although it may not have maximum of 6 bits i.e. we could encode a had contributed to any problems. maximum of 64 classes with these settings. - Feature selection was done using This was sufficient for the data sets we used, “Inducer”[MLC++] and C4.5. Selected which had a maximum of 26 classes. 16 features from the original set of 41. - Stratified the de-duped file to max 50 per class to get 689 records. To run a simpler / faster test.
  • 3.
    - Used these 689 records to train with the KDD 3.7% 53.9% 57.8% RBF kernel with params: -g 0.03 -c 10 - intrusion q 50 -n 40 detection++ letter 34% 78.5% 88.38% In KDD with FSS using only 689 records. And testing 10 percent of the test file Figure 3: Accruacy with various methods (29615) records: - 5000 records had certain class labels * (1% train data, deduplicated = 23000; present in test file, which were non- 10% test data = 29615) existent in the training data. These 5000 ++ (0.01% train data, deduplicated = 689; records directly contributed to 10% test data = 29615) classification errors. Dataset 1-vs-1 1-vs-many ECOC - 1-vs-1 – almost all records got classified File Convert Learn File Convert Learn File Convert Learn as class 1 (3.7 % accuracy) - 1-vs-many – 13687 errors (53.9% accuracy) - ECOC – 12489 errors (57.8% accuracy) iris 0.09 0.11 0.06 0.17 0.79 1.21 - dermatal 0.43 3.1 0.29 1.19 2.1 9.79 Data Set Parameters ogy iris kernel:poly d=3, glass 0.29 1.14 0.13 0.38 0.96 3.45 other params: c=0.001 ecoli 0.65 1.22 0.19 0.45 1.16 2.53 dermatalogy kernel: RBF g=0.01 other params: c=10 yeast 1.66 9.87 0.6 14.73 4.29 97.1 glass kernel: RBF g=0.8 KDD 75.96 6530 44.37 2980 81.25 13675 other params: c=10 intrusion ecoli kernel: poly d=3 detection other params: - * yeast kernel: RBF g=10 other params: c=10 KDD 6.63 101.9 1.44 19.87 1.2 83.96 KDD kernel: RBF g=0.001 intrusion detection intrusion other params: c=1, q=50, n=40 ++ detection letter kernel: RBF g=0.01 letter 105.4 1609 33.49 972.7 108 9230 other params: c=1, q=50, n=40 Figure 2: SVM Parameters for various data sets #Classes vs accuracy Dataset 1-vs-1 1-vs- ECOC many 120% iris 94% 98% 98% 100% 80% 1v1 dermatalogy 86.9% 89.3% 90.98% 60% 1vm glass 66.7% 68% 73.6% 40% ECOC ecoli 80.3% 55.3% 73.2% 20% yeast 53.3% 44.6% 55.8% 0% 3 6 7 8 10 23 23 26 KDD 5.8% 47.5% 63.7% intrusion detection*
  • 4.
    [Gha00] Rayid Ghani.Using error- convert vs row s/class correcting codes for text classification. In Proceedings of 16000 14000 the Seventeenth International 12000 Conference on Machine Learning, 1v1 10000 2000. 8000 1vm 6000 4000 ECOC [Ren01] Jason D. M. Rennie. Improving 2000 multi-class text classification with 0 naïve bayes. Master's thesis, Massachusetts Institute of 4 29 63 43 5. Technology, 2001. 0. 6. 0. 10 [Hsu02] C.-W. Hsu and C.-J. Lin. A comparison of methods for multi- learn vs row s/class class support vector machines , IEEE Transactions on Neural 120 Networks, 13(2002), 415-425. 100 80 1v1 [Wes99] J. Weston, "Extensions to the 60 1vm Support Vector Method", PhD 40 ECOC thesis, Royal Holloway University 20 of London, 1999. 0 [Mor98] M. Moreira and E. Mayoraz. Improving pairwise coupling 4 63 29 43 5. classification with error correcting 0. 6. 0. 10 classifiers. Proceedings of the Tenth European Conference on Machine Learning, April 1998. References [All00] E. L. Allwein, R. E. Schapire, and [SVM02] Thorsten Joachims Y. Singer. Reducing multiclass to http://svmlight.joachims.org/, binary: A unifying approach for Cornell University, Department of margin classifiers. Journal of Computer Science. Machine Learning Research, 1:113-141, 2000. [Dua01] Kaibo Duan, S Sathiya Keerthi, Aun Neow PooICONIP – 2001, [Goh01] K. Goh, E. Chang, K. Cheng. 8th International Conference on SVM Binary Classifier Ensembles Neural Information Processing, for Image Classification. Shanghai China, November CIKM’01, November 5-10,2001, 14-18.2001 Atlanta, Georgia, USA. [MLC++] Silicon Graphics, Inc., MLC++, [Die95] T. G. Dietterich and G. Bakiri. http://www.sgi.com/tech/mlc/, Solving multiclass learning 2002 problems via error-correcting output codes. Journal of Artificial [Zar02] R. Morelos-Zaragoza., BCH codes - Intelligence Research, 2:263-286, The Error Correcting Codes (ECC) 1995. Page, http://www.csl.sony.co.jp/person/ morelos/ecc/codes.html, 2002