Learning on the Border: Active Learning in Imbalanced Data Classification SeyDa, Jian Hungm Leon Bottou, C. Lee Giles,CIKM...
Abstract <ul><li>This paper is concerned with the class imbalance problem which has been known to hinder the learning perf...
Outline <ul><li>Introduction </li></ul><ul><li>Related work </li></ul><ul><li>Methodology </li></ul><ul><li>Performance me...
Introduction  <ul><li>A training dataset is called imbalanced </li></ul><ul><ul><li>If at least on of the classes are repr...
Introduction  <ul><li>In classification tasks, it’s generally more important to correctly classify the minority class inst...
Introduction  <ul><li>Many research direction in recent to overcome the class imbalance problem is to resample the origina...
Related work <ul><li>Assign distinct costs to the classification( [ P.Domingos, 1999],[M.Pazzani,C.Merz,P.Nurphy,K.Ali,T.H...
Related work <ul><li>Use a recognition-based, instead of discrimination-based inductive leaning([N.Japkowicz,1995],[B.Rask...
Methodology  <ul><li>Active leaning has access to a vast pool of unlabeled examples, and it tries to make a clever choice ...
Methodology
Support Vector Machines <ul><li>SVM are well known for their strong theoretical foundations, generalization performance an...
Support Vector Machines <ul><li>The dual representation of equation 1 </li></ul><ul><li>K(x i ,x j ) = (Φ(x i ),Φ(x j )) ,...
Support Vector Machines
Active Learning <ul><li>In equation 5, only the support vectors have an effect on the SVM solution. </li></ul><ul><ul><li>...
Active Learning with Small Pools <ul><li>The basic working principle of SVM active learning </li></ul><ul><ul><li>Learn an...
Active Learning with Small Pools <ul><li>The selection method picks L(L<< # training instances) random training samples in...
Active Learning with Small Pools
Online SVM for Active Learning <ul><li>LASVM is an online kernel classifier which relies on the traditional soft margin SV...
Active Learning with Early Stopping <ul><li>A theoretically sound method to stop training is when the examples in the marg...
Active Learning with Early Stopping
Performance Metrics <ul><li>Classification accuracy is not a good metric to evaluate classifiers in applications with clas...
Performance Matrix <ul><li>Area under the ROC Curve (AUC) </li></ul><ul><ul><li>Numerical measure of a model’s discriminat...
Datasets
Experiments and Empirical evaluation
Experiments and Empirical evaluation
Experiments and Empirical evaluation
Experiments and Empirical evaluation
Experiments and Empirical evaluation
Experiments and Empirical evaluation
Experiments and Empirical evaluation
Experiments and Empirical evaluation
Conclusions <ul><li>The results of this paper offer a better understanding of the effect of the active learning on imbalan...
Upcoming SlideShare
Loading in …5
×

Learning On The Border:Active Learning in Imbalanced classification Data

1,904 views

Published on

Published in: Economy & Finance, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,904
On SlideShare
0
From Embeds
0
Number of Embeds
82
Actions
Shares
0
Downloads
36
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Learning On The Border:Active Learning in Imbalanced classification Data

  1. 1. Learning on the Border: Active Learning in Imbalanced Data Classification SeyDa, Jian Hungm Leon Bottou, C. Lee Giles,CIKM’07 Presenter: Ping-Hua Yang
  2. 2. Abstract <ul><li>This paper is concerned with the class imbalance problem which has been known to hinder the learning performance of classification algorithms. </li></ul><ul><li>In this paper, we demonstrate that active learning is capable of solving the class imbalance problem by providing the learner more balanced classes. </li></ul>
  3. 3. Outline <ul><li>Introduction </li></ul><ul><li>Related work </li></ul><ul><li>Methodology </li></ul><ul><li>Performance metrics </li></ul><ul><li>Datasets </li></ul><ul><li>Experiments and empirical evaluation </li></ul><ul><li>Conclusions </li></ul>
  4. 4. Introduction <ul><li>A training dataset is called imbalanced </li></ul><ul><ul><li>If at least on of the classes are represented by significantly less number of instances than the others </li></ul></ul><ul><li>Examples of application which may have class imbalance problem </li></ul><ul><ul><li>Predicting pre-term births </li></ul></ul><ul><ul><li>Identifying fraudulent credit card transactions </li></ul></ul><ul><ul><li>Text categorization </li></ul></ul><ul><ul><li>Classification of protein databases </li></ul></ul><ul><ul><li>Detecting certain objects from satellite images </li></ul></ul>
  5. 5. Introduction <ul><li>In classification tasks, it’s generally more important to correctly classify the minority class instances. </li></ul><ul><ul><li>Mispredicting a rare event can result in more serious consequences. </li></ul></ul><ul><li>However in classification problems with imbalanced data, the minority class examples are more likely to misclassified than the majority class examples. </li></ul><ul><ul><li>Due to machine learning algorithms design principles. </li></ul></ul><ul><ul><li>This paper proposes a framework which has high prediction performance to overcome this serious data mining problem. </li></ul></ul><ul><li>In this paper we propose several methods : </li></ul><ul><ul><li>Using active learning strategy to deal with the class imbalance problem. </li></ul></ul><ul><ul><li>SVM based active learning selection strategy </li></ul></ul>
  6. 6. Introduction <ul><li>Many research direction in recent to overcome the class imbalance problem is to resample the original training dataset to create more balanced classes. </li></ul>
  7. 7. Related work <ul><li>Assign distinct costs to the classification( [ P.Domingos, 1999],[M.Pazzani,C.Merz,P.Nurphy,K.Ali,T.Hume,C.Brunk,1994] </li></ul><ul><ul><li>The misclassification penalty for the positive class is assigned a higher value than negative class. </li></ul></ul><ul><ul><li>This method requires tuning to come up with good penalty parameters for the misclassified examples. </li></ul></ul><ul><li>Resample the original training dataset([N.V.Chawla,2002],[N.Japkowicz,1995],[M.Kubat,1997],[C.X.Ling,1998]) </li></ul><ul><ul><li>either by over-sampling the minority class or under-sampling the majority class. </li></ul></ul><ul><ul><li>under-sampling may discard potentially useful data that could be important. </li></ul></ul><ul><ul><li>over-sampling may suffer from over-fitting and due to the increase in the number of samples, the training time of the learning process gets longer. </li></ul></ul>
  8. 8. Related work <ul><li>Use a recognition-based, instead of discrimination-based inductive leaning([N.Japkowicz,1995],[B.Raskutti,2004]) </li></ul><ul><ul><li>These methods attempt to measure the amount of similarity between a query object and the target class. </li></ul></ul><ul><ul><li>The major drawback of those methods is the need for tuning the similarity threshold </li></ul></ul><ul><li>SMOTE – synthetic minority over-sampling technique([N.V.Chawla,2002]) </li></ul><ul><ul><li>Minority class is oversampled by creating synthetic examples rather than with replacement. </li></ul></ul><ul><ul><li>Preprocessing the data with SMOTE may lead to improved prediction performance. </li></ul></ul><ul><ul><li>SMOTE brings more computational cost and increased number of training data. </li></ul></ul>
  9. 9. Methodology <ul><li>Active leaning has access to a vast pool of unlabeled examples, and it tries to make a clever choice to select the most informative example to obtain its label. </li></ul><ul><li>The strategy of selecting instances within the margin addresses the imbalanced dataset classification very well. </li></ul>
  10. 10. Methodology
  11. 11. Support Vector Machines <ul><li>SVM are well known for their strong theoretical foundations, generalization performance and ability to handle high dimensional. </li></ul><ul><ul><li>Using the training set, SVM builds an optimum hyper-plane. </li></ul></ul><ul><li>This hyper-plane can be obtained by minimizing the following objective function </li></ul><ul><li>w : the norm of the hyper-plane , y i : labels , Φ(*) : mapping from input space to feature space , b : offset , ξ : slack variables </li></ul>
  12. 12. Support Vector Machines <ul><li>The dual representation of equation 1 </li></ul><ul><li>K(x i ,x j ) = (Φ(x i ),Φ(x j )) , α i : Lagrange multipliers </li></ul><ul><li>After solving the QP problem, the norm of the hyper-plane w can be represented as </li></ul>
  13. 13. Support Vector Machines
  14. 14. Active Learning <ul><li>In equation 5, only the support vectors have an effect on the SVM solution. </li></ul><ul><ul><li>If SVM is retrained with a new set of data which only consist of those support vectors, the learner will end up finding the same hyper-plane. </li></ul></ul><ul><li>In this paper we will focus on a form of selection strategy called SVM based active learning. </li></ul><ul><ul><li>In SVMs, the most informative instance is the closest instance to the hyper-plane. </li></ul></ul><ul><ul><li>For the possibility of a non-symmetric version space, there are more complex selection methods. </li></ul></ul>
  15. 15. Active Learning with Small Pools <ul><li>The basic working principle of SVM active learning </li></ul><ul><ul><li>Learn an SVM on the existing training data, </li></ul></ul><ul><ul><li>Select the closest instance to the hyper-plane, </li></ul></ul><ul><ul><li>Add the new selected instance to the training set and train again. </li></ul></ul><ul><li>In classical active learning, the search for the most informative instance is performed over the entire dataset. </li></ul><ul><ul><li>For large datasets, searching the entire training set is a very time consuming and computationally expensive task. </li></ul></ul><ul><li>By using the “ 59 trick ” which does not necessitate a full search through the entire dataset but locates an approximate most informative sample. </li></ul>
  16. 16. Active Learning with Small Pools <ul><li>The selection method picks L(L<< # training instances) random training samples in each iteration and selects the best among them. </li></ul><ul><ul><li>Pick a random subset X L , L<<N </li></ul></ul><ul><ul><li>Select the closest sample x i from X L based on the condition that x i is among the top p% closest instances in X N with probability (1- η ) </li></ul></ul>
  17. 17. Active Learning with Small Pools
  18. 18. Online SVM for Active Learning <ul><li>LASVM is an online kernel classifier which relies on the traditional soft margin SVM formulation. </li></ul><ul><ul><li>LASVM requires less computational resources. </li></ul></ul><ul><li>LASVM’s model is continually modified as it process training instances on by one. </li></ul><ul><ul><li>Each LASVM iteration receives a fresh training example and tries to optimize the dual cost function in equation(3) using feasible direction searches. </li></ul></ul><ul><li>The new informative instance selected by active learning can be integrated to the existing model without retraining all the samples repeatedly. </li></ul>
  19. 19. Active Learning with Early Stopping <ul><li>A theoretically sound method to stop training is when the examples in the margin are exhausted. </li></ul><ul><li>To check if there are still unseen training instances in the margin </li></ul><ul><ul><li>The distance of the new selected is compared to the support vector of current model. </li></ul></ul><ul><li>A practical implementation of this idea is to count the number of support vectors during the active learning training process. </li></ul>
  20. 20. Active Learning with Early Stopping
  21. 21. Performance Metrics <ul><li>Classification accuracy is not a good metric to evaluate classifiers in applications with class imbalance problem. </li></ul><ul><ul><li>In non-separable case, if the misclassification penalty C is very small, SVM learner simply tends to classify every example as negative. </li></ul></ul><ul><li>G-means </li></ul><ul><ul><li>sensitivity : TruPos./(TruPos.+FalseNeg.) </li></ul></ul><ul><ul><li>specifity : TrueNeg./(TrueNeg.+FalsePos.) </li></ul></ul><ul><li>Receiver Operating Curve (ROC) </li></ul><ul><ul><li>ROC is a plot of the true positive rate against the false positive rate as the decision threshold is changed. </li></ul></ul>
  22. 22. Performance Matrix <ul><li>Area under the ROC Curve (AUC) </li></ul><ul><ul><li>Numerical measure of a model’s discrimination performance. </li></ul></ul><ul><ul><li>Show how successfully and correctly the model separates the positive and negative. </li></ul></ul><ul><li>Precision Recall Break-Even Point (PRBEP) </li></ul><ul><ul><li>PRBEP is the accuracy of the positive class at the threshold where precision equals to recall. </li></ul></ul>
  23. 23. Datasets
  24. 24. Experiments and Empirical evaluation
  25. 25. Experiments and Empirical evaluation
  26. 26. Experiments and Empirical evaluation
  27. 27. Experiments and Empirical evaluation
  28. 28. Experiments and Empirical evaluation
  29. 29. Experiments and Empirical evaluation
  30. 30. Experiments and Empirical evaluation
  31. 31. Experiments and Empirical evaluation
  32. 32. Conclusions <ul><li>The results of this paper offer a better understanding of the effect of the active learning on imbalanced datasets. </li></ul><ul><li>By focusing the learning on the instances around the classification boundary, more balanced class distributions can be provided to the learner in the earlier steps of the learning. </li></ul>

×