On the Stratification of Multi-Label Data
Upcoming SlideShare
Loading in...5
×
 

On the Stratification of Multi-Label Data

on

  • 182 views

Strati ed sampling is a sampling method that takes into ...

Strati ed sampling is a sampling method that takes into
account the existence of disjoint groups within a population and produces
samples where the proportion of these groups is maintained. In
single-label classi cation tasks, groups are di erentiated based on the
value of the target variable. In multi-label learning tasks, however, where
there are multiple target variables, it is not clear how strati ed sampling
could/should be performed. This paper investigates strati cation
in the multi-label data context. It considers two strati cation methods
for multi-label data and empirically compares them along with random
sampling on a number of datasets and based on a number of evaluation
criteria. The results reveal some interesting conclusions with respect to
the utility of each method for particular types of multi-label datasets.

Statistics

Views

Total Views
182
Views on SlideShare
182
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • What about multi-label data
  • Could also say this together with the next slide.
  • why we study ED? Because our iterative stratification approach does respect the desired number of examples
  • add animations
  • Only fails in the 4 files where the minimum number of examples per label is 1
  • We assumed that CV offers a good estimate of the test error and focused on its variance. Not real CV in iterative.

On the Stratification of Multi-Label Data On the Stratification of Multi-Label Data Presentation Transcript

  • On the Stratification of Multi-Label Data Konstantinos Sechidis, Grigorios Tsoumakas, Ioannis Vlahavas Machine Learning & Knowledge Discovery Group Department of Informatics Aristotle University of Thessaloniki Greece
  • Stratified Sampling Sampling plays a key role in practical machine learning and data mining  Exploration and efficient processing of vast data  Generation of training, validation and test sets for accuracy estimation, model selection, hyper-parameter selection and overfitting avoidance (e.g. reduced error pruning) The stratified version of sampling is typically used in classification tasks  The proportion of the examples of each class in a sample of a dataset follows that of the full dataset  It has been found to improve standard cross-validation both in terms of bias and variance of estimate (Kohavi, 1995)
  • Stratifying Multi-Label Data Instances associated with a subset of a fixed set of labels Male, Horse, Natural, Animals, Sunny, Day, Mountains, Clouds, Sky, Plants, Outdoor
  • Stratifying Multi-Label Data Random sampling is typically used in the literature We consider two main approaches for the stratification of multi-label data  Stratified sampling based on labelsets (label combinations)  The number of labelsets is often quite large and each labelset is associated with very few examples, rendering this approach impractical  Set as goal the maintenance of the distribution of positive and negative examples of each label  This views the problem independently for each label  It cannot be achieved by simple independent stratification of each label, as the produced subsets need to be the same  Our solution: iterative stratification of labels
  • Stratification Based on Labelsets 1st Foldinstance λ1 λ2 λ3 labelset i1 1 0 1 5 i2 0 0 1 1 2 2nd Fold i3 0 1 0 4 i4 1 0 0 3 i5 0 1 1 6 i6 1 1 0 5 i7 1 0 1 3rd Fold 5 i8 1 0 1 1 i9 0 0 1
  • Stratification Based on Labelsets 1st Fold i1 1 0 1 5instance λ1 λ2 λ3 labelset i2 0 0 1 1 i1 1 0 1 5 i3 0 1 0 2 i2 0 0 1 1 2 2nd Fold i3 0 1 0 4 i7 1 0 1 5 i4 1 0 0 3 i9 0 0 1 1 i5 0 1 1 6 i4 1 0 0 4 i6 1 1 0 5 i7 1 0 1 3rd Fold 5 i8 1 0 1 1 i8 1 0 1 5 i9 0 0 1 i5 0 1 1 3 i6 1 1 0 6
  • Statistics of Multi-Label Data examples per examples per labelset label labelsets dataset labels examples labelsets / examples min avg max min avg max Scene 6 2407 15 0.01 1 160 405 364 431 533 Emotions 6 593 27 0.05 1 22 81 148 185 264 TMC2007 22 28596 1341 0.05 1 21 2486 441 2805 16173 Genbase 27 662 32 0.05 1 21 170 1 31 171 Yeast 14 2417 198 0.08 1 12 237 34 731 1816 Medical 45 978 94 0.1 1 10 155 1 27 266 Mediamill 101 43907 6555 0.15 1 7 2363 31 1902 33869 Bookmarks 208 87856 18716 0.21 1 5 6087 300 857 6772 Bibtex 159 7395 2856 0.39 1 3 471 51 112 1042 Enron 53 1702 753 0.44 1 2 163 1 108 913 Corel5k 374 5000 3175 0.64 1 2 55 1 47 1120ImageCLEF2010 93 8000 7366 0.92 1 1 32 12 1038 7484 Delicious 983 16105 15806 0.98 1 1 19 21 312 6495
  • Iterative Stratification Algorithm Select the label with the fewest remaining examples  If rare labels are not examined in priority, they may be distributed in an undesired way, beyond subsequent repair  For frequent labels, we have the chance to modify the current distribution towards the desired one in a subsequent iteration, due to the availability of more examples For each example of this label, select the subset with  The largest desired number of examples for this label  The largest desired number of examples, in case of ties  Further ties are broken randomly Update statistics  Desired number of examples per label at each subsetNote: No hard constrain on the desired number of examples
  • Example 1st FoldInstance λ1 λ2 λ3 i1 1 0 1 desired 1.7 1 2 i2 0 0 1 2nd Fold i3 0 1 0 i4 1 0 0 i5 0 1 1 i6 1 1 0 desired 1.7 1 2 i7 1 0 1 3rd Fold i8 1 0 1 i9 0 0 1 sum 5 3 6 desired 1.7 1 2
  • Example 1st Fold Firstly Distribute theInstance λ1 λ2 λ3 positive examples i1 1 0 1 of λ2 desired 1.7 1 2 i2 0 0 1 2nd Fold i3 0 1 0 i4 1 0 0 i5 0 1 1 i6 1 1 0 desired 1.7 1 2 i7 1 0 1 3rd Fold i8 1 0 1 i9 0 0 1 sum 5 3 6 desired 1.7 1 2
  • Example 1st Fold Firstly i3 0 1 0 Distribute theInstance λ1 λ2 λ3 positive examples i1 1 0 1 of λ2 desired 1.7 0 2 i2 0 0 1 2nd Fold i4 1 0 0 i5 0 1 1 i6 1 1 0 desired 1.7 1 2 i7 1 0 1 3rd Fold i8 1 0 1 i9 0 0 1 sum 5 2 6 desired 1.7 1 2
  • Example 1st Fold Firstly i3 0 1 0 Distribute theInstance λ1 λ2 λ3 positive examples i1 1 0 1 of λ2 desired 1.7 0 2 i2 0 0 1 2nd Fold i4 1 0 0 i6 1 1 0 desired 1.7 1 2 i7 1 0 1 3rd Fold i8 1 0 1 i5 0 1 1 i9 0 0 1 sum 5 1 5 desired 1.7 0 1
  • Example 1st Fold Firstly i3 0 1 0 Distribute theInstance λ1 λ2 λ3 positive examples i1 1 0 1 of λ2 desired 1.7 0 2 i2 0 0 1 2nd Fold i6 1 1 0 i4 1 0 0 desired 0.7 0 2 i7 1 0 1 3rd Fold i8 1 0 1 i5 0 1 1 i9 0 0 1 sum 4 - 5 desired 1.7 0 1
  • Example 1st Fold Secondly i3 0 1 0 Distribute the positiveInstance λ1 λ2 λ3 examples of λ1 i1 1 0 1 desired 1.7 0 2 i2 0 0 1 2nd Fold i6 1 1 0 i4 1 0 0 desired 0.7 0 2 i7 1 0 1 3rd Fold i8 1 0 1 i5 0 1 1 i9 0 0 1 sum 4 - 5 desired 1.7 0 1
  • Example 1st Fold Secondly i3 0 1 0 Distribute the positive i1 1 0 1Instance λ1 λ2 λ3 examples of λ1 desired 0.7 0 1 i2 0 0 1 2nd Fold i6 1 1 0 i4 1 0 0 desired 0.7 0 2 i7 1 0 1 3rd Fold i8 1 0 1 i5 0 1 1 i9 0 0 1 sum 3 - 4 desired 1.7 0 1
  • Example 1st Fold Secondly i3 0 1 0 Distribute the positive i1 1 0 1Instance λ1 λ2 λ3 examples of λ1 desired 0.7 0 1 i2 0 0 1 2nd Fold i6 1 1 0 desired 0.7 0 2 i7 1 0 1 3rd Fold i8 1 0 1 i5 0 1 1 i9 0 0 1 i4 1 0 0 sum 2 - 4 desired 0.7 0 1
  • Example 1st Fold Secondly i3 0 1 0 Distribute the positive i1 1 0 1Instance λ1 λ2 λ3 examples of λ1 desired 0.7 0 1 i2 0 0 1 2nd Fold i6 1 1 0 i7 1 0 1 desired -0.3 0 1 3rd Fold i8 1 0 1 i5 0 1 1 i9 0 0 1 i4 1 0 0 sum 1 - 3 desired 0.7 0 1
  • Example 1st Fold Secondly i3 0 1 0 Distribute the positive i1 1 0 1Instance λ1 λ2 λ3 examples of λ1 i8 1 0 1 desired -0.3 0 0 i2 0 0 1 2nd Fold i6 1 1 0 i7 1 0 1 desired -0.3 0 1 3rd Fold i5 0 1 1 i9 0 0 1 i4 1 0 0 sum - - 2 desired 0.7 0 1
  • Example 1st Fold Thirdly i3 0 1 0 Distribute the positive i1 1 0 1Instance λ1 λ2 λ3 examples of λ3 i8 1 0 1 desired -0.3 0 0 i2 0 0 1 2nd Fold i6 1 1 0 i7 1 0 1 desired -0.3 0 1 3rd Fold i5 0 1 1 i9 0 0 1 i4 1 0 0 sum - - 2 desired 0.7 0 1
  • Example 1st Fold Thirdly i3 0 1 0 Distribute the positive i1 1 0 1Instance λ1 λ2 λ3 examples of λ3 i8 1 0 1 desired -0.3 0 0 2nd Fold i6 1 1 0 i7 1 0 1 i2 0 0 1 desired -0.3 0 0 3rd Fold i5 0 1 1 i9 0 0 1 i4 1 0 0 sum - - 1 desired 0.7 0 1
  • Example 1st Fold Thirdly i3 0 1 0 Distribute the positive i1 1 0 1Instance λ1 λ2 λ3 examples of λ3 i8 1 0 1 desired -0.3 0 0 2nd Fold i6 1 1 0 i7 1 0 1 i2 0 0 1 desired -0.3 0 0 3rd Fold i5 0 1 1 i4 1 0 0 sum - - - i9 0 0 1 desired 0.7 0 0
  • The Triggering Event Implementation of evaluation software  Stratification of multi-label data concerned us a while ago during the development of the Mulan open-source library However, a more practical issue triggered this work  During our participation at ImageCLEF 2010, x-validation experiments led to subsets without positive examples for some labels, and problems in the calculation of the main evaluation measure of the challenge, Mean Avg Precision
  • Subsets Without Label Examples When can this happen?  When there are rare labels Problems in calculation of evaluation measures  A test set without positive examples for a label (fn=tp=0) renders recall undefined, and so gets F1, AUC and MAP  Furthermore, if the model is correct (fp=0) then precision is undefined Predicted negative positive Recall: tp/(tp+fn) negative tn fp Precision: tp/(tp+fp) Actual positive fn tp
  • Comparison of the Approachesintends to maintain joint distribution random based on labelsets iterative 1st Fold 1st Fold 1st Fold i1 1 0 1 5 i1 1 0 1 5 i3 0 1 0 2 i2 0 0 1 1 i2 0 0 1 1 i1 1 0 1 5 i3 0 1 0 2 i3 0 1 0 2 i8 1 0 1 5 2nd Fold 2nd Fold 2nd Fold i4 1 0 0 4 i7 1 0 1 5 i6 1 1 0 6 i5 0 1 1 3 i9 0 0 1 1 i7 1 0 1 5 i6 1 1 0 6 i4 1 0 0 4 i2 0 0 1 1 3rd Fold 3rd Fold 3rd Fold i7 1 0 1 5 i8 1 0 1 5 i5 0 1 1 3 i8 1 0 1 5 i5 0 1 1 3 i4 1 0 0 4 i9 0 0 1 1 i6 1 1 0 6 i9 0 0 1 1 intends to maintain marginal distribution
  • Experiments Sampling approaches  Random (R)  Stratified sampling based on labelsets (L)  Iterative stratification algorithm (I) We experiment on 13 multi-label datasets  10-fold CV on datasets with up to 15k examples and  Holdout (2/3 for training and 1/3 for testing) on larger ones Experiments are repeated 5 times with different random orderings of the training examples  Presented results are averages over these 5 experiments
  • Distribution of Labels & Examples Notation  q labels, k subsets, cj desired examples in subset j,  Di: set of examples of label i, Sj: set of examples in subset j  Sij: set of examples of label i in subset j Labels distribution (LD) and examples distribution (ED) 1 q1 k S ij Di  1 k LD = ∑  ∑ − ÷ ED = ∑ S j − c j q i =1  k j =1 S j − S ji D − Di ÷ k j =1   Subsets without positive examples  Number of folds that contain at least one label with zero positive examples (FZ), number of fold-label pairs with zero positive examples (FLZ)
  • 0.2 0.4 0.6 0.8 0 1 Sc en e (0 .0 1) Em oti on s( 0. 05 ) Ge nb as e (0 .0 5) TM C2 00 7 (0 .0 5) Ye as t( 0. 08 ) M ed ic a l( 0. 10 ) M ed ia m ill (0 .1 5) Bo ok m ar ks (0 .2 1) Bi bte x( 0. 39 ) Labels Distribution Normalized to the LD value of R En ro n( 0. 44 ) Co re l5 k Im (0 ag .9 eC 2) LEDatasets are sorted in increasing order of #labelsets/#examples F2 0 10 (0 .9 2 ) De lic io u s( 0. 98 ) Labels Distribution (normalized) I L R
  • Examples Distribution 80 R 70 Iterative stratification L I trades off example distribution 60 for label distribution 50 The larger the datasetExamples Distribution 40 the larger the deviation from the desired number of examples 30 20 Mediamill contains 1730 examples without any annotation 10 ? 0 4) 2) 5) ) 9) ) ) 0) 8) 5) 5) 1) ) 05 01 08 2 .4 .9 .1 .0 .3 .9 .1 .2 .0 .9 0. . . (0 (0 (0 (0 (0 (0 (0 (0 (0 (0 (0 (0 t( n al e 7 e k ex us il l ks ns 10 as l5 00 ro as en ic m bt io ar re io 0 Ye En ed nb Sc ia C2 Bi lic F2 ot m Co ed M Ge Em De ok TM LE M Bo eC ag Im Datasets are sorted in decreasing order of #examples
  • Subsets Without Label Examples examples per label FZ FLZ dataset labels labelsets / examples min avg max R L I R L I Scene 6 0.01 364 431 533 0 0 0 0 0 0 Emotions 6 0.05 148 185 264 0 0 0 0 0 0 Genbase 27 0.05 1 31 171 10 10 10 90 77 74 Yeast 14 0.08 34 731 1816 1 0 0 1 0 0 Medical 45 0.1 1 27 266 10 10 10 203 179 173 Bibtex 159 0.39 51 112 1042 1 1 0 1 1 0 Enron 53 0.44 1 108 913 10 10 10 95 88 47 Corel5k 374 0.64 1 47 1120 10 10 10 1140 1118 788ImageCLEF2010 93 0.92 12 1038 7484 4 4 0 4 0 0- Iterative stratification produces the lowest FZ & FLZ in all datasets- All schemes fail in Genbase, Medical, Enron and Corel5k due to label rarity- All schemes do well in Scene, Emotions, where examples per label abound- Only iterative stratification does well in Bibtex and ImageCLEF2010
  • Variance of 10-fold CV Estimates Algorithms  Binary Relevance (one-versus-rest)  Calibrated Label Ranking (Fürnkranz et al., 2008)  Combination of pairwise and one-versus-rest models  Considers label dependencies Measures Measure Required type of output Hamming Loss Bipartition Subset Accuracy Bipartition Coverage Ranking Ranking Loss Ranking Mean Average Precision Probabilities Micro-averaged AUC Probabilities
  • Average Ranking for BR (1/3)  On all 9 datasets Average Rank for Stadard Deviation using BR R 3 6 fails 4 fails L I 7 fails2.5 21.5 1 e AP C ss ss ag cy AU Lo Lo M ra r ve o- cu g g Co icr in in Ac nk m M m et Ra Ha bs Su only based on scene and emotions
  • Average Ranking for BR (2/3)  On 5 datasets where #labelsets/#examples ≤ 0.1 Average Rank for Stadard Deviation with ratio of labelsets over examples less or erqual than 0.1 using BR R 3 3 fails 2 fails L I 3 fails2.5 21.5 1 e AP C ss ss ag cy AU Lo Lo M ra r ve o- cu g g Co icr ki n in Ac m M n m et Ra Ha bs Su only based on scene and emotions
  • Average Ranking for BR (3/3)  On 4 datasets where #labelsets/#examples ≥ 0.39 Average Rank for Stadard Deviation with ratio of labelsets over examples greter or erqual than 0.39 using BR R 3 L I2.5 21.5 1 e C ss ss ag cy AU Lo Lo ra r ve o- u g g cc Co icr in in nk A m M m et Ra Ha bs Su Fails in MAP – R: 4, L: 4, I: 2
  • Average Ranking for CLR  On 5 datasets with #labels < 50 for complexity reasons (those that #labelsets/#examples ≤ 0.1) Average Rank for Stadard Deviation with ratio of labelsets over examples less or erqual than 0.1 using CLR R 3 3 fails 2 fails L I2.5 3 fails 21.5 1 e ss ss C y P ag ac U A Lo Lo -A M er ur g ov g ro cc in n ic C ki tA m M an am se R ub H S only based on scene and emotions
  • BR vs CLR  On 5 datasets where #labelsets/#examples ≤ 0.1 Comparison between BR and CLR for datasets with ratio of labelsets over examples less or erqual than 0.1 BR with R 3 CLR with R BR with L CLR with L BR with I CLR with I2.5 21.5 1 e AP C ss ss ag cy AU Lo Lo M ra r ve o- u g g cc Co icr in in A nk m M m et Ra Ha bs Su Iterative stratification suits BR Labelsets-based suits CLR
  • Conclusions Labelsets-based stratification  Works well when #labelsets/#examples is small  Works well with Calibrated Label Ranking Iterative stratification  Works well when #labelsets/#examples is large  Works well with Binary Relevance  Works well for estimating the Ranking Loss  Handles rare labels in a better way  Maintains the imbalance ratio of each label in each subset Random sampling  Is consistently worse and should be avoided, contrary to the typical multi-label experimental setup of the literature
  • Future Work Iterative stratification  Investigate the effect of changing the algorithm to respect the desired number of examples at each subset Hybrid approach  Stratification based on labelsets of the examples of frequent labelsets  Iterative stratification for the rest of the examples Sampling and generalization performance  Conduct statistically valid experiments to assess the quality of the sampling schemes in terms of estimating the test error (unbiased and low variance)