Genetic-based Synthetic Data
  Sets for the A l i f
  S t f th Analysis of
    Classifiers Behavior
 8th I t
     Internat...
Motivation

                                    Knowledge
     Data Set                                        Model
     ...
Outline
1.
1    Data complexity
2.   Synthetic data sets
3.   Design of GA
4.
4    Experiments and results
5.   Conclusion...
1. Data complexity
 Length of the class boundary
    Build minimum spanning tree (MST) connecting all
    the points regar...
2. Synthetic data sets
 Generation procedure
   Set the number of instances n, the number of
   attributes m and the lengt...
2. Synthetic data sets
 Exhaustive search
   Labelings grow exponentially with the number of
   instances
 Heuristic searc...
3. Design of GA
 Knowledge representation
     k-ary string where the bit i stores the class label of
     the ith instanc...
3. Design of GA
 Genetic operators
   s-wise tournament selection
   Two-point crossover
   T       it
   Bit-wise mutatio...
4. Experiment and results (I)
  Synthetic data set generation
    Different solutions < Solutions
       Population conver...
4. Experiment and results (II)
  Analysis of classifiers behavior
    Three different paradigms: C4.5, Naïve Bayes, and
  ...
5. Conclusions
 The GA allows us to generate data sets with
 the demanded length of the class boundary




     Overview a...
6. Further work
 Efficiency and scalability
   Move from simple GA to competent GA
 Capacity of satisfying multiple criter...
Genetic-based Synthetic Data
  Sets for the A l i f
  S t f th Analysis of
    Classifiers Behavior
 8th I t
     Internat...
Upcoming SlideShare
Loading in …5
×

HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

537 views

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
537
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers Behavior

  1. 1. Genetic-based Synthetic Data Sets for the A l i f S t f th Analysis of Classifiers Behavior 8th I t International Conference on Hybrid Intelligent Systems ti lC f H b id I t lli tS t Núria Macià Albert Orriols-Puig Alb t O i l P i Ester Bernadó-Mansilla {nmacia,aorriols,esterb}@salle.url.edu Grup de Recerca en Sistemes Intel·ligents Enginyeria i Arquitectura La Salle Universitat Ramon Llull
  2. 2. Motivation Knowledge Data Set Model Extraction Real-world Learner problem + Prediction Necessity of synthetic data sets To evaluate real learners performance under controlled scenarios How to generate synthetic data sets? Data complexity (Ho & Basu, 2002) Length of the class boundary (Macià et al., 2008) Objective: Set of benchmark problems to analyze learners behavior Overview and Future Research Slide 2
  3. 3. Outline 1. 1 Data complexity 2. Synthetic data sets 3. Design of GA 4. 4 Experiments and results 5. Conclusions and further work Overview and Future Research Slide 3
  4. 4. 1. Data complexity Length of the class boundary Build minimum spanning tree (MST) connecting all the points regardless of class Count the number of edges joining opposite classes it l Two cases of many points in boundary: Very interleaved or random data Linearly separable problem with narrow margins Overview and Future Research Slide 4
  5. 5. 2. Synthetic data sets Generation procedure Set the number of instances n, the number of attributes m and the length of the class boundary m, b. Generate n points di t ib t d randomly and b ild G t i t distributed dl d build the MST. Label the class of each instances Overview and Future Research Slide 5
  6. 6. 2. Synthetic data sets Exhaustive search Labelings grow exponentially with the number of instances Heuristic search Demanded length of the class boundary is not always achieved No diverse solutions Genetic algorithm G ti l ith Overview and Future Research Slide 6
  7. 7. 3. Design of GA Knowledge representation k-ary string where the bit i stores the class label of the ith instance Data set i Individual i Att. 1 Att. 2 … Att. N Class 0.4 04 0.5 05 0.4 04 0 0.2 1.0 0.2 1 011011 0.5 0.3 0.4 1 0.6 0.5 0.4 0 0.7 0.1 1.0 1 0.5 0.3 0.9 1 Overview and Future Research Slide 7
  8. 8. 3. Design of GA Genetic operators s-wise tournament selection Two-point crossover T it Bit-wise mutation Fitness function fitnessi = bobj − bi Overview and Future Research Slide 8
  9. 9. 4. Experiment and results (I) Synthetic data set generation Different solutions < Solutions Population converge Pop lation con erge to the same sol tion solution {0100,1011} are equivalent individuals Intermediate complexity are obtained i early It di t l it bt i d in l generations Overview and Future Research Slide 9
  10. 10. 4. Experiment and results (II) Analysis of classifiers behavior Three different paradigms: C4.5, Naïve Bayes, and SMO Similar accuracy rates with noticeable variability Overview and Future Research Slide 10
  11. 11. 5. Conclusions The GA allows us to generate data sets with the demanded length of the class boundary Overview and Future Research Slide 11
  12. 12. 6. Further work Efficiency and scalability Move from simple GA to competent GA Capacity of satisfying multiple criteria C f f Multi-objective strategy j gy Achieve structure of real-world problems Provide a set of benchmark problems Overview and Future Research Slide 12
  13. 13. Genetic-based Synthetic Data Sets for the A l i f S t f th Analysis of Classifiers Behavior 8th I t International Conference on Hybrid Intelligent Systems ti lC f H b id I t lli tS t Núria Macià Albert Orriols-Puig Alb t O i l P i Ester Bernadó-Mansilla {nmacia,aorriols,esterb}@salle.url.edu Grup de Recerca en Sistemes Intel·ligents Enginyeria i Arquitectura La Salle Universitat Ramon Llull

×