Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Mining Speaker: Chih-Yang Lin
Outline <ul><li>Present two paper about clustering in Data Mining </li></ul><ul><ul><li>Spatial Clustering Methods in Data...
What is data mining <ul><li>It can be considered as one of the steps in the Knowledge Discovery in Database(KDD) </li></ul...
The fields of data mining <ul><li>Mining association rules </li></ul><ul><li>Data generalization and summarization </li></...
Spatial Clustering Methods in Data Mining: A Survey Author: Jiawei Han, M.Kamber and A. K. H. Tung Source: Geographic Data...
Introduction Clustering in DM <ul><li>Partition methods </li></ul><ul><li>Hierarchical methods </li></ul><ul><li>Density-b...
A good clustering method? <ul><li>Parameter </li></ul><ul><ul><li># of clusters, # of neighbors, radius… </li></ul></ul><u...
Partition methods : K-means
Partition methods : K-means <ul><li>Time complexity: O(knt) </li></ul><ul><li>Need to specify k in advance </li></ul><ul><...
Partition methods : K-medoid <ul><li>Time complexity: O(k(n-k) 2 t) </li></ul><ul><li>Robust than k-means </li></ul><ul><l...
Hierarchical methods <ul><li>Two ways </li></ul><ul><ul><li>Agglomerative approach: Button-up </li></ul></ul><ul><ul><ul><...
Hierarchical methods <ul><li>Difficult to decide cut points or merge points </li></ul><ul><li>Irreversible </li></ul><ul><...
Density based methods <ul><li>Discover clusters of arbitrary shape </li></ul><ul><li>Handle noise </li></ul><ul><li>Time c...
Density based methods <ul><li>Two parameters are needed </li></ul><ul><ul><li>Eps : Maximum radius of the neighbourhood </...
Grid-Based Clustering Method <ul><li>Fast processing time, independent of the number of data objects, dependent on only th...
Model-Based Clustering methods <ul><li>Statistical approach </li></ul><ul><li>Neural network approach </li></ul>
Comparison No No Yes Gird structure O(n) STING No No No Radius, Minpts O(n 2 ) DBSCAN No No Cyclic Cut or merge points O(n...
Nonparametric genetic clustering: comparison of validity indices Bandyopadhyay, S.; Maulik, U. Machine IEEE Transactions o...
Introduction <ul><li>GA is a randomized and parallel searching algorithm </li></ul><ul><li>GA operation </li></ul><ul><ul>...
VGA-Clustering(1) <ul><li>String representation : a chromosome means a kind of clustering result </li></ul><ul><li>Chromos...
VGA-Clustering (2) <ul><li>Population initialization </li></ul><ul><ul><li>Chromosome length: </li></ul></ul><ul><ul><ul><...
VGA-Clustering (3): Crossover <ul><li>Crossover :  stochastically with probability  μ c </li></ul><ul><li>At least 2 clust...
VGA-Clustering (4) <ul><li>Mutation </li></ul><ul><ul><li>probability  μ m </li></ul></ul><ul><ul><li>Random d in [0,1] </...
Fitness Computation <ul><li>Use the cluster validity indices </li></ul><ul><ul><li>DB Index </li></ul></ul><ul><ul><li>Dun...
Cluster Validity Indices (1) <ul><li>DB  (Davies-Bouldin)  Index </li></ul><ul><li>Within-cluster scatter : </li></ul><ul>...
Cluster Validity Indices (2) <ul><li>Dunn’s Index:  Let   S, T be two nonempty subset </li></ul><ul><li>Diameter of S : </...
Cluster Validity Indices (3) <ul><li>Generalized Dunn’s Index </li></ul><ul><ul><li>δ i : Any positive, semi-definite, sym...
Cluster Validity Indices (4) <ul><li>New Cluster Validity Index </li></ul><ul><li>Index: </li></ul><ul><li>Scatter: </li><...
Experiment    Data Set AD_5_2 AD_10_2 AD_4_3 5 clusters 4 clusters 10 clusters
Experiment    result (1) DB, DUNN DB, DUNN DB, G. DUNN DUNN G. DUNN G. DUNN 5 clusters 8 clusters 4 clusters 4 clusters 2...
Experiment    result (2) DB (iris data) New Index New Index (iris data) 2 clusters 10 clusters 3 clusters
Experiment    result (3) Cluster #  V.S.  Generation 1/DB  V.S.  Generation
Conclusion <ul><li>VGAs exploited for clustering when number of clusters is not known a priori </li></ul><ul><li>New clust...
Disadvantage <ul><li>Handle only cyclic shape of clusters </li></ul><ul><li>Similar size </li></ul><ul><li>Chromosome enco...
Future work <ul><li>bottleneck of apriori algorithms </li></ul><ul><ul><li>Specify parameter: support and confidence </li>...
Future work <ul><li>Bottleneck of parallel mining association rules </li></ul><ul><ul><li>Communication overhead </li></ul...
Bibliography <ul><li>Journal </li></ul><ul><ul><li>Data Mining and Knowledge Discovery Journal </li></ul></ul><ul><ul><li>...
Upcoming SlideShare
Loading in …5
×

Data Mining Speaker: Chih-Yang Lin

1,038 views

Published on

  • Be the first to comment

Data Mining Speaker: Chih-Yang Lin

  1. 1. Data Mining Speaker: Chih-Yang Lin
  2. 2. Outline <ul><li>Present two paper about clustering in Data Mining </li></ul><ul><ul><li>Spatial Clustering Methods in Data Mining: A Survey </li></ul></ul><ul><ul><li>Nonparametric Genetic Clustering: Comparison of Validity </li></ul></ul><ul><li>Some idea & future work </li></ul>
  3. 3. What is data mining <ul><li>It can be considered as one of the steps in the Knowledge Discovery in Database(KDD) </li></ul><ul><li>Commonly used to find the useful patterns or models </li></ul><ul><li>It handles real world data that have large amount, dynamic, incomplete and noisy characteristics </li></ul>
  4. 4. The fields of data mining <ul><li>Mining association rules </li></ul><ul><li>Data generalization and summarization </li></ul><ul><li>Classification </li></ul><ul><li>Clustering </li></ul><ul><li>Pattern similarity </li></ul><ul><li>Sequential Mining </li></ul><ul><li>…………… </li></ul>
  5. 5. Spatial Clustering Methods in Data Mining: A Survey Author: Jiawei Han, M.Kamber and A. K. H. Tung Source: Geographic Data Mining and Knowledge Discovery 2001
  6. 6. Introduction Clustering in DM <ul><li>Partition methods </li></ul><ul><li>Hierarchical methods </li></ul><ul><li>Density-based methods </li></ul><ul><li>Grid-based methods </li></ul><ul><li>Model-based methods </li></ul>
  7. 7. A good clustering method? <ul><li>Parameter </li></ul><ul><ul><li># of clusters, # of neighbors, radius… </li></ul></ul><ul><li>Shape </li></ul><ul><li>Similar size </li></ul><ul><li>Noise sensitive </li></ul><ul><li>Input order </li></ul>
  8. 8. Partition methods : K-means
  9. 9. Partition methods : K-means <ul><li>Time complexity: O(knt) </li></ul><ul><li>Need to specify k in advance </li></ul><ul><li>Unable to handle noisy and outliers </li></ul><ul><li>Discover clusters with spherical shape </li></ul><ul><li>Find global optimal centers or k clusters is NP-hard </li></ul>
  10. 10. Partition methods : K-medoid <ul><li>Time complexity: O(k(n-k) 2 t) </li></ul><ul><li>Robust than k-means </li></ul><ul><li>Other improved algorithms </li></ul><ul><ul><li>CLARA(1990), CLARAN(Clustering Large Applications based upon Randomized Search)(1994) </li></ul></ul>
  11. 11. Hierarchical methods <ul><li>Two ways </li></ul><ul><ul><li>Agglomerative approach: Button-up </li></ul></ul><ul><ul><ul><li>E.g.: AGNES </li></ul></ul></ul><ul><ul><li>Divisive approach : Top-dwon </li></ul></ul><ul><ul><ul><li>E.g: DIANA </li></ul></ul></ul>Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA)
  12. 12. Hierarchical methods <ul><li>Difficult to decide cut points or merge points </li></ul><ul><li>Irreversible </li></ul><ul><li>Time complexity: O(n 2 ) </li></ul>
  13. 13. Density based methods <ul><li>Discover clusters of arbitrary shape </li></ul><ul><li>Handle noise </li></ul><ul><li>Time complexity: O(n 2 ) </li></ul>
  14. 14. Density based methods <ul><li>Two parameters are needed </li></ul><ul><ul><li>Eps : Maximum radius of the neighbourhood </li></ul></ul><ul><ul><li>MinPts : Minimum number of points in an Eps-neighbourhood of that point </li></ul></ul>p q MinPts = 5 Eps = 1 cm p q o
  15. 15. Grid-Based Clustering Method <ul><li>Fast processing time, independent of the number of data objects, dependent on only the number of cells in each dimension </li></ul><ul><li>Useful on high dimensional data </li></ul>
  16. 16. Model-Based Clustering methods <ul><li>Statistical approach </li></ul><ul><li>Neural network approach </li></ul>
  17. 17. Comparison No No Yes Gird structure O(n) STING No No No Radius, Minpts O(n 2 ) DBSCAN No No Cyclic Cut or merge points O(n 2 ) AGNES DIANA No Yes Cyclic K, (t) O(k(n-k) 2 t) K-medoids Yes Yes Cyclic K, (t) O(nkt) K-means Noise sensitive Similar size Shape limit Parameters Complexity
  18. 18. Nonparametric genetic clustering: comparison of validity indices Bandyopadhyay, S.; Maulik, U. Machine IEEE Transactions on SMC, Part C: Applications and Reviews 2001
  19. 19. Introduction <ul><li>GA is a randomized and parallel searching algorithm </li></ul><ul><li>GA operation </li></ul><ul><ul><li>Selection, crossover, mutation </li></ul></ul><ul><li>Develop a nonparametric clustering technique </li></ul><ul><ul><li>Variable string length genetic algorithm (VGA) </li></ul></ul>
  20. 20. VGA-Clustering(1) <ul><li>String representation : a chromosome means a kind of clustering result </li></ul><ul><li>Chromosome i : </li></ul><ul><ul><li>Coordinates of centers of clusters (real number) </li></ul></ul><ul><ul><li>K i clusters, N dim.  length = N*K i </li></ul></ul><ul><li>Partition : assigning the points to the closest cluster </li></ul>center 1 (c1,c2 … c N ) center 2 (c1,c2 … c N ) …… center K i (c1,c2 … c N )
  21. 21. VGA-Clustering (2) <ul><li>Population initialization </li></ul><ul><ul><li>Chromosome length: </li></ul></ul><ul><ul><ul><li>K i = rand() mod K* </li></ul></ul></ul><ul><ul><ul><li>(K*:upper bound of initial chromosome) </li></ul></ul></ul><ul><ul><li>K i centers : randomly selected points from data set </li></ul></ul><ul><li>Selection:roulette wheel </li></ul>
  22. 22. VGA-Clustering (3): Crossover <ul><li>Crossover : stochastically with probability μ c </li></ul><ul><li>At least 2 clusters in offspring </li></ul><ul><ul><li>Boundary: </li></ul></ul><ul><ul><ul><li>C1 = rand() mod K1 </li></ul></ul></ul><ul><ul><ul><li>LB(C2) = min[2, max[0, 2-(K1-C1)] ] </li></ul></ul></ul><ul><ul><ul><li>UB(C2) = K2-max[0,2-C1] </li></ul></ul></ul><ul><ul><ul><li>C2 = LB(C2)+rand() mod (UB(C2) – LB(C2)) </li></ul></ul></ul>Bad
  23. 23. VGA-Clustering (4) <ul><li>Mutation </li></ul><ul><ul><li>probability μ m </li></ul></ul><ul><ul><li>Random d in [0,1] </li></ul></ul><ul><ul><li>Gene position v: (1 ± 2*d)*v ,when v≠0 </li></ul></ul><ul><ul><li>± 2*d ,v=0 </li></ul></ul>
  24. 24. Fitness Computation <ul><li>Use the cluster validity indices </li></ul><ul><ul><li>DB Index </li></ul></ul><ul><ul><li>Dunn’s Index </li></ul></ul><ul><ul><li>Generalized Dunn’s Index ν GD </li></ul></ul><ul><ul><li>New Cluster Validity Index (by authors) </li></ul></ul>
  25. 25. Cluster Validity Indices (1) <ul><li>DB (Davies-Bouldin) Index </li></ul><ul><li>Within-cluster scatter : </li></ul><ul><li>Between-cluster separation : </li></ul>x z i x <ul><li>ratio : </li></ul><ul><li>DB Index : </li></ul><ul><li>(fitness :1/DB) </li></ul>Ci :number of points in cluster i K :number of clusters Z i :centroid z i z j
  26. 26. Cluster Validity Indices (2) <ul><li>Dunn’s Index: Let S, T be two nonempty subset </li></ul><ul><li>Diameter of S : </li></ul>x y <ul><li>Set distance between S & T : </li></ul>S T x y <ul><li>Dunn’s Index </li></ul>
  27. 27. Cluster Validity Indices (3) <ul><li>Generalized Dunn’s Index </li></ul><ul><ul><li>δ i : Any positive, semi-definite, symmetric set distance function </li></ul></ul><ul><ul><li>Δ i : Any positive, semi-definite, symmetric diameter function </li></ul></ul>
  28. 28. Cluster Validity Indices (4) <ul><li>New Cluster Validity Index </li></ul><ul><li>Index: </li></ul><ul><li>Scatter: </li></ul><ul><li>Separation : </li></ul>cluster number compact separation z i z j x z k x
  29. 29. Experiment  Data Set AD_5_2 AD_10_2 AD_4_3 5 clusters 4 clusters 10 clusters
  30. 30. Experiment  result (1) DB, DUNN DB, DUNN DB, G. DUNN DUNN G. DUNN G. DUNN 5 clusters 8 clusters 4 clusters 4 clusters 2 clusters 2 clusters
  31. 31. Experiment  result (2) DB (iris data) New Index New Index (iris data) 2 clusters 10 clusters 3 clusters
  32. 32. Experiment  result (3) Cluster # V.S. Generation 1/DB V.S. Generation
  33. 33. Conclusion <ul><li>VGAs exploited for clustering when number of clusters is not known a priori </li></ul><ul><li>New cluster validity index introduced </li></ul><ul><ul><li>Formation of compact and separated clusters. </li></ul></ul><ul><ul><li>Provide correct number of clusters but other ones sometimes fail </li></ul></ul>
  34. 34. Disadvantage <ul><li>Handle only cyclic shape of clusters </li></ul><ul><li>Similar size </li></ul><ul><li>Chromosome encode too complex </li></ul>
  35. 35. Future work <ul><li>bottleneck of apriori algorithms </li></ul><ul><ul><li>Specify parameter: support and confidence </li></ul></ul><ul><ul><li>Too many IO costs </li></ul></ul><ul><ul><li>Huge number of candidate sets </li></ul></ul><ul><li>Two efficient algorithms for association rules </li></ul><ul><ul><li>Hash based algorithm </li></ul></ul><ul><ul><li>FP(frequent pattern)-tree </li></ul></ul>
  36. 36. Future work <ul><li>Bottleneck of parallel mining association rules </li></ul><ul><ul><li>Communication overhead </li></ul></ul><ul><ul><li>Load balancing </li></ul></ul><ul><li>Privacy and data security: encoding </li></ul><ul><li>Bioinformatics </li></ul><ul><li>Integrated clustering methods </li></ul><ul><li>Incremental series </li></ul>
  37. 37. Bibliography <ul><li>Journal </li></ul><ul><ul><li>Data Mining and Knowledge Discovery Journal </li></ul></ul><ul><ul><li>IEEE trans. On Knowledge and Data Engineering (TKDE) </li></ul></ul><ul><li>Special interest groups </li></ul><ul><ul><li>ACM-SIGKDD </li></ul></ul><ul><ul><li>ACM-SIGMOD </li></ul></ul><ul><li>Conference </li></ul><ul><ul><li>IEEE International Conf. on Data Mining </li></ul></ul><ul><ul><li>IEEE International Conf. On Data Engineering (ICDE) </li></ul></ul><ul><ul><li>IEEE International Conf. On KDDM (KDD) </li></ul></ul><ul><ul><li>SIAM International Conf. on Data Mining (SIAMDM) </li></ul></ul><ul><ul><li>Pacific-Asia Conf. On KDDM (PAKDD) </li></ul></ul>

×