Your SlideShare is downloading. ×
0
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Data Mining Speaker: Chih-Yang Lin
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Mining Speaker: Chih-Yang Lin

729

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
729
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Mining Speaker: Chih-Yang Lin
  • 2. Outline <ul><li>Present two paper about clustering in Data Mining </li></ul><ul><ul><li>Spatial Clustering Methods in Data Mining: A Survey </li></ul></ul><ul><ul><li>Nonparametric Genetic Clustering: Comparison of Validity </li></ul></ul><ul><li>Some idea &amp; future work </li></ul>
  • 3. What is data mining <ul><li>It can be considered as one of the steps in the Knowledge Discovery in Database(KDD) </li></ul><ul><li>Commonly used to find the useful patterns or models </li></ul><ul><li>It handles real world data that have large amount, dynamic, incomplete and noisy characteristics </li></ul>
  • 4. The fields of data mining <ul><li>Mining association rules </li></ul><ul><li>Data generalization and summarization </li></ul><ul><li>Classification </li></ul><ul><li>Clustering </li></ul><ul><li>Pattern similarity </li></ul><ul><li>Sequential Mining </li></ul><ul><li>…………… </li></ul>
  • 5. Spatial Clustering Methods in Data Mining: A Survey Author: Jiawei Han, M.Kamber and A. K. H. Tung Source: Geographic Data Mining and Knowledge Discovery 2001
  • 6. Introduction Clustering in DM <ul><li>Partition methods </li></ul><ul><li>Hierarchical methods </li></ul><ul><li>Density-based methods </li></ul><ul><li>Grid-based methods </li></ul><ul><li>Model-based methods </li></ul>
  • 7. A good clustering method? <ul><li>Parameter </li></ul><ul><ul><li># of clusters, # of neighbors, radius… </li></ul></ul><ul><li>Shape </li></ul><ul><li>Similar size </li></ul><ul><li>Noise sensitive </li></ul><ul><li>Input order </li></ul>
  • 8. Partition methods : K-means
  • 9. Partition methods : K-means <ul><li>Time complexity: O(knt) </li></ul><ul><li>Need to specify k in advance </li></ul><ul><li>Unable to handle noisy and outliers </li></ul><ul><li>Discover clusters with spherical shape </li></ul><ul><li>Find global optimal centers or k clusters is NP-hard </li></ul>
  • 10. Partition methods : K-medoid <ul><li>Time complexity: O(k(n-k) 2 t) </li></ul><ul><li>Robust than k-means </li></ul><ul><li>Other improved algorithms </li></ul><ul><ul><li>CLARA(1990), CLARAN(Clustering Large Applications based upon Randomized Search)(1994) </li></ul></ul>
  • 11. Hierarchical methods <ul><li>Two ways </li></ul><ul><ul><li>Agglomerative approach: Button-up </li></ul></ul><ul><ul><ul><li>E.g.: AGNES </li></ul></ul></ul><ul><ul><li>Divisive approach : Top-dwon </li></ul></ul><ul><ul><ul><li>E.g: DIANA </li></ul></ul></ul>Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA)
  • 12. Hierarchical methods <ul><li>Difficult to decide cut points or merge points </li></ul><ul><li>Irreversible </li></ul><ul><li>Time complexity: O(n 2 ) </li></ul>
  • 13. Density based methods <ul><li>Discover clusters of arbitrary shape </li></ul><ul><li>Handle noise </li></ul><ul><li>Time complexity: O(n 2 ) </li></ul>
  • 14. Density based methods <ul><li>Two parameters are needed </li></ul><ul><ul><li>Eps : Maximum radius of the neighbourhood </li></ul></ul><ul><ul><li>MinPts : Minimum number of points in an Eps-neighbourhood of that point </li></ul></ul>p q MinPts = 5 Eps = 1 cm p q o
  • 15. Grid-Based Clustering Method <ul><li>Fast processing time, independent of the number of data objects, dependent on only the number of cells in each dimension </li></ul><ul><li>Useful on high dimensional data </li></ul>
  • 16. Model-Based Clustering methods <ul><li>Statistical approach </li></ul><ul><li>Neural network approach </li></ul>
  • 17. Comparison No No Yes Gird structure O(n) STING No No No Radius, Minpts O(n 2 ) DBSCAN No No Cyclic Cut or merge points O(n 2 ) AGNES DIANA No Yes Cyclic K, (t) O(k(n-k) 2 t) K-medoids Yes Yes Cyclic K, (t) O(nkt) K-means Noise sensitive Similar size Shape limit Parameters Complexity
  • 18. Nonparametric genetic clustering: comparison of validity indices Bandyopadhyay, S.; Maulik, U. Machine IEEE Transactions on SMC, Part C: Applications and Reviews 2001
  • 19. Introduction <ul><li>GA is a randomized and parallel searching algorithm </li></ul><ul><li>GA operation </li></ul><ul><ul><li>Selection, crossover, mutation </li></ul></ul><ul><li>Develop a nonparametric clustering technique </li></ul><ul><ul><li>Variable string length genetic algorithm (VGA) </li></ul></ul>
  • 20. VGA-Clustering(1) <ul><li>String representation : a chromosome means a kind of clustering result </li></ul><ul><li>Chromosome i : </li></ul><ul><ul><li>Coordinates of centers of clusters (real number) </li></ul></ul><ul><ul><li>K i clusters, N dim.  length = N*K i </li></ul></ul><ul><li>Partition : assigning the points to the closest cluster </li></ul>center 1 (c1,c2 … c N ) center 2 (c1,c2 … c N ) …… center K i (c1,c2 … c N )
  • 21. VGA-Clustering (2) <ul><li>Population initialization </li></ul><ul><ul><li>Chromosome length: </li></ul></ul><ul><ul><ul><li>K i = rand() mod K* </li></ul></ul></ul><ul><ul><ul><li>(K*:upper bound of initial chromosome) </li></ul></ul></ul><ul><ul><li>K i centers : randomly selected points from data set </li></ul></ul><ul><li>Selection:roulette wheel </li></ul>
  • 22. VGA-Clustering (3): Crossover <ul><li>Crossover : stochastically with probability μ c </li></ul><ul><li>At least 2 clusters in offspring </li></ul><ul><ul><li>Boundary: </li></ul></ul><ul><ul><ul><li>C1 = rand() mod K1 </li></ul></ul></ul><ul><ul><ul><li>LB(C2) = min[2, max[0, 2-(K1-C1)] ] </li></ul></ul></ul><ul><ul><ul><li>UB(C2) = K2-max[0,2-C1] </li></ul></ul></ul><ul><ul><ul><li>C2 = LB(C2)+rand() mod (UB(C2) – LB(C2)) </li></ul></ul></ul>Bad
  • 23. VGA-Clustering (4) <ul><li>Mutation </li></ul><ul><ul><li>probability μ m </li></ul></ul><ul><ul><li>Random d in [0,1] </li></ul></ul><ul><ul><li>Gene position v: (1 ± 2*d)*v ,when v≠0 </li></ul></ul><ul><ul><li>± 2*d ,v=0 </li></ul></ul>
  • 24. Fitness Computation <ul><li>Use the cluster validity indices </li></ul><ul><ul><li>DB Index </li></ul></ul><ul><ul><li>Dunn’s Index </li></ul></ul><ul><ul><li>Generalized Dunn’s Index ν GD </li></ul></ul><ul><ul><li>New Cluster Validity Index (by authors) </li></ul></ul>
  • 25. Cluster Validity Indices (1) <ul><li>DB (Davies-Bouldin) Index </li></ul><ul><li>Within-cluster scatter : </li></ul><ul><li>Between-cluster separation : </li></ul>x z i x <ul><li>ratio : </li></ul><ul><li>DB Index : </li></ul><ul><li>(fitness :1/DB) </li></ul>Ci :number of points in cluster i K :number of clusters Z i :centroid z i z j
  • 26. Cluster Validity Indices (2) <ul><li>Dunn’s Index: Let S, T be two nonempty subset </li></ul><ul><li>Diameter of S : </li></ul>x y <ul><li>Set distance between S &amp; T : </li></ul>S T x y <ul><li>Dunn’s Index </li></ul>
  • 27. Cluster Validity Indices (3) <ul><li>Generalized Dunn’s Index </li></ul><ul><ul><li>δ i : Any positive, semi-definite, symmetric set distance function </li></ul></ul><ul><ul><li>Δ i : Any positive, semi-definite, symmetric diameter function </li></ul></ul>
  • 28. Cluster Validity Indices (4) <ul><li>New Cluster Validity Index </li></ul><ul><li>Index: </li></ul><ul><li>Scatter: </li></ul><ul><li>Separation : </li></ul>cluster number compact separation z i z j x z k x
  • 29. Experiment  Data Set AD_5_2 AD_10_2 AD_4_3 5 clusters 4 clusters 10 clusters
  • 30. Experiment  result (1) DB, DUNN DB, DUNN DB, G. DUNN DUNN G. DUNN G. DUNN 5 clusters 8 clusters 4 clusters 4 clusters 2 clusters 2 clusters
  • 31. Experiment  result (2) DB (iris data) New Index New Index (iris data) 2 clusters 10 clusters 3 clusters
  • 32. Experiment  result (3) Cluster # V.S. Generation 1/DB V.S. Generation
  • 33. Conclusion <ul><li>VGAs exploited for clustering when number of clusters is not known a priori </li></ul><ul><li>New cluster validity index introduced </li></ul><ul><ul><li>Formation of compact and separated clusters. </li></ul></ul><ul><ul><li>Provide correct number of clusters but other ones sometimes fail </li></ul></ul>
  • 34. Disadvantage <ul><li>Handle only cyclic shape of clusters </li></ul><ul><li>Similar size </li></ul><ul><li>Chromosome encode too complex </li></ul>
  • 35. Future work <ul><li>bottleneck of apriori algorithms </li></ul><ul><ul><li>Specify parameter: support and confidence </li></ul></ul><ul><ul><li>Too many IO costs </li></ul></ul><ul><ul><li>Huge number of candidate sets </li></ul></ul><ul><li>Two efficient algorithms for association rules </li></ul><ul><ul><li>Hash based algorithm </li></ul></ul><ul><ul><li>FP(frequent pattern)-tree </li></ul></ul>
  • 36. Future work <ul><li>Bottleneck of parallel mining association rules </li></ul><ul><ul><li>Communication overhead </li></ul></ul><ul><ul><li>Load balancing </li></ul></ul><ul><li>Privacy and data security: encoding </li></ul><ul><li>Bioinformatics </li></ul><ul><li>Integrated clustering methods </li></ul><ul><li>Incremental series </li></ul>
  • 37. Bibliography <ul><li>Journal </li></ul><ul><ul><li>Data Mining and Knowledge Discovery Journal </li></ul></ul><ul><ul><li>IEEE trans. On Knowledge and Data Engineering (TKDE) </li></ul></ul><ul><li>Special interest groups </li></ul><ul><ul><li>ACM-SIGKDD </li></ul></ul><ul><ul><li>ACM-SIGMOD </li></ul></ul><ul><li>Conference </li></ul><ul><ul><li>IEEE International Conf. on Data Mining </li></ul></ul><ul><ul><li>IEEE International Conf. On Data Engineering (ICDE) </li></ul></ul><ul><ul><li>IEEE International Conf. On KDDM (KDD) </li></ul></ul><ul><ul><li>SIAM International Conf. on Data Mining (SIAMDM) </li></ul></ul><ul><ul><li>Pacific-Asia Conf. On KDDM (PAKDD) </li></ul></ul>

×