Your SlideShare is downloading. ×
Data Mining Speaker: Chih-Yang Lin
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Mining Speaker: Chih-Yang Lin

719

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
719
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Mining Speaker: Chih-Yang Lin
  • 2. Outline
    • Present two paper about clustering in Data Mining
      • Spatial Clustering Methods in Data Mining: A Survey
      • Nonparametric Genetic Clustering: Comparison of Validity
    • Some idea & future work
  • 3. What is data mining
    • It can be considered as one of the steps in the Knowledge Discovery in Database(KDD)
    • Commonly used to find the useful patterns or models
    • It handles real world data that have large amount, dynamic, incomplete and noisy characteristics
  • 4. The fields of data mining
    • Mining association rules
    • Data generalization and summarization
    • Classification
    • Clustering
    • Pattern similarity
    • Sequential Mining
    • ……………
  • 5. Spatial Clustering Methods in Data Mining: A Survey Author: Jiawei Han, M.Kamber and A. K. H. Tung Source: Geographic Data Mining and Knowledge Discovery 2001
  • 6. Introduction Clustering in DM
    • Partition methods
    • Hierarchical methods
    • Density-based methods
    • Grid-based methods
    • Model-based methods
  • 7. A good clustering method?
    • Parameter
      • # of clusters, # of neighbors, radius…
    • Shape
    • Similar size
    • Noise sensitive
    • Input order
  • 8. Partition methods : K-means
  • 9. Partition methods : K-means
    • Time complexity: O(knt)
    • Need to specify k in advance
    • Unable to handle noisy and outliers
    • Discover clusters with spherical shape
    • Find global optimal centers or k clusters is NP-hard
  • 10. Partition methods : K-medoid
    • Time complexity: O(k(n-k) 2 t)
    • Robust than k-means
    • Other improved algorithms
      • CLARA(1990), CLARAN(Clustering Large Applications based upon Randomized Search)(1994)
  • 11. Hierarchical methods
    • Two ways
      • Agglomerative approach: Button-up
        • E.g.: AGNES
      • Divisive approach : Top-dwon
        • E.g: DIANA
    Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA)
  • 12. Hierarchical methods
    • Difficult to decide cut points or merge points
    • Irreversible
    • Time complexity: O(n 2 )
  • 13. Density based methods
    • Discover clusters of arbitrary shape
    • Handle noise
    • Time complexity: O(n 2 )
  • 14. Density based methods
    • Two parameters are needed
      • Eps : Maximum radius of the neighbourhood
      • MinPts : Minimum number of points in an Eps-neighbourhood of that point
    p q MinPts = 5 Eps = 1 cm p q o
  • 15. Grid-Based Clustering Method
    • Fast processing time, independent of the number of data objects, dependent on only the number of cells in each dimension
    • Useful on high dimensional data
  • 16. Model-Based Clustering methods
    • Statistical approach
    • Neural network approach
  • 17. Comparison No No Yes Gird structure O(n) STING No No No Radius, Minpts O(n 2 ) DBSCAN No No Cyclic Cut or merge points O(n 2 ) AGNES DIANA No Yes Cyclic K, (t) O(k(n-k) 2 t) K-medoids Yes Yes Cyclic K, (t) O(nkt) K-means Noise sensitive Similar size Shape limit Parameters Complexity
  • 18. Nonparametric genetic clustering: comparison of validity indices Bandyopadhyay, S.; Maulik, U. Machine IEEE Transactions on SMC, Part C: Applications and Reviews 2001
  • 19. Introduction
    • GA is a randomized and parallel searching algorithm
    • GA operation
      • Selection, crossover, mutation
    • Develop a nonparametric clustering technique
      • Variable string length genetic algorithm (VGA)
  • 20. VGA-Clustering(1)
    • String representation : a chromosome means a kind of clustering result
    • Chromosome i :
      • Coordinates of centers of clusters (real number)
      • K i clusters, N dim.  length = N*K i
    • Partition : assigning the points to the closest cluster
    center 1 (c1,c2 … c N ) center 2 (c1,c2 … c N ) …… center K i (c1,c2 … c N )
  • 21. VGA-Clustering (2)
    • Population initialization
      • Chromosome length:
        • K i = rand() mod K*
        • (K*:upper bound of initial chromosome)
      • K i centers : randomly selected points from data set
    • Selection:roulette wheel
  • 22. VGA-Clustering (3): Crossover
    • Crossover : stochastically with probability μ c
    • At least 2 clusters in offspring
      • Boundary:
        • C1 = rand() mod K1
        • LB(C2) = min[2, max[0, 2-(K1-C1)] ]
        • UB(C2) = K2-max[0,2-C1]
        • C2 = LB(C2)+rand() mod (UB(C2) – LB(C2))
    Bad
  • 23. VGA-Clustering (4)
    • Mutation
      • probability μ m
      • Random d in [0,1]
      • Gene position v: (1 ± 2*d)*v ,when v≠0
      • ± 2*d ,v=0
  • 24. Fitness Computation
    • Use the cluster validity indices
      • DB Index
      • Dunn’s Index
      • Generalized Dunn’s Index ν GD
      • New Cluster Validity Index (by authors)
  • 25. Cluster Validity Indices (1)
    • DB (Davies-Bouldin) Index
    • Within-cluster scatter :
    • Between-cluster separation :
    x z i x
    • ratio :
    • DB Index :
    • (fitness :1/DB)
    Ci :number of points in cluster i K :number of clusters Z i :centroid z i z j
  • 26. Cluster Validity Indices (2)
    • Dunn’s Index: Let S, T be two nonempty subset
    • Diameter of S :
    x y
    • Set distance between S & T :
    S T x y
    • Dunn’s Index
  • 27. Cluster Validity Indices (3)
    • Generalized Dunn’s Index
      • δ i : Any positive, semi-definite, symmetric set distance function
      • Δ i : Any positive, semi-definite, symmetric diameter function
  • 28. Cluster Validity Indices (4)
    • New Cluster Validity Index
    • Index:
    • Scatter:
    • Separation :
    cluster number compact separation z i z j x z k x
  • 29. Experiment  Data Set AD_5_2 AD_10_2 AD_4_3 5 clusters 4 clusters 10 clusters
  • 30. Experiment  result (1) DB, DUNN DB, DUNN DB, G. DUNN DUNN G. DUNN G. DUNN 5 clusters 8 clusters 4 clusters 4 clusters 2 clusters 2 clusters
  • 31. Experiment  result (2) DB (iris data) New Index New Index (iris data) 2 clusters 10 clusters 3 clusters
  • 32. Experiment  result (3) Cluster # V.S. Generation 1/DB V.S. Generation
  • 33. Conclusion
    • VGAs exploited for clustering when number of clusters is not known a priori
    • New cluster validity index introduced
      • Formation of compact and separated clusters.
      • Provide correct number of clusters but other ones sometimes fail
  • 34. Disadvantage
    • Handle only cyclic shape of clusters
    • Similar size
    • Chromosome encode too complex
  • 35. Future work
    • bottleneck of apriori algorithms
      • Specify parameter: support and confidence
      • Too many IO costs
      • Huge number of candidate sets
    • Two efficient algorithms for association rules
      • Hash based algorithm
      • FP(frequent pattern)-tree
  • 36. Future work
    • Bottleneck of parallel mining association rules
      • Communication overhead
      • Load balancing
    • Privacy and data security: encoding
    • Bioinformatics
    • Integrated clustering methods
    • Incremental series
  • 37. Bibliography
    • Journal
      • Data Mining and Knowledge Discovery Journal
      • IEEE trans. On Knowledge and Data Engineering (TKDE)
    • Special interest groups
      • ACM-SIGKDD
      • ACM-SIGMOD
    • Conference
      • IEEE International Conf. on Data Mining
      • IEEE International Conf. On Data Engineering (ICDE)
      • IEEE International Conf. On KDDM (KDD)
      • SIAM International Conf. on Data Mining (SIAMDM)
      • Pacific-Asia Conf. On KDDM (PAKDD)

×