Successfully reported this slideshow.

Clustering: Large Databases in data mining

7,291 views

Published on

This chapter describes the application of clustering algorithms to large database.

Published in: Technology

Clustering: Large Databases in data mining

  1. 1. Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology
  2. 2. Contents <ul><li>Introduction </li></ul><ul><li>Idea for there major approaches for scalable clustering </li></ul><ul><li>{Divide-and-Conquer, Incremental, Parallel} </li></ul><ul><li>There approaches for scalable clustering </li></ul><ul><li>{ BIRCH, DSBCAN, CURE} </li></ul><ul><li>Application </li></ul>
  3. 3. Introduction –Common method <ul><li>Common method for clustering: visit all data from database and analyze the data, just like: </li></ul><ul><li>Time : Computational Complexities: O(n*n). </li></ul><ul><li>Memory : Need to load all data to main memory </li></ul>PP133  huge, huge number  millions Time/ Memory Data
  4. 4. Motivation—Clustering for large database f(x): O(n*n). f(x): O(n). Time/ Memory Data Time/ Memory Data Method ??? PP134
  5. 5. Requirement—Clustering for large database <ul><li>No more (preferably less) than one scan of the database. </li></ul><ul><li>Process each [record] only once </li></ul><ul><li>With limited memory </li></ul>f(x): O(n*n). f(x): O(n). Time/ Memory Data Time/ Memory Data Method ??? PP134 <ul><li>Can suspend, stop, and resume </li></ul><ul><li>Can update the results when new data inserted or removed </li></ul><ul><li>Can perform different technology to scan the database </li></ul><ul><li>During execution, method should provide status and ‘best’ answer. </li></ul>
  6. 6. Major approach for scalable clustering <ul><li>Divide-and-Conquer approach </li></ul><ul><li>Parallel clustering approach </li></ul><ul><li>Incremental clustering approach </li></ul>PP135
  7. 7. Divide-and Conquer approach <ul><li>Definition. </li></ul><ul><li>Divide-and-conquer is a problem-solving approach in which we: </li></ul><ul><li>divide the problem into sub-problems, </li></ul><ul><li>recursively conquer or solve each sub-problem, and then </li></ul><ul><li>combine the sub-problem solutions to obtain a solution to the original problem. </li></ul>PP135 Key Assumptions 1.Problem solutions can be constructed using subproblem solutions. 2.Subproblem solutions are independent of one another. 9*9 数独
  8. 8. Parallel clustering approach <ul><li>Idea: Divide data into small set and then run small set on different machine (Come from Divide-and-Conquer) </li></ul>PP136-137
  9. 9. Explanation about Divide-and-Conquer Divide is some algorithms Conquer is some algorithms
  10. 10. Application <ul><li>Sorting: quick-sort and merge sort </li></ul><ul><li>Fast Fourier transforms </li></ul><ul><li>Tower of Hanoi puzzle </li></ul><ul><li>matrix multiplication </li></ul><ul><li>….. </li></ul>PP135
  11. 11. CURE- Divide-and-Conquer <ul><li>1.Get the size n of set D and partition D into p group (contain n/p elements) </li></ul><ul><li>2.To each group pi, clustered into k groups by using Heap and k-d tree </li></ul><ul><li>3.delete some no relationship node in Heap and k-d tree </li></ul><ul><li>4. Cluster the partial clusters and get the final cluster </li></ul>PP140-141
  12. 12. Heap PP140-141
  13. 13. k-D Tree <ul><li>Technically, the letter k refers to the number of dimensions </li></ul>PP140-141 3-dimensional k d-tree
  14. 14. K-D Tree PP140-141
  15. 15. CURE- Divide-and-Conquer PP140-141 Nearest Merge Nearest Merge
  16. 16. Incremental clustering approach <ul><li>Idea: scan all data in database, Compare with the existing clusters, if find similar cluster, assign it to with cluster, or else, create a new cluster. Go on till no data </li></ul><ul><li>Steps: </li></ul><ul><li>1. S={};//set cluster = NULL </li></ul><ul><li>2. do{ </li></ul><ul><li>3. read one record d; </li></ul><ul><li>4. r = find_simiarity_cluster(d, S); </li></ul><ul><li>5. if (r exists) </li></ul><ul><li>6. assign d to the cluster r </li></ul><ul><li>6. else </li></ul><ul><li>7. Add_cluster(d, S); </li></ul><ul><li>8. } untill (no record in database); </li></ul>PP135-136
  17. 17. Application--Incremental clustering approach <ul><li>BIRCH </li></ul><ul><li>Balanced Iterative Reducing and Clustering using Hierarchies </li></ul><ul><li>DBSCAN </li></ul><ul><li>Density-Based Spatial Clustering of Application with Noise </li></ul>
  18. 18. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies ) <ul><li>Based on distance measurement, compute the similarity between record and cluster and give the clusters. </li></ul><ul><li>Inner Cluster </li></ul><ul><li>Among Cluster </li></ul>PP137-138
  19. 19. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies ) <ul><li>Inner Cluster Among Cluster </li></ul>PP137-138
  20. 20. Related Definiation <ul><li>Cluster: {x i }, where i = 1, 2, …, N </li></ul><ul><li>CF(Clustering Feature) : is a triple, (N,LS,SS) , N : number of data ; LS : linear sum of N data ; SS : Square sum </li></ul>
  21. 21. Related Definiation <ul><li>CF tree = (B,T), </li></ul><ul><li>B = (CF i , child i ), if is internal node in a cluster </li></ul><ul><li>B = (CF i , prev, next) if is external or leaf node in a cluster. </li></ul><ul><li>T: threshold for all leaf node, which should satisfy mean distance D < T </li></ul>
  22. 22. Algorithm for BIRCH
  23. 23. DBSCAN <ul><li>DBSCAN: Density-Based Spatial Clustering of Application with Noise </li></ul><ul><li>Ex1: </li></ul><ul><li>We want to class house along with river from one spatial photo </li></ul><ul><li>Ex2: </li></ul>
  24. 24. Definition for DBSCAN <ul><li>Eps-neighborhood of a point </li></ul><ul><li>The Eps-neighborhood of a point p, denoted by N Eps (p), is defined by N Eps (p)={q ∈D |dist(p,q) ≤ Eps} </li></ul><ul><li>Minimum Number (MinPts) </li></ul><ul><li>The MinPts is the minimum number of data points in any cluster. </li></ul>
  25. 25. Definition for DBSCAN <ul><li>Directly density-reachable </li></ul><ul><li>A point p is directly density-reachable from a point q. Eps and MinPts if </li></ul><ul><li>1): p ∈ N Eps (q); </li></ul><ul><li>2): |N Eps (q)| ≥MinPts ; </li></ul>
  26. 26. Definition for DBSCAN <ul><li>Density-reachable </li></ul><ul><li>A point p is density-reachable from a point q. Eps and MinPts if there is a chain of points p 1 ,p 2 ,…,p n ,p=p 1 ,q=p n such as p i +1 is directly desity-reachable from p i ; </li></ul>
  27. 27. Definition for DBSCAN <ul><li>Density-reachable </li></ul><ul><li>A point p is density-reachable from a point q. Eps and MinPts if there is a chain of points p 1 ,p 2 ,…,p n ,p=p 1 ,q=p n such as p i +1 is directly desity-reachable from p i ; </li></ul>
  28. 28. Algorithm of DBSCAN <ul><li>Input </li></ul><ul><li>D={t 1 ,t 2 ,…,t n } </li></ul><ul><li>MinPts </li></ul><ul><li>Eps </li></ul><ul><li>Output </li></ul><ul><li>K=K 1 ,K 2 ,…K k </li></ul><ul><li>k = 0; </li></ul><ul><li>for i =1 to n do </li></ul><ul><li>if t i is not in a cluster then </li></ul><ul><li>X={t i | t j is density-reachable from t i } </li></ul><ul><li>end if </li></ul><ul><li>if X is a valid cluster then </li></ul><ul><li>k= k+1; </li></ul><ul><li>K k = X; </li></ul><ul><li>end if </li></ul><ul><li>end for </li></ul>

×