Clustering: Large Databases in data mining

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Clustering: Large Databases in data mining - Presentation Transcript

    1. Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology
    2. Contents
      • Introduction
      • Idea for there major approaches for scalable clustering
      • {Divide-and-Conquer, Incremental, Parallel}
      • There approaches for scalable clustering
      • { BIRCH, DSBCAN, CURE}
      • Application
    3. Introduction –Common method
      • Common method for clustering: visit all data from database and analyze the data, just like:
      • Time : Computational Complexities: O(n*n).
      • Memory : Need to load all data to main memory
      PP133  huge, huge number  millions Time/ Memory Data
    4. Motivation—Clustering for large database f(x): O(n*n). f(x): O(n). Time/ Memory Data Time/ Memory Data Method ??? PP134
    5. Requirement—Clustering for large database
      • No more (preferably less) than one scan of the database.
      • Process each [record] only once
      • With limited memory
      f(x): O(n*n). f(x): O(n). Time/ Memory Data Time/ Memory Data Method ??? PP134
      • Can suspend, stop, and resume
      • Can update the results when new data inserted or removed
      • Can perform different technology to scan the database
      • During execution, method should provide status and ‘best’ answer.
    6. Major approach for scalable clustering
      • Divide-and-Conquer approach
      • Parallel clustering approach
      • Incremental clustering approach
      PP135
    7. Divide-and Conquer approach
      • Definition.
      • Divide-and-conquer is a problem-solving approach in which we:
      • divide the problem into sub-problems,
      • recursively conquer or solve each sub-problem, and then
      • combine the sub-problem solutions to obtain a solution to the original problem.
      PP135 Key Assumptions 1.Problem solutions can be constructed using subproblem solutions. 2.Subproblem solutions are independent of one another. 9*9 数独
    8. Parallel clustering approach
      • Idea: Divide data into small set and then run small set on different machine (Come from Divide-and-Conquer)
      PP136-137
    9. Explanation about Divide-and-Conquer Divide is some algorithms Conquer is some algorithms
    10. Application
      • Sorting: quick-sort and merge sort
      • Fast Fourier transforms
      • Tower of Hanoi puzzle
      • matrix multiplication
      • …..
      PP135
    11. CURE- Divide-and-Conquer
      • 1.Get the size n of set D and partition D into p group (contain n/p elements)
      • 2.To each group pi, clustered into k groups by using Heap and k-d tree
      • 3.delete some no relationship node in Heap and k-d tree
      • 4. Cluster the partial clusters and get the final cluster
      PP140-141
    12. Heap PP140-141
    13. k-D Tree
      • Technically, the letter k refers to the number of dimensions
      PP140-141 3-dimensional k d-tree
    14. K-D Tree PP140-141
    15. CURE- Divide-and-Conquer PP140-141 Nearest Merge Nearest Merge
    16. Incremental clustering approach
      • Idea: scan all data in database, Compare with the existing clusters, if find similar cluster, assign it to with cluster, or else, create a new cluster. Go on till no data
      • Steps:
      • 1. S={};//set cluster = NULL
      • 2. do{
      • 3. read one record d;
      • 4. r = find_simiarity_cluster(d, S);
      • 5. if (r exists)
      • 6. assign d to the cluster r
      • 6. else
      • 7. Add_cluster(d, S);
      • 8. } untill (no record in database);
      PP135-136
    17. Application--Incremental clustering approach
      • BIRCH
      • Balanced Iterative Reducing and Clustering using Hierarchies
      • DBSCAN
      • Density-Based Spatial Clustering of Application with Noise
    18. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies )
      • Based on distance measurement, compute the similarity between record and cluster and give the clusters.
      • Inner Cluster
      • Among Cluster
      PP137-138
    19. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies )
      • Inner Cluster Among Cluster
      PP137-138
    20. Related Definiation
      • Cluster: {x i }, where i = 1, 2, …, N
      • CF(Clustering Feature) : is a triple, (N,LS,SS) , N : number of data ; LS : linear sum of N data ; SS : Square sum
    21. Related Definiation
      • CF tree = (B,T),
      • B = (CF i , child i ), if is internal node in a cluster
      • B = (CF i , prev, next) if is external or leaf node in a cluster.
      • T: threshold for all leaf node, which should satisfy mean distance D < T
    22. Algorithm for BIRCH
    23. DBSCAN
      • DBSCAN: Density-Based Spatial Clustering of Application with Noise
      • Ex1:
      • We want to class house along with river from one spatial photo
      • Ex2:
    24. Definition for DBSCAN
      • Eps-neighborhood of a point
      • The Eps-neighborhood of a point p, denoted by N Eps (p), is defined by N Eps (p)={q ∈D |dist(p,q) ≤ Eps}
      • Minimum Number (MinPts)
      • The MinPts is the minimum number of data points in any cluster.
    25. Definition for DBSCAN
      • Directly density-reachable
      • A point p is directly density-reachable from a point q. Eps and MinPts if
      • 1): p ∈ N Eps (q);
      • 2): |N Eps (q)| ≥MinPts ;
    26. Definition for DBSCAN
      • Density-reachable
      • A point p is density-reachable from a point q. Eps and MinPts if there is a chain of points p 1 ,p 2 ,…,p n ,p=p 1 ,q=p n such as p i +1 is directly desity-reachable from p i ;
    27. Definition for DBSCAN
      • Density-reachable
      • A point p is density-reachable from a point q. Eps and MinPts if there is a chain of points p 1 ,p 2 ,…,p n ,p=p 1 ,q=p n such as p i +1 is directly desity-reachable from p i ;
    28. Algorithm of DBSCAN
      • Input
      • D={t 1 ,t 2 ,…,t n }
      • MinPts
      • Eps
      • Output
      • K=K 1 ,K 2 ,…K k
      • k = 0;
      • for i =1 to n do
      • if t i is not in a cluster then
      • X={t i | t j is density-reachable from t i }
      • end if
      • if X is a valid cluster then
      • k= k+1;
      • K k = X;
      • end if
      • end for

    + ZHAO SamZHAO Sam, 2 years ago

    custom

    1269 views, 0 favs, 0 embeds more stats

    This chapter describes the application of clusterin more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1269
      • 1269 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 40
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Tags