Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
DBSCAN
Density-based spatial
clustering of applications
with noise
By: Cory Cook
CLUSTER
ANALYSIS
The goal of cluster analysis is to associate
data elements with each other based on
some relevant element...
DBSCAN
Originally proposed by Martin Ester, Hans-Peter Kriegel, JΓΆrg Sander
and Xiaowei Xu in 1996
Allows the user to perf...
DBSCAN
ALGORITHMDBSCAN(D, eps, MinPts)
C = 0
for each unvisited point P in dataset D
mark P as visited
NeighborPts = regio...
DBSCAN COMPLEXITY
Complexity is in 𝑂(𝑛) for the main algorithm and additional
complexity for the region query. Resulting i...
DBSCAN IMPROVEMENTS
It is possible to improve the time complexity of the algorithm by
utilizing an indexing structure to q...
RANDOMIZED DBSCAN
RANDOMIZED
DBSCAN
β€’ Instead of analyzing every single point
in the neighborhood we can select a
random subset of points to...
ALGORITHM
expandCluster(P, NeighborPts, C, eps, MinPts, k)
add P to cluster C
for each point P' in NeighborPts
if P' is no...
PROBABILISTIC
ANALYSIS
For now, assume uniform distribution and two
dimensions.
The probability of selecting a point π‘₯1 di...
PROBABILISTIC
ANALYSIS
The probability of finding a point in the 2-
epsilon shell given a k-point at distance d is
defined...
PROBABILISTIC
ANALYSIS
Pr π‘₯1 =
4π‘‘πœ”
πœ–2
Pr π‘₯2|π‘₯1 β‰ˆ
0.203𝑑
πœ–
This probability is greater than zero for all d
greater than zer...
COMPLEXITY
The affect of a point in a neighborhood is independent of the size of the problem and the epsilon
chosen.
Choos...
COMPLEXITY
Choosing epsilon and minimum points such that the average number
of points in a neighborhood is the square root...
TESTING
(IMPLEMENTATION IN R)
TESTING
Method
Generate a random data set of n elements
with values ranging between 0 and 50.
Then trim values between 25 ...
TESTING
β€’ Randomized DBSCAN improves as the
epsilon increases (increasing the
number of points per epsilon and the
relativ...
TESTING
β€’ Running time is dependent upon
number of elements; however, it
improves with higher relative densities.
β€’ Even a...
TESTING
β€’ For any relative density above the
minimum points threshold the
Randomized DBSCAN algorithm returns
the exact sa...
FUTURE WORK
β€’ Probabilistic analysis to determine accuracy of the algorithm in n
dimensions. Does the k-accuracy relations...
DBRS
A Density-based Spatial Clustering Method with Random Sampling
ο‚­ Initially proposed by Xin Wang and Howard J. Hamilto...
REFERENCES
I. Ester, Martin; Kriegel, Hans-Peter; Sander, JΓΆrg; Xu, Xiaowei (1996). "A
density-based algorithm for discove...
Upcoming SlideShare
Loading in …5
×

2

Share

Download to read offline

DBSCAN (2014_11_25 06_21_12 UTC)

Download to read offline

Related Audiobooks

Free with a 30 day trial from Scribd

See all

DBSCAN (2014_11_25 06_21_12 UTC)

  1. 1. DBSCAN Density-based spatial clustering of applications with noise By: Cory Cook
  2. 2. CLUSTER ANALYSIS The goal of cluster analysis is to associate data elements with each other based on some relevant element distance analysis. Each β€˜cluster’ will represent elements that are part of a disjoint set in the superset. IMAGE REFERENCE: HTTP://CA- SCIENCE7.WIKISPACES.COM/FILE/VIEW/CLUSTER_ANALYSIS.GIF/343040618/CLUSTER_ANALYSIS.GIF
  3. 3. DBSCAN Originally proposed by Martin Ester, Hans-Peter Kriegel, JΓΆrg Sander and Xiaowei Xu in 1996 Allows the user to perform data cluster analysis without specifying the number of clusters before hand Can find clusters of arbitrary shape and size (albeit uniform density) Is noise and outlier resistant Requires only a number of minimum points and neighborhood distance as input parameters.
  4. 4. DBSCAN ALGORITHMDBSCAN(D, eps, MinPts) C = 0 for each unvisited point P in dataset D mark P as visited NeighborPts = regionQuery(P, eps) if sizeof(NeighborPts) < MinPts mark P as NOISE else C = next cluster expandCluster(P, NeighborPts, C, eps, MinPts) expandCluster(P, NeighborPts, C, eps, MinPts) add P to cluster C for each point P' in NeighborPts if P' is not visited mark P' as visited NeighborPts' = regionQuery(P', eps) if sizeof(NeighborPts') >= MinPts NeighborPts = NeighborPts joined with NeighborPts' if P' is not yet member of any cluster add P' to cluster C regionQuery(P, eps) return all points within P's eps-neighborhood (including P) IMAGE REFERENCE: HTTP://UPLOAD.WIKIMEDIA.ORG/WIKIPEDIA/COMMONS/A/AF/DBSCAN-ILLUSTRATION.SVG
  5. 5. DBSCAN COMPLEXITY Complexity is in 𝑂(𝑛) for the main algorithm and additional complexity for the region query. Resulting in 𝑂(𝑛2 ) for the entire algorithm. The algorithm β€œvisits” each point and determines the neighbors for that point. Determining neighbors depends on the algorithm used for region query; however, it is most likely in 𝑂(𝑛) as the distance will need to be queried between each point and the point in question.
  6. 6. DBSCAN IMPROVEMENTS It is possible to improve the time complexity of the algorithm by utilizing an indexing structure to query neighborhoods in 𝑂 log 𝑛 ; however, the structure would require 𝑂 𝑛2 space to store the indices. A majority of attempts to improve DBSCAN involve overcoming the statistical limitations, such as varying density in the data set.
  7. 7. RANDOMIZED DBSCAN
  8. 8. RANDOMIZED DBSCAN β€’ Instead of analyzing every single point in the neighborhood we can select a random subset of points to analyze. β€’ Randomizing ensures that the selection will roughly represent the entire distribution. β€’ Selecting on an order fewer points to analyze will result in an improvement in the overall complexity of the algorithm. β€’ Effectiveness of this approach is largely determined by the data density relative to the epsilon distance. Edge cases will not be analyzed by DBSCAN as they do not meet the minimum points requirement. Any of the points in the epsilon-neighborhood will share many of the same points. IMAGE REFERENCE: HTTP://I.STACK.IMGUR.COM/SU734.JPG
  9. 9. ALGORITHM expandCluster(P, NeighborPts, C, eps, MinPts, k) add P to cluster C for each point P' in NeighborPts if P' is not visited mark P' as visited NeighborPts' = regionQuery(P', eps) if sizeof(NeighborPts') >= MinPts NeighborPts’ = maximumCoercion(NeighbotPts’, k) NeighborPts = NeighborPts joined with NeighborPts' if P' is not yet member of any cluster add P' to cluster C maximumCoercion(Pts, k) visited <- number of visited points in Pts points <- select max(sizeof(Pts) – k – visited, 0) elements from Pts for each point P’ in points mark P’ as visited return Pts The algorithm is the same as DBSCAN with a slight modification. We force a maximum number of points to continue analysis. If there are more points in the neighborhood than the maximum then we mark them as visited. Marking points as visited allows us to β€œskip” them by not performing a region query for those points. This effectively reduces the overall complexity.
  10. 10. PROBABILISTIC ANALYSIS For now, assume uniform distribution and two dimensions. The probability of selecting a point π‘₯1 distance d from the reference point is defined as Pr π‘₯1 = π‘‘βˆ’πœ” 𝑑+πœ” 2πœ‹π‘Ÿ π‘‘π‘Ÿ πœ‹πœ–2 ; 0 ≀ 𝑑 ≀ πœ– Pr π‘₯1 = πœ‹ 𝑑 + πœ” 2 βˆ’ 𝑑 βˆ’ πœ” 2 πœ‹πœ–2 Pr π‘₯1 = 4π‘‘πœ” πœ–2 The probability increases as d increases. πœ–2πœ–
  11. 11. PROBABILISTIC ANALYSIS The probability of finding a point in the 2- epsilon shell given a k-point at distance d is defined as Pr π‘₯2|π‘₯1 = 2πœ–2 arctan 𝑑 4πœ–2 βˆ’ 𝑑2 + 𝑑 2 4πœ–2 βˆ’ 𝑑2 3πœ‹πœ–2 This is from a modified lens equation 𝐴 = a2 πœ‹ βˆ’ 2π‘Ž2 arctan 𝑑 4π‘Ž2 βˆ’ 𝑑2 βˆ’ 𝑑 2 4π‘Ž2 βˆ’ 𝑑2 Divided by the area of the 2-epsilon shell πœ‹ 2πœ– 2 βˆ’ πœ‹πœ–2 = 3πœ‹πœ–2 This can be approximated (from Vesica Piscis) as Pr π‘₯2|π‘₯1 β‰ˆ 0.203𝑑 πœ– ; 0 ≀ 𝑑 ≀ πœ– πœ–2πœ–
  12. 12. PROBABILISTIC ANALYSIS Pr π‘₯1 = 4π‘‘πœ” πœ–2 Pr π‘₯2|π‘₯1 β‰ˆ 0.203𝑑 πœ– This probability is greater than zero for all d greater than zero. So long as a point exists between the reference point and epsilon then there is a chance that it will find the target point in the 2-epsilon shell. This is the probability for finding a single point in the 2-epsilon shell. For each additional point in the shell the probability increases for finding any point. Pr 𝑋 = Pr{π‘₯1 βˆͺ π‘₯2 βˆͺ β‹― βˆͺ π‘₯ π‘š} πœ–2πœ–
  13. 13. COMPLEXITY The affect of a point in a neighborhood is independent of the size of the problem and the epsilon chosen. Choose k points as the maximum number of neighbors to propagate. Assume m (size of the neighborhood) is constant: 𝑖=1 𝑛/π‘š π‘˜ = π‘˜ π‘š 𝑛 = 𝑂 𝑛 Assume m = n/p where p is constant. Meaning that the neighborhood size is a fraction of the total size: 𝑖=1 𝑛 𝑛/𝑝 π‘˜ = π‘π‘˜ = 𝑂(1) Assume π‘š = 𝑛 𝑖=1 𝑛/ 𝑛 π‘˜ = π‘˜ 𝑛 = 𝑂 𝑛 Therefore, it is possible choose epsilon and minimum points to maximize the efficiency of the algorithm.
  14. 14. COMPLEXITY Choosing epsilon and minimum points such that the average number of points in a neighborhood is the square root of the number of points in the universe allows us to reduce the time complexity of the problem from 𝑂 𝑛2 to 𝑂(𝑛 𝑛).
  15. 15. TESTING (IMPLEMENTATION IN R)
  16. 16. TESTING Method Generate a random data set of n elements with values ranging between 0 and 50. Then trim values between 25 and 25+epsilon on the x and y axis. This should give us at least 4 clusters. Run the each algorithm 100 times on each data set and record the average running time for each algorithm and the average accuracy of Randomized DBSCAN. Repeat for 1000, 2000, 3000, 4000 initial points (before trim) Repeat for eps = [1:10] 0 5 10 15 20 25 30 0 1,000 2,000 3,000 4,000 5,000 RunTime(s) Number of Elements (N) Complexity Analysis DBSCAN t eps=1 eps=2 eps=3 eps=4 eps=5 eps=6 eps=7 eps=8 eps=9 eps=10 Poly. (DBSCAN t) Poly. (eps=2) Linear (eps=10)
  17. 17. TESTING β€’ Randomized DBSCAN improves as the epsilon increases (increasing the number of points per epsilon and the relative density). β€’ DBSCAN will perform in 𝑂(𝑛2 ) regardless of epsilon and relative density. β€’ Randomized DBSCAN always performs as well as DBSCAN regardless of the relative density and chosen epsilon. 0 5 10 15 20 25 30 0 1,000 2,000 3,000 4,000 5,000 RunTime(s) Number of Elements (N) Complexity Analysis DBSCAN t eps=1 eps=2 eps=3 eps=4 eps=5 eps=6 eps=7 eps=8 eps=9 eps=10 Poly. (DBSCAN t) Poly. (eps=2) Linear (eps=10)
  18. 18. TESTING β€’ Running time is dependent upon number of elements; however, it improves with higher relative densities. β€’ Even a large amount of data can be processed quickly with a high relative density. 957 921 877 835835785764711662 665 1,918 1,845 1,766 1,709 1,6481,5501,5181,416 1,344 1,284 2,890 2,797 2,672 2,525 2,448 2,322 2,179 2,156 1,973 1,954 3,840 3,696 3,528 3,409 3,250 3,117 2,928 2,833 2,694 2,557 y = 5.2012x-0.364 0 5 10 15 20 25 0 20 40 60 80 100 120 140 160 180 RunningTime(s) Points Per Epsilon (PPE) Complexity Analysis
  19. 19. TESTING β€’ For any relative density above the minimum points threshold the Randomized DBSCAN algorithm returns the exact same result as the DBSCAN algorithm. β€’ We would expect the Randomized DBSCAN to be more accurate at higher densities (higher probability for each point in epsilon range); however, it doesn’t seem to matter above a very small threshold. 0 10 20 30 40 50 60 70 0 20 40 60 80 100 120 140 160 180 Error(%) Points Per Epsilon (PPE) Accuracy Analysis
  20. 20. FUTURE WORK β€’ Probabilistic analysis to determine accuracy of the algorithm in n dimensions. Does the k-accuracy relationship scale linearly or (more likely) exponentially with the number of dimensions. β€’ Determine performance and accuracy implications for classification and discreet attributes. β€’ Combine the randomized DBSCAN with an indexed region query to reduce the time complexity of the clustering algorithm from 𝑂 𝑛2 to 𝑂 𝑛 log 𝑛 . β€’ Rerun tests with balanced data sets to highlight (and better represent) improvement. β€’ Determining optimal epsilon for performance and accuracy of a particular data set.
  21. 21. DBRS A Density-based Spatial Clustering Method with Random Sampling ο‚­ Initially proposed by Xin Wang and Howard J. Hamilton in 2003 ο‚­ Randomly selects points and assigns clusters ο‚­ Merges clusters that should be together Advantages ο‚­ Handles varying densities Disadvantages ο‚­ Same time and space complexity limitations as DBSCAN ο‚­ Requires an additional parameter and accompanying concept: purity
  22. 22. REFERENCES I. Ester, Martin; Kriegel, Hans-Peter; Sander, JΓΆrg; Xu, Xiaowei (1996). "A density-based algorithm for discovering clusters in large spatial databases with noise". In Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M. "Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96)". AAAI Press. pp. 226– 231. ISBN 1-57735-004-9. CiteSeerX: 10.1.1.71.1980. II. Wang, Xin; Hamilton, Howard J. (2003) β€œDBRS: A Desity-Based Spatial Clustering Method with Random Sampling.” III. Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, An Introduction to Statistical Learning: with Applications in R, Springer, 1st ed, 2013, ISBN: 978-1461471370 IV. Michael Mitzenmacher and Eli Upfal, Probability and Computing: Randomized Algorithms and Probabilistic Analysis, Cambridge University Press, 1st ed, 2005, ISBN: 978-0521835404 V. Weisstein, Eric W. "Lens." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/Lens.html
  • Hadola1986

    Sep. 29, 2018
  • raviranjangautam

    Dec. 10, 2015

Views

Total views

1,213

On Slideshare

0

From embeds

0

Number of embeds

15

Actions

Downloads

60

Shares

0

Comments

0

Likes

2

Γ—