Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

1,721 views
1,570 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,721
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
11
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation

  1. 1. Brendan Collins
  2. 2. “The function of the brain and nervoussystem is to protect us from beingoverwhelmed and confused by this mass oflargely useless and irrelevant knowledge,by shutting out most of what we shouldotherwise perceive or remember at anymoment, and leaving only that very smalland special selection which is likely tobe practically useful.” -Aldous Huxley
  3. 3. 103,000 Public Schools(No Clustering)
  4. 4. 103,000 Public Schools(Count)
  5. 5. 103,000 Public Schools(Mean Student Teacher Ratio)
  6. 6. Operation Point Cluster• Review general clustering algorithms• Suggest strategies & implementations for clustering for web applications – Server-side (C#) – Offline w/ArcGIS (Python) – Offline w/3rd Party (Python)
  7. 7. Data Classification (One Dimensional Clustering)• Equal-interval – Clusters have same max – min (interval)• Quantile – Clusters have same count• Natural Breaks (Jenks) – Clusters have minimum deviation from mean
  8. 8. KMeans(Centroid-based)
  9. 9. KMeans (Centroid-based)1. Choose random starting points2. Assign each target point to cluster candidates3. Replace randomly centroid point with mean of group.4. Repeat steps 2 & 3 until convergence.
  10. 10. Grid Clustering (Grid-based)1. Overlay mesh sized appropriate for zoom level2. Compare point coordinates to mesh to create clusters.• Very common on client-side• Can lead to undesired “Grid” effect• Somewhat non-deterministic
  11. 11. QuadTree(Distance-based) http://en.wikipedia.org/wiki/QUADTREE
  12. 12. QuadTree (Distance-based)1.Input minimum cluster tolerance2.Recursively insert points into existing tree 1. Where distance < tolerance, number of points++ 2. Where distance > tolerance, insert to child node.• Easy to implement• Can lead to “Grid” affect
  13. 13. DBSCAN(Density-based) http://en.wikipedia.org/wiki/DBSCAN
  14. 14. DBSCAN (Density-based)1. Takes search radius and minimum number of points for cluster2. Visit each point and count number of points in search radius• Clusters can be any shape• Search radius determined by zoom level
  15. 15. Strategies & Implementations for Web Apps (Server Object Extension vs. Pre-Crunched)
  16. 16. Where should clustering occur? • Small number of points ( < 10,000 ) • No addition server loadClient-side • Widely available within client APIs • Limited by client-side languages • Medium number of points ( < 1M ) • Many language/library optionsServer-side • Robust querying • Very maintainable / extendible • Large number of points( > 1M) • Many language/library options Offline • Limited querying • Output Normal Feature Class
  17. 17. Clustering Server Object Extension (C#/QuadTree)1. Extends MapServer2. Wraps map query based on extent3. returns clustered results4. Stateless5. Problems 1. Re-calculates tree on each request 2. Client-side wrappers 3. Lost out-of-box ArcGIS Server functions
  18. 18. Clustering with Arcpy (distance-based / offline)1.Divide data into logical chunks (where clause)2.Integrate using tolerance3.Collect Events4.Spatial Join add descriptive statistics4.Append all results
  19. 19. Clustering w/Python• Numpy/Scipy – Defacto• Scikit-Learn – (Python machine learning library)• PyTables – HDF5, akin to NetCDF, but with support for hierarchical tables and very scalable – http://bcdcspatial.blogspot.com/2013 /02/converting-arcgis-feature-class- to.html
  20. 20. Scikit-Learn SciKit – Learn…btw it’s awesome - http://scikit-learn.org/stable/
  21. 21. Bleeding Edge Python• PyPy, Cython, Anaconda, Numba Pro, Pandas• Python is now a first-class citizen on the GPU!
  22. 22. In Summary:• Clustering is not Panning• Think outside Count• Clustering is not only for spatial data
  23. 23. Thank You!Follow us on Twitter: @blueraster @brendancolVisit us at: blueraster.com/blog bcdcspatial.blogspot.com

×