4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-time geodemographics

  • 702 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
702
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Parallel k-means clustering using GPUs for the Geocomputation of Real-time Geodemographics Muhammad Adnan, Alex Singleton, Paul Longley
  • 2. Presentation Outline
    • Geodemographic Classications
    • Does one size fit all ?
    • Bespoke classifications
    • Live data
    • Major challenge for ‘on the fly’ classifications
    • Enhancements for k-means clustering algorithm
      • Parallel K-means clustering using Nvidia graphics cards
      • Comparison of K-means and Parallel K-means
      • Enhancement by using within sum of squares and standard deviation of clusters
    • Conlcusion
  • 3. Geodemographic Classifications
    • OAC by ONS
    • Mosaic by Experian
    • Accorn by CACI
    • Microvision by NDS/Equifax
    • All classifications give a national level overview of UK by groupings areas into homogeneous clusters.
  • 4. Does one size fit all ? OAC (Output Area Classification): London OAC (Output Area Classification): Birmingham
  • 5. Does one size fit all ?
    • There are some underlying interesting patterns at finest geographical levels.
    • Different bespoke classifications have been developed over the past few years.
    Employment classification of Yorkshire and Humber OAC (Output Area Classification)
  • 6. Does one size fit all ?
    • There are some underlying interesting patterns at finest geographical levels.
    • Different bespoke classifications have been developed over the past few years .
    Employment classification of Yorkshire and Humber OAC (Output Area Classification)
    • E-society Classification by UCL
    • Education Classification by UCL
    • Curicible by Tesco
  • 7. Live data is available on the web
    • ONS NESS API (Live XML feeds)
    • Police API (Live XML feeds)
    • 100s of data sources on Education, Crime, Transport, Environment ( www.data.gov.uk )
    • This increases the need for
    • ‘ On the fly’ bespoke classifications.
      • Users have the control of how the classification is created and variables are weighted.
      • Classifications are produced within minutes rather than hours.
  • 8. Major challenge for ‘on the fly’ classifications
    • K-means is used to create geodemographic classifications.
    • K-means is an unstable clustering algorithm.
    • Creating a classification with K-means requires running it multiple times on a data set.
      • 10,000 times (Singleton & Longley, 2008)
    • Creating OAC (k=7 groups) with K-means requires approx. 11.75 hours on a high specification computer.
  • 9. Enhancements for K-means
    • This paper gives two enhancement methods for K-means.
      • A parallel version of K-means which runs on GPUs of Nvidia Graphics Cards.
      • Convergence of K-means algorithm by ‘comparing within sum of squares’ and ‘standard deviation’ of consecutive runs.
  • 10. Parallel k-means Clustering algorithm
  • 11. Nvidia Graphics Cards
    • Nvidia graphics cards have multiple GPUs (Graphical Processing Units).
      • GeForce 8600M GS has 16 GPUs.
    • Each GPU can run one process independent of others.
    • Programmers use C/C++ to write programs which can run on Nvidia graphics cards.
    • GPUs can be used for parallel computation of computationaly expensive algorithms.
    Has 1000 GPUs
  • 12. Parallel k-means
    • User specifies K and N
    • Where, K= Number of clusters
    • N= Number of k-means runs
    Step-1 CPU
    • Count number of GPUs
    • Prepare data points
    • Upload data on GPUs
  • 13. Parallel k-means
    • User specifies K and N
    • Where, K= Number of clusters
    • N= Number of k-means runs
    Step-1 CPU
    • Count number of GPUs
    • Prepare data points
    • Upload data on GPUs
    Step-2 Graphics Card GPU-1 GPU-2 GPU-3 GPU-N
    • Perform k-means clustering by minimizing within sum of squares.
    • Return the result back to CPU.
  • 14. Parallel k-means
    • User specifies K and N
    • Where, K= Number of clusters
    • N= Number of k-means runs
    Step-1 CPU
    • Count number of GPUs
    • Prepare data points
    • Upload data on GPUs
    Step-2 Graphics Card GPU-1 GPU-2 GPU-3 GPU-N
    • Perform k-means clustering by minimizing within sum of squares.
    • Return the result back to CPU.
    Step-3
    • CPU keeps on delegating data point to GPUs until ‘N’ times.
    • CPU compares ‘within sum of squares’ of each run.
    • The run having ‘minimum within sum of squares’ is the geodemographic classification.
  • 15. Comparing k-means and Parallel k-means
    • OA (Output Area) Level Results
    OA (Output Area) Level results
  • 16. Comparing k-means and Parallel k-means
    • LSOA (Lower Super Output Area) Level Results
    LSOA (Lower Super Output Area) Level results
  • 17. Comparing k-means and Parallel k-means
    • WARD Level Results
    OA (Output Area) Level results
  • 18. Efficiency achieved by using Parallel K-means
    • OAC Classification by Parallel K-means
    • Parallel K-means gives 90% efficiency over K-means
    No. of clusters K-means Parallel K-means Throughput 7 9 sec. 0.54 sec. 94% 12 25 sec. 1.5 sec. 93% 52 38 sec. 2.16 sec. 89%
  • 19. Within sum of squares of K-means
    • Running K-means on OA (Output Area) Level data for UK for K=7.
  • 20. 2 nd performance enhancement for k-means Establishing a threshold value
    • If threshold remains same for another 100 runs, terminate the algorithm.
  • 21. Testing the approach
    • OA (Output Area) level data for UK was used
    K-means for K=7
    • This approach is reasonably faster than running K-means for 10,000 times.
    Run Number Convergence achieved 1 1016 runs 2 928 runs 3 1800 runs 4 826 runs
  • 22. Conclusion & Future Work
    • Need for real time bespoke geodemographic classifications is increasing.
    • Parallel K-means is faster in performance than standard K-means clustering algorithm.
    • Parallel K-means can be used for ‘On the fly’ creation of geodemographic classifications.
    • Parallel K-means can be combined with 2 nd approach described in this paper for enhanced computational throughput.
  • 23. Thank you for listening Any Questions?