Real Time Geodemographics

1,239 views
1,167 views

Published on

This presentation is a comparison of different clustering based on their computational time. This is the first step in creating open source and bespoke Geodemographic classifications in near real time.

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,239
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Real Time Geodemographics

  1. 1. Real time Geodemographics: Requirements and Challenges Muhammad Adnan, Paul Longley
  2. 2. Current Geodemographic classifications <ul><li>Census data </li></ul><ul><ul><ul><li>E.g. OA (Output Area) dataset has 41 census variables. </li></ul></ul></ul><ul><li>Variables are weighted according to their importance in classification. </li></ul><ul><li>K-means clustering algorithm is used to cluster data into homogeneous groups. </li></ul><ul><ul><ul><li>Multiple runs of K-means due to its un-stability </li></ul></ul></ul><ul><ul><ul><li>10,000 times (Singleton, 2008) </li></ul></ul></ul>
  3. 3. Need for real time Geodemographics <ul><li>Current classifications are created using static data sources. </li></ul><ul><li>Rate and scale of current population change is making large surveys (census) increasingly redundant. </li></ul><ul><ul><ul><li>Significant hidden value in transactional data </li></ul></ul></ul><ul><li>Data is increasingly available in near real time </li></ul><ul><li>e.g. ONS NESS API </li></ul><ul><li>Application specific (bespoke) classifications have demonstrated utility (Longley & Singleton, 2009). </li></ul>
  4. 4. What are real time Geodemographics ? Specification Estimation Testing
  5. 5. Computational challenges <ul><li>Integration of large and possibly disparate databases. </li></ul><ul><ul><ul><li>E.g. NHS data; Census data </li></ul></ul></ul><ul><li>Data normalisation and optimization for fast transactions. </li></ul><ul><li>Minimizing computational time of clustering algorithms (Very Important)! </li></ul><ul><li>Common protocol </li></ul><ul><ul><ul><li>XML (SOAP) </li></ul></ul></ul><ul><li>Use of non traditional data sources. (Singleton, 2008) </li></ul><ul><ul><ul><li>E.g. Flickr; Facebook </li></ul></ul></ul>
  6. 6. Important Challenge: Selection of clustering algorithm <ul><li>K-Means </li></ul><ul><li>PAM (Partitioning Around Medoids) </li></ul><ul><li>CLARA (Clustering Large Applications) </li></ul><ul><li>GA (Genetic Algorithm) </li></ul>
  7. 7. K-means <ul><li>Attempts to find out cluster centroids by minimising within sum of squares distance. </li></ul><ul><li>K-means is unstable due to its initial seeds assignment. </li></ul><ul><ul><ul><li>Sensitive to outliers. </li></ul></ul></ul><ul><li>Creating a Geodemographic classification requires running algorithm multiple times. </li></ul><ul><ul><ul><li>10,000 times (Singleton, 2008) </li></ul></ul></ul><ul><ul><ul><li>Computationally expensive in a real time environment. </li></ul></ul></ul>
  8. 8. K-means (100 runs of k-means on OAC data set for k=4)
  9. 9. An example of bad clustering result (K-means)
  10. 10. An example of bad clustering result (K-means)
  11. 11. An example of bad clustering result (K-means)
  12. 12. Alternate Clustering Algorithms <ul><li>PAM (Partitioning around medoids) tries to minimize the sum of distances of the objects to their cluster centers. </li></ul><ul><ul><ul><li>Less sensitive to outliers than K-means. </li></ul></ul></ul><ul><ul><ul><li>Cannot handle larger data sets. </li></ul></ul></ul><ul><li>CLARA (Clustering Large Applications) draws multiple samples of the dataset, applies PAM to each sample and returns the best result. </li></ul><ul><li>GA (Genetic Algorithm) is inspired by models of biological evolution. It produces results through a breeding procedure. </li></ul>
  13. 13. This paper compares <ul><li>K-means </li></ul><ul><li>Clara </li></ul><ul><li>GA </li></ul><ul><li>By using three data normalisation techniques </li></ul><ul><li>Z-Scores </li></ul><ul><li>Range Standardisation </li></ul><ul><li>Principle Component Analysis. </li></ul><ul><li>Algorithm stability of K-means, Clara, and GA </li></ul>
  14. 14. Data normalisation techniques used <ul><li>Z-Scores </li></ul><ul><ul><ul><li>Widely used variable normalisation technique </li></ul></ul></ul><ul><ul><ul><li>Can create outliers in the datasets </li></ul></ul></ul><ul><li>Range Standardisation </li></ul><ul><ul><ul><li>Standardise values between a range of 0-1 </li></ul></ul></ul><ul><ul><ul><li>Can erase interesting patterns in the data </li></ul></ul></ul><ul><li>Principle Component Analysis. </li></ul><ul><ul><ul><li>Reduces the dimensions of a data set </li></ul></ul></ul><ul><ul><ul><li>Can erase interesting patterns in the data </li></ul></ul></ul>
  15. 15. Comparing computational efficiency (Z-scores) PAM, and GA on the three geographic aggregations of a dataset covering London. Figure 1: OA (Output Area) level results Figure 2 : LSOA (Lower Super Output Area) level results Figure 3 : Ward level results
  16. 16. Comparing computational efficiency (Range Standardisation) PAM, and GA on the three geographic aggregations of a dataset covering London. Figure 4: OA (Output Area) level results Figure 5 : LSOA (Lower Super Output Area) level results Figure 6 : Ward level results
  17. 17. Comparing computational efficiency (PCA) PAM, and GA on the three geographic aggregations of a dataset covering London. Figure 7: OA (Output Area) level results Figure 8 : LSOA (Lower Super Output Area) level results Figure 9 : Ward level results
  18. 18. Algorithm Stability (w.r.t. Computational time) Figure 10: Running k-means on OA (Output Area) for 120 times on each iteration Figure 11: Running CLARA on OA (Output Area) for 120 times on each iteration Figure 12: Running GA on OA (Output Area) for 120 times on each iteration
  19. 19. K-means and Principle Component Analysis <ul><li>PCA can be used to facilitate K-means clustering by reducing dimensions. </li></ul><ul><li>(Ding, C., He, X., 2004) </li></ul>Figure 13: K-means result for 41 “OAC variables” Figure 14: K-means result for 26 “OAC Principle Components” K=4 (99% similar)
  20. 20. K-means and Principle Component Analysis <ul><li>PCA can be used to facilitate K-means clustering by reducing dimensions. </li></ul><ul><li>(Ding, C., He, X., 2004) </li></ul>Figure 13: K-means result for 4 1 “OAC variables” Figure 14: K-means result for 26 “OAC Principle Components”
  21. 21. Conclusion <ul><li>Clara is plausible alternative to k-means in a real time Geodemographic classification system. </li></ul><ul><li>K-means might be combined with PCA for enhanced computation power. </li></ul><ul><li>In an online environment k-means is better for small data sets. </li></ul><ul><li>Exploration of non traditional data sources. </li></ul>

×