Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Real time Geodemographics:  Requirements and Challenges Muhammad Adnan, Paul Longley
Current Geodemographic classifications <ul><li>Census data </li></ul><ul><ul><ul><li>E.g. OA (Output Area) dataset has 41 ...
Need for real time Geodemographics <ul><li>Current classifications are created using static data sources.  </li></ul><ul><...
What are real time Geodemographics ? Specification  Estimation  Testing
Computational challenges <ul><li>Integration of large and possibly disparate databases. </li></ul><ul><ul><ul><li>E.g. NHS...
Important Challenge: Selection of clustering algorithm <ul><li>K-Means </li></ul><ul><li>PAM (Partitioning Around Medoids)...
K-means <ul><li>Attempts to find out cluster centroids by minimising within sum of squares distance. </li></ul><ul><li>K-m...
K-means   (100 runs of k-means on OAC data set for k=4)
An example of bad clustering result (K-means)
An example of bad clustering result (K-means)
An example of bad clustering result (K-means)
Alternate Clustering Algorithms <ul><li>PAM (Partitioning around medoids) tries to minimize the sum of distances of the ob...
This paper compares <ul><li>K-means </li></ul><ul><li>Clara </li></ul><ul><li>GA </li></ul><ul><li>By using three data nor...
Data normalisation techniques used <ul><li>Z-Scores </li></ul><ul><ul><ul><li>Widely used variable normalisation technique...
Comparing computational efficiency (Z-scores) PAM, and GA on the three geographic aggregations of a dataset covering Londo...
Comparing computational efficiency (Range Standardisation) PAM, and GA on the three geographic aggregations of a dataset c...
Comparing computational efficiency (PCA) PAM, and GA on the three geographic aggregations of a dataset covering London. Fi...
Algorithm Stability (w.r.t. Computational time) Figure 10:   Running k-means on OA (Output Area) for 120 times on each ite...
K-means and Principle Component Analysis <ul><li>PCA can be used to facilitate K-means clustering by reducing dimensions. ...
K-means and Principle Component Analysis <ul><li>PCA can be used to facilitate K-means clustering by reducing dimensions. ...
Conclusion <ul><li>Clara is plausible alternative to k-means in a real time Geodemographic classification system. </li></u...
Upcoming SlideShare
Loading in …5
×

Real Time Geodemographics

1,561 views

Published on

This presentation is a comparison of different clustering based on their computational time. This is the first step in creating open source and bespoke Geodemographic classifications in near real time.

Published in: Education, Technology, Business
  • Be the first to comment

  • Be the first to like this

Real Time Geodemographics

  1. 1. Real time Geodemographics: Requirements and Challenges Muhammad Adnan, Paul Longley
  2. 2. Current Geodemographic classifications <ul><li>Census data </li></ul><ul><ul><ul><li>E.g. OA (Output Area) dataset has 41 census variables. </li></ul></ul></ul><ul><li>Variables are weighted according to their importance in classification. </li></ul><ul><li>K-means clustering algorithm is used to cluster data into homogeneous groups. </li></ul><ul><ul><ul><li>Multiple runs of K-means due to its un-stability </li></ul></ul></ul><ul><ul><ul><li>10,000 times (Singleton, 2008) </li></ul></ul></ul>
  3. 3. Need for real time Geodemographics <ul><li>Current classifications are created using static data sources. </li></ul><ul><li>Rate and scale of current population change is making large surveys (census) increasingly redundant. </li></ul><ul><ul><ul><li>Significant hidden value in transactional data </li></ul></ul></ul><ul><li>Data is increasingly available in near real time </li></ul><ul><li>e.g. ONS NESS API </li></ul><ul><li>Application specific (bespoke) classifications have demonstrated utility (Longley & Singleton, 2009). </li></ul>
  4. 4. What are real time Geodemographics ? Specification Estimation Testing
  5. 5. Computational challenges <ul><li>Integration of large and possibly disparate databases. </li></ul><ul><ul><ul><li>E.g. NHS data; Census data </li></ul></ul></ul><ul><li>Data normalisation and optimization for fast transactions. </li></ul><ul><li>Minimizing computational time of clustering algorithms (Very Important)! </li></ul><ul><li>Common protocol </li></ul><ul><ul><ul><li>XML (SOAP) </li></ul></ul></ul><ul><li>Use of non traditional data sources. (Singleton, 2008) </li></ul><ul><ul><ul><li>E.g. Flickr; Facebook </li></ul></ul></ul>
  6. 6. Important Challenge: Selection of clustering algorithm <ul><li>K-Means </li></ul><ul><li>PAM (Partitioning Around Medoids) </li></ul><ul><li>CLARA (Clustering Large Applications) </li></ul><ul><li>GA (Genetic Algorithm) </li></ul>
  7. 7. K-means <ul><li>Attempts to find out cluster centroids by minimising within sum of squares distance. </li></ul><ul><li>K-means is unstable due to its initial seeds assignment. </li></ul><ul><ul><ul><li>Sensitive to outliers. </li></ul></ul></ul><ul><li>Creating a Geodemographic classification requires running algorithm multiple times. </li></ul><ul><ul><ul><li>10,000 times (Singleton, 2008) </li></ul></ul></ul><ul><ul><ul><li>Computationally expensive in a real time environment. </li></ul></ul></ul>
  8. 8. K-means (100 runs of k-means on OAC data set for k=4)
  9. 9. An example of bad clustering result (K-means)
  10. 10. An example of bad clustering result (K-means)
  11. 11. An example of bad clustering result (K-means)
  12. 12. Alternate Clustering Algorithms <ul><li>PAM (Partitioning around medoids) tries to minimize the sum of distances of the objects to their cluster centers. </li></ul><ul><ul><ul><li>Less sensitive to outliers than K-means. </li></ul></ul></ul><ul><ul><ul><li>Cannot handle larger data sets. </li></ul></ul></ul><ul><li>CLARA (Clustering Large Applications) draws multiple samples of the dataset, applies PAM to each sample and returns the best result. </li></ul><ul><li>GA (Genetic Algorithm) is inspired by models of biological evolution. It produces results through a breeding procedure. </li></ul>
  13. 13. This paper compares <ul><li>K-means </li></ul><ul><li>Clara </li></ul><ul><li>GA </li></ul><ul><li>By using three data normalisation techniques </li></ul><ul><li>Z-Scores </li></ul><ul><li>Range Standardisation </li></ul><ul><li>Principle Component Analysis. </li></ul><ul><li>Algorithm stability of K-means, Clara, and GA </li></ul>
  14. 14. Data normalisation techniques used <ul><li>Z-Scores </li></ul><ul><ul><ul><li>Widely used variable normalisation technique </li></ul></ul></ul><ul><ul><ul><li>Can create outliers in the datasets </li></ul></ul></ul><ul><li>Range Standardisation </li></ul><ul><ul><ul><li>Standardise values between a range of 0-1 </li></ul></ul></ul><ul><ul><ul><li>Can erase interesting patterns in the data </li></ul></ul></ul><ul><li>Principle Component Analysis. </li></ul><ul><ul><ul><li>Reduces the dimensions of a data set </li></ul></ul></ul><ul><ul><ul><li>Can erase interesting patterns in the data </li></ul></ul></ul>
  15. 15. Comparing computational efficiency (Z-scores) PAM, and GA on the three geographic aggregations of a dataset covering London. Figure 1: OA (Output Area) level results Figure 2 : LSOA (Lower Super Output Area) level results Figure 3 : Ward level results
  16. 16. Comparing computational efficiency (Range Standardisation) PAM, and GA on the three geographic aggregations of a dataset covering London. Figure 4: OA (Output Area) level results Figure 5 : LSOA (Lower Super Output Area) level results Figure 6 : Ward level results
  17. 17. Comparing computational efficiency (PCA) PAM, and GA on the three geographic aggregations of a dataset covering London. Figure 7: OA (Output Area) level results Figure 8 : LSOA (Lower Super Output Area) level results Figure 9 : Ward level results
  18. 18. Algorithm Stability (w.r.t. Computational time) Figure 10: Running k-means on OA (Output Area) for 120 times on each iteration Figure 11: Running CLARA on OA (Output Area) for 120 times on each iteration Figure 12: Running GA on OA (Output Area) for 120 times on each iteration
  19. 19. K-means and Principle Component Analysis <ul><li>PCA can be used to facilitate K-means clustering by reducing dimensions. </li></ul><ul><li>(Ding, C., He, X., 2004) </li></ul>Figure 13: K-means result for 41 “OAC variables” Figure 14: K-means result for 26 “OAC Principle Components” K=4 (99% similar)
  20. 20. K-means and Principle Component Analysis <ul><li>PCA can be used to facilitate K-means clustering by reducing dimensions. </li></ul><ul><li>(Ding, C., He, X., 2004) </li></ul>Figure 13: K-means result for 4 1 “OAC variables” Figure 14: K-means result for 26 “OAC Principle Components”
  21. 21. Conclusion <ul><li>Clara is plausible alternative to k-means in a real time Geodemographic classification system. </li></ul><ul><li>K-means might be combined with PCA for enhanced computation power. </li></ul><ul><li>In an online environment k-means is better for small data sets. </li></ul><ul><li>Exploration of non traditional data sources. </li></ul>

×