American Clusters Classification Methodology


Published on

This presentation describes methodology and the process of American Clusters Geodemographic Classification creation. For more visit WORLD CLUSTERS.ORG

Published in: Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

  • American Clusters Classification Methodology

    1. 1. American Clusters Geodemographic Classification METHODOLOGY
    2. 2. Geodemographics Birds of a feather flock together (English proverb). Geodemographics is the analysis of people by where they live (Sleight, 1996). Geodemographic classification categorizes neighborhoods based on their socio-economic and lifestyle characteristics.
    3. 3. Applications of Geodemographic Classifications Commercial Non-for-Profit Market Research Public Sector Site Selection Health Care Trade Area Commercial – People who live in an area Non-for-profit – People who live in an area Education where there is... where there is... Analysis •A high probability of buying a •Low grades in secondary Local Authority Direct Marketing particular type of newspaper. school exams. Advertising •A high application rate for •High risk of developing Policing consolidation loans. diabetes. Management •A high consumption of •High fear of crime, but low Academic fashion goods. crime levels Media Analysis Poverty Prevention Elections Charities Source: adapted from: •Bolton Council Planning •Harris et al. 2005 Research •OAC
    4. 4. Open Geodemographics Term “Open Geodemographics” was first used and applied by Dr. Dan Vickers, the author of “Multi-level Integrated Classification Based on the 2001 Census”. In this paper Dr. Vickers describes methods and techniques of building the National Classification of Census Output Areas (UK). Dr. Vickers has uploaded methodology and results of his work online for review and free download. To find out more about Open Geodemographics please watch this presentation.
    5. 5. Project Objectives Are to create open and free geodemographic classification of USA using Census 2000 results following the methodology developed by Dr. Dan Vickers.
    6. 6. Project Methodology Selection of cluster objects Variables (operational taxonomic Variables selection standardization units) Interpretation, Clustering method selection Identification of cluster testing and mapping of number clusters The classification building process was largely based on the methodology elaborated by Daniel Vickers in “Multi-level Integrated Classification Based on the 2001 Census (2006)”. Other methodologies were also considered such as the methodology of building MOSAIC described in “Geodemographics, GIS and Neighborhood Targeting” R. Harris, P. Sleight, R. Webber, Wiley (2005).
    7. 7. Classification Inputs 208,000 Census block groups of all 50 states 281,000,000 Overall population 106,000,000 Households
    8. 8. Data and Variables Data US Census 2000 - the major source of data for the classification Variables Some researchers such as Harris and Webber suggest the inclusion of as many variables as possible to create “more meaningful clusters” (Harris et al, 2005) . However another opinion exists that the minimum number of variables should be used for analysis in order to prevent data redundancy and collinearity (researchers Everitt and Vickers)(Vickers, 2006). In our classification we will try to use only “necessary” variables to avoid data redundancy and collinearity.
    9. 9. Variables Selection List of initial 77 variables 1. Male 28. Married 53. Percent of monthly cost (without 2. Female 29. Divorced mortgage) exceeds 25 3. Urban 30. Widowed 54. High School Degree 4. Rural 31. Never Married 55. Higher Education Attainment 5. White 32. Occupied by renters 56. No car households 6. Black 33. Median Rent 57. 2+ car households 7. Native 34. Avg. house size 58. Work at home 8. Asian 35. One bedroom 59. Car to work 9. Hawaiian 36. Four bedrooms 60. Carpool to work 10. Some other race 37. Gas 61. Public Transport to Work 11. Two or more races 38. Bottled Gas Kerosene 62. Bicycle or Walk to Work 12. Aged 0-4 39. Wood 63. Long commuters (60+min to work) 13. Aged 5-14 40. No fuel 64. Short Commuters (less 15 min to work) 14. Aged 15-24 41. Median year house build 65. To work from 12 to 5am 15. Aged 25-44 42. Median house value 66. To work from 8 to 10 am 16. Aged 45-64 43. Occupied more 1.5 person per room 67. To work from 4 pm to 11:59 pm 17. Aged 65+ 44. Occupied less 0.5 person per room 68. Standardised Disability Ratio 18. Foreign born 45. lacking complete plumbing facilities 69. Unemployed 19. Not a citizen 46. lacking complete kitchen facilities 70. Working part-time 20. Spanish language households 47. Households without a mortgage 71. People living below poverty line 21. Other Eurp languages households 48. Second mortgage or home equity 72. Retirement Income 22. Asian language households 49. Owner cost with mortgage 73. Public Assistance Income 23. One person households (under 65) 50. Owner cost without mortgage 74. Supplemental Security Income 24. One person households (over 65) 51. Percent of monthly cost (with 75. Social Security Income for HH 25. One parent households mortgage) exceeds 40 76. Interest, Dividends or Net rental income 26. Family households with no children 52. Percent of monthly cost (with for HH 27. Non family households mortgage ) does not exceed 10 77. Median HH income some of them had to be removed...
    10. 10. Variables Selection The initial list of variables was reviewed and reduced by applying three analytical tools: • Principal Component Analysis (PCA)/Factor analysis: variables with high factor loadings were selected • Correlation Matrix: pairs of variables strongly correlated with each other were examined and only one within each pair was left for analysis • Standard Deviation Evaluation: variables with low SD were not included because they vary little between block groups and do not bring value to the clustering process.
    11. 11. Variables Selection List of variables was reduced from 77 to 47 variables 1.Urban population 2.Black 26.Second mortgage or home equity 3.Asian 27.Monthly cost (with mortgage ) less 10% 4.Aged under 5 28.Monthly cost (with mortgage ) more 40% 5.Aged 5-14 29.Higher education 6.Aged 15-24 30.Two cars households 7.Aged 25-44 31.Car to work 8.Aged 45-64 32.Public transport to work 9.Aged 65+ 33.Bicycle or walk to work 10.One person households (under 65) 34.Long commuters (60+min to work) 11.One person households (over 65) 35.To work from 8 to 10 am 12.One parent households 36.Standardized Disability Ratio 13.Households with no children 37.Unemployed 14.Foreign born population 38.Working part-time 15.Spanish language households 39.People living below poverty line 16.Other Eurp languages households 40.Retirement income 17.Never married 41.Public assistance income 18.Occupied by renters 42.Social security income 19.Median rent 43.Interest, dividends or net rental income 20.One bedroom 44.Median household income 21.Four bedrooms 45.Government workers 22.Bottled gas, oil, kerosene as fuel 46.Self-employed 23.Median house value 47.Agriculture, forestry, fishing and hunting and mining 24.Occupied 1.5 person per room 25.Households Without a mortgage
    12. 12. Variable Transformation Variable transformation and standardization allowed for the inclusion of different types of data (e.g. percent of black population, medium no. of rooms, medium household income) Log transformation By transforming the data to a log (logarithmic) scale the problem of very high value outliers was greatly reduced as the difference between values at the extremities of the data set was reduced by more than those more typical average values. (Methods for Area classification for output areas, National Statistics, UK)
    13. 13. Variables Transformation Z-score standardization Z-score standardization is the most popular method of data standardization, but in our project it showed poor results because highly skewed variables were given too much weight. This could result in forming of wrong clusters in the clustering process. Range standardization To reduce the effect of highly skewed variables the method of range standardization was chosen. This method allows for the data to be standardized in the range between 0 (minimum value) and 1 (maximum value).
    14. 14. Clustering Process K-means To create the classification k-means method (performed in SPSS) was chosen as it works well on large datasets (Vickers, 2006) The major issue with k-means is that the number of clusters has to be specified beforehand To determine the number of clusters 2 tests were performed: Test 1 Evaluation of an average distance to the cluster center Test 2 Cluster size range assessment
    15. 15. Test1. Evaluation of average distance to cluster center It was suggested that the most useful number of clusters for the classification would be around 6 (Vickers, 2006). So the target number of k-means clusters would be found between 4 and 8. We needed to find the solution with the most significant increase in average distance from the cluster center. The most significant increase between two consecutive solutions was found at the point of 6 cluster solution.
    16. 16. Test 2. Cluster size range Another important factor which was considered was the size of clusters: the more homogenous are clusters in terms of number of members the better. Mean range between optimal cluster size (all clusters are equal in number of members) and actual sizes of clusters for a given solution – the lower range the better.
    17. 17. Number of Clusters 7 cluster solution was chosen 4 cluster solution worked good in both tests, but was outperformed by 5 and 7cluster solution which worked better in the second test. 6 cluster solutions showed poor results in the second test while 8 cluster solution didn’t pass the first test well.
    18. 18. Group of Clusters These 7 clusters form a highest level of classification hierarchy. They represent the Group of Clusters. To build the second level each resulted cluster was split into number of smaller clusters by using the same methods we applied before. As the result we’ve got 18 distinguished clusters for the American Clusters Classification.
    19. 19. Clusters Hierarchy 1st level of classification hierarchy (Group of Clusters) 2nd level of classification hierarchy 18 clusters
    20. 20. Analyzing Clusters Now 18 identified clusters are ready to be analyzed All mean values of clusters were compared with the dataset mean values.
    21. 21. Describing CLusters Cluster 5.1- Upscale Couples “Significant share of these group members is self-employed’ Based on the comparison clusters were analyzed and described “Their incomes are two Then clusters were mapped and times higher than the US average” named “Upscale Couples” is a cluster of rich people with large share of men and women between 45 and 65 who live in suburban areas within the vicinity of large US cities. “Those who work prefer to use personal vehicle to get to work, and majority of them leave home between 8-10am.” Image Sources:
    22. 22. Mapping Clusters Mapping clusters using free tools from Google
    23. 23. Final Clusters
    24. 24. Clusters Description To get cluster description just click on one of the links below Depressed Blocks Satellite City Young Families Low Income Families Small Town Communities Established Suburbs Upscale Couples Settled Achievers Prosperous B Boomers Suburban Middles Successful Families Rural Despair Country Life Rural Communities Farmers' Land Unfortunate Countryside Multicultural Communities Retired Citizens Hispanic Families
    25. 25. References Callingham, M. (2005), From areal classification to geodemographics, paper presented at the Demographic User Group Conference, Royal Society, London 10th November 2005. Debenham, J. E. (2002), Understanding Geodemographic Classification: Creating The Building Blocks For An Extension, Working Paper 02/1 School of Geography, University of Leeds [online] http:// Harris, R., Sleight, P. and Webber, R. (2005), Geodemographics, GIS and Neighbourhood Targeting, London, Wiley Longley, P. A. (2005), Geographical Information Systems: a renaissance of geodemographics for public service delivery, Progress in Human Geography, 29(1) Sleight, P. (2004) Targeting customers: How to Use Geodemographic and Lifestyle Data in Your Business, Henley-on –Thames, World Advertising Research Centre Vickers, D. (2006) , Multi-level Integrated Classification Based on the 2001 Census, The University of Leeds Webber, R. and Farr, M. (2001) , MOSAIC-From an area classification system to household classification, Journal of Targeting, Measurement and Analysis for Marketing,10(1).
    26. 26. To find more please visit World