American Clusters Classification Methodology

American Clusters Geodemographic
Classiﬁcation
METHODOLOGY

Geodemographics
Birds of a feather ﬂock together (English proverb).
Geodemographics is the analysis of people by where they
live (Sleight, 1996).
Geodemographic classiﬁcation categorizes neighborhoods
based on their socio-economic and lifestyle characteristics.

Applications of Geodemographic
Classifications
Commercial Non-for-Profit
Market Research Public Sector
Site Selection Health Care
Trade Area Commercial – People
who live in an area
Non-for-profit – People
who live in an area Education
where there is... where there is...
Analysis
•A high probability of buying a •Low grades in secondary Local Authority
Direct Marketing particular type of newspaper. school exams.

Advertising •A high application rate for •High risk of developing Policing
consolidation loans. diabetes.

Management •A high consumption of •High fear of crime, but low Academic
fashion goods. crime levels
Media Analysis Poverty Prevention
Elections Charities
Source: adapted from:
•Bolton Council Planning
•Harris et al. 2005 Research
•OAC

Open Geodemographics
Term “Open Geodemographics” was first used and applied by Dr. Dan Vickers,
the author of “Multi-level Integrated Classification Based on the 2001
Census”. In this paper Dr. Vickers describes methods and techniques of
building the National Classification of Census Output Areas (UK).

Dr. Vickers has uploaded methodology and results of his work online for
review and free download.

To find out more about Open Geodemographics please watch this
presentation.

Project Objectives

Are to create open and free geodemographic classiﬁcation of
USA using Census 2000 results following the methodology
developed by Dr. Dan Vickers.

Project Methodology
Selection of cluster objects Variables
(operational taxonomic Variables selection standardization
units)

Interpretation,
Clustering method selection Identification of cluster
testing and mapping of
number
clusters

The classification building process was largely based on the methodology elaborated by Daniel Vickers in “Multi-level Integrated Classification Based on the 2001 Census (2006)”. Other
methodologies were also considered such as the methodology of building MOSAIC described in “Geodemographics, GIS and Neighborhood Targeting” R. Harris, P. Sleight, R. Webber, Wiley
(2005).

Classiﬁcation Inputs

208,000 Census block groups of all 50 states
281,000,000 Overall population
106,000,000 Households

Data and Variables
Data US Census 2000 - the major source of data for the classiﬁcation

Variables Some researchers such as Harris and Webber suggest the inclusion
of as many variables as possible to create “more meaningful clusters” (Harris et
al, 2005) .

However another opinion exists that the minimum number of variables should
be used for analysis in order to prevent data redundancy and collinearity
(researchers Everitt and Vickers)(Vickers, 2006).

In our classiﬁcation we will try to use only “necessary” variables to avoid data
redundancy and collinearity.

Variables Selection
List of initial 77 variables
1. Male 28. Married 53. Percent of monthly cost (without
2. Female 29. Divorced mortgage) exceeds 25
3. Urban 30. Widowed 54. High School Degree
4. Rural 31. Never Married 55. Higher Education Attainment
5. White 32. Occupied by renters 56. No car households
6. Black 33. Median Rent 57. 2+ car households
7. Native 34. Avg. house size 58. Work at home
8. Asian 35. One bedroom 59. Car to work
9. Hawaiian 36. Four bedrooms 60. Carpool to work
10. Some other race 37. Gas 61. Public Transport to Work
11. Two or more races 38. Bottled Gas Kerosene 62. Bicycle or Walk to Work
12. Aged 0-4 39. Wood 63. Long commuters (60+min to work)
13. Aged 5-14 40. No fuel 64. Short Commuters (less 15 min to work)
14. Aged 15-24 41. Median year house build 65. To work from 12 to 5am
15. Aged 25-44 42. Median house value 66. To work from 8 to 10 am
16. Aged 45-64 43. Occupied more 1.5 person per room 67. To work from 4 pm to 11:59 pm
17. Aged 65+ 44. Occupied less 0.5 person per room 68. Standardised Disability Ratio
18. Foreign born 45. lacking complete plumbing facilities 69. Unemployed
19. Not a citizen 46. lacking complete kitchen facilities 70. Working part-time
20. Spanish language households 47. Households without a mortgage 71. People living below poverty line
21. Other Eurp languages households 48. Second mortgage or home equity 72. Retirement Income
22. Asian language households 49. Owner cost with mortgage 73. Public Assistance Income
23. One person households (under 65) 50. Owner cost without mortgage 74. Supplemental Security Income
24. One person households (over 65) 51. Percent of monthly cost (with 75. Social Security Income for HH
25. One parent households mortgage) exceeds 40 76. Interest, Dividends or Net rental income
26. Family households with no children 52. Percent of monthly cost (with for HH
27. Non family households mortgage ) does not exceed 10 77. Median HH income

some of them had to be removed...

Variables Selection
The initial list of variables was reviewed and reduced by applying three analytical
tools:

• Principal Component Analysis (PCA)/Factor analysis: variables with high factor
loadings were selected

• Correlation Matrix: pairs of variables strongly correlated with each other were
examined and only one within each pair was left for analysis

• Standard Deviation Evaluation: variables with low SD were not included
because they vary little between block groups and do not bring value to the
clustering process.

Variables Selection
List of variables was reduced from 77 to 47 variables
1.Urban population
2.Black 26.Second mortgage or home equity
3.Asian 27.Monthly cost (with mortgage ) less 10%
4.Aged under 5 28.Monthly cost (with mortgage ) more 40%
5.Aged 5-14 29.Higher education
6.Aged 15-24 30.Two cars households
7.Aged 25-44 31.Car to work
8.Aged 45-64 32.Public transport to work
9.Aged 65+ 33.Bicycle or walk to work
10.One person households (under 65) 34.Long commuters (60+min to work)
11.One person households (over 65) 35.To work from 8 to 10 am
12.One parent households 36.Standardized Disability Ratio
13.Households with no children 37.Unemployed
14.Foreign born population 38.Working part-time
15.Spanish language households 39.People living below poverty line
16.Other Eurp languages households 40.Retirement income
17.Never married 41.Public assistance income
18.Occupied by renters 42.Social security income
19.Median rent 43.Interest, dividends or net rental income
20.One bedroom 44.Median household income
21.Four bedrooms 45.Government workers
22.Bottled gas, oil, kerosene as fuel 46.Self-employed
23.Median house value 47.Agriculture, forestry, ﬁshing and hunting and mining
24.Occupied 1.5 person per room
25.Households Without a mortgage

Variable Transformation
Variable transformation and standardization allowed for the inclusion of
different types of data (e.g. percent of black population, medium no. of
rooms, medium household income)

Log transformation

By transforming the data to a log (logarithmic) scale the problem of
very high value outliers was greatly reduced as the difference between
values at the extremities of the data set was reduced by more than those
more typical average values. (Methods for Area classiﬁcation for output
areas, National Statistics, UK)
http://www.statistics.gov.uk/about/methodology_by_theme/area_classiﬁcation/oa/methodology.asp

Variables Transformation
Z-score standardization

Z-score standardization is the most popular method of data standardization,
but in our project it showed poor results because highly skewed variables
were given too much weight. This could result in forming of wrong clusters
in the clustering process.

Range standardization
To reduce the effect of highly skewed variables the method of range
standardization was chosen. This method allows for the data to be
standardized in the range between 0 (minimum value) and 1 (maximum
value).

Clustering Process
K-means

To create the classiﬁcation k-means method (performed in SPSS) was
chosen as it works well on large datasets (Vickers, 2006)
The major issue with k-means is that the number of clusters has to be
speciﬁed beforehand
To determine the number of clusters 2 tests were performed:
Test 1 Evaluation of an average distance to the cluster center
Test 2 Cluster size range assessment

Test1. Evaluation of average distance to
cluster center

It was suggested that the most useful number of clusters for the classification would be around 6 (Vickers, 2006).
So the target number of k-means clusters would be found between 4 and 8. We needed to find the solution with
the most significant increase in average distance from the cluster center. The most significant increase between
two consecutive solutions was found at the point of 6 cluster solution.

Test 2. Cluster size range

Another important factor which was considered was the size of clusters: the more homogenous are clusters in
terms of number of members the better.
Mean range between optimal cluster size (all clusters are equal in number of members) and actual sizes of
clusters for a given solution – the lower range the better.

Number of Clusters

7 cluster solution was chosen
4 cluster solution worked good in both tests, but was
outperformed by 5 and 7cluster solution which worked better
in the second test. 6 cluster solutions showed poor results in
the second test while 8 cluster solution didn’t pass the ﬁrst
test well.

Group of Clusters
These 7 clusters form a highest level of classiﬁcation
hierarchy. They represent the Group of Clusters.
To build the second level each resulted cluster was
split into number of smaller clusters by using the
same methods we applied before.
As the result we’ve got 18 distinguished clusters for
the American Clusters Classiﬁcation.

Clusters Hierarchy

1st level of
classiﬁcation
hierarchy
(Group of Clusters)
2nd level of
classiﬁcation
hierarchy
18 clusters

Analyzing Clusters
Now 18 identiﬁed
clusters are ready
to be analyzed
All mean values of
clusters were
compared with the
dataset mean
values.

Describing CLusters
Cluster 5.1- Upscale Couples
“Significant share of
these group members is
self-employed’
Based on the comparison clusters
were analyzed and described
“Their incomes are two
Then clusters were mapped and times higher than the US
average”
named
“Upscale Couples” is a cluster of rich people with large
share of men and women between 45 and 65 who live in
suburban areas within the vicinity of large US cities.
“Those who work prefer to use personal
vehicle to get to work, and majority of them
leave home between 8-10am.” Image Sources:
http://www.ﬂickr.com/photos/loungerie/3029049309/
http://www.ﬂickr.com/photos/whartz/1066907283/

Mapping Clusters
Mapping clusters
using free tools
from Google

Clusters Description
To get cluster description just click on one of the links below
Depressed Blocks Satellite City Young Families

Low Income Families Small Town Communities

Established Suburbs Upscale Couples

Settled Achievers Prosperous B Boomers

Suburban Middles Successful Families

Rural Despair Country Life

Rural Communities Farmers' Land

Unfortunate Countryside Multicultural Communities

Retired Citizens Hispanic Families

References
Callingham, M. (2005), From areal classification to geodemographics, paper presented at the
Demographic User Group Conference, Royal Society, London 10th November 2005.
Debenham, J. E. (2002), Understanding Geodemographic Classification: Creating The Building Blocks For
An Extension, Working Paper 02/1 School of Geography, University of Leeds [online] http://
www.geog.leeds.ac.uk/wpapers/02-1.pdf
Harris, R., Sleight, P. and Webber, R. (2005), Geodemographics, GIS and Neighbourhood Targeting,
London, Wiley
Longley, P. A. (2005), Geographical Information Systems: a renaissance of geodemographics for public
service delivery, Progress in Human Geography, 29(1)
Sleight, P. (2004) Targeting customers: How to Use Geodemographic and Lifestyle Data in Your Business,
Henley-on –Thames, World Advertising Research Centre
Vickers, D. (2006) , Multi-level Integrated Classification Based on the 2001 Census, The University of
Leeds
Webber, R. and Farr, M. (2001) , MOSAIC-From an area classification system to household classification,
Journal of Targeting, Measurement and Analysis for Marketing,10(1).

To ﬁnd more please visit
World Clusters.org

American Clusters Classification Methodology

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

American Clusters Classification Methodology

Editor's Notes