Artificial intelligence in the post-deep learning era
A Geodemographic Analysis of Ethnicity and Identity of Twitter Users in Greater London
1. A geodemographic analysis of the ethnicity and
identity of Twitter users in Greater London
Muhammad Adnan, Guy Lansley, Paul Longley
Department of Geography, University College London
Web: http://www.uncertaintyofidentity.com
2. Introduction
• Use of Social media services has increased
• But how representative social media data sets are of the
Census or Electoral roll data ?
• This paper provides an Ethnicity, Age, and Gender analysis
of Twitter users
• A comparison is provided with the 2011 Census data
• Could have potential applications in cyber maketing and
cyber security
3. Twitter (www.twitter.com)
• Online social-networking and micro blogging service
• Launched in 2006
• Users can send messages of 140 characters or less
• Approximately 200 million active users
• 350 million tweets daily
• In 2012, UK and London were ranked 4th and 3rd,
respectively, in terms of the number of posted tweets
4. Data available through the Twitter API
• User Creation Date • Geo Enabled
• Followers • Latitude
• Friends • Longitude
• User ID • Tweet date and time
• Language • Tweet text
• Location
• Name
• Screen Name
• Time Zone
Users can download 1% sample of the live tweets through the API
8. Analysing Names on Twitter
• Some examples of NAME variations on Twitter
Real Names Fake Names
Kevin Hodge Castor 5.
Andre Alves WHAT IS LOVE?
Jose de Franco MysticMind
Carolina Thomas, Dr. KIRILL_aka_KID
Prof. Martha Del Val Vanessa
Fabíola Sanchez Fernandes Petuna
9. Classifying Twitter Data to ethnic origins
• Applied ONOMAP (www.onomap.org) on FORENAME +
SURNAME pairs
Kevin Hodge (ENGLISH)
Andre de Franco (ITALIAN)
…
…
…
…
11. Segregation in different ethnic groups of Twitter
Users
• We used Information Theory Index (Thiel’s H) to compare
segregation between different Twitter ethnic groups
Where (for each Twitter ethnic group)
E = Greater London’s Entropy
Ei = Entropy of each output area in Greater London
T = Population of London
ti = Population of each output area in Greater London
• 0= No Segregation ; 1=Maximum Segregation
12. Segregation in different ethnic groups of Twitter
Users
0= No Segregation ; 1=Maximum Segregation
Ethnic Groups Domestic Week Days Week Nights Weekend
buildings and
gardens
British 0.483 0.211 0.401 0.315
Irish 0.67 0.357 0.571 0.475
White Other 0.63 0.303 0.51 0.42
Pakistani 0.765 0.488 0.679 0.633
Indian 0.748 0.451 0.673 0.59
Bangladeshi 0.864 0.671 0.834 0.784
Black Caribbean 0.831 0.548 0.808 0.666
Black African 0.764 0.492 0.704 0.64
Chinese 0.712 0.403 0.608 0.524
Other 0.71 0.374 0.593 0.497
13. Comparison of Ethnic Groups between ‘2011
Census’ and ‘Twitter’
• Onomap groups were aggregated to match the appropriate
groups from the Census
White White Black
London Total
British other
Indian Pakistani Bangladeshi
African
Chinese
Week
53611 71.35% 12.12% 2.63% 2.63% 1.82% 1.52% 1.74%
Night
Week Day 80676 73.12% 11.80% 2.41% 2.41% 1.56% 1.25% 1.61%
Weekend 67351 72.86% 12.17% 2.61% 2.61% 1.67% 1.39% 1.73%
2011 Census 44.89% 12.65% 6.64% 2.74% 2.72% 7.02% 1.52%
14. Comparison of the distribution of ethnicity with the
2011 Census
White British (Quintiles)
2011 Census Twitter
16. Gender Analysis of Twitter Users
60%
50%
40%
30%
20%
10%
0%
Male Female Unisex Not Found
Number of Tweets Number of Unique Users
17. Monica: Age estimation from given names
• Original data provided by CACI, consisting of a total of
12,000 names from a sample of almost 7 million
individuals
• However, this sample did not account for people under the
age of 18
• Birth certificate data from 1994 to 2011 was used to
supplement the dataset (total of 9.7 million names)
• Data was then standardised by the age structure from the
2011 Census
18. Monica: Age estimation from given names
45%
40%
35%
30%
Percent
25%
20%
15%
10%
5%
0%
Age group
PAUL BETTY GUY MUHAMMAD
20. Generalised Land Use Database
GLUD Tweets per
Tweets (%)
category km2
Open Water 1.11 402.71
Domestic
12.93 1748.52
Buildings
Non-
Domestic 14.14 3468.55
Buildings
Road 29.36 2681.84
Path 0.84 1204.20
Rail 2.17 1962.57
Green Space 10.91 303.62
Domestic
17.69 867.89
Gardens
Other 10.86 1637.06
21. Hourly Twitter Activity by Land Use
40.0%
35.0%
30.0%
Percentage of Tweets
25.0%
20.0%
15.0%
10.0%
5.0%
0.0%
Time
Non-Domestic Buildings Transport Residential
22. Conclusion
• An insight into the ethnic, gender, and age distribution of the
Twitter users
• A first attempt to compare any social media data set with the
census of population
• Future work will involve the investigation of micro-level
activity patterns of twitter users during different times of the
day
• We also envisage to extend this analysis to other social
media services i.e. FourSquare, Facebook etc.