Use of social media continues to increase day by day, with implications for the creation of ‘big’ data – Twitter alone was forecast to have created 1.8 zettabytes of data in 2011. This talk presents an initial work towards the creation of geo-temporal geodemgoraphic classifications by using the Twitter social media data. London was chosen as the study area because of its high incidence of users and the consequent expectation that higher penetration might be associated with lower demographic bias.
Spatio-temporal demographic classification of the Twitter users
1. Spatio-temporal demographic classification of the
Twitter users
Paul Longley, Muhammad Adnan, Guy Lansley
Department of Geography, University College London
Web: http://www.uncertaintyofidentity.com
2. Outline
1. Introduction
• Geodemographics
• Social Media Geodemographics
2. Twitter
3. A geo-temporal demographic classification of Twitter users
• Residence of Twitter users
• Ethnic classification of Twitter users
• Age classification of Twitter users
• Computing the demographic classification
3. Introduction
• Geodemographics
• Analysis of people by where they live” [1]
• Night time characteristics of the population
• Social Media Geodemographics
• Moving beyond the night time geography
• Who: Ethnicity, Gender, and Age of social media users
• When: What time of day conversations happen
• Where: Where social media conversations happen
[1] Sleight, P. (2004). Targetting Customers-How to Use Geodemographic and Lifestyle Data in Your Business.
4. Twitter (www.twitter.com)
• Online social-networking and micro blogging service
• Launched in 2006
• Users can send messages of 140 characters or less
• Approximately 200 million active users [2]
• 350 million tweets daily
• In 2013, UK and London were ranked 4th and 3rd, respectively, in
terms of the number of posted tweets [3]
[2] Twitter. 2012. What is Twitter ?. Retrieved 31st December, 2012, from https://business.twitter.com/basics/what-is-twitter/.
[3] Bennet, S. 2013. Revealed: The Top 20 Countries and Cities of Twitter [STATS]. Retrieved 31st December, 2013, from
http://www.mediabistro.com/alltwitter/twitter-top-countries_b26726.
5. Data available through the Twitter API
• User Creation Date
• Followers
• Friends
• User ID
• Language
• Location
• Name
• Screen Name
• Time Zone
• Geo Enabled
• Latitude
• Longitude
• Tweet date and time
• Tweet text
6. Twitter data for the case study
• Approx. 8 million geo-tagged tweets (Jan – Dec, 2013)
• Sent by 385,050 unique users
• 155,249 users sent 5 or more tweets (7.6 million tweets)
7. Variables for creating a geo-temporal classification
1. Residence
• Where twitter users live
1. Ethnicity
• Probable ethnic origins of Twitter users
1. Age
• Probable Age of Twitter users
1. Land Use Category of a Tweet message
• Residential; Non-domestic building; Park etc.
2. Temporal Scales
• Day, Afternoon, Night, Peak travel hours
8. Residence of Twitter Users
• 170m X 170m grid was used to find the probable residence of users
• Probable residence was found for the 75,522 users
9. Extracting demographic attributes of Twitter users by
using their forenames and surnames
A name is a statement of the bearer’s cultural, ethnic, and linguistic
identity [4]
[4] Mateos P, Longley P A, O’Sullivan D 2011. Ethnicity and population structure in personal naming networks. PloS ONE (Public Library of Science) 6 (9)
e22943.
10. Analysing Names on Twitter
• Some examples of NAME variations on Twitter
• Approx. 68% of the accounts have real names
Fake Names
Castor 5.
WHAT IS LOVE?
MysticMind
KIRILL_aka_KID
Vanessa
Justin Bieber Home
Real Names
Kevin Hodge
Andre Alves
Jose de Franco
Carolina Thomas, Dr.
Prof. Martha Del Val
Fabíola Sanchez Fernandes
11. Onomap: Names to Ethnicity classification
• Onomap was created by clustering names of 1 billion individuals
around the world
• Applied ONOMAP (www.onomap.org) on forename – surname pairs
Kevin Hodge (English)
Pablo Mateos (Spanish)
…
…
…
…
12. Age estimation from ‘forenames’
• Monica dataset provided by CACI Ltd, UK
• Supplemented with UK birth certificate records
[5] Longley, P., Adnan, M., Lansley, G. 2013. “The geo-temporal demographics of Twitter usage”. Environment and Planning A. (In Press)
13. Age distribution of Twitter users
Twitter Users vs. 2011 Census (Greater London)
[5] Longley, P., Adnan, M., Lansley, G. 2013. “The geo-temporal demographics of Twitter usage”. Environment and Planning A. (In Press)
15. Variables for creating a geo-temporal classification
1. Residence
4. Age
V1: Tweet made near probable London residence
V14: <=20
V2: Tweeter lives ‘outside the UK’
V15: 21 - 30
V3: Tweeter lives in the rest of the UK outside London
V16: 31 - 40
V17: 41 - 50
2. Total Number of Tweets
V18: 50+
V4: Total number of tweets made by the user
3. Ethnicity
5. Tweets outside the UK
V19: In West Europe (not including UK)
V5: West European
V20: In East Europe
V6: East European
V21: In North America
V7: Greek or Turkish
V22: In Central or South American
V8: South East Asian
V23: In Australasia
V9: Other Asian
V24: In Africa
V10: African & Caribbean
V25: In Middle East
V11: Jewish
V26: In Asia
V12: Chinese
V27: In Paris
V13: Other minority
16. Variables for creating a geo-temporal classification
6. Number of countries visited
9. Temporal Scales
V28: Number of countries tweeter has visited
V42: Morning Peak Hours
7. London Land Use Category
V43: Week Day
V44: Afternoon
V29: Residential location
V45: Week Night
V30: Non-domestic buildings
V46: Weekend
V31: Transport links and locations
V32: Green-spaces
V33: All other land uses
8. 2011 London Output Area Classification
V34: Intermediate Lifestyles
V35: High Density and High Rise Flats
V36: Settled Asians
V37: Urban Elites
V38: City Vibe
V39: London Life-Cycle
V40: Multi-Ethnic Suburbs
V41: Ageing-City Fringe
17. Computing the geo-temporal classifications
• Segmentations were created by using K-means clustering algorithm
• K-means tries to find cluster centroids by minimising
• Seven clusters
n
n
V zx y
åå -
= =
=
x
y
1 1
2 ( m )
• Group A: London Residents
• Group B: Commuting Professionals
• Group C: Student Lifestyle
• Group D: The Daily Grind
• Group E: Spectators
• Group F: Visitors
• Group G: Workplace and tourist activity
18. Group A: London Residents
• Tweets made near primary residential locations
• Tweets made on weeknights or weekends
19. Group B: Commuting Professionals
• Tweets made from
• Transport locations
• ‘Urban Elites’ LOAC classification
• Tweets made by individuals of intermediate age (21-30)
20. Group F: Visitors
• Tweeters live outside London
• Tweets originated from residential land uses
• Mixed age groups
21. Group G: Workplace and tourist activity
• Tweets sent from non-domestic buildings
• Full range of Twitter age cohorts
• Tweets originate from a mix of residents and international visitors
22. Conclusion
• Geo-temporal demographic classifications
• Census (night time geography)
• Social media data (day and travel time geography)
• Issues of representation
• An insight into the residential and travel geographies of individuals
• An insight into the spatial activity patterns of different kind of
social media users