1. Data Mining to Understand International
Dimensions to Online Identity
- a classification of 2+ billion names and
their linkage to virtual identities and
social network traffic.
• Alistair Leak
• UCL SECReT
• a.leak.11@ucl.ac.uk
2. Who am I?
Education:
Kingston University (BSc) - GIS
UCL (M.Res) - Advanced Spatial Analysis and Visualisation
UCL 3+1 - PhD Security and Crime Science
Supervisors:
1st Supervisor: Professor Paul Longley
2nd Supervisor: Dr James Cheshire
3. Definitions:
• Netnography
– “A qualitative, interpretive research methodology that uses
internet-optimized ethnographic research techniques to study the
social context in online communities” (Kozinets,2009)
• Cybergeodemographics
– “The analysis of people by where they live and by whom they
interact with, in real and virtual space” (Longley, 2012)
4. Uncertainty of Identity: Work Package 4:
Cybergeodemographics
• Use of primary and secondary data to relate virtual Internet traffic to the
probable physical locations from which it emanated; and the development
of typologies of social networks that are robust, generalized and related to
physical locations.
Secondary
Data Collection Tools Data
(WP1)
Cybergeodemographics
(WP4)
Text Analytics
(WP2)
5. Working Title:
• “Data Mining to Understand International Dimensions to
Online Identity - a classification of 2+ billion names and
their linkage to virtual identities and social network traffic”
Objectives:
• Develop spatial context of name network classification
• Develop typologies of social networks
• Measure how representative social media is of the
underlying population.
6. Work Plan
• M.Res (Present – 2013)
– Foundation work
• Assess representative capability of tweet data
– Skills Development
• Spatio-Temporal Data Mining
• Database Management
• Ph.D (2013 – 2016)
– Objectives
• Develop spatial component of names networks
• Develop typologies of social networks
• Develop a measure of uncertainty
– Completion in August 2016
8. Case Study: Tweets
in London
• 1.4 Million Tweets
over 3 months
Sep - Dec 2012
9. What’s in a Tweet?
First Name
Surname
Unique ID
# Themes
Location
Possibilities:
•Political Affiliation
Popularity •Gender
•Age
•Location
Interactions
Time/Date
10. Data Classification
• Gender
– Database of 62000 names + genders
– Determined by Forename
• Demographic
– OAC – Output area classifier
• ONOMAP
– Ethnicity, Religion, Geographical Origin.
– Determined by Forename Surname combination
15. Challenges of Study
• Signal from Noise
– Tweets are not all sent from individuals homes
• Day and night demographics
– Not all location tweets are real people
• Data Quality/Sample Size
– Twitter users are self selecting
• Only a small proportion have enabled location services
• Dataset currently has 92,000 unique users
16. Target Areas of Study
• Spatio-temporal differentiation of tweets
– Night
– Day
– Travel
• Expansion of the Methodology for World Names
– Initially into Europe.
• Application of new name datasets.
17. References:
• Dale, M. R. T., and M-J. Fortin. "From graphs to spatial graphs." Annual Review of Ecology,
Evolution, and Systematics 41.1 (2010): 21.
• Fischer, E. (July, 2011). World Map of Flikr and Twitter Locations. In See Something or Say
Something. Available at http://www.flickr.com/photos/walkingsf/5912169471/in/set-72157627140310742
• http://urbantick.blogspot.co.uk/2010/12/ncl-social-networks.html
• Kozinets, Robert V. Netnography: Doing ethnographic research online. Sage Publications Limited,
2009.
• R Core Team (2012). R: A language and environment for statistical computing. R Foundation for
• Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.
• Rao, D., Yarowsky, D., Shreevats, A., & Gupta, M. (2010, October). Classifying latent user attributes
in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated
contents (pp. 37-44). ACM.