Data Mining to Understand InternationalDimensions to Online Identity- a classification of 2+ billion names andtheir linkage to virtual identities andsocial network traffic.• Alistair Leak• UCL SECReT• email@example.com
Who am I?Education:Kingston University (BSc) - GISUCL (M.Res) - Advanced Spatial Analysis and VisualisationUCL 3+1 - PhD Security and Crime ScienceSupervisors:1st Supervisor: Professor Paul Longley2nd Supervisor: Dr James Cheshire
Definitions:• Netnography – “A qualitative, interpretive research methodology that uses internet-optimized ethnographic research techniques to study the social context in online communities” (Kozinets,2009)• Cybergeodemographics – “The analysis of people by where they live and by whom they interact with, in real and virtual space” (Longley, 2012)
Uncertainty of Identity: Work Package 4: Cybergeodemographics• Use of primary and secondary data to relate virtual Internet traffic to the probable physical locations from which it emanated; and the development of typologies of social networks that are robust, generalized and related to physical locations. Secondary Data Collection Tools Data (WP1) Cybergeodemographics (WP4) Text Analytics (WP2)
Working Title:• “Data Mining to Understand International Dimensions to Online Identity - a classification of 2+ billion names and their linkage to virtual identities and social network traffic” Objectives:• Develop spatial context of name network classification• Develop typologies of social networks• Measure how representative social media is of the underlying population.
Work Plan• M.Res (Present – 2013) – Foundation work • Assess representative capability of tweet data – Skills Development • Spatio-Temporal Data Mining • Database Management• Ph.D (2013 – 2016) – Objectives • Develop spatial component of names networks • Develop typologies of social networks • Develop a measure of uncertainty – Completion in August 2016
Challenges of Study• Signal from Noise – Tweets are not all sent from individuals homes • Day and night demographics – Not all location tweets are real people• Data Quality/Sample Size – Twitter users are self selecting • Only a small proportion have enabled location services • Dataset currently has 92,000 unique users
Target Areas of Study• Spatio-temporal differentiation of tweets – Night – Day – Travel• Expansion of the Methodology for World Names – Initially into Europe.• Application of new name datasets.
References:• Dale, M. R. T., and M-J. Fortin. "From graphs to spatial graphs." Annual Review of Ecology, Evolution, and Systematics 41.1 (2010): 21.• Fischer, E. (July, 2011). World Map of Flikr and Twitter Locations. In See Something or Say Something. Available at http://www.flickr.com/photos/walkingsf/5912169471/in/set-72157627140310742• http://urbantick.blogspot.co.uk/2010/12/ncl-social-networks.html• Kozinets, Robert V. Netnography: Doing ethnographic research online. Sage Publications Limited, 2009.• R Core Team (2012). R: A language and environment for statistical computing. R Foundation for• Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.• Rao, D., Yarowsky, D., Shreevats, A., & Gupta, M. (2010, October). Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents (pp. 37-44). ACM.
Thank-youX Factor GraphProduced with R and Gephi