Successfully reported this slideshow.

Open Data: Analysis and Visualisation

570 views

Published on

This presentation gives an overview of the Open data. A number of case studies are given on the spatio-temporal analysis and visualization of the Social Media data (Twitter). The presentation also explains the creation of a heatmap visualisation by using R.

Published in: Technology, Business
  • Be the first to comment

Open Data: Analysis and Visualisation

  1. 1. Open Data: Analysis and Visualisation Muhammad Adnan Department of Geography, University College London Web: http://www.uncertaintyofidentity.com Twitter: @gisandtech
  2. 2. Dr. Muhammad Adnan • Research Associate – Working on an EPSRC funded project “Uncertainty of Identity” – http://www.uncertaintyofidentity.com Research Interests • Data Mining • Social Media Analysis • Data Visualisation
  3. 3. Outline • Open Data • Crowd-Sourced Data (Social Media) • Analysis and Visualisation Challenges • Twitter Case Study • Spatial Analysis • Temporal Analysis • R • A brief introduction • How to create heat maps
  4. 4. Open data Data that is:  Open and Free to the public  Complete  Accessible  Timely  Machine processable  Non-discriminatory
  5. 5. Dataset examples • • • • • • • • • National Budgets Car registries National roads Water heights Schools Weather Public transport Council tax bands And many more
  6. 6. Census Profiler • http://www.censusprofiler.org/ • Users can visualise 2001 Census data
  7. 7. Education Profiler • http://www.educationprofiler.org/ • Users can visualise education datasets
  8. 8. Open Data Profiler • http://www.opendataprofiler.com/ • Users can visualise 60 different 2011 Census datasets
  9. 9. Crowd Sourced datasets • Twitter • Public streaming API can be used to download live tweets • Four Square • Has an API which can be used to access the Four Square data • Facebook • Facebook applications can access user information • Flickr • Wikipedia • Youtube
  10. 10. How big are crowd sourced datasets ? • Facebook • Number of active users: 850 Million • Average daily uploaded photos: 360 Million • Total data size: 30+ Petabytes • Twitter • Number of active users: 200 Million • Daily tweets (posts): 350 Million • Foursquare • Number of active users: 15 Million • Total check-ins: 1.5 Billion
  11. 11. What are the issues with these datasets ? • How representative social media data sets are of the Census or Electoral roll data ? • Who: Ethnicity, Gender, and Age of social media users • Where: Where social media conversations are happening and who is leading them • Intelligence about where people are located and what they are doing • When: What time of day conversations happen
  12. 12. Twitter (www.twitter.com) • Online social-networking and micro blogging service • Launched in 2006 • Users can send messages of 140 characters or less • Approximately 200 million active users • 350 million tweets daily • In 2012, UK and London were ranked 4th and 3rd, respectively, in terms of the number of posted tweets
  13. 13. Basic Analysis of the Twitter data
  14. 14. Data available through the Twitter API • • • • • • • • • User Creation Date Followers Friends User ID Language Location Name Screen Name Time Zone • • • • • Geo Enabled Latitude Longitude Tweet date and time Tweet text Users can download 1% sample of the live tweets through the API
  15. 15. Created with approx. 100 million tweets
  16. 16. 4 million geo-tagged tweets downloaded during August and December, 2012
  17. 17. 4 million geo-tagged tweets downloaded during August and December, 2012
  18. 18. Hourly and Daily Twitter Activity in London
  19. 19. Hourly Twitter Activity in London
  20. 20. Daily Twitter Activity in London Monday Tuesday 12000 12000 10000 10000 8000 8000 6000 6000 4000 4000 2000 2000 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 Hour 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Hour Wednesday Thursday 12000 12000 10000 10000 8000 8000 6000 6000 4000 4000 2000 2000 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Hour 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Hour
  21. 21. Daily Twitter Activity in London Friday Saturday 12000 12000 10000 10000 8000 8000 6000 6000 4000 4000 2000 2000 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Hour Sunday 12000 10000 8000 6000 4000 2000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Hour 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Hour
  22. 22. Analysis of User Names on Twitter • A name is a statement of the person‟s ethnic, linguistic, and cultural identity. • E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo mateos is a Spanish (Hispanic) name.
  23. 23. Analysing Names on Twitter • Some examples of NAME variations on Twitter Real Names Kevin Hodge Andre Alves Jose de Franco Carolina Thomas, Dr. Prof. Martha Del Val Fabíola Sanchez Fernandes Fake Names JustinBieber_Home. WHAT IS LOVE? MysticMind KIRILL_aka_KID Vanessa Petuna
  24. 24. Analysing Names on Twitter • Some examples of NAME variations on Twitter Real Names Kevin Hodge -> F: „Kevin‟ ; S: „Hodge‟ Andre Alves -> F: „Andre‟ ; S: „Alves‟ Jose De Franco -> F: „Jose‟ ; S: „De Franco‟ Carolina Thomas, Dr. -> F: „Carolina‟ ; S: „Thomas‟ Prof. Martha Del Val -> F: „Martha‟ ; S: „Del Val‟ Fabíola Sanchez Fernandes -> F: „Fabíola‟ ; S: „Fernandes‟
  25. 25. Where they tweet from:
  26. 26. Where they tweet from:
  27. 27. Where they tweet from:
  28. 28. Predicting Ethnicity of Twitter Users by using their „Names‟ • A name is a statement of the person‟s ethnic, linguistic, and cultural identity. • E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo mateos is a Spanish (Hispanic) name.
  29. 29. Classifying Twitter Data to ethnic origins • Applied ONOMAP (www.onomap.org) on FORENAME + SURNAME pairs Kevin Hodge (ENGLISH) Pablo Mateos (Spanish) … … … …
  30. 30. Top 10 Ethnic Groups of Twitter Users
  31. 31. Tweeting Activity by different Ethnic Groups
  32. 32. Comparison of Ethnic Groups between „2011 Census‟ and „Twitter‟ • Onomap groups were aggregated to match the appropriate groups from the Census London Total White British White other Indian Pakistani Bangladeshi Black Chinese African Week Night 53611 71.35% 12.12% 2.63% 2.63% 1.82% 1.52% 1.74% Week Day 80676 73.12% 11.80% 2.41% 2.41% 1.56% 1.25% 1.61% Weekend 67351 72.86% 12.17% 2.61% 2.61% 1.67% 1.39% 1.73% 44.89% 12.65% 6.64% 2.74% 2.72% 7.02% 1.52% 2011 Census
  33. 33. Comparison of the distribution of ethnicity with the 2011 Census White British (Quintiles) 2011 Census Twitter
  34. 34. Gender and Age Analysis of Twitter Users by using their „forenames‟
  35. 35. Gender Analysis of Twitter Users 60% 50% 40% 30% 20% 10% 0% Male Female Number of Tweets Unisex Number of Unique Users Not Found
  36. 36. Age estimation from „forenames‟ Data: Monica (CACI, Ltd.) and Birth Certificate Data (Office of National Statistics) 45% 40% 35% Percent 30% 25% 20% 15% 10% 5% 0% Age group PAUL BETTY GUY MUHAMMAD
  37. 37. Age-Sex structure of Twitter Users and 2011 Census Male Female
  38. 38. Tweets by different Land-use Categories
  39. 39. Temporal Activity: Tweets from different Land-use Categories
  40. 40. Ethnic Segregation of Twitter Users
  41. 41. Segregation Analysis
  42. 42. Segregation Analysis • The value of the information theory index is between 0 (low segregation) and 1 (high segregation). Ethnic Groups H (Domestic buildings and gardens) H (Week Nights) H (Week Days) H (Weekend) British 0.483 0.401 0.211 0.315 Irish 0.670 0.571 0.357 0.475 White Other 0.630 0.510 0.303 0.420 Pakistani 0.765 0.679 0.488 0.633 Indian 0.748 0.673 0.451 0.590 Bangladeshi 0.864 0.834 0.671 0.784 Black Caribbean 0.831 0.808 0.548 0.666 Black African 0.764 0.704 0.492 0.640 Chinese 0.712 0.608 0.403 0.524 Other 0.710 0.593 0.374 0.497
  43. 43. Extending the analysis to other cities
  44. 44. Tweet density map of London
  45. 45. Tweet density map of Paris
  46. 46. Tweet density map of New York City
  47. 47. Top 10 ethnic groups in London
  48. 48. Top 10 ethnic groups in Paris
  49. 49. Top 10 ethnic groups in NYC
  50. 50. Tweeting Activity by different Ethnic Groups (NYC)
  51. 51. Tweeting Activity by different Ethnic Groups (Paris)
  52. 52. Gender Analysis
  53. 53. Exploring the Languages on Twitter
  54. 54. Data available through the Twitter API • • • • • • • • • User Creation Date Followers Friends User ID Language Location Name Screen Name Time Zone • • • • • Geo Enabled Latitude Longitude Tweet date and time Tweet text
  55. 55. Twitter Languages (World)
  56. 56. Twitter Languages (Europe)
  57. 57. Twitter Language Maps
  58. 58. Twitter Language Maps
  59. 59. Twitter Language Maps
  60. 60. Temporal Analysis of the data sets
  61. 61. Temporal Analysis of the Twitter Data • Data: 12 September, 2012 – 25 September, 2013 • We extracted a total of approx. 800 million tweets over the last year • A temporal activity analysis of different cities could potentially reveal a lot of information about the residents of the city • But Twitter data is not clean and has lots of problems !
  62. 62. Problems with the data 1) Extracting the data for individual cities or places • Use of bounding boxes to extract the data • New York City NW: 40.91762, -73.7004 SW: 40.47662, -74.2589 • http://isithackday.com could be used to find the bounding boxes of different cities
  63. 63. Problems with the data 2) Twitter data has a GMT and BST timestamp. Conversion to other time stamp is very time consuming • 12p.m. in „London‟ is 5a.m in Los Angeles, if the time stamp is GMT. • 12p.m. in „London‟ is 6a.m in Los Angeles, if the time stamp is BST.
  64. 64. Temporal Analysis of different cities • Approx. 170 million tweets were sent from the following 30 cities. 40 35 Number of Tweets (Millions) 30 25 20 15 10 5 0
  65. 65. Temporal Analysis of different cities LONDON
  66. 66. Temporal Analysis of different cities LONDON PARIS
  67. 67. Temporal Analysis of different cities JAKARTA
  68. 68. Temporal Analysis of different cities JAKARTA RIYADH
  69. 69. Temporal Analysis of different cities ISTANBUL JAKARTA
  70. 70. Introduction to R
  71. 71. What is R? • The R statistical programming language is a free open source package based on the S language developed by Bell Labs. • The language is very powerful for writing programs. • Many statistical functions are already built in. • Very easy to create maps and different visualizations.
  72. 72. What is R? • You will have to write some code to get the things done ! • R is available @ www.r-project.org • Supports both 32 and 64 bit Windows PCs, Linux, Unix, and Mac OS operating sytems
  73. 73. Getting Started • The R GUI?
  74. 74. Getting Started
  75. 75. Interacting with R Math: > 1 + 1 [1] 2 > 1 + 1 * 7 [1] 8 > (1 + 1) * 7 [1] 14 Variables: > x > x [1] > y > y [1] > z > z [1] <- 1 1 <- 2 2 <- x+y 3 > sqrt(16) [1] 4 80
  76. 76. Importing Data • How do we get data into R? • First make sure your data is in an easy to read format such as CSV (Comma Separated Values) • Use code: – D <- read.csv(“path”,sep=“,”,header=T) – D <- read.table(“path”,sep=“,”,header=T)
  77. 77. Working with data. • Accessing columns. • D has our data in it…. But you can‟t see it directly. • To select a column use D$column.
  78. 78. Basic Graphics • Histogram – hist(D$wg)
  79. 79. How to create a heat map in R ?
  80. 80. How to create a heat map in R ? • Three steps: – Read a CSV file – Chose the colours for the heat map – Create the heat map
  81. 81. How to create a heat map in R ? • Step 1: Read a CSV file read.csv(“FILE NAME", sep=",", header=T)
  82. 82. How to create a heat map in R ? • Step 1: Read a CSV file read.csv(“FILE NAME", sep=",", header=T) • Assign it to a variable Input <- read.csv(“FILE NAME", sep=",", header=T) i.e. with „<„ (less than) and „-‟ (dash) symbols.
  83. 83. How to create a heat map in R ? • Step 1: Read a CSV file
  84. 84. How to create a heat map in R ? • Step 2: Chose the colours for the heat map colours <- c(0) (Create an empty variable)
  85. 85. How to create a heat map in R ? • Step 2: Chose the colours for the heat map colours <- c(0) colours[1] <- "#FDD49E" colours[2] <- "#FDBB84" colours[3] <- "#FC8D59" colours[4] <- "#EF6548" colours[5] <- "#D7301F" colours[6] <- "#B30000" colours[7] <- "#7F0000"
  86. 86. How to create a heat map in R ? • Step 2: Chose the colours for the heat map colours <- c(0) colours[1] <- "#FDD49E" colours[2] <- "#FDBB84" colours[3] <- "#FC8D59" colours[4] <- "#EF6548" colours[5] <- "#D7301F" colours[6] <- "#B30000" colours[7] <- "#7F0000"
  87. 87. How to create a heat map in R ? • Step 3: Create the heat map heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours)
  88. 88. How to create a heat map in R ? • Step 3: Create the heat map heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours) Input Data
  89. 89. How to create a heat map in R ? • Step 3: Create the heat map heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours) Whether to apply scaling on the data. Options are „col‟, „row‟, and „none‟.
  90. 90. How to create a heat map in R ? • Step 3: Create the heat map heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours) Leave them as they are!
  91. 91. How to create a heat map in R ? • Step 3: Create the heat map heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours) Colours
  92. 92. • • • • Open Data Crowd-Sourced Data (Social Media) Analysis and Visualisation Challenges Twitter Case Study • Spatial Analysis • Temporal Analysis • R • A brief introduction • How to create heat maps Any Questions ?

×