Cities are complex environments in which digital technologies are more and more pervasive; this digitization of the urban space has led to a rich ecosystem of data producers and data consumers. Moreover, heterogeneous sources differ in terms of data complexity, spatio-temporal resolution and curation/maintenance costs. Do those diverse urban sources reflect the same picture of the city? Do distinct perspectives share some commonalities?
We present our data analytics empirical experiments on a set of urban sources related to the city of Milano; our investigation is aimed at discovering “affinities” between datasets by means of different quantitative and qualitative correlation analyses. We also explore the influence of spatial resolution and data complexity on the dependence strength between heterogeneous urban sources, to pave the way to a meaningful information fusion.
Customer Service Analytics - Make Sense of All Your Data.pptx
City Data Dating: emerging affinities between diverse urban datasets
1. City Data Dating: emerging affinities
between diverse urban datasets
Gloria Re Calegari, Irene Celino, Diego Peroni
Paper available at:
http://www.sciencedirect.com/science/article/pii/S0306437915001362
Elsevier - Information Systems Journal 1
2. Digital information about cities
• Open data (large number of data sources available on the web):
• Urban planning (land cover, public registers)
• Demographics and statistics about municipality
• Closed data sources produced and maintained by enterprises:
• Phone activity data but sometimes made open!
• User generated information:
• Volunteered geographic information and crowdsourcing information (Open Street Map)
• Location based social network (Foursquare check-ins and geo located information)
• Real-time and streaming information
• Sensors (e.g. Temperature, energy consumption, ..)
Elsevier - Information Systems Journal 2
3. Do diverse urban datasets provide the same
“picture” of the city?
Short term goals
• Discovering “affinities” between heterogeneous datasets.
• Using a human relations metaphor, do diverse urban datasets “date each other” and show
“natural affinities”?
• Which is the influence of spatial resolution and data complexity on the dependence strength
between heterogeneous urban sources?
Long term goals
• Would it be possible to use one or more “cheap” datasets as proxy for more “expensive” data
sources?
Elsevier - Information Systems Journal 3
4. Milano datasets
Demographics:
• Population density
• Spatial resolution: census area (6079 –median
size of census area 12,000 m2)
• Source: Milano open data
Points of interest (POIs):
• Trasports, schools, sports facilities, amenity places,
shops ...
• Spatial resolution: lat-long points
• Source: Milano open data (official, 6718) and Open
Street Map (user generated, 44351)
Elsevier - Information Systems Journal 4
5. Milano datasets
Land use:
• type of land use according to CORINE taxonomy
(3-levels hierarchy, up to 40 types of land use defined)
• CORINE taxonomy
http://swa.cefriel.it/ontologies/corine.html#
• 12 types selected (which better feature
metropolitan area as Milan)
[dense residential areas, scattered residential areas, industrial
areas, parks and green areas, roads, railways, hospitals, sports
centres, public services and offices, construction sites, agricultural
areas and wild areas.]
• Spatial resolution: building level
• Source: Lombardy region open data
Elsevier - Information Systems Journal 5
6. Milano datasets
Call data records:
• 5 phone activities
• Incoming SMS
• Outcoming SMS
• Incoming CALL
• Outcoming CALL
• Internet
• Recorded every 10 minutes (144 values a day for each
activity) for 2 months (Nov-Dec 2013)
• Spatial resolution: grid of 3538 square cells of 250m
• Source: Telecom Italia – provided for their Big Data
Challenge http://theodi.fbk.eu/openbigdata/
Elsevier - Information Systems Journal 6
7. Pre-processing of data
Making spatial resolution uniform
Spatial resolutions used:
• District level with 88 official subdivisions
• Grid level with 3.538 square cells of 250m
Cells
Districts
New datasets generated:
• Density of POIs in each cell/district
• Weighted sum of population density in each cell/district
• Percentage shares of each land use over each cell/district area
Elsevier - Information Systems Journal 7
8. Pre-processing of data
Telecom data
Footprint/temporal signature for each
cell/district
(average activity over all the 60 days, distinguishing
between week and weekend days)
Temporal Data compression (pre-processing large scale time series to get a more
manageable compressed representation)
Elsevier - Information Systems Journal 8
9. Data exploration experiments
To discover possibly “natural” connections between heterogeneous datasets.
Three-step process
1) Correlation Analysis
2) Regression Analysis
3) Clustering Analysis
All the analysis performed both at district and at cell level
complexity
of data
Elsevier - Information Systems Journal 9
10. 1) Correlation analysis
Try to identify possible correspondences between different datasets.
Measure whether and how two variables change together using correlation indexes -> Pearson’s
correlation coefficient
-1 < r < 1
Positive
correlation
Negative
correlation
Pairwise comparisons between 1-dimensional vectors:
• POIs: density
• Population: density
• Telecom: first Principal Component with 90% of explained variability
• Land use data: only residential and agricultural land use used separately, in term of belonging
percentages to district/cell
Elsevier - Information Systems Journal 10
11. Correlation analysis - district level
• Correlation between
• Telecom and residential
• Telecom and POIs
can actually exist.
Data fits quasi linear models.
tlc
resid
agric
POI mun
POI
OSM
pop
• Negative correlation between agricultural
land use and other datasets -> human action
inversely related to agricultural areas.
Elsevier - Information Systems Journal 11
12. Correlation analysis - cell level
• All coefficients lower than the district level
• Higher values again between Telecom and
residential and POIs
=> the choice of resolution level can have a
significant impact on the correlation results.
tlc
resid
agric
POI
mun
POI
OSM
pop
• Some phenomena causing the correlation are
independent of the resolution level (0.76
residential-population) .
Elsevier - Information Systems Journal 12
13. Correlation analysis - phone calls and population
• Could the correlation change during the
day according to the everyday human
behaviour pattern (get up, go to work,
come back home in the evening)?
• Call activity at 6 different day times
• Week and weekend profiles are different
-> mirroring people’s different habits
• Average correlation higher in the weekend
(phone activity related to the actual
presence of people at home)
• Weekday profile -> human behaviour
pattern
DISTRICT CELL
WEEKWEEKEND
Elsevier - Information Systems Journal 13
14. 2) Regression analysis
Fit multiple linear regression models (MLR)
(To take into consideration a larger portion of urban data complexity)
MLR models
Telecom
(1-d vector for each communication type
for week/weekend days with the average
activity)
POIs
(1-d vector for each POI category with
the POI density)
Demographics
(1-d vector with the population
density)
Land use
(1-d vector for each land use with
the percentage of land covered.
Only 5 main land use type are used.)
Cheap-to-
produce
Expensive-
to-mantain
Elsevier - Information Systems Journal 14
15. Regression analysis - results • The more selected predictors (increasing
k), the larger part of the outcome
variable’s variance is explained (increasing
R2adj).
• The strength of the correlation decrease
from district-level (coarse-grained) to
cell-level (fine-grained)
• Benefit of adding a higher number of
predictors is weaker at cell-level
• Population, dense residential and
agricultural areas show a stronger
correlation
manual stepwise
model selection
AIC
criterion
Heterogeneous datasets provide
comparable or even similar pictures of
the urban environment.
Elsevier - Information Systems Journal 15
16. Regression analysis – significant predictors
Significant predictors of the models selected with manual backwards elimination
Elsevier - Information Systems Journal 16
17. 3) Clustering analysis
Exploit the whole city information available (n-dimensional datasets), avoiding data compression.
Clustering technique to understand if diverse data naturally group together throughout the urban
space.
Cluster each dataset and compare the clustering obtained for each pair of datasets.
k-Means algorithm with 5 classes (Silhouette coefficient) applied on:
• CORINE: one vector, for each NIL/cell, of belonging percentages to the 12 categories
• Telecom: the whole footprint for each cell/NIL (a vector of 1440 values)
Elsevier - Information Systems Journal 17
18. Clustering analysis - Pairwise clusterings comparisons
Pairwise clustering comparisons using Rank Index and Overall Accuracy to evaluate datasets correnspondance
Elsevier - Information Systems Journal 18
19. District classified in the same way in Telecom and CORINE datasets
Clustering analysis - Qualitative analysis at district level
CORINE and whole activityCORINE and SMSin-out activity CORINE and Internet activity
Elsevier - Information Systems Journal 19
20. Conclusions and future work
• Evaluation of the correlations between datasets at different levels
• different spatial resolution (coarse-grained district level vs. fine-grained cell level)
• different data complexity (very compressed information vs multi-dimensional data).
• Correlations between different sources exist and their strength depends on both
spatial resolution and data complexity
• Diverse urban datasets can “date each other”, but that their actual “affinity” can
vary.
What is coming next?
• Extending our investigation toward a predicting approach (statistical and machine
learning techniques).
Elsevier - Information Systems Journal 20