City Data Dating: emerging afﬁnities between diverse urban datasets

City Data Dating: emerging affinities
between diverse urban datasets
Gloria Re Calegari, Irene Celino, Diego Peroni
Paper available at:
http://www.sciencedirect.com/science/article/pii/S0306437915001362
Elsevier - Information Systems Journal 1

Digital information about cities
• Open data (large number of data sources available on the web):
• Urban planning (land cover, public registers)
• Demographics and statistics about municipality
• Closed data sources produced and maintained by enterprises:
• Phone activity data  but sometimes made open!
• User generated information:
• Volunteered geographic information and crowdsourcing information (Open Street Map)
• Location based social network (Foursquare check-ins and geo located information)
• Real-time and streaming information
• Sensors (e.g. Temperature, energy consumption, ..)

Do diverse urban datasets provide the same
“picture” of the city?
Short term goals
• Discovering “affinities” between heterogeneous datasets.
• Using a human relations metaphor, do diverse urban datasets “date each other” and show
“natural affinities”?
• Which is the influence of spatial resolution and data complexity on the dependence strength
between heterogeneous urban sources?
Long term goals
• Would it be possible to use one or more “cheap” datasets as proxy for more “expensive” data
sources?

Milano datasets
Demographics:
• Population density
• Spatial resolution: census area (6079 –median
size of census area 12,000 m2)
• Source: Milano open data
Points of interest (POIs):
• Trasports, schools, sports facilities, amenity places,
shops ...
• Spatial resolution: lat-long points
• Source: Milano open data (official, 6718) and Open
Street Map (user generated, 44351)

Milano datasets
Land use:
• type of land use according to CORINE taxonomy
(3-levels hierarchy, up to 40 types of land use defined)
• CORINE taxonomy
http://swa.cefriel.it/ontologies/corine.html#
• 12 types selected (which better feature
metropolitan area as Milan)
[dense residential areas, scattered residential areas, industrial
areas, parks and green areas, roads, railways, hospitals, sports
centres, public services and ofﬁces, construction sites, agricultural
areas and wild areas.]
• Spatial resolution: building level
• Source: Lombardy region open data

Milano datasets
Call data records:
• 5 phone activities
• Incoming SMS
• Outcoming SMS
• Incoming CALL
• Outcoming CALL
• Internet
• Recorded every 10 minutes (144 values a day for each
activity) for 2 months (Nov-Dec 2013)
• Spatial resolution: grid of 3538 square cells of 250m
• Source: Telecom Italia – provided for their Big Data
Challenge http://theodi.fbk.eu/openbigdata/

Pre-processing of data
Making spatial resolution uniform
Spatial resolutions used:
• District level with 88 official subdivisions
• Grid level with 3.538 square cells of 250m
Cells
Districts
New datasets generated:
• Density of POIs in each cell/district
• Weighted sum of population density in each cell/district
• Percentage shares of each land use over each cell/district area

Pre-processing of data
Telecom data
Footprint/temporal signature for each
cell/district
(average activity over all the 60 days, distinguishing
between week and weekend days)
Temporal Data compression (pre-processing large scale time series to get a more
manageable compressed representation)

Data exploration experiments
To discover possibly “natural” connections between heterogeneous datasets.
Three-step process
1) Correlation Analysis
2) Regression Analysis
3) Clustering Analysis
All the analysis performed both at district and at cell level
complexity
of data

1) Correlation analysis
Try to identify possible correspondences between different datasets.
Measure whether and how two variables change together using correlation indexes -> Pearson’s
correlation coefficient
-1 < r < 1
Positive
correlation
Negative
correlation
Pairwise comparisons between 1-dimensional vectors:
• POIs: density
• Population: density
• Telecom: first Principal Component with 90% of explained variability
• Land use data: only residential and agricultural land use used separately, in term of belonging
percentages to district/cell

Correlation analysis - district level
• Correlation between
• Telecom and residential
• Telecom and POIs
can actually exist.
Data fits quasi linear models.
tlc
resid
agric
POI mun
POI
OSM
pop
• Negative correlation between agricultural
land use and other datasets -> human action
inversely related to agricultural areas.

Correlation analysis - cell level
• All coefficients lower than the district level
• Higher values again between Telecom and
residential and POIs
=> the choice of resolution level can have a
significant impact on the correlation results.
tlc
resid
agric
POI
mun
POI
OSM
pop
• Some phenomena causing the correlation are
independent of the resolution level (0.76
residential-population) .

Correlation analysis - phone calls and population
• Could the correlation change during the
day according to the everyday human
behaviour pattern (get up, go to work,
come back home in the evening)?
• Call activity at 6 different day times
• Week and weekend profiles are different
-> mirroring people’s different habits
• Average correlation higher in the weekend
(phone activity related to the actual
presence of people at home)
• Weekday profile -> human behaviour
pattern
DISTRICT CELL
WEEKWEEKEND

2) Regression analysis
Fit multiple linear regression models (MLR)
(To take into consideration a larger portion of urban data complexity)
MLR models
Telecom
(1-d vector for each communication type
for week/weekend days with the average
activity)
POIs
(1-d vector for each POI category with
the POI density)
Demographics
(1-d vector with the population
density)
Land use
(1-d vector for each land use with
the percentage of land covered.
Only 5 main land use type are used.)
Cheap-to-
produce
Expensive-
to-mantain

Regression analysis - results • The more selected predictors (increasing
k), the larger part of the outcome
variable’s variance is explained (increasing
R2adj).
• The strength of the correlation decrease
from district-level (coarse-grained) to
cell-level (ﬁne-grained)
• Benefit of adding a higher number of
predictors is weaker at cell-level
• Population, dense residential and
agricultural areas show a stronger
correlation
manual stepwise
model selection
AIC
criterion
Heterogeneous datasets provide
comparable or even similar pictures of
the urban environment.

Regression analysis – significant predictors
Signiﬁcant predictors of the models selected with manual backwards elimination

3) Clustering analysis
Exploit the whole city information available (n-dimensional datasets), avoiding data compression.
Clustering technique to understand if diverse data naturally group together throughout the urban
space.
Cluster each dataset and compare the clustering obtained for each pair of datasets.
k-Means algorithm with 5 classes (Silhouette coefficient) applied on:
• CORINE: one vector, for each NIL/cell, of belonging percentages to the 12 categories
• Telecom: the whole footprint for each cell/NIL (a vector of 1440 values)

Clustering analysis - Pairwise clusterings comparisons
Pairwise clustering comparisons using Rank Index and Overall Accuracy to evaluate datasets correnspondance

District classified in the same way in Telecom and CORINE datasets
Clustering analysis - Qualitative analysis at district level
CORINE and whole activityCORINE and SMSin-out activity CORINE and Internet activity

Conclusions and future work
• Evaluation of the correlations between datasets at different levels
• different spatial resolution (coarse-grained district level vs. ﬁne-grained cell level)
• different data complexity (very compressed information vs multi-dimensional data).
• Correlations between different sources exist and their strength depends on both
spatial resolution and data complexity
• Diverse urban datasets can “date each other”, but that their actual “afﬁnity” can
vary.
What is coming next?
• Extending our investigation toward a predicting approach (statistical and machine
learning techniques).

City Data Dating: emerging afﬁnities between diverse urban datasets

Recommended

Recommended

More Related Content

Similar to City Data Dating: emerging afﬁnities between diverse urban datasets

Similar to City Data Dating: emerging afﬁnities between diverse urban datasets (20)

Recently uploaded

Recently uploaded (20)

City Data Dating: emerging afﬁnities between diverse urban datasets