Prediction of expensive datasets starting from a set of cheap heterogeneous information sources in smart city scenarios.
Prediction of the population and land use of Milano starting from data about Points Of Interest and phone activity.
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Smart Urban Planning Support through Web Data Science on Open and Enterprise Data
1. Smart Urban Planning Support through
Web Data Science on Open and
Enterprise Data
Gloria Re Calegari and Irene Celino
CEFRIEL – Politecnico di Milano
1
The 24th International World Wide Web Conference
Florence, Italy
18 – 22 May 2015
Web Data Science meets Smart Cities
19th May 2015
2. Digital information about cities
• Large number of data sources available on the web (Open data):
• Urban planning (land cover, public registers)
• Demographics and statistics about municipality
• User generated information:
• Volunteered geographic information and crowdsourcing information (Open Street Map)
• Location based social network (Foursquare check-ins and geo located information)
• Close data sources produced and maintained by enterprises:
• Phone activity data
Cost of data management (collection, cleansing, maintenance) is highly variable
with respect to the diverse data origins.
2
3. Research goal
Long term goal:
• Can we predict (generate or update) a costly dataset from a set of
cheap information sources?
Cheap datasets
Expensive
datasets
Predict or
update
3
4. Our case study
• Data collection
• Available datasets about Milano
• Problem of spatial granularities and pre-processing of the datasets
• Data processing
• Definition of input/output
• Predictive analysis
• Statistical learning
• Machine learning
• Results evaluation
4
5. Milano datasets
Demographics:
• population density
• Spatial resolution: census area
• Source: Milano open data
Points of interest (POIs):
• Trasports, schools, sports facilities, amenity places,
shops ...
• Spatial resolution: lat-long points
• Source: Milano open data (official) and Open Street
Map (user generated)
5
6. Milano datasets
Land use cover:
• type of land use according to CORINE
taxonomy (3-levels hierarchy, up to 40 types of
land use defined)
• CORINE taxonomy
http://swa.cefriel.it/ontologies/corine.html#
• 5 type selected (which better feature
metropolitan area as Milan)
1. Residential
2. Agricultural
3. Commercial/industrial
4. Parks and green areas
5. Sport centres
• Spatial resolution: building level
• Source: Lombardy region open data
6
7. Milano datasets
Call data records:
• 5 phone activities
• Incoming SMS
• Outcoming SMS
• Incoming CALL
• Outcoming CALL
• Internet
• Recorded every 10 minutes (144 values a day for each activity)
for 2 months (Nov-Dec 2013)
• Summarizing structure: a footprint for each cell (average
activity over all the days, distinguishing between week and
weekend days)
• Spatial resolution: grid of 3538 square cells of 250m
• Source: Telecom Italia – provided for their Big Data Challenge
http://theodi.fbk.eu/openbigdata/
7
8. Pre-processing
Uniform the spatial resolution in order to
make datasets comparable.
Spatial resolution used: grid of 3538 square
cells of 250m
Overlapping and intersecting layers using
QGIS software.
New datasets generated:
• Presence/absence of POIs in each cell
• Weighted sum of population density in each cell
• Percentage shares of each land use over each cell area
8
9. Selection of input/output variables
Predictive models
(regression)
Land use density:
• Residential
• Agricultural
• Commercial
• Green area
• Sport facilities
Population density
Telecom data
• means of each
phone activity
(10 values)
• means hour-by-
hour of all the
activities (24
values)
POIs
• School
• Transport
• Shop
• Food
• Sport
• ...
9
INPUT
OUTPUT
10. Aims of the experiments
1. Comparing different regression algorithms
1. Statistical Learning approach -> Multiple Linear Regression (MLR)
2. Machine Learning approach -> Random Forest (RF)
2. Evaluating how the number of predictors impacts the models
performances
1. All the predictors
2. Manual selection of a subset of predictors
3. Automatic selection of predictors by AIC (Akaike information criterion)
10
11. Tests performed
5 tests combining the different algorithms and inputs
All predictors Manual selection AIC selection
RF x x
MLR x x x
11
12. Methodology of the experiments
• Dividing dataset into training (90%) and test (10%) sets
• Training the model using the 10 fold cross validation to avoid
overfitting
• Calculating the Adjusted R^2 Index to measure the goodness of the
model (percentage of variance explained)
12
13. Results
1) Different output results: some
variables are predicted better
2) Models comparison: RF always
equals or outperforms MLR (data
does not follow a linear distribution
but a more complex one)
3) Number of predictors: RF-manual
selection is usually better than RF-
all and MLR AIC-selection is better
than others MLR models. Higher
the number of variables included in
the model, the more the risk of
overfitting (higher difference
between R^2 of training and test
set)
MLR – manual
selection
MLR – all MLR – AIC
selection
RF – all RF– manual
selection
13
Adj R-square RF - all RF - manual selection
Train Test Train Test
population 0.668 0.623 0.604 0.591
residential 0.633 0.588 0.623 0.614
14. worse results in RF-manual selection
Predictors importance calculated by RF-all
14
7 vars in
the top10
out of the
manually
selected
2 vars in
the top10
out of the
manually
selected
Variable selection is an essential step in optimizing a predictive model
better results in RF-manual selection
15. Conclusions
• Encouraging results in employing open and enterprise datasets in
regression models
• Good results in predicting population, residential and agricultural
areas -> explained variability reaching 62%
• There is a relation between land use/popoulation and diverse and
heterogeneous datasets used as predictors (POIs and phone activity)
• Chosing the best predictors is an ‘’art’’. A lot of relevant data
available about cities. A preprocessing phase is essential to select only
the most informative and discriminative variables.
15
16. Future work
• Improvements on input variables: preprocessing predictors to extract more
discriminative information from the data (changing the POIs data from
presence/absence to distances from the closest POI )
• Improvements on output variables: definition of new outputs that are easier
to predict experimentally (dense residential, sparse residential, agricultural,
industrial/commercial, parks and natural stuff). Problems in predicting specific
land uses (parks, sport centres) -> other kind of input data may be required.
• Improvements on predictive algorithms: better results using Support Vector
Machine (SVM) -> the urban environment is so complex that cannot be
modelled using linear models
• Reproducibility of our solution on different scenarios: comparable results
obtained on other European cities (Barcelona, Muenchen and Brussels) -> the
methodology proposed is successful.
16
17. 17
Thank you! Any question?
Gloria Re Calegari and Irene Celino
CEFRIEL – Politecnico di Milano