SlideShare a Scribd company logo
Smart Urban Planning Support through
Web Data Science on Open and
Enterprise Data
Gloria Re Calegari and Irene Celino
CEFRIEL – Politecnico di Milano
1
The 24th International World Wide Web Conference
Florence, Italy
18 – 22 May 2015
Web Data Science meets Smart Cities
19th May 2015
Digital information about cities
• Large number of data sources available on the web (Open data):
• Urban planning (land cover, public registers)
• Demographics and statistics about municipality
• User generated information:
• Volunteered geographic information and crowdsourcing information (Open Street Map)
• Location based social network (Foursquare check-ins and geo located information)
• Close data sources produced and maintained by enterprises:
• Phone activity data
Cost of data management (collection, cleansing, maintenance) is highly variable
with respect to the diverse data origins.
2
Research goal
Long term goal:
• Can we predict (generate or update) a costly dataset from a set of
cheap information sources?
Cheap datasets
Expensive
datasets
Predict or
update
3
Our case study
• Data collection
• Available datasets about Milano
• Problem of spatial granularities and pre-processing of the datasets
• Data processing
• Definition of input/output
• Predictive analysis
• Statistical learning
• Machine learning
• Results evaluation
4
Milano datasets
Demographics:
• population density
• Spatial resolution: census area
• Source: Milano open data
Points of interest (POIs):
• Trasports, schools, sports facilities, amenity places,
shops ...
• Spatial resolution: lat-long points
• Source: Milano open data (official) and Open Street
Map (user generated)
5
Milano datasets
Land use cover:
• type of land use according to CORINE
taxonomy (3-levels hierarchy, up to 40 types of
land use defined)
• CORINE taxonomy
http://swa.cefriel.it/ontologies/corine.html#
• 5 type selected (which better feature
metropolitan area as Milan)
1. Residential
2. Agricultural
3. Commercial/industrial
4. Parks and green areas
5. Sport centres
• Spatial resolution: building level
• Source: Lombardy region open data
6
Milano datasets
Call data records:
• 5 phone activities
• Incoming SMS
• Outcoming SMS
• Incoming CALL
• Outcoming CALL
• Internet
• Recorded every 10 minutes (144 values a day for each activity)
for 2 months (Nov-Dec 2013)
• Summarizing structure: a footprint for each cell (average
activity over all the days, distinguishing between week and
weekend days)
• Spatial resolution: grid of 3538 square cells of 250m
• Source: Telecom Italia – provided for their Big Data Challenge
http://theodi.fbk.eu/openbigdata/
7
Pre-processing
Uniform the spatial resolution in order to
make datasets comparable.
Spatial resolution used: grid of 3538 square
cells of 250m
Overlapping and intersecting layers using
QGIS software.
New datasets generated:
• Presence/absence of POIs in each cell
• Weighted sum of population density in each cell
• Percentage shares of each land use over each cell area
8
Selection of input/output variables
Predictive models
(regression)
Land use density:
• Residential
• Agricultural
• Commercial
• Green area
• Sport facilities
Population density
Telecom data
• means of each
phone activity
(10 values)
• means hour-by-
hour of all the
activities (24
values)
POIs
• School
• Transport
• Shop
• Food
• Sport
• ...
9
INPUT
OUTPUT
Aims of the experiments
1. Comparing different regression algorithms
1. Statistical Learning approach -> Multiple Linear Regression (MLR)
2. Machine Learning approach -> Random Forest (RF)
2. Evaluating how the number of predictors impacts the models
performances
1. All the predictors
2. Manual selection of a subset of predictors
3. Automatic selection of predictors by AIC (Akaike information criterion)
10
Tests performed
5 tests combining the different algorithms and inputs
All predictors Manual selection AIC selection
RF x x
MLR x x x
11
Methodology of the experiments
• Dividing dataset into training (90%) and test (10%) sets
• Training the model using the 10 fold cross validation to avoid
overfitting
• Calculating the Adjusted R^2 Index to measure the goodness of the
model (percentage of variance explained)
12
Results
1) Different output results: some
variables are predicted better
2) Models comparison: RF always
equals or outperforms MLR (data
does not follow a linear distribution
but a more complex one)
3) Number of predictors: RF-manual
selection is usually better than RF-
all and MLR AIC-selection is better
than others MLR models. Higher
the number of variables included in
the model, the more the risk of
overfitting (higher difference
between R^2 of training and test
set)
MLR – manual
selection
MLR – all MLR – AIC
selection
RF – all RF– manual
selection
13
Adj R-square RF - all RF - manual selection
Train Test Train Test
population 0.668 0.623 0.604 0.591
residential 0.633 0.588 0.623 0.614
worse results in RF-manual selection
Predictors importance calculated by RF-all
14
7 vars in
the top10
out of the
manually
selected
2 vars in
the top10
out of the
manually
selected
Variable selection is an essential step in optimizing a predictive model
better results in RF-manual selection
Conclusions
• Encouraging results in employing open and enterprise datasets in
regression models
• Good results in predicting population, residential and agricultural
areas -> explained variability reaching 62%
• There is a relation between land use/popoulation and diverse and
heterogeneous datasets used as predictors (POIs and phone activity)
• Chosing the best predictors is an ‘’art’’. A lot of relevant data
available about cities. A preprocessing phase is essential to select only
the most informative and discriminative variables.
15
Future work
• Improvements on input variables: preprocessing predictors to extract more
discriminative information from the data (changing the POIs data from
presence/absence to distances from the closest POI )
• Improvements on output variables: definition of new outputs that are easier
to predict experimentally (dense residential, sparse residential, agricultural,
industrial/commercial, parks and natural stuff). Problems in predicting specific
land uses (parks, sport centres) -> other kind of input data may be required.
• Improvements on predictive algorithms: better results using Support Vector
Machine (SVM) -> the urban environment is so complex that cannot be
modelled using linear models
• Reproducibility of our solution on different scenarios: comparable results
obtained on other European cities (Barcelona, Muenchen and Brussels) -> the
methodology proposed is successful.
16
17
Thank you! Any question?
Gloria Re Calegari and Irene Celino
CEFRIEL – Politecnico di Milano

More Related Content

Viewers also liked

5 pgt milano zona 5 - 11 ottobre - arci bellezza -piergiorgio vitillo
5 pgt milano zona 5 - 11 ottobre - arci bellezza -piergiorgio vitillo5 pgt milano zona 5 - 11 ottobre - arci bellezza -piergiorgio vitillo
5 pgt milano zona 5 - 11 ottobre - arci bellezza -piergiorgio vitilloLibertà e Giustizia
 
Dati Statistici Milano 2014 Opendata
Dati Statistici Milano 2014 OpendataDati Statistici Milano 2014 Opendata
Dati Statistici Milano 2014 Opendata
Ivano Esposito
 
Distributed and heterogeneous data analysis for smart urban planning
Distributed and heterogeneous data analysis for smart urban planningDistributed and heterogeneous data analysis for smart urban planning
Distributed and heterogeneous data analysis for smart urban planning
Eduardo Oliveira
 
Climate smart planning: Kingston Waterfront Flooding Task Force
Climate smart planning: Kingston Waterfront Flooding Task ForceClimate smart planning: Kingston Waterfront Flooding Task Force
Climate smart planning: Kingston Waterfront Flooding Task Force
Climate Resilience in the Hudson Valley
 
Smart Urban Planning
Smart Urban PlanningSmart Urban Planning
Smart Urban Planning
Dedagroup
 
Data urban service science 20130617 v2
Data urban service science 20130617 v2Data urban service science 20130617 v2
Data urban service science 20130617 v2
ISSIP
 
Polycentric Cities and Sustainable Development
Polycentric Cities and Sustainable DevelopmentPolycentric Cities and Sustainable Development
Polycentric Cities and Sustainable Development
DuncanSmith
 
Smart cities: Urban Planning Focus - Kochi - 5th sept 2015
Smart cities: Urban Planning Focus - Kochi - 5th sept 2015Smart cities: Urban Planning Focus - Kochi - 5th sept 2015
Smart cities: Urban Planning Focus - Kochi - 5th sept 2015
Jaison Paul
 
Harvesting business Value with Data Science
Harvesting business Value with Data ScienceHarvesting business Value with Data Science
Harvesting business Value with Data Science
InfoFarm
 
What's different between urban planning, urban design, architecture, AADI
What's different between urban planning, urban design, architecture, AADIWhat's different between urban planning, urban design, architecture, AADI
What's different between urban planning, urban design, architecture, AADI
aalliance
 
City forms
City formsCity forms
City forms
Vijay Meena
 
Urban Planning Portfolio
Urban Planning PortfolioUrban Planning Portfolio
Urban Planning Portfolio
Robert Platt
 
Urban Planning theories and models
Urban Planning theories and modelsUrban Planning theories and models
Urban Planning theories and models
Geofrey Yator
 

Viewers also liked (13)

5 pgt milano zona 5 - 11 ottobre - arci bellezza -piergiorgio vitillo
5 pgt milano zona 5 - 11 ottobre - arci bellezza -piergiorgio vitillo5 pgt milano zona 5 - 11 ottobre - arci bellezza -piergiorgio vitillo
5 pgt milano zona 5 - 11 ottobre - arci bellezza -piergiorgio vitillo
 
Dati Statistici Milano 2014 Opendata
Dati Statistici Milano 2014 OpendataDati Statistici Milano 2014 Opendata
Dati Statistici Milano 2014 Opendata
 
Distributed and heterogeneous data analysis for smart urban planning
Distributed and heterogeneous data analysis for smart urban planningDistributed and heterogeneous data analysis for smart urban planning
Distributed and heterogeneous data analysis for smart urban planning
 
Climate smart planning: Kingston Waterfront Flooding Task Force
Climate smart planning: Kingston Waterfront Flooding Task ForceClimate smart planning: Kingston Waterfront Flooding Task Force
Climate smart planning: Kingston Waterfront Flooding Task Force
 
Smart Urban Planning
Smart Urban PlanningSmart Urban Planning
Smart Urban Planning
 
Data urban service science 20130617 v2
Data urban service science 20130617 v2Data urban service science 20130617 v2
Data urban service science 20130617 v2
 
Polycentric Cities and Sustainable Development
Polycentric Cities and Sustainable DevelopmentPolycentric Cities and Sustainable Development
Polycentric Cities and Sustainable Development
 
Smart cities: Urban Planning Focus - Kochi - 5th sept 2015
Smart cities: Urban Planning Focus - Kochi - 5th sept 2015Smart cities: Urban Planning Focus - Kochi - 5th sept 2015
Smart cities: Urban Planning Focus - Kochi - 5th sept 2015
 
Harvesting business Value with Data Science
Harvesting business Value with Data ScienceHarvesting business Value with Data Science
Harvesting business Value with Data Science
 
What's different between urban planning, urban design, architecture, AADI
What's different between urban planning, urban design, architecture, AADIWhat's different between urban planning, urban design, architecture, AADI
What's different between urban planning, urban design, architecture, AADI
 
City forms
City formsCity forms
City forms
 
Urban Planning Portfolio
Urban Planning PortfolioUrban Planning Portfolio
Urban Planning Portfolio
 
Urban Planning theories and models
Urban Planning theories and modelsUrban Planning theories and models
Urban Planning theories and models
 

Similar to Smart Urban Planning Support through Web Data Science on Open and Enterprise Data

A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial DataA Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
Gloria Re Calegari
 
Tanay Chaudhari_SIP PPT
Tanay Chaudhari_SIP PPTTanay Chaudhari_SIP PPT
Tanay Chaudhari_SIP PPT
Government of India and Tata Trusts
 
Transport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital TwinTransport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital Twin
Neo4j
 
flat_presentation_time_evolving_OD_matrix_estimation
flat_presentation_time_evolving_OD_matrix_estimationflat_presentation_time_evolving_OD_matrix_estimation
flat_presentation_time_evolving_OD_matrix_estimation
Luís Moreira-Matias
 
Integrating eo with official statistics using machine learning in mexico geo ...
Integrating eo with official statistics using machine learning in mexico geo ...Integrating eo with official statistics using machine learning in mexico geo ...
Integrating eo with official statistics using machine learning in mexico geo ...
Abel Alejandro Coronado Iruegas
 
Wireless Sensor Network for AgriTech Applications
Wireless Sensor Network for AgriTech Applications Wireless Sensor Network for AgriTech Applications
Wireless Sensor Network for AgriTech Applications
IoTForum | TiE Bangalore
 
Bruce Thompson on digital disruption and the environment
Bruce Thompson on digital disruption and the environment Bruce Thompson on digital disruption and the environment
Bruce Thompson on digital disruption and the environment
OCESAdmin
 
1st Technical Meeting - WP1
1st Technical Meeting - WP11st Technical Meeting - WP1
1st Technical Meeting - WP1
SLOPE Project
 
Smart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat TranSmart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat Tran
VMware Tanzu Korea
 
La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...
Esri España
 
ESTIMATING THE EFFORT OF MOBILE APPLICATION DEVELOPMENT
ESTIMATING THE EFFORT OF MOBILE APPLICATION DEVELOPMENTESTIMATING THE EFFORT OF MOBILE APPLICATION DEVELOPMENT
ESTIMATING THE EFFORT OF MOBILE APPLICATION DEVELOPMENT
csandit
 
BA Summit 2014 Predictive maintenance: Met big data het lek dichten
BA Summit 2014  Predictive maintenance: Met big data het lek dichtenBA Summit 2014  Predictive maintenance: Met big data het lek dichten
BA Summit 2014 Predictive maintenance: Met big data het lek dichten
Daniel Westzaan
 
big_data_casestudies_2.ppt
big_data_casestudies_2.pptbig_data_casestudies_2.ppt
big_data_casestudies_2.ppt
vishal choudhary
 
Multimedia Mining
Multimedia Mining Multimedia Mining
Multimedia Mining
Biniam Asnake
 
Role of gis in telecommunications
Role of gis in telecommunicationsRole of gis in telecommunications
Role of gis in telecommunications
Akhil Gupta
 
[DSC Europe 23] Mihailo Ilic - Scalable and Interoperable Data Flow Managemen...
[DSC Europe 23] Mihailo Ilic - Scalable and Interoperable Data Flow Managemen...[DSC Europe 23] Mihailo Ilic - Scalable and Interoperable Data Flow Managemen...
[DSC Europe 23] Mihailo Ilic - Scalable and Interoperable Data Flow Managemen...
DataScienceConferenc1
 
big data analytics in mobile cellular network
big data analytics in mobile cellular networkbig data analytics in mobile cellular network
big data analytics in mobile cellular network
shubham patil
 
Analysis of Educational Robotics activities using a machine learning approach
Analysis of Educational Robotics activities using a machine learning approachAnalysis of Educational Robotics activities using a machine learning approach
Analysis of Educational Robotics activities using a machine learning approach
Lorenzo Cesaretti
 
RECAP at ETSI Experiential Network Intelligence (ENI) Meeting
RECAP at ETSI Experiential Network Intelligence (ENI) MeetingRECAP at ETSI Experiential Network Intelligence (ENI) Meeting
RECAP at ETSI Experiential Network Intelligence (ENI) Meeting
RECAP Project
 
R3 TREES - Integrated Management of Urban Green Areas
R3 TREES - Integrated Management of Urban Green AreasR3 TREES - Integrated Management of Urban Green Areas
R3 TREES - Integrated Management of Urban Green Areas
Paolo Viskanic
 

Similar to Smart Urban Planning Support through Web Data Science on Open and Enterprise Data (20)

A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial DataA Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
 
Tanay Chaudhari_SIP PPT
Tanay Chaudhari_SIP PPTTanay Chaudhari_SIP PPT
Tanay Chaudhari_SIP PPT
 
Transport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital TwinTransport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital Twin
 
flat_presentation_time_evolving_OD_matrix_estimation
flat_presentation_time_evolving_OD_matrix_estimationflat_presentation_time_evolving_OD_matrix_estimation
flat_presentation_time_evolving_OD_matrix_estimation
 
Integrating eo with official statistics using machine learning in mexico geo ...
Integrating eo with official statistics using machine learning in mexico geo ...Integrating eo with official statistics using machine learning in mexico geo ...
Integrating eo with official statistics using machine learning in mexico geo ...
 
Wireless Sensor Network for AgriTech Applications
Wireless Sensor Network for AgriTech Applications Wireless Sensor Network for AgriTech Applications
Wireless Sensor Network for AgriTech Applications
 
Bruce Thompson on digital disruption and the environment
Bruce Thompson on digital disruption and the environment Bruce Thompson on digital disruption and the environment
Bruce Thompson on digital disruption and the environment
 
1st Technical Meeting - WP1
1st Technical Meeting - WP11st Technical Meeting - WP1
1st Technical Meeting - WP1
 
Smart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat TranSmart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat Tran
 
La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...
 
ESTIMATING THE EFFORT OF MOBILE APPLICATION DEVELOPMENT
ESTIMATING THE EFFORT OF MOBILE APPLICATION DEVELOPMENTESTIMATING THE EFFORT OF MOBILE APPLICATION DEVELOPMENT
ESTIMATING THE EFFORT OF MOBILE APPLICATION DEVELOPMENT
 
BA Summit 2014 Predictive maintenance: Met big data het lek dichten
BA Summit 2014  Predictive maintenance: Met big data het lek dichtenBA Summit 2014  Predictive maintenance: Met big data het lek dichten
BA Summit 2014 Predictive maintenance: Met big data het lek dichten
 
big_data_casestudies_2.ppt
big_data_casestudies_2.pptbig_data_casestudies_2.ppt
big_data_casestudies_2.ppt
 
Multimedia Mining
Multimedia Mining Multimedia Mining
Multimedia Mining
 
Role of gis in telecommunications
Role of gis in telecommunicationsRole of gis in telecommunications
Role of gis in telecommunications
 
[DSC Europe 23] Mihailo Ilic - Scalable and Interoperable Data Flow Managemen...
[DSC Europe 23] Mihailo Ilic - Scalable and Interoperable Data Flow Managemen...[DSC Europe 23] Mihailo Ilic - Scalable and Interoperable Data Flow Managemen...
[DSC Europe 23] Mihailo Ilic - Scalable and Interoperable Data Flow Managemen...
 
big data analytics in mobile cellular network
big data analytics in mobile cellular networkbig data analytics in mobile cellular network
big data analytics in mobile cellular network
 
Analysis of Educational Robotics activities using a machine learning approach
Analysis of Educational Robotics activities using a machine learning approachAnalysis of Educational Robotics activities using a machine learning approach
Analysis of Educational Robotics activities using a machine learning approach
 
RECAP at ETSI Experiential Network Intelligence (ENI) Meeting
RECAP at ETSI Experiential Network Intelligence (ENI) MeetingRECAP at ETSI Experiential Network Intelligence (ENI) Meeting
RECAP at ETSI Experiential Network Intelligence (ENI) Meeting
 
R3 TREES - Integrated Management of Urban Green Areas
R3 TREES - Integrated Management of Urban Green AreasR3 TREES - Integrated Management of Urban Green Areas
R3 TREES - Integrated Management of Urban Green Areas
 

Recently uploaded

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 

Recently uploaded (20)

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 

Smart Urban Planning Support through Web Data Science on Open and Enterprise Data

  • 1. Smart Urban Planning Support through Web Data Science on Open and Enterprise Data Gloria Re Calegari and Irene Celino CEFRIEL – Politecnico di Milano 1 The 24th International World Wide Web Conference Florence, Italy 18 – 22 May 2015 Web Data Science meets Smart Cities 19th May 2015
  • 2. Digital information about cities • Large number of data sources available on the web (Open data): • Urban planning (land cover, public registers) • Demographics and statistics about municipality • User generated information: • Volunteered geographic information and crowdsourcing information (Open Street Map) • Location based social network (Foursquare check-ins and geo located information) • Close data sources produced and maintained by enterprises: • Phone activity data Cost of data management (collection, cleansing, maintenance) is highly variable with respect to the diverse data origins. 2
  • 3. Research goal Long term goal: • Can we predict (generate or update) a costly dataset from a set of cheap information sources? Cheap datasets Expensive datasets Predict or update 3
  • 4. Our case study • Data collection • Available datasets about Milano • Problem of spatial granularities and pre-processing of the datasets • Data processing • Definition of input/output • Predictive analysis • Statistical learning • Machine learning • Results evaluation 4
  • 5. Milano datasets Demographics: • population density • Spatial resolution: census area • Source: Milano open data Points of interest (POIs): • Trasports, schools, sports facilities, amenity places, shops ... • Spatial resolution: lat-long points • Source: Milano open data (official) and Open Street Map (user generated) 5
  • 6. Milano datasets Land use cover: • type of land use according to CORINE taxonomy (3-levels hierarchy, up to 40 types of land use defined) • CORINE taxonomy http://swa.cefriel.it/ontologies/corine.html# • 5 type selected (which better feature metropolitan area as Milan) 1. Residential 2. Agricultural 3. Commercial/industrial 4. Parks and green areas 5. Sport centres • Spatial resolution: building level • Source: Lombardy region open data 6
  • 7. Milano datasets Call data records: • 5 phone activities • Incoming SMS • Outcoming SMS • Incoming CALL • Outcoming CALL • Internet • Recorded every 10 minutes (144 values a day for each activity) for 2 months (Nov-Dec 2013) • Summarizing structure: a footprint for each cell (average activity over all the days, distinguishing between week and weekend days) • Spatial resolution: grid of 3538 square cells of 250m • Source: Telecom Italia – provided for their Big Data Challenge http://theodi.fbk.eu/openbigdata/ 7
  • 8. Pre-processing Uniform the spatial resolution in order to make datasets comparable. Spatial resolution used: grid of 3538 square cells of 250m Overlapping and intersecting layers using QGIS software. New datasets generated: • Presence/absence of POIs in each cell • Weighted sum of population density in each cell • Percentage shares of each land use over each cell area 8
  • 9. Selection of input/output variables Predictive models (regression) Land use density: • Residential • Agricultural • Commercial • Green area • Sport facilities Population density Telecom data • means of each phone activity (10 values) • means hour-by- hour of all the activities (24 values) POIs • School • Transport • Shop • Food • Sport • ... 9 INPUT OUTPUT
  • 10. Aims of the experiments 1. Comparing different regression algorithms 1. Statistical Learning approach -> Multiple Linear Regression (MLR) 2. Machine Learning approach -> Random Forest (RF) 2. Evaluating how the number of predictors impacts the models performances 1. All the predictors 2. Manual selection of a subset of predictors 3. Automatic selection of predictors by AIC (Akaike information criterion) 10
  • 11. Tests performed 5 tests combining the different algorithms and inputs All predictors Manual selection AIC selection RF x x MLR x x x 11
  • 12. Methodology of the experiments • Dividing dataset into training (90%) and test (10%) sets • Training the model using the 10 fold cross validation to avoid overfitting • Calculating the Adjusted R^2 Index to measure the goodness of the model (percentage of variance explained) 12
  • 13. Results 1) Different output results: some variables are predicted better 2) Models comparison: RF always equals or outperforms MLR (data does not follow a linear distribution but a more complex one) 3) Number of predictors: RF-manual selection is usually better than RF- all and MLR AIC-selection is better than others MLR models. Higher the number of variables included in the model, the more the risk of overfitting (higher difference between R^2 of training and test set) MLR – manual selection MLR – all MLR – AIC selection RF – all RF– manual selection 13 Adj R-square RF - all RF - manual selection Train Test Train Test population 0.668 0.623 0.604 0.591 residential 0.633 0.588 0.623 0.614
  • 14. worse results in RF-manual selection Predictors importance calculated by RF-all 14 7 vars in the top10 out of the manually selected 2 vars in the top10 out of the manually selected Variable selection is an essential step in optimizing a predictive model better results in RF-manual selection
  • 15. Conclusions • Encouraging results in employing open and enterprise datasets in regression models • Good results in predicting population, residential and agricultural areas -> explained variability reaching 62% • There is a relation between land use/popoulation and diverse and heterogeneous datasets used as predictors (POIs and phone activity) • Chosing the best predictors is an ‘’art’’. A lot of relevant data available about cities. A preprocessing phase is essential to select only the most informative and discriminative variables. 15
  • 16. Future work • Improvements on input variables: preprocessing predictors to extract more discriminative information from the data (changing the POIs data from presence/absence to distances from the closest POI ) • Improvements on output variables: definition of new outputs that are easier to predict experimentally (dense residential, sparse residential, agricultural, industrial/commercial, parks and natural stuff). Problems in predicting specific land uses (parks, sport centres) -> other kind of input data may be required. • Improvements on predictive algorithms: better results using Support Vector Machine (SVM) -> the urban environment is so complex that cannot be modelled using linear models • Reproducibility of our solution on different scenarios: comparable results obtained on other European cities (Barcelona, Muenchen and Brussels) -> the methodology proposed is successful. 16
  • 17. 17 Thank you! Any question? Gloria Re Calegari and Irene Celino CEFRIEL – Politecnico di Milano