SlideShare a Scribd company logo
1 of 17
Download to read offline
A Data Scientist Exploration in the World of
Heterogeneous Open Geospatial Data
Gloria Re Calegari and Irene Celino
CEFRIEL – Politecnico di Milano
Como, July 17th 2015
Free and Open Source Software for Geospatial - FOSS4G Europe 2015
Digital information about cities
• Open data (large number of data sources available on the web):
• Urban planning (land cover, public registers)
• Demographics and statistics about municipality
• Closed data sources produced and maintained by enterprises:
• Phone activity data  but sometimes made open!
• User generated information:
• Volunteered geographic information and crowdsourcing information (Open Street
Map)
• Location based social network (Foursquare check-ins and geo located information)
• Real-time and streaming information
• Sensors (e.g. Temperature, energy consumption, ..)
2Free and Open Source Software for Geospatial - FOSS4G Europe 2015
Data exploration process and case study
A lot of data could describe the urban environment from different
perspectives -> great wealth for data scientist.
Managing, processing and comparing those data can be cumbersome ->
smarter solutions are required.
Data exploration of hetherogeneous urban information sources related to
the city of Milano in Italy:
• Possible issues
• Best practices
• Data exploration through correlation analysis
(understand if diverse information sources mirror the same picture of a city)
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 3
Milano datasets
Demographics:
• Population density
• Spatial resolution: census area (6079 –
median size of census area 12,000 m2)
• Source: Milano open data
Points of interest (POIs):
• Trasports, schools, sports facilities, amenity places,
shops ...
• Spatial resolution: lat-long points
• Source: Milano open data (official, 6718) and Open
Street Map (user generated, 44351)
4Free and Open Source Software for Geospatial - FOSS4G Europe 2015
Milano datasets
Land use cover:
• type of land use according to CORINE
taxonomy (3-levels hierarchy, up to 40 types of
land use defined)
• CORINE taxonomy
http://swa.cefriel.it/ontologies/corine.html#
• 2 types selected (which better feature
metropolitan area as Milan)
1. Residential
2. Agricultural
• Spatial resolution: building level
• Source: Lombardy region open data
5Free and Open Source Software for Geospatial - FOSS4G Europe 2015
Milano datasets
Call data records:
• 5 phone activities
• Incoming SMS
• Outcoming SMS
• Incoming CALL
• Outcoming CALL
• Internet
• Recorded every 10 minutes (144 values a day for
each activity) for 2 months (Nov-Dec 2013)
• Spatial resolution: grid of 3538 square cells of 250m
• Source: Telecom Italia – provided for their Big Data
Challenge http://theodi.fbk.eu/openbigdata/
6Free and Open Source Software for Geospatial - FOSS4G Europe 2015
Challenges
• Varying spatial resolution of information sources (census area for
population, single points for POIs, ...)
• Different time frames (population census done every 10 years, tlc data every
10 minutes)
• Reliability (to what extent the sources can be trusted; data from public
authorities or from crowdsourcing)
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 7
Best practices adopted
1) Data transformation, cleansing or normalization
(standard operation)
2) Making spatial resolution uniform
Spatial resolutions used:
• District level with 88 official subdivisions
• Grid level with 3.538 square cells of 250m
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 8
Cells
Districts
New datasets generated:
• Density of POIs in each cell/district
• Weighted sum of population density in each cell/district
• Percentage shares of each land use over each cell/district area
Best practices adopted
3) Data compression (pre-processing large scale time series to get a
more manageable compressed representation)
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 9
Telecom data
Footprint/temporal signature for each
cell/district
(average activity over all the 60 days, distinguishing
between week and weekend days)
Correlation analysis
Try to identify possible correspondences between different datasets.
Measure whether and how two variables change together using
correlation indexes -> Pearson’s correlation coefficient
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 10
-1 < r < 1
Positive
correlation
Negative
correlation
Correlation analysis - datasets
Pairwise comparisons between 1-dimensional vectors:
• POIs municipality: density
• POIs OSM: density
• Population: density
• Telecom: first Principal Component with 90% of explained variability
• Land use data: residential and agricultural used separately, in term of
belonging percentages to district/cell
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 11
Correlation analysis
at district level
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 12
• Correlation between
• Telecom and residential
• Telecom and POIs
can actually exist.
Data fits quasi linear models.
tlc
resid
agric
POI mun
POI
OSM
pop
• Negative correlation between
agricultural land use and other
datasets -> human action inversely
related to agricultural areas.
Correlation analysis
at cell level
• All coefficients lower than the
district level
• Higher values again between
Telecom and residential and POIs
=> the choice of resolution level can
have a significant impact on the
correlation results.
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 13
tlc
resid
agric
POI
mun
POI
OSM
pop
• Some phenomena causing the
correlation are independent of the
resolution level (0.76 residential-
population) .
Correlation analysis: phone calls and population
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 14
• Could the correlation change during the
day according to the everyday human
behaviour pattern (get up, go to work,
come back home in the evening)?
• Call activity at 6 different day times
• Week and weekend profiles are different
-> mirroring people’s different habits
• Average correlation higher in the weekend
(phone activity related to the actual
presence of people at home)
• Weekday profile -> human behaviour
pattern
DISTRICT CELL
WEEKWEEKEND
Conclusions and future works
To sum up...
• Presentation of the best practices for data exploration process applied on urban
dataset of Milano
• Approach presented in a urban environment but can be applied also in different
environment
• Correlation between different sources exists and it is strongly related to the resolution
level adopted
What is coming next?
• Extending our investigation toward a predicting approach
• Would it be possible to use one or more ‘cheap’ datasets (like open data) as a proxy
for more ‘expensive’ data sources?
• Explorative analysis => statistical and machine learning techniques.
15Free and Open Source Software for Geospatial - FOSS4G Europe 2015
Predictive analysis (not in the paper)
• Support Vector Machine to
classify the CORINE classes using
the POIs as predictors.
• Accuracy > 83%
• Errors (black dots) on the
boundary
=> promising results, go on in this
direction!
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 16
Thank you! Any question?
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
Gloria Re Calegari and Irene Celino
CEFRIEL – Politecnico di Milano
Free and Open Source Software for Geospatial - FOSS4G Europe 2015

More Related Content

What's hot

Merging statistics and geospatial information - demography / commuting / spat...
Merging statistics and geospatial information - demography / commuting / spat...Merging statistics and geospatial information - demography / commuting / spat...
Merging statistics and geospatial information - demography / commuting / spat...Mirosław Migacz
 
191008 kafka meetup_liebig
191008 kafka meetup_liebig191008 kafka meetup_liebig
191008 kafka meetup_liebigThomas Liebig
 
NERA: Network of European Research Infrastructures for Earthquake Risk Assess...
NERA: Network of European Research Infrastructures for Earthquake Risk Assess...NERA: Network of European Research Infrastructures for Earthquake Risk Assess...
NERA: Network of European Research Infrastructures for Earthquake Risk Assess...Global Earthquake Model Foundation
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyGuy Lansley
 
Km4city: Open Urban Platform for a Sentient Smart City
Km4city: Open Urban Platform for a Sentient Smart CityKm4city: Open Urban Platform for a Sentient Smart City
Km4city: Open Urban Platform for a Sentient Smart CityPaolo Nesi
 

What's hot (9)

Studying Migrations Routes: New data and Tools
Studying Migrations Routes: New data and ToolsStudying Migrations Routes: New data and Tools
Studying Migrations Routes: New data and Tools
 
Merging statistics and geospatial information - demography / commuting / spat...
Merging statistics and geospatial information - demography / commuting / spat...Merging statistics and geospatial information - demography / commuting / spat...
Merging statistics and geospatial information - demography / commuting / spat...
 
Studying Migrations Routes: New data and Tools
Studying Migrations Routes: New data and ToolsStudying Migrations Routes: New data and Tools
Studying Migrations Routes: New data and Tools
 
B08 A3pc 82 Diapo Girardot En
B08 A3pc 82 Diapo Girardot EnB08 A3pc 82 Diapo Girardot En
B08 A3pc 82 Diapo Girardot En
 
191008 kafka meetup_liebig
191008 kafka meetup_liebig191008 kafka meetup_liebig
191008 kafka meetup_liebig
 
NERA: Network of European Research Infrastructures for Earthquake Risk Assess...
NERA: Network of European Research Infrastructures for Earthquake Risk Assess...NERA: Network of European Research Infrastructures for Earthquake Risk Assess...
NERA: Network of European Research Infrastructures for Earthquake Risk Assess...
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
 
Km4city: Open Urban Platform for a Sentient Smart City
Km4city: Open Urban Platform for a Sentient Smart CityKm4city: Open Urban Platform for a Sentient Smart City
Km4city: Open Urban Platform for a Sentient Smart City
 
Maps with leafletR
Maps with leafletRMaps with leafletR
Maps with leafletR
 

Similar to A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data

Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
Ontology Building vs Data Harvesting and Cleaning for Smart-city ServicesOntology Building vs Data Harvesting and Cleaning for Smart-city Services
Ontology Building vs Data Harvesting and Cleaning for Smart-city ServicesPaolo Nesi
 
Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Towards emergency vehicle routing using Geolinked Open Data: the case study o...Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Towards emergency vehicle routing using Geolinked Open Data: the case study o...Sergio Consoli
 
Open Land Use - the Current Status and Steps Forward
Open Land Use - the Current Status and Steps ForwardOpen Land Use - the Current Status and Steps Forward
Open Land Use - the Current Status and Steps Forwardplan4all
 
Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...
Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...
Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...Istituto nazionale di statistica
 
Km4city: open flexible scalable city platform
Km4city: open flexible scalable city platformKm4city: open flexible scalable city platform
Km4city: open flexible scalable city platformPaolo Nesi
 
SoBigData - Exploring human mobility and migration with BigData @ NTTS2017
SoBigData - Exploring human mobility and migration with BigData @ NTTS2017SoBigData - Exploring human mobility and migration with BigData @ NTTS2017
SoBigData - Exploring human mobility and migration with BigData @ NTTS2017Vittorio Romano
 
From Digital Earth to the Internet of Places for Management of Risks and Emer...
From Digital Earth to the Internet of Places for Management of Risks and Emer...From Digital Earth to the Internet of Places for Management of Risks and Emer...
From Digital Earth to the Internet of Places for Management of Risks and Emer...Maria Antonia Brovelli
 
Patterns of public eService development across European cities
Patterns of public eService development across European citiesPatterns of public eService development across European cities
Patterns of public eService development across European citiesLuigi Reggi
 
Gianluca Vannuccini - Commune di Firenze - open data city of florence - July...
Gianluca Vannuccini - Commune di Firenze  - open data city of florence - July...Gianluca Vannuccini - Commune di Firenze  - open data city of florence - July...
Gianluca Vannuccini - Commune di Firenze - open data city of florence - July...AmbasciatadelCanada
 
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...BigData_Europe
 
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...Wolfgang Ksoll
 
Ontology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOntology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOscar Corcho
 
A Web of Things Based Eco-System for Urban Computing - Towards Smarter Cities
A Web of Things Based Eco-System for Urban Computing - Towards Smarter CitiesA Web of Things Based Eco-System for Urban Computing - Towards Smarter Cities
A Web of Things Based Eco-System for Urban Computing - Towards Smarter CitiesAndreas Kamilaris
 
OpenTransportNet: Stimulating Innovation with Open Geographic Information
OpenTransportNet: Stimulating Innovation with Open Geographic InformationOpenTransportNet: Stimulating Innovation with Open Geographic Information
OpenTransportNet: Stimulating Innovation with Open Geographic Information21cConsultancy_2012
 

Similar to A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data (20)

Lesson3 esa summer_school_brovelli
Lesson3 esa summer_school_brovelliLesson3 esa summer_school_brovelli
Lesson3 esa summer_school_brovelli
 
Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
Ontology Building vs Data Harvesting and Cleaning for Smart-city ServicesOntology Building vs Data Harvesting and Cleaning for Smart-city Services
Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
 
Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Towards emergency vehicle routing using Geolinked Open Data: the case study o...Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Towards emergency vehicle routing using Geolinked Open Data: the case study o...
 
What can be done with Open Data?
What can be done with Open Data?What can be done with Open Data?
What can be done with Open Data?
 
Open Land Use - the Current Status and Steps Forward
Open Land Use - the Current Status and Steps ForwardOpen Land Use - the Current Status and Steps Forward
Open Land Use - the Current Status and Steps Forward
 
Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...
Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...
Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...
 
Km4city: open flexible scalable city platform
Km4city: open flexible scalable city platformKm4city: open flexible scalable city platform
Km4city: open flexible scalable city platform
 
Lesson2 esa summer_school_brovelli
Lesson2 esa summer_school_brovelliLesson2 esa summer_school_brovelli
Lesson2 esa summer_school_brovelli
 
Citadel technical
Citadel technicalCitadel technical
Citadel technical
 
Geo Open Data
Geo Open DataGeo Open Data
Geo Open Data
 
SoBigData - Exploring human mobility and migration with BigData @ NTTS2017
SoBigData - Exploring human mobility and migration with BigData @ NTTS2017SoBigData - Exploring human mobility and migration with BigData @ NTTS2017
SoBigData - Exploring human mobility and migration with BigData @ NTTS2017
 
From Digital Earth to the Internet of Places for Management of Risks and Emer...
From Digital Earth to the Internet of Places for Management of Risks and Emer...From Digital Earth to the Internet of Places for Management of Risks and Emer...
From Digital Earth to the Internet of Places for Management of Risks and Emer...
 
Patterns of public eService development across European cities
Patterns of public eService development across European citiesPatterns of public eService development across European cities
Patterns of public eService development across European cities
 
Gianluca Vannuccini - Commune di Firenze - open data city of florence - July...
Gianluca Vannuccini - Commune di Firenze  - open data city of florence - July...Gianluca Vannuccini - Commune di Firenze  - open data city of florence - July...
Gianluca Vannuccini - Commune di Firenze - open data city of florence - July...
 
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
 
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...
 
Ontology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOntology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data Sharing
 
Lesson1 esa summer_school_brovelli
Lesson1 esa summer_school_brovelliLesson1 esa summer_school_brovelli
Lesson1 esa summer_school_brovelli
 
A Web of Things Based Eco-System for Urban Computing - Towards Smarter Cities
A Web of Things Based Eco-System for Urban Computing - Towards Smarter CitiesA Web of Things Based Eco-System for Urban Computing - Towards Smarter Cities
A Web of Things Based Eco-System for Urban Computing - Towards Smarter Cities
 
OpenTransportNet: Stimulating Innovation with Open Geographic Information
OpenTransportNet: Stimulating Innovation with Open Geographic InformationOpenTransportNet: Stimulating Innovation with Open Geographic Information
OpenTransportNet: Stimulating Innovation with Open Geographic Information
 

Recently uploaded

Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Recently uploaded (20)

Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 

A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data

  • 1. A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data Gloria Re Calegari and Irene Celino CEFRIEL – Politecnico di Milano Como, July 17th 2015 Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 2. Digital information about cities • Open data (large number of data sources available on the web): • Urban planning (land cover, public registers) • Demographics and statistics about municipality • Closed data sources produced and maintained by enterprises: • Phone activity data  but sometimes made open! • User generated information: • Volunteered geographic information and crowdsourcing information (Open Street Map) • Location based social network (Foursquare check-ins and geo located information) • Real-time and streaming information • Sensors (e.g. Temperature, energy consumption, ..) 2Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 3. Data exploration process and case study A lot of data could describe the urban environment from different perspectives -> great wealth for data scientist. Managing, processing and comparing those data can be cumbersome -> smarter solutions are required. Data exploration of hetherogeneous urban information sources related to the city of Milano in Italy: • Possible issues • Best practices • Data exploration through correlation analysis (understand if diverse information sources mirror the same picture of a city) Free and Open Source Software for Geospatial - FOSS4G Europe 2015 3
  • 4. Milano datasets Demographics: • Population density • Spatial resolution: census area (6079 – median size of census area 12,000 m2) • Source: Milano open data Points of interest (POIs): • Trasports, schools, sports facilities, amenity places, shops ... • Spatial resolution: lat-long points • Source: Milano open data (official, 6718) and Open Street Map (user generated, 44351) 4Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 5. Milano datasets Land use cover: • type of land use according to CORINE taxonomy (3-levels hierarchy, up to 40 types of land use defined) • CORINE taxonomy http://swa.cefriel.it/ontologies/corine.html# • 2 types selected (which better feature metropolitan area as Milan) 1. Residential 2. Agricultural • Spatial resolution: building level • Source: Lombardy region open data 5Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 6. Milano datasets Call data records: • 5 phone activities • Incoming SMS • Outcoming SMS • Incoming CALL • Outcoming CALL • Internet • Recorded every 10 minutes (144 values a day for each activity) for 2 months (Nov-Dec 2013) • Spatial resolution: grid of 3538 square cells of 250m • Source: Telecom Italia – provided for their Big Data Challenge http://theodi.fbk.eu/openbigdata/ 6Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 7. Challenges • Varying spatial resolution of information sources (census area for population, single points for POIs, ...) • Different time frames (population census done every 10 years, tlc data every 10 minutes) • Reliability (to what extent the sources can be trusted; data from public authorities or from crowdsourcing) Free and Open Source Software for Geospatial - FOSS4G Europe 2015 7
  • 8. Best practices adopted 1) Data transformation, cleansing or normalization (standard operation) 2) Making spatial resolution uniform Spatial resolutions used: • District level with 88 official subdivisions • Grid level with 3.538 square cells of 250m Free and Open Source Software for Geospatial - FOSS4G Europe 2015 8 Cells Districts New datasets generated: • Density of POIs in each cell/district • Weighted sum of population density in each cell/district • Percentage shares of each land use over each cell/district area
  • 9. Best practices adopted 3) Data compression (pre-processing large scale time series to get a more manageable compressed representation) Free and Open Source Software for Geospatial - FOSS4G Europe 2015 9 Telecom data Footprint/temporal signature for each cell/district (average activity over all the 60 days, distinguishing between week and weekend days)
  • 10. Correlation analysis Try to identify possible correspondences between different datasets. Measure whether and how two variables change together using correlation indexes -> Pearson’s correlation coefficient Free and Open Source Software for Geospatial - FOSS4G Europe 2015 10 -1 < r < 1 Positive correlation Negative correlation
  • 11. Correlation analysis - datasets Pairwise comparisons between 1-dimensional vectors: • POIs municipality: density • POIs OSM: density • Population: density • Telecom: first Principal Component with 90% of explained variability • Land use data: residential and agricultural used separately, in term of belonging percentages to district/cell Free and Open Source Software for Geospatial - FOSS4G Europe 2015 11
  • 12. Correlation analysis at district level Free and Open Source Software for Geospatial - FOSS4G Europe 2015 12 • Correlation between • Telecom and residential • Telecom and POIs can actually exist. Data fits quasi linear models. tlc resid agric POI mun POI OSM pop • Negative correlation between agricultural land use and other datasets -> human action inversely related to agricultural areas.
  • 13. Correlation analysis at cell level • All coefficients lower than the district level • Higher values again between Telecom and residential and POIs => the choice of resolution level can have a significant impact on the correlation results. Free and Open Source Software for Geospatial - FOSS4G Europe 2015 13 tlc resid agric POI mun POI OSM pop • Some phenomena causing the correlation are independent of the resolution level (0.76 residential- population) .
  • 14. Correlation analysis: phone calls and population Free and Open Source Software for Geospatial - FOSS4G Europe 2015 14 • Could the correlation change during the day according to the everyday human behaviour pattern (get up, go to work, come back home in the evening)? • Call activity at 6 different day times • Week and weekend profiles are different -> mirroring people’s different habits • Average correlation higher in the weekend (phone activity related to the actual presence of people at home) • Weekday profile -> human behaviour pattern DISTRICT CELL WEEKWEEKEND
  • 15. Conclusions and future works To sum up... • Presentation of the best practices for data exploration process applied on urban dataset of Milano • Approach presented in a urban environment but can be applied also in different environment • Correlation between different sources exists and it is strongly related to the resolution level adopted What is coming next? • Extending our investigation toward a predicting approach • Would it be possible to use one or more ‘cheap’ datasets (like open data) as a proxy for more ‘expensive’ data sources? • Explorative analysis => statistical and machine learning techniques. 15Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 16. Predictive analysis (not in the paper) • Support Vector Machine to classify the CORINE classes using the POIs as predictors. • Accuracy > 83% • Errors (black dots) on the boundary => promising results, go on in this direction! Free and Open Source Software for Geospatial - FOSS4G Europe 2015 16
  • 17. Thank you! Any question? A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data Gloria Re Calegari and Irene Celino CEFRIEL – Politecnico di Milano Free and Open Source Software for Geospatial - FOSS4G Europe 2015