SlideShare a Scribd company logo
1 of 17
Download to read offline
A Data Scientist Exploration in the World of
Heterogeneous Open Geospatial Data
Gloria Re Calegari and Irene Celino
CEFRIEL – Politecnico di Milano
Como, July 17th 2015
Free and Open Source Software for Geospatial - FOSS4G Europe 2015
Digital information about cities
• Open data (large number of data sources available on the web):
• Urban planning (land cover, public registers)
• Demographics and statistics about municipality
• Closed data sources produced and maintained by enterprises:
• Phone activity data  but sometimes made open!
• User generated information:
• Volunteered geographic information and crowdsourcing information (Open Street
Map)
• Location based social network (Foursquare check-ins and geo located information)
• Real-time and streaming information
• Sensors (e.g. Temperature, energy consumption, ..)
2Free and Open Source Software for Geospatial - FOSS4G Europe 2015
Data exploration process and case study
A lot of data could describe the urban environment from different
perspectives -> great wealth for data scientist.
Managing, processing and comparing those data can be cumbersome ->
smarter solutions are required.
Data exploration of hetherogeneous urban information sources related to
the city of Milano in Italy:
• Possible issues
• Best practices
• Data exploration through correlation analysis
(understand if diverse information sources mirror the same picture of a city)
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 3
Milano datasets
Demographics:
• Population density
• Spatial resolution: census area (6079 –
median size of census area 12,000 m2)
• Source: Milano open data
Points of interest (POIs):
• Trasports, schools, sports facilities, amenity places,
shops ...
• Spatial resolution: lat-long points
• Source: Milano open data (official, 6718) and Open
Street Map (user generated, 44351)
4Free and Open Source Software for Geospatial - FOSS4G Europe 2015
Milano datasets
Land use cover:
• type of land use according to CORINE
taxonomy (3-levels hierarchy, up to 40 types of
land use defined)
• CORINE taxonomy
http://swa.cefriel.it/ontologies/corine.html#
• 2 types selected (which better feature
metropolitan area as Milan)
1. Residential
2. Agricultural
• Spatial resolution: building level
• Source: Lombardy region open data
5Free and Open Source Software for Geospatial - FOSS4G Europe 2015
Milano datasets
Call data records:
• 5 phone activities
• Incoming SMS
• Outcoming SMS
• Incoming CALL
• Outcoming CALL
• Internet
• Recorded every 10 minutes (144 values a day for
each activity) for 2 months (Nov-Dec 2013)
• Spatial resolution: grid of 3538 square cells of 250m
• Source: Telecom Italia – provided for their Big Data
Challenge http://theodi.fbk.eu/openbigdata/
6Free and Open Source Software for Geospatial - FOSS4G Europe 2015
Challenges
• Varying spatial resolution of information sources (census area for
population, single points for POIs, ...)
• Different time frames (population census done every 10 years, tlc data every
10 minutes)
• Reliability (to what extent the sources can be trusted; data from public
authorities or from crowdsourcing)
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 7
Best practices adopted
1) Data transformation, cleansing or normalization
(standard operation)
2) Making spatial resolution uniform
Spatial resolutions used:
• District level with 88 official subdivisions
• Grid level with 3.538 square cells of 250m
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 8
Cells
Districts
New datasets generated:
• Density of POIs in each cell/district
• Weighted sum of population density in each cell/district
• Percentage shares of each land use over each cell/district area
Best practices adopted
3) Data compression (pre-processing large scale time series to get a
more manageable compressed representation)
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 9
Telecom data
Footprint/temporal signature for each
cell/district
(average activity over all the 60 days, distinguishing
between week and weekend days)
Correlation analysis
Try to identify possible correspondences between different datasets.
Measure whether and how two variables change together using
correlation indexes -> Pearson’s correlation coefficient
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 10
-1 < r < 1
Positive
correlation
Negative
correlation
Correlation analysis - datasets
Pairwise comparisons between 1-dimensional vectors:
• POIs municipality: density
• POIs OSM: density
• Population: density
• Telecom: first Principal Component with 90% of explained variability
• Land use data: residential and agricultural used separately, in term of
belonging percentages to district/cell
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 11
Correlation analysis
at district level
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 12
• Correlation between
• Telecom and residential
• Telecom and POIs
can actually exist.
Data fits quasi linear models.
tlc
resid
agric
POI mun
POI
OSM
pop
• Negative correlation between
agricultural land use and other
datasets -> human action inversely
related to agricultural areas.
Correlation analysis
at cell level
• All coefficients lower than the
district level
• Higher values again between
Telecom and residential and POIs
=> the choice of resolution level can
have a significant impact on the
correlation results.
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 13
tlc
resid
agric
POI
mun
POI
OSM
pop
• Some phenomena causing the
correlation are independent of the
resolution level (0.76 residential-
population) .
Correlation analysis: phone calls and population
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 14
• Could the correlation change during the
day according to the everyday human
behaviour pattern (get up, go to work,
come back home in the evening)?
• Call activity at 6 different day times
• Week and weekend profiles are different
-> mirroring people’s different habits
• Average correlation higher in the weekend
(phone activity related to the actual
presence of people at home)
• Weekday profile -> human behaviour
pattern
DISTRICT CELL
WEEKWEEKEND
Conclusions and future works
To sum up...
• Presentation of the best practices for data exploration process applied on urban
dataset of Milano
• Approach presented in a urban environment but can be applied also in different
environment
• Correlation between different sources exists and it is strongly related to the resolution
level adopted
What is coming next?
• Extending our investigation toward a predicting approach
• Would it be possible to use one or more ‘cheap’ datasets (like open data) as a proxy
for more ‘expensive’ data sources?
• Explorative analysis => statistical and machine learning techniques.
15Free and Open Source Software for Geospatial - FOSS4G Europe 2015
Predictive analysis (not in the paper)
• Support Vector Machine to
classify the CORINE classes using
the POIs as predictors.
• Accuracy > 83%
• Errors (black dots) on the
boundary
=> promising results, go on in this
direction!
Free and Open Source Software for Geospatial - FOSS4G Europe 2015 16
Thank you! Any question?
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
Gloria Re Calegari and Irene Celino
CEFRIEL – Politecnico di Milano
Free and Open Source Software for Geospatial - FOSS4G Europe 2015

More Related Content

What's hot

Merging statistics and geospatial information - demography / commuting / spat...
Merging statistics and geospatial information - demography / commuting / spat...Merging statistics and geospatial information - demography / commuting / spat...
Merging statistics and geospatial information - demography / commuting / spat...Mirosław Migacz
 
191008 kafka meetup_liebig
191008 kafka meetup_liebig191008 kafka meetup_liebig
191008 kafka meetup_liebigThomas Liebig
 
NERA: Network of European Research Infrastructures for Earthquake Risk Assess...
NERA: Network of European Research Infrastructures for Earthquake Risk Assess...NERA: Network of European Research Infrastructures for Earthquake Risk Assess...
NERA: Network of European Research Infrastructures for Earthquake Risk Assess...Global Earthquake Model Foundation
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyGuy Lansley
 
Km4city: Open Urban Platform for a Sentient Smart City
Km4city: Open Urban Platform for a Sentient Smart CityKm4city: Open Urban Platform for a Sentient Smart City
Km4city: Open Urban Platform for a Sentient Smart CityPaolo Nesi
 

What's hot (9)

Studying Migrations Routes: New data and Tools
Studying Migrations Routes: New data and ToolsStudying Migrations Routes: New data and Tools
Studying Migrations Routes: New data and Tools
 
Merging statistics and geospatial information - demography / commuting / spat...
Merging statistics and geospatial information - demography / commuting / spat...Merging statistics and geospatial information - demography / commuting / spat...
Merging statistics and geospatial information - demography / commuting / spat...
 
Studying Migrations Routes: New data and Tools
Studying Migrations Routes: New data and ToolsStudying Migrations Routes: New data and Tools
Studying Migrations Routes: New data and Tools
 
B08 A3pc 82 Diapo Girardot En
B08 A3pc 82 Diapo Girardot EnB08 A3pc 82 Diapo Girardot En
B08 A3pc 82 Diapo Girardot En
 
191008 kafka meetup_liebig
191008 kafka meetup_liebig191008 kafka meetup_liebig
191008 kafka meetup_liebig
 
NERA: Network of European Research Infrastructures for Earthquake Risk Assess...
NERA: Network of European Research Infrastructures for Earthquake Risk Assess...NERA: Network of European Research Infrastructures for Earthquake Risk Assess...
NERA: Network of European Research Infrastructures for Earthquake Risk Assess...
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
 
Km4city: Open Urban Platform for a Sentient Smart City
Km4city: Open Urban Platform for a Sentient Smart CityKm4city: Open Urban Platform for a Sentient Smart City
Km4city: Open Urban Platform for a Sentient Smart City
 
Maps with leafletR
Maps with leafletRMaps with leafletR
Maps with leafletR
 

Similar to A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data

Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
Ontology Building vs Data Harvesting and Cleaning for Smart-city ServicesOntology Building vs Data Harvesting and Cleaning for Smart-city Services
Ontology Building vs Data Harvesting and Cleaning for Smart-city ServicesPaolo Nesi
 
Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Towards emergency vehicle routing using Geolinked Open Data: the case study o...Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Towards emergency vehicle routing using Geolinked Open Data: the case study o...Sergio Consoli
 
Open Land Use - the Current Status and Steps Forward
Open Land Use - the Current Status and Steps ForwardOpen Land Use - the Current Status and Steps Forward
Open Land Use - the Current Status and Steps Forwardplan4all
 
Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...
Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...
Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...Istituto nazionale di statistica
 
Km4city: open flexible scalable city platform
Km4city: open flexible scalable city platformKm4city: open flexible scalable city platform
Km4city: open flexible scalable city platformPaolo Nesi
 
SoBigData - Exploring human mobility and migration with BigData @ NTTS2017
SoBigData - Exploring human mobility and migration with BigData @ NTTS2017SoBigData - Exploring human mobility and migration with BigData @ NTTS2017
SoBigData - Exploring human mobility and migration with BigData @ NTTS2017Vittorio Romano
 
From Digital Earth to the Internet of Places for Management of Risks and Emer...
From Digital Earth to the Internet of Places for Management of Risks and Emer...From Digital Earth to the Internet of Places for Management of Risks and Emer...
From Digital Earth to the Internet of Places for Management of Risks and Emer...Maria Antonia Brovelli
 
Patterns of public eService development across European cities
Patterns of public eService development across European citiesPatterns of public eService development across European cities
Patterns of public eService development across European citiesLuigi Reggi
 
Gianluca Vannuccini - Commune di Firenze - open data city of florence - July...
Gianluca Vannuccini - Commune di Firenze  - open data city of florence - July...Gianluca Vannuccini - Commune di Firenze  - open data city of florence - July...
Gianluca Vannuccini - Commune di Firenze - open data city of florence - July...AmbasciatadelCanada
 
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...BigData_Europe
 
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...Wolfgang Ksoll
 
Ontology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOntology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOscar Corcho
 
A Web of Things Based Eco-System for Urban Computing - Towards Smarter Cities
A Web of Things Based Eco-System for Urban Computing - Towards Smarter CitiesA Web of Things Based Eco-System for Urban Computing - Towards Smarter Cities
A Web of Things Based Eco-System for Urban Computing - Towards Smarter CitiesAndreas Kamilaris
 
OpenTransportNet: Stimulating Innovation with Open Geographic Information
OpenTransportNet: Stimulating Innovation with Open Geographic InformationOpenTransportNet: Stimulating Innovation with Open Geographic Information
OpenTransportNet: Stimulating Innovation with Open Geographic Information21cConsultancy_2012
 

Similar to A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data (20)

Lesson3 esa summer_school_brovelli
Lesson3 esa summer_school_brovelliLesson3 esa summer_school_brovelli
Lesson3 esa summer_school_brovelli
 
Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
Ontology Building vs Data Harvesting and Cleaning for Smart-city ServicesOntology Building vs Data Harvesting and Cleaning for Smart-city Services
Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
 
Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Towards emergency vehicle routing using Geolinked Open Data: the case study o...Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Towards emergency vehicle routing using Geolinked Open Data: the case study o...
 
What can be done with Open Data?
What can be done with Open Data?What can be done with Open Data?
What can be done with Open Data?
 
Open Land Use - the Current Status and Steps Forward
Open Land Use - the Current Status and Steps ForwardOpen Land Use - the Current Status and Steps Forward
Open Land Use - the Current Status and Steps Forward
 
Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...
Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...
Giorgio Alleva, Data Innovation in Official Statistics: the Leading Role of O...
 
Km4city: open flexible scalable city platform
Km4city: open flexible scalable city platformKm4city: open flexible scalable city platform
Km4city: open flexible scalable city platform
 
Lesson2 esa summer_school_brovelli
Lesson2 esa summer_school_brovelliLesson2 esa summer_school_brovelli
Lesson2 esa summer_school_brovelli
 
Citadel technical
Citadel technicalCitadel technical
Citadel technical
 
Geo Open Data
Geo Open DataGeo Open Data
Geo Open Data
 
SoBigData - Exploring human mobility and migration with BigData @ NTTS2017
SoBigData - Exploring human mobility and migration with BigData @ NTTS2017SoBigData - Exploring human mobility and migration with BigData @ NTTS2017
SoBigData - Exploring human mobility and migration with BigData @ NTTS2017
 
From Digital Earth to the Internet of Places for Management of Risks and Emer...
From Digital Earth to the Internet of Places for Management of Risks and Emer...From Digital Earth to the Internet of Places for Management of Risks and Emer...
From Digital Earth to the Internet of Places for Management of Risks and Emer...
 
Patterns of public eService development across European cities
Patterns of public eService development across European citiesPatterns of public eService development across European cities
Patterns of public eService development across European cities
 
Gianluca Vannuccini - Commune di Firenze - open data city of florence - July...
Gianluca Vannuccini - Commune di Firenze  - open data city of florence - July...Gianluca Vannuccini - Commune di Firenze  - open data city of florence - July...
Gianluca Vannuccini - Commune di Firenze - open data city of florence - July...
 
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
 
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...
NextGEOSS: The Next Generation European Data Hub and Cloud Platform for Earth...
 
Ontology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOntology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data Sharing
 
Lesson1 esa summer_school_brovelli
Lesson1 esa summer_school_brovelliLesson1 esa summer_school_brovelli
Lesson1 esa summer_school_brovelli
 
A Web of Things Based Eco-System for Urban Computing - Towards Smarter Cities
A Web of Things Based Eco-System for Urban Computing - Towards Smarter CitiesA Web of Things Based Eco-System for Urban Computing - Towards Smarter Cities
A Web of Things Based Eco-System for Urban Computing - Towards Smarter Cities
 
OpenTransportNet: Stimulating Innovation with Open Geographic Information
OpenTransportNet: Stimulating Innovation with Open Geographic InformationOpenTransportNet: Stimulating Innovation with Open Geographic Information
OpenTransportNet: Stimulating Innovation with Open Geographic Information
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data

  • 1. A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data Gloria Re Calegari and Irene Celino CEFRIEL – Politecnico di Milano Como, July 17th 2015 Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 2. Digital information about cities • Open data (large number of data sources available on the web): • Urban planning (land cover, public registers) • Demographics and statistics about municipality • Closed data sources produced and maintained by enterprises: • Phone activity data  but sometimes made open! • User generated information: • Volunteered geographic information and crowdsourcing information (Open Street Map) • Location based social network (Foursquare check-ins and geo located information) • Real-time and streaming information • Sensors (e.g. Temperature, energy consumption, ..) 2Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 3. Data exploration process and case study A lot of data could describe the urban environment from different perspectives -> great wealth for data scientist. Managing, processing and comparing those data can be cumbersome -> smarter solutions are required. Data exploration of hetherogeneous urban information sources related to the city of Milano in Italy: • Possible issues • Best practices • Data exploration through correlation analysis (understand if diverse information sources mirror the same picture of a city) Free and Open Source Software for Geospatial - FOSS4G Europe 2015 3
  • 4. Milano datasets Demographics: • Population density • Spatial resolution: census area (6079 – median size of census area 12,000 m2) • Source: Milano open data Points of interest (POIs): • Trasports, schools, sports facilities, amenity places, shops ... • Spatial resolution: lat-long points • Source: Milano open data (official, 6718) and Open Street Map (user generated, 44351) 4Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 5. Milano datasets Land use cover: • type of land use according to CORINE taxonomy (3-levels hierarchy, up to 40 types of land use defined) • CORINE taxonomy http://swa.cefriel.it/ontologies/corine.html# • 2 types selected (which better feature metropolitan area as Milan) 1. Residential 2. Agricultural • Spatial resolution: building level • Source: Lombardy region open data 5Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 6. Milano datasets Call data records: • 5 phone activities • Incoming SMS • Outcoming SMS • Incoming CALL • Outcoming CALL • Internet • Recorded every 10 minutes (144 values a day for each activity) for 2 months (Nov-Dec 2013) • Spatial resolution: grid of 3538 square cells of 250m • Source: Telecom Italia – provided for their Big Data Challenge http://theodi.fbk.eu/openbigdata/ 6Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 7. Challenges • Varying spatial resolution of information sources (census area for population, single points for POIs, ...) • Different time frames (population census done every 10 years, tlc data every 10 minutes) • Reliability (to what extent the sources can be trusted; data from public authorities or from crowdsourcing) Free and Open Source Software for Geospatial - FOSS4G Europe 2015 7
  • 8. Best practices adopted 1) Data transformation, cleansing or normalization (standard operation) 2) Making spatial resolution uniform Spatial resolutions used: • District level with 88 official subdivisions • Grid level with 3.538 square cells of 250m Free and Open Source Software for Geospatial - FOSS4G Europe 2015 8 Cells Districts New datasets generated: • Density of POIs in each cell/district • Weighted sum of population density in each cell/district • Percentage shares of each land use over each cell/district area
  • 9. Best practices adopted 3) Data compression (pre-processing large scale time series to get a more manageable compressed representation) Free and Open Source Software for Geospatial - FOSS4G Europe 2015 9 Telecom data Footprint/temporal signature for each cell/district (average activity over all the 60 days, distinguishing between week and weekend days)
  • 10. Correlation analysis Try to identify possible correspondences between different datasets. Measure whether and how two variables change together using correlation indexes -> Pearson’s correlation coefficient Free and Open Source Software for Geospatial - FOSS4G Europe 2015 10 -1 < r < 1 Positive correlation Negative correlation
  • 11. Correlation analysis - datasets Pairwise comparisons between 1-dimensional vectors: • POIs municipality: density • POIs OSM: density • Population: density • Telecom: first Principal Component with 90% of explained variability • Land use data: residential and agricultural used separately, in term of belonging percentages to district/cell Free and Open Source Software for Geospatial - FOSS4G Europe 2015 11
  • 12. Correlation analysis at district level Free and Open Source Software for Geospatial - FOSS4G Europe 2015 12 • Correlation between • Telecom and residential • Telecom and POIs can actually exist. Data fits quasi linear models. tlc resid agric POI mun POI OSM pop • Negative correlation between agricultural land use and other datasets -> human action inversely related to agricultural areas.
  • 13. Correlation analysis at cell level • All coefficients lower than the district level • Higher values again between Telecom and residential and POIs => the choice of resolution level can have a significant impact on the correlation results. Free and Open Source Software for Geospatial - FOSS4G Europe 2015 13 tlc resid agric POI mun POI OSM pop • Some phenomena causing the correlation are independent of the resolution level (0.76 residential- population) .
  • 14. Correlation analysis: phone calls and population Free and Open Source Software for Geospatial - FOSS4G Europe 2015 14 • Could the correlation change during the day according to the everyday human behaviour pattern (get up, go to work, come back home in the evening)? • Call activity at 6 different day times • Week and weekend profiles are different -> mirroring people’s different habits • Average correlation higher in the weekend (phone activity related to the actual presence of people at home) • Weekday profile -> human behaviour pattern DISTRICT CELL WEEKWEEKEND
  • 15. Conclusions and future works To sum up... • Presentation of the best practices for data exploration process applied on urban dataset of Milano • Approach presented in a urban environment but can be applied also in different environment • Correlation between different sources exists and it is strongly related to the resolution level adopted What is coming next? • Extending our investigation toward a predicting approach • Would it be possible to use one or more ‘cheap’ datasets (like open data) as a proxy for more ‘expensive’ data sources? • Explorative analysis => statistical and machine learning techniques. 15Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 16. Predictive analysis (not in the paper) • Support Vector Machine to classify the CORINE classes using the POIs as predictors. • Accuracy > 83% • Errors (black dots) on the boundary => promising results, go on in this direction! Free and Open Source Software for Geospatial - FOSS4G Europe 2015 16
  • 17. Thank you! Any question? A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data Gloria Re Calegari and Irene Celino CEFRIEL – Politecnico di Milano Free and Open Source Software for Geospatial - FOSS4G Europe 2015