Presentation by John Murray, Fusion Data Science, giving examples and applications of joining data by location, using OS and other open data.
@MurrayData
2. Joining Data Together
2
• The real value of data is not the data itself, but the insights
derived from it.
• To achieve maximum economic benefit from data, disparate
sources need to be joined:
• Appending socio-demographic data to a customer
database for marketing insights.
• Merging crime data with benefits and deprivation data
to analyse causes of crime.
• Joining NHS mortality and prescribing data to census
data to examine factors in poor health.
• Geography is a common "currency" in much Open Data
which allows us to join it.
3. Geography Types in Open Data
3
• Census geography.
• Output areas.
• Administrative geography.
• Local authorities, regions, NHS areas, Police Forces.
• Political geography.
• Electoral wards, Parliamentary constituencies.
• Postal geography.
• Postcodes, sectors, areas, districts.
• Unstructured geography.
• Spatial points.
• Bespoke catchments, e.g. retail stores.
4. Census Geography
4
• Hierarchy of published area statistics.
• Output Area (OA)
• 40-250 households.
• Lower Super Output Area (LSOA)
• 400-1200 households.
• Middle Super Output Area (MSOA)
• 2000-6000 households.
• Links to administrative geography
• Open data geography tables:
• ONS Postcode Directory (ONSPD)
• National Statistics Postcode Lookup (NSPL)
5. Administrative and Political Geography
5
• Local Authorities.
• District.
• County.
• Metropolitan Boroughs and Unitary Councils.
• Parish and Town Councils.
• Parliamentary Constituencies.
• Government Regions.
• NHS.
• Police Forces.
• Environment Agency Regions.
• Links Provided in ONSPD and NSPL.
6. Postal Geography
6
• Based around the postcode.
• Introduced in 1959 on a trial basis.
• Current UK system in use since 1967.
• Designed for the purpose of efficient delivery of
mail.
• Doesn't align exactly with Census and
Administrative Geography.
• 1.8 million postcodes currently in use.
• Mean number of "delivery points" is 14.
7. Anatomy Of A Postcode
7
CH1 2HS
• CH – Postcode Area
• CH1 – Postcode District
• CH1 1 – Postcode Sector
• CH1 2HS – Postcode
• "HS" is called the walk.
• CH1 referred to as the Out code
• 2HS is referred to as the In code
8. Postcode Facts
8
• Postcode mean 14 delivery points.
• Postcode sector mean 2530 delivery points.
• Postcode district mean 9080 delivery points.
• Postcode area mean 200,000 delivery points.
• 26 million delivery points in UK.
• Ordnance Survey Codepoint Open, ONSPD and
NSPL contain grid references for postcode
centroids.
9. Joining Data
9
• In most cases, use ONSPD
• Although approximate, good enough for most uses.
• Political and public sector, use NSPL
• Specifically designed for that purpose.
• Use postcode to join data.
• Can join individual/household data.
• Augment existing data, e.g. customer database
• Customer demographic profiling.
• Store catchment analysis.
• Join open and closed data sources.
• Common in many open data sources.
• Links easily to other levels of geography.
11. Geospatial Data in Databases
11
• Spatial data types
• Point (single point)
• Line (set of joined points e.g. road)
• Polygon (closed set of joined points e.g. boundary)
• Most database support spatial data types
• Proprietary e.g. MS SQL Server, Oracle.
• Open source: MySQL, MariaDB, PostGreSQL
• NSQL: Neo4J, MongoDB, PostGIS
• Spatial queries
• Contained in (point in polygon).
• Intersects (crosses).
• Distance (not supported by all).
13. Distance Metrics
13
• Euclidean Distance
• “Crow flies” linear distance
• Graph Distance
• Road distance
• Manhattan Distance
• Rectilinear distance
• Great Circle
• Shortest distance between two points on the surface
of a sphere
14. Euclidian Distance
14
• University of Chester to
Liverpool Airport.
• Euclidean distance 9.4
miles.
• Manhattan distance 11.1
miles.
• Graph distance (fastest)
24.5 miles.
• Used OS Strategy Roads
Opendata and A*
algorithm.
15. Non-Formal Unstructured Geography
15
• Micro geo-centric analysis
• Describe neighbourhood
• Point based data
• Relate to formal geography through boundaries.
• User defined
• Store catchments
• Sales territories
• Radial/drive time
16. Point Based Data
16
• The simplest type of spatial object.
• Represents a point relative to the Earth's surface.
• Has at least 2 values for coordinates.
• May optionally have an elevation z value in some
systems.
• Ordnance Survey grid references are Cartesian
Coordinates, in metres, east and north of origin
point.
17. Converting Between Systems
17
• Use GIS software or conversion software.
• Scripts freely downloadable from Ordnance Survey and
others.
• Ordnance Survey provide comprehensive guides and
resources to write your own scripts.
• Unfortunately, it ISN’T as straightforward as using a
formula.
• Need to take into account tectonic shifts and historic
inaccuracies in surveying.
• OS provides a dataset of shifts to do this.
18. Geocentric analysis
18
• Use point as centre.
• Use Euclidian distance to aggregate metrics.
• Standardise units.
• Example – population density at postcode level:
• Census Postcode Estimates
• Ordnance Survey Code-Point Open
• Join the datasets.
• Sum the counts within specified radius.
• Convert to standardised unit e.g. people/hectare
24. INSPIRE Directive
24
• INfrastructure for Spatial InfoRmation in Europe.
• EU Directive since May 2007.
• Lays down framework for spatial information.
• Aim is ensure compatibility and usability across
member states.
• Interoperability of spatial datasets.
• Metadata standards.
• Ordnance Survey Opendata.
• Land Registry Cadestral Polygons.
26. Street Level Data
26
• Use proximity to street geometry to link
attributes.
• Interrelation between features.
• Inference of addresses.
• Describe local neighbourhood.