SlideShare a Scribd company logo
Historic Postcode Directories - Progress and
Plans
Postcode GeoReferencing User Group, 5th April
James Crone, EDINA.
Overview
• About EDINA
• Project Background and Context
• Progress To Date
• Plans for coming months
• Outstanding Issues
EDINA
• A JISC funded national data centre based at Edinburgh University Data
Library.
• Provides the UK tertiary education and research community online access to a
library of data, information and research resources.
• The largest section of which (Geo Data Services), comprised of GIS
Specialists and Software Engineers provides access to 2 key online services -
Digimap & UKBORDERS.
• We and our user community have an interest in both contemporary and
historical postcode products.
Background & Context
• What are the historical postcode directories? - datasets which list all unit
postcodes within the UK and assigns to them a national grid reference,
geographic lookups and counts of assigned addresses.
• ESRC has purchased Gridlinked versions of AFPD (2001-2006) for use by the
academic community. This community also has an interest in historic versions
of the AFPD and thus ONS supplied to ESRC historic postcode directories
(1980-2000) for free on the basis that ESRC would QA the historic versions.
• At this point all versions of postcode directories received by ESRC have been
available to users through the EDINA UKBORDERS service since October
2004.
• Steady stream of user downloads. Data for census years most popular but
interestingly significant interest in non-census years.
Deliverables
• Objectives/Deliverables of the QA set out formally in August 2004 MOU
between ESRC & ONS:
• Key Deliverable is a Quality Controlled postcode instance database spanning
1980 to present day. From this ESRC will derive snapshot historical versions of
the postcode directories replacing the versions of unknown quality that are
currently in existence.
• Postcode Instance - defined as the existence of a postcode for a certain period
of time which is unique on both postcode label and date of introduction.
• Postcode Instance = Postcode Label + Date of Introduction
• Instance db will have number of fields – DOI, DOT, most recent easting &
northing and higher geography lookups (1991 ED/OA; 1998 Ward; 2001 OA).
• The ONS Ward History Database will be used to check the veracity of ward
codes within the historic versions of the postcode directories.
Progress to Date
• 4 sequential work phases to complete these objectives:
• I. Data Loading (complete)
• II. Quality Assurance I - Audit (complete)
• III. Quality Assurance II - Verification (in progress)
• IV. Production of Historic Snapshots
• At this point first 2 of these are complete and we are currently engaged in
the verification phase.
• ... Taking each phase in turn
Phase I – Data Loading
• Postcode directories were supplied by ONS from 1980 to present day.
• Origin of data varies:
• Central Postcode Directories: 1980 - 1990 (except 1989)
• AFPDs: 1991 - 1998 (except 1996 & 1997)
• NHSPD: 1996 & 1997
• AFPD (NHS Variant): 1999
• AFPD (Gridlink version): 2000
• + Gridlink versions of AFPD from 2001 to current release.
• With the exception of 1989, a complete set, quite remarkable given that
digital curation & preservation a fairly recent concern.
Phase I – Data Loading
• We took each historic version, loaded it into it`s own
database table (database used is PostgreSQL) &
then merged each years table into a super table
giving all postcodes from all versions of the AFPD.
• Given the differing origins of the year tables and the
tendency for number of attributes to increase over
time, the harmonisation of these snapshots itself
was an "interesting" data management challenge.
For practical purposes fields were distilled down to a
core set.
• The super table was reduced to a table with distinct
postcodes labels (giving the labels of all postcodes
since 1980) and then to the more valuable postcode
instance table.
• Composite merged table - 50,986,078 rows
• Distinct postcode unit table - 2,330,886 rows
• Postcode Instance table - 2,763,839 rows
Phase I – Data Loading
• By itself Date of Introduction only tells us when a postcode was instantised.
In order to be able to examine the lifecycle of each instance we also need to
know if this instance has been terminated or is still live.
• To each instance we attempted to add a Date Of Termination (DOT) by
searching through each of the historic AFPD version tables and determining if
the instance was terminated. Not a trivial task given volumes of data and
number of searches required.
• At the same time each instance also had associated with it latest grid
reference.
• Instance database is therefore quite rich as it holds both the temporal and
spatial history for the instances associated with a postcode.
Phase II – Quality Assurance
(Audit)
• Rationale for Quality Assurance – The quality of the instance database will be
propagated to derived products therefore essential that we have an understanding of
which instances are genuine and which can be regarded as spurious and which may
need to be fixed or weeded out.
• First Step – Analysis of the frequency of instances associated with distinct postcodes.
• Frequency of instances associated with distinct postcodes:
Num of postcode instances : Frequency
1 : 2,379,140
2 : 343,995
3 : 34,986
4 : 4,839
5 : 571
6 : 85
7 : 27
8 : 26
9 : 138
10 : 18
11 : 8
12 : 2
13 : 4
• Straightaway can see that in some cases distinct postcodes have multiple instances
associated with them.
Phase II – Quality Assurance
(Audit)
• Majority of postcodes represented by only a single instance. But significant
number of postcodes have multiple instances associated with them – why?
• Genuine Postcode Recycling
• Spurious Instances due to imputation problems or systematic tablewide
update procedures in past versions (i.e. update for all Scottish 1973
instances in 1980 table).
• Expected vs. Divergent Cases.
Phase II – Quality Assurance
(Audit)
Phase II – Quality Assurance
(Audit)
Phase II – Quality Assurance
(Audit)
• Programmatic tests were designed to flag cases in the Instance database
which diverged from what we expected.
• Do this by taking each postcode in turn and examining the timelines
associated with its instances. Errors grouped into 3 types:
• Type I - in which the DOI = DOT (the instance is instantised & terminated at
the same point in time)
• Type II – (A) in which all instances of the postcode are live or (B) there are
other inconsistencies within the timeline such as blank dates of termination
within a sequence of instances.
• Type III - multiple dates of termination - postcode instantised once but has
multiple dates of termination
Name of these errors is a convenience – not to be confused with Type I/II errors
in Statistics!
3558
347828
206001
4448
0
50000
100000
150000
200000
250000
300000
350000
400000
I II.A II.B III
Spurious Instance Type
Count
Phase II – Quality Assurance
(Audit)
Phase II – Quality Assurance
(Audit)
• As we can see the Type II error cases represent the bulk of the errors so
effort has been directed at identifying different varieties of this type of error.
We will spend a few minutes examining two such examples now.
Phase II – Quality Assurance
(Audit)
• Case A
• 6 instances never with a date of termination - conflict immediately after the
first case.
• Is it valid for there to be so many postcodes which have multiple live
instances?
• Are all of these cases a result of postcode recycling or are they in fact due to
inconsistencies within the dataset itself?
Phase II – Quality Assurance
(Audit)
• Case B
• Again we have 6 instances - this time there is a blank date of termination
within the timeline (which conflicts with the latter 2 instances)
Phase II – Quality Assurance
(Audit)
• Why are these a problem? - when we create the historic cuts we don`t want
any ambiguity.
• need to be sure that all live postcodes are truly live (and should not have
been terminated).
• that where a postcode has multiple instances associated with it, these are
genuine and not a result of problems with how the data was created or
updated.
• that all data is consistent as possible.
• How to reconcile these Spurious cases?
Phase III – QA - Verification
• Type I errors - unclear - we can`t see any logic behind this - to which we ask
is it valid for an instance to introduced and terminated in the same month?
• Type II errors - problem less clear cut as we have already seen - different
species of the same problem causing instances to diverge from the expected
norm.
• Type III errors - multiple dates of termination - As a rule, pick either the
earliest OR latest and apply to all cases
• Mainly Concerned in rest of presentation with dealing with the Type II errors.
• Key Assumption – Instance database holds information about the location of
each instance in space and time. Instances which are similar in both these
respects can be merged.
Phase III – QA - Verification
Phase III – QA - Verification
• Time - According to Royal Mail:
• A postcode is only supposed to be reused after a minimum period of 3 years
has elapsed & residential postcodes are never reused.
• On this basis where we have 2 instances which are instantised within less
than 3 years of one another we can assume that they are referring to the
same thing.
Phase III – QA - Verification
Space (Geography)
• Nearby things tend to be more similar than things that are more distant
apart.
• Instances located close to one another likely reference the same set of
addresses. Instances located more distant apart may represent recycling
events.
• For a postcode instance can see how its instances change in position over
time - are they spatially stationary or more dynamic?
• How quantify this within the instance table? - for each set of instances
associated with a postcode unit compute change in easting & northing
between instances.
Phase III – QA - Verification
• BUT we need to be aware of the spatial accuracy issue. Accuracy with which
grid references have been assigned to postcodes has increased over time as
methodologies have changed with technology advances.
• An overall increase in accuracy of georeferencing over time.
• Instance location change may therefore operate at multiple scales – a local
change due to changes in georeferencing plus a larger change brought about
by recycling.
Phase III – QA - Verification
• Summary statistics for all instances:
• 75% of postcodes with multiple instances record no change in location
whatsoever.
• Of those that do exhibit location change, in 90% of cases this was between
1m and 3km with the remaining cases exhibiting a change of up to 500km.
• Clearly it would be useful if we had a spatial threshold (like the 3 year
temporal threshold) that we could use to decide whether 2 instances should
be merged or kept separate as genuine reuses.
• We argue that using a combination of temporal & spatial measures of
similarity it is possible to discriminate between genuine and spurious
instances.
Phase III – QA - Verification
• Research has only recently began to engage with this problem, progress has
been hindered by the size of the datasets involved and the pain involved in
isolating indicative cases.
• Significant time has been invested in exploring the problem but we are by no
means experts - we need feedback - does this methodology seem
appropriate - are our core assumptions logical?
• Plans are to explore the effects of applying different threshold values - using
known cases of reuse to inform selection of threshold value.
• Pick a threshold value - determine the effects of applying this to the dataset
as a whole in terms of i.e. number of merges that this yields taking samples
to determine the validity of results - are instances inappropriately merged.
Phase III – QA - Verification
Phase III – QA - Verification
• Demonstrate application of these rules by going back to the Spurious cases
we looked at earlier.
•Case A - using our temporal rule of 3 years - these 6 could be compressed to
3 instances. Using our spatial rule (assuming that our upper spatial threshold
exceeds 100m) these could be compressed to a single instance.
Phase III – QA - Verification
•Case B - the inconsistent instance must either be terminated or merged with
another instance. Applying the temporal rule it could be merged with the
following instance. However its location is quite different and so we might decide
that this falls outside our threshold and so instead we might terminate it with
the start date of the following instance.
Phase IV – Create QA Instance DB
At some point in order to move forward we are going to have to proceed,
implement the rules from phase 3 and carry out the updates to the instance
database.
• In doing this we run the risk of going in one of two directions - we can be
either be too inclusive leading to too many instances being merged together
or we cannot be inclusive enough with not enough instances merged
together.
• We intend to be pragmatic about this - we simply cannot have so many
possibly false instances associated with each postcode. Unlikely that we are
going to be able to resolve all cases.
• Once the rules are in place, implementation of them should be fairly straight
forward.
Creation of Historic Snapshots
• With Quality Controlled Instance database in place, yearly historic version of
the postcode directories can then be derived by pulling out all instances that
exist within a particular time slice.
Outstanding Issues
• Reconciling the spurious instances still an ongoing task.
• We would welcome comments/feedback about the
assumptions/methodologies we have chosen to adapt both from ONS and
from other expert users of the AFPD.
• Is there any documentation which might shed light on procedures used to
update the datasets in the past & might explain some of the systematic
inconsistencies we have discovered?
Conclusions
• 1. Historical & Contemporary postcode directory datasets are being accessed
by academic users through UKBORDERS.
• 2. QA process data has been received and loaded - raw instance database
has been created.
• 3. Quality Assurance Audit has been carried out - quality of dataset has been
assessed.
• 4. Significant Progress has been made in reconciling inconsistencies, but work
remains before derived data can be created and exposed to user community.
• 5. Feedback on work to date and input from others users is requested in
order to bring work to a close.
Contact Details
• http://edina.ac.uk/
• james.crone@ed.ac.uk
• Questions?

More Related Content

Similar to Historic Postcode Directories

Final presentation
Final presentationFinal presentation
Final presentation
Amogh Hajare
 
Cleansing land ownership data, an FME use case - David Eagle
Cleansing land ownership data, an FME use case - David EagleCleansing land ownership data, an FME use case - David Eagle
Cleansing land ownership data, an FME use case - David Eagle
Association for Geographic Information (AGI)
 
Internet of things - 3/4. Solving the problems
Internet of things - 3/4. Solving the problemsInternet of things - 3/4. Solving the problems
Internet of things - 3/4. Solving the problems
Sumanth Bhat
 
Update of time-invalid information in knowledge bases through mobile agents
Update of time-invalid information in knowledge bases through mobile agentsUpdate of time-invalid information in knowledge bases through mobile agents
Update of time-invalid information in knowledge bases through mobile agents
Vrije Universiteit Amsterdam
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
DataStax Academy
 
TECHNIP
TECHNIPTECHNIP
Tracking Project WBS
Tracking Project WBSTracking Project WBS
Tracking Project WBS
Cameron
 
Data Wrangling: Working with Date / Time Data and Visualizing It
Data Wrangling: Working with Date / Time Data and Visualizing ItData Wrangling: Working with Date / Time Data and Visualizing It
Data Wrangling: Working with Date / Time Data and Visualizing It
kanaugust
 
CN Module 5 part 2 2022.pdf
CN Module 5 part 2 2022.pdfCN Module 5 part 2 2022.pdf
CN Module 5 part 2 2022.pdf
MayankRaj687571
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
Paris Data Engineers !
 
FME & Governement
FME & GovernementFME & Governement
FME & Governement
Safe Software
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Bernardo Najlis
 
Asufe juniors-training session2
Asufe juniors-training session2Asufe juniors-training session2
Asufe juniors-training session2
Omar Ahmed
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Tutorial(release)
Tutorial(release)Tutorial(release)
Tutorial(release)
Oshin Hung
 
Chicago AWS user group - Raja Dheekonda: Replatforming ML
Chicago AWS user group - Raja Dheekonda: Replatforming MLChicago AWS user group - Raja Dheekonda: Replatforming ML
Chicago AWS user group - Raja Dheekonda: Replatforming ML
AWS Chicago
 
Anomaly detection - TIBCO Data Science Central
Anomaly detection - TIBCO Data Science CentralAnomaly detection - TIBCO Data Science Central
Anomaly detection - TIBCO Data Science Central
Michael O'Connell
 
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Gloria Re Calegari
 
Network Detroit 9/25/15
Network Detroit 9/25/15Network Detroit 9/25/15
Network Detroit 9/25/15
Ellice Engdahl
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 

Similar to Historic Postcode Directories (20)

Final presentation
Final presentationFinal presentation
Final presentation
 
Cleansing land ownership data, an FME use case - David Eagle
Cleansing land ownership data, an FME use case - David EagleCleansing land ownership data, an FME use case - David Eagle
Cleansing land ownership data, an FME use case - David Eagle
 
Internet of things - 3/4. Solving the problems
Internet of things - 3/4. Solving the problemsInternet of things - 3/4. Solving the problems
Internet of things - 3/4. Solving the problems
 
Update of time-invalid information in knowledge bases through mobile agents
Update of time-invalid information in knowledge bases through mobile agentsUpdate of time-invalid information in knowledge bases through mobile agents
Update of time-invalid information in knowledge bases through mobile agents
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
TECHNIP
TECHNIPTECHNIP
TECHNIP
 
Tracking Project WBS
Tracking Project WBSTracking Project WBS
Tracking Project WBS
 
Data Wrangling: Working with Date / Time Data and Visualizing It
Data Wrangling: Working with Date / Time Data and Visualizing ItData Wrangling: Working with Date / Time Data and Visualizing It
Data Wrangling: Working with Date / Time Data and Visualizing It
 
CN Module 5 part 2 2022.pdf
CN Module 5 part 2 2022.pdfCN Module 5 part 2 2022.pdf
CN Module 5 part 2 2022.pdf
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
 
FME & Governement
FME & GovernementFME & Governement
FME & Governement
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
 
Asufe juniors-training session2
Asufe juniors-training session2Asufe juniors-training session2
Asufe juniors-training session2
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Tutorial(release)
Tutorial(release)Tutorial(release)
Tutorial(release)
 
Chicago AWS user group - Raja Dheekonda: Replatforming ML
Chicago AWS user group - Raja Dheekonda: Replatforming MLChicago AWS user group - Raja Dheekonda: Replatforming ML
Chicago AWS user group - Raja Dheekonda: Replatforming ML
 
Anomaly detection - TIBCO Data Science Central
Anomaly detection - TIBCO Data Science CentralAnomaly detection - TIBCO Data Science Central
Anomaly detection - TIBCO Data Science Central
 
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
 
Network Detroit 9/25/15
Network Detroit 9/25/15Network Detroit 9/25/15
Network Detroit 9/25/15
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 

Recently uploaded

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 

Recently uploaded (20)

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 

Historic Postcode Directories

  • 1. Historic Postcode Directories - Progress and Plans Postcode GeoReferencing User Group, 5th April James Crone, EDINA.
  • 2. Overview • About EDINA • Project Background and Context • Progress To Date • Plans for coming months • Outstanding Issues
  • 3. EDINA • A JISC funded national data centre based at Edinburgh University Data Library. • Provides the UK tertiary education and research community online access to a library of data, information and research resources. • The largest section of which (Geo Data Services), comprised of GIS Specialists and Software Engineers provides access to 2 key online services - Digimap & UKBORDERS. • We and our user community have an interest in both contemporary and historical postcode products.
  • 4. Background & Context • What are the historical postcode directories? - datasets which list all unit postcodes within the UK and assigns to them a national grid reference, geographic lookups and counts of assigned addresses. • ESRC has purchased Gridlinked versions of AFPD (2001-2006) for use by the academic community. This community also has an interest in historic versions of the AFPD and thus ONS supplied to ESRC historic postcode directories (1980-2000) for free on the basis that ESRC would QA the historic versions. • At this point all versions of postcode directories received by ESRC have been available to users through the EDINA UKBORDERS service since October 2004. • Steady stream of user downloads. Data for census years most popular but interestingly significant interest in non-census years.
  • 5. Deliverables • Objectives/Deliverables of the QA set out formally in August 2004 MOU between ESRC & ONS: • Key Deliverable is a Quality Controlled postcode instance database spanning 1980 to present day. From this ESRC will derive snapshot historical versions of the postcode directories replacing the versions of unknown quality that are currently in existence. • Postcode Instance - defined as the existence of a postcode for a certain period of time which is unique on both postcode label and date of introduction. • Postcode Instance = Postcode Label + Date of Introduction • Instance db will have number of fields – DOI, DOT, most recent easting & northing and higher geography lookups (1991 ED/OA; 1998 Ward; 2001 OA). • The ONS Ward History Database will be used to check the veracity of ward codes within the historic versions of the postcode directories.
  • 6. Progress to Date • 4 sequential work phases to complete these objectives: • I. Data Loading (complete) • II. Quality Assurance I - Audit (complete) • III. Quality Assurance II - Verification (in progress) • IV. Production of Historic Snapshots • At this point first 2 of these are complete and we are currently engaged in the verification phase. • ... Taking each phase in turn
  • 7. Phase I – Data Loading • Postcode directories were supplied by ONS from 1980 to present day. • Origin of data varies: • Central Postcode Directories: 1980 - 1990 (except 1989) • AFPDs: 1991 - 1998 (except 1996 & 1997) • NHSPD: 1996 & 1997 • AFPD (NHS Variant): 1999 • AFPD (Gridlink version): 2000 • + Gridlink versions of AFPD from 2001 to current release. • With the exception of 1989, a complete set, quite remarkable given that digital curation & preservation a fairly recent concern.
  • 8. Phase I – Data Loading • We took each historic version, loaded it into it`s own database table (database used is PostgreSQL) & then merged each years table into a super table giving all postcodes from all versions of the AFPD. • Given the differing origins of the year tables and the tendency for number of attributes to increase over time, the harmonisation of these snapshots itself was an "interesting" data management challenge. For practical purposes fields were distilled down to a core set. • The super table was reduced to a table with distinct postcodes labels (giving the labels of all postcodes since 1980) and then to the more valuable postcode instance table. • Composite merged table - 50,986,078 rows • Distinct postcode unit table - 2,330,886 rows • Postcode Instance table - 2,763,839 rows
  • 9. Phase I – Data Loading • By itself Date of Introduction only tells us when a postcode was instantised. In order to be able to examine the lifecycle of each instance we also need to know if this instance has been terminated or is still live. • To each instance we attempted to add a Date Of Termination (DOT) by searching through each of the historic AFPD version tables and determining if the instance was terminated. Not a trivial task given volumes of data and number of searches required. • At the same time each instance also had associated with it latest grid reference. • Instance database is therefore quite rich as it holds both the temporal and spatial history for the instances associated with a postcode.
  • 10. Phase II – Quality Assurance (Audit) • Rationale for Quality Assurance – The quality of the instance database will be propagated to derived products therefore essential that we have an understanding of which instances are genuine and which can be regarded as spurious and which may need to be fixed or weeded out. • First Step – Analysis of the frequency of instances associated with distinct postcodes. • Frequency of instances associated with distinct postcodes: Num of postcode instances : Frequency 1 : 2,379,140 2 : 343,995 3 : 34,986 4 : 4,839 5 : 571 6 : 85 7 : 27 8 : 26 9 : 138 10 : 18 11 : 8 12 : 2 13 : 4 • Straightaway can see that in some cases distinct postcodes have multiple instances associated with them.
  • 11. Phase II – Quality Assurance (Audit) • Majority of postcodes represented by only a single instance. But significant number of postcodes have multiple instances associated with them – why? • Genuine Postcode Recycling • Spurious Instances due to imputation problems or systematic tablewide update procedures in past versions (i.e. update for all Scottish 1973 instances in 1980 table). • Expected vs. Divergent Cases.
  • 12. Phase II – Quality Assurance (Audit)
  • 13. Phase II – Quality Assurance (Audit)
  • 14. Phase II – Quality Assurance (Audit) • Programmatic tests were designed to flag cases in the Instance database which diverged from what we expected. • Do this by taking each postcode in turn and examining the timelines associated with its instances. Errors grouped into 3 types: • Type I - in which the DOI = DOT (the instance is instantised & terminated at the same point in time) • Type II – (A) in which all instances of the postcode are live or (B) there are other inconsistencies within the timeline such as blank dates of termination within a sequence of instances. • Type III - multiple dates of termination - postcode instantised once but has multiple dates of termination Name of these errors is a convenience – not to be confused with Type I/II errors in Statistics!
  • 15. 3558 347828 206001 4448 0 50000 100000 150000 200000 250000 300000 350000 400000 I II.A II.B III Spurious Instance Type Count Phase II – Quality Assurance (Audit)
  • 16. Phase II – Quality Assurance (Audit) • As we can see the Type II error cases represent the bulk of the errors so effort has been directed at identifying different varieties of this type of error. We will spend a few minutes examining two such examples now.
  • 17. Phase II – Quality Assurance (Audit) • Case A • 6 instances never with a date of termination - conflict immediately after the first case. • Is it valid for there to be so many postcodes which have multiple live instances? • Are all of these cases a result of postcode recycling or are they in fact due to inconsistencies within the dataset itself?
  • 18. Phase II – Quality Assurance (Audit) • Case B • Again we have 6 instances - this time there is a blank date of termination within the timeline (which conflicts with the latter 2 instances)
  • 19. Phase II – Quality Assurance (Audit) • Why are these a problem? - when we create the historic cuts we don`t want any ambiguity. • need to be sure that all live postcodes are truly live (and should not have been terminated). • that where a postcode has multiple instances associated with it, these are genuine and not a result of problems with how the data was created or updated. • that all data is consistent as possible. • How to reconcile these Spurious cases?
  • 20. Phase III – QA - Verification • Type I errors - unclear - we can`t see any logic behind this - to which we ask is it valid for an instance to introduced and terminated in the same month? • Type II errors - problem less clear cut as we have already seen - different species of the same problem causing instances to diverge from the expected norm. • Type III errors - multiple dates of termination - As a rule, pick either the earliest OR latest and apply to all cases • Mainly Concerned in rest of presentation with dealing with the Type II errors. • Key Assumption – Instance database holds information about the location of each instance in space and time. Instances which are similar in both these respects can be merged.
  • 21. Phase III – QA - Verification
  • 22. Phase III – QA - Verification • Time - According to Royal Mail: • A postcode is only supposed to be reused after a minimum period of 3 years has elapsed & residential postcodes are never reused. • On this basis where we have 2 instances which are instantised within less than 3 years of one another we can assume that they are referring to the same thing.
  • 23. Phase III – QA - Verification Space (Geography) • Nearby things tend to be more similar than things that are more distant apart. • Instances located close to one another likely reference the same set of addresses. Instances located more distant apart may represent recycling events. • For a postcode instance can see how its instances change in position over time - are they spatially stationary or more dynamic? • How quantify this within the instance table? - for each set of instances associated with a postcode unit compute change in easting & northing between instances.
  • 24. Phase III – QA - Verification • BUT we need to be aware of the spatial accuracy issue. Accuracy with which grid references have been assigned to postcodes has increased over time as methodologies have changed with technology advances. • An overall increase in accuracy of georeferencing over time. • Instance location change may therefore operate at multiple scales – a local change due to changes in georeferencing plus a larger change brought about by recycling.
  • 25. Phase III – QA - Verification • Summary statistics for all instances: • 75% of postcodes with multiple instances record no change in location whatsoever. • Of those that do exhibit location change, in 90% of cases this was between 1m and 3km with the remaining cases exhibiting a change of up to 500km. • Clearly it would be useful if we had a spatial threshold (like the 3 year temporal threshold) that we could use to decide whether 2 instances should be merged or kept separate as genuine reuses. • We argue that using a combination of temporal & spatial measures of similarity it is possible to discriminate between genuine and spurious instances.
  • 26. Phase III – QA - Verification • Research has only recently began to engage with this problem, progress has been hindered by the size of the datasets involved and the pain involved in isolating indicative cases. • Significant time has been invested in exploring the problem but we are by no means experts - we need feedback - does this methodology seem appropriate - are our core assumptions logical? • Plans are to explore the effects of applying different threshold values - using known cases of reuse to inform selection of threshold value. • Pick a threshold value - determine the effects of applying this to the dataset as a whole in terms of i.e. number of merges that this yields taking samples to determine the validity of results - are instances inappropriately merged.
  • 27. Phase III – QA - Verification
  • 28. Phase III – QA - Verification • Demonstrate application of these rules by going back to the Spurious cases we looked at earlier. •Case A - using our temporal rule of 3 years - these 6 could be compressed to 3 instances. Using our spatial rule (assuming that our upper spatial threshold exceeds 100m) these could be compressed to a single instance.
  • 29. Phase III – QA - Verification •Case B - the inconsistent instance must either be terminated or merged with another instance. Applying the temporal rule it could be merged with the following instance. However its location is quite different and so we might decide that this falls outside our threshold and so instead we might terminate it with the start date of the following instance.
  • 30. Phase IV – Create QA Instance DB At some point in order to move forward we are going to have to proceed, implement the rules from phase 3 and carry out the updates to the instance database. • In doing this we run the risk of going in one of two directions - we can be either be too inclusive leading to too many instances being merged together or we cannot be inclusive enough with not enough instances merged together. • We intend to be pragmatic about this - we simply cannot have so many possibly false instances associated with each postcode. Unlikely that we are going to be able to resolve all cases. • Once the rules are in place, implementation of them should be fairly straight forward.
  • 31. Creation of Historic Snapshots • With Quality Controlled Instance database in place, yearly historic version of the postcode directories can then be derived by pulling out all instances that exist within a particular time slice.
  • 32. Outstanding Issues • Reconciling the spurious instances still an ongoing task. • We would welcome comments/feedback about the assumptions/methodologies we have chosen to adapt both from ONS and from other expert users of the AFPD. • Is there any documentation which might shed light on procedures used to update the datasets in the past & might explain some of the systematic inconsistencies we have discovered?
  • 33. Conclusions • 1. Historical & Contemporary postcode directory datasets are being accessed by academic users through UKBORDERS. • 2. QA process data has been received and loaded - raw instance database has been created. • 3. Quality Assurance Audit has been carried out - quality of dataset has been assessed. • 4. Significant Progress has been made in reconciling inconsistencies, but work remains before derived data can be created and exposed to user community. • 5. Feedback on work to date and input from others users is requested in order to bring work to a close.
  • 34. Contact Details • http://edina.ac.uk/ • james.crone@ed.ac.uk • Questions?