The document discusses the progress made on a project to quality assure and create historic snapshots of UK postcode directories from 1980 to the present. It outlines the following:
1) Four phases of work completed so far including loading the raw data, auditing for errors, and developing a methodology to verify instances and reconcile inconsistencies based on temporal and spatial thresholds.
2) Common error types identified include instances with the same introduction and termination dates (Type I) and instances lacking termination dates or with inconsistent timelines (Type II).
3) Plans to finalize the quality assurance rules, update the instance database, and then derive the historic snapshots from the quality controlled data.
4) Outstanding issues include reconciling remaining
2. Overview
• About EDINA
• Project Background and Context
• Progress To Date
• Plans for coming months
• Outstanding Issues
3. EDINA
• A JISC funded national data centre based at Edinburgh University Data
Library.
• Provides the UK tertiary education and research community online access to a
library of data, information and research resources.
• The largest section of which (Geo Data Services), comprised of GIS
Specialists and Software Engineers provides access to 2 key online services -
Digimap & UKBORDERS.
• We and our user community have an interest in both contemporary and
historical postcode products.
4. Background & Context
• What are the historical postcode directories? - datasets which list all unit
postcodes within the UK and assigns to them a national grid reference,
geographic lookups and counts of assigned addresses.
• ESRC has purchased Gridlinked versions of AFPD (2001-2006) for use by the
academic community. This community also has an interest in historic versions
of the AFPD and thus ONS supplied to ESRC historic postcode directories
(1980-2000) for free on the basis that ESRC would QA the historic versions.
• At this point all versions of postcode directories received by ESRC have been
available to users through the EDINA UKBORDERS service since October
2004.
• Steady stream of user downloads. Data for census years most popular but
interestingly significant interest in non-census years.
5. Deliverables
• Objectives/Deliverables of the QA set out formally in August 2004 MOU
between ESRC & ONS:
• Key Deliverable is a Quality Controlled postcode instance database spanning
1980 to present day. From this ESRC will derive snapshot historical versions of
the postcode directories replacing the versions of unknown quality that are
currently in existence.
• Postcode Instance - defined as the existence of a postcode for a certain period
of time which is unique on both postcode label and date of introduction.
• Postcode Instance = Postcode Label + Date of Introduction
• Instance db will have number of fields – DOI, DOT, most recent easting &
northing and higher geography lookups (1991 ED/OA; 1998 Ward; 2001 OA).
• The ONS Ward History Database will be used to check the veracity of ward
codes within the historic versions of the postcode directories.
6. Progress to Date
• 4 sequential work phases to complete these objectives:
• I. Data Loading (complete)
• II. Quality Assurance I - Audit (complete)
• III. Quality Assurance II - Verification (in progress)
• IV. Production of Historic Snapshots
• At this point first 2 of these are complete and we are currently engaged in
the verification phase.
• ... Taking each phase in turn
7. Phase I – Data Loading
• Postcode directories were supplied by ONS from 1980 to present day.
• Origin of data varies:
• Central Postcode Directories: 1980 - 1990 (except 1989)
• AFPDs: 1991 - 1998 (except 1996 & 1997)
• NHSPD: 1996 & 1997
• AFPD (NHS Variant): 1999
• AFPD (Gridlink version): 2000
• + Gridlink versions of AFPD from 2001 to current release.
• With the exception of 1989, a complete set, quite remarkable given that
digital curation & preservation a fairly recent concern.
8. Phase I – Data Loading
• We took each historic version, loaded it into it`s own
database table (database used is PostgreSQL) &
then merged each years table into a super table
giving all postcodes from all versions of the AFPD.
• Given the differing origins of the year tables and the
tendency for number of attributes to increase over
time, the harmonisation of these snapshots itself
was an "interesting" data management challenge.
For practical purposes fields were distilled down to a
core set.
• The super table was reduced to a table with distinct
postcodes labels (giving the labels of all postcodes
since 1980) and then to the more valuable postcode
instance table.
• Composite merged table - 50,986,078 rows
• Distinct postcode unit table - 2,330,886 rows
• Postcode Instance table - 2,763,839 rows
9. Phase I – Data Loading
• By itself Date of Introduction only tells us when a postcode was instantised.
In order to be able to examine the lifecycle of each instance we also need to
know if this instance has been terminated or is still live.
• To each instance we attempted to add a Date Of Termination (DOT) by
searching through each of the historic AFPD version tables and determining if
the instance was terminated. Not a trivial task given volumes of data and
number of searches required.
• At the same time each instance also had associated with it latest grid
reference.
• Instance database is therefore quite rich as it holds both the temporal and
spatial history for the instances associated with a postcode.
10. Phase II – Quality Assurance
(Audit)
• Rationale for Quality Assurance – The quality of the instance database will be
propagated to derived products therefore essential that we have an understanding of
which instances are genuine and which can be regarded as spurious and which may
need to be fixed or weeded out.
• First Step – Analysis of the frequency of instances associated with distinct postcodes.
• Frequency of instances associated with distinct postcodes:
Num of postcode instances : Frequency
1 : 2,379,140
2 : 343,995
3 : 34,986
4 : 4,839
5 : 571
6 : 85
7 : 27
8 : 26
9 : 138
10 : 18
11 : 8
12 : 2
13 : 4
• Straightaway can see that in some cases distinct postcodes have multiple instances
associated with them.
11. Phase II – Quality Assurance
(Audit)
• Majority of postcodes represented by only a single instance. But significant
number of postcodes have multiple instances associated with them – why?
• Genuine Postcode Recycling
• Spurious Instances due to imputation problems or systematic tablewide
update procedures in past versions (i.e. update for all Scottish 1973
instances in 1980 table).
• Expected vs. Divergent Cases.
14. Phase II – Quality Assurance
(Audit)
• Programmatic tests were designed to flag cases in the Instance database
which diverged from what we expected.
• Do this by taking each postcode in turn and examining the timelines
associated with its instances. Errors grouped into 3 types:
• Type I - in which the DOI = DOT (the instance is instantised & terminated at
the same point in time)
• Type II – (A) in which all instances of the postcode are live or (B) there are
other inconsistencies within the timeline such as blank dates of termination
within a sequence of instances.
• Type III - multiple dates of termination - postcode instantised once but has
multiple dates of termination
Name of these errors is a convenience – not to be confused with Type I/II errors
in Statistics!
16. Phase II – Quality Assurance
(Audit)
• As we can see the Type II error cases represent the bulk of the errors so
effort has been directed at identifying different varieties of this type of error.
We will spend a few minutes examining two such examples now.
17. Phase II – Quality Assurance
(Audit)
• Case A
• 6 instances never with a date of termination - conflict immediately after the
first case.
• Is it valid for there to be so many postcodes which have multiple live
instances?
• Are all of these cases a result of postcode recycling or are they in fact due to
inconsistencies within the dataset itself?
18. Phase II – Quality Assurance
(Audit)
• Case B
• Again we have 6 instances - this time there is a blank date of termination
within the timeline (which conflicts with the latter 2 instances)
19. Phase II – Quality Assurance
(Audit)
• Why are these a problem? - when we create the historic cuts we don`t want
any ambiguity.
• need to be sure that all live postcodes are truly live (and should not have
been terminated).
• that where a postcode has multiple instances associated with it, these are
genuine and not a result of problems with how the data was created or
updated.
• that all data is consistent as possible.
• How to reconcile these Spurious cases?
20. Phase III – QA - Verification
• Type I errors - unclear - we can`t see any logic behind this - to which we ask
is it valid for an instance to introduced and terminated in the same month?
• Type II errors - problem less clear cut as we have already seen - different
species of the same problem causing instances to diverge from the expected
norm.
• Type III errors - multiple dates of termination - As a rule, pick either the
earliest OR latest and apply to all cases
• Mainly Concerned in rest of presentation with dealing with the Type II errors.
• Key Assumption – Instance database holds information about the location of
each instance in space and time. Instances which are similar in both these
respects can be merged.
22. Phase III – QA - Verification
• Time - According to Royal Mail:
• A postcode is only supposed to be reused after a minimum period of 3 years
has elapsed & residential postcodes are never reused.
• On this basis where we have 2 instances which are instantised within less
than 3 years of one another we can assume that they are referring to the
same thing.
23. Phase III – QA - Verification
Space (Geography)
• Nearby things tend to be more similar than things that are more distant
apart.
• Instances located close to one another likely reference the same set of
addresses. Instances located more distant apart may represent recycling
events.
• For a postcode instance can see how its instances change in position over
time - are they spatially stationary or more dynamic?
• How quantify this within the instance table? - for each set of instances
associated with a postcode unit compute change in easting & northing
between instances.
24. Phase III – QA - Verification
• BUT we need to be aware of the spatial accuracy issue. Accuracy with which
grid references have been assigned to postcodes has increased over time as
methodologies have changed with technology advances.
• An overall increase in accuracy of georeferencing over time.
• Instance location change may therefore operate at multiple scales – a local
change due to changes in georeferencing plus a larger change brought about
by recycling.
25. Phase III – QA - Verification
• Summary statistics for all instances:
• 75% of postcodes with multiple instances record no change in location
whatsoever.
• Of those that do exhibit location change, in 90% of cases this was between
1m and 3km with the remaining cases exhibiting a change of up to 500km.
• Clearly it would be useful if we had a spatial threshold (like the 3 year
temporal threshold) that we could use to decide whether 2 instances should
be merged or kept separate as genuine reuses.
• We argue that using a combination of temporal & spatial measures of
similarity it is possible to discriminate between genuine and spurious
instances.
26. Phase III – QA - Verification
• Research has only recently began to engage with this problem, progress has
been hindered by the size of the datasets involved and the pain involved in
isolating indicative cases.
• Significant time has been invested in exploring the problem but we are by no
means experts - we need feedback - does this methodology seem
appropriate - are our core assumptions logical?
• Plans are to explore the effects of applying different threshold values - using
known cases of reuse to inform selection of threshold value.
• Pick a threshold value - determine the effects of applying this to the dataset
as a whole in terms of i.e. number of merges that this yields taking samples
to determine the validity of results - are instances inappropriately merged.
28. Phase III – QA - Verification
• Demonstrate application of these rules by going back to the Spurious cases
we looked at earlier.
•Case A - using our temporal rule of 3 years - these 6 could be compressed to
3 instances. Using our spatial rule (assuming that our upper spatial threshold
exceeds 100m) these could be compressed to a single instance.
29. Phase III – QA - Verification
•Case B - the inconsistent instance must either be terminated or merged with
another instance. Applying the temporal rule it could be merged with the
following instance. However its location is quite different and so we might decide
that this falls outside our threshold and so instead we might terminate it with
the start date of the following instance.
30. Phase IV – Create QA Instance DB
At some point in order to move forward we are going to have to proceed,
implement the rules from phase 3 and carry out the updates to the instance
database.
• In doing this we run the risk of going in one of two directions - we can be
either be too inclusive leading to too many instances being merged together
or we cannot be inclusive enough with not enough instances merged
together.
• We intend to be pragmatic about this - we simply cannot have so many
possibly false instances associated with each postcode. Unlikely that we are
going to be able to resolve all cases.
• Once the rules are in place, implementation of them should be fairly straight
forward.
31. Creation of Historic Snapshots
• With Quality Controlled Instance database in place, yearly historic version of
the postcode directories can then be derived by pulling out all instances that
exist within a particular time slice.
32. Outstanding Issues
• Reconciling the spurious instances still an ongoing task.
• We would welcome comments/feedback about the
assumptions/methodologies we have chosen to adapt both from ONS and
from other expert users of the AFPD.
• Is there any documentation which might shed light on procedures used to
update the datasets in the past & might explain some of the systematic
inconsistencies we have discovered?
33. Conclusions
• 1. Historical & Contemporary postcode directory datasets are being accessed
by academic users through UKBORDERS.
• 2. QA process data has been received and loaded - raw instance database
has been created.
• 3. Quality Assurance Audit has been carried out - quality of dataset has been
assessed.
• 4. Significant Progress has been made in reconciling inconsistencies, but work
remains before derived data can be created and exposed to user community.
• 5. Feedback on work to date and input from others users is requested in
order to bring work to a close.