The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).
An Identifier Scheme for the Digitising Scotland Project
1. An Identifier Scheme for the
Digitising Scotland Project
Alasdair J G Gray
Department of Computer Science,
Heriot-Watt University,
Edinburgh
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
Özgür Akgün, Uni. of St Andrews
Ahamd Alsadeeqi, Heriot-Watt Uni.
Peter Christen, Australian National Uni.
Tom Dalton, Uni. of St Andrews
Alan Dearle, Uni. of St Andrews
Chris Dibben, Uni. of Edinburgh
Eilidh Garret, Uni. of Essex
Graham Kirby, Uni. of St Andrews
Alice Reid, Uni. of Cambridge
Lee Williamson, Uni. of Edinburgh
2. Digitising Scotland Project
Large scale family reconstruction
studies and Pedigrees
• Transcription of data
• Linking of data
Performed at scale
• Whole nation
• Large timeframe
1 June 2017 ADRN Conference 2
3. Project Team
Backgrounds
• Demographers • Historians • Computer Scientists
Distributed team
1 June 2017 ADRN Conference 3
St Andrews Cambridge Edinburgh Edinburgh Australia
5. Data Linkage Challenges
Low quality data
Probabilistic matches
Scalability
Skewed name
distributionsJohn Grant
Fisherman
Fiona Sinclair
Ian Grant
Smithy
Born: 1861
Stuart Adam
Wheelwright
Morag Scott
Flora Adam
Seamstress
Born: 1866
Married: 1884
John Grant
Farmer
Fiona Sinclaire
Iain Grant
Born: 1860
1 June 2017 ADRN Conference 5
7. Discussing records
Eilidh, I’m having problems with the
Skye record B-BABY-8293.
Peter, which transcribed certificate
is that?
It is the record for Chris Dibben,
born 18 March 1893.
That is the child on record 5457. It
should link to the death on record
5754, 4 December 1959.
Thanks, found it now. It is record
D-DEATH-2182.
1 June 2017 ADRN Conference 7
8. Existing Identifier Schemes
Historians: Example: 5457
• Incremental integer
• Easily confused with other record
types
• Identifies certificate not actors
• Based on order of transcription
• Not derived from data
• Unique for a file
• Excel spreadsheet
Record Linkage: Example: B-BABY-8293
• Encode type of certificate and
actor on certificate
• Four digits generated by linkage
process
• Different from those used by the
historians
• Different for each run of linkage
pre-processing
1 June 2017 ADRN Conference 8
9. Desiderata for Identifiers
1. Identifier for each
actor on a certificate
2. Exchangeable between
researchers
3. Unique generation
process from the data
4. Immutable to data
changes, e.g. typo
discovered in data
5. Human derivable from
data records
6. Human interpretable
7. Compact to enable
efficient computation
8. Susceptible to blocking
9. Globally unique
10.Consistent approach
for all record types
11.Compatible with pre-
existing NRS approach
12.Compatibility with
Open Data Standards
1 June 2017 ADRN Conference 9
11. Certificate Roles
Birth
• baby
• mother
• father
• registrar
• informant
Marriage
• groom
• groom_father
• groom_mother
• bride
• bride_father
• bride_mother
• witness1
• witness2
• officiant
• registrar
Death
• deceased
• mother
• father
• spouse1…spousen
• informant
• doctor
• registrar
1 June 2017 ADRN Conference 11
12. Conclusions
• Agreed identifier scheme
typeYear_district_subdistrict_entryNumber_role
• Meets desiderata
• Reliant on “clean” parts of certificate
• Compatible with NRS
• Improved team communications
Alasdair Gray
www.macs.hw.ac.uk/~ajg33/
A.J.G.Gray@hw.ac.uk
@gray_alasdair
Acknowledgements:
• Julia Jennings
• Christine Jones
• Diego Ramiro-Farinas
1 June 2017 ADRN Conference 12
Editor's Notes
Outline:
Background of the Digitising Scotland project and its aims
Need for an identifier scheme
Agreed upon scheme
Large scale family reconstruction studies and Pedigrees
Different backgrounds, different expertise, different terminology
Have run a series of workshops to bring us together to understand our approaches and terminologies
Civil registration of births, deaths and marriages in Scotland began on 1 January 1855
All historical vital events records have been converted into digital image format with a supporting index
Modern vital events data (from 1974 onwards) are available electronically
The DS project will digitise the 24 million Scottish vital events record images (births, marriages and deaths) since 1855.
This will allow research access to individual-level information on some 18 million individuals – a large proportion of those who have lived in Scotland.
Transcription outsourced. Now starting to receive data. Queens Centre for Data Digitisation and Analysis
Data is of low quality
Transcription errors
No occupation standards, etc.
Skewed name distributions (large proportion of people in a village/town have the same name)
Scalability (linking 24M records)
Low quality linkage due to challenges
- Skewed name distributions
-
Need to regularly discuss within team
Different teams using their own identifier schemes
No relationship between identifier scheme and original record
Existing approaches reliant on order of transcription
CS hash-based approaches reliant on data content – ID changes if data changes
Use information on registration book
Registration district on book or on microfiche image
Rathven, Banff has no subdistrict
Need to capture the different roles