Successfully reported this slideshow.
Your SlideShare is downloading. ×

An Identifier Scheme for the Digitising Scotland Project

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 12 Ad

An Identifier Scheme for the Digitising Scotland Project

Download to read offline

The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).

The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to An Identifier Scheme for the Digitising Scotland Project (20)

Advertisement

More from Alasdair Gray (19)

Recently uploaded (20)

Advertisement

An Identifier Scheme for the Digitising Scotland Project

  1. 1. An Identifier Scheme for the Digitising Scotland Project Alasdair J G Gray Department of Computer Science, Heriot-Watt University, Edinburgh @gray_alasdair www.macs.hw.ac.uk/~ajg33 Özgür Akgün, Uni. of St Andrews Ahamd Alsadeeqi, Heriot-Watt Uni. Peter Christen, Australian National Uni. Tom Dalton, Uni. of St Andrews Alan Dearle, Uni. of St Andrews Chris Dibben, Uni. of Edinburgh Eilidh Garret, Uni. of Essex Graham Kirby, Uni. of St Andrews Alice Reid, Uni. of Cambridge Lee Williamson, Uni. of Edinburgh
  2. 2. Digitising Scotland Project Large scale family reconstruction studies and Pedigrees • Transcription of data • Linking of data Performed at scale • Whole nation • Large timeframe 1 June 2017 ADRN Conference 2
  3. 3. Project Team Backgrounds • Demographers • Historians • Computer Scientists Distributed team 1 June 2017 ADRN Conference 3 St Andrews Cambridge Edinburgh Edinburgh Australia
  4. 4. Transcribing Scotland’s Vital Records: 1855 – 1974 • 24M records • Birth • Marriage • Death • 18M individuals 41 June 2017 ADRN Conference
  5. 5. Data Linkage Challenges Low quality data Probabilistic matches Scalability Skewed name distributionsJohn Grant Fisherman Fiona Sinclair Ian Grant Smithy Born: 1861 Stuart Adam Wheelwright Morag Scott Flora Adam Seamstress Born: 1866 Married: 1884 John Grant Farmer Fiona Sinclaire Iain Grant Born: 1860 1 June 2017 ADRN Conference 5
  6. 6. Linking Skye Data 1 June 2017 ADRN Conference 6
  7. 7. Discussing records Eilidh, I’m having problems with the Skye record B-BABY-8293. Peter, which transcribed certificate is that? It is the record for Chris Dibben, born 18 March 1893. That is the child on record 5457. It should link to the death on record 5754, 4 December 1959. Thanks, found it now. It is record D-DEATH-2182. 1 June 2017 ADRN Conference 7
  8. 8. Existing Identifier Schemes Historians: Example: 5457 • Incremental integer • Easily confused with other record types • Identifies certificate not actors • Based on order of transcription • Not derived from data • Unique for a file • Excel spreadsheet Record Linkage: Example: B-BABY-8293 • Encode type of certificate and actor on certificate • Four digits generated by linkage process • Different from those used by the historians • Different for each run of linkage pre-processing 1 June 2017 ADRN Conference 8
  9. 9. Desiderata for Identifiers 1. Identifier for each actor on a certificate 2. Exchangeable between researchers 3. Unique generation process from the data 4. Immutable to data changes, e.g. typo discovered in data 5. Human derivable from data records 6. Human interpretable 7. Compact to enable efficient computation 8. Susceptible to blocking 9. Globally unique 10.Consistent approach for all record types 11.Compatible with pre- existing NRS approach 12.Compatibility with Open Data Standards 1 June 2017 ADRN Conference 9
  10. 10. Identifier Scheme B1903_164_00_baby 1 June 2017 ADRN Conference 10 typeYear_district_subdistrict_entryNumber_role
  11. 11. Certificate Roles Birth • baby • mother • father • registrar • informant Marriage • groom • groom_father • groom_mother • bride • bride_father • bride_mother • witness1 • witness2 • officiant • registrar Death • deceased • mother • father • spouse1…spousen • informant • doctor • registrar 1 June 2017 ADRN Conference 11
  12. 12. Conclusions • Agreed identifier scheme typeYear_district_subdistrict_entryNumber_role • Meets desiderata • Reliant on “clean” parts of certificate • Compatible with NRS • Improved team communications Alasdair Gray www.macs.hw.ac.uk/~ajg33/ A.J.G.Gray@hw.ac.uk @gray_alasdair Acknowledgements: • Julia Jennings • Christine Jones • Diego Ramiro-Farinas 1 June 2017 ADRN Conference 12

Editor's Notes

  • Outline:
    Background of the Digitising Scotland project and its aims
    Need for an identifier scheme
    Agreed upon scheme
  • Large scale family reconstruction studies and Pedigrees
  • Different backgrounds, different expertise, different terminology
    Have run a series of workshops to bring us together to understand our approaches and terminologies
  • Civil registration of births, deaths and marriages in Scotland began on 1 January 1855
    All historical vital events records have been converted into digital image format with a supporting index
    Modern vital events data (from 1974 onwards) are available electronically

    The DS project will digitise the 24 million Scottish vital events record images (births, marriages and deaths) since 1855.
    This will allow research access to individual-level information on some 18 million individuals – a large proportion of those who have lived in Scotland.

    Transcription outsourced. Now starting to receive data. Queens Centre for Data Digitisation and Analysis


  • Data is of low quality
    Transcription errors
    No occupation standards, etc.
    Skewed name distributions (large proportion of people in a village/town have the same name)
    Scalability (linking 24M records)
  • Low quality linkage due to challenges
    - Skewed name distributions
    -
    Need to regularly discuss within team
  • Different teams using their own identifier schemes
    No relationship between identifier scheme and original record
  • Existing approaches reliant on order of transcription
    CS hash-based approaches reliant on data content – ID changes if data changes
  • Use information on registration book

    Registration district on book or on microfiche image
    Rathven, Banff has no subdistrict

    Need to capture the different roles

×