Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sonia Ranade: 'Traces Through Time overview and next steps'

Digital History Seminar and Archives and Society Seminar
Institute of Historical Research
23 June 2015
http://ihrdighist.blogs.sas.ac.uk/2015/06/15/23-june-2015-exploring-big-and-small-historical-datasets-reflections-on-two-recent-projects/

  • Login to see the comments

  • Be the first to like this

Sonia Ranade: 'Traces Through Time overview and next steps'

  1. 1. Sonia Ranade 23rd June 2015 Traces Through Time Archives & Society/Digital History seminar
  2. 2. 3 Introduction to Traces Through Time
  3. 3. The Traces Through Time project • To trace individuals through records in the National Archives, and other institutions • To handle the variation and inconsistency that find in historical records • To assign confidence to the links that we find • To develop methods that work for records from different time periods • Funded by the AHRC • In partnership with IHR and the Universities of Brighton and Leiden • Project ran from January 2014 to March 2015 Aims
  4. 4. Datasets • Highly structured • Often only partially transcribed • Big data Early data • Semi-structured narrative text • Rich in context and relationships • Small data Modern data
  5. 5. 6 Linking records
  6. 6. Linking records A 4-stage pipeline for processing and linking: Data cleansing and Standardisation Statistics Optimisation Linking and Confidence
  7. 7. Basic Probabilistic Model • Attribute by attribute comparison • Calculate ratio: • Probability of comparison score given it’s the same person vs. • Probability of comparison score given they’re different people • Use frequency tables to calculate scores • Add up the scores • Assumes that attributes are independent
  8. 8. 9 AIR 76 ADM 188 Name comparison scores DOB scores Total Name DOB Name DOB Sname Fname 1 Fname 2 Fname 3 Day Month Year Score denzil adair bartlett morle 19 May 1879 denzil adair bartlett morle 19 May 1879 5.4442 4.4379 6.2457 5.5897 1.3302 0.9180 1.5307 25.50 kingsley storrs stanton parker 02 January 1900 kingsley storrs stanton parker 02 January 1900 2.5287 4.8050 7.1751 5.2720 1.3302 0.9180 1.1603 23.19 Very high-scoring matches
  9. 9. 10 AIR 76 ADM 188 Name comparison scores DOB scores Total Name DOB Name DOB Surname Fname1 Fname2 Fname3 Day Month Year Score f e orton frank edward orton 06 December 1890 3.812 1.289 0.915 0.000 0.000 0.000 0.000 6.016 harry leonard gardner harry leonard garner 06 March 1897 1.839 1.939 2.237 0.000 0.000 0.000 0.000 6.016 charles measures charles measures 29 October 1868 4.362 1.653 0.000 0.000 0.000 0.000 0.000 6.015 charles measures charles measures 25 December 1861 4.362 1.653 0.000 0.000 0.000 0.000 0.000 6.015 g w lester george w lester 11 October 1853 3.542 1.305 1.168 0.000 0.000 0.000 0.000 6.014 george scarrott george scarrott 14 August 1878 4.592 1.422 0.000 0.000 0.000 0.000 0.000 6.014 w j cann william james cann 17 July 1858 3.834 1.212 0.967 0.000 0.000 0.000 0.000 6.013 william james cann 29 November 1867 3.834 1.212 0.967 0.000 0.000 0.000 0.000 6.013 d s gordon david stevenson gordon 29 August 1896 3.225 1.407 1.381 0.000 0.000 0.000 0.000 6.013 h c lyons henry charles lyons 16 November 1893 3.415 1.268 1.330 0.000 0.000 0.000 0.000 6.013 arthur edward john dobson arthur edward dobson 28 February 1885 3.181 1.685 1.847 -0.699 0.000 0.000 0.000 6.013 j w chilton james william chilton 15 July 1884 3.833 0.967 1.212 0.000 0.000 0.000 0.000 6.012 joseph wright chilton 07 October 1899 3.833 0.967 1.212 0.000 0.000 0.000 0.000 6.012 arthur william smith 25 July 1899 arthur smith 25 July 1899 1.693 1.685 -0.699 0.000 1.330 0.918 1.085 6.012 john bertram granville bradley 12 March 1899 john bradley 12 March 1899 2.875 1.201 -0.699 -0.699 1.330 0.918 1.085 6.011 h h england henry humphrey england 30 July 1898 3.475 1.268 1.268 0.000 0.000 0.000 0.000 6.010 herbert henry england 07 February 1889 3.475 1.268 1.268 0.000 0.000 0.000 0.000 6.010 albert edward monecroft albert edward morecroft 29 May 1896 2.404 1.759 1.847 0.000 0.000 0.000 0.000 6.010 w t randall william thomas randall 13 December 1890 3.250 1.212 1.548 0.000 0.000 0.000 0.000 6.010 william thomas randall 31 August 1894 3.250 1.212 1.548 0.000 0.000 0.000 0.000 6.010 william thomas randall 06 June 1883 3.250 1.212 1.548 0.000 0.000 0.000 0.000 6.010 e j breen edward james breen 23 May 1902 4.128 0.915 0.967 0.000 0.000 0.000 0.000 6.010 c w moran charles william moran 01 January 1886 3.462 1.330 1.212 0.000 0.000 0.000 0.000 6.004 christopher walter moran 21 February 1885 3.462 1.330 1.212 0.000 0.000 0.000 0.000 6.004 t h johnston thomas henry johnston 24 January 1894 3.188 1.548 1.268 0.000 0.000 0.000 0.000 6.004 a c m pym albert charles pym 08 January 1899 4.398 0.975 1.330 -0.699 0.000 0.000 0.000 6.004 c a walter charles alfred walter 18 April 1892 3.695 1.330 0.975 0.000 0.000 0.000 0.000 6.000 Lower confidence matches
  10. 10. 11 AIR 76 ADM 188 Name comparison scores DOB scores Total Name DOB Name DOB Surname Fname1 Fname2 Fname3 Day Month Year Score clifton james twine clipton james twine 14 November 1897 4.576208 2.009343 1.524698 0.000000 0.000000 0.000000 0.000000 8.110249 josiah c wedgewood josiah wedgewood 27 January 1897 5.215215 3.593108 -0.698970 0.000000 0.000000 0.000000 0.000000 8.109353 harold vaughan hicks 25 January 1897 harold hicks 25 January 1897 3.190312 2.100176 -0.698970 0.000000 1.330211 0.918030 1.265789 8.105548 rupert john goodman crouch 12 February 1897 rupert john goodman crouch 12 February 1892 3.633749 3.716703 1.200945 5.302958 1.330211 0.918030 -8.000000 8.102596 r h wrate roy holcombe wrate 29 May 1899 5.469066 1.313985 1.267723 0.000000 0.000000 0.000000 0.000000 8.050774 roy holcombe wrate 29 May 1899 5.469066 1.313985 1.267723 0.000000 0.000000 0.000000 0.000000 8.050774 john norman longfield 12 December 1884 john norman longfield 13 December 1884 4.829306 1.200945 2.417149 0.000000 -2.735969 0.918030 1.417104 8.046566 d f crittall daniel frederick crittall 06 August 1896 5.340946 1.406961 1.288987 0.000000 0.000000 0.000000 0.000000 8.036893 c c v terry christopher charles vincent terry 13 October 1899 3.375747 1.330257 1.330257 1.979755 0.000000 0.000000 0.000000 8.016016 william john davies 27 December 1894 william john davies 27 December 1894 2.057942 1.201008 1.200945 0.000000 1.330211 0.918030 1.303943 8.012078 h v phippen harold victor phippen 19 July 1892 4.742067 1.267723 1.979755 0.000000 0.000000 0.000000 0.000000 7.989546 edward heron 21 August 1893 edward george appelbe heron 21 August 1893 3.937212 1.846789 -0.698970 -0.698970 1.330211 0.918030 1.332133 7.966435 f r maddaford frank richard maddaford 16 December 1892 5.360395 1.288987 1.313985 0.000000 0.000000 0.000000 0.000000 7.963367 albert john penaluna albert john penaluna 06 May 1892 5.001995 1.759049 1.200945 0.000000 0.000000 0.000000 0.000000 7.961989 harry ward 10 May 1898 harry ward 10 May 1898 2.458691 1.939398 0.000000 0.000000 1.330211 0.918030 1.243949 7.890279 w h a rockett william henry albert rockett 12 May 1904 4.431952 1.212178 1.267723 0.975004 0.000000 0.000000 0.000000 7.886857 The threshold for high confidence matches: Computer says ‘Maybe’
  11. 11. Refining the basic probabilistic model • We’ve assumed independence of attributes: Is this a valid assumption? • Spelling variants and similar strings e.g. ‘Sidney’ and ‘Sydney’ • Name variants e.g. ‘Jack’ and ‘John’ or ‘Henry’ and ‘Harry’ • What about incorrectly transcribed initials? • What about dates?
  12. 12. 13 How independent are attributes?
  13. 13. A stark example
  14. 14. 15 AIR 76 Person ADM 188 Person Name comparison scores DOB comparison scores Total Name DOB Name DOB Surname Fname1 Fname2 Fname3 Day Month Year Score j j gabell jonathan joseph gabell 29 October 1898 5.271 0.967 0.967 0.000 0.000 0.000 0.000 7.205 e s mendoza elias sidney mendoza 28 June 1900 4.908 0.915 1.381 0.000 0.000 0.000 0.000 7.205 robert m blackwood robert maxwell blackwood 06 October 1887 4.428 1.764 1.012 0.000 0.000 0.000 0.000 7.204 a j loton alfred john loton 05 April 1897 5.260 0.975 0.967 0.000 0.000 0.000 0.000 7.202 h r rickey henry rickey 16 July 1851 6.633 1.268 -0.699 0.000 0.000 0.000 0.000 7.201 h v briscoe hugh villiers briscoe 25 March 1896 3.953 1.268 1.980 0.000 0.000 0.000 0.000 7.201 john wall 16 June 1884 john wall 16 June 1886 3.231 1.201 0.000 0.000 1.330 0.918 0.521 7.200 h h girdlestone horace howard girdlestone 12 January 1900 4.663 1.268 1.268 0.000 0.000 0.000 0.000 7.198 thomas francis taylor 1885 thomas francis taylor 13 February 1885 2.030 1.492 2.244 0.000 0.000 0.000 1.430 7.197 harold powell September 1881 harold powell 25 September 1881 2.687 2.100 0.000 0.000 0.000 0.918 1.488 7.193 frederick williams 31 January 1894 frederick williams 31 January 1894 1.957 1.681 0.000 0.000 1.330 0.918 1.304 7.190 c h munford charles henry munford 23 November 1897 4.591 1.330 1.268 0.000 0.000 0.000 0.000 7.189 patrick cashman patrick cashman 01 June 1879 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188 patrick cashman 18 October 1878 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188 patrick cashman 16 September 1884 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188 patrick cashman 17 March 1878 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188 patrick cashman 04 April 1875 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188 patrick cashman 04 April 1870 4.654 2.534 0.000 0.000 0.000 0.000 0.000 7.188 f g donnison frederick george donnison 28 December 1899 4.593 1.289 1.305 0.000 0.000 0.000 0.000 7.187 percival albert wright percival albert wright 10 October 1883 2.317 3.110 1.759 0.000 0.000 0.000 0.000 7.186 stanley b crick stanley benjamin charles crick 08 May 1899 4.005 2.267 1.608 -0.699 0.000 0.000 0.000 7.181 g h tidman george henry tidman 18 May 1900 4.608 1.305 1.268 0.000 0.000 0.000 0.000 7.181 william edward back william edward back 30 December 1855 4.132 1.201 1.847 0.000 0.000 0.000 0.000 7.180 f w doy frederick william doy 12 August 1891 4.679 1.289 1.212 0.000 0.000 0.000 0.000 7.180 An anomaly
  15. 15. 16 Where were all the Cashmans born?
  16. 16. 17 Clustering surnames and forenames…
  17. 17. 18 Attribute Similarity
  18. 18. An example: Name Rank Service Number Date of Death Regiment / Service BOULONOIS, PERCY THOMAS Private 28645 29/12/1917 Royal Fusiliers WO 372 – Percy J. Boulonois GS/27645 MH 47 - Percy Thomas Boulonois CWGC – Percy Thomas Boulonois 28645
  19. 19. Henry and Harry, an example • Looked at high confidence matched pairs • Approx 1% of Henrys are also recorded as Harry • Tried three scenarios: • Apply a weighting based on string similarity (our standard approach) • Assume Henry and Harry are interchangeable • Include 1% probability into calculation • Ran tests with and without Date of Birth
  20. 20. 21 Browsing Individuals
  21. 21. 22 From Conscription to Henry VIII
  22. 22. 23 Future work…

×