Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mining and mapping places with multiple names


Published on

Presentation by James Butler & Christopher Donaldson at the 1st Lancaster Data Conversations 30 January.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Mining and mapping places with multiple names

  1. 1. Mining and mapping places with multiple names James Butler & Christopher Donaldson Lancaster University
  2. 2. 1901 Corpus of Lake District Literature 1688 1789 1837 • 80 texts, comprising more than 1,500,000 words • Mixture of canonical and non- canonical literature about the Lake District, mainly from c18 and c19 (78 out of 80 works) • Mixture of genres, including guidebooks, travelogues, novels, poems, journals, and private letters 34 Texts 650K words 22 Texts 250K words 22 Texts 613K words
  3. 3. Sample sentence collocation: beautiful ‘Again entering the boat, we passed up the channel between Lord’s Island the shore, from whence beautiful prospects are obtained of the majestic form of Skiddaw, with the woods of Castlehead and Cockshot Park in the foreground.’ (Edward Baines, A Companion to the Lakes [1829] 121.) ±5 tokens: No place-names identified ±10 tokens: 2 place-names identified – Lord’s Island & Skiddaw Within sentence: 4 place-names identified – Lord’s Island, Skiddaw, Castlehead & Cockshot Park. Average sentence length Lake District corpus = 29.8 words British National Corpus (BNC) = 16 words
  4. 4. from C. Grover, et al., ‘Use of the Edinburgh Geoparser for Georeferencing Digitized Historical Collections’, Phil. Trans. R. Soc. A 368 (2010) 3875–89. Diagram of the Edinburgh Geoparser System
  5. 5. Example of input/output from the Edinburgh Geoparser System
  6. 6. Geo-referenced Data from the Edinburgh Geoparser
  7. 7. Geo-referenced Data, Corrected
  8. 8. Bowness: ‘the curved headland’, from ON bogi/OE boga ‘bow’ and ON nes/OE naess ‘headland’ *Variant Historical Spellings: Bownus, Bawnas, Bonas, Bonus, Boulness cf. D. Whaley, A Dictionary of Lake District Place Names (Nottingham: English Place-Name Society, 2006), 42.
  9. 9. Some of the common generic gazetteer geo-referenced issues… Spatial misattribution. Onomastic misassumption Incorrect weighting Just for the items that are found!
  10. 10. An extract of our custom manually-collected gazetteer for the corpus Unique ID Topog. Cat. Primary Name Secondary Names Regional Placement CONISTON (lake): Thurstan, Coniston Lake, Coniston Water, Thurston, Conistone, Conistone Lake, Cunnistone Lake, Thurston Lake, Coniston Mere, Lake of Coniston, Conis- ton, Conyngs Tun, Conyngeston, Thorstane's watter, Turstinus.
  11. 11. Geospatial categories chosen for flexibility and degree of universal referential specificity
  12. 12. An extract from the latest iteration of the corpus - allowing referential relationships to be analysed on a whole new level. Lake, Vale, Specific - Farm, Waterfall