Institute for Language,                              Cognition and Computation	 The Edinburgh Geoparser        and Chalice...
Institute for Language,                                                             Cognition and Computation	Overview of ...
Institute for Language,                                                                                                   ...
Institute for Language,Cognition and Computation
Institute for Language,Cognition and Computation
Institute for Language,Cognition and Computation
Institute for Language,Cognition and Computation
Institute for Language,                                                           Cognition and Computation	              ...
Institute for Language,                                                            Cognition and Computation	           Cu...
Institute for Language,                                                                  Cognition and Computation	       ...
Institute for Language,                                                                  Cognition and Computation	       ...
Institute for Language,                                                           Cognition and Computation	              ...
Institute for Language,                                                                Cognition and Computation	         ...
Institute for Language,Cognition and Computation
Institute for Language,                      Cognition and Computation	The start of theentry for thetownship ofWillaston i...
Institute for Language,Cognition and Computation
Institute for Language,Cognition and Computation
Institute for Language,Cognition and Computation
Institute for Language,Cognition and Computation
Institute for Language,Cognition and Computation
Institute for Language,Cognition and Computation
Institute for Language,                                                                    Cognition and Computation	     ...
Institute for Language,Cognition and Computation
Institute for Language,Cognition and Computation
Institute for Language,Cognition and Computation
Institute for Language,Cognition and Computation
Institute for Language,Cognition and Computation
Institute for Language,               Cognition and Computation	Thank you!
Upcoming SlideShare
Loading in …5
×

Edin pelagios

1,362 views

Published on

Talk about the Edinburgh Geoparser and the Chalice project at the Pelagios workshop in London on 24/03/11.

Published in: Education
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total views
1,362
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
11
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Edin pelagios

  1. 1. Institute for Language, Cognition and Computation The Edinburgh Geoparser and Chalice Claire GroverKate Byrne, Richard Tobin, Jo Walsh www.inf.ed.ac.uk
  2. 2. Institute for Language, Cognition and Computation Overview of the Edinburgh Geoparser •  System to automatically recognise place names in text and disambiguate them with respect to a gazetteer. (Athens, Springfield)•  Patchy development over past few years funded by a variety of projects applied to a range of data sets: –  GeoCrossWalk –  BOPCRIS –  GeoDigRef (Histpop, BOPCRIS, BL) –  Embedding GeoCrossWalk (Stormont Papers) –  SYNC3 (online news) –  Chalice (EPNS) –  Unlock•  Main concern has been to keep it generally usable while applying it to specific data sets.
  3. 3. Institute for Language, Cognition and Computation Overview of the Edinburgh Geoparser Geotagging .txt .html Format Tokenisation POS Lemmatis- Named Entity .geotagged.xml .xml conversion tagging ation Recognition .geotagged.xml Gazetteer lookup Resolution .gaz.xml Georesolution
  4. 4. Institute for Language,Cognition and Computation
  5. 5. Institute for Language,Cognition and Computation
  6. 6. Institute for Language,Cognition and Computation
  7. 7. Institute for Language,Cognition and Computation
  8. 8. Institute for Language, Cognition and Computation Evaluation (2009) SpatialML (gold geotagging) GeoNames UnlockNo. of place names 3628 3628No. for which gaz entries found 3538 3049Correct within 5km 2946 2143As % of total 81.2% 59.0% SpatialML (end-to-end) GeoNames No. of place names 3628 No. for which gaz entries found 2923 Correct within 5km 2504 As % of total 69.0%
  9. 9. Institute for Language, Cognition and Computation Current Development Issues •  Open source release•  Increased configurability –  Input formats: plain text, HTML, simple XML, ... –  User’s own text analysis: paragraphs, sentences, word tokens, place name mark-up –  Output formats: map visualisation, text mark-up, … –  User input: constrain by area, bounding box, …•  Choice of gazetteer: GeoNames, Unlock, geonames-local, Pleiades+, Chalice historical gazetteer, ...•  Performance monitoring/evaluation against test sets
  10. 10. Institute for Language, Cognition and Computation GAP project: Pleiades+ •  Based on Pleiades set of ancient place names but extended in two ways:•  by matching Pleiades place names against GeoNames place names in the same location and adding the GeoNames alternative names to the Pleiades+ list: –  adds three alternative names for the single Pleiades entry for Autricum (Chartrez, Chartres, Shartr), because Autricum” is present in both Pleiades and GeoNames, with the same approximate location•  at run-time, looking up place names found in the text against GeoNames (as well as against Pleiades+) and the using the alternative names from GeoNames to match against the Pleiades+ list –  Pleiades has no entry for Egypt”. We look up the name in GeoNames and use its alternative names (which include Aegyptus) to match back against Pleiades (which does include Aegyptus). (We dont want to simply take places directly from GeoNames because, when we tried it, we were swamped with irrelevant modern places having names corresponding to ancient toponyms.)
  11. 11. Institute for Language, Cognition and Computation Chalice •  Connecting Historical Authorities with Linked Data, Contexts, and Entities.•  Funded under the JISC jiscEXPO programme on exposing digital content for education and research.•  The project is exploring the viability of creating a historical gazetteer from digitized volumes from the English Place-Name Society (EPNS).•  Partners: –  CDDA, Queen’s University, Belfast –  School of Informatics, Edinburgh –  EDINA, Edinburgh –  CeRch, Kings College London•  Informatics role is to adapt our existing text mining/geoparsing technology to convert the textual documents that are output from OCR into structured data.
  12. 12. Institute for Language, Cognition and Computation Chalice data •  Cheshire –  Cheshire Part I. EPNS Volume 44, 1970 –  Cheshire Part II. EPNS Volume 45, 1970 –  Cheshire Part III. EPNS Volume 46, 1971 –  Cheshire Part IV. EPNS Volume 47, 1972 –  Cheshire Part V (1 :i). EPNS Volume 48, 1981 –  Cheshire Part V (1 :ii). EPNS Volume 54, 1981•  Small samples from: –  Berkshire, Buckinghamshire (Vol. 2), Cambridgeshire (Vol 19), Derbyshire (Vols 27-29), Hertfordshire (Vol. 15)•  Shropshire: Pimhill Hundred (born digital)
  13. 13. Institute for Language, Cognition and Computation EPNS •  Parishes are usually organised in terms of the hundreds in which they belong.•  Towns and villages are usually referred to as townships and are organised in terms of the parish in which they belong.•  Township descriptions often contain relatively unstructured information about smaller associated places such as buildings, bridges, lanes, woods and farms.•  Township descriptions also frequently contain separately marked sections of information about field names and street names.•  Information about river and major road names are described separately from the inhabited place descriptions.•  Place names are the primary object of interest and descriptions of them contain information about alternative names and spellings that have been attested in historical sources and the etymology of names or name parts.•  In Chalice we focus on capturing parishes, townships, sub-townships, attestation. We don’t deal with hundreds, field names, street names, rivers, roads etc.
  14. 14. Institute for Language,Cognition and Computation
  15. 15. Institute for Language, Cognition and Computation The start of theentry for thetownship ofWillaston in theparish of Neston inWirral Hundred.
  16. 16. Institute for Language,Cognition and Computation
  17. 17. Institute for Language,Cognition and Computation
  18. 18. Institute for Language,Cognition and Computation
  19. 19. Institute for Language,Cognition and Computation
  20. 20. Institute for Language,Cognition and Computation
  21. 21. Institute for Language,Cognition and Computation
  22. 22. Institute for Language, Cognition and Computation Issues •  OCR quality needs to be high: not just recognising characters correctly but getting font and layout information right. Failure to recognise bold and small caps fonts or the difference between a line break and a paragraph break can lead to major errors in the recognition process.•  EPNS volumes vary in the use of layout and font to indicate structure (e.g. Cheshire parishes are signaled by centering combined with numbering with roman numerals while Hertfordshire ones are unnumbered but centered and in bold font.) In some volumes potentially useful information is contained in footnotes.•  Different volumes reflect different decisions about where place name information should be put. In most cases the information about the parish name occurs next to the town in the parish that has the same name. In the Shropshire text some place name information occurs in an earlier volume and is not subsequently repeated, e.g. the description of the parish of Baschurch, containing a township of the same name, has no attestation or etymological information provided because the name was discussed in Part 1.
  23. 23. Institute for Language,Cognition and Computation
  24. 24. Institute for Language,Cognition and Computation
  25. 25. Institute for Language,Cognition and Computation
  26. 26. Institute for Language,Cognition and Computation
  27. 27. Institute for Language,Cognition and Computation
  28. 28. Institute for Language, Cognition and Computation Thank you!

×