Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Towards Semantic enrichment
of Newspapers:
A Historical Ecology use case.
Marieke van Erp

Thomas van Goethem

Katrien Dep...
Digital Humanities Group (September 2017)
http://huc.knaw.nl
Changing views on animal species
Changing views on animal species
Image source: https://en.wikipedia.org/wiki/Little_Red_Riding_Hood#/media/
File:Carl_Lars...
Changing views on animal species
Historical ecology
• studies interactions between humans and their
environment
• draws on theory and methods from biology,...
SERPENS & CLARIAH
• This project is a CLARIAH pilot project
• Tools developed in CLARIAH are tested in a real use case
• C...
SERPENS & CLARIAH
• This project is a CLARIAH pilot project
• Tools developed in CLARIAH are tested in a real use case
• C...
Why it’s hard to work with text
Image source: https://www.powned.tv/media/73380/johndewolfff_gr.jpg?
anchor=center&mode=cr...
Why it’s hard to work with text
Image source: https://www.gpswalking.nl/web/img/A860/A860-Erp-082.jpg
Sources and Resources
Sources and
Resources
• + taxonomic lists from the
Netherlands Biodiversity
Centre and ATHENA
project 

• + CLARIAH natura...
SERPENS Workflow
SERPENS Workflow
Creating a gold standard dataset
• Query Delpher API for “bunzing” (European Polecat) and
“lynx” (Lynx)
• Bunzing: 2,515 r...
Creating a gold standard dataset
Bunzing Lynx
Natural History 553 182
Nuisance, material damage 296 10
Nuisance, immateria...
OCR Quality
• Not all papers come out with the same OCR quality
• Rank retrieved articles through lexicon match
• National...
Discussion
• Animals and plants don’t care about borders
• Take Belgian & German news into account as
well?
• How much lan...
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Upcoming SlideShare
Loading in …5
×

Towards Semantic Enrichment of Newspapers: a historical ecology use case

394 views

Published on

Slides presented at WHISE II workshop in conjunction with ISWC 2017. 21 October 2017, Vienna, Austria
http://whise.kmi.open.ac.uk/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Towards Semantic Enrichment of Newspapers: a historical ecology use case

  1. 1. Towards Semantic enrichment of Newspapers: A Historical Ecology use case. Marieke van Erp Thomas van Goethem Katrien Depuydt Jesse de Does
  2. 2. Digital Humanities Group (September 2017) http://huc.knaw.nl
  3. 3. Changing views on animal species
  4. 4. Changing views on animal species Image source: https://en.wikipedia.org/wiki/Little_Red_Riding_Hood#/media/ File:Carl_Larsson_-_Little_Red_Riding_Hood_1881.jpg
  5. 5. Changing views on animal species
  6. 6. Historical ecology • studies interactions between humans and their environment • draws on theory and methods from biology, ecology, history, geography and others • informs policy makers to guide environmental management • often involves analysis of archival records
  7. 7. SERPENS & CLARIAH • This project is a CLARIAH pilot project • Tools developed in CLARIAH are tested in a real use case • CLARIAH covers three domains: • language • structured data • multimedia
  8. 8. SERPENS & CLARIAH • This project is a CLARIAH pilot project • Tools developed in CLARIAH are tested in a real use case • CLARIAH covers three domains: • language • structured data • multimedia
  9. 9. Why it’s hard to work with text Image source: https://www.powned.tv/media/73380/johndewolfff_gr.jpg? anchor=center&mode=crop&quality=80&width=1024&height=576&rnd=131081516640000000
  10. 10. Why it’s hard to work with text Image source: https://www.gpswalking.nl/web/img/A860/A860-Erp-082.jpg
  11. 11. Sources and Resources
  12. 12. Sources and Resources • + taxonomic lists from the Netherlands Biodiversity Centre and ATHENA project • + CLARIAH natural language processing tools
  13. 13. SERPENS Workflow
  14. 14. SERPENS Workflow
  15. 15. Creating a gold standard dataset • Query Delpher API for “bunzing” (European Polecat) and “lynx” (Lynx) • Bunzing: 2,515 results • Lynx: 5,530 • But: spelling, dialectal and historic variations may yield more results: Bunsing (1,319), Bonzing (47), Bunzel (617), Bunsel (67), Ulk (21,993), Ulling (1,153), Fret (25,830), Eierdief (98)
  16. 16. Creating a gold standard dataset Bunzing Lynx Natural History 553 182 Nuisance, material damage 296 10 Nuisance, immaterial damage 81 - Pest control 287 - Hunt for economic reasons 636 173 Prevention 1 - Accidents 220 - Figurative 109 36 Other 154 3468
  17. 17. OCR Quality • Not all papers come out with the same OCR quality • Rank retrieved articles through lexicon match • National Library is working on automatically recognising ads, crosswords & images
  18. 18. Discussion • Animals and plants don’t care about borders • Take Belgian & German news into account as well? • How much language variation to take into account? • How to make this more generic? • Employ word-sense disambiguation?

×