Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Slicing and Dicing a Newspaper Corpus for Historical Ecology Research


Published on

Presented at EKAW 2018

Historical newspapers are a novel source of information for historical ecologists to study the interactions between humans and animals through time and space. Newspaper archives are particularly interesting to analyse because of their breadth and depth. However, the size and the occasional noisiness of such archives also brings difficulties, as manual analysis is impossible. In this paper, we present experiments and results on automatic query expansion and categorisation for the perception of animal species between 1800 and 1940. For query expansion and to the manual annotation process, we used lexicons. For the categorisation we trained a Support Vector Machine model. Our results indicate that we can distinguish newspaper articles that are about animal species from those that are not with an F 1 of 0.92 and the subcategorisation of the different types of newspapers on animals up to 0.84 F 1 .

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Slicing and Dicing a Newspaper Corpus for Historical Ecology Research

  1. 1. Slicing and Dicing a Newspaper Corpus for Historical Ecology Research Marieke van Erp Jesse de Does Katrien Depuydt Rob Lenders Thomas van Goethem Image source:
  2. 2. SERPENS in a Nutshell • Historical ecologists are starting to use newspaper corpora for their research • The abundance of data is both a blessing and a curse • SERPENS aims to make the computer do the ‘boring’ work of filtering relevant articles from irrelevant ones • Historical ecology researchers can then spend more time on the ‘hard’ analyses • Partners: • Funded by:
  3. 3. Why pest and nuisance species? • Ambivalent relationship; • Food, fur, totem • Diseases, agricultural damages • Relationships change over time • Exotic species, reintroductions, plagues • Understanding the past helps us to understand current ecological conditions • Useful to policy makers, conservationist biologists etc. Muskrat Image source:
  4. 4. Why newspapers? • Which species were considered “pest and nuisance species”? • Why were they considered as such? • How did humans respond? Also more tangible information: • Extermination methods, number of incidents/sightings, statistics, fur prices
  5. 5. First hurdle: OCR • The older the source, the harder it is to read • OCR errors may result in relevant documents being missed and irrelevant documents being retrieved • We don’t try to ‘fix’ bad OCR but rank documents by OCR quality through lexicon overlap
  6. 6. Ambiguity • Wolf: animal • Wolf: last name • Wolf in sheep’s clothes • … • Context of the document needed to find the right meaning
  7. 7. Experimental Setup
  8. 8. SERPENS Categories • Natural history • Nuisance, material damage • Nuisance, immaterial damage • Pest control • Hunt for economic reasons • Prevention • Accidents • Figurative • Other beast • No beast • Bad OCR
  9. 9. Training a new topic classifier • Manually classified 9,940 documents • Replace occurrences of animal names from queries with “—ANIMAL—“ • 10-fold cross-validation • various experiments to measure impact settings and dataset size • Code available at: CLARIAH/serpens/
  10. 10. Results different algorithms
  11. 11. Zooming in (snippets)
  12. 12. Results per class linear SVM (snippets)
  13. 13. Learning curves • Total dataset consists of nearly 10,000 annotated examples • Learning curves are a measure of performance vs training set size • Results converge rapidly, for two-class problem, ~1000 examples already achieve 90% accuracy
  14. 14. Preliminary analysis • Public perception of Mustelidae (European polecat) • Combination of distant and close reading approaches • Newspaper archives not equally well digitised over time • Trends in news may affect reporting on animals
  15. 15. Lessons Learnt & Future Work • Domain use cases often need specific solutions • Document classification already very useful to historical ecologists (probably also to other domain experts) • 1,000 annotated examples sufficient for two-class classification • Extend to more species • Improve classification sub-categories • Add sentiment/opinions Image source: Mink
  16. 16. Shameless plug: 3rd Workshop on Humanities in the Semantic Web
  17. 17. image source: Questions?