Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using Linked Open Data to crowdsource Dutch WW2 underground newspapers on Wikipedia

311 views

Published on

During the second World War some 1.300 illegal newspapers were issued by the Dutch resistance.

Right after the war as many of these newspapers as possible were physically preserved by Dutch memory institutions. They were described in formal library catalogues, that were digitized and brought online in the ‘90s. In 2010 the national collection of underground newspapers – some 200.000 pages – was full-text digitized in Delpher, the national aggregator for historical full-texts.

Having created online metadata and full-texts for these publications, the third pillar ''context'' was still missing, making it hard for people to understand the historic background of the newspapers.

We are currently running a project to tackle this contextual problem. We started by extracting contextual entries from a hard-copy standard work on Dutch illegal press and combined these with data from the library catalogue and Delpher into a central LOD triple store.
We then created links between historically related newspapers and used Named Entity Recognition to find persons, organisations and places related to the newspapers. We further semantically enriched the data using DBPedia.
Next, using an article template to ensure uniformity and consistency, we generated 1.300 Wikipedia article stubs from the database.
Finally, we sought collaboration with the Dutch Wikipedia volunteer community to extend these stubs into full encyclopedic articles.

In this way we can give every newspaper its own Wikipedia article, making these WW2 materials much more visible to the Dutch public, over 80% of whom uses Wikipedia.

At the same time the triple store can serve as a source for alternative applications, like data visualizations. This will enable us to visualize connections and networks between underground newspapers, as they developed over time between 1940 and 1945.

Presentation during the SWIB (Semantic Web in Libraries) conference, 28-30 November 2016, Bonn, Germany

The SWIB conference is an annual conference focusing on Linked Open Data (LOD) in libraries and related organizations. The topics of talks and workshops at SWIB revolve around opening data, linking data and creating tools and software for LOD production scenarios. These areas of focus are supplemented by presentations of research projects in applied sciences, industry applications, and LOD activities in other areas. SWIB mainly targets IT staff, developers, librarians and researchers.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Using Linked Open Data to crowdsource Dutch WW2 underground newspapers on Wikipedia

  1. 1. Using LOD to crowdsource Dutch WW2 underground newspapers on Wikipedia Olaf Janssen, National Library of the Netherlands & Wikipedia Gerard Kuys, DBpedia & Wikimedia Nederland olaf.janssen@kb.nl - @ookgezellig - slideshare.net/OlafJanssenNL SWIB 2016, Bonn, 29-11-2016
  2. 2. http://www.4en5meiamsterdam.nl/attachment/47454
  3. 3. During WW2 the Dutch resistance issued many underground newspapers. In every shape & form… http://www.4en5meiamsterdam.nl/attachment/47454
  4. 4. http://resolver.kb.nl/resolve?urn=ddd:010436323 http://resolver.kb.nl/resolve?urn=ddd:010442948 http://resolver.kb.nl/resolve?urn=ddd:010447825 http://resolver.kb.nl/resolve?urn=ddd:010450508 From well-organized, ‘professional’ big titles… (o.a. Parool, Vrij Nederland, Trouw, de Waarheid)
  5. 5. …to very small, amateur, home-made, pamphlet-like issues
  6. 6. After the war 1.300 newspaper titles were (physically) preserved at the NIOD … https://commons.wikimedia.org/wiki/File:Verzetskrant_in_archiefdozen_bij_het_NIOD.jpg – CC-BY-SA - OlafJanssen The national Institute for War, Holocaust and Genocide Studies in Amsterdam
  7. 7. http://opac-gonext.oclc.org:8180/DB=8/XMLPRS=Y/PPN?PPN=107123223 .. and were described in formal library catalogues (1.300 titles) Bibliographic metadata Underground students’ newspaper from The Hague
  8. 8. In 2010 these WW2 newspapers were digitized…..
  9. 9. www.delpher.nl/kranten …into full-texts in Delpher … (1.300 titles) The Dutch national aggregator for historic full-texts • Newspapers • Books • Magzines
  10. 10. In Delpher you can read and search these newspapers… • Scans • Full-text OCR • ALTO
  11. 11. But say, I want to know more about this newspaper • What sort of illegal newspaper was it? • What is the history of this newspaper? • Who wrote it? • Where was this newspaper printed? • How was it distributed? • Were there any relations with other underground newspapers? • Etc…
  12. 12. But say, I want to know more about this newspaper • What sort of illegal newspaper was it? • What is the history of this newspaper? • Who wrote it? • Where was this newspaper printed? • How was it distributed? • Were there any relations with other underground newspapers or resistance groups? • Etc…
  13. 13. But say, I want to know more about this newspaper • What sort of illegal newspaper was it? • What is the history of this newspaper? • Who wrote it? • Where was this newspaper printed? • How was it distributed? • Were there any relations with other underground newspapers? • Etc… You can’t answer these questions from Delpher
  14. 14. Big drawback of Delpher: No contextual information about WW2 underground newspapers https://thejungleisneutral.files.wordpress.com/2013/11/lost.jpg
  15. 15. http://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad) Where would many people go to find contextual information about historic newspapers? Probably Wikipedia (via Google)
  16. 16. http://2.bp.blogspot.com/_BWzuYwiS6-I/TMgeRsFd3mI/AAAAAAAAElw/3cvgbZSPWcs/s1600/doctor+macro+judy+scared.jpg
  17. 17. http://2.bp.blogspot.com/_BWzuYwiS6-I/TMgeRsFd3mI/AAAAAAAAElw/3cvgbZSPWcs/s1600/doctor+macro+judy+scared.jpg
  18. 18. http://2.bp.blogspot.com/_BWzuYwiS6-I/TMgeRsFd3mI/AAAAAAAAElw/3cvgbZSPWcs/s1600/doctor+macro+judy+scared.jpg Information on underground newspapers is distributed across multiple, unconnected sources 1. Descriptions (metadata in library catalogue, 1.300 titles) 2. Content (full-text in Delpher, 1.300 titles) 3. Context (in Wikipedia…. at least... )
  19. 19. This Wikipedia article is a carefully chosen exception
  20. 20. 1. There are very few illegal newspapers with their own WP articles 2. The inventory of these newspapers on WP is far from complete <<< 1.300 titles
  21. 21. We can tackle both problems!
  22. 22. Wikiproject Systematically and uniformly describe & interlink all 1.300 Dutch underground newspapers from WW2 on Wikipedia tinyurl.com/verzetskranten
  23. 23. Wikiproject Systematically and uniformly describe & interlink all 1.300 Dutch underground newspapers from WW2 on Wikipedia tinyurl.com/verzetskranten 2) Automatically make data available for other open purposes Wikidata -- DBpedia -- Dataviz 1) Reach big audiences
  24. 24. https://thejungleisneutral.files.wordpress.com/2013/11/lost.jpg We badly need contextual information about the newspapers. Where do we get it? De Ondergrondse Pers 1940-1945 Lydia E. Winkel, H. de Vries , 1989, ISBN 9021837463, Veen Uitgevers This paper book contains entries about all 1.300 illegal newspapers
  25. 25. Entry 199 – De Geus; (onder studenten) Unique ID (within the book)
  26. 26. Place of publication Newspaper Place name Entry 199 – De Geus; (onder studenten)
  27. 27. Entry 199 – De Geus; (onder studenten) Context Raw material for Wikipedia article!
  28. 28. Entry 199 – De Geus; (onder studenten) Person names Newspaper Persons
  29. 29. Entry 199 – De Geus; (onder studenten) IDs of related students’ newspapers This newspaper Other newspapers
  30. 30. We OCRed this book into PDF (CC-BY-SA) http://www.niod.nl/nl/de-ondergrondse-pers-1940-1945 (PDF)
  31. 31. We OCRed this book into PDF (CC-BY-SA) http://www.niod.nl/nl/de-ondergrondse-pers-1940-1945 (PDF) Available online (PDF, flat file) Open license (CC-BY-SA) Convert PDF into structured database. Link: titles  places, persons, other titles Link: titles  library catalogue (metadata) and Delpher (full-text) Link: titles, persons and places  external sources
  32. 32. Convert PDF into structured database. Link: titles  places, persons, other titles Link: titles  library catalogue (metadata) and Delpher (full-text) Link: titles, persons and places  external sources My co-author Gerard Kuys
  33. 33. Convert PDF into structured database. Link: titles  places, persons, other titles Link: titles  library catalogue (metadata) and Delpher (full-text) Link: titles, persons and places  external sources VIAF
  34. 34. Technical appendix from slide 48 onwards
  35. 35. We OCRed this book into PDF (CC-BY-SA) http://www.niod.nl/nl/de-ondergrondse-pers-1940-1945 (PDF) Available online (PDF, flat file) Open license (CC-BY-SA) Convert PDF into structured database. Link: titles  places, persons, other titles Link: titles  library catalogue (metadata) and Delpher (full-text) Link: titles, persons and places external sources
  36. 36. Summer 2016 This LOD triple store (Virtuoso) is unique in the Netherlands. First time data about underground newspapers is systematically collected and linked online! https://www.pinterest.com/freethewronged/world-war-ii/ 2) For other open reuse purposes Wikidata -- DBpedia -- Dataviz 1) For Wikipedia
  37. 37. Wikiproject Systematically and uniformly describe & interlink all 1.300 Dutch underground newspapers from WW2 on Wikipedia
  38. 38. We have: LOD-database Using an article template we generated 1.300 uniform and interlinked Wikipedia stubs https://c1.staticflickr.com/9/8281/7699231918_11a7356c38_b.jpg
  39. 39. https://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad) Non-grey = Wikipedia article stub Automatically generated from database using a template
  40. 40. This bit was added manually to expand stub into full article  Crowdsourcing by Dutch Wikipedia community https://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad)
  41. 41. A group of Wikipedia volunteers is currently working to expand the 1.300 stubs… gradually creating more and more full articles. Door Sebastiaan ter Burg [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons
  42. 42. Before the project
  43. 43. The number of articles is growing steadily…
  44. 44. … making many Dutch people happy! http://www.formerdays.com/2011/05/dutch-liberation.html
  45. 45. Thanks! olaf.janssen@kb.nl - @ookgezellig tinyurl.com/verzetskranten
  46. 46. Slides by Gerard Kuys Technical appendix http://www.ilord.com/vintage.html-http://www.ilord.com/images/enigma-8-rotors-1000px.jpg
  47. 47. Transforming Descriptive Data into Linked Open Data - Locations
  48. 48. Transforming Descriptive Data into Linked Open Data - Persons
  49. 49. Transforming Descriptive Data into Linked Open Data - interlinking
  50. 50. • Interlinked descriptions in Lydia Winkel’s annotations (‘see also’) can be put to use in order to construct an affiliation chain for underground publications • Right now, the model of people involved with one or more underground publications is very flat indeed: either someone is involved or not mentioned in this context at all. The consequences are devastating: – No distinction between people writing and people distributing, or doing both – Hardly a clue as to the people who did the illegal multiplying of copies, and how they organised their logistics (labour, machines, paper, ink, stencil sheets or lead slugs, etc.) – And, worst of all: no way to distinguish resistance people from snitches and agents provocateurs • We need an event model in order to connect people to the things that happened to an underground publication, and be at least a bit precise about their role in a particular event • More often than not, new editions sprang up as a result of collaborators holding gradually differing opinions; we would like to create an overview of evolving points of view by way of some kind of representation of categorizations of political beliefs Things yet to come
  51. 51. • Forget about a fully automated process: it is 80 / 20 all the time • But what we can do in an automated way, is Named Entity Recognition • In order to do Named Entity Recognition, we need reference lists of people or things (‘gazetteers’) that strings within descriptive text fragments can be matched against • We dispose of two excellent reference lists: – The Index of Places (already in the 1954 edition of Lydia Winkel’s book) – The Index of Persons (added to the 1989 edition of the same work) – With only slight manual corrections (e.g., ‘Ferwerderadeel’ where Winkel has ‘Ferweradeel’) – Linking to the site gemeentegeschiedenis.nl, providing data on Dutch municipality boundaries, which kept on changing during World War II • And, of course, there is DBpedia: – Currently identifying 402 Dutch resistance people, apart from people who became better known as a writer, politician, sportsman, etc. – Identifying and linking to all of the locations mentioned in Lydia Winkel’s text – Inviting everyone to improve the list by adding entries or list items to Wikipedia • Once digitized, Lydia Winkel’s texts become very much malleable and searchable, so we could easily locate all candidate references to other underground periodicals for interlinking – Find ‘(Zie nr. 270)’, ‘(Zie nr. 270, xxxx )’, ‘(Zie nrs. xxxx, nr. 270)’, ‘(Zie nrs. xxxx, 270, yyyy)’ How did we do the linking?
  52. 52. How did we do the linking?
  53. 53. How did we do the linking?
  54. 54. Named Entity Recognition using SILK Workbench
  55. 55. Generating References • The general idea is, that a Reference is a resource in its own right – It is not the resource pointed to – It has properties of its own, like source, page number, connected resource – Could also be the place where an event is linked to the object that is referenced, because we have a context here • A single Reference resource for each occasion the subject is mentioned in a tekst – In this way, we can point to the exact place of a reference within a larger tekst fragment • A Reference is not a Link – A Reference is a real-world thing itself, it is a place in a tekst saying something about something else – owl:sameAs links should be bound to the real-world object or, better still, be stored in a LinkSet
  56. 56. Matching text fragments against Linked Data resources Approaches: • Brute force with SPARQL: a query with the ‘Contains’ keyword • Using the existing data with SPARQL: a query connecting Persons from the Person to References generated from the text • Matching against DBpedia: DBpedia Spotlight • Fine-grained comparison: GATE scripting
  57. 57. Generating References PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX bf: <http://bibframe.org/vocab/> PREFIX ns0: <http://almere.pilod.nl/LydiaWinkel/> PREFIX dct: <http://purl.org/dc/terms/> PREFIX dbo: <http://dbpedia.org/ontology/> CONSTRUCT { ?URI a dbo:Reference ; dct:references ?ts ; dct:source ?comm ; dbo:connectsReferencedTo ?subject } FROM <http://almere.pilod.nl/LydiaWinkel/> WHERE { ?ts a ns0:UndergroundPublication BIND (IRI(CONCAT(STR(?ts), "-Ref1")) AS ?URI ). ?ts ns0:winkelSummary ?comm . ?comm bf:annotationBody ?ann . ?ref dct:references ?subject . ?subject rdfs:label ?ond FILTER (contains(?ann, ?ond)) }
  58. 58. The Data Model: Library of Congress’ BibFrame
  59. 59. The Data Model: Interlinking Underground Publications
  60. 60. The Data Model: Interlinking Underground Publications

×