Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Linked Open Data case study (illegal newspapers WW2, Wikipedia, DBpedia) - Lecture Leiden University 27-2-2017

281 views

Published on

A practical case study of how to create Linked Open Data for 1.300 Dutch underground newspapers from World War 2 using Wikipedia, DBpedia and an old paper book.

Lecture given by Olaf Janssen - Wikimedia & Open Data coordinator for the National Library of the Netherlands (KB) - for students of the master's course "Digital Access to Cultural Heritage" at Leiden University on 27-2-2017

Published in: Education
  • Be the first to comment

  • Be the first to like this

Linked Open Data case study (illegal newspapers WW2, Wikipedia, DBpedia) - Lecture Leiden University 27-2-2017

  1. 1. Linked Open Data case study: WW2 underground newspapers on Wikipedia Digital Access to Cultural Heritage, Leiden University, 23-2-2017 Olaf Janssen (Koninklijke Bibliotheek) olaf.janssen@kb.nl - @ookgezellig
  2. 2. What I hope you’ll learn today Using LOD theory (lecture René Voorburg) in practice 1. How to give a new life to an old paper book 2. How to get 1.300 newspapers from WW2 on Wikipedia While doing 1 and 2: 3. The advantages of Linked Open Data (= downsides of unconnected data sources) See this lecture on Slideshare: https://www.slideshare.net/OlafJanssenNL/linked-open-data-case-study-illegal-newspapers-ww2-wikipedia- dbpedia-lecture-leiden-university-2722017
  3. 3. http://www.4en5meiamsterdam.nl/attachment/47454
  4. 4. During WW2 the Dutch resistance issued many underground, illegal newspapers. In every shape & form… http://www.4en5meiamsterdam.nl/attachment/47454
  5. 5. http://resolver.kb.nl/resolve?urn=ddd:010436323 http://resolver.kb.nl/resolve?urn=ddd:010442948 http://resolver.kb.nl/resolve?urn=ddd:010447825 http://resolver.kb.nl/resolve?urn=ddd:010450508 From well-organized, ‘professional’ big titles… (o.a. Parool, Vrij Nederland, Trouw, de Waarheid)
  6. 6. …to very small, amateur, home-made, pamphlet-like issues
  7. 7. By OlafJanssen (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], - https://commons.wikimedia.org/wiki/File:Kluisdeur_bij_het_NIOD_in_Amsterdam.jpg After the war 1.300 newspaper titles were (physically) preserved at the NIOD. The national Institute for War, Holocaust and Genocide Studies in Amsterdam By OSeveno (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], - https://commons.wikimedia.org/wiki/File:02_(NIOD)_2016_(Herengracht_380 -382,_Amsterdam).jpg
  8. 8. By Romaine - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=37072767
  9. 9. https://commons.wikimedia.org/wiki/File:Verzetskrant_in_archiefdozen_bij_het_NIOD.jpg – CC-BY-SA - OlafJanssen
  10. 10. By Romaine - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=37072734
  11. 11. They were described in formal library catalogues (1.300 titles) Bibliographic metadata Underground students’ newspaper from The Hague
  12. 12. 107123223 = PPN = unique ID of this title in KB catalogue
  13. 13. In 2010 these newspapers were digitised page by page… resulting in …
  14. 14. www.delpher.nl/kranten … full-texts in Delpher … (1.300 titles)
  15. 15. Once more, De Geus onder studenten 37 editions
  16. 16. URL of this page: http://www.delpher.nl/nl/kranten/results?coll=d ddtitel&cql[]=ppn+any+(107123223) 107123223 = PPN = unique ID of this title in Delpher (same for KB catalogue) Once more, De Geus onder studenten
  17. 17. In Delpher you can read and (word)search these newspapers… • Scans • Full-text OCR • ALTO
  18. 18. But say, I want to know more about this newspaper • What sort of illegal newspaper was it? • What is the history of this newspaper? • Who wrote it? • Where was this newspaper printed? • How was it distributed? • Were there any relations with other underground newspapers? • Etc…
  19. 19. But say, I want to know more about this newspaper • What sort of illegal newspaper was it? • What is the history of this newspaper? • Who wrote it? • Where was this newspaper printed? • How was it distributed? • Were there any relations with other underground newspapers or resistance groups? • Etc…
  20. 20. Under “Details” perhaps? But say, I want to know more about this newspaper • What sort of illegal newspaper was it? • What is the history of this newspaper? • Who wrote it? • Where was this newspaper printed? • How was it distributed? • Were there any relations with other underground newspapers? • Etc…
  21. 21. OK ok, some metadata… .. but no real contextual info
  22. 22. Maybe in the catalogue record??
  23. 23. Big drawback of Delpher (and KB catalogue) No contextual information about WW2 underground newspapers https://thejungleisneutral.files.wordpress.com/2013/11/lost.jpg
  24. 24. Question: Where would many people go to find contextual information about historic newspapers? Probably Wikipedia! (via Google)
  25. 25. http://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad) Question: Where would many people go to find contextual information about historic newspapers? Probably Wikipedia! (via Google)
  26. 26. Report on interest in WW2 among Dutch population http://www.oorlogsbronnen.nl/gebruikersonderzoek2015, May 2015
  27. 27. Many of us use the internet to search for information [..]. We often mention Wikipedia… General audience
  28. 28. Everything is of course on Wikipedia. Just type in a name and you can read entire essays... People > 60
  29. 29. Over half of us think that Wikipedia and Google contribute to our knowledge and understanding of history School/students
  30. 30. Over half of us think that Wikipedia and Google contribute to our knowledge and understanding of history When we have to find information about WW2 outside the class setting, we fully concentrate on digital resources like Google and Wikipedia. School/students
  31. 31. http://www.delpher.nl/nl/kranten/results?coll=dddtitel&cql[]=ppn+any+(107123223) http://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad) is given in Context about De Geus onder studenten
  32. 32. http://2.bp.blogspot.com/_BWzuYwiS6-I/TMgeRsFd3mI/AAAAAAAAElw/3cvgbZSPWcs/s1600/doctor+macro+judy+scared.jpg
  33. 33. This Wikipedia article is a carefully chosen exception
  34. 34. 1. There are very few illegal newspapers with their own WP articles 2. The inventory of these newspapers on WP is far from complete <<< 1.300 titles
  35. 35. https://thejungleisneutral.files.wordpress.com/2013/11/lost.jpg So: We badly need contextual information about illegal newspapers. Where do we get it?
  36. 36. https://thejungleisneutral.files.wordpress.com/2013/11/lost.jpg So: We badly need contextual information about illegal newspapers. Where do we get it? De Ondergrondse Pers 1940-1945 Lydia E. Winkel, H. de Vries , 1989, ISBN 9021837463, Veen Uitgevers This paper book contains contextual entries about all 1.300 illegal newspapers
  37. 37. Entry 199 – De Geus; (onder studenten) Unique ID (within the book)
  38. 38. Metadata
  39. 39. Place of publication Relation 1/3 Newspaper Place name
  40. 40. Context Raw material for Wikipedia article
  41. 41. Person names Relation 2/3 Newspaper Persons
  42. 42. This article also contains references to other newspapers • 106 = Cereales Vadeness (students’ resistance newspaper, Wageningen) • 360 = Leidsche Brief (students’ resistance newspaper, Leiden) • 748 = Sol Justitiae (students’ resistance newspaper, Utrecht)
  43. 43. Relation 3/3 This newspaper Other newspapers
  44. 44. http://https://knowledgeutopia.files.wordpress.com/2014/01/hollandhouselibraryblitz1940.j pg Too bad it’s a paper book  hard to find, multiply, distribute and build upon
  45. 45. We need it digital!! http://https://knowledgeutopia.files.wordpress.com/2014/01/hollandhouselibraryblitz1940.j pg
  46. 46. We need it digital!! 1. Clear copyright with copyright holder (NIOD)  Open CC-BY-SA license 2. Scan & OCR 3. Convert into PDF 4. Put online: NIOD site & Wikimedia Commons
  47. 47. DOP as PDF on NIOD website (CC-BY-SA) http://www.niod.nl/nl/de-ondergrondse-pers-1940-1945
  48. 48. DOP as PDF on Wikimedia Commons https://commons.wikimedia.org/wiki/File:PDF_of_De_Ondergrondse_Pers_1940-1945_-_derde_druk_-_1989.pdf
  49. 49. De Winkel as PDF on NIOD website (CC-BY-SA) http://www.niod.nl/nl/de-ondergrondse-pers-1940-1945 Saved us €13.330! http://www.brill.com/dutch-underground-press-1940-1945
  50. 50. Wikipedia article about DOP http://nl.wikipedia.org/wiki/De_ondergrondse_pers_1940-1945
  51. 51. Wikipedia article about the author http://nl.wikipedia.org/wiki/Lydia_Winkel
  52. 52. DOP, the plusses Available online (PDF, flat file) Open license (CC-BY-SA) Contextual information Relations • Newspaper  Places • Newspaper  Persons • Newspaper  Other newspapers http://www.archives.gov/research/military/ww2/photos/images/ww2-194.jpg
  53. 53. http://2.bp.blogspot.com/_BWzuYwiS6-I/TMgeRsFd3mI/AAAAAAAAElw/3cvgbZSPWcs/s1600/doctor+macro+judy+scared.jpg DOP, the minusses Unstructured data (PDF, flat file)  Not very machine readable (unlike CSV, XML, JSON, RDF) No relations between • newspaper  KB catalogue (metadata) • newspaper  Delpher (full-text) • newspaper, places & persons  external sources (like Wikipedia)
  54. 54. ... but the data sources are unconnected (and for 3+4: unstructured & not machine-readable) To summarize: a lot of data sources are available about these WW2 underground newspapers 1. Metadata (from KB catalogue) 2. Content (full-text from Delpher) 3. Context (from De Ondergrondse Pers, PDF) 4. Relations: newspaper  places, persons, other newspapers (DOP, PDF) 5. External resources about newspapers, places and persons
  55. 55. http://2.bp.blogspot.com/_BWzuYwiS6-I/TMgeRsFd3mI/AAAAAAAAElw/3cvgbZSPWcs/s1600/doctor+macro+judy+scared.jpg ... making discovery, understanding & research of these newspapers (and related places & persons) more difficult than necessary
  56. 56. We can solve all these issues! Good news!
  57. 57. Wikiproject(*) Verzetskranten Systematically and uniformly describe & interlink all 1.300 Dutch underground newspapers from WW2 on Dutch Wikipedia tinyurl.com/verzetskranten (in Dutch) * https://nl.wikipedia.org/wiki/Wikipedia:Wikiproject, https://en.wikipedia.org/wiki/Wikipedia:Wikiproject
  58. 58. Wikiproject(*) Verzetskranten Systematically and uniformly describe & interlink all 1.300 Dutch underground newspapers from WW2 on Dutch Wikipedia tinyurl.com/verzetskranten (in Dutch) * https://nl.wikipedia.org/wiki/Wikipedia:Wikiproject, https://en.wikipedia.org/wiki/Wikipedia:Wikiproject Reach a big audience: 80% of Dutch people use Wikipedia
  59. 59. From 14  1.300 titles https://nl.wikipedia.org/wiki/Categorie:Illegale_pers_in_de_Tweede_Wereldoorlog
  60. 60. Wikiproject(*) Verzetskranten Systematically and uniformly describe & interlink all 1.300 Dutch underground newspapers from WW2 on Dutch Wikipedia tinyurl.com/verzetskranten (in Dutch) We need a database!
  61. 61. Build central database https://www.youtube.com/watch?v=GVDGuCjog_0
  62. 62. Build central database www.youtube.com/watch?v=GVDGuCjog_0 Step 1: Create Excel-sheet with • Metadata about newspaper (from DOP PDF) • Unique Wikipedia article title • Contextual info, incl. related persons & titles (from DOP PDF) • PPN: unique identifier linking newspaper to KB-catalogue & Delpher
  63. 63. DOP-ID
  64. 64. Place of publication
  65. 65. Other metadata
  66. 66. Title
  67. 67. Unique Wikipedia article title = <Newspaper title> (verzetsblad, <Place of publication>)
  68. 68. • Contextual info • Related persons • Related newspaper titles
  69. 69. PPN (107123223) http://opc4.kb.nl/DB=1/ PPN?PPN=107123223
  70. 70. http://opc4.kb.nl/DB=1/ PPN?PPN=107123223 PPN (107123223)
  71. 71. PPN (107123223) http://www.delpher.nl/nl/kra nten/results/index?coll=dddtit el&cql[]=ppn%3D107123223
  72. 72. http://www.delpher.nl/nl/kra nten/results/index?coll=dddtit el&cql[]=ppn%3D107123223 PPN (107123223)
  73. 73. Build central database Step 2: Convert Excel into RDF triplestore (=special kind of online database anybody can access) • Steps 1-4 from http://linda-project.eu/linked- data-primer-2 • Step 4: Vocubulary used = Bibframe (http://bibframe.org/vocab)
  74. 74. Build central database Step 3: Link to external resources • Step 5 from http://linda-project.eu/linked-data-primer-2 • DBpedia = machine-readable, structured version of Wikipedia • DBpedia = hub for linking different data sets on the Web to each other  Linked Open Data cloud • Connect persons & places in newspaper database to external resources via DBpedia
  75. 75. Step 1c: Link to external resources • Step 5 from http://linda-project.eu/linked-data-primer-2 • DBpedia = machine-readable , structured version of Wikipedia DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data • Connect persons & places in newspaper database to external resources via DBpedia http://lod-cloud.net/versions/2010-09-22/lod-cloud_colored.png Linked Open Data cloud
  76. 76. Build central database Step 3: Link to external resources • Step 5 from http://linda-project.eu/linked-data-primer-2 • DBpedia = machine-readable, structured version of Wikipedia • DBpedia = hub for linking different data sets on the Web to each other  Linked Open Data cloud • We use DBpdia to connect persons & places in our newspaper database to information in other databases
  77. 77. http://nl.dbpedia.org/page/Huib_Drion https://nl.wikipedia.org/wiki/Huib_Drion http://www.dbnl.org/auteurs/auteur.php?id=drio001 http://www.biografischportaal.nl/persoon/41181342
  78. 78. Build central database Added value of Linked Open Data & DBpedia Software can automatically query for additional information about places and persons mentioned in DOP that is not available in • KB catalogue • Delpher • DOP itself
  79. 79. Summary: data about 1.300 newspapers Available online Structured data (RDF-triples) Open license (CC-BY-SA) Open standard (RDF) Contextual information Links between newspaper Delpher & KB-cat (via PPNs) Relations Links between newspapers • Newspaper  Places places & persons  external • Newspaper  Persons sources (via DBpedia) • Newspaper  Other newspapers (PPNs)
  80. 80. Summary: data about 1.300 newspapers Available online Structured data (RDF-triples) Open license (CC-BY-SA) Open standard (RDF) Contextual information Links between newspaper Delpher & KB-cat (via PPNs) Relations Links between newspapers • Newspaper  Places places & persons  external • Newspaper  Persons sources (via DBpedia) • Newspaper  Other newspapers (PPNs)
  81. 81. Summary: data about 1.300 newspapers Available online Structured data (RDF-triples) Open license (CC-BY-SA) Open standard (RDF) Contextual information Links between newspaper Delpher & KB-cat (via PPNs) Relations Links between newspapers • Newspaper  Places places & persons  external • Newspaper  Persons sources (via DBpedia) • Newspaper  Other newspapers (PPNs)
  82. 82. http://5stardata.info/en/
  83. 83. Wikiproject Verzetskranten Systematically and uniformly describe & link all 1.300 Dutch underground newspapers from WW2 on Dutch Wikipedia We need a template!
  84. 84. Using an article template we can generate 1.300 uniform Wikipedia article stubs from the LOD triple store https://c1.staticflickr.com/9/8281/7699231918_11a7356c38_b.jpg
  85. 85. LOD database + article template = Wikipedia article stub
  86. 86. https://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad)
  87. 87. https://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad) Titles  Get from triple store Using a SPARQL query
  88. 88. Metadata  Get from triple store Using SPARQL query
  89. 89. Metadata  get from triple store Using SPARQL query https://github.com/ookgezellig/verzetskranten/blob/master/var/sparql/DOP_Blad_met_Details.rq
  90. 90. Related persons  Get from triple sore
  91. 91. Related newspapers  Get from triplestore
  92. 92. ID of newspaper in DOP (199)  Get from triple store
  93. 93. Link to full-text editions in Delpher  Get from triple store
  94. 94. Link to KB catalogue record  Get from triple store
  95. 95. These snippets/labels are identical for all 1.300 newspapers  Hard coded in template  https://github.com/ookgezellig/verzetskranten/blob/master/app/Resources/views/defaul t/wiki.html.twig
  96. 96. Grey = Wikipedia article stub • From triple store (using SPARQL) • Hard coded fixed strings in template
  97. 97. https://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad) Non-grey = Wikipedia article stub • From triple store (using SPARQL) • Hard coded fixed strings in template
  98. 98. Overview of 1300 article stubs @WP:NL https://nl.wikipedia.org/wiki/Wikipedia:Wikiproject/Verzetskranten/Beginnetjes
  99. 99. Overview of 1300 article stubs @WP:NL https://nl.wikipedia.org/wiki/Wikipedia:Wikiproject/Verzetskranten/Beginnetjes
  100. 100. Overview of 1300 article stubs @WP:NL https://nl.wikipedia.org/wiki/Wikipedia:Wikiproject/Verzetskranten/Beginnetjes
  101. 101. Overview of 1300 article stubs @WP:NL https://nl.wikipedia.org/wiki/Wikipedia:Wikiproject/Verzetskranten/Beginnetjes
  102. 102. https://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad)
  103. 103. This bit was added manually to expand stub into full article  Crowdsourcing by Dutch Wikipedia community https://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad)
  104. 104. A group of Wikipedia volunteers is currently working to expand the 1.300 stubs… gradually creating more and more full articles. Door Sebastiaan ter Burg [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons
  105. 105. Before the project (2015) Overview of full articles on illegal newspapers @WP:NL https://nl.wikipedia.org/wiki/Categorie:Nederlandse_illegale_pers_in_de_Tweede_Wereldoorlog
  106. 106. The number of articles is growing steadily… Overview of full articles on illegal newspapers @WP:NL https://nl.wikipedia.org/wiki/Categorie:Nederlandse_illegale_pers_in_de_Tweede_Wereldoorlog
  107. 107. … making many Dutch people happy! http://www.formerdays.com/2011/05/dutch-liberation.html
  108. 108. Thanks! olaf.janssen@kb.nl - @ookgezellig tinyurl.com/verzetskranten
  109. 109. Suggested reading • http://www.ted.com/talks/tim_berners_lee_on_the_next_web 20 years ago, Tim Berners-Lee invented the World Wide Web. For his next project, he's building a web for open, linked data that could do for numbers what the Web did for words, pictures, video: unlock our data and reframe the way we use it together. • https://en.wikipedia.org/wiki/Linked_data Wikipedia article related to the above video • http://5stardata.info/en/ The 5 stars of Linked Open Data (Tim Berners-Lee) • http://linda-project.eu/linked-data-primer-2/ Short primer about creating LOD in practice, starting from an Excel sheet • http://www.programmableweb.com/news/how-linked-data-solved-digital-age-marketing- problem/analysis/2015/08/31 The figure near the bottom of the first page is a good illustration of the concept of (linked) triples • https://en.wikipedia.org/wiki/DBpedia • https://en.wikipedia.org/wiki/Semantic_network http://www.gettyimages.nl/detail/nieuwsfoto's/three-women-of-the-ats-light-up-together-ats-regulations-nieuwsfotos/3094265

×