Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Linked Open Data case study (illegal newspapers WW2, Wikipedia, DBpedia) - Lecture Leiden University 3-3-2016

682 views

Published on

A practical case study of how to create Linked Open Data for 1.300 Dutch underground newspapers from World War 2 using Wikipedia, DBpedia and an old paper book.

Lecture given by Olaf Janssen - Wikimedia & Open Data coordinator for the National Library of the Netherlands (KB) - for students of the master's course "Digital Access to Cultural Heritage" at Leiden University on 3-3-2016

Published in: Education
  • Be the first to comment

Linked Open Data case study (illegal newspapers WW2, Wikipedia, DBpedia) - Lecture Leiden University 3-3-2016

  1. 1. LOD case study: WW2 underground newspapers on Wikipedia Digital Access to Cultural Heritage, Leiden University, 3-3-2016 Olaf Janssen (Koninklijke Bibliotheek) olaf.janssen@kb.nl - @ookgezellig
  2. 2. What I hope you’ll learn today 1. How to give a new life to an old paper book 2. How to get 1.300 newspapers from WW2 on Wikipedia While doing 1 and 2: 3. The advantages of linked open data (= downsides of unconnected data sources) Olaf Janssen Wikipedia & Open data coordinator National library of the Netherlands
  3. 3. http://www.4en5meiamsterdam.nl/attachment/47454
  4. 4. During WW2 ± 1.300 Dutch underground newspaper titles have been issued in NL In every shape & form… http://www.4en5meiamsterdam.nl/attachment/47454
  5. 5. http://resolver.kb.nl/resolve?urn=ddd:010436323 http://resolver.kb.nl/resolve?urn=ddd:010442948 http://resolver.kb.nl/resolve?urn=ddd:010447825 http://resolver.kb.nl/resolve?urn=ddd:010450508 From well-known big titles (o.a. Parool, Vrij Nederland, Trouw, de Waarheid)
  6. 6. To very small, home-made, pamphlet-like issues
  7. 7. After the war many titles have been (physically) preserved at the NIOD … The national Institute for War, Holocaust and Genocide Studies in Amsterdam By Romaine - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=37072767
  8. 8. https://commons.wikimedia.org/wiki/File:Verzetskrant_in_archiefdozen_bij_het_NIOD.jpg – CC-BY-SA - OlafJanssen
  9. 9. By Romaine - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=37072734
  10. 10. These 1.300 newspapers have been described in the KB-catalogue… Bibliographic metadata Like this one, De Geus onder studenten
  11. 11. PPN = unique ID of this title in KB-catalogue http://opc4.kb.nl/DB=1/PPN?PPN=107123223
  12. 12. These newspapers were digitised page by page… resulting in …
  13. 13. ... full texts in Delpher http://resolver.kb.nl/resolve?urn=ddd:010424553:mpeg21:p001 • Scans • Full-text OCR
  14. 14. Again, De Geus onder studenten
  15. 15. http://www.delpher.nl/nl/kranten/results?coll=d ddtitel&cql[]=ppn+any+(107123223) PPN = unique ID of this title in Delpher (same at for KB-catalogue) Again, De Geus onder studenten
  16. 16. On Delpher you can read and (word)search this title...
  17. 17. Say I want to know more about this newspaper • What sort/style of underground paper was De Geus? • What is the history of this newspaper? • Who were working on it? • Where was this newpaper printed? • How was De Geus distributed and financed? • Were there any relations with other illegal newspapers or resistance groups? • Etc… Under “Details” perhaps?
  18. 18. OK ok, some metadata… .. but I want to know móóór….
  19. 19. Maybe in the catalogue record??
  20. 20. Problem with Delpher (and KB-catalogue) Véry little contextual information about the newspaper(title)s https://thejungleisneutral.files.wordpress.com/2013/11/lost.jpg
  21. 21. Question: Where would most people start searching contextual information about De Geus onder studenten? Probably Wikipedia! (via Google)
  22. 22. http://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad) Question: Where would most people start searching contextual information about De Geus onder studenten? Probably Wikipedia! (via Google)
  23. 23. https://www.youtube.com/watch?v=VREJV--VHSw
  24. 24. Report on interest in WW2 among Dutch population http://www.oorlogsbronnen.nl/gebruikersonderzoek2015, May 2015
  25. 25. Many of us use the internet to search for information [..]. We often mention Wikipedia…
  26. 26. Everything is of course on Wikipedia. Just type in a name and you can read entire essays... (man 70s)
  27. 27. Over half of us think that Wikipedia and Google contribute to our knowledge and understanding of history When we have to find information about WW2 outside the class setting, we fully concentrate on digital resources like Google and Wikipedia. (school kids)
  28. 28. http://www.delpher.nl/nl/kranten/results?coll=dddtitel&cql[]=ppn+any+(107123223) http://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad) is given in Context about De Geus onder studenten
  29. 29. But now… We have another problem…
  30. 30. https://nl.wikipedia.org/wiki/Categorie:Illegale_pers_in_de_Tweede_Wereldoorlog … De Geus on Wikipedia is an exception 1. Very few underground newspapers have their own WP articles 2. The overview of these newspapers on WP is far from complete
  31. 31. Good news! http://www.archives.gov/research/military/ww2/photos/images/ww2-194.jpg There is a 1-fix solution to this contextual problem!
  32. 32. De Ondergrondse Pers 1940-1945 By Lydia E. Winkel & H. de Vries 1989, ISBN 9021837463 Veen Uitgevers This book (“De Winkel”) contains contextual articles about (nearly) all ± 1.300 illegal WW2 newspapers
  33. 33. “De Winkel” – nr. 199 De Ondergrondse Pers 1940-1945 , Lydia E. Winkel, H. de Vries , 1989, ISBN 9021837463, Veen Uitgevers
  34. 34. Every article has a unique ID (“Winkel-ID”)
  35. 35. Every article has metadata • Title, subtitle, motto • Place of publication • Period of publication • Publication frequency (daily, weekly, one-off, irregular) • Multiplication (stenciled, printed, typed, handwritten) • Contents (news, opinions, poems, illustrations, humor) • Number of prints (min – max)
  36. 36. Relation 1/3 Newspaper  Placename  semantics, linked data
  37. 37. Relation 1/3 Newspaper  Placename  semantics, linked data
  38. 38. Contextual information  Nice material for a Wikipedia article
  39. 39. Very often persons related to this newspaper are mentioned
  40. 40. Relation 2/3 Newspaper  Persons  semantics, linked data
  41. 41. Many articles also contain references to other newspapers • 106 = Cereales Vadeness (students resistance, newspaper Wageningen) • 360 = Leidsche Brief (students resistance newspaper, Leiden) • 748 = Sol Justitiae (students resistance newspaper, Utrecht)
  42. 42. Relation 3/3 Newspaper  Other newspapers  semantics, linked data
  43. 43. http://https://knowledgeutopia.files.wordpress.com/2014/01/hollandhouselibraryblitz1940.j pg Too bad it’s a paper book  hard to find, multiply, distribute and build upon
  44. 44. We need it digital!! http://https://knowledgeutopia.files.wordpress.com/2014/01/hollandhouselibraryblitz1940.j pg
  45. 45. We need it digital!! 1. Clear copyright with copyright holder (NIOD)  Open CC-BY-SA license 2. Scan & OCR 3. Convert into PDF 4. Put online: NIOD site & Wikimedia Commons
  46. 46. De Winkel as PDF on NIOD website (CC-BY-SA) http://www.niod.nl/nl/de-ondergrondse-pers-1940-1945
  47. 47. De Winkel as PDF on Wikimedia Commons https://commons.wikimedia.org/wiki/File:PDF_of_De_Ondergrondse_Pers_1940-1945_-_derde_druk_-_1989.pdf
  48. 48. De Winkel as PDF on NIOD website (CC-BY-SA) http://www.niod.nl/nl/de-ondergrondse-pers-1940-1945 Saved us €13.330! http://www.brill.com/dutch-underground-press-1940-1945
  49. 49. Wikipedia article about De Winkel http://nl.wikipedia.org/wiki/De_ondergrondse_pers_1940-1945
  50. 50. Wikipedia article about the author http://nl.wikipedia.org/wiki/Lydia_Winkel
  51. 51. Winkel, the plusses Available online (PDF, flat file) Open license (CC-BY-SA) Contextual information Relations • Titles  Places • Titles  Persons • Titles  Other titles http://www.archives.gov/research/military/ww2/photos/images/ww2-194.jpg
  52. 52. http://2.bp.blogspot.com/_BWzuYwiS6-I/TMgeRsFd3mI/AAAAAAAAElw/3cvgbZSPWcs/s1600/doctor+macro+judy+scared.jpg Winkel, the minusses Unstructured data (PDF, flat file)  Not very machine readable (unlike CSV, XML, JSON, RDF) PDF is no (real) open standard (unlike CSV, XML, JSON, RDF) No links between titles  Delpher & KB-cat No links between titles, places & persons  external sources (like Wikipedia)
  53. 53. ... but the data sources are unconnected (and for 3+4 unstructured & not machine-readable) To summarize: a lot of information is available about these WW2 underground newspapers 1. Metadata (KB-cat) 2. Content (full-text, Delpher) 3. Context (Winkel, PDF) 4. Relations: titles  places, persons, other titles (Winkel, PDF) 5. External resources about titles, places and persons
  54. 54. http://2.bp.blogspot.com/_BWzuYwiS6-I/TMgeRsFd3mI/AAAAAAAAElw/3cvgbZSPWcs/s1600/doctor+macro+judy+scared.jpg ... making discovery, understanding & research of these newspapers (and related places & persons) more difficult than necessary
  55. 55. We can solve all these issues! Good news!
  56. 56. Wikiproject(*) Verzetskranten Systematically and uniformly describe & link all 1.300 Dutch underground newspapers from WW2 on Dutch Wikipedia tinyurl.com/verzetskranten (in Dutch) * https://nl.wikipedia.org/wiki/Wikipedia:Wikiproject, https://en.wikipedia.org/wiki/Wikipedia:Wikiproject
  57. 57. From 14  1.300 titles https://nl.wikipedia.org/wiki/Categorie:Illegale_pers_in_de_Tweede_Wereldoorlog
  58. 58. Wikiproject(*) Verzetskranten Systematically and uniformly describe & link all 1.300 Dutch underground newspapers from WW2 on Dutch Wikipedia tinyurl.com/verzetskranten (in Dutch) We need a database!
  59. 59. Build central database https://www.youtube.com/watch?v=GVDGuCjog_0
  60. 60. Build central database www.youtube.com/watch?v=GVDGuCjog_0 Step 1: Create Excel-sheet with • Metadata about newspaper (from Winkel PDF) • Unique Wikipedia article title • Contextual info, incl. related persons & titles (from Winkel PDF) • PPN: unique ID linking newspaper to KB-catalogue & Delpher
  61. 61. Winkel-ID
  62. 62. Place of publication
  63. 63. Other metadata
  64. 64. Title
  65. 65. Unique Wikipedia article title = <Newspaper title> (verzetsblad, <Place of publication>)
  66. 66. • Contextual info • Related persons • Related newspaper titles
  67. 67. PPN (107123223) http://opc4.kb.nl/DB=1/ PPN?PPN=107123223
  68. 68. http://opc4.kb.nl/DB=1/ PPN?PPN=107123223 PPN (107123223)
  69. 69. PPN (107123223) http://www.delpher.nl/nl/kra nten/results/index?coll=dddtit el&cql[]=ppn%3D107123223
  70. 70. http://www.delpher.nl/nl/kra nten/results/index?coll=dddtit el&cql[]=ppn%3D107123223 PPN (107123223)
  71. 71. Build central database Step 2: Convert Excel into RDF triplestore (=special kind of online database anybody can access) • Steps 1-4 from http://linda-project.eu/linked- data-primer-2 • Step 4: Vocubulary used = Bibframe (http://bibframe.org/vocab)
  72. 72. Build central database Step 3: Link to external resources • Step 5 from http://linda-project.eu/linked-data-primer-2 • DBpedia = machine-readable, structured version of Wikipedia • DBpedia = hub for linking different data sets on the Web to each other  Linked Open Data cloud • Connect persons & places in newspaper database to external resources via DBpedia
  73. 73. Step 1c: Link to external resources • Step 5 from http://linda-project.eu/linked-data-primer-2 • DBpedia = machine-readable , structured version of Wikipedia DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data • Connect persons & places in newspaper database to external resources via DBpedia http://lod-cloud.net/versions/2010-09-22/lod-cloud_colored.png Linked Open Data cloud
  74. 74. Build central database Step 3: Link to external resources • Step 5 from http://linda-project.eu/linked-data-primer-2 • DBpedia = machine-readable, structured version of Wikipedia • DBpedia = hub for linking different data sets on the Web to each other  Linked Open Data cloud • We use DBpdia to connect persons & places in our newspaper database to information in other databases
  75. 75. http://nl.dbpedia.org/page/Huib_Drion https://nl.wikipedia.org/wiki/Huib_Drion http://www.dbnl.org/auteurs/auteur.php?id=drio001 http://www.biografischportaal.nl/persoon/41181342
  76. 76. Build central database Added value of Linked Open Data & DBpedia Software can automatically query for additional information about places and persons mentioned in De Winkel that is not available in • KB-catalogue • Delpher • De Winkel
  77. 77. Summary: data about 1.300 newspapers Available online Structured data (RDF-triples) Open license (CC-BY-SA) Open standard (RDF) Contextual information Links between titles  Delpher & KB-cat Relations Links between titles, places • Titles  Places & persons  external • Titles  Persons sources (via DBpedia) • Titles  Other titles
  78. 78. Summary: data about 1.300 newspapers Available online Structured data (RDF-triples) Open license (CC-BY-SA) Open standard (RDF) Contextual information Links between titles  Delpher & KB-cat Relations Links between titles, places • Titles  Places & persons  external • Titles  Persons sources (via DBpedia) • Titles  Other titles
  79. 79. Summary: data about 1.300 newspapers Available online Structured data (RDF-triples) Open license (CC-BY-SA) Open standard (RDF) Contextual information Links between titles  Delpher & KB-cat (via PPNs) Relations Links between places • Titles  Places & persons  external • Titles  Persons sources (via DBpedia) • Titles  Other titles (PPNs)
  80. 80. http://5stardata.info/en/
  81. 81. Wikiproject Verzetskranten Systematically and uniformly describe & link all 1.300 Dutch underground newspapers from WW2 on Dutch Wikipedia We need a template!
  82. 82. Using an article template we can generate 1.300 uniform and interlinked Wikipedia articles from the LOD-database https://c1.staticflickr.com/9/8281/7699231918_11a7356c38_b.jpg
  83. 83. LOD-database + article template = Wikipedia article
  84. 84. https://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad)
  85. 85. https://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad) Titles  from database
  86. 86. Metadata  from database
  87. 87. Related persons  from database
  88. 88. Related newspapers  from database
  89. 89. Winkel-ID  from database
  90. 90. Link to full-texts in Delpher  from database
  91. 91. Link to KB-catalogue record  from database
  92. 92. Fixed categories on WP and Commons
  93. 93. Predefined fixed strings
  94. 94. Grey = • From database • Predefined fixed strings  Uniformity between articles guaranteed!
  95. 95. All that WP-writers need to add manually
  96. 96. Problem with Delpher (and KB-catalogue) Véry little contextual information about the newspaper(title)s https://thejungleisneutral.files.wordpress.com/2013/11/lost.jpg
  97. 97. Problem with Delpher (and KB-catalogue) Véry little contextual information about the newspaper(title)s https://thejungleisneutral.files.wordpress.com/2013/11/lost.jpg The KB can re-use (embed) the Wikipedia content in its own websites to tackle this problem
  98. 98. https://nl.wikipedia.org/wiki/De_Geus_onder_studenten_%28verzetsblad%29
  99. 99. Delpher - search results.. http://www.delpher.nl/nl/kranten/results?coll=dddtitel&cql[]=ppn+any+(107123223)
  100. 100. De geus; (onder studenten) was een verzetsblad uit de Tweede Wereldoorlog, dat vanaf 4 oktober 1940 tot en met 13 juli 1944 …. Lees verder op Wikipedia Embedded contextual snippet from Wikipedia http://www.delpher.nl/nl/kranten/results?coll=dddtitel&cql[]=ppn+any+(107123223) Delpher - search results.. De geus; (onder studenten) was een verzetsblad uit de Tweede Wereldoorlog, dat vanaf 4 oktober 1940 tot en met 13 juli 1944 …. Lees verder op Wikipedia
  101. 101. Delpher - object presentation http://resolver.kb.nl/resolve?urn=ddd:010424553:mpeg21:p001
  102. 102. Over De Geus onder studenten De geus; (onder studenten) was een verzetsblad uit de Tweede Wereldoorlog, dat vanaf 4 oktober 1940 tot en met 13 juli 1944 in Den Haag werd uitgegeven. Het blad verscheen in 1940, 1941 en 1943 maandelijks, verder onregelmatig in een oplage tussen de 250 en 8000 exemplaren. Het werd aanvankelijk gestencild, en vanaf november 1942 gedrukt en de inhoud bestond voornamelijk uit opinie-artikelen. Het blad werd uitgegeven door Jan Drion en Huib Drion, twee Leidse… Lees verder op Wikipedia Embedded contextual snippet from Wikipedia http://resolver.kb.nl/resolve?urn=ddd:010424553:mpeg21:p001 Delpher - object presentation
  103. 103. Suggested reading • http://www.ted.com/talks/tim_berners_lee_on_the_next_web 20 years ago, Tim Berners-Lee invented the World Wide Web. For his next project, he's building a web for open, linked data that could do for numbers what the Web did for words, pictures, video: unlock our data and reframe the way we use it together. • https://en.wikipedia.org/wiki/Linked_data Wikipedia article related to the above video • http://5stardata.info/en/ The 5 stars of Linked Open Data (Tim Berners-Lee) • http://linda-project.eu/linked-data-primer-2/ Short primer about creating LOD in practice, starting from an Excel sheet • http://www.programmableweb.com/news/how-linked-data-solved-digital-age-marketing- problem/analysis/2015/08/31 The figure near the bottom of the first page is a good illustration of the concept of (linked) triples • https://en.wikipedia.org/wiki/DBpedia • https://en.wikipedia.org/wiki/Semantic_network http://www.gettyimages.nl/detail/nieuwsfoto's/three-women-of-the-ats-light-up-together-ats-regulations-nieuwsfotos/3094265
  104. 104. Questions? olaf.janssen@kb.nl - @ookgezellig tinyurl.com/verzetskranten http://www.gettyimages.nl/detail/nieuwsfoto's/three-women-of-the-ats-light-up-together-ats-regulations-nieuwsfotos/3094265

×