Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a knowledge graph of the Belgian War Press

199 views

Published on

Presentation by Brecht Van de Vyvere at Open Belgium 2017.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Building a knowledge graph of the Belgian War Press

  1. 1. Let’s talk Linked Data session Open Belgium 2017 Brecht Van de Vyvere | @brechtvdv Building a knowledge graph of the Belgium War Press
  2. 2. Can I easily link historic papers with other datasources?
  3. 3. Agenda • hetarchief.be • Knowledge graph • 5-star Open Data plan • Adding context • Linked Data as a Service • Future Work
  4. 4. Dataset
  5. 5. hetarchief.be “News from the Great War” • Newspapers 1914 - 1918 • 10+ Content Partners • Begin 2015: site launched • Functionality • Search by keyword • Map with place of publication • Collections 1k titles 55k newspapers 300k pages
  6. 6. Human-readable interface
  7. 7. Policy 1. Metadata • No restrictions → CC0 2. OCR, documents • Pictures, short stories… • Uncertain copyright status • No license or “terms of use” that minimises restrictions for re-use • Disclaimer
  8. 8. hetarchief.be • One of the biggest databases online • No raw data? • Title • Description → OCR from ALTO • Date created • Owner • IDs (carrier, Abraham, VIAA) • URL image
  9. 9. 9 5-stars Open Data Plan
  10. 10. First 3 Stars • Open License • Structured • Non-proprietary VIAA DB VIAA API NodeJS → github.com/viaacode/hetarchief2lod IDs Metadata CSV Transform
  11. 11. Step 4: URIs for everything • Map VIAAs internal ID to URI: • http://data.viaa.be/noid/{id} • Use ontologies • BBC → Creative Work Ontology • schema.org • Hydra → collections
  12. 12. Knowledge graph • Semantic network • Concepts • Relations • Linked Data • URIs • RDF
  13. 13. <http://dbpedia.org/page Albert_I_of_Belgium> rdfs:type <http://xmlns.com/ foaf/0.1/Person> <http://data.viaa.be/ noid/example> <http://www.bbc.co.uk/ontologies/ creativework#tag>
  14. 14. 5-star: link to other sources • ABRAHAM: catalogue of newspapers in Belgium <http://anet.be/record/abraham/opacbnc/c:bnc:26> <http://data.viaa.be/noid/tm71v5c76q_191510XX> owl:sameAs
  15. 15. L’illustration“1915-10-XX” http://data.viaa.be/noid/ tm71v5c76q_191510XX cwork:titlecwork:dateCreated On dit que c'est notre imagination et…. cwork:content cwork:CreativeWork rdf:type UGENT schema:copyrightHolder schema:inLanguage en Basic information triples
  16. 16. http://data.viaa.be/noid/ tm71v5c76q http://data.viaa.be/noid/ tm71v5c76q_191804xx_0003 http://data.viaa.be/noid/ tm71v5c76q_191804xx_0002 http://data.viaa.be/noid/ tm71v5c76q_191804xx_0001 first last previous/next first memberOf totalItemsHydra last 3 first/last
  17. 17. Problems • Node limited to 1.7 GB memory • OCR too big • Turtle file: 475 MB max (32k newspapers) • Compressed to HDT: 388 MB • Basic triples with HDT: • 54k newspapers → 8.2 MB
  18. 18. Adding context
  19. 19. Connect with other datasources • Cfr. Europeana, delpher.nl, lab.kbresearch.nl
  20. 20. Stanford NER • 4 types: Location, Organisation, Person and Other • Train your model: golden corpus • Write code that fits your needs • SPARQL query that matches strings • REPERTOIRE des COMMUNES et des PRINCIPAUX HAMEAUX de la ci-devant Belgique • Difficult to find cultural APIs (cfr. InFlandersField list of names, Abraham catalogue)
  21. 21. DBpedia Spotlight • Proof of concept • Models for all languages (nl, en, fr, de) NL/FR/EN/DE trained model DBpedia matcher Stanford NER
  22. 22. Results? • Filter on OCR quality; e.g. <90% assurance in ALTO • Wrong time period, e.g. geonames • Standard models, should be trained • Use DBpedia knowledge later to filter “impossible” tags
  23. 23. DBpedia Spotlight • Running your own endpoint is easy: • java -Xmx8G -jar dbpedia- spotlight-0.7.1.jar nl http://localhost:2223/ nl/rest • Or with Docker: • docker build -f Dockerfile -t dutch_spotlight . • docker run -i -p 2223:80 dutch_spotlight spotlight.sh
  24. 24. Linked Data as a Service • Allow federated queries • Low server cost • Be reliable • Triple Pattern Fragments: a Low-cost Knowledge Graph Interface for the Web
  25. 25. Linked Data Fragments querying • VIAA is part of the family! http://data.viaa.be/ ldfhttps://query.wikidata.org/ bigdata/ldf http:// data.linkeddatafragments. org/linkedgeodata http:// data.linkeddatafragments. org/dbpedia2014 Your browser Client-side algorithm GET fragments
  26. 26. Demo time!
  27. 27. Demo • Retrieve all newspaper titles: SELECT DISTINCT ?title WHERE { ?paper <http://www.bbc.co.uk/ontologies/creativework#title> ?title }
  28. 28. Demo • Retrieve more info from corresponding DBpedia URI: SELECT ?label ?comment WHERE { <http://data.viaa.be/noid/2z12n51476_19141120_0001> <http:// www.bbc.co.uk/ontologies/creativework#tag> ?tag . ?db owl:sameAs ?tag . ?db rdfs:label ?label . ?db rdfs:comment ?comment }
  29. 29. Battle of the Somme • Pages with military leaders from the Battle of the Somme mentioned + thumbnail: SELECT ?paper ?o ?thumbnail WHERE { <http://dbpedia.org/resource/Battle_of_the_Somme> <http://dbpedia.org/ ontology/commander> ?o . ?paper <http://www.bbc.co.uk/ontologies/creativework#tag> ?ctag . ?o owl:sameAs ?ctag . ?o <http://dbpedia.org/ontology/thumbnail> ?thumbnail . }
  30. 30. Frontpainters • Semi-automatic generation of collections, e.g. about frontpainters SELECT ?newspaper ?artist ?tag ?hetarchief WHERE { ?artist dc:subject <http://dbpedia.org/resource/ Category:Belgian_war_artists> . ?artist owl:sameAs ?tag . ?newspaper <http://www.bbc.co.uk/ontologies/creativework#tag> ? tag . ?newspaper <http://www.w3.org/2000/01/rdf-schema#seeAlso> ? hetarchief }
  31. 31. Conclusion • Extra search method for our researchers • NER versus OCR: enhanced findability • Adding extra information (cfr. Abraham) requires effort, we need more TPFs interfaces
  32. 32. Future work • Dereferencable URIs • http://data.viaa.be/noid/{id} • Content negotiation • HTML • JSON • RDF • Save location with OLR • Suggestions are welcome!
  33. 33. Q&A Brecht Van de Vyvere | @brechtvdv

×