Successfully reported this slideshow.
Your SlideShare is downloading. ×

Linked data at the BBC

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 66 Ad

Linked data at the BBC

Download to read offline

July's Connected Data London meetup was hosted by Valtech and featured Augstine Kwanashie of the BBC talking about his experience of working on one of the world's leading linked data platforms.

In the BBC we store data about entities like people, places, events and organisations that matter to our audiences (and appear in our programmes and online content) in an RDF store. These are then used to tag BBC content with the resulting RDF graphs helping to power audience-facing apps and websites. By connecting BBC content in this way, we help enhance content discovery, grouping/aggregation, navigation as well as personalisation and recommendations.


Augustine will talk about the architecture of our linked data system in terms of resilience, monitoring, performance, tooling, data quality and validation. He will also share some of our plans for opening up the platform and making the data accessible to the general public.

July's Connected Data London meetup was hosted by Valtech and featured Augstine Kwanashie of the BBC talking about his experience of working on one of the world's leading linked data platforms.

In the BBC we store data about entities like people, places, events and organisations that matter to our audiences (and appear in our programmes and online content) in an RDF store. These are then used to tag BBC content with the resulting RDF graphs helping to power audience-facing apps and websites. By connecting BBC content in this way, we help enhance content discovery, grouping/aggregation, navigation as well as personalisation and recommendations.


Augustine will talk about the architecture of our linked data system in terms of resilience, monitoring, performance, tooling, data quality and validation. He will also share some of our plans for opening up the platform and making the data accessible to the general public.

Advertisement
Advertisement

More Related Content

Similar to Linked data at the BBC (20)

More from Connected Data World (20)

Advertisement

Recently uploaded (20)

Linked data at the BBC

  1. 1. Linked Data at the BBC Augustine Kwanashie
  2. 2. Outline •  Introduction •  APIs and Tools •  Validation and Data Quality •  Performance & Resilience Measures •  Summary
  3. 3. Introduction
  4. 4. Tagging BBC Content <http://www.bbc.co.uk/things/2b7ba3ca-32ca…> a cwork:CreativeWork, cwork:NewsItem ; cwork:title “Pep Guardiola…” ; cms:locator <urn:bbc:cps:asset:748947894> ; cwork:language “en-gb” ; cwork:primaryFormat cwork:TextualFormat ; prov:dateCreated "2017-04-07T21:39:23+00:00” .
  5. 5. Tagging BBC Content <http://www.bbc.co.uk/things/4bdbf2-d1ad…> a core:Organisation, sport:SportingOrganisation ; core:label "Manchester City"@en-gb ; core:sameAs <http://www.wikidata.org/entity/Q50602> ; sport:competesIn <http://www.bbc../things/f1eb4771…> ; sport:discipline <http://www.bbc../things/ba6e1118…> ; sport:hasHome <http://www.bbc../things/0710009f…> .
  6. 6. Tagging BBC Content Article _______ _______ Manchester City <http://www.bbc.co.uk/things/4bdbf21d-d1ad-... > Video _______ _______ Stream _______ _______ about about
  7. 7. Tagging BBC Content <http://www.bbc.co.uk/things/2b7ba3ca-32ca…> a cwork:CreativeWork, cwork:NewsItem ; cwork:title ”In the future I will be better…" ; cwork:about <http://www.bbc../things/4bdbf2-d1ad…> . <http://www.bbc.co.uk/things/4bdbf2-d1ad…> a core:Organisation, sport:SportingOrganisation ; core:label "Manchester City"@en-gb ; core:sameAs <http://www.wikidata.org/entity/Q50602> .
  8. 8. Tagging helps to: o  Group/aggregate content. o  Enhance content discovery. o  Enhance navigation. o  Improve personalisation and recommendation.
  9. 9. APIs and Tools
  10. 10. Landscape InfrastructureAPIs and ToolsClients
  11. 11. Tagging a BBC Article CMS Content Store Writer API Triple Store Core API Website Content API LDM Journalists Journalists Content Metadata
  12. 12. GraphDB Setup Jolokia Workbench Sesame Tomcat GraphDB master node Httpd Jolokia Workbench Sesame Tomcat GraphDB worker nodes Httpd S3 for storing backups Opsworks for management CloudWatch for monitoring
  13. 13. Custom APIs vs. SPARQL endpoints SPARQL endpoints Custom APIs o  Can ensure performance. o  Ideal for rarely changing use-cases. o  Can validate writes. o  Complete flexibility with queries. o  Ideal for varied/changing use-cases.
  14. 14. Write APIs Validation Applies set of validation rules Security Authenticate via SSL certificate whitelists Content-Types Accepts Turtle, RDF+XML Persistence Writes asynchronously to triplestore PUT: https://ldp-writer.int.api.bbci.co.uk/crea3ve-works Content-Type: text/turtle Body: <http://www.bbc.co.uk/things/2b7ba3ca-32ca…> a cwork:CreativeWork, cwork:NewsItem ; cwork:title "Pep Guardiola…" ; cwork:about <http://www.bbc../things/4bd…> .
  15. 15. Read APIs Filters & Mixins Restrict returned data by type, domain, etc. Search Full-text search on labels. Content-Types Produces Trig, JSON+LD, JSON, HTML Security Authenticate via SSL certificatesGET: https://things.api.bbc.com/things ?type = core:Person &label_search = Theresa &mixin = pol Accept: json+ld
  16. 16. Documenting APIs <urn:api:things:documentation> { <urn:api:things:get-multiple:covered-by> a api:Filter ; api:collectionFormat "multi"^^xsd:string ; api:description "Filter for Things with a matching bbc:coveredBy relationship."^^xsd:string ; api:in "query"^^xsd:string ; api:name "covered_by"^^xsd:string ; api:required "false"^^xsd:boolean ; api:type "array"^^xsd:string . }
  17. 17. Validation and Data Quality
  18. 18. ThingGraphs context:123 { things:635 a core:Thing, sport:SportingOrganisation ; core:label "Manchester City"@en-gb ; core:sameAs <http://www.wikidata.org/entity/Q50602> . context:123 a prov:ThingGraph ; prov:managedBy cms:LDM ; prov:provider <mailto:augustine@bbc.co.uk> ; prov:provided "2014-08-20T10:47:42+00:00"^^xsd:dateTime . }
  19. 19. Multiple ThingGraphs for single Thing context:01 { things:635 core:preferredLabel "Manchester City" . context:01 prov:managedBy cms:LDM . } context:02 { things:635 sport:competesIn things:834 . context:02 prov:managedBy cms:LDM . } context:03 { things:635 biz:listing things:856 . context:03 prov:managedBy cms:NewsIDM . }
  20. 20. Thing Response Status: 200 Content-Type: application/trig Body: things:635 { things:635 core:preferredLabel "Manchester City" ; sport:competesIn things:834 ; core:label "Manchester City" ; biz:listing things:856 . } GET: https://things.int.api.bbc.com/things ?mixin = sport &mixin = biz Accept: application/trig
  21. 21. Some Validation Rules Cannot delete a Thing that is used to tag a CreativeWork things:635 core:preferredLabel "Manchester City" ; sport:competesIn things:834 ; DELETE: https://ldp-writer.bbc.com ?guid=things:635 cwork:345 a cwork:CreativeWork ; tagging:about things:635 .
  22. 22. Some Validation Rules Cannot update a ThingGraph managed by another CMS context:02 { things:635 sport:competesIn things:834 . context:02 prov:managedBy cms:LDM . } PUT: https://ldp-writer.bbc.com ?guid=things:635 X-ManagedBy: VIVO
  23. 23. Managing Things
  24. 24. Managing Breaking Changes things:635 a core:Thing, core:Theme; core:label "Technology" ; core:sameAs dbpedia:Technology . things:635 a core:Thing, core:Theme; core:label "Technology"@en-gb, "Technologies"@fr, "Tecnología"@es ; core:sameAs dbpedia:Technology .
  25. 25. Managing Breaking Changes things:635 a core:Thing, core:Theme; core:label "Technology" ; core:sameAs dbpedia:Technology . things:635 a core:Thing, core:Theme; trans-01:label "Technology”@en-gb, "Technologies"@fr, "Tecnología"@es ; core:label "Technology" ; core:sameAs dbpedia:Technology . 1 Add transition triples
  26. 26. Managing Breaking Changes things:635 a core:Thing, core:Theme; core:label "Technology" ; core:sameAs dbpedia:Technology . things:635 a core:Thing, core:Theme; trans-01:label "Technology"@en-gb, "Technologies"@fr, "Tecnología"@es ; core:label "Technology"@en-gb, "Technologies"@fr, "Tecnología"@es ; core:sameAs dbpedia:Technology . 2 Align to new schema
  27. 27. Managing Breaking Changes things:635 a core:Thing, core:Theme; trans-01:label "Technology”@en-gb, "Technologies"@fr, "Tecnología"@es ; core:label "Technology”@en-gb, "Technologies"@fr, "Tecnología"@es ; core:sameAs dbpedia:Technology . 3 Remove transition triples things:635 a core:Thing, core:Theme; core:label "Technology”@en-gb, "Technologies"@fr, "Tecnología"@es ; core:sameAs dbpedia:Technology .
  28. 28. Tagging out of Context core:label "The Presidents of the United States of America"; core:disambiguationHint "Music Group"; Hard to identify and prevent!
  29. 29. Managing Duplicate Things context:123 { things:635 a core:Thing, sport:Team; core:label "Manchester City"@en-gb ; core:sameAs dbpedia:Manchester_City_FC ; sport:managedBy things:6372 . context:123 prov:provided "2014-08-20" . } context:124 { things:636 a core:Thing, sport:Team; core:label "Man City"@en-gb ; core:sameAs <http://www.wikidata.org/234> . context:124 prov:provided "2017-02-12" . } 2800 CreativeWorks tagged 46 CreativeWorks tagged
  30. 30. Managing Duplicate Things context:123 { things:635 a core:Thing, sport:Team; core:label "Manchester City"@en-gb ; core:sameAs <http://www.wikidata.org/234>, dbpedia:Manchester_City_FC ; sport:managedBy things:6372 . context:123 prov:provided "2014-08-20" . } 2846 CreativeWorks tagged 1 Switch tags from things:636 to things:635 2 Compare all triples for both Things 3 Update things:635 with consolidated triples 4 Delete things:636 Merge script to:
  31. 31. Performance
  32. 32. Load-balance reads and writes separately Master node WriteELB Worker nodes ReadELB1 ReadELB2
  33. 33. Asynchronous writes Clients Write API Write Pipeline POST POST 202 202 Write ELB POST 200 CREATED
  34. 34. Optimise SPARQL queries SELECT ?subject ?predicate ?object WHERE { ?subject ?predicate ?object . { SELECT ?subject WHERE { OPTIONAL { ?subject prov:createdBy ?created . } } GROUP BY (?subject) HAVING BOUND (?subject) } } SELECT ?subject ?predicate ?object WHERE { ?subject ?predicate ?object . { SELECT ?subject WHERE { ?subject prov:createdBy ?created . } GROUP BY (?subject) } }
  35. 35. Load-test against future demand 100% Increase in the number of CreativeWorks by 201960% Increase in 99 percentile response times by 2019 21m Requests to the CreativeWorks API daily 94m Triples in Triplestore
  36. 36. Monitor Everything!
  37. 37. Auto-scaling on API Instances 5 1 2 3 4 1 ELB sends metrics 2 Instances send metrics 3 Alarms trigger autoscaling action 4 New instance is created 5 Instance is added to pool
  38. 38. Caching Responses
  39. 39. Resilience
  40. 40. Queue-based write pipeline Queued writes across multiple clusters Writer API Consumer Primary GraphDB Cluster Consumer Replica GraphDB Cluster
  41. 41. Event-based write pipeline Event-based writes improves resilience Writer API Consumer Replica GraphDB Cluster Primary GraphDB Cluster Event store API RDS Notification Topics
  42. 42. Backup and Recovery 26GB Per backup 20mins Recovery time 16Full backups per day Opsworks recipes to: ²  Switch Primary and Replica cluster roles. ²  Schedule backups. ²  Restore backup to cluster. S3: ²  Stores backups by date/time. ²  Retires old backups to Glacier.
  43. 43. Replacing Triplestore Clusters Read ELB Write ELB Primary: Cluster 1 Replica: Cluster 2 Cluster 3 1 Create new cluster and load data
  44. 44. Replacing Triplestore Clusters Read ELB Write ELB Primary: Cluster 1 Replica: Cluster 3 Cluster 2 2 Swap new cluster with replica 3 Delete old replica cluster
  45. 45. Replacing Triplestore Clusters Read ELB Write ELB Primary: Cluster 3 Replica: Cluster 1 4 Swap new cluster with primary 5 Repeat steps 1 - 4
  46. 46. Responding to incidents CPU Utilisation > 90% for 5 mins? CloudWatch Alarm!!! Severity: - Warning? - Critical? Action: - Email to Dev team? - Notify 247 support? - Trigger Autoscaling action?
  47. 47. Summary
  48. 48. Opening up BBC Things
  49. 49. Opening up BBC Things
  50. 50. Opening up BBC Things
  51. 51. Opening up BBC Things
  52. 52. Main points o  Separating content from metadata o  APIs powered by Linked Data o  Monitoring and reacting to incidents o  Performance for present and future
  53. 53. Thanks... Augustine Kwanashie Connections.TechSupport@bbc.co.uk www.bbc.co.uk/things
  54. 54. Bonus slides…
  55. 55. Filters and Mixins http://www.bbc.co.uk/things/4bdbf2-d1a1 http://www.bbc.co.uk/things/4bdbf2-d1a2 http://www.bbc.co.uk/things/4bdbf2-d1a3 http://www.bbc.co.uk/things/4bdbf2-d1a4 http://www.bbc.co.uk/things/4bdbf2-d1a5 http://www.bbc.co.uk/things/4bdbf2-d1a6 Filter by type = core:Person Mixin = sport, core things:4bdbf2-d1b1 a core:Thing, sport:Team ; core:title "Manchester City" ; biz:listedIn <http://www.londonstockexchange> ; sport:managedBy things:4bdbf2-d1a1; biz:tradingAs "Manchester City PLC” .
  56. 56. Swagger Docs
  57. 57. Handing Data Scala libraries to enable easy RDF manipulation Trig Turtle etc. Connections-RDF ²  Import/Export ²  Create triples ²  Compare Graphs ²  Navigate Graphs ²  Manage Datasets Trig Turtle etc.
  58. 58. Handing Data RDF DSL in Scala val rdfGraph = ( Iri("http://…") >> Rdf.`type` >>> Core.Thing >> Sport.`type` >>> Sport.Organisation >> BBC.coveredBy >>> Iri("urn:bbc:news") >> Core.label >>> "Manchester City" ) val label = (rdfGraph Core.label).get[String]
  59. 59. Some Validation Rules things:635 core:preferredLabel "Manchester City" ; cms:locator <urn:bbc:cps:asset:39715040>, <urn:bbc:cps:asset:39715040> . Thing locators must be unique things:635 core:preferredLabel "Manchester City" ; cms:locator <urn:bbc:cps:asset:01>, <urn:bbc:cps:asset:02> . <urn:bbc:cps:asset:01> a cms:CPSLocator . <urn:bbc:cps:asset:02> a cms:CPSLocator . Locator Types must be unique
  60. 60. Some Validation Rules things:635 cms:locator <urn:bbc:cps:asset:01> . things:636 cms:locator <urn:bbc:cps:asset:01> . Multiple Things with the same locator things:635 cms:sameAs dbpedia:01 . things:636 cms:sameAs dbpedia:01 . Multiple Things with the same sameAs things:635 core:label "Manchester City" rdf:type owl:Class . Blacklisted URIs present
  61. 61. Ordering Thing Updates Correctly create:1 update:2 update:3 delete:4 Document Writer Primary GraphDB Cluster 1 Fetch events from Event store 1 2 34 2 Execute task on Triplestore (only if task id is newer) 3 Errors? Put on Retry queue 4 Fetch and process events from Retry queue
  62. 62. Search: Creating an Index INSERT DATA { luc:index luc:setParam "uris" . luc:include luc:setParam "literals" . luc:includePredicates luc:setParam "core:label rdf:label core:shortLabel" . luc:moleculeSize luc:setParam "1" . luc:labelIndex luc:createIndex "true" . }
  63. 63. Search: Creating an Index :manutd :fclub :manc type label Football Club label Manchester United locatedIn label Manchester locatedIn :uk label United Kingdom RDF Module for :manutd RDF Module for :manc
  64. 64. Search: Full & Incremental Re-index INSERT DATA { luc:labelIndex luc:addToIndex <http://www.bbc.co.uk/things/2b7ba3ca-32ca…> . } Run incremental re-index after each Thing update INSERT DATA { luc:labelIndex luc:updateIndex _:b1 . } Run full re-index once daily
  65. 65. Search: Full Text Search Query SELECT ?thing ?score WHERE { ?thing a tagging:TagConcept . ?thing luc:score ?score . ?thing luc:labelIndex " (Manchester OR *Manchester OR *Manchester*) " . } o  Index available during the re-index process
  66. 66. Searching logs service="triple-store" and env="live" and "Error" Instance logs S3 bucket CloudWatch Logs

×