Building a High Performance Environment for RDF Publishing
Upcoming SlideShare
Loading in...5
×
 

Building a High Performance Environment for RDF Publishing

on

  • 1,994 views

SWIB12, cologne 2012-11-27. Video recording: http://www.scivee.tv/node/55329

SWIB12, cologne 2012-11-27. Video recording: http://www.scivee.tv/node/55329

Statistics

Views

Total Views
1,994
Views on SlideShare
1,972
Embed Views
22

Actions

Likes
2
Downloads
32
Comments
0

1 Embed 22

https://twitter.com 22

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Building a High Performance Environment for RDF Publishing Building a High Performance Environment for RDF Publishing Presentation Transcript

  • Building a HighPerformance Environment for RDF Publishing Pascal Christoph
  • These slides and all the graphics made by the author and thosetaken from https://openclipart.org/ are dedicated to the publicdomain : https://creativecommons.org/about/cc0 .All marks mentioned may be trademarks or registered trademarksof their respective owners.Read about the license of „The scream“ of Edward Munch athttps://en.wikipedia.org/wiki/File:The_Scream.jpg Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  • OverviewPublishing is for Consuming • Mandatory • Nice to haveStory so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the dataPublishing RDF through elasticsearch • Benefits • Some more details • CaveatsFuture prospects Building a High Performance Environment for RDF Publishing 3
  • OverviewPublishing is for Consuming • Mandatory • Nice to haveStory so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the dataPublishing RDF through elasticsearch • Benefits • Some more details • CaveatsFuture prospects Building a High Performance Environment for RDF Publishing 4
  • ing is for Consuming PublishPublishing is for Consuming Building a High Performance Environment for RDF Publishing 5
  • ing is for Consuming Publish MandatoryA resource: Building a High Performance Environment for RDF Publishing 6
  • ing is for Consuming Publish MandatoryA resource:gets a dereferenceable URI: Building a High Performance Environment for RDF Publishing 7
  • ing is for Consuming Publish MandatoryA resource:gets a dereferenceable URI:which provides RDF:<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/title> "With reference to reference" .<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/issued> "1983" .<http://lobid.org/resource/HT002948556> <http://purl.org/ontology/bibo/isbn13> "9780915145539" .<http://lobid.org/resource/HT002948556><http://purl.org/dc/elements/1.1/creator><http://d-nb.info/gnd/135539897> . Building a High Performance Environment for RDF Publishing 8
  • ing is for Consuming Publish Mandatory=> basic LOD publishing is very simple:you just need a Webserver Building a High Performance Environment for RDF Publishing 9
  • ing is for Consuming Publish Nice to have• Dumps• Content Negotiation (different RDF serializations)• SPARQL• Human readable representation (best: RDFa in HTML)• Data searchable• Timely updates• High Availability• Versioning• Web developers want simple APIs providing JSON• ... Building a High Performance Environment for RDF Publishing 10
  • ing is for Consuming Publish SPARQL Endpoint• (Dumps)• Content Negotiation (different RDF serializations)• SPARQL• Human readable representation (best: RDFa in HTML)• Data searchable• Timely updates• High Availability• Versioning• Web developers want simple APIs providing JSON• ... Building a High Performance Environment for RDF Publishing 11
  • ing is for Consuming Publish SPARQL Endpoint• (Dumps): but may be painfully slow when having lots of data• Content Negotiation (different RDF serializations)• SPARQL• Human readable representation (best: RDFa in HTML)• (Data searchable) : maybe painfully slow• Timely updates• High Availability• Versioning• Web developers want simple APIs providing JSON • most triple stores provides JSON/RDF • Simple powerful API : too powerful/complex ?• ... Building a High Performance Environment for RDF Publishing 12
  • ing is for Consuming Publish Nice to haveIn principle, web developers already got simple APIs : LOD is the API ! Building a High Performance Environment for RDF Publishing 13
  • ing is for Consuming Publish Nice to haveIn principle, web developers already got simple APIs : Remember: Building a High Performance Environment for RDF Publishing 14
  • ing is for Consuming Publish MandatoryA resource:gets a dereferenceable URI:which provides the data (in RDF):<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/title> "With reference to reference" .<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/issued> "1983" .<http://lobid.org/resource/HT002948556> <http://purl.org/ontology/bibo/isbn13> "9780915145539" .<http://lobid.org/resource/HT002948556><http://purl.org/dc/elements/1.1/creator><http://d-nb.info/gnd/135539897> . Building a High Performance Environment for RDF Publishing 15
  • ing is for Consuming Publish Nice to haveIn principle, web developers already got powerful APIs : RESTful SPARQL Building a High Performance Environment for RDF Publishing 16
  • ing is for Consuming Publish RESTful SPARQL example getting all data of all resources having a particular ISBN:curl -H "Accept: application/json" --data-urlencode query=prefix bibo: <http://purl.org/ontology/bibo/>SELECT * WHERE { ?s bibo:isbn13 "9780851706238" ; ?p ?o . } LIMIT 100 http://lobid.org/sparql/ Building a High Performance Environment for RDF Publishing 17
  • ing is for Consuming Publish Nice to haveBuilding a High Performance Environment for RDF Publishing 18
  • ing is for Consuming Publish RESTful SPARQL example… and the JSON/RDF result:{ "head": { "vars": [ "s", "p","o"] }, "results": { "bindings": [ { "o": { "type": "uri", "value": "http://openlibrary.org/works/OL2109573W" }, "p": { "type": "uri", "value": "http://rdvocab.info/RDARelationshipsWEMI/workManifested" }, "s": { "type": "uri", "value": "http://lobid.org/resource/HT007824357" } }, { "o": { ... Building a High Performance Environment for RDF Publishing 19
  • ing is for Consuming Publish Nice to have As it is, web developers dont like SPARQL web developerBuilding a High Performance Environment for RDF Publishing 20
  • ing is for Consuming Publish Nice to haveWeb developers want APIs like:http://lobid.org/resources/api/isbn/$isbn Building a High Performance Environment for RDF Publishing 21
  • Happy web developer
  • OverviewPublishing is for Consuming • Mandatory • Nice to haveStory so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the dataPublishing RDF through elasticsearch • Benefits • Some more details • CaveatsFuture prospects Building a High Performance Environment for RDF Publishing 23
  • What is l obid.org ? lobid.orgBuilding a High Performance Environment for RDF Publishing 24
  • What is l obid.org ?● lobid := linking open bibliographic data● LOD services of the hbz ● lobid-resources : ● exposes 85% of the hbz cooperative catalogue ● entries coming from > 200 scientific German libraries ● ~ 16 M records with 700 M triples ● with links to ~ 5 M other resources ● with links to ~ 32 M items (consisting of 300 M triples) ● lobid-organisations : ● exposes German Sigelverzeichnis and MARC-Isil directory ● ~ 40 k descriptions of institutions Building a High Performance Environment for RDF Publishing 25
  • What is lobid.org ? Whats missing?• Dumps• Content Negotiation (different RDF serializations)• SPARQL• Human readable representation ( RDFa in HTML)• Data searchable• Timely updates• High Availability• Versioning• Web developers want simple APIs providing JSON• ... Building a High Performance Environment for RDF Publishing 26
  • OverviewPublishing is for Consuming • Mandatory • Nice to haveStory so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the dataPublishing RDF through elasticsearch • Benefits • Some more details • CaveatsFuture prospects Building a High Performance Environment for RDF Publishing 27
  • storing the data 2010 - 2011, lobid-organisationFilesystem : + easy to maintain + reliable + fast - no search - no SPARQL - ... Building a High Performance Environment for RDF Publishing 28
  • storing the data lobid todayTriple Store (4store) : + power of SPARQL +/- depending on the query: fast to horribly slow +/- search (but string searches often slow and limited) - sometimes gets stuck ! Building a High Performance Environment for RDF Publishing 29
  • storing the data lobid todaySearch engine (elasticsearch): + fast search + stemming, linguistics … + wildcard searching + facets + geo search + JSON + schema-less + simple RESTful API + many plugins + ... + easy to achieve High Availability + scales nicely Building a High Performance Environment for RDF Publishing 30
  • lobid today storing/getting the data
  • OverviewPublishing is for Consuming • Mandatory • Nice to haveStory so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the dataPublishing RDF through elasticsearch • Benefits • Some more details • CaveatsFuture prospects Building a High Performance Environment for RDF Publishing 32
  • getting the datalobid : technology/dependency stack Webapp Search Engine Triple Store Building a High Performance Environment for RDF Publishing 33
  • getting the datalobid : technology/dependency stack Webapp we can do that Search Engine highly available ! Triple Store sometimes gets stuck! Building a High Performance Environment for RDF Publishing 34
  • getting the datalobid : technology/dependency stack Webapp we can do that Search Engine < highly available ! Triple Store = sometimes gets stuck! Building a High Performance Environment for RDF Publishing 35
  • getting the datalobid : technology/dependency stack Webapp we can do that Search Engine highly available ! Triple Store sometimes gets stuck! Building a High Performance Environment for RDF Publishing 36
  • oring/getting the data st Variant 1 : technology/dependency stack Triple Store Triple StoreClosed, internal. Will be safe from malign queries. For external access. Sometimes gets stuck! Building a High Performance Environment for RDF Publishing 37
  • oring/getting the data st Variant 1 : technology/dependency stack redundant, complex … Triple Store Triple StoreClosed, internal. Will be safe from malign queries. For external access. Sometimes gets stuck! Building a High Performance Environment for RDF Publishing 38
  • OverviewPublishing is for Consuming • Mandatory • Nice to haveStory so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the dataPublishing RDF through elasticsearch Benefits • • Some more details • CaveatsFuture prospects Building a High Performance Environment for RDF Publishing 39
  • D with elasticsearch Publishing LO Variant 2: technology/dependency stack Webapp we can do that Search Engine highly available !LOD basis functionality (and some other Triple Store For external access and some APIs) are highly available fancy nice-to-have stuff. Sometimes gets stuck! Building a High Performance Environment for RDF Publishing 40
  • D with elasticsearch Publishing LO Benefits• Dumps• Content Negotiation (different RDF serializations)• SPARQL• Human readable representation ( RDFa in HTML)• Data searchable• Near Real Time updates• High Availability• (Versioning)• Web developers want simple APIs returning JSON• ... Building a High Performance Environment for RDF Publishing 41
  • D with elasticsearch Publishing LO Benefitsfast, scalable search engine Building a High Performance Environment for RDF Publishing 42
  • D with elasticsearch Publishing LO performance testData: 10 M records <=> 300 M tripleCase-insensitive query: „beach“SELECT ?sWHERE { ?s <http://purl.org/dc/terms/title> ?o FILTER regex(str(?o), "beach", "i") }#### => SPARQL execution time for Q8316: 108.7s, returned 2815 rows.http://$ip:9200/_search?q=beach&from=0&size=2800# => Elasticsearch needed 0.4s=> Elasticsearch is 250 times faster Building a High Performance Environment for RDF Publishing 43
  • D with elasticsearch Publishing LO performance test(there is a support for text indexing in 4store, have not tested that.) Building a High Performance Environment for RDF Publishing 44
  • D with elasticsearch Publishing LO performance testElasticsearch: 18 M records , 6 GB RAM: 5 hour4store: 1 B triples, having 72 GB RAM: 7 hoursCPU: Quad Core mit 2.4 GhZ und Hyperthreading => 8 CPUsHD: 6 x 2.5" 10k U/min a 146GB(Dont take benchmarks too seriously – they just give a clue !) Building a High Performance Environment for RDF Publishing 45
  • D with elasticsearch Publishing LO Benefits• Dumps• Content Negotiation (different RDF serializations)• SPARQL• Human readable representation ( RDFa in HTML)• Data searchable• Near Real Time updates• High Availability• (Versioning) • Web developers want simple APIs providing JSON• ... Building a High Performance Environment for RDF Publishing 46
  • D with elasticsearch Publishing LO Benefitsbuild to be easily made highly available ! Building a High Performance Environment for RDF Publishing 47
  • D with elasticsearch Publishing LO Benefits• Dumps• Content Negotiation (different RDF serializations)• SPARQL• Human readable representation ( RDFa in HTML)• Data searchable• Near Real Time updates• High Availability• (Versioning) • Web developers want simple APIs providing JSON• ... Building a High Performance Environment for RDF Publishing 48
  • D with elasticsearch Publishing LO BenefitsVersioning with elasticsearch:Not out-of-the-box, but comes at least e.g. with* concurrency control* documents have a version number=> implementing versioning is not hard Building a High Performance Environment for RDF Publishing 49
  • D with elasticsearch Publishing LO Benefits, relying on elasticsearch as basic LOD storage• Dumps• Content Negotiation (different RDF serializations)• SPARQL• Human readable representation ( RDFa in HTML)• Data searchable• Near Real Time updates• High Availability• Versionizing• Web developers want: • JSON (LD) • Simple APIs• ... Building a High Performance Environment for RDF Publishing 50
  • D with elasticsearch Publishing LO Benefits• Dumps• Content Negotiation (different RDF serializations)• SPARQL• Human readable representation ( RDFa in HTML)• Data searchable• Near Real Time updates• High Availability• (Versioning) • Web developers want simple APIs providing JSON• ... Building a High Performance Environment for RDF Publishing 51
  • D with elasticsearch Publishing LO Why JSON-LD? JSON is :• stored natively by many tools (e.g. elasticsearch)• loved by consumers (web developers) JSON-LD is :• supported by RDF libraries (e.g. transforming to NTriples) Building a High Performance Environment for RDF Publishing 52
  • D with elasticsearch Publishing LO Benefits• Dumps• Content Negotiation (different RDF serializations)• SPARQL• Human readable representation ( RDFa in HTML)• Data searchable• Near Real Time updates• High Availability• (Versioning) • Web developers want simple APIs providing JSON• ... Building a High Performance Environment for RDF Publishing 53
  • D with elasticsearch Publishing LO BenefitsRESTful elasticsearch API, e. g. :http://lobid.org/resources/_search?q=isbn:$isbn Building a High Performance Environment for RDF Publishing 54
  • D with elasticsearch Publishing LO Benefits• … and many other nice things come with elasticsearch • geo-search : „Query only libraries/items residing up to 10 km from me.“ • … Building a High Performance Environment for RDF Publishing 55
  • D with elasticsearch Publishing LO Benefits• Dumps Content Negotiation (different RDF serializations) d!•• SPARQL plishe• accom Human readable representation ( RDFa in HTML)•• Data searchable Mis sion Near Real Time updates• High Availability• (Versioning)• Web developers want simple APIs providing JSON• ... Building a High Performance Environment for RDF Publishing 56
  • D with elasticsearch Publishing LO ( … ok, something is left to be done ! )• Dumps• Content Negotiation (different RDF serializations)• SPARQL• Human readable representation ( RDFa in HTML)• Data searchable• Near Real Time updates• High Availability• Versionizing• Web developers want simple APIs providing JSON• ... Building a High Performance Environment for RDF Publishing 57
  • OverviewPublishing is for Consuming • Mandatory • Nice to haveStory so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the dataPublishing RDF through elasticsearch • Benefits • Caveats • Auto suggest demoConclusion Building a High Performance Environment for RDF Publishing 58
  • D with elasticsearch Publishing LO Caveats• Dumps• Content Negotiation (different RDF serializations)• SPARQL• Human readable representation ( RDFa in HTML)• Data searchable• Near Real Time updates•• High Availability Versionizing !?• Web developers want simple APIs providing JSON• ... Building a High Performance Environment for RDF Publishing 59
  • D with elasticsearch Publishing LO CaveatsHow to integrate semantic search into a document storage ?dct:contributor --------> dct:creator -------> dc:creator ---------> dc:contributor --------> bibo:translator …There is no inferencing as comes with SPARQL ! Building a High Performance Environment for RDF Publishing 60
  • D with elasticsearch Publishing LO CaveatsOur data flow :from records to RDF triples to records Building a High Performance Environment for RDF Publishing 61
  • D with elasticsearch Publishing LO CaveatsOur data flow :from records to RDF triples to records Building a High Performance Environment for RDF Publishing 62
  • D with elasticsearch Publishing LO Caveatsfrom records to RDF triples to records !? Building a High Performance Environment for RDF Publishing 63
  • D with elasticsearch Publishing LO CaveatsFrom records to RDF triples |-----> graph-database ------> computing ---> record-database MARC/MAB/PICA... JSON-LD Building a High Performance Environment for RDF Publishing 64
  • D with elasticsearch Publishing LO Caveats tree-based vs graph-based:Pre-render the whole document? What is the document ? Building a High Performance Environment for RDF Publishing 65
  • D with elasticsearch Publishing LO CaveatsBuilding a High Performance Environment for RDF Publishing 66
  • D with elasticsearch Publishing LO CaveatsWhat is the document ? Only the top-level node ? Building a High Performance Environment for RDF Publishing 67
  • D with elasticsearch Publishing LO CaveatsWhat is the document ? Only the top-level node ?… but then you couldnt even search the authors name ! Building a High Performance Environment for RDF Publishing 68
  • D with elasticsearch Publishing LO Caveatssearching needs integration of some fields from subgraphs into the document Building a High Performance Environment for RDF Publishing 69
  • OverviewPublishing is for Consuming • Mandatory • Nice to haveStory so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the dataPublishing RDF through elasticsearch • Benefits • Caveats • Auto suggest demoConclusion Building a High Performance Environment for RDF Publishing 70
  • D with elasticsearch Publishing LO auto suggestauthority IDs must be easily found 71 Building a High Performance Environment for RDF Publishing
  • D with elasticsearch Publishing LO auto suggestauthority IDs must be easily found => in need of auto suggest 72 Building a High Performance Environment for RDF Publishing
  • D with elasticsearch Publishing LO auto suggestauto suggests needs fast searching Building a High Performance Environment for RDF Publishing 73
  • D with elasticsearch Publishing LO Demo auto suggestBuilding a High Performance Environment for RDF Publishing 74
  • D with elasticsearchPublishing LO 75
  • D with elasticsearch Publishing LO auto suggestRESTful APIs:http://demo.lobid.org/search?format=short&index=gnd-index&author=Schmidt%2C+Karlhttp://demo.lobid.org/search?format=page&index=gnd-index&author=Schmidt%2C+Karlhttp://demo.lobid.org/search?format=full&index=gnd-index&author=Schmidt%2C+Karl…API usage:GET /search?format=<page|full|short>&index=<lobid-index|gnd-index>&author=<query>easy to enhance with the play framework and the elasticsearch API Building a High Performance Environment for RDF Publishing 76
  • D with elasticsearch Publishing LO auto suggestRESTful APIs:http://demo.lobid.org/search?format=short&index=gnd-index&author=Schmidt%2C+Karl [ "Schmidt, Karl (1894-1945)", "Schmidt, Karl", "Schmidt, Karl (1910-)", "Schmidt, Karl (1846-1928)", "Schmidt, Karl (1913-)", "Schmidt, Karl (1899-)", "Schmidt, Karl (1924-)", "Schmidt, Karl (1836-1888)", "Schmidt, L. F. Karl", "Schmidt, Karl (1902-1945)", "Schmidt, Karl J.", "Schmidt, Karl (1848-1905)", "Schmidt, Karl (1817-1882)", "Schmidt, Karl R.", "Schmidt, Karl (1954-)", "Schmidt, Karl (1888-)", "Schmidt, Karl (1867-)", ... ] Building a High Performance Environment for RDF Publishing 77
  • D with elasticsearch Publishing LO auto suggest GND authority file in lobid-resourcesBuilding a High Performance Environment for RDF Publishing 78
  • D with elasticsearchPublishing LO 79
  • D with elasticsearch Publishing LOBuilding a High Performance Environment for RDF Publishing 80
  • OverviewPublishing is for Consuming • Mandatory • Nice to haveStory so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the dataPublishing RDF through elasticsearch • Benefits • Caveats • Auto suggest demoConclusion Building a High Performance Environment for RDF Publishing 81
  • D with elasticsearch Publishing LO Conclusion a highly customizable/reliable/feature-rich LOD service Webapp we can do that Search Engine highly available !LOD basis functionality (and some other Triple Store For external access and some APIs) are highly available fancy nice-to-have stuff. Sometimes gets stuck! Building a High Performance Environment for RDF Publishing 82
  • D with elasticsearch Publishing LOthe software is Open Source: http://elasticsearch.org/ https://hadoop.apache.org/ http://www.playframework.org/ http://4store.org/ https://github.com/lobid/Building a High Performance Environment for RDF Publishing 83
  • Any Questions ? Pascal Christoph christoph@hbz-nrw.de semweb@hbz-nrw.de
  • Using a dark background, this presentation saves maybe 70% of energy