This document discusses building a high performance environment for publishing RDF data. It describes experiences publishing data through lobid.org, including what lobid.org is, how the data is stored, and how it is accessed. It discusses benefits of publishing RDF through Elasticsearch, including fast search performance, scalability, and ease of achieving high availability. The document also notes Elasticsearch provides benefits like dumps, content negotiation, searchability, near real-time updates, and APIs that return JSON for web developers.
2. These slides and all the graphics made by the author and those
taken from https://openclipart.org/ are dedicated to the public
domain : https://creativecommons.org/about/cc0 .
All marks mentioned may be trademarks or registered trademarks
of their respective owners.
Read about the license of „The scream“ of Edward Munch at
https://en.wikipedia.org/wiki/File:The_Scream.jpg
Linking Open Data cloud diagram, by Richard
Cyganiak and Anja Jentzsch. http://lod-cloud.net/
3. Overview
Publishing is for Consuming
•
Mandatory
•
Nice to have
Story so far - experiences with lobid.org
•
What is lobid.org ?
•
Storing the data
•
Getting the data
Publishing RDF through elasticsearch
•
Benefits
•
Some more details
•
Caveats
Future prospects
Building a High Performance Environment for RDF Publishing 3
4. Overview
Publishing is for Consuming
•
Mandatory
•
Nice to have
Story so far - experiences with lobid.org
•
What is lobid.org ?
•
Storing the data
•
Getting the data
Publishing RDF through elasticsearch
•
Benefits
•
Some more details
•
Caveats
Future prospects
Building a High Performance Environment for RDF Publishing 4
5. ing is for Consuming
Publish
Publishing is for Consuming
Building a High Performance Environment for RDF Publishing 5
6. ing is for Consuming
Publish
Mandatory
A resource:
Building a High Performance Environment for RDF Publishing 6
7. ing is for Consuming
Publish
Mandatory
A resource:
gets a dereferenceable URI:
Building a High Performance Environment for RDF Publishing 7
8. ing is for Consuming
Publish
Mandatory
A resource:
gets a dereferenceable URI:
which provides RDF:
<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/title> "With reference to reference" .
<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/issued> "1983" .
<http://lobid.org/resource/HT002948556> <http://purl.org/ontology/bibo/isbn13> "9780915145539" .
<http://lobid.org/resource/HT002948556><http://purl.org/dc/elements/1.1/creator><http://d-nb.info/gnd/135539897> .
Building a High Performance Environment for RDF Publishing 8
9. ing is for Consuming
Publish
Mandatory
=> basic LOD publishing is very simple:
you just need a Webserver
Building a High Performance Environment for RDF Publishing 9
10. ing is for Consuming
Publish
Nice to have
•
Dumps
•
Content Negotiation (different RDF serializations)
•
SPARQL
•
Human readable representation (best: RDFa in HTML)
•
Data searchable
•
Timely updates
•
High Availability
•
Versioning
•
Web developers want simple APIs providing JSON
•
...
Building a High Performance Environment for RDF Publishing 10
11. ing is for Consuming
Publish
SPARQL Endpoint
•
(Dumps)
•
Content Negotiation (different RDF serializations)
•
SPARQL
•
Human readable representation (best: RDFa in HTML)
•
Data searchable
•
Timely updates
•
High Availability
•
Versioning
•
Web developers want simple APIs providing JSON
•
...
Building a High Performance Environment for RDF Publishing 11
12. ing is for Consuming
Publish
SPARQL Endpoint
•
(Dumps): but may be painfully slow when having lots of data
•
Content Negotiation (different RDF serializations)
•
SPARQL
•
Human readable representation (best: RDFa in HTML)
•
(Data searchable) : maybe painfully slow
•
Timely updates
•
High Availability
•
Versioning
•
Web developers want simple APIs providing JSON
•
most triple stores provides JSON/RDF
•
Simple powerful API : too powerful/complex ?
•
...
Building a High Performance Environment for RDF Publishing 12
13. ing is for Consuming
Publish
Nice to have
In principle, web developers already got simple APIs :
LOD is the API !
Building a High Performance Environment for RDF Publishing 13
14. ing is for Consuming
Publish
Nice to have
In principle, web developers already got simple APIs :
Remember:
Building a High Performance Environment for RDF Publishing 14
15. ing is for Consuming
Publish
Mandatory
A resource:
gets a dereferenceable URI:
which provides the data (in RDF):
<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/title> "With reference to reference" .
<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/issued> "1983" .
<http://lobid.org/resource/HT002948556> <http://purl.org/ontology/bibo/isbn13> "9780915145539" .
<http://lobid.org/resource/HT002948556><http://purl.org/dc/elements/1.1/creator><http://d-nb.info/gnd/135539897> .
Building a High Performance Environment for RDF Publishing 15
16. ing is for Consuming
Publish
Nice to have
In principle, web developers already got powerful APIs :
RESTful SPARQL
Building a High Performance Environment for RDF Publishing 16
17. ing is for Consuming
Publish
RESTful SPARQL example
getting all data of all resources having a particular ISBN:
curl -H "Accept: application/json" --data-urlencode 'query=
prefix bibo: <http://purl.org/ontology/bibo/>
SELECT * WHERE {
?s bibo:isbn13 "9780851706238" ;
?p ?o .
} LIMIT 100
' http://lobid.org/sparql/
Building a High Performance Environment for RDF Publishing 17
18. ing is for Consuming
Publish
Nice to have
Building a High Performance Environment for RDF Publishing 18
19. ing is for Consuming
Publish
RESTful SPARQL example
… and the JSON/RDF result:
{
"head": {
"vars": [ "s", "p","o"]
},
"results": {
"bindings": [ {
"o": {
"type": "uri",
"value": "http://openlibrary.org/works/OL2109573W"
},
"p": {
"type": "uri",
"value": "http://rdvocab.info/RDARelationshipsWEMI/workManifested"
},
"s": {
"type": "uri",
"value": "http://lobid.org/resource/HT007824357"
}
},
{
"o": { ...
Building a High Performance Environment for RDF Publishing 19
20. ing is for Consuming
Publish
Nice to have
As it is, web developers don't like SPARQL
web developer
Building a High Performance Environment for RDF Publishing 20
21. ing is for Consuming
Publish
Nice to have
Web developers want APIs like:
http://lobid.org/resources/api/isbn/$isbn
Building a High Performance Environment for RDF Publishing 21
23. Overview
Publishing is for Consuming
•
Mandatory
•
Nice to have
Story so far - experiences with lobid.org
•
What is lobid.org ?
•
Storing the data
•
Getting the data
Publishing RDF through elasticsearch
•
Benefits
•
Some more details
•
Caveats
Future prospects
Building a High Performance Environment for RDF Publishing 23
24. What is l obid.org ?
lobid.org
Building a High Performance Environment for RDF Publishing 24
25. What is l obid.org ?
●
lobid := linking open bibliographic data
●
LOD services of the hbz
●
lobid-resources :
●
exposes 85% of the hbz cooperative catalogue
●
entries coming from > 200 scientific German libraries
●
~ 16 M records with 700 M triples
●
with links to ~ 5 M other resources
●
with links to ~ 32 M items (consisting of 300 M triples)
●
lobid-organisations :
●
exposes German Sigelverzeichnis and MARC-Isil directory
●
~ 40 k descriptions of institutions
Building a High Performance Environment for RDF Publishing 25
26. What is lobid.org ?
What's missing?
•
Dumps
•
Content Negotiation (different RDF serializations)
•
SPARQL
•
Human readable representation ( RDFa in HTML)
•
Data searchable
•
Timely updates
•
High Availability
•
Versioning
•
Web developers want simple APIs providing JSON
•
...
Building a High Performance Environment for RDF Publishing 26
27. Overview
Publishing is for Consuming
•
Mandatory
•
Nice to have
Story so far - experiences with lobid.org
•
What is lobid.org ?
•
Storing the data
•
Getting the data
Publishing RDF through elasticsearch
•
Benefits
•
Some more details
•
Caveats
Future prospects
Building a High Performance Environment for RDF Publishing 27
28. storing the data
2010 - 2011, lobid-organisation
Filesystem :
+ easy to maintain
+ reliable
+ fast
- no search
- no SPARQL
- ...
Building a High Performance Environment for RDF Publishing 28
29. storing the data
lobid today
Triple Store (4store) :
+ power of SPARQL
+/- depending on the query: fast to horribly slow
+/- search (but string searches often slow and limited)
- sometimes gets stuck !
Building a High Performance Environment for RDF Publishing 29
30. storing the data
lobid today
Search engine (elasticsearch):
+ fast search
+ stemming, linguistics …
+ wildcard searching
+ facets
+ geo search
+ JSON
+ schema-less
+ simple RESTful API
+ many plugins
+ ...
+ easy to achieve High Availability
+ scales nicely
Building a High Performance Environment for RDF Publishing 30
32. Overview
Publishing is for Consuming
•
Mandatory
•
Nice to have
Story so far - experiences with lobid.org
•
What is lobid.org ?
•
Storing the data
•
Getting the data
Publishing RDF through elasticsearch
•
Benefits
•
Some more details
•
Caveats
Future prospects
Building a High Performance Environment for RDF Publishing 32
33. getting the data
lobid : technology/dependency stack
Webapp
Search Engine
Triple Store
Building a High Performance Environment for RDF Publishing 33
34. getting the data
lobid : technology/dependency stack
Webapp
we can do that
Search Engine
highly available !
Triple Store
sometimes gets stuck!
Building a High Performance Environment for RDF Publishing 34
35. getting the data
lobid : technology/dependency stack
Webapp
we can do that
Search Engine
<
highly available !
Triple Store
=
sometimes gets stuck!
Building a High Performance Environment for RDF Publishing 35
36. getting the data
lobid : technology/dependency stack
Webapp
we can do that
Search Engine
highly available !
Triple Store
sometimes gets stuck!
Building a High Performance Environment for RDF Publishing 36
37. oring/getting the data
st
Variant 1 : technology/dependency stack
Triple Store Triple Store
Closed, internal. Will be safe from malign queries. For external access.
Sometimes gets stuck!
Building a High Performance Environment for RDF Publishing 37
38. oring/getting the data
st
Variant 1 : technology/dependency stack
redundant, complex …
Triple Store Triple Store
Closed, internal. Will be safe from malign queries. For external access.
Sometimes gets stuck!
Building a High Performance Environment for RDF Publishing 38
39. Overview
Publishing is for Consuming
•
Mandatory
•
Nice to have
Story so far - experiences with lobid.org
•
What is lobid.org ?
•
Storing the data
•
Getting the data
Publishing RDF through elasticsearch
Benefits
•
•
Some more details
•
Caveats
Future prospects
Building a High Performance Environment for RDF Publishing 39
40. D with elasticsearch
Publishing LO
Variant 2: technology/dependency stack
Webapp
we can do that
Search Engine
highly available !
LOD basis functionality (and some other
Triple Store
For external access and some
APIs) are highly available
fancy nice-to-have stuff.
Sometimes gets stuck!
Building a High Performance Environment for RDF Publishing 40
41. D with elasticsearch
Publishing LO
Benefits
•
Dumps
•
Content Negotiation (different RDF serializations)
•
SPARQL
•
Human readable representation ( RDFa in HTML)
•
Data searchable
•
Near Real Time updates
•
High Availability
•
(Versioning)
•
Web developers want simple APIs returning JSON
•
...
Building a High Performance Environment for RDF Publishing 41
42. D with elasticsearch
Publishing LO
Benefits
fast, scalable search engine
Building a High Performance Environment for RDF Publishing 42
43. D with elasticsearch
Publishing LO
performance test
Data: 10 M records <=> 300 M triple
Case-insensitive query: „beach“
SELECT ?s
WHERE
{ ?s <http://purl.org/dc/terms/title> ?o
FILTER regex(str(?o), "beach", "i")
}
#### => SPARQL execution time for Q8316: 108.7s, returned 2815 rows.
http://$ip:9200/_search?q=beach&from=0&size=2800
# => Elasticsearch needed 0.4s
=> Elasticsearch is 250 times faster
Building a High Performance Environment for RDF Publishing 43
44. D with elasticsearch
Publishing LO
performance test
(there is a support for text indexing in 4store, have not tested that.)
Building a High Performance Environment for RDF Publishing 44
45. D with elasticsearch
Publishing LO
performance test
Elasticsearch: 18 M records , 6 GB RAM: 5 hour
4store: 1 B triples, having 72 GB RAM: 7 hours
CPU: Quad Core mit 2.4 GhZ und Hyperthreading => 8 CPUs
HD: 6 x 2.5" 10k U/min a 146GB
(Don't take benchmarks too seriously – they just give a clue !)
Building a High Performance Environment for RDF Publishing 45
46. D with elasticsearch
Publishing LO
Benefits
•
Dumps
•
Content Negotiation (different RDF serializations)
•
SPARQL
•
Human readable representation ( RDFa in HTML)
•
Data searchable
•
Near Real Time updates
•
High Availability
•
(Versioning)
•
Web developers want simple APIs providing JSON
•
...
Building a High Performance Environment for RDF Publishing 46
47. D with elasticsearch
Publishing LO
Benefits
build to be easily made highly available !
Building a High Performance Environment for RDF Publishing 47
48. D with elasticsearch
Publishing LO
Benefits
•
Dumps
•
Content Negotiation (different RDF serializations)
•
SPARQL
•
Human readable representation ( RDFa in HTML)
•
Data searchable
•
Near Real Time updates
•
High Availability
•
(Versioning)
•
Web developers want simple APIs providing JSON
•
...
Building a High Performance Environment for RDF Publishing 48
49. D with elasticsearch
Publishing LO
Benefits
Versioning with elasticsearch:
Not out-of-the-box, but comes at least e.g. with
* concurrency control
* documents have a version number
=> implementing versioning is not hard
Building a High Performance Environment for RDF Publishing 49
50. D with elasticsearch
Publishing LO
Benefits, relying on elasticsearch as basic LOD storage
•
Dumps
•
Content Negotiation (different RDF serializations)
•
SPARQL
•
Human readable representation ( RDFa in HTML)
•
Data searchable
•
Near Real Time updates
•
High Availability
•
Versionizing
•
Web developers want:
•
JSON (LD)
•
Simple APIs
•
...
Building a High Performance Environment for RDF Publishing 50
51. D with elasticsearch
Publishing LO
Benefits
•
Dumps
•
Content Negotiation (different RDF serializations)
•
SPARQL
•
Human readable representation ( RDFa in HTML)
•
Data searchable
•
Near Real Time updates
•
High Availability
•
(Versioning)
•
Web developers want simple APIs providing JSON
•
...
Building a High Performance Environment for RDF Publishing 51
52. D with elasticsearch
Publishing LO
Why JSON-LD?
JSON is :
•
stored natively by many tools (e.g. elasticsearch)
•
loved by consumers (web developers)
JSON-LD is :
•
supported by RDF libraries (e.g. transforming to NTriples)
Building a High Performance Environment for RDF Publishing 52
53. D with elasticsearch
Publishing LO
Benefits
•
Dumps
•
Content Negotiation (different RDF serializations)
•
SPARQL
•
Human readable representation ( RDFa in HTML)
•
Data searchable
•
Near Real Time updates
•
High Availability
•
(Versioning)
•
Web developers want simple APIs providing JSON
•
...
Building a High Performance Environment for RDF Publishing 53
54. D with elasticsearch
Publishing LO
Benefits
RESTful elasticsearch API, e. g. :
http://lobid.org/resources/_search?q=isbn:$isbn
Building a High Performance Environment for RDF Publishing 54
55. D with elasticsearch
Publishing LO
Benefits
•
… and many other nice things come with elasticsearch
•
geo-search : „Query only libraries/items residing up to 10 km from me.“
•
…
Building a High Performance Environment for RDF Publishing 55
56. D with elasticsearch
Publishing LO
Benefits
•
Dumps
Content Negotiation (different RDF serializations)
d!
•
•
SPARQL
plishe
•
accom
Human readable representation ( RDFa in HTML)
•
•
Data searchable
Mis sion
Near Real Time updates
•
High Availability
•
(Versioning)
•
Web developers want simple APIs providing JSON
•
...
Building a High Performance Environment for RDF Publishing 56
57. D with elasticsearch
Publishing LO
( … ok, something is left to be done ! )
•
Dumps
•
Content Negotiation (different RDF serializations)
•
SPARQL
•
Human readable representation ( RDFa in HTML)
•
Data searchable
•
Near Real Time updates
•
High Availability
•
Versionizing
•
Web developers want simple APIs providing JSON
•
...
Building a High Performance Environment for RDF Publishing 57
58. Overview
Publishing is for Consuming
•
Mandatory
•
Nice to have
Story so far - experiences with lobid.org
•
What is lobid.org ?
•
Storing the data
•
Getting the data
Publishing RDF through elasticsearch
•
Benefits
•
Caveats
•
Auto suggest demo
Conclusion
Building a High Performance Environment for RDF Publishing 58
59. D with elasticsearch
Publishing LO
Caveats
•
Dumps
•
Content Negotiation (different RDF serializations)
•
SPARQL
•
Human readable representation ( RDFa in HTML)
•
Data searchable
•
Near Real Time updates
•
•
High Availability
Versionizing
!?
•
Web developers want simple APIs providing JSON
•
...
Building a High Performance Environment for RDF Publishing 59
60. D with elasticsearch
Publishing LO
Caveats
How to integrate semantic search into a document storage ?
dct:contributor --------> dct:creator -------> dc:creator
---------> dc:contributor
--------> bibo:translator
…
There is no inferencing as comes with SPARQL !
Building a High Performance Environment for RDF Publishing 60
61. D with elasticsearch
Publishing LO
Caveats
Our data flow :
from records to RDF triples to records
Building a High Performance Environment for RDF Publishing 61
62. D with elasticsearch
Publishing LO
Caveats
Our data flow :
from records to RDF triples to records
Building a High Performance Environment for RDF Publishing 62
63. D with elasticsearch
Publishing LO
Caveats
from records to RDF triples to records
!?
Building a High Performance Environment for RDF Publishing 63
64. D with elasticsearch
Publishing LO
Caveats
From records to RDF triples
|-----> graph-database
'------> computing ---> record-database
MARC/MAB/PICA...
JSON-LD
Building a High Performance Environment for RDF Publishing 64
65. D with elasticsearch
Publishing LO
Caveats
tree-based vs graph-based:
Pre-render the whole document?
What is the document ?
Building a High Performance Environment for RDF Publishing 65
66. D with elasticsearch
Publishing LO
Caveats
Building a High Performance Environment for RDF Publishing 66
67. D with elasticsearch
Publishing LO
Caveats
What is the document ? Only the top-level node ?
Building a High Performance Environment for RDF Publishing 67
68. D with elasticsearch
Publishing LO
Caveats
What is the document ? Only the top-level node ?
… but then you couldn't even search the authors name !
Building a High Performance Environment for RDF Publishing 68
69. D with elasticsearch
Publishing LO
Caveats
searching needs integration of some fields from subgraphs into the document
Building a High Performance Environment for RDF Publishing 69
70. Overview
Publishing is for Consuming
•
Mandatory
•
Nice to have
Story so far - experiences with lobid.org
•
What is lobid.org ?
•
Storing the data
•
Getting the data
Publishing RDF through elasticsearch
•
Benefits
•
Caveats
•
Auto suggest demo
Conclusion
Building a High Performance Environment for RDF Publishing 70
71. D with elasticsearch
Publishing LO
auto suggest
authority IDs must be easily found
71
Building a High Performance Environment for RDF Publishing
72. D with elasticsearch
Publishing LO
auto suggest
authority IDs must be easily found
=> in need of auto suggest
72
Building a High Performance Environment for RDF Publishing
73. D with elasticsearch
Publishing LO
auto suggest
auto suggests needs fast searching
Building a High Performance Environment for RDF Publishing 73
74. D with elasticsearch
Publishing LO
Demo
auto suggest
Building a High Performance Environment for RDF Publishing 74
76. D with elasticsearch
Publishing LO
auto suggest
RESTful APIs:
http://demo.lobid.org/search?format=short&index=gnd-index&author=Schmidt%2C+Karl
http://demo.lobid.org/search?format=page&index=gnd-index&author=Schmidt%2C+Karl
http://demo.lobid.org/search?format=full&index=gnd-index&author=Schmidt%2C+Karl
…
API usage:
GET /search?format=<page|full|short>&index=<lobid-index|gnd-index>&author=<query>
easy to enhance with the play framework and the elasticsearch API
Building a High Performance Environment for RDF Publishing 76
77. D with elasticsearch
Publishing LO
auto suggest
RESTful APIs:
http://demo.lobid.org/search
?format=short&index=gnd-index&author=Schmidt%2C+Karl
[
"Schmidt, Karl (1894-1945)",
"Schmidt, Karl",
"Schmidt, Karl (1910-)",
"Schmidt, Karl (1846-1928)",
"Schmidt, Karl (1913-)",
"Schmidt, Karl (1899-)",
"Schmidt, Karl (1924-)",
"Schmidt, Karl (1836-1888)",
"Schmidt, L. F. Karl",
"Schmidt, Karl (1902-1945)",
"Schmidt, Karl J.",
"Schmidt, Karl (1848-1905)",
"Schmidt, Karl (1817-1882)",
"Schmidt, Karl R.",
"Schmidt, Karl (1954-)",
"Schmidt, Karl (1888-)",
"Schmidt, Karl (1867-)",
...
]
Building a High Performance Environment for RDF Publishing 77
78. D with elasticsearch
Publishing LO
auto suggest
GND authority file
in
lobid-resources
Building a High Performance Environment for RDF Publishing 78
80. D with elasticsearch
Publishing LO
Building a High Performance Environment for RDF Publishing 80
81. Overview
Publishing is for Consuming
•
Mandatory
•
Nice to have
Story so far - experiences with lobid.org
•
What is lobid.org ?
•
Storing the data
•
Getting the data
Publishing RDF through elasticsearch
•
Benefits
•
Caveats
•
Auto suggest demo
Conclusion
Building a High Performance Environment for RDF Publishing 81
82. D with elasticsearch
Publishing LO
Conclusion
a highly customizable/reliable/feature-rich LOD service
Webapp
we can do that
Search Engine
highly available !
LOD basis functionality (and some other
Triple Store
For external access and some
APIs) are highly available
fancy nice-to-have stuff.
Sometimes gets stuck!
Building a High Performance Environment for RDF Publishing 82
83. D with elasticsearch
Publishing LO
the software is Open Source:
http://elasticsearch.org/
https://hadoop.apache.org/
http://www.playframework.org/
http://4store.org/
https://github.com/lobid/
Building a High Performance Environment for RDF Publishing 83
84. Any Questions ?
Pascal Christoph
christoph@hbz-nrw.de
semweb@hbz-nrw.de
85. Using a dark background, this presentation saves maybe 70% of energy