SlideShare a Scribd company logo
1 of 85
Download to read offline
Building a High
Performance Environment
   for RDF Publishing

        Pascal Christoph
These slides and all the graphics made by the author and those
taken from https://openclipart.org/ are dedicated to the public
domain : https://creativecommons.org/about/cc0 .

All marks mentioned may be trademarks or registered trademarks
of their respective owners.

Read about the license of „The scream“ of Edward Munch at
https://en.wikipedia.org/wiki/File:The_Scream.jpg

             Linking Open Data cloud diagram, by Richard
             Cyganiak and Anja Jentzsch. http://lod-cloud.net/
Overview

Publishing is for Consuming
  •
    Mandatory
  •
    Nice to have

Story so far - experiences with lobid.org
  •
    What is lobid.org ?
  •
    Storing the data
  •
    Getting the data

Publishing RDF through elasticsearch
  •
    Benefits
  •
    Some more details
  •
    Caveats

Future prospects


                   Building a High Performance Environment for RDF Publishing   3
Overview

Publishing is for Consuming
  •
    Mandatory
  •
    Nice to have

Story so far - experiences with lobid.org
  •
    What is lobid.org ?
  •
    Storing the data
  •
    Getting the data

Publishing RDF through elasticsearch
  •
    Benefits
  •
    Some more details
  •
    Caveats

Future prospects


                   Building a High Performance Environment for RDF Publishing   4
ing is for Consuming
                        Publish



Publishing is for Consuming




  Building a High Performance Environment for RDF Publishing   5
ing is for Consuming
                                    Publish
                             Mandatory

A resource:




              Building a High Performance Environment for RDF Publishing   6
ing is for Consuming
                                    Publish
                             Mandatory

A resource:



gets a dereferenceable URI:




              Building a High Performance Environment for RDF Publishing   7
ing is for Consuming
                                                        Publish
                                               Mandatory

A resource:



gets a dereferenceable URI:


which provides RDF:
<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/title> "With reference to reference" .
<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/issued> "1983" .
<http://lobid.org/resource/HT002948556> <http://purl.org/ontology/bibo/isbn13> "9780915145539" .
<http://lobid.org/resource/HT002948556><http://purl.org/dc/elements/1.1/creator><http://d-nb.info/gnd/135539897> .




                         Building a High Performance Environment for RDF Publishing                          8
ing is for Consuming
                           Publish
                    Mandatory


=> basic LOD publishing is very simple:


you just need a Webserver




     Building a High Performance Environment for RDF Publishing   9
ing is for Consuming
                                              Publish
                                      Nice to have
•
    Dumps
•
    Content Negotiation (different RDF serializations)
•
    SPARQL
•
    Human readable representation (best: RDFa in HTML)
•
    Data searchable
•
    Timely updates
•
    High Availability
•
    Versioning
•
    Web developers want simple APIs providing JSON
•
    ...




                        Building a High Performance Environment for RDF Publishing   10
ing is for Consuming
                                              Publish
                                  SPARQL Endpoint
•
    (Dumps)
•
    Content Negotiation (different RDF serializations)
•
    SPARQL
•
    Human readable representation (best: RDFa in HTML)
•
    Data searchable
•
    Timely updates
•
    High Availability
•
    Versioning
•
    Web developers want simple APIs providing JSON
•
    ...




                        Building a High Performance Environment for RDF Publishing   11
ing is for Consuming
                                                  Publish
                                      SPARQL Endpoint
•
    (Dumps): but may be painfully slow when having lots of data
•
    Content Negotiation (different RDF serializations)
•
    SPARQL
•
    Human readable representation (best: RDFa in HTML)
•
    (Data searchable) : maybe painfully slow
•
    Timely updates
•
    High Availability
•
    Versioning
•
    Web developers want simple APIs providing JSON
          •
              most triple stores provides JSON/RDF
          •
              Simple powerful API : too powerful/complex ?
•
    ...
                            Building a High Performance Environment for RDF Publishing   12
ing is for Consuming
                                   Publish
                           Nice to have


In principle, web developers already got simple APIs :

                 LOD is the API !




             Building a High Performance Environment for RDF Publishing   13
ing is for Consuming
                                   Publish
                           Nice to have


In principle, web developers already got simple APIs :

                      Remember:




             Building a High Performance Environment for RDF Publishing   14
ing is for Consuming
                                                        Publish
                                               Mandatory

A resource:



gets a dereferenceable URI:


which provides the data (in RDF):

<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/title> "With reference to reference" .
<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/issued> "1983" .
<http://lobid.org/resource/HT002948556> <http://purl.org/ontology/bibo/isbn13> "9780915145539" .
<http://lobid.org/resource/HT002948556><http://purl.org/dc/elements/1.1/creator><http://d-nb.info/gnd/135539897> .




                         Building a High Performance Environment for RDF Publishing                          15
ing is for Consuming
                                   Publish
                           Nice to have


In principle, web developers already got powerful APIs :

                 RESTful         SPARQL




             Building a High Performance Environment for RDF Publishing   16
ing is for Consuming
                                    Publish
                  RESTful SPARQL example
  getting all data of all resources having a particular ISBN:


curl -H "Accept: application/json" --data-urlencode 'query=
prefix bibo: <http://purl.org/ontology/bibo/>
SELECT * WHERE {
 ?s bibo:isbn13 "9780851706238" ;
   ?p ?o .
 } LIMIT 100
' http://lobid.org/sparql/



              Building a High Performance Environment for RDF Publishing   17
ing is for Consuming
                      Publish
                Nice to have




Building a High Performance Environment for RDF Publishing 18
ing is for Consuming
                                                            Publish
                                  RESTful SPARQL example
… and the JSON/RDF result:
{
    "head": {
       "vars": [ "s", "p","o"]
    },
    "results": {
       "bindings": [ {
             "o": {
                "type": "uri",
                "value": "http://openlibrary.org/works/OL2109573W"
             },
             "p": {
                "type": "uri",
                "value": "http://rdvocab.info/RDARelationshipsWEMI/workManifested"
             },
             "s": {
                "type": "uri",
                "value": "http://lobid.org/resource/HT007824357"
             }
          },
          {
             "o": { ...




                            Building a High Performance Environment for RDF Publishing   19
ing is for Consuming
                      Publish
                Nice to have
               As it is, web developers don't like SPARQL




                               web developer




Building a High Performance Environment for RDF Publishing 20
ing is for Consuming
                                   Publish
                           Nice to have


Web developers want APIs like:


http://lobid.org/resources/api/isbn/$isbn




             Building a High Performance Environment for RDF Publishing   21
Happy web developer
Overview

Publishing is for Consuming
  •
    Mandatory
  •
    Nice to have


Story so far - experiences with lobid.org
  •
    What is lobid.org ?
  •
    Storing the data
  •
    Getting the data

Publishing RDF through elasticsearch
  •
    Benefits
  •
    Some more details
  •
    Caveats

Future prospects

                   Building a High Performance Environment for RDF Publishing   23
What is l obid.org ?
                 lobid.org




Building a High Performance Environment for RDF Publishing   24
What is l obid.org ?
●
    lobid := linking open bibliographic data
●
    LOD services of the hbz
     ●
       lobid-resources :
        ●
          exposes 85% of the hbz cooperative catalogue
        ●
          entries coming from > 200 scientific German libraries
        ●
          ~ 16 M records with 700 M triples
           ●
             with links to ~ 5 M other resources
           ●
             with links to ~ 32 M items (consisting of 300 M triples)
    ●
        lobid-organisations :
        ●
          exposes German Sigelverzeichnis and MARC-Isil directory
        ●
          ~ 40 k descriptions of institutions



                   Building a High Performance Environment for RDF Publishing   25
What is lobid.org ?
                                   What's missing?
•
    Dumps
•
    Content Negotiation (different RDF serializations)
•
    SPARQL
•
    Human readable representation ( RDFa in HTML)
•
    Data searchable
•
    Timely updates
•
    High Availability
•
    Versioning
•
    Web developers want simple APIs providing JSON
•
    ...




                        Building a High Performance Environment for RDF Publishing   26
Overview

Publishing is for Consuming
  •
    Mandatory
  •
    Nice to have


Story so far - experiences with lobid.org
  •
      What is lobid.org ?
  •
      Storing the data
  •
      Getting the data

Publishing RDF through elasticsearch
  •
    Benefits
  •
    Some more details
  •
    Caveats

Future prospects

                    Building a High Performance Environment for RDF Publishing   27
storing the data
                2010 - 2011, lobid-organisation

Filesystem :
     + easy to maintain
     + reliable
     + fast
     - no search
     - no SPARQL
     - ...




               Building a High Performance Environment for RDF Publishing   28
storing the data
                              lobid today

Triple Store (4store) :
     + power of SPARQL
     +/- depending on the query: fast to horribly slow
     +/- search (but string searches often slow and limited)
     - sometimes gets stuck !




               Building a High Performance Environment for RDF Publishing   29
storing the data
                               lobid today

Search engine (elasticsearch):
     + fast search
     + stemming, linguistics …
     + wildcard searching
     + facets
     + geo search
     + JSON
     + schema-less
     + simple RESTful API
     + many plugins
     + ...
     + easy to achieve High Availability
     + scales nicely



                Building a High Performance Environment for RDF Publishing   30
lobid today storing/getting the data
Overview

Publishing is for Consuming
  •
    Mandatory
  •
    Nice to have


Story so far - experiences with lobid.org
  •
    What is lobid.org ?
  •
    Storing the data
  •
      Getting the data
Publishing RDF through elasticsearch
  •
    Benefits
  •
    Some more details
  •
    Caveats

Future prospects

                   Building a High Performance Environment for RDF Publishing   32
getting the data
lobid : technology/dependency stack

            Webapp

                  Search Engine

                         Triple Store




  Building a High Performance Environment for RDF Publishing   33
getting the data
lobid : technology/dependency stack

            Webapp
              we can do that

                  Search Engine
                     highly available !

                         Triple Store
                              sometimes gets stuck!




  Building a High Performance Environment for RDF Publishing   34
getting the data
lobid : technology/dependency stack

            Webapp
              we can do that

                  Search Engine
                                              <
                     highly available !

                         Triple Store
                                                   =
                              sometimes gets stuck!




  Building a High Performance Environment for RDF Publishing   35
getting the data
lobid : technology/dependency stack

            Webapp
              we can do that

                  Search Engine
                     highly available !

                         Triple Store
                              sometimes gets stuck!




  Building a High Performance Environment for RDF Publishing   36
oring/getting the data
                                                  st
            Variant 1 : technology/dependency stack




         Triple Store                                       Triple Store
Closed, internal. Will be safe from malign queries.         For external access.
                                                            Sometimes gets stuck!

                    Building a High Performance Environment for RDF Publishing   37
oring/getting the data
                                                  st
            Variant 1 : technology/dependency stack


                                   redundant, complex …




         Triple Store                                       Triple Store
Closed, internal. Will be safe from malign queries.         For external access.
                                                            Sometimes gets stuck!

                    Building a High Performance Environment for RDF Publishing   38
Overview

Publishing is for Consuming
  •
    Mandatory
  •
    Nice to have


Story so far - experiences with lobid.org
  •
    What is lobid.org ?
  •
    Storing the data
  •
    Getting the data

Publishing RDF through elasticsearch
  Benefits
  •


  •
    Some more details
  •
    Caveats

Future prospects

                   Building a High Performance Environment for RDF Publishing   39
D with elasticsearch
                                 Publishing LO
             Variant 2: technology/dependency stack



                           Webapp
                             we can do that

                                 Search Engine
                                    highly available !


LOD basis functionality (and some other
                                                            Triple Store
                                                     For external access and some
      APIs) are highly available
                                                     fancy nice-to-have stuff.
                                                     Sometimes gets stuck!
                    Building a High Performance Environment for RDF Publishing   40
D with elasticsearch
                                    Publishing LO
                                         Benefits
•
    Dumps
•
    Content Negotiation (different RDF serializations)
•
    SPARQL
•
    Human readable representation ( RDFa in HTML)
•
    Data searchable
•
    Near Real Time updates
•
    High Availability
•
    (Versioning)
•
    Web developers want simple APIs returning JSON
•
    ...



                        Building a High Performance Environment for RDF Publishing   41
D with elasticsearch
              Publishing LO
                   Benefits




fast, scalable search engine




  Building a High Performance Environment for RDF Publishing   42
D with elasticsearch
                                            Publishing LO
                                            performance test

Data: 10 M records <=> 300 M triple
Case-insensitive query: „beach“
SELECT ?s
WHERE
 { ?s <http://purl.org/dc/terms/title> ?o
   FILTER regex(str(?o), "beach", "i")
 }
#### => SPARQL execution time for Q8316: 108.7s, returned 2815 rows.

http://$ip:9200/_search?q=beach&from=0&size=2800
# => Elasticsearch needed 0.4s

=> Elasticsearch is 250 times faster


                          Building a High Performance Environment for RDF Publishing   43
D with elasticsearch
                             Publishing LO
                            performance test



(there is a support for text indexing in 4store, have not tested that.)




                 Building a High Performance Environment for RDF Publishing   44
D with elasticsearch
                                 Publishing LO
                                performance test


Elasticsearch: 18 M records , 6 GB RAM: 5 hour
4store: 1 B triples, having 72 GB RAM: 7 hours


CPU: Quad Core mit 2.4 GhZ und Hyperthreading => 8 CPUs
HD: 6 x 2.5" 10k U/min a 146GB




(Don't take benchmarks too seriously – they just give a clue !)



                    Building a High Performance Environment for RDF Publishing   45
D with elasticsearch
                                    Publishing LO
                                         Benefits
•
    Dumps
•
    Content Negotiation (different RDF serializations)
•
    SPARQL
•
    Human readable representation ( RDFa in HTML)
•
    Data searchable
•
    Near Real Time updates
•
    High Availability
•
  (Versioning)
      •
        Web developers want simple APIs providing JSON
•
  ...




                        Building a High Performance Environment for RDF Publishing   46
D with elasticsearch
                    Publishing LO
                         Benefits




build to be easily made highly available !




        Building a High Performance Environment for RDF Publishing   47
D with elasticsearch
                                    Publishing LO
                                         Benefits
•
    Dumps
•
    Content Negotiation (different RDF serializations)
•
    SPARQL
•
    Human readable representation ( RDFa in HTML)
•
    Data searchable
•
    Near Real Time updates
•
    High Availability
•
  (Versioning)
      •
        Web developers want simple APIs providing JSON
•
  ...




                        Building a High Performance Environment for RDF Publishing   48
D with elasticsearch
                       Publishing LO
                            Benefits


Versioning with elasticsearch:
Not out-of-the-box, but comes at least e.g. with
* concurrency control
* documents have a version number
=> implementing versioning is not hard

           Building a High Performance Environment for RDF Publishing   49
D with elasticsearch
                                    Publishing LO
      Benefits, relying on elasticsearch as basic LOD storage
•
    Dumps
•
    Content Negotiation (different RDF serializations)
•
    SPARQL
•
    Human readable representation ( RDFa in HTML)
•
    Data searchable
•
    Near Real Time updates
•
    High Availability
•
    Versionizing
•
  Web developers want:
      •
        JSON (LD)
      •
        Simple APIs
•
  ...

                        Building a High Performance Environment for RDF Publishing   50
D with elasticsearch
                                    Publishing LO
                                         Benefits
•
    Dumps
•
    Content Negotiation (different RDF serializations)
•
    SPARQL
•
    Human readable representation ( RDFa in HTML)
•
    Data searchable
•
    Near Real Time updates
•
    High Availability
•
  (Versioning)
      •
        Web developers want simple APIs providing JSON
•
  ...




                        Building a High Performance Environment for RDF Publishing   51
D with elasticsearch
                             Publishing LO
                             Why JSON-LD?
    JSON is :
•
    stored natively by many tools (e.g. elasticsearch)
•
    loved by consumers (web developers)


    JSON-LD is :
•
    supported by RDF libraries (e.g. transforming to NTriples)




                 Building a High Performance Environment for RDF Publishing   52
D with elasticsearch
                                    Publishing LO
                                         Benefits
•
    Dumps
•
    Content Negotiation (different RDF serializations)
•
    SPARQL
•
    Human readable representation ( RDFa in HTML)
•
    Data searchable
•
    Near Real Time updates
•
    High Availability
•
  (Versioning)
      •
        Web developers want simple APIs providing JSON
•
  ...




                        Building a High Performance Environment for RDF Publishing   53
D with elasticsearch
                           Publishing LO
                                Benefits


RESTful elasticsearch API, e. g. :


http://lobid.org/resources/_search?q=isbn:$isbn




               Building a High Performance Environment for RDF Publishing   54
D with elasticsearch
                                  Publishing LO
                                       Benefits

•
    … and many other nice things come with elasticsearch
     •
         geo-search : „Query only libraries/items residing up to 10 km from me.“
     •
         …




                      Building a High Performance Environment for RDF Publishing   55
D with elasticsearch
                                    Publishing LO
                                         Benefits
•
    Dumps
    Content Negotiation (different RDF serializations)

                                       d!
•



•
    SPARQL
                                 plishe
•


                            accom
    Human readable representation ( RDFa in HTML)
•



•
    Data searchable
                   Mis sion
    Near Real Time updates
•
    High Availability
•
    (Versioning)
•
    Web developers want simple APIs providing JSON
•
    ...



                        Building a High Performance Environment for RDF Publishing   56
D with elasticsearch
                                    Publishing LO
                   ( … ok, something is left to be done ! )
•
    Dumps
•
    Content Negotiation (different RDF serializations)
•
    SPARQL
•
    Human readable representation ( RDFa in HTML)
•
    Data searchable
•
    Near Real Time updates
•
    High Availability
•
    Versionizing
•
    Web developers want simple APIs providing JSON
•
    ...



                        Building a High Performance Environment for RDF Publishing   57
Overview

Publishing is for Consuming
  •
    Mandatory
  •
    Nice to have

Story so far - experiences with lobid.org
  •
    What is lobid.org ?
  •
    Storing the data
  •
    Getting the data

Publishing RDF through elasticsearch
  •
      Benefits
  •
      Caveats
  •
      Auto suggest demo

Conclusion

                   Building a High Performance Environment for RDF Publishing   58
D with elasticsearch
                                    Publishing LO
                                         Caveats
•
    Dumps
•
    Content Negotiation (different RDF serializations)
•
    SPARQL
•
    Human readable representation ( RDFa in HTML)
•
    Data searchable
•
    Near Real Time updates
•


•
    High Availability
    Versionizing
                                                                    !?
•
    Web developers want simple APIs providing JSON
•
    ...



                        Building a High Performance Environment for RDF Publishing   59
D with elasticsearch
                           Publishing LO
                                Caveats
How to integrate semantic search into a document storage ?

dct:contributor --------> dct:creator -------> dc:creator
              ---------> dc:contributor
                --------> bibo:translator
              …


There is no inferencing as comes with SPARQL !



               Building a High Performance Environment for RDF Publishing   60
D with elasticsearch
                              Publishing LO
                                   Caveats

Our data flow :

from records to RDF triples to records




                  Building a High Performance Environment for RDF Publishing   61
D with elasticsearch
                              Publishing LO
                                   Caveats

Our data flow :

from records to RDF triples to records




                  Building a High Performance Environment for RDF Publishing   62
D with elasticsearch
                      Publishing LO
                           Caveats




from records to RDF triples to records


                   !?

          Building a High Performance Environment for RDF Publishing   63
D with elasticsearch
                                Publishing LO
                                     Caveats

From records to RDF triples
                  |-----> graph-database
                  '------> computing ---> record-database




 MARC/MAB/PICA...
                                                                JSON-LD




                    Building a High Performance Environment for RDF Publishing   64
D with elasticsearch
               Publishing LO
                    Caveats
    tree-based vs graph-based:

Pre-render the whole document?

 What is the document ?



   Building a High Performance Environment for RDF Publishing   65
D with elasticsearch
            Publishing LO
                 Caveats




Building a High Performance Environment for RDF Publishing   66
D with elasticsearch
                         Publishing LO
                              Caveats

What is the document ? Only the top-level node ?




             Building a High Performance Environment for RDF Publishing   67
D with elasticsearch
                          Publishing LO
                               Caveats

What is the document ? Only the top-level node ?




… but then you couldn't even search the authors name !

              Building a High Performance Environment for RDF Publishing   68
D with elasticsearch
                              Publishing LO
                                   Caveats

searching needs integration of some fields from subgraphs into the document




                  Building a High Performance Environment for RDF Publishing   69
Overview

Publishing is for Consuming
  •
    Mandatory
  •
    Nice to have

Story so far - experiences with lobid.org
  •
    What is lobid.org ?
  •
    Storing the data
  •
    Getting the data

Publishing RDF through elasticsearch
  •
    Benefits
  •
    Caveats
  •
      Auto suggest demo
Conclusion

                   Building a High Performance Environment for RDF Publishing   70
D with elasticsearch
                  Publishing LO
                    auto suggest



authority IDs must be easily found




                                                               71
      Building a High Performance Environment for RDF Publishing
D with elasticsearch
                  Publishing LO
                    auto suggest



authority IDs must be easily found
   => in need of auto suggest




                                                               72
      Building a High Performance Environment for RDF Publishing
D with elasticsearch
                   Publishing LO
                     auto suggest



auto suggests needs fast searching




       Building a High Performance Environment for RDF Publishing   73
D with elasticsearch
            Publishing LO
                   Demo



        auto suggest




Building a High Performance Environment for RDF Publishing   74
D with elasticsearch
Publishing LO




                               75
D with elasticsearch
                                Publishing LO
                                  auto suggest
RESTful APIs:

http://demo.lobid.org/search?format=short&index=gnd-index&author=Schmidt%2C+Karl

http://demo.lobid.org/search?format=page&index=gnd-index&author=Schmidt%2C+Karl

http://demo.lobid.org/search?format=full&index=gnd-index&author=Schmidt%2C+Karl
…

API usage:
GET /search?format=<page|full|short>&index=<lobid-index|gnd-index>&author=<query>

easy to enhance with the play framework and the elasticsearch API




                    Building a High Performance Environment for RDF Publishing   76
D with elasticsearch
                                         Publishing LO
                                           auto suggest
RESTful APIs:

http://demo.lobid.org/search
?format=short&index=gnd-index&author=Schmidt%2C+Karl
  [

            "Schmidt, Karl (1894-1945)",
            "Schmidt, Karl",
            "Schmidt, Karl (1910-)",
            "Schmidt, Karl (1846-1928)",
            "Schmidt, Karl (1913-)",
            "Schmidt, Karl (1899-)",
         "Schmidt, Karl (1924-)",
      "Schmidt, Karl (1836-1888)",
      "Schmidt, L. F. Karl",
      "Schmidt, Karl (1902-1945)",
      "Schmidt, Karl J.",
      "Schmidt, Karl (1848-1905)",
      "Schmidt, Karl (1817-1882)",
      "Schmidt, Karl R.",
      "Schmidt, Karl (1954-)",
      "Schmidt, Karl (1888-)",
      "Schmidt, Karl (1867-)",
      ...
  ]



                             Building a High Performance Environment for RDF Publishing   77
D with elasticsearch
            Publishing LO
              auto suggest

 GND authority file
                    in
     lobid-resources



Building a High Performance Environment for RDF Publishing   78
D with elasticsearch
Publishing LO




                               79
D with elasticsearch
            Publishing LO




Building a High Performance Environment for RDF Publishing   80
Overview

Publishing is for Consuming
  •
    Mandatory
  •
    Nice to have

Story so far - experiences with lobid.org
  •
    What is lobid.org ?
  •
    Storing the data
  •
    Getting the data

Publishing RDF through elasticsearch
  •
    Benefits
  •
    Caveats
  •
    Auto suggest demo

Conclusion

                   Building a High Performance Environment for RDF Publishing   81
D with elasticsearch
                                 Publishing LO
                                   Conclusion
    a highly customizable/reliable/feature-rich LOD service

                           Webapp
                             we can do that

                                 Search Engine
                                    highly available !


LOD basis functionality (and some other
                                                            Triple Store
                                                     For external access and some
      APIs) are highly available
                                                     fancy nice-to-have stuff.
                                                     Sometimes gets stuck!
                    Building a High Performance Environment for RDF Publishing   82
D with elasticsearch
             Publishing LO
the software is Open Source:



              http://elasticsearch.org/


         https://hadoop.apache.org/


    http://www.playframework.org/

  http://4store.org/

 https://github.com/lobid/




Building a High Performance Environment for RDF Publishing   83
Any Questions ?


     Pascal Christoph
     christoph@hbz-nrw.de


     semweb@hbz-nrw.de
Using a dark background, this presentation saves maybe 70% of energy

More Related Content

What's hot

Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
rhatr
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
NekoGato
 
Yahoo Pipes Middleware In The Cloud
Yahoo Pipes Middleware In The CloudYahoo Pipes Middleware In The Cloud
Yahoo Pipes Middleware In The Cloud
ConSanFrancisco123
 

What's hot (13)

Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
 
Apache HBase: State of the Union
Apache HBase: State of the UnionApache HBase: State of the Union
Apache HBase: State of the Union
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 
Rails 6 Multi-DB 実戦投入
Rails 6 Multi-DB 実戦投入Rails 6 Multi-DB 実戦投入
Rails 6 Multi-DB 実戦投入
 
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
SharePoint Saturday San Antonio: SharePoint 2010 Performance
SharePoint Saturday San Antonio: SharePoint 2010 PerformanceSharePoint Saturday San Antonio: SharePoint 2010 Performance
SharePoint Saturday San Antonio: SharePoint 2010 Performance
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Yahoo Pipes Middleware In The Cloud
Yahoo Pipes Middleware In The CloudYahoo Pipes Middleware In The Cloud
Yahoo Pipes Middleware In The Cloud
 
Drupal + ApacheSolr
Drupal + ApacheSolrDrupal + ApacheSolr
Drupal + ApacheSolr
 
Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010
 
FOSDEM '18 - Tools for large scale collection and analysis of source code re...
FOSDEM '18  - Tools for large scale collection and analysis of source code re...FOSDEM '18  - Tools for large scale collection and analysis of source code re...
FOSDEM '18 - Tools for large scale collection and analysis of source code re...
 

Similar to Building a High Performance Environment for RDF Publishing

Similar to Building a High Performance Environment for RDF Publishing (20)

ISWC GoodRelations Tutorial Part 2
ISWC GoodRelations Tutorial Part 2ISWC GoodRelations Tutorial Part 2
ISWC GoodRelations Tutorial Part 2
 
GoodRelations Tutorial Part 2
GoodRelations Tutorial Part 2GoodRelations Tutorial Part 2
GoodRelations Tutorial Part 2
 
Practical Cross-Dataset Queries with SPARQL (Introduction)
Practical Cross-Dataset Queries with SPARQL (Introduction)Practical Cross-Dataset Queries with SPARQL (Introduction)
Practical Cross-Dataset Queries with SPARQL (Introduction)
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 
Producing, publishing and consuming linked data - CSHALS 2013
Producing, publishing and consuming linked data - CSHALS 2013Producing, publishing and consuming linked data - CSHALS 2013
Producing, publishing and consuming linked data - CSHALS 2013
 
LOD技術解説
LOD技術解説LOD技術解説
LOD技術解説
 
D2RQ
D2RQD2RQ
D2RQ
 
Why rdfa
Why rdfaWhy rdfa
Why rdfa
 
RDFa Introductory Course Session 3/4 Why RDFa
RDFa Introductory Course Session 3/4 Why RDFaRDFa Introductory Course Session 3/4 Why RDFa
RDFa Introductory Course Session 3/4 Why RDFa
 
D2 rq
D2 rqD2 rq
D2 rq
 
Towards a Commons RDF Library - ApacheCon Europe 2014
Towards a Commons RDF Library - ApacheCon Europe 2014Towards a Commons RDF Library - ApacheCon Europe 2014
Towards a Commons RDF Library - ApacheCon Europe 2014
 
Eclipse RDF4J - Working with RDF in Java
Eclipse RDF4J - Working with RDF in JavaEclipse RDF4J - Working with RDF in Java
Eclipse RDF4J - Working with RDF in Java
 
RejectKaigi2010 - RDF.rb
RejectKaigi2010 - RDF.rbRejectKaigi2010 - RDF.rb
RejectKaigi2010 - RDF.rb
 
Ruby semweb 2011-12-06
Ruby semweb 2011-12-06Ruby semweb 2011-12-06
Ruby semweb 2011-12-06
 
ArangoDB
ArangoDBArangoDB
ArangoDB
 
Apache dubbo (incubating) open source present and future
Apache dubbo (incubating) open source present and futureApache dubbo (incubating) open source present and future
Apache dubbo (incubating) open source present and future
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012
 
RDFauthor (EKAW)
RDFauthor (EKAW)RDFauthor (EKAW)
RDFauthor (EKAW)
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
Converting GHO to RDF
Converting GHO to RDFConverting GHO to RDF
Converting GHO to RDF
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Building a High Performance Environment for RDF Publishing

  • 1. Building a High Performance Environment for RDF Publishing Pascal Christoph
  • 2. These slides and all the graphics made by the author and those taken from https://openclipart.org/ are dedicated to the public domain : https://creativecommons.org/about/cc0 . All marks mentioned may be trademarks or registered trademarks of their respective owners. Read about the license of „The scream“ of Edward Munch at https://en.wikipedia.org/wiki/File:The_Scream.jpg Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  • 3. Overview Publishing is for Consuming • Mandatory • Nice to have Story so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the data Publishing RDF through elasticsearch • Benefits • Some more details • Caveats Future prospects Building a High Performance Environment for RDF Publishing 3
  • 4. Overview Publishing is for Consuming • Mandatory • Nice to have Story so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the data Publishing RDF through elasticsearch • Benefits • Some more details • Caveats Future prospects Building a High Performance Environment for RDF Publishing 4
  • 5. ing is for Consuming Publish Publishing is for Consuming Building a High Performance Environment for RDF Publishing 5
  • 6. ing is for Consuming Publish Mandatory A resource: Building a High Performance Environment for RDF Publishing 6
  • 7. ing is for Consuming Publish Mandatory A resource: gets a dereferenceable URI: Building a High Performance Environment for RDF Publishing 7
  • 8. ing is for Consuming Publish Mandatory A resource: gets a dereferenceable URI: which provides RDF: <http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/title> "With reference to reference" . <http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/issued> "1983" . <http://lobid.org/resource/HT002948556> <http://purl.org/ontology/bibo/isbn13> "9780915145539" . <http://lobid.org/resource/HT002948556><http://purl.org/dc/elements/1.1/creator><http://d-nb.info/gnd/135539897> . Building a High Performance Environment for RDF Publishing 8
  • 9. ing is for Consuming Publish Mandatory => basic LOD publishing is very simple: you just need a Webserver Building a High Performance Environment for RDF Publishing 9
  • 10. ing is for Consuming Publish Nice to have • Dumps • Content Negotiation (different RDF serializations) • SPARQL • Human readable representation (best: RDFa in HTML) • Data searchable • Timely updates • High Availability • Versioning • Web developers want simple APIs providing JSON • ... Building a High Performance Environment for RDF Publishing 10
  • 11. ing is for Consuming Publish SPARQL Endpoint • (Dumps) • Content Negotiation (different RDF serializations) • SPARQL • Human readable representation (best: RDFa in HTML) • Data searchable • Timely updates • High Availability • Versioning • Web developers want simple APIs providing JSON • ... Building a High Performance Environment for RDF Publishing 11
  • 12. ing is for Consuming Publish SPARQL Endpoint • (Dumps): but may be painfully slow when having lots of data • Content Negotiation (different RDF serializations) • SPARQL • Human readable representation (best: RDFa in HTML) • (Data searchable) : maybe painfully slow • Timely updates • High Availability • Versioning • Web developers want simple APIs providing JSON • most triple stores provides JSON/RDF • Simple powerful API : too powerful/complex ? • ... Building a High Performance Environment for RDF Publishing 12
  • 13. ing is for Consuming Publish Nice to have In principle, web developers already got simple APIs : LOD is the API ! Building a High Performance Environment for RDF Publishing 13
  • 14. ing is for Consuming Publish Nice to have In principle, web developers already got simple APIs : Remember: Building a High Performance Environment for RDF Publishing 14
  • 15. ing is for Consuming Publish Mandatory A resource: gets a dereferenceable URI: which provides the data (in RDF): <http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/title> "With reference to reference" . <http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/issued> "1983" . <http://lobid.org/resource/HT002948556> <http://purl.org/ontology/bibo/isbn13> "9780915145539" . <http://lobid.org/resource/HT002948556><http://purl.org/dc/elements/1.1/creator><http://d-nb.info/gnd/135539897> . Building a High Performance Environment for RDF Publishing 15
  • 16. ing is for Consuming Publish Nice to have In principle, web developers already got powerful APIs : RESTful SPARQL Building a High Performance Environment for RDF Publishing 16
  • 17. ing is for Consuming Publish RESTful SPARQL example getting all data of all resources having a particular ISBN: curl -H "Accept: application/json" --data-urlencode 'query= prefix bibo: <http://purl.org/ontology/bibo/> SELECT * WHERE { ?s bibo:isbn13 "9780851706238" ; ?p ?o . } LIMIT 100 ' http://lobid.org/sparql/ Building a High Performance Environment for RDF Publishing 17
  • 18. ing is for Consuming Publish Nice to have Building a High Performance Environment for RDF Publishing 18
  • 19. ing is for Consuming Publish RESTful SPARQL example … and the JSON/RDF result: { "head": { "vars": [ "s", "p","o"] }, "results": { "bindings": [ { "o": { "type": "uri", "value": "http://openlibrary.org/works/OL2109573W" }, "p": { "type": "uri", "value": "http://rdvocab.info/RDARelationshipsWEMI/workManifested" }, "s": { "type": "uri", "value": "http://lobid.org/resource/HT007824357" } }, { "o": { ... Building a High Performance Environment for RDF Publishing 19
  • 20. ing is for Consuming Publish Nice to have As it is, web developers don't like SPARQL web developer Building a High Performance Environment for RDF Publishing 20
  • 21. ing is for Consuming Publish Nice to have Web developers want APIs like: http://lobid.org/resources/api/isbn/$isbn Building a High Performance Environment for RDF Publishing 21
  • 23. Overview Publishing is for Consuming • Mandatory • Nice to have Story so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the data Publishing RDF through elasticsearch • Benefits • Some more details • Caveats Future prospects Building a High Performance Environment for RDF Publishing 23
  • 24. What is l obid.org ? lobid.org Building a High Performance Environment for RDF Publishing 24
  • 25. What is l obid.org ? ● lobid := linking open bibliographic data ● LOD services of the hbz ● lobid-resources : ● exposes 85% of the hbz cooperative catalogue ● entries coming from > 200 scientific German libraries ● ~ 16 M records with 700 M triples ● with links to ~ 5 M other resources ● with links to ~ 32 M items (consisting of 300 M triples) ● lobid-organisations : ● exposes German Sigelverzeichnis and MARC-Isil directory ● ~ 40 k descriptions of institutions Building a High Performance Environment for RDF Publishing 25
  • 26. What is lobid.org ? What's missing? • Dumps • Content Negotiation (different RDF serializations) • SPARQL • Human readable representation ( RDFa in HTML) • Data searchable • Timely updates • High Availability • Versioning • Web developers want simple APIs providing JSON • ... Building a High Performance Environment for RDF Publishing 26
  • 27. Overview Publishing is for Consuming • Mandatory • Nice to have Story so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the data Publishing RDF through elasticsearch • Benefits • Some more details • Caveats Future prospects Building a High Performance Environment for RDF Publishing 27
  • 28. storing the data 2010 - 2011, lobid-organisation Filesystem : + easy to maintain + reliable + fast - no search - no SPARQL - ... Building a High Performance Environment for RDF Publishing 28
  • 29. storing the data lobid today Triple Store (4store) : + power of SPARQL +/- depending on the query: fast to horribly slow +/- search (but string searches often slow and limited) - sometimes gets stuck ! Building a High Performance Environment for RDF Publishing 29
  • 30. storing the data lobid today Search engine (elasticsearch): + fast search + stemming, linguistics … + wildcard searching + facets + geo search + JSON + schema-less + simple RESTful API + many plugins + ... + easy to achieve High Availability + scales nicely Building a High Performance Environment for RDF Publishing 30
  • 32. Overview Publishing is for Consuming • Mandatory • Nice to have Story so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the data Publishing RDF through elasticsearch • Benefits • Some more details • Caveats Future prospects Building a High Performance Environment for RDF Publishing 32
  • 33. getting the data lobid : technology/dependency stack Webapp Search Engine Triple Store Building a High Performance Environment for RDF Publishing 33
  • 34. getting the data lobid : technology/dependency stack Webapp we can do that Search Engine highly available ! Triple Store sometimes gets stuck! Building a High Performance Environment for RDF Publishing 34
  • 35. getting the data lobid : technology/dependency stack Webapp we can do that Search Engine < highly available ! Triple Store = sometimes gets stuck! Building a High Performance Environment for RDF Publishing 35
  • 36. getting the data lobid : technology/dependency stack Webapp we can do that Search Engine highly available ! Triple Store sometimes gets stuck! Building a High Performance Environment for RDF Publishing 36
  • 37. oring/getting the data st Variant 1 : technology/dependency stack Triple Store Triple Store Closed, internal. Will be safe from malign queries. For external access. Sometimes gets stuck! Building a High Performance Environment for RDF Publishing 37
  • 38. oring/getting the data st Variant 1 : technology/dependency stack redundant, complex … Triple Store Triple Store Closed, internal. Will be safe from malign queries. For external access. Sometimes gets stuck! Building a High Performance Environment for RDF Publishing 38
  • 39. Overview Publishing is for Consuming • Mandatory • Nice to have Story so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the data Publishing RDF through elasticsearch Benefits • • Some more details • Caveats Future prospects Building a High Performance Environment for RDF Publishing 39
  • 40. D with elasticsearch Publishing LO Variant 2: technology/dependency stack Webapp we can do that Search Engine highly available ! LOD basis functionality (and some other Triple Store For external access and some APIs) are highly available fancy nice-to-have stuff. Sometimes gets stuck! Building a High Performance Environment for RDF Publishing 40
  • 41. D with elasticsearch Publishing LO Benefits • Dumps • Content Negotiation (different RDF serializations) • SPARQL • Human readable representation ( RDFa in HTML) • Data searchable • Near Real Time updates • High Availability • (Versioning) • Web developers want simple APIs returning JSON • ... Building a High Performance Environment for RDF Publishing 41
  • 42. D with elasticsearch Publishing LO Benefits fast, scalable search engine Building a High Performance Environment for RDF Publishing 42
  • 43. D with elasticsearch Publishing LO performance test Data: 10 M records <=> 300 M triple Case-insensitive query: „beach“ SELECT ?s WHERE { ?s <http://purl.org/dc/terms/title> ?o FILTER regex(str(?o), "beach", "i") } #### => SPARQL execution time for Q8316: 108.7s, returned 2815 rows. http://$ip:9200/_search?q=beach&from=0&size=2800 # => Elasticsearch needed 0.4s => Elasticsearch is 250 times faster Building a High Performance Environment for RDF Publishing 43
  • 44. D with elasticsearch Publishing LO performance test (there is a support for text indexing in 4store, have not tested that.) Building a High Performance Environment for RDF Publishing 44
  • 45. D with elasticsearch Publishing LO performance test Elasticsearch: 18 M records , 6 GB RAM: 5 hour 4store: 1 B triples, having 72 GB RAM: 7 hours CPU: Quad Core mit 2.4 GhZ und Hyperthreading => 8 CPUs HD: 6 x 2.5" 10k U/min a 146GB (Don't take benchmarks too seriously – they just give a clue !) Building a High Performance Environment for RDF Publishing 45
  • 46. D with elasticsearch Publishing LO Benefits • Dumps • Content Negotiation (different RDF serializations) • SPARQL • Human readable representation ( RDFa in HTML) • Data searchable • Near Real Time updates • High Availability • (Versioning) • Web developers want simple APIs providing JSON • ... Building a High Performance Environment for RDF Publishing 46
  • 47. D with elasticsearch Publishing LO Benefits build to be easily made highly available ! Building a High Performance Environment for RDF Publishing 47
  • 48. D with elasticsearch Publishing LO Benefits • Dumps • Content Negotiation (different RDF serializations) • SPARQL • Human readable representation ( RDFa in HTML) • Data searchable • Near Real Time updates • High Availability • (Versioning) • Web developers want simple APIs providing JSON • ... Building a High Performance Environment for RDF Publishing 48
  • 49. D with elasticsearch Publishing LO Benefits Versioning with elasticsearch: Not out-of-the-box, but comes at least e.g. with * concurrency control * documents have a version number => implementing versioning is not hard Building a High Performance Environment for RDF Publishing 49
  • 50. D with elasticsearch Publishing LO Benefits, relying on elasticsearch as basic LOD storage • Dumps • Content Negotiation (different RDF serializations) • SPARQL • Human readable representation ( RDFa in HTML) • Data searchable • Near Real Time updates • High Availability • Versionizing • Web developers want: • JSON (LD) • Simple APIs • ... Building a High Performance Environment for RDF Publishing 50
  • 51. D with elasticsearch Publishing LO Benefits • Dumps • Content Negotiation (different RDF serializations) • SPARQL • Human readable representation ( RDFa in HTML) • Data searchable • Near Real Time updates • High Availability • (Versioning) • Web developers want simple APIs providing JSON • ... Building a High Performance Environment for RDF Publishing 51
  • 52. D with elasticsearch Publishing LO Why JSON-LD? JSON is : • stored natively by many tools (e.g. elasticsearch) • loved by consumers (web developers) JSON-LD is : • supported by RDF libraries (e.g. transforming to NTriples) Building a High Performance Environment for RDF Publishing 52
  • 53. D with elasticsearch Publishing LO Benefits • Dumps • Content Negotiation (different RDF serializations) • SPARQL • Human readable representation ( RDFa in HTML) • Data searchable • Near Real Time updates • High Availability • (Versioning) • Web developers want simple APIs providing JSON • ... Building a High Performance Environment for RDF Publishing 53
  • 54. D with elasticsearch Publishing LO Benefits RESTful elasticsearch API, e. g. : http://lobid.org/resources/_search?q=isbn:$isbn Building a High Performance Environment for RDF Publishing 54
  • 55. D with elasticsearch Publishing LO Benefits • … and many other nice things come with elasticsearch • geo-search : „Query only libraries/items residing up to 10 km from me.“ • … Building a High Performance Environment for RDF Publishing 55
  • 56. D with elasticsearch Publishing LO Benefits • Dumps Content Negotiation (different RDF serializations) d! • • SPARQL plishe • accom Human readable representation ( RDFa in HTML) • • Data searchable Mis sion Near Real Time updates • High Availability • (Versioning) • Web developers want simple APIs providing JSON • ... Building a High Performance Environment for RDF Publishing 56
  • 57. D with elasticsearch Publishing LO ( … ok, something is left to be done ! ) • Dumps • Content Negotiation (different RDF serializations) • SPARQL • Human readable representation ( RDFa in HTML) • Data searchable • Near Real Time updates • High Availability • Versionizing • Web developers want simple APIs providing JSON • ... Building a High Performance Environment for RDF Publishing 57
  • 58. Overview Publishing is for Consuming • Mandatory • Nice to have Story so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the data Publishing RDF through elasticsearch • Benefits • Caveats • Auto suggest demo Conclusion Building a High Performance Environment for RDF Publishing 58
  • 59. D with elasticsearch Publishing LO Caveats • Dumps • Content Negotiation (different RDF serializations) • SPARQL • Human readable representation ( RDFa in HTML) • Data searchable • Near Real Time updates • • High Availability Versionizing !? • Web developers want simple APIs providing JSON • ... Building a High Performance Environment for RDF Publishing 59
  • 60. D with elasticsearch Publishing LO Caveats How to integrate semantic search into a document storage ? dct:contributor --------> dct:creator -------> dc:creator ---------> dc:contributor --------> bibo:translator … There is no inferencing as comes with SPARQL ! Building a High Performance Environment for RDF Publishing 60
  • 61. D with elasticsearch Publishing LO Caveats Our data flow : from records to RDF triples to records Building a High Performance Environment for RDF Publishing 61
  • 62. D with elasticsearch Publishing LO Caveats Our data flow : from records to RDF triples to records Building a High Performance Environment for RDF Publishing 62
  • 63. D with elasticsearch Publishing LO Caveats from records to RDF triples to records !? Building a High Performance Environment for RDF Publishing 63
  • 64. D with elasticsearch Publishing LO Caveats From records to RDF triples |-----> graph-database '------> computing ---> record-database MARC/MAB/PICA... JSON-LD Building a High Performance Environment for RDF Publishing 64
  • 65. D with elasticsearch Publishing LO Caveats tree-based vs graph-based: Pre-render the whole document? What is the document ? Building a High Performance Environment for RDF Publishing 65
  • 66. D with elasticsearch Publishing LO Caveats Building a High Performance Environment for RDF Publishing 66
  • 67. D with elasticsearch Publishing LO Caveats What is the document ? Only the top-level node ? Building a High Performance Environment for RDF Publishing 67
  • 68. D with elasticsearch Publishing LO Caveats What is the document ? Only the top-level node ? … but then you couldn't even search the authors name ! Building a High Performance Environment for RDF Publishing 68
  • 69. D with elasticsearch Publishing LO Caveats searching needs integration of some fields from subgraphs into the document Building a High Performance Environment for RDF Publishing 69
  • 70. Overview Publishing is for Consuming • Mandatory • Nice to have Story so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the data Publishing RDF through elasticsearch • Benefits • Caveats • Auto suggest demo Conclusion Building a High Performance Environment for RDF Publishing 70
  • 71. D with elasticsearch Publishing LO auto suggest authority IDs must be easily found 71 Building a High Performance Environment for RDF Publishing
  • 72. D with elasticsearch Publishing LO auto suggest authority IDs must be easily found => in need of auto suggest 72 Building a High Performance Environment for RDF Publishing
  • 73. D with elasticsearch Publishing LO auto suggest auto suggests needs fast searching Building a High Performance Environment for RDF Publishing 73
  • 74. D with elasticsearch Publishing LO Demo auto suggest Building a High Performance Environment for RDF Publishing 74
  • 76. D with elasticsearch Publishing LO auto suggest RESTful APIs: http://demo.lobid.org/search?format=short&index=gnd-index&author=Schmidt%2C+Karl http://demo.lobid.org/search?format=page&index=gnd-index&author=Schmidt%2C+Karl http://demo.lobid.org/search?format=full&index=gnd-index&author=Schmidt%2C+Karl … API usage: GET /search?format=<page|full|short>&index=<lobid-index|gnd-index>&author=<query> easy to enhance with the play framework and the elasticsearch API Building a High Performance Environment for RDF Publishing 76
  • 77. D with elasticsearch Publishing LO auto suggest RESTful APIs: http://demo.lobid.org/search ?format=short&index=gnd-index&author=Schmidt%2C+Karl [ "Schmidt, Karl (1894-1945)", "Schmidt, Karl", "Schmidt, Karl (1910-)", "Schmidt, Karl (1846-1928)", "Schmidt, Karl (1913-)", "Schmidt, Karl (1899-)", "Schmidt, Karl (1924-)", "Schmidt, Karl (1836-1888)", "Schmidt, L. F. Karl", "Schmidt, Karl (1902-1945)", "Schmidt, Karl J.", "Schmidt, Karl (1848-1905)", "Schmidt, Karl (1817-1882)", "Schmidt, Karl R.", "Schmidt, Karl (1954-)", "Schmidt, Karl (1888-)", "Schmidt, Karl (1867-)", ... ] Building a High Performance Environment for RDF Publishing 77
  • 78. D with elasticsearch Publishing LO auto suggest GND authority file in lobid-resources Building a High Performance Environment for RDF Publishing 78
  • 80. D with elasticsearch Publishing LO Building a High Performance Environment for RDF Publishing 80
  • 81. Overview Publishing is for Consuming • Mandatory • Nice to have Story so far - experiences with lobid.org • What is lobid.org ? • Storing the data • Getting the data Publishing RDF through elasticsearch • Benefits • Caveats • Auto suggest demo Conclusion Building a High Performance Environment for RDF Publishing 81
  • 82. D with elasticsearch Publishing LO Conclusion a highly customizable/reliable/feature-rich LOD service Webapp we can do that Search Engine highly available ! LOD basis functionality (and some other Triple Store For external access and some APIs) are highly available fancy nice-to-have stuff. Sometimes gets stuck! Building a High Performance Environment for RDF Publishing 82
  • 83. D with elasticsearch Publishing LO the software is Open Source: http://elasticsearch.org/ https://hadoop.apache.org/ http://www.playframework.org/ http://4store.org/ https://github.com/lobid/ Building a High Performance Environment for RDF Publishing 83
  • 84. Any Questions ? Pascal Christoph christoph@hbz-nrw.de semweb@hbz-nrw.de
  • 85. Using a dark background, this presentation saves maybe 70% of energy