A Decentralized Network for
Publishing Linked Data
—
Nanopublications, Trusty URIs, and Science Bots
Tobias Kuhn
http://ww...
Increasing Scientific Output:
>1.5M New Articles Per Year
Citation network of 30M scientific publications
Image from: Kuhn e...
Increasing Importance of Scientific Data
London Underground staff sorting 4M used tickets to analyse line use in 1939
Image ...
Problem:
Replication and Re-Use of Research Results
Exemplary Situation: Sue publishes a script that should allow
everybod...
Data Publishing, Archiving, and Re-Use
Scientific datasets become increasingly important, and these data are
increasingly p...
Nanopublications:
Provenance-Aware Semantic Publishing
(based on RDF)
assertion
provenance
publication info
nanopublicatio...
Vision: Changing Scholarly Communication
Now
Narrative articles at the center
Future
Nanopublications at the center
Images...
Nanopublication Example
sub:assertion {
sub:_3 a rdf:Statement ; rdf:subject schem:Adenosine%20triphosphate ;
rdf:predicat...
Identifiability Problem of URIs (Web Links)
http://some-third-party.org/dataset/1.4
?
Given a URI for a digital artifact, t...
Cryptographic Hash Values
A cryptographic hash value is a short random-looking sequence of
bytes calculated on a given inp...
Iterative Hashing
Hash values can be used as identifiers in an iterative fashion:
This is some text. ⇒ hRUv0M
This text is ...
Trusty URIs: Cryptographic Hash Values for
Verifiable and Immutable Web Identifiers
Basic idea: Use of cryptographic hash va...
Verifiable — Immutable — Permanent
Whether or not a given resource is the one a given trusty URI is
supposed to represent c...
Extended Range of Verifiability Through
Iterative Hashing
http://...RAcbjcRI...
http://...RAQozo2w...
http://...RABMq4Wc......
Verifiable — Immutable — Permanent
Trusty URI artifacts are immutable, as any change in the content
also changes its URI, t...
Verifiable — Immutable — Permanent
Trusty URI artifacts are permanent, as they can be retrieved from
the cache of third-par...
A Multi-Layer Architecture for
Reliable Scientific Data Publishing?
User Interfaces
Services:
Finding, querying, filtering,...
A Multi-Layer Architecture for
Reliable Scientific Data Publishing?
User Interfaces
Decentralized
Data Publishing
Network
h...
A Multi-Layer Architecture for
Reliable Scientific Data Publishing?
User Interfaces
Decentralized
Data Publishing
Network
h...
Decentralized and Reliable Publishing with a
Nanopublication Server Network
Nanopublications
with Trusty URIs
Publication
...
Decentralized — Open — Real-time
• No a central authority: Everybody can set up a server and join
the network
• No restric...
Fast Parallel Access
time from start of test in seconds
responsetimeinseconds
0 50 100 150 200 250 3000 50 100 150 200 250...
Defining Datasets with Nanopublication Indexes
(which are themselves Nanopublications)
appends
has sub-index
has
element
(a...
Using Nanopublication Datasets
Once published in the network, nanopublication indexes can be cited:
[7] Nanopubs converted...
Could these techniques and infrastructures allow us to
make a step forward in terms of automation in science?
S C I E N C ...
Science Bots —
Scientists’ Little Helpers in the Future?
“Science bots” that autonomously publish results in their own nam...
Quality Control with Reputation
Mechanisms and Network Metrics?
Robust automatic calculation of reputation metrics in a de...
Quality Control with Reputation
Mechanisms and Network Metrics?
Robust automatic calculation of reputation metrics in a de...
Quality Control with Reputation
Mechanisms and Network Metrics?
Robust automatic calculation of reputation metrics in a de...
Thank you for your attention!
Questions?
Further information:
• Trusty URIs: http://trustyuri.net
• Nanopublications: http...
Upcoming SlideShare
Loading in …5
×

A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty URIs, and Science Bots

1,079 views

Published on

Making available and archiving scientific results is for the most part still considered the task of classical publishing companies, despite the fact that classical forms of publishing centered around printed narrative articles no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. In this talk, I outline how we can design scientific data publishing as a Web-based bottom-up process, without top-down control of central authorities. I present a protocol and a server network to decentrally store and archive data in the form of nanopublications, a format based on Semantic Web techniques to represent scientific data with formal semantics. Such nanopublications can be made verifiable and immutable by applying cryptographic methods with identifiers called Trusty URIs. I show how this approach allows researchers to produce, publish, retrieve, address, verify, and recombine datasets and their individual nanopublications in a reliable and trustworthy manner, and I discuss how the current small network can grow to handle the large amounts of structured data that modern science is producing and consuming.

(CC license does not apply to third-party content)

Published in: Science
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,079
On SlideShare
0
From Embeds
0
Number of Embeds
148
Actions
Shares
0
Downloads
7
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty URIs, and Science Bots

  1. 1. A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty URIs, and Science Bots Tobias Kuhn http://www.tkuhn.ch @txkuhn ETH Zurich CERN Workshop on Innovations in Scholarly Communication (OAI9) Geneva 17 June 2015
  2. 2. Increasing Scientific Output: >1.5M New Articles Per Year Citation network of 30M scientific publications Image from: Kuhn et al. Inheritance patterns in citation networks reveal scientific memes. Physical Review X 4. 2014. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 2 / 30
  3. 3. Increasing Importance of Scientific Data London Underground staff sorting 4M used tickets to analyse line use in 1939 Image from: http://www.telegraph.co.uk/travel/picturegalleries/9791007/ The-history-of-the-Tube-in-pictures-150-years-of-London-Underground.html?frame=2447159 Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 3 / 30
  4. 4. Problem: Replication and Re-Use of Research Results Exemplary Situation: Sue publishes a script that should allow everybody to replicate her scientific analysis: # Download data: wget http://some-third-party.org/dataset/1.4 # Analyze data: ... Problems: • What if the resource becomes unavailable at this location? • What if the third party silently changes that version of the dataset? • What if the web site gets hacked and the data manipulated? Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 4 / 30
  5. 5. Data Publishing, Archiving, and Re-Use Scientific datasets become increasingly important, and these data are increasingly produced and consumed directly by software. Published data should therefore be: • Verifiable (Is this really the data I am looking for?) • Immutable (Can I be sure that it hasn’t been modified?) • Permanent (Will it be available in 1, 5, 20 years from now?) • Reliable (Can it be efficiently retrieved whenever needed?) • Granular (Can I refer to individual data entries?) • Semantic (Can it be automatically interpreted?) • Linked (Does it use established identifiers and ontologies?) • Trustworthy (Can I trust the source?) Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 5 / 30
  6. 6. Nanopublications: Provenance-Aware Semantic Publishing (based on RDF) assertion provenance publication info nanopublication http://nanopub.org / @nanopub org Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 6 / 30
  7. 7. Vision: Changing Scholarly Communication Now Narrative articles at the center Future Nanopublications at the center Images from Mons et al. The value of data. Nature genetics, 43(4):281–283, 2011 Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 7 / 30
  8. 8. Nanopublication Example sub:assertion { sub:_3 a rdf:Statement ; rdf:subject schem:Adenosine%20triphosphate ; rdf:predicate belv:decreases ; rdf:object sub:_1 ; occursIn: obo:UBERON_0001134 , species:9606 . sub:_1 a go:0003824 ; hasAgent: sub:_2 . sub:_2 a Protein: ; geneProductOf: hgnc:12517 . } sub:provenance { sub:assertion prov:hadPrimarySource pubmed:9703368 ; prov:wasDerivedFrom beldoc: , sub:_4 . beldoc: dce:description "Approximately 61,000 statements." ; dce:rights "Copyright (c) 2011-2012, Selventa. All rights reserved." ; dce:title "BEL Framework Large Corpus Document" ; pav:authoredBy sub:_5 ; pav:version "20131211" . sub:_4 prov:value "UCP1 contains six potential transmembrane a-helices (72) and prov:wasQuotedFrom pubmed:9703368 . sub:_5 rdfs:label "Selventa" . } sub:pubinfo { this: dct:created "2014-07-03T14:34:13.226+02:00"^^xsd:dateTime ; pav:createdBy orcid:0000-0001-6818-334X , orcid:0000-0002-1267-0234 . } Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 8 / 30
  9. 9. Identifiability Problem of URIs (Web Links) http://some-third-party.org/dataset/1.4 ? Given a URI for a digital artifact, there is no reliable standard procedure of checking whether a retrieved file really represents the correct and original state of that artifact. Solution: Identifiers that include (iterative) cryptographic hash values (as applied, for example, by Git and Bitcoin) Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 9 / 30
  10. 10. Cryptographic Hash Values A cryptographic hash value is a short random-looking sequence of bytes calculated on a given input: This is some text. ⇒ hRUv0M The same input always leads to exactly the same value: This is some text. ⇒ hRUv0M Even a minimally modified input leads to a completely different value: This is xome text. ⇒ sCtYbf The input is not reconstructible from the hash value (in practice): This is some text. hRUv0M Given an input and a matching hash value, we can therefore be perfectly sure that it was exactly that input that led to the hash. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 10 / 30
  11. 11. Iterative Hashing Hash values can be used as identifiers in an iterative fashion: This is some text. ⇒ hRUv0M This text is based on hRUv0M. ⇒ LwGqwX This depends on LwGqwX. ⇒ civRbq From a single identifier (such as civRbq), the entire reference tree can be verified: This is some text. ⇒ hRUv0M This text is based on hRUv0M. ⇒ LwGqwX This depends on LwGqwX. ⇒ civRbq And any modification can be noticed: This is xome text. ⇒ hRUv0M This text is based on hRUv0M. ⇒ LwGqwX This depends on LwGqwX. ⇒ civRbq Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 11 / 30
  12. 12. Trusty URIs: Cryptographic Hash Values for Verifiable and Immutable Web Identifiers Basic idea: Use of cryptographic hash values together with URIs as identifiers for digital artifacts such as nanopublications. Requirements: • To allow for the verification of entire reference trees, the hash should be part of the reference (i.e. the URI) • To allow for meta-data, digital artifacts should be allowed to contain self-references (i.e. their own URI) • Format-independent hash for different kinds of content (e.g. RDF) • The complete approach should be decentralized and open • We want to use them right away .trighttp://example.org/r1. RA 5AbXdpz5DcaYXCh9l3eI9ruBosiL5XDU3rxBbBaUO70 Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 12 / 30
  13. 13. Verifiable — Immutable — Permanent Whether or not a given resource is the one a given trusty URI is supposed to represent can be verified with perfect confidence. (assuming that the trusty URI for the required artifact is known, e.g. because another artifact contains it as a link) http://trustyuri.net Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 13 / 30
  14. 14. Extended Range of Verifiability Through Iterative Hashing http://...RAcbjcRI... http://...RAQozo2w... http://...RABMq4Wc... http://...RAcbjcRI... http://...RAQozo2w... http://.../resource23 http://.../resource23 ... http://...RAUx3Pqu... http://.../resource55 http://...RABMq4Wc... http://.../resource55 http://...RARz0AX-... ... http://...RAUx3Pqu... ... http://...RARz0AX... ... range of verifiability http://trustyuri.net Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 14 / 30
  15. 15. Verifiable — Immutable — Permanent Trusty URI artifacts are immutable, as any change in the content also changes its URI, thereby making it a new artifact. (as soon as your trusty URI has been picked up by third parties, e.g. cached or linked from other resources, every change will be noticed) http://trustyuri.net Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 15 / 30
  16. 16. Verifiable — Immutable — Permanent Trusty URI artifacts are permanent, as they can be retrieved from the cache of third-party websites if otherwise no longer available. (if there are services regularly crawling and caching the artifacts on the web) http://trustyuri.net Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 16 / 30
  17. 17. A Multi-Layer Architecture for Reliable Scientific Data Publishing? User Interfaces Services: Finding, querying, filtering, and aggregating data Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 17 / 30
  18. 18. A Multi-Layer Architecture for Reliable Scientific Data Publishing? User Interfaces Decentralized Data Publishing Network hypotheses Services: Finding, querying, filtering, and aggregating data facts measurements opinions meta-data assessments annotations ... Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 18 / 30
  19. 19. A Multi-Layer Architecture for Reliable Scientific Data Publishing? User Interfaces Decentralized Data Publishing Network hypotheses Services: Finding, querying, filtering, and aggregating data Science Bots: Automated Tasks facts measurements opinions meta-data assessments annotations ... Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 19 / 30
  20. 20. Decentralized and Reliable Publishing with a Nanopublication Server Network Nanopublications with Trusty URIs Publication Retrieval Propagation / Archiving http://npmonitor.inn.ac Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving of Data. arXiv:1411.2749. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 20 / 30
  21. 21. Decentralized — Open — Real-time • No a central authority: Everybody can set up a server and join the network • No restrictions on publication: Everybody can upload nanopublications • No delay between submission and publication: Nanopublications are made public immediately • No updates: If a nanopublication is modified, that makes it a new nanopublication (enforced by trusty URIs) • No queries: Only simple identifier-based lookup Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving of Data. arXiv:1411.2749. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 21 / 30
  22. 22. Fast Parallel Access time from start of test in seconds responsetimeinseconds 0 50 100 150 200 250 3000 50 100 150 200 250 300 0.1 1 10 100 0 20 40 60 80 100 number of clients accessing the service in parallel Virtuoso triple store with SPARQL endpoint nanopublication server Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving of Data. arXiv:1411.2749. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 22 / 30
  23. 23. Defining Datasets with Nanopublication Indexes (which are themselves Nanopublications) appends has sub-index has element (a) (b) (c) (f) (d) (e) Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving of Data. arXiv:1411.2749. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 23 / 30
  24. 24. Using Nanopublication Datasets Once published in the network, nanopublication indexes can be cited: [7] Nanopubs converted from OpenBEL’s Small and Large Corpus 20131211. Nanopublication index, 4 March 2014, http://np.inn.ac/RAR5dwELYLKGSfrOclnWhjOj-2nGZN_8BW1JjxwFZINHw Researchers can then fetch and reuse the data in a reliable and prefectly reproducible manner: # Download data: np get -c RAR5dwELYLKGSfrOclnWhjOj-2nGZN 8BW1JjxwFZINHw # Analyze data: ... Existing data can be recombined in new indexes; and researchers can unambiguously refer to the used datasets for new results: this: prov:wasDerivedFrom nps:RAR5dwELYLKGSfrOclnWhjOj-2nGZN 8BW1Jjx Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving of Data. arXiv:1411.2749. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 24 / 30
  25. 25. Could these techniques and infrastructures allow us to make a step forward in terms of automation in science? S C I E N C E B O T S Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 25 / 30
  26. 26. Science Bots — Scientists’ Little Helpers in the Future? “Science bots” that autonomously publish results in their own name could cover a wide variety of applications, for example: nanopub -lications PubMed abstracts nanopub -lications sensor data nanopub -lications nanopub -lications text mining bot inference bot sensor bot Kuhn. Science Bots: A Model for the Future of Scientific Computation? SAVE-SD, WWW 2015 Companion Proceedings. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 26 / 30
  27. 27. Quality Control with Reputation Mechanisms and Network Metrics? Robust automatic calculation of reputation metrics in a decentralized and open system: is contributed by Kuhn. Science Bots: A Model for the Future of Scientific Computation? SAVE-SD, WWW 2015 Companion Proceedings. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 27 / 30
  28. 28. Quality Control with Reputation Mechanisms and Network Metrics? Robust automatic calculation of reputation metrics in a decentralized and open system: gives positive assessment for is contributed by Kuhn. Science Bots: A Model for the Future of Scientific Computation? SAVE-SD, WWW 2015 Companion Proceedings. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 28 / 30
  29. 29. Quality Control with Reputation Mechanisms and Network Metrics? Robust automatic calculation of reputation metrics in a decentralized and open system: gives positive assessment for is contributed by Eigenvector centrality (0-100) 77 100 71 1 0 85 4 0.0 0 0 50 50 0 0.4 0.4 0 Kuhn. Science Bots: A Model for the Future of Scientific Computation? SAVE-SD, WWW 2015 Companion Proceedings. Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 29 / 30
  30. 30. Thank you for your attention! Questions? Further information: • Trusty URIs: http://trustyuri.net • Nanopublications: http://nanopub.org • Nanopublication Server Network: http://arxiv.org/abs/1411.2749 • Science Bots: http://arxiv.org/abs/1503.04374 Tobias Kuhn, ETH Zurich A Decentralized Network for Publishing Linked Data 30 / 30

×