Semantic Publishing with Nanopublications

Semantic Publishing with Nanopublications
Tobias Kuhn
http://www.tkuhn.ch
@txkuhn
ETH Zurich
Research Colloquium
Stanford Center for Biomedical Informatics Research
12 March 2015

Information Overload:
>1M new Scientific Articles Per Year
• How can we help
scientists to stay
up-to-date?
• How can we efficiently
retrieve and aggregate
published scientific
results?
• How should we design
the system for the
future of scientific
communication?
Image from: Kuhn et al. Inheritance patterns in citation networks reveal scientific memes. Physical Review X 4. 2014.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 2 / 30

Nanopublications:
Provenance-Aware Semantic Publishing
assertion
provenance
publication info
nanopublication
http://nanopub.org / @nanopub org

Nanopublications:
Provenance-Aware Semantic Publishing
Nanopub0001
Assertion:
opm:wasDerivedFrom d:DataSourceX
Provenance:
ns1:mosquito ns2:malaria
ns3:transmission
Publication Information:
dc:created “2013-01-01”
pav:createdBy p:Isabelle_Dubois
http://nanopub.org / @nanopub org

Example: Nanopublication Converted from BEL
sub:assertion {
sub:_3 a rdf:Statement ;
rdf:subject schem:Adenosine%20triphosphate ;
rdf:predicate belv:decreases ; rdf:object sub:_1 ;
occursIn: obo:UBERON_0001134 , species:9606 .
sub:_1 a go:0003824 ; hasAgent: sub:_2 .
sub:_2 a Protein: ; geneProductOf: hgnc:12517 .
sub:assertion rdfs:label "a(SCHEM:"Adenosine triphosphate") -| cat(p(HGNC:UCP1))" .
}
sub:provenance {
sub:assertion prov:hadPrimarySource pubmed:9703368 ;
prov:wasDerivedFrom beldoc: , sub:_4 .
beldoc: dce:description "Approximately 61,000 statements." ;
dce:rights "Copyright (c) 2011-2012, Selventa. All rights reserved." ;
dce:title "BEL Framework Large Corpus Document" ;
pav:authoredBy sub:_5 ; pav:version "20131211" .
sub:_4 prov:value "UCP1 contains six potential transmembrane a-helices (72) and acts unde
prov:wasQuotedFrom pubmed:9703368 .
sub:_5 rdfs:label "Selventa" .
}
sub:pubinfo {
this: dct:created "2014-07-03T14:34:13.226+02:00"^^xsd:dateTime ;
pav:createdBy orcid:0000-0001-6818-334X , orcid:0000-0002-1267-0234 .
}
https://github.com/tkuhn/bel2nanopub

Vision: Changing Scholarly Communication
Now
Narrative articles at the center
Future
Nanopublications at the center
Images from Mons et al. The value of data. Nature genetics, 43(4):281–283, 2011

Open Problems
Three important open problems for nanopublications are:
• Identifiability: How can we ensure the immutability of
nanopublications and provide verifiable identifiers?
• Persistence: How can we ensure that published
nanopublications can be reliably retrieved by others and remain
permanently available?
• Representation: How can we deal with results that have no
straightforward RDF representation?

Problem:
Replication and Re-Use of Research Results
Exemplary Situation: Sue publishes a script that should allow
everybody to replicate her scientiﬁc analysis:
# Download data:
wget http://some-third-party.org/dataset/1.4
# Analyze data:
...
Problems: What if the third party silently changes that version of the
dataset? What if the resource becomes unavailable at this location?
What if the web site later gets hacked and the data manipulated?

Identiﬁability Problem
http://some-third-party.org/dataset/1.4
?
Given a URI for a digital artifact, there is no reliable standard
procedure of checking whether a retrieved ﬁle really represents the
correct and original state of that artifact.

Trusty URIs
Basic idea: Use of cryptographic hash values calculated on digital
artifacts.
Requirements:
• To allow for the verification of entire reference trees, the hash
should be part of the reference (i.e. the URI)
• To allow for meta-data, digital artifacts should be allowed to
contain self-references (i.e. their own URI)
• Format-independent hash for different kinds of content
• The complete approach should be decentralized and open
• We want to use them right away
Example:
http://example.org/r1.RA5AbXdpz5DcaYXCh9l3eI9ruBosiL5XDU3rxBbBaUO70
Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.

Trusty URIs: Range of Veriﬁability
With the hash as a part of the URI, the “range of veriﬁability”
extends to referenced artifacts (if they also use trusty URIs):
http://...RAcbjcRI...
http://...RAQozo2w...
http://...RABMq4Wc...
http://...RAcbjcRI...
http://...RAQozo2w...
http://.../resource23
...
http://...RAUx3Pqu...
http://...RABMq4Wc...
http://...RARz0AX-...
...
http://...RAUx3Pqu...
...
http://...RARz0AX...
...
range of
verifiability

Verifiable — Immutable — Permanent
Whether or not a given resource is the one a given trusty URI is
supposed to represent can be verified with perfect confidence.
(assuming that the trusty URI for the required artifact is known, e.g. because
another artifact contains it as a link)
Kuhn, Dumontier. Making Digital Artifacts on the Web Verifiable and Reliable. IEEE TKDE. To appear. / Kuhn,
Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.

A trusty URI artifact is immutable, as any change in the content
also changes its URI, thereby making it a new artifact.
(as soon as your trusty URI has been picked up by third parties, e.g. cached or
linked from other resources, every change will be noticed)

Trusty URI artifacts are permanent, as they can be retrieved from
the cache of third-party websites if otherwise no longer available.
(if there are search engines and web archives regularly crawling and caching the
artifacts on the web)

Calculating Trusty URIs is Fast and Reliable

Nanopublication Server Network
Nanopublications
with Trusty URIs
Publication
Retrieval
Propagation /
Archiving
http://npmonitor.inn.ac
Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving of
Data. arXiv:1411.2749.

Decentralized — Open — Real-time
• No a central authority: Everybody can set up a server and join
the network
• No restrictions on publication: Everybody can upload
nanopublications
• No delay between submission and publication: Nanopublications
are made public immediately
• No updates: If a nanopublication is modiﬁed, that makes it a
new nanopublication (enforced by trusty URIs)
• No queries: Only simple identiﬁer-based lookup

High Performance and High Availability

Deﬁning Datasets with Nanopublication Indexes
(which are themselves Nanopublications)
appends
has sub-index
has
element
(a) (b)
(c) (f)
(d) (e)

Using Nanopublication Datasets
Once published in the network, nanopublication indexes can be cited:
[7] Nanopubs converted from OpenBEL’s Small and Large Corpus 20131211.
Nanopublication index, 4 March 2014,
http://np.inn.ac/RAR5dwELYLKGSfrOclnWhjOj-2nGZN_8BW1JjxwFZINHw
Researchers can then fetch and reuse the data in a reliable and
prefectly reproducible manner:
# Download data:
GetNanopub.sh -c RAR5dwELYLKGSfrOclnWhjOj-2nGZN 8BW1JjxwFZINHw
# Analyze data:
...
Existing data can be recombined in new indexes; and researchers can
unambiguously refer to the used datasets for new results:
this: prov:wasDerivedFrom nps:RAR5dwELYLKGSfrOclnWhjOj-2nGZN 8BW1Jjx

Real-Time and Replicable Data Mining
authors
users
curators
structured
data sources
unstructured
data sources
ocean of
nanopublications
nanopublication
portals
in silico
experiments
network
analyses
aggregated
views
hypothesis
generation
...bots
1
2
3
4
5
6
Kuhn et al. Broadening the Scope of Nanopublications. ESWC 2013.

How can we Represent Actual Scientiﬁc Results
in RDF?
Example:
[...] the risk of developing neurodegenerative disease in idiopathic
REM sleep behavior disorder is substantial, with the majority of
patients developing Parkinson disease and Lewy body dementia.
[PubMed 19109537]
This is diﬃcult to represent in RDF, but it would correspond to two
nanopublication assertions:
• “The risk of developing neurodegenerative disease in idiopathic REM
sleep behavior disorder is substantial.”
• “The majority of patients with idiopathic REM sleep behavior disorder
who develop a neurodegenerative disease develop Parkinson disease
and Lewy body dementia.”
Nanopublication provenance and metadata, on the other hand, would
be relatively easy to produce.

English Sentences as Anchors to Structure
Scientiﬁc Discourse
Malaria is transmitted by mosquitoes.
Mosquitoes transmit malaria.
Malaria is transmitted by female mosquitoes.
study A
same meaning
more specific
meaning of
provides
evidence for provides counter-
evidence against
study B
study C
study D
Malaria is transmitted by moscitos.
corrected version of

AIDA Sentences: Criteria
Single English sentences that are:
• Atomic: a sentence describing one thought that cannot be
further broken down in a practical way
• Independent: a sentence that can stand on its own, without
external references like “this eﬀect” or “we”
• Declarative: a complete sentence ending with a full stop that
could in theory be either true or false
• Absolute: a sentence describing the core of a claim ignoring the
uncertainty about its truth and how it was discovered (no
“probably” or “evaluation showed”); typically in present tense

Controlled Natural Language (CNL)
AIDA is a kind of Controlled Natural Language.
More about that in my talk next week (20 March 2014)
in the Prot´eg´e Seminar.

AIDA Nanopublications
Nanopublications can be combined with AIDA sentences to allow for
informal and underspeciﬁed assertions:
ns1:mosquito
ns3:transmission ns1:mosquito
ns2:malaria
ns3:transmission
They can be linked regardless of their degree of formality:
ns1:mosquito
ns2:malaria
ns3:transmission

AIDA Sentences can be Eﬃciently Created
Manual creation of AIDA sentences from existing abstracts:
163total 100%
114perfect 70%
7typo etc. 4%
10inaccurate 6%
32not AIDA 20%
0 20 40 60 80 100 120 140 160
sentences
Automatic extraction of AIDA sentences from existing GeneRIF dataset:
250total 100%
177perfect 71%
8typo etc. 3%
65not AIDA 26%
0 25 50 75 100 125 150 175 200 225 250
sentences

Examples: AIDA Sentences for Claims by
Famous Computer Scientists
• Computers are more flexible and more powerful if data and program
instructions are both stored side by side in the same memory. [John von
Neumann]
• The problem of determining the provability of arbitrary propositions in
first-order logic is undecidable. [Alonso Church / Alan Turing]
• There is no general algorithm that can decide for all possible programs and
inputs whether the program will terminate or run forever. [Alan Turing]
• Automated reasoning can be suitably implemented as a search tree with the
hypothesis at the root and by applying heuristics to trim branches. [Herbert
Simon and Allen Newell]
• Single-layer perceptrons cannot learn the XOR relation. [Marvin Minsky and
Seymour Papert]
• The use of GOTO statements in programming is harmful to program
structure. [Edsger Dijkstra]
• Hypertext representations are more flexible, more general, and more natural
than classical text structure for computer-supported information systems.
[Ted Nelson]
https://github.com/tkuhn/aida/

Summary
• With Trusty URIs, digital artifacts on the Web can be made
immutable and veriﬁable
• With a decentralized network of nanopublication servers, we
can build a reliable and trustworthy infrastructure for data
publishing
• With AIDA sentences, we can formally represent provenance
and metadata of informal or underspeciﬁed assertions

Thank you for your attention!
Questions?

Semantic Publishing with Nanopublications

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Semantic Publishing with Nanopublications

Similar to Semantic Publishing with Nanopublications (20)

More from Tobias Kuhn

More from Tobias Kuhn (16)

Recently uploaded

Recently uploaded (20)

Semantic Publishing with Nanopublications