Semantic Web in Physical Science

The Semantic Web in Physical

Science/Engineering
Peter Murray-Rust
University of Cambridge
Open Knowledge Foundation
Culham Laboratory, 2013-09-11, UK

Themes
To make the complete scientific literature
accessible to machines and humans
•
•
•
•

The Semantic Web.
The power and need for Open
Building Communities
Multidisciplinarity

Funding includes JISC, Unilever, EPSRC.

The Semantic Web
"The Semantic Web is an extension of the
current web in which information is given welldefined meaning, better enabling computers
and people to work in cooperation."
Tim Berners-Lee, James Hendler, Ora Lassila, The
Semantic Web, Scientific American, May 2001

The scientist’s amanuensis
• "The bane of my life is doing things I know computers could do
for me" (Dan Connolly, W3C)
Example: A semantic amanuensis could
• Give me a daily digest of zeolite papers
• Extract all the crystal structures from them
• Compute physical properties with GULP and NWChem
• Compare the results statistically
• Preserve and distribute the complete operation
• Prepare the results for publication

The semantic web is having a personal amanuensis

Linked Open Data – the world’s knowledge
RDF
triples
Music,
Social
Art
Literature

Knowledge
bases

DBPedia

Lib

GOV.uk
Comp

PDB

GOV
Ontologies

BIO

very little physical science 
http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png

Linked Open data from Wikipedia

“Which Rivers flow into the Rhine and are longer
than 50 kilometers?” or “Which Skyscrapers
in China have more than 50 floors and have
been constructed before the year 2000?”
Open Crystallography?
“Which countries where tropical diseases are
endemic have published structures of chiral
natural products?”
CC-BY-SA from Wikipedia

Semantics: (Things Take Time)*
• 1994 1st WWW Conference
• 1994 , Chemical MIME , Chemical Markup
Language (Henry Rzepa, PMR)
• 2001 UK eScience programme, eMinerals
• 2005 Materials Grid (Martin Dove group)
• 2006 Blue Obelisk (Open Source chemistry)
• 2011 PNNL (US) meetings and visit
• 2012 Semantic Physical Science (Cambridge)
*TTT: Piet Hein

Componentised approach liberates

Individual, manual,
unreusable, flaky

Commodity, standard,
reliable, re-usable

Representing Semantics
Interoperating approaches:
Markup Languages (“hardcoded objects”) MathML,
G(eo)ML, CellML, S(ys)B(io)ML,
• CML (Chemistry and numeric science):
1.
2.
3.
4.
5.

Molecules (atoms, bonds, coordinates,
Reactions,
Spectra,
Solid state,
Computation

RDF (relationships, annotations, linking).
Ontologies (Dictionaries)

Humans and machines use different
languages

Scalable Vector Graphics (SVG)
Human-friendly

Automatic!

<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="1280" height="640" viewBox="0 0
30240 15120">
<defs id="defs6">
<polygon points="0,-9 1.735535,-3.6038755 7.0364833,-5.6114082 3.8997116,-0.89008374 8.7743512,2.0026884 3.1273259,2.4939592
3.9049537,8.1087198 0,4 -3.9049537,8.1087198 -3.1273259,2.4939592 -8.7743512,2.0026884 -3.8997116,-0.89008374 -7.0364833,5.6114082 -1.735535,-3.6038755 0,-9 " id="Star7"/>
</defs>
<path d="M 0,0 L 30240,0 L 30240,15120 L 0,15120 L 0,0 z" style="fill:#00008b"/>
<use transform="matrix(252,0,0,252,7560,11340)" id="Commonwealth_Star" style="fill:#fff" xlink:href="#Star7"/>
<use transform="matrix(120,0,0,120,22680,12600)" id="Star_Alpha_Crucis" style="fill:#fff" xlink:href="#Star7"/>
<!– snipped 
217,2520 L 10080,2520 L 15120,0 z" id="Red_Diagonals" style="fill:red"/>
<use transform="matrix(-1,0,0,-1,15120,7560)" id="Red_Diagonals_Rotated" style="fill:red" xlink:href="#Red_Diagonals"/>
</svg>

Machine-friendly

MathML

Mathematics Markup Language
Energy of c.c.p lattice of argon

Automatic!

Human-friendly

4 pages clipped

Many editors and tools exist
We used MathWeaver

Machinefriendly

CML (Chemical Markup Language)

Automatic!

Human-friendly

Machine-friendly

Current scientific information flow
… is broken for data-rich science
Non-semantic
data

PDF

Lineprinter output

Human input
Text files

Data extraction
difficult and
incomplete

Human
readers

Semantic network closes the loop
Measurement

Computation

Semantic
Authoring

Analysis

Community

Data available for
e-science and reuse

Data mined from
document

The network grows autonomously

Human-machine

Human-human

Machine-human

Machine-machine

• Example: Materials 2012, 5, 27-46;
doi:10.3390/ma5010027

REACTIONS

ABBREVIATIONS
“… electron donor (ED), such as an electron rich,
metal-based light absorber (LA), and electron
acceptor (EA) sites.”

PROPERTIES (NAME-VALUE-UNITS)

Name
VU N

Value
VU N

Units
N

U
N

VV

U

Note CML supports value ranges and errors

VV

Mathematics

CML is being integrated with
computable (content) MathML

Materials Search Challenge
• What would you like a “Google for materials”
to find for you in the scientific literature?

TimBerners-Lee’s Open data
http://5stardata.info
★
CIFDIC
ACS ★★
IUCr

make your stuff available on the Web (whatever
format) under an OPEN license
make it available as structured data (i.e. NOT
PDF)
CRYSTALEYE

★★★

use non-proprietary formats (e.g., CSV)

★★★
★

use URIs to denote things, so that people can
point at your stuff

★★★
★★

link your data to other data to provide context

• http://upload.wikimedia.org/wikipedia/comm
ons/3/34/LOD_Cloud_Diagram_as_of_Septem
ber_2011.png
CIFDIC

COD

Creating semantic content
1.
2.
3.
4.

Authoring tools for humans
Program output
Chemical databases
Content mining and Natural
Language Processing (Text) (NLP)
5. Community

Semantic authoring IUCr
• http://blogs.ch.cam.ac.uk/pmr/2012/01/23/brian-mcmahonpublishing-semantic-crystallography-every-science-data-publishershould-watch-this-all-the-way-through/
•
•
•
•
•
•
•
•
•

1:08 CIF
3:36 CIF Syntax and dataTypes
4:30 Publishing with CIF
6:41 Demonstration: CheckCIF
12:02 Interactive Chemical validation
14:42 Linking data to journal article and search for novelty of data
15:08 Jmol display applet
21:03 Supplementary data
21:47 PublCIF a tool to merge data and text and annotate them

Semanticizing Logfiles: JumboConverters

LOGFILE

QUIXOTE: Semantic KnowledgeBase for
Computational Chemistry

Content Mining of Chemistry

Typical chemical synthesis

http://wwmm.ch.cam.ac.uk/chemicaltagger

Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.

Open Content Mining of FACTS
Machines can interpret phylogenetic trees

Unusable
FACT
Re-usable
FACT

>100,000 diagrams in literature; cost 1,000,000,000 hours

Crowdcrafting for Aegis/CERN
•
•
•
•
•
•

•

Does antimatter fall down or up?
Help the AEgIS experiment at CERN to work out how antimatter is affected by
gravity. Just join the dots!
Antimatter
The observable universe is composed almost entirely of matter but we can
produce stuff called antimatter in the lab. Antimatter is material composed of
antiparticles.
Antiparticles have the same mass as normal matter particles but the opposite
charge. When an antiparticle collides with an ordinary matter particle they both
annihilate - producing a burst of other particles and radiation.
Antiparticles should interact gravitationally just like particles of ordinary matter
because Einstein's weak equivalence principle states that gravity doesn't depend
on composition. But if they don't then gravity is much more complicated than our
current understanding indicates.

http://crowdcrafting.org http://crowdcrafting.org/antimatter

RCUK
Wellcome
ERC
NSF …
require
fully OPEN

[at Research Data Alliance, we are entering a new “era of open science”, which will be “good
for citizens, good for scientists and good for society”.
She explicitly highlighted the transformative potential of open access, open data, open
software and open educational resources – mentioning the EU’s policy requiring open access
to all publications and data resulting from EU funded research.
http://blog.okfn.org/2013/03/21/we-are-entering-an-era-of-open-science-says-eu-vp-neeliekroes/#sthash.3SWDXDE6.dpuf

Open Definition
• “A piece of data or content is open if anyone is
free to use, reuse, and redistribute it —
subject only, at most, to the requirement to
attribute and/or share-alike.”
OPEN

NOT OPEN

PDB
COD,Crystaleye

CCDC, ICSD

RSC/ACS/IUCr CIFs

Elsevier/Wiley/Springer CIFs

Acta Cryst E

Acta Cryst ABCD (default)

CIF dictionaries

Crystaleye
• A database of 200,000 crystal structures scraped
from publications CIF supplemental information
• CML molecules and name-value pairs
• Re-usable as fragment base
Nick Day, Jim Downing, Sam Adams, N. W. England
and Peter Murray-Rust*
J.Appl.Cryst. (2012). 45 , 316–323,
doi:10.1107/S0021889812006462
http://wwmm.ch.cam.ac.uk/crystaleye

Supplemental
Information (CIFs)
harvested
from Publications

ACS
IUCr

RSC

ELS

As-Cl Bond lengths

Short

Long

COD Letter to Editors 2012
[We] have become aware of growing concerns regarding the publication,
preservation and quality maintenance of crystallographic data. …However,
we believe that completely open deposition of data and multiple checks can
ensure the quality and wide availability of scientific data
[Please] recommend to your authors that, they also deposit their
supplementary crystallographic data into the COD when they submit
scientific papers to your journals.
Being open by its design, the COD enables the creation of multiple mirrors
and backup copies. It provides, thus, archival storage of scientific data with
adequate reliability. … services for reviewers and editors to facilitate the
peer-review. …since our database follows the Open Access model, all
material deposited into the COD is available to other databases. The COD
team actually encourages the use of our data collection for any possible
scientific or industrial application by putting the database into the public
domain

Recommendations for Open
Crystallography
• Require Open Crystal Data for all publications
• Deposition of Open Data in COD
• Integrate CIF dictionaries as RDF into Linked
Open Data
• Integrate COD into Linked Open Data Cloud
• CCDC/ICSD to publish RAW author CIFs Openly

Most “Open Access” is not re-usable
CC-BY / Reusable
Restricted by
licence or
lack of clarity

CC-NC
CC-ND

Nothing/
unclear
0

6000
PRICE per article USD

Ross Mounce
Panton Fellow
2012

Panton Principles for Open Data in Science
Why? Wanted to avoid the mess in OA
• Peter Murray-Rust, Cameron
Neylon, Rufus Pollock, John
Wilbanks
2008-> 2010 (launch) at
Panton Arms
Launch 2010
Peter
John
Jordan
Panton Fellowships (2012)Murray-Rust
Hatcher Wilbanks
Jenny
Molloy

Rufus
Pollock

Cameron
Neylon

“Licence STM Data as CC0”

Panton Fellows

Ross Mounce & Sophie Kershaw
(Support from Open Society Foundations)

* Data should be open
• Make your wishes clear
• Use an appropriate licence

Open Mining Manifesto
1. Define ‘open content mining’ in a broad and useful
manner
‘Open Content Mining’ means the unrestricted right of subscribers to extract, process and
republish content manually or by machine in whatever form (without prior specific
permissions and subject only to community norms of responsible behaviour in the electronic
age.
Text
Numbers
Tables
Diagrams
Graphical representations of relationships between variables
Images and video and audio when it is the means of expressing a fact.
Semantics (XML, RDF)

2. Urge publishers and institutional repositories to adhere to the following principles:

Principle 1: Right of Legitimate Accessors to Mine
We assert that there is no legal, ethical or moral reason to refuse to allow legitimate
accessors of research content (OA or otherwise) to use machines to analyse the published
output of the research community. Researchers expect to access and process the full
content of the research literature with their computer programs and should be able to use
their machines as they use their eyes. The

right to read is the right to mine

Principle 2: Lightweight Processing Terms and Conditions
Mining by legitimate subscribers should not be prohibited by contractual or other legal
barriers. Publishers should add clarifying language in subscription agreements that content
is available for information mining by download or by remote access. Where access is
through researcher-provided tools, no further cost should be required. Users and

providers should encourage machine processing
Principle 3: Use
Researchers can and will publish facts and excerpts which they discover by reading and
processing documents. They expect to disseminate and aggregate statistical results as facts
and context text as fair use excerpts, openly and with no restrictions other than attribution.
Publisher efforts to claim rights in the results of mining further retard the advancement of
science by making those results less available to the research community; Such claims should
be prohibited.

Facts don’t belong to anyone.

3. Strategies

Assert the above rights by:
Educating researchers and librarians about the potential of
content mining and the current impediments to doing so,
including alerting librarians to the need not to cede any of the
above rights when signing contracts with publishers
Compiling a list of publishers and indicating what rights they
currently permit, in order to highlight the gap between the
rights here being asserted and what is currently possible
Urging governments and funders to promote and aid the
enjoyment of the above rights.

Take-away messages
•
•
•
•
•

Lost/unused STM* data costs 30-100Billion /yr [1]
Licence: DATA as CCZero and TEXT as CC-BY
Content Mining for DATA is a RIGHT
Apathy is our worst enemy
Trust and empower young people

“A piece of content or data is open if anyone is free to
use, reuse, and redistribute it — subject only, at most,
to the requirement to attribute and/or share-alike.”
Une donnée est ouverte, si chacun est libre de l'utiliser,
de la réutiliser et de la redistribuer
*Scientific Technical Medical

[1] PMR: submission to UK Hargreaves process

To make the complete scientific literature
accessible to machines and humans

Semantic Web in Physical Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Semantic Web in Physical Science

Similar to Semantic Web in Physical Science (20)

More from petermurrayrust

More from petermurrayrust (20)

Recently uploaded

Recently uploaded (20)

Semantic Web in Physical Science