Linked Data
for improved organization
of research data
Farmbio BioScience Seminar May 18, 2018
Samuel Lampa @smllmp
PhD Student in Pharm. Bioinformatics @ pharmb.io / farmbio.uu.se
● Large datasets
● Automation
● Scientific workflows
● Machine Learning
● Semantic data
● Reasoning
● Query systems
● Something user friendly
● And hopefully usable
● “Answer all the (computational)
research questions”
Research interests
What’s the problem?
What’s the problem?
● Data in different formats
● Different data schemas
● Losing track of what data means
(meaning available only in context)
A database to the rescue?
Database to the rescue?
● Same problems with losing data identity on export
● So, put all data in the same database?
● One database can’t fit all the world’s data!
● What to do?
What to do?
What if all data could be:
● Easy to share
● Self-described
● Use the same (underlying) format
● Be easy to integrate with other data
(In other words: FAIR – Findable, Accessible, Interoperable, Re-usable)
Linked Data!
Linked data – Basic ideas
● Use URI:s (“https://”) to identify things
● Make URI:s into dereferenceable links
(So one can visit them to find relevant data)
● Refer to other data using their links
What about the linking?
Triple model*:
– Subject (URI), Predicate (URI), Object (URI or literal value)
@ex: http://example.org/myontology/
ex:Sweden ex:hasPopulation 9000000
ex:Sweden ex:hasCapital ex:Stockholm
* For more info: Check “RDF: Resource Description Framework”
Web links vs. Data links
Example data
Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES.
Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6.
Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport
<http://[...]/nmrshiftdb/?moleculeId=234>
dc:title "warburganal";
chem:casnumber "62994-47-2";
nmr:moleculeId "234";
nmr:hasSpectrum <http://[...]/nmrshiftdb/?spectrumId=4735>;
<http://[...]/nmrshiftdb/?spectrumId=4735> nmr:field "50";
nmr:hasPeak <http://[...]/nmrshiftdb/?s4735p0>,
<http://[...]/nmrshiftdb/?s4735p1>,
<http://[...]/nmrshiftdb/?s4735p2>,
<http://[...]/nmrshiftdb/?s4735p3>;
nmr:solvent "Chloroform-D1 (CDCl3)";
nmr:spectrumId "4735";
nmr:spectrumType "13C";
nmr:temperature "298".
<http://[...]/nmrshiftdb/?s4735p1>
nmr:hasShift 18.3;
a nmr:peak.
<http://[...]/nmrshiftdb/?s4735p2>
nmr:hasShift 22.6;
a nmr:peak.
<http://[...]/nmrshiftdb/?s4735p3>
nmr:hasShift 26.5;
a nmr:peak.
Example data
Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES.
Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6.
Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport
Powerful querying with SPARQL
What to do? - Linked Data!
What if all data could be:
● Easy to share – Yep, RDF is a web based format
● Self-described – Yes, links in the data describe the data
● Use the same (underlying) format – Yes, RDF triples
● Be easy to integrate with other data - Yes, just create links
(In other words: FAIR – Findable, Accessible, Interoperable, Re-usable)
But how to actually use this in
practice?
What we did (1/3):
Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES.
Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6.
Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport
← SWI-Prolog for querying
… Integrated into Bioclipse
Pros / Cons:
+ Powerful querying
+ Easy to integrate into other software
=> Powerful interactive environment
+ Excellent performance
- No support for really large datasets
(exceednig RAM size)
What we did (2/3):
Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć
RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
Semantic MediaWiki as a collaborative and
interactive platform for playing around with
data, summarizing and visualizing using SMW’s
Ask query language →
Pros / Cons:
+ Collaboration supported
+ Versioned data storage
+ UI generation included in SMW
- Performance concerns
- Lack of expressiveness and power
in the SMW “Ask” query language
What we did (2/3):
Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć
RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
What we did (2/3):
Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć
RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
What we did (3/3): urisolve
● A simple web server to resolve, or “dereference” URIs
● Returns any data / triples for the URI in question
● Based on data in a triplestore (semantic database)
or an RDF-HDT file (compressed, indexed file format)
● Source code: github.com/pharmbio/urisolve
Lapins M,Arvidsson S, Lampa S, Berg A, Schaal W,Alvarsson J, Spjuth O.
A confidence predictor for logD using conformal regression and a support-vector machine. J Cheminform. 2018;10(1):17. doi: 10.1186/s13321-018-0271-1
● Linked Data makes data self-describing
● It is extremely flexible to work with
● Lowers the barriers to data entry
Conclusions
Vision:A central workbench for Linked Data
SWISH: SWI-Prolog Notebook: swish.swi-prolog.org
… to access all data sources, and
“answer all the (computational) research questions”
Thank you
Samuel Lampa @smllmp
PhD Student in Pharm. Bioinformatics @ pharmb.io / farmbio.uu.se

Linked Data for improved organization of research data

  • 1.
    Linked Data for improvedorganization of research data Farmbio BioScience Seminar May 18, 2018 Samuel Lampa @smllmp PhD Student in Pharm. Bioinformatics @ pharmb.io / farmbio.uu.se
  • 2.
    ● Large datasets ●Automation ● Scientific workflows ● Machine Learning ● Semantic data ● Reasoning ● Query systems ● Something user friendly ● And hopefully usable ● “Answer all the (computational) research questions” Research interests
  • 3.
  • 4.
    What’s the problem? ●Data in different formats ● Different data schemas ● Losing track of what data means (meaning available only in context)
  • 5.
    A database tothe rescue?
  • 6.
    Database to therescue? ● Same problems with losing data identity on export ● So, put all data in the same database? ● One database can’t fit all the world’s data! ● What to do?
  • 7.
    What to do? Whatif all data could be: ● Easy to share ● Self-described ● Use the same (underlying) format ● Be easy to integrate with other data (In other words: FAIR – Findable, Accessible, Interoperable, Re-usable)
  • 8.
  • 9.
    Linked data –Basic ideas ● Use URI:s (“https://”) to identify things ● Make URI:s into dereferenceable links (So one can visit them to find relevant data) ● Refer to other data using their links
  • 10.
    What about thelinking? Triple model*: – Subject (URI), Predicate (URI), Object (URI or literal value) @ex: http://example.org/myontology/ ex:Sweden ex:hasPopulation 9000000 ex:Sweden ex:hasCapital ex:Stockholm * For more info: Check “RDF: Resource Description Framework”
  • 11.
    Web links vs.Data links
  • 12.
    Example data Willighagen EL,AlvarssonJ,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES. Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6. Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport
  • 13.
    <http://[...]/nmrshiftdb/?moleculeId=234> dc:title "warburganal"; chem:casnumber "62994-47-2"; nmr:moleculeId"234"; nmr:hasSpectrum <http://[...]/nmrshiftdb/?spectrumId=4735>; <http://[...]/nmrshiftdb/?spectrumId=4735> nmr:field "50"; nmr:hasPeak <http://[...]/nmrshiftdb/?s4735p0>, <http://[...]/nmrshiftdb/?s4735p1>, <http://[...]/nmrshiftdb/?s4735p2>, <http://[...]/nmrshiftdb/?s4735p3>; nmr:solvent "Chloroform-D1 (CDCl3)"; nmr:spectrumId "4735"; nmr:spectrumType "13C"; nmr:temperature "298". <http://[...]/nmrshiftdb/?s4735p1> nmr:hasShift 18.3; a nmr:peak. <http://[...]/nmrshiftdb/?s4735p2> nmr:hasShift 22.6; a nmr:peak. <http://[...]/nmrshiftdb/?s4735p3> nmr:hasShift 26.5; a nmr:peak. Example data Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES. Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6. Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport
  • 14.
  • 15.
    What to do?- Linked Data! What if all data could be: ● Easy to share – Yep, RDF is a web based format ● Self-described – Yes, links in the data describe the data ● Use the same (underlying) format – Yes, RDF triples ● Be easy to integrate with other data - Yes, just create links (In other words: FAIR – Findable, Accessible, Interoperable, Re-usable)
  • 16.
    But how toactually use this in practice?
  • 17.
    What we did(1/3): Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES. Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6. Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport ← SWI-Prolog for querying … Integrated into Bioclipse Pros / Cons: + Powerful querying + Easy to integrate into other software => Powerful interactive environment + Excellent performance - No support for really large datasets (exceednig RAM size)
  • 18.
    What we did(2/3): Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y. Semantic MediaWiki as a collaborative and interactive platform for playing around with data, summarizing and visualizing using SMW’s Ask query language → Pros / Cons: + Collaboration supported + Versioned data storage + UI generation included in SMW - Performance concerns - Lack of expressiveness and power in the SMW “Ask” query language
  • 19.
    What we did(2/3): Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
  • 20.
    What we did(2/3): Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
  • 21.
    What we did(3/3): urisolve ● A simple web server to resolve, or “dereference” URIs ● Returns any data / triples for the URI in question ● Based on data in a triplestore (semantic database) or an RDF-HDT file (compressed, indexed file format) ● Source code: github.com/pharmbio/urisolve Lapins M,Arvidsson S, Lampa S, Berg A, Schaal W,Alvarsson J, Spjuth O. A confidence predictor for logD using conformal regression and a support-vector machine. J Cheminform. 2018;10(1):17. doi: 10.1186/s13321-018-0271-1
  • 22.
    ● Linked Datamakes data self-describing ● It is extremely flexible to work with ● Lowers the barriers to data entry Conclusions
  • 23.
    Vision:A central workbenchfor Linked Data SWISH: SWI-Prolog Notebook: swish.swi-prolog.org … to access all data sources, and “answer all the (computational) research questions”
  • 24.
    Thank you Samuel Lampa@smllmp PhD Student in Pharm. Bioinformatics @ pharmb.io / farmbio.uu.se