Talk at Linked Data Sweden 2018 at SciLifeLab Uppsala.
(Program and talk info at: https://lankadedata.se/LDSV/2018)
Abstract: Data in the life sciences are growing at an exponential rate. The semantic web technologies which were initially thought up before the "Big Data" era, have not always been optimal for handling really large data sets. Based on our experience, the situation can improve with the right approach, and some promising new developments to better merge the worlds of semantic and big data.
3. A deep historic divide ...
Semantic Web
Data Science
“web-focused”
“distributed”
“verbose”
“slow”
“large-scale”
“performance focussed”
“pragmatic”
“academic”
“automated”
4. A deep historic divide ...
Semantic Web
Data Science
“web-focused”
“distributed”
“verbose”
“slow”
“large-scale”
“performance focussed”
“pragmatic”
“academic”
“automated”
Any solution?Any solution?
5. Semantic Web vs. Data Science
Data Science = “Be able to experiment with data”
Not been easy in SemWeb, because of ...
(warning: strong opinions ahead):
● Distributedness of data locality (original vision)
● Massive technological “re-invention of the wheel”
6. So, what’s the problem?
● Data science requires:
● “Local” data (for large data)
● Powerful querying
● “Schema-less” is challenging without some starting
point, or some structure (such as re-usable queries)
● SPARQL helps only so much (no re-usable queries)
8. What we did (1/3):
Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES.
Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6.
Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport
← SWI-Prolog for querying
… Integrated into Bioclipse
Pros / Cons:
+ Powerful querying
+ Easy to integrate into other software
=> Powerful interactive environment
+ Excellent performance
- No support for really large datasets
(exceednig RAM size)
9. What we did (2/3):
Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć
RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
Semantic MediaWiki as a collaborative and
interactive platform for playing around with
data, summarizing and visualizing using SMW’s
Ask query language →
Pros / Cons:
+ Collaboration supported
+ Versioned data storage
+ UI generation included in SMW
- Performance concerns
- Lack of expressiveness and power
in the SMW “Ask” query language
10. What we did (2/3):
Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć
RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
13. RDF-HDT
HDT: Header,Dictionary,Triples
+ Fast
+ Relatively few dependencies
+ Easy to integrate
+ SWI-Prolog support(!)
- Resource demanding conversion
- Still quite new and “bleeding-edge” rdfhdt.org
14. RDF serializations
Text (XML/Turtle/N3) (G)Zipped Text RDF-HDT
Inefficient
(compared to TSV)
Search requires
Brute-force scan
Search requires
decompression
AND(!)
brute-force scan
Search can
leverage indexes
to make it fast
Compact, binary
format
Compact
15. LOD Laundromat
● The “whole Linked Data cloud” !!!
● Cleaned up and integrated.
● Download in RDF-HDT format
● Or query via “Linked data fragments” or SPARQL
● Play around!
● lodlaundromat.org
● See also: youtu.be/sXJdSfjO1dU
16. What we did (3/3): urisolve
● Based on data in BlazeGraph or RDF-HDT
● Resolves RDF URIs
● Returns RDF with any triples connected to the URI in question
● Source code: github.com/pharmbio/urisolve
Lapins M,Arvidsson S, Lampa S, Berg A, Schaal W,Alvarsson J, Spjuth O.
A confidence predictor for logD using conformal regression and a support-vector machine. J Cheminform. 2018;10(1):17. doi: 10.1186/s13321-018-0271-1
17. The future: SWI-Prolog as central point?
SWISH: SWI-Prolog Notebook: swish.swi-prolog.org
… for powerful querying and reasoning, aka “hands-on data science”
18. “Linked Data is the Semantic Web done right”
– Tim Berners Lee
tomheath.com/blog/2009/03/linked-data-web-of-data-semantic-web-wtf/