Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Semantic Web ❤ Data Science? - Practical large scale semantic data handling with RDFIO and RDF-HDT

196 views

Published on

Talk at Linked Data Sweden 2018 at SciLifeLab Uppsala.
(Program and talk info at: https://lankadedata.se/LDSV/2018)

Abstract: Data in the life sciences are growing at an exponential rate. The semantic web technologies which were initially thought up before the "Big Data" era, have not always been optimal for handling really large data sets. Based on our experience, the situation can improve with the right approach, and some promising new developments to better merge the worlds of semantic and big data.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Semantic Web ❤ Data Science? - Practical large scale semantic data handling with RDFIO and RDF-HDT

  1. 1. Semantic Web Data Science ?? ? Samuel Lampa | @smllmp | pharmb.io | Linked Data Sweden 2018 | Uppsala,April 9
  2. 2. Practical large scale semantic data handling with RDFIO and RDF-HDT … or, in other words:
  3. 3. A deep historic divide ... Semantic Web Data Science “web-focused” “distributed” “verbose” “slow” “large-scale” “performance focussed” “pragmatic” “academic” “automated”
  4. 4. A deep historic divide ... Semantic Web Data Science “web-focused” “distributed” “verbose” “slow” “large-scale” “performance focussed” “pragmatic” “academic” “automated” Any solution?Any solution?
  5. 5. Semantic Web vs. Data Science Data Science = “Be able to experiment with data” Not been easy in SemWeb, because of ... (warning: strong opinions ahead): ● Distributedness of data locality (original vision) ● Massive technological “re-invention of the wheel”
  6. 6. So, what’s the problem? ● Data science requires: ● “Local” data (for large data) ● Powerful querying ● “Schema-less” is challenging without some starting point, or some structure (such as re-usable queries) ● SPARQL helps only so much (no re-usable queries)
  7. 7. One solution: SWI-Prolog – Re-usable rules: Great support for semweb: www.swi-prolog.org/web
  8. 8. What we did (1/3): Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES. Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6. Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport ← SWI-Prolog for querying … Integrated into Bioclipse Pros / Cons: + Powerful querying + Easy to integrate into other software => Powerful interactive environment + Excellent performance - No support for really large datasets (exceednig RAM size)
  9. 9. What we did (2/3): Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y. Semantic MediaWiki as a collaborative and interactive platform for playing around with data, summarizing and visualizing using SMW’s Ask query language → Pros / Cons: + Collaboration supported + Versioned data storage + UI generation included in SMW - Performance concerns - Lack of expressiveness and power in the SMW “Ask” query language
  10. 10. What we did (2/3): Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
  11. 11. Ecosystem today:Totally different blazegraph.com rdfhdt.org
  12. 12. BlazeGraph blazegraph.com Powers query.wikidata.org + Fast + Easy to use, web-based, interface - Requires running process - Needs importing - Only SPARQL (No re-usable queries)
  13. 13. RDF-HDT HDT: Header,Dictionary,Triples + Fast + Relatively few dependencies + Easy to integrate + SWI-Prolog support(!) - Resource demanding conversion - Still quite new and “bleeding-edge” rdfhdt.org
  14. 14. RDF serializations Text (XML/Turtle/N3) (G)Zipped Text RDF-HDT Inefficient (compared to TSV) Search requires Brute-force scan Search requires decompression AND(!) brute-force scan Search can leverage indexes to make it fast Compact, binary format Compact
  15. 15. LOD Laundromat ● The “whole Linked Data cloud” !!! ● Cleaned up and integrated. ● Download in RDF-HDT format ● Or query via “Linked data fragments” or SPARQL ● Play around! ● lodlaundromat.org ● See also: youtu.be/sXJdSfjO1dU
  16. 16. What we did (3/3): urisolve ● Based on data in BlazeGraph or RDF-HDT ● Resolves RDF URIs ● Returns RDF with any triples connected to the URI in question ● Source code: github.com/pharmbio/urisolve Lapins M,Arvidsson S, Lampa S, Berg A, Schaal W,Alvarsson J, Spjuth O. A confidence predictor for logD using conformal regression and a support-vector machine. J Cheminform. 2018;10(1):17. doi: 10.1186/s13321-018-0271-1
  17. 17. The future: SWI-Prolog as central point? SWISH: SWI-Prolog Notebook: swish.swi-prolog.org … for powerful querying and reasoning, aka “hands-on data science”
  18. 18. “Linked Data is the Semantic Web done right” – Tim Berners Lee tomheath.com/blog/2009/03/linked-data-web-of-data-semantic-web-wtf/
  19. 19. Linked Data Data Science

×