Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation - SEMANTiCS 2019 talk

145 views

Published on

Over the last two decades, the amount of data which has been created, published and managed using Semantic Web standards and especially via Resource Description Framework (RDF) has been increasing.
As a result, efficient processing of such big RDF datasets has become challenging.
Indeed, these processes require, both efficient storage strategies and query-processing engines, to be able to scale in terms of data size.
In this study, we propose a scalable approach to evaluate SPARQL queries over distributed RDF datasets using a semantic-based partition and is implemented inside the state-of-the-art RDF processing framework: SANSA.
An evaluation of the performance of our approach in processing large-scale RDF datasets is also presented.
The preliminary results of the conducted experiments show that our approach can scale horizontally and perform well as compared with the previous Hadoop-based system.
It is also comparable with the in-memory SPARQL query evaluators when there is less shuffling involved.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation - SEMANTiCS 2019 talk

  1. 1. Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu, Damien Graux, Imran Khan, Ioanna Lytra, Hajira Jabeen, and Jens Lehmann
  2. 2. 2 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn ❖ Introduction ❖ Approach ➢ System Architecture Overview ➢ Distributed Algorithm Description ❖ Evaluation ❖ Conclusion and Future work Outline
  3. 3. 3 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Introduction ❖ Over the last years, the size of the Semantic Web has increased and several large-scale datasets were published
  4. 4. 4 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Introduction Source: LOD-Cloud (http://lod-cloud.net/ ) ❖ Over the last years, the size of the Semantic Web has increased and several large-scale datasets were published ➢ As of August 2018
  5. 5. 5 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Introduction Source: LOD-Cloud (http://lod-cloud.net/ ) ❖ Over the last years, the size of the Semantic Web has increased and several large-scale datasets were published ➢ Based on LOD Stats (http://lodstats.aksw.org/ ) ~10, 000 datasets Openly available online using Semantic Web standards
  6. 6. 6 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Introduction Source: LOD-Cloud (http://lod-cloud.net/ ) ❖ Over the last years, the size of the Semantic Web has increased and several large-scale datasets were published ➢ Based on LOD Stats (http://lodstats.aksw.org/ ) ~10, 000 datasets Openly available online using Semantic Web standards many datasets RDFized and kept private (e.g. Supply chain, manufacture, ethereum dataset, etc.)
  7. 7. 7 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Introduction Source: LOD-Cloud (http://lod-cloud.net/ ) ❖ Dealing with such amount of data makes many tasks hard to be solved on single machines
  8. 8. 8 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Introduction Source: LOD-Cloud (http://lod-cloud.net/ ) ❖ Dealing with such amount of data makes many tasks hard to be solved on single machines Supply Chain Mng. Find relevant information about motorcycle part
  9. 9. 9 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Introduction Source: LOD-Cloud (http://lod-cloud.net/ ) ❖ Dealing with such amount of data makes many tasks hard to be solved on single machines Media Publishing Discover what others say about news stories Supply Chain Mng. Find relevant information about motorcycle part
  10. 10. 10 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Introduction Source: LOD-Cloud (http://lod-cloud.net/ ) ❖ Dealing with such amount of data makes many tasks hard to be solved on single machines Media Publishing Discover what others say about news stories Exploring Tourism Exploring the neighborhood (events, point of interests, etc) Supply Chain Mng. Find relevant information about motorcycle part
  11. 11. 11 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Introduction Source: LOD-Cloud (http://lod-cloud.net/ ) ❖ Dealing with such amount of data makes many tasks hard to be solved on single machines Media Publishing Discover what others say about news stories Exploring Tourism Exploring the neighborhood (events, point of interests, etc) Entity Linking Which datasets are good candidates for interlinking? Supply Chain Mng. Find relevant information about motorcycle part
  12. 12. 12 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Introduction Source: LOD-Cloud (http://lod-cloud.net/ ) ❖ Dealing with such amount of data makes many tasks hard to be solved on single machines Media Publishing Discover what others say about news stories Exploring Tourism Exploring the neighborhood (events, point of interests, etc) Entity Linking Which datasets are good candidates for interlinking? Supply Chain Mng. Find relevant information about motorcycle part These processes require, both efficient storage strategies and query processing engine.
  13. 13. 13 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn SANSA Overview ❖ SANSA [1] is a processing data flow engine that provides data distribution, and fault tolerance for distributed computations over RDF large-scale datasets ❖ SANSA includes several libraries: ➢ Read / Write RDF / OWL library ➢ Querying library ➢ Inference library ➢ ML- Machine Learning core library BigDataEurope Inference Knowledge Distribution & Representation DeployCoreAPIs&Libraries Local Cluster Standalone Resource manager Querying Machine Learning
  14. 14. 14 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning Query Layer Semantic map map Distributed Data Structures Results RDFData SELECT ?p WHERE { ?p :owns ?c . ?c :madeIn ?Ingolstadt . } SPARQL Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany System Architecture Overview
  15. 15. 15 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach RDFData Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany System Architecture Overview (workflow)
  16. 16. 16 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach System Architecture Overview (workflow) SANSA Engine RDF Layer Data Ingestion RDFData Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany
  17. 17. 17 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning RDFData Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany System Architecture Overview (workflow)
  18. 18. 18 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning RDFData Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn System Architecture Overview (workflow)
  19. 19. 19 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning RDFData Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt System Architecture Overview (workflow)
  20. 20. 20 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning RDFData Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany System Architecture Overview (workflow)
  21. 21. 21 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning RDFData Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen System Architecture Overview (workflow)
  22. 22. 22 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning RDFData Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany System Architecture Overview (workflow)
  23. 23. 23 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning RDFData SELECT ?p WHERE { ?p :owns ?c . ?c :madeIn ?Ingolstadt . } SPARQL Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany System Architecture Overview (workflow)
  24. 24. 24 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning Query Layer Semantic map map RDFData SELECT ?p WHERE { ?p :owns ?c . ?c :madeIn ?Ingolstadt . } SPARQL Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany System Architecture Overview (workflow)
  25. 25. 25 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning Query Layer Semantic map map RDFData SELECT ?p WHERE { ?p :owns ?c . ?c :madeIn ?Ingolstadt . } SPARQL Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany System Architecture Overview (workflow) #1 select (p=:owns)
  26. 26. 26 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning Query Layer Semantic map map RDFData SELECT ?p WHERE { ?p :owns ?c . ?c :madeIn ?Ingolstadt . } SPARQL Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany System Architecture Overview (workflow) #1 select (p=:owns) #2 select (p=:madeIn, o=:Ingolstadt)
  27. 27. 27 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning Query Layer Semantic map map RDFData SELECT ?p WHERE { ?p :owns ?c . ?c :madeIn ?Ingolstadt . } SPARQL Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany System Architecture Overview (workflow) #1 select (p=:owns) #2 select (p=:madeIn, o=:Ingolstadt) #3 join (?c)
  28. 28. 28 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning Query Layer Semantic map map RDFData SELECT ?p WHERE { ?p :owns ?c . ?c :madeIn ?Ingolstadt . } SPARQL Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany System Architecture Overview (workflow) #1 select (p=:owns) #2 select (p=:madeIn, o=:Ingolstadt) #3 join (?c) #4 project (?p)
  29. 29. 29 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning Query Layer Semantic map map Distributed Data Structures RDFData SELECT ?p WHERE { ?p :owns ?c . ?c :madeIn ?Ingolstadt . } SPARQL Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany System Architecture Overview (workflow)
  30. 30. 30 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Approach SANSA Engine RDF Layer Data Ingestion Partitioning Query Layer Semantic map map Distributed Data Structures Results RDFData SELECT ?p WHERE { ?p :owns ?c . ?c :madeIn ?Ingolstadt . } SPARQL Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany System Architecture Overview (workflow)
  31. 31. 31 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn ❖ Experimental Setup ➢ Cluster configuration ■ 6 machines (1 master, 5 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala 2.11.11 and Java 8 ➢ Datasets (all in nt format) ➢ Distributed SPARQL query evaluators we compare with: ■ SHARD [2], SPARQLGX-SDE [3], and Sparklify [4] Evaluation LUBM WatDiv 1K 2K 3K 10M 100M #nr. of triples 138,280,374 276,349,040 414,493,296 10,916,457 108,997,714 size (GB) 24 49 70 1.5 15
  32. 32. 32 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Runtime (s) (mean) Queries SHARD SPARQLGX-SDE SANSA.Sparklify SANSA.Semantic (ours) C3 n/a 38.79 72.94 90.48 F3 n/a 38.41 74.69 n/a L3 n/a 21.05 73.16 72.84 S3 n/a 26.27 70.1 79.7 C3 n/a 181.51 96.59 300.82 F3 n/a 162.86 91.2 n/a L3 n/a 84.09 82.17 189.89 S3 n/a 123.6 93.02 176.2 Q1 774.93 103.74 103.57 226.21 Q2 fail fail 3348.51 329.69 Q3 772.55 126.31 107.25 235.31 Q4 988.28 182.52 111.89 294.8 Q5 771.69 101.05 100.37 226.21 Q6 fail 73.05 100.72 207.06 Q7 fail 160.94 113.03 277.08 Q8 fail 179.56 114.83 309.39 Q9 fail 204.62 114.25 326.29 Q10 780.05 106.26 110.18 232.72 Q11 783.2 112.23 105.13 231.36 Q12 fail 159.65 105.86 283.53 Q13 778.16 100.06 90.87 220.28 Q14 688.44 74.64 100.58 204.43 Performance analysis on large-scale RDF datasets Evaluation Watdiv-10MWatdiv-100MLUBM-1K
  33. 33. 33 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Runtime (s) (mean) Queries SHARD SPARQLGX-SDE SANSA.Sparklify SANSA.Semantic (ours) C3 n/a 38.79 72.94 90.48 F3 n/a 38.41 74.69 n/a L3 n/a 21.05 73.16 72.84 S3 n/a 26.27 70.1 79.7 C3 n/a 181.51 96.59 300.82 F3 n/a 162.86 91.2 n/a L3 n/a 84.09 82.17 189.89 S3 n/a 123.6 93.02 176.2 Q1 774.93 103.74 103.57 226.21 Q2 fail fail 3348.51 329.69 Q3 772.55 126.31 107.25 235.31 Q4 988.28 182.52 111.89 294.8 Q5 771.69 101.05 100.37 226.21 Q6 fail 73.05 100.72 207.06 Q7 fail 160.94 113.03 277.08 Q8 fail 179.56 114.83 309.39 Q9 fail 204.62 114.25 326.29 Q10 780.05 106.26 110.18 232.72 Q11 783.2 112.23 105.13 231.36 Q12 fail 159.65 105.86 283.53 Q13 778.16 100.06 90.87 220.28 Q14 688.44 74.64 100.58 204.43 Performance analysis on large-scale RDF datasets Evaluation Watdiv-10MWatdiv-100MLUBM-1K
  34. 34. 34 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Runtime (s) (mean) Queries SHARD SPARQLGX-SDE SANSA.Sparklify SANSA.Semantic (ours) C3 n/a 38.79 72.94 90.48 F3 n/a 38.41 74.69 n/a L3 n/a 21.05 73.16 72.84 S3 n/a 26.27 70.1 79.7 C3 n/a 181.51 96.59 300.82 F3 n/a 162.86 91.2 n/a L3 n/a 84.09 82.17 189.89 S3 n/a 123.6 93.02 176.2 Q1 774.93 103.74 103.57 226.21 Q2 fail fail 3348.51 329.69 Q3 772.55 126.31 107.25 235.31 Q4 988.28 182.52 111.89 294.8 Q5 771.69 101.05 100.37 226.21 Q6 fail 73.05 100.72 207.06 Q7 fail 160.94 113.03 277.08 Q8 fail 179.56 114.83 309.39 Q9 fail 204.62 114.25 326.29 Q10 780.05 106.26 110.18 232.72 Q11 783.2 112.23 105.13 231.36 Q12 fail 159.65 105.86 283.53 Q13 778.16 100.06 90.87 220.28 Q14 688.44 74.64 100.58 204.43 Performance analysis on large-scale RDF datasets Evaluation Watdiv-10MWatdiv-100MLUBM-1K We can see that our approach (SANSA.Semantic) evaluates most of the queries as opposed to SHARD. SHARD system fails to evaluate most of the LUBM queries and its parser does not support Watdiv queries.
  35. 35. 35 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Runtime (s) (mean) Queries SHARD SPARQLGX-SDE SANSA.Sparklify SANSA.Semantic (ours) C3 n/a 38.79 72.94 90.48 F3 n/a 38.41 74.69 n/a L3 n/a 21.05 73.16 72.84 S3 n/a 26.27 70.1 79.7 C3 n/a 181.51 96.59 300.82 F3 n/a 162.86 91.2 n/a L3 n/a 84.09 82.17 189.89 S3 n/a 123.6 93.02 176.2 Q1 774.93 103.74 103.57 226.21 Q2 fail fail 3348.51 329.69 Q3 772.55 126.31 107.25 235.31 Q4 988.28 182.52 111.89 294.8 Q5 771.69 101.05 100.37 226.21 Q6 fail 73.05 100.72 207.06 Q7 fail 160.94 113.03 277.08 Q8 fail 179.56 114.83 309.39 Q9 fail 204.62 114.25 326.29 Q10 780.05 106.26 110.18 232.72 Q11 783.2 112.23 105.13 231.36 Q12 fail 159.65 105.86 283.53 Q13 778.16 100.06 90.87 220.28 Q14 688.44 74.64 100.58 204.43 Performance analysis on large-scale RDF datasets Evaluation Watdiv-10MWatdiv-100MLUBM-1K On the other hand, SPARQLGX-SDE performs better than both Sparklify and our approach, when the size of the dataset is considerably small (e.g. less than 25GB). This behavior is due to the large partitioning overhead for Sparklify and our approach.
  36. 36. 36 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Runtime (s) (mean) Queries SHARD SPARQLGX-SDE SANSA.Sparklify SANSA.Semantic (ours) C3 n/a 38.79 72.94 90.48 F3 n/a 38.41 74.69 n/a L3 n/a 21.05 73.16 72.84 S3 n/a 26.27 70.1 79.7 C3 n/a 181.51 96.59 300.82 F3 n/a 162.86 91.2 n/a L3 n/a 84.09 82.17 189.89 S3 n/a 123.6 93.02 176.2 Q1 774.93 103.74 103.57 226.21 Q2 fail fail 3348.51 329.69 Q3 772.55 126.31 107.25 235.31 Q4 988.28 182.52 111.89 294.8 Q5 771.69 101.05 100.37 226.21 Q6 fail 73.05 100.72 207.06 Q7 fail 160.94 113.03 277.08 Q8 fail 179.56 114.83 309.39 Q9 fail 204.62 114.25 326.29 Q10 780.05 106.26 110.18 232.72 Q11 783.2 112.23 105.13 231.36 Q12 fail 159.65 105.86 283.53 Q13 778.16 100.06 90.87 220.28 Q14 688.44 74.64 100.58 204.43 Performance analysis on large-scale RDF datasets Evaluation Watdiv-10MWatdiv-100MLUBM-1K However, Sparklify performs better compared to SPARQLGX-SDE when the size of the dataset increases (see Watdiv-100M results) and the queries involve more joins (see LUBM-1K results). This is due to the Spark SQL optimizer and Sparqlify self-joins optimizers.
  37. 37. 37 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Evaluation Sizeup analysis (on LUBM dataset)
  38. 38. 38 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Evaluation Sizeup analysis (on LUBM dataset) Query execution time for our approach grows linearly when the size of the datasets increases. SHARD suffers from the expensive overhead of MapReduce joins which impact its performance.
  39. 39. 39 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Evaluation Sizeup analysis (on LUBM dataset) Sparqlify and SPARQLGX-SDE performs better when the size of the datasets increases as compared with our approach and SHARD.
  40. 40. 40 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Evaluation Node scalability (on LUBM-1K)
  41. 41. 41 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Evaluation Node scalability (on LUBM-1K) We see that as the number of nodes increases, the runtime cost of our query engine decreases linearly as compared with the SHARD, which keeps staying constant.
  42. 42. 42 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Evaluation Node scalability (on LUBM-1K) SPARQLGX-SDE and Sparklify perform better when the number of nodes increases compared to our approach and SHARD
  43. 43. 43 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn ❖ Processing large-scale RDF datasets: ➢ data-intensive and computing-intensive ➢ challenge to develop fast and efficient engine that can handle large scale RDF datasets ❖ A scalable semantic-based query engine (integrated into the larger SANSA framework) for efficient evaluation of SPARQL queries over distributed RDF datasets using the Spark framework ❖ As a future work we plan to expand our parser and add statistics to the query engine while evaluating queries Conclusion
  44. 44. THANK YOU! SANSA Semantic Analytics Stack https://github.com/SANSA-Stack ● SANSA-RDF ● SANSA-OWL ● SANSA-Query ● SANSA-Inference ● SANSA-ML ● SANSA-Examples ● SANSA-Notebooks Gezim Sejdiu PhD Student Smart Data Analytics / University of Bonn http://sda.tech @Gezim_Sejdiu https://gezimsejdiu.github.io/ <http://sansa-stack.net/>
  45. 45. 45 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn [1]. J. Lehmann, G. Sejdiu, L. B¨uhmann, P. Westphal, C. Stadler, I. Ermilov, S. Bin, N. Chakraborty, M. Saleem, A.-C. N. Ngonga, and H. Jabeen. Distributed Semantic Analytics using the SANSA Stack. In Proceedings of 16th International Semantic Web Conference - Resources Track (ISWC’2017), 2017. [2]. K. Rohloff and R. E. Schantz. High-performance, massively scalable distributed systems using the mapreduce software framework: The shard triple-store. In Programming Support Innovations for Emerging Distributed Applications, PSI EtA ’10, pages 4:1–4:5, New York, NY, USA, 2010. ACM. [3]. D. Graux, L. Jachiet, P. Genev`es, and N. Layaida. Sparqlgx: Efficient distributed evaluation of sparql with apache spark. In P. Groth, E. Simperl, A. Gray, M. Sabou, M. Kr¨otzsch, F. Lecue, F. Fl¨ock, and Y. Gil, editors, The Semantic Web – ISWC 2016, pages 80–87, Cham, 2016. [4]. C. Stadler, G. Sejdiu, D. Graux, and J. Lehmann. Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets. In Proceedings of 18th International Semantic Web Conference (ISWC’2019), 2019. References
  46. 46. 46 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Backup Slides
  47. 47. 47 Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation Gezim Sejdiu - University of Bonn Evaluation Overall analysis of queries on LUBM-1K dataset (cluser mode)

×