ParlBench: a SPARQL-benchmark for electronicpublishing applicationsTatiana Tarasova Maarten MarxUniversity of AmsterdamInf...
MEDIA	  PUBLICATIONS	  LIFE-­‐SCIENCES	  CROSS-­‐DOMAIN	  GEOGRAPHIC	  GOVERNMENT	  
MEDIA	  PUBLICATIONS	  LIFE-­‐SCIENCES	  CROSS-­‐DOMAIN	  GEOGRAPHIC	  GOVERNMENT	  ?	  
The ParlBench BenchmarkGoal:→ test performances of RDF store systems in the settings of e-publishingapplications
The ParlBench BenchmarkGoal:→ test performances of RDF store systems in the settings of e-publishingapplicationsComponents...
The ParlBench BenchmarkGoal:→ test performances of RDF store systems in the settings of e-publishingapplicationsComponents...
Outline1 The ParlBench BenchmarkData SetsQueries2 ParlBench experimental run on Virtuoso
Outline1 The ParlBench BenchmarkData SetsQueries2 ParlBench experimental run on Virtuoso
Outline1 The ParlBench BenchmarkData SetsQueries2 ParlBench experimental run on Virtuoso
The ParlBench Data Sets IPoliticalMashup: characteristics→ Dutch parliamentary proceedings (1814-2013),political parties a...
The ParlBench Data Sets IPoliticalMashup: characteristics→ Dutch parliamentary proceedings (1814-2013),political parties a...
The ParlBench Data Sets IIparties: Dutch political partiesmembers: members of the Dutch parliamentproceedings: structure o...
RDF Data ModelParliamentary Proceedings: ParliPro [2], DC and DC Terms [8]TopicStageDirectionSpeechParagraphSceneParliamen...
RDF Data ModelParliament Member: FOAF [4], Bio [3] and DBpedia Ontology [5]ParliamentMemberDBpediaresourcesame asBiography...
RDF Data ModelParties: ParliPro [2]PoliticalPartyDBpediaresourcesame as
RDF Data ModelParagraphs: ParliPro [2]ParagraphContent of theparagraphhas text
RDF Data ModelTagged Entities: MUTO [6], FOAF [4], Basic WGS84 [7]ParagraphTagDBpediaresourcehas auto meaningPerson Organi...
Outline1 The ParlBench BenchmarkData SetsQueries2 ParlBench experimental run on Virtuoso
19 ParlBench queries: 4 micro-benchmarks→ 3 Average, e.g.A0: Retrieve average number of people spoke per topic.→ 5 Count, ...
Outline1 The ParlBench BenchmarkData SetsQueries2 ParlBench experimental run on Virtuoso
ParlBench experimental runTest Machine→ MacBook Pro + Mac OS X Lion 10.7.6 x64→ CPUs: 2.8 GHz Intel Core i7 (2x2 cores)→ M...
ParlBench experimental runSystem Under Test→ Virtuoso Open Source Edition v.06.01.3, native RDF store→ default Virtuoso in...
ParlBench experimental runExperimental set-up→ 8 test collections: Parties, Members, scaled Proceedings (from 1 to100%)→ s...
Loading Time, log2(time, sec)1 2 4 8 16 32 64 1001248163264128256512102420484096Size of proceedings, %Time,sec
Query Response Time by Micro-Benchmarks,log2(SUM(time), sec)1 2 4 8 16 32 64 1000.250.51248163264128256Size of proceedings...
Query Response Time on the Largest Collection (∼36M)A0 A1 A2 C0 C1 C2 C3 C4 F0 F1 F2 F3 F4 F5 T0 T1 T2 T3 T401020304050607...
T4: Retrieve top 10 longest topics (i.e., number ofparagraphs).SELECT ?topic COUNT(?par) as ?numOfParsWHERE {?topic rdf:ty...
Characteristics of ParlBench queriesmicro benchmarkAverage Count Factual Top 10A0 A1 A2 C0 C1 C2 C3 C4 F0 F1 F2 F3 F4 F5 T...
T2: Retrieve top 10 topics with the most speechesSELECT ?topic COUNT(?speech) as ?numOfSpeechesWHERE {?topic rdf:type parl...
Conclusion→ SPARQL-benchmark for e-publishing applications→ large collections of real data→ intuitive analytical queries→ ...
Thank you!ParlBench resources→ data access:→ resolvable URIs→ RDF data dumps at http://data.politicalmashup.nl/RDF/data/→ ...
Thank you!Questions?
References IDutch national regulations in CEN MetaLexhttp://doc.metalex.eu/The Parliamentary Proceedings (ParliPro) Vocabu...
References IIDublin Core Metadata Element Sethttp://purl.org/dc/elements/1.1/ and Dublin Core collectiondescription Terms ...
Statistics of the benchmark data setsdataset # of triples size # of filesmembers 33,885 14M 3,583parties 510 612K 151procee...
Statistics of the ParlBench data setsNumber of classes: 9Number of properties: 25Number of instances per class:Member: 3,5...
Parliamentary Proceedings: example of encodingparlipro:ParliamentaryProceedingspm:nl.proc.ob.d.h-tk-19992000-2432-2483rdf:...
Members: example of encodingnl-dbpedia:Marijke_Vosowl:sameAs_:biobio:biographypm:nl.m.02547foaf:genderbio:Biographyen-dbpe...
Paragraphs and Tagged Entities: example of encodingParagraphpm:nl.proc.ob.d.h-tk-19992000-2432-2483.1.7.30.1parlipro:Parag...
Paragraphs and Tagged Entities: example of encodingParagraphpm:nl.proc.ob.d.h-tk-19992000-2432-2483.1.7.30.1parlipro:Parag...
Upcoming SlideShare
Loading in …5
×

ParlBench: a SPARQL-benchmark for electronic publishing applications.

332 views

Published on

Slides from the workshop on Benchmarking RDF Systems co-located with the Extended Semantic Web Conference 2013. The presentation is about an on-going work on building the benchmark for electronic publishing applications. The benchmark provides real-world data sets, the Dutch parliamentary proceedings and a set of analytical SPARQL queries that were built on top of these data sets. The queries were grouped into micro-benchmarks according to their analytical aims. This allows one to perform better analysis of RDF stores behaviors with respect to a certain SPARQL feature used in a micro-benchmark/query.
Preliminary results of running the benchmark on the Virtuoso native RDF store are presented, as well as references to the on-line material including the data sets, queries and the scripts that were used to obtain the results.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

ParlBench: a SPARQL-benchmark for electronic publishing applications.

  1. 1. ParlBench: a SPARQL-benchmark for electronicpublishing applicationsTatiana Tarasova Maarten MarxUniversity of AmsterdamInformation and Language Processing SystemsMay 26, 2013Workshop on Benchmarking RDF Systems, ESWC 2013
  2. 2. MEDIA  PUBLICATIONS  LIFE-­‐SCIENCES  CROSS-­‐DOMAIN  GEOGRAPHIC  GOVERNMENT  
  3. 3. MEDIA  PUBLICATIONS  LIFE-­‐SCIENCES  CROSS-­‐DOMAIN  GEOGRAPHIC  GOVERNMENT  ?  
  4. 4. The ParlBench BenchmarkGoal:→ test performances of RDF store systems in the settings of e-publishingapplications
  5. 5. The ParlBench BenchmarkGoal:→ test performances of RDF store systems in the settings of e-publishingapplicationsComponents:→ real-world data: Dutch parliamentary proceedings, members andpolitical parties→ vocabulary: Parliamentary Proceedings [2] (ParliPro) + mix ofexisting vocabularies→ 19 analytical SPARQL queries grouped into 4 micro-benchmarks:Average, Count, Factual and Top 10
  6. 6. The ParlBench BenchmarkGoal:→ test performances of RDF store systems in the settings of e-publishingapplicationsComponents:→ real-world data: Dutch parliamentary proceedings, members andpolitical parties→ vocabulary: Parliamentary Proceedings [2] (ParliPro) + mix ofexisting vocabularies→ 19 analytical SPARQL queries grouped into 4 micro-benchmarks:Average, Count, Factual and Top 10Performance metrics:→ loading time→ query response time
  7. 7. Outline1 The ParlBench BenchmarkData SetsQueries2 ParlBench experimental run on Virtuoso
  8. 8. Outline1 The ParlBench BenchmarkData SetsQueries2 ParlBench experimental run on Virtuoso
  9. 9. Outline1 The ParlBench BenchmarkData SetsQueries2 ParlBench experimental run on Virtuoso
  10. 10. The ParlBench Data Sets IPoliticalMashup: characteristics→ Dutch parliamentary proceedings (1814-2013),political parties and politicians→ richly structured XML documents (∼ 54.000)→ URIs of concepts→ metadata: who said what and when→ links to Wikipedia
  11. 11. The ParlBench Data Sets IPoliticalMashup: characteristics→ Dutch parliamentary proceedings (1814-2013),political parties and politicians→ richly structured XML documents (∼ 54.000)→ URIs of concepts→ metadata: who said what and when→ links to WikipediaLinked PoliticalMashup: design choices→ keep the URIs and linking structure→ re-use existing vocabularies→ link to the Linked Open Data cloud→ separate the structure from the text
  12. 12. The ParlBench Data Sets IIparties: Dutch political partiesmembers: members of the Dutch parliamentproceedings: structure of the Dutch parliamentary proceedingsparagraphs: content of speeches of the parliamentary meetingstagged entities: links from the paragraphs to DBpedia# of triplesparties members proceedings paragraphs tagged entities total510 33,885 ∼36.5M ∼11.25M ∼34.4M ∼82.2M
  13. 13. RDF Data ModelParliamentary Proceedings: ParliPro [2], DC and DC Terms [8]TopicStageDirectionSpeechParagraphSceneParliamentMemberPoliticalPartyhas partParliamentaryProceedingshas parthas parthas partreferencesmemberreferencespartyhas parthas parthas parthas part
  14. 14. RDF Data ModelParliament Member: FOAF [4], Bio [3] and DBpedia Ontology [5]ParliamentMemberDBpediaresourcesame asBiographybiography
  15. 15. RDF Data ModelParties: ParliPro [2]PoliticalPartyDBpediaresourcesame as
  16. 16. RDF Data ModelParagraphs: ParliPro [2]ParagraphContent of theparagraphhas text
  17. 17. RDF Data ModelTagged Entities: MUTO [6], FOAF [4], Basic WGS84 [7]ParagraphTagDBpediaresourcehas auto meaningPerson OrganizationSpatialThingis ais ais a
  18. 18. Outline1 The ParlBench BenchmarkData SetsQueries2 ParlBench experimental run on Virtuoso
  19. 19. 19 ParlBench queries: 4 micro-benchmarks→ 3 Average, e.g.A0: Retrieve average number of people spoke per topic.→ 5 Count, e.g.C4: Count speeches of a female speaker from the topic where only onefemale spoke.→ 6 Factual, e.g.F3: What is the percentage of female speakers?→ 5 Top 10, e.g.T4: Retrieve top 10 longest topics (i.e., number of paragraphs).
  20. 20. Outline1 The ParlBench BenchmarkData SetsQueries2 ParlBench experimental run on Virtuoso
  21. 21. ParlBench experimental runTest Machine→ MacBook Pro + Mac OS X Lion 10.7.6 x64→ CPUs: 2.8 GHz Intel Core i7 (2x2 cores)→ Memory: 8GB
  22. 22. ParlBench experimental runSystem Under Test→ Virtuoso Open Source Edition v.06.01.3, native RDF store→ default Virtuoso index scheme→ configuration for large data sets loading
  23. 23. ParlBench experimental runExperimental set-up→ 8 test collections: Parties, Members, scaled Proceedings (from 1 to100%)→ single user mode→ 1 run = 10 permutations of 19 queries (190 queries)→ warm-up period: 5 runs (950 queries)→ measuring period: 3 runs (570 queries)→ query response time: mean of all the permutations of all the runs(10*3 = 30 runs)Scaling of proceedingsScaling Factor 1% 2% 4% 8% 16% 32% 64% 100%# of triples ∼0.5M ∼1M ∼1.9M ∼3.9M ∼7.6M ∼15M ∼23M ∼36.5M
  24. 24. Loading Time, log2(time, sec)1 2 4 8 16 32 64 1001248163264128256512102420484096Size of proceedings, %Time,sec
  25. 25. Query Response Time by Micro-Benchmarks,log2(SUM(time), sec)1 2 4 8 16 32 64 1000.250.51248163264128256Size of proceedings, %Sumofexecutiontime,secaveragecountfactualtop10
  26. 26. Query Response Time on the Largest Collection (∼36M)A0 A1 A2 C0 C1 C2 C3 C4 F0 F1 F2 F3 F4 F5 T0 T1 T2 T3 T40102030405060708090100110120130140150160170QueriesTime,sec45.942239.588547.12682.421210.68831.4383 0.864930.01187.999678.185822.377822.41920.105348.88870.835710.281341.69150.9241168.1313averagecountfactualtop10
  27. 27. T4: Retrieve top 10 longest topics (i.e., number ofparagraphs).SELECT ?topic COUNT(?par) as ?numOfParsWHERE {?topic rdf:type parlipro:Topic .?speech rdf:type parlipro:Speech .?speech dcterms:hasPart ?par .?par rdf:type parlipro:Paragraph .{?topic dcterms:hasPart ?speech .}UNION{?topic dcterms:hasPart ?sd .?sd rdf:type parlipro:StageDirection .?sd dcterms:hasPart ?speech .}UNION{?topic dcterms:hasPart ?scene .?scene rdf:type parlipro:Scene .?scene dcterms:hasPart ?speech .}}GROUP BY ?topicORDER BY DESC(?numOfPars)LIMIT 10
  28. 28. Characteristics of ParlBench queriesmicro benchmarkAverage Count Factual Top 10A0 A1 A2 C0 C1 C2 C3 C4 F0 F1 F2 F3 F4 F5 T0 T1 T2 T3 T4FILTER + + + + + + + +UNION + + + + + + + + +LIMIT + + + + + + +ORDER BY + + + + + + +GROUP BY + + + + + + + + + + + +COUNT + + + + + + + + + + + + + + + + +DISTINCT + + + +AVG + + +negation +OPTIONAL + +subquery + + + + + + +blank node scoping + + + + + + + + +# of triple patterns 10 9 12 5 5 5 6 13 8 16 6 6 2 4 2 4 9 3 11
  29. 29. T2: Retrieve top 10 topics with the most speechesSELECT ?topic COUNT(?speech) as ?numOfSpeechesWHERE {?topic rdf:type parlipro:Topic .?speech rdf:type parlipro:Speech .{?topic dcterms:hasPart ?speech .}UNION{{?topic dcterms:hasPart ?sd .?sd rdf:type parlipro:StageDirection .?sd dcterms:hasPart ?speech .}UNION{?topic dcterms:hasPart ?scene .?scene rdf:type parlipro:Scene .?scene dcterms:hasPart ?speech .}}GROUP BY ?topicORDER BY DESC(?numOfSpeeches)LIMIT 10
  30. 30. Conclusion→ SPARQL-benchmark for e-publishing applications→ large collections of real data→ intuitive analytical queries→ micro-benchmarks for SPARQL features analysisFuture work→ enlarge the data sets- votes in proceedings- interlink proceedings with the Dutch legislation data set [1] (>280M oftriples)- tagged entities: more tags→ extend the queries- SPARQL 1.1: path expressions- Linked Open Data integration scenario→ run the benchmark on other RDF stores
  31. 31. Thank you!ParlBench resources→ data access:→ resolvable URIs→ RDF data dumps at http://data.politicalmashup.nl/RDF/data/→ experimental run:website describing an experimental runhttp://data.politicalmashup.nl/RDF/public SPARQL-endpoint to a test collectionhttp://data.politicalmashup.nl/sparql/→ scripts are available athttp://data.politicalmashup.nl/RDF/scripts/→ ParliPro vocabulary:RDF representation http://purl.org/vocab/parlipro#HTML representationhttp://data.politicalmashup.nl/RDF/vocabularies/parlipro
  32. 32. Thank you!Questions?
  33. 33. References IDutch national regulations in CEN MetaLexhttp://doc.metalex.eu/The Parliamentary Proceedings (ParliPro) Vocabularyhttp://purl.org/vocab/parlipro#BIO: A vocabulary for biographical informationhttp://vocab.org/bioThe Friend of a Friend Vocabulary (FOAF)http://xmlns.com/foaf/0.1/The DBpedia Ontology http://dbpedia.org/ontology/The Modular Unified Tagging Ontology (MUTO)http://muto.socialtagging.org/Basic Geo (WGS84 lat/long) Vocabularyhttp://www.w3.org/2003/01/geo/wgs84_pos#
  34. 34. References IIDublin Core Metadata Element Sethttp://purl.org/dc/elements/1.1/ and Dublin Core collectiondescription Terms http://purl.org/dc/terms/
  35. 35. Statistics of the benchmark data setsdataset # of triples size # of filesmembers 33,885 14M 3,583parties 510 612K 151proceedings 36,503,688 4.15G 51,233paragraphs 11,250,295 5.77G 51,233tagged entities 34,449,033 2.57G 34,755TOTAL: 82,237,411 ∼13G 140,955
  36. 36. Statistics of the ParlBench data setsNumber of classes: 9Number of properties: 25Number of instances per class:Member: 3,583Party: 151Proceedings: 51,233Topic: 102,289Stage Direction: 1,776,598Scene: 189,226Speech: 2,495,969Paragraph: 11,211,520Tagged Entity: 11,383,787
  37. 37. Parliamentary Proceedings: example of encodingparlipro:ParliamentaryProceedingspm:nl.proc.ob.d.h-tk-19992000-2432-2483rdf:typepm:nl.proc.ob.d.h-tk-19992000-2432-2483.1 parlipro:Topicdcterms:hasPartpm:nl.proc.ob.d.h-tk-19992000-2432-2483.1.7.30parlipro:Speechrdf:typedcterms:hasPartpm:nl.proc.ob.d.h-tk-19992000-2432-2483.1.7.30.1parlipro:Paragraphrdf:typedcterms:hasPartpm:nl.p.glpm:nl.m.02547parlipro:refMemberparlipro:refParty1999-12-08rdf:typedc:datepm:nl.proc.ob.d.h-tk-19992000-2432-2483.1.7 parlipro:Scenerdf:typedcterms:hasPart…
  38. 38. Members: example of encodingnl-dbpedia:Marijke_Vosowl:sameAs_:biobio:biographypm:nl.m.02547foaf:genderbio:Biographyen-dbpedia:Marijke_Vosowl:sameAsdbpedia-ont:Femalerdf:type1957-05-04foaf:birthdayLeidschendamdbpedia-ont:birthPlaceParliamentMemberrdf:type
  39. 39. Paragraphs and Tagged Entities: example of encodingParagraphpm:nl.proc.ob.d.h-tk-19992000-2432-2483.1.7.30.1parlipro:Paragraphrdf:typeBlijkbaar is er nu het een en ander mis in de relatietussen de Europese Unie en de Russische Federatie. ...has text
  40. 40. Paragraphs and Tagged Entities: example of encodingParagraphpm:nl.proc.ob.d.h-tk-19992000-2432-2483.1.7.30.1parlipro:Paragraphrdf:typeBlijkbaar is er nu het een en ander mis in de relatietussen de Europese Unie en de Russische Federatie. ...has textTagged Entitymuto:hasTagpm:nl.proc.ob.d.h-tk-19992000-2432-2483.1.7.30.1_:tagmuto:hasAutoMeaningnl-dbpedia:Rusland geo:SpatialThingrdf:typeparlipro:Paragraphrdf:type

×