Geographica: A Benchmark for Geospatial RDF Stores

1,201 views
955 views

Published on

Geospatial extensions of SPARQL like GeoSPARQL and stSPARQL have recently been defined and corresponding geospatial RDF stores have been implemented. However, there is no widely used benchmark for evaluating geospatial RDF stores which takes into account recent advances to the state of the art in this area. In this paper, we develop a benchmark, called Geographica, which uses both real-world and synthetic data to test the o ffered functionality and the performance of some prominent geospatial RDF stores.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,201
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
27
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Geographica: A Benchmark for Geospatial RDF Stores

  1. 1. Benchmarking Geospatial RDF Stores George Garbis, Kostis Kyzirakos, Manolis Koubarakis Dept. of Informatics and Telecommunications, National and Kapodistrian University of Athens, Greece 1st International Workshop on Benchmarking RDF Systems (BeRSys 2013)
  2. 2. 5/26/2013 2 Outline • Motivation • SPARQL extensions for querying geospatial data expressed in RDF • State-of-the-art geospatial RDF stores • Related benchmarks • The benchmark Geographica • Evaluating the performance of geospatial RDF stores using Geographica • Conclusions
  3. 3. 4 Geospatial data on the Web
  4. 4. 5 Open Government Data
  5. 5. 6 Linked geospatial data – Ordnance Survey
  6. 6. 7 Linked geospatial data – Open Street Map
  7. 7. 8 Linked geospatial data – http://www.linkedopendata.gr
  8. 8. 9 Linked geospatial data – Wildfire Monitoring
  9. 9. 5/26/2013 10 Outline • Motivation • SPARQL extensions for querying geospatial data expressed in RDF • State-of-the-art geospatial RDF stores • Related benchmarks • The benchmark Geographica • Evaluating the performance of geospatial RDF stores using Geographica • Conclusions
  10. 10. Our Contributions • The data model stRDF and the query language stSPARQL • The system Strabon 11
  11. 11. The Data Model stRDF • stRDF stands for spatiotemporal RDF. • It is an extension of the W3C standard RDF for the representation of geospatial data that may change over time. • stRDF extends RDF with: • Spatial literals encoded in OGC standards Well- Known Text or GML • New datatypes for spatial literals (strdf:WKT, strdf:GML and strdf:geometry) • Valid time of triples (ignored in this talk) [ ESWC 2013 ]
  12. 12. stRDF: An example
  13. 13. stRDF: An example (WKT) ex:BurntArea1 rdf:type noa:BurntArea. ex:BurntArea1 noa:hasID "1"^^xsd:decimal. ex:BurntArea1 noa:hasArea "23.7636"^^xsd:double. ex:BurntArea1 strdf:hasGeometry "POLYGON(( 38.16 23.7, 38.18 23.7, 38.18 23.8, 38.16 23.8, 38.16 23.7)); <http://spatialreference.org/ref/epsg/4121/>"^^strdf:WKT . Spatial Literal (OpenGIS Simple Features) Spatial Data Type Well-Known Text
  14. 14. stRDF: An example (GML) ex:BurntArea1 rdf:type noa:BurntArea. ex:BurntArea1 noa:hasID "1"^^xsd:decimal. ex:BurntArea1 noa:hasArea "23.7636"^^xsd:double. ex:BurntArea1 strdf:hasGeometry "<gml:Polygon srsName='http://www.opengis.net/def/crs/EPSG/0/4121'> <gml:outerBoundaryIs> <gml:LinearRing> <gml:coordinates>38.16,23.70 38.18,23.70 38.18, 23.80 38.16,23.80,38.16 23.70 </gml:coordinates> </gml:LinearRing> </gml:outerBoundaryIs> </gml:Polygon>"^^strdf:GML . Spatial Literal (GML Simple Features Profile) Spatial Data Type GML
  15. 15. • Find all burnt forests close to a city SELECT ?BA ?BAGEO WHERE { ?R rdf:type noa:Region . ?R strdf:hasGeometry ?RGEO . ?R noa:hasLandCover ?F . ?F rdfs:subClassOf clc:Forests . ?CITY rdf:type dbpedia:City . ?CITY strdf:hasGeometry ?CGEO . ?BA rdf:type noa:BurntArea . ?BA strdf:hasGeometry ?BAGEO . FILTER(strdf:anyInteract(?RGEO,?BAGEO) && strdf:distance(?BAGEO,?CGEO) < 0.02) } stSPARQL: An example Spatial Functions (OGC Simple Feature Access)
  16. 16. stSPARQL: Geospatial SPARQL 1.1 • We start from SPARQL 1.1. • We add a SPARQL extension function for each function defined in the OGC standard OpenGIS Simple Feature Access – Part 2: SQL option (ISO 19125) for adding geospatial data to relational DBMSs and SQL. • We add appropriate geospatial extensions to SPARQL 1.1 Update language
  17. 17. stSPARQL (cont’d) • Basic functions • Get a property of a geometry (e.g., strdf:srid) • Get the desired representation of a geometry (e.g., strdf:AsText) • Test whether a certain condition holds (e.g., strdf:IsEmpty, strdf:IsSimple) • Functions for testing topological spatial relationships (e.g., strdf:equals, strdf:intersects) • OGC Simple Features Access, Egenhofer, RCC-8 • Spatial analysis functions • Construct new geometric objects from existing geometric objects (e.g., strdf:buffer, strdf:intersection, strdf:convexHull) • Spatial metric functions (e.g., strdf:distance, strdf:area) • Spatial aggregate functions (e.g., strdf:union, strdf:extent)
  18. 18. stSPARQL (cont’d) • SELECT clause Construction of new geometries (e.g., strdf:buffer(?geo, 0.1)) Spatial aggregate functions (e.g., strdf:extent(?geo)) Metric functions (e.g., strdf:area(?geo)) • FILTER clause Functions for testing topological spatial relationships between spatial terms (e.g., strdf:contains(?G1, strdf:union(?G2, ?G3))) Numeric expressions involving spatial metric functions (e.g.,strdf:area(?G1)<=2*strdf:area(?G2)+1) • HAVING clause Spatial aggregate functions and spatial metric functions or functions testing for topological relationships between spatial terms (e.g., strdf:area(strdf:union(?geo))>1)
  19. 19. • Isolate the parts of the burnt areas that lie in coniferous forests. SELECT ?burntArea (strdf:intersection(?baGeom, strdf:union(?fGeom)) AS ?burntForest) WHERE { ?burntArea rdf:type noa:BurntArea; noa:hasGeometry [ noa:hasSerialization ?baGeo ]. ?forest rdf:type noa:Region; clc:hasLandCover noa:coniferousForest; clc:hasGeometry [ clc:hasSerialization ?fGeom ]. FILTER(strdf:intersects(?baGeom,?fGeom)) } GROUP BY ?burntArea ?baGeom stSPARQL: An example
  20. 20. GeoSPARQL Core Topology Vocabulary Extension - relation family Geometry Extension - serialization - version Geometry Topology Extension - serialization - version - relation family Query Rewrite Extension - serialization - version - relation family RDFS Entailment Extension - serialization - version - relation family Parameters • Serialization • WKT • GML • Relation Family • Simple Features • RCC8 • Egenhofer
  21. 21. System Language Index Geometries CRS support Comments on Functionality Strabon stSPARQL/ GeoSPARQL* R-tree- over-GiST WKT / GML support Yes • OGC-SFA • Egenhofer • RCC-8 Parliament GeoSPARQL R-Tree WKT / GML support Yes •OGC-SFA •Egenhofer •RCC-8 Brodt et al. (RDF-3X) SPARQL R-Tree WKT support No OGC-SFA Perry SPARQL-ST R-Tree GeoRSS GML Yes RCC8 AllegroGraph Extended SPARQL Distribution sweeping technique 2D point geometries Partial •Buffer •Bounding Box •Distance OWLIM Extended SPARQL Custom 2D point geometries (W3C Basic Geo Vocabulary) No •Point-in-polygon •Buffer •Distance Virtuoso SPARQL R-Tree 2D point geometries (in WKT) Yes SQL/MM (subset) uSeekM GeoSPARQL R-tree-over GiST WKT support No OGC-SFA
  22. 22. 5/26/2013 23 Outline • Motivation • SPARQL extensions for querying geospatial data expressed in RDF • State-of-the-art geospatial RDF stores • Related benchmarks • The benchmark Geographica • Evaluating the performance of geospatial RDF stores using Geographica • Conclusions
  23. 23. 5/26/2013 24 Related Work • Benchmarks for SPARQL query processing: • LUBM (JWS 2005) • DBpedia SPARQL Benchmark (ISWC 2011) • … • Benchmarks for geospatial relational DBMS: • Sequoia 2000 (SIGMOD 1993) • VESPA (BNCOD 2000) • Jackpine (ICDE 2011) • Benchmarks for spatial indexing and query processing operations • Benchmarks for geospatial RDF stores: • A Benchmark for Spatial Semantic Web Systems (Kolas, SSWS 2008)
  24. 24. 5/26/2013 25 The Benchmark Geographica • Aim: measure the performance of today’s geospatial RDF stores • Organized around two workloads: • Real-world workload: • Based on existing linked geospatial datasets and known application scenarios • Synthetic workload: • Measure performance in a controlled environment where we can play around with properties of the data and the queries. • Γεωγραφικά: 17-volume geographical encyclopedia by Στράβων (AD 17)
  25. 25. 5/26/2013 26 Real-World Workload • Datasets: Real-world datasets for the geographic area of Greece playing an important role in the LOD cloud or having complex geometries • LinkedGeoData (LGD) for rivers and roads in Greece • GeoNames for Greece • DBpedia for Greece • Greek Administrative Geography (GAG) • CORINE land cover (CLC) for Greece • Hotspots
  26. 26. 5/26/2013 27 Real-World Workload Datasets Dataset Size # of Triples # of Points # of Lines (max/min/avg points/line) # of Polygons (max/min/avg points/polygon) GAG 33MB 4K - - 325 (15K/4/400) CLC 401MB 630K - - 45K (5K/4/140) LGD 29MB 150K - 12K (1.6K/2/21) - GeoNames 45MB 400K 22K - - DBpedia 89MB 430K 8K - - Hotspots 90MB 450K - - 37K (4/4/4)
  27. 27. 5/26/2013 28 Real-World Workload Parts • For this workload, Geographica has two parts (following Jackpine): • Micro part: Tests primitive spatial functions offered by geospatial RDF stores • Macro part: Simulates some typical application scenarios
  28. 28. 5/26/2013 29 Real-World Workload Micro part • 29 queries that consist of one or two triple patterns and a spatial function. • Functions included: • Spatial analysis: boundary, envelope, convex hull, buffer, area • Topological: equals, intersects, overlaps, crosses, within, distance, disjoint • As used in spatial selections and spatial joins • Spatial aggregates: extent, union • Functions are applied to many representative types of geometries .
  29. 29. 5/26/2013 30 Example – spatial analysis • Construct the boundary of all polygons of CLC PREFIX geof: <http://www.opengis.net/def/function/geosparql/> SELECT ( geof:boundary(?o1) as ?ret ) WHERE { GRAPH <http://geographica.di.uoa.gr/dataset/clc> { ?s1 <http://geo.linkedopendata.gr/corine/ontology#asWKT> ?o1} }
  30. 30. 5/26/2013 31 Example – spatial selection • Find all points in Geonames that are contained in a given polygon. PREFIX geof: <http://www.opengis.net/def/function/geosparql/> SELECT ?s1 ?o1 WHERE { GRAPH <http://geographica.di.uoa.gr/dataset/geonames> { ?s1 <http://www.geonames.org/ontology#asWKT> ?o1 } FILTER( geof:sfWithin(?o1, "GIVEN_POLYGON_IN_WKT"^^<http://www.opengis.net/ont/geosparql# wktLiteral>)). }
  31. 31. 5/26/2013 32 Example – spatial join • Find all pairs of GAG polygons that overlap PREFIX geof: <http://www.opengis.net/def/function/geosparql/> SELECT ?s1 ?s2 WHERE { GRAPH <http://geographica.di.uoa.gr/dataset/gag> {?s1 <http://geo.linkedopendata.gr/gag/ontology/asWKT> ?o1} GRAPH <http://geographica.di.uoa.gr/dataset/clc> {?s2 <http://geo.linkedopendata.gr/corine/ontology#asWKT> ?o2} FILTER( geof:sfOverlaps(?o1, ?o2) ) }
  32. 32. Query Point Query Line Query Polygon Point Within Buffer Distance Within Disjoint Line Equals Crosses Intersects Disjoint Polygon Intersects Equals Overlaps Real-World Workload Micro part • Spatial Selections • Spatial Joins Point Line Polygon Point Equals Intersects Intersects Within Line Intersects Within Crosses Polygon Within Touches Overlaps
  33. 33. 5/26/2013 34 Real-World Workload Macro part • Reverse Geocoding: Attribute a street address and place to a given point. • Queries: • Find the closest populated place (from GeoNames) • Find the closest street (from LGD)
  34. 34. 5/26/2013 35 Real-World Workload Macro part • Web Map Search and Browsing • Queries: • Find the co-ordinates of a given POI based on thematic criteria (from GeoNames) • Find roads in a given bounding box around these co- ordinates (from LGD) • Find other POI in a given bounding box around these co- ordinates (from LGD)
  35. 35. 5/26/2013 36 Real-World Workload Macro part • Rapid Mapping for Fire Monitoring: representative of typical rapid mapping tasks carried out by space agencies in the case of an emergency
  36. 36. 5/26/2013 37 Real-World Workload Macro part • Rapid Mapping for Fire Monitoring • Queries: • Simple tasks retrieving background mapping information: • Find the land cover of areas inside a given bounding box (from CLC) • Find primary roads inside a given bounding box (from LGD) • Find capitals of prefectures inside a given bounding box (from GAG) • (Often) more complex main mapping tasks: • Find municipality boundaries inside a bounding box (from GAG) • Find coniferous forests which are on fire (from CLC and Hotspots) • Find road segments which may be damaged by fire (from LGD and Hotspots)
  37. 37. 5/26/2013 38 Experimental Evaluation Real-world workload • Geospatial RDF stores tested: Strabon, Parliament, uSeekM • Machine: Intel Xeon E5620, 12MB L3 cache, 2.4GHz, 24GB RAM, 4 HDD with RAID-5 • Micro part: • Metric: response time • Run 3 times and compute the median • Time out: 1 hour • Run both on warm caches and cold caches • Macro part: • Run each scenario many times for one hour with warm caches • Metric: Average time for a complete execution
  38. 38. 5/26/2013 39 Results Data Storage • Strabon is the slowest given the PostGIS tables and indices it is building. • uSeekM does much better using the Sesame native store • Parliament slows down due to geo:asWKT subproperty inferencing System Strabon uSeekM Parliament Storage time 550 sec 214 sec 250 sec
  39. 39. 5/26/2013 40 Results Micro part (cold caches)
  40. 40. 5/26/2013 41 Results Macro part Scenario Strabon uSeekM Parliament Reverse Geocoding 65 sec 0.77 sec 2.6 sec Map Search and Browsing 0.9 sec 0.6 sec 22.2 sec Rapid Mapping for Fire Monitoring 207.4 sec - -
  41. 41. 5/26/2013 42 Synthetic Workload • Goal: Evaluate performance in a controlled environment where we can vary the thematic and spatial selectivity of queries • Thematic selectivity: the fraction of the total geographic features of a dataset that satisfy the non- spatial part of a query • Spatial selectivity: the fraction of the total geographic features of a dataset which satisfy the topological relation in the FILTER clause of a query
  42. 42. 5/26/2013 43 Synthetic Workload • Dataset: As in VESPA, the produced datasets are geographic features on a synthetic map: • States in a country ( (n/3)2 ) • Land ownership (n2) • Roads (n) • POI (n2)
  43. 43. 5/26/2013 44 Synthetic Workload Ontology • Based roughly on the ontology of OpenStreetMap and the GeoSPARQL vocabulary • Tagging each feature with a key enables us to select a known fraction of features in a uniform way
  44. 44. 5/26/2013 45 Synthetic Workload Query template for spatial selections SELECT ?s WHERE { ?s ns:hasGeometry ?g. ?s c:hasTag ?tag. ?g ns:asWKT ?wkt. ?tag ns:hasKey “THEMA” FILTER(FUNCTION(?wkt, “GEOM”))} • Parameters: • ns: specifies the kind of feature (and geometry type) examined • THEMA: defines the thematic selectivity of the query using another parameter k • FUNCTION: specifies the topological function examined • GEOM: specifies a rectangle that controls the spatial selectivity of the query
  45. 45. 5/26/2013 46 Synthetic Workload Query template for spatial joins SELECT ?s1 ?s2 WHERE { ?s1 ns1:hasGeometry ?g1. ?s1 c:hasTag ?tag1. ?g1 ns:asWKT ?wkt1. ?tag1 ns:hasKey “THEMA” ?s2 ns2:hasGeometry ?g2. ?s2 c:hasTag ?tag2. ?g2 ns2:asWKT ?wkt2. ?tag2 ns2:hasKey “THEMA’” FILTER(FUNCTION(?wkt1, ?wkt2))}
  46. 46. 5/26/2013 47 Experimental Evaluation Synthetic workload • Geospatial RDF stores tested: Strabon, Parliament, uSeekM • Machine: Intel Xeon E5620, 12MB L3 cache, 2.4GHz, 24GB RAM, 4 HDD with RAID-5 • Details: • Metric: response time • Run 3 times and compute the median • Time out: 1 hour • Run both on warm caches and cold caches
  47. 47. 5/26/2013 48 Results Data Storage • We generate the synthetic dataset with n=512 and k=9. This results in: • 28900 states • 262144 land ownerships • 512 roads • 262144 points of interest • Size: 3,880,224 triples (745 MB) System Strabon uSeekM Parliament Storage time 221 sec 406 sec 462 sec
  48. 48. 5/26/2013 49 Results Synthetic Workload (Spatial Selections) Intersects Tag 1, cold caches Intersects Tag 512, cold caches
  49. 49. 5/26/2013 50 Results Synthetic Workload (Spatial Joins) Intersects
  50. 50. 5/26/2013 51 Outline • Motivation • SPARQL extensions for querying geospatial data expressed in RDF • State-of-the-art geospatial RDF stores • Related benchmarks • The benchmark Geographica • Evaluating the performance of geospatial RDF stores using Geographica • Conclusions
  51. 51. 5/26/2013 52 Conclusions • We defined Geographica, a new comprehensive benchmark for geospatial RDF stores, and used it to compare 3 relevant systems (Strabon, Parliament, uSeekM). • More implementation work is necessary in adding features to other geospatial RDF stores beyond the ones tested. • More real-world scenarios can be added. • Next target: spatiotemporal RDF stores
  52. 52. 5/26/2013 53 Advertisement  Strabon: http://strabon.di.uoa.gr  Geographica: http://geographica.di.uoa.gr  Tutorials/Survey paper  More at ESWC  Paper: Storing and Querying the Valid Time of Triples in Linked Geospatial Data  Demo: Sextant, a web tool for browsing and mapping Linked Geospatial Data http://strabon.di.uoa.gr:8080/sextant/  Project networking: TELEIOS http://www.earthobservatory.eu

×