SAVETA SPgen ISWC 2018 – Monterey CA
SPgen: A Benchmark Generator for Spatial
Link Discovery Tools
T. Saveta1, I. Fundulaki1, G. Flouris1, and A.-C. Ngonga-Ngomo2
1 Institute of Computer Science - FORTH, Greece
2 University of Paderborn, Germany
ISWC 2018 – Monterey CA
ISWC 2018 – Monterey CA
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
LOD Cloud and Geospatial Data
UK Postcodes
LinkedGeoData
GeoWordNet
Ocean Drilling - Forams
GeoNames Semantic Web
Pleiades
Ocean Drilling –
Janus Age Models
Weather Stations
OpenMobileNetwork
GeoLinkedData
EARTh
Linked Crowdsourced Data
greek-administrative-
geography
The Getty
Thesaurus of
Geographic
Names
Linked Logainm
FAO geopolitical ontology
NUTS (GeoVocab)
NUTS (GeoVocab)
AgreementMaker
Strabon
OntoIdea
Silk
RADON
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Essential to determine the effectiveness and the efficiency of the
proposed techniques and tools
Compare them
Assess their performance
«Push» the systems to get better
Benchmarking Link Discovery Systems for Geospatial Data
Develop benchmarks for geo-spatial link discovery
systems
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Organized into test cases each addressing different kind of
requirements
Datasets
The raw material of the benchmarks.These are the source and
the target dataset that will be matched together to find the links
Gold Standard
The “correct answer sheet” used to judge the completeness and
soundness of the link discovery algorithms
Metrics
The performance metric(s) that determine the systems behavior
and performance
Link Discovery Benchmark Ingredients
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Only a limited number of benchmarks target the problem of
linking geo-spatial entities
Propose: Spatial Benchmark Generator (SPgen)
Differs from the classical, mostly string-based
approaches
Generic, schema agnostic and choke-point based
Test the performance of link discovery systems that
deal with the DE-9IM (Dimensionally Extended nine-
Intersection Model) - standard to describe topological
relations between two geometries.
Is integrated into HOBBIT platform
Spatial Benchmark Generator (SPgen)
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Source Dataset
Traces (lists of (longitude, latitude)) represented as LineStrings in
the Well-known text (WKT) format
Target Dataset
Traces represented as LineStrings or Polygons in the WKT format
Gold Standard
Stores pairs of matched source and target traces
Generated using RADON to ensure completeness and
correctness
SPgen Overview
Given aWKT Geometry s as source and a DE-9IM Relation r, we generate a
WKT Geometry t as target such as their Intersection Matrix follows the
definition of Relation r.
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Choke Points
CP1: Scalability
• Large datasets
• Large number of points for each trace
CP2: Output Quality Metric
• Precision
• Recall
• F-measure
CP3:Time Performance
SPgen Choke Points & Key Performance Indicators
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
SPgen: Source Dataset Generation
Input Dataset
(Traces)
Source Data
Generation
Target Data
Generation
(JTS Extension)
Test Case
Generation
Parameters
Source
Data
Target
Data
Gold
Standard
Generation
Gold
Standard
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
TomTom Data Generator
GPSTrace Generator
• Trace: a list of (longitude, latitude) pairs recorded by one
device (phone, car, etc.)
Spaten (Spatio-temporal and textual Big Data Generator)
Did not use their data generator but the 163.845 already
generated traces that came from 9.460 users and 38.800.019
GPSTraces
The generator can produce 14 users per day using the
free version of Google Maps API –2500 requests per day
Data Generators
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
SPgen: Target Dataset Generation
Input Dataset
(Traces)
Source Data
Generation
Target Data
Generation
(JTS Extension)
Test Case
Generation
Parameters
Source
Data
Target
Data
Gold
Standard
Generation
Gold
Standard
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
DE-9IM Topological Relations between LineStrings
Equals Disjoint Touches
Contains (Within)
& Covers (Covered By)
Intersects Crosses
Overlaps
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
DE-9IM Topological Relations between a LineString and a Polygon
Disjoint
TouchesIntersects
• Equals and Overlaps cannot hold between a LineString and a Polygon
Crosses Contains (Within)
& Covers (Covered By)
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Disjoint example (LineString/LineString)
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
SPgen: Gold Standard Generation
Input Dataset
(Traces)
Source Data
Generation
Target Data
Generation
(JTS Extension)
Test Case
Generation
Parameters
Source
Data
Target
Data
Gold
Standard
Generation
Gold
Standard
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Systems to test
Spatial Link Discovery systems: RADON, Silk, AML, OntoIdea
Tasks: For each data generator (TomTom/Spaten) we divided the
tasks into 4 subtasks, systems were asked to match the geometries
considering a given relation
SLL & LLL: LineStrings/LineStrings, Small dataset of 200
instances and Large dataset of 2K instances
SLP & LLP: LineStrings/Polygons, Small dataset of 200
instances Large dataset of 2K instances
Experiments for SPgen
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Aims
Scalability
• Due to a limitation of Silk, we had to select traces no larger than 64
KB
Output Quality Metric
• Precision, Recall and F-measure (all were equal to 1.0)
Time Performance to show:
• The rate between the time performance and the size of the datasets
• The rate between the time performance and the different data
generators
• The time difference between the topological relations
• The time difference between the geometries
Experimental set up: Platform time limit = 75 minutes
Experiments for SPgen
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Experiments for SPgen (LineStrings/LineStrings) [SLL]
Time rate (in ms) while increasing population for each topological relation for RADON, Silk, AML and OntoIdea.
*Platform time limit = 75 mins
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Experiments for SPgen (LineStrings/LineStrings) [LLL]
Time rate (in ms) while increasing population for each topological relation for RADON, Silk, AML and OntoIdea.
*Platform time limit = 75 mins
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Experiments for SPgen (LineStrings/Polygons) [SLP]
Time rate (in ms) while increasing population for each topological relation for RADON, Silk, AML.
*Platform time limit = 75 mins
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Experiments for SPgen (LineStrings/Polygons) [LLP]
Time rate (in ms) while increasing population for each topological relation for RADON, Silk, AML.
*Platform time limit = 75 mins
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
RADON
Addressed all the tasks
Best performance for the LineString/PolygonTasks
Improvements in Touches and Intersects relations for the LineString/LineStringTasks
AML
Performs extremely well in most cases
Improvements in LineString/PolygonTasks and in Disjoint relations
Silk
Improvements in Touches, Intersects and Overlaps relations and for the
LineString/LineString and in Disjoint relation for the LineString/PolygonTasks
OntoIdea
Handles small datasets efficiently, this does not hold when it comes to larger datasets
Did not support LineString/PolygonTasks in this version
Experiments General Remarks
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
All the systems needed
More time forTomTom than Spaten, due to the smaller number
of points per trace in the latter
More time the LineString/LineString Tasks for Touches, Intersects
and Crosses relations
Approximately the same time for both LineString/LineString and
LineString/PolygonTasks for Disjoint relation
Less time for the LineString/LineString Tasks for Contains, Within,
Covers and Covered By relations
Depending on the test case we can choose the appropriate
system
Experiments General Remarks
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Conclusions
SPgen: checks whether spatial link discovery systems can
identify DE-9IM topological relations between LineStrings and
Polygons
Experiments with RADON and Silk, AML and OntoIdea
Choke points: Scalability, Quality measure output andTime
performance
Next
Implement DE-9IM relations for all possible combinations of
different geometries (e.g. Polygon/Polygon, combination with
Points, LineStrings and Polygons, etc.)
More data generators – already added DEBS dataset
More system – already added Strabon
Concluding Remarks & Future Work
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
Thank you!
Questions?
SAVETA SPgen ISWC 2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA
This work was supported by grands from the EU H2020 Framework Programme
provided for the project HOBBIT (GA no. 688227).

SPgen: A Benchmark Generator for Spatial Link Discovery Tools

  • 1.
    SAVETA SPgen ISWC2018 – Monterey CA SPgen: A Benchmark Generator for Spatial Link Discovery Tools T. Saveta1, I. Fundulaki1, G. Flouris1, and A.-C. Ngonga-Ngomo2 1 Institute of Computer Science - FORTH, Greece 2 University of Paderborn, Germany ISWC 2018 – Monterey CA ISWC 2018 – Monterey CA
  • 2.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA LOD Cloud and Geospatial Data UK Postcodes LinkedGeoData GeoWordNet Ocean Drilling - Forams GeoNames Semantic Web Pleiades Ocean Drilling – Janus Age Models Weather Stations OpenMobileNetwork GeoLinkedData EARTh Linked Crowdsourced Data greek-administrative- geography The Getty Thesaurus of Geographic Names Linked Logainm FAO geopolitical ontology NUTS (GeoVocab) NUTS (GeoVocab) AgreementMaker Strabon OntoIdea Silk RADON
  • 3.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Essential to determine the effectiveness and the efficiency of the proposed techniques and tools Compare them Assess their performance «Push» the systems to get better Benchmarking Link Discovery Systems for Geospatial Data Develop benchmarks for geo-spatial link discovery systems
  • 4.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Organized into test cases each addressing different kind of requirements Datasets The raw material of the benchmarks.These are the source and the target dataset that will be matched together to find the links Gold Standard The “correct answer sheet” used to judge the completeness and soundness of the link discovery algorithms Metrics The performance metric(s) that determine the systems behavior and performance Link Discovery Benchmark Ingredients
  • 5.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Only a limited number of benchmarks target the problem of linking geo-spatial entities Propose: Spatial Benchmark Generator (SPgen) Differs from the classical, mostly string-based approaches Generic, schema agnostic and choke-point based Test the performance of link discovery systems that deal with the DE-9IM (Dimensionally Extended nine- Intersection Model) - standard to describe topological relations between two geometries. Is integrated into HOBBIT platform Spatial Benchmark Generator (SPgen)
  • 6.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Source Dataset Traces (lists of (longitude, latitude)) represented as LineStrings in the Well-known text (WKT) format Target Dataset Traces represented as LineStrings or Polygons in the WKT format Gold Standard Stores pairs of matched source and target traces Generated using RADON to ensure completeness and correctness SPgen Overview Given aWKT Geometry s as source and a DE-9IM Relation r, we generate a WKT Geometry t as target such as their Intersection Matrix follows the definition of Relation r.
  • 7.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Choke Points CP1: Scalability • Large datasets • Large number of points for each trace CP2: Output Quality Metric • Precision • Recall • F-measure CP3:Time Performance SPgen Choke Points & Key Performance Indicators
  • 8.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA SPgen: Source Dataset Generation Input Dataset (Traces) Source Data Generation Target Data Generation (JTS Extension) Test Case Generation Parameters Source Data Target Data Gold Standard Generation Gold Standard
  • 9.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA TomTom Data Generator GPSTrace Generator • Trace: a list of (longitude, latitude) pairs recorded by one device (phone, car, etc.) Spaten (Spatio-temporal and textual Big Data Generator) Did not use their data generator but the 163.845 already generated traces that came from 9.460 users and 38.800.019 GPSTraces The generator can produce 14 users per day using the free version of Google Maps API –2500 requests per day Data Generators
  • 10.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA SPgen: Target Dataset Generation Input Dataset (Traces) Source Data Generation Target Data Generation (JTS Extension) Test Case Generation Parameters Source Data Target Data Gold Standard Generation Gold Standard
  • 11.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA DE-9IM Topological Relations between LineStrings Equals Disjoint Touches Contains (Within) & Covers (Covered By) Intersects Crosses Overlaps
  • 12.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA DE-9IM Topological Relations between a LineString and a Polygon Disjoint TouchesIntersects • Equals and Overlaps cannot hold between a LineString and a Polygon Crosses Contains (Within) & Covers (Covered By)
  • 13.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Disjoint example (LineString/LineString)
  • 14.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA SPgen: Gold Standard Generation Input Dataset (Traces) Source Data Generation Target Data Generation (JTS Extension) Test Case Generation Parameters Source Data Target Data Gold Standard Generation Gold Standard
  • 15.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Systems to test Spatial Link Discovery systems: RADON, Silk, AML, OntoIdea Tasks: For each data generator (TomTom/Spaten) we divided the tasks into 4 subtasks, systems were asked to match the geometries considering a given relation SLL & LLL: LineStrings/LineStrings, Small dataset of 200 instances and Large dataset of 2K instances SLP & LLP: LineStrings/Polygons, Small dataset of 200 instances Large dataset of 2K instances Experiments for SPgen
  • 16.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Aims Scalability • Due to a limitation of Silk, we had to select traces no larger than 64 KB Output Quality Metric • Precision, Recall and F-measure (all were equal to 1.0) Time Performance to show: • The rate between the time performance and the size of the datasets • The rate between the time performance and the different data generators • The time difference between the topological relations • The time difference between the geometries Experimental set up: Platform time limit = 75 minutes Experiments for SPgen
  • 17.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Experiments for SPgen (LineStrings/LineStrings) [SLL] Time rate (in ms) while increasing population for each topological relation for RADON, Silk, AML and OntoIdea. *Platform time limit = 75 mins
  • 18.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Experiments for SPgen (LineStrings/LineStrings) [LLL] Time rate (in ms) while increasing population for each topological relation for RADON, Silk, AML and OntoIdea. *Platform time limit = 75 mins
  • 19.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Experiments for SPgen (LineStrings/Polygons) [SLP] Time rate (in ms) while increasing population for each topological relation for RADON, Silk, AML. *Platform time limit = 75 mins
  • 20.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Experiments for SPgen (LineStrings/Polygons) [LLP] Time rate (in ms) while increasing population for each topological relation for RADON, Silk, AML. *Platform time limit = 75 mins
  • 21.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA RADON Addressed all the tasks Best performance for the LineString/PolygonTasks Improvements in Touches and Intersects relations for the LineString/LineStringTasks AML Performs extremely well in most cases Improvements in LineString/PolygonTasks and in Disjoint relations Silk Improvements in Touches, Intersects and Overlaps relations and for the LineString/LineString and in Disjoint relation for the LineString/PolygonTasks OntoIdea Handles small datasets efficiently, this does not hold when it comes to larger datasets Did not support LineString/PolygonTasks in this version Experiments General Remarks
  • 22.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA All the systems needed More time forTomTom than Spaten, due to the smaller number of points per trace in the latter More time the LineString/LineString Tasks for Touches, Intersects and Crosses relations Approximately the same time for both LineString/LineString and LineString/PolygonTasks for Disjoint relation Less time for the LineString/LineString Tasks for Contains, Within, Covers and Covered By relations Depending on the test case we can choose the appropriate system Experiments General Remarks
  • 23.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Conclusions SPgen: checks whether spatial link discovery systems can identify DE-9IM topological relations between LineStrings and Polygons Experiments with RADON and Silk, AML and OntoIdea Choke points: Scalability, Quality measure output andTime performance Next Implement DE-9IM relations for all possible combinations of different geometries (e.g. Polygon/Polygon, combination with Points, LineStrings and Polygons, etc.) More data generators – already added DEBS dataset More system – already added Strabon Concluding Remarks & Future Work
  • 24.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA Thank you! Questions?
  • 25.
    SAVETA SPgen ISWC2018 – Monterey CASAVETA SPgen ISWC 2018 – Monterey CA This work was supported by grands from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).