Big Linked Data Interlinking - ExtremeEarth Open Workshop
1. Interlinking Big Linked Geospatial Data
George Papadakis
ExtremeEarth Online Workshop 9/12/2021
2. 2
Geospatial Interlinking in action
Mandilaras et. al. “Ice monitoring with ExtremeEarth” LASCAR workshop 2020, co-located with ESWC
3. 3
Detected links
Two different types:
1. Proximity relations (such as dbp:near) with a distance threshold
• e.g., find all cities from S that are less than 1km away from any river in T
2. Topological relations according to the Dimensionally Extended 9-Intersection Model (DE9IM)
• Equals
• Disjoint
• Touches
• Contains
• Covers
• Intersects
• Within
• CoveredBy
• Crosses
• Overlaps
5. 5
GIA.nt: Geospatial Interlinking At large
Goes beyond existing Filtering methods in two ways:
1. Redundant pairs are inherently removed
2. Space tiling depends only on the source dataset →
the target dataset is read from the disk
(>50% lower memory footprint)
Introduces Holistic Verification
• based on the Intersection Matrix →
80% lower run-time
G. Papadakis, G. Mandilaras, N. Mamoulis, M. Koubarakis: Progressive, Holistic Geospatial Interlinking. WWW 2021
6. 6
Progressive Geospatial Interlinking
Same Filtering as GIA.nt.
Introduces Scheduling:
• Priority queue with top-BU weighted candidate pairs, where BU is
the available budget and weight is determined by:
• Co-occurrence Frequency (CF): #common tiles
• Jaccard Similarity (JS): normalized CF
• Pearson’s 𝜒2 test (𝜒2): degree to which s and t occur
independently in tiles
Verification processes the pairs of the queue in decreasing weight.
G. Papadakis, G. Mandilaras, N. Mamoulis, M. Koubarakis: Progressive, Holistic Geospatial Interlinking. WWW 2021
7. 7
Dynamic Progressive Geospatial Interlinking
New weighting schemes:
• POINTS: smaller geometries processed first → higher time efficiency
• MBR: higher overlap in Minimum Bounding Rectangles first → higher effectiveness
• composite weights → higher effectiveness, more deterministic behavior
New scheduling:
• instead of static processing order of geometry pairs, the processing order is updated dynamically, as
more topologically related pairs are detected
9. 9
JedAI-spatial – Part B
• Common three-stage pipeline for the
state-of-the-art parallel joins:
o GeoSpark, i.e., Apache Sedona
o Spatial Spark
o Magellan
o Location Spark
o Parallel GIA.nt
Scalability Analysis over D1
(|S|=2.3M, |T|=5.8M, |C|=6.3M)
10. 10
Approximate Geospatial Interlinking
Goal:
• Improve Progressive Geospatial Interlinking in two ways:
1. Use comprehensive evidence to discard candidate pairs in a principled way
2. Reduce the memory requirements
Approach:
1. Filtering → as in (Progressive) GIA.nt
2. Supervised Filtering
o Classify candidate pairs into “likely related pairs” & “unlikely related pairs” using a
feature vector
3. Verification → as in (Progressive) GIA.nt
Challenges:
• Avoid any human intervention
• Address class imbalance
• Define generic, effective & efficient features
• Minimize the feature and the training set → simple & efficient classification models
11. 11
Approximate Geospatial Interlinking – Solution overview
• Contrastive, self-supervised learning
• 4 categories of features
1. Area-based
2. Boundary-based
3. Grid-based
4. Candidate-based
• 2 sub-categories in each case:
o Atomic features
o Composite features
Experimental results:
• Undersampling necessary
• All 31 features are important
• Just 1,000 labelled instances suffice
• Parallelization for higher scalability
12. 12
Proactive Geospatial Interlinking
• Motivation:
o Most geometry pairs are disjoint
o Progressive Geospatial Interlinking maximizes throughput, but
has no way to a-priori determine the maximum number of
Verifications, BU, for a desired recall level
▪ Low BU leads to low recall
▪ High BU leads to low precision
• Solution:
o Terminate Geospatial Interlinking automatically as soon as recall exceeds a desired level →
minimize the time required for processing voluminous datasets
• Algorithms:
o Extrapolation-based
o Heuristics-based
▪ Precision-threshold
▪ Qualifying distance threshold
o Convergence-based
13. 13
Convergence-based Algorithm
Based on:
• Trilateral weighting scheme (JS + CF + MBR) → fully deterministic approach
• Fine-grained MBRs → fewer candidate pairs (see next slide)
• Massive parallelization on Apache Spark
• Batch-oriented operation
o Terminate as soon as batch precision falls below a threshold for n consecutive batches
• Experiment with ~300M geometries in progress.
Precision
Precision
14. 14
Fine-grained MBR
Decompose large geometries into smaller geometry segments
As a result:
• further filter superfluous verifications
o Verifications in D4: 66,379,979
o Verifications in D4 using fine-grained MBRs: 45,209,855 →
31% less candidate pairs
• accelerate Verification
o Instead of verifying two big geometries, verify only
the intersecting segments
o Task: Discover all topological relations by verifying
the least intersecting segments