Big Linked Data Interlinking - ExtremeEarth Open Workshop

Interlinking Big Linked Geospatial Data
George Papadakis
ExtremeEarth Online Workshop 9/12/2021

2
Geospatial Interlinking in action
Mandilaras et. al. “Ice monitoring with ExtremeEarth” LASCAR workshop 2020, co-located with ESWC

3
Detected links
Two different types:
1. Proximity relations (such as dbp:near) with a distance threshold
• e.g., find all cities from S that are less than 1km away from any river in T
2. Topological relations according to the Dimensionally Extended 9-Intersection Model (DE9IM)
• Equals
• Disjoint
• Touches
• Contains
• Covers
• Intersects
• Within
• CoveredBy
• Crosses
• Overlaps

4
Geospatial Interlinking Example
Three topological relations:
1. LineString g1 touches Polygon g3
2. LineString g1 intersects LineString g2
3. Polygon g3 contains Polygon g4
Challenges:
1. quadratic time complexity, O(n2)
2. time-consuming topological relations
g3

5
GIA.nt: Geospatial Interlinking At large
Goes beyond existing Filtering methods in two ways:
1. Redundant pairs are inherently removed
2. Space tiling depends only on the source dataset →
the target dataset is read from the disk
(>50% lower memory footprint)
Introduces Holistic Verification
• based on the Intersection Matrix →
80% lower run-time
G. Papadakis, G. Mandilaras, N. Mamoulis, M. Koubarakis: Progressive, Holistic Geospatial Interlinking. WWW 2021

6
Progressive Geospatial Interlinking
Same Filtering as GIA.nt.
Introduces Scheduling:
• Priority queue with top-BU weighted candidate pairs, where BU is
the available budget and weight is determined by:
• Co-occurrence Frequency (CF): #common tiles
• Jaccard Similarity (JS): normalized CF
• Pearson’s 𝜒2 test (𝜒2): degree to which s and t occur
independently in tiles
Verification processes the pairs of the queue in decreasing weight.
G. Papadakis, G. Mandilaras, N. Mamoulis, M. Koubarakis: Progressive, Holistic Geospatial Interlinking. WWW 2021

7
Dynamic Progressive Geospatial Interlinking
New weighting schemes:
• POINTS: smaller geometries processed first → higher time efficiency
• MBR: higher overlap in Minimum Bounding Rectangles first → higher effectiveness
• composite weights → higher effectiveness, more deterministic behavior
New scheduling:
• instead of static processing order of geometry pairs, the processing order is updated dynamically, as
more topologically related pairs are detected

8
JedAI-spatial
Publicly available at: https://github.com/GeoLinker/GeoLinker
Solution space:
Model-view-controller architecture

9
JedAI-spatial – Part B
• Common three-stage pipeline for the
state-of-the-art parallel joins:
o GeoSpark, i.e., Apache Sedona
o Spatial Spark
o Magellan
o Location Spark
o Parallel GIA.nt
Scalability Analysis over D1
(|S|=2.3M, |T|=5.8M, |C|=6.3M)

10
Approximate Geospatial Interlinking
Goal:
• Improve Progressive Geospatial Interlinking in two ways:
1. Use comprehensive evidence to discard candidate pairs in a principled way
2. Reduce the memory requirements
Approach:
1. Filtering → as in (Progressive) GIA.nt
2. Supervised Filtering
o Classify candidate pairs into “likely related pairs” & “unlikely related pairs” using a
feature vector
3. Verification → as in (Progressive) GIA.nt
Challenges:
• Avoid any human intervention
• Address class imbalance
• Define generic, effective & efficient features
• Minimize the feature and the training set → simple & efficient classification models

11
Approximate Geospatial Interlinking – Solution overview
• Contrastive, self-supervised learning
• 4 categories of features
1. Area-based
2. Boundary-based
3. Grid-based
4. Candidate-based
• 2 sub-categories in each case:
o Atomic features
o Composite features
Experimental results:
• Undersampling necessary
• All 31 features are important
• Just 1,000 labelled instances suffice
• Parallelization for higher scalability

12
Proactive Geospatial Interlinking
• Motivation:
o Most geometry pairs are disjoint
o Progressive Geospatial Interlinking maximizes throughput, but
has no way to a-priori determine the maximum number of
Verifications, BU, for a desired recall level
▪ Low BU leads to low recall
▪ High BU leads to low precision
• Solution:
o Terminate Geospatial Interlinking automatically as soon as recall exceeds a desired level →
minimize the time required for processing voluminous datasets
• Algorithms:
o Extrapolation-based
o Heuristics-based
▪ Precision-threshold
▪ Qualifying distance threshold
o Convergence-based

13
Convergence-based Algorithm
Based on:
• Trilateral weighting scheme (JS + CF + MBR) → fully deterministic approach
• Fine-grained MBRs → fewer candidate pairs (see next slide)
• Massive parallelization on Apache Spark
• Batch-oriented operation
o Terminate as soon as batch precision falls below a threshold for n consecutive batches
• Experiment with ~300M geometries in progress.
Precision
Precision

14
Fine-grained MBR
Decompose large geometries into smaller geometry segments
As a result:
• further filter superfluous verifications
o Verifications in D4: 66,379,979
o Verifications in D4 using fine-grained MBRs: 45,209,855 →
31% less candidate pairs
• accelerate Verification
o Instead of verifying two big geometries, verify only
the intersecting segments
o Task: Discover all topological relations by verifying
the least intersecting segments

Big Linked Data Interlinking - ExtremeEarth Open Workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Linked Data Interlinking - ExtremeEarth Open Workshop

Similar to Big Linked Data Interlinking - ExtremeEarth Open Workshop (20)

More from ExtremeEarth

More from ExtremeEarth (15)

Recently uploaded

Recently uploaded (20)

Big Linked Data Interlinking - ExtremeEarth Open Workshop