Big Linked Data Querying - ExtremeEarth Open Workshop

ExtremeEarth
From Copernicus Big Data
to Extreme Earth Analytics
This project has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No 825258.

December 09, 2021
Extreme Earth Online Workshop
Dimitris Bilidas, Theofilos Ioannidis
Querying Big Linked Geospatial Data

4
Overview
● Objective: Perform GeoSPARQL query answering on top of massive geospatial
RDF graphs.
○ Rich spatial analytics on large datasets interlinking information mined from EO
Copernicus data with other available datasets
● Development of the distributed Strabo2 system
○ Relies on the Apache Sedona framework (formerly GeoSpark) in order to
perform geospatial analytics on top of Apache Spark.
○ Deployed in Hopsworks platform in CREODIAS
● Application in the two use cases of ExtremeEarth

6
Data Import
● Vertical Partitioning
○ For each predicate encountered in the RDF data, we create a 2-column table in
Hive (subject and object columns)
<observation1> rdf:type <IceObservation> .
<observation2> rdf:type <IceObservation> .
<image1> rdf:type <SatelliteImage> .
<observation1> polar:hasCTClassName “close drift ice” .
<observation2> polar:hasCTClassName “open drift ice” .
<image1> geo:hasGeometry <geometry1> .
<observation1> <IceObservation>
<observation2> <IceObservation>
<image1> <SatelliteImage>
<observation1> “close drift ice”
<observation2> “open drift ice”
<image1> <geometry1>

7
● During import, we create a dictionary with the corresponding Hive
table name for each RDF property.
Query Translation

10
● Use JedAI-spatial to pre-compute qualitative spatial relations
between the geometries in the dataset
○ Optional step after data import
● During query translation, replace FILTER clauses that introduce spatial
joins in GeoSPARQL with predicates accessing the stored qualitative
relations
○ We cannot replace the geof:distance and geof:disjoint functions
Caching of Spatial Relations

11
● Persistent spatial indexes cannot be created through the Sedona SQL interface: For each
operation, a temporary spatial index is created on-the-fly
● Use the Sedona RDD interface in order to create persistent spatial index, and implement
spatial filter operations with code accessing the RDD and transforming the result into
Dataframe
● Data loading and Indexing:
Using Persistent Spatial Indexing and
Partitioning

12
● During translation of a GeoSPARQL query that contains a spatial filter, we modify the
translation process, such that it will access the persistent spatial index and partitioning,
compute an intermediate result corresponding to the filter, and save it in a temporary table.
Partitioning

13
● During translation of a GeoSPARQL query that contains a spatial filter, we modify the
translation process, such that it will access the persistent spatial index and partitioning,
compute an intermediate result corresponding to the filter, and save it in a temporary table.
● Finally, we modify the resulting SQL query, by replacing the spatial filter with access to
the temporary table.
Partitioning

14
● Before the execution of a query, for each base table participating in a join, we perform
partitioning and sorting on the join key
○ this comes at no extra cost, as SPARK will perform this operation during query
● Cache the temporary result and re-use it in subsequent queries
● SPARK manages the cache by saving table fragments to disk when appropriate
SELECT prop1.subject FROM prop1, prop2
WHERE prop1.subject = prop2.object
CACHE TABLE prop1SubjectPartitioned AS SELECT * from prop1 CLUSTER BY subject
CACHE TABLE prop2ObjectPartitioned AS SELECT * from prop2 CLUSTER BY object
SELECT prop1SubjectPartitioned.subject FROM prop1SubjectPartitioned, prop2ObjectPartitioned
WHERE prop1SubjectPartitioned.subject = prop2ObjectPartitioned.object
Caching Partitioned Thematic Tables

15
Query Optimizer
● Takes as input a query that contains a series of thematic and spatial filters
and joins and returns a query execution plan
● Needs estimates about the result size of each operation
● Plan enumeration is based on the DPsub dynamic programming algorithm
● Cost estimates take into consideration the partitioning of tables and the
possible use of spatial indexing for spatial selection operations

16
Endpoint Deployment Through Apache Livy
● Developed a SPARQL endpoint based on Apache Livy for
communication with HopsYARN
○ Can be deployed using Docker
● SPARK session is initiated on startup
○ Apache Sedona jars are added in Spark jars
○ Spatial UDFs are registered in SPARK engine through
specialized requests
● Endpoint has been deployed in CREODIAS installation of
Hopsworks

17
Deployment At CREODIAS
● Strabo2 has been installed in CREODIAS with the following datasets:
● For polar use-case
○ Ice Charts dataset
○ Ship potitions dataset
○ GADM Norway and GADM North
● For food security use-case
○ Crop type maps
○ Hydro River Network EU (Danube, Rhine, Elbe)
○ Irrigation dataset
○ Precipitation dataset
○ Snow cover dataset
○ Interlinking result between Precipitation & GADM
● Total size ~30 GB

18
● Data loading time with 8 executors, 2 cores and 4GB memory per
executor is 80 minutes

19
● Queries (12 executors, 2 cores and 4GB memory per executor):
○ Get all images that correspond to ice map observations that were
obtained between 2018-03-03 and 2018-03-01 and the observation
CT class name is Close Drift Ice: ~233 million results in 2 minutes
○ Get all observations in less than 5km distance from POLYGON ((0.0
0.0, 90.0 0.0, 90.0 77.94970848221368, 0.0 77.94970848221368,
0.0 0.0)): 35k results in 47 seconds
○ Get Regions affected by precipitation in “Quarter 2 of 2021” that was
“lower than 15%” of the normal rainfall and that are "equipped with
irrigation": 12k results in 38 second
○ Get regions that showed a negative trend in precipitation in Q2 but a
positive in Q3 (of e.g. year 2018): 64 results in 23 seconds

20
Experiments with synthetic datasets
● Geographica 2 synthetic dataset
○ For each dataset, a minimal ontology that follows a general version
of the schema of OSM is used.
○ 36 queries of spatial selections and spatial joins with different
selectivities (intersects, within, touches).
○ We have successfully executed the query set in hops.site for a
datasets size of up to 2.35 billion triples (0.5 TB in plain text-scale
factor 12228).
■ Average execution time: 98 seconds using 128 workers, 271
seconds using 64 workers

22
Scalability-Number of Executors

23
Ongoing Work
● Perform more experiments
○ With more datasets and queries from the use cases
○ Scalability experiments with the Synthetic dataset of Geographica 3
○ Evaluate specific aspects of the system and identify possible
shortcomings
■ steps for further improvement

24
Geographica 3-CL Distributed Synthetic Generator
● Changed serial processing logic in order to run in a distributed fashion
to allow for horizontal scalability
● Optimized data structures to minimize memory footprint on driver,
each executor and network communication
● To minimize the storage footprint the Parquet+Snappy file format was
added to the output format options
● To increase parallelism by dataset distributed consumers (Strabo 2
Loaders) the number of optimal partitions can be provided
● Dynamic Spatial Selectivities for Querysets provided by the user,
allows for better steering of query loads towards testing the scalability of
spatial behaviour

TEXT /
PARQUET
N=512
(baseline)
1024 2048 4096 8192 16384 32768
HEXAGON_LARGE
(states)
202.524 814.403 3.256.764 13.044.367 52.173.916 208.764.915 835.045.124
HEXAGON_LARGE_
CENTER (state
centers)
202.524 814.403 3.256.764 13.044.367 52.173.916 208.764.915 835.045.124
HEXAGON_SMALL
(land ownerships)
1.837.056 7.344.128 29.368.320 117.456.896 469.794.816 1.879.113.728 7.516.323.840
LINESTRING (roads) 3.588 7.172 14.340 28.676 57.348 114.692 229.380
POINT (points of
interest)
1.837.056 7.344.128 29.368.320 117.456.896 469.794.816 1.879.113.728 7.516.323.840
TOTAL
4.082.748 16.324.234 65.264.508 261.031.202 1.043.994.812 4.175.871.978 16.702.967.308
Geographica 3-CL DistSynthGen
Number of Triples per Feature Class

TEXT N=512
(baseline)
1024 2048 4096 8192 16384 32768
HEXAGON_LARGE
(states)
37.3M 150.5M 604.8M 2.4G 9.5G 38.1G 153.1G
HEXAGON_LARGE_
CENTER (state
centers)
33.9M 136.6M 549.0M 2.2G 8.6G 34.6G 139.1G
HEXAGON_SMALL
(land ownerships)
374.7M 1.5G 5.9G 23.7G 94.6G 379.5G 1.5T
LINESTRING (roads) 6.4M 24.4M 95.4M 377.5M 1.5G 5.8G 23.4G
POINT (points of
interest)
326.2M 1.3G 5.1G 20.6G 82.5G 331.0G 1.3T
TOTAL
778.5M
3.0G
(x3.95)
12.2G
(x4.06)
49.2G
(x4.03)
196.7G
(x4.00)
789.1G
(x4.01)
3.1T
(x4.02)
Text format Storage Scaling

PARQUET N=512
(baseline)
1024 2048 4096 8192 16384 32768
HEXAGON_LARGE
(states)
4.7M 20.5M 86.5M 350.8M 1.4G 5.6G 23.0G
HEXAGON_LARGE_
CENTER (state
centers)
3.4M 14.5M 59.4M 241.0M 978.8M 3.9G 15.8G
HEXAGON_SMALL
(land ownerships)
46.9M 198.3M 835.0M 3.3G 13.3G 53.8G 218.8G
LINESTRING (roads) 2.3M 12.4M 60.1M 237.1M 920.9M 3.5G 14.3G
POINT (points of
interest)
36.9M 149.6M 611.4M 2.4G 9.8G 39.6G 160.8G
TOTAL
94.3M 395.2M (x4.19)
1.6G
(x4.14)
6.5G (x4.06)
26.3G
(x4.04)
106.4G
(x4.04)
432.6G
(x4.06)
Parquet format Storage Scaling

Format 2048 4096 8192 16384
Text 1m 12s 3m 24s (x2.83) 13m 00s(x3.82) 51m 00s(x3.92)
Parquet 1m 24s 4m 06s(x2.92)
16m 00s
(15m 00s 12prt)
(x3.65)
66m 00s
(60m 00s 12prt)
(x4.00)
PolarTEP (Medium Conf) - Driver (4GB, 1vCore), 4xExecutor (4GB, 1vCore)
* Spark Job Total Uptime
Time Scaling

Big Linked Data Querying - ExtremeEarth Open Workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Linked Data Querying - ExtremeEarth Open Workshop

Similar to Big Linked Data Querying - ExtremeEarth Open Workshop (20)

More from ExtremeEarth

More from ExtremeEarth (13)

Recently uploaded

Recently uploaded (20)

Big Linked Data Querying - ExtremeEarth Open Workshop