SlideShare a Scribd company logo
1 of 29
Download to read offline
ExtremeEarth
From Copernicus Big Data
to Extreme Earth Analytics
This project has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No 825258.
December 09, 2021
Extreme Earth Online Workshop
Dimitris Bilidas, Theofilos Ioannidis
Querying Big Linked Geospatial Data
4
Overview
● Objective: Perform GeoSPARQL query answering on top of massive geospatial
RDF graphs.
○ Rich spatial analytics on large datasets interlinking information mined from EO
Copernicus data with other available datasets
● Development of the distributed Strabo2 system
○ Relies on the Apache Sedona framework (formerly GeoSpark) in order to
perform geospatial analytics on top of Apache Spark.
○ Deployed in Hopsworks platform in CREODIAS
● Application in the two use cases of ExtremeEarth
5
Strabo 2 on Hopsworks
6
Data Import
● Vertical Partitioning
○ For each predicate encountered in the RDF data, we create a 2-column table in
Hive (subject and object columns)
<observation1> rdf:type <IceObservation> .
<observation2> rdf:type <IceObservation> .
<image1> rdf:type <SatelliteImage> .
<observation1> polar:hasCTClassName “close drift ice” .
<observation2> polar:hasCTClassName “open drift ice” .
<image1> geo:hasGeometry <geometry1> .
<observation1> <IceObservation>
<observation2> <IceObservation>
<image1> <SatelliteImage>
<observation1> “close drift ice”
<observation2> “open drift ice”
<image1> <geometry1>
7
● During import, we create a dictionary with the corresponding Hive
table name for each RDF property.
Query Translation
8
Query Translation
9
Query Translation
10
● Use JedAI-spatial to pre-compute qualitative spatial relations
between the geometries in the dataset
○ Optional step after data import
● During query translation, replace FILTER clauses that introduce spatial
joins in GeoSPARQL with predicates accessing the stored qualitative
relations
○ We cannot replace the geof:distance and geof:disjoint functions
Caching of Spatial Relations
11
● Persistent spatial indexes cannot be created through the Sedona SQL interface: For each
operation, a temporary spatial index is created on-the-fly
● Use the Sedona RDD interface in order to create persistent spatial index, and implement
spatial filter operations with code accessing the RDD and transforming the result into
Dataframe
● Data loading and Indexing:
Using Persistent Spatial Indexing and
Partitioning
12
● During translation of a GeoSPARQL query that contains a spatial filter, we modify the
translation process, such that it will access the persistent spatial index and partitioning,
compute an intermediate result corresponding to the filter, and save it in a temporary table.
Using Persistent Spatial Indexing and
Partitioning
13
● During translation of a GeoSPARQL query that contains a spatial filter, we modify the
translation process, such that it will access the persistent spatial index and partitioning,
compute an intermediate result corresponding to the filter, and save it in a temporary table.
● Finally, we modify the resulting SQL query, by replacing the spatial filter with access to
the temporary table.
Using Persistent Spatial Indexing and
Partitioning
14
● Before the execution of a query, for each base table participating in a join, we perform
partitioning and sorting on the join key
○ this comes at no extra cost, as SPARK will perform this operation during query
● Cache the temporary result and re-use it in subsequent queries
● SPARK manages the cache by saving table fragments to disk when appropriate
SELECT prop1.subject FROM prop1, prop2
WHERE prop1.subject = prop2.object
CACHE TABLE prop1SubjectPartitioned AS SELECT * from prop1 CLUSTER BY subject
CACHE TABLE prop2ObjectPartitioned AS SELECT * from prop2 CLUSTER BY object
SELECT prop1SubjectPartitioned.subject FROM prop1SubjectPartitioned, prop2ObjectPartitioned
WHERE prop1SubjectPartitioned.subject = prop2ObjectPartitioned.object
Caching Partitioned Thematic Tables
15
Query Optimizer
● Takes as input a query that contains a series of thematic and spatial filters
and joins and returns a query execution plan
● Needs estimates about the result size of each operation
● Plan enumeration is based on the DPsub dynamic programming algorithm
● Cost estimates take into consideration the partitioning of tables and the
possible use of spatial indexing for spatial selection operations
16
Endpoint Deployment Through Apache Livy
● Developed a SPARQL endpoint based on Apache Livy for
communication with HopsYARN
○ Can be deployed using Docker
● SPARK session is initiated on startup
○ Apache Sedona jars are added in Spark jars
○ Spatial UDFs are registered in SPARK engine through
specialized requests
● Endpoint has been deployed in CREODIAS installation of
Hopsworks
17
Deployment At CREODIAS
● Strabo2 has been installed in CREODIAS with the following datasets:
● For polar use-case
○ Ice Charts dataset
○ Ship potitions dataset
○ GADM Norway and GADM North
● For food security use-case
○ Crop type maps
○ Hydro River Network EU (Danube, Rhine, Elbe)
○ Irrigation dataset
○ Precipitation dataset
○ Snow cover dataset
○ Interlinking result between Precipitation & GADM
● Total size ~30 GB
18
Deployment At CREODIAS
● Data loading time with 8 executors, 2 cores and 4GB memory per
executor is 80 minutes
19
Deployment At CREODIAS
● Queries (12 executors, 2 cores and 4GB memory per executor):
○ Get all images that correspond to ice map observations that were
obtained between 2018-03-03 and 2018-03-01 and the observation
CT class name is Close Drift Ice: ~233 million results in 2 minutes
○ Get all observations in less than 5km distance from POLYGON ((0.0
0.0, 90.0 0.0, 90.0 77.94970848221368, 0.0 77.94970848221368,
0.0 0.0)): 35k results in 47 seconds
○ Get Regions affected by precipitation in “Quarter 2 of 2021” that was
“lower than 15%” of the normal rainfall and that are "equipped with
irrigation": 12k results in 38 second
○ Get regions that showed a negative trend in precipitation in Q2 but a
positive in Q3 (of e.g. year 2018): 64 results in 23 seconds
20
Experiments with synthetic datasets
● Geographica 2 synthetic dataset
○ For each dataset, a minimal ontology that follows a general version
of the schema of OSM is used.
○ 36 queries of spatial selections and spatial joins with different
selectivities (intersects, within, touches).
○ We have successfully executed the query set in hops.site for a
datasets size of up to 2.35 billion triples (0.5 TB in plain text-scale
factor 12228).
■ Average execution time: 98 seconds using 128 workers, 271
seconds using 64 workers
21
Scalability-Dataset Size
22
Scalability-Number of Executors
23
Ongoing Work
● Perform more experiments
○ With more datasets and queries from the use cases
○ Scalability experiments with the Synthetic dataset of Geographica 3
○ Evaluate specific aspects of the system and identify possible
shortcomings
■ steps for further improvement
24
Geographica 3-CL Distributed Synthetic Generator
● Changed serial processing logic in order to run in a distributed fashion
to allow for horizontal scalability
● Optimized data structures to minimize memory footprint on driver,
each executor and network communication
● To minimize the storage footprint the Parquet+Snappy file format was
added to the output format options
● To increase parallelism by dataset distributed consumers (Strabo 2
Loaders) the number of optimal partitions can be provided
● Dynamic Spatial Selectivities for Querysets provided by the user,
allows for better steering of query loads towards testing the scalability of
spatial behaviour
TEXT /
PARQUET
N=512
(baseline)
1024 2048 4096 8192 16384 32768
HEXAGON_LARGE
(states)
202.524 814.403 3.256.764 13.044.367 52.173.916 208.764.915 835.045.124
HEXAGON_LARGE_
CENTER (state
centers)
202.524 814.403 3.256.764 13.044.367 52.173.916 208.764.915 835.045.124
HEXAGON_SMALL
(land ownerships)
1.837.056 7.344.128 29.368.320 117.456.896 469.794.816 1.879.113.728 7.516.323.840
LINESTRING (roads) 3.588 7.172 14.340 28.676 57.348 114.692 229.380
POINT (points of
interest)
1.837.056 7.344.128 29.368.320 117.456.896 469.794.816 1.879.113.728 7.516.323.840
TOTAL
4.082.748 16.324.234 65.264.508 261.031.202 1.043.994.812 4.175.871.978 16.702.967.308
Geographica 3-CL DistSynthGen
Number of Triples per Feature Class
TEXT N=512
(baseline)
1024 2048 4096 8192 16384 32768
HEXAGON_LARGE
(states)
37.3M 150.5M 604.8M 2.4G 9.5G 38.1G 153.1G
HEXAGON_LARGE_
CENTER (state
centers)
33.9M 136.6M 549.0M 2.2G 8.6G 34.6G 139.1G
HEXAGON_SMALL
(land ownerships)
374.7M 1.5G 5.9G 23.7G 94.6G 379.5G 1.5T
LINESTRING (roads) 6.4M 24.4M 95.4M 377.5M 1.5G 5.8G 23.4G
POINT (points of
interest)
326.2M 1.3G 5.1G 20.6G 82.5G 331.0G 1.3T
TOTAL
778.5M
3.0G
(x3.95)
12.2G
(x4.06)
49.2G
(x4.03)
196.7G
(x4.00)
789.1G
(x4.01)
3.1T
(x4.02)
Geographica 3-CL DistSynthGen
Text format Storage Scaling
PARQUET N=512
(baseline)
1024 2048 4096 8192 16384 32768
HEXAGON_LARGE
(states)
4.7M 20.5M 86.5M 350.8M 1.4G 5.6G 23.0G
HEXAGON_LARGE_
CENTER (state
centers)
3.4M 14.5M 59.4M 241.0M 978.8M 3.9G 15.8G
HEXAGON_SMALL
(land ownerships)
46.9M 198.3M 835.0M 3.3G 13.3G 53.8G 218.8G
LINESTRING (roads) 2.3M 12.4M 60.1M 237.1M 920.9M 3.5G 14.3G
POINT (points of
interest)
36.9M 149.6M 611.4M 2.4G 9.8G 39.6G 160.8G
TOTAL
94.3M 395.2M (x4.19)
1.6G
(x4.14)
6.5G (x4.06)
26.3G
(x4.04)
106.4G
(x4.04)
432.6G
(x4.06)
Geographica 3-CL DistSynthGen
Parquet format Storage Scaling
Format 2048 4096 8192 16384
Text 1m 12s 3m 24s (x2.83) 13m 00s(x3.82) 51m 00s(x3.92)
Parquet 1m 24s 4m 06s(x2.92)
16m 00s
(15m 00s 12prt)
(x3.65)
66m 00s
(60m 00s 12prt)
(x4.00)
PolarTEP (Medium Conf) - Driver (4GB, 1vCore), 4xExecutor (4GB, 1vCore)
* Spark Job Total Uptime
Geographica 3-CL DistSynthGen
Time Scaling
Thank you!

More Related Content

What's hot

Overview of MassGIS Web Mapping Services
Overview of MassGIS Web Mapping ServicesOverview of MassGIS Web Mapping Services
Overview of MassGIS Web Mapping Servicesaleda_freeman
 
CourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkCourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkDataWorks Summit
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pRobert Grossman
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3Robert Grossman
 
Improving access to satellite imagery with Cloud computing
Improving access to satellite imagery with Cloud computingImproving access to satellite imagery with Cloud computing
Improving access to satellite imagery with Cloud computingRAHUL BHOJWANI
 
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...Hirofumi Hayashi
 
Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...
Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...
Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...The HDF-EOS Tools and Information Center
 
State of the Map US 2018: Analytic Support to Mapping Contributors
State of the Map US 2018: Analytic Support to Mapping ContributorsState of the Map US 2018: Analytic Support to Mapping Contributors
State of the Map US 2018: Analytic Support to Mapping Contributorsrlewis48
 
Producing INSPIRE compliant datasets
Producing INSPIRE compliant datasetsProducing INSPIRE compliant datasets
Producing INSPIRE compliant datasetsRoope Tervo
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
EMERGENCIES Paris & EMERGENCIES Mediterranean
EMERGENCIES Paris & EMERGENCIES MediterraneanEMERGENCIES Paris & EMERGENCIES Mediterranean
EMERGENCIES Paris & EMERGENCIES MediterraneanBigData_Europe
 
Open Weather Data as Part of Big Data
Open Weather Data as Part of Big DataOpen Weather Data as Part of Big Data
Open Weather Data as Part of Big DataRoope Tervo
 
HACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a SupercomputerHACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a Supercomputerinside-BigData.com
 
Parallel Algorithms K – means Clustering
Parallel Algorithms K – means ClusteringParallel Algorithms K – means Clustering
Parallel Algorithms K – means ClusteringAndreina Uzcategui
 

What's hot (20)

Overview of MassGIS Web Mapping Services
Overview of MassGIS Web Mapping ServicesOverview of MassGIS Web Mapping Services
Overview of MassGIS Web Mapping Services
 
Clustering
ClusteringClustering
Clustering
 
CourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkCourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on Spark
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
CMPE275-Project1Report
CMPE275-Project1ReportCMPE275-Project1Report
CMPE275-Project1Report
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Improving access to satellite imagery with Cloud computing
Improving access to satellite imagery with Cloud computingImproving access to satellite imagery with Cloud computing
Improving access to satellite imagery with Cloud computing
 
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
 
Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...
Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...
Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...
 
State of the Map US 2018: Analytic Support to Mapping Contributors
State of the Map US 2018: Analytic Support to Mapping ContributorsState of the Map US 2018: Analytic Support to Mapping Contributors
State of the Map US 2018: Analytic Support to Mapping Contributors
 
Producing INSPIRE compliant datasets
Producing INSPIRE compliant datasetsProducing INSPIRE compliant datasets
Producing INSPIRE compliant datasets
 
Malstone KDD 2010
Malstone KDD 2010Malstone KDD 2010
Malstone KDD 2010
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
EMERGENCIES Paris & EMERGENCIES Mediterranean
EMERGENCIES Paris & EMERGENCIES MediterraneanEMERGENCIES Paris & EMERGENCIES Mediterranean
EMERGENCIES Paris & EMERGENCIES Mediterranean
 
Open Weather Data as Part of Big Data
Open Weather Data as Part of Big DataOpen Weather Data as Part of Big Data
Open Weather Data as Part of Big Data
 
QGIS training class 3
QGIS training class 3QGIS training class 3
QGIS training class 3
 
HACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a SupercomputerHACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a Supercomputer
 
Parallel Algorithms K – means Clustering
Parallel Algorithms K – means ClusteringParallel Algorithms K – means Clustering
Parallel Algorithms K – means Clustering
 
T180304125129
T180304125129T180304125129
T180304125129
 

Similar to Big Linked Data Querying - ExtremeEarth Open Workshop

Big linked geospatial data tools in ExtremeEarth-phiweek19
Big linked geospatial data tools in ExtremeEarth-phiweek19Big linked geospatial data tools in ExtremeEarth-phiweek19
Big linked geospatial data tools in ExtremeEarth-phiweek19ExtremeEarth
 
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013Kostis Kyzirakos
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech ProjectsJody Garnett
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
State of GeoServer 2.14
State of GeoServer 2.14State of GeoServer 2.14
State of GeoServer 2.14Jody Garnett
 
State of GeoServer 2.10
State of GeoServer 2.10State of GeoServer 2.10
State of GeoServer 2.10Jody Garnett
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020GEO Analytics Canada
 
WMS Performance Shootout 2011
WMS Performance Shootout 2011WMS Performance Shootout 2011
WMS Performance Shootout 2011Jeff McKenna
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationRob Emanuele
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostDatabricks
 
GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems researchVasia Kalavri
 
MTCNA Intro to routerOS
MTCNA Intro to routerOSMTCNA Intro to routerOS
MTCNA Intro to routerOSGLC Networks
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
MTCNA : Intro to RouterOS - Part 1
MTCNA : Intro to RouterOS - Part 1MTCNA : Intro to RouterOS - Part 1
MTCNA : Intro to RouterOS - Part 1GLC Networks
 
Geographica: A Benchmark for Geospatial RDF Stores
Geographica: A Benchmark for Geospatial RDF StoresGeographica: A Benchmark for Geospatial RDF Stores
Geographica: A Benchmark for Geospatial RDF StoresKostis Kyzirakos
 

Similar to Big Linked Data Querying - ExtremeEarth Open Workshop (20)

Big linked geospatial data tools in ExtremeEarth-phiweek19
Big linked geospatial data tools in ExtremeEarth-phiweek19Big linked geospatial data tools in ExtremeEarth-phiweek19
Big linked geospatial data tools in ExtremeEarth-phiweek19
 
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
State of GeoServer 2.14
State of GeoServer 2.14State of GeoServer 2.14
State of GeoServer 2.14
 
State of GeoServer 2.10
State of GeoServer 2.10State of GeoServer 2.10
State of GeoServer 2.10
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020
 
WMS Performance Shootout 2011
WMS Performance Shootout 2011WMS Performance Shootout 2011
WMS Performance Shootout 2011
 
TransPAC3/ACE Measurement & PerfSONAR Update
TransPAC3/ACE Measurement & PerfSONAR UpdateTransPAC3/ACE Measurement & PerfSONAR Update
TransPAC3/ACE Measurement & PerfSONAR Update
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
 
GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
MTCNA Intro to routerOS
MTCNA Intro to routerOSMTCNA Intro to routerOS
MTCNA Intro to routerOS
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
MTCNA : Intro to RouterOS - Part 1
MTCNA : Intro to RouterOS - Part 1MTCNA : Intro to RouterOS - Part 1
MTCNA : Intro to RouterOS - Part 1
 
Geographica: A Benchmark for Geospatial RDF Stores
Geographica: A Benchmark for Geospatial RDF StoresGeographica: A Benchmark for Geospatial RDF Stores
Geographica: A Benchmark for Geospatial RDF Stores
 

More from ExtremeEarth

Polar Use Case - ExtremeEarth Open Workshop
Polar Use Case  - ExtremeEarth Open WorkshopPolar Use Case  - ExtremeEarth Open Workshop
Polar Use Case - ExtremeEarth Open WorkshopExtremeEarth
 
ExtremeEarth Open Workshop - Introduction
ExtremeEarth Open Workshop - IntroductionExtremeEarth Open Workshop - Introduction
ExtremeEarth Open Workshop - IntroductionExtremeEarth
 
Food Security Use Case - ExtremeEarth Open Workshop
Food Security Use Case - ExtremeEarth Open WorkshopFood Security Use Case - ExtremeEarth Open Workshop
Food Security Use Case - ExtremeEarth Open WorkshopExtremeEarth
 
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...ExtremeEarth
 
ExtremeEarth Data Science Pipeline for Linked Earth Observation Data
ExtremeEarth Data Science Pipeline for Linked Earth Observation DataExtremeEarth Data Science Pipeline for Linked Earth Observation Data
ExtremeEarth Data Science Pipeline for Linked Earth Observation DataExtremeEarth
 
Artificial Intelligence in the Earth Observation Domain: Current European Res...
Artificial Intelligence in the Earth Observation Domain: Current European Res...Artificial Intelligence in the Earth Observation Domain: Current European Res...
Artificial Intelligence in the Earth Observation Domain: Current European Res...ExtremeEarth
 
Snow Monitoring for Water Availability and Irrigation
Snow Monitoring for Water Availability and IrrigationSnow Monitoring for Water Availability and Irrigation
Snow Monitoring for Water Availability and IrrigationExtremeEarth
 
Polar Use Case in ExtremeEarth-phiweek19
Polar Use Case in ExtremeEarth-phiweek19Polar Use Case in ExtremeEarth-phiweek19
Polar Use Case in ExtremeEarth-phiweek19ExtremeEarth
 
The ExtremeEarth infrastructure-phiweek19
The ExtremeEarth infrastructure-phiweek19The ExtremeEarth infrastructure-phiweek19
The ExtremeEarth infrastructure-phiweek19ExtremeEarth
 
Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19ExtremeEarth
 
Food security use case in ExtremeEarth-phiweek19
Food security use case in ExtremeEarth-phiweek19Food security use case in ExtremeEarth-phiweek19
Food security use case in ExtremeEarth-phiweek19ExtremeEarth
 
Copernicus and AI workshop 2020
Copernicus and AI workshop 2020Copernicus and AI workshop 2020
Copernicus and AI workshop 2020ExtremeEarth
 
LPS19 ExtremeEarth Project
LPS19 ExtremeEarth ProjectLPS19 ExtremeEarth Project
LPS19 ExtremeEarth ProjectExtremeEarth
 

More from ExtremeEarth (13)

Polar Use Case - ExtremeEarth Open Workshop
Polar Use Case  - ExtremeEarth Open WorkshopPolar Use Case  - ExtremeEarth Open Workshop
Polar Use Case - ExtremeEarth Open Workshop
 
ExtremeEarth Open Workshop - Introduction
ExtremeEarth Open Workshop - IntroductionExtremeEarth Open Workshop - Introduction
ExtremeEarth Open Workshop - Introduction
 
Food Security Use Case - ExtremeEarth Open Workshop
Food Security Use Case - ExtremeEarth Open WorkshopFood Security Use Case - ExtremeEarth Open Workshop
Food Security Use Case - ExtremeEarth Open Workshop
 
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
 
ExtremeEarth Data Science Pipeline for Linked Earth Observation Data
ExtremeEarth Data Science Pipeline for Linked Earth Observation DataExtremeEarth Data Science Pipeline for Linked Earth Observation Data
ExtremeEarth Data Science Pipeline for Linked Earth Observation Data
 
Artificial Intelligence in the Earth Observation Domain: Current European Res...
Artificial Intelligence in the Earth Observation Domain: Current European Res...Artificial Intelligence in the Earth Observation Domain: Current European Res...
Artificial Intelligence in the Earth Observation Domain: Current European Res...
 
Snow Monitoring for Water Availability and Irrigation
Snow Monitoring for Water Availability and IrrigationSnow Monitoring for Water Availability and Irrigation
Snow Monitoring for Water Availability and Irrigation
 
Polar Use Case in ExtremeEarth-phiweek19
Polar Use Case in ExtremeEarth-phiweek19Polar Use Case in ExtremeEarth-phiweek19
Polar Use Case in ExtremeEarth-phiweek19
 
The ExtremeEarth infrastructure-phiweek19
The ExtremeEarth infrastructure-phiweek19The ExtremeEarth infrastructure-phiweek19
The ExtremeEarth infrastructure-phiweek19
 
Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19
 
Food security use case in ExtremeEarth-phiweek19
Food security use case in ExtremeEarth-phiweek19Food security use case in ExtremeEarth-phiweek19
Food security use case in ExtremeEarth-phiweek19
 
Copernicus and AI workshop 2020
Copernicus and AI workshop 2020Copernicus and AI workshop 2020
Copernicus and AI workshop 2020
 
LPS19 ExtremeEarth Project
LPS19 ExtremeEarth ProjectLPS19 ExtremeEarth Project
LPS19 ExtremeEarth Project
 

Recently uploaded

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 

Recently uploaded (20)

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 

Big Linked Data Querying - ExtremeEarth Open Workshop

  • 1.
  • 2. ExtremeEarth From Copernicus Big Data to Extreme Earth Analytics This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825258.
  • 3. December 09, 2021 Extreme Earth Online Workshop Dimitris Bilidas, Theofilos Ioannidis Querying Big Linked Geospatial Data
  • 4. 4 Overview ● Objective: Perform GeoSPARQL query answering on top of massive geospatial RDF graphs. ○ Rich spatial analytics on large datasets interlinking information mined from EO Copernicus data with other available datasets ● Development of the distributed Strabo2 system ○ Relies on the Apache Sedona framework (formerly GeoSpark) in order to perform geospatial analytics on top of Apache Spark. ○ Deployed in Hopsworks platform in CREODIAS ● Application in the two use cases of ExtremeEarth
  • 5. 5 Strabo 2 on Hopsworks
  • 6. 6 Data Import ● Vertical Partitioning ○ For each predicate encountered in the RDF data, we create a 2-column table in Hive (subject and object columns) <observation1> rdf:type <IceObservation> . <observation2> rdf:type <IceObservation> . <image1> rdf:type <SatelliteImage> . <observation1> polar:hasCTClassName “close drift ice” . <observation2> polar:hasCTClassName “open drift ice” . <image1> geo:hasGeometry <geometry1> . <observation1> <IceObservation> <observation2> <IceObservation> <image1> <SatelliteImage> <observation1> “close drift ice” <observation2> “open drift ice” <image1> <geometry1>
  • 7. 7 ● During import, we create a dictionary with the corresponding Hive table name for each RDF property. Query Translation
  • 10. 10 ● Use JedAI-spatial to pre-compute qualitative spatial relations between the geometries in the dataset ○ Optional step after data import ● During query translation, replace FILTER clauses that introduce spatial joins in GeoSPARQL with predicates accessing the stored qualitative relations ○ We cannot replace the geof:distance and geof:disjoint functions Caching of Spatial Relations
  • 11. 11 ● Persistent spatial indexes cannot be created through the Sedona SQL interface: For each operation, a temporary spatial index is created on-the-fly ● Use the Sedona RDD interface in order to create persistent spatial index, and implement spatial filter operations with code accessing the RDD and transforming the result into Dataframe ● Data loading and Indexing: Using Persistent Spatial Indexing and Partitioning
  • 12. 12 ● During translation of a GeoSPARQL query that contains a spatial filter, we modify the translation process, such that it will access the persistent spatial index and partitioning, compute an intermediate result corresponding to the filter, and save it in a temporary table. Using Persistent Spatial Indexing and Partitioning
  • 13. 13 ● During translation of a GeoSPARQL query that contains a spatial filter, we modify the translation process, such that it will access the persistent spatial index and partitioning, compute an intermediate result corresponding to the filter, and save it in a temporary table. ● Finally, we modify the resulting SQL query, by replacing the spatial filter with access to the temporary table. Using Persistent Spatial Indexing and Partitioning
  • 14. 14 ● Before the execution of a query, for each base table participating in a join, we perform partitioning and sorting on the join key ○ this comes at no extra cost, as SPARK will perform this operation during query ● Cache the temporary result and re-use it in subsequent queries ● SPARK manages the cache by saving table fragments to disk when appropriate SELECT prop1.subject FROM prop1, prop2 WHERE prop1.subject = prop2.object CACHE TABLE prop1SubjectPartitioned AS SELECT * from prop1 CLUSTER BY subject CACHE TABLE prop2ObjectPartitioned AS SELECT * from prop2 CLUSTER BY object SELECT prop1SubjectPartitioned.subject FROM prop1SubjectPartitioned, prop2ObjectPartitioned WHERE prop1SubjectPartitioned.subject = prop2ObjectPartitioned.object Caching Partitioned Thematic Tables
  • 15. 15 Query Optimizer ● Takes as input a query that contains a series of thematic and spatial filters and joins and returns a query execution plan ● Needs estimates about the result size of each operation ● Plan enumeration is based on the DPsub dynamic programming algorithm ● Cost estimates take into consideration the partitioning of tables and the possible use of spatial indexing for spatial selection operations
  • 16. 16 Endpoint Deployment Through Apache Livy ● Developed a SPARQL endpoint based on Apache Livy for communication with HopsYARN ○ Can be deployed using Docker ● SPARK session is initiated on startup ○ Apache Sedona jars are added in Spark jars ○ Spatial UDFs are registered in SPARK engine through specialized requests ● Endpoint has been deployed in CREODIAS installation of Hopsworks
  • 17. 17 Deployment At CREODIAS ● Strabo2 has been installed in CREODIAS with the following datasets: ● For polar use-case ○ Ice Charts dataset ○ Ship potitions dataset ○ GADM Norway and GADM North ● For food security use-case ○ Crop type maps ○ Hydro River Network EU (Danube, Rhine, Elbe) ○ Irrigation dataset ○ Precipitation dataset ○ Snow cover dataset ○ Interlinking result between Precipitation & GADM ● Total size ~30 GB
  • 18. 18 Deployment At CREODIAS ● Data loading time with 8 executors, 2 cores and 4GB memory per executor is 80 minutes
  • 19. 19 Deployment At CREODIAS ● Queries (12 executors, 2 cores and 4GB memory per executor): ○ Get all images that correspond to ice map observations that were obtained between 2018-03-03 and 2018-03-01 and the observation CT class name is Close Drift Ice: ~233 million results in 2 minutes ○ Get all observations in less than 5km distance from POLYGON ((0.0 0.0, 90.0 0.0, 90.0 77.94970848221368, 0.0 77.94970848221368, 0.0 0.0)): 35k results in 47 seconds ○ Get Regions affected by precipitation in “Quarter 2 of 2021” that was “lower than 15%” of the normal rainfall and that are "equipped with irrigation": 12k results in 38 second ○ Get regions that showed a negative trend in precipitation in Q2 but a positive in Q3 (of e.g. year 2018): 64 results in 23 seconds
  • 20. 20 Experiments with synthetic datasets ● Geographica 2 synthetic dataset ○ For each dataset, a minimal ontology that follows a general version of the schema of OSM is used. ○ 36 queries of spatial selections and spatial joins with different selectivities (intersects, within, touches). ○ We have successfully executed the query set in hops.site for a datasets size of up to 2.35 billion triples (0.5 TB in plain text-scale factor 12228). ■ Average execution time: 98 seconds using 128 workers, 271 seconds using 64 workers
  • 23. 23 Ongoing Work ● Perform more experiments ○ With more datasets and queries from the use cases ○ Scalability experiments with the Synthetic dataset of Geographica 3 ○ Evaluate specific aspects of the system and identify possible shortcomings ■ steps for further improvement
  • 24. 24 Geographica 3-CL Distributed Synthetic Generator ● Changed serial processing logic in order to run in a distributed fashion to allow for horizontal scalability ● Optimized data structures to minimize memory footprint on driver, each executor and network communication ● To minimize the storage footprint the Parquet+Snappy file format was added to the output format options ● To increase parallelism by dataset distributed consumers (Strabo 2 Loaders) the number of optimal partitions can be provided ● Dynamic Spatial Selectivities for Querysets provided by the user, allows for better steering of query loads towards testing the scalability of spatial behaviour
  • 25. TEXT / PARQUET N=512 (baseline) 1024 2048 4096 8192 16384 32768 HEXAGON_LARGE (states) 202.524 814.403 3.256.764 13.044.367 52.173.916 208.764.915 835.045.124 HEXAGON_LARGE_ CENTER (state centers) 202.524 814.403 3.256.764 13.044.367 52.173.916 208.764.915 835.045.124 HEXAGON_SMALL (land ownerships) 1.837.056 7.344.128 29.368.320 117.456.896 469.794.816 1.879.113.728 7.516.323.840 LINESTRING (roads) 3.588 7.172 14.340 28.676 57.348 114.692 229.380 POINT (points of interest) 1.837.056 7.344.128 29.368.320 117.456.896 469.794.816 1.879.113.728 7.516.323.840 TOTAL 4.082.748 16.324.234 65.264.508 261.031.202 1.043.994.812 4.175.871.978 16.702.967.308 Geographica 3-CL DistSynthGen Number of Triples per Feature Class
  • 26. TEXT N=512 (baseline) 1024 2048 4096 8192 16384 32768 HEXAGON_LARGE (states) 37.3M 150.5M 604.8M 2.4G 9.5G 38.1G 153.1G HEXAGON_LARGE_ CENTER (state centers) 33.9M 136.6M 549.0M 2.2G 8.6G 34.6G 139.1G HEXAGON_SMALL (land ownerships) 374.7M 1.5G 5.9G 23.7G 94.6G 379.5G 1.5T LINESTRING (roads) 6.4M 24.4M 95.4M 377.5M 1.5G 5.8G 23.4G POINT (points of interest) 326.2M 1.3G 5.1G 20.6G 82.5G 331.0G 1.3T TOTAL 778.5M 3.0G (x3.95) 12.2G (x4.06) 49.2G (x4.03) 196.7G (x4.00) 789.1G (x4.01) 3.1T (x4.02) Geographica 3-CL DistSynthGen Text format Storage Scaling
  • 27. PARQUET N=512 (baseline) 1024 2048 4096 8192 16384 32768 HEXAGON_LARGE (states) 4.7M 20.5M 86.5M 350.8M 1.4G 5.6G 23.0G HEXAGON_LARGE_ CENTER (state centers) 3.4M 14.5M 59.4M 241.0M 978.8M 3.9G 15.8G HEXAGON_SMALL (land ownerships) 46.9M 198.3M 835.0M 3.3G 13.3G 53.8G 218.8G LINESTRING (roads) 2.3M 12.4M 60.1M 237.1M 920.9M 3.5G 14.3G POINT (points of interest) 36.9M 149.6M 611.4M 2.4G 9.8G 39.6G 160.8G TOTAL 94.3M 395.2M (x4.19) 1.6G (x4.14) 6.5G (x4.06) 26.3G (x4.04) 106.4G (x4.04) 432.6G (x4.06) Geographica 3-CL DistSynthGen Parquet format Storage Scaling
  • 28. Format 2048 4096 8192 16384 Text 1m 12s 3m 24s (x2.83) 13m 00s(x3.82) 51m 00s(x3.92) Parquet 1m 24s 4m 06s(x2.92) 16m 00s (15m 00s 12prt) (x3.65) 66m 00s (60m 00s 12prt) (x4.00) PolarTEP (Medium Conf) - Driver (4GB, 1vCore), 4xExecutor (4GB, 1vCore) * Spark Job Total Uptime Geographica 3-CL DistSynthGen Time Scaling