SlideShare a Scribd company logo
1 of 29
Download to read offline
ExtremeEarth
From Copernicus Big Data
to Extreme Earth Analytics
This project has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No 825258.
December 09, 2021
Extreme Earth Online Workshop
Dimitris Bilidas, Theofilos Ioannidis
Querying Big Linked Geospatial Data
4
Overview
● Objective: Perform GeoSPARQL query answering on top of massive geospatial
RDF graphs.
○ Rich spatial analytics on large datasets interlinking information mined from EO
Copernicus data with other available datasets
● Development of the distributed Strabo2 system
○ Relies on the Apache Sedona framework (formerly GeoSpark) in order to
perform geospatial analytics on top of Apache Spark.
○ Deployed in Hopsworks platform in CREODIAS
● Application in the two use cases of ExtremeEarth
5
Strabo 2 on Hopsworks
6
Data Import
● Vertical Partitioning
○ For each predicate encountered in the RDF data, we create a 2-column table in
Hive (subject and object columns)
<observation1> rdf:type <IceObservation> .
<observation2> rdf:type <IceObservation> .
<image1> rdf:type <SatelliteImage> .
<observation1> polar:hasCTClassName “close drift ice” .
<observation2> polar:hasCTClassName “open drift ice” .
<image1> geo:hasGeometry <geometry1> .
<observation1> <IceObservation>
<observation2> <IceObservation>
<image1> <SatelliteImage>
<observation1> “close drift ice”
<observation2> “open drift ice”
<image1> <geometry1>
7
● During import, we create a dictionary with the corresponding Hive
table name for each RDF property.
Query Translation
8
Query Translation
9
Query Translation
10
● Use JedAI-spatial to pre-compute qualitative spatial relations
between the geometries in the dataset
○ Optional step after data import
● During query translation, replace FILTER clauses that introduce spatial
joins in GeoSPARQL with predicates accessing the stored qualitative
relations
○ We cannot replace the geof:distance and geof:disjoint functions
Caching of Spatial Relations
11
● Persistent spatial indexes cannot be created through the Sedona SQL interface: For each
operation, a temporary spatial index is created on-the-fly
● Use the Sedona RDD interface in order to create persistent spatial index, and implement
spatial filter operations with code accessing the RDD and transforming the result into
Dataframe
● Data loading and Indexing:
Using Persistent Spatial Indexing and
Partitioning
12
● During translation of a GeoSPARQL query that contains a spatial filter, we modify the
translation process, such that it will access the persistent spatial index and partitioning,
compute an intermediate result corresponding to the filter, and save it in a temporary table.
Using Persistent Spatial Indexing and
Partitioning
13
● During translation of a GeoSPARQL query that contains a spatial filter, we modify the
translation process, such that it will access the persistent spatial index and partitioning,
compute an intermediate result corresponding to the filter, and save it in a temporary table.
● Finally, we modify the resulting SQL query, by replacing the spatial filter with access to
the temporary table.
Using Persistent Spatial Indexing and
Partitioning
14
● Before the execution of a query, for each base table participating in a join, we perform
partitioning and sorting on the join key
○ this comes at no extra cost, as SPARK will perform this operation during query
● Cache the temporary result and re-use it in subsequent queries
● SPARK manages the cache by saving table fragments to disk when appropriate
SELECT prop1.subject FROM prop1, prop2
WHERE prop1.subject = prop2.object
CACHE TABLE prop1SubjectPartitioned AS SELECT * from prop1 CLUSTER BY subject
CACHE TABLE prop2ObjectPartitioned AS SELECT * from prop2 CLUSTER BY object
SELECT prop1SubjectPartitioned.subject FROM prop1SubjectPartitioned, prop2ObjectPartitioned
WHERE prop1SubjectPartitioned.subject = prop2ObjectPartitioned.object
Caching Partitioned Thematic Tables
15
Query Optimizer
● Takes as input a query that contains a series of thematic and spatial filters
and joins and returns a query execution plan
● Needs estimates about the result size of each operation
● Plan enumeration is based on the DPsub dynamic programming algorithm
● Cost estimates take into consideration the partitioning of tables and the
possible use of spatial indexing for spatial selection operations
16
Endpoint Deployment Through Apache Livy
● Developed a SPARQL endpoint based on Apache Livy for
communication with HopsYARN
○ Can be deployed using Docker
● SPARK session is initiated on startup
○ Apache Sedona jars are added in Spark jars
○ Spatial UDFs are registered in SPARK engine through
specialized requests
● Endpoint has been deployed in CREODIAS installation of
Hopsworks
17
Deployment At CREODIAS
● Strabo2 has been installed in CREODIAS with the following datasets:
● For polar use-case
○ Ice Charts dataset
○ Ship potitions dataset
○ GADM Norway and GADM North
● For food security use-case
○ Crop type maps
○ Hydro River Network EU (Danube, Rhine, Elbe)
○ Irrigation dataset
○ Precipitation dataset
○ Snow cover dataset
○ Interlinking result between Precipitation & GADM
● Total size ~30 GB
18
Deployment At CREODIAS
● Data loading time with 8 executors, 2 cores and 4GB memory per
executor is 80 minutes
19
Deployment At CREODIAS
● Queries (12 executors, 2 cores and 4GB memory per executor):
○ Get all images that correspond to ice map observations that were
obtained between 2018-03-03 and 2018-03-01 and the observation
CT class name is Close Drift Ice: ~233 million results in 2 minutes
○ Get all observations in less than 5km distance from POLYGON ((0.0
0.0, 90.0 0.0, 90.0 77.94970848221368, 0.0 77.94970848221368,
0.0 0.0)): 35k results in 47 seconds
○ Get Regions affected by precipitation in “Quarter 2 of 2021” that was
“lower than 15%” of the normal rainfall and that are "equipped with
irrigation": 12k results in 38 second
○ Get regions that showed a negative trend in precipitation in Q2 but a
positive in Q3 (of e.g. year 2018): 64 results in 23 seconds
20
Experiments with synthetic datasets
● Geographica 2 synthetic dataset
○ For each dataset, a minimal ontology that follows a general version
of the schema of OSM is used.
○ 36 queries of spatial selections and spatial joins with different
selectivities (intersects, within, touches).
○ We have successfully executed the query set in hops.site for a
datasets size of up to 2.35 billion triples (0.5 TB in plain text-scale
factor 12228).
■ Average execution time: 98 seconds using 128 workers, 271
seconds using 64 workers
21
Scalability-Dataset Size
22
Scalability-Number of Executors
23
Ongoing Work
● Perform more experiments
○ With more datasets and queries from the use cases
○ Scalability experiments with the Synthetic dataset of Geographica 3
○ Evaluate specific aspects of the system and identify possible
shortcomings
■ steps for further improvement
24
Geographica 3-CL Distributed Synthetic Generator
● Changed serial processing logic in order to run in a distributed fashion
to allow for horizontal scalability
● Optimized data structures to minimize memory footprint on driver,
each executor and network communication
● To minimize the storage footprint the Parquet+Snappy file format was
added to the output format options
● To increase parallelism by dataset distributed consumers (Strabo 2
Loaders) the number of optimal partitions can be provided
● Dynamic Spatial Selectivities for Querysets provided by the user,
allows for better steering of query loads towards testing the scalability of
spatial behaviour
TEXT /
PARQUET
N=512
(baseline)
1024 2048 4096 8192 16384 32768
HEXAGON_LARGE
(states)
202.524 814.403 3.256.764 13.044.367 52.173.916 208.764.915 835.045.124
HEXAGON_LARGE_
CENTER (state
centers)
202.524 814.403 3.256.764 13.044.367 52.173.916 208.764.915 835.045.124
HEXAGON_SMALL
(land ownerships)
1.837.056 7.344.128 29.368.320 117.456.896 469.794.816 1.879.113.728 7.516.323.840
LINESTRING (roads) 3.588 7.172 14.340 28.676 57.348 114.692 229.380
POINT (points of
interest)
1.837.056 7.344.128 29.368.320 117.456.896 469.794.816 1.879.113.728 7.516.323.840
TOTAL
4.082.748 16.324.234 65.264.508 261.031.202 1.043.994.812 4.175.871.978 16.702.967.308
Geographica 3-CL DistSynthGen
Number of Triples per Feature Class
TEXT N=512
(baseline)
1024 2048 4096 8192 16384 32768
HEXAGON_LARGE
(states)
37.3M 150.5M 604.8M 2.4G 9.5G 38.1G 153.1G
HEXAGON_LARGE_
CENTER (state
centers)
33.9M 136.6M 549.0M 2.2G 8.6G 34.6G 139.1G
HEXAGON_SMALL
(land ownerships)
374.7M 1.5G 5.9G 23.7G 94.6G 379.5G 1.5T
LINESTRING (roads) 6.4M 24.4M 95.4M 377.5M 1.5G 5.8G 23.4G
POINT (points of
interest)
326.2M 1.3G 5.1G 20.6G 82.5G 331.0G 1.3T
TOTAL
778.5M
3.0G
(x3.95)
12.2G
(x4.06)
49.2G
(x4.03)
196.7G
(x4.00)
789.1G
(x4.01)
3.1T
(x4.02)
Geographica 3-CL DistSynthGen
Text format Storage Scaling
PARQUET N=512
(baseline)
1024 2048 4096 8192 16384 32768
HEXAGON_LARGE
(states)
4.7M 20.5M 86.5M 350.8M 1.4G 5.6G 23.0G
HEXAGON_LARGE_
CENTER (state
centers)
3.4M 14.5M 59.4M 241.0M 978.8M 3.9G 15.8G
HEXAGON_SMALL
(land ownerships)
46.9M 198.3M 835.0M 3.3G 13.3G 53.8G 218.8G
LINESTRING (roads) 2.3M 12.4M 60.1M 237.1M 920.9M 3.5G 14.3G
POINT (points of
interest)
36.9M 149.6M 611.4M 2.4G 9.8G 39.6G 160.8G
TOTAL
94.3M 395.2M (x4.19)
1.6G
(x4.14)
6.5G (x4.06)
26.3G
(x4.04)
106.4G
(x4.04)
432.6G
(x4.06)
Geographica 3-CL DistSynthGen
Parquet format Storage Scaling
Format 2048 4096 8192 16384
Text 1m 12s 3m 24s (x2.83) 13m 00s(x3.82) 51m 00s(x3.92)
Parquet 1m 24s 4m 06s(x2.92)
16m 00s
(15m 00s 12prt)
(x3.65)
66m 00s
(60m 00s 12prt)
(x4.00)
PolarTEP (Medium Conf) - Driver (4GB, 1vCore), 4xExecutor (4GB, 1vCore)
* Spark Job Total Uptime
Geographica 3-CL DistSynthGen
Time Scaling
Thank you!

More Related Content

What's hot

Overview of MassGIS Web Mapping Services
Overview of MassGIS Web Mapping ServicesOverview of MassGIS Web Mapping Services
Overview of MassGIS Web Mapping Servicesaleda_freeman
 
CourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkCourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkDataWorks Summit
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pRobert Grossman
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3Robert Grossman
 
Improving access to satellite imagery with Cloud computing
Improving access to satellite imagery with Cloud computingImproving access to satellite imagery with Cloud computing
Improving access to satellite imagery with Cloud computingRAHUL BHOJWANI
 
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...Hirofumi Hayashi
 
Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...
Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...
Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...The HDF-EOS Tools and Information Center
 
State of the Map US 2018: Analytic Support to Mapping Contributors
State of the Map US 2018: Analytic Support to Mapping ContributorsState of the Map US 2018: Analytic Support to Mapping Contributors
State of the Map US 2018: Analytic Support to Mapping Contributorsrlewis48
 
Producing INSPIRE compliant datasets
Producing INSPIRE compliant datasetsProducing INSPIRE compliant datasets
Producing INSPIRE compliant datasetsRoope Tervo
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
EMERGENCIES Paris & EMERGENCIES Mediterranean
EMERGENCIES Paris & EMERGENCIES MediterraneanEMERGENCIES Paris & EMERGENCIES Mediterranean
EMERGENCIES Paris & EMERGENCIES MediterraneanBigData_Europe
 
Open Weather Data as Part of Big Data
Open Weather Data as Part of Big DataOpen Weather Data as Part of Big Data
Open Weather Data as Part of Big DataRoope Tervo
 
HACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a SupercomputerHACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a Supercomputerinside-BigData.com
 
Parallel Algorithms K – means Clustering
Parallel Algorithms K – means ClusteringParallel Algorithms K – means Clustering
Parallel Algorithms K – means ClusteringAndreina Uzcategui
 

What's hot (20)

Overview of MassGIS Web Mapping Services
Overview of MassGIS Web Mapping ServicesOverview of MassGIS Web Mapping Services
Overview of MassGIS Web Mapping Services
 
Clustering
ClusteringClustering
Clustering
 
CourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkCourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on Spark
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
CMPE275-Project1Report
CMPE275-Project1ReportCMPE275-Project1Report
CMPE275-Project1Report
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Improving access to satellite imagery with Cloud computing
Improving access to satellite imagery with Cloud computingImproving access to satellite imagery with Cloud computing
Improving access to satellite imagery with Cloud computing
 
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
 
Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...
Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...
Implementation of OGC Web Coverage Service Using HDF5/HDF-EOS5 as the Base Fi...
 
State of the Map US 2018: Analytic Support to Mapping Contributors
State of the Map US 2018: Analytic Support to Mapping ContributorsState of the Map US 2018: Analytic Support to Mapping Contributors
State of the Map US 2018: Analytic Support to Mapping Contributors
 
Producing INSPIRE compliant datasets
Producing INSPIRE compliant datasetsProducing INSPIRE compliant datasets
Producing INSPIRE compliant datasets
 
Malstone KDD 2010
Malstone KDD 2010Malstone KDD 2010
Malstone KDD 2010
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
EMERGENCIES Paris & EMERGENCIES Mediterranean
EMERGENCIES Paris & EMERGENCIES MediterraneanEMERGENCIES Paris & EMERGENCIES Mediterranean
EMERGENCIES Paris & EMERGENCIES Mediterranean
 
Open Weather Data as Part of Big Data
Open Weather Data as Part of Big DataOpen Weather Data as Part of Big Data
Open Weather Data as Part of Big Data
 
QGIS training class 3
QGIS training class 3QGIS training class 3
QGIS training class 3
 
HACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a SupercomputerHACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a Supercomputer
 
Parallel Algorithms K – means Clustering
Parallel Algorithms K – means ClusteringParallel Algorithms K – means Clustering
Parallel Algorithms K – means Clustering
 
T180304125129
T180304125129T180304125129
T180304125129
 

Similar to Big Linked Data Querying - ExtremeEarth Open Workshop

Big linked geospatial data tools in ExtremeEarth-phiweek19
Big linked geospatial data tools in ExtremeEarth-phiweek19Big linked geospatial data tools in ExtremeEarth-phiweek19
Big linked geospatial data tools in ExtremeEarth-phiweek19ExtremeEarth
 
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013Kostis Kyzirakos
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech ProjectsJody Garnett
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
State of GeoServer 2.14
State of GeoServer 2.14State of GeoServer 2.14
State of GeoServer 2.14Jody Garnett
 
State of GeoServer 2.10
State of GeoServer 2.10State of GeoServer 2.10
State of GeoServer 2.10Jody Garnett
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020GEO Analytics Canada
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
WMS Performance Shootout 2011
WMS Performance Shootout 2011WMS Performance Shootout 2011
WMS Performance Shootout 2011Jeff McKenna
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationRob Emanuele
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostDatabricks
 
GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems researchVasia Kalavri
 
MTCNA Intro to routerOS
MTCNA Intro to routerOSMTCNA Intro to routerOS
MTCNA Intro to routerOSGLC Networks
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 

Similar to Big Linked Data Querying - ExtremeEarth Open Workshop (20)

Big linked geospatial data tools in ExtremeEarth-phiweek19
Big linked geospatial data tools in ExtremeEarth-phiweek19Big linked geospatial data tools in ExtremeEarth-phiweek19
Big linked geospatial data tools in ExtremeEarth-phiweek19
 
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
State of GeoServer 2.14
State of GeoServer 2.14State of GeoServer 2.14
State of GeoServer 2.14
 
State of GeoServer 2.10
State of GeoServer 2.10State of GeoServer 2.10
State of GeoServer 2.10
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
WMS Performance Shootout 2011
WMS Performance Shootout 2011WMS Performance Shootout 2011
WMS Performance Shootout 2011
 
TransPAC3/ACE Measurement & PerfSONAR Update
TransPAC3/ACE Measurement & PerfSONAR UpdateTransPAC3/ACE Measurement & PerfSONAR Update
TransPAC3/ACE Measurement & PerfSONAR Update
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
 
GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
MTCNA Intro to routerOS
MTCNA Intro to routerOSMTCNA Intro to routerOS
MTCNA Intro to routerOS
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 

More from ExtremeEarth

Polar Use Case - ExtremeEarth Open Workshop
Polar Use Case  - ExtremeEarth Open WorkshopPolar Use Case  - ExtremeEarth Open Workshop
Polar Use Case - ExtremeEarth Open WorkshopExtremeEarth
 
ExtremeEarth Open Workshop - Introduction
ExtremeEarth Open Workshop - IntroductionExtremeEarth Open Workshop - Introduction
ExtremeEarth Open Workshop - IntroductionExtremeEarth
 
Food Security Use Case - ExtremeEarth Open Workshop
Food Security Use Case - ExtremeEarth Open WorkshopFood Security Use Case - ExtremeEarth Open Workshop
Food Security Use Case - ExtremeEarth Open WorkshopExtremeEarth
 
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...ExtremeEarth
 
ExtremeEarth Data Science Pipeline for Linked Earth Observation Data
ExtremeEarth Data Science Pipeline for Linked Earth Observation DataExtremeEarth Data Science Pipeline for Linked Earth Observation Data
ExtremeEarth Data Science Pipeline for Linked Earth Observation DataExtremeEarth
 
Artificial Intelligence in the Earth Observation Domain: Current European Res...
Artificial Intelligence in the Earth Observation Domain: Current European Res...Artificial Intelligence in the Earth Observation Domain: Current European Res...
Artificial Intelligence in the Earth Observation Domain: Current European Res...ExtremeEarth
 
Snow Monitoring for Water Availability and Irrigation
Snow Monitoring for Water Availability and IrrigationSnow Monitoring for Water Availability and Irrigation
Snow Monitoring for Water Availability and IrrigationExtremeEarth
 
Polar Use Case in ExtremeEarth-phiweek19
Polar Use Case in ExtremeEarth-phiweek19Polar Use Case in ExtremeEarth-phiweek19
Polar Use Case in ExtremeEarth-phiweek19ExtremeEarth
 
The ExtremeEarth infrastructure-phiweek19
The ExtremeEarth infrastructure-phiweek19The ExtremeEarth infrastructure-phiweek19
The ExtremeEarth infrastructure-phiweek19ExtremeEarth
 
Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19ExtremeEarth
 
Food security use case in ExtremeEarth-phiweek19
Food security use case in ExtremeEarth-phiweek19Food security use case in ExtremeEarth-phiweek19
Food security use case in ExtremeEarth-phiweek19ExtremeEarth
 
Copernicus and AI workshop 2020
Copernicus and AI workshop 2020Copernicus and AI workshop 2020
Copernicus and AI workshop 2020ExtremeEarth
 
LPS19 ExtremeEarth Project
LPS19 ExtremeEarth ProjectLPS19 ExtremeEarth Project
LPS19 ExtremeEarth ProjectExtremeEarth
 

More from ExtremeEarth (13)

Polar Use Case - ExtremeEarth Open Workshop
Polar Use Case  - ExtremeEarth Open WorkshopPolar Use Case  - ExtremeEarth Open Workshop
Polar Use Case - ExtremeEarth Open Workshop
 
ExtremeEarth Open Workshop - Introduction
ExtremeEarth Open Workshop - IntroductionExtremeEarth Open Workshop - Introduction
ExtremeEarth Open Workshop - Introduction
 
Food Security Use Case - ExtremeEarth Open Workshop
Food Security Use Case - ExtremeEarth Open WorkshopFood Security Use Case - ExtremeEarth Open Workshop
Food Security Use Case - ExtremeEarth Open Workshop
 
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
Artificial Intelligence and Big Data Technologies for Copernicus Data: the Ex...
 
ExtremeEarth Data Science Pipeline for Linked Earth Observation Data
ExtremeEarth Data Science Pipeline for Linked Earth Observation DataExtremeEarth Data Science Pipeline for Linked Earth Observation Data
ExtremeEarth Data Science Pipeline for Linked Earth Observation Data
 
Artificial Intelligence in the Earth Observation Domain: Current European Res...
Artificial Intelligence in the Earth Observation Domain: Current European Res...Artificial Intelligence in the Earth Observation Domain: Current European Res...
Artificial Intelligence in the Earth Observation Domain: Current European Res...
 
Snow Monitoring for Water Availability and Irrigation
Snow Monitoring for Water Availability and IrrigationSnow Monitoring for Water Availability and Irrigation
Snow Monitoring for Water Availability and Irrigation
 
Polar Use Case in ExtremeEarth-phiweek19
Polar Use Case in ExtremeEarth-phiweek19Polar Use Case in ExtremeEarth-phiweek19
Polar Use Case in ExtremeEarth-phiweek19
 
The ExtremeEarth infrastructure-phiweek19
The ExtremeEarth infrastructure-phiweek19The ExtremeEarth infrastructure-phiweek19
The ExtremeEarth infrastructure-phiweek19
 
Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19
 
Food security use case in ExtremeEarth-phiweek19
Food security use case in ExtremeEarth-phiweek19Food security use case in ExtremeEarth-phiweek19
Food security use case in ExtremeEarth-phiweek19
 
Copernicus and AI workshop 2020
Copernicus and AI workshop 2020Copernicus and AI workshop 2020
Copernicus and AI workshop 2020
 
LPS19 ExtremeEarth Project
LPS19 ExtremeEarth ProjectLPS19 ExtremeEarth Project
LPS19 ExtremeEarth Project
 

Recently uploaded

Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...ssuserf63bd7
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 

Recently uploaded (20)

Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 

Big Linked Data Querying - ExtremeEarth Open Workshop

  • 1.
  • 2. ExtremeEarth From Copernicus Big Data to Extreme Earth Analytics This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825258.
  • 3. December 09, 2021 Extreme Earth Online Workshop Dimitris Bilidas, Theofilos Ioannidis Querying Big Linked Geospatial Data
  • 4. 4 Overview ● Objective: Perform GeoSPARQL query answering on top of massive geospatial RDF graphs. ○ Rich spatial analytics on large datasets interlinking information mined from EO Copernicus data with other available datasets ● Development of the distributed Strabo2 system ○ Relies on the Apache Sedona framework (formerly GeoSpark) in order to perform geospatial analytics on top of Apache Spark. ○ Deployed in Hopsworks platform in CREODIAS ● Application in the two use cases of ExtremeEarth
  • 5. 5 Strabo 2 on Hopsworks
  • 6. 6 Data Import ● Vertical Partitioning ○ For each predicate encountered in the RDF data, we create a 2-column table in Hive (subject and object columns) <observation1> rdf:type <IceObservation> . <observation2> rdf:type <IceObservation> . <image1> rdf:type <SatelliteImage> . <observation1> polar:hasCTClassName “close drift ice” . <observation2> polar:hasCTClassName “open drift ice” . <image1> geo:hasGeometry <geometry1> . <observation1> <IceObservation> <observation2> <IceObservation> <image1> <SatelliteImage> <observation1> “close drift ice” <observation2> “open drift ice” <image1> <geometry1>
  • 7. 7 ● During import, we create a dictionary with the corresponding Hive table name for each RDF property. Query Translation
  • 10. 10 ● Use JedAI-spatial to pre-compute qualitative spatial relations between the geometries in the dataset ○ Optional step after data import ● During query translation, replace FILTER clauses that introduce spatial joins in GeoSPARQL with predicates accessing the stored qualitative relations ○ We cannot replace the geof:distance and geof:disjoint functions Caching of Spatial Relations
  • 11. 11 ● Persistent spatial indexes cannot be created through the Sedona SQL interface: For each operation, a temporary spatial index is created on-the-fly ● Use the Sedona RDD interface in order to create persistent spatial index, and implement spatial filter operations with code accessing the RDD and transforming the result into Dataframe ● Data loading and Indexing: Using Persistent Spatial Indexing and Partitioning
  • 12. 12 ● During translation of a GeoSPARQL query that contains a spatial filter, we modify the translation process, such that it will access the persistent spatial index and partitioning, compute an intermediate result corresponding to the filter, and save it in a temporary table. Using Persistent Spatial Indexing and Partitioning
  • 13. 13 ● During translation of a GeoSPARQL query that contains a spatial filter, we modify the translation process, such that it will access the persistent spatial index and partitioning, compute an intermediate result corresponding to the filter, and save it in a temporary table. ● Finally, we modify the resulting SQL query, by replacing the spatial filter with access to the temporary table. Using Persistent Spatial Indexing and Partitioning
  • 14. 14 ● Before the execution of a query, for each base table participating in a join, we perform partitioning and sorting on the join key ○ this comes at no extra cost, as SPARK will perform this operation during query ● Cache the temporary result and re-use it in subsequent queries ● SPARK manages the cache by saving table fragments to disk when appropriate SELECT prop1.subject FROM prop1, prop2 WHERE prop1.subject = prop2.object CACHE TABLE prop1SubjectPartitioned AS SELECT * from prop1 CLUSTER BY subject CACHE TABLE prop2ObjectPartitioned AS SELECT * from prop2 CLUSTER BY object SELECT prop1SubjectPartitioned.subject FROM prop1SubjectPartitioned, prop2ObjectPartitioned WHERE prop1SubjectPartitioned.subject = prop2ObjectPartitioned.object Caching Partitioned Thematic Tables
  • 15. 15 Query Optimizer ● Takes as input a query that contains a series of thematic and spatial filters and joins and returns a query execution plan ● Needs estimates about the result size of each operation ● Plan enumeration is based on the DPsub dynamic programming algorithm ● Cost estimates take into consideration the partitioning of tables and the possible use of spatial indexing for spatial selection operations
  • 16. 16 Endpoint Deployment Through Apache Livy ● Developed a SPARQL endpoint based on Apache Livy for communication with HopsYARN ○ Can be deployed using Docker ● SPARK session is initiated on startup ○ Apache Sedona jars are added in Spark jars ○ Spatial UDFs are registered in SPARK engine through specialized requests ● Endpoint has been deployed in CREODIAS installation of Hopsworks
  • 17. 17 Deployment At CREODIAS ● Strabo2 has been installed in CREODIAS with the following datasets: ● For polar use-case ○ Ice Charts dataset ○ Ship potitions dataset ○ GADM Norway and GADM North ● For food security use-case ○ Crop type maps ○ Hydro River Network EU (Danube, Rhine, Elbe) ○ Irrigation dataset ○ Precipitation dataset ○ Snow cover dataset ○ Interlinking result between Precipitation & GADM ● Total size ~30 GB
  • 18. 18 Deployment At CREODIAS ● Data loading time with 8 executors, 2 cores and 4GB memory per executor is 80 minutes
  • 19. 19 Deployment At CREODIAS ● Queries (12 executors, 2 cores and 4GB memory per executor): ○ Get all images that correspond to ice map observations that were obtained between 2018-03-03 and 2018-03-01 and the observation CT class name is Close Drift Ice: ~233 million results in 2 minutes ○ Get all observations in less than 5km distance from POLYGON ((0.0 0.0, 90.0 0.0, 90.0 77.94970848221368, 0.0 77.94970848221368, 0.0 0.0)): 35k results in 47 seconds ○ Get Regions affected by precipitation in “Quarter 2 of 2021” that was “lower than 15%” of the normal rainfall and that are "equipped with irrigation": 12k results in 38 second ○ Get regions that showed a negative trend in precipitation in Q2 but a positive in Q3 (of e.g. year 2018): 64 results in 23 seconds
  • 20. 20 Experiments with synthetic datasets ● Geographica 2 synthetic dataset ○ For each dataset, a minimal ontology that follows a general version of the schema of OSM is used. ○ 36 queries of spatial selections and spatial joins with different selectivities (intersects, within, touches). ○ We have successfully executed the query set in hops.site for a datasets size of up to 2.35 billion triples (0.5 TB in plain text-scale factor 12228). ■ Average execution time: 98 seconds using 128 workers, 271 seconds using 64 workers
  • 23. 23 Ongoing Work ● Perform more experiments ○ With more datasets and queries from the use cases ○ Scalability experiments with the Synthetic dataset of Geographica 3 ○ Evaluate specific aspects of the system and identify possible shortcomings ■ steps for further improvement
  • 24. 24 Geographica 3-CL Distributed Synthetic Generator ● Changed serial processing logic in order to run in a distributed fashion to allow for horizontal scalability ● Optimized data structures to minimize memory footprint on driver, each executor and network communication ● To minimize the storage footprint the Parquet+Snappy file format was added to the output format options ● To increase parallelism by dataset distributed consumers (Strabo 2 Loaders) the number of optimal partitions can be provided ● Dynamic Spatial Selectivities for Querysets provided by the user, allows for better steering of query loads towards testing the scalability of spatial behaviour
  • 25. TEXT / PARQUET N=512 (baseline) 1024 2048 4096 8192 16384 32768 HEXAGON_LARGE (states) 202.524 814.403 3.256.764 13.044.367 52.173.916 208.764.915 835.045.124 HEXAGON_LARGE_ CENTER (state centers) 202.524 814.403 3.256.764 13.044.367 52.173.916 208.764.915 835.045.124 HEXAGON_SMALL (land ownerships) 1.837.056 7.344.128 29.368.320 117.456.896 469.794.816 1.879.113.728 7.516.323.840 LINESTRING (roads) 3.588 7.172 14.340 28.676 57.348 114.692 229.380 POINT (points of interest) 1.837.056 7.344.128 29.368.320 117.456.896 469.794.816 1.879.113.728 7.516.323.840 TOTAL 4.082.748 16.324.234 65.264.508 261.031.202 1.043.994.812 4.175.871.978 16.702.967.308 Geographica 3-CL DistSynthGen Number of Triples per Feature Class
  • 26. TEXT N=512 (baseline) 1024 2048 4096 8192 16384 32768 HEXAGON_LARGE (states) 37.3M 150.5M 604.8M 2.4G 9.5G 38.1G 153.1G HEXAGON_LARGE_ CENTER (state centers) 33.9M 136.6M 549.0M 2.2G 8.6G 34.6G 139.1G HEXAGON_SMALL (land ownerships) 374.7M 1.5G 5.9G 23.7G 94.6G 379.5G 1.5T LINESTRING (roads) 6.4M 24.4M 95.4M 377.5M 1.5G 5.8G 23.4G POINT (points of interest) 326.2M 1.3G 5.1G 20.6G 82.5G 331.0G 1.3T TOTAL 778.5M 3.0G (x3.95) 12.2G (x4.06) 49.2G (x4.03) 196.7G (x4.00) 789.1G (x4.01) 3.1T (x4.02) Geographica 3-CL DistSynthGen Text format Storage Scaling
  • 27. PARQUET N=512 (baseline) 1024 2048 4096 8192 16384 32768 HEXAGON_LARGE (states) 4.7M 20.5M 86.5M 350.8M 1.4G 5.6G 23.0G HEXAGON_LARGE_ CENTER (state centers) 3.4M 14.5M 59.4M 241.0M 978.8M 3.9G 15.8G HEXAGON_SMALL (land ownerships) 46.9M 198.3M 835.0M 3.3G 13.3G 53.8G 218.8G LINESTRING (roads) 2.3M 12.4M 60.1M 237.1M 920.9M 3.5G 14.3G POINT (points of interest) 36.9M 149.6M 611.4M 2.4G 9.8G 39.6G 160.8G TOTAL 94.3M 395.2M (x4.19) 1.6G (x4.14) 6.5G (x4.06) 26.3G (x4.04) 106.4G (x4.04) 432.6G (x4.06) Geographica 3-CL DistSynthGen Parquet format Storage Scaling
  • 28. Format 2048 4096 8192 16384 Text 1m 12s 3m 24s (x2.83) 13m 00s(x3.82) 51m 00s(x3.92) Parquet 1m 24s 4m 06s(x2.92) 16m 00s (15m 00s 12prt) (x3.65) 66m 00s (60m 00s 12prt) (x4.00) PolarTEP (Medium Conf) - Driver (4GB, 1vCore), 4xExecutor (4GB, 1vCore) * Spark Job Total Uptime Geographica 3-CL DistSynthGen Time Scaling