Comparing Geospatial Implementation in MongoDB, Postgres, and Elastic

Comparing Geospatial Implementation
in MongoDB, Postgres, and Elastic
Percona Live Online
12-13 May 2021

Antonios Giannopoulos
Senior Database Administrator
Pedro Albuquerque
Staff Database Engineer
Alex Cercel
Principal Database Engineer

Agenda
● Deﬁnitions
● Proximity search
● Proximity search with ﬁlters
● Proximity search with ordering
● Area search
● Best practices
● Benchmark

Dataset
We modiﬁed the NY restaurants dataset (https://bit.ly/3xwdNU8)
● Name
● Location
● Area
● Price range*
● Cuisines*
● Rating*
● Amenities*
*Randomly generated

MongoDB - GeoJSON
● Supports GeoJSON and legacy coordinate pairs [<lon>,<lat>]
● Point
● LineString
● Polygon
● MultiPoint
● MultiLineString
● MultiPolygon
● GeometryCollection

MongoDB - Indexes
● Supports 2d and 2dSphere Indexes
● Version 2
● Version 3 (MongoDB 3.2)
● Sparse by default
● Must hold geometry data
● Supports Compound
● Can’t use it for sharding

MongoDB - Proximity query
● Give me the points of interest near me
● $geowithin
○ $box*
○ $polygon*
○ $center*
○ $centerSphere
● Doesn’t require a 2dsphere
Index
● Results don’t come in
proximity order
● Limit results

MongoDB - Proximity query
● $nearSphere
○ Point
○ $minDistance
○ $maxDistance
● Requires a 2dsphere Index
● Results ordered by distance
● Limit works differently

MongoDB - Proximity with filters
● Give me specific points of interest near me
● Compound indexes
● Both $geowithin and
$nearSphere support filters
● Index order matters

MongoDB - Ordered proximity
● Give me nearest points of interest ordered by criteria
● $geoWithin (natural order)
● $nearSphere orders by distance
● Both accept $sort criteria

● A little trick
● Results come ordered
● But… more keys to access
VS

● $geoSphere
● Results come ordered by
distance
● The “trick” doesn’t work

MongoDB - Aggregation
● $geoNear adds extra functionalities
● distanceField
● min/maxDistance
● query
● key
● Fist stage of the pipeline
● Geospatial index

MongoDB - Area search
● In which area the point belongs to.
● $geoIntersects
● Areas deﬁnition
● Usually polygons

MongoDB - Moving Points
● Accuracy vs Speed
○ Accuracy requires higher write throughput
○ Speed pushes the changes on regular intervals
● Scale the writes with sharding
● Pick a random(ish) shard key
● Update the active records only (client)

MongoDB - Best Practices
● Always have an Geospacial index in place
● You may need different variations of the Geospacial Index
● $hint as much as possible
● $limit is your friend
● Control the document size (both search and sort)
● Use $geoWithin for ordered results
● Use metadata to avoid $geoIntersects
● Scale with additional secondaries and use tags
● Scale with sharding (divide and conquer vs targeted operations)
● Know your queries (random queries can hurt performance)

MongoDB - Best Practices
1) 2)
3) 4)

PostgreSQL - PostGIS
● Spatial database extension for PostgreSQL
● Extra data types
○ geometry
○ geography
● Additional functions and operators
● Raster map algebra
● Spatial reprojection SQL callable functions for both vector and raster
data
● Import/export support of shape ﬁles

PostGIS - Data types
Geometry:
● Older data type
● Cartesian plane
● More support from third party tools
● Operations on it are generally faster
● Need for a lot of spatial processing
Geography:
● Newer data type
● Points on the earth’s surface (latitude/longitude)
● Supports long range distance measurements
● Slower than geometry
● More accurate results

PostGIS - Geometric objects
Supports:
● POINT
● LINESTRING
● POLYGON
● MULTIPOINT
● MULTILINESTRING
● MULTIPOLYGON
● GEOMETRYCOLLECTION
● CURVES
● POLYHEDRALSURFACE

PostGIS - Spatial Indexes
● Used on spatial dataset
● Multi-dimension
● GiST (Generalized Search Tree)
● R-tree index implementation
● Clustering on GiST indexes
Image: Object Trajectory Analysis in Video Indexing and
Retrieval Applications
(Mattia Broilo, Nicola Piotto, G. Boato, Nicola Conci, April
2010)

PostgreSQL - Proximity query
# EXPLAIN ANALYZE SELECT name FROM restaurants_geography WHERE ST_DWithin(location, ST_GeogFromText('POINT(-73.9855 40.7580)'),100);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using geography_location on restaurants_geography (cost=0.40..33.42 rows=3 width=17) (actual time=0.734..1.736 rows=31 loops=1)
Index Cond: (location && _st_expand('0101000020E6100000508D976E127F52C01B2FDD2406614440'::geography, '100'::double precision))
Filter: st_dwithin(location, '0101000020E6100000508D976E127F52C01B2FDD2406614440'::geography, '100'::double precision, true)
Rows Removed by Filter: 9
Planning Time: 0.212 ms
Execution Time: 1.858 ms
● Always have an spatial index in place
● ST_DWithin ﬁnds geo locations within a given space
● Geography: meters
● Geometry: units deﬁned by the rsid (ex: degrees)

# EXPLAIN ANALYZE SELECT name FROM restaurants_geography WHERE
ST_DWithin(location, ST_GeogFromText('POINT(-73.9855 40.7580)'),1000);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on restaurants_geography (cost=4.43..119.10 rows=3 width=17) (actual time=1.924..18.900 rows=1782 loops=1)
Heap Blocks: exact=303
-> Bitmap Index Scan on geography_location (cost=0.00..4.43 rows=4 width=0) (actual time=1.200..1.202 rows=2547 loops=1)
Index Cond: (
location && _st_expand('0101000020E6100000508D976E127F52C01B2FDD2406614440'::geography, '1000'::double precision))
● && operator
● ST_DWithin(g1, g2, distance) translates into:
○ g1 && ST_Expand(g2,10) AND ST_Distance(g1,g2) < 10

with ordered results
# SELECT name, ST_Distance(location, ref_geog) AS distance FROM restaurants_geography CROSS JOIN (SELECT ST_GeogFromText('POINT(-73.9855 40.7580)') AS ref_geog)
AS r WHERE ST_DWithin(location, ref_geog, 100) ORDER BY ST_Distance(location, ref_geog) limit 15;
name | distance
-----------------------------------------+-------------
Cbre-1540 | 40.39000116
Buca Di Beppo | 40.39000116
Planet Hollywood | 40.39000116
Minskoff Theater | 46.50344181
Best Buy Theater | 48.41508544
Refresh Cafe | 48.41508544
Viacom Cafeteria | 48.41508544
Viacom Executive Dining Room | 48.41508544
Junior"S Restaurant | 48.41508544
Starbucks Coffee | 68.38420071
Nuchas | 79.01362202
Bond 45 Italian Kitchen Steak & Seafood | 83.16301778
Cookie Party(@Toy ""R"" Us) | 88.45480111
Scoops R Us | 88.45480111
Lyceum Theatre | 88.93144242
# CLUSTER geography_location ON restaurants_geography;
CLUSTER

PostgreSQL - Proximity with
ﬁlters
● Compound indexes
● Bitmap Index Scan
● btree_gist extension
# CREATE INDEX geography_location_cuisines on restaurants_geography USING GIST(location, cuisines);
ERROR: syntax error at or near "USING"
LINE 1: CREATE INDEX geography_location_cuisines USING GIST(location…
percona=# CREATE EXTENSION btree_gist;
percona=# CREATE INDEX geography_location_cuisines on restaurants_geography USING GIST(location, cuisines);
percona=# SELECT tablename, indexname, indexdef FROM pg_indexes WHERE indexname = 'geography_location_cuisines' ORDER BY
tablename, indexname;
tablename | indexname | indexdef
-----------------------+-----------------------------+-------------------------------------------------------------------
---------------------------------------
restaurants_geography | geography_location_cuisines | CREATE INDEX geography_location_cuisines ON
public.restaurants_geography USING gist (location, cuisines)

PostgreSQL - Proximity with
ﬁlters
GiST INDEX ON location
EXPLAIN ANALYZE SELECT name FROM restaurants_geography WHERE ST_DWithin(location, ST_GeogFromText('POINT(-73.9855 40.7580)'),100) and cuisines = 'Japanese';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------
Index Scan using geog_location on restaurants_geography (cost=0.40..33.42 rows=1 width=17) (actual time=0.794..1.261 rows=5 loops=1)
Index Cond: (location && _st_expand('0101000020E6100000508D976E127F52C01B2FDD2406614440'::geography, '100'::double precision))
Filter: (((cuisines)::text = 'Japanese'::text) AND st_dwithin(location, '0101000020E6100000508D976E127F52C01B2FDD2406614440'::geography, '100'::double
precision, true))
GiST INDEX ON location, cuisines
EXPLAIN ANALYZE SELECT name FROM restaurants_geography WHERE ST_DWithin(location, ST_GeogFromText('POINT(-73.9855 40.7580)'),100) and cuisines = 'Japanese';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------
Index Scan using geog_location_cuisines on restaurants_geography (cost=0.40..33.42 rows=1 width=17) (actual time=0.741..1.065 rows=5 loops=1)
Index Cond: ((location && _st_expand('0101000020E6100000508D976E127F52C01B2FDD2406614440'::geography, '100'::double precision)) AND ((cuisines)::text =
'Japanese'::text))

Elasticsearch - Geo Field Types:
● geo_point - data types which support lon/latitude pairs;
● geo_shape - more advanced ﬁelds which support points, lines, circles,
polygons, multi-polygons;

Elasticsearch - Geo Field Types:
● Make sure you deﬁne the mappings before indexing as dynamic
mappings will not do a good job. When we’ve indexed the dataset in
Elastic, we ended up with “ﬂoat” instead of “geo_point”
PUT /restaurants1
{
"mappings": {
"properties": {
"loc": {
"type": "geo_point"
}
}
}
}

Elasticsearch - B(lock)KD Tree:
● After the addition of Lucene 6, the geo spatial implementation
moved to using a form of KD Tree called BKD Tree. A BKD tree is a
collection of multiple KD Trees. A KD Tree focuses on breaking of a
plane in 2 sub-planes.
A
B
C
D
E
F
Y
X
X A (5,4)
Y B(3,2) C(9,5)
X D(6,4)
Y E(3,5) F(8,4)

Elasticsearch - Geo Queries:
● geo_bounding_box query.
● geo_distance query.
● geo_polygon query. *Deprecated in 7.12*
● geo_shape query.

Elasticsearch - Proximity query:
- All common filters will be cached
- The distance can be specified in large nr
of units but it defaults to meters.
- By default, displays the top 10 results but we
had 31 answers in this case
- I only have 1 shard but would tell you how
many it hit
- “Hits.total.value” = number of matches
- It took 42ms initially, then 5-6 with caching

Elasticsearch - Proximity with ﬁlters
- We’re no longer interested in match_all
but on documents with the term
Japanese
- The filter remained, of course, the same
- From 31, we now have 5 hits
- From 42ms, this took 14ms
initially because we are limiting
the amount of documents that it
needs to return

Elasticsearch - Ordered proximity
- I only used the sorting by price here
and used asc
- Can also sort by _geo_distance to
add additional sorting
- From my
experiments, I
didn’t see a
noticeable
difference in
terms of speed in
case I sorted or
not

Elasticsearch - Area search
● In which area the point belongs to
- Used the geo_polygon to draw the area
- Used _source:false to not retrieve
additional info about the documents
- Used collapse to only receive one value
per hit
- We had 10 hits
which means we
had 10
documents in
that polygon but
since we
collapsed the
area to unique
values, we got
only one uniq
term.
- I cheated. I used
the boundaries
of that
neighbourhood

Elasticsearch - GeoDistance agg
● Group my search per different ranges
- Based on the origin, the ranges
defined in meters are the buckets
where we’re searching for
restaurants
- We know from
previous examples
that in an area of
100m, we have 31
restaurants but we
have more insights
into how many
restaurants are
outside those. Seems
like we have more
options

Elasticsearch - Geo Aggregation
● Elasticsearch allows a hefty amount of options for aggregating data:
○ Bucket aggregations
■ Geodistance, Geohash & Geotile grid aggregations
○ Metrics aggregations
■ Geobounds, Geocentroid & Geoline(useful for maps)
aggregations

Closing remarks/Thought
● Data structures used by Postgres and ES are more suitable for heavy Geo
Workload than MongoDB
● All three databases supports a rich command set. PostGIS looks to have
the richest command set
● ES works out of the box, MongoDB needs indexes to be deployed and
Postgres requires the extension to be installed
● All three provide, various scaling mechanisms for geospatial workloads
● If we had to choose one… it would be...

Comparing Geospatial Implementation in MongoDB, Postgres, and Elastic

More Related Content

Similar to Comparing Geospatial Implementation in MongoDB, Postgres, and Elastic

More from Antonios Giannopoulos

Recently uploaded

Comparing Geospatial Implementation in MongoDB, Postgres, and Elastic