Algorithms for Spatial Joins
and Spatial Query Processing
and Optimization
-Natasha Mandal
Applications of Spatial Queries
O Spatial Database Systems
O Geographical Information Systems
O Urban Planning
O CAD/CAM systems
O Image Databases
POINT
QUERIES
NEAREST
NEIGHBOR
QUERY
RANGE
QUERY
MAP
OVERLAY
Goals
O Understand more about Query Processing in
SDBMS
O Learn more about Spatial Operations in SDBMS
O Learn about Optimization in SDBMS
What is Query Processing?
Why Optimize?
O Queries are expressed in a high-level declarative
language such as SQL.
O The database software is supposed to map the
query into a sequence of operations supported by
spatial indexes and storage structures.
O Goals:
 Process a query accurately
 Do this in the minimum amount of time possible
What is Query Processing?
Why Optimize?
O Queries are composed of a basic set of relations.
O Query processing and optimization are divided into
two steps:
 Design and fine-tune algorithms for each of the
basic relational operators.
 Map high-level queries into a composition of these
basic relational operators and optimize (using
information in the first step).
Challenges in Spatial Databases
 Unlike relational databases, spatial databases
have no fixed set of operators that serve as
building blocks for query evaluation (ex. Overlap
and Intersect may return a similar result).
 Spatial databases have large volumes of complex
objects (with spatial extensions) which cannot be
sorted in a one-dimensional array.
 The assumption that I/O costs dominate CPU
costs is no longer valid since computationally
expensive algorithms are used to test for spatial
predicates.
Spatial Operations
O Spatial Operations can be classified into four
groups:
 Update - Modify, Create etc.
 Selection –
o Point Query (𝑃𝑄): Given a query point 𝑝, find all spatial
objects 𝑂 that contain it:
𝑃𝑄 𝑝 = {𝑂|𝑝 ∈ 𝑂. 𝐺 ≠ ∅}
where 𝑂. 𝐺 is the geometry of the object 𝑂.
Ex. “Find all river flood-plains which contain the CITY” [CITY
is assumed to be a point type]
o Range Query (𝑅𝑄): Given a query polygon 𝑃, find all spatial
objects 𝑂 which intersect 𝑃. [If 𝑃 is a rectangle, 𝑅𝑄 is a
window query]
𝑅𝑄(𝑃)={𝑂│𝑂.‫ܩ‬ ∧ 𝑃.‫}∅≠ܩ‬
Ex. “Get all forests which overlap with flood plain of River
Nile”
Spatial Operations
 Spatial Join – This relation holds when two
tables 𝑅 and 𝑆 are joined on a spatial predicate
𝜃 . Map Overlay is an important variant of
Spatial Join.
𝑅 ⋈ 𝜃 𝑆 = 𝑜1, 𝑜2 𝑜1 ∈ 𝑅, 𝑜2 ∈ 𝑆, 𝜃 𝑜1. 𝐺, 𝑜2. 𝐺
Some example 𝜃 predicates are intersect, contains,
is enclosed by, distance, northwest, adjacent,
meets, overlap etc.
Spatial Operations
Ex. “Find all forest stands and river plains which
overlap”
SELECT FS.name, FP.name
FROM Forest Stand FS, Flood Plain FP
WHERE overlap(FS.G, FP.G)
 Spatial Aggregate – These are usually variants of
the nearest neighbor search.
𝑁𝑁𝑄 𝑜′ = {𝑜|∀𝑜": 𝑑𝑖𝑠𝑡(𝑜′. 𝐺, 𝑜. 𝐺) ≤ 𝑑𝑖𝑠𝑡(𝑜′. 𝐺, 𝑜". 𝐺)}
Two-Step Query Processing of
Object Operations
O Filter Step: Spatial Objects are represented by
simpler approximations such as MBR or different
predicates. No tuple from the final answer using
exact geometry should be eliminated in the filter
step.
For ex. touch(River.Flood-Plain, :CITY) may be
replaced by overlap(MBR(River.Flood-Plain),
MBR(:CITY))
Two-Step Query Processing of
Object Operations
 Refinement Step: The exact geometry of each
element from the candidate set and the exact
predicate are examined. This may require a CPU
intensive application and may be processed
outside the spatial database (in a GIS).
Filtering – MBRs
Geometric Filter (Approximations) – Convex Hull,
Minimum Enclosed Circle etc.
Exact Geometry – Plane Sweep etc.
Techniques for Spatial Selection
O What are the alternative ways of processing a
query? It depends on how the file containing the
relations being queried is organized.
 Unsorted Data and No Index – Use brute force to
scan the whole file and test each record for the
predicate.
 Spatial Indexing – Can be used to access geometric
data. The MBRs of spatial attributes of a relation
can be indexed.
 Space filling curves – These can be used to map
points of multidimensional space into one
dimensional space. A B-Tree index can be imposed
on ordered entries to enhance the search.
General Spatial Selection
O A selection condition can be a combination of
several “primitive” selection conditions.
O For spatial selections, the order in which the
individual conditions in CNF is processed is
important because different spatial conditions
have different processing costs.
O Predicates can be applied in ascending order of
𝑅𝑎𝑛𝑘.
𝑅𝑎𝑛𝑘 =
𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 − 1
𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡𝑖𝑎𝑙 𝑐𝑜𝑠𝑡
General Spatial Selection
𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑝 =
𝑐𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦(𝑜𝑢𝑡𝑝𝑢𝑡(𝑝))
𝑐𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦(𝑖𝑛𝑝𝑢𝑡(𝑝))
𝑑𝑖𝑓𝑓𝑒𝑛𝑡𝑖𝑎𝑙 𝑐𝑜𝑠𝑡 = per tuple cost of a predicate. It
remains constant throughout the life of the function
and can be stored in the system catalog (along with
selectivity).
Spatial Join
O Spatial Join can be an expensive operation and
the presence of indices can help in the fast
processing of queries.
Classification of spatial join methods
Both inputs are indexed One input is indexed Neither input is indexed
 transformation to z-values
 spatial join index
 tree matching
 index nested loops
 seeded tree join
 build and match
 sort and match
 slot index spatial join
 spatial hash join
 partition based spatial merge join
 size separation spatial join
 scalable sweeping-based spatial join
The R-Tree Join
O This algorithm can be used when both the inputs
are indexed.
O It is based on the enclosure property of trees: if
two nodes do not intersect, then there are no
rectangles below them that can intersect.
O RJ starts from the roots of the trees to be joined
and finds pairs of overlapping entries.
O For each such pair, the algorithm is recursively
called until the leaf levels where overlapping pairs
constitute solutions.
O The following algorithm assumes both the R-Trees
are of equal height (this can easily be extended).
The R-Tree Join
Alg. RJ(Rtree_Node ni, RTNode nj)
for each entry ej,y ∈ nj, do
{
for each entry ei,x ∈ ni with ei,x ⋂, ej,y ≠ ∅ do
{
if ni is a leaf node /* nj is also a leaf node */
then Output (ei,x, ej,y );
else /* intermediate nodes */
{
ReadPage(ei,x. ref); ReadPage(ej,y.ref);
RJ(ei,x ref, ej,y ref);
}
}
} /* end for */
The R-Tree Join
The R-Tree Join
 Optimizations for CPU speed:
 Search Space Restriction
 Plane Sweep – sorting in one dimension
reduces time for finding overlapping pairs
 Optimizations for I/O speed:
 Plane Sweep - consecutive computed
pairs overlap with high probability
 Breadth-first traversal that sorts the output
at each level in order to reduce the
number of page accesses.
Spatial Hash Join
O This algorithm can be used to compute
the join of two non-indexed datasets 𝑅
(build input i.e. smaller relation) and 𝑆
(probe input).
O 𝑅 is partitioned into 𝐾 buckets.
 The initial buckets are points determined
based on sampling.
 Each object is inserted into the bucket that
is enlarged the least.
Spatial Hash Join
O 𝑆 is hashed into buckets with the same extent
as 𝑅's buckets
 An object is inserted into all buckets that intersect
it.
 Some objects may be assigned to multiple buckets
(replication) and some may not be inserted at all
(filtering).
O The two bucket sets are joined; each bucket from
R is matched with only one bucket from S, thus
requiring a single scan of both files.
O If for some pair neither bucket fits in memory, an
R-tree is built for one of them, and the bucket-to-
bucket join is executed in an index nested loop
fashion.
Spatial Hash Join
Slot Index Spatial Join
O This algorithm is applicable when there is an
R-tree for one of the inputs (𝑅).
O If 𝐾 is the desired number of partitions, SISJ
will find the topmost level of the tree such that
the number of entries is larger than or equal
to 𝐾. These entries are then grouped into 𝐾
(possibly overlapping) partitions called slots.
 Each slot contains the MBR of the indexed R-
tree entries, along with a list of pointers to
these entries.
Slot Index Spatial Join
 SISJ starts with a single empty slot and inserts
entries into the slot that is enlarged the least.
 When the maximum capacity of a slot is reached
(determined by 𝐾 and the total number of entries),
either some entries are deleted and reinserted or
the slot is split according to the R*-tree splitting
policy.
O The second dataset 𝑆 is hashed into buckets with
the same extents as the slots.
 If an object from 𝑆 does not intersect any bucket, it
is filtered.
 If it intersects more than one bucket, it is replicated.
Slot Index Spatial Join
O The join phase
 All data from the R-tree of 𝑅 indexed by a slot
are loaded and joined with the corresponding
hash-bucket from 𝑆 using plane sweep.
 If the data to be joined does not fit in memory,
they can be joined using an algorithm which
employs external sorting and then plane
sweep.
 During the join phase of SISJ, when no data
from 𝑆 is inserted into a bucket, the sub-tree
data under the corresponding slot is not
loaded (slot filtering).
Slot Index Spatial Join
Query Optimization
O The metric used for an evaluation plan is time
required to execute the query. For spatial
databases this would include I/O and CPU costs.
O A query optimizer (a module in the database
software) generates different evaluation plans and
determines the appropriate execution strategy.
O The idea is to avoid the worst plans and choose a
good one (seldom the best one).
O The procedures of query optimizer can be divided
into two parts - 𝑙𝑜𝑔𝑖𝑐𝑎𝑙 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 and
𝑑𝑦𝑛𝑎𝑚𝑖𝑐 𝑝𝑟𝑜𝑔𝑟𝑎𝑚𝑚𝑖𝑛𝑔.
Logical Transformation
O Parsing
 The parser checks the syntax and transforms the
statement into a query tree.
 Parsers for spatial databases have to be more
sophisticated to identify and manage user-defined
data types.
 The leaf nodes of the query tree correspond to the
relations involved and the internal nodes correspond
to the operations.
 Query processing starts at the leaf nodes and
proceeds up until the operation at the root node has
been performed.
Logical Transformation
SELECT L.Name FROM
Lake L, Facilities Fa
WHERE Area(L.G)>20
AND Fa.Name
=“Campground” AND
Distance(Fa.G, L.G)<50
𝜋 𝐿.𝑁𝑎𝑚𝑒
𝜎𝐴𝑟𝑒𝑎.𝐺>20
𝜎 𝐹𝑎.𝑁𝑎𝑚𝑒="𝐶𝑎𝑚𝑝𝑔𝑟𝑜𝑢𝑛𝑑"
⋈ 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝐹𝑎.𝐺,𝐿.𝐺 <50
𝐿𝑎𝑘𝑒 𝐿 𝐹𝑎𝑐𝑖𝑙𝑖𝑡𝑖𝑒𝑠 𝐹𝑎
Logical Transformation
O Logical Transformation
 The query tree generated by parser is mapped onto
equivalent query trees (based on a formal set of
rules inherited from relational algebra).
 After equivalent trees are enumerated, we can apply
heuristics to filter out non-candidates.
 Clear-cut heuristic may not apply for spatial
databases due to user-defined functions etc.
 𝑅𝑎𝑛𝑘 can be used as a heuristic. 𝑆𝑒𝑙𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 and
𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡𝑖𝑎𝑙 𝑐𝑜𝑠𝑡 can be stored in the System
Catalog.
Logical Transformation
O Equivalence Rules:
 Selections
o 𝜎𝑐1∧𝑐2∧⋯𝑐𝑛(𝑅) ≡ 𝜎𝑐1(𝜎𝑐2 … 𝜎𝑐𝑛 𝑅 … ) – Can push all
non-spatial conditions towards the right.
o 𝜎𝑐1(𝜎𝑐2 𝑅 ) ≡ 𝜎𝑐2(𝜎𝑐1 𝑅 )
 Projections
o 𝜋 𝑎1(𝑅) ≡ 𝜋 𝑎1 𝜋 𝑎2 … 𝜋 𝑎𝑛 𝑅 … if 𝑎𝑖 ⊂ 𝑎𝑖+1for 𝑖 =
1, … 𝑛 − 1
 Cross Product and Joins
o 𝑅 ⋈ 𝑆 ≡ 𝑆 ⋈ 𝑅
o 𝑅 ⋈ (𝑆 ⋈ 𝑇) ≡ (𝑅 ⋈ 𝑆) ⋈ 𝑇
Logical Transformation
 Selection, Projection and Joins
o If the selection condition involves attributes retained by
the projection operator
𝜋 𝑎(𝜎𝑐 𝑅 ) ≡ 𝜎𝑐(𝜋 𝑎 𝑅 )
o If a selection condition involves only an attribute that is
present in 𝑅 and not in 𝑆 then
𝜎𝑐(𝑅 ⋈ 𝑆) ≡ 𝜎𝑐(𝑅) ⋈ 𝑆
o Projection can be computed with a join:
𝜋 𝑎(𝑅 ⋈ 𝑆) ≡ 𝜋 𝑎1(𝑅) ⋈ 𝜋 𝑎2(𝑆)
where 𝑎1 ⊆ 𝑎 which appears in 𝑅 and 𝑎2 ⊆ 𝑎 which
appears in 𝑆
Cost Based Optimization:
Dynamic Programming
O Dynamic Programming is used to determine the
optimal execution strategy from a set of execution
plans.
O The optimal solution minimizes the cost function.
O We focus on each node of query tree and enumerate
the different execution strategies available to process
the node. The different processing strategies for each
node when combined for the whole query constitutes
the plan space.
O The cardinality of plan space might be high and the
optimization time must be kept minimum. This
suggests that we should select a good (not the best)
plan.
Cost Based Optimization:
Dynamic Programming
O The factors that a good cost function must take
into account are:
o Access cost – Searching for and transferring data
from secondary storage.
o Storage cost – Storing intermediate temporary
relations produced by an execution strategy.
o Computation cost – CPU cost of performing in-
memory operations.
o Communication cost – Transferring information
between the client and server.
Cost Based Optimization:
Dynamic Programming
O Systems Catalog
 It contains the information required by the cost
function to design an optimal execution strategy.
 It includes:
o the size of each file
o the number of records in each file
o number of blocks over which records are spread
o information about indexes and indexing attributes
o 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 and 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡𝑖𝑎𝑙 𝑐𝑜𝑠𝑡
o can materialize expensive, user-defined functions
and index their values for fast retrieval
Cost Based Optimization:
Dynamic Programming
O Cost Functions
𝑐𝑜𝑠𝑡 = 𝐸𝑥𝑝 𝑟𝑒𝑐𝑜𝑟𝑑𝑠_𝑒𝑥𝑎𝑚𝑖𝑛𝑒𝑑 + 𝐾 ∗ 𝐸𝑥𝑝(𝑝𝑎𝑔𝑒𝑠_𝑟𝑒𝑎𝑑)
 𝐸𝑥𝑝 𝑟𝑒𝑐𝑜𝑟𝑑𝑠_𝑒𝑥𝑎𝑚𝑖𝑛𝑒𝑑 = expected number of records read
[measure of CPU time]
 𝐸𝑥𝑝(𝑝𝑎𝑔𝑒𝑠_𝑟𝑒𝑎𝑑)= expected number of pages read from
storage [measure of I/O time]
 𝐾= measure of how important CPU resources are relative to
I/O resources
O Decomposition and Merge in Hybrid Architecture
 A query is decomposed into spatial and non-spatial part.
 Subqueries are optimized in separate modules and are
merged.
Conclusion
O We learnt about the 2-Step Query Processing
paradigm.
O We reviewed algorithms for Spatial Operations like
Spatial Join.
O We learnt how Dynamic Programming can be
used to optimize queries based on the cost
function.
Algorithms for Query Processing and Optimization of Spatial Operations

Algorithms for Query Processing and Optimization of Spatial Operations

  • 1.
    Algorithms for SpatialJoins and Spatial Query Processing and Optimization -Natasha Mandal
  • 2.
    Applications of SpatialQueries O Spatial Database Systems O Geographical Information Systems O Urban Planning O CAD/CAM systems O Image Databases
  • 3.
  • 4.
  • 5.
    Goals O Understand moreabout Query Processing in SDBMS O Learn more about Spatial Operations in SDBMS O Learn about Optimization in SDBMS
  • 6.
    What is QueryProcessing? Why Optimize? O Queries are expressed in a high-level declarative language such as SQL. O The database software is supposed to map the query into a sequence of operations supported by spatial indexes and storage structures. O Goals:  Process a query accurately  Do this in the minimum amount of time possible
  • 7.
    What is QueryProcessing? Why Optimize? O Queries are composed of a basic set of relations. O Query processing and optimization are divided into two steps:  Design and fine-tune algorithms for each of the basic relational operators.  Map high-level queries into a composition of these basic relational operators and optimize (using information in the first step).
  • 8.
    Challenges in SpatialDatabases  Unlike relational databases, spatial databases have no fixed set of operators that serve as building blocks for query evaluation (ex. Overlap and Intersect may return a similar result).  Spatial databases have large volumes of complex objects (with spatial extensions) which cannot be sorted in a one-dimensional array.  The assumption that I/O costs dominate CPU costs is no longer valid since computationally expensive algorithms are used to test for spatial predicates.
  • 9.
    Spatial Operations O SpatialOperations can be classified into four groups:  Update - Modify, Create etc.  Selection – o Point Query (𝑃𝑄): Given a query point 𝑝, find all spatial objects 𝑂 that contain it: 𝑃𝑄 𝑝 = {𝑂|𝑝 ∈ 𝑂. 𝐺 ≠ ∅} where 𝑂. 𝐺 is the geometry of the object 𝑂. Ex. “Find all river flood-plains which contain the CITY” [CITY is assumed to be a point type] o Range Query (𝑅𝑄): Given a query polygon 𝑃, find all spatial objects 𝑂 which intersect 𝑃. [If 𝑃 is a rectangle, 𝑅𝑄 is a window query] 𝑅𝑄(𝑃)={𝑂│𝑂.‫ܩ‬ ∧ 𝑃.‫}∅≠ܩ‬ Ex. “Get all forests which overlap with flood plain of River Nile”
  • 10.
    Spatial Operations  SpatialJoin – This relation holds when two tables 𝑅 and 𝑆 are joined on a spatial predicate 𝜃 . Map Overlay is an important variant of Spatial Join. 𝑅 ⋈ 𝜃 𝑆 = 𝑜1, 𝑜2 𝑜1 ∈ 𝑅, 𝑜2 ∈ 𝑆, 𝜃 𝑜1. 𝐺, 𝑜2. 𝐺 Some example 𝜃 predicates are intersect, contains, is enclosed by, distance, northwest, adjacent, meets, overlap etc.
  • 11.
    Spatial Operations Ex. “Findall forest stands and river plains which overlap” SELECT FS.name, FP.name FROM Forest Stand FS, Flood Plain FP WHERE overlap(FS.G, FP.G)  Spatial Aggregate – These are usually variants of the nearest neighbor search. 𝑁𝑁𝑄 𝑜′ = {𝑜|∀𝑜": 𝑑𝑖𝑠𝑡(𝑜′. 𝐺, 𝑜. 𝐺) ≤ 𝑑𝑖𝑠𝑡(𝑜′. 𝐺, 𝑜". 𝐺)}
  • 12.
    Two-Step Query Processingof Object Operations O Filter Step: Spatial Objects are represented by simpler approximations such as MBR or different predicates. No tuple from the final answer using exact geometry should be eliminated in the filter step. For ex. touch(River.Flood-Plain, :CITY) may be replaced by overlap(MBR(River.Flood-Plain), MBR(:CITY))
  • 13.
    Two-Step Query Processingof Object Operations  Refinement Step: The exact geometry of each element from the candidate set and the exact predicate are examined. This may require a CPU intensive application and may be processed outside the spatial database (in a GIS). Filtering – MBRs Geometric Filter (Approximations) – Convex Hull, Minimum Enclosed Circle etc. Exact Geometry – Plane Sweep etc.
  • 15.
    Techniques for SpatialSelection O What are the alternative ways of processing a query? It depends on how the file containing the relations being queried is organized.  Unsorted Data and No Index – Use brute force to scan the whole file and test each record for the predicate.  Spatial Indexing – Can be used to access geometric data. The MBRs of spatial attributes of a relation can be indexed.  Space filling curves – These can be used to map points of multidimensional space into one dimensional space. A B-Tree index can be imposed on ordered entries to enhance the search.
  • 16.
    General Spatial Selection OA selection condition can be a combination of several “primitive” selection conditions. O For spatial selections, the order in which the individual conditions in CNF is processed is important because different spatial conditions have different processing costs. O Predicates can be applied in ascending order of 𝑅𝑎𝑛𝑘. 𝑅𝑎𝑛𝑘 = 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 − 1 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡𝑖𝑎𝑙 𝑐𝑜𝑠𝑡
  • 17.
    General Spatial Selection 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦𝑝 = 𝑐𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦(𝑜𝑢𝑡𝑝𝑢𝑡(𝑝)) 𝑐𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦(𝑖𝑛𝑝𝑢𝑡(𝑝)) 𝑑𝑖𝑓𝑓𝑒𝑛𝑡𝑖𝑎𝑙 𝑐𝑜𝑠𝑡 = per tuple cost of a predicate. It remains constant throughout the life of the function and can be stored in the system catalog (along with selectivity).
  • 18.
    Spatial Join O SpatialJoin can be an expensive operation and the presence of indices can help in the fast processing of queries. Classification of spatial join methods Both inputs are indexed One input is indexed Neither input is indexed  transformation to z-values  spatial join index  tree matching  index nested loops  seeded tree join  build and match  sort and match  slot index spatial join  spatial hash join  partition based spatial merge join  size separation spatial join  scalable sweeping-based spatial join
  • 19.
    The R-Tree Join OThis algorithm can be used when both the inputs are indexed. O It is based on the enclosure property of trees: if two nodes do not intersect, then there are no rectangles below them that can intersect. O RJ starts from the roots of the trees to be joined and finds pairs of overlapping entries. O For each such pair, the algorithm is recursively called until the leaf levels where overlapping pairs constitute solutions. O The following algorithm assumes both the R-Trees are of equal height (this can easily be extended).
  • 20.
    The R-Tree Join Alg.RJ(Rtree_Node ni, RTNode nj) for each entry ej,y ∈ nj, do { for each entry ei,x ∈ ni with ei,x ⋂, ej,y ≠ ∅ do { if ni is a leaf node /* nj is also a leaf node */ then Output (ei,x, ej,y ); else /* intermediate nodes */ { ReadPage(ei,x. ref); ReadPage(ej,y.ref); RJ(ei,x ref, ej,y ref); } } } /* end for */
  • 21.
  • 22.
    The R-Tree Join Optimizations for CPU speed:  Search Space Restriction  Plane Sweep – sorting in one dimension reduces time for finding overlapping pairs  Optimizations for I/O speed:  Plane Sweep - consecutive computed pairs overlap with high probability  Breadth-first traversal that sorts the output at each level in order to reduce the number of page accesses.
  • 23.
    Spatial Hash Join OThis algorithm can be used to compute the join of two non-indexed datasets 𝑅 (build input i.e. smaller relation) and 𝑆 (probe input). O 𝑅 is partitioned into 𝐾 buckets.  The initial buckets are points determined based on sampling.  Each object is inserted into the bucket that is enlarged the least.
  • 24.
    Spatial Hash Join O𝑆 is hashed into buckets with the same extent as 𝑅's buckets  An object is inserted into all buckets that intersect it.  Some objects may be assigned to multiple buckets (replication) and some may not be inserted at all (filtering). O The two bucket sets are joined; each bucket from R is matched with only one bucket from S, thus requiring a single scan of both files. O If for some pair neither bucket fits in memory, an R-tree is built for one of them, and the bucket-to- bucket join is executed in an index nested loop fashion.
  • 25.
  • 26.
    Slot Index SpatialJoin O This algorithm is applicable when there is an R-tree for one of the inputs (𝑅). O If 𝐾 is the desired number of partitions, SISJ will find the topmost level of the tree such that the number of entries is larger than or equal to 𝐾. These entries are then grouped into 𝐾 (possibly overlapping) partitions called slots.  Each slot contains the MBR of the indexed R- tree entries, along with a list of pointers to these entries.
  • 27.
    Slot Index SpatialJoin  SISJ starts with a single empty slot and inserts entries into the slot that is enlarged the least.  When the maximum capacity of a slot is reached (determined by 𝐾 and the total number of entries), either some entries are deleted and reinserted or the slot is split according to the R*-tree splitting policy. O The second dataset 𝑆 is hashed into buckets with the same extents as the slots.  If an object from 𝑆 does not intersect any bucket, it is filtered.  If it intersects more than one bucket, it is replicated.
  • 28.
    Slot Index SpatialJoin O The join phase  All data from the R-tree of 𝑅 indexed by a slot are loaded and joined with the corresponding hash-bucket from 𝑆 using plane sweep.  If the data to be joined does not fit in memory, they can be joined using an algorithm which employs external sorting and then plane sweep.  During the join phase of SISJ, when no data from 𝑆 is inserted into a bucket, the sub-tree data under the corresponding slot is not loaded (slot filtering).
  • 29.
  • 30.
    Query Optimization O Themetric used for an evaluation plan is time required to execute the query. For spatial databases this would include I/O and CPU costs. O A query optimizer (a module in the database software) generates different evaluation plans and determines the appropriate execution strategy. O The idea is to avoid the worst plans and choose a good one (seldom the best one). O The procedures of query optimizer can be divided into two parts - 𝑙𝑜𝑔𝑖𝑐𝑎𝑙 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 and 𝑑𝑦𝑛𝑎𝑚𝑖𝑐 𝑝𝑟𝑜𝑔𝑟𝑎𝑚𝑚𝑖𝑛𝑔.
  • 32.
    Logical Transformation O Parsing The parser checks the syntax and transforms the statement into a query tree.  Parsers for spatial databases have to be more sophisticated to identify and manage user-defined data types.  The leaf nodes of the query tree correspond to the relations involved and the internal nodes correspond to the operations.  Query processing starts at the leaf nodes and proceeds up until the operation at the root node has been performed.
  • 33.
    Logical Transformation SELECT L.NameFROM Lake L, Facilities Fa WHERE Area(L.G)>20 AND Fa.Name =“Campground” AND Distance(Fa.G, L.G)<50 𝜋 𝐿.𝑁𝑎𝑚𝑒 𝜎𝐴𝑟𝑒𝑎.𝐺>20 𝜎 𝐹𝑎.𝑁𝑎𝑚𝑒="𝐶𝑎𝑚𝑝𝑔𝑟𝑜𝑢𝑛𝑑" ⋈ 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝐹𝑎.𝐺,𝐿.𝐺 <50 𝐿𝑎𝑘𝑒 𝐿 𝐹𝑎𝑐𝑖𝑙𝑖𝑡𝑖𝑒𝑠 𝐹𝑎
  • 34.
    Logical Transformation O LogicalTransformation  The query tree generated by parser is mapped onto equivalent query trees (based on a formal set of rules inherited from relational algebra).  After equivalent trees are enumerated, we can apply heuristics to filter out non-candidates.  Clear-cut heuristic may not apply for spatial databases due to user-defined functions etc.  𝑅𝑎𝑛𝑘 can be used as a heuristic. 𝑆𝑒𝑙𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 and 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡𝑖𝑎𝑙 𝑐𝑜𝑠𝑡 can be stored in the System Catalog.
  • 35.
    Logical Transformation O EquivalenceRules:  Selections o 𝜎𝑐1∧𝑐2∧⋯𝑐𝑛(𝑅) ≡ 𝜎𝑐1(𝜎𝑐2 … 𝜎𝑐𝑛 𝑅 … ) – Can push all non-spatial conditions towards the right. o 𝜎𝑐1(𝜎𝑐2 𝑅 ) ≡ 𝜎𝑐2(𝜎𝑐1 𝑅 )  Projections o 𝜋 𝑎1(𝑅) ≡ 𝜋 𝑎1 𝜋 𝑎2 … 𝜋 𝑎𝑛 𝑅 … if 𝑎𝑖 ⊂ 𝑎𝑖+1for 𝑖 = 1, … 𝑛 − 1  Cross Product and Joins o 𝑅 ⋈ 𝑆 ≡ 𝑆 ⋈ 𝑅 o 𝑅 ⋈ (𝑆 ⋈ 𝑇) ≡ (𝑅 ⋈ 𝑆) ⋈ 𝑇
  • 36.
    Logical Transformation  Selection,Projection and Joins o If the selection condition involves attributes retained by the projection operator 𝜋 𝑎(𝜎𝑐 𝑅 ) ≡ 𝜎𝑐(𝜋 𝑎 𝑅 ) o If a selection condition involves only an attribute that is present in 𝑅 and not in 𝑆 then 𝜎𝑐(𝑅 ⋈ 𝑆) ≡ 𝜎𝑐(𝑅) ⋈ 𝑆 o Projection can be computed with a join: 𝜋 𝑎(𝑅 ⋈ 𝑆) ≡ 𝜋 𝑎1(𝑅) ⋈ 𝜋 𝑎2(𝑆) where 𝑎1 ⊆ 𝑎 which appears in 𝑅 and 𝑎2 ⊆ 𝑎 which appears in 𝑆
  • 37.
    Cost Based Optimization: DynamicProgramming O Dynamic Programming is used to determine the optimal execution strategy from a set of execution plans. O The optimal solution minimizes the cost function. O We focus on each node of query tree and enumerate the different execution strategies available to process the node. The different processing strategies for each node when combined for the whole query constitutes the plan space. O The cardinality of plan space might be high and the optimization time must be kept minimum. This suggests that we should select a good (not the best) plan.
  • 38.
    Cost Based Optimization: DynamicProgramming O The factors that a good cost function must take into account are: o Access cost – Searching for and transferring data from secondary storage. o Storage cost – Storing intermediate temporary relations produced by an execution strategy. o Computation cost – CPU cost of performing in- memory operations. o Communication cost – Transferring information between the client and server.
  • 39.
    Cost Based Optimization: DynamicProgramming O Systems Catalog  It contains the information required by the cost function to design an optimal execution strategy.  It includes: o the size of each file o the number of records in each file o number of blocks over which records are spread o information about indexes and indexing attributes o 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 and 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡𝑖𝑎𝑙 𝑐𝑜𝑠𝑡 o can materialize expensive, user-defined functions and index their values for fast retrieval
  • 40.
    Cost Based Optimization: DynamicProgramming O Cost Functions 𝑐𝑜𝑠𝑡 = 𝐸𝑥𝑝 𝑟𝑒𝑐𝑜𝑟𝑑𝑠_𝑒𝑥𝑎𝑚𝑖𝑛𝑒𝑑 + 𝐾 ∗ 𝐸𝑥𝑝(𝑝𝑎𝑔𝑒𝑠_𝑟𝑒𝑎𝑑)  𝐸𝑥𝑝 𝑟𝑒𝑐𝑜𝑟𝑑𝑠_𝑒𝑥𝑎𝑚𝑖𝑛𝑒𝑑 = expected number of records read [measure of CPU time]  𝐸𝑥𝑝(𝑝𝑎𝑔𝑒𝑠_𝑟𝑒𝑎𝑑)= expected number of pages read from storage [measure of I/O time]  𝐾= measure of how important CPU resources are relative to I/O resources O Decomposition and Merge in Hybrid Architecture  A query is decomposed into spatial and non-spatial part.  Subqueries are optimized in separate modules and are merged.
  • 41.
    Conclusion O We learntabout the 2-Step Query Processing paradigm. O We reviewed algorithms for Spatial Operations like Spatial Join. O We learnt how Dynamic Programming can be used to optimize queries based on the cost function.