3. Agenda
Spatial
• Polygons
and
Accuracy:
SerializedDVStrategy
• FlexPrefixTree
• BBoxSpa=alStrategy
• Student/Intern
contribu=ons,
Geodesics
Temporal
• Dates,
and
Date
Ranges
• Search
• Face=ng
4. About David Smiley
• Freelance search consultant / developer
• Expert
Lucene/Solr
development
skills,
advice
(consul=ng),
training
• Java
(full-‐stack),
Web,
Spa=al
• Apache Lucene / Solr committer & PMC,
Eclipse Locationtech PMC
• Authored 1st book on Solr, plus two editions
• Presented at several conferences & meetups
• Taught several Solr classes, self-developed & LucidWorks
5. Lucene Spatial Overview
• Multiple approaches to index spatial data
abstract class SpatialStrategy
(5+
concrete
implementa=ons)
• RecursivePrefixTreeStrategy (RPT) is most prominent, versatile
• Grid
based
Shape
Spa=alPrefixTree
/
Cell
PrefixTreeStrategy
• Uses Spatial4j lib for shapes, distance calculations, and WKT
• Uses
JTS
Topology
Suite
lib
for
polygons
IntersectsPrefixTreeFilter
Contains…
Geohash
|
Quad
Within…
6. SpatialPrefixTrees and Accuracy
RecursivePrefixTree (RPT) uses Lucene’s index as a PrefixTree
• Thus
represents
shapes
as
grid
cells
of
varying
precision
by
prefix
Example, a point shape:
• D,
DR,
DRT,
DRT2,
DRT2Y
Example, a polygon shape:
• Too
many
to
list…
508
cells
More
details
here:
h7p://opensourceconnec;ons.com/blog/2014/04/11/indexing-‐polygons-‐in-‐lucene-‐with-‐accuracy/
7. …continued
• For more accuracy, index more levels (longer prefixes)
• Points:
linear
rela=onship
of
levels
to
number
of
cells
J
• Non-‐points:
exponen=al
rela=onship…
L
RPT applies a distErrPct shape size ratio to non-point shapes to
trade accuracy for scalability
• distErrPct=0.025 (2.5% of the radius, the default):
• Massachuse[s:
level
6
• USA:
level
4
(not
as
precise)
8. SerializedDVStrategy (Lucene 4.7)
• Stores serialized geometry into Lucene BinaryDocValues
• It’s
as
accurate
as
the
underlying
geometry
coordinates/shape
• But
it’s
not
a
spa=al
index
–
it’s
retrievable
on
a
per-‐document
basis
• Use RPT + SerializedDV for speed and accuracy!
• More to come eventually:
• Solr
adapter
–
SOLR-‐5728,
Elas=cSearch
adapter
#2361
• Speed:
Skip
the
serialized
geometry
check
for
non-‐edge
cells
–
LUCENE-‐5579
9. Sample Code
SpatialArgs
args
=
new
SpatialArgs(INTERSECTS,
point);
treeStrategy
=
new
RecursivePrefixTreeStrategy(
grid,
"geometry");
verifyStrategy
=
new
SerializedDVStrategy(
ctx,
"serialized_geometry");
Query
treeQuery
=
new
ConstantScoreQuery(
treeStrategy.makeFilter(args));
Query
combinedQuery
=
new
FilteredQuery(
treeQuery,
verifyStrategy.makeFilter(args),
FilteredQuery.QUERY_FIRST_FILTER_STRATEGY);
Code
is
from
a
related
presenta;on
by
the
Climate
Corpora;on
presented
at
FOSS4G
2014
10. FlexPrefixTree (Coming to Lucene 5)
• A new SpatialPrefixTree by Varun Shenoy (GSOC 2014) !
• LUCENE-‐4922;
S=ll
needs
to
be
commi[ed.
Goal
is
for
5.0.
• More optimized, more flexible, than Geohash & Quad
• Configurable
sub-‐cells
at
each
level:
4,
16,
64,
256
• You
choose
trade-‐off
between
index
speed/disk
size
&
search
speed
• Internally
uses
an
integer
coordinate
system
• Rectangle
searches
are
par=cularly
fast;
minimal
floa=ng-‐point
conversion
• Cells
are
always
squares
(equal
sides)
–
be[er
for
heatmaps
• YMMV:
10%
-‐
100%
faster
than
GeohashPrefixTree
11. BBoxSpatialStrategy (Lucene 4.10)
• Rectangles (BBox’s) only, one value per field
• Wide predicate support
• Equals,
Intersects,
Within,
Contains,
Disjoint
• Accurate (8-byte double floating point)
• Area overlap relevancy
• Weight
search
results
by
a
combina=on
of
query
shape
overlap
&
index
shape
overlap
ra=os
• Solr BBoxField…
15. Approach: Simple Two-field
(as you might do in SQL or any system without native range types)
• A start-time & end-time field pair
• A search window (time span) becomes two range queries
• details
vary
by
predicate
(Intersects,
Contains,
vs.
Within)
• Single-valued only
• …even
though
Lucene
supports
mul=-‐valued
fields
• Theore=cally
possible
but
would
be
a
lot
of
work
• because
Lucene
doesn’t
store
“posi=on”
info
for
numeric
fields
• because
numeric
range/prefix
queries
are
posi=on-‐less
16. Approach: 2D Spatial PrefixTree
• Lucene Spatial QuadPrefixTree
(2D) with RPT Strategy
• Use ‘x’ for start-time, ‘y’ for end-time
• A search window (time span)
becomes a rectangle query
• details
vary
by
predicate
(Intersects,
Contains,
vs.
Within)
• Cool…
• But
floa=ng-‐point
edge
issues
• Only
~50
levels
supported;
not
64
Details:
h[p://wiki.apache.org/solr/Spa=alForTimeDura=ons
17. Approach: DateRangePrefixTree (Lucene 5)
• A new 1D SpatialPrefixTree: NumberRangePrefixTree
• NumberRangePrefixTree
w/
DateRangePrefixTree
subclass
• NR-‐SPT:
Configurable
sub-‐cells
per
level;
no
level
limit
• Not
just
for
ranges;
instances
too
• Index/Search
with
NumberRangePrefixTreeStrategy
• Indexing,
and
search
predicate
code
(e.g.
Intersects…)
completely
re-‐used
• DateRangePrefixTree
• 9
Levels:
1M
years,
1K
years,
years,
months,
days,
hours,
minutes,
seconds,
millis
…continued…
18. Trade-offs of N/D-SPT
• Indexing:
• “Common”
date-‐ranges
use
~
<50
terms,
but
random
millisecond
ranges
use
up
to
~14K
terms
• All
date
instances
(not
a
range)
<=
9
terms
• Comparison
to
2D
SPT:
instance
or
range,
always
50
• Search:
• Query
for
“common”
query
ranges
faster
than
uncommon
• Comparison
to
2D
SPT:
• Contains
&
Within
predicates:
overlapping
values
per
document
get
coalesced,
can’t
be
differen=ated
19. Solr DateRangeField
• Configuration in schema.xml:
<field
name="dateRange"
type=”dateRange”
/>
<fieldType
name="dateRange"
class="solr.DateRangeField"
/>
• Index field data, examples:
• 2014-‐05-‐21T12:00:00.000Z
(same
as
TrieDate)
• 2014-‐05-‐21T12
(truncated
to
desired
precision)
• [1990
TO
1995]
• Query, examples:
• fq=dateRange:[*
TO
2014-‐05-‐21]
• fq={!field
f=dateRange
op=Contains}
[2000
TO
2014-‐05-‐21]
21. Date Faceting
• Option A: facet.range
• Not
for
indexed
date-‐ranges
• Internally
executes
one
query
for
each
value
&
caches
large
bitset
• Option B: facet.interval (Solr 4.10)
• Not
for
indexed
date-‐ranges
• Requires
DocValues
(more
index
data)
• Supports
variable/custom
intervals
• New work-in-progress option: Facet on DateRangeField
• Ranges
are
fixed/pre-‐determined
(months,
days,
etc.)
• Op=mized
for
thousands
of
ranges
to
count
• Each
value-‐range
is
only
1
term!
22. Future stuff I’m excited about
• Continuing works in-progress
• Spatial heatmaps! Coming in January 2015!
• Lucene
layer
&
Solr
adapter
• Lucene term auto-prefixing LUCENE-5879
• Brings
spa=al,
date,
numeric,
indexing/search
to
the
next
level!
• More prefix-tree optimizations
• Inner
vs
edge
leaf
cell
differen=a=on
for
non-‐point
shapes
• RPT
+
SerializedDVStrategy;
skip
accuracy
checks
for
inner
cells
• Don’t
index
leaf
cells
twice
23. That’s
all
for
now;
thanks
for
coming!
Need
Lucene/Solr
guidance
or
custom
development?
Contact
me!
Email:
dsmiley@apache.org
LinkedIn:
h[p://www.linkedin.com/in/davidwsmiley
G+:
+DavidSmiley
Twi[er:
@DavidWSmiley
ETA:
December
2014