2014 11 lucene spatial temporal update

The Latest in
Spatial & Temporal Search
David Smiley

Agenda
Spatial
• Polygons and Accuracy: SerializedDVStrategy
• FlexPrefixTree
• BBoxSpatialStrategy
• Student/Intern contributions, Geodesics
Temporal
• Dates, and Date Ranges
• Search
• Faceting

About David Smiley
• Freelance search consultant / developer
• Expert Lucene/Solr development skills,
advice (consulting), training
• Java (full-stack), Web, Spatial
• Apache Lucene / Solr committer & PMC,
Eclipse Locationtech PMC
• Authored 1st book on Solr, plus two editions
• Presented at several conferences & meetups
• Taught several Solr classes, self-developed & LucidWorks

Lucene Spatial Overview
• Multiple approaches to index spatial data
abstract class SpatialStrategy
(5+ concrete implementations)
• RecursivePrefixTreeStrategy (RPT) is most prominent, versatile
• Grid based
Shape
SpatialPrefixTree / Cell PrefixTreeStrategy
• Uses Spatial4j lib for shapes, distance calculations, and WKT
• Uses JTS Topology Suite lib for polygons
IntersectsPrefixTreeFilter
Contains…
Geohash | Quad Within…

SpatialPrefixTrees and Accuracy
RecursivePrefixTree (RPT) uses Lucene’s index as a PrefixTree
• Thus represents shapes as grid cells of varying precision by prefix
Example, a point shape:
• D, DR, DRT, DRT2, DRT2Y
Example, a polygon shape:
• Too many to list… 508 cells
More details here:
http://opensourceconnections.com/blog/2014/04/11/indexing-polygons-in-lucene-with-accuracy/

…continued
• For more accuracy, index more levels (longer prefixes)
• Points: linear relationship of levels to number of cells 
• Non-points: exponential relationship… 
RPT applies a distErrPct shape size ratio to non-point shapes to
trade accuracy for scalability
• distErrPct=0.025 (2.5% of the radius, the default):
• Massachusetts: level 6
• USA: level 4 (not as precise)

SerializedDVStrategy (Lucene 4.7)
• Stores serialized geometry into Lucene BinaryDocValues
• It’s as accurate as the underlying geometry coordinates/shape
• But it’s not a spatial index – it’s retrievable on a per-document basis
• Use RPT + SerializedDV for speed and accuracy!
• More to come eventually:
• Solr adapter – SOLR-5728, ElasticSearch adapter #2361
• Speed: Skip the serialized geometry check for non-edge cells –
LUCENE-5579

Sample Code
SpatialArgs args = new SpatialArgs(INTERSECTS, point);
treeStrategy = new RecursivePrefixTreeStrategy(
grid, "geometry");
verifyStrategy = new SerializedDVStrategy(
ctx, "serialized_geometry");
Query treeQuery = new ConstantScoreQuery(
treeStrategy.makeFilter(args));
Query combinedQuery = new FilteredQuery(
treeQuery,
verifyStrategy.makeFilter(args),
FilteredQuery.QUERY_FIRST_FILTER_STRATEGY);
Code is from a related presentation by the Climate Corporation presented at FOSS4G 2014

FlexPrefixTree (Coming to Lucene 5)
• A new SpatialPrefixTree by Varun Shenoy (GSOC 2014) !
• LUCENE-4922; Still needs to be committed. Goal is for 5.0.
• More optimized, more flexible, than Geohash & Quad
• Configurable sub-cells at each level: 4, 16, 64, 256
• You choose trade-off between index speed/disk size & search speed
• Internally uses an integer coordinate system
• Rectangle searches are particularly fast; minimal floating-point conversion
• Cells are always squares (equal sides) – better for heatmaps
• YMMV: 10% - 100% faster than GeohashPrefixTree

BBoxSpatialStrategy (Lucene 4.10)
• Rectangles (BBox’s) only, one value per field
• Wide predicate support
• Equals, Intersects, Within, Contains, Disjoint
• Accurate (8-byte double floating point)
• Area overlap relevancy
• Weight search results by a combination of query shape overlap &
index shape overlap ratios
• Solr BBoxField…

Solr BBoxField
• Schema configuration
<field name="bbox" type="bbox" />
<fieldType name="bbox" class="solr.BBoxField”
geo="true" units="degrees" numberType="_bbox_coord" />
<fieldType name="_bbox_coord" class="solr.TrieDoubleField”
precisionStep="8" docValues="true" stored="false"/>
• Search with overlap ratio ordering
&q={!field f=bbox score=overlapRatio}Intersects(ENVELOPE(-10, 20, 15, 10))
• score can be: overlapRatio, area, area2D

Recent Student/Intern Contributions
• Varun Shenoy via GSOC: summer 2014
• Lucene spatial: new “FlexPrefixTree” – an optimized grid
• Rebecca Alford via F.B. Open-Academy: winter 2014
• Spatial4j: geodesic polygons
• Chris Pavlicek via F.B. Open-Academy: winter 2014
• Spatial4j: geodesic buffered lines
• Evana Gizzi, MITRE intern: winter 2014
• Spatial4j: geodesic circle polygonizer
• Liviy Ambrose, MITRE intern: fall 2013
• Lucene spatial: integrated with Lucene’s benchmark module

Temporal/Date Durations
or basically any numeric ranges

Approach: Simple Two-field
(as you might do in SQL or any system without native range types)
• A start-time & end-time field pair
• A search window (time span) becomes two range queries
• details vary by predicate (Intersects, Contains, vs. Within)
• Single-valued only
• …even though Lucene supports multi-valued fields
• Theoretically possible but would be a lot of work
• because Lucene doesn’t store “position” info for numeric fields
• because numeric range/prefix queries are position-less

Approach: 2D Spatial PrefixTree
• Lucene Spatial QuadPrefixTree
(2D) with RPT Strategy
• Use ‘x’ for start-time, ‘y’ for end-time
• A search window (time span)
becomes a rectangle query
• details vary by predicate (Intersects,
Contains, vs. Within)
• Cool…
• But floating-point edge issues
• Only ~50 levels supported; not 64
Details: http://wiki.apache.org/solr/SpatialForTimeDurations

Approach: DateRangePrefixTree (Lucene 5)
• A new 1D SpatialPrefixTree: NumberRangePrefixTree
• NumberRangePrefixTree w/ DateRangePrefixTree subclass
• NR-SPT: Configurable sub-cells per level; no level limit
• Not just for ranges; instances too
• Index/Search with NumberRangePrefixTreeStrategy
• Indexing, and search predicate code (e.g. Intersects…) completely re-used
• DateRangePrefixTree
• 9 Levels: 1M years, 1K years, years, months, days, hours, minutes,
seconds, millis
…continued…

Trade-offs of N/D-SPT
• Indexing:
• “Common” date-ranges use ~ <50 terms, but random millisecond
ranges use up to ~14K terms
• All date instances (not a range) <= 9 terms
• Comparison to 2D SPT: instance or range, always 50
• Search:
• Query for “common” query ranges faster than uncommon
• Comparison to 2D SPT:
• Contains & Within predicates: overlapping values per document get
coalesced, can’t be differentiated

Solr DateRangeField
• Configuration in schema.xml:
<field name="dateRange" type=”dateRange” />
<fieldType name="dateRange" class="solr.DateRangeField" />
• Index field data, examples:
• 2014-05-21T12:00:00.000Z (same as TrieDate)
• 2014-05-21T12 (truncated to desired precision)
• [1990 TO 1995]
• Query, examples:
• fq=dateRange:[* TO 2014-05-21]
• fq={!field f=dateRange op=Contains} [2000 TO 2014-05-21]

Visualizing Date Facets
• http://bl.ocks.org/mbostock/4063318

Date Faceting
• Option A: facet.range
• Not for indexed date-ranges
• Internally executes one query for each value & caches large bitset
• Option B: facet.interval (Solr 4.10)
• Not for indexed date-ranges
• Requires DocValues (more index data)
• Supports variable/custom intervals
• New work-in-progress option: Facet on DateRangeField
• Ranges are fixed/pre-determined (months, days, etc.)
• Optimized for thousands of ranges to count
• Each value-range is only 1 term!

Future stuff I’m excited about
• Continuing works in-progress
• Spatial heatmaps! Coming in January 2015!
• Lucene layer & Solr adapter
• Lucene term auto-prefixing LUCENE-5879
• Brings spatial, date, numeric, indexing/search to the next level!
• More prefix-tree optimizations
• Inner vs edge leaf cell differentiation for non-point shapes
• RPT + SerializedDVStrategy; skip accuracy checks for inner cells
• Don’t index leaf cells twice

That’s all for now; thanks for coming!
Need Lucene/Solr guidance or custom development?
Contact me!
Email: dsmiley@apache.org
LinkedIn: http://www.linkedin.com/in/davidwsmiley
G+: +DavidSmiley
Twitter: @DavidWSmiley
ETA: December
2014

2014 11 lucene spatial temporal update

Recommended

Recommended

More Related Content

Similar to 2014 11 lucene spatial temporal update

Similar to 2014 11 lucene spatial temporal update (20)

Recently uploaded

Recently uploaded (20)

2014 11 lucene spatial temporal update

Editor's Notes