Lucene 4 spatial


Published on

Covers the new Apache Lucene 4 spatial module. Includes Solr usage info. Applicable to ElasticSearch too.
Presented the 2012 Open Source Search in Government conference by Basis Technologies.

Published in: Technology
  • Hi Sang,
    It depends on what precisely it is that you load tested. Your message says 'distance' -- do you mean sorting by distance (or otherwise using the distance in a boost)? The only way I know right now to speed up distance calculations is to use the 'SloppyMath' code recently added to Lucene (but copied from some other project) that I recall is maybe 2x as fast. Another thing is that if your data is all in the same region just one country, even USA size) then project the data using something like Proj4j and then use standard Euclidean geometry distance calculations provided by Spatial4j/Lucene/Solr.
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi David Smiley,
    Have you ever do load test with Lucene Spatial ?
    I have done loadtest with it and get the speed of distance arround 100ms ( I have 5M document) Could I speed it up ?
    Are you sure you want to  Yes  No
    Your message goes here
  • An update to this as of May 2013 is here:
    Are you sure you want to  Yes  No
    Your message goes here

Lucene 4 spatial

  1. 1. LUCENE 4 SPATIAL2012 Basis TechnologyOpen Source Search ConferencePresented by David Smiley, MITRE © 2012 The MITRE Corporation. All rights reserved.
  2. 2. About David Smiley• Working at MITRE, for 12 years • web development, Java, search • 3 Solr apps, 1 Endeca• Published 1st book on Solr; then 2nd edition (2009, 2011)• Apache Lucene / Solr committer (2012) • Specializing on spatial• Presented at Lucene Revolution (2010) & Basis O.S. Search Conference (2011)• Taught Solr classes at MITRE (2010, 2011, 2012)• Solr search consultant within MITRE and its sponsors, and privately via OpenSource Connections 2 © 2012 The MITRE Corporation. All rights reserved.
  3. 3. What is Spatial Search?Primary features: • Spatial filter query • Spatial distance sorting • Spatial distance relevancy (i.e. spatial query score) NOT “geocoding” – resolve “Boston” to its latitude and longitudeTypical use-case:1. Index a location for each Lucene document given a latitude & longitude2. Then search for matching documents by a circle (point- radius) or bounding box3. Then sort results by distance © 2012 The MITRE Corporation. All rights reserved.
  4. 4. History of Spatial for Lucene & Solr• 2007: Local-Lucene • by Patric O’Leary (AOL)• 2009-09: LL -> Lucene spatial contrib in Lucene 2.9.0 • Local-Lucene graduates to an official Lucene contrib module• 2009-12: Spatial Search Plugin (SSP) for Solr • by Chris Male (JTeam -> Orange11, ElasticSearch)• 2010-10: SOLR-2155 a geohash prefix tree filter • by David Smiley (MITRE)• 2011-01: Lucene Spatial Playground (LSP) • by Ryan McKinley (Voyager GIS), David, and Chris• 2011-03: Solr 3.1 new spatial features • by Grant Ingersoll and Yonik Seeley (LucidWorks)• 2012-03: LSP -> Lucene 4 spatial module + Spatial4j • replaces former Lucene spatial contrib module © 2012 The MITRE Corporation. All rights reserved.
  5. 5. Lucene Spatial Committers• David Smiley, MITRE • Bedford, MA• Chris Male, Elastic Search • New Zealand• Ryan McKinley, Voyager GIS • Oakland, CA © 2012 The MITRE Corporation. All rights reserved.
  6. 6. Breakdown of Spatial Components Misc 16% Solr adapters 6% Spatial4j 43% Lucene spatial 35%Total: 4,781 Non-Comment Source Statements (without javadocs or tests) © 2012 The MITRE Corporation. All rights reserved.
  7. 7. Spatial4j: It’s all about the shapes• Shapes • Types: Point, Rectangle, Circle, Polygon • Geospatial & Euclidean/2D implementations • Intersection: within, contains, intersects, disjoint• Distance and area math utilities• Input/Output serialization to Well Known Text (WKT) • Ex: POLYGON ((30 10, 10 20, 20 40, 40 40, 30 10))• ASL licensed project independent of Apache on GitHub• Requires JTS (3rd party LGPL) for polygon & WKT support• Ported to .NET as Spatial4n and used by RavenDB • by Itamar Syn-Herskhko © 2012 The MITRE Corporation. All rights reserved.
  8. 8. Lucene 4 Spatial Module• There isn’t one best way to implement spatial indexing for all use-cases • Index just points, or other shapes too? Which? • Multiple shapes per field? • Query by Intersection? Contains? Within? Equals? Disjoint? … • Distance sorting? Query boost by distance? • Or more exotic shape relevancy like overlap percentage? • Tradeoff shape precision for speed?• Multiple SpatialStrategy implementations: • RecursivePrefixTreeStrategy and TermQueryPrefixTreeStrategy • PointVectorStrategy • BBoxStrategy (currently in trunk, not 4x) • JtsGeoStrategy (in Spatial4j/LSP) Names subject to change! © 2012 The MITRE Corporation. All rights reserved.
  9. 9. Strategy: PointVector• Similar to Solr’s PointType / LatLonType • X & Y trie double fields; caching via FieldCache• Characteristics • Indexes points (only) • Single-valued field (no multi) • Query by rectangle or circle (only) • Circle uses FieldCache (requires memory) • Circle does bbox pre-filter for performance • Relations: Intersects, Within (only) • Exact precision for x & y coordinates and query shape • Distance sort • Uses FieldCache (requires memory) © 2012 The MITRE Corporation. All rights reserved.
  10. 10. Strategy: RecursivePrefixTree Potential rename to• Grid / Tile / Trie / Prefix- GridFilterSpatialStrategy Tree based • With recursive decent algorithm • Or TermQueryPrefixTree alternative• Choose Geohash (geo only) or Quad tree• The most mature strategy to date• The current evolution of SOLR-2155 © 2012 The MITRE Corporation. All rights reserved.
  11. 11. Strategy: RecursivePrefixTree• Characteristics: • Indexes all shapes • Variable precision of shape edges • Highly precise shapes other than point won’t scale • LineString’s possibly not precise enough for your needs • Multi-valued field support • Query by any shape • Variable precision for query shape • Highest precision usually scales • Relations: Intersects (only) • Distance sort (w/ multi-value support) • Warning: immature, won’t scale • Uses significant amounts of memory • Fast spatial filtering; no cache needed © 2012 The MITRE Corporation. All rights reserved.
  12. 12. Strategy: BBox• Implemented with 4 doubles & 1 boolean• Ported from ESRI Open SourceGeoPortal• Characteristics: • Indexes rectangles (only) • Single-valued field (no multi) • Query by rectangle (only) • Supports all relations: Intersects, Within, Contains, … • Distance sort from box center • Uses FieldCache (requires memory) • Area overlap sorting • Sort results by percentage overlap between query and indexed boxes • Uses FieldCache (requires memory) • Note: FieldCache needs are somewhat high © 2012 The MITRE Corporation. All rights reserved.
  13. 13. Strategy: JtsGeoStrategy• Stores any JTS geometry in Lucene 4’s DocValues • Stores WKB -- WKT in binary format • Full vector geometry is retained for search • DocValues is mostly a better FieldCache • Faster loading into memory • Can be disk resident or memory• Characteristics: • Indexes any shape • Single valued field but can be MultiPoint, MultiPolygon, etc. • Query by any shape • Uses DocValues (memory use optional) • Supports all relations: intersect, within, contains, … • No sorting • Experimental / immature status © 2012 The MITRE Corporation. All rights reserved.
  14. 14. Solr Adapters• Configuration:<fieldType name="geo" class="solr.SpatialRecursivePrefixTreeFieldType" spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory" distErrPct="0.025"maxDistErr="0.000009" /><field name="geo" type="geo" indexed="true" stored="true” multiValued="true" />• Adding data:<field name="geo">43.17614,-90.57341</field><field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))</field>• Search Filterfq=geo:”Intersects(Circle(54.729696,-98.525391 d=10))”• Distance Sortsort=query($sortsq) asc&sortsq={! score=distance v=$sq}&sq=store:"Intersects(Circle(54.729696,-98.525391 d=10))" © 2012 The MITRE Corporation. All rights reserved.
  15. 15. Future Possibilities• Solr: • Filter out points in multi-valued field from search results not matching filter • Heatmap/grid faceting spatial summarization• Spatial-Temporal search • 3d (x,y,t) point shapes, and “track” shape queries• Support any query shape for all Strategies• PrefixTreeStrategy: • More efficient binary grid encoding; use Hilbert Curve order • Better multi-value point caches • Cache-less sort of top-N results • More query relations: Contains, Within• Configurable DocValues vs. FieldCache choice• Choose floats or configurable bits instead of forcing doubles• CircleStrategy © 2012 The MITRE Corporation. All rights reserved.
  16. 16. Thank you!• References • Lucene 4 spatial javadocs • • Spatial4j at GitHub • ( redirect) • -- • Solr •• Contact me: • David Smiley © 2012 The MITRE Corporation. All rights reserved.