Lucene solr 4 spatial extended deep dive


Published on

Presented by David Smiley, Software Systems Engineer, Lead, MITRE

Lucene’s former spatial contrib is gone and in its place is an entirely new spatial module developed by several well-known names in the Lucene/Solr spatial community. The heart of this module is an approach in which spatial geometries are indexed using edge-ngram tokenized geohashes searched with a prefix-tree/trie recursive algorithm. It sounds cool and it is! In this presentation, you’ll see how it works, why it’s fast, and what new things you can do with it. Key features are support for multi-valued fields, and indexing shapes with area -- even polygons, and support for various spatial predicates like “Within”. You’ll see a live demonstration and a visual representation of geohash indexed shapes. Finally, the session will conclude with a look at the future direction of the module.

Published in: Education, Technology

Lucene solr 4 spatial extended deep dive

  1. 1. LUCENE/ SOLR 4 SPATIALDEEPDIVEDavidSmileySoftwareSystemsEngineer,Lead
  2. 2. © 2013 The MITRE Corporation. All rights reserved.LUCENE / SOLR 4 SPATIALDEEP-DIVE2013 Lucene RevolutionPresented by David Smiley, MITRE
  3. 3. About David Smiley• Working at MITRE, for 13 years• web development, Java, search• 3 Solr apps, 1 Endeca• Published 1st book on Solr; then 2nd edition (2009, 2011)• Apache Lucene / Solr committer/PMC member (2012)• Specializing on spatial• Presented at Lucene Revolution (2010) & Basis O.S.Search Conference (2011, 2012)• Taught Solr classes at MITRE (2010, 2011, 2012)• Solr search consultant within MITRE and its sponsors,and privately3
  4. 4. Agenda• Background, overview• Spatial4j• Lucene spatial• PrefixTree / Trie / Grid• Solr spatial• Demo• Interesting use-cases
  6. 6. What is Spatial Search?Popular features:• Spatial filter query• Spatial distance sorting• Spatial distance relevancy (i.e. spatial query score)NOT “geocoding” – resolve “Boston” to its latitude and longitudeTypical use-case:1. Index a location for each Lucene document given alatitude & longitude2. Then search for matching documents by a circle (point-radius) or bounding box3. Then sort results by distance
  7. 7. History of Spatial for Lucene & Solr• 2007: Local-Lucene• by Patric O’Leary (AOL)• 2009-09: LL -> Lucene spatial contrib in Lucene 2.9.0• Local-Lucene graduates to an official Lucene contrib module• 2009-12: Spatial Search Plugin (SSP) for Solr• by Chris Male (JTeam -> Orange11, ElasticSearch)• 2010-10: SOLR-2155 a geohash prefix tree filter• by David Smiley (MITRE)• 2011-01: Lucene Spatial Playground (LSP)• by Ryan McKinley (Voyager GIS), David, and Chris• 2011-03: Solr 3.1 new spatial features• by Grant Ingersoll and Yonik Seeley (LucidWorks)• 2012-03: LSP -> Lucene 4 spatial module + Spatial4j + SSP• replaces former Lucene spatial contrib module
  8. 8. Lucene Spatial Committers• David Smiley• Works for MITRE• Boston area• Ryan McKinley• Works for Voyager GIS• Silicon Valley• Chris Male,• Formerly at Elastic Search• New Zealand
  9. 9. Spatial decomposed• Spatial4j• Shapes, WKT, Distance calculations, JTS adapter• Lucene spatial• Strategies: PrefixTree (TermQuery & Recursive impl.), BBox,PointVector• Solr adapters• Misc: Spatial Solr Sandbox• LSE• JtsGeoStrategy• Spatial-Demo (web app)
  10. 10. Lines of Code for Spatial ComponentsSpatial4j43%Lucene spatial35%Solr adapters6%Misc16%Total: 4,781 Non-Comment Source Statements (without javadocs or tests)as of 2012-09
  11. 11. CarrotSearch Labs’ RandomizedTesting•• Provides plumbing for repeatable randomized JUnit tests• All the spatial test code uses it extensivelyRandomized testing more generally is a certainphilosophy / approach on how to test• A typical hard-coded test will only catch some regressions• A randomized test will catch just about anythingeventually, especially nasty edge cases• Although it’s hard to read / write / maintain these tests• Randomized testing helped find bugs related to…• Computing the bounding box of a circle• Computing the relationship of a circle to a rectangle that has all 4 ofits corners inside it
  12. 12. SPATIAL4JIt’s all about the shapes
  13. 13. Spatial4j: It’s all about the shapes ( redirect)• Shapes• A “Shape” abstraction with multiple implementations• Geodetic (sphere) & Cartesian/2D implementations• Computes intersection relationship with other shapes• Also…• Distance and area math utilities, Geohash utilities• Parsing Well Known Text (WKT) formatted shapes• ASL licensed project independent of Apache on GitHub• Requires JTS (LGPL licensed) for polygons & WKT*• JTS is “JTS Topology Suite”• * WKT parsing soon to be implemented directly by Spatial4j• Ported to .NET as Spatial4n and used by RavenDB• by Itamar Syn-Herskhko
  14. 14. The case for Spatial4j’s existence• Just for shapes? How much code could there be?• You’d be surprised. Determining the relationship between a lat-lonrectangle and a geodetic circle (Within, Contains, Intersects, Disjoint)is non-trivial, and that’s just one shape.• Lots of non-trivial test code go with it.• Why isn’t it a part of Lucene spatial?• Parts of Spatial4j depend on JTS, an LGPL licensed library. TheLucene PMC voted not to introduce this compile-time dependency.• Spatial4j is independently useful.• Is this duplication of other open-source that could be used?• Spatial4j needs to be ASL licensed to be a dependency of Lucene.• Still… I haven’t found existing code that does what Spatial4j does.• Can’t only the JTS dependent parts be external to Lucene?
  15. 15. The Shape interface(may become an abstract class in the next version)• interface Shape {• Point getCenter();• Rectangle getBoundingBox();• boolean hasArea();• double getArea();• SpatialRelation relate(Shape other);• Must support Point & Rectangle• enum SpatialRelation• DISJOINT, INTERSECTS, WITHIN, CONTAINS• Note: simpler set than the “DE-9IM” spatial standard• no “equals” or “touches”
  16. 16. Spatial4j shapesCartesianCartesianwithdatelinewrapGeodeticPoint Y Y YLine & LineString(w/ buffer)Y N NRectangle Y Y YCircle Y N YShapeCollection Y Y YJTS Geometry(incl. polygons)Y Y N• Cartesian (AKAEuclidean): a flat plane• Dateline wrap assumesthe plane circles back onitself• Geodetic: a sphericalmathematical model
  17. 17. Well Known Text (WKT)(see Wikipedia)• A popular standard forrepresenting shapes asstrings• Requires JTS’s WKTParser but Spatial4j hasits own in-progress• Extensions are TBD forRectangles and Circles• Limited support forEMPTY and “Z” and “M”dimensions (future)• Some Examples:• POINT (3, -2)• LINESTRING(30 10, 10 30, …• POLYGON ((30 10, 10 20, 2040, 40 40, 30 10))• MULTIPOLYGON (((…• …• Deprecated (may moveto Solr):• -90, -180• -180 -90 180 90• CIRCLE(4.56,1.23 d=0.071)• TBD / Pending:• ENVELOPE(-180,180,90,-90)• BOX2D(-180 -90, 180 90)
  18. 18. Spatial4j code sampleSpatialContext ctx = SpatialContext.GEO;Rectangle r = ctx.makeRectangle(-71, -70, 42, 43);Circle c = ctx.makeCircle(-72, 42, 1);SpatialRelation rel = r.relate(c);System.out.println(rel);rel.intersects();//booleanctx = JtsSpatialContext.GEO;Shape s = ctx.readShape(“POLYGON ((30 10, 10 20, 20 40, 4040, 30 10))”);double distanceDegrees = ctx.getDistCalc().distance(ctx.makePoint(2, 2), ctx.makePoint(3, 3) );Distances (including circleradius) are in “Degrees”, notradians or KM
  19. 19. Spatial4j Future• Built-in WKT support (no JTS dependency)• Extensible to user-defined shapes• API improvements• Shape argument validation via WKT but not via ctx.makeShape(…)• ShapeCollection visitor design pattern• Refactor to remove need for isGeo()• LineString dateline & geodetic support• Projection / Datum support
  20. 20. LUCENE SPATIALSpatial index information retrieval
  21. 21. Lucene 4 Spatial Module• There isn’t one best way to implement spatial indexing forall use-cases• Index just points, or other shapes too? Which?• Multiple shapes per field?• Query by Intersection? Contains? Within? Equals? Disjoint? …• Distance sorting? Query boost by distance?• Or more exotic shape relevancy like overlap percentage?• Tradeoff shape precision for speed?• Multiple SpatialStrategy implementations:• RecursivePrefixTreeStrategy and TermQueryPrefixTreeStrategy• PointVectorStrategy• BBoxStrategy (currently in trunk, not 4x)• JtsGeoStrategy (in Spatial Solr Sandbox)
  22. 22. Strategy: PointVector• Similar to Solr’s PointType / LatLonType• X & Y trie double fields; caching via FieldCache• Characteristics• Indexes points (only)• Single-valued field (no multi)• Query by rectangle or circle (only)• Circle uses FieldCache (requires memory)• Circle does bbox pre-filter for performance• Relations: Intersects, Within (only)• Exact precision for x & y coordinates and query shape• Distance sort• Uses FieldCache (requires memory)
  23. 23. Strategy: BBox• Implemented with 4 doubles & 1 boolean• Ported from ESRI GeoPortal (Open Source)• Characteristics:• Indexes rectangles (only)• Single-valued field (no multi)• Query by rectangle (only)• Supports all relations: Intersects, Within, Contains, …• Distance sort from box center• Uses FieldCache (requires memory)• Area overlap sorting• Sort results by percentage overlap between query and indexed boxes• Uses FieldCache (requires memory)• Note: FieldCache needs are somewhat high
  24. 24. Strategy: JtsGeoStrategy• Stores a JTS geometry in Lucene 4’s DocValues• Stores WKB (WKT in binary format)• Full vector geometry is retained for search• DocValues is mostly a better FieldCache• Faster loading into memory• Can be disk resident or memory• Multi-valued• Characteristics:• Indexes any shape, including Multi… varieties• Query by any shape• Uses DocValues (memory use optional)• Supports all relations: intersect, within, contains, …• Could easily also support JTS’s exotic DE-9IM based relations• Exact precision to the vector geometry• No sorting• Experimental / immature statusMore of a proof-of-concept for now
  25. 25. PREFIXTREE STRATEGYSpatial grid indexing
  26. 26. Strategy: RecursivePrefixTree• Grid / Tile / Trie / Prefix-Tree based• With recursive decentalgorithms• Or TermQueryPrefixTreealternative• Choose Geohash (geoonly) or Quad tree• The most maturestrategy to date• Highly tested• The current evolution ofSOLR-2155
  27. 27. Strategy: RecursivePrefixTree• Characteristics:• Indexes all shapes• Variable precision of shape edges• Highly precise shapes other than Point won’t scale• LineString possibly not precise enough for your needs• Multi-valued field support• Query by any shape• Variable precision for query shape• Highest precision usually scales• All Relations: Intersects, Within, Contains, Disjoint• Distance sort (w/ multi-value support)• Warning: immature, won’t scale• Uses significant amounts of memory• Fast scalable spatial filtering; no caches needednew in Lucene 4.3How many search /NoSQL systems havethese capabilities?
  28. 28. Geohashes• What is a Geohash?• A lat/lon geocode system• Has a hierarchical spatial structure• Gradual precision degradation• In the public domain• Example: (Boston) DRT2Y
  29. 29. Demo
  30. 30. Zooming In: D
  31. 31. Zooming In: DR
  32. 32. Zooming In: DRT
  33. 33. Zooming In: DRT2
  34. 34. Zooming In: DRT2Y
  35. 35. Geohash GridsDRT2YInternal coordinates of an odd length geohash……and an even length geohashDRT2
  36. 36. Demo• Spatial Solr Playground• Demo KML grid generation from geometries• A sample point with quad tree indexes to these tokens:• A, AD, ADB, ADBA• A sample circle with quad tree indexes to these tokens:• A, AB, ABA, ABAB+, ABAC+, ABAD+, ABB, ABBA+,ABBB+, ABBC+, ABBD+, ABC, ABCA+, ABCB+, ABCC+,ABCD+, ABD+, AD, ADA, ADAA+, ADAB+, ADAC+, ADAD+,ADB+, ADC, ADCA+, ADCB+, ADCD+, ADD, ADDA+,ADDB+, ADDC+, ADDD+, B, BA, BAA, BAAC+, BAAD+,BAC, BACA+, BACB+, BACC+, BACD+, BC, BCA, BCAA+,BCAB+, BCAC+, BCC, BCCA+, BCCC+, C, CB, CBB,CBBA+• Tokens with a ‘+’ are actually indexed with and without the ‘+’
  37. 37. PrefixTreeStrategy ArchitectureShapecalc rect relationshipSpatialPrefixTree & Cellbyte string to/from Cell (rect)PrefixTreeStrategyindex & search algorithmsLuceneTermsEnumIntersectsPrefixTreeFilterContainsPrefixTreeFilterWithinPrefixTreeFilter
  38. 38. Lucene Spatial example codectx = SpatialContext.GEO;strategy = new RecursivePrefixTreeStrategy(new GeohashPrefixTree(ctx,11), “myGeoField”);… // make indexWriter and a Documentfor (Field f : strategy.createIndexableFields(shape))doc.add(f);indexWriter.addDocument(doc);…filter = strategy.makeFilter(new SpatialArgs(SpatialOperation.Intersects,ctx.makeCircle(-80.0, 33.0,DistanceUtils.dist2Degrees(200,DistanceUtils.EARTH_MEAN_RADIUS_KM))));, filter, 10);See in Lucene spatial tests for more
  39. 39. Future• Possible de-emphasis of SpatialStrategy abstraction• A better options for distance sorting of PrefixTreestrategies• Better PrefixTree encoding than both geohash & quadtree• Google Summer of Code 2013 -- TBD• Performance improvements to spatial IntersectsRecursivePrefixTree Filter• Remove the need to double-index leaf-nodes (with andwithout ‘+’)• Exact geometry search by blending benefits of PrefixTreeand JtsGeoStrategy• A Single-dimensional PrefixTree (for numeric range index)
  40. 40. SOLR SPATIALAdapters to Lucene 4 spatial
  41. 41. Solr 3 Spatial: LatLonType & friends• Solr 3 was Solr’s first release to include spatial support• Not based on Lucene’s old spatial contrib module• Similar to TwoDoublesStrategy but more optimized• Single-valued only, fast distance sorting, can choose floats (savememory)• Fields:• LatLonType (Geodetic)• PointType (Cartesian)• Query parsers (spatial filters):• {!geofilt} (circle) “p” and “sfield” and “d” params• {!bbox} (bounding box of a circle)• Distance function:• geodist() and some esoteric othersNOT completelysuperseded by Solr 4spatial fields
  42. 42. Solr 4 Spatial• See<fieldType name="location_rpt"class="solr.SpatialRecursivePrefixTreeFieldType”spatialContextFactory=”com.spatial4j.core.context.jts.JtsSpatialContextFactory”distErrPct="0.025”maxDistErr="0.000009”units="degrees” />If you don’t need JTS(polygons) don’t set thisNon-point shapesapproximated togrid up to 2.5% ofradiusMax precision (1m) asmeasured in degrees
  43. 43. Indexing• Point: Latitude, Longitude (i.e. Y, X)<field name="geo">43.17614, -90.57341</field>• Point: X Y<field name="geo">-90.57341 43.17614</field>• Rect: minX minY maxX maxY<field name="geo">-74.093 41.042 -69.347 44.558</field>• Circle: point then d=radius (in degrees)• will be deprecated<field name="geo">Circle(4.56,1.23 d=0.0710)</field>• WKT (preferred; it’s a standard)<field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20,0 0, -10 30))</field>
  44. 44. Filter (search)• Using Solr 3’s bbox or geofilt query parsers• Distance radius ‘d’ is interpreted as kilometers, just like LatLonType• Limited to bbox and bbox of a circlefq={!geofilt}&sfield=geo&pt=45.15,-93.85&d=5• Range query style (bounding box)• Handles dateline wrapfq=geo:[-90,-180 TO 90,180]• Field query style• Unique to Lucene 4 spatial; see SpatialArgsParserfq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 4020, 0 0, -10 30))) distErrPct=0”• Predicates: Intersects, IsDisjointTo, IsWithin,Contains, …• distErrPct (& distErr) optional; override field type’s defaultSOLR-4242: Abetter spatialquery parser
  45. 45. Distance Sort & Relevancy Boost• geodist() is for Solr 3 LatLonType onlysort=geodist(lltField,45.15,-93.85) desc• Solr 4 spatial queries can return the distance as the scoreq={!geofilt sfield=geo pt=45.15,-93.85 d=5score=distance}&sort=score asc&fl=*,score• Without a filtersort=query($sortsq) asc&sortsq={!geofilt filter=falsescore=distance sfield=geo pt=45.15,-93.85 d=0}• Relevancy boostdefType=edismax&boost=query($mysq)&mysq={!geofiltfilter=false score=recipDistance pt=45.15,-98.85d=5}
  46. 46. Distance Faceting• sfield=geo (the field)• pt=45.15,-93.85 (point of reference)• Within 10km• facet.query={!geofilt d=10}• Within 50km• facet.query={!geofilt d=50}• Within 100km• facet.query={!geofilt d=100}
  47. 47. Future• A more Solr-friendly spatial query parser SOLR-4242• Retrofit geodist() to support the SpatialStrategies?• Expose more tunables• A grid based heat-map faceting component• Idea: a multi-strategy spatial field encompassing• A PrefixTree field for points• A PrefixTree field for non-points• A TwoDoubles field for good distance sorting / relevancy• Knows whether its single vs. multi-valued• A FieldType for multi-value numeric ranges
  48. 48. DEMO
  50. 50. 1. Geohash each point to multiple lengths and index eachlength into its own field• geohash_1:D, geohash_2:DR, geohash_3:DRT, geohash_4:DRT22. Search with a rectangle (bbox) filter, and…3. Facet on the geohash field with the desired resolution• facet.field=geohash_4&facet.limit=10000• Lots of tuning / customizationoptions• Projected / quad tree• facet.prefix may helpHeatmap / Grid faceting
  51. 51. Plotting many points on a map• Why not ask Solr for rows=1000 ?• It’s slow• If variable-points per doc then could yield be 1 distinct point or 1M• Instead facet on a geohash with facet.limit=1000• Fast• Guaranteed <= 1000 points• But might need lots of memory• Or result-grouping on a geohashBut do you really wantto plot 1000+ pointson a map?
  52. 52. Filter by indexed distance constraints• Imagine a dating site where both potential parties have amaximum distance they’re willing to travel• Q: For the current user, who is not “too far” for you but isalso not “too far” for them?• A: Index each user’s location as a point in one field andas a circle in another. Query by the current user’s circle tothe indexed point field as well as the current user’s pointto the indexed circle field.
  53. 53. Multi-valued durations• What if your documents needed a variable number of time (orother numerical value) durations• This approach won’t work:<field name=“start” type=“tdate” multiValued=“true”/><field name=“end” type=“tdate” multiValued=“true”/>• Solr (without Solr 4 spatial fields) can’t do it!• You need to think differently to solve this…• Example use-cases• Searching for hotel-room vacancies• Searching for movie show-times• (next slides) Each document is a person with a variable number of“shifts” that they are working…
  54. 54. … model durations as points
  55. 55. … queries become rectangles
  56. 56. … some config & search details• Configuration<fieldType name="days_of_year”class="solr.SpatialRecursivePrefixTreeFieldType"geo="false" units="degrees"worldBounds="0 0 365 365"distErrPct="0" maxDistErr="1"/>• Sample search: Find shifts that have any overlap with 19th day to 23rddaysOfYear:Intersects(0 18.5 23.5 365)• Caveat: Won’t scale to the full precision of a java Long (timestamp)
  57. 57. Thank you!• References• Lucene 4 spatial javadocs•• Spatial4j at GitHub• ( redirect)• --• Solr•• Spatial Solr Sandbox•• Contact me:• David Smiley
  58. 58.