SlideShare a Scribd company logo
The Latest in 
Spatial & Temporal Search 
David Smiley
Agenda 
Spatial 
• Polygons and Accuracy: SerializedDVStrategy 
• FlexPrefixTree 
• BBoxSpatialStrategy 
• Student/Intern contributions, Geodesics 
Temporal 
• Dates, and Date Ranges 
• Search 
• Faceting
About David Smiley 
• Freelance search consultant / developer 
• Expert Lucene/Solr development skills, 
advice (consulting), training 
• Java (full-stack), Web, Spatial 
• Apache Lucene / Solr committer & PMC, 
Eclipse Locationtech PMC 
• Authored 1st book on Solr, plus two editions 
• Presented at several conferences & meetups 
• Taught several Solr classes, self-developed & LucidWorks
Lucene Spatial Overview 
• Multiple approaches to index spatial data 
abstract class SpatialStrategy 
(5+ concrete implementations) 
• RecursivePrefixTreeStrategy (RPT) is most prominent, versatile 
• Grid based 
Shape 
SpatialPrefixTree / Cell PrefixTreeStrategy 
• Uses Spatial4j lib for shapes, distance calculations, and WKT 
• Uses JTS Topology Suite lib for polygons 
IntersectsPrefixTreeFilter 
Contains… 
Geohash | Quad Within…
SpatialPrefixTrees and Accuracy 
RecursivePrefixTree (RPT) uses Lucene’s index as a PrefixTree 
• Thus represents shapes as grid cells of varying precision by prefix 
Example, a point shape: 
• D, DR, DRT, DRT2, DRT2Y 
Example, a polygon shape: 
• Too many to list… 508 cells 
More details here: 
http://opensourceconnections.com/blog/2014/04/11/indexing-polygons-in-lucene-with-accuracy/
…continued 
• For more accuracy, index more levels (longer prefixes) 
• Points: linear relationship of levels to number of cells  
• Non-points: exponential relationship…  
RPT applies a distErrPct shape size ratio to non-point shapes to 
trade accuracy for scalability 
• distErrPct=0.025 (2.5% of the radius, the default): 
• Massachusetts: level 6 
• USA: level 4 (not as precise)
SerializedDVStrategy (Lucene 4.7) 
• Stores serialized geometry into Lucene BinaryDocValues 
• It’s as accurate as the underlying geometry coordinates/shape 
• But it’s not a spatial index – it’s retrievable on a per-document basis 
• Use RPT + SerializedDV for speed and accuracy! 
• More to come eventually: 
• Solr adapter – SOLR-5728, ElasticSearch adapter #2361 
• Speed: Skip the serialized geometry check for non-edge cells – 
LUCENE-5579
Sample Code 
SpatialArgs args = new SpatialArgs(INTERSECTS, point); 
treeStrategy = new RecursivePrefixTreeStrategy( 
grid, "geometry"); 
verifyStrategy = new SerializedDVStrategy( 
ctx, "serialized_geometry"); 
Query treeQuery = new ConstantScoreQuery( 
treeStrategy.makeFilter(args)); 
Query combinedQuery = new FilteredQuery( 
treeQuery, 
verifyStrategy.makeFilter(args), 
FilteredQuery.QUERY_FIRST_FILTER_STRATEGY); 
Code is from a related presentation by the Climate Corporation presented at FOSS4G 2014
FlexPrefixTree (Coming to Lucene 5) 
• A new SpatialPrefixTree by Varun Shenoy (GSOC 2014) ! 
• LUCENE-4922; Still needs to be committed. Goal is for 5.0. 
• More optimized, more flexible, than Geohash & Quad 
• Configurable sub-cells at each level: 4, 16, 64, 256 
• You choose trade-off between index speed/disk size & search speed 
• Internally uses an integer coordinate system 
• Rectangle searches are particularly fast; minimal floating-point conversion 
• Cells are always squares (equal sides) – better for heatmaps 
• YMMV: 10% - 100% faster than GeohashPrefixTree
BBoxSpatialStrategy (Lucene 4.10) 
• Rectangles (BBox’s) only, one value per field 
• Wide predicate support 
• Equals, Intersects, Within, Contains, Disjoint 
• Accurate (8-byte double floating point) 
• Area overlap relevancy 
• Weight search results by a combination of query shape overlap & 
index shape overlap ratios 
• Solr BBoxField…
Solr BBoxField 
• Schema configuration 
<field name="bbox" type="bbox" /> 
<fieldType name="bbox" class="solr.BBoxField” 
geo="true" units="degrees" numberType="_bbox_coord" /> 
<fieldType name="_bbox_coord" class="solr.TrieDoubleField” 
precisionStep="8" docValues="true" stored="false"/> 
• Search with overlap ratio ordering 
&q={!field f=bbox score=overlapRatio}Intersects(ENVELOPE(-10, 20, 15, 10)) 
• score can be: overlapRatio, area, area2D
Recent Student/Intern Contributions 
• Varun Shenoy via GSOC: summer 2014 
• Lucene spatial: new “FlexPrefixTree” – an optimized grid 
• Rebecca Alford via F.B. Open-Academy: winter 2014 
• Spatial4j: geodesic polygons 
• Chris Pavlicek via F.B. Open-Academy: winter 2014 
• Spatial4j: geodesic buffered lines 
• Evana Gizzi, MITRE intern: winter 2014 
• Spatial4j: geodesic circle polygonizer 
• Liviy Ambrose, MITRE intern: fall 2013 
• Lucene spatial: integrated with Lucene’s benchmark module
Temporal/Date Durations 
or basically any numeric ranges
Approach: Simple Two-field 
(as you might do in SQL or any system without native range types) 
• A start-time & end-time field pair 
• A search window (time span) becomes two range queries 
• details vary by predicate (Intersects, Contains, vs. Within) 
• Single-valued only 
• …even though Lucene supports multi-valued fields 
• Theoretically possible but would be a lot of work 
• because Lucene doesn’t store “position” info for numeric fields 
• because numeric range/prefix queries are position-less
Approach: 2D Spatial PrefixTree 
• Lucene Spatial QuadPrefixTree 
(2D) with RPT Strategy 
• Use ‘x’ for start-time, ‘y’ for end-time 
• A search window (time span) 
becomes a rectangle query 
• details vary by predicate (Intersects, 
Contains, vs. Within) 
• Cool… 
• But floating-point edge issues 
• Only ~50 levels supported; not 64 
Details: http://wiki.apache.org/solr/SpatialForTimeDurations
Approach: DateRangePrefixTree (Lucene 5) 
• A new 1D SpatialPrefixTree: NumberRangePrefixTree 
• NumberRangePrefixTree w/ DateRangePrefixTree subclass 
• NR-SPT: Configurable sub-cells per level; no level limit 
• Not just for ranges; instances too 
• Index/Search with NumberRangePrefixTreeStrategy 
• Indexing, and search predicate code (e.g. Intersects…) completely re-used 
• DateRangePrefixTree 
• 9 Levels: 1M years, 1K years, years, months, days, hours, minutes, 
seconds, millis 
…continued…
Trade-offs of N/D-SPT 
• Indexing: 
• “Common” date-ranges use ~ <50 terms, but random millisecond 
ranges use up to ~14K terms 
• All date instances (not a range) <= 9 terms 
• Comparison to 2D SPT: instance or range, always 50 
• Search: 
• Query for “common” query ranges faster than uncommon 
• Comparison to 2D SPT: 
• Contains & Within predicates: overlapping values per document get 
coalesced, can’t be differentiated
Solr DateRangeField 
• Configuration in schema.xml: 
<field name="dateRange" type=”dateRange” /> 
<fieldType name="dateRange" class="solr.DateRangeField" /> 
• Index field data, examples: 
• 2014-05-21T12:00:00.000Z (same as TrieDate) 
• 2014-05-21T12 (truncated to desired precision) 
• [1990 TO 1995] 
• Query, examples: 
• fq=dateRange:[* TO 2014-05-21] 
• fq={!field f=dateRange op=Contains} [2000 TO 2014-05-21]
Visualizing Date Facets 
• http://bl.ocks.org/mbostock/4063318
Date Faceting 
• Option A: facet.range 
• Not for indexed date-ranges 
• Internally executes one query for each value & caches large bitset 
• Option B: facet.interval (Solr 4.10) 
• Not for indexed date-ranges 
• Requires DocValues (more index data) 
• Supports variable/custom intervals 
• New work-in-progress option: Facet on DateRangeField 
• Ranges are fixed/pre-determined (months, days, etc.) 
• Optimized for thousands of ranges to count 
• Each value-range is only 1 term!
Future stuff I’m excited about 
• Continuing works in-progress 
• Spatial heatmaps! Coming in January 2015! 
• Lucene layer & Solr adapter 
• Lucene term auto-prefixing LUCENE-5879 
• Brings spatial, date, numeric, indexing/search to the next level! 
• More prefix-tree optimizations 
• Inner vs edge leaf cell differentiation for non-point shapes 
• RPT + SerializedDVStrategy; skip accuracy checks for inner cells 
• Don’t index leaf cells twice
That’s all for now; thanks for coming! 
Need Lucene/Solr guidance or custom development? 
Contact me! 
Email: dsmiley@apache.org 
LinkedIn: http://www.linkedin.com/in/davidwsmiley 
G+: +DavidSmiley 
Twitter: @DavidWSmiley 
ETA: December 
2014

More Related Content

Similar to 2014 11 lucene spatial temporal update

Lucene 4 spatial
Lucene 4 spatialLucene 4 spatial
Lucene 4 spatial
David Smiley
 
Lucene solr 4 spatial extended deep dive
Lucene solr 4 spatial   extended deep diveLucene solr 4 spatial   extended deep dive
Lucene solr 4 spatial extended deep dive
lucenerevolution
 
Spatial Data in SQL Server
Spatial Data in SQL ServerSpatial Data in SQL Server
Spatial Data in SQL Server
Eduardo Castro
 
Spatial Data in SQL Server
Spatial Data in SQL ServerSpatial Data in SQL Server
Spatial Data in SQL Server
Eduardo Castro
 
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
Yandex
 
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovPostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
Nikolay Samokhvalov
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 
Cloud conf-varna-2014-mihail mateev-spatial-data-and-microsoft-azure-sql-data...
Cloud conf-varna-2014-mihail mateev-spatial-data-and-microsoft-azure-sql-data...Cloud conf-varna-2014-mihail mateev-spatial-data-and-microsoft-azure-sql-data...
Cloud conf-varna-2014-mihail mateev-spatial-data-and-microsoft-azure-sql-data...
Mihail Mateev
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Lucidworks
 
PostgreSQL 9.4: NoSQL on ACID
PostgreSQL 9.4: NoSQL on ACIDPostgreSQL 9.4: NoSQL on ACID
PostgreSQL 9.4: NoSQL on ACID
Oleg Bartunov
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph Schema
Joshua Shinavier
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
Torsten Steinbach
 
Vector Tiles with GeoServer and OpenLayers
Vector Tiles with GeoServer and OpenLayersVector Tiles with GeoServer and OpenLayers
Vector Tiles with GeoServer and OpenLayers
Jody Garnett
 
Enterprise geodatabase sql access and administration
Enterprise geodatabase sql access and administrationEnterprise geodatabase sql access and administration
Enterprise geodatabase sql access and administration
brentpierce
 
Object Relational Database Management System
Object Relational Database Management SystemObject Relational Database Management System
Object Relational Database Management System
Amar Myana
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Lucidworks (Archived)
 
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Ontico
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQL
MongoDB
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmaplucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road maplucenerevolution
 

Similar to 2014 11 lucene spatial temporal update (20)

Lucene 4 spatial
Lucene 4 spatialLucene 4 spatial
Lucene 4 spatial
 
Lucene solr 4 spatial extended deep dive
Lucene solr 4 spatial   extended deep diveLucene solr 4 spatial   extended deep dive
Lucene solr 4 spatial extended deep dive
 
Spatial Data in SQL Server
Spatial Data in SQL ServerSpatial Data in SQL Server
Spatial Data in SQL Server
 
Spatial Data in SQL Server
Spatial Data in SQL ServerSpatial Data in SQL Server
Spatial Data in SQL Server
 
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
 
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovPostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Cloud conf-varna-2014-mihail mateev-spatial-data-and-microsoft-azure-sql-data...
Cloud conf-varna-2014-mihail mateev-spatial-data-and-microsoft-azure-sql-data...Cloud conf-varna-2014-mihail mateev-spatial-data-and-microsoft-azure-sql-data...
Cloud conf-varna-2014-mihail mateev-spatial-data-and-microsoft-azure-sql-data...
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
 
PostgreSQL 9.4: NoSQL on ACID
PostgreSQL 9.4: NoSQL on ACIDPostgreSQL 9.4: NoSQL on ACID
PostgreSQL 9.4: NoSQL on ACID
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph Schema
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
Vector Tiles with GeoServer and OpenLayers
Vector Tiles with GeoServer and OpenLayersVector Tiles with GeoServer and OpenLayers
Vector Tiles with GeoServer and OpenLayers
 
Enterprise geodatabase sql access and administration
Enterprise geodatabase sql access and administrationEnterprise geodatabase sql access and administration
Enterprise geodatabase sql access and administration
 
Object Relational Database Management System
Object Relational Database Management SystemObject Relational Database Management System
Object Relational Database Management System
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
 
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQL
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 

Recently uploaded

AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
Google
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
QuickwayInfoSystems3
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 

Recently uploaded (20)

AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 

2014 11 lucene spatial temporal update

  • 1.
  • 2. The Latest in Spatial & Temporal Search David Smiley
  • 3. Agenda Spatial • Polygons and Accuracy: SerializedDVStrategy • FlexPrefixTree • BBoxSpatialStrategy • Student/Intern contributions, Geodesics Temporal • Dates, and Date Ranges • Search • Faceting
  • 4. About David Smiley • Freelance search consultant / developer • Expert Lucene/Solr development skills, advice (consulting), training • Java (full-stack), Web, Spatial • Apache Lucene / Solr committer & PMC, Eclipse Locationtech PMC • Authored 1st book on Solr, plus two editions • Presented at several conferences & meetups • Taught several Solr classes, self-developed & LucidWorks
  • 5. Lucene Spatial Overview • Multiple approaches to index spatial data abstract class SpatialStrategy (5+ concrete implementations) • RecursivePrefixTreeStrategy (RPT) is most prominent, versatile • Grid based Shape SpatialPrefixTree / Cell PrefixTreeStrategy • Uses Spatial4j lib for shapes, distance calculations, and WKT • Uses JTS Topology Suite lib for polygons IntersectsPrefixTreeFilter Contains… Geohash | Quad Within…
  • 6. SpatialPrefixTrees and Accuracy RecursivePrefixTree (RPT) uses Lucene’s index as a PrefixTree • Thus represents shapes as grid cells of varying precision by prefix Example, a point shape: • D, DR, DRT, DRT2, DRT2Y Example, a polygon shape: • Too many to list… 508 cells More details here: http://opensourceconnections.com/blog/2014/04/11/indexing-polygons-in-lucene-with-accuracy/
  • 7. …continued • For more accuracy, index more levels (longer prefixes) • Points: linear relationship of levels to number of cells  • Non-points: exponential relationship…  RPT applies a distErrPct shape size ratio to non-point shapes to trade accuracy for scalability • distErrPct=0.025 (2.5% of the radius, the default): • Massachusetts: level 6 • USA: level 4 (not as precise)
  • 8. SerializedDVStrategy (Lucene 4.7) • Stores serialized geometry into Lucene BinaryDocValues • It’s as accurate as the underlying geometry coordinates/shape • But it’s not a spatial index – it’s retrievable on a per-document basis • Use RPT + SerializedDV for speed and accuracy! • More to come eventually: • Solr adapter – SOLR-5728, ElasticSearch adapter #2361 • Speed: Skip the serialized geometry check for non-edge cells – LUCENE-5579
  • 9. Sample Code SpatialArgs args = new SpatialArgs(INTERSECTS, point); treeStrategy = new RecursivePrefixTreeStrategy( grid, "geometry"); verifyStrategy = new SerializedDVStrategy( ctx, "serialized_geometry"); Query treeQuery = new ConstantScoreQuery( treeStrategy.makeFilter(args)); Query combinedQuery = new FilteredQuery( treeQuery, verifyStrategy.makeFilter(args), FilteredQuery.QUERY_FIRST_FILTER_STRATEGY); Code is from a related presentation by the Climate Corporation presented at FOSS4G 2014
  • 10. FlexPrefixTree (Coming to Lucene 5) • A new SpatialPrefixTree by Varun Shenoy (GSOC 2014) ! • LUCENE-4922; Still needs to be committed. Goal is for 5.0. • More optimized, more flexible, than Geohash & Quad • Configurable sub-cells at each level: 4, 16, 64, 256 • You choose trade-off between index speed/disk size & search speed • Internally uses an integer coordinate system • Rectangle searches are particularly fast; minimal floating-point conversion • Cells are always squares (equal sides) – better for heatmaps • YMMV: 10% - 100% faster than GeohashPrefixTree
  • 11. BBoxSpatialStrategy (Lucene 4.10) • Rectangles (BBox’s) only, one value per field • Wide predicate support • Equals, Intersects, Within, Contains, Disjoint • Accurate (8-byte double floating point) • Area overlap relevancy • Weight search results by a combination of query shape overlap & index shape overlap ratios • Solr BBoxField…
  • 12. Solr BBoxField • Schema configuration <field name="bbox" type="bbox" /> <fieldType name="bbox" class="solr.BBoxField” geo="true" units="degrees" numberType="_bbox_coord" /> <fieldType name="_bbox_coord" class="solr.TrieDoubleField” precisionStep="8" docValues="true" stored="false"/> • Search with overlap ratio ordering &q={!field f=bbox score=overlapRatio}Intersects(ENVELOPE(-10, 20, 15, 10)) • score can be: overlapRatio, area, area2D
  • 13. Recent Student/Intern Contributions • Varun Shenoy via GSOC: summer 2014 • Lucene spatial: new “FlexPrefixTree” – an optimized grid • Rebecca Alford via F.B. Open-Academy: winter 2014 • Spatial4j: geodesic polygons • Chris Pavlicek via F.B. Open-Academy: winter 2014 • Spatial4j: geodesic buffered lines • Evana Gizzi, MITRE intern: winter 2014 • Spatial4j: geodesic circle polygonizer • Liviy Ambrose, MITRE intern: fall 2013 • Lucene spatial: integrated with Lucene’s benchmark module
  • 14. Temporal/Date Durations or basically any numeric ranges
  • 15. Approach: Simple Two-field (as you might do in SQL or any system without native range types) • A start-time & end-time field pair • A search window (time span) becomes two range queries • details vary by predicate (Intersects, Contains, vs. Within) • Single-valued only • …even though Lucene supports multi-valued fields • Theoretically possible but would be a lot of work • because Lucene doesn’t store “position” info for numeric fields • because numeric range/prefix queries are position-less
  • 16. Approach: 2D Spatial PrefixTree • Lucene Spatial QuadPrefixTree (2D) with RPT Strategy • Use ‘x’ for start-time, ‘y’ for end-time • A search window (time span) becomes a rectangle query • details vary by predicate (Intersects, Contains, vs. Within) • Cool… • But floating-point edge issues • Only ~50 levels supported; not 64 Details: http://wiki.apache.org/solr/SpatialForTimeDurations
  • 17. Approach: DateRangePrefixTree (Lucene 5) • A new 1D SpatialPrefixTree: NumberRangePrefixTree • NumberRangePrefixTree w/ DateRangePrefixTree subclass • NR-SPT: Configurable sub-cells per level; no level limit • Not just for ranges; instances too • Index/Search with NumberRangePrefixTreeStrategy • Indexing, and search predicate code (e.g. Intersects…) completely re-used • DateRangePrefixTree • 9 Levels: 1M years, 1K years, years, months, days, hours, minutes, seconds, millis …continued…
  • 18. Trade-offs of N/D-SPT • Indexing: • “Common” date-ranges use ~ <50 terms, but random millisecond ranges use up to ~14K terms • All date instances (not a range) <= 9 terms • Comparison to 2D SPT: instance or range, always 50 • Search: • Query for “common” query ranges faster than uncommon • Comparison to 2D SPT: • Contains & Within predicates: overlapping values per document get coalesced, can’t be differentiated
  • 19. Solr DateRangeField • Configuration in schema.xml: <field name="dateRange" type=”dateRange” /> <fieldType name="dateRange" class="solr.DateRangeField" /> • Index field data, examples: • 2014-05-21T12:00:00.000Z (same as TrieDate) • 2014-05-21T12 (truncated to desired precision) • [1990 TO 1995] • Query, examples: • fq=dateRange:[* TO 2014-05-21] • fq={!field f=dateRange op=Contains} [2000 TO 2014-05-21]
  • 20. Visualizing Date Facets • http://bl.ocks.org/mbostock/4063318
  • 21. Date Faceting • Option A: facet.range • Not for indexed date-ranges • Internally executes one query for each value & caches large bitset • Option B: facet.interval (Solr 4.10) • Not for indexed date-ranges • Requires DocValues (more index data) • Supports variable/custom intervals • New work-in-progress option: Facet on DateRangeField • Ranges are fixed/pre-determined (months, days, etc.) • Optimized for thousands of ranges to count • Each value-range is only 1 term!
  • 22. Future stuff I’m excited about • Continuing works in-progress • Spatial heatmaps! Coming in January 2015! • Lucene layer & Solr adapter • Lucene term auto-prefixing LUCENE-5879 • Brings spatial, date, numeric, indexing/search to the next level! • More prefix-tree optimizations • Inner vs edge leaf cell differentiation for non-point shapes • RPT + SerializedDVStrategy; skip accuracy checks for inner cells • Don’t index leaf cells twice
  • 23. That’s all for now; thanks for coming! Need Lucene/Solr guidance or custom development? Contact me! Email: dsmiley@apache.org LinkedIn: http://www.linkedin.com/in/davidwsmiley G+: +DavidSmiley Twitter: @DavidWSmiley ETA: December 2014

Editor's Notes

  1. There is a 3rd edition expected by the end of 2014.
  2. 508 cells level 5 detail (same as point example). 463 of these are “leaf” cells, and these get duplicated in the index with and without a leaf variant. Disclaimer: the actual polygon picture here actually goes to level 6 but that’s not important.
  3. distErrPct=0.025 tends to yield a few thousand cells or so. distErrPct is independent of maximum configured precision. More about
  4. The geometry format is dictated by Spatial4j which has it’s own format for the Spatial4j native shapes; other shapes (e.g. polygons) use WKB. There are plenty of opportunities for a more compact representation; WKB is a little hefty but it’s known to be fast to read nonetheless.
  5. By the way, set PrefixGridScanLevel on the RPT strategy to be at least maxLevels (set to 100 is fine), such that it never scans. The scanning optimization is has turned out to be very bad for non-point indexed shapes.
  6. TODO: Update for latest trunk, and run some randomized tests (beasting) for a while, then commit to trunk. Then wait a little and back-port to 5x. 256 levels is only supported for point data. No Hilbert Curve ordering yet. Configurable levels is similar in concept to precisionStep in the numeric Trie fields, but here it’s configurable at each “step” (level). https://issues.apache.org/jira/browse/LUCENE-4922
  7. See the Solr Ref Guide for more info: https://cwiki.apache.org/confluence/display/solr/Spatial+Search That ENVELOPE syntax is WKT/CQL
  8. In reverse chronological order. Note the middle 3 are works in progress. Non-coincidentally they all deal with geodesics. Geodesics is hard! Also, Varun was basically full-time at this.
  9. distErrPct=0 Once FlexPrefixTree is committed, it would be great to add an ‘integer’ based 2D Shape (or upgrade it to ‘long’) and add some ease-of-use wrappers (Solr FieldType) to make this nicer
  10. Could use configurable maximum depth.
  11. Theoretically should be faster than DateField for date instances given “common” query ranges because DateSpatialPrefixTree is customized for common date ranges resulting in ~ <50 terms whereas Lucene numeric trie fields with precisionStep 6 will use ~680 terms (math: 2^6 * 64/6)
  12. “op” local-param could get renamed to ‘pred’ by the time it’s released