The document discusses the evolution of range filters in Solr, from the initial naive implementation to optimizations using tries and then point fields. It explains how range filters were initially implemented by breaking them into individual term queries, and how tries were later used at index and query time to improve performance by exploiting common prefixes. The document then introduces point fields, which use a Bkd-tree to index multi-dimensional points in a way that adapts to the data distribution, offering better performance than tries.
2. Privileged and Confidential
Agenda
1. Recap: From query parser to TopDocsCollector.
2. TermQuery search flow.
3. How Range Filters are implemented?
4. Optimizations for Range Filters.
5. Point Fields.
2
4. Privileged and Confidential
Recap: From query parser to TopDocsCollector
? What is query parser?
? What is the difference between Query and Scorer? And why you need a Collector?
4
6. Privileged and Confidential
Recap: From query parser to TopDocsCollector
? What is query parser?
? What is the difference between Query and Scorer?
? What is the difference between TermsEnum and PosingsEnum?
6
20. Privileged and Confidential
Trie*Field query time
Original values
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
Additional values
42* -> [1, 2]
44* -> [3, 4]
52* -> [5, 7]
63* -> [5, 6]
64* -> [5, 6 , 7]
4** -> [1, 2, 3, 4]
5** -> [5, 7]
6** -> [5, 6, 7]
Exploit the Trie*Field
In total = 6 should clauses in the end
20
21. Privileged and Confidential
Is not it enough? Distribution of terms?
Trie-based approach does not involve distribution of the terms analysis.
q=PRICE:[100 TO 2002222]Original values
1 -> [1]
100 -> [2]
2000001 -> [3]
2000022 -> [3]
2000222 -> [4]
2002222 -> [5]
50000005 -> [7]
21
22. Privileged and Confidential
Is not it enough?
IO efficiency.
We need to store all original and additional values.
We need to read all Terms of the field at search time.
Original values
1 -> [1]
100 -> [2]
2000001 -> [3]
2000022 -> [3]
2000222 -> [4]
2002222 -> [5]
50000005 -> [7]
Additional values
10* -> [2]
1** -> [1, 2]
200002* -> [3]
200022* -> [4]
20002** -> [4]
200**** -> [3, 4, 5]
200222* -> [5]
20022** -> [5]
2002*** -> [5]
22
23. Privileged and Confidential
Point Fields
This feature replaces the now deprecated numeric fields (Trie*Field) and numeric range query since it
has better overall performance and is more general - allowing multidimensions. (since Lucene 6.0)
● Based on Bkd-Tree: A Dynamic Scalable kd-Tree
Naturally adapt to each data set's particular distribution. In contrast to legacy numeric fields
which always index the same precision levels for every value regardless of how the points are
distributed.
● Most of the data structure resides in on-disk blocks, with a small in-heap binary tree index
structure to locate the blocks at search time.
● Allows to operate with multi-dimensional points. (Maps, 3D-models).
23
27. Privileged and Confidential
Point Fields: search time
Disk
Heap
q=PRICE:[100, 2002222]
If block overlaps with the query - we
have to check every term value inside
If block is fully contained within the query -
the documents with values in that cell are
efficiently collected without having to test
each point
27
30. Privileged and Confidential
Links
Numeric Range Queries in Lucene/Solr
http://blog-archive.griddynamics.com/2014/10/numeric-range-queries-in-lucenesolr.html
Lucene Search Essentials: Scorers, Collectors and Custom Queries
https://www.slideshare.net/lucenerevolution/lucene-search-essentials-scorers-collectors-and-custom-queries-dublin13
Multi-dimensional points, coming in Apache Lucene 6.0
https://www.elastic.co/blog/lucene-points-6.0
Bkd-Tree: A Dynamic Scalable kd-Tree
https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf
The Evolution of Lucene & Solr Numerics from Strings to Points
https://www.slideshare.net/lucidworks/the-evolution-of-lucene-solr-numerics-from-strings-to-points-
presented-by-steve-rowe-lucidworks?from_action=save
30