Elevation Query Extension: Introducing Subselects into Lucene Queries

STAY CONNECTED
Twitter @activate_conf
Facebook @activateconf
#Activate19
Log in to wifi, follow Activate on social media,
and download the event app where you can
submit an evaluation after the session
WIFI NETWORK: Activate2019
PASSWORD: Lucidworks
DOWNLOAD THE ACTIVATE 2019 MOBILE APP
Search Activate2019 in the App/Play store
Or visit: http://crowd.cc/activate19

Speaker Slide
R O B E R T
K I R C H G E S S N E R
Search Technology Architect
Wolters Kluwer
E X P E R I E N C E
• Search Algorithms Development
• Content Analysis
• Entity Recognition
• Solr plugins / extensions
• Strong software development experience for about 14 years in different commercial projects
• Last 4 years working on search expertise, particularly with Apache Solr and cloud-based solution for this
including availability and scalability.
• Customers: Wolters Kluwer, TRAFIGURA, Daadkracht...
N A Z A R S E N I U K
Lead Software Engineer
EPAM

Agenda
• Motivation
• Implementation Idea
• Implementation status
• Case study autosuggest
• Summary

Some background
• Developing search applications for legal market since 2003
• Inhomogeneous, structured content, rich metadata (laws, cases, commentaries)
• Use of metadata for ranking is essential for good results
• Up to 30% of queries contain legal / other entities
• Relying on query cooking using entity recognition in the user input
• Combining with full text search and tuning the results becomes a challenge

Example
User input: § 123 BGB
Transformed to queries Q1, Q2, Q3, Q4
Expected output:
• § 123 BGB (law document)
• Legal commentary A to § 123 BGB (promoted content)
• Legal commentary B to § 123 BGB (promoted content)
• Some latest cases based on § 123 BGB (relevant content)
• Full text (or whatever needed)
How to achieve?

Requirements
C O N T E N T S T R U C T U R E
• Handle entities in the user input properly: legal citations, locations, dates, names
– e.g. place the correct document cited in the query on the top
– given a book title place an entry document (table of contents) on the top
• Top (1-5) hits expected to be unambiguous
• Use the top slots efficiently (10-100 hits)
• Keep balance between numerous document types (legal cases) and relevant or promoted
document types
Generally more precise control of what is going on in the top 10

Possible solutions
• Boost factors on queries, terms, documents
• Sort fields
• Ranking functions
• Function queries
• Reranking (in Solr or application)
• Filtering
• Multiple requests

Works, but…
• Some are too complex
• Some are too slow
• Others are not reliable
• Missing a concept of subquery:
– tracking from which subquery a document is coming from
• Missing LIMIT as in SQL

Example continued
User input: § 123 BGB
Transformed to queries Q1, Q2, Q3, Q4
Expected output:
• § 123 BGB (law document)
• Legal commentary A to § 123 BGB (promoted content)
• Legal commentary B to § 123 BGB (promoted content)
• Some latest cases based on § 123 BGB (relevant content)
• Full text (or whatever needed)
Want the request look like: Q1 << Q2 << Q3 << Q4

Elevation query
Initial Idea / Specification
Given a list of queries Q1, Q2, …, QN produce a result fulfilling the conditions:
• All the documents of Qn are placed before the documents of Qm for m>n
• Each hit should occur in the leftmost possible subset
• No duplication of hits
• Meaningful scores
• Correct faceting

Elevation query
Additional requirements / expectations
• One request / one pass search
• Usable via some new syntax / parser support
• Implemented as plugin
Furthermore it should be possible to
• impose a limit on the results of each subquery
• provide a sort parameter for each subquery

Implementation
Idea
Where to start
• TopFieldCollector.collect
• TFC manages a priority queue
• The priority queue is parametrized with
size and sorting
• DisjunctionMaxQuery:
– „generates the union of documents produced by
its subqueries“
Q1
8
71
7
28
6
13
5
23
4
50
3
10
2
31
1
23
7
28
6
13
4
50
3
10
1
23
9
66
8
71
7
28
6
13
5
23
4
50
3
10
2
31
1
23
1
23
10
42
9
66
8
71
7
28
6
13
5
23
4
50
3
10
2
31
2
31
11
63
10
42
9
66
8
71
7
28
6
13
5
23
4
50
3
10
3
10
12
19
11
63
10
42
9
66
8
71
7
28
6
13
5
23
4
50
4
50
1
23
3
10
13
36
12
19
11
63
10
42
9
66
8
71
7
28
6
13
5
23
14
47
13
36
12
19
11
63
10
42
9
66
8
71
7
28
6
13
6
13
3
10
15
99
14
47
13
36
12
19
11
63
10
42
9
66
8
71
7
28
7
28
1
23
6
13

Implementation
Idea
Where to go
• Provide more than one queue to collector
• Propagate information from
DisjunctionMaxScorer to the collector
• Some additional bookkeeping
– Scores
– Sort field values
– Subquery index
– (Facets)
Q3
Q2
Q1
87654321 87654321 98765432
1
50
1
-
1
-
1
50
1
50
1
-
1
-
109876543
1
50
2
-
2
43
2
-
1
50
2
-
2
43
2
43
2
-
1110987654
1
50
3
-
2
43
3
31
3
55
1110987654
1
50
3
-
2
43
3
31
3
31
3
55
12111098765
1
50
4
51
2
43
3
31
4
-
4
76
4
51
1
50
4
51
2
43
3
31
4
-
4
76
131211109876
2
43
3
31
5
-
5
-
4
51
1
50
5
74
131211109876
2
43
3
31
5
-
5
-
5
74
4
51
5
74
2
43
3
31
4
76
5
74
4
51
5
74
2
43
3
31
4
76
4
76
5
74
4
51
5
74
1413121110987
2
43
3
31
6
88
4
76
6
-
5
74
4
51
6
-
1413121110987
6
88
2
43
3
31
4
76
6
-
5
74
4
51
6
-
15141312111098
6
88
2
43
3
31
7
12
4
76
7
99
5
74
4
51
7
-
15141312111098
6
88
2
43
3
31
7
12
7
12
4
76
7
99
5
74
4
51
7
-
15141312111098
7
12
15141312111099
6
88
2
43
3
31
7
12
8
-
4
76
8
55
5
74
4
51
8
-
15141312111099
6
88
2
43
3
31
7
12
8
-
4
76
8
55
8
55
5
74
4
51
8
-
15141312111099

Implementation status
• https://github.com/rokirx/solr-eq
• Working
– Collector logic / multiple queues
– Sort and limit parameter per subquery
– Parser support
• In testing
– Correct scoring
– Faceting
– Multiple sort fields per subquery
• Works with 6.4, 7.6, 8.0, 8.2

Case Study: Autosuggest
User Input tax
• Assumptions on the relevancy of completion:
– Highest priority if the term at the beginning and exact match, eg tax relief
– Lower priority exact match but term not at the beginnilng, eg income tax
– Lowest priority prefix match anywhere in the phrase, eg estate taxes
• Map this condition to queries:
– Term at the beginning of a phrase and exact match: ^tax$
– Exact match in the middle of a phrase: tax$
– Prefix match (edge n-gram): tax

Case Study: Autosuggest
User Input tax
• Resulting query: ^tax$ << tax$ << tax guarantees the specified behavior
• Additional benefit: optimize the performance by cancelling out subqueries
– If the exact hit count is not necessary
– And the minimum required number of hits in the preceeding queues is collected
– Stop fetching the docs from lower priority queue by cancelling them out of the collector/scorer
– Whitout missing out any relevant documents

Potential benefits
• Reduce the number of search requests
• Reduce the complexity of the architecture
• Additional dimension to control rank
• Pluggable, easy to evaluate
• Improve performance through runtime subquery cancellation

Summary
It is technically possible to implement a concept of subquery into Solr/Lucene
• Single request / one pass collection of results
• Individual limits on each subquery
• Individual sort parameters on each subquery
• Optimization if no total hits number needed
– cancel lower prioritized subqueries during evaluation without affecting top hits
• Plugin

Elevation Query Extension: Introducing Subselects into Lucene Queries

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Elevation Query Extension: Introducing Subselects into Lucene Queries

Similar to Elevation Query Extension: Introducing Subselects into Lucene Queries (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Elevation Query Extension: Introducing Subselects into Lucene Queries

Editor's Notes