STAY CONNECTED WITH ACTIVATE 2019

STAY CONNECTED
Twitter @activate_conf
Facebook @activateconf
#Activate19
Log in to wifi, follow Activate on social media,
and download the event app where you can
submit an evaluation after the session
WIFI NETWORK: Activate2019
PASSWORD: Lucidworks
DOWNLOAD THE ACTIVATE 2019 MOBILE APP
Search Activate2019 in the App/Play store
Or visit: http://crowd.cc/activate19

Speaker Slide
R O B E R T
K I R C H G E S S N E R
Search Technology Architect
Wolters Kluwer
E X P E R I E N C E
• Search Algorithms Development
• Content Analysis
• Entity Recognition
• Solr plugins / extensions
• Strong software development experience for about 14 years in different commercial projects
• Last 4 years working on search expertise, particularly with Apache Solr and cloud-based solution for this
including availability and scalability.
• Customers: Wolters Kluwer, TRAFIGURA, Daadkracht...
N A Z A R S E N I U K
Lead Software Engineer
EPAM

Agenda
• Motivation
• Implementation Idea
• Implementation status
• Case study autosuggest
• Summary

Some background
• Developing search applications for legal market since 2003
• Inhomogeneous, structured content, rich metadata (laws, cases, commentaries)
• Use of metadata for ranking is essential for good results
• Up to 30% of queries contain legal / other entities
• Relying on query cooking using entity recognition in the user input
• Combining with full text search and tuning the results becomes a challenge

Example
User input: § 123 BGB
Transformed to queries Q1, Q2, Q3, Q4
Expected output:
• § 123 BGB (law document)
• Legal commentary A to § 123 BGB (promoted content)
• Legal commentary B to § 123 BGB (promoted content)
• Some latest cases based on § 123 BGB (relevant content)
• Full text (or whatever needed)
How to achieve?

Requirements
C O N T E N T S T R U C T U R E
• Handle entities in the user input properly: legal citations, locations, dates, names
– e.g. place the correct document cited in the query on the top
– given a book title place an entry document (table of contents) on the top
• Top (1-5) hits expected to be unambiguous
• Use the top slots efficiently (10-100 hits)
• Keep balance between numerous document types (legal cases) and relevant or promoted
document types
Generally more precise control of what is going on in the top 10

Possible solutions
• Boost factors on queries, terms, documents
• Sort fields
• Ranking functions
• Function queries
• Reranking (in Solr or application)
• Filtering
• Multiple requests

Works, but…
• Some are too complex
• Some are too slow
• Others are not reliable
• Missing a concept of subquery:
– tracking from which subquery a document is coming from
• Missing LIMIT as in SQL

Example continued
User input: § 123 BGB
Transformed to queries Q1, Q2, Q3, Q4
Expected output:
• § 123 BGB (law document)
• Legal commentary A to § 123 BGB (promoted content)
• Legal commentary B to § 123 BGB (promoted content)
• Some latest cases based on § 123 BGB (relevant content)
• Full text (or whatever needed)
Want the request look like: Q1 << Q2 << Q3 << Q4

Elevation query
Initial Idea / Specification
Given a list of queries Q1, Q2, …, QN produce a result fulfilling the conditions:
• All the documents of Qn are placed before the documents of Qm for m>n
• Each hit should occur in the leftmost possible subset
• No duplication of hits
• Meaningful scores
• Correct faceting

Elevation query
Additional requirements / expectations
• One request / one pass search
• Usable via some new syntax / parser support
• Implemented as plugin
Furthermore it should be possible to
• impose a limit on the results of each subquery
• provide a sort parameter for each subquery

Implementation
Idea
Where to start
• TopFieldCollector.collect
• TFC manages a priority queue
• The priority queue is parametrized with
size and sorting
• DisjunctionMaxQuery:
– „generates the union of documents produced by
its subqueries“
Q1
8
71
7
28
6
13
5
23
4
50
3
10
2
31
1
23
7
28
6
13
4
50
3
10
1
23
9
66
8
71
7
28
6
13
5
23
4
50
3
10
2
31
1
23
1
23
10
42
9
66
8
71
7
28
6
13
5
23
4
50
3
10
2
31
2
31
11
63
10
42
9
66
8
71
7
28
6
13
5
23
4
50
3
10
3
10
12
19
11
63
10
42
9
66
8
71
7
28
6
13
5
23
4
50
4
50
1
23
3
10
13
36
12
19
11
63
10
42
9
66
8
71
7
28
6
13
5
23
14
47
13
36
12
19
11
63
10
42
9
66
8
71
7
28
6
13
6
13
3
10
15
99
14
47
13
36
12
19
11
63
10
42
9
66
8
71
7
28
7
28
1
23
6
13

Implementation
Idea
Where to go
• Provide more than one queue to collector
• Propagate information from
DisjunctionMaxScorer to the collector
• Some additional bookkeeping
– Scores
– Sort field values
– Subquery index
– (Facets)
Q3
Q2
Q1
87654321 87654321 98765432
1
50
1
-
1
-
1
50
1
50
1
-
1
-
109876543
1
50
2
-
2
43
2
-
1
50
2
-
2
43
2
43
2
-
1110987654
1
50
3
-
2
43
3
31
3
55
1110987654
1
50
3
-
2
43
3
31
3
31
3
55
12111098765
1
50
4
51
2
43
3
31
4
-
4
76
4
51
1
50
4
51
2
43
3
31
4
-
4
76
131211109876
2
43
3
31
5
-
5
-
4
51
1
50
5
74
131211109876
2
43
3
31
5
-
5
-
5
74
4
51
5
74
2
43
3
31
4
76
5
74
4
51
5
74
2
43
3
31
4
76
4
76
5
74
4
51
5
74
1413121110987
2
43
3
31
6
88
4
76
6
-
5
74
4
51
6
-
1413121110987
6
88
2
43
3
31
4
76
6
-
5
74
4
51
6
-
15141312111098
6
88
2
43
3
31
7
12
4
76
7
99
5
74
4
51
7
-
15141312111098
6
88
2
43
3
31
7
12
7
12
4
76
7
99
5
74
4
51
7
-
15141312111098
7
12
15141312111099
6
88
2
43
3
31
7
12
8
-
4
76
8
55
5
74
4
51
8
-
15141312111099
6
88
2
43
3
31
7
12
8
-
4
76
8
55
8
55
5
74
4
51
8
-
15141312111099

Implementation status
• https://github.com/rokirx/solr-eq
• Working
– Collector logic / multiple queues
– Sort and limit parameter per subquery
– Parser support
• In testing
– Correct scoring
– Faceting
– Multiple sort fields per subquery
• Works with 6.4, 7.6, 8.0, 8.2

Case Study: Autosuggest
User Input tax
• Assumptions on the relevancy of completion:
– Highest priority if the term at the beginning and exact match, eg tax relief
– Lower priority exact match but term not at the beginnilng, eg income tax
– Lowest priority prefix match anywhere in the phrase, eg estate taxes
• Map this condition to queries:
– Term at the beginning of a phrase and exact match: ^tax$
– Exact match in the middle of a phrase: tax$
– Prefix match (edge n-gram): tax

Case Study: Autosuggest
User Input tax
• Resulting query: ^tax$ << tax$ << tax guarantees the specified behavior
• Additional benefit: optimize the performance by cancelling out subqueries
– If the exact hit count is not necessary
– And the minimum required number of hits in the preceeding queues is collected
– Stop fetching the docs from lower priority queue by cancelling them out of the collector/scorer
– Whitout missing out any relevant documents

Potential benefits
• Reduce the number of search requests
• Reduce the complexity of the architecture
• Additional dimension to control rank
• Pluggable, easy to evaluate
• Improve performance through runtime subquery cancellation

Summary
It is technically possible to implement a concept of subquery into Solr/Lucene
• Single request / one pass collection of results
• Individual limits on each subquery
• Individual sort parameters on each subquery
• Optimization if no total hits number needed
– cancel lower prioritized subqueries during evaluation without affecting top hits
• Plugin

STAY CONNECTED WITH ACTIVATE 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to STAY CONNECTED WITH ACTIVATE 2019

Similar to STAY CONNECTED WITH ACTIVATE 2019 (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

STAY CONNECTED WITH ACTIVATE 2019

Editor's Notes