Autocomplete as Relevancy
Haystack Search Relevance Conference
April 24, 2019
Rimple Shah
Revanth Malay
David Rhodes
2
LexisNexis
 Business – Information for Lawyers and other Professionals
 Mission: Advance the Rule of Law
 Flagship Products: Lexis Advance, Lexis Risk Solutions, Nexis
 Target Markets: Legal, Risk, Government, Academia, Professional Information
Users
 Customers in 130 countries
 Subsidiary of RELX (NYSE: RELX) since 1994
 Primary Direct Competitors: Dow Jones, Thomson Reuters, Wolters Kluwer,
Bloomberg
 > 10,000 employees worldwide
3
Agenda
 Autocomplete as Search Relevance
 Use case
 Customizing Solr for Autocomplete
 Architecture Design
 Evaluating Relevance
 Future Roadmap
4
Autocomplete is Search Relevance
5
Autocomplete in Lexis Advance
6
Who’s Your Data?
 The suggestion is the document
<doc>
<field name="query">obamacare</field>
<field name="display">patient protection and affordable care act</field>
<field name="token_count">1</field>
<field name="id">urn:label:7AC7C01A1EF546C4BCDF334557</field>
<field name="popularity">8464</field>
<field name="source">KRM</field>
<field name="region">United States</field>
</doc>
7
Where Do Suggestions Come From?
Users’ Queries LexisNexis legal experts
8
Data Preparation
9
Solr Suggester
 Built-in Solr Search Component
 Features
 Fast in-memory Finite State Transducer data structure
 Easy to add to an existing Solr index
10
• “index + query” approach for complex weights calculation, stop
words removal, or basic context filtrationFunctionality
• Lookups against in-memory FST work incredibly fast
• Performance of a well-tuned Solr Index is sufficient for this use
case
Performance
Should We Use Solr Suggester?
11
Basic Solr Configuration
 Keyword Tokenizer
 Lowercase Filter Factory
 EdgeNGram Filter Factory
 MinGramSize=3
 MaxGramSize=30
Motion to Dismiss
 mot
 moti
 motio
 motion
 motion t
 motion to
 motion to di
 motion to dis
 …
 Whitespace Tokenizer
 Lowercase Filter Factory
 EdgeNGram Filter Factory
 MinGramSize=1
 MaxGramSize=30
Motion to Dismiss
 m t d
 mo to di
 mot dis
 moti dism
 motio dismi
 motion dismis
 dismiss
 Whitespace Tokenizer
 Lowercase Filter Factory
Motion to Dismiss
 motion
 to
 dismiss
12
Term Frequency
motion to dis
T.F. = 1.0
motion to dismiss
plaintiff’s motion
motion to dismiss
Solution
Problem
motion to dis
13
EDisMax's pf2 Parameter
• Boost suggestions that have user query tokens next to each other.
• Example:
User Query: plaintiff’s rebuttal expert witness
Suggestions:
Doc1 : rebuttal expert witness | Score: 292
Doc2 : rebuttal witness and expert testimony | Score: 253
14
Preference for First Word Match
Insert an anchor term as the first token in index and query time.
Example :
User query : motion dismiss KXQHZ motion dismiss
Suggestions:
Documents Index
motion to dismiss with prejudice KXQHZ motion to dismiss with prejudice
dismiss motion with prejudice KXQHZ dismiss motion with prejudice
15
Incorrectly Matching On Partial Words
• Query suggestion incorrectly considers complete token as partial word and provides token
suggestions that start with the word.
User Query Documents Index
government is a virgin islands government act v i g a
vi is go ac
vir isl gov act
virg isla gove
virgi islan gover
virgin island govern
…….
government
16
Correctly Matching On Partial Words
• Condition 1: When user query has no trailing space
•Insert ‘xwkq’ in the beginning of the last token
User Query Documents Index
government is a
xwkq
virgin islands government act xwkqv xwkqi xwkqg xwkqa
xwkqvi xwkqis xwkqgo xwkqac
xwkqvir xwkqgov xwkqact
xwkqvirg xwkqgove
xwkqvirgi xwkqgover
xwkqvirgin xwkqgovern
…….
xwkqgovernment
17
Correctly Matching On Partial Words
• Condition 2: When user query has trailing space
• Rest of the Solr analyzers do the job here
User Query Documents Index
government is a_ virgin islands government act xwkqv xwkqi xwkqg xwkqa
xwkqvi xwkqis xwkqgo xwkqac
xwkqvir xwkqgov xwkqact
xwkqvirg xwkqgove
xwkqvirgi xwkqgover
xwkqvirgin xwkqgovern
…….
xwkqgovernment
18
Exact Token Match Before Stemmed & Synonym Match
^8
^6
^6
Standard Tokenizer Factory
+
Lowercase Filter Factory
Standard Tokenizer Factory
+
Lowercase Filter Factory
+
Snowball Porter Filter Factory
+
English Possessive Filter Factory
Standard Tokenizer Factory
+
Lowercase Filter Factory
+
Synonym Graph Filter Factory
19
Duplicate and Near Duplicate Suggestions
• Reduce the impression of repetitive suggestion by reduce the suggestion word from the same
root
• User Query: zoning var
• Suggestions:
20
Reduce Near Duplicate Suggestions
zone variance
zoning variance
zone variances
variance of zoning

variance_zone
21
• About 10-12 % of user queries to web search
engines have spelling errors
Spelling Correction
22
Spelling Correction
23
Architecture Design
24
Architecture Design
25
Offline Evaluation
26
Offline Evaluation Feedback
27
Online Evaluation
• Measure user engagement with
autocomplete suggestions
• Click-rate
• Mean Reciprocal Rank (MRR)
• Minimum Keystroke (MKS)
28
Future Roadmap
29
Any Questions ?
30
Thank You
• Rimple Shah
• rimple.shah@lexisnexis.com
• Revanth Malay
• revanth.malay@lexisnexis.com
• David Rhodes
• david.rhodes@lexisnexis.com
Stay
in touch
with us

Haystack 2019 - Autocomplete as Relevancy - Rimple Shah

  • 1.
    Autocomplete as Relevancy HaystackSearch Relevance Conference April 24, 2019 Rimple Shah Revanth Malay David Rhodes
  • 2.
    2 LexisNexis  Business –Information for Lawyers and other Professionals  Mission: Advance the Rule of Law  Flagship Products: Lexis Advance, Lexis Risk Solutions, Nexis  Target Markets: Legal, Risk, Government, Academia, Professional Information Users  Customers in 130 countries  Subsidiary of RELX (NYSE: RELX) since 1994  Primary Direct Competitors: Dow Jones, Thomson Reuters, Wolters Kluwer, Bloomberg  > 10,000 employees worldwide
  • 3.
    3 Agenda  Autocomplete asSearch Relevance  Use case  Customizing Solr for Autocomplete  Architecture Design  Evaluating Relevance  Future Roadmap
  • 4.
  • 5.
  • 6.
    6 Who’s Your Data? The suggestion is the document <doc> <field name="query">obamacare</field> <field name="display">patient protection and affordable care act</field> <field name="token_count">1</field> <field name="id">urn:label:7AC7C01A1EF546C4BCDF334557</field> <field name="popularity">8464</field> <field name="source">KRM</field> <field name="region">United States</field> </doc>
  • 7.
    7 Where Do SuggestionsCome From? Users’ Queries LexisNexis legal experts
  • 8.
  • 9.
    9 Solr Suggester  Built-inSolr Search Component  Features  Fast in-memory Finite State Transducer data structure  Easy to add to an existing Solr index
  • 10.
    10 • “index +query” approach for complex weights calculation, stop words removal, or basic context filtrationFunctionality • Lookups against in-memory FST work incredibly fast • Performance of a well-tuned Solr Index is sufficient for this use case Performance Should We Use Solr Suggester?
  • 11.
    11 Basic Solr Configuration Keyword Tokenizer  Lowercase Filter Factory  EdgeNGram Filter Factory  MinGramSize=3  MaxGramSize=30 Motion to Dismiss  mot  moti  motio  motion  motion t  motion to  motion to di  motion to dis  …  Whitespace Tokenizer  Lowercase Filter Factory  EdgeNGram Filter Factory  MinGramSize=1  MaxGramSize=30 Motion to Dismiss  m t d  mo to di  mot dis  moti dism  motio dismi  motion dismis  dismiss  Whitespace Tokenizer  Lowercase Filter Factory Motion to Dismiss  motion  to  dismiss
  • 12.
    12 Term Frequency motion todis T.F. = 1.0 motion to dismiss plaintiff’s motion motion to dismiss Solution Problem motion to dis
  • 13.
    13 EDisMax's pf2 Parameter •Boost suggestions that have user query tokens next to each other. • Example: User Query: plaintiff’s rebuttal expert witness Suggestions: Doc1 : rebuttal expert witness | Score: 292 Doc2 : rebuttal witness and expert testimony | Score: 253
  • 14.
    14 Preference for FirstWord Match Insert an anchor term as the first token in index and query time. Example : User query : motion dismiss KXQHZ motion dismiss Suggestions: Documents Index motion to dismiss with prejudice KXQHZ motion to dismiss with prejudice dismiss motion with prejudice KXQHZ dismiss motion with prejudice
  • 15.
    15 Incorrectly Matching OnPartial Words • Query suggestion incorrectly considers complete token as partial word and provides token suggestions that start with the word. User Query Documents Index government is a virgin islands government act v i g a vi is go ac vir isl gov act virg isla gove virgi islan gover virgin island govern ……. government
  • 16.
    16 Correctly Matching OnPartial Words • Condition 1: When user query has no trailing space •Insert ‘xwkq’ in the beginning of the last token User Query Documents Index government is a xwkq virgin islands government act xwkqv xwkqi xwkqg xwkqa xwkqvi xwkqis xwkqgo xwkqac xwkqvir xwkqgov xwkqact xwkqvirg xwkqgove xwkqvirgi xwkqgover xwkqvirgin xwkqgovern ……. xwkqgovernment
  • 17.
    17 Correctly Matching OnPartial Words • Condition 2: When user query has trailing space • Rest of the Solr analyzers do the job here User Query Documents Index government is a_ virgin islands government act xwkqv xwkqi xwkqg xwkqa xwkqvi xwkqis xwkqgo xwkqac xwkqvir xwkqgov xwkqact xwkqvirg xwkqgove xwkqvirgi xwkqgover xwkqvirgin xwkqgovern ……. xwkqgovernment
  • 18.
    18 Exact Token MatchBefore Stemmed & Synonym Match ^8 ^6 ^6 Standard Tokenizer Factory + Lowercase Filter Factory Standard Tokenizer Factory + Lowercase Filter Factory + Snowball Porter Filter Factory + English Possessive Filter Factory Standard Tokenizer Factory + Lowercase Filter Factory + Synonym Graph Filter Factory
  • 19.
    19 Duplicate and NearDuplicate Suggestions • Reduce the impression of repetitive suggestion by reduce the suggestion word from the same root • User Query: zoning var • Suggestions:
  • 20.
    20 Reduce Near DuplicateSuggestions zone variance zoning variance zone variances variance of zoning  variance_zone
  • 21.
    21 • About 10-12% of user queries to web search engines have spelling errors Spelling Correction
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    27 Online Evaluation • Measureuser engagement with autocomplete suggestions • Click-rate • Mean Reciprocal Rank (MRR) • Minimum Keystroke (MKS)
  • 28.
  • 29.
  • 30.
    30 Thank You • RimpleShah • rimple.shah@lexisnexis.com • Revanth Malay • revanth.malay@lexisnexis.com • David Rhodes • david.rhodes@lexisnexis.com Stay in touch with us