De Bitmanager, 2016
You Know, for Search
Peter van der Weerd
De Bitmanager, 2016
Who am I?
• Peter van der Weerd
• Search specialist
• Self employed Bitmanager
• Enormous span of control 
De Bitmanager, 2016
Search
• Common sense:
Easy
Solved
De Bitmanager, 2016
Yeah, true…
• Install ES
• Fill it with some data
• And o/: we can search
De Bitmanager, 2016
But…
• Are the users satisfied?
• Many people struggle with sub-optimal search
results.
De Bitmanager, 2016
Search as a toolbox
• It consists of 1 or more(!) tools to find what
you need
Searchbox
Faceting (intersecting)
Sorting
More like this
Not more like this (this is not what I mean)
Etc…
De Bitmanager, 2016
Search at Booking
• Destination based (city, region, airport, etc)
• Autocomplete
Results in max 5 destinations, query per
keystroke
• Disambiguation
Show a partioned result that enables people
to choose a destination
De Bitmanager, 2016
Autocomplete in action
De Bitmanager, 2016
Disambiguation in action
De Bitmanager, 2016
Scoring
De Bitmanager, 2016
Scoring
• Lucene scores in general like: tf * idf
• Tf = term frequency
the more matched terms, the more important
• Idf = inverse document frequency
The more matched documents for the term,
the less important
De Bitmanager, 2016
Term frequency
• Used to give more importance to relative high
occurring terms.
• Scoring examples for ‘house’
House
The house
The little house on the prairie
The little house on the prairie blah blah blah
s
c
o
r
e
De Bitmanager, 2016
Inverse document frequency
• Prefers less frequent tokens.
• Useless on single token queries: it is only used
to relative score multiple tokens
• Examples:
house
little
on
the
s
c
o
r
e
De Bitmanager, 2016
Drawback of idf
• Other example…
Pekela
Haarlem
Amsterdam
Paris
• Booking switched off idf, but could have used
df instead…
s
c
o
r
e
De Bitmanager, 2016
When does idf work
• Idf typically work for large text-like queries.
• The documents *must* be evenly distributed
over shards
(or use dfs_query_then_fetch)
De Bitmanager, 2016
Is tf * idf enough?
• Well, no…
• What to deliver on a query for ‘Paris’?
The city (ehm, the are several cities Paris)
Airports?
Hotels? Which one? There are 1000’s of them.
• Even worse:
What to deliver for query ‘p’ or ‘pa’?
De Bitmanager, 2016
Record boost
• Based on
Popularity
From where booked
Language
oSame (doc language == site language)
oLocal translations
oEnglish
oMismatch
De Bitmanager, 2016
+ or x?
• Boosts are implemented by adding
• Intuitive justification:
Language could be seen as yet another (implicit!)
search term
Same for popularity: people ar typical not
searching for impopular things
• Example (from an english site):
amsterdam->amsterdam english popular
De Bitmanager, 2016
But wait…
• How big should the record-boost be?
0..1? 100?
• Lucene score might vary heavely,
sometimes more then 10x different
• So lets take 10 as max record-boost
But now the recordboost might out-weight smaller
scores
• Argggggg….
De Bitmanager, 2016
Score ranges
• Difficult to tinker with:
For instance use a stemmed token with boost 0.5
house^1.0 vs houses^0.5
What if the Lucene score is more than 2 times
higher than the stem itself?
• We are doing entity search vs text search
De Bitmanager, 2016
Different scorers
Title Score:default Score:BM25 Score:custom
House 1.22 0.77 1.20
The house 0.76 0.61 1.10
The little house on
the prairie
0.46 0.39 1.05
Querying for ‘house’:
De Bitmanager, 2016
Normalizing scores
• Goal: each term is scored around 1.0
Base score 1.0
Tf is normalized between 0 .. 0.2 and added to the
base score
Idf is normalized between 0 .. 0.2 and added to the
base score
Giving a score varying between 1 and 1.4 per term
(sometimes we don’t use idf)
De Bitmanager, 2016
Language boosting
• Same language or english: +0.7
• Local language: +0.3
(Roma vs Rome in an English site)
• Mismatched language: -0.3
De Bitmanager, 2016
About N-grams
• For auto-complete: left-edge N-Grams
• Rome:
rome
rom
ro
r
De Bitmanager, 2016
About N-grams
• When a user types ‘ro’…
Rome
Ródos
Rotterdam
Etc
• Score depends on percentage of match
(or Levenshtein distance)
s
c
o
r
e
De Bitmanager, 2016
Original approach
• Multiple fields (name, city, region, etc)
• Combining them by a weighted dismax query
De Bitmanager, 2016
Dismax query
• More subtle way of combining scores.
• Score = max + (sum - max) * tieBreaker
In words: the max plus a percentage of the others
• Edge cases:
Tiebreaker=0
Score is the max. score
Tiebreaker=1
Score is the sum of all the individual scores
(same behavior as boolean or)
De Bitmanager, 2016
Dismax example
• Q= the house
Suppose S[the] = 0.8, S[house]=1.2
• Scores for different tiebreakers:
Bool score (tiebreaker=1): 2.0
Max score (tiebreaker=0): 1.2
Score with tiebreaker=0.1: 1.28
this makes documents containing ‘the house’ a
little bit more important than ‘house’ only.
De Bitmanager, 2016
Difficulties
• Lack of context
• Hard to create a reliable scoring model
De Bitmanager, 2016
Different approach
• Canonical name:
 Hotel V Frederiksplein, Amsterdam, Noord-Holland, Netherlands
• Self name (indexed)
Hotel V Frederiksplein
• Rest (indexed)
Amsterdam, Noord-Holland, Netherlands
De Bitmanager, 2016
Weighting fields
• All fields are equal but some fields are more
equal than others…
Self name is most important
Other names (like the city where a hotel resides)
are less important
• Dismax over self name and other
De Bitmanager, 2016
Payload
• Small piece of information that is added to
every occurrence
• Basically a byte[]
De Bitmanager, 2016
Nowadays: payloads
• We need more information per occurrence of
a token:
Length of the original token
Self-name or other location info
Type of the name (hotel, city, landmark, etc)
• All the above info is encoded in a 32 bit
integer, and indexed as a payload
De Bitmanager, 2016
Dismax vs payload
• With fieldinfo in the payload we can simulate
dismax behavior
• We query only 1 index-field (instead of 5)
• Context: easier to do advanced scoring: all info
is in 1 scorer.
• Payloads *are* possible in ElasticSearch, but
more difficult to use
De Bitmanager, 2016
Search
• Difficult
• Sensitive equilibrium
• Impossible to serve them all
De Bitmanager, 2016
Suits
De Bitmanager, 2016
Suits
• Reasons for people to wear a suit might
include:
Hiding the fact that you cannot trust them
Hiding their incompetence
etc

De Bitmanager, 2016
Combining fields
• To prevent double counting, a dismax is
adviced.
• The fact that a term occurs in both the title as
the abstract doesn’t make it roughly twice as
important.
But it does make it somewhat more important
De Bitmanager, 2016
Combining fields
• Intuitive reaction: query terms in each others
neighborhood are more important…
• Example: search for a book:
chamber secrets rowling
• Expected top result:
Harry Potter and the Chamber of Secrets/J.K.
Rowling
De Bitmanager, 2016
Combining fields
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
• More important if in the same field?
De Bitmanager, 2016
Combining fields
• But: we get an excerpt book that contains the
requested
(all terms were present in the abstract field)
• Phrases behave even worse
De Bitmanager, 2016
Combining fields
• Suppose:
 we have 2 fields: F1 and F2
 2 query terms: qt1 and qt2
• Now we have choices how to combine…
De Bitmanager, 2016
Combining fields
• (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
 this will prefer records where both terms are
found in the same field
• (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
 this prefer behaves more like a there were no
fields
De Bitmanager, 2016
Combining fields
(F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
De Bitmanager, 2016
Combining fields
(F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
"_score": 2.1447253,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
De Bitmanager, 2016
Combining fields
• Of course: way more possibilities.
See the multi-match query for examples
Most but not all possibilities can be done by hand
(blending)
De Bitmanager, 2016
Combining fields
• Different strategy:
Combine all fields as if they were one field
Do some re-scoring afterwards
Example:
oSearch ‘rowling’ anywhere, score 1
oSearch ‘potter’ anywhere, score 1
oCombine with additional queries to do a finishing touch
De Bitmanager, 2016
Explain
• Always use explain (in debug mode)
• Did I already tell you to always use explain?
• Create a new application by first making
explain part of your infrastructure
• At least expose the scores in debug mode.
De Bitmanager, 2016
Suits: beware the logic rules…
• Cannot be reversed:
• The fact that I am not wearing a suit does not
imply that:
I am trustworthy
I am competent
De Bitmanager, 2016
You Know, for Bits…
Peter @ bitmanager.nl

You know, for search

  • 1.
    De Bitmanager, 2016 YouKnow, for Search Peter van der Weerd
  • 2.
    De Bitmanager, 2016 Whoam I? • Peter van der Weerd • Search specialist • Self employed Bitmanager • Enormous span of control 
  • 3.
    De Bitmanager, 2016 Search •Common sense: Easy Solved
  • 4.
    De Bitmanager, 2016 Yeah,true… • Install ES • Fill it with some data • And o/: we can search
  • 5.
    De Bitmanager, 2016 But… •Are the users satisfied? • Many people struggle with sub-optimal search results.
  • 6.
    De Bitmanager, 2016 Searchas a toolbox • It consists of 1 or more(!) tools to find what you need Searchbox Faceting (intersecting) Sorting More like this Not more like this (this is not what I mean) Etc…
  • 7.
    De Bitmanager, 2016 Searchat Booking • Destination based (city, region, airport, etc) • Autocomplete Results in max 5 destinations, query per keystroke • Disambiguation Show a partioned result that enables people to choose a destination
  • 8.
  • 9.
  • 10.
  • 11.
    De Bitmanager, 2016 Scoring •Lucene scores in general like: tf * idf • Tf = term frequency the more matched terms, the more important • Idf = inverse document frequency The more matched documents for the term, the less important
  • 12.
    De Bitmanager, 2016 Termfrequency • Used to give more importance to relative high occurring terms. • Scoring examples for ‘house’ House The house The little house on the prairie The little house on the prairie blah blah blah s c o r e
  • 13.
    De Bitmanager, 2016 Inversedocument frequency • Prefers less frequent tokens. • Useless on single token queries: it is only used to relative score multiple tokens • Examples: house little on the s c o r e
  • 14.
    De Bitmanager, 2016 Drawbackof idf • Other example… Pekela Haarlem Amsterdam Paris • Booking switched off idf, but could have used df instead… s c o r e
  • 15.
    De Bitmanager, 2016 Whendoes idf work • Idf typically work for large text-like queries. • The documents *must* be evenly distributed over shards (or use dfs_query_then_fetch)
  • 16.
    De Bitmanager, 2016 Istf * idf enough? • Well, no… • What to deliver on a query for ‘Paris’? The city (ehm, the are several cities Paris) Airports? Hotels? Which one? There are 1000’s of them. • Even worse: What to deliver for query ‘p’ or ‘pa’?
  • 17.
    De Bitmanager, 2016 Recordboost • Based on Popularity From where booked Language oSame (doc language == site language) oLocal translations oEnglish oMismatch
  • 18.
    De Bitmanager, 2016 +or x? • Boosts are implemented by adding • Intuitive justification: Language could be seen as yet another (implicit!) search term Same for popularity: people ar typical not searching for impopular things • Example (from an english site): amsterdam->amsterdam english popular
  • 19.
    De Bitmanager, 2016 Butwait… • How big should the record-boost be? 0..1? 100? • Lucene score might vary heavely, sometimes more then 10x different • So lets take 10 as max record-boost But now the recordboost might out-weight smaller scores • Argggggg….
  • 20.
    De Bitmanager, 2016 Scoreranges • Difficult to tinker with: For instance use a stemmed token with boost 0.5 house^1.0 vs houses^0.5 What if the Lucene score is more than 2 times higher than the stem itself? • We are doing entity search vs text search
  • 21.
    De Bitmanager, 2016 Differentscorers Title Score:default Score:BM25 Score:custom House 1.22 0.77 1.20 The house 0.76 0.61 1.10 The little house on the prairie 0.46 0.39 1.05 Querying for ‘house’:
  • 22.
    De Bitmanager, 2016 Normalizingscores • Goal: each term is scored around 1.0 Base score 1.0 Tf is normalized between 0 .. 0.2 and added to the base score Idf is normalized between 0 .. 0.2 and added to the base score Giving a score varying between 1 and 1.4 per term (sometimes we don’t use idf)
  • 23.
    De Bitmanager, 2016 Languageboosting • Same language or english: +0.7 • Local language: +0.3 (Roma vs Rome in an English site) • Mismatched language: -0.3
  • 24.
    De Bitmanager, 2016 AboutN-grams • For auto-complete: left-edge N-Grams • Rome: rome rom ro r
  • 25.
    De Bitmanager, 2016 AboutN-grams • When a user types ‘ro’… Rome Ródos Rotterdam Etc • Score depends on percentage of match (or Levenshtein distance) s c o r e
  • 26.
    De Bitmanager, 2016 Originalapproach • Multiple fields (name, city, region, etc) • Combining them by a weighted dismax query
  • 27.
    De Bitmanager, 2016 Dismaxquery • More subtle way of combining scores. • Score = max + (sum - max) * tieBreaker In words: the max plus a percentage of the others • Edge cases: Tiebreaker=0 Score is the max. score Tiebreaker=1 Score is the sum of all the individual scores (same behavior as boolean or)
  • 28.
    De Bitmanager, 2016 Dismaxexample • Q= the house Suppose S[the] = 0.8, S[house]=1.2 • Scores for different tiebreakers: Bool score (tiebreaker=1): 2.0 Max score (tiebreaker=0): 1.2 Score with tiebreaker=0.1: 1.28 this makes documents containing ‘the house’ a little bit more important than ‘house’ only.
  • 29.
    De Bitmanager, 2016 Difficulties •Lack of context • Hard to create a reliable scoring model
  • 30.
    De Bitmanager, 2016 Differentapproach • Canonical name:  Hotel V Frederiksplein, Amsterdam, Noord-Holland, Netherlands • Self name (indexed) Hotel V Frederiksplein • Rest (indexed) Amsterdam, Noord-Holland, Netherlands
  • 31.
    De Bitmanager, 2016 Weightingfields • All fields are equal but some fields are more equal than others… Self name is most important Other names (like the city where a hotel resides) are less important • Dismax over self name and other
  • 32.
    De Bitmanager, 2016 Payload •Small piece of information that is added to every occurrence • Basically a byte[]
  • 33.
    De Bitmanager, 2016 Nowadays:payloads • We need more information per occurrence of a token: Length of the original token Self-name or other location info Type of the name (hotel, city, landmark, etc) • All the above info is encoded in a 32 bit integer, and indexed as a payload
  • 34.
    De Bitmanager, 2016 Dismaxvs payload • With fieldinfo in the payload we can simulate dismax behavior • We query only 1 index-field (instead of 5) • Context: easier to do advanced scoring: all info is in 1 scorer. • Payloads *are* possible in ElasticSearch, but more difficult to use
  • 35.
    De Bitmanager, 2016 Search •Difficult • Sensitive equilibrium • Impossible to serve them all
  • 36.
  • 37.
    De Bitmanager, 2016 Suits •Reasons for people to wear a suit might include: Hiding the fact that you cannot trust them Hiding their incompetence etc 
  • 38.
    De Bitmanager, 2016 Combiningfields • To prevent double counting, a dismax is adviced. • The fact that a term occurs in both the title as the abstract doesn’t make it roughly twice as important. But it does make it somewhat more important
  • 39.
    De Bitmanager, 2016 Combiningfields • Intuitive reaction: query terms in each others neighborhood are more important… • Example: search for a book: chamber secrets rowling • Expected top result: Harry Potter and the Chamber of Secrets/J.K. Rowling
  • 40.
    De Bitmanager, 2016 Combiningfields "_score": 2.0767038, "author": "De Bitmanager", "title": "Excerpt book", "abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling" "_score": 1.2030121, "author": "J.K. Rowling", "title": "Harry Potter and the Chamber of Secrets", "abstract": "Fresh torments and horrors arise, including an outrageously stuck-up new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom." • More important if in the same field?
  • 41.
    De Bitmanager, 2016 Combiningfields • But: we get an excerpt book that contains the requested (all terms were present in the abstract field) • Phrases behave even worse
  • 42.
    De Bitmanager, 2016 Combiningfields • Suppose:  we have 2 fields: F1 and F2  2 query terms: qt1 and qt2 • Now we have choices how to combine…
  • 43.
    De Bitmanager, 2016 Combiningfields • (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)  this will prefer records where both terms are found in the same field • (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)  this prefer behaves more like a there were no fields
  • 44.
    De Bitmanager, 2016 Combiningfields (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2) "_score": 2.0767038, "author": "De Bitmanager", "title": "Excerpt book", "abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling" "_score": 1.2030121, "author": "J.K. Rowling", "title": "Harry Potter and the Chamber of Secrets", "abstract": "Fresh torments and horrors arise, including an outrageously stuck-up new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom."
  • 45.
    De Bitmanager, 2016 Combiningfields (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2) "_score": 2.1447253, "author": "J.K. Rowling", "title": "Harry Potter and the Chamber of Secrets", "abstract": "Fresh torments and horrors arise, including an outrageously stuck-up new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom." "_score": 2.0767038, "author": "De Bitmanager", "title": "Excerpt book", "abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
  • 46.
    De Bitmanager, 2016 Combiningfields • Of course: way more possibilities. See the multi-match query for examples Most but not all possibilities can be done by hand (blending)
  • 47.
    De Bitmanager, 2016 Combiningfields • Different strategy: Combine all fields as if they were one field Do some re-scoring afterwards Example: oSearch ‘rowling’ anywhere, score 1 oSearch ‘potter’ anywhere, score 1 oCombine with additional queries to do a finishing touch
  • 48.
    De Bitmanager, 2016 Explain •Always use explain (in debug mode) • Did I already tell you to always use explain? • Create a new application by first making explain part of your infrastructure • At least expose the scores in debug mode.
  • 49.
    De Bitmanager, 2016 Suits:beware the logic rules… • Cannot be reversed: • The fact that I am not wearing a suit does not imply that: I am trustworthy I am competent
  • 50.
    De Bitmanager, 2016 YouKnow, for Bits… Peter @ bitmanager.nl