5. De Bitmanager, 2016
But…
• Are the users satisfied?
• Many people struggle with sub-optimal search
results.
6. De Bitmanager, 2016
Search as a toolbox
• It consists of 1 or more(!) tools to find what
you need
Searchbox
Faceting (intersecting)
Sorting
More like this
Not more like this (this is not what I mean)
Etc…
7. De Bitmanager, 2016
Search at Booking
• Destination based (city, region, airport, etc)
• Autocomplete
Results in max 5 destinations, query per
keystroke
• Disambiguation
Show a partioned result that enables people
to choose a destination
11. De Bitmanager, 2016
Scoring
• Lucene scores in general like: tf * idf
• Tf = term frequency
the more matched terms, the more important
• Idf = inverse document frequency
The more matched documents for the term,
the less important
12. De Bitmanager, 2016
Term frequency
• Used to give more importance to relative high
occurring terms.
• Scoring examples for ‘house’
House
The house
The little house on the prairie
The little house on the prairie blah blah blah
s
c
o
r
e
13. De Bitmanager, 2016
Inverse document frequency
• Prefers less frequent tokens.
• Useless on single token queries: it is only used
to relative score multiple tokens
• Examples:
house
little
on
the
s
c
o
r
e
14. De Bitmanager, 2016
Drawback of idf
• Other example…
Pekela
Haarlem
Amsterdam
Paris
• Booking switched off idf, but could have used
df instead…
s
c
o
r
e
15. De Bitmanager, 2016
When does idf work
• Idf typically work for large text-like queries.
• The documents *must* be evenly distributed
over shards
(or use dfs_query_then_fetch)
16. De Bitmanager, 2016
Is tf * idf enough?
• Well, no…
• What to deliver on a query for ‘Paris’?
The city (ehm, the are several cities Paris)
Airports?
Hotels? Which one? There are 1000’s of them.
• Even worse:
What to deliver for query ‘p’ or ‘pa’?
17. De Bitmanager, 2016
Record boost
• Based on
Popularity
From where booked
Language
oSame (doc language == site language)
oLocal translations
oEnglish
oMismatch
18. De Bitmanager, 2016
+ or x?
• Boosts are implemented by adding
• Intuitive justification:
Language could be seen as yet another (implicit!)
search term
Same for popularity: people ar typical not
searching for impopular things
• Example (from an english site):
amsterdam->amsterdam english popular
19. De Bitmanager, 2016
But wait…
• How big should the record-boost be?
0..1? 100?
• Lucene score might vary heavely,
sometimes more then 10x different
• So lets take 10 as max record-boost
But now the recordboost might out-weight smaller
scores
• Argggggg….
20. De Bitmanager, 2016
Score ranges
• Difficult to tinker with:
For instance use a stemmed token with boost 0.5
house^1.0 vs houses^0.5
What if the Lucene score is more than 2 times
higher than the stem itself?
• We are doing entity search vs text search
21. De Bitmanager, 2016
Different scorers
Title Score:default Score:BM25 Score:custom
House 1.22 0.77 1.20
The house 0.76 0.61 1.10
The little house on
the prairie
0.46 0.39 1.05
Querying for ‘house’:
22. De Bitmanager, 2016
Normalizing scores
• Goal: each term is scored around 1.0
Base score 1.0
Tf is normalized between 0 .. 0.2 and added to the
base score
Idf is normalized between 0 .. 0.2 and added to the
base score
Giving a score varying between 1 and 1.4 per term
(sometimes we don’t use idf)
23. De Bitmanager, 2016
Language boosting
• Same language or english: +0.7
• Local language: +0.3
(Roma vs Rome in an English site)
• Mismatched language: -0.3
25. De Bitmanager, 2016
About N-grams
• When a user types ‘ro’…
Rome
Ródos
Rotterdam
Etc
• Score depends on percentage of match
(or Levenshtein distance)
s
c
o
r
e
26. De Bitmanager, 2016
Original approach
• Multiple fields (name, city, region, etc)
• Combining them by a weighted dismax query
27. De Bitmanager, 2016
Dismax query
• More subtle way of combining scores.
• Score = max + (sum - max) * tieBreaker
In words: the max plus a percentage of the others
• Edge cases:
Tiebreaker=0
Score is the max. score
Tiebreaker=1
Score is the sum of all the individual scores
(same behavior as boolean or)
28. De Bitmanager, 2016
Dismax example
• Q= the house
Suppose S[the] = 0.8, S[house]=1.2
• Scores for different tiebreakers:
Bool score (tiebreaker=1): 2.0
Max score (tiebreaker=0): 1.2
Score with tiebreaker=0.1: 1.28
this makes documents containing ‘the house’ a
little bit more important than ‘house’ only.
30. De Bitmanager, 2016
Different approach
• Canonical name:
Hotel V Frederiksplein, Amsterdam, Noord-Holland, Netherlands
• Self name (indexed)
Hotel V Frederiksplein
• Rest (indexed)
Amsterdam, Noord-Holland, Netherlands
31. De Bitmanager, 2016
Weighting fields
• All fields are equal but some fields are more
equal than others…
Self name is most important
Other names (like the city where a hotel resides)
are less important
• Dismax over self name and other
33. De Bitmanager, 2016
Nowadays: payloads
• We need more information per occurrence of
a token:
Length of the original token
Self-name or other location info
Type of the name (hotel, city, landmark, etc)
• All the above info is encoded in a 32 bit
integer, and indexed as a payload
34. De Bitmanager, 2016
Dismax vs payload
• With fieldinfo in the payload we can simulate
dismax behavior
• We query only 1 index-field (instead of 5)
• Context: easier to do advanced scoring: all info
is in 1 scorer.
• Payloads *are* possible in ElasticSearch, but
more difficult to use
37. De Bitmanager, 2016
Suits
• Reasons for people to wear a suit might
include:
Hiding the fact that you cannot trust them
Hiding their incompetence
etc
38. De Bitmanager, 2016
Combining fields
• To prevent double counting, a dismax is
adviced.
• The fact that a term occurs in both the title as
the abstract doesn’t make it roughly twice as
important.
But it does make it somewhat more important
39. De Bitmanager, 2016
Combining fields
• Intuitive reaction: query terms in each others
neighborhood are more important…
• Example: search for a book:
chamber secrets rowling
• Expected top result:
Harry Potter and the Chamber of Secrets/J.K.
Rowling
40. De Bitmanager, 2016
Combining fields
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
• More important if in the same field?
41. De Bitmanager, 2016
Combining fields
• But: we get an excerpt book that contains the
requested
(all terms were present in the abstract field)
• Phrases behave even worse
42. De Bitmanager, 2016
Combining fields
• Suppose:
we have 2 fields: F1 and F2
2 query terms: qt1 and qt2
• Now we have choices how to combine…
43. De Bitmanager, 2016
Combining fields
• (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
this will prefer records where both terms are
found in the same field
• (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
this prefer behaves more like a there were no
fields
44. De Bitmanager, 2016
Combining fields
(F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
45. De Bitmanager, 2016
Combining fields
(F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
"_score": 2.1447253,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
46. De Bitmanager, 2016
Combining fields
• Of course: way more possibilities.
See the multi-match query for examples
Most but not all possibilities can be done by hand
(blending)
47. De Bitmanager, 2016
Combining fields
• Different strategy:
Combine all fields as if they were one field
Do some re-scoring afterwards
Example:
oSearch ‘rowling’ anywhere, score 1
oSearch ‘potter’ anywhere, score 1
oCombine with additional queries to do a finishing touch
48. De Bitmanager, 2016
Explain
• Always use explain (in debug mode)
• Did I already tell you to always use explain?
• Create a new application by first making
explain part of your infrastructure
• At least expose the scores in debug mode.
49. De Bitmanager, 2016
Suits: beware the logic rules…
• Cannot be reversed:
• The fact that I am not wearing a suit does not
imply that:
I am trustworthy
I am competent