You know, for search

De Bitmanager, 2016
You Know, for Search
Peter van der Weerd

De Bitmanager, 2016
Who am I?
• Peter van der Weerd
• Search specialist
• Self employed Bitmanager
• Enormous span of control 

De Bitmanager, 2016
Search
• Common sense:
Easy
Solved

De Bitmanager, 2016
Yeah, true…
• Install ES
• Fill it with some data
• And o/: we can search

De Bitmanager, 2016
But…
• Are the users satisfied?
• Many people struggle with sub-optimal search
results.

De Bitmanager, 2016
Search as a toolbox
• It consists of 1 or more(!) tools to find what
you need
Searchbox
Faceting (intersecting)
Sorting
More like this
Not more like this (this is not what I mean)
Etc…

De Bitmanager, 2016
Search at Booking
• Destination based (city, region, airport, etc)
• Autocomplete
Results in max 5 destinations, query per
keystroke
• Disambiguation
Show a partioned result that enables people
to choose a destination

De Bitmanager, 2016
Autocomplete in action

De Bitmanager, 2016
Disambiguation in action

De Bitmanager, 2016
Scoring
• Lucene scores in general like: tf * idf
• Tf = term frequency
the more matched terms, the more important
• Idf = inverse document frequency
The more matched documents for the term,
the less important

De Bitmanager, 2016
Term frequency
• Used to give more importance to relative high
occurring terms.
• Scoring examples for ‘house’
House
The house
The little house on the prairie
The little house on the prairie blah blah blah
s
c
o
r
e

De Bitmanager, 2016
Inverse document frequency
• Prefers less frequent tokens.
• Useless on single token queries: it is only used
to relative score multiple tokens
• Examples:
house
little
on
the
s
c
o
r
e

De Bitmanager, 2016
Drawback of idf
• Other example…
Pekela
Haarlem
Amsterdam
Paris
• Booking switched off idf, but could have used
df instead…
s
c
o
r
e

De Bitmanager, 2016
When does idf work
• Idf typically work for large text-like queries.
• The documents *must* be evenly distributed
over shards
(or use dfs_query_then_fetch)

De Bitmanager, 2016
Is tf * idf enough?
• Well, no…
• What to deliver on a query for ‘Paris’?
The city (ehm, the are several cities Paris)
Airports?
Hotels? Which one? There are 1000’s of them.
• Even worse:
What to deliver for query ‘p’ or ‘pa’?

De Bitmanager, 2016
Record boost
• Based on
Popularity
From where booked
Language
oSame (doc language == site language)
oLocal translations
oEnglish
oMismatch

De Bitmanager, 2016
+ or x?
• Boosts are implemented by adding
• Intuitive justification:
Language could be seen as yet another (implicit!)
search term
Same for popularity: people ar typical not
searching for impopular things
• Example (from an english site):
amsterdam->amsterdam english popular

De Bitmanager, 2016
But wait…
• How big should the record-boost be?
0..1? 100?
• Lucene score might vary heavely,
sometimes more then 10x different
• So lets take 10 as max record-boost
But now the recordboost might out-weight smaller
scores
• Argggggg….

De Bitmanager, 2016
Score ranges
• Difficult to tinker with:
For instance use a stemmed token with boost 0.5
house^1.0 vs houses^0.5
What if the Lucene score is more than 2 times
higher than the stem itself?
• We are doing entity search vs text search

De Bitmanager, 2016
Different scorers
Title Score:default Score:BM25 Score:custom
House 1.22 0.77 1.20
The house 0.76 0.61 1.10
The little house on
the prairie
0.46 0.39 1.05
Querying for ‘house’:

De Bitmanager, 2016
Normalizing scores
• Goal: each term is scored around 1.0
Base score 1.0
Tf is normalized between 0 .. 0.2 and added to the
base score
Idf is normalized between 0 .. 0.2 and added to the
base score
Giving a score varying between 1 and 1.4 per term
(sometimes we don’t use idf)

De Bitmanager, 2016
Language boosting
• Same language or english: +0.7
• Local language: +0.3
(Roma vs Rome in an English site)
• Mismatched language: -0.3

De Bitmanager, 2016
About N-grams
• For auto-complete: left-edge N-Grams
• Rome:
rome
rom
ro
r

De Bitmanager, 2016
About N-grams
• When a user types ‘ro’…
Rome
Ródos
Rotterdam
Etc
• Score depends on percentage of match
(or Levenshtein distance)
s
c
o
r
e

De Bitmanager, 2016
Original approach
• Multiple fields (name, city, region, etc)
• Combining them by a weighted dismax query

De Bitmanager, 2016
Dismax query
• More subtle way of combining scores.
• Score = max + (sum - max) * tieBreaker
In words: the max plus a percentage of the others
• Edge cases:
Tiebreaker=0
Score is the max. score
Tiebreaker=1
Score is the sum of all the individual scores
(same behavior as boolean or)

De Bitmanager, 2016
Dismax example
• Q= the house
Suppose S[the] = 0.8, S[house]=1.2
• Scores for different tiebreakers:
Bool score (tiebreaker=1): 2.0
Max score (tiebreaker=0): 1.2
Score with tiebreaker=0.1: 1.28
this makes documents containing ‘the house’ a
little bit more important than ‘house’ only.

De Bitmanager, 2016
Difficulties
• Lack of context
• Hard to create a reliable scoring model

De Bitmanager, 2016
Different approach
• Canonical name:
 Hotel V Frederiksplein, Amsterdam, Noord-Holland, Netherlands
• Self name (indexed)
Hotel V Frederiksplein
• Rest (indexed)
Amsterdam, Noord-Holland, Netherlands

De Bitmanager, 2016
Weighting fields
• All fields are equal but some fields are more
equal than others…
Self name is most important
Other names (like the city where a hotel resides)
are less important
• Dismax over self name and other

De Bitmanager, 2016
Payload
• Small piece of information that is added to
every occurrence
• Basically a byte[]

De Bitmanager, 2016
Nowadays: payloads
• We need more information per occurrence of
a token:
Length of the original token
Self-name or other location info
Type of the name (hotel, city, landmark, etc)
• All the above info is encoded in a 32 bit
integer, and indexed as a payload

De Bitmanager, 2016
Dismax vs payload
• With fieldinfo in the payload we can simulate
dismax behavior
• We query only 1 index-field (instead of 5)
• Context: easier to do advanced scoring: all info
is in 1 scorer.
• Payloads *are* possible in ElasticSearch, but
more difficult to use

De Bitmanager, 2016
Search
• Difficult
• Sensitive equilibrium
• Impossible to serve them all

De Bitmanager, 2016
Suits
• Reasons for people to wear a suit might
include:
Hiding the fact that you cannot trust them
Hiding their incompetence
etc


De Bitmanager, 2016
Combining fields
• To prevent double counting, a dismax is
adviced.
• The fact that a term occurs in both the title as
the abstract doesn’t make it roughly twice as
important.
But it does make it somewhat more important

De Bitmanager, 2016
Combining fields
• Intuitive reaction: query terms in each others
neighborhood are more important…
• Example: search for a book:
chamber secrets rowling
• Expected top result:
Harry Potter and the Chamber of Secrets/J.K.
Rowling

De Bitmanager, 2016
Combining fields
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
• More important if in the same field?

De Bitmanager, 2016
Combining fields
• But: we get an excerpt book that contains the
requested
(all terms were present in the abstract field)
• Phrases behave even worse

De Bitmanager, 2016
Combining fields
• Suppose:
 we have 2 fields: F1 and F2
 2 query terms: qt1 and qt2
• Now we have choices how to combine…

De Bitmanager, 2016
Combining fields
• (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
 this will prefer records where both terms are
found in the same field
• (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
 this prefer behaves more like a there were no
fields

De Bitmanager, 2016
Combining fields
(F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
"_score": 2.0767038,
"_score": 1.2030121,

De Bitmanager, 2016
Combining fields
(F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
"_score": 2.1447253,
"_score": 2.0767038,

De Bitmanager, 2016
Combining fields
• Of course: way more possibilities.
See the multi-match query for examples
Most but not all possibilities can be done by hand
(blending)

De Bitmanager, 2016
Combining fields
• Different strategy:
Combine all fields as if they were one field
Do some re-scoring afterwards
Example:
oSearch ‘rowling’ anywhere, score 1
oSearch ‘potter’ anywhere, score 1
oCombine with additional queries to do a finishing touch

De Bitmanager, 2016
Explain
• Always use explain (in debug mode)
• Did I already tell you to always use explain?
• Create a new application by first making
explain part of your infrastructure
• At least expose the scores in debug mode.

De Bitmanager, 2016
Suits: beware the logic rules…
• Cannot be reversed:
• The fact that I am not wearing a suit does not
imply that:
I am trustworthy
I am competent

De Bitmanager, 2016
You Know, for Bits…
Peter @ bitmanager.nl

You know, for search

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to You know, for search

Similar to You know, for search (20)

Recently uploaded

Recently uploaded (20)

You know, for search