2. Motivation
Real-world example
Techniques
Tokenization
Stop words
Normalization
Stemming/lemmatization
3. Using a variety of techniques, we want to
improve IR systems so that they “understand”
more of what we want from a query
E.g. When searching for a paper about
Facebook, the following queries should all
return the paper
The facebook, facebook, face-book
4.
5.
6.
7. Damerau–Levenshtein distance is the number
of ops between two words
Insert
Delete
Change
Swap
adidas = adiidas == adifas (distance 1)
But: cat != rat != hat (distance 1)
8. Breaking up sentences on a variety of rules
Split on non-alphanumeric?
Good: The dog ran to the park
Bad: Ms. O’Hannety went to O’Flaggerty’s pub
(Ms, O, Hannety, went, to, O, Flaggerty, s, pub)
Split on space?
Bad: San Fransisco is a great city.
9. E.g. Lebensversicherungsgesellschaftsangestellter =
life insurance company employee
Would not get split by any of the previously
mentioned methods
10. Drop common ‘useless’ words
How useless are they (“President of the USA”)
Not a big problem to include them, space or
time-wise
11. What I did at Amazon (codenamed BrandSims
normalization)
Maps words/phrases that are semantically
related to each other, so they can refer to the
same content
E.g. Alan went to the store = Alan go store
12. Mainly dropped since they were not always
supported
Problematic since in certain languages accents
are critical to understanding
13. Standardize to all caps or all lowercase (more
common)
Everywhere in the sentence?
Bad: We went to the White House
Better solution is the beginning of a sentence
and in titles
14. More complicated than previous normalization
techniques
Goal is to remove things like tense, number,
possession from strings
15. Chop off the end of the word
Con: Crude and sometime ineffective
Pro: Fast and no overhead
E.g. cookies -> cooki, cup->c
16. Use a vocab list and morphological (structural)
list [which may or may not help much]
Recognize context in a sentence (saw would
become see if used as a verb, not a noun)
Porter’s algorithm:
17. Understand the type of queries that will be
submitted
It is all about tradeoffs between precision and
recall
These techniques can be used differently
depending on the context.