The basics of search
• A search engine mediates between user’s query and metadata surrogates for
• Documents are reduced to metadata
• User’s need is translated into a query
• Query terms are used to ﬁnd matching metadata terms
• Lots and lots of room for error...
The search process
1. Crawl content for metadata
2. Index document terms into an inverted ﬁle;
an inverted ﬁle is very fast to search
3. Search the index to identify the result set;
search the index - not the documents
4. Rank the results for display;
ranking is the hardest part
Search algorithm 1
Term-based Ranking (tf/idf)
• tf = term frequency
documents that use the query terms most are presumed to be most relevant
• idf = inverse document frequency
terms that are more rare are better indicators of relevance
1) relevance can be measured with document terms
Search algorithm 2
• Relevant set is still identiﬁed by term matching
• A revolution in ranking:
based on linking between documents
1) important sites link to other important sites
2) if many people link to a site, it is important
• Authors carefully select articles to cite
• The more citations an article gets,
the better it must be
• Citations by authors who have a lot of citations confers their power to those
• Aggregate and leverage all these small individual decisions...
How Complex is
Google has about
36 ranking algorithms
Parsing Document Structure
Parsing Data in the Document
the percentage of all relevant documents retrieved
100% recall means every relevant document is retrieved
the percentage of documents retrieved that are relevant
100% precision means only relevant documents are retrieved
Thoughts & Reservations about Evaluating Search
• Precision and Recall are usually inversely proportional, so improving one often
reduces the other.
• Given a corpus of content like the web (tens of billions of items)...
Recall is unmeasurable, and thus essentially meaningless
• What is relevance?
• Measuring Precision depends on an agreed deﬁnition of relevance, which is
tricky (human cataloging is only about 80% ‘accurate’ - relevance is very hard
• Manually selected results, tied to speciﬁc query terms or phrases
• User-driven phrases
select the most-used phrases from search trafﬁc;
go for easy wins, because returns diminish sharply
• Business-driven phrases
select phrases important to the business;
such as product names or ofﬁce locations;
or politically sensitive phrases, so you can control the message people see
• The user provides direct or indirect feedback on the search results
• Click tracking
• “More like this” or “Find similar”
• Designers use patterns in search behavior to guess user’s intent;
this requires a substantial understanding of user behavior;
it may require structured content (although, not necessarily)
• Zip Code -> Zip Code Lookup Tool
• Person’s name -> Directory Listing
• Product Name -> Shop or Support?
• Address -> Map this?
• Topic -> Introduction, Forms, Policies or Reports?
• Classiﬁcation with a controlled vocabulary is the best way to ensure 100%
• Lead-in synonyms
enter “fridge”; get “refrigerator” instead;
best if the collection is well-cataloged
increases precision (e.g. in a library)
• Term-expansion synonyms;
enter “refrigerator”; get “fridge” too;
best if the collection is not well-cataloged
increases recall at the cost of precision (e.g on eBay)
• Spell check on query phrases
Why is search
About half of all users prefer to
What percentage of a content
site’s development effort should
be devoted to search?
* This statistic is highly context-dependent. People’s
behavior depends on the context of their actions.
The stat is from Jared Spool.