0
Information Retrieval

   James Melzer

   June 15, 2006




                        1
How Does Search Work?




                        2
The basics of search

• A search engine mediates between user’s query and metadata surrogates for
  documents


• Document...
The search process

1. Crawl content for metadata


2. Index document terms into an inverted file;
   an inverted file is ve...
Search algorithm 1

Term-based Ranking (tf/idf)


• tf = term frequency
  documents that use the query terms most are pres...
Search algorithm 2

PageRank (Google)


• Relevant set is still identified by term matching


• A revolution in ranking:
  ...
Citation Analysis

• Authors carefully select articles to cite


• The more citations an article gets,
  the better it mus...
How Complex is
Google?
    Google has about
    36 ranking algorithms

    Examples:

    Citation Analysis

    Statistic...
How to Make Search Better?




                             9
Evaluating Search

Recall


the percentage of all relevant documents retrieved


100% recall means every relevant document...
Thoughts & Reservations about Evaluating Search

• Precision and Recall are usually inversely proportional, so improving o...
Zipf
Best Bets

• Manually selected results, tied to specific query terms or phrases


• User-driven phrases
  select the m...
Relevance Feedback

• The user provides direct or indirect feedback on the search results


• Click tracking


• “More lik...
Structured Search

• Designers use patterns in search behavior to guess user’s intent;
  this requires a substantial under...
Controlled Vocabularies

• Classification with a controlled vocabulary is the best way to ensure 100%
  Recall


• Lead-in ...
Why is search
important?

IF:
About half of all users prefer to
search first*


THEN:
What percentage of a content
site’s d...
Questions?
James Melzer
Information Architect
SRA International
james_melzer@sra.com




                        17
Upcoming SlideShare
Loading in...5
×

Information Retrieval (for beginners)

2,311

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,311
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
95
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Information Retrieval (for beginners)"

  1. 1. Information Retrieval James Melzer June 15, 2006 1
  2. 2. How Does Search Work? 2
  3. 3. The basics of search • A search engine mediates between user’s query and metadata surrogates for documents • Documents are reduced to metadata • User’s need is translated into a query • Query terms are used to find matching metadata terms • Lots and lots of room for error... 3
  4. 4. The search process 1. Crawl content for metadata 2. Index document terms into an inverted file; an inverted file is very fast to search 3. Search the index to identify the result set; search the index - not the documents 4. Rank the results for display; ranking is the hardest part 4
  5. 5. Search algorithm 1 Term-based Ranking (tf/idf) • tf = term frequency documents that use the query terms most are presumed to be most relevant • idf = inverse document frequency terms that are more rare are better indicators of relevance • Assumptions 1) relevance can be measured with document terms 5
  6. 6. Search algorithm 2 PageRank (Google) • Relevant set is still identified by term matching • A revolution in ranking: based on linking between documents • Assumptions: 1) important sites link to other important sites 2) if many people link to a site, it is important 6
  7. 7. Citation Analysis • Authors carefully select articles to cite • The more citations an article gets, the better it must be • Citations by authors who have a lot of citations confers their power to those they cite • Aggregate and leverage all these small individual decisions... 7
  8. 8. How Complex is Google? Google has about 36 ranking algorithms Examples: Citation Analysis Statistical Clustering Parsing Document Structure Parsing Data in the Document Microcontent Parsing 8
  9. 9. How to Make Search Better? 9
  10. 10. Evaluating Search Recall the percentage of all relevant documents retrieved 100% recall means every relevant document is retrieved Precision the percentage of documents retrieved that are relevant 100% precision means only relevant documents are retrieved 10
  11. 11. Thoughts & Reservations about Evaluating Search • Precision and Recall are usually inversely proportional, so improving one often reduces the other. • Given a corpus of content like the web (tens of billions of items)... Recall is unmeasurable, and thus essentially meaningless • What is relevance? • Measuring Precision depends on an agreed definition of relevance, which is tricky (human cataloging is only about 80% ‘accurate’ - relevance is very hard to quantify)
  12. 12. Zipf Best Bets • Manually selected results, tied to specific query terms or phrases • User-driven phrases select the most-used phrases from search traffic; go for easy wins, because returns diminish sharply • Business-driven phrases select phrases important to the business; such as product names or office locations; or politically sensitive phrases, so you can control the message people see 12
  13. 13. Relevance Feedback • The user provides direct or indirect feedback on the search results • Click tracking • “More like this” or “Find similar” • Clustering 13
  14. 14. Structured Search • Designers use patterns in search behavior to guess user’s intent; this requires a substantial understanding of user behavior; it may require structured content (although, not necessarily) Examples • Zip Code -> Zip Code Lookup Tool • Person’s name -> Directory Listing • Product Name -> Shop or Support? • Address -> Map this? • Topic -> Introduction, Forms, Policies or Reports? 14
  15. 15. Controlled Vocabularies • Classification with a controlled vocabulary is the best way to ensure 100% Recall • Lead-in synonyms enter “fridge”; get “refrigerator” instead; best if the collection is well-cataloged increases precision (e.g. in a library) • Term-expansion synonyms; enter “refrigerator”; get “fridge” too; best if the collection is not well-cataloged increases recall at the cost of precision (e.g on eBay) • Spell check on query phrases 15
  16. 16. Why is search important? IF: About half of all users prefer to search first* THEN: What percentage of a content site’s development effort should be devoted to search? * This statistic is highly context-dependent. People’s behavior depends on the context of their actions. The stat is from Jared Spool. 16
  17. 17. Questions? James Melzer Information Architect SRA International james_melzer@sra.com 17
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×