Information Retrieval (for beginners)
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,060
On Slideshare
3,014
From Embeds
46
Number of Embeds
4

Actions

Shares
Downloads
74
Comments
0
Likes
1

Embeds 46

http://jamesmelzer.com 43
http://s35716.gridserver.com 1
http://katti.secondbrain.com 1
http://www.slideshare.net 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Information Retrieval James Melzer June 15, 2006 1
  • 2. How Does Search Work? 2
  • 3. The basics of search • A search engine mediates between user’s query and metadata surrogates for documents • Documents are reduced to metadata • User’s need is translated into a query • Query terms are used to find matching metadata terms • Lots and lots of room for error... 3
  • 4. The search process 1. Crawl content for metadata 2. Index document terms into an inverted file; an inverted file is very fast to search 3. Search the index to identify the result set; search the index - not the documents 4. Rank the results for display; ranking is the hardest part 4
  • 5. Search algorithm 1 Term-based Ranking (tf/idf) • tf = term frequency documents that use the query terms most are presumed to be most relevant • idf = inverse document frequency terms that are more rare are better indicators of relevance • Assumptions 1) relevance can be measured with document terms 5
  • 6. Search algorithm 2 PageRank (Google) • Relevant set is still identified by term matching • A revolution in ranking: based on linking between documents • Assumptions: 1) important sites link to other important sites 2) if many people link to a site, it is important 6
  • 7. Citation Analysis • Authors carefully select articles to cite • The more citations an article gets, the better it must be • Citations by authors who have a lot of citations confers their power to those they cite • Aggregate and leverage all these small individual decisions... 7
  • 8. How Complex is Google? Google has about 36 ranking algorithms Examples: Citation Analysis Statistical Clustering Parsing Document Structure Parsing Data in the Document Microcontent Parsing 8
  • 9. How to Make Search Better? 9
  • 10. Evaluating Search Recall the percentage of all relevant documents retrieved 100% recall means every relevant document is retrieved Precision the percentage of documents retrieved that are relevant 100% precision means only relevant documents are retrieved 10
  • 11. Thoughts & Reservations about Evaluating Search • Precision and Recall are usually inversely proportional, so improving one often reduces the other. • Given a corpus of content like the web (tens of billions of items)... Recall is unmeasurable, and thus essentially meaningless • What is relevance? • Measuring Precision depends on an agreed definition of relevance, which is tricky (human cataloging is only about 80% ‘accurate’ - relevance is very hard to quantify)
  • 12. Zipf Best Bets • Manually selected results, tied to specific query terms or phrases • User-driven phrases select the most-used phrases from search traffic; go for easy wins, because returns diminish sharply • Business-driven phrases select phrases important to the business; such as product names or office locations; or politically sensitive phrases, so you can control the message people see 12
  • 13. Relevance Feedback • The user provides direct or indirect feedback on the search results • Click tracking • “More like this” or “Find similar” • Clustering 13
  • 14. Structured Search • Designers use patterns in search behavior to guess user’s intent; this requires a substantial understanding of user behavior; it may require structured content (although, not necessarily) Examples • Zip Code -> Zip Code Lookup Tool • Person’s name -> Directory Listing • Product Name -> Shop or Support? • Address -> Map this? • Topic -> Introduction, Forms, Policies or Reports? 14
  • 15. Controlled Vocabularies • Classification with a controlled vocabulary is the best way to ensure 100% Recall • Lead-in synonyms enter “fridge”; get “refrigerator” instead; best if the collection is well-cataloged increases precision (e.g. in a library) • Term-expansion synonyms; enter “refrigerator”; get “fridge” too; best if the collection is not well-cataloged increases recall at the cost of precision (e.g on eBay) • Spell check on query phrases 15
  • 16. Why is search important? IF: About half of all users prefer to search first* THEN: What percentage of a content site’s development effort should be devoted to search? * This statistic is highly context-dependent. People’s behavior depends on the context of their actions. The stat is from Jared Spool. 16
  • 17. Questions? James Melzer Information Architect SRA International james_melzer@sra.com 17