Information Retrieval (for beginners)

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Information Retrieval (for beginners) - Presentation Transcript

    1. Information Retrieval James Melzer June 15, 2006 1
    2. How Does Search Work? 2
    3. The basics of search • A search engine mediates between user’s query and metadata surrogates for documents • Documents are reduced to metadata • User’s need is translated into a query • Query terms are used to find matching metadata terms • Lots and lots of room for error... 3
    4. The search process 1. Crawl content for metadata 2. Index document terms into an inverted file; an inverted file is very fast to search 3. Search the index to identify the result set; search the index - not the documents 4. Rank the results for display; ranking is the hardest part 4
    5. Search algorithm 1 Term-based Ranking (tf/idf) • tf = term frequency documents that use the query terms most are presumed to be most relevant • idf = inverse document frequency terms that are more rare are better indicators of relevance • Assumptions 1) relevance can be measured with document terms 5
    6. Search algorithm 2 PageRank (Google) • Relevant set is still identified by term matching • A revolution in ranking: based on linking between documents • Assumptions: 1) important sites link to other important sites 2) if many people link to a site, it is important 6
    7. Citation Analysis • Authors carefully select articles to cite • The more citations an article gets, the better it must be • Citations by authors who have a lot of citations confers their power to those they cite • Aggregate and leverage all these small individual decisions... 7
    8. How Complex is Google? Google has about 36 ranking algorithms Examples: Citation Analysis Statistical Clustering Parsing Document Structure Parsing Data in the Document Microcontent Parsing 8
    9. How to Make Search Better? 9
    10. Evaluating Search Recall the percentage of all relevant documents retrieved 100% recall means every relevant document is retrieved Precision the percentage of documents retrieved that are relevant 100% precision means only relevant documents are retrieved 10
    11. Thoughts & Reservations about Evaluating Search • Precision and Recall are usually inversely proportional, so improving one often reduces the other. • Given a corpus of content like the web (tens of billions of items)... Recall is unmeasurable, and thus essentially meaningless • What is relevance? • Measuring Precision depends on an agreed definition of relevance, which is tricky (human cataloging is only about 80% ‘accurate’ - relevance is very hard to quantify)
    12. Zipf Best Bets • Manually selected results, tied to specific query terms or phrases • User-driven phrases select the most-used phrases from search traffic; go for easy wins, because returns diminish sharply • Business-driven phrases select phrases important to the business; such as product names or office locations; or politically sensitive phrases, so you can control the message people see 12
    13. Relevance Feedback • The user provides direct or indirect feedback on the search results • Click tracking • “More like this” or “Find similar” • Clustering 13
    14. Structured Search • Designers use patterns in search behavior to guess user’s intent; this requires a substantial understanding of user behavior; it may require structured content (although, not necessarily) Examples • Zip Code -> Zip Code Lookup Tool • Person’s name -> Directory Listing • Product Name -> Shop or Support? • Address -> Map this? • Topic -> Introduction, Forms, Policies or Reports? 14
    15. Controlled Vocabularies • Classification with a controlled vocabulary is the best way to ensure 100% Recall • Lead-in synonyms enter “fridge”; get “refrigerator” instead; best if the collection is well-cataloged increases precision (e.g. in a library) • Term-expansion synonyms; enter “refrigerator”; get “fridge” too; best if the collection is not well-cataloged increases recall at the cost of precision (e.g on eBay) • Spell check on query phrases 15
    16. Why is search important? IF: About half of all users prefer to search first* THEN: What percentage of a content site’s development effort should be devoted to search? * This statistic is highly context-dependent. People’s behavior depends on the context of their actions. The stat is from Jared Spool. 16
    17. Questions? James Melzer Information Architect SRA International james_melzer@sra.com 17

    + James MelzerJames Melzer, 2 years ago

    custom

    1197 views, 1 favs, 3 embeds more stats

    More info about this document

    CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

    Go to text version

    • Total Views 1197
      • 1152 on SlideShare
      • 45 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 50
    Most viewed embeds
    • 43 views on http://jamesmelzer.com
    • 1 views on http://s35716.gridserver.com
    • 1 views on http://katti.secondbrain.com

    more

    All embeds
    • 43 views on http://jamesmelzer.com
    • 1 views on http://s35716.gridserver.com
    • 1 views on http://katti.secondbrain.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Tags