0
The Fundamentals of Enterprise Search<br />KMWorld 2009<br />Avi Rappoport, Search Tools Consulting<br />www.searchtools.c...
What’s In This Workshop<br />Overview of enterprise search, in context <br />Search engine processes<br />Robot spiders, d...
About SearchTools <br />Avi Rappoport is a librarian (MLIS from Berkeley) <br />Software developer and product manager<br ...
Defining Enterprise Search <br />Large scale web site search <br />Corporate sites<br />Institutional sites<br />Online st...
Similarities to Webwide Search <br />Robot crawlers <br />HTML over HTTP<br />Scaling to millions of items<br />Distribute...
Differences from Web Search <br />Limited scope <br />A site, set of sites, extranet, or intranet <br />Few meaningful hyp...
Text Search vs. Database Search<br />Indexes multiple content sources<br />Database fields, files, web pages, feeds...<br ...
Search and Information Architecture <br />Information Architecture <br />The art and science of organizing information for...
Search and Taxonomy<br />Taxonomy creates categories<br />Labels and metadata<br />Improves quality of search results<br /...
Search &  Knowledge Management<br />KM is: “The process through which organizations generate value from their intellectual...
Two Main Types of Search <br />Known-item search <br />Short queries<br />“Good-enough” answers<br />Exploratory search<br...
All people see are the search box and results list<br />Invisible functionality <br />Indexes<br />Query processing<br />R...
Elements of Search Engines <br />Automated tools to collect content <br />Specialized storage for quick retrieval<br />Que...
Choosing Content To Index <br />Information sites <br />Consider indexing every single page<br />Use search indexing as a ...
(Near) Real Time Indexing<br />Twitter has changed expectations<br />Even in intranets<br />Index must support partial upd...
Indexing and Security<br />Search can undermine “security by obscurity”<br />One link can expose a whole set of documents<...
Search and Access Control <br />Authentication and authorization in indexing <br />“Basic authentication” - user name and ...
Indexing: Sources of Content<br />Web sites <br />Intranets<br />Extranets<br />Blogs<br />Wikis<br />Mailing list archive...
Indexing: Robot Spiders <br />Start with base URL for all hosts <br />For each page, repeat <br />Read text into internal ...
Robot  Indexing Spider<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Common Problems With Robots <br />Pages that are not linked from anywhere <br />Spider disallowed by robots.txt or robots ...
Indexing: Other Data Sources<br />RSS feeds: nice clean text<br />File servers: SMB, file:/// etc. <br />Content / Documen...
Indexing: Text Files <br />Plain text is easy<br />RTF export format text easy to find<br />HTML semi-structured text<br /...
Indexing: Binary File Formats <br />PDF<br />Scanned, may not have any text<br />Bad PDF generators break words at columns...
Indexing: Tokenizing<br />Lowercase all characters (aka ‘folding’)<br />Tokenizing makes words searchable <br />Break on p...
Indexing: Character Set Issues<br />World has many charsets (aka scripts, alphabets)<br />English has a simple alphabet: 2...
Indexing: Language Issues<br />Text search works across languages<br />Simple pattern-matching, query to index <br />Langu...
Indexing: Multimedia <br />Images, photos, drawings, sound, scores, video<br />External metadata <br />File name<br />Link...
Inverted Index Diagram<br /><ul><li>Inverted indexes work well
Lots of IR research shows this
Better than DBMS
Alphabetical list of tokens
Tokens not in paragraph order, thus, inverted
Each token hasID of source</li></ul>Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Richer Index Structures<br />Store word position (for phrase matching) <br />Enclosing tag or field<br />Document metadata...
Example Inverted Index Structure<br />For each word<br />Document ID<br />Position<br />Tag name <br />For each document<b...
Indexing: Stopwords<br />Stopwords - very common terms <br />Linguistic (a an the as he she it you new)<br />Ubiquitous (n...
Stopwords Problems: Example<br />Searching wordpress.com for whatever will be<br /><ul><li>Finds all matches for whatever ...
Useless results ranking
No matches for will be
One ad gets it right
External search finds over 3,000 pages on site with phrase</li></ul>Fundamentals of Search Engines 2009 / © Avi Rappoport,...
Indexing: Stemming<br />Singular query should find plural words & vice versa<br />Shoe &lt;=&gt; shoes, cans &lt;=&gt; can...
Indexing: Document Store<br />Minimum<br />ID (key for for inverted index)<br />Unique location (URL / file path / record ...
Indexing: Dealing with Duplicates<br />Detecting duplicate documents <br />Exact match is fairly easy: checksums<br />Docu...
Indexing: Document Dates <br />HTTP servers lie about dates <br />Frequent wrong settings: 1969, 2040<br />Dynamic pages s...
Search Process Flow <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Where the Queries Come From<br />User-entered text in search fields<br />Search navigation: moving around in results list<...
Query Processing Steps <br />Try to recognize the character set and language <br />Tokenize the text by language rules<br ...
Query Expansion<br />Stemming <br />Dependant on index stemming choices<br />Good to find singular/plural forms <br />Word...
Search: Retrieval, Recall & Precision <br />Retrieval <br />Finding the documents matching a particular query<br />Recall ...
One-Word Text Retrieval<br />Fastbinary search in inverted index<br />Check index updates on disk or in memory<br />If the...
Multi-Word Text Retrieval <br />Relationship between words defines results<br />Boolean AND, + operator, find all default<...
Relevance Ranking Algorithms<br />Relevance <br />The likelihood that an item will fill an information need<br />Based on ...
Relevance Heuristics<br />Phrase matches for multiple query terms <br />Logs show most multi-word searches are phrases<br ...
More on Relevance <br />Relevance is task-specific <br />Results can never please all of the people<br />More like berry-p...
Federated Search and Relevance<br />Send query to multiple search engines<br />May require special syntax<br />Response ti...
Retrieval: Access Control<br />Limit access to search itself<br />User enters password or other credentials<br />Search on...
Search User Experience<br />Limit user interface complexity<br />Show the scope of the information covered<br />Expose que...
Search Forms Interface <br />Balance simplicity with functionality<br />Put a search field in the navigation bar <br />Loc...
Search Field Auto-Complete<br />Dropdown menu of matching words<br />Base on search logs<br />Smallish list, 7-10<br />Mos...
Other Search Interfaces<br />Heavily researched<br />Natural language <br />Must keep typing<br />Defining a questionis qu...
Simple vs. Advanced Search UI <br />Most searches are simple<br />Short: one to three words<br />Fewer than 10% use any op...
Advanced Search Fits Sometimes<br />EBay<br />High motivation <br />Complex search requirements <br />Frequent use<br />UX...
Search Results: Page Elements<br />Site context <br />General page layout, navigation links<br />Colors and design element...
Search Results: Good Example <br />Full but readable<br /><ul><li>white space
content blocks</li></ul>Site look-and-feel<br />Navigation<br />Familiar search results elements <br />Fundamentals of Sea...
Search Results: Not-So-Good Example<br />Site page has navigation, colors: search results should too<br />Fundamentals of ...
Search Results: Visualization<br />Fascinating to look at, great demos<br />Star charts<br />Topographical displays<br />I...
Search Results: Header Elements<br />Search field, with the current query<br />Users often edit to be more or less restric...
Search Results: Hits and Pages<br />Show number of items matched<br />Be accurate <br />Do not give estimates for small nu...
Results Headers: Examples <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Search Results: “Best Bets”<br />aka Search Suggestions, QuickLinks, KeyMatch, Recommendations<br />Special-case links for...
Best Bets Example <br />Best Bets are very clear<br />Would not come first in normal search results<br />Fundamentals of S...
Search Results: List Sorting<br />List of links to items matching the query<br />Sorted by matching terms<br />Impossible ...
Search Results: Not Enough Variety<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Search Results: Weird Sort<br />Sorted by:“Degrees away”<br />Labels too subtle:<br /><ul><li>Hidden in header
Degree icon should be on the left side </li></ul>Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.co...
Result Items: Elements<br />Information foraging: show hints about items<br />Title of document, or name of product<br />L...
Results Items: Not Enough Content<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Results Items: Too Much Content<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Results Items: Just Right<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Results Items: Additional Data<br />Date (if reliable)<br />Size and File type <br />Avoid surprising launches of Acrobat ...
Results Items: Rich Items Example <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Results: Dynamic Clustering <br />Uses search results text to infer topics <br />Groups by similarity in titles and result...
Results: Clustering Example<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Commerce and Catalog Results <br />Picture or graphic if possible<br />Important attributes <br />Price<br />Color<br />Si...
Online Store Results Example <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Multimedia Search <br />Image, audio, and video files<br />Audio and visual similarity search still theory<br />Show conte...
Multimedia Results Example <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Results: Faceted Metadata<br />Better than forms for structured text data <br />Exposes attributes as part of search resul...
Why Faceted Search is Better Than Forms<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Faceted Metadata: Commerce Example <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Faceted Metadata: Library Catalog<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
No Matches Queries: Causes <br />Misspellings and typing errors<br />Scope problem: nothing for that topic<br />Vocabulary...
No Matches Queries: Responses<br />Track queries with no matches in logs <br />Use sessions, surveys & testing to find use...
No Matches Queries: Spelling Issues<br />Detect and address common problems<br />Spelling errors<br />Typos<br />Queries w...
Good Example of No-Matches Page <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
Empty Searches<br />Users click or press “enter” in the search box<br /><ul><li>Test for this special case</li></ul>Should...
Upcoming SlideShare
Loading in...5
×

Fundamentals Of Search

5,554

Published on

These slides are from my 2009 Fundamentals of Search workshop at KMWorld. Please contact me for information about search engines, consulting, workshops and training.

Published in: Technology, Design
2 Comments
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,554
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
189
Comments
2
Likes
8
Embeds 0
No embeds

No notes for slide
  • http://www.slideshare.net/bdelacretaz/beyond-fulltext-searches-with-lucene-and-solrGreat book: Search User Interfaces"by Marti Hearst
  • Transcript of "Fundamentals Of Search"

    1. 1. The Fundamentals of Enterprise Search<br />KMWorld 2009<br />Avi Rappoport, Search Tools Consulting<br />www.searchtools.com<br />consult9@searchtools.com<br />www.searchtools.com/slides/kmw09/fundamentals-of-search.html<br />
    2. 2. What’s In This Workshop<br />Overview of enterprise search, in context <br />Search engine processes<br />Robot spiders, database access<br />Indexing<br />Security<br />Query parsing, retrieval, and relevance ranking<br />Usable search interfaces. <br />Maintenance and Analytics<br />Methods for choosing a good search engine<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    3. 3. About SearchTools <br />Avi Rappoport is a librarian (MLIS from Berkeley) <br />Software developer and product manager<br />User interface designer<br />Long-time search consultant<br />Editor & Publisher, www.searchtools.com<br />Search Tools Consulting<br />Search needs analysis and recommendations<br />Enterprise search evaluation <br />Outsourced search administration <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    4. 4. Defining Enterprise Search <br />Large scale web site search <br />Corporate sites<br />Institutional sites<br />Online stores<br />Intranet search <br />Crossing departmental lines<br />Opening data silos<br />Extranets<br />Portal Search<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    5. 5. Similarities to Webwide Search <br />Robot crawlers <br />HTML over HTTP<br />Scaling to millions of items<br />Distributed processing <br />Full-text indexing of content<br />Simple query language<br />Relevance ranking of results<br />TF-IDF (term frequency : inverse document frequency)<br />Familiar results list<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    6. 6. Differences from Web Search <br />Limited scope <br />A site, set of sites, extranet, or intranet <br />Few meaningful hyperlinks <br />Page Rank and link analysis is less useful <br />Security and access control issues<br />Content in databases, CMSs, etc. <br />More control<br />Index update scheduling <br />Some content is very valuable, other is not <br />No search spam<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    7. 7. Text Search vs. Database Search<br />Indexes multiple content sources<br />Database fields, files, web pages, feeds...<br />Simple search commands instead of SQL<br />Flexible indexing and retrieval<br />Relevance ranking (this is a major issue)<br />Does not compete for database resources <br />Easy to scale separately from DBMS <br />New features: spellcheck, auto complete, facets<br />Works in the real world, from eBay to Google <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    8. 8. Search and Information Architecture <br />Information Architecture <br />The art and science of organizing information for access and use.<br />IA work enriches search<br />Creates order and systems<br />Provides standard vocabulary<br />Removes ROT (redundant, obsolete, trivial)<br />Search supplements IA<br />Supports user vocabularies<br />Changes dynamically with new content<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    9. 9. Search and Taxonomy<br />Taxonomy creates categories<br />Labels and metadata<br />Improves quality of search results<br />Additional metadata extremely valuable<br />Search crosses categories <br />Bypasses ambiguous topic labels<br />Useful for novices <br />Supports user vocabulary<br />Dynamic updates for new topics<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    10. 10. Search & Knowledge Management<br />KM is: “The process through which organizations generate value from their intellectual and knowledge-based assets.” (CIO Magazine)<br />Organizes information, processes and people <br />Offers collaboration and archiving tools<br />Attempts to regularize implicit knowledge<br />Search mostly matches words <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    11. 11. Two Main Types of Search <br />Known-item search <br />Short queries<br />“Good-enough” answers<br />Exploratory search<br />Research - finding unknowns<br />Scientific, legal, medical, business, sales<br />Conceptual overviews<br />Completeness - all possible relevant items<br />Law enforcement<br />Medicine<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    12. 12. All people see are the search box and results list<br />Invisible functionality <br />Indexes<br />Query processing<br />Retrieval<br />Relevance ranking<br />Search is a mystery <br />But it’s just software <br />Search as an Iceberg <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    13. 13. Elements of Search Engines <br />Automated tools to collect content <br />Specialized storage for quick retrieval<br />Query processing and expansion <br />Retrieval (matching query to index content)<br />Relevance ranking<br />Search results interfaces <br />Analytics, metrics and maintenance <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    14. 14. Choosing Content To Index <br />Information sites <br />Consider indexing every single page<br />Use search indexing as a discovery mechanism<br />Online stores, catalogs <br />Product information: cost, color, size, materials<br />Other: return policies, CEO’s name, jobs listing<br />Intranets <br />Intranet portal and core servers <br />May need archive servers and search<br />Multimedia: images, audio, video<br />Metadata at least<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    15. 15. (Near) Real Time Indexing<br />Twitter has changed expectations<br />Even in intranets<br />Index must support partial updates<br />Search engines finding limits at scale<br />Distribute indexing and indexes<br />Trigger index updates (push vs. pull)<br />Continuous feed<br />Send web service message<br />Database trigger<br />Update watched URLs with new links<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    16. 16. Indexing and Security<br />Search can undermine “security by obscurity”<br />One link can expose a whole set of documents<br />Work with your security team <br />List areas which contain sensitive content<br />Define words which trigger further analysis<br />Create a process for removing sensitive data<br />Indexing encrypted content <br />Search engine uses SSL client for indexing <br />Encrypt search results before returning<br />Physical security on search servers<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    17. 17. Search and Access Control <br />Authentication and authorization in indexing <br />“Basic authentication” - user name and password<br />NT Security integration<br />ACLs and single sign-on <br />Conform to security rules during indexing<br />Keep access control info as part of document store<br />Showing results - who can see what?<br />Access to search engine itself<br />Collection-level access control <br />Locked results as teaser for subscription<br />Hit-level access control <br />Check before displaying results<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    18. 18. Indexing: Sources of Content<br />Web sites <br />Intranets<br />Extranets<br />Blogs<br />Wikis<br />Mailing list archives & email public folders<br />File systems & shared servers<br />NFS, SMB, AFP, GFS, ftp, WebDAV<br />Content Management Systems <br />Databases<br />Legacy programs in silos<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    19. 19. Indexing: Robot Spiders <br />Start with base URL for all hosts <br />For each page, repeat <br />Read text into internal format<br />Save document in cache<br />Save words into index<br />Extract all links and check the rules<br />If they are new URLs, add them to the list<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    20. 20. Robot Indexing Spider<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    21. 21. Common Problems With Robots <br />Pages that are not linked from anywhere <br />Spider disallowed by robots.txt or robots meta<br />URLs with ? and & (all should do these now)<br />JavaScript, forms, and interactive dynamic links<br />Some robots can handle some of these<br />Session IDs that change<br />Duplicate detection<br />Multiple views of the same data (Lotus, wikis)<br />Symbolic links & bad redirects<br />Multiple copies of files or directories<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    22. 22. Indexing: Other Data Sources<br />RSS feeds: nice clean text<br />File servers: SMB, file:/// etc. <br />Content / Document Management Systems<br />Email archives <br />Databases via ODBC, JDBC, Oracle API<br />Full-text content<br />Metadata: library catalog records, yellow pages<br />External sources using APIs <br />(Application programmatic interfaces)<br />News feeds (Reuters, AP)<br />Twitter<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    23. 23. Indexing: Text Files <br />Plain text is easy<br />RTF export format text easy to find<br />HTML semi-structured text<br />Content is between tags and in attributes<br />Generated by JavaScript - hard to extract<br />Bad HTML, especially missing &lt;/ close tags<br />XML files (structured)<br />Many tags are document-level<br />Content is between tags and in attributes<br />Complex tag hierarchy<br />TEI (Text Encoding Initiative) & Semantic Web<br />Xquery and XPATH tools<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    24. 24. Indexing: Binary File Formats <br />PDF<br />Scanned, may not have any text<br />Bad PDF generators break words at columns<br />“Shadow” text effect duplicates letters<br />SWF and Flash: API may not load dynamic text<br />Office documents<br />Word processing files (may have hidden text from revisions)<br />Spreadsheets (hard to know what to grab) <br />Presentations<br />Note: new docx, xslx, pptx are really XML file sets<br />CAD and project files <br />Metadata (properties, Adobe XMP)<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    25. 25. Indexing: Tokenizing<br />Lowercase all characters (aka ‘folding’)<br />Tokenizing makes words searchable <br />Break on punctuation and spaces<br />Recognize special words: C++ @ [TS]<br />Typography issues: st is really “st”<br />HTML escaped text: möchten = m&ouml;chten<br />Special cases for structured strings<br />Numbers, Prices, Dates<br />N-grams - an alternate approach<br />Break into short text patterns<br />Takes a lot of index space<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    26. 26. Indexing: Character Set Issues<br />World has many charsets (aka scripts, alphabets)<br />English has a simple alphabet: 26 letters, 10 numbers<br />Other Roman languages: extended (ç, î, ß)<br />Non-Roman one byte: Cyrillic, Arabic, Hebrew<br />Asian two bytes: Chinese, Japanese, Korean<br />Identifying character sets<br />Unicode characters<br />Older usage: language “code pages”<br />HTTP header or &lt;META http-equiv&gt;<br />Statistical detection techniques<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    27. 27. Indexing: Language Issues<br />Text search works across languages<br />Simple pattern-matching, query to index <br />Language-specific indexing improves search<br />Tokenizing using appropriate rules<br />Compound nouns (kindergarten)<br />Language rules for stemming<br />Singular version of thés is thé<br />Language detection<br />Trusted tags<br />Bilingual dictionaries<br />Statistical matches, n-grams<br />Documents may have mixed languages…<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    28. 28. Indexing: Multimedia <br />Images, photos, drawings, sound, scores, video<br />External metadata <br />File name<br />Link text, surrounding words<br />Internal metadata <br />ID3 tags for music<br />EXIF and other digital photo information<br />Subtitles (sometimes) <br />Content<br />OCR to extract graphic text and closed captions<br />Audio: Speech-to-text conversion, still buggy<br />Use human judgment not just automated systems <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    29. 29. Inverted Index Diagram<br /><ul><li>Inverted indexes work well
    30. 30. Lots of IR research shows this
    31. 31. Better than DBMS
    32. 32. Alphabetical list of tokens
    33. 33. Tokens not in paragraph order, thus, inverted
    34. 34. Each token hasID of source</li></ul>Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    35. 35. Richer Index Structures<br />Store word position (for phrase matching) <br />Enclosing tag or field<br />Document metadata <br />Database field names<br />Image (which attribute)<br />Named anchor text<br />Text markup tags (TEI, Semantic Web)<br />Extracted entities <br />Personal names, companies, geo locations, dates<br />Anchor text from incoming links<br />Can be very descriptive<br />Add to index as if part of the target document<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    36. 36. Example Inverted Index Structure<br />For each word<br />Document ID<br />Position<br />Tag name <br />For each document<br />ID<br />Title<br />URL<br /> Description<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    37. 37. Indexing: Stopwords<br />Stopwords - very common terms <br />Linguistic (a an the as he she it you new)<br />Ubiquitous (names, copyright, click here)<br />Consequences of excluding stopwords:<br />Reduces the size of index files <br />Improves recall, finds more matching documents <br />Fails some queries<br />As You Like It, IT copyright policy<br />Problems matching phrases: “New York University”<br />Solutions vary:<br />Index everything, pay the price in index size<br />CommonGrams: n-grams of of frequent phrases<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    38. 38. Stopwords Problems: Example<br />Searching wordpress.com for whatever will be<br /><ul><li>Finds all matches for whatever (stopwords ignored)
    39. 39. Useless results ranking
    40. 40. No matches for will be
    41. 41. One ad gets it right
    42. 42. External search finds over 3,000 pages on site with phrase</li></ul>Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    43. 43. Indexing: Stemming<br />Singular query should find plural words & vice versa<br />Shoe &lt;=&gt; shoes, cans &lt;=&gt; can, geese &lt;=&gt; goose <br />Statistical and probabilistic truncation rules<br />Linguistic rules <br />Lemmatization - stemming based on part of speech<br />Stemming before indexing <br />Improve recall: find all forms of a word<br />Reduce index size<br />Consequences of extreme stemming<br />Short query problems<br />Search for Ranshouldn’t match Run, Lola, Run<br />Other options<br />Index everything (makes indexes larger and queries slower)<br />New idea: CommonGrams (n-grams of frequent phrases)<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    44. 44. Indexing: Document Store<br />Minimum<br />ID (key for for inverted index)<br />Unique location (URL / file path / record ID)<br />Richer document store<br />Implicit metadata: filename, size, location<br />Explicit metadata<br />Title, date, keywords, author<br />Taxonomy labels, classification, user tagging<br />Language, character set<br />Access control settings<br />Full text of the document<br />For snippets and caching<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    45. 45. Indexing: Dealing with Duplicates<br />Detecting duplicate documents <br />Exact match is fairly easy: checksums<br />Document similarity check: harder but worth it<br />Choosing the primary copy <br />Most recent (if reliable)<br />Rules based on path or metadata<br />New web search “canonical” tag<br />What to do with duplicates <br />Remove from the index: saves space<br />Hide in results unless requested<br />That’s the Google way<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    46. 46. Indexing: Document Dates <br />HTTP servers lie about dates <br />Frequent wrong settings: 1969, 2040<br />Dynamic pages send the current timestamp<br />File systems lie about dates <br />Applications lie about dates<br />Indexers do the best they can <br />Metadata (date tag, property, tag DC.date)<br />Extract from page content<br />Checksum to see if file has changed since last index <br />Consider external metadata repository<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    47. 47. Search Process Flow <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    48. 48. Where the Queries Come From<br />User-entered text in search fields<br />Search navigation: moving around in results list<br />Previous searches <br />May just be repeated clicks on URL<br />Save Search feature<br />Simplistic alerts<br />Facet click to add a metadata filter<br />May re-issue search with additional terms<br />May be navigational, no text query<br />Scripts or automated queries<br />Dynamic links (find all pictures by this artist)<br />Geographic information systems<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    49. 49. Query Processing Steps <br />Try to recognize the character set and language <br />Tokenize the text by language rules<br />Break at spaces and punctuation<br />Same algorithm as index tokenizer<br />Check for operators <br />Internet Query Operators: + - &quot;quotes&quot;<br />Boolean Operators: AND OR NOT & | !<br />Others: NEAR, (parentheses)<br />Check for field names, zones, other filters<br /> Example: title:lunch location=94703<br />Handle the rare natural language question<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    50. 50. Query Expansion<br />Stemming <br />Dependant on index stemming choices<br />Good to find singular/plural forms <br />Word similarity searching - increases recall<br />Fuzzy matching<br />Phonetic, soundex, sound-alike <br />May overwhelm exact matches<br />Synonym expansion, should be site-specific <br />bus =&gt; coach, ATM =&gt; Air Tasking Message<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    51. 51. Search: Retrieval, Recall & Precision <br />Retrieval <br />Finding the documents matching a particular query<br />Recall <br />Finding every relevant document<br />Precision <br />Finding only relevant documents<br />Balance more recall vs. better precision<br />Use search logs and user studies to guide choices<br />Use precision as part of relevance ranking <br />Top results should be more exact matches<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    52. 52. One-Word Text Retrieval<br />Fastbinary search in inverted index<br />Check index updates on disk or in memory<br />If there are distributed indexes, merge results<br />Store the related document information in a list <br />Document ID<br />Term frequency in document<br />Term positions in the document<br />Note: The document list is not yet sorted<br />Frequent searches may be cached<br />“Short head” vs. “long tail”<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    53. 53. Multi-Word Text Retrieval <br />Relationship between words defines results<br />Boolean AND, + operator, find all default<br />Only documents which contain all terms<br />Boolean OR operator, find any default<br />All documents with any term<br />Boolean NOT, - operator<br />All documents with the first term but not next term <br />Phrase operators, quotes<br />Only documents with the words as a phrase<br />Also check for zones or field filters<br />Parentheses: use for order of processing<br />Merge resulting lists<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    54. 54. Relevance Ranking Algorithms<br />Relevance <br />The likelihood that an item will fill an information need<br />Based on documents in retrieval list<br />Most common algorithm: TF:IDF<br />(Term frequency : inverse document frequency)<br />How often the query word is in the document?<br />How often the word is in the index?<br />Other relevance algorithms <br />Vectors and document-query similarity <br />Linguistic analysis and Natural Language Processing <br />Statistical and Bayesian analysis <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    55. 55. Relevance Heuristics<br />Phrase matches for multiple query terms <br />Logs show most multi-word searches are phrases<br />Query terms found in special sections<br />Title<br />Metadata<br />Top of document<br />All terms matched in document <br />Even when not relevant, it’s transparent<br />Old systems gave excess weight to single rare terms<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    56. 56. More on Relevance <br />Relevance is task-specific <br />Results can never please all of the people<br />More like berry-picking than like hunting<br />Link analysis (PageRank) not very useful <br />Intranet and site links tend to be navigational <br />Situation-specific adjustments <br />Some areas more likely to be valuable <br />Current content<br />Local content <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    57. 57. Federated Search and Relevance<br />Send query to multiple search engines<br />May require special syntax<br />Response time often a factor<br />Receive results in relevance order for each<br />Display results, two options<br />Separate sections for each search engine<br />Merged single relevance rank list<br />Works if all search indexes are similar<br />Problems where the sources are very different<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    58. 58. Retrieval: Access Control<br />Limit access to search itself<br />User enters password or other credentials<br />Search only accepts queries when authenticated<br />Collection-level access control<br />Query filter only retrieves items from allowed groups<br />Hit-level access control<br />Real-time check for user access on documents<br />Start with most relevant documents<br />Repeat until there are ten (may be slow)<br />Display top results, include estimate of how many more<br />Show helpful message if user can’t see any<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    59. 59. Search User Experience<br />Limit user interface complexity<br />Show the scope of the information covered<br />Expose query expansion and contraction <br />Use familiar UI elements<br />User experience goes beyond interface<br />Index coverage<br />Query syntax<br />Retrieval quality and speed<br />Relevance ranking (first ten are vital)<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    60. 60. Search Forms Interface <br />Balance simplicity with functionality<br />Put a search field in the navigation bar <br />Location should be consistent<br />Longer is better: short fields lead to short queries <br />Simple Search forms: limit options<br />Zone or section<br />Dates<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    61. 61. Search Field Auto-Complete<br />Dropdown menu of matching words<br />Base on search logs<br />Smallish list, 7-10<br />Most popular<br />Simple sort<br />Alphabetic<br />Price or size<br />Complete range (preferably lowestto highest)<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    62. 62. Other Search Interfaces<br />Heavily researched<br />Natural language <br />Must keep typing<br />Defining a questionis quite hard <br />Interactive search<br />Guided interviews<br />But users want immediate results<br />Avatars <br />do not improve interaction<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    63. 63. Simple vs. Advanced Search UI <br />Most searches are simple<br />Short: one to three words<br />Fewer than 10% use any operators at all (maybe 1%)<br />Even experts prefer simple search <br />Will use advanced tools if simple doesn’t work <br />Default to simple search, link to advanced search <br />Those are your power users: librarians, techies<br />Expose all possible options<br />Don’t spend huge resources on advanced UI <br />Exploratory search is different<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    64. 64. Advanced Search Fits Sometimes<br />EBay<br />High motivation <br />Complex search requirements <br />Frequent use<br />UX testing still required <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    65. 65. Search Results: Page Elements<br />Site context <br />General page layout, navigation links<br />Colors and design elements<br />Results header<br />A search field, with the current search terms <br />Retrieval information - how many hits<br />Results list in relevance order<br />Each result item with at least a linked title<br />Facets: dynamic links for filtering results<br />Results footer<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    66. 66. Search Results: Good Example <br />Full but readable<br /><ul><li>white space
    67. 67. content blocks</li></ul>Site look-and-feel<br />Navigation<br />Familiar search results elements <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    68. 68. Search Results: Not-So-Good Example<br />Site page has navigation, colors: search results should too<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    69. 69. Search Results: Visualization<br />Fascinating to look at, great demos<br />Star charts<br />Topographical displays<br />Interactive fly-throughs<br />Hyperbolic trees <br />Require significant resources to run<br />Good for exploratory & comprehensive research<br />Finding unexpected synergies<br />Simple search is much cheaper for casual users<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    70. 70. Search Results: Header Elements<br />Search field, with the current query<br />Users often edit to be more or less restrictive<br />Number of results found<br />A few search options <br />Match Any Word / All Words / Exact Phase<br />Filter by date option (if trustworthy)<br />Search zones<br />Results navigation<br />Best Bets<br />Spelling suggestions <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    71. 71. Search Results: Hits and Pages<br />Show number of items matched<br />Be accurate <br />Do not give estimates for small numbers<br />(Google and SharePoint are bad this way)<br />Pagination - results list navigation<br />Helps user calibrate content<br />Important for exploratory search<br />Follow web search conventions, example<br />&lt; previous1 2 34 ... 26next &gt;<br />Be accurate<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    72. 72. Results Headers: Examples <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    73. 73. Search Results: “Best Bets”<br />aka Search Suggestions, QuickLinks, KeyMatch, Recommendations<br />Special-case links for problem queries <br />Internal topic landing pages<br />External sites when appropriate<br />New and better query to search<br /><ul><li>Only implement for very frequent queries</li></ul>Discover problems from users, log analysis<br />“Short head” - few very popular query terms<br />Allocate resources to keep them current<br /><ul><li>Good search results are higher priority</li></ul>Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    74. 74. Best Bets Example <br />Best Bets are very clear<br />Would not come first in normal search results<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    75. 75. Search Results: List Sorting<br />List of links to items matching the query<br />Sorted by matching terms<br />Impossible to be relevant to every query<br />Variety of sources when possible<br />Transparency: why these items in this order<br />Other sort orders - make very visible<br />By author’s last name<br />By date<br />By price<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    76. 76. Search Results: Not Enough Variety<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    77. 77. Search Results: Weird Sort<br />Sorted by:“Degrees away”<br />Labels too subtle:<br /><ul><li>Hidden in header
    78. 78. Degree icon should be on the left side </li></ul>Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    79. 79. Result Items: Elements<br />Information foraging: show hints about items<br />Title of document, or name of product<br />Location: URL, file path, database ID <br />May need to rewrite to user-accessible URLs<br />Hide location if it’s not meaningful<br />Distinguishing data <br />Metadata: picture, product code, author name<br />Show match terms in context (snippets)<br />Text before and after query term matches <br />Highlight the matches<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    80. 80. Results Items: Not Enough Content<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    81. 81. Results Items: Too Much Content<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    82. 82. Results Items: Just Right<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    83. 83. Results Items: Additional Data<br />Date (if reliable)<br />Size and File type <br />Avoid surprising launches of Acrobat or other app.<br />Metadata <br />Author, department, brand, product... <br />Access status: password required? <br />Topics and subject headings<br />Taxonomy categories<br />Keywords and concept tags<br />User tags, folksonomy<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    84. 84. Results Items: Rich Items Example <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    85. 85. Results: Dynamic Clustering <br />Uses search results text to infer topics <br />Groups by similarity in titles and results text<br />Particularly good for portals and intranets<br />Unstructured, uncontrolled text<br />Dynamic, no preprocessing needed<br />Can supplement categorization and taxonomies<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    86. 86. Results: Clustering Example<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    87. 87. Commerce and Catalog Results <br />Picture or graphic if possible<br />Important attributes <br />Price<br />Color<br />Size<br />Compatibility<br />Availability<br />“Buy” button <br />Simplify process, save time<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    88. 88. Online Store Results Example <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    89. 89. Multimedia Search <br />Image, audio, and video files<br />Audio and visual similarity search still theory<br />Show context in results <br />Match terms from transcript or OCR<br />Text around image<br />Thumbnails or keyframes<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    90. 90. Multimedia Results Example <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    91. 91. Results: Faceted Metadata<br />Better than forms for structured text data <br />Exposes attributes as part of search results <br />Leverages metadata<br />Topic names, taxonomy<br />Mundane stuff: color, date, size, author... <br />Choices specifically relating to search results <br />Dynamically generates from metadata <br />Preview numbers offer users confidence in clicking <br />Supported by extensive usability testing<br />Used on a majority of large e-commerce sites<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    92. 92. Why Faceted Search is Better Than Forms<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    93. 93. Faceted Metadata: Commerce Example <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    94. 94. Faceted Metadata: Library Catalog<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    95. 95. No Matches Queries: Causes <br />Misspellings and typing errors<br />Scope problem: nothing for that topic<br />Vocabulary differences <br />Users may be less precise, or use competitor’s terms<br />Marketers may dominate content <br />Restrictive search settings <br />Default may only match exact phrase or all words <br />Access control may disallow user<br />Software/hardware/network failures<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    96. 96. No Matches Queries: Responses<br />Track queries with no matches in logs <br />Use sessions, surveys & testing to find user intent <br />Design the no-matches page carefully <br />Explain what is and isn’t on the site <br />Provide useful navigation links<br />Add search engine help <br />Synonyms<br />Best Bets<br />Spelling<br />Add terms to text<br />Add content, topic pages<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    97. 97. No Matches Queries: Spelling Issues<br />Detect and address common problems<br />Spelling errors<br />Typos<br />Queries without spaces between words <br />Use site-specific dictionary<br />Easy to build from search index <br />Never suggests any words not on the site<br />Users familiar with did you mean....?<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    98. 98. Good Example of No-Matches Page <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    99. 99. Empty Searches<br />Users click or press “enter” in the search box<br /><ul><li>Test for this special case</li></ul>Should not find all items in the index<br /><ul><li>Interaction options:</li></ul>Do nothing<br />Go to a simple search page<br />Show an error dialog<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    100. 100. Search Engine Maintenance <br />Index maintenance <br />Obsolete content removal<br />Check for new content<br />Track technical problems (bad links, servers down)<br />Search quality <br />Re-run test suite<br />Compare with original results<br />Add new test queries<br />Track user feedback, surveys<br />Use metrics and log analysis to catch trends <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    101. 101. Metrics for Search Engines <br />Server uptime<br />Errors: how often and how serious<br />Index<br />Size on disc and in memory<br />Number of entries<br />Number and type of indexing errors<br />Search traffic <br />Queries per minute (60 qpm is common)<br />Average clicks on results items per query<br />Average next-page views per query<br />Number and percent of no-match queries<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    102. 102. Search Log Analysis <br />Most frequent query terms<br />Short head: a few very popular terms<br />Long tail of unique queries<br />Lots of junk: URLs, spam, gibberish<br />Frequent query terms not matched - fix somehow<br />More esoteric analysis - need a lot of data<br />Frequent query terms with low click-through <br />Frequent query terms with high “next page” clicks<br />Raw logs<br />Import into database for ad-hoc reports<br />Session analysis can be enlightening<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    103. 103. Choosing a Search Engine <br />Find specific information needs<br />Analyze content<br />Source and formats formats<br />Rough number of pages/ records / items<br />Define platform, API, language requirements<br />Buy (or use open source), don’t build<br />User surveys show problems with home-grown <br />Choose & compare likely candidates<br />Gathering, indexing, retrieval, relevance features<br />Scaling<br />Administration tools<br />Continuing development, support, user groups<br />Price<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    104. 104. Information Needs Analysis <br /><ul><li>What works already? </li></ul>Don’t fix what’s not broken <br />Where is the real pain?<br />Difficult search syntax<br />Data silos<br />New content not findable<br />What requires more complex tools? <br />Exploratory search<br />Scientific & academic research <br />Business intelligence and data mining<br />Comprehensive legal discovery<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    105. 105. Content Inventory<br />Work with Information Architects <br />Use existing taxonomies and catalogs<br />Learn what you have <br />Simple static HTML pages<br />Other formats: PDF, Office documents (which version)<br />CMS, document management, publishing systems<br />Databases and legacy systems<br />Multimedia audio and video files<br />Identify more and less valuable data <br />Some content should be in archives<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    106. 106. Search Engine Deployment Types<br />Software <br />Controlled by local IT<br />Flexible installation<br />Open-source - several high quality packages<br />Search Appliances <br />Server hardware/software combinations<br />Require very little technical attention<br />Check development and backup server pricing<br />Remote Search Services (SaaS) <br />Index using robot spiders or remote access<br />Query goes to service, results go back to user <br />Low network, hosting, IT load<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    107. 107. Scaling Search to Millions & Billions <br />What are the largest installations for each?<br />Talk to them before committing<br />Cache frequent queries<br />Add query servers, automated load balancing<br />Indexing at scale<br />Indexing on dedicated servers<br />Deal with new calls for near-real-time indexing <br />Distribute multiple clones of indexes<br />Segment indexes, parallel lookups, merge result<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    108. 108. Testing Search Indexing <br />Choose 3-4 good candidates <br />Index as much content as possible <br />Watch the robot, track errors<br />Try to index tricky data sources<br />Compare coverage among them<br />Test index scaling<br />Make a really big index based on expected use<br />Speed of add/ update/ delete<br />Responsiveness during big update<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    109. 109. Evaluating Search Results<br />Create a query test suite<br />Use existing search logs if possible<br />Short, long, unusual, common (check cache)<br />Simple and complex queries <br />Spelling, typing and vocabulary errors<br />Many matches, few matches, no matches<br />Perform searches against the test engines<br />Save results pages as HTML for later checking<br />Analyze differences among them<br />Retrieval (and indexing): what’s found?<br />Relevance: are the top results good ones?<br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    110. 110. Search: Not a Black Box<br />Simple search solves many enterprise problems <br />Dynamic access to local content<br />Familiar interface, expectations<br />User vocabulary <br />Understand the real information needs <br />Index the right stuff<br />Work with content providers and IAs <br />Link to specialty research engines<br />Learn from users over time, make it better <br />Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×