Your SlideShare is downloading. ×

Fundamentals Of Search

5,416
views

Published on

These slides are from my 2009 Fundamentals of Search workshop at KMWorld. Please contact me for information about search engines, consulting, workshops and training.

These slides are from my 2009 Fundamentals of Search workshop at KMWorld. Please contact me for information about search engines, consulting, workshops and training.

Published in: Technology, Design

2 Comments
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,416
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
189
Comments
2
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • http://www.slideshare.net/bdelacretaz/beyond-fulltext-searches-with-lucene-and-solrGreat book: Search User Interfaces"by Marti Hearst
  • Transcript

    • 1. The Fundamentals of Enterprise Search
      KMWorld 2009
      Avi Rappoport, Search Tools Consulting
      www.searchtools.com
      consult9@searchtools.com
      www.searchtools.com/slides/kmw09/fundamentals-of-search.html
    • 2. What’s In This Workshop
      Overview of enterprise search, in context
      Search engine processes
      Robot spiders, database access
      Indexing
      Security
      Query parsing, retrieval, and relevance ranking
      Usable search interfaces.
      Maintenance and Analytics
      Methods for choosing a good search engine
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 3. About SearchTools
      Avi Rappoport is a librarian (MLIS from Berkeley)
      Software developer and product manager
      User interface designer
      Long-time search consultant
      Editor & Publisher, www.searchtools.com
      Search Tools Consulting
      Search needs analysis and recommendations
      Enterprise search evaluation
      Outsourced search administration
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 4. Defining Enterprise Search
      Large scale web site search
      Corporate sites
      Institutional sites
      Online stores
      Intranet search
      Crossing departmental lines
      Opening data silos
      Extranets
      Portal Search
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 5. Similarities to Webwide Search
      Robot crawlers
      HTML over HTTP
      Scaling to millions of items
      Distributed processing
      Full-text indexing of content
      Simple query language
      Relevance ranking of results
      TF-IDF (term frequency : inverse document frequency)
      Familiar results list
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 6. Differences from Web Search
      Limited scope
      A site, set of sites, extranet, or intranet
      Few meaningful hyperlinks
      Page Rank and link analysis is less useful
      Security and access control issues
      Content in databases, CMSs, etc.
      More control
      Index update scheduling
      Some content is very valuable, other is not
      No search spam
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 7. Text Search vs. Database Search
      Indexes multiple content sources
      Database fields, files, web pages, feeds...
      Simple search commands instead of SQL
      Flexible indexing and retrieval
      Relevance ranking (this is a major issue)
      Does not compete for database resources
      Easy to scale separately from DBMS
      New features: spellcheck, auto complete, facets
      Works in the real world, from eBay to Google
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 8. Search and Information Architecture
      Information Architecture
      The art and science of organizing information for access and use.
      IA work enriches search
      Creates order and systems
      Provides standard vocabulary
      Removes ROT (redundant, obsolete, trivial)
      Search supplements IA
      Supports user vocabularies
      Changes dynamically with new content
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 9. Search and Taxonomy
      Taxonomy creates categories
      Labels and metadata
      Improves quality of search results
      Additional metadata extremely valuable
      Search crosses categories
      Bypasses ambiguous topic labels
      Useful for novices
      Supports user vocabulary
      Dynamic updates for new topics
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 10. Search & Knowledge Management
      KM is: “The process through which organizations generate value from their intellectual and knowledge-based assets.” (CIO Magazine)
      Organizes information, processes and people
      Offers collaboration and archiving tools
      Attempts to regularize implicit knowledge
      Search mostly matches words
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 11. Two Main Types of Search
      Known-item search
      Short queries
      “Good-enough” answers
      Exploratory search
      Research - finding unknowns
      Scientific, legal, medical, business, sales
      Conceptual overviews
      Completeness - all possible relevant items
      Law enforcement
      Medicine
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 12. All people see are the search box and results list
      Invisible functionality
      Indexes
      Query processing
      Retrieval
      Relevance ranking
      Search is a mystery
      But it’s just software
      Search as an Iceberg
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 13. Elements of Search Engines
      Automated tools to collect content
      Specialized storage for quick retrieval
      Query processing and expansion
      Retrieval (matching query to index content)
      Relevance ranking
      Search results interfaces
      Analytics, metrics and maintenance
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 14. Choosing Content To Index
      Information sites
      Consider indexing every single page
      Use search indexing as a discovery mechanism
      Online stores, catalogs
      Product information: cost, color, size, materials
      Other: return policies, CEO’s name, jobs listing
      Intranets
      Intranet portal and core servers
      May need archive servers and search
      Multimedia: images, audio, video
      Metadata at least
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 15. (Near) Real Time Indexing
      Twitter has changed expectations
      Even in intranets
      Index must support partial updates
      Search engines finding limits at scale
      Distribute indexing and indexes
      Trigger index updates (push vs. pull)
      Continuous feed
      Send web service message
      Database trigger
      Update watched URLs with new links
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 16. Indexing and Security
      Search can undermine “security by obscurity”
      One link can expose a whole set of documents
      Work with your security team
      List areas which contain sensitive content
      Define words which trigger further analysis
      Create a process for removing sensitive data
      Indexing encrypted content
      Search engine uses SSL client for indexing
      Encrypt search results before returning
      Physical security on search servers
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 17. Search and Access Control
      Authentication and authorization in indexing
      “Basic authentication” - user name and password
      NT Security integration
      ACLs and single sign-on
      Conform to security rules during indexing
      Keep access control info as part of document store
      Showing results - who can see what?
      Access to search engine itself
      Collection-level access control
      Locked results as teaser for subscription
      Hit-level access control
      Check before displaying results
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 18. Indexing: Sources of Content
      Web sites
      Intranets
      Extranets
      Blogs
      Wikis
      Mailing list archives & email public folders
      File systems & shared servers
      NFS, SMB, AFP, GFS, ftp, WebDAV
      Content Management Systems
      Databases
      Legacy programs in silos
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 19. Indexing: Robot Spiders
      Start with base URL for all hosts
      For each page, repeat
      Read text into internal format
      Save document in cache
      Save words into index
      Extract all links and check the rules
      If they are new URLs, add them to the list
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 20. Robot Indexing Spider
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 21. Common Problems With Robots
      Pages that are not linked from anywhere
      Spider disallowed by robots.txt or robots meta
      URLs with ? and & (all should do these now)
      JavaScript, forms, and interactive dynamic links
      Some robots can handle some of these
      Session IDs that change
      Duplicate detection
      Multiple views of the same data (Lotus, wikis)
      Symbolic links & bad redirects
      Multiple copies of files or directories
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 22. Indexing: Other Data Sources
      RSS feeds: nice clean text
      File servers: SMB, file:/// etc.
      Content / Document Management Systems
      Email archives
      Databases via ODBC, JDBC, Oracle API
      Full-text content
      Metadata: library catalog records, yellow pages
      External sources using APIs
      (Application programmatic interfaces)
      News feeds (Reuters, AP)
      Twitter
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 23. Indexing: Text Files
      Plain text is easy
      RTF export format text easy to find
      HTML semi-structured text
      Content is between tags and in attributes
      Generated by JavaScript - hard to extract
      Bad HTML, especially missing </ close tags
      XML files (structured)
      Many tags are document-level
      Content is between tags and in attributes
      Complex tag hierarchy
      TEI (Text Encoding Initiative) & Semantic Web
      Xquery and XPATH tools
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 24. Indexing: Binary File Formats
      PDF
      Scanned, may not have any text
      Bad PDF generators break words at columns
      “Shadow” text effect duplicates letters
      SWF and Flash: API may not load dynamic text
      Office documents
      Word processing files (may have hidden text from revisions)
      Spreadsheets (hard to know what to grab)
      Presentations
      Note: new docx, xslx, pptx are really XML file sets
      CAD and project files
      Metadata (properties, Adobe XMP)
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 25. Indexing: Tokenizing
      Lowercase all characters (aka ‘folding’)
      Tokenizing makes words searchable
      Break on punctuation and spaces
      Recognize special words: C++ @ [TS]
      Typography issues: st is really “st”
      HTML escaped text: möchten = möchten
      Special cases for structured strings
      Numbers, Prices, Dates
      N-grams - an alternate approach
      Break into short text patterns
      Takes a lot of index space
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 26. Indexing: Character Set Issues
      World has many charsets (aka scripts, alphabets)
      English has a simple alphabet: 26 letters, 10 numbers
      Other Roman languages: extended (ç, î, ß)
      Non-Roman one byte: Cyrillic, Arabic, Hebrew
      Asian two bytes: Chinese, Japanese, Korean
      Identifying character sets
      Unicode characters
      Older usage: language “code pages”
      HTTP header or <META http-equiv>
      Statistical detection techniques
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 27. Indexing: Language Issues
      Text search works across languages
      Simple pattern-matching, query to index
      Language-specific indexing improves search
      Tokenizing using appropriate rules
      Compound nouns (kindergarten)
      Language rules for stemming
      Singular version of thés is thé
      Language detection
      Trusted tags
      Bilingual dictionaries
      Statistical matches, n-grams
      Documents may have mixed languages…
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 28. Indexing: Multimedia
      Images, photos, drawings, sound, scores, video
      External metadata
      File name
      Link text, surrounding words
      Internal metadata
      ID3 tags for music
      EXIF and other digital photo information
      Subtitles (sometimes)
      Content
      OCR to extract graphic text and closed captions
      Audio: Speech-to-text conversion, still buggy
      Use human judgment not just automated systems
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 29. Inverted Index Diagram
      • Inverted indexes work well
      • 30. Lots of IR research shows this
      • 31. Better than DBMS
      • 32. Alphabetical list of tokens
      • 33. Tokens not in paragraph order, thus, inverted
      • 34. Each token hasID of source
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 35. Richer Index Structures
      Store word position (for phrase matching)
      Enclosing tag or field
      Document metadata
      Database field names
      Image (which attribute)
      Named anchor text
      Text markup tags (TEI, Semantic Web)
      Extracted entities
      Personal names, companies, geo locations, dates
      Anchor text from incoming links
      Can be very descriptive
      Add to index as if part of the target document
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 36. Example Inverted Index Structure
      For each word
      Document ID
      Position
      Tag name
      For each document
      ID
      Title
      URL
      Description
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 37. Indexing: Stopwords
      Stopwords - very common terms
      Linguistic (a an the as he she it you new)
      Ubiquitous (names, copyright, click here)
      Consequences of excluding stopwords:
      Reduces the size of index files
      Improves recall, finds more matching documents
      Fails some queries
      As You Like It, IT copyright policy
      Problems matching phrases: “New York University”
      Solutions vary:
      Index everything, pay the price in index size
      CommonGrams: n-grams of of frequent phrases
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 38. Stopwords Problems: Example
      Searching wordpress.com for whatever will be
      • Finds all matches for whatever (stopwords ignored)
      • 39. Useless results ranking
      • 40. No matches for will be
      • 41. One ad gets it right
      • 42. External search finds over 3,000 pages on site with phrase
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 43. Indexing: Stemming
      Singular query should find plural words & vice versa
      Shoe <=> shoes, cans <=> can, geese <=> goose
      Statistical and probabilistic truncation rules
      Linguistic rules
      Lemmatization - stemming based on part of speech
      Stemming before indexing
      Improve recall: find all forms of a word
      Reduce index size
      Consequences of extreme stemming
      Short query problems
      Search for Ranshouldn’t match Run, Lola, Run
      Other options
      Index everything (makes indexes larger and queries slower)
      New idea: CommonGrams (n-grams of frequent phrases)
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 44. Indexing: Document Store
      Minimum
      ID (key for for inverted index)
      Unique location (URL / file path / record ID)
      Richer document store
      Implicit metadata: filename, size, location
      Explicit metadata
      Title, date, keywords, author
      Taxonomy labels, classification, user tagging
      Language, character set
      Access control settings
      Full text of the document
      For snippets and caching
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 45. Indexing: Dealing with Duplicates
      Detecting duplicate documents
      Exact match is fairly easy: checksums
      Document similarity check: harder but worth it
      Choosing the primary copy
      Most recent (if reliable)
      Rules based on path or metadata
      New web search “canonical” tag
      What to do with duplicates
      Remove from the index: saves space
      Hide in results unless requested
      That’s the Google way
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 46. Indexing: Document Dates
      HTTP servers lie about dates
      Frequent wrong settings: 1969, 2040
      Dynamic pages send the current timestamp
      File systems lie about dates
      Applications lie about dates
      Indexers do the best they can
      Metadata (date tag, property, tag DC.date)
      Extract from page content
      Checksum to see if file has changed since last index
      Consider external metadata repository
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 47. Search Process Flow
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 48. Where the Queries Come From
      User-entered text in search fields
      Search navigation: moving around in results list
      Previous searches
      May just be repeated clicks on URL
      Save Search feature
      Simplistic alerts
      Facet click to add a metadata filter
      May re-issue search with additional terms
      May be navigational, no text query
      Scripts or automated queries
      Dynamic links (find all pictures by this artist)
      Geographic information systems
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 49. Query Processing Steps
      Try to recognize the character set and language
      Tokenize the text by language rules
      Break at spaces and punctuation
      Same algorithm as index tokenizer
      Check for operators
      Internet Query Operators: + - "quotes"
      Boolean Operators: AND OR NOT & | !
      Others: NEAR, (parentheses)
      Check for field names, zones, other filters
      Example: title:lunch location=94703
      Handle the rare natural language question
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 50. Query Expansion
      Stemming
      Dependant on index stemming choices
      Good to find singular/plural forms
      Word similarity searching - increases recall
      Fuzzy matching
      Phonetic, soundex, sound-alike
      May overwhelm exact matches
      Synonym expansion, should be site-specific
      bus => coach, ATM => Air Tasking Message
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 51. Search: Retrieval, Recall & Precision
      Retrieval
      Finding the documents matching a particular query
      Recall
      Finding every relevant document
      Precision
      Finding only relevant documents
      Balance more recall vs. better precision
      Use search logs and user studies to guide choices
      Use precision as part of relevance ranking
      Top results should be more exact matches
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 52. One-Word Text Retrieval
      Fastbinary search in inverted index
      Check index updates on disk or in memory
      If there are distributed indexes, merge results
      Store the related document information in a list
      Document ID
      Term frequency in document
      Term positions in the document
      Note: The document list is not yet sorted
      Frequent searches may be cached
      “Short head” vs. “long tail”
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 53. Multi-Word Text Retrieval
      Relationship between words defines results
      Boolean AND, + operator, find all default
      Only documents which contain all terms
      Boolean OR operator, find any default
      All documents with any term
      Boolean NOT, - operator
      All documents with the first term but not next term
      Phrase operators, quotes
      Only documents with the words as a phrase
      Also check for zones or field filters
      Parentheses: use for order of processing
      Merge resulting lists
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 54. Relevance Ranking Algorithms
      Relevance
      The likelihood that an item will fill an information need
      Based on documents in retrieval list
      Most common algorithm: TF:IDF
      (Term frequency : inverse document frequency)
      How often the query word is in the document?
      How often the word is in the index?
      Other relevance algorithms
      Vectors and document-query similarity
      Linguistic analysis and Natural Language Processing
      Statistical and Bayesian analysis
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 55. Relevance Heuristics
      Phrase matches for multiple query terms
      Logs show most multi-word searches are phrases
      Query terms found in special sections
      Title
      Metadata
      Top of document
      All terms matched in document
      Even when not relevant, it’s transparent
      Old systems gave excess weight to single rare terms
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 56. More on Relevance
      Relevance is task-specific
      Results can never please all of the people
      More like berry-picking than like hunting
      Link analysis (PageRank) not very useful
      Intranet and site links tend to be navigational
      Situation-specific adjustments
      Some areas more likely to be valuable
      Current content
      Local content
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 57. Federated Search and Relevance
      Send query to multiple search engines
      May require special syntax
      Response time often a factor
      Receive results in relevance order for each
      Display results, two options
      Separate sections for each search engine
      Merged single relevance rank list
      Works if all search indexes are similar
      Problems where the sources are very different
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 58. Retrieval: Access Control
      Limit access to search itself
      User enters password or other credentials
      Search only accepts queries when authenticated
      Collection-level access control
      Query filter only retrieves items from allowed groups
      Hit-level access control
      Real-time check for user access on documents
      Start with most relevant documents
      Repeat until there are ten (may be slow)
      Display top results, include estimate of how many more
      Show helpful message if user can’t see any
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 59. Search User Experience
      Limit user interface complexity
      Show the scope of the information covered
      Expose query expansion and contraction
      Use familiar UI elements
      User experience goes beyond interface
      Index coverage
      Query syntax
      Retrieval quality and speed
      Relevance ranking (first ten are vital)
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 60. Search Forms Interface
      Balance simplicity with functionality
      Put a search field in the navigation bar
      Location should be consistent
      Longer is better: short fields lead to short queries
      Simple Search forms: limit options
      Zone or section
      Dates
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 61. Search Field Auto-Complete
      Dropdown menu of matching words
      Base on search logs
      Smallish list, 7-10
      Most popular
      Simple sort
      Alphabetic
      Price or size
      Complete range (preferably lowestto highest)
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 62. Other Search Interfaces
      Heavily researched
      Natural language
      Must keep typing
      Defining a questionis quite hard
      Interactive search
      Guided interviews
      But users want immediate results
      Avatars
      do not improve interaction
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 63. Simple vs. Advanced Search UI
      Most searches are simple
      Short: one to three words
      Fewer than 10% use any operators at all (maybe 1%)
      Even experts prefer simple search
      Will use advanced tools if simple doesn’t work
      Default to simple search, link to advanced search
      Those are your power users: librarians, techies
      Expose all possible options
      Don’t spend huge resources on advanced UI
      Exploratory search is different
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 64. Advanced Search Fits Sometimes
      EBay
      High motivation
      Complex search requirements
      Frequent use
      UX testing still required
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 65. Search Results: Page Elements
      Site context
      General page layout, navigation links
      Colors and design elements
      Results header
      A search field, with the current search terms
      Retrieval information - how many hits
      Results list in relevance order
      Each result item with at least a linked title
      Facets: dynamic links for filtering results
      Results footer
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 66. Search Results: Good Example
      Full but readable
      • white space
      • 67. content blocks
      Site look-and-feel
      Navigation
      Familiar search results elements
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 68. Search Results: Not-So-Good Example
      Site page has navigation, colors: search results should too
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 69. Search Results: Visualization
      Fascinating to look at, great demos
      Star charts
      Topographical displays
      Interactive fly-throughs
      Hyperbolic trees
      Require significant resources to run
      Good for exploratory & comprehensive research
      Finding unexpected synergies
      Simple search is much cheaper for casual users
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 70. Search Results: Header Elements
      Search field, with the current query
      Users often edit to be more or less restrictive
      Number of results found
      A few search options
      Match Any Word / All Words / Exact Phase
      Filter by date option (if trustworthy)
      Search zones
      Results navigation
      Best Bets
      Spelling suggestions
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 71. Search Results: Hits and Pages
      Show number of items matched
      Be accurate
      Do not give estimates for small numbers
      (Google and SharePoint are bad this way)
      Pagination - results list navigation
      Helps user calibrate content
      Important for exploratory search
      Follow web search conventions, example
      < previous1 2 34 ... 26next >
      Be accurate
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 72. Results Headers: Examples
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 73. Search Results: “Best Bets”
      aka Search Suggestions, QuickLinks, KeyMatch, Recommendations
      Special-case links for problem queries
      Internal topic landing pages
      External sites when appropriate
      New and better query to search
      • Only implement for very frequent queries
      Discover problems from users, log analysis
      “Short head” - few very popular query terms
      Allocate resources to keep them current
      • Good search results are higher priority
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 74. Best Bets Example
      Best Bets are very clear
      Would not come first in normal search results
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 75. Search Results: List Sorting
      List of links to items matching the query
      Sorted by matching terms
      Impossible to be relevant to every query
      Variety of sources when possible
      Transparency: why these items in this order
      Other sort orders - make very visible
      By author’s last name
      By date
      By price
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 76. Search Results: Not Enough Variety
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 77. Search Results: Weird Sort
      Sorted by:“Degrees away”
      Labels too subtle:
      • Hidden in header
      • 78. Degree icon should be on the left side
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 79. Result Items: Elements
      Information foraging: show hints about items
      Title of document, or name of product
      Location: URL, file path, database ID
      May need to rewrite to user-accessible URLs
      Hide location if it’s not meaningful
      Distinguishing data
      Metadata: picture, product code, author name
      Show match terms in context (snippets)
      Text before and after query term matches
      Highlight the matches
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 80. Results Items: Not Enough Content
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 81. Results Items: Too Much Content
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 82. Results Items: Just Right
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 83. Results Items: Additional Data
      Date (if reliable)
      Size and File type
      Avoid surprising launches of Acrobat or other app.
      Metadata
      Author, department, brand, product...
      Access status: password required?
      Topics and subject headings
      Taxonomy categories
      Keywords and concept tags
      User tags, folksonomy
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 84. Results Items: Rich Items Example
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 85. Results: Dynamic Clustering
      Uses search results text to infer topics
      Groups by similarity in titles and results text
      Particularly good for portals and intranets
      Unstructured, uncontrolled text
      Dynamic, no preprocessing needed
      Can supplement categorization and taxonomies
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 86. Results: Clustering Example
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 87. Commerce and Catalog Results
      Picture or graphic if possible
      Important attributes
      Price
      Color
      Size
      Compatibility
      Availability
      “Buy” button
      Simplify process, save time
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 88. Online Store Results Example
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 89. Multimedia Search
      Image, audio, and video files
      Audio and visual similarity search still theory
      Show context in results
      Match terms from transcript or OCR
      Text around image
      Thumbnails or keyframes
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 90. Multimedia Results Example
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 91. Results: Faceted Metadata
      Better than forms for structured text data
      Exposes attributes as part of search results
      Leverages metadata
      Topic names, taxonomy
      Mundane stuff: color, date, size, author...
      Choices specifically relating to search results
      Dynamically generates from metadata
      Preview numbers offer users confidence in clicking
      Supported by extensive usability testing
      Used on a majority of large e-commerce sites
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 92. Why Faceted Search is Better Than Forms
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 93. Faceted Metadata: Commerce Example
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 94. Faceted Metadata: Library Catalog
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 95. No Matches Queries: Causes
      Misspellings and typing errors
      Scope problem: nothing for that topic
      Vocabulary differences
      Users may be less precise, or use competitor’s terms
      Marketers may dominate content
      Restrictive search settings
      Default may only match exact phrase or all words
      Access control may disallow user
      Software/hardware/network failures
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 96. No Matches Queries: Responses
      Track queries with no matches in logs
      Use sessions, surveys & testing to find user intent
      Design the no-matches page carefully
      Explain what is and isn’t on the site
      Provide useful navigation links
      Add search engine help
      Synonyms
      Best Bets
      Spelling
      Add terms to text
      Add content, topic pages
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 97. No Matches Queries: Spelling Issues
      Detect and address common problems
      Spelling errors
      Typos
      Queries without spaces between words
      Use site-specific dictionary
      Easy to build from search index
      Never suggests any words not on the site
      Users familiar with did you mean....?
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 98. Good Example of No-Matches Page
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 99. Empty Searches
      Users click or press “enter” in the search box
      • Test for this special case
      Should not find all items in the index
      • Interaction options:
      Do nothing
      Go to a simple search page
      Show an error dialog
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 100. Search Engine Maintenance
      Index maintenance
      Obsolete content removal
      Check for new content
      Track technical problems (bad links, servers down)
      Search quality
      Re-run test suite
      Compare with original results
      Add new test queries
      Track user feedback, surveys
      Use metrics and log analysis to catch trends
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 101. Metrics for Search Engines
      Server uptime
      Errors: how often and how serious
      Index
      Size on disc and in memory
      Number of entries
      Number and type of indexing errors
      Search traffic
      Queries per minute (60 qpm is common)
      Average clicks on results items per query
      Average next-page views per query
      Number and percent of no-match queries
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 102. Search Log Analysis
      Most frequent query terms
      Short head: a few very popular terms
      Long tail of unique queries
      Lots of junk: URLs, spam, gibberish
      Frequent query terms not matched - fix somehow
      More esoteric analysis - need a lot of data
      Frequent query terms with low click-through
      Frequent query terms with high “next page” clicks
      Raw logs
      Import into database for ad-hoc reports
      Session analysis can be enlightening
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 103. Choosing a Search Engine
      Find specific information needs
      Analyze content
      Source and formats formats
      Rough number of pages/ records / items
      Define platform, API, language requirements
      Buy (or use open source), don’t build
      User surveys show problems with home-grown
      Choose & compare likely candidates
      Gathering, indexing, retrieval, relevance features
      Scaling
      Administration tools
      Continuing development, support, user groups
      Price
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 104. Information Needs Analysis
      • What works already?
      Don’t fix what’s not broken
      Where is the real pain?
      Difficult search syntax
      Data silos
      New content not findable
      What requires more complex tools?
      Exploratory search
      Scientific & academic research
      Business intelligence and data mining
      Comprehensive legal discovery
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 105. Content Inventory
      Work with Information Architects
      Use existing taxonomies and catalogs
      Learn what you have
      Simple static HTML pages
      Other formats: PDF, Office documents (which version)
      CMS, document management, publishing systems
      Databases and legacy systems
      Multimedia audio and video files
      Identify more and less valuable data
      Some content should be in archives
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 106. Search Engine Deployment Types
      Software
      Controlled by local IT
      Flexible installation
      Open-source - several high quality packages
      Search Appliances
      Server hardware/software combinations
      Require very little technical attention
      Check development and backup server pricing
      Remote Search Services (SaaS)
      Index using robot spiders or remote access
      Query goes to service, results go back to user
      Low network, hosting, IT load
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 107. Scaling Search to Millions & Billions
      What are the largest installations for each?
      Talk to them before committing
      Cache frequent queries
      Add query servers, automated load balancing
      Indexing at scale
      Indexing on dedicated servers
      Deal with new calls for near-real-time indexing
      Distribute multiple clones of indexes
      Segment indexes, parallel lookups, merge result
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 108. Testing Search Indexing
      Choose 3-4 good candidates
      Index as much content as possible
      Watch the robot, track errors
      Try to index tricky data sources
      Compare coverage among them
      Test index scaling
      Make a really big index based on expected use
      Speed of add/ update/ delete
      Responsiveness during big update
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 109. Evaluating Search Results
      Create a query test suite
      Use existing search logs if possible
      Short, long, unusual, common (check cache)
      Simple and complex queries
      Spelling, typing and vocabulary errors
      Many matches, few matches, no matches
      Perform searches against the test engines
      Save results pages as HTML for later checking
      Analyze differences among them
      Retrieval (and indexing): what’s found?
      Relevance: are the top results good ones?
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
    • 110. Search: Not a Black Box
      Simple search solves many enterprise problems
      Dynamic access to local content
      Familiar interface, expectations
      User vocabulary
      Understand the real information needs
      Index the right stuff
      Work with content providers and IAs
      Link to specialty research engines
      Learn from users over time, make it better
      Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

    ×