Fundamentals Of Search
Upcoming SlideShare
Loading in...5
×
 

Fundamentals Of Search

on

  • 5,879 views

These slides are from my 2009 Fundamentals of Search workshop at KMWorld. Please contact me for information about search engines, consulting, workshops and training.

These slides are from my 2009 Fundamentals of Search workshop at KMWorld. Please contact me for information about search engines, consulting, workshops and training.

Statistics

Views

Total Views
5,879
Views on SlideShare
5,830
Embed Views
49

Actions

Likes
8
Downloads
184
Comments
2

6 Embeds 49

http://www.linkedin.com 31
http://www.slideshare.net 9
https://www.linkedin.com 3
http://www.lmodules.com 2
http://www.kmconnect.net 2
http://kmconnect.net 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • http://www.slideshare.net/bdelacretaz/beyond-fulltext-searches-with-lucene-and-solrGreat book: Search User Interfaces"by Marti Hearst

Fundamentals Of Search Fundamentals Of Search Presentation Transcript

  • The Fundamentals of Enterprise Search
    KMWorld 2009
    Avi Rappoport, Search Tools Consulting
    www.searchtools.com
    consult9@searchtools.com
    www.searchtools.com/slides/kmw09/fundamentals-of-search.html
  • What’s In This Workshop
    Overview of enterprise search, in context
    Search engine processes
    Robot spiders, database access
    Indexing
    Security
    Query parsing, retrieval, and relevance ranking
    Usable search interfaces.
    Maintenance and Analytics
    Methods for choosing a good search engine
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • About SearchTools
    Avi Rappoport is a librarian (MLIS from Berkeley)
    Software developer and product manager
    User interface designer
    Long-time search consultant
    Editor & Publisher, www.searchtools.com
    Search Tools Consulting
    Search needs analysis and recommendations
    Enterprise search evaluation
    Outsourced search administration
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Defining Enterprise Search
    Large scale web site search
    Corporate sites
    Institutional sites
    Online stores
    Intranet search
    Crossing departmental lines
    Opening data silos
    Extranets
    Portal Search
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Similarities to Webwide Search
    Robot crawlers
    HTML over HTTP
    Scaling to millions of items
    Distributed processing
    Full-text indexing of content
    Simple query language
    Relevance ranking of results
    TF-IDF (term frequency : inverse document frequency)
    Familiar results list
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Differences from Web Search
    Limited scope
    A site, set of sites, extranet, or intranet
    Few meaningful hyperlinks
    Page Rank and link analysis is less useful
    Security and access control issues
    Content in databases, CMSs, etc.
    More control
    Index update scheduling
    Some content is very valuable, other is not
    No search spam
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Text Search vs. Database Search
    Indexes multiple content sources
    Database fields, files, web pages, feeds...
    Simple search commands instead of SQL
    Flexible indexing and retrieval
    Relevance ranking (this is a major issue)
    Does not compete for database resources
    Easy to scale separately from DBMS
    New features: spellcheck, auto complete, facets
    Works in the real world, from eBay to Google
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search and Information Architecture
    Information Architecture
    The art and science of organizing information for access and use.
    IA work enriches search
    Creates order and systems
    Provides standard vocabulary
    Removes ROT (redundant, obsolete, trivial)
    Search supplements IA
    Supports user vocabularies
    Changes dynamically with new content
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search and Taxonomy
    Taxonomy creates categories
    Labels and metadata
    Improves quality of search results
    Additional metadata extremely valuable
    Search crosses categories
    Bypasses ambiguous topic labels
    Useful for novices
    Supports user vocabulary
    Dynamic updates for new topics
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search & Knowledge Management
    KM is: “The process through which organizations generate value from their intellectual and knowledge-based assets.” (CIO Magazine)
    Organizes information, processes and people
    Offers collaboration and archiving tools
    Attempts to regularize implicit knowledge
    Search mostly matches words
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Two Main Types of Search
    Known-item search
    Short queries
    “Good-enough” answers
    Exploratory search
    Research - finding unknowns
    Scientific, legal, medical, business, sales
    Conceptual overviews
    Completeness - all possible relevant items
    Law enforcement
    Medicine
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • All people see are the search box and results list
    Invisible functionality
    Indexes
    Query processing
    Retrieval
    Relevance ranking
    Search is a mystery
    But it’s just software
    Search as an Iceberg
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Elements of Search Engines
    Automated tools to collect content
    Specialized storage for quick retrieval
    Query processing and expansion
    Retrieval (matching query to index content)
    Relevance ranking
    Search results interfaces
    Analytics, metrics and maintenance
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Choosing Content To Index
    Information sites
    Consider indexing every single page
    Use search indexing as a discovery mechanism
    Online stores, catalogs
    Product information: cost, color, size, materials
    Other: return policies, CEO’s name, jobs listing
    Intranets
    Intranet portal and core servers
    May need archive servers and search
    Multimedia: images, audio, video
    Metadata at least
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • (Near) Real Time Indexing
    Twitter has changed expectations
    Even in intranets
    Index must support partial updates
    Search engines finding limits at scale
    Distribute indexing and indexes
    Trigger index updates (push vs. pull)
    Continuous feed
    Send web service message
    Database trigger
    Update watched URLs with new links
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing and Security
    Search can undermine “security by obscurity”
    One link can expose a whole set of documents
    Work with your security team
    List areas which contain sensitive content
    Define words which trigger further analysis
    Create a process for removing sensitive data
    Indexing encrypted content
    Search engine uses SSL client for indexing
    Encrypt search results before returning
    Physical security on search servers
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search and Access Control
    Authentication and authorization in indexing
    “Basic authentication” - user name and password
    NT Security integration
    ACLs and single sign-on
    Conform to security rules during indexing
    Keep access control info as part of document store
    Showing results - who can see what?
    Access to search engine itself
    Collection-level access control
    Locked results as teaser for subscription
    Hit-level access control
    Check before displaying results
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Sources of Content
    Web sites
    Intranets
    Extranets
    Blogs
    Wikis
    Mailing list archives & email public folders
    File systems & shared servers
    NFS, SMB, AFP, GFS, ftp, WebDAV
    Content Management Systems
    Databases
    Legacy programs in silos
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Robot Spiders
    Start with base URL for all hosts
    For each page, repeat
    Read text into internal format
    Save document in cache
    Save words into index
    Extract all links and check the rules
    If they are new URLs, add them to the list
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Robot Indexing Spider
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Common Problems With Robots
    Pages that are not linked from anywhere
    Spider disallowed by robots.txt or robots meta
    URLs with ? and & (all should do these now)
    JavaScript, forms, and interactive dynamic links
    Some robots can handle some of these
    Session IDs that change
    Duplicate detection
    Multiple views of the same data (Lotus, wikis)
    Symbolic links & bad redirects
    Multiple copies of files or directories
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Other Data Sources
    RSS feeds: nice clean text
    File servers: SMB, file:/// etc.
    Content / Document Management Systems
    Email archives
    Databases via ODBC, JDBC, Oracle API
    Full-text content
    Metadata: library catalog records, yellow pages
    External sources using APIs
    (Application programmatic interfaces)
    News feeds (Reuters, AP)
    Twitter
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Text Files
    Plain text is easy
    RTF export format text easy to find
    HTML semi-structured text
    Content is between tags and in attributes
    Generated by JavaScript - hard to extract
    Bad HTML, especially missing </ close tags
    XML files (structured)
    Many tags are document-level
    Content is between tags and in attributes
    Complex tag hierarchy
    TEI (Text Encoding Initiative) & Semantic Web
    Xquery and XPATH tools
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Binary File Formats
    PDF
    Scanned, may not have any text
    Bad PDF generators break words at columns
    “Shadow” text effect duplicates letters
    SWF and Flash: API may not load dynamic text
    Office documents
    Word processing files (may have hidden text from revisions)
    Spreadsheets (hard to know what to grab)
    Presentations
    Note: new docx, xslx, pptx are really XML file sets
    CAD and project files
    Metadata (properties, Adobe XMP)
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Tokenizing
    Lowercase all characters (aka ‘folding’)
    Tokenizing makes words searchable
    Break on punctuation and spaces
    Recognize special words: C++ @ [TS]
    Typography issues: st is really “st”
    HTML escaped text: möchten = möchten
    Special cases for structured strings
    Numbers, Prices, Dates
    N-grams - an alternate approach
    Break into short text patterns
    Takes a lot of index space
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Character Set Issues
    World has many charsets (aka scripts, alphabets)
    English has a simple alphabet: 26 letters, 10 numbers
    Other Roman languages: extended (ç, î, ß)
    Non-Roman one byte: Cyrillic, Arabic, Hebrew
    Asian two bytes: Chinese, Japanese, Korean
    Identifying character sets
    Unicode characters
    Older usage: language “code pages”
    HTTP header or <META http-equiv>
    Statistical detection techniques
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Language Issues
    Text search works across languages
    Simple pattern-matching, query to index
    Language-specific indexing improves search
    Tokenizing using appropriate rules
    Compound nouns (kindergarten)
    Language rules for stemming
    Singular version of thés is thé
    Language detection
    Trusted tags
    Bilingual dictionaries
    Statistical matches, n-grams
    Documents may have mixed languages…
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Multimedia
    Images, photos, drawings, sound, scores, video
    External metadata
    File name
    Link text, surrounding words
    Internal metadata
    ID3 tags for music
    EXIF and other digital photo information
    Subtitles (sometimes)
    Content
    OCR to extract graphic text and closed captions
    Audio: Speech-to-text conversion, still buggy
    Use human judgment not just automated systems
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Inverted Index Diagram
    • Inverted indexes work well
    • Lots of IR research shows this
    • Better than DBMS
    • Alphabetical list of tokens
    • Tokens not in paragraph order, thus, inverted
    • Each token hasID of source
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Richer Index Structures
    Store word position (for phrase matching)
    Enclosing tag or field
    Document metadata
    Database field names
    Image (which attribute)
    Named anchor text
    Text markup tags (TEI, Semantic Web)
    Extracted entities
    Personal names, companies, geo locations, dates
    Anchor text from incoming links
    Can be very descriptive
    Add to index as if part of the target document
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Example Inverted Index Structure
    For each word
    Document ID
    Position
    Tag name
    For each document
    ID
    Title
    URL
    Description
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Stopwords
    Stopwords - very common terms
    Linguistic (a an the as he she it you new)
    Ubiquitous (names, copyright, click here)
    Consequences of excluding stopwords:
    Reduces the size of index files
    Improves recall, finds more matching documents
    Fails some queries
    As You Like It, IT copyright policy
    Problems matching phrases: “New York University”
    Solutions vary:
    Index everything, pay the price in index size
    CommonGrams: n-grams of of frequent phrases
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Stopwords Problems: Example
    Searching wordpress.com for whatever will be
    • Finds all matches for whatever (stopwords ignored)
    • Useless results ranking
    • No matches for will be
    • One ad gets it right
    • External search finds over 3,000 pages on site with phrase
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Stemming
    Singular query should find plural words & vice versa
    Shoe <=> shoes, cans <=> can, geese <=> goose
    Statistical and probabilistic truncation rules
    Linguistic rules
    Lemmatization - stemming based on part of speech
    Stemming before indexing
    Improve recall: find all forms of a word
    Reduce index size
    Consequences of extreme stemming
    Short query problems
    Search for Ranshouldn’t match Run, Lola, Run
    Other options
    Index everything (makes indexes larger and queries slower)
    New idea: CommonGrams (n-grams of frequent phrases)
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Document Store
    Minimum
    ID (key for for inverted index)
    Unique location (URL / file path / record ID)
    Richer document store
    Implicit metadata: filename, size, location
    Explicit metadata
    Title, date, keywords, author
    Taxonomy labels, classification, user tagging
    Language, character set
    Access control settings
    Full text of the document
    For snippets and caching
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Dealing with Duplicates
    Detecting duplicate documents
    Exact match is fairly easy: checksums
    Document similarity check: harder but worth it
    Choosing the primary copy
    Most recent (if reliable)
    Rules based on path or metadata
    New web search “canonical” tag
    What to do with duplicates
    Remove from the index: saves space
    Hide in results unless requested
    That’s the Google way
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Indexing: Document Dates
    HTTP servers lie about dates
    Frequent wrong settings: 1969, 2040
    Dynamic pages send the current timestamp
    File systems lie about dates
    Applications lie about dates
    Indexers do the best they can
    Metadata (date tag, property, tag DC.date)
    Extract from page content
    Checksum to see if file has changed since last index
    Consider external metadata repository
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Process Flow
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Where the Queries Come From
    User-entered text in search fields
    Search navigation: moving around in results list
    Previous searches
    May just be repeated clicks on URL
    Save Search feature
    Simplistic alerts
    Facet click to add a metadata filter
    May re-issue search with additional terms
    May be navigational, no text query
    Scripts or automated queries
    Dynamic links (find all pictures by this artist)
    Geographic information systems
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Query Processing Steps
    Try to recognize the character set and language
    Tokenize the text by language rules
    Break at spaces and punctuation
    Same algorithm as index tokenizer
    Check for operators
    Internet Query Operators: + - "quotes"
    Boolean Operators: AND OR NOT & | !
    Others: NEAR, (parentheses)
    Check for field names, zones, other filters
    Example: title:lunch location=94703
    Handle the rare natural language question
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Query Expansion
    Stemming
    Dependant on index stemming choices
    Good to find singular/plural forms
    Word similarity searching - increases recall
    Fuzzy matching
    Phonetic, soundex, sound-alike
    May overwhelm exact matches
    Synonym expansion, should be site-specific
    bus => coach, ATM => Air Tasking Message
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search: Retrieval, Recall & Precision
    Retrieval
    Finding the documents matching a particular query
    Recall
    Finding every relevant document
    Precision
    Finding only relevant documents
    Balance more recall vs. better precision
    Use search logs and user studies to guide choices
    Use precision as part of relevance ranking
    Top results should be more exact matches
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • One-Word Text Retrieval
    Fastbinary search in inverted index
    Check index updates on disk or in memory
    If there are distributed indexes, merge results
    Store the related document information in a list
    Document ID
    Term frequency in document
    Term positions in the document
    Note: The document list is not yet sorted
    Frequent searches may be cached
    “Short head” vs. “long tail”
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Multi-Word Text Retrieval
    Relationship between words defines results
    Boolean AND, + operator, find all default
    Only documents which contain all terms
    Boolean OR operator, find any default
    All documents with any term
    Boolean NOT, - operator
    All documents with the first term but not next term
    Phrase operators, quotes
    Only documents with the words as a phrase
    Also check for zones or field filters
    Parentheses: use for order of processing
    Merge resulting lists
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Relevance Ranking Algorithms
    Relevance
    The likelihood that an item will fill an information need
    Based on documents in retrieval list
    Most common algorithm: TF:IDF
    (Term frequency : inverse document frequency)
    How often the query word is in the document?
    How often the word is in the index?
    Other relevance algorithms
    Vectors and document-query similarity
    Linguistic analysis and Natural Language Processing
    Statistical and Bayesian analysis
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Relevance Heuristics
    Phrase matches for multiple query terms
    Logs show most multi-word searches are phrases
    Query terms found in special sections
    Title
    Metadata
    Top of document
    All terms matched in document
    Even when not relevant, it’s transparent
    Old systems gave excess weight to single rare terms
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • More on Relevance
    Relevance is task-specific
    Results can never please all of the people
    More like berry-picking than like hunting
    Link analysis (PageRank) not very useful
    Intranet and site links tend to be navigational
    Situation-specific adjustments
    Some areas more likely to be valuable
    Current content
    Local content
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Federated Search and Relevance
    Send query to multiple search engines
    May require special syntax
    Response time often a factor
    Receive results in relevance order for each
    Display results, two options
    Separate sections for each search engine
    Merged single relevance rank list
    Works if all search indexes are similar
    Problems where the sources are very different
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Retrieval: Access Control
    Limit access to search itself
    User enters password or other credentials
    Search only accepts queries when authenticated
    Collection-level access control
    Query filter only retrieves items from allowed groups
    Hit-level access control
    Real-time check for user access on documents
    Start with most relevant documents
    Repeat until there are ten (may be slow)
    Display top results, include estimate of how many more
    Show helpful message if user can’t see any
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search User Experience
    Limit user interface complexity
    Show the scope of the information covered
    Expose query expansion and contraction
    Use familiar UI elements
    User experience goes beyond interface
    Index coverage
    Query syntax
    Retrieval quality and speed
    Relevance ranking (first ten are vital)
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Forms Interface
    Balance simplicity with functionality
    Put a search field in the navigation bar
    Location should be consistent
    Longer is better: short fields lead to short queries
    Simple Search forms: limit options
    Zone or section
    Dates
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Field Auto-Complete
    Dropdown menu of matching words
    Base on search logs
    Smallish list, 7-10
    Most popular
    Simple sort
    Alphabetic
    Price or size
    Complete range (preferably lowestto highest)
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Other Search Interfaces
    Heavily researched
    Natural language
    Must keep typing
    Defining a questionis quite hard
    Interactive search
    Guided interviews
    But users want immediate results
    Avatars
    do not improve interaction
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Simple vs. Advanced Search UI
    Most searches are simple
    Short: one to three words
    Fewer than 10% use any operators at all (maybe 1%)
    Even experts prefer simple search
    Will use advanced tools if simple doesn’t work
    Default to simple search, link to advanced search
    Those are your power users: librarians, techies
    Expose all possible options
    Don’t spend huge resources on advanced UI
    Exploratory search is different
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Advanced Search Fits Sometimes
    EBay
    High motivation
    Complex search requirements
    Frequent use
    UX testing still required
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Results: Page Elements
    Site context
    General page layout, navigation links
    Colors and design elements
    Results header
    A search field, with the current search terms
    Retrieval information - how many hits
    Results list in relevance order
    Each result item with at least a linked title
    Facets: dynamic links for filtering results
    Results footer
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Results: Good Example
    Full but readable
    • white space
    • content blocks
    Site look-and-feel
    Navigation
    Familiar search results elements
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Results: Not-So-Good Example
    Site page has navigation, colors: search results should too
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Results: Visualization
    Fascinating to look at, great demos
    Star charts
    Topographical displays
    Interactive fly-throughs
    Hyperbolic trees
    Require significant resources to run
    Good for exploratory & comprehensive research
    Finding unexpected synergies
    Simple search is much cheaper for casual users
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Results: Header Elements
    Search field, with the current query
    Users often edit to be more or less restrictive
    Number of results found
    A few search options
    Match Any Word / All Words / Exact Phase
    Filter by date option (if trustworthy)
    Search zones
    Results navigation
    Best Bets
    Spelling suggestions
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Results: Hits and Pages
    Show number of items matched
    Be accurate
    Do not give estimates for small numbers
    (Google and SharePoint are bad this way)
    Pagination - results list navigation
    Helps user calibrate content
    Important for exploratory search
    Follow web search conventions, example
    < previous1 2 34 ... 26next >
    Be accurate
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Results Headers: Examples
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Results: “Best Bets”
    aka Search Suggestions, QuickLinks, KeyMatch, Recommendations
    Special-case links for problem queries
    Internal topic landing pages
    External sites when appropriate
    New and better query to search
    • Only implement for very frequent queries
    Discover problems from users, log analysis
    “Short head” - few very popular query terms
    Allocate resources to keep them current
    • Good search results are higher priority
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Best Bets Example
    Best Bets are very clear
    Would not come first in normal search results
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Results: List Sorting
    List of links to items matching the query
    Sorted by matching terms
    Impossible to be relevant to every query
    Variety of sources when possible
    Transparency: why these items in this order
    Other sort orders - make very visible
    By author’s last name
    By date
    By price
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Results: Not Enough Variety
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Results: Weird Sort
    Sorted by:“Degrees away”
    Labels too subtle:
    • Hidden in header
    • Degree icon should be on the left side
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Result Items: Elements
    Information foraging: show hints about items
    Title of document, or name of product
    Location: URL, file path, database ID
    May need to rewrite to user-accessible URLs
    Hide location if it’s not meaningful
    Distinguishing data
    Metadata: picture, product code, author name
    Show match terms in context (snippets)
    Text before and after query term matches
    Highlight the matches
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Results Items: Not Enough Content
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Results Items: Too Much Content
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Results Items: Just Right
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Results Items: Additional Data
    Date (if reliable)
    Size and File type
    Avoid surprising launches of Acrobat or other app.
    Metadata
    Author, department, brand, product...
    Access status: password required?
    Topics and subject headings
    Taxonomy categories
    Keywords and concept tags
    User tags, folksonomy
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Results Items: Rich Items Example
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Results: Dynamic Clustering
    Uses search results text to infer topics
    Groups by similarity in titles and results text
    Particularly good for portals and intranets
    Unstructured, uncontrolled text
    Dynamic, no preprocessing needed
    Can supplement categorization and taxonomies
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Results: Clustering Example
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Commerce and Catalog Results
    Picture or graphic if possible
    Important attributes
    Price
    Color
    Size
    Compatibility
    Availability
    “Buy” button
    Simplify process, save time
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Online Store Results Example
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Multimedia Search
    Image, audio, and video files
    Audio and visual similarity search still theory
    Show context in results
    Match terms from transcript or OCR
    Text around image
    Thumbnails or keyframes
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Multimedia Results Example
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Results: Faceted Metadata
    Better than forms for structured text data
    Exposes attributes as part of search results
    Leverages metadata
    Topic names, taxonomy
    Mundane stuff: color, date, size, author...
    Choices specifically relating to search results
    Dynamically generates from metadata
    Preview numbers offer users confidence in clicking
    Supported by extensive usability testing
    Used on a majority of large e-commerce sites
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Why Faceted Search is Better Than Forms
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Faceted Metadata: Commerce Example
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Faceted Metadata: Library Catalog
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • No Matches Queries: Causes
    Misspellings and typing errors
    Scope problem: nothing for that topic
    Vocabulary differences
    Users may be less precise, or use competitor’s terms
    Marketers may dominate content
    Restrictive search settings
    Default may only match exact phrase or all words
    Access control may disallow user
    Software/hardware/network failures
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • No Matches Queries: Responses
    Track queries with no matches in logs
    Use sessions, surveys & testing to find user intent
    Design the no-matches page carefully
    Explain what is and isn’t on the site
    Provide useful navigation links
    Add search engine help
    Synonyms
    Best Bets
    Spelling
    Add terms to text
    Add content, topic pages
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • No Matches Queries: Spelling Issues
    Detect and address common problems
    Spelling errors
    Typos
    Queries without spaces between words
    Use site-specific dictionary
    Easy to build from search index
    Never suggests any words not on the site
    Users familiar with did you mean....?
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Good Example of No-Matches Page
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Empty Searches
    Users click or press “enter” in the search box
    • Test for this special case
    Should not find all items in the index
    • Interaction options:
    Do nothing
    Go to a simple search page
    Show an error dialog
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Engine Maintenance
    Index maintenance
    Obsolete content removal
    Check for new content
    Track technical problems (bad links, servers down)
    Search quality
    Re-run test suite
    Compare with original results
    Add new test queries
    Track user feedback, surveys
    Use metrics and log analysis to catch trends
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Metrics for Search Engines
    Server uptime
    Errors: how often and how serious
    Index
    Size on disc and in memory
    Number of entries
    Number and type of indexing errors
    Search traffic
    Queries per minute (60 qpm is common)
    Average clicks on results items per query
    Average next-page views per query
    Number and percent of no-match queries
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Log Analysis
    Most frequent query terms
    Short head: a few very popular terms
    Long tail of unique queries
    Lots of junk: URLs, spam, gibberish
    Frequent query terms not matched - fix somehow
    More esoteric analysis - need a lot of data
    Frequent query terms with low click-through
    Frequent query terms with high “next page” clicks
    Raw logs
    Import into database for ad-hoc reports
    Session analysis can be enlightening
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Choosing a Search Engine
    Find specific information needs
    Analyze content
    Source and formats formats
    Rough number of pages/ records / items
    Define platform, API, language requirements
    Buy (or use open source), don’t build
    User surveys show problems with home-grown
    Choose & compare likely candidates
    Gathering, indexing, retrieval, relevance features
    Scaling
    Administration tools
    Continuing development, support, user groups
    Price
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Information Needs Analysis
    • What works already?
    Don’t fix what’s not broken
    Where is the real pain?
    Difficult search syntax
    Data silos
    New content not findable
    What requires more complex tools?
    Exploratory search
    Scientific & academic research
    Business intelligence and data mining
    Comprehensive legal discovery
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Content Inventory
    Work with Information Architects
    Use existing taxonomies and catalogs
    Learn what you have
    Simple static HTML pages
    Other formats: PDF, Office documents (which version)
    CMS, document management, publishing systems
    Databases and legacy systems
    Multimedia audio and video files
    Identify more and less valuable data
    Some content should be in archives
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search Engine Deployment Types
    Software
    Controlled by local IT
    Flexible installation
    Open-source - several high quality packages
    Search Appliances
    Server hardware/software combinations
    Require very little technical attention
    Check development and backup server pricing
    Remote Search Services (SaaS)
    Index using robot spiders or remote access
    Query goes to service, results go back to user
    Low network, hosting, IT load
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Scaling Search to Millions & Billions
    What are the largest installations for each?
    Talk to them before committing
    Cache frequent queries
    Add query servers, automated load balancing
    Indexing at scale
    Indexing on dedicated servers
    Deal with new calls for near-real-time indexing
    Distribute multiple clones of indexes
    Segment indexes, parallel lookups, merge result
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Testing Search Indexing
    Choose 3-4 good candidates
    Index as much content as possible
    Watch the robot, track errors
    Try to index tricky data sources
    Compare coverage among them
    Test index scaling
    Make a really big index based on expected use
    Speed of add/ update/ delete
    Responsiveness during big update
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Evaluating Search Results
    Create a query test suite
    Use existing search logs if possible
    Short, long, unusual, common (check cache)
    Simple and complex queries
    Spelling, typing and vocabulary errors
    Many matches, few matches, no matches
    Perform searches against the test engines
    Save results pages as HTML for later checking
    Analyze differences among them
    Retrieval (and indexing): what’s found?
    Relevance: are the top results good ones?
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com
  • Search: Not a Black Box
    Simple search solves many enterprise problems
    Dynamic access to local content
    Familiar interface, expectations
    User vocabulary
    Understand the real information needs
    Index the right stuff
    Work with content providers and IAs
    Link to specialty research engines
    Learn from users over time, make it better
    Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com