Your SlideShare is downloading. ×
How search engines work Anand Saini
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

How search engines work Anand Saini

279
views

Published on

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
279
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Helping people find what they’re looking for  Starts with an “information need”  Convert to a query  Gets resultsIn the materials available  Web pages  Other formats  Deep Web
  • 2.  Search can’t find what’s not there  The content is hugely important Information Architecture is vital Usable sites have good navigation and structure
  • 3. Index ahead of time • Find files or records • Open each one and read it • Store each word in a searchable indexProvide search forms • Match the query terms with words in the index • Sort documents by relevanceDisplay results
  • 4. Like an iceberg,2/3 below water user interface search content functionality
  • 5. • Text search works for structured content• Keyword search vs. SQL queries• Approximate vs. exact match• Multiple sources of content• Response time and database resources• Relevance ranking, very important• Works in the real world (e.g. EBay)
  • 6. Users blame the search engine  Even when the content is unavailableUnderstand the scope of site or intranet  Kinds of information  Divided sites: products / corporate info  Dates  Languages  Sources and data silos: CMSs, databases...  Update processes
  • 7. Store text to search it laterMany ways to gather text  Crawl (spider) via HTTP  Read files on file servers  Access databases (HTTP or API)  Data silos via local APIs  Applications, CMSs, via Web ServicesSecurity and Access Control
  • 8.  Basic information for document or record • File name / URL / record ID • Title or equivalent • Size, date, MIME type Full text of item More metadata • Product name, picture ID • Category, topic, or subject • Other attributes, for relevance ranking and display
  • 9. Stop wordsStemmingMetadata  Explicit (tags)  Implicit (context)Semantics  CMS and Database fields  XML tags and attributes
  • 10. What happens after you click the search button and before retrieval starts.Usually in this order  Handle character set, maybe language  Look for operators and organize the query  Look for field names or metadata  Extract words (just like the indexer)  Deal with letter casing
  • 11. • Retrieval: find files with query terms• Not the same as relevance ranking Recall: find all relevant items Precision: find only relevant items Increasing one decreases the other
  • 12. Single-word queries  Find items containing that wordMulti-word queries: combine lists  Any: every item with any query word  All: only items with every word  Phrases: find only items with all words in orderBoolean and complex queries – Use algorithm to combine lists
  • 13. • Empty search• Nothing on the site on that topic (scope)• Misspelling or typing mistakes• Vocabulary differences• Restrictive search defaults• Restrictive search choices• Software failure
  • 14. Theory: sort the matching items, so the most relevant ones appear firstCant really know what the user wantsRelevance is hard to define and situationalShort queries tend to be deeply ambiguous What do people mean when they type “bank”?First 10 results are the most importantThe more transparent, the better
  • 15.  Sorting documents on various criteria Start with words matching query terms Citation and link analysis  Like old library Citation Indexes  Ted Nelson - not only hypertext, but the links  Google PageRank  Incoming links  Authority of linkers Taxonomies and external metadata
  • 16. • Term frequency in the item• Inverse document frequency of term  Rare words are likely to be more important wij = weight of Term Tj in Document Di tfij = frequency of Term Tj in Document Dj N = number of Documents in collection n = number of Documents where term Tj occurs at least once From Salton 1989
  • 17. • Vector space• Probabilistic (binary interdependence)• Fuzzy set theory• Bayesian statistical analysis• Latent semantic indexing• Neural networks• Machine learning• All require sophisticated queries• See MIR, chapter 2
  • 18. Heuristics are rules of thumb • Not algorithms, not mathSearch Relevance Ranking Heuristics • Documents containing all search words • Search words as a phrase • Matches in title tag • Matches in other metadataBased on real-word user behavior
  • 19. What users see after they click the Search buttonThe most visible part of searchElements of the results page  Page layout and navigation  Results header  List of results items  Results footer
  • 20. Human judgment beats algorithmsGreat for frequent, ambiguous searches  Use search log to identify best candidatesRecommend good starting pages  Product information, FAQs, etc.Requires human resources  That means money and timeMore static than algorithmic search
  • 21.  Leverage content structure  database fields (i.e. cruise amenities)  document metadata (news article bylines) Provide both search and browse  Support information foraging  Integrate navigation with results  Not just subject taxonomies  Display only fruitful paths, no dead ends Supported by academic research  Marti Hearst, UCB SIMS, flamenco.berkeley.edu
  • 22. Metrics  Number of searches  Number of no-matches searches  Traffic from search to high-value pages  Relate search changes to other metricsSearch Log Analysis  Top 5% searches: phrases and words  Top no-matches searches  Use as market research
  • 23. Search engines can’t read minds  User queries are short and ambiguousSome things will help  Design a usable interface  Show match words in context  Keep index current and complete  Adjust heuristic weighting  Maintain suggestions and synonyms  Consider faceted metadata search
  • 24. Join usAdd: WZ-30-a,Bhagwan Das NagarEast Punjabi Bagh, Delhi-110026Tel.: 011 28316148, 3203571, 30538061Mobile; +91-8010 298 388, 8010 198 388E-mail: info@seocertification.org.in