How search engines work


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Adaptive Path
  • Adaptive Path
  • Adaptive Path
  • Adaptive Path
  • How search engines work

    1. 1. How SearcH engineS work Presentation by cHinna
    2. 2. What is Search Engine Search engine is a software program thatsearches for sites based on the words that you designate as search terms. "Search engine" is the popular term for an Information Retrieval (IR) system. 2
    3. 3. Motto of search enginesA web search engine is designed to search forinformation on the World Wide Web andFTP servers. The search results are generallypresented in a list of results often referred toas SERPS, or "search engine results pages".The information may consist of web pages,images, information and other types of files. 3
    4. 4. Purpose of Search EnginesHelping people find what they’re looking for • Starts with an "information need" • Convert to a query • Gets resultsIn the materials available • Web pages • Other formats • Deep Web 4
    5. 5. HISTORYArchie – First search tool for the InternetGopher – indexed plain text documentsJughead – searched the files stored in Gopher index systemsWandex – First Web search engine 5
    6. 6. How web search engines worksearch engine operates in the following order: Web Crawling Indexing Searching 6
    7. 7. How do Search Engine Works Spiders Robots 7
    8. 8. Search is Not a PanaceaSearch can’t find what’s not there • The content is hugely importantInformation Architecture is vitalUsable sites have good navigation and structure 8
    9. 9. Search Engine ModulesA query processorA search and matching functionA ranking capabilitySummarizing and Presenting documents. 9
    10. 10. Search Engines Mode of Working in Earlier DaysFrom 1990-1998 (1st Generation of search tools): • Looked at title of web pages • Ranking was based on page content • Looked at number of times the search term appeared on the page • Looked at metatags 10
    11. 11. SEO (Search Engine Optimization)Used by companies to get a higher result in search enginesWhite hat: Using legitimate techniquesBlack hat: Using illegal techniques to trick the search engine, like paying sites to link to you. 11
    12. 12. Search Processing 12
    13. 13. Search is Only as Good as the ContentUsers blame the search engine • Even when the content is unavailableUnderstand the scope of site or intranet • Kinds of information • Divided sites: products / corporate info • Dates • Languages • Sources and data silos: databases... • Update processes 13
    14. 14. Making a Searchable IndexStore text to search it laterMany ways to gather text • Crawl (spider) via HTTP • Read files on file servers • Access databases (HTTP or API) • Data silos via local APIs • Applications, CMSs, via Web ServicesSecurity and Access Control 14
    15. 15. Robot Indexing Diagram Sour 15
    16. 16. What the Index NeedsBasic information for document or record • File name / URL / record ID • Title or equivalent • Size, date, MIME typeFull text of itemMore metadata • Product name, picture ID • Category, topic, or subject • Other attributes, for relevance ranking and display 16
    17. 17. Simple Index Diagram 17
    18. 18. Index IssuesStopwordsStemmingMetadata • Explicit (tags) • Implicit (context)Semantics • CMS and Database fields • XML tags and attributes 18
    19. 19. Search Query ProcessingWhat happens after you click the search button, and before retrieval starts.Usually in this order • Handle character set, maybe language • Look for operators and organize the query • Look for field names or metadata • Extract words (just like the indexer) • Deal with letter casing 19
    20. 20. Search and RetrievalRetrieval: find files with query termsNot the same as relevance rankingRecall: find all relevant itemsPrecision: find only relevant itemsIncreasing one decreases the other 20
    21. 21. Retrieval = MatchingSingle-word queries • Find items containing that wordMulti-word queries: combine lists • Any: every item with any query word • All: only items with every word • Phrases: find only items with all words in orderBoolean and complex queries • Use algorithm to combine lists 21
    22. 22. Why Searches FailEmpty searchNothing on the site on that topic (scope)Misspelling or typing mistakesVocabulary differencesRestrictive search defaultsRestrictive search choicesSoftware failure 22
    23. 23. Relevance RankingTheory: sort the matching items, so the most relevant ones appear firstCant really know what the user wantsRelevance is hard to define and situationalShort queries tend to be deeply ambiguous • What do people mean when they type “bank”?First 10 results are the most important 23
    24. 24. Relevance ProcessingSorting documents on various criteriaStart with words matching query termsCitation and link analysis • Like old library Citation Indexes • Not only hypertext, but the links • Google PageRank • Incoming links • Authority of linkersTaxonomies and external metadata 24
    25. 25. Search Results InterfaceWhat users see after they click the Search buttonThe most visible part of searchElements of the results page • Page layout and navigation • Results header • List of results items • Results footer 25
    26. 26. Search SuggestionsHuman judgment beats algorithmsGreat for frequent, ambiguous searches • Use search log to identify best candidatesRecommend good starting pages • Product information, FAQs, etc.Requires human resources • That means money and timeMore static than algorithmic search 26
    27. 27. Search Metrics Number of searches Number of matches searchesTraffic from search to high-value pages Relate search changes to other metrics 27
    28. 28. Query ExampleConsider the Query Mahendra Singh Dhoni A good answer contains all the three words, and morefrequently the better, we call this Term Frequency(TF) Some Query terms are more important those have betterdiscriminating power than others For example an answer containing only "Dhoni" is likely tobe better than an answer containing only “Mahendra“We call this Inverse Document Frequency (IDF) 28
    29. 29. Search Will Never Be PerfectSearch engines can’t read minds • User queries are short and ambiguousSome things will help • Design a usable interface • Show match words in context • Keep index current and complete • Adjust heuristic weighting • Maintain suggestions and synonyms • Consider faceted metadata search 29