Search EnginesSearch Engines
Presented by
ganesh kavhar
2
What Are They?What Are They?
 Four Components
 A database of references to webpages
 An indexing robot that crawls the WWW
 An interface

Enables users to submit queries

Displays results
 Information retrieval system
 Each is unique, but are mostly the
same
3
DatabaseDatabase
 Where user's query is matched
 Contains only essential parts of pages
 Only includes pages that were indexed
 Search engines are always out of date
4
Web CrawlerWeb Crawler
 A robot that follows links
 Records data it finds
 Words in the webpage
 Metadata
 ALT attributes in IMG tags
 Robot Exclusion Protocol
5
Search Engine InterfacesSearch Engine Interfaces
 Gathers input from users
 Presents results from the IR system
 Often in ranked order
6
Search Engine InterfacesSearch Engine Interfaces
 Input
 User requirements

Search expression, search limits
 Presentation style

Presentation format , search type
7
Search Engine InterfacesSearch Engine Interfaces
 Output
 Results
 Descriptions
 Clusters
8
Search Term MatchingSearch Term Matching
 Trying to find a match in the database
 Two main methods
 Keyword searching

Matching single terms, computing cosine
 Concept-based searching

Examining clusters of words

Attempt to determine meaning of query and
find records related to that meaning
9
Basic IR FeaturesBasic IR Features
 Boolean operators
 AND, OR, NOT, grouping
 Extended operators
 NEAR, ADJACENT, (")
 Stop word deletion
 Stemming
 Searching in fields (e.g. host)
10
Ranked OutputRanked Output
 Most SEs produce ranked lists by applying
simple rules:
 Early words are more important
 Title is very important
 Frequency of occurrence matters for some
 Infrequent words matter more
 Modification date
 Google is different:
 PageRankTM
method based on popularity
 Links as money
11
GooglebombingGooglebombing
 Google spoofed from the lecture list
 first hit from 1992
 Official GoogleBlog explanation
12
What about the Invisible Web?What about the Invisible Web?
 Also known as the Deep Web
 Documents that are on the WWW but
not indexed by Search Engines
 Some are available only by submitting
forms
 Some are not generally accessible (in
subnets)
 Some are not in (X)HTML format
13
The Invisible Web Isn't SoThe Invisible Web Isn't So
Invisible Anymore…Invisible Anymore…
 More search engines parse non-
(X)HTML now than before
 Because of awareness of the problem
companies are making more content
available using
 Stable URLs
 Robot-friendly sitemaps
 But much content is still not indexed
14
But, there's still plenty ofBut, there's still plenty of
important yet invisible docsimportant yet invisible docs
 How to find them?
 Many of them are in databases
 No one search engine covers everything
 Use database tools from the U.'s library
 Especially for research articles
 Use multiple search engines or a meta-
crawler
 dogpile is the most famous
Search EnginesSearch Engines
A Summary of Practical Advice
16
How To Succeed With SEsHow To Succeed With SEs
 As a surfer:
 If you don't know what you are looking for

Use multiple SEs, or a meta-crawler

Search within results
 If you don't know what you are looking for

Use multiple SEs, or a meta-crawler

Use Boolean expressions or search within
results

Consider specialized engines
17
How To Succeed With SEsHow To Succeed With SEs
 As a creator:
 HTML level

Always use ALT attributes with <IMG>, etc.

Avoid frames
 Make it easier to index

Don't expect SEs to find your pages

Make links between your pages

Use metadata

Informal: <meta name="description" …>

Formal: Dublin core and others
 Increase your pages popularity

Don’t use systematic reciprocal linking: rings, exchanges, lists

Page Rank™ is inversely proportional to outdegree
18
How To Succeed With SEsHow To Succeed With SEs
 As a creator (cont.)
 For surfers:
 Use <meta name="description" …>
 Don't expect surfers to start at top of your
hierarchy

Don't rely on a hierarchy

Include a context map near the top of each page

Don't use frames

Think through dynamic content implications

Stickiness… is for another day
Follow onFollow on
 https://ganeshmkavhar.000webhostapp.
com/
 https://github.com/ganeshkavhar
 https://www.csharpcorner.com/member
s/ganesh-kavhar
19

Search engines by ganesh kavhar

  • 1.
  • 2.
    2 What Are They?WhatAre They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables users to submit queries  Displays results  Information retrieval system  Each is unique, but are mostly the same
  • 3.
    3 DatabaseDatabase  Where user'squery is matched  Contains only essential parts of pages  Only includes pages that were indexed  Search engines are always out of date
  • 4.
    4 Web CrawlerWeb Crawler A robot that follows links  Records data it finds  Words in the webpage  Metadata  ALT attributes in IMG tags  Robot Exclusion Protocol
  • 5.
    5 Search Engine InterfacesSearchEngine Interfaces  Gathers input from users  Presents results from the IR system  Often in ranked order
  • 6.
    6 Search Engine InterfacesSearchEngine Interfaces  Input  User requirements  Search expression, search limits  Presentation style  Presentation format , search type
  • 7.
    7 Search Engine InterfacesSearchEngine Interfaces  Output  Results  Descriptions  Clusters
  • 8.
    8 Search Term MatchingSearchTerm Matching  Trying to find a match in the database  Two main methods  Keyword searching  Matching single terms, computing cosine  Concept-based searching  Examining clusters of words  Attempt to determine meaning of query and find records related to that meaning
  • 9.
    9 Basic IR FeaturesBasicIR Features  Boolean operators  AND, OR, NOT, grouping  Extended operators  NEAR, ADJACENT, (")  Stop word deletion  Stemming  Searching in fields (e.g. host)
  • 10.
    10 Ranked OutputRanked Output Most SEs produce ranked lists by applying simple rules:  Early words are more important  Title is very important  Frequency of occurrence matters for some  Infrequent words matter more  Modification date  Google is different:  PageRankTM method based on popularity  Links as money
  • 11.
    11 GooglebombingGooglebombing  Google spoofedfrom the lecture list  first hit from 1992  Official GoogleBlog explanation
  • 12.
    12 What about theInvisible Web?What about the Invisible Web?  Also known as the Deep Web  Documents that are on the WWW but not indexed by Search Engines  Some are available only by submitting forms  Some are not generally accessible (in subnets)  Some are not in (X)HTML format
  • 13.
    13 The Invisible WebIsn't SoThe Invisible Web Isn't So Invisible Anymore…Invisible Anymore…  More search engines parse non- (X)HTML now than before  Because of awareness of the problem companies are making more content available using  Stable URLs  Robot-friendly sitemaps  But much content is still not indexed
  • 14.
    14 But, there's stillplenty ofBut, there's still plenty of important yet invisible docsimportant yet invisible docs  How to find them?  Many of them are in databases  No one search engine covers everything  Use database tools from the U.'s library  Especially for research articles  Use multiple search engines or a meta- crawler  dogpile is the most famous
  • 15.
    Search EnginesSearch Engines ASummary of Practical Advice
  • 16.
    16 How To SucceedWith SEsHow To Succeed With SEs  As a surfer:  If you don't know what you are looking for  Use multiple SEs, or a meta-crawler  Search within results  If you don't know what you are looking for  Use multiple SEs, or a meta-crawler  Use Boolean expressions or search within results  Consider specialized engines
  • 17.
    17 How To SucceedWith SEsHow To Succeed With SEs  As a creator:  HTML level  Always use ALT attributes with <IMG>, etc.  Avoid frames  Make it easier to index  Don't expect SEs to find your pages  Make links between your pages  Use metadata  Informal: <meta name="description" …>  Formal: Dublin core and others  Increase your pages popularity  Don’t use systematic reciprocal linking: rings, exchanges, lists  Page Rank™ is inversely proportional to outdegree
  • 18.
    18 How To SucceedWith SEsHow To Succeed With SEs  As a creator (cont.)  For surfers:  Use <meta name="description" …>  Don't expect surfers to start at top of your hierarchy  Don't rely on a hierarchy  Include a context map near the top of each page  Don't use frames  Think through dynamic content implications  Stickiness… is for another day
  • 19.
    Follow onFollow on https://ganeshmkavhar.000webhostapp. com/  https://github.com/ganeshkavhar  https://www.csharpcorner.com/member s/ganesh-kavhar 19

Editor's Notes

  • #18 Page Rank™ (PR) is proportional to PR of page that links to you but also inversely proportional to their outdegree