Web Technology Presentation
How search works in SEARCH ENGINE
Index
✔ Prelude
✔ How search engine works
✔ Search Engine Architecture
✔ Working
✔ Features(Web Crawling,Page Rank,etc.)
✔ Problems
✔ Future
✔ Conclusion
Prelude
✔ A Search Engine is an information retrieval system
designed to help find information stored on a computer
system, such as on the
● World Wide Web,
● inside a corporate or proprietary network,
● or in a personal computer.
✔ The search engine allows one to ask for content meeting
specific criteria (typically those containing a given word or
phrase) and retrieves a list of items that match those criteria.
How search engine works
✔ SEARCH STARTS WITH THE WEB. IT'S MADE UP OF OVER
60  TRILLION INDIVIDUAL PAGES AND IT'S CONSTANTLY
GROWING.
✔ Search engine navigates the web by:
● Web Crawling
● Indexing
● Searching
Architecture Overview
Working...
✔ There is an URL server that sends lists of URL to be
fetched to the crawlers.
✔ The web pages that are fetched are then sent to the store
server.
✔ The store server then compresses and stores the web
pages into a repository.
✔ Every web page has an associated ID number called a
doc ID which is assigned whenever a new URL is parsed
out of a web page.
Working...(cont.)
✔ The indexing function is performed by the indexer and the
sorter.
✔ The indexer performs a number of functions. It reads the
repository, uncompresses the documents, and parses them.
✔ The indexer distributes hits into a set of "barrels", creating a
partially sorted forward index.
✔ It parses out all the links in every web page and stores
important information about them in anchor files.
✔ Anchor file contains enough information to determine where
each link points from and to, and the text of the link.
Working...(cont.)
✔ The URL resolver reads the anchors file and converts
relative URL into absolute URL and in turn into doc ID.
✔ It puts the anchor text into the forward index, associated
with the doc ID that the anchor points to. It also
generates a database of links which are pairs of doc IDs.
The links database is used to compute Page Ranks for
all the documents.
✔ The sorter takes the barrels, which are sorted by doc ID
and resorts them by word ID to generate the inverted
index.
Working...(cont.)
✔ A program called Dump Lexicon takes this list together
with the lexicon produced by the indexer and generates a
new lexicon to be used by the searcher.
✔ The searcher is run by a web server and uses the lexicon
built by Dump Lexicon together with the inverted index
and the Page Ranks to answer queries.
Crawling the Web (Web Crawler)
✔ A web crawler or spider is used to create an index of web pages to
be used by a web search engine.
✔ Web search is based on this index.
✔ Search engines has a fast distributed crawling system.
✔ A single URL server serves lists of URL to a number of crawlers.
✔ Both the URL server and the crawlers are implemented in Python.
✔ Each crawler keeps roughly 300 connections open at once. At peak
speeds, the system can crawl over 100 web pages per second using
four crawlers. This amounts to roughly 600K per second of data.
✔ Each crawler maintains its own DNS cache so it does not need to do
a DNS lookup before crawling each document.
Crawling the Web
Indexing the Web
✔ Parsing
◦ Any parser which is designed to run on the entire Web must handle a huge
array of possible errors.
✔
Indexing Documents into Barrels
◦ After each document is parsed, it is encoded into a number of barrels. Every
word is converted into a word ID by using an in-memory hash table -- the
lexicon.
◦ Once the words are converted into word ID, their occurrences in the current
document are translated into hit lists and are written into the forward barrels.
✔ Sorting
◦ the sorter takes each of the forward barrels and sorts it by word ID to produce
an inverted barrel for title and anchor hits and a full text inverted barrel.
Indexing the Web
Web Searching
✔ Match the query terms with words in the index
✔ Sort documents by relevance
✔ Find files or records
✔ Open each one and read it
✔ Store each word in a search able index
✔ Kinds of information
✔ Dates
✔ Languages
✔ Update processes
Web Searching
Other System Features
✔ It makes use of the link structure of the Web to calculate
a quality ranking for each web page, called Page Rank
◦ Page Rank is a trademark of Google. The Page Rank
process has been patented.
✔ Search Engine utilizes link to improve search results.
Page Rank
✔ Page Rank is a link analysis algorithm which assigns a
numerical weighting to each Web page, with the purpose
of "measuring" relative importance.
Based on the hyper links map
An excellent way to prioritize the results of web keyword
searches
Intuitive Justification
✔ A "random surfer" who is given a web page at random and keeps
clicking on links, never hitting "back“, but eventually gets bored
and starts on another random page.
◦ The probability that the random surfer visits a page is its Page
Rank.
◦ The d damping factor is the probability at each page the "random
surfer" will get bored and request another random page.
✔ A page can have a high Page Rank
◦ If there are many pages that point to it
◦ Or if there are some pages that point to it, and have a high Page
Rank.
Anchor Text
● <A href =" http://www.yahoo.com/">Yahoo!</A>
Besides the text of a hyper link (anchor text) is
associated with the page that the link is on, it is also
associated with the page the link points to.
◦ anchors often provide more accurate descriptions of
web pages than the pages themselves.
◦ anchors may exist for documents which cannot be
indexed by a text-based search engine, such as images,
programs, and databases.
Major Data Structures
✔
Big Files
 virtual files spanning multiple file systems and are addressable by 64 bit integers.
✔ Repository
 contains the full HTML of every web page.
✔ Document Index
 keeps information about each document.
✔
Lexicon
 two parts – a list of the words and a hash table of pointers.
✔
Hit Lists
 a list of occurrences of a particular word in a particular document including position, font, and
capitalization information.
✔
Forward Index
 stored in a number of barrels
✔
Inverted Index
 consists of the same barrels as the forward index, except that they have been processed by the sorter.
Search is NOT a Panacea
✔ Search can’t find what’s not there
◦ The content is hugely important
✔ Information Architecture is vital
✔ Usable sites have good navigation and structure
Problems
✔ High Quality Search
The biggest problem facing users of web search engines
today is the quality of the results they get back.
✔ Scalable Architecture
Google is designed to scale. It must be efficient in both
space and time
The Future
“The ultimate search engine would
understand exactly what you mean
and give back exactly what you
want.”
- Larry Page
Conclusion
✔ The primary goal is to provide high quality search results
over a rapidly growing World Wide Web.
✔ Search engines employs a number of techniques to
improve search quality including page rank, anchor text,
and proximity information.
✔ Search engines is a complete architecture for gathering
web pages, indexing them, and performing search
queries over them.
And thats how search works...
● THANK YOU
Presented by:Presented by:
SAURABH GOELSAURABH GOEL
111438111438

Web Search Engine

  • 1.
    Web Technology Presentation Howsearch works in SEARCH ENGINE
  • 2.
    Index ✔ Prelude ✔ Howsearch engine works ✔ Search Engine Architecture ✔ Working ✔ Features(Web Crawling,Page Rank,etc.) ✔ Problems ✔ Future ✔ Conclusion
  • 3.
    Prelude ✔ A SearchEngine is an information retrieval system designed to help find information stored on a computer system, such as on the ● World Wide Web, ● inside a corporate or proprietary network, ● or in a personal computer. ✔ The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria.
  • 4.
    How search engineworks ✔ SEARCH STARTS WITH THE WEB. IT'S MADE UP OF OVER 60  TRILLION INDIVIDUAL PAGES AND IT'S CONSTANTLY GROWING. ✔ Search engine navigates the web by: ● Web Crawling ● Indexing ● Searching
  • 5.
  • 6.
    Working... ✔ There isan URL server that sends lists of URL to be fetched to the crawlers. ✔ The web pages that are fetched are then sent to the store server. ✔ The store server then compresses and stores the web pages into a repository. ✔ Every web page has an associated ID number called a doc ID which is assigned whenever a new URL is parsed out of a web page.
  • 7.
    Working...(cont.) ✔ The indexingfunction is performed by the indexer and the sorter. ✔ The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. ✔ The indexer distributes hits into a set of "barrels", creating a partially sorted forward index. ✔ It parses out all the links in every web page and stores important information about them in anchor files. ✔ Anchor file contains enough information to determine where each link points from and to, and the text of the link.
  • 8.
    Working...(cont.) ✔ The URLresolver reads the anchors file and converts relative URL into absolute URL and in turn into doc ID. ✔ It puts the anchor text into the forward index, associated with the doc ID that the anchor points to. It also generates a database of links which are pairs of doc IDs. The links database is used to compute Page Ranks for all the documents. ✔ The sorter takes the barrels, which are sorted by doc ID and resorts them by word ID to generate the inverted index.
  • 9.
    Working...(cont.) ✔ A programcalled Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. ✔ The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries.
  • 10.
    Crawling the Web(Web Crawler) ✔ A web crawler or spider is used to create an index of web pages to be used by a web search engine. ✔ Web search is based on this index. ✔ Search engines has a fast distributed crawling system. ✔ A single URL server serves lists of URL to a number of crawlers. ✔ Both the URL server and the crawlers are implemented in Python. ✔ Each crawler keeps roughly 300 connections open at once. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. ✔ Each crawler maintains its own DNS cache so it does not need to do a DNS lookup before crawling each document.
  • 11.
  • 12.
    Indexing the Web ✔Parsing ◦ Any parser which is designed to run on the entire Web must handle a huge array of possible errors. ✔ Indexing Documents into Barrels ◦ After each document is parsed, it is encoded into a number of barrels. Every word is converted into a word ID by using an in-memory hash table -- the lexicon. ◦ Once the words are converted into word ID, their occurrences in the current document are translated into hit lists and are written into the forward barrels. ✔ Sorting ◦ the sorter takes each of the forward barrels and sorts it by word ID to produce an inverted barrel for title and anchor hits and a full text inverted barrel.
  • 13.
  • 14.
    Web Searching ✔ Matchthe query terms with words in the index ✔ Sort documents by relevance ✔ Find files or records ✔ Open each one and read it ✔ Store each word in a search able index ✔ Kinds of information ✔ Dates ✔ Languages ✔ Update processes
  • 15.
  • 16.
    Other System Features ✔It makes use of the link structure of the Web to calculate a quality ranking for each web page, called Page Rank ◦ Page Rank is a trademark of Google. The Page Rank process has been patented. ✔ Search Engine utilizes link to improve search results.
  • 17.
    Page Rank ✔ PageRank is a link analysis algorithm which assigns a numerical weighting to each Web page, with the purpose of "measuring" relative importance. Based on the hyper links map An excellent way to prioritize the results of web keyword searches
  • 18.
    Intuitive Justification ✔ A"random surfer" who is given a web page at random and keeps clicking on links, never hitting "back“, but eventually gets bored and starts on another random page. ◦ The probability that the random surfer visits a page is its Page Rank. ◦ The d damping factor is the probability at each page the "random surfer" will get bored and request another random page. ✔ A page can have a high Page Rank ◦ If there are many pages that point to it ◦ Or if there are some pages that point to it, and have a high Page Rank.
  • 19.
    Anchor Text ● <Ahref =" http://www.yahoo.com/">Yahoo!</A> Besides the text of a hyper link (anchor text) is associated with the page that the link is on, it is also associated with the page the link points to. ◦ anchors often provide more accurate descriptions of web pages than the pages themselves. ◦ anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases.
  • 20.
    Major Data Structures ✔ BigFiles  virtual files spanning multiple file systems and are addressable by 64 bit integers. ✔ Repository  contains the full HTML of every web page. ✔ Document Index  keeps information about each document. ✔ Lexicon  two parts – a list of the words and a hash table of pointers. ✔ Hit Lists  a list of occurrences of a particular word in a particular document including position, font, and capitalization information. ✔ Forward Index  stored in a number of barrels ✔ Inverted Index  consists of the same barrels as the forward index, except that they have been processed by the sorter.
  • 21.
    Search is NOTa Panacea ✔ Search can’t find what’s not there ◦ The content is hugely important ✔ Information Architecture is vital ✔ Usable sites have good navigation and structure
  • 22.
    Problems ✔ High QualitySearch The biggest problem facing users of web search engines today is the quality of the results they get back. ✔ Scalable Architecture Google is designed to scale. It must be efficient in both space and time
  • 23.
    The Future “The ultimatesearch engine would understand exactly what you mean and give back exactly what you want.” - Larry Page
  • 24.
    Conclusion ✔ The primarygoal is to provide high quality search results over a rapidly growing World Wide Web. ✔ Search engines employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. ✔ Search engines is a complete architecture for gathering web pages, indexing them, and performing search queries over them.
  • 25.
    And thats howsearch works... ● THANK YOU Presented by:Presented by: SAURABH GOELSAURABH GOEL 111438111438