<ul><li>GOOGLE is the most popular search engine in the world.
A public company based in Mountain View, California, it provides services such as e-mail, online mapping, office productivity, social networking, video sharing and an open source web browser. </li></ul>3<br />
8<br />Main AIM of GOOGLE:<br />“To organize the world's information and make it universally accessible and useful”<br />
How Google Got the Name <br />Google ?<br />9<br />
<ul><li>The original name for the search engine was BackRub
But Later Sergey and Larry decided to name the company number called a “Googol” – which is the number 1 followed by 100 zeroes(10100).</li></ul>10<br />
<ul><li>Then name 'Google' itself was derived from a misspelling of 'googol'</li></ul>which was picked to signify that the search engine wants to provide large quantities of information for people.<br />11<br />
Anatomy (Architecture) of google<br />12<br />
High level archit-ecture of GOOGLE<br />13<br />
A Uniform Resource Locator or Universal Resource Locator (URL) is a character string that specifies where a known resource is available on the Internet and the mechanism for retrieving it.</li></li></ul><li>16<br /><ul><li>DNS:
Short for Domain Name System, an Internet service that translates domain names into IP addresses
For example, the domain name www.example.com might translate to 220.127.116.11. </li></li></ul><li>17<br /><ul><li>docID
The document index keeps information about each document.
The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics, points to variable file which contains crawled pages’ URL.</li></li></ul><li>18<br /><ul><li>PARSING:
Parse:to divide large components into small components that can be analyzed.
PARSER: A program that dissects source code so that it can be translated into object code.</li></li></ul><li>19<br />Components of architecture:<br />
1)CRAWLER<br /><ul><li>In Google, the web</li></ul> crawling ,(downloading <br />of web pages)<br />is done by several distributed crawlers. <br /><ul><li>Fragile application implemented in PYTHON</li></ul>21<br />
22<br /><ul><li>It involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.
Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. </li></li></ul><li>23<br />At peak speeds, the system can crawl over<br />four crawlers 100 web pages per second <br />600K per second of data.<br />
24<br />Function:<br /><ul><li>Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document.
Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response.</li></li></ul><li>25<br /><ul><li>It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.
Due to huge amount of data involved,crawler can crash or behave unexpectedly.</li></li></ul><li>26<br /><ul><li>Systems which access large parts of the Internet need to be designed to be very robust and carefully tested.
Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up. </li></li></ul><li>27<br />2)URL server<br /><ul><li>URLserver sends </li></ul>lists of URLs(Uniform<br /> Resource Location)<br />to be fetched to the crawlers. <br />
3)Storeserver<br /><ul><li>The web pages that are fetched are then sent to the storeserver.
The storeserver then compresses and stores the web pages into a repository.</li></ul>28<br />
4)REPOSITORY<br /><ul><li>The repository contains the full HTML of every web page.
The choice of compression technique is a tradeoff between speed and compression ratio.</li></li></ul><li>30<br /><ul><li>Each page is compressed using zlib
The compression rate on the repository of zlib is 3 to 1 compression.
In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL as can be seen in Figure.</li></li></ul><li>31<br />
33<br />5)INDEXING<br /><ul><li>Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page.
The indexing function is performed by the indexer and the sorter. </li></li></ul><li>34<br /><ul><li>The indexer performs </li></ul>a number of functions.<br /><ul><li>It reads the repository, uncompresses the documents, and parses them</li></li></ul><li>35<br /><ul><li>It passes through three stages as follows:
Sorting</li></li></ul><li><ul><li>Each document is converted into a set of word occurrences called hits.
The hits record the word, position in document, an approximation of font size, and capitalization. </li></ul>36<br />
37<br /><ul><li>The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. </li></li></ul><li><ul><li>The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file.
This file contains enough information to determine where each link points from and to, and the text of the link. </li></ul>38<br />
<ul><li>The URLresolver reads the anchors file and converts relative URLs</li></ul> into absolute URLs and<br /> in turn into docIDs. <br /><ul><li>It puts the anchor text </li></ul>into the forward index,<br /> associated with <br />the docID that the <br />anchor points to.<br />39<br />Part 3:<br />
40<br /><ul><li>It also generates a database of links which are pairs of docIDs.
The links database is used to compute PageRanks for all the documents</li></li></ul><li><ul><li>The sorter takes the barrels, which are sorted by docID, and resorts them by wordID to generate the inverted index.
This is done in place so that little temporary space is needed for this operation. </li></ul>41<br />
<ul><li>The sorter also produces a list of wordIDs and offsets into the inverted index.
A program called DumpLexicon</li></ul> takes this list together with <br />the lexicon produced by <br />the indexer and generates<br /> a new lexicon to be used by<br /> the searcher.<br />42<br />
<ul><li>The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and </li></ul>the PageRanks to answer queries. <br />43<br />
Step 2:Convert words into wordIDs.</li></li></ul><li>46<br /><ul><li>Step 3:Seek to the start of the doclist in the short barrel for every word.
Step 4:Scan through the doclists until there is a document that matches all the search terms.
Step 5:Compute the rank of that document for the query.</li></li></ul><li>47<br /><ul><li>Step 6:If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
Step 7:If we are not at the end of any doclist go to step 4.
Sort the documents that have matched by rank and return the top k.</li></li></ul><li>48<br />Page Rank Algorithm<br />
assigns a numerical weighting to each element of a hyperlinked set of documents(such as www)
purpose is "measuring" its relative importance within the set. </li></li></ul><li>50<br /><ul><li> Pages that GOOGLE believes are important pages receive a higher PageRankand are more likely to appear at the top of the search results
PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value</li></li></ul><li>51<br /><ul><li> Web pages as nodes
52<br />Example:<br /><ul><li>Assume a small universe of four web pages: A, B, C and D.
The initial approximation of PageRank would be evenly divided between these four documents. </li></li></ul><li>53<br /><ul><li>Hence, each document would begin with an estimated PageRank of 0.25.
If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A.
This is 0.75</li></li></ul><li>54<br />B C and A<br /><ul><li>page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C.</li></ul>D A,B and C<br />
55<br />Thus,<br /><ul><li>The PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L(v) of links from page v.</li></li></ul><li>56<br />