Google history nd architecturePresentation Transcript
By:- Name: Divyangee Jain En no: 090410107015 Class: TY CE-A Batch: A 1
What is GOOGLE? 2
GOOGLE is the most popular search engine in the world.
A public company based in Mountain View, California, it provides services such as e-mail, online mapping, office productivity, social networking, video sharing and an open source web browser.
How google got name?
Anatomy of google
How google searches query?
Page rank algorithm
History Of google 5
The company was founded byLARRYPAGE and SERGEY BRIN,often dubbed the "Google Guys“,while they were attending Stanford University as PhD candidates
It was first incorporated as a privately held company on September 4, 1998, and its initial public offering followed on August 19, 2004.
8 Main AIM of GOOGLE: “To organize the world's information and make it universally accessible and useful”
How Google Got the Name Google ? 9
The original name for the search engine was BackRub
But Later Sergey and Larry decided to name the company number called a “Googol” – which is the number 1 followed by 100 zeroes(10100).
Then name 'Google' itself was derived from a misspelling of 'googol'
which was picked to signify that the search engine wants to provide large quantities of information for people. 11
Anatomy (Architecture) of google 12
High level archit-ecture of GOOGLE 13
14 Few words to acquaint with:
A Uniform Resource Locator or Universal Resource Locator (URL) is a character string that specifies where a known resource is available on the Internet and the mechanism for retrieving it.
Short for Domain Name System, an Internet service that translates domain names into IP addresses
For example, the domain name www.example.com might translate to 184.108.40.206.
The document index keeps information about each document.
The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics, points to variable file which contains crawled pages’ URL.
Parse:to divide large components into small components that can be analyzed.
PARSER: A program that dissects source code so that it can be translated into object code.
19 Components of architecture:
20 Part 1:
In Google, the web
crawling ,(downloading of web pages) is done by several distributed crawlers.
Fragile application implemented in PYTHON
It involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.
Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace.
23 At peak speeds, the system can crawl over four crawlers 100 web pages per second 600K per second of data.
Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document.
Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response.
It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.
Due to huge amount of data involved,crawler can crash or behave unexpectedly.
Systems which access large parts of the Internet need to be designed to be very robust and carefully tested.
Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up.
27 2)URL server
lists of URLs(Uniform Resource Location) to be fetched to the crawlers.
The web pages that are fetched are then sent to the storeserver.
The storeserver then compresses and stores the web pages into a repository.
The repository contains the full HTML of every web page.
The choice of compression technique is a tradeoff between speed and compression ratio.
Each page is compressed using zlib
The compression rate on the repository of zlib is 3 to 1 compression.
In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL as can be seen in Figure.
32 Part 2:
Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page.
The indexing function is performed by the indexer and the sorter.
The indexer performs
a number of functions.
It reads the repository, uncompresses the documents, and parses them
It passes through three stages as follows:
Indexing Documents into Barrels
Each document is converted into a set of word occurrences called hits.
The hits record the word, position in document, an approximation of font size, and capitalization.
The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index.
The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file.
This file contains enough information to determine where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs
into absolute URLs and in turn into docIDs.
It puts the anchor text
into the forward index, associated with the docID that the anchor points to. 39 Part 3:
It also generates a database of links which are pairs of docIDs.
The links database is used to compute PageRanks for all the documents
The sorter takes the barrels, which are sorted by docID, and resorts them by wordID to generate the inverted index.
This is done in place so that little temporary space is needed for this operation.
The sorter also produces a list of wordIDs and offsets into the inverted index.
A program called DumpLexicon
takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. 42
The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and
the PageRanks to answer queries. 43
44 SEARCHING A QUERY
45 The goal of searching is to provide quality search results efficiently.
The google query evaluation process:
Step 1:Parse the query.
Step 2:Convert words into wordIDs.
Step 3:Seek to the start of the doclist in the short barrel for every word.
Step 4:Scan through the doclists until there is a document that matches all the search terms.
Step 5:Compute the rank of that document for the query.
Step 6:If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
Step 7:If we are not at the end of any doclist go to step 4.
Sort the documents that have matched by rank and return the top k.
48 Page Rank Algorithm
a link analysis algorithm
named after Larry Page
assigns a numerical weighting to each element of a hyperlinked set of documents(such as www)
purpose is "measuring" its relative importance within the set.
Pages that GOOGLE believes are important pages receive a higher PageRankand are more likely to appear at the top of the search results
PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value
Web pages as nodes
Hyperlinks as edges
Assume a small universe of four web pages: A, B, C and D.
The initial approximation of PageRank would be evenly divided between these four documents.
Hence, each document would begin with an estimated PageRank of 0.25.
If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A.
This is 0.75
54 B C and A
page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C.
D A,B and C
The PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L(v) of links from page v.