• Save
Google history nd architecture
Upcoming SlideShare
Loading in...5
×
 

Google history nd architecture

on

  • 1,125 views

 

Statistics

Views

Total Views
1,125
Slideshare-icon Views on SlideShare
1,125
Embed Views
0

Actions

Likes
3
Downloads
0
Comments
1

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • ed3dfeff
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Google history nd architecture Google history nd architecture Presentation Transcript

    • By:-
      Name: Divyangee Jain
      En no: 090410107015
      Class: TY CE-A
      Batch: A
      1
    • What is GOOGLE?
      2
      • GOOGLE is the most popular search engine in the world.
      • A public company based in Mountain View, California, it provides services such as e-mail, online mapping, office productivity, social networking, video sharing and an open source web browser.
      3
    • 4
      • Topics:
      • History
      • How google got name?
      • Anatomy of google
      • How google searches query?
      • Page rank algorithm
    • History
      Of
      google
      5
      • The company was founded byLARRYPAGE and SERGEY BRIN,often dubbed the "Google Guys“,while they were attending Stanford University as PhD candidates
      • It was first incorporated as a privately held company on September 4, 1998, and its initial public offering followed on August 19, 2004.
      6
    • 7
    • 8
      Main AIM of GOOGLE:
      “To organize the world's information and make it universally accessible and useful”
    • How Google Got the Name
      Google ?
      9
      • The original name for the search engine was BackRub
      • But Later Sergey and Larry decided to name the company number called a “Googol” – which is the number 1 followed by 100 zeroes(10100).
      10
      • Then name 'Google' itself was derived from a misspelling of 'googol'
      which was picked to signify that the search engine wants to provide large quantities of information for people.
      11
    • Anatomy (Architecture) of google
      12
    • High level archit-ecture of GOOGLE
      13
    • 14
      Few words to acquaint with:
    • 15
      • URL:
      • A Uniform Resource Locator or Universal Resource Locator (URL) is a character string that specifies where a known resource is available on the Internet and the mechanism for retrieving it.
    • 16
      • DNS:
      • Short for Domain Name System, an Internet service that translates domain names into IP addresses
      • For example, the domain name www.example.com might translate to 198.105.232.4.
    • 17
      • docID
      • The document index keeps information about each document.
      • The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics, points to variable file which contains crawled pages’ URL.
    • 18
      • PARSING:
      • Parse:to divide large components into small components that can be analyzed.
      • PARSER: A program that dissects source code so that it can be translated into object code.
    • 19
      Components of architecture:
    • 20
      Part 1:
    • 1)CRAWLER
      • In Google, the web
      crawling ,(downloading
      of web pages)
      is done by several distributed crawlers.
      • Fragile application implemented in PYTHON
      21
    • 22
      • It involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.
      • Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace.
    • 23
      At peak speeds, the system can crawl over
      four crawlers 100 web pages per second
      600K per second of data.
    • 24
      Function:
      • Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document.
      • Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response.
    • 25
      • It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.
      • Due to huge amount of data involved,crawler can crash or behave unexpectedly.
    • 26
      • Systems which access large parts of the Internet need to be designed to be very robust and carefully tested.
      • Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up.
    • 27
      2)URL server
      • URLserver sends
      lists of URLs(Uniform
      Resource Location)
      to be fetched to the crawlers.
    • 3)Storeserver
      • The web pages that are fetched are then sent to the storeserver.
      • The storeserver then compresses and stores the web pages into a repository.
      28
    • 4)REPOSITORY
      • The repository contains the full HTML of every web page.
      • The choice of compression technique is a tradeoff between speed and compression ratio.
    • 30
      • Each page is compressed using zlib
      • The compression rate on the repository of zlib is 3 to 1 compression.
      • In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL as can be seen in Figure.
    • 31
    • 32
      Part 2:
    • 33
      5)INDEXING
      • Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page.
      • The indexing function is performed by the indexer and the sorter.
    • 34
      • The indexer performs
      a number of functions.
      • It reads the repository, uncompresses the documents, and parses them
    • 35
      • It passes through three stages as follows:
      • Parsing
      • Indexing Documents into Barrels
      • Sorting
      • Each document is converted into a set of word occurrences called hits.
      • The hits record the word, position in document, an approximation of font size, and capitalization.
      36
    • 37
      • The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index.
      • The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file.
      • This file contains enough information to determine where each link points from and to, and the text of the link.
      38
      • The URLresolver reads the anchors file and converts relative URLs
      into absolute URLs and
      in turn into docIDs.
      • It puts the anchor text
      into the forward index,
      associated with
      the docID that the
      anchor points to.
      39
      Part 3:
    • 40
      • It also generates a database of links which are pairs of docIDs.
      • The links database is used to compute PageRanks for all the documents
      • The sorter takes the barrels, which are sorted by docID, and resorts them by wordID to generate the inverted index.
      • This is done in place so that little temporary space is needed for this operation.
      41
      • The sorter also produces a list of wordIDs and offsets into the inverted index.
      • A program called DumpLexicon
      takes this list together with
      the lexicon produced by
      the indexer and generates
      a new lexicon to be used by
      the searcher.
      42
      • The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and
      the PageRanks to answer queries.
      43
    • 44
      SEARCHING A QUERY
    • 45
      The goal of searching is to provide quality search results efficiently.
      • The google query evaluation process:
      • Step 1:Parse the query.
      • Step 2:Convert words into wordIDs.
    • 46
      • Step 3:Seek to the start of the doclist in the short barrel for every word.
      • Step 4:Scan through the doclists until there is a document that matches all the search terms.
      • Step 5:Compute the rank of that document for the query.
    • 47
      • Step 6:If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
      • Step 7:If we are not at the end of any doclist go to step 4.
      • Sort the documents that have matched by rank and return the top k.
    • 48
      Page Rank Algorithm
    • 49
      • PageRank
      • a link analysis algorithm
      • named after Larry Page
      • assigns a numerical weighting to each element of a hyperlinked set of documents(such as www)
      • purpose is "measuring" its relative importance within the set.
    • 50
      • Pages that GOOGLE believes are important pages receive a higher PageRankand are more likely to appear at the top of the search results
      • PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value
    • 51
      • Web pages as nodes
      • Hyperlinks as edges
      Webgraph
    • 52
      Example:
      • Assume a small universe of four web pages: A, B, C and D.
      • The initial approximation of PageRank would be evenly divided between these four documents.
    • 53
      • Hence, each document would begin with an estimated PageRank of 0.25.
      • If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A.
      • This is 0.75
    • 54
      B C and A
      • page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C.
      D A,B and C
    • 55
      Thus,
      • The PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L(v) of links from page v.
    • 56