SlideShare a Scribd company logo
THE ANATOMY OF A LARGE SCALE-HYPER
    TEXTUAL WEB SEARCH ENGINE


 ASIM FROM UNIVERSITY PESAHAWAR.



 Author: Sergey Brin, Lawrence Page
ABSTRACT
 Google  Search Engine as Prototype
 Anatomy
 Web Users: Queries (tens of millions)

 Academic research

 Building a large scale search engine
 Heavy use of hyper textual information
  (anchor links, hyperlinks)
INTRODUCTION

   Web (as a dynamic entity)

   Irrelevant Search Results

   Human maintained Indices, Table of Contents

   Too many low quality research

   Address many problems of users (Page Ranking)
CONT…

Google:       Scaling with the Web
    Google’s Fast Crawling Technology

    Storage space availability

    Indexing system processing 100’s of Gigabytes
     Data

    Minimized Queries Response Time
DESIGN GOALS

   Improved Search Quality.

   Indexing does not provide Relevant Search Results.

   Making the percentage of Junks Results as low as possible.

   Users show interest in top ranked results.

   Notion is to provide relevant results.

   Google make uses of Link structure & anchor text.
CONT…

   Academic search engine results.

   User Accessibility & Availability of the desired

    results.

   Supports Novel Research.

   All problem solving solutions to be given in a single

    place.
SYSTEM FEATURES

   Google search engine has two important features.

   Link structure of the web(page ranking).

   Utilization Links(anchor text) to improve search
    results.
       <A href="http://www.yahoo.com/">Yahoo!</A>
    Besides the text of a hyperlink (anchor text) is
associated with the page that the link is on,
it is also associated with the page the link
points to.
PAGE RANK
 Page Rank: bringing order to the web
 Academic citation literature is applied to calculate
  page rank

   PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))

   In the equation 't1 - tn' are pages linking to page A,
    'C' is the number of outbound links that a page has
    and 'd' is a damping factor, usually set to 0.85.
PAGE RANK (INTUITIVE JUSTIFICATION)
 Many pages that point to a single page
 A page having high PageRank that points to
  another page
 Broken Links are not listed on Higher Page Ranked
  sites
 Text of the link provides more description, Google
  utilizes such information
       Provides more accurate results for images, graphs,
        databases
SYSTEM ANATOMY
SYSTEM ANATOMY
   URL Server:
        provides list of URLs to the Crawlers for fetching
        information from web
 Distributed Crawlers (Downloading WebPages)
 Store Server:
     Compression and Storage in Repository
     docID’s are used to distinguish WebPages
   Indexer
     Indexing, Sorting, Uncompressing, Parsing
     Hits
         records word occurences, position, text formate information in
          documents
         Hits are organized into barrels which creates partially sorted
          forward index
FORWARD INDEX




Document        Words
Document 1      the,cow,says,moo
Document 2      the,cat,and,the,hat
Document 3      the,dish,ran,away,with,the,spoon
INVERTED INDEX
                        T0 = "it is what it is“
                        T1 = "what is it“
                        T2 = "it is a banana“
   A term search for the terms "what", "is" and "it" would give the set.


   If we run a phrase search for "what is it" we get hits for all the words in both document 0
    and 1. But the terms occur consecutively only in document 1.

              Inverted Index                     Words
              {(2, 2)}                           a
              {(2, 3)}                           banana
              {(0, 1), (0, 4), (1, 1), (2, 1)}   is
              {(0, 0), (0, 3), (1, 2), (2, 0)}   It
              {(0, 2), (1, 0)}                   What
CONT…
   Indexer:
       Anchor files as a result of parsing possessing links
        information (in & out links)
   URL resolver:
     Reads anchor files, converts relative to absolute URLs
      and inturn into docIDs
     Puts anchor text in forward index
     Database of links, necessary to compute PageRanks
   Sorter :
      Takes the barrels which are sorted by docID and
      resorts them by wordID to generate inverted index.
     It produces a list of wordIDs and offsets into the inverted
      index.
CONT…
   DumpLexicon
       A program DumpLexicon takes this list together with the
        lexicon produced by the indexer and generates a new
        lexicon to be used by the searcher.


   Searcher:
       The searcher is run by a web server and uses the
        lexicon built by DumpLexicon together with the inverted
        index and the PageRanks to answer queries.
CONT…
   Major data structures
       Data is stored in BigFiles which are virtual files and it
        supports compression.
       Half of the storage used by raw html repository.
       Having compressed html of every page and its small
        header.
       Document index keep information of each document.
       The ISAM(Index sequential access mode) index is
        ordered by docID.
       Each stored entry includes information of current
        status, pointer into the repository, document checksum,
        URL and title information.
       They all are memory-based hash tables with varying
        values attached with each word.
CONT…
   Hit lits encoding
       Uses compact encoding(a hand optimized)
       It requires less space and less bit manipulation.
       It uses two bytes for every hits.
       For saving space the length of a hit list is combined with
        the wordID in the forward index and the docID in the
        inverted index.
       Forward index is stored in the number of barrels(64).
       Each barrels holds word IDs
       Words falling in particular barrel, the DocIDs is recorded
        into the barrel followed by the List of WordIDs with
        Hitlists which corresponds to those words
CONT…
 The inverted index consist of the same barrels as
  the forward index. Inverted index is processed by
  the sorter
 Pointer is used for pointing to wordID in barrels.

 Pointer points to List of docIDs and Hit list, this is
  called docList
CRAWLING

   Web Crawling (downloading pages)

   Crawlers (3 to 4)

   Each crawler contains three hundred open

    connections

   Social issues

   Efficiency
ARCHITECTURE OF THE GOOGLE SEARCH ENGINE
DESCRIPTION OF THE PICTORIAL COMPONENTS
Components     Description

Crawlers       There are several distributed crawlers, they parse the pages and
               extract links and keywords.

URL Server     Provides to crawlers a list of URLs to scan. The crawlers sends
               collected data to a store server.

Server Store   It compresses the pages and places them in the repository. Each
               page is stored with an identifier, a docID.

Repository     Contains a copy of the pages and images, allowing comparisons and
               caching.

Indexer        It decompresses documents and converts them into sets of words
               called "hits". It distributes hits among a set of "barrels". This provides
               an index partially sorted. It also creates a list of URLs on each page.
               A hit contains the following information: the word, its position in the
               document, font size, capitalization.

Barrels        These "barrels" are databases that classify documents by docID.
               They are created by the indexer and used by the sorter.
Anchors        The bank of anchors created by the indexer contains internal links
               and text associated with each link.
CONT…
Components   Description
URL          It takes the contents of anchors, converts relative URLs into absolute
Resolver     addresses and finds or creates a docID.
             It builds an index of documents and a database of links.

Doc Index    Contains the text relative to each URL.

             The database of links associates each one with a docID (and so to a
Links
             real document on the Web).

             The software uses the database of links to define the PageRank of each
PageRank
             page.

             It interacts with barrels. It includes documents classified by docID and
Sorter
             creates an inverted list sorted by wordID.

             A software called DumpLexicon takes the list provided by the sorter
             (classified by wordID), and also includes the lexicon created by the
Lexicon
             indexer (the sets of keywords in each page), and produces a new
             lexicon to the searcher.

             It runs on a web server in a datacenter, uses the lexicon built by
Searcher     DumpLexicon in combination with the index classified by wordID,
             taking into account the PageRank, and produces a results page.
RESULTS, PROBLEMS & CONCLUSION
   Most important issue is quality of search results
   Google performance is better compared to other
    commercial engines
   Need of Relevant and exact Query Results
   Up to date information processing
   Performing search queries
   Crawling technologies
   Google employs a number of techniques to improve
    search quality including page rank, anchor text, and
    proximity information.
   “The ultimate search engine would understand exactly
    what you mean and give back exactly what you want.”
    by Larry Page
   “The ultimate search engine would understand exactly what you
    mean and give back exactly what you want.” by Larry Page



   “The absolute search engine’s query generation would be based
    on information, not based on the repository records and query
    results will be real timed, and it will change the whole internet
    and web architecture.” by asim
Thanks!

More Related Content

What's hot

Short Report Bridges performance gap between Relational and RDF
Short Report Bridges performance gap between Relational and RDFShort Report Bridges performance gap between Relational and RDF
Short Report Bridges performance gap between Relational and RDFAkram Abbasi
 
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
pradeepa velmurugan
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
lucenerevolution
 
Url
UrlUrl
What is the Google URL builder?
What is the Google URL builder?What is the Google URL builder?
What is the Google URL builder?
John Wojewoda
 
Indexing and hashing
Indexing and hashingIndexing and hashing
Indexing and hashing
Abdul mannan Karim
 
UKSG Conference 2015 - In and out: how does that metadata get into a knowledg...
UKSG Conference 2015 - In and out: how does that metadata get into a knowledg...UKSG Conference 2015 - In and out: how does that metadata get into a knowledg...
UKSG Conference 2015 - In and out: how does that metadata get into a knowledg...
UKSG: connecting the knowledge community
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
Adrien Grand
 
Overview of Storage and Indexing ...
Overview of Storage and Indexing                                             ...Overview of Storage and Indexing                                             ...
Overview of Storage and Indexing ...
Javed Khan
 
facilitating document annotation using content and querying value
facilitating document annotation using content and querying valuefacilitating document annotation using content and querying value
facilitating document annotation using content and querying value
swathi78
 
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
JPJ1421  Facilitating Document Annotation Using Content and Querying ValueJPJ1421  Facilitating Document Annotation Using Content and Querying Value
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
chennaijp
 
Facilitating document annotation using content and querying value
Facilitating document annotation using content and querying valueFacilitating document annotation using content and querying value
Facilitating document annotation using content and querying value
IEEEFINALYEARPROJECTS
 
Sentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and HiveSentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and Hive
IRJET Journal
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
Crossref
 
Taxonomies in Search
Taxonomies in SearchTaxonomies in Search
Taxonomies in Search
TSoholt
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...
DESTIN-Informatique.com
 
CrossRef Technical Information for Libraries
CrossRef Technical Information for LibrariesCrossRef Technical Information for Libraries
CrossRef Technical Information for Libraries
Crossref
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Facilitating document annotation usin...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Facilitating document annotation usin...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Facilitating document annotation usin...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Facilitating document annotation usin...
IEEEGLOBALSOFTTECHNOLOGIES
 
Lec 1 indexing and hashing
Lec 1 indexing and hashing Lec 1 indexing and hashing
Lec 1 indexing and hashing
Md. Mashiur Rahman
 

What's hot (20)

Short Report Bridges performance gap between Relational and RDF
Short Report Bridges performance gap between Relational and RDFShort Report Bridges performance gap between Relational and RDF
Short Report Bridges performance gap between Relational and RDF
 
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Url
UrlUrl
Url
 
What is the Google URL builder?
What is the Google URL builder?What is the Google URL builder?
What is the Google URL builder?
 
Indexing and hashing
Indexing and hashingIndexing and hashing
Indexing and hashing
 
UKSG Conference 2015 - In and out: how does that metadata get into a knowledg...
UKSG Conference 2015 - In and out: how does that metadata get into a knowledg...UKSG Conference 2015 - In and out: how does that metadata get into a knowledg...
UKSG Conference 2015 - In and out: how does that metadata get into a knowledg...
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Overview of Storage and Indexing ...
Overview of Storage and Indexing                                             ...Overview of Storage and Indexing                                             ...
Overview of Storage and Indexing ...
 
facilitating document annotation using content and querying value
facilitating document annotation using content and querying valuefacilitating document annotation using content and querying value
facilitating document annotation using content and querying value
 
URL
URLURL
URL
 
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
JPJ1421  Facilitating Document Annotation Using Content and Querying ValueJPJ1421  Facilitating Document Annotation Using Content and Querying Value
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
 
Facilitating document annotation using content and querying value
Facilitating document annotation using content and querying valueFacilitating document annotation using content and querying value
Facilitating document annotation using content and querying value
 
Sentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and HiveSentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and Hive
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
 
Taxonomies in Search
Taxonomies in SearchTaxonomies in Search
Taxonomies in Search
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...
 
CrossRef Technical Information for Libraries
CrossRef Technical Information for LibrariesCrossRef Technical Information for Libraries
CrossRef Technical Information for Libraries
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Facilitating document annotation usin...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Facilitating document annotation usin...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Facilitating document annotation usin...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT Facilitating document annotation usin...
 
Lec 1 indexing and hashing
Lec 1 indexing and hashing Lec 1 indexing and hashing
Lec 1 indexing and hashing
 

Similar to Anatomy of google

Google Paper
Google Paper Google Paper
Google Paper girish1m
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGlebinit singh
 
Google history nd architecture
Google history nd architectureGoogle history nd architecture
Google history nd architectureDivyangee Jain
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Sean Golliher
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search EngineNIKHIL NAIR
 
N017249497
N017249497N017249497
N017249497
IOSR Journals
 
Context Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: ReviewContext Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: Review
iosrjce
 
Annotating Search Results from Web Databases
Annotating Search Results from Web DatabasesAnnotating Search Results from Web Databases
Annotating Search Results from Web Databases
SWAMI06
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engineankur881120
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glance
poojagupta267
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013
Yadhu Kiran
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
Sovan Misra
 
How Google Works
How Google WorksHow Google Works
How Google Works
Ganesh Solanke
 
How a search engine works report
How a search engine works reportHow a search engine works report
How a search engine works report
Sovan Misra
 
Seminar report(rohitsahu cs 17 vth sem)
Seminar report(rohitsahu cs 17 vth sem)Seminar report(rohitsahu cs 17 vth sem)
Seminar report(rohitsahu cs 17 vth sem)
ROHIT SAHU
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
smile790243
 

Similar to Anatomy of google (20)

Google Paper
Google Paper Google Paper
Google Paper
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGle
 
Google history nd architecture
Google history nd architectureGoogle history nd architecture
Google history nd architecture
 
How web searching engines work
How web searching engines workHow web searching engines work
How web searching engines work
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
 
N017249497
N017249497N017249497
N017249497
 
Context Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: ReviewContext Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: Review
 
Annotating Search Results from Web Databases
Annotating Search Results from Web DatabasesAnnotating Search Results from Web Databases
Annotating Search Results from Web Databases
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glance
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
How a search engine works report
How a search engine works reportHow a search engine works report
How a search engine works report
 
Seminar report(rohitsahu cs 17 vth sem)
Seminar report(rohitsahu cs 17 vth sem)Seminar report(rohitsahu cs 17 vth sem)
Seminar report(rohitsahu cs 17 vth sem)
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 

More from Iftikhar Alam

Review paper human activity analysis
Review paper human activity analysisReview paper human activity analysis
Review paper human activity analysisIftikhar Alam
 
What is the future of disk drives?
What is the future of disk drives?What is the future of disk drives?
What is the future of disk drives?Iftikhar Alam
 
Hypertext presentation
Hypertext presentationHypertext presentation
Hypertext presentationIftikhar Alam
 
Hypertext presentation
Hypertext presentationHypertext presentation
Hypertext presentationIftikhar Alam
 
Www history by Mumtaz Khan
Www history by Mumtaz KhanWww history by Mumtaz Khan
Www history by Mumtaz KhanIftikhar Alam
 

More from Iftikhar Alam (7)

Review paper human activity analysis
Review paper human activity analysisReview paper human activity analysis
Review paper human activity analysis
 
As we may think
As we may thinkAs we may think
As we may think
 
What is the future of disk drives?
What is the future of disk drives?What is the future of disk drives?
What is the future of disk drives?
 
Hypertext presentation
Hypertext presentationHypertext presentation
Hypertext presentation
 
As we may think
As we may thinkAs we may think
As we may think
 
Hypertext presentation
Hypertext presentationHypertext presentation
Hypertext presentation
 
Www history by Mumtaz Khan
Www history by Mumtaz KhanWww history by Mumtaz Khan
Www history by Mumtaz Khan
 

Recently uploaded

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

Anatomy of google

  • 1. THE ANATOMY OF A LARGE SCALE-HYPER TEXTUAL WEB SEARCH ENGINE ASIM FROM UNIVERSITY PESAHAWAR. Author: Sergey Brin, Lawrence Page
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7. ABSTRACT  Google Search Engine as Prototype  Anatomy  Web Users: Queries (tens of millions)  Academic research  Building a large scale search engine  Heavy use of hyper textual information (anchor links, hyperlinks)
  • 8. INTRODUCTION  Web (as a dynamic entity)  Irrelevant Search Results  Human maintained Indices, Table of Contents  Too many low quality research  Address many problems of users (Page Ranking)
  • 9. CONT… Google: Scaling with the Web  Google’s Fast Crawling Technology  Storage space availability  Indexing system processing 100’s of Gigabytes Data  Minimized Queries Response Time
  • 10. DESIGN GOALS  Improved Search Quality.  Indexing does not provide Relevant Search Results.  Making the percentage of Junks Results as low as possible.  Users show interest in top ranked results.  Notion is to provide relevant results.  Google make uses of Link structure & anchor text.
  • 11. CONT…  Academic search engine results.  User Accessibility & Availability of the desired results.  Supports Novel Research.  All problem solving solutions to be given in a single place.
  • 12. SYSTEM FEATURES  Google search engine has two important features.  Link structure of the web(page ranking).  Utilization Links(anchor text) to improve search results.  <A href="http://www.yahoo.com/">Yahoo!</A> Besides the text of a hyperlink (anchor text) is associated with the page that the link is on, it is also associated with the page the link points to.
  • 13. PAGE RANK  Page Rank: bringing order to the web  Academic citation literature is applied to calculate page rank  PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))  In the equation 't1 - tn' are pages linking to page A, 'C' is the number of outbound links that a page has and 'd' is a damping factor, usually set to 0.85.
  • 14. PAGE RANK (INTUITIVE JUSTIFICATION)  Many pages that point to a single page  A page having high PageRank that points to another page  Broken Links are not listed on Higher Page Ranked sites  Text of the link provides more description, Google utilizes such information  Provides more accurate results for images, graphs, databases
  • 15.
  • 17. SYSTEM ANATOMY  URL Server:  provides list of URLs to the Crawlers for fetching information from web  Distributed Crawlers (Downloading WebPages)  Store Server:  Compression and Storage in Repository  docID’s are used to distinguish WebPages  Indexer  Indexing, Sorting, Uncompressing, Parsing  Hits  records word occurences, position, text formate information in documents  Hits are organized into barrels which creates partially sorted forward index
  • 18. FORWARD INDEX Document Words Document 1 the,cow,says,moo Document 2 the,cat,and,the,hat Document 3 the,dish,ran,away,with,the,spoon
  • 19. INVERTED INDEX  T0 = "it is what it is“  T1 = "what is it“  T2 = "it is a banana“  A term search for the terms "what", "is" and "it" would give the set.  If we run a phrase search for "what is it" we get hits for all the words in both document 0 and 1. But the terms occur consecutively only in document 1. Inverted Index Words {(2, 2)} a {(2, 3)} banana {(0, 1), (0, 4), (1, 1), (2, 1)} is {(0, 0), (0, 3), (1, 2), (2, 0)} It {(0, 2), (1, 0)} What
  • 20. CONT…  Indexer:  Anchor files as a result of parsing possessing links information (in & out links)  URL resolver:  Reads anchor files, converts relative to absolute URLs and inturn into docIDs  Puts anchor text in forward index  Database of links, necessary to compute PageRanks  Sorter :  Takes the barrels which are sorted by docID and resorts them by wordID to generate inverted index.  It produces a list of wordIDs and offsets into the inverted index.
  • 21. CONT…  DumpLexicon  A program DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher.  Searcher:  The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.
  • 22. CONT…  Major data structures  Data is stored in BigFiles which are virtual files and it supports compression.  Half of the storage used by raw html repository.  Having compressed html of every page and its small header.  Document index keep information of each document.  The ISAM(Index sequential access mode) index is ordered by docID.  Each stored entry includes information of current status, pointer into the repository, document checksum, URL and title information.  They all are memory-based hash tables with varying values attached with each word.
  • 23. CONT…  Hit lits encoding  Uses compact encoding(a hand optimized)  It requires less space and less bit manipulation.  It uses two bytes for every hits.  For saving space the length of a hit list is combined with the wordID in the forward index and the docID in the inverted index.  Forward index is stored in the number of barrels(64).  Each barrels holds word IDs  Words falling in particular barrel, the DocIDs is recorded into the barrel followed by the List of WordIDs with Hitlists which corresponds to those words
  • 24. CONT…  The inverted index consist of the same barrels as the forward index. Inverted index is processed by the sorter  Pointer is used for pointing to wordID in barrels.  Pointer points to List of docIDs and Hit list, this is called docList
  • 25. CRAWLING  Web Crawling (downloading pages)  Crawlers (3 to 4)  Each crawler contains three hundred open connections  Social issues  Efficiency
  • 26. ARCHITECTURE OF THE GOOGLE SEARCH ENGINE
  • 27. DESCRIPTION OF THE PICTORIAL COMPONENTS Components Description Crawlers There are several distributed crawlers, they parse the pages and extract links and keywords. URL Server Provides to crawlers a list of URLs to scan. The crawlers sends collected data to a store server. Server Store It compresses the pages and places them in the repository. Each page is stored with an identifier, a docID. Repository Contains a copy of the pages and images, allowing comparisons and caching. Indexer It decompresses documents and converts them into sets of words called "hits". It distributes hits among a set of "barrels". This provides an index partially sorted. It also creates a list of URLs on each page. A hit contains the following information: the word, its position in the document, font size, capitalization. Barrels These "barrels" are databases that classify documents by docID. They are created by the indexer and used by the sorter. Anchors The bank of anchors created by the indexer contains internal links and text associated with each link.
  • 28. CONT… Components Description URL It takes the contents of anchors, converts relative URLs into absolute Resolver addresses and finds or creates a docID. It builds an index of documents and a database of links. Doc Index Contains the text relative to each URL. The database of links associates each one with a docID (and so to a Links real document on the Web). The software uses the database of links to define the PageRank of each PageRank page. It interacts with barrels. It includes documents classified by docID and Sorter creates an inverted list sorted by wordID. A software called DumpLexicon takes the list provided by the sorter (classified by wordID), and also includes the lexicon created by the Lexicon indexer (the sets of keywords in each page), and produces a new lexicon to the searcher. It runs on a web server in a datacenter, uses the lexicon built by Searcher DumpLexicon in combination with the index classified by wordID, taking into account the PageRank, and produces a results page.
  • 29. RESULTS, PROBLEMS & CONCLUSION  Most important issue is quality of search results  Google performance is better compared to other commercial engines  Need of Relevant and exact Query Results  Up to date information processing  Performing search queries  Crawling technologies  Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information.  “The ultimate search engine would understand exactly what you mean and give back exactly what you want.” by Larry Page
  • 30. “The ultimate search engine would understand exactly what you mean and give back exactly what you want.” by Larry Page  “The absolute search engine’s query generation would be based on information, not based on the repository records and query results will be real timed, and it will change the whole internet and web architecture.” by asim