Review of "The anatomy of a large scale hyper textual web search engine"
KOSURU SAI MALLESWAR; SC09B093; SEM-6. REVIEW OF “The Anatomy of a Large-Scale Hyper textual Web Search Engine”Sergey Brin and Lawrence Page started the design of ‘Google’ to make a search engine that cancrawl and index the web quickly and efficiently and to effectively deal with huge uncontrolledhypertext collections. One of the main goals was to improve the quality and scalability of search.Another goal was to setup a system that can support novel research activities on large-scale webdata and a reasonable number of people can actually use it for their academic research.Google makes efficient use of storage space to store the index. This allows the quality of thesearch to scale effectively to the size of the web as it grows. Its data structures are optimized forfast and efficient access. To get high precision, Google uses the link structure of the Web tocalculate a quality ranking for each web page. This ranking is called PageRank. The probabilitythat the ‘random surfer’ visits a page is its PageRank. The ranking also involves damping factor,which is the probability at each page the ‘random surfer’ will get bored and request anotherrandom page. It allows for personalization and can make it nearly impossible to deliberatelymislead the system in order to get a higher ranking. The text of a link is associated with the pagethat the link is on and also with the page the link points to. This idea of anchor text propagationprovides better quality search but the challenge was the efficient usage of it because of the heavydata processing task. Along with page rank Google keeps a track of location information of allhits, some visual presentation details and stores full raw HTML of pages in the repository.Most of the Google’s architecture is implemented in C or C++ for efficiency and can run ineither Solaris or Linux. The data structures of Google include big files, document indexes,lexicon, forward and reverse indexes and a huge repository. Google’s data structures areoptimized in terms of cost by the feature of avoiding disk seeks whenever possible. Google has afast distributed crawling system, where URL server and the crawlers are implemented in Python.Each crawler maintains a DNS cache to reduce the no. of DNS lookups, uses asynchronous IOand a no. of queues. The steps involved in indexing are parsing, indexing documents into barrelsusing multiple indexers running in parallel and sorting. The Google’s ranking system is designedso that no particular factor can have too much influence. The dot product of the vector of count-weights with the vector of type-weights is used to compute an IR score for the document.Finally, the IR score is combined with PageRank to give a final rank to the document. For multiword search, Google has a complex algorithm. Google also considers feedback by trusted userswhile updating the ranks of webpages.Google can produce better results than the major commercial search engines for most searches.Google has evolved to overcome a number of bottlenecks in CPU, memory access, memorycapacity, disk seeks, disk throughput, disk capacity, and network IO during various operations.By the efficient crawling and indexing performed by Google, information can be kept up to dateand major changes can be tested relatively quickly. Google does not have optimizations such asquery caching, sub-indices on common terms. The inventors intended to speed up Googleconsiderably through distribution and hardware, software, and algorithmic improvements. Theywished to make Google as a high quality search tool for searchers and researchers all around theworld, sparking the next generation of search engine technology.