DC presentation 1

The Anatomy of a Large-Scale
Hypertextual Web Search Engine
Sergey Brin and Lawrence Page

Group Members
• Laksri Wijerathna
• Harini Sirisena
• Himali Erangika
• Erica Jayasundara

What is this paper is about..?
• Presents Google, a prototype of a large-scale
search engine.
• This paper was presented by Sergey Brin and
Lawrence Page who were PhD students at
Stanford University, USA

Outline
• Problem
• Solution
• Results
• Conclusion
• Future work
• Comparison with today
• Q&A

Problems Addressed
• Address the question “How to build a practical
large-scale system which can exploit the
additional information present in hypertext”.
• Also addresses “How to effectively deal with
uncontrolled hypertext collections where any
one can publish anything they want”.

Problems with Automated Search
Engines
• Automated search engines that rely on keyword
matching usually return too many low quality
matches.
• Some advertisers attempt to gain people’s
attention by taking measures meant to mislead
automated search engines.

Web Search Engine Scaling up
(1994-2000)
• Web Search engines has to scale to keep up with
the growth of the web.
• 1994 WWWW ( World Wide Web Worm) one of
the first web search engines had an index of
110,000 web pages and web access documents.
• 1994 WWWW handled 1500 queries per day.
• 1997 Altavista roughly handled 20 million
queries per day.

Goal of Proposed System
• To address many of the problems, both in
quality and scalability

Challenges for Google to Scaling with
the Web
• Fast crawling technology is needed to gather the
web documents and keep them up to date.
• Storage space must be used efficiently to store
indices and documents
• Indexing system must process hundreds of
gigabytes of data efficiently.
• Queries must be handled quickly, at a rate of
hundreds to thousands per day

Problems for Google to scaling with
the Web
• Tasks are becoming increasingly difficult as the
Web grows.
• Disk seek time and Operating System robustness
can not be avoided.

Design Goals Background
• In 1994 some people believed that a complete
search index would make it possible to find
anything else.
• But Index is not the only factor in the quality
search results, “Junk results” can wash out any
results that a user is interested in.

Design Goals- Main Cause
• Main cause of this problem is that the number of
documents in the indices has been increasing by
many orders of magnitude, but the user’s ability
to look at documents has not.- People only look
at first tens of results.
• Because of this as the collection size grows, it
need to have tools with high precision where it
can select relevant documents in the top ten
results.

Design Goals
• To push more development and understanding
into the academic realm
• To Build a system that reasonable numbers of
people can actually use.
• To Build an architecture that can support novel
research activities on large-scale web data.
• To set up a Spacelab-like environment where
researchers or even students can propose and do
interesting experiments.

Google System Features
• Make use of link structure of the Web to
calculate a quality ranking for each web page. –
PageRank
• Intuitive Justification
• Utilizes links to improve the search results –
Anchor Text.

PageRank
• Link analysis algorithm, named after Larry Page.
• Assigns a numerical weight to each element of a
hyperlinked set of documents.
• Gives higher rank to pages that are linked to
from others(popularity based).

How the PageRank works
• To get the approximation of a page’s importance
or quality we can count the citations or backlink
to a given page.
• PageRank extends this idea by not counting
links from all pages equally, but by normalizing
by the number of links on a page.

PageRank Definition
• PR(A) = (1-d) + d(PR(T1)/ C(T1)+ ..+PR(Tn)/ C(Tn))
• Assume page A has pages T1..Tn which point to A
• d – damping factor which can be set between 0 -1. Usually set to
0.85
• C(T1) – Number of links going out of Page T1
 PageRank forms a Probability distribution over web
pages, so the sum of all web pages’ PageRank will be one.

PageRank Definition cont
• PageRank is the probability that the random
surfer visits a page.
(“random surfer” – the person who is given a web
page at random and keeps clicking on links, never
hitting “back” but eventually get bored and starts on
another random page)
• d – damping factor is the probability at each
page that random surfer will get bored and
request group of pages.

Intuitive Justification
PageRank can be higher if;
• There are many pages that point to it.
• There are pages that point to it and have a high
PageRank.
• The pages are well cited from many places
around the web.
• The pages that have perhaps only one citation
from something like Yahoo.

Anchor Text
• Is the visible, clickable text in the hyperlink.
• Anchor Text are treated in a special way in
Google.
• Advantages:
◦ Provides more accurate descriptions of web pages
than the pages themselves.
◦ Anchors may exist for documents which can not
be indexed by a text-base search engine, such as
images, programs and databases. This make it
possible to return web pages which have not
actually ben crawled.

Google - other features
• Has the location information for all hits and so it
makes extensive use of proximity in search
• Keep track of some visual presentation details
such as font size of words – fonts in larger or
bolder font are weighted higher than others.
• Full raw HTML of pages is available in a
repository.

Major Data Structures
• BigFiles
• Repository
• Document Index
• Lexicon
• Hit Lists
• Forward Index
• Inverted Index

BigFiles
• Virtual files spanning multiple file systems.
• Addressable by 64 bit integers.
• Allocation among multiple files handled
automatically.
• Handles allocation and deallocation of file
descriptors.
• Also support rudimentary compression options.

Repository
• Contains full HTML of every web page.
• Each page is compressed using zlib
• Choice of compression technique is a tradeoff
between speed and compression ratio.
• The documents are stored one after the other
and prefixed by docID, length, and URL.
• Repository does not requires other data
structures to be used to access it.

Document Index
• Keeps information about each document
• Information Includes
▫ Current document status
▫ Pointer into the repository
▫ Document checksum
▫ Various Statistics
• Is fixed with ISAM(Index sequential access
mode) index, ordered by docID.
• The crawled documents contains a pointer into a
variable width file called “docinfo” which
contains the URL and title.

Lexicon
• Current implementation keep the lexicon in
memory on a machine with 256 MB of main
memory contains 14 million words which make
it easy to fit in memory for a reasonable price.
• Lexicon Implementation has two parts
1. A list of the words
2. Hash table of pointers

Hit Lists
• A list of occurrences of each word in a particular
document
▫ – Position
▫ – Font
▫ – Capitalization
• The hit list accounts for most of the space used in
both indices
• Uses a special compact encoding algorithm
▫ – Requiring only 2 bytes for each hit
• The hit lists are very important in calculating the
Rank of a page
• There are two different types of hits:

Hit Lists cont
• Plain Hits:
▫ – Capitalization bit
▫ – Font Size (relative to the rest of the page) -> 3 bits
▫ – Word Position in document -> 12 bits
• Fancy Hits (found in URL, title, anchor text, or meta tag)
▫ – Capitalization bit
▫ – Font Size - set to 7 to indicate Fancy Hit -> 3 bits
▫ – Type of fancy hit -> 4 bits
▫ – Position -> 8 bits
• If the type of fancy hit is an anchor, the Position is split:
▫ – 4 bits for position in anchor
▫ – 4 bits for hash of the DocID the anchor occurs in
• The length of the hit list is stored before the hits
themselves

Forward Index
• Stored in (64)barrels containing:
A range of WordID's, The DocID of a pages
containing these words, A list of WordID's
followed by corresponding hit lists, actual
WordID's are not stored in the barrels; instead,
the difference between the word and the
minimum of the barrel is stored, This requires
only 24 bits for each WordID, allowing 8 bits to
hold the hit list length

Inverted Index
Contains the same barrels as the Forward
Index,except they have been sorted by docID’s,
all words are pointed to by the Lexicon, contains
pointers to a doclist containing all docID’s with
their corresponding hit lists.
• – The barrels are duplicated
• – For speed in single word searches

Crawling the web
• 1 URL server.
• 3 crawler machines.
• Both URL server and crawler implemented in
python.
• Each crawler keeps up to 300 connections open
at once.
• At peak speeds crawled up to 100pages/600K
per sec (using 4 crawlers)

Crawler actions
1. Looking up DNS
2. Connecting to a host
3. Sending request
4. Receiving response
• Note – each crawler maintains a DNS cache to
speedup DNS lookup process

Issues
• Large amount of email/calls
• Accidently crawling an online game
• Robots exclusion protocol not followed

Indexing the web
1. Parsing – custom flex parser (robust, speed)
2. Indexing documents into barrels – hit list and
forward index constructed.
3. Sorting – done 1 barrel at a time, but multiple
sorters run in parallel. Constructs short
inverted barrel and full inverted barrel.

Searching
1. Parse query
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short
barrel for every word.
4. Scan through the doclist until there is a
document that matches all the terms.

Searching contd..
5. Compute the rank of that document, for that query
6. If we are in the short barrels and at the end of any
doclist, seek to the start of the doclist in the full
barrel for every word and go to step 4.
7. If we are not at the end of any doclist go to step 4.
8. Sort the documents that have matched by rank
and return the top k (40K for efficiency)
Note – 8) may cause suboptimal results. This can be
avoided by ranking hits.

Ranking system
• Example of a single word search :
- Type weights vector (different weight for different text types
such as title, anchor, URL, large text etc).
- Count weights vector (weight for no of counts, increases
linearly and then tapers off)
- Score dot product of count weights vector and type-weights
vector
- This score is combined with page rank for final score for the
page.

Ranking system contd..
• For multi words, word proximity is considered
(10 diff classifications from same phrase to not
even close). Now a type-prox-weight vector is
made.

Improving using feedback
• Uses feedback to figure out type-weights and
type-prox-weights.
• A trusted user evaluates the search results and
gives feedback. Weight vectors are modified
accordingly.

System performance
• Once the system was running smoothly it
downloaded 11million pages in 63hours(about
48.5 pages per sec).
• Indexer is just faster than the crawler to prevent
bottleneck.
• Indexer ran roughly 54pages per sec. Sorters ran
parallel, using 4 machines the process takes
about 24hours.

Search performance
• Focus was on improving search results not
performance.
• Most queries answered in 1-10secs. This is
mainly due to disk I/O since disks are spread
over multiple machines.
• No optimizations like query caching, sub-indices
on common terms etc.
• Can be speeded up with distribution and
algorithm improvements.

Conclusion
• Google was designed to..
1. Be a scalable search engine.
2. To provide high quality search results
Considering the success of the Google search
engine we can conclude that the original goals
have been met to a very high degree.

Summary of Key Optimization
Techniques
• Each crawler maintains its own DNS lookup
cache.
• Use flex to generate lexical analyzer with own
stack for parsing documents.
• Parallelization of indexing phase.
• In-memory lexicon.

Summary of Key Optimization
Techniques contd..
• Compression of repository.
• Compact encoding of hitlists accounting for
major space savings.
• Document index is updated in bulk.
• Critical data structures placed on local disk.

Future work
• Scale to approximately 100 million web pages
• Query efficiency handling by
▫ Query caching
▫ Smart disk allocation
▫ Sub indices
• Proxy caches to build search databases

Future work contd..
• Improvement from commercial search engine
such as
▫ Boolean operator
▫ Negation
▫ Stemming
• Support of user context and result
summarization
• Increase the weight of bookmarks
• Support of text surrounding links in addition to
the link text itself

Google improvements
• Google File System - 2003
• MapReduce (Simplied Data Processing on Large
Clusters) - 2008
• BigTable (A Distributed Storage System for
Structured Data) - 2008
• Percolator (Large-scale Incremental Processing
Using Distributed Transactions and
Notifications) - 2010

Google File System
• Google's core data storage and usage needs
• An improvement of BigFIles Developed by Larry
Page and Sergey Brin
• Designed for system-to-system interaction, and
not for user-to-system interaction. The chunk
servers replicate the data automatically.

Architecture of GFS
• Files are divided into chunks
• Fixed-size chunks (64MB)
• Replicated over chunkservers, called replicas
• Unique 64-bit chunk handles
• Chunks as Linux files

Advantages of GFS
• Scalability
• autonomic computing ( a concept in which
computers are able to diagnose problems and
solve them in real time without the need for
human intervention)
• Chunk handle (files up into chunks of 64
megabytes (MB) )
• High availability and component failure
• Fault tolerance

MapReduce
Typical Challenges in Parallel Processing:
• How to assign tasks to the workers?
• What if we have more tasks than workers?
• What if the workers need to share partial
results?
• How do we aggregate partial results?
• How do we know whether the workers have
finished?
• What if the workers fail?

MapReduce contd..
Functional Programming
+
Distributed Processing Platform

Percolator
• Google’s indexing system stores tens of
petabytes of data and processes billions of
updates per day on thousands of machines.
• MapReduce and other batch-processing systems
cannot process small updates individually as
they rely on creating large batches for efficiency.

Percolator contd..
• Percolator is a system for incrementally
processing updates to a large data set, and
deployed to create the Google web search index.
• By replacing a batch-based indexing system with
an indexing system based on incremental
processing it is possible to process the same
number of documents per day, while reducing
the average age of documents in Google search
results by 50%.

Google Challenges today..
• SEO(Search Engine Optimization) techniques
used by advertisers are adversely affecting
results of users’ queries.SEO supposedly
improve page rankings using SEO tools.

Google Challenges today contd..
• Filter bubbles - Customized results based on the
user's activity history.
• algorithms are written to selectively guess what
information a user would like to see, based on
information about the user
1. location,
2. past click behavior,
3. search history.

Modern Search Engine Capacities

New application areas of search
engines
• Image retrieval search engine over Content-
based image retrieval (CBIR)
• Audio search engine
• Video search engine
• Semantic search engine
• Enterprise search engine

DC presentation 1

More Related Content

What's hot

Similar to DC presentation 1

DC presentation 1

Editor's Notes