Search engine

Sergey Brin and Lawrence Page
Computer Science Department,
Stanford University, Stanford, CA 94305, USA

Agenda
 Introduction
 History
 Design Goals
 System Features
 System Anatomy
 Crawling the web
 Indexing the web
 Searching
 Results and Performance
 Conclusion

Introduction
 Web information is growing rapidly, as well as the
number of new users inexperienced in the art of web
search.
 People are likely to surf using,
 Human maintained indices
 Eg. Yahoo

 Automated Search engines

Why the name Google ?
 It’s a common spelling of googol, or 10^100.

 Fits well with our goal of building very large-scale
search engines.

Scaling Up: 1994 - 2000
1200

1000

800
Pages(In
600 millions)
Queries(In
400 millions)
Pages Queries
200
1994 110000 1500
0
1997 2 to 100 20 M
1994 1997 2000 M
2000 1B 100 M

Technology Challenges
 Fast crawling technology is needed
 to gather the web documents and keep them up to date.

 Storage space must be used efficiently
 to store indices and, optionally, the documents themselves.

 The indexing system must process hundreds of gigabytes of
data efficiently.

 Queries must be handled quickly
 at a rate of hundreds to thousands per second.

Design Goals
 Improved Search Quality
 “The number of documents in the indices has been
increasing by many orders of magnitude, but the user’s
ability to look at documents has not.”

 Academic Search Engine Research
 Search engine technology remains largely a black art and
advertising oriented.

System Features

 Page Rank

 Anchor Text

 Other features

PageRank: Bringing Order to the
Web
 PageRank Calculation
 We assume page A has pages T1...Tn which point to it (i.e., are
citations). The parameter d is a damping factor which can be set
between 0 and 1. We usually set d to 0.85. There are more details about d
in the next section. Also C(A) is defined as the number of links going out
of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +
PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so
the sum of all web pages’ PageRanks will be one.

Anchor Text
 Anchors often provide more accurate descriptions of
web pages than the pages themselves.

 Second, anchors may exist for documents which
cannot be indexed by a text-based search engine, such
as images, programs, and databases.

 Using anchor text efficiently is technically difficult
because of the large amounts of data which must be
processed.

Other Features
 First, it has location information for all hits and so it
makes extensive use of proximity in search.

 Second, it keeps track of some visual presentation
details such as font size of words.
 Words in a larger or bolder font are weighted higher
than other words.

 Third, full raw HTML of pages is available in a
repository.

System Anatomy

High Level Google Architecture

Important Data Structures
 Big Files
 Virtual Files;

 Repository
 DocID, URL, Length, HTML Page Code

 Document Index
 ISAM index
 Contains DocID, URL, Checksum

 Lexicon
 Requires 256MB RAM,
 Contains 14M words
 Implemented in 2 parts- List of words and Hash table of
pointers

Search engine

More Related Content

Similar to Search engine

Search engine

Editor's Notes