SlideShare a Scribd company logo
1 of 19
Page 1 of 19


        Term Paper Presentation titled

                SEARCH ENGINE(s)




Submitted in partial fulfillment of the requirements for the award

                               of

     BACHELOR’S IN COMPUTER APPLICATIONS (BCA)

                of Integral University, Lucknow.

                       Session: 2004-07

                          Submitted by

                   PRASHANT MATHUR

                 Roll Number: 0400518017


          Under the guidance of Mr. Pavan Srivastava


             Name and Address of the Study Centre
Page 2 of 19

UPTEC Computer Consultancy Limited, Kapoorthala, Lucknow.

                      Contents
    i. Prelude

    ii. History

   iii. Challenges faced by search engines

   iv. How search engines work

    v.   Storage costs and crawling time

   vi. Geospatially enabled search engines

   vii. Vertical Search engines

   viii. Search Engine Optimizer

   ix. Page Rank

    x.   GOOGLE Architecture Overview

   xi. Conclusions
Page 3 of 19




               WHAT IS A SEARCH ENGINE?

                                 Prelude
With billions of items of information scattered around the World Wide Web how
do you find what you are looking for? Someone might tell you the address of
an interesting site. You might hear of an address from the TV, radio or a
magazine. Without search engines these would be your only ways of finding
things. Search engines use computer programs that spend the whole time
trawling through the vast amount of information on the web. They create huge
indexes of all this information. You can go to the web page of a search engine
and type in what you are looking for. The search engine software will look
through its indexes and give you a list of Web pages that contain the words
you typed.

Search Engine, computer software that compiles lists of documents, most
commonly those on the World Wide Web (WWW), and the contents of those
documents. Search engines respond to a user entry, or query, by searching
the lists and displaying a list of documents (called Web sites when on the
WWW) that match the search query. Some search engines include the
opening portion of the text of Web pages in their lists, but others include only
the titles or addresses (known as Universal Resource Locators, or URLs) of
Web pages. Some search engines occur separately from the WWW, indexing
documents on a local area network or other system.

The major global general-purpose search engines include Google, Yahoo!,
MSN Search, AltaVista, Lycos, and HotBot. Yahoo!—one of the first available
search engines—differs from most other search sites because the content and
listings are manually compiled and organized by subject into a directory. As of
January 2005, Google ranked as the most comprehensive search engine
available, with over four trillion pages indexed.

These engines operate by building—and regularly updating—an enormous
index of Web pages and files. This is done with the help of a Web crawler, or
spider, a kind of automated browser that perpetually trolls the Web, retrieving
each page it finds. Pages are then indexed according to the words they
contain, with special treatment given to words in titles and other headers.
When a user inputs a query, the search engine then scans the index and
retrieves a list of pages that seem to best fit what the user is looking for.
Search engines often return results in fractions of a second.

Generally, when an engine displays a list of results, pages are ranked
according to how many other sites link to those pages. The assumption is that
the more useful a site is, the more often other sites will send users to it.
Page 4 of 19
Google pioneered this technique in the late 1990s with a technology called
PageRank. But this is not the only way of ranking results. Dozens of other
criteria are used, and these will vary from engine to engine.




                                   Google
A Search Engine is an information retrieval system designed to help find
information stored on a computer system, such as on the World Wide Web,
inside a corporate or proprietary network, or in a personal computer. The
search engine allows one to ask for content meeting specific criteria (typically
those containing a given word or phrase) and retrieves a list of items that
match those criteria. This list is often sorted with respect to some measure of
relevance of the results. Search engines use regularly updated indexes to
operate quickly and efficiently.

Without further qualification, search engine usually refers to a Web search
engine, which searches for information on the public Web. Other kinds of
search engine are enterprise search engines, which search on intranets,
personal search engines, and mobile search engines. Different selection and
relevance criteria may apply in different environments, or for different uses.

Some search engines also mine data available in newsgroups, databases, or
open directories. Unlike Web directories, which are maintained by human
editors, search engines operate algorithmically or are a mixture of algorthmic
and human input.

                                 History
The very first tool used for searching on the Internet was Archie.[1] The name
stands for "archive" without the "v". It was created in 1990 by Alan Emtage, a
student at McGill University in Montreal. The program downloaded the
directory listings of all the files located on public anonymous FTP (File
Transfer Protocol) sites, creating a searchable database of filenames;
however, Archie could not search by file contents.

While Archie indexed computer files, Gopher indexed plain text documents.
Gopher was created in 1991 by Mark McCahill at the University of Minnesota;
Page 5 of 19
Gopher was named after the school's mascot. Because these were text files,
most of the Gopher sites became websites after the creation of the World
Wide Web.

Two other programs, Veronica and Jughead, searched the files stored in
Gopher index systems. Veronica (Very Easy Rodent-Oriented Net-wide Index
to Computerized Archives) provided a keyword search of most Gopher menu
titles in the entire Gopher listings. Jughead (Jonzy's Universal Gopher
Hierarchy Excavation And Display) was a tool for obtaining menu information
from various Gopher servers. While the name of the search engine "Archie"
was not a reference to the Archie comic book series, "Veronica" and
"Jughead" are characters in the series, thus referencing their predecessor.

The first Web search engine was Wandex, a now-defunct index collected by
the World Wide Web Wanderer, a web crawler developed by Matthew Gray at
MIT in 1993. Another very early search engine, Aliweb, also appeared in 1993,
and still runs today. The first "full text" crawler-based search engine was
WebCrawler, which came out in 1994. Unlike its predecessors, it let users
search for any word in any webpage, which became the standard for all major
search engines since. It was also the first one to be widely known by the
public. Also in 1994 Lycos (which started at Carnegie Mellon University) came
out, and became a major commercial endeavor.

Soon after, many search engines appeared and vied for popularity. These
included Excite, Infoseek, Inktomi, Northern Light, and AltaVista. In some
ways, they competed with popular directories such as Yahoo!. Later, the
directories integrated or added on search engine technology for greater
functionality.

Search engines were also known as some of the brightest stars in the Internet
investing frenzy that occurred in the late 1990s. Several companies entered
the market spectacularly, receiving record gains during their initial public
offerings. Some have taken down their public search engine, and are
marketing enterprise-only editions, such as Northern Light.

Its success was based in part on the concept of link popularity and PageRank.
The number of other websites and webpages that link to a given page is taken
into consideration with PageRank, on the premise that good or desirable
pages are linked to more than others. The PageRank of linking pages and the
number of links on these pages contribute to the PageRank of the linked page.
This makes it possible for Google to order its results by how many websites
link to each found page. Google's minimalist user interface is very popular with
users, and has since spawned a number of imitators.

Google and most other web engines utilize not only PageRank but more than
150 criteria to determine relevancy. The algorithm "remembers" where it has
been and indexes the number of cross-links and relates these into groupings.
Page 6 of 19
PageRank is based on citation analysis that was developed in the 1950s by
Eugene Garfield at the University of Pennsylvania. Google's founders cite
Garfield's work in their original paper. In this way virtual communities of
webpages are found. Teoma's search technology uses a communities
approach in its ranking algorithm. NEC Research Institute has worked on
similar technology. Web link analysis was first developed by Jon Kleinberg and
his team while working on the CLEVER project at IBM's Almaden Research
Center. Google is currently the most popular search engine.

                            Yahoo! Search
The two founders of Yahoo!, David Filo and Jerry Yang, Ph.D. candidates in
Electrical Engineering at Stanford University, started their guide in a campus
trailer in February 1994 as a way to keep track of their personal interests on
the Internet. Before long they were spending more time on their home-brewed
lists of favorite links than on their doctoral dissertations. Eventually, Jerry and
David's lists became too long and unwieldy, and they broke them out into
categories. When the categories became too full, they developed
subcategories ... and the core concept behind Yahoo! was born. In 2002,
Yahoo! acquired Inktomi and in 2003, Yahoo! acquired Overture, which owned
AlltheWeb and AltaVista. Despite owning its own search engine, Yahoo!
initially kept using Google to provide its users with search results on its main
website Yahoo.com. However, in 2004, Yahoo! launched its own search
engine based on the combined technologies of its acquisitions and providing a
service that gave pre-eminence to the Web search engine over the directory.

                                Microsoft
The most recent major search engine is MSN Search (evolved into Windows
Live Search), owned by Microsoft, which previously relied on others for its
search engine listings. In 2004 it debuted a beta version of its own results,
powered by its own web crawler (called msnbot). In early 2005 it started
showing its own results live. This was barely noticed by average users
unaware of where results come from, but was a huge development for many
webmasters, who seek inclusion in the major search engines. At the same
time, Microsoft ceased using results from Inktomi, now owned by Yahoo!. In
2006, Microsoft migrated to a new search platform - Windows Live Search,
retiring the "MSN Search" name in the process.

         Challenges faced by search engines
a. The   Web is growing much faster than any present-technology search
   engine can possibly index (see distributed web crawling). In 2006, some
   users found major search-engines became slower to index new WebPages.
b. Many WebPages are updated frequently, which forces the search engine to
   revisit them periodically.
Page 7 of 19

c. The queries one can make are currently limited to searching for key words,
   which may result in many false positives, especially using the default
   whole-page search. Better results might be achieved by using a proximity-
   search option with a search-bracket to limit matches within a paragraph or
   phrase, rather than matching random words scattered across large pages.
   Another alternative is using human operators to do the researching for the
   user with organic search engines.
d. Dynamically generated sites may be slow or difficult to index, or may result
   in excessive results, perhaps generate 500 times more WebPages than
   average. Example: for a dynamic webpage which changes content based
   on entries inserted from a database, a search-engine might be requested to
   index 50,000 static WebPages for 50,000 different parameter values
   passed to that dynamic webpage.
e. Many dynamically generated websites are not indexable by search
   engines; this phenomenon is known as the invisible web. There are
   search engines that specialize in crawling the invisible web by crawling
   sites that have dynamic content, require forms to be filled out, or are
   password protected.
f. Relevancy: sometimes the engine can't get what the person is looking for.
g. Some search-engines do not rank results by relevance, but by the amount
   of money the matching websites pay.
h. In 2006, hundreds of generated websites used tricks to manipulate a
   search-engine to display them in the higher results for numerous keywords.
   This can lead to some search results being polluted with linkspam or bait-
   and-switch pages which contain little or no information about the matching
   phrases. The more relevant WebPages are pushed further down in the
   results list, perhaps by 500 entries or more.
i. Secure pages (content hosted on HTTPS URLs) pose a challenge for
   crawlers which either can't browse the content for technical reasons or
   won't index it for privacy reasons.

                    How search engines work
A search engine operates, in the following order

     a. Web crawling
     b. Indexing
     c. Searching

Web search engines work by storing information about a large number of web
pages, which they retrieve from the WWW itself. These pages are retrieved by
a Web crawler (sometimes also known as a spider) — an automated Web
browser which follows every link it sees. Exclusions can be made by the use of
robots.txt. The contents of each page are then analyzed to determine how it
should be indexed (for example, words are extracted from the titles, headings,
Page 8 of 19
or special fields called meta tags). Data about web pages are stored in an
index database for use in later queries. Some search engines, such as
Google, store all or part of the source page (referred to as a cache) as well as
information about the web pages, whereas others, such as AltaVista, store
every word of every page they find. This cached page always holds the actual
search text since it is the one that was actually indexed, so it can be very
useful when the content of the current page has been updated and the search
terms are no longer in it. This problem might be considered to be a mild form
of linkrot, and Google's handling of it increases usability by satisfying user
expectations that the search terms will be on the returned webpage. This
satisfies the principle of least astonishment since the user normally expects
the search terms to be on the returned pages. Increased search relevance
makes these cached pages very useful, even beyond the fact that they may
contain data that may no longer be available elsewhere.

When a user comes to the search engine and makes a query, typically by
giving key words, the engine looks up the index and provides a listing of best-
matching web pages according to its criteria, usually with a short summary
containing the document's title and sometimes parts of the text. Most search
engines support the use of the boolean terms AND, OR and NOT to further
specify the search query. An advanced feature is proximity search, which
allows users to define the distance between keywords.

The usefulness of a search engine depends on the relevance of the result set
it gives back. While there may be millions of webpages that include a
particular word or phrase, some pages may be more relevant, popular, or
authoritative than others. Most search engines employ methods to rank the
results to provide the "best" results first. How a search engine decides which
pages are the best matches, and what order the results should be shown in,
varies widely from one engine to another. The methods also change over time
as Internet usage changes and new techniques evolve.

Most Web search engines are commercial ventures supported by advertising
revenue and, as a result, some employ the controversial practice of allowing
advertisers to pay money to have their listings ranked higher in search results.
Those search engines which do not accept money for their search engine
results make money by running search related ads alongside the regular
search engine results. The search engines make money everytime someone
clicks on one of these ads.

The vast majorities of search engines are run by private companies using
proprietary algorithms and closed databases, though some are open source.

          Storage C osts and Crawling Time
Storage costs are not the limiting resource in search engine implementation.
Simply storing 10 billion pages of 10 KB each (compressed) requires 100TB
Page 9 of 19
and another 100TB or so for indexes, giving a total hardware cost of under
$200k: 100 cheap PCs each with four 500GB disk drives.

However, a public search engine requires considerably more resources than
this to calculate query results and to provide high availability. Also, the costs of
operating a large server farm are not trivial.

Crawling 10B pages with 100 machines crawling at 100 pages/second would
take 1M seconds, or 11.6 days on a very high capacity Internet connection.
Most search engines crawl a small fraction of the Web (10-20% pages) at
around this frequency or better, but also crawl dynamic websites (e.g. news
sites and blogs) at a much higher frequency.

        Geospatially enabled Search Engines
A recent enhancement to search engine technology is the addition of
geocoding and geoparsing to the processing of the ingested documents being
indexed, to enable searching within a specified locality (or region). Geoparsing
attempts to match any found references to locations and places to a
geospatial frame of reference, such as a street address, gazetteer locations,
or to an area (such as a polygonal boundary for a municipality). Through this
geoparsing process, latitudes and longitudes are assigned to the found places,
and these latitudes and longitudes are indexed for later spatial query and
retrieval. This can enhance the search process tremendously by allowing a
user to search for documents within a given map extent, or conversely, plot
the location of documents matching a given keyword to analyze incidence and
clustering, or any combination of the two. See the list of search engines for
examples of companies which offer this feature.

                   Vertical Search Engines
Vertical search engines or specialized search engines are search engines
which specialize in specific content categories or that search within a specific
media. Popular Search engines, like Google or Yahoo!, are very effective
when the user searches for web sites, web pages or general information.
Vertical search engines enable the user to find specific types of listings thus
making the search more customized to the user's needs.

               Search Engine Optimization (SEO)
A subset of search engine marketing, is the process of improving the volume
and quality of traffic to a web site from search engines via "natural" ("organic"
or "algorithmic") search results. SEO can also target specialized searches
such as image search, local search, and industry-specific vertical search
engines.

SEO is marketing by understanding how search algorithms work and what
human visitors might search for, to help match those visitors with sites offering
Page 10 of 19
what they are interested in finding. Some SEO efforts may involve optimizing a
site's coding, presentation, and structure, without making very noticeable
changes to human visitors, such as incorporating a clear hierarchical structure
to a site, and avoiding or fixing problems that might keep search engine
indexing programs from fully spidering a site. Other, more noticeable efforts,
involve including unique content on pages that can be easily indexed and
extracted from those pages by search engines while also appealing to human
visitors.




              A typical Search Engine Results Page (SERP)
The term SEO can also refer to "search engine optimizers," a term adopted by
an industry of consultants who carry out optimization projects on behalf of
clients, and by employees of site owners who may perform SEO services in-
house. Search engine optimizers often offer SEO as a stand-alone service or
as a part of a larger marketing campaign. Because effective SEO can require
making changes to the source code of a site, it is often very helpful when
incorporated into the initial development and design of a site, leading to the
use of the term "Search Engine Friendly" to describe designs, menus, content
management systems and shopping carts that can be optimized easily and
effectively.

             Optimizing for Traffic Quality
In addition to seeking better rankings, search engine optimization is also
concerned with traffic quality. Traffic quality is measured by how often a visitor
using a specific keyword phrase leads to a desired conversion action, such as
making a purchase, viewing or downloading a certain page, requesting further
information, signing up for a newsletter, or taking some other specific action.
By improving the quality of a page's search listings, more searchers may
select that page and those searchers may be more likely to convert. Examples
of SEO tactics to improve traffic quality include writing attention-grabbing titles,
Page 11 of 19
adding accurate meta descriptions, and choosing a domain and URL that
improve the site's branding.

      Relationship between SEO and Search
                     Engines
By 1997 search engines recognized that some webmasters were making
efforts to rank well in their search engines, and even manipulating the page
rankings in search results. In some early search engines, such as Infoseek,
ranking first was as easy as grabbing the source code of the top-ranked page,
placing it on your website, and submitting a URL to instantly index and rank
that page. Due to the high value and targeting of search results, there is
potential for an adversarial relationship between search engines and SEOs. In
2005, an annual conference named AirWeb was created to discuss bridging
the gap and minimizing the sometimes damaging effects of aggressive web
content providers.

Some more aggressive site owners and SEOs generate automated sites or
employ techniques that eventually get domains banned from the search
engines. Many search engine optimization companies, which sell services,
employ long-term, low-risk strategies, and most SEO firms that do employ
high-risk strategies do so on their own affiliate, lead-generation, or content
sites, instead of risking client websites.

Some SEO companies employ aggressive techniques that get their client
websites banned from the search results. The Wall Street Journal profiled a
company, Traffic Power, that allegedly used high-risk techniques and failed to
disclose those risks to its clients. Wired reported the same company sued a
blogger for mentioning that they were banned. Google’s Matt Cutts later
confirmed that Google did in fact ban Traffic Power and some of its clients.

Some search engines have also reached out to the SEO industry, and are
frequent sponsors and guests at SEO conferences and seminars. In fact, with
the advent of paid inclusion, some search engines now have a vested interest
in the health of the optimization community. All of the main search engines
provide information/guidelines to help with site optimization: Google's,
Yahoo!'s, MSN's and Ask.com's. Google has a Sitemaps program to help
webmasters learn if Google is having any problems indexing their website and
also provides data on Google traffic to the website. Yahoo! has Site Explorer
that provides a way to submit your URLs for free (like MSN/Google),
determine how many pages are in the Yahoo! index and drill down on inlinks
to deep pages. Yahoo! has an Ambassador Program and Google has a
program for qualifying Google Advertising Professionals.

                           Types of SEO
Page 12 of 19
SEO techniques are classified by some into two broad categories: techniques
that search engines recommend as part of good design and those techniques
that search engines do not approve of and attempt to minimize the effect of,
referred to as spamdexing. Most professional SEO consultants do not offer
spamming and spamdexing techniques amongst the services that they provide
to clients. Some industry commentators classify these methods, and the
practitioners who utilize them, as either "white hat SEO", or "black hat SEO".
Many SEO consultants reject the black and white hat dichotomy as a
convenient but unfortunate and misleading over-simplification that makes the
industry look bad as a whole.

                                 White Hat
An SEO tactic, technique or method is considered "White hat" if it conforms to
the search engines' guidelines and/or involves no deception. As the search
engine guidelines are not written as a series of rules or commandments, this is
an important distinction to note. White Hat SEO is not just about following
guidelines, but is about ensuring that the content a search engine indexes and
subsequently ranks is the same content a user will see.
White Hat advice is generally summed up as creating content for users, not for
search engines, and then makes that content easily accessible to their
spiders, rather than game the system. White hat SEO is in many ways similar
to web development that promotes accessibility, although the two are not
identical.

                     Black hat /Spamdexing
"Black hat" SEO are methods to try to improve rankings that are disapproved
of by the search engines and/or involve deception. This can range from text
that is "hidden", either as text colored similar to the background or in an
invisible or left of visible div, or by redirecting users from a page that is built for
search engines to one that is more human friendly. A method that sends a
user to a page that was different from the page the search engine ranked is
Black hat as a rule. One well known example is Cloaking, the practice of
serving one version of a page to search engine spiders/bots and another
version to human visitors.
Search engines may penalize sites they discover using black hat methods,
either by reducing their rankings or eliminating their listings from their
databases altogether. Such penalties can be applied either automatically by
the search engines' algorithms or by a manual review of a site.

                         ‘Archie’ Search Engine
Archie is a tool for indexing FTP archives, allowing people to find specific
files. It is considered to be the first Internet search engine. The original
implementation was written in 1990 by Alan Emtage, Bill Heelan, and Peter J.
Deutsch, then students at McGill University in Montreal. The earliest versions
Page 13 of 19
of archie simply contacted a list of FTP archives on a regular basis (contacting
each roughly once a month, so as not to waste too many resources on the
remote servers) and requested a listing. These listings were stored in local
files to be searched using the UNIX grep command. Later, more efficient front-
and back-ends were developed, and the system spread from a local tool, to a
network-wide resource, to a popular service available from multiple sites
around the Internet. Such archie servers could be accessed in multiple ways:
using a local client (such as archie or xarchie); telneting to a server directly;
sending queries by electronic mail; and later via World Wide Web interfaces.

The name derives from the word "archive", but is also associated with the
comic book series of the same name. This was not originally intended, but it
certainly acted as the inspiration for the names of Jughead and Veronica, both
search systems for the Gopher protocol, named after other characters from
the same comics.

The World Wide Web made searching for files much easier, and there are
currently very few archie servers in operation. One gateway can be found in
Poland.

                            System Features
The Google search engine has two important features that help it produce high
precision results. First, it makes use of the link structure of the Web to
calculate a quality ranking for each web page. This ranking is called
PageRank. Second, Google utilizes link to improve search results.

           Page Rank: Bringing Order to the Web
The citation (link) graph of the web is an important resource that has largely
gone unused in existing web search engines. We have created maps
containing as many as 518 million of these hyperlinks, a significant sample of
the total. These maps allow rapid calculation of a web page's "PageRank", an
objective measure of its citation importance that corresponds well with
people's subjective idea of importance. Because of this correspondence,
PageRank is an excellent way to prioritize the results of web keyword
searches. For most popular subjects, a simple text matching search that is
restricted to web page titles performs admirably when PageRank prioritizes
the results. For the type of full text searches in the main Google system,
PageRank also helps a great deal.

            Description of Page Rank Calculation
Academic citation literature has been applied to the web, largely by counting
citations or backlinks to a given page. This gives some approximation of a
page's importance or quality. PageRank extends this idea by not counting
links from all pages equally, and by normalizing by the number of links on a
page. PageRank is defined as follows:
Page 14 of 19
We assume page A has pages T1...Tn which point to it (i.e., are citations). The
parameter d
is a damping factor which can be set between 0 and 1. We usually set d to
0.85. There are more details about d in the next section. Also C(A) is defined
as the number of links going out of page A. The PageRank of a page A is
given as follows:

            PR (A) = (1-d) + d (PR (T1)/C (T1) + ... + PR (Tn)/C (Tn))

Note that the PageRanks form a probability distribution over web pages, so
the sum of all
Web pages' PageRanks will be one
.
PageRank or PR (A) can be calculated using a simple iterative algorithm, and
corresponds to the principal eigenvector of the normalized link matrix of the
web. Also, a PageRank for 26 million webs pages can be computed in a few
hours on a medium size workstation. There are many other details which are
beyond the scope of this paper.

                                Anchor Text
The text of links is treated in a special way in our search engine. Most search
engines associate the text of a link with the page that the link is on. In addition,
we associate it with the page the link points to. This has several advantages.
First, anchors often provide more accurate descriptions of web pages than the
pages themselves. Second, anchors may exist for documents which cannot be
indexed by a text based search engine, such as images, programs, and
databases. This makes it possible to return web pages which have not actually
been crawled. Note that pages that have not been crawled can cause
problems, since they are never checked for validity before being returned to
the user. In this case, the search engine can even return a page that never
actually existed, but had hyperlinks pointing to it. However, it is possible to sort
the results, so that this particular problem rarely happens. This idea of
propagating anchor text to the page it refers to was implemented in the World
Wide Web Worm especially because it helps search non-text information, and
expands the search coverage with fewer downloaded documents. We use
anchor propagation mostly because anchor text can help provide better quality
results. Using anchor text efficiently is technically difficult because of the large
amounts of data which must be processed. In our current crawl of 24 million
pages, we had over 259 million anchors which we indexed.

                              Other Features
Aside from Page Rank and the use of anchor text, Google has several other
features. First, it has location information for all hits and so it makes extensive
use of proximity in search. Second, Google keeps track of some visual
presentation details such as font size of words. Words in a larger or bolder font
Page 15 of 19
are weighted higher than other words. Third, full raw HTML of pages is
available in a repository.

                             System Anatomy
There are some in-depth descriptions of important data structures. Finally, the
major applications like crawling, indexing, and searching will be examined in
depth.

                   Google Architecture Overview




This is a high level overview of how the whole system works as pictured in
further sections will discuss the applications and data structures not mentioned
in this section. Most of Google is implemented in C or C++ for efficiency and
can run in either Solaris or Linux. In Google, the web crawling (downloading of
web pages) is done by several distributed crawlers. There is an URLserver
that sends lists of URLs to be fetched to the crawlers. The web pages that are
fetched are then sent to the storeserver. The storeserver then compresses
and stores the web pages into a repository. Every web page has an
associated ID number called a docID which is assigned whenever a new URL
is parsed out of a web page. The indexing function is performed by the indexer
and the sorter. The indexer performs a number of functions. It reads the
repository, uncompresses the documents, and parses them. The Anatomy of a
Search Engine converted into a set of word occurrences called hits. The hits
record the word, position in document, an approximation of font size, and
capitalization. The indexer distributes these hits into a set of "barrels", creating
a partially sorted forward index. The indexer performs another important
function. It parses out all the links in every web page and stores important
Page 16 of 19
information about them in an anchors file. This file contains enough
information to determine where each link points from and to, and the text of
the link. The URLresolver reads the anchors file and converts relative URLs
into absolute URLs and in turn into docIDs. It puts the anchor text into the
forward index, associated with the docID that the anchor points to. It also
generates a database of links which are pairs of docIDs. The links database is
used to compute PageRanks for all the documents. The sorter takes the
barrels, which are sorted by docID and resorts them by wordID to generate the
inverted index. This is done in place so that little temporary space is needed
for this operation. The sorter also produces a list of wordIDs and offsets into
the inverted index. A program called DumpLexicon takes this list together with
the lexicon produced by the indexer and generates a new lexicon to be used
by the searcher. The searcher is run by a web server and uses the lexicon
built by DumpLexicon together with the inverted index and the PageRanks to
answer queries.

                        Major Data Structures
Google's data structures are optimized so that a large document collection can
be crawled, indexed, and searched with little cost. Although, CPUs and bulk
input output rates have improved dramatically over the years, a disk seek still
requires about 10 ms to complete. Google is designed to avoid disk seeks
whenever possible, and this has had a considerable influence on the design of
the data structures.

                                   Hit Lists
A hit list corresponds to a list of occurrences of a particular word in a particular
document including position, font, and capitalization information. Hit lists
account for most of the space used in both the forward and the inverted
indices. Because of this, it is important to represent them as efficiently as
possible. We considered several alternatives for encoding position, font, and
capitalization, simple encoding (a triple of integers), a compact encoding (a
hand optimized allocation of bits), and Huffman coding.

                            Crawling the Web
Running a web crawler is a challenging task. There are tricky performance and
reliability issues and even more importantly, there are social issues. Crawling
is the most fragile application since it involves interacting with hundreds of
thousands of web servers and various name servers which are all beyond the
control of the system. In order to scale to hundreds of millions of web pages,
Google has a fast distributed crawling system. A single URLserver serves lists
of URLs to a number of crawlers. Both the URLserver and the crawlers are
implemented in Python. Each crawler keeps roughly 300 connections open at
once. This is necessary to retrieve web pages at a fast enough pace. At peak
speeds, the system can crawl over 100 web pages per second using four
Page 17 of 19
crawlers. This amounts to roughly 600K per second of data. A major
performance stress is DNS lookup. Each crawler maintains a its own DNS
cache so it does not need to do a DNS lookup before crawling each document.
Each of the hundreds of connections can be in a number of different states:
looking up DNS, connecting to host, sending request, and receiving response.
These factors make the crawler a complex component of the system. It uses
asynchronous IO to manage events, and a number of queues to move page
fetches from state to state. It turns out that running a crawler which connects
to more than half a million servers, and generates tens of millions of log
entries generates a fair amount of email and phone calls. Because of the vast
number of people coming on line, there are always those who do not know
what a crawler is, because this is the first one they have seen. Almost daily,
we receive an email something like, "Wow, you looked at a lot of pages from
my web site. How did you like it?" There are also some people who do not
know about the robots exclusion protocol, and think their page should be
protected from indexing by a statement like, "This page is copyrighted and
should not be indexed", which needless to say is difficult for web crawlers to
understand. Also, because of the huge amount of data involved, unexpected
things will happen. For example, our system tried to crawl an online game.
This resulted in lots of garbage messages in the middle of their game! It turns
out this was an easy problem to fix. But this problem had not come up until we
had downloaded tens of millions of pages. Because of the immense variation
in web pages and servers, it is virtually impossible to test a crawler without
running it on large part of the Internet. Invariably, there are hundreds of
obscure problems which may only occur on one page out of the whole web
and cause the crawler to crash, or worse, cause unpredictable or incorrect
behavior. Systems which access large parts of the Internet need to be
designed to be very robust and carefully tested. Since large complex systems
such as crawlers will invariably cause problems, there needs to be significant
resources devoted to reading the email and solving these problems as they
come up.

                          Indexing the Web
                                   Parsing
Any parser which is designed to run on the entire Web must handle a huge
array of possible errors. These range from typos in HTML tags to kilobytes of
zeros in the middle of a tag, non-ASCII characters, HTML tags nested
hundreds deep, and a great variety of other errors that challenge anyone's
imagination to come up with equally creative ones. For maximum speed,
instead of using YACC to generate a CFG parser, we use flex to generate a
lexical analyzer which we outfit with its own stack. Developing this parser
which runs at a reasonable speed and is very robust involved a fair amount of
work.
Page 18 of 19

                     Indexing Documents into Barrels
After each document is parsed, it is encoded into a number of barrels. Every
word is converted into a wordID by using an in-memory hash table the lexicon.
New additions to the lexicon hash table are logged to a file. Once the words
are converted into wordID's, their occurrences in the current document are
translated into hit lists and are written into the forward barrels. The main
difficulty with parallelization of the indexing phase is that the lexicon needs to
be shared. Instead of sharing the lexicon, we took the approach of writing a
log of all the extra words that were not in a base lexicon, which we fixed at 14
million words. That way multiple indexers can run in parallel and then the small
log file of extra words can be processed by one final indexer.

                                    Sorting
In order to generate the inverted index, the sorter takes each of the forward
barrels and sorts it by wordID to produce an inverted barrel for title and anchor
hits and a full text inverted barrel. This process happens one barrel at a time,
thus requiring little temporary storage. Also, we parallelize the sorting phase to
use as many machines as we have simply by running multiple sorters, which
can process different buckets at the same time. Since the barrels don't fit into
main memory, the sorter further subdivides them into baskets which do fit into
memory based on wordID and docID. Then the sorter loads each basket into
memory, sorts it and writes its contents into the short inverted barrel and the
full inverted barrel.

                                   Searching
The goal of searching is to provide quality search results efficiently. Many of
the large commercial search engines seemed to have made great progress in
terms of efficiency. Therefore, we have focused more on quality of search in
our research, although we believe our solutions are scalable to commercial
volumes with a bit more effort. We are currently investigating other ways to
solve this problem. In the past, we sorted the hits according to
PageRank, which seemed to improve the situation.

                                  A Research
In addition to being a high quality search engine, Google is a research tool.
The data Google has collected has already resulted in many other papers
submitted to conferences and many more on the way. Recent research such
as has shown a number of limitations to queries about the Web that may be
answered without having the Web available locally. This means that Google
(or a similar system) is not only a valuable research tool but a necessary one
for a wide range of applications. We hope Google will be a resource for
searchers and researchers all around the world and will spark the next
generation of search engine technology.
Page 19 of 19

                               Conclusions
Google is designed to be a scalable search engine. The primary goal is to
provide high quality search results over a rapidly growing World Wide Web.
Google employs a number of techniques to improve search quality including
page rank, anchor text, and proximity information. Furthermore, Google is a
complete architecture for gathering web pages, indexing them, and performing
search queries over them.
                              Future Work
A large-scale web search engine is a complex system and much remains to be
done. Our immediate goals are to improve search efficiency and to scale to
approximately 100 million web pages. Some simple improvements to
efficiency include query caching, smart disk allocation, and subindices.
Another area which requires much research is updates. We must have smart
algorithms to decide what old web pages should be recrawled and what new
ones should be crawled. Work toward this goal has been done. We are
planning to add simple features supported by commercial search engines like
boolean operators, negation, and stemming. However, other features are just
starting to be explored such as relevance feedback and clustering (Google
currently supports a simple hostname based clustering). We are also working
to extend the use of link structure and link text. Simple experiments indicate
PageRank can be personalized by increasing the weight of a user's home
page or bookmarks. Web search engine is a very rich environment for
research ideas. We have far too many to list here so we do not expect this
Future Work section to become much shorter in the near future.

                         High Quality Search
The biggest problem facing users of web search engines today is the quality of
the results they get back. While the results are often amusing and expand
user’s horizons, they are often frustrating and consume precious time. For
example, the top result for a search for "Bill Clinton" on one of the most
popular commercial search engines was the Bill Clinton Joke of the Day: April
14, 1997. Google is designed to provide higher quality search so as the Web
continues to grow rapidly, information can be found easily. In order to
accomplish this Google makes heavy use of hypertextual information
consisting of link structure and link (anchor) text. Google also uses proximity
and font information. While evaluation of a search engine is difficult, we have
subjectively found that Google returns higher quality search results than
current commercial search engines. The analysis of link structure via
PageRank allows Google to evaluate the quality of web pages. The use of link
text as a description of what the link points to helps the search engine return
relevant (and to some degree high quality) results. Finally, the use of proximity
information helps increase relevance a great deal for many queries.

More Related Content

What's hot

Keywords and Keyword Research.pdf
Keywords and Keyword Research.pdfKeywords and Keyword Research.pdf
Keywords and Keyword Research.pdfShristi Shrestha
 
How Google search works ppt
How Google search works pptHow Google search works ppt
How Google search works pptHardik Mahant
 
Website Analysis Report
Website Analysis ReportWebsite Analysis Report
Website Analysis ReportAuroIN
 
google search engine
google search enginegoogle search engine
google search engineway2go
 
Keyword Research and Selection
Keyword Research and SelectionKeyword Research and Selection
Keyword Research and SelectionStoney deGeyter
 
Comparing Search Engines
Comparing Search EnginesComparing Search Engines
Comparing Search EnginesMelissa Brisbin
 
On page SEO Optimization & it's Techniques
On page SEO Optimization & it's TechniquesOn page SEO Optimization & it's Techniques
On page SEO Optimization & it's TechniquesPratibha Maurya
 
An introduction to Search Engine Optimization (SEO) and web analytics on fao.org
An introduction to Search Engine Optimization (SEO) and web analytics on fao.orgAn introduction to Search Engine Optimization (SEO) and web analytics on fao.org
An introduction to Search Engine Optimization (SEO) and web analytics on fao.orgFAO
 
What is Keyword Research & How to Do it ?
What is Keyword Research & How to Do it ? What is Keyword Research & How to Do it ?
What is Keyword Research & How to Do it ? Jam Hassan
 
KEYWORD RESEARCH & SEO
KEYWORD RESEARCH & SEO KEYWORD RESEARCH & SEO
KEYWORD RESEARCH & SEO AVIK BAL
 
Google Algorithm updates
Google Algorithm updatesGoogle Algorithm updates
Google Algorithm updateskhan majid
 
SEO presentation
SEO presentationSEO presentation
SEO presentationarniontech
 
Technical SEO Presentation
Technical SEO PresentationTechnical SEO Presentation
Technical SEO PresentationJoe Robison
 
SEO Process - Search Engine Optimization Roadmap Requirement Analysis and Sel...
SEO Process - Search Engine Optimization Roadmap Requirement Analysis and Sel...SEO Process - Search Engine Optimization Roadmap Requirement Analysis and Sel...
SEO Process - Search Engine Optimization Roadmap Requirement Analysis and Sel...Sonu Pandey
 
Search Engine Powerpoint
Search Engine  Powerpoint Search Engine  Powerpoint
Search Engine Powerpoint Partha Himu
 

What's hot (20)

Keywords and Keyword Research.pdf
Keywords and Keyword Research.pdfKeywords and Keyword Research.pdf
Keywords and Keyword Research.pdf
 
Search engine
Search engineSearch engine
Search engine
 
Seo for-content
Seo for-contentSeo for-content
Seo for-content
 
How Google search works ppt
How Google search works pptHow Google search works ppt
How Google search works ppt
 
Website Analysis Report
Website Analysis ReportWebsite Analysis Report
Website Analysis Report
 
google search engine
google search enginegoogle search engine
google search engine
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Keyword Research and Selection
Keyword Research and SelectionKeyword Research and Selection
Keyword Research and Selection
 
Comparing Search Engines
Comparing Search EnginesComparing Search Engines
Comparing Search Engines
 
On page SEO Optimization & it's Techniques
On page SEO Optimization & it's TechniquesOn page SEO Optimization & it's Techniques
On page SEO Optimization & it's Techniques
 
Google Search Presentation
Google Search PresentationGoogle Search Presentation
Google Search Presentation
 
An introduction to Search Engine Optimization (SEO) and web analytics on fao.org
An introduction to Search Engine Optimization (SEO) and web analytics on fao.orgAn introduction to Search Engine Optimization (SEO) and web analytics on fao.org
An introduction to Search Engine Optimization (SEO) and web analytics on fao.org
 
What is Keyword Research & How to Do it ?
What is Keyword Research & How to Do it ? What is Keyword Research & How to Do it ?
What is Keyword Research & How to Do it ?
 
KEYWORD RESEARCH & SEO
KEYWORD RESEARCH & SEO KEYWORD RESEARCH & SEO
KEYWORD RESEARCH & SEO
 
Google Algorithm updates
Google Algorithm updatesGoogle Algorithm updates
Google Algorithm updates
 
SEO presentation
SEO presentationSEO presentation
SEO presentation
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Technical SEO Presentation
Technical SEO PresentationTechnical SEO Presentation
Technical SEO Presentation
 
SEO Process - Search Engine Optimization Roadmap Requirement Analysis and Sel...
SEO Process - Search Engine Optimization Roadmap Requirement Analysis and Sel...SEO Process - Search Engine Optimization Roadmap Requirement Analysis and Sel...
SEO Process - Search Engine Optimization Roadmap Requirement Analysis and Sel...
 
Search Engine Powerpoint
Search Engine  Powerpoint Search Engine  Powerpoint
Search Engine Powerpoint
 

Viewers also liked

Search Engines Presentation
Search Engines PresentationSearch Engines Presentation
Search Engines PresentationJSCHO9
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search EnginesNitin Pande
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint201014161
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its typesNagarjuna Kalluru
 
Search Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content SpamSearch Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content Spamjagadish thaker
 
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...SEO monitor
 
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpointvbaker2210
 
Water Resource Engineering
Water Resource EngineeringWater Resource Engineering
Water Resource EngineeringSravan Kumar
 
поделки
поделкиподелки
поделкиSvetlana
 
05 syllabus presentation
05 syllabus presentation05 syllabus presentation
05 syllabus presentationskearney
 
2009 Sw Media Kit
2009 Sw Media Kit2009 Sw Media Kit
2009 Sw Media Kitjanmac
 
Waterproofing New Technology & Appraoch
Waterproofing   New Technology & AppraochWaterproofing   New Technology & Appraoch
Waterproofing New Technology & AppraochARIJIT BASU
 
Meet The People
Meet The PeopleMeet The People
Meet The Peopleemmotive
 

Viewers also liked (20)

Search engines
Search enginesSearch engines
Search engines
 
Search Engines Presentation
Search Engines PresentationSearch Engines Presentation
Search Engines Presentation
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its types
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Search Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content SpamSearch Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content Spam
 
Search engine
Search engineSearch engine
Search engine
 
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
 
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpoint
 
Water Resource Engineering
Water Resource EngineeringWater Resource Engineering
Water Resource Engineering
 
07 waves
07 waves07 waves
07 waves
 
Military Genealogy
Military GenealogyMilitary Genealogy
Military Genealogy
 
поделки
поделкиподелки
поделки
 
05 syllabus presentation
05 syllabus presentation05 syllabus presentation
05 syllabus presentation
 
6 Ethernet
6 Ethernet6 Ethernet
6 Ethernet
 
2009 Sw Media Kit
2009 Sw Media Kit2009 Sw Media Kit
2009 Sw Media Kit
 
Waterproofing New Technology & Appraoch
Waterproofing   New Technology & AppraochWaterproofing   New Technology & Appraoch
Waterproofing New Technology & Appraoch
 
Taking the challenge
Taking the challengeTaking the challenge
Taking the challenge
 
Meet The People
Meet The PeopleMeet The People
Meet The People
 

Similar to Search Engine

Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismUmang MIshra
 
How search engine works and history of search engine
How search engine works and history of search engineHow search engine works and history of search engine
How search engine works and history of search engineAK DigiHub
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than GoogleDr Trivedi
 
Tolmachev Alexander Web Search Engines
Tolmachev Alexander Web Search EnginesTolmachev Alexander Web Search Engines
Tolmachev Alexander Web Search EnginesAlexanderTolmachev
 
Search Engine Made By Hasnain jatt .pptx
Search Engine Made By Hasnain jatt .pptxSearch Engine Made By Hasnain jatt .pptx
Search Engine Made By Hasnain jatt .pptxHasNainBaBa3
 
How google works and functions: A complete Approach
How google works and functions: A complete ApproachHow google works and functions: A complete Approach
How google works and functions: A complete ApproachPrakhar Gethe
 
Search Engine Progressions over time
Search Engine Progressions over timeSearch Engine Progressions over time
Search Engine Progressions over timewiniata
 
Introduction To Search - SEO 101
Introduction To Search - SEO 101Introduction To Search - SEO 101
Introduction To Search - SEO 101Andrew Zarick
 
Search Engine Progressions over time
Search Engine Progressions over timeSearch Engine Progressions over time
Search Engine Progressions over timeifxyou
 
Search Engine ppt.pptx
Search Engine ppt.pptxSearch Engine ppt.pptx
Search Engine ppt.pptxAnmolThakur67
 

Similar to Search Engine (20)

Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanism
 
About search engines
About search enginesAbout search engines
About search engines
 
Search Engine
Search Engine Search Engine
Search Engine
 
How search engine works and history of search engine
How search engine works and history of search engineHow search engine works and history of search engine
How search engine works and history of search engine
 
Search Engine Optimization Virtuo.Ca
Search Engine Optimization Virtuo.CaSearch Engine Optimization Virtuo.Ca
Search Engine Optimization Virtuo.Ca
 
Search engines
Search enginesSearch engines
Search engines
 
Seo Presentation
Seo PresentationSeo Presentation
Seo Presentation
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Search engines
Search enginesSearch engines
Search engines
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than Google
 
Search engine
Search engine Search engine
Search engine
 
Tolmachev Alexander Web Search Engines
Tolmachev Alexander Web Search EnginesTolmachev Alexander Web Search Engines
Tolmachev Alexander Web Search Engines
 
Search Engine Made By Hasnain jatt .pptx
Search Engine Made By Hasnain jatt .pptxSearch Engine Made By Hasnain jatt .pptx
Search Engine Made By Hasnain jatt .pptx
 
Mythology of search engine
Mythology of search engineMythology of search engine
Mythology of search engine
 
How google works and functions: A complete Approach
How google works and functions: A complete ApproachHow google works and functions: A complete Approach
How google works and functions: A complete Approach
 
Search Engine Progressions over time
Search Engine Progressions over timeSearch Engine Progressions over time
Search Engine Progressions over time
 
Introduction To Search - SEO 101
Introduction To Search - SEO 101Introduction To Search - SEO 101
Introduction To Search - SEO 101
 
Search Engine Progressions over time
Search Engine Progressions over timeSearch Engine Progressions over time
Search Engine Progressions over time
 
Search Engine ppt.pptx
Search Engine ppt.pptxSearch Engine ppt.pptx
Search Engine ppt.pptx
 
G017254554
G017254554G017254554
G017254554
 

More from Ram Dutt Shukla (20)

Ip Sec Rev1
Ip Sec Rev1Ip Sec Rev1
Ip Sec Rev1
 
Message Authentication
Message AuthenticationMessage Authentication
Message Authentication
 
Shttp
ShttpShttp
Shttp
 
Web Security
Web SecurityWeb Security
Web Security
 
I Pv6 Addressing
I Pv6 AddressingI Pv6 Addressing
I Pv6 Addressing
 
Anycast & Multicast
Anycast & MulticastAnycast & Multicast
Anycast & Multicast
 
Congestion Control
Congestion ControlCongestion Control
Congestion Control
 
Congestion Control
Congestion ControlCongestion Control
Congestion Control
 
Retransmission Tcp
Retransmission TcpRetransmission Tcp
Retransmission Tcp
 
Tcp Congestion Avoidance
Tcp Congestion AvoidanceTcp Congestion Avoidance
Tcp Congestion Avoidance
 
Tcp Immediate Data Transfer
Tcp Immediate Data TransferTcp Immediate Data Transfer
Tcp Immediate Data Transfer
 
Tcp Reliability Flow Control
Tcp Reliability Flow ControlTcp Reliability Flow Control
Tcp Reliability Flow Control
 
Tcp Udp Notes
Tcp Udp NotesTcp Udp Notes
Tcp Udp Notes
 
Transport Layer [Autosaved]
Transport Layer [Autosaved]Transport Layer [Autosaved]
Transport Layer [Autosaved]
 
Transport Layer
Transport LayerTransport Layer
Transport Layer
 
T Tcp
T TcpT Tcp
T Tcp
 
Anycast & Multicast
Anycast & MulticastAnycast & Multicast
Anycast & Multicast
 
Igmp
IgmpIgmp
Igmp
 
Mobile I Pv6
Mobile I Pv6Mobile I Pv6
Mobile I Pv6
 
Mld
MldMld
Mld
 

Recently uploaded

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Search Engine

  • 1. Page 1 of 19 Term Paper Presentation titled SEARCH ENGINE(s) Submitted in partial fulfillment of the requirements for the award of BACHELOR’S IN COMPUTER APPLICATIONS (BCA) of Integral University, Lucknow. Session: 2004-07 Submitted by PRASHANT MATHUR Roll Number: 0400518017 Under the guidance of Mr. Pavan Srivastava Name and Address of the Study Centre
  • 2. Page 2 of 19 UPTEC Computer Consultancy Limited, Kapoorthala, Lucknow. Contents i. Prelude ii. History iii. Challenges faced by search engines iv. How search engines work v. Storage costs and crawling time vi. Geospatially enabled search engines vii. Vertical Search engines viii. Search Engine Optimizer ix. Page Rank x. GOOGLE Architecture Overview xi. Conclusions
  • 3. Page 3 of 19 WHAT IS A SEARCH ENGINE? Prelude With billions of items of information scattered around the World Wide Web how do you find what you are looking for? Someone might tell you the address of an interesting site. You might hear of an address from the TV, radio or a magazine. Without search engines these would be your only ways of finding things. Search engines use computer programs that spend the whole time trawling through the vast amount of information on the web. They create huge indexes of all this information. You can go to the web page of a search engine and type in what you are looking for. The search engine software will look through its indexes and give you a list of Web pages that contain the words you typed. Search Engine, computer software that compiles lists of documents, most commonly those on the World Wide Web (WWW), and the contents of those documents. Search engines respond to a user entry, or query, by searching the lists and displaying a list of documents (called Web sites when on the WWW) that match the search query. Some search engines include the opening portion of the text of Web pages in their lists, but others include only the titles or addresses (known as Universal Resource Locators, or URLs) of Web pages. Some search engines occur separately from the WWW, indexing documents on a local area network or other system. The major global general-purpose search engines include Google, Yahoo!, MSN Search, AltaVista, Lycos, and HotBot. Yahoo!—one of the first available search engines—differs from most other search sites because the content and listings are manually compiled and organized by subject into a directory. As of January 2005, Google ranked as the most comprehensive search engine available, with over four trillion pages indexed. These engines operate by building—and regularly updating—an enormous index of Web pages and files. This is done with the help of a Web crawler, or spider, a kind of automated browser that perpetually trolls the Web, retrieving each page it finds. Pages are then indexed according to the words they contain, with special treatment given to words in titles and other headers. When a user inputs a query, the search engine then scans the index and retrieves a list of pages that seem to best fit what the user is looking for. Search engines often return results in fractions of a second. Generally, when an engine displays a list of results, pages are ranked according to how many other sites link to those pages. The assumption is that the more useful a site is, the more often other sites will send users to it.
  • 4. Page 4 of 19 Google pioneered this technique in the late 1990s with a technology called PageRank. But this is not the only way of ranking results. Dozens of other criteria are used, and these will vary from engine to engine. Google A Search Engine is an information retrieval system designed to help find information stored on a computer system, such as on the World Wide Web, inside a corporate or proprietary network, or in a personal computer. The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria. This list is often sorted with respect to some measure of relevance of the results. Search engines use regularly updated indexes to operate quickly and efficiently. Without further qualification, search engine usually refers to a Web search engine, which searches for information on the public Web. Other kinds of search engine are enterprise search engines, which search on intranets, personal search engines, and mobile search engines. Different selection and relevance criteria may apply in different environments, or for different uses. Some search engines also mine data available in newsgroups, databases, or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorthmic and human input. History The very first tool used for searching on the Internet was Archie.[1] The name stands for "archive" without the "v". It was created in 1990 by Alan Emtage, a student at McGill University in Montreal. The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of filenames; however, Archie could not search by file contents. While Archie indexed computer files, Gopher indexed plain text documents. Gopher was created in 1991 by Mark McCahill at the University of Minnesota;
  • 5. Page 5 of 19 Gopher was named after the school's mascot. Because these were text files, most of the Gopher sites became websites after the creation of the World Wide Web. Two other programs, Veronica and Jughead, searched the files stored in Gopher index systems. Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) provided a keyword search of most Gopher menu titles in the entire Gopher listings. Jughead (Jonzy's Universal Gopher Hierarchy Excavation And Display) was a tool for obtaining menu information from various Gopher servers. While the name of the search engine "Archie" was not a reference to the Archie comic book series, "Veronica" and "Jughead" are characters in the series, thus referencing their predecessor. The first Web search engine was Wandex, a now-defunct index collected by the World Wide Web Wanderer, a web crawler developed by Matthew Gray at MIT in 1993. Another very early search engine, Aliweb, also appeared in 1993, and still runs today. The first "full text" crawler-based search engine was WebCrawler, which came out in 1994. Unlike its predecessors, it let users search for any word in any webpage, which became the standard for all major search engines since. It was also the first one to be widely known by the public. Also in 1994 Lycos (which started at Carnegie Mellon University) came out, and became a major commercial endeavor. Soon after, many search engines appeared and vied for popularity. These included Excite, Infoseek, Inktomi, Northern Light, and AltaVista. In some ways, they competed with popular directories such as Yahoo!. Later, the directories integrated or added on search engine technology for greater functionality. Search engines were also known as some of the brightest stars in the Internet investing frenzy that occurred in the late 1990s. Several companies entered the market spectacularly, receiving record gains during their initial public offerings. Some have taken down their public search engine, and are marketing enterprise-only editions, such as Northern Light. Its success was based in part on the concept of link popularity and PageRank. The number of other websites and webpages that link to a given page is taken into consideration with PageRank, on the premise that good or desirable pages are linked to more than others. The PageRank of linking pages and the number of links on these pages contribute to the PageRank of the linked page. This makes it possible for Google to order its results by how many websites link to each found page. Google's minimalist user interface is very popular with users, and has since spawned a number of imitators. Google and most other web engines utilize not only PageRank but more than 150 criteria to determine relevancy. The algorithm "remembers" where it has been and indexes the number of cross-links and relates these into groupings.
  • 6. Page 6 of 19 PageRank is based on citation analysis that was developed in the 1950s by Eugene Garfield at the University of Pennsylvania. Google's founders cite Garfield's work in their original paper. In this way virtual communities of webpages are found. Teoma's search technology uses a communities approach in its ranking algorithm. NEC Research Institute has worked on similar technology. Web link analysis was first developed by Jon Kleinberg and his team while working on the CLEVER project at IBM's Almaden Research Center. Google is currently the most popular search engine. Yahoo! Search The two founders of Yahoo!, David Filo and Jerry Yang, Ph.D. candidates in Electrical Engineering at Stanford University, started their guide in a campus trailer in February 1994 as a way to keep track of their personal interests on the Internet. Before long they were spending more time on their home-brewed lists of favorite links than on their doctoral dissertations. Eventually, Jerry and David's lists became too long and unwieldy, and they broke them out into categories. When the categories became too full, they developed subcategories ... and the core concept behind Yahoo! was born. In 2002, Yahoo! acquired Inktomi and in 2003, Yahoo! acquired Overture, which owned AlltheWeb and AltaVista. Despite owning its own search engine, Yahoo! initially kept using Google to provide its users with search results on its main website Yahoo.com. However, in 2004, Yahoo! launched its own search engine based on the combined technologies of its acquisitions and providing a service that gave pre-eminence to the Web search engine over the directory. Microsoft The most recent major search engine is MSN Search (evolved into Windows Live Search), owned by Microsoft, which previously relied on others for its search engine listings. In 2004 it debuted a beta version of its own results, powered by its own web crawler (called msnbot). In early 2005 it started showing its own results live. This was barely noticed by average users unaware of where results come from, but was a huge development for many webmasters, who seek inclusion in the major search engines. At the same time, Microsoft ceased using results from Inktomi, now owned by Yahoo!. In 2006, Microsoft migrated to a new search platform - Windows Live Search, retiring the "MSN Search" name in the process. Challenges faced by search engines a. The Web is growing much faster than any present-technology search engine can possibly index (see distributed web crawling). In 2006, some users found major search-engines became slower to index new WebPages. b. Many WebPages are updated frequently, which forces the search engine to revisit them periodically.
  • 7. Page 7 of 19 c. The queries one can make are currently limited to searching for key words, which may result in many false positives, especially using the default whole-page search. Better results might be achieved by using a proximity- search option with a search-bracket to limit matches within a paragraph or phrase, rather than matching random words scattered across large pages. Another alternative is using human operators to do the researching for the user with organic search engines. d. Dynamically generated sites may be slow or difficult to index, or may result in excessive results, perhaps generate 500 times more WebPages than average. Example: for a dynamic webpage which changes content based on entries inserted from a database, a search-engine might be requested to index 50,000 static WebPages for 50,000 different parameter values passed to that dynamic webpage. e. Many dynamically generated websites are not indexable by search engines; this phenomenon is known as the invisible web. There are search engines that specialize in crawling the invisible web by crawling sites that have dynamic content, require forms to be filled out, or are password protected. f. Relevancy: sometimes the engine can't get what the person is looking for. g. Some search-engines do not rank results by relevance, but by the amount of money the matching websites pay. h. In 2006, hundreds of generated websites used tricks to manipulate a search-engine to display them in the higher results for numerous keywords. This can lead to some search results being polluted with linkspam or bait- and-switch pages which contain little or no information about the matching phrases. The more relevant WebPages are pushed further down in the results list, perhaps by 500 entries or more. i. Secure pages (content hosted on HTTPS URLs) pose a challenge for crawlers which either can't browse the content for technical reasons or won't index it for privacy reasons. How search engines work A search engine operates, in the following order a. Web crawling b. Indexing c. Searching Web search engines work by storing information about a large number of web pages, which they retrieve from the WWW itself. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link it sees. Exclusions can be made by the use of robots.txt. The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings,
  • 8. Page 8 of 19 or special fields called meta tags). Data about web pages are stored in an index database for use in later queries. Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered to be a mild form of linkrot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages. Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere. When a user comes to the search engine and makes a query, typically by giving key words, the engine looks up the index and provides a listing of best- matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. Most search engines support the use of the boolean terms AND, OR and NOT to further specify the search query. An advanced feature is proximity search, which allows users to define the distance between keywords. The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of webpages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve. Most Web search engines are commercial ventures supported by advertising revenue and, as a result, some employ the controversial practice of allowing advertisers to pay money to have their listings ranked higher in search results. Those search engines which do not accept money for their search engine results make money by running search related ads alongside the regular search engine results. The search engines make money everytime someone clicks on one of these ads. The vast majorities of search engines are run by private companies using proprietary algorithms and closed databases, though some are open source. Storage C osts and Crawling Time Storage costs are not the limiting resource in search engine implementation. Simply storing 10 billion pages of 10 KB each (compressed) requires 100TB
  • 9. Page 9 of 19 and another 100TB or so for indexes, giving a total hardware cost of under $200k: 100 cheap PCs each with four 500GB disk drives. However, a public search engine requires considerably more resources than this to calculate query results and to provide high availability. Also, the costs of operating a large server farm are not trivial. Crawling 10B pages with 100 machines crawling at 100 pages/second would take 1M seconds, or 11.6 days on a very high capacity Internet connection. Most search engines crawl a small fraction of the Web (10-20% pages) at around this frequency or better, but also crawl dynamic websites (e.g. news sites and blogs) at a much higher frequency. Geospatially enabled Search Engines A recent enhancement to search engine technology is the addition of geocoding and geoparsing to the processing of the ingested documents being indexed, to enable searching within a specified locality (or region). Geoparsing attempts to match any found references to locations and places to a geospatial frame of reference, such as a street address, gazetteer locations, or to an area (such as a polygonal boundary for a municipality). Through this geoparsing process, latitudes and longitudes are assigned to the found places, and these latitudes and longitudes are indexed for later spatial query and retrieval. This can enhance the search process tremendously by allowing a user to search for documents within a given map extent, or conversely, plot the location of documents matching a given keyword to analyze incidence and clustering, or any combination of the two. See the list of search engines for examples of companies which offer this feature. Vertical Search Engines Vertical search engines or specialized search engines are search engines which specialize in specific content categories or that search within a specific media. Popular Search engines, like Google or Yahoo!, are very effective when the user searches for web sites, web pages or general information. Vertical search engines enable the user to find specific types of listings thus making the search more customized to the user's needs. Search Engine Optimization (SEO) A subset of search engine marketing, is the process of improving the volume and quality of traffic to a web site from search engines via "natural" ("organic" or "algorithmic") search results. SEO can also target specialized searches such as image search, local search, and industry-specific vertical search engines. SEO is marketing by understanding how search algorithms work and what human visitors might search for, to help match those visitors with sites offering
  • 10. Page 10 of 19 what they are interested in finding. Some SEO efforts may involve optimizing a site's coding, presentation, and structure, without making very noticeable changes to human visitors, such as incorporating a clear hierarchical structure to a site, and avoiding or fixing problems that might keep search engine indexing programs from fully spidering a site. Other, more noticeable efforts, involve including unique content on pages that can be easily indexed and extracted from those pages by search engines while also appealing to human visitors. A typical Search Engine Results Page (SERP) The term SEO can also refer to "search engine optimizers," a term adopted by an industry of consultants who carry out optimization projects on behalf of clients, and by employees of site owners who may perform SEO services in- house. Search engine optimizers often offer SEO as a stand-alone service or as a part of a larger marketing campaign. Because effective SEO can require making changes to the source code of a site, it is often very helpful when incorporated into the initial development and design of a site, leading to the use of the term "Search Engine Friendly" to describe designs, menus, content management systems and shopping carts that can be optimized easily and effectively. Optimizing for Traffic Quality In addition to seeking better rankings, search engine optimization is also concerned with traffic quality. Traffic quality is measured by how often a visitor using a specific keyword phrase leads to a desired conversion action, such as making a purchase, viewing or downloading a certain page, requesting further information, signing up for a newsletter, or taking some other specific action. By improving the quality of a page's search listings, more searchers may select that page and those searchers may be more likely to convert. Examples of SEO tactics to improve traffic quality include writing attention-grabbing titles,
  • 11. Page 11 of 19 adding accurate meta descriptions, and choosing a domain and URL that improve the site's branding. Relationship between SEO and Search Engines By 1997 search engines recognized that some webmasters were making efforts to rank well in their search engines, and even manipulating the page rankings in search results. In some early search engines, such as Infoseek, ranking first was as easy as grabbing the source code of the top-ranked page, placing it on your website, and submitting a URL to instantly index and rank that page. Due to the high value and targeting of search results, there is potential for an adversarial relationship between search engines and SEOs. In 2005, an annual conference named AirWeb was created to discuss bridging the gap and minimizing the sometimes damaging effects of aggressive web content providers. Some more aggressive site owners and SEOs generate automated sites or employ techniques that eventually get domains banned from the search engines. Many search engine optimization companies, which sell services, employ long-term, low-risk strategies, and most SEO firms that do employ high-risk strategies do so on their own affiliate, lead-generation, or content sites, instead of risking client websites. Some SEO companies employ aggressive techniques that get their client websites banned from the search results. The Wall Street Journal profiled a company, Traffic Power, that allegedly used high-risk techniques and failed to disclose those risks to its clients. Wired reported the same company sued a blogger for mentioning that they were banned. Google’s Matt Cutts later confirmed that Google did in fact ban Traffic Power and some of its clients. Some search engines have also reached out to the SEO industry, and are frequent sponsors and guests at SEO conferences and seminars. In fact, with the advent of paid inclusion, some search engines now have a vested interest in the health of the optimization community. All of the main search engines provide information/guidelines to help with site optimization: Google's, Yahoo!'s, MSN's and Ask.com's. Google has a Sitemaps program to help webmasters learn if Google is having any problems indexing their website and also provides data on Google traffic to the website. Yahoo! has Site Explorer that provides a way to submit your URLs for free (like MSN/Google), determine how many pages are in the Yahoo! index and drill down on inlinks to deep pages. Yahoo! has an Ambassador Program and Google has a program for qualifying Google Advertising Professionals. Types of SEO
  • 12. Page 12 of 19 SEO techniques are classified by some into two broad categories: techniques that search engines recommend as part of good design and those techniques that search engines do not approve of and attempt to minimize the effect of, referred to as spamdexing. Most professional SEO consultants do not offer spamming and spamdexing techniques amongst the services that they provide to clients. Some industry commentators classify these methods, and the practitioners who utilize them, as either "white hat SEO", or "black hat SEO". Many SEO consultants reject the black and white hat dichotomy as a convenient but unfortunate and misleading over-simplification that makes the industry look bad as a whole. White Hat An SEO tactic, technique or method is considered "White hat" if it conforms to the search engines' guidelines and/or involves no deception. As the search engine guidelines are not written as a series of rules or commandments, this is an important distinction to note. White Hat SEO is not just about following guidelines, but is about ensuring that the content a search engine indexes and subsequently ranks is the same content a user will see. White Hat advice is generally summed up as creating content for users, not for search engines, and then makes that content easily accessible to their spiders, rather than game the system. White hat SEO is in many ways similar to web development that promotes accessibility, although the two are not identical. Black hat /Spamdexing "Black hat" SEO are methods to try to improve rankings that are disapproved of by the search engines and/or involve deception. This can range from text that is "hidden", either as text colored similar to the background or in an invisible or left of visible div, or by redirecting users from a page that is built for search engines to one that is more human friendly. A method that sends a user to a page that was different from the page the search engine ranked is Black hat as a rule. One well known example is Cloaking, the practice of serving one version of a page to search engine spiders/bots and another version to human visitors. Search engines may penalize sites they discover using black hat methods, either by reducing their rankings or eliminating their listings from their databases altogether. Such penalties can be applied either automatically by the search engines' algorithms or by a manual review of a site. ‘Archie’ Search Engine Archie is a tool for indexing FTP archives, allowing people to find specific files. It is considered to be the first Internet search engine. The original implementation was written in 1990 by Alan Emtage, Bill Heelan, and Peter J. Deutsch, then students at McGill University in Montreal. The earliest versions
  • 13. Page 13 of 19 of archie simply contacted a list of FTP archives on a regular basis (contacting each roughly once a month, so as not to waste too many resources on the remote servers) and requested a listing. These listings were stored in local files to be searched using the UNIX grep command. Later, more efficient front- and back-ends were developed, and the system spread from a local tool, to a network-wide resource, to a popular service available from multiple sites around the Internet. Such archie servers could be accessed in multiple ways: using a local client (such as archie or xarchie); telneting to a server directly; sending queries by electronic mail; and later via World Wide Web interfaces. The name derives from the word "archive", but is also associated with the comic book series of the same name. This was not originally intended, but it certainly acted as the inspiration for the names of Jughead and Veronica, both search systems for the Gopher protocol, named after other characters from the same comics. The World Wide Web made searching for files much easier, and there are currently very few archie servers in operation. One gateway can be found in Poland. System Features The Google search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank. Second, Google utilizes link to improve search results. Page Rank: Bringing Order to the Web The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines. We have created maps containing as many as 518 million of these hyperlinks, a significant sample of the total. These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results. For the type of full text searches in the main Google system, PageRank also helps a great deal. Description of Page Rank Calculation Academic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows:
  • 14. Page 14 of 19 We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR (A) = (1-d) + d (PR (T1)/C (T1) + ... + PR (Tn)/C (Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all Web pages' PageRanks will be one . PageRank or PR (A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million webs pages can be computed in a few hours on a medium size workstation. There are many other details which are beyond the scope of this paper. Anchor Text The text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on. In addition, we associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled. Note that pages that have not been crawled can cause problems, since they are never checked for validity before being returned to the user. In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. However, it is possible to sort the results, so that this particular problem rarely happens. This idea of propagating anchor text to the page it refers to was implemented in the World Wide Web Worm especially because it helps search non-text information, and expands the search coverage with fewer downloaded documents. We use anchor propagation mostly because anchor text can help provide better quality results. Using anchor text efficiently is technically difficult because of the large amounts of data which must be processed. In our current crawl of 24 million pages, we had over 259 million anchors which we indexed. Other Features Aside from Page Rank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font
  • 15. Page 15 of 19 are weighted higher than other words. Third, full raw HTML of pages is available in a repository. System Anatomy There are some in-depth descriptions of important data structures. Finally, the major applications like crawling, indexing, and searching will be examined in depth. Google Architecture Overview This is a high level overview of how the whole system works as pictured in further sections will discuss the applications and data structures not mentioned in this section. Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux. In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is an URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. The Anatomy of a Search Engine converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important
  • 16. Page 16 of 19 information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link. The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents. The sorter takes the barrels, which are sorted by docID and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries. Major Data Structures Google's data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures. Hit Lists A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. Hit lists account for most of the space used in both the forward and the inverted indices. Because of this, it is important to represent them as efficiently as possible. We considered several alternatives for encoding position, font, and capitalization, simple encoding (a triple of integers), a compact encoding (a hand optimized allocation of bits), and Huffman coding. Crawling the Web Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system. In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers. Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four
  • 17. Page 17 of 19 crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state. It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of pages from my web site. How did you like it?" There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up. Indexing the Web Parsing Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone's imagination to come up with equally creative ones. For maximum speed, instead of using YACC to generate a CFG parser, we use flex to generate a lexical analyzer which we outfit with its own stack. Developing this parser which runs at a reasonable speed and is very robust involved a fair amount of work.
  • 18. Page 18 of 19 Indexing Documents into Barrels After each document is parsed, it is encoded into a number of barrels. Every word is converted into a wordID by using an in-memory hash table the lexicon. New additions to the lexicon hash table are logged to a file. Once the words are converted into wordID's, their occurrences in the current document are translated into hit lists and are written into the forward barrels. The main difficulty with parallelization of the indexing phase is that the lexicon needs to be shared. Instead of sharing the lexicon, we took the approach of writing a log of all the extra words that were not in a base lexicon, which we fixed at 14 million words. That way multiple indexers can run in parallel and then the small log file of extra words can be processed by one final indexer. Sorting In order to generate the inverted index, the sorter takes each of the forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel. This process happens one barrel at a time, thus requiring little temporary storage. Also, we parallelize the sorting phase to use as many machines as we have simply by running multiple sorters, which can process different buckets at the same time. Since the barrels don't fit into main memory, the sorter further subdivides them into baskets which do fit into memory based on wordID and docID. Then the sorter loads each basket into memory, sorts it and writes its contents into the short inverted barrel and the full inverted barrel. Searching The goal of searching is to provide quality search results efficiently. Many of the large commercial search engines seemed to have made great progress in terms of efficiency. Therefore, we have focused more on quality of search in our research, although we believe our solutions are scalable to commercial volumes with a bit more effort. We are currently investigating other ways to solve this problem. In the past, we sorted the hits according to PageRank, which seemed to improve the situation. A Research In addition to being a high quality search engine, Google is a research tool. The data Google has collected has already resulted in many other papers submitted to conferences and many more on the way. Recent research such as has shown a number of limitations to queries about the Web that may be answered without having the Web available locally. This means that Google (or a similar system) is not only a valuable research tool but a necessary one for a wide range of applications. We hope Google will be a resource for searchers and researchers all around the world and will spark the next generation of search engine technology.
  • 19. Page 19 of 19 Conclusions Google is designed to be a scalable search engine. The primary goal is to provide high quality search results over a rapidly growing World Wide Web. Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. Furthermore, Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them. Future Work A large-scale web search engine is a complex system and much remains to be done. Our immediate goals are to improve search efficiency and to scale to approximately 100 million web pages. Some simple improvements to efficiency include query caching, smart disk allocation, and subindices. Another area which requires much research is updates. We must have smart algorithms to decide what old web pages should be recrawled and what new ones should be crawled. Work toward this goal has been done. We are planning to add simple features supported by commercial search engines like boolean operators, negation, and stemming. However, other features are just starting to be explored such as relevance feedback and clustering (Google currently supports a simple hostname based clustering). We are also working to extend the use of link structure and link text. Simple experiments indicate PageRank can be personalized by increasing the weight of a user's home page or bookmarks. Web search engine is a very rich environment for research ideas. We have far too many to list here so we do not expect this Future Work section to become much shorter in the near future. High Quality Search The biggest problem facing users of web search engines today is the quality of the results they get back. While the results are often amusing and expand user’s horizons, they are often frustrating and consume precious time. For example, the top result for a search for "Bill Clinton" on one of the most popular commercial search engines was the Bill Clinton Joke of the Day: April 14, 1997. Google is designed to provide higher quality search so as the Web continues to grow rapidly, information can be found easily. In order to accomplish this Google makes heavy use of hypertextual information consisting of link structure and link (anchor) text. Google also uses proximity and font information. While evaluation of a search engine is difficult, we have subjectively found that Google returns higher quality search results than current commercial search engines. The analysis of link structure via PageRank allows Google to evaluate the quality of web pages. The use of link text as a description of what the link points to helps the search engine return relevant (and to some degree high quality) results. Finally, the use of proximity information helps increase relevance a great deal for many queries.