Search Engine
Upcoming SlideShare
Loading in...5
×
 

Search Engine

on

  • 4,814 views

 

Statistics

Views

Total Views
4,814
Views on SlideShare
4,814
Embed Views
0

Actions

Likes
1
Downloads
209
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Search Engine Search Engine Document Transcript

  • Page 1 of 19 Term Paper Presentation titled SEARCH ENGINE(s) Submitted in partial fulfillment of the requirements for the award of BACHELOR’S IN COMPUTER APPLICATIONS (BCA) of Integral University, Lucknow. Session: 2004-07 Submitted by PRASHANT MATHUR Roll Number: 0400518017 Under the guidance of Mr. Pavan Srivastava Name and Address of the Study Centre
  • Page 2 of 19 UPTEC Computer Consultancy Limited, Kapoorthala, Lucknow. Contents i. Prelude ii. History iii. Challenges faced by search engines iv. How search engines work v. Storage costs and crawling time vi. Geospatially enabled search engines vii. Vertical Search engines viii. Search Engine Optimizer ix. Page Rank x. GOOGLE Architecture Overview xi. Conclusions
  • Page 3 of 19 WHAT IS A SEARCH ENGINE? Prelude With billions of items of information scattered around the World Wide Web how do you find what you are looking for? Someone might tell you the address of an interesting site. You might hear of an address from the TV, radio or a magazine. Without search engines these would be your only ways of finding things. Search engines use computer programs that spend the whole time trawling through the vast amount of information on the web. They create huge indexes of all this information. You can go to the web page of a search engine and type in what you are looking for. The search engine software will look through its indexes and give you a list of Web pages that contain the words you typed. Search Engine, computer software that compiles lists of documents, most commonly those on the World Wide Web (WWW), and the contents of those documents. Search engines respond to a user entry, or query, by searching the lists and displaying a list of documents (called Web sites when on the WWW) that match the search query. Some search engines include the opening portion of the text of Web pages in their lists, but others include only the titles or addresses (known as Universal Resource Locators, or URLs) of Web pages. Some search engines occur separately from the WWW, indexing documents on a local area network or other system. The major global general-purpose search engines include Google, Yahoo!, MSN Search, AltaVista, Lycos, and HotBot. Yahoo!—one of the first available search engines—differs from most other search sites because the content and listings are manually compiled and organized by subject into a directory. As of January 2005, Google ranked as the most comprehensive search engine available, with over four trillion pages indexed. These engines operate by building—and regularly updating—an enormous index of Web pages and files. This is done with the help of a Web crawler, or spider, a kind of automated browser that perpetually trolls the Web, retrieving each page it finds. Pages are then indexed according to the words they contain, with special treatment given to words in titles and other headers. When a user inputs a query, the search engine then scans the index and retrieves a list of pages that seem to best fit what the user is looking for. Search engines often return results in fractions of a second. Generally, when an engine displays a list of results, pages are ranked according to how many other sites link to those pages. The assumption is that the more useful a site is, the more often other sites will send users to it.
  • Page 4 of 19 Google pioneered this technique in the late 1990s with a technology called PageRank. But this is not the only way of ranking results. Dozens of other criteria are used, and these will vary from engine to engine. Google A Search Engine is an information retrieval system designed to help find information stored on a computer system, such as on the World Wide Web, inside a corporate or proprietary network, or in a personal computer. The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria. This list is often sorted with respect to some measure of relevance of the results. Search engines use regularly updated indexes to operate quickly and efficiently. Without further qualification, search engine usually refers to a Web search engine, which searches for information on the public Web. Other kinds of search engine are enterprise search engines, which search on intranets, personal search engines, and mobile search engines. Different selection and relevance criteria may apply in different environments, or for different uses. Some search engines also mine data available in newsgroups, databases, or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorthmic and human input. History The very first tool used for searching on the Internet was Archie.[1] The name stands for "archive" without the "v". It was created in 1990 by Alan Emtage, a student at McGill University in Montreal. The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of filenames; however, Archie could not search by file contents. While Archie indexed computer files, Gopher indexed plain text documents. Gopher was created in 1991 by Mark McCahill at the University of Minnesota;
  • Page 5 of 19 Gopher was named after the school's mascot. Because these were text files, most of the Gopher sites became websites after the creation of the World Wide Web. Two other programs, Veronica and Jughead, searched the files stored in Gopher index systems. Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) provided a keyword search of most Gopher menu titles in the entire Gopher listings. Jughead (Jonzy's Universal Gopher Hierarchy Excavation And Display) was a tool for obtaining menu information from various Gopher servers. While the name of the search engine "Archie" was not a reference to the Archie comic book series, "Veronica" and "Jughead" are characters in the series, thus referencing their predecessor. The first Web search engine was Wandex, a now-defunct index collected by the World Wide Web Wanderer, a web crawler developed by Matthew Gray at MIT in 1993. Another very early search engine, Aliweb, also appeared in 1993, and still runs today. The first "full text" crawler-based search engine was WebCrawler, which came out in 1994. Unlike its predecessors, it let users search for any word in any webpage, which became the standard for all major search engines since. It was also the first one to be widely known by the public. Also in 1994 Lycos (which started at Carnegie Mellon University) came out, and became a major commercial endeavor. Soon after, many search engines appeared and vied for popularity. These included Excite, Infoseek, Inktomi, Northern Light, and AltaVista. In some ways, they competed with popular directories such as Yahoo!. Later, the directories integrated or added on search engine technology for greater functionality. Search engines were also known as some of the brightest stars in the Internet investing frenzy that occurred in the late 1990s. Several companies entered the market spectacularly, receiving record gains during their initial public offerings. Some have taken down their public search engine, and are marketing enterprise-only editions, such as Northern Light. Its success was based in part on the concept of link popularity and PageRank. The number of other websites and webpages that link to a given page is taken into consideration with PageRank, on the premise that good or desirable pages are linked to more than others. The PageRank of linking pages and the number of links on these pages contribute to the PageRank of the linked page. This makes it possible for Google to order its results by how many websites link to each found page. Google's minimalist user interface is very popular with users, and has since spawned a number of imitators. Google and most other web engines utilize not only PageRank but more than 150 criteria to determine relevancy. The algorithm "remembers" where it has been and indexes the number of cross-links and relates these into groupings.
  • Page 6 of 19 PageRank is based on citation analysis that was developed in the 1950s by Eugene Garfield at the University of Pennsylvania. Google's founders cite Garfield's work in their original paper. In this way virtual communities of webpages are found. Teoma's search technology uses a communities approach in its ranking algorithm. NEC Research Institute has worked on similar technology. Web link analysis was first developed by Jon Kleinberg and his team while working on the CLEVER project at IBM's Almaden Research Center. Google is currently the most popular search engine. Yahoo! Search The two founders of Yahoo!, David Filo and Jerry Yang, Ph.D. candidates in Electrical Engineering at Stanford University, started their guide in a campus trailer in February 1994 as a way to keep track of their personal interests on the Internet. Before long they were spending more time on their home-brewed lists of favorite links than on their doctoral dissertations. Eventually, Jerry and David's lists became too long and unwieldy, and they broke them out into categories. When the categories became too full, they developed subcategories ... and the core concept behind Yahoo! was born. In 2002, Yahoo! acquired Inktomi and in 2003, Yahoo! acquired Overture, which owned AlltheWeb and AltaVista. Despite owning its own search engine, Yahoo! initially kept using Google to provide its users with search results on its main website Yahoo.com. However, in 2004, Yahoo! launched its own search engine based on the combined technologies of its acquisitions and providing a service that gave pre-eminence to the Web search engine over the directory. Microsoft The most recent major search engine is MSN Search (evolved into Windows Live Search), owned by Microsoft, which previously relied on others for its search engine listings. In 2004 it debuted a beta version of its own results, powered by its own web crawler (called msnbot). In early 2005 it started showing its own results live. This was barely noticed by average users unaware of where results come from, but was a huge development for many webmasters, who seek inclusion in the major search engines. At the same time, Microsoft ceased using results from Inktomi, now owned by Yahoo!. In 2006, Microsoft migrated to a new search platform - Windows Live Search, retiring the "MSN Search" name in the process. Challenges faced by search engines a. The Web is growing much faster than any present-technology search engine can possibly index (see distributed web crawling). In 2006, some users found major search-engines became slower to index new WebPages. b. Many WebPages are updated frequently, which forces the search engine to revisit them periodically.
  • Page 7 of 19 c. The queries one can make are currently limited to searching for key words, which may result in many false positives, especially using the default whole-page search. Better results might be achieved by using a proximity- search option with a search-bracket to limit matches within a paragraph or phrase, rather than matching random words scattered across large pages. Another alternative is using human operators to do the researching for the user with organic search engines. d. Dynamically generated sites may be slow or difficult to index, or may result in excessive results, perhaps generate 500 times more WebPages than average. Example: for a dynamic webpage which changes content based on entries inserted from a database, a search-engine might be requested to index 50,000 static WebPages for 50,000 different parameter values passed to that dynamic webpage. e. Many dynamically generated websites are not indexable by search engines; this phenomenon is known as the invisible web. There are search engines that specialize in crawling the invisible web by crawling sites that have dynamic content, require forms to be filled out, or are password protected. f. Relevancy: sometimes the engine can't get what the person is looking for. g. Some search-engines do not rank results by relevance, but by the amount of money the matching websites pay. h. In 2006, hundreds of generated websites used tricks to manipulate a search-engine to display them in the higher results for numerous keywords. This can lead to some search results being polluted with linkspam or bait- and-switch pages which contain little or no information about the matching phrases. The more relevant WebPages are pushed further down in the results list, perhaps by 500 entries or more. i. Secure pages (content hosted on HTTPS URLs) pose a challenge for crawlers which either can't browse the content for technical reasons or won't index it for privacy reasons. How search engines work A search engine operates, in the following order a. Web crawling b. Indexing c. Searching Web search engines work by storing information about a large number of web pages, which they retrieve from the WWW itself. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link it sees. Exclusions can be made by the use of robots.txt. The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings,
  • Page 8 of 19 or special fields called meta tags). Data about web pages are stored in an index database for use in later queries. Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered to be a mild form of linkrot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages. Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere. When a user comes to the search engine and makes a query, typically by giving key words, the engine looks up the index and provides a listing of best- matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. Most search engines support the use of the boolean terms AND, OR and NOT to further specify the search query. An advanced feature is proximity search, which allows users to define the distance between keywords. The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of webpages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve. Most Web search engines are commercial ventures supported by advertising revenue and, as a result, some employ the controversial practice of allowing advertisers to pay money to have their listings ranked higher in search results. Those search engines which do not accept money for their search engine results make money by running search related ads alongside the regular search engine results. The search engines make money everytime someone clicks on one of these ads. The vast majorities of search engines are run by private companies using proprietary algorithms and closed databases, though some are open source. Storage C osts and Crawling Time Storage costs are not the limiting resource in search engine implementation. Simply storing 10 billion pages of 10 KB each (compressed) requires 100TB
  • Page 9 of 19 and another 100TB or so for indexes, giving a total hardware cost of under $200k: 100 cheap PCs each with four 500GB disk drives. However, a public search engine requires considerably more resources than this to calculate query results and to provide high availability. Also, the costs of operating a large server farm are not trivial. Crawling 10B pages with 100 machines crawling at 100 pages/second would take 1M seconds, or 11.6 days on a very high capacity Internet connection. Most search engines crawl a small fraction of the Web (10-20% pages) at around this frequency or better, but also crawl dynamic websites (e.g. news sites and blogs) at a much higher frequency. Geospatially enabled Search Engines A recent enhancement to search engine technology is the addition of geocoding and geoparsing to the processing of the ingested documents being indexed, to enable searching within a specified locality (or region). Geoparsing attempts to match any found references to locations and places to a geospatial frame of reference, such as a street address, gazetteer locations, or to an area (such as a polygonal boundary for a municipality). Through this geoparsing process, latitudes and longitudes are assigned to the found places, and these latitudes and longitudes are indexed for later spatial query and retrieval. This can enhance the search process tremendously by allowing a user to search for documents within a given map extent, or conversely, plot the location of documents matching a given keyword to analyze incidence and clustering, or any combination of the two. See the list of search engines for examples of companies which offer this feature. Vertical Search Engines Vertical search engines or specialized search engines are search engines which specialize in specific content categories or that search within a specific media. Popular Search engines, like Google or Yahoo!, are very effective when the user searches for web sites, web pages or general information. Vertical search engines enable the user to find specific types of listings thus making the search more customized to the user's needs. Search Engine Optimization (SEO) A subset of search engine marketing, is the process of improving the volume and quality of traffic to a web site from search engines via "natural" ("organic" or "algorithmic") search results. SEO can also target specialized searches such as image search, local search, and industry-specific vertical search engines. SEO is marketing by understanding how search algorithms work and what human visitors might search for, to help match those visitors with sites offering
  • Page 10 of 19 what they are interested in finding. Some SEO efforts may involve optimizing a site's coding, presentation, and structure, without making very noticeable changes to human visitors, such as incorporating a clear hierarchical structure to a site, and avoiding or fixing problems that might keep search engine indexing programs from fully spidering a site. Other, more noticeable efforts, involve including unique content on pages that can be easily indexed and extracted from those pages by search engines while also appealing to human visitors. A typical Search Engine Results Page (SERP) The term SEO can also refer to "search engine optimizers," a term adopted by an industry of consultants who carry out optimization projects on behalf of clients, and by employees of site owners who may perform SEO services in- house. Search engine optimizers often offer SEO as a stand-alone service or as a part of a larger marketing campaign. Because effective SEO can require making changes to the source code of a site, it is often very helpful when incorporated into the initial development and design of a site, leading to the use of the term "Search Engine Friendly" to describe designs, menus, content management systems and shopping carts that can be optimized easily and effectively. Optimizing for Traffic Quality In addition to seeking better rankings, search engine optimization is also concerned with traffic quality. Traffic quality is measured by how often a visitor using a specific keyword phrase leads to a desired conversion action, such as making a purchase, viewing or downloading a certain page, requesting further information, signing up for a newsletter, or taking some other specific action. By improving the quality of a page's search listings, more searchers may select that page and those searchers may be more likely to convert. Examples of SEO tactics to improve traffic quality include writing attention-grabbing titles,
  • Page 11 of 19 adding accurate meta descriptions, and choosing a domain and URL that improve the site's branding. Relationship between SEO and Search Engines By 1997 search engines recognized that some webmasters were making efforts to rank well in their search engines, and even manipulating the page rankings in search results. In some early search engines, such as Infoseek, ranking first was as easy as grabbing the source code of the top-ranked page, placing it on your website, and submitting a URL to instantly index and rank that page. Due to the high value and targeting of search results, there is potential for an adversarial relationship between search engines and SEOs. In 2005, an annual conference named AirWeb was created to discuss bridging the gap and minimizing the sometimes damaging effects of aggressive web content providers. Some more aggressive site owners and SEOs generate automated sites or employ techniques that eventually get domains banned from the search engines. Many search engine optimization companies, which sell services, employ long-term, low-risk strategies, and most SEO firms that do employ high-risk strategies do so on their own affiliate, lead-generation, or content sites, instead of risking client websites. Some SEO companies employ aggressive techniques that get their client websites banned from the search results. The Wall Street Journal profiled a company, Traffic Power, that allegedly used high-risk techniques and failed to disclose those risks to its clients. Wired reported the same company sued a blogger for mentioning that they were banned. Google’s Matt Cutts later confirmed that Google did in fact ban Traffic Power and some of its clients. Some search engines have also reached out to the SEO industry, and are frequent sponsors and guests at SEO conferences and seminars. In fact, with the advent of paid inclusion, some search engines now have a vested interest in the health of the optimization community. All of the main search engines provide information/guidelines to help with site optimization: Google's, Yahoo!'s, MSN's and Ask.com's. Google has a Sitemaps program to help webmasters learn if Google is having any problems indexing their website and also provides data on Google traffic to the website. Yahoo! has Site Explorer that provides a way to submit your URLs for free (like MSN/Google), determine how many pages are in the Yahoo! index and drill down on inlinks to deep pages. Yahoo! has an Ambassador Program and Google has a program for qualifying Google Advertising Professionals. Types of SEO
  • Page 12 of 19 SEO techniques are classified by some into two broad categories: techniques that search engines recommend as part of good design and those techniques that search engines do not approve of and attempt to minimize the effect of, referred to as spamdexing. Most professional SEO consultants do not offer spamming and spamdexing techniques amongst the services that they provide to clients. Some industry commentators classify these methods, and the practitioners who utilize them, as either "white hat SEO", or "black hat SEO". Many SEO consultants reject the black and white hat dichotomy as a convenient but unfortunate and misleading over-simplification that makes the industry look bad as a whole. White Hat An SEO tactic, technique or method is considered "White hat" if it conforms to the search engines' guidelines and/or involves no deception. As the search engine guidelines are not written as a series of rules or commandments, this is an important distinction to note. White Hat SEO is not just about following guidelines, but is about ensuring that the content a search engine indexes and subsequently ranks is the same content a user will see. White Hat advice is generally summed up as creating content for users, not for search engines, and then makes that content easily accessible to their spiders, rather than game the system. White hat SEO is in many ways similar to web development that promotes accessibility, although the two are not identical. Black hat /Spamdexing "Black hat" SEO are methods to try to improve rankings that are disapproved of by the search engines and/or involve deception. This can range from text that is "hidden", either as text colored similar to the background or in an invisible or left of visible div, or by redirecting users from a page that is built for search engines to one that is more human friendly. A method that sends a user to a page that was different from the page the search engine ranked is Black hat as a rule. One well known example is Cloaking, the practice of serving one version of a page to search engine spiders/bots and another version to human visitors. Search engines may penalize sites they discover using black hat methods, either by reducing their rankings or eliminating their listings from their databases altogether. Such penalties can be applied either automatically by the search engines' algorithms or by a manual review of a site. ‘Archie’ Search Engine Archie is a tool for indexing FTP archives, allowing people to find specific files. It is considered to be the first Internet search engine. The original implementation was written in 1990 by Alan Emtage, Bill Heelan, and Peter J. Deutsch, then students at McGill University in Montreal. The earliest versions
  • Page 13 of 19 of archie simply contacted a list of FTP archives on a regular basis (contacting each roughly once a month, so as not to waste too many resources on the remote servers) and requested a listing. These listings were stored in local files to be searched using the UNIX grep command. Later, more efficient front- and back-ends were developed, and the system spread from a local tool, to a network-wide resource, to a popular service available from multiple sites around the Internet. Such archie servers could be accessed in multiple ways: using a local client (such as archie or xarchie); telneting to a server directly; sending queries by electronic mail; and later via World Wide Web interfaces. The name derives from the word "archive", but is also associated with the comic book series of the same name. This was not originally intended, but it certainly acted as the inspiration for the names of Jughead and Veronica, both search systems for the Gopher protocol, named after other characters from the same comics. The World Wide Web made searching for files much easier, and there are currently very few archie servers in operation. One gateway can be found in Poland. System Features The Google search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank. Second, Google utilizes link to improve search results. Page Rank: Bringing Order to the Web The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines. We have created maps containing as many as 518 million of these hyperlinks, a significant sample of the total. These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results. For the type of full text searches in the main Google system, PageRank also helps a great deal. Description of Page Rank Calculation Academic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows:
  • Page 14 of 19 We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR (A) = (1-d) + d (PR (T1)/C (T1) + ... + PR (Tn)/C (Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all Web pages' PageRanks will be one . PageRank or PR (A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million webs pages can be computed in a few hours on a medium size workstation. There are many other details which are beyond the scope of this paper. Anchor Text The text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on. In addition, we associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled. Note that pages that have not been crawled can cause problems, since they are never checked for validity before being returned to the user. In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. However, it is possible to sort the results, so that this particular problem rarely happens. This idea of propagating anchor text to the page it refers to was implemented in the World Wide Web Worm especially because it helps search non-text information, and expands the search coverage with fewer downloaded documents. We use anchor propagation mostly because anchor text can help provide better quality results. Using anchor text efficiently is technically difficult because of the large amounts of data which must be processed. In our current crawl of 24 million pages, we had over 259 million anchors which we indexed. Other Features Aside from Page Rank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font
  • Page 15 of 19 are weighted higher than other words. Third, full raw HTML of pages is available in a repository. System Anatomy There are some in-depth descriptions of important data structures. Finally, the major applications like crawling, indexing, and searching will be examined in depth. Google Architecture Overview This is a high level overview of how the whole system works as pictured in further sections will discuss the applications and data structures not mentioned in this section. Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux. In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is an URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. The Anatomy of a Search Engine converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important
  • Page 16 of 19 information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link. The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents. The sorter takes the barrels, which are sorted by docID and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries. Major Data Structures Google's data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures. Hit Lists A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. Hit lists account for most of the space used in both the forward and the inverted indices. Because of this, it is important to represent them as efficiently as possible. We considered several alternatives for encoding position, font, and capitalization, simple encoding (a triple of integers), a compact encoding (a hand optimized allocation of bits), and Huffman coding. Crawling the Web Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system. In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers. Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four
  • Page 17 of 19 crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state. It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of pages from my web site. How did you like it?" There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up. Indexing the Web Parsing Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone's imagination to come up with equally creative ones. For maximum speed, instead of using YACC to generate a CFG parser, we use flex to generate a lexical analyzer which we outfit with its own stack. Developing this parser which runs at a reasonable speed and is very robust involved a fair amount of work.
  • Page 18 of 19 Indexing Documents into Barrels After each document is parsed, it is encoded into a number of barrels. Every word is converted into a wordID by using an in-memory hash table the lexicon. New additions to the lexicon hash table are logged to a file. Once the words are converted into wordID's, their occurrences in the current document are translated into hit lists and are written into the forward barrels. The main difficulty with parallelization of the indexing phase is that the lexicon needs to be shared. Instead of sharing the lexicon, we took the approach of writing a log of all the extra words that were not in a base lexicon, which we fixed at 14 million words. That way multiple indexers can run in parallel and then the small log file of extra words can be processed by one final indexer. Sorting In order to generate the inverted index, the sorter takes each of the forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel. This process happens one barrel at a time, thus requiring little temporary storage. Also, we parallelize the sorting phase to use as many machines as we have simply by running multiple sorters, which can process different buckets at the same time. Since the barrels don't fit into main memory, the sorter further subdivides them into baskets which do fit into memory based on wordID and docID. Then the sorter loads each basket into memory, sorts it and writes its contents into the short inverted barrel and the full inverted barrel. Searching The goal of searching is to provide quality search results efficiently. Many of the large commercial search engines seemed to have made great progress in terms of efficiency. Therefore, we have focused more on quality of search in our research, although we believe our solutions are scalable to commercial volumes with a bit more effort. We are currently investigating other ways to solve this problem. In the past, we sorted the hits according to PageRank, which seemed to improve the situation. A Research In addition to being a high quality search engine, Google is a research tool. The data Google has collected has already resulted in many other papers submitted to conferences and many more on the way. Recent research such as has shown a number of limitations to queries about the Web that may be answered without having the Web available locally. This means that Google (or a similar system) is not only a valuable research tool but a necessary one for a wide range of applications. We hope Google will be a resource for searchers and researchers all around the world and will spark the next generation of search engine technology.
  • Page 19 of 19 Conclusions Google is designed to be a scalable search engine. The primary goal is to provide high quality search results over a rapidly growing World Wide Web. Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. Furthermore, Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them. Future Work A large-scale web search engine is a complex system and much remains to be done. Our immediate goals are to improve search efficiency and to scale to approximately 100 million web pages. Some simple improvements to efficiency include query caching, smart disk allocation, and subindices. Another area which requires much research is updates. We must have smart algorithms to decide what old web pages should be recrawled and what new ones should be crawled. Work toward this goal has been done. We are planning to add simple features supported by commercial search engines like boolean operators, negation, and stemming. However, other features are just starting to be explored such as relevance feedback and clustering (Google currently supports a simple hostname based clustering). We are also working to extend the use of link structure and link text. Simple experiments indicate PageRank can be personalized by increasing the weight of a user's home page or bookmarks. Web search engine is a very rich environment for research ideas. We have far too many to list here so we do not expect this Future Work section to become much shorter in the near future. High Quality Search The biggest problem facing users of web search engines today is the quality of the results they get back. While the results are often amusing and expand user’s horizons, they are often frustrating and consume precious time. For example, the top result for a search for "Bill Clinton" on one of the most popular commercial search engines was the Bill Clinton Joke of the Day: April 14, 1997. Google is designed to provide higher quality search so as the Web continues to grow rapidly, information can be found easily. In order to accomplish this Google makes heavy use of hypertextual information consisting of link structure and link (anchor) text. Google also uses proximity and font information. While evaluation of a search engine is difficult, we have subjectively found that Google returns higher quality search results than current commercial search engines. The analysis of link structure via PageRank allows Google to evaluate the quality of web pages. The use of link text as a description of what the link points to helps the search engine return relevant (and to some degree high quality) results. Finally, the use of proximity information helps increase relevance a great deal for many queries.