Searching the web general


Published on

word wide web, search engine, seo, algorithms

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Searching the web general

  1. 1. Searching the Web: General andScientific Information AccessSteve Lawrence and C. Lee Giles, NEC Research InstituteABSTRACT The World Wide Web has revolutionized the way people access information, and has opened up new possibilities in areas suchas digital libraries, general and scientific information dissemination and retrieval, educa-tion, commerce, entertainment, government, and health care. There are many avenues for is possible today, and can improveimprovement of the Web; for example, in the areas of locating and organizing informa- access to scientific information on thetion. Current techniques for access to both general and scientific information on the Web Web or in other digital libraries of sci-provide much room for improvement; search engines do not provide comprehensive entific articles.indices of the Web and have difficulty in accurately ranking the relevance of results. Scien-tific information on the Web is very disorganized. We discuss the effectiveness of Websearch engines, including results showing that the major Web search engines cover only a WEB SEARCHfraction of the “publicly indexable Web.” Current research into improved searching of the One of the key aspects of the WorldWeb is discussed, including new techniques for ranking the relevance of results, and new Wide Web which makes it a valuabletechniques in metasearch that can improve the efficiency and effectiveness of Web search. information resource is that the full textThe creation of digital libraries incorporating autonomous citation indexing is discussed for of documents can be searched usingimproved access to scientific information on the Web. Web search engines such as AltaVista and HotBot. Just how effective are the Web search engines? The following sec- T he World Wide Web is revolutionizing the way people access information, and has opened upnew possibilities in areas such as digital libraries, general tions discuss the effectiveness of current engines and current research into improved techniques.and scientific information dissemination and retrieval, edu- THE COMPREHENSIVENESS AND RECENCY OF THEcation, commerce, entertainment, government, and health WEB SEARCH ENGINEScare. The amount of publicly available information on theWeb is increasing rapidly [1]. The Web is a gigantic digital This section considers the effectiveness of the major Weblibrary, a searchable 15 billion word encyclopedia [2]. It has search engines in terms of comprehensiveness and recency.stimulated research and development in information retrieval We provide results on the size of the Web, the coverage ofand dissemination, and fostered search engines such as each search engine, and the freshness of the search engineAltaVista. These new developments are not limited to the databases. These results show that none of the search enginesWeb, and can enhance access to virtually all forms of digital covers more than about one third of the publicly indexablelibraries. Web, and that the freshness of the various databases varies The revolution the Web has brought to information significantly.access is not so much due to the availability of information Typical quotes regarding the coverage and recency of the(huge amounts of information has long been available in major search engine databases include: “If you can’t find itlibraries and elsewhere), but rather the increased efficiency using AltaVista search, it’s probably not out there” [3], “[Withof accessing information, which can make previously imprac- AltaVista] you can find new information just about as quicklytical tasks practical. There are many avenues for improve- as it’s available on the Web” [3], and “HotBot is the firstment in the efficiency of accessing information on the Web, search robot capable of indexing and searching the entire Web”for example, in the areas of locating and organizing infor- [4]. However, the World Wide Web is a distributed, dynam-mation. ic, and rapidly growing [1] information resource that pre- This article discusses general and scientific information sents difficulties to traditional information retrievalaccess on the Web, and many of our comments are applica- technologies. Traditional information retrieval software wasble to digital libraries in general. The effectiveness of Web designed for different environments and has typically beensearch engines is discussed, including results that show that used for indexing a static collection of directly accessiblethe major search engines cover only a fraction of the “pub- documents. The nature of the Web brings up questions suchlicly indexable Web” (the part of the Web which is consid- as: can the centralized architecture of the search enginesered for indexing by the major engines, which excludes pages keep up with the increasing number of documents on thehidden behind search forms, pages with authorization Web? Can they update their databases regularly to detectrequirements, etc.). Current research into improved search- modified, deleted, and relocated information? Answers toing of the Web is discussed, including new techniques for these questions impact on the best methodology to useranking the relevance of results, and new techniques in when searching the Web, and on the future of Web searchmetasearch that can improve the efficiency and effectiveness technology.of Web search. We performed a study of the comprehensiveness and The amount of scientific information and the number of recency of the major Web search engines in Decemberelectronic journals on the Internet continues to increase. 1997 by analyzing the responses of AltaVista, Excite, Hot-Researchers are increasingly making their work available Bot, Infoseek, Lycos, and Northern Light for 575 queriesonline. This article also discusses the creation of digital made by employees at the NEC Research Institute [1].libraries of the scientific literature, incorporating autonomous Search engines rank documents differently and can returncitation indexing. The autonomous creation of citation indices documents that do not contain the query terms (e.g., pages116 0163-6804/99/$10.00 © 1999 IEEE IEEE Communications Magazine • January 1999
  2. 2. Search engine HotBot Alta Northern Excite Infoseek Lycos Vista Lightwith morphological variants orsynonyms). Therefore, we only Coverage WRT estimated Web size 34% 28% 20% 14% 10% 3%considered queries for which we Percentage of dead links returned 5.3% 2.5% 5.0% 2.0% 2.6% 1.6%could download the full text ofevery document that each engine s Table 1. Estimated coverage of each engine with respect to the estimated size of the Web,reports as matching the query. and the percentage of invalid links returned by each engine (from 575 queries performedDocuments were only counted if December 15–17, 1997).they could be downloaded andcontained the query terms. Wehandled other important details such as the normalization added and modified, a truly comprehensive index would haveof URLs, and capitalization and morphology (full details to index all pages simultaneously, which is not currently possi-can be found in [1]). ble. Furthermore, there may be many pages that contain no Table 1 shows the estimated coverage of the search links to them, making it difficult for the search engines toengines, which varies by an order of magnitude. This variation know that the pages much greater than would be expected from considering the We also looked at the percentage of dead links returnednumber of pages that each engine reports to have indexed. by the search engines, which is related to how often theThe variation may be explained by differences in indexing or engines update their databases. Intuitively, it is possible for aretrieval technology between the engines (e.g., an engine trade-off to exist between the comprehensiveness and fresh-would appear to be smaller if it only indexed part of the text ness of a search engine; it should be possible to check foron some pages), or differences in the kinds of pages indexed modified documents and update the index more rapidly if(our study used mostly scientific queries which may not be the index is smaller. Some evidence of such a trade-off wascovered as well if an engine focuses more on well-connected, found — the most comprehensive engine had the largest“popular” pages). Note that the results in the table are specif- percentage of dead links, and the least comprehensiveic to the particular queries performed (typical queries made engine had the smallest percentage of dead links. Table 1by scientists), and the state of the engine databases at the shows the percentage of invalid links for each search engine.time they were performed. However, we found that the rating of the engines in terms of We estimated a lower bound on the size of the publicly the percentage of dead links varies greatly over time. Thisindexable Web to be 320 million pages. In order to produce provides evidence that the search engines may not be verythis estimate, we analyzed the overlap between pairs of regular in their indexing processes; for example, an engineengines [1]. Consider two engines a and b. Using the assump- might suspend the processing of new pages for a period oftion that each engine samples the Web independently, the time during upgrades.quantity n o /n b , where n o is the number of documents How can this knowledge of the effectiveness of the searchreturned by both engines and nb is the number of documents engines be used to improve Web search? The coverage inves-returned by engine b, is an estimate of the fraction of the tigations indicate that the coverage of the Web engines isindexable Web, p a , covered by engine a. The size of the much lower than commonly believed, and that the enginesindexable Web can then be estimated with s a/p a where s a is tend to index different sets of pages. This indicates that whenthe number of pages indexed by engine a. This technique is searching for less popular information, it can be very useful tolimited because the engines do not choose pages to sample combine the results of multiple engines. The freshness investi-independently; they all allow pages to be registered, and they gations indicate that it is difficult to predict ahead of timeare typically biased toward indexing more popular or well- which search engine will be the best engine to use when look-connected pages. To estimate the size of the Web we used ing for recent information. Therefore, it can also be very use-the overlap between the largest two engines where the inde- ful to combine the results of multiple engines when lookingpendence assumption is more valid (the larger engines can for recent information. There are other ways to compare theindex more of the nonregistered and less popular pages). search engines besides comprehensiveness and recency, suchSome dependence between the sampling of the engines as how well the engines rank the relevance of results (dis-remains between the largest two engines, and therefore this cussed in the next section), and features of the query inter-estimate is a lower bound. Using this estimate of the size of face.the Web, we found that no engine indexes more than aboutone third of the indexable Web. We also found that combin- RESEARCH IN WEB SEARCHing the results of the six engines returned approximately 3.5 Research into technology for searching the Web is abun-times more documents on average when compared to using dant, which is not surprising considering that the existenceonly one engine. of full-text search engines is one of the major differences Recall that the queries used in the study were from the between the Web and previous means of accessing informa-employees of the NEC Research Institute. Most of the tion. The following sections look specifically at some of theemployees are scientists, and scientists tend to search for recent research: improved methods for ranking pages thatless “popular” or harder-to-find information. This is benefi- utilize the graph structure of the Web, a metasearch tech-cial when estimating the size of the Web as above. Howev- nique that can improve the efficiency of Web search byer, the search engines are typically biased toward indexing downloading matching pages in order to extract query termmore “popular” information. Therefore, the coverage of context and analyze the pages, and “softbots” which can bethe search engines is typically better for more popular used to locate pages that may not be indexed by any of theinformation. engines. There are a number of possible reasons why the majorsearch engines do not provide comprehensive indices of the Page Relevance — A common complaint against searchWeb: the engines may be limited by network bandwidth, disk engines is that they return too many pages, and that many ofstorage, computational power, scalability of their indexing and them have low relevance to the query. This has been used asretrieval technology, or a combination of these items (despite an argument for not providing comprehensive indices of theclaims to the contrary [5]). Because Web pages are continually Web (“people are already overloaded with too much infor-IEEE Communications Magazine • January 1999 117
  3. 3. mation”). However, a search engine could be more compre- results, improved document ranking using proximity informa-hensive while still returning the same set of pages first. One tion (because Inquirus has the full text of all pages it avoidsof the main problems is that the search engines do not rank the ranking problem with standard metasearch engines), dra-the relevance of results very well. Research search engines matically improved precision for certain queries by using spe-such as Google [6] and LASER [7] promise improved rank- cific expressive forms, and quick jump links and highlightinging of results. These engines make greater use of HTML when viewing the full documents.structure and the graph formed by hyperlinks in order to One of the fundamental features of Inquirus is that it ana-determine page relevancy than do the major Web search lyzes each document and displays local context around theengines. For example, Google uses a ranking algorithm query terms. The benefit of displaying the local context, rathercalled PageRank that iteratively uses information from the than an abstract or query-insensitive summary of the docu-number of pages pointing to each page (which is related to ment, is that the user may be able to more readily determinethe popularity of the pages). Google also uses the text in if the document answers his or her specific query (withoutlinks to a page as descriptors of the page (the links often repeatedly clicking and waiting for pages to download). Acontain better descriptions of the pages than the pages user can therefore find documents of high relevance by quick-themselves). Another engine with a novel ranking measure is ly scanning the local context of the query terms. This tech-Direct Hit (, which is typically good nique is simple, but can be very effective, especially for Webfor common queries. Direct Hit ranks results for a given search where the database is very large, diverse, and poorlyquery according to the number of times previous users have organized.clicked on the pages (i.e., the more popular pages are A study by Tombros (1997) shows that query-sensitiveranked higher). summaries can improve the efficiency of search. Tombros Kleinberg [8] has presented a method for locating two considered the use of query-sensitive summaries and per-types of useful pages — authorities, which are highly refer- formed a user study which showed that users workingenced pages, and hubs, which are pages that contain links to with query-sensitive summaries had a higher successmany authorities. The underlying principle is the following: rate. Query-sensitive summaries allowed users to performgood hub pages point to many good authority pages, and a relevance judgments more accurately and rapidly, andgood authority page is pointed to by many good hub pages. greatly reduced the need to refer to the full text of docu-An iterative process can be used to find hubs and authorities [8]. ments.Future search engines may use this method to classify hub and One interesting feature of Inquirus is the Specific Expres-authority pages, and to rank the pages within these classes. sive Forms (SEF) search technique. The Web is highly redun- dant, and techniques that trade recall (the fraction of allMetasearch — Limitations of the search services have led to relevant documents returned) for improved precision (thethe introduction of metasearch engines [9]. A metasearch fraction of returned documents that are relevant) are oftenengine searches the Web by making requests to multiple useful. The SEF search technique transforms queries in thesearch engines such as AltaVista or HotBot. The primary form of a question into specific forms for expressing theadvantages of current metasearch engines are the ability to answer. For example, the query “What does NASDAQ standcombine the results of multiple search engines and the ability for?” is transformed into the query “NASDAQ standsto provide a consistent user interface for searching these for” “NASDAQ is an abbreviation” “NASDAQengines. means”. Clearly the information may be contained in a dif- The idea of querying and collating results from multiple ferent form than these three possibilities; however, if thedatabases is not new. Companies like PLS, Lexis-Nexis, DIA- information does exist in one of these forms, there is a higherLOG, and Verity have long since created systems that inte- likelihood that finding these phrases will provide the answergrate the results of multiple heterogeneous databases [9]. to the query. For many queries, the answer might exist on theMany other Web metasearch services exist, such as the popu- Web, but not in any of the specific forms used. However, ourlar and useful MetaCrawler service [9]. Services similar to experiments indicate that the method works well enough to beMetaCrawler include SavvySearch and Infoseek Express. effective for certain queries. Metasearch engines can introduce their own deficiencies; Inquirus is surprisingly efficient. Inquirus downloads searchfor example, they can have difficulty ranking the list of results. engine responses and Web pages in parallel, and typicallyIf one engine returns many low-relevance documents, these returns the first result faster than the average response timedocuments may make it more difficult to find relevant pages of a search the list. Most of the metasearch engines on the Web also In summary, metasearch techniques can improve the effi-limit the number of results that can be obtained, and typically ciency of Web search by combining the results of multipledo not support all of the features of the query languages for search engines, and by implementing functionality which iseach engine. not provided by the underlying engines (e.g., extracting query The NEC Research Institute has been developing an term context and filtering dead links). The Inquirusexperimental metasearch engine called Inquirus [10]. Inquirus metasearch prototype at the NEC Research Institute haswas motivated by problems with current metasearch engines, shown that downloading and analyzing pages in real time isas well as the poor precision, limited coverage, limited avail- feasible. Inquirus, like other meta engines and various Webability, limited user interfaces, and out-of-date databases of tools, relies on the underlying search engines, which providethe major Web search engines. Rather than work with the list important and valuable services. Widescale use of this or anyof documents and summaries returned by search engines, as metasearch engine would require an amicable arrangementcurrent metasearch engines typically do, Inquirus works by with the underlying search engines. Such arrangements maydownloading and analyzing the individual documents. Inquirus include passing through ads or micro-payment systems.makes improvements over existing engines in a number ofareas, such as more useful document summaries incorporating IMPROVING WEB SEARCHquery term context, identification of both pages that no longer Users tend to make queries that result in poor precision.exist and pages that no longer contain the query terms, About 70 percent of queries to Infoseek contain only oneimproved detection of duplicate pages, progressive display of term (Harry Motro, Infoseek CEO, CNBC, May 7, 1998).118 IEEE Communications Magazine • January 1999
  4. 4. About 40 percent of queries made by the employees of the AVAILABILITYNEC Research Institute to the Inquirus engine contain onlyone term. In information retrieval, there is typically a A lot of scientific literature is copyrighted by the authors ortrade-off between precision and recall. Simple (e.g., single- publishers and is not generally available on the “publiclyterm) queries can return thousands or millions of docu- indexable Web.” However, the amount of scientific materialments. Unfortunately, ranking the relevance of these available on the publicly indexable Web is growing. Somedocuments is a difficult problem, and the desired docu- journals owned by societies such as IEEE (the largest techni-ments may not appear near the top of the list. One way to cal/scientific society) and ACM are permitting their papers toimprove the precision of results is to use more query terms, be placed on the author’s Web sites as long as the properand to tell the search engines that relevant documents must copyright notices are posted. Some private publishers, MITcontain certain terms (required terms). Other ways include Press for example, are doing the same. Some publishers per-using phrases or proximity (e.g., searching for specific mit prepublication Web access but do not allow posting of thephrases rather than single terms), using constraints offered final version of papers. We predict that more and moreby some search engines such as date ranges and geographic papers will be available on the publicly indexable Web in therestrictions, or using the refinement features offered by future.some engines (e.g., AltaVista offers a refine function, and We used six major Web search engines to search for theInfoseek allows subsequent searches within the results set papers in a recent issue of Neural Computation, after the tableof previous searches). of contents was released but before we obtained our copy of Another alternative is to combine available search engines the journal. We found that about 50 percent of the paperswith automated online searching. One example is the Internet were available on the homepages of the authors. As men-“softbot” [11]. The softbot transforms queries into goals and tioned before, the coverage of any one search engine is limit-uses a planning algorithm (with extensive knowledge of the ed. The simplest means of improving the chances of finding ainformation sources) in order to generate a sequence of particular scientist or paper on the publicly indexable Web isactions that satisfies the goal. AHOY! is a successful softbot to combine the results of multiple engines, as is done withthat locates homepages for individuals [11]. Shakes et al. per- metasearch engines such as MetaCrawler.formed a study where they searched for the homepages of 582 Although more and more scientific papers are being maderesearchers, and AHOY! was able to locate more homepages available on the publicly indexable Web, these papers arethan MetaCrawler (which located more homepages than Hot- spread throughout researcher and institution homepages,Bot or AltaVista). AHOY! also provided greatly improved technical report archives, and journal sites. The Web searchprecision. engines do not make it easy to locate these papers because More comprehensive and relevant results may also be the search engines typically do not index Postscript or PDFpossible using a search engine that specializes in a particu- documents, which account for a large percentage of the avail-lar area; for example, Excite NewsTracker specializes in able articles. The next section introduces a technique forindexing news sites, and OpenText Pinstripe specializes in organizing and indexing this literature.indexing business sites. Because there are fewer pages toindex, the engines may be able to be more comprehensive DIGITAL LIBRARIES AND CITATION INDEXINGwithin their area, and may also be able to update the index The Web offers the possibility of providing easy and effi-more regularly. When searching for popular information, cient services for organizing and accessing scientific infor-directories constructed by hand, such as Yahoo’s directory, mation. A citation index is one such service. Citation indicescan be very useful because fewer low-relevance results are [13] index the citations in an article, linking the article withreturned. the cited works. Citation indices were originally designed In summary, there exist several ways of improving on the for literature search, allowing a researcher to find subse-major Web search engines, depending on the type of informa- quent articles that cite a given article. Citation indices aretion desired. For harder-to-find information, metasearch and also valuable for other purposes, including evaluation ofsoftbots can improve coverage. If the topic being queried is articles, authors, and so on, and analysis of research trends.covered by one of the more specialized engines, these engines The most popular citation indices of academic research arecan be used, and they often provide more comprehensive and produced by the ISI. One such index, the Science Citationup-to-date indices within their specialty compared to the gen- Index, is intended to be a practical cost-effective tool foreral Web search engines. indexing the significant scientific journals. Unfortunately, the ISI databases are expensive and not available to all researchers. Much of the expense is due to the manual SCIENTIFIC INFORMATION RETRIEVAL effort required during indexing.Immediate access to scientific literature has long been The rise of the Internet and the Web has led to proposalsdesired by scientists, and the Web search engines have made for online digital libraries that incorporate citation indexing.a large and growing body of scientific literature and other For example, Cameron proposed a “universal, [Internet-information resources accessible within seconds. Advances in based,] bibliographic and citation database linking everycomputing and communications, and the rapid rise of the scholarly work ever written” [14]. Such a database would beWeb have led to the increasingly widespread availability of highly “comprehensive and up-to-date”, making it a powerfulonline research articles, as well as a simple-to-use Web version tool for academic literature research, and for the productionof the Institute for Scientific Information’s ® (ISI) Science of statistics as with traditional citation indices. However,Citation Index® — the Web of Science®. The Web is changing Cameron’s proposal presents significant difficulty for imple-the way researchers locate and access scientific publications. mentation, and requires authors or institutions to provide cita-Many print journals now provide access to the full text of tion information in a specific format.articles on the Web, and the number of online journals was The NEC Research Institute is working on a digitalabout 1000 in 1996 [12]. Researchers are increasingly making library of scientific publications that creates a citation indextheir work available on their homepages or in technical autonomously (using Autonomous Citation Indexing, ACI),report archives. without the requirement of any additional effort on the partIEEE Communications Magazine • January 1999 119
  5. 5. Searching for “simulated annealing” in Machine Learning [small test index] (13828 documents 278202 citations total). example of how an ACI system can 1218 citations found extract the context of citations to a Click on the [Context] links to see the citing documents and the context of the citations. given paper and display them for easy browsing. Note that finding Citations (self) Article and extracting the context of cita- tions to a given paper could previ- 196 S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by ously be done by using traditional simulated annealing,” Science, vol. 220, 1983, pp. 671–680. [Context] citation indices and manually locat- [Check] ing and searching the citing papers — the difference is that the 49 (1) D. S. Johnson et al., “Optimization by simulated annealing: An automation and Web interface experimental evaluation,” Technical report, Bell Labs preprint. [Context] make the task far more efficient, [Check] and thus practical, where it may not have been before. 39 E. Aarts and J. Korst, “Simulated Annealing and Boltzmann Machines,” Digital libraries incorporating John Wiley and Sons, 1989. [Context] [Check] citation indexing can be used to organize the scientific literature, [... section deleted...] and help with literature search and evaluation. A “universal citations Figure 1. An ACI system can group variant forms of citations to the same paper (citations can database” which accurately indexes be written in many different formats), and rank search results by the number of citations. all literature would be ideal, but is currently impractical because of the limited availability of articlesof the authors or institutions, and without any manual assis- in electronic form and the lack of standardization in citationtance [15]. An ACI system autonomously extracts citations, practices. However, CiteSeer shows that it is possible toidentifies identical citations that occur in different formats, organize and index the subset of literature available on theand identifies the context of citations in the body of articles. Web, and to autonomously process freeform citations withAs with traditional citation indices like the Science Citation reasonable accuracy. As long as there is a significant portionIndex, ACI allows literature search using citation links, and of publishing through the Web, be it the publicly indexablethe ranking of papers, journals, authors, and so on by the Web or the subscription-only Web, there is great value innumber of citations. Compared to traditional citation index- being able to prepare citation indices from the machine-ing systems, ACI has both disadvantages and advantages. readable material. Citation indices may appear that indexThe disadvantages include lower accuracy (which is expected from both parts of the Web. Access to the full text of articlesto be less of a disadvantage over time). However, the advan- may be done openly or by subscription, depending on howtages are significant and include no manual effort required the Web and the publication business evolve. Citationfor indexing, resulting in a corresponding reduction in cost indices for subscription-only data may be offered by the pub-and increase in availability, and literature search based on lisher or performed by a third party that has an agreementthe context of citations — given a particular paper of inter- with the publisher.est, an ACI system can display the context of how the paperis cited in subsequent publications. The context of citationscan be very useful for efficient literature search and evalua- THE FUTURE OF WEB SEARCH ANDtion. ACI has the potential for broader coverage of the liter-ature because human indexers are not required, and can DIGITAL LIBRARIESprovide more timely feedback and evaluation by indexing What is the future of the Web, Web search, and digitalitems such as conference proceedings and technical reports. libraries? Improvements in technology will enable new appli-Overall, ACI can improve scientific communication, and cations. Computational and storage resources will continue tofacilitates an increased rate of scientific dissemination and improve. Bandwidth is likely to increase significantly as tech-feedback. nology advances and the following positive spiral works: more ACI is ideal for operation on the Web — new articles can people are becoming connected to the Internet as it becomesbe automatically located and indexed when they are posted easier to use and more popular, and as new access mecha-on the Web or announced on mailing lists, and an efficient nisms are introduced (e.g., cable modems and digital sub-interface for browsing the articles, citations, and the context scriber lines). This provides incentive for the infrastructureof the citations can be created. Part of the benefit of companies to make more investment in the backbone, improv-autonomous citation indexing is due to the ability to format ing bandwidth. More investment in the backbone improvesand organize information on demand using a Web interface access, so more people want to be the citation index. Figure 1 shows an example of the out- Will the fraction of the Web covered by the major searchput from the NEC Research Institute’s prototype engines increase? Some search engines are focusing on index-autonomous citation indexing digital library system: Cite- ing Web pages that satisfy the majority of searches, asSeer. This example shows the results of a search for citations opposed to trying to catalog all of the Web. However, therecontaining the phrase “simulated annealing” in a small test are still some engines that aim to index the Web comprehen-database of the machine learning literature (only a subset of sively. Improvements in indexing technology and computa-the machine learning literature on the Web). Searching for tional resources will allow the creation of larger indices.citations to papers by a given author can also be performed Nevertheless, it is unlikely to become economically practical(including secondary authors). The [Context] links show the for a single search engine to index close to all of the publiclycontext of the individual citations. The [Check] links show indexable Web in the near future. However, it is predictedthe individual citations in each group and can be used to that the cost of indexing and storage will decline over timecheck for errors in the citation grouping.Figure 2 shows an relative to the increase in the size of the indexable Web [6],120 IEEE Communications Magazine • January 1999
  6. 6. S. Kirkpatrick, C. D. Gelatt Jr., and M. Vecchi, Optimization by simulated annealing, Science 220 (4598) (1983), 671–680. This paper is cited in the following contexts: M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 177 - Appeared: IEEE ICASSP. San Francisco. March 1992, vol. III. pp. 45–48. - GIBBS RANDOM FIELD: - TEMPERATURE AND PARAMETER ANALYSIS - Rosalind W. Picard –M.I.T Lab, E15-392: 20 Ames Street, Cambridge, MA 02139 – [Details] [Full Text] [Related Articles] [] ...... Simulated annealing is a popular nonlinear optimization technique where a cost function is substituted for E(x), and consequently minimized. There is a key observation in the simulated annealing literature that prompts the study of temperature presented in this paper. Kirkpatrick, et al. [3] observed that “more optimization” occurs at certain temperatures than at others. These favored temperatures are analogous to the physical idea of a “critical temperature,” a point that marks transition between different “phases” of the data. The reason for considering these physical...... ...... region, we have shown in earlier work that a similar kind of point, which we call a “transition” temperature, T, does occur [8]. By measuring the specific heat of the binary process it can be shown to correspond to the same region where the “most optimiza- tion” occurs in simulated annealing [3]. For GRF analysis, this region is where the energy fluctuation peaks, and where small changes in the parameters become more significant. In [8] the transition temperature for n = 2 was estimated to be at I =T = 1:7. This analysis suggests that attempts to estimate parameters should take...... [3] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, Optimization by simulated annealing, Science 220 (4598) 671–680, 1983. Technical Report No. 9805. Department of Statistics, University of Toronto - Annealed Importance Sampling - Radford M. Neal – Department of Statistics and Department of Computer Science - University of Toronto, Toronto, Ontario, Canada- - - 18 Feb. 1998 [Details] [Full Text] [Related Articles] [ftp://ftp.cs.utoron-] ...... respect to these transitions. Because such a chain will move between modes only rarely, it will take a long time to reach equilibri- um, and will exhibit high autocorrelations for functions of the state variables out to long timelags. The method of simulated anneal- ing was introduced by Kirkpatrick, Gelatt, and Vecchi (1983) as a way of handling multiple modes in an optimization context. It employs a sequence of distributions, with probabilities or probability densities given by p0(x) to 2 pn(x), in which each pj differs only slightly from pj+l. The distribution p0 is the one of interest. The...... Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., (1983) “Optimization by simulated annealing,” Science, vol. 220, pp. 671–680. [...section deleted...]s Figure 2. An example of how an autonomous citation indexing system can show the context of citations to a given paper. The sen- tences containing the citations are automatically highlighted.resulting in favorable scaling properties for centralized text SUMMARYsearch engines. In the meantime, an increased number ofspecialized search services may arise that cover specific types The Web is revolutionizing information access; however, cur-of information. rent techniques for access to both general and scientific infor- The use of more expensive and better algorithms (e.g., as mation on the Web leave room for much improvement. Thein Google) will produce improved page rankings. More infor- Web search engines are limited in terms of coverage, recency,mation retrieval techniques aimed at the large, diverse, low how well they rank query results, and the query options theysignal-to-noise ratio database of the Web will be developed. support. Access to the growing body of scientific literature onOne interesting possibility is the use of machine learning in the publicly indexable Web is limited by the lack of organiza-order to create query transformations similar to those used in tion and because the major search engines do not indexthe SEF technique discussed earlier. Postscript or PDF documents. We have discussed several Metasearch techniques, which combine the results of mul- fruitful research directions that will improve access to generaltiple engines, are likely to continue to be useful when and scientific information, and greatly enhance the utility ofsearching for hard-to-find information, or when comprehen- the Web: improved ranking methods, metasearch engines,sive results are desired. The major Web search engines are softbots, and autonomous citation indexing. It is not clear howalso likely to continue to focus on performing queries as availability will evolve, because this depends on how the Webquickly as possible, and therefore metasearch engines that emerges as a business platform for publishers. Nevertheless,perform additional client-side processing (e.g., query term improved ways to do basic searching, and specialized citationcontext summaries) may become increasingly popular as searching are likely to evolve and replace present methods,these products become more powerful, address problems and will greatly increase the utility of the Web over what iswith data fusion from different sources, and learn to deal available today.better with the constantly evolving search services. Improve-ments in bandwidth should improve the feasibility of ACKNOWLEDGMENTSmetasearch techniques. We thank H. Stone and the reviewers for very useful com- Digital libraries incorporating ACI should become more ments and suggestions.widely available, bringing the benefits of citation indexing togroups who cannot afford the commercial services, and REFERENCESimproving the dissemination and retrieval of scientific litera- [1] S. Lawrence and C. L. Giles, “Searching the World Wide Web,” Science,ture. vol. 280, no. 5360, 1998, pp. 98–100.IEEE Communications Magazine • January 1999 121
  7. 7. [2] J. Barrie and D. Presti, “The World Wide Web as an instructional tool,” Digital Libraries 98 — The Third ACM Conf. Digital Libraries, Pittsburgh, Science, vol. 274, 1996, pp. 371–72. PA, 1998, pp. 89–98.[3] R. Seltzer, E. Ray, and D. Ray, The AltaVista Search Revolution: How to Find Anything on the Internet, McGraw-Hill, 1997. BIOGRAPHIES[4] Inktomi,, 1997.[5] S. Steinberg, “Seek and ye shall find (maybe),” Wired, vol. 4, no. 5, STEVE LAWRENCE ( is a research scientist at 1996. the NEC Research Institute, Princeton, New Jersey. His research interests[6] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web include information retrieval and dissemination, machine learning, artificial search engine,” Proc. 7th Int’l. WWW Conf., Brisbane, Australia, 1998. intelligence, neural networks, face recognition, speech recognition, time[7] J. Boyan, D. Freitag, and T. Joachims, “A machine learning architecture series prediction, and natural language. His awards include an NEC for optimizing Web search engines,” Proc. AAAI Wksp. Internet-Based Research Institute excellence award, ATERB and APRA priority scholarships, Info. Sys., 1996. a QUT university medal and award for excellence, QEC and Telecom Aus-[8] J. Kleinberg, “Authoritative sources in a hyperlinked environment,” Proc. tralia Engineering prizes, and three successive prizes in the annual Aus- ACM-SIAM Symp. Discrete Algorithms, 1998. tralian Mathematics Competition. He received a B.Sc. in computing and a[9] E. Selberg and O. Etzioni, “Multi-service search and comparison using B.Eng. in electronic systems from the Queensland University of Technology, the MetaCrawler,” Proc. 1995 WWW Conf., 1995. Australia, and a Ph.D. from the University of Queensland, Australia.[10] S. Lawrence and C. L. Giles, “Context and page analysis for improved Web search,” IEEE Internet Comp., vol. 2, no. 4, 1998, pp. 38–46. C. LEE GILES [F] ( is a senior research scientist in[11] O. Etzioni and D. Weld, “A softbot-based interface to the Internet,” Computer Science at NEC Research Institute, Princeton, New Jersey. Cur- Commun. ACM, vol. 37, no. 7, 1994, pp. 72–76. rently he is an adjunct professor at the Institute for Advanced Computer[12] G. Taubes, Science, vol. 271, 1996, p. 764. Studies at the University of Maryland. His research interests are in novel[13] E. Garfield, Citation Indexing: Its Theory and Application in Science, applications of neural and machine learning, agents and AI in the Web, Technology, and Humanities, New York: Wiley, 1979. and computing. He is on the editorial boards of IEEE Intelligent Systems,[14] R. D. Cameron, “A universal citation database as a catalyst for reform IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions in scholarly communication,” First Monday, vol. 2, no. 4, 1997. on Neural Networks, the Journal of Computational Intelligence in Finance,[15] C. L. Giles, K. Bollacker, and S. Lawrence, “CiteSeer: An automatic cita- Journal of Parallel and Distributed Computing, Neural Networks, Neural tion indexing system,” I. Witten, R. Akscyn, and F. M. Shipman III, Eds., Computation, and Applied Optics.122 IEEE Communications Magazine • January 1999