Computer Networks and ISDN Systems 29 (I 997) 122% 1235 The quest for correct information on the Web: hyper search engines Massimo Marchiori ’ Department of Pure and Applied Mathematics, Universiy of Padova, Ka Beboni 7, 35131 Padova. Ita!Abstract Finding the right information in the World Wide Web is becoming a fundamental problem, since the amount of globalinformation that the WWW containsis growingat an incrediblerate. In this paper,we presenta novel methodto extractfrom a web object its “hyper” informative content, in contrastwith current searchengines,which only deal with the“textual” informativecontent.This methodis not only valuableper se,but it is shownto be ableto considerablyincreasethe precisionof current searchengines, Moreover, it integratessmoothlywith existing search engines technologysinceitcan be implemented top of every searchengine,acting asa post-processor, automaticallytransforming search on thus aengineinto its corresponding “hype? version.We also showhow, interestingly,the hyper information can be usefullyemployed face the search to enginespersuasionproblem. 0 1997 Published by Elsevier Science B.V.Keywords: World Wide Web; Information retrieval; Searchengines;Sep; Browsersand tools; Designprinciplesandtechniques; Integrationof heterogeneous informationsources; Searchtechniques1. Introduction report, the performance of actual search engines is far from being fully satisfactory. While so far the techno- The World Wide Web is growing at phenome- logical progress has made possible to reasonably dealnal rates, as witnessed by every recent estimation. with the size of the web, in terms of number of objectsThis explosion both of Internet hosts and of peo- classified, the big problem is just to properly classifyple using the web, has made crucial the problem of objects in response to the users’ needs: in other words,managing such enormous amount of information. As to give the user the information s/he needs.market studies clearly indicate, in order to survive In this paper, we argue for a solution to this prob-into this informative jungle, web users have to almost lem, by means of a new measure of the informativeexclusively resort on search engines (automatic cata- content of a web object, namely its “hyper informa-logs of the web) and repositories (human-maintained tion”. Informally, the problem with current search en-collections of links usually topics-based). In turn, gines is that they look at a web object to evaluate,repositories are now themselves resorting on search almost just like a piece of text. Even with extremelyengines to keep their databases up-to-date. Thus, the sophisticated techniques like those already present incrucial component in the information management is some search engine scoring mechanism, this approachgiven by search engines. suffers from an intrinsic weakness: it doesn’t take into As all the recent studies on the subject (see e.g. [S]) account the webstructure the object is part of. The power of the web just relies on its capability’ E-mail: email@example.com of redirecting the information flow via hyperlinks, so0169-7552/97/$17.00 0 1997 Published by Elsevier Science B.V. Ail rights reserved.PII SO1 69-7552197100036-6
1226 M. Marchiori / Computer Networks and ISDN Systems 29 ( 1997) 1225-1235 it should appear rather natural that in order to evaluate quences of bytes. The intuition is that for each URL the informative content of a web object, the web struc- we can require from the web structure the corre- ture has to be carefully analyzed. Instead, so far only sponding object (an HTML 3 page, a text file, etc.). very limited forms of analysis of an object in the web The function has to be partial because for some URL space have been used, like “visibility” (the number of there is no corresponding object. links pointing to a web object), and the gain in obtain- In this paper we consider as web structure ing a more precise information has been very little. the World Wide Web structure WWW. Note that In this paper, instead, we develop a natural no- in general the real WWW is a timed structure, tion of “hype? information which properly takes since the URL mapping varies with time (i.e. it into account the web structure an object is part of. should be written as WWW, for each time in- Informally, the “overall information” of a web object stant t). However, we will work under the hy- is not composed only by its static “textual informa- pothesis that the WWW is locally time consis- tion”, but also another “hype? information has to be tent, i.e. that there is a non-void time interval I added, which is the measure of the potential informa- such that the probability that wwW,(ur/) = seq,tion of a web object with respect to the web space. t 2 t’ 2 t + I * WWW,r(url) = seq is extremely Roughly speaking, it measures how much informa- high. That is to say, if at a certain time t an URL tion one can obtain using that page with a browser, points to an object seq, then there is a time interval and navigating starting from it. in which this property stays true. Note this doesn’t Besides its fundamental characteristic, and future mean that the web structure stays the same, sincepotentials, the hyper information has an immediate new web objects can be added.practical impact: indeed, it works “on top” of any We will also assume that all the operations resort-textual information function, manipulating its “local” ing on the WWW structure that we will employ in thisscores to obtain the more complete “global” score of paper, can be performed with extremely high prob-a web object. This means that it provides a smooth ability within time I: this way, we are acting on aintegration with existing search engines, since it can “locally consistent” time interval, and thus it is (withbe used to considerably improve the performance of extremely high probability) safe to get rid of the dy-a search engine simply by post-processing its scores. namic behavior of the World Wide Web and consider Moreover, we show how the hyper information it as an untimed web structure, as we will do.can nicely deal with the big problem of “search Note that these assumptions are not restrictive,engines persuasion” (tuning pages so to cheat a since empirical observations (cf. ) show thatsearch engine, in order to make it give a higher WWW is indeed locally time consistent, e.g. settingrank), one of the most alarming factors of degrade of I = 1 day (in fact, this is one of the essential reasonsthe performance of most search engines. for which the World Wide Web works). This, in turn, Finally, we present the result of an extensive test- also implies that the second hypothesis (algorithmsing on the hyper information, used as post-processor working in at most I time) is not restrictive.in combination with the major search engines, and A web object is a pair (urZ,seq), made up by anshow how the hyper information is able to consider- URL url and a sequence of bytes seq = WWW(url).ably improve the quality of information provided by In the sequel, we will usually consider understoodsearch engines. the WWW web structure in all the situations where web objects are considered. As usual in the literature, when no confusion2. World Wide Web arises we will sometimes talk of a link meaning the pointed web object (for instance, we may talk about In general, we consider an (untimed) web struc- the score of a link meaning the score of the webture to be a partial function from URLs’ to se- object it points to).’ http:/Jwww.w3.orgfpubfW?VWfAddressingfrfc1738.txt 3 http:ffwww.w3.orgfpub~~arkUpJ
M. Murchiori / Computer Networks and ISDN Systems29 (I 997) 1225-1235 12273. Hyper vs. non-hyper tent must be more valuable than ather web objects that have less links pointing to them. This reasoning A great problem with search engines scoring would be correct if all web objects were known tomechanism is that they tend to score text more than users and to search engines. But this assumption ishypertext. Let us explain this better. It is well known clearly false. The fact is that a web object which isthat the essential difference between normal texts not known enough, for instance for its location, isand hypertext is the relation-flow of information going to have a very low visibility (when it is notwhich is possible via hyperlinks. Thus, when evalu- completely neglected by search engines), unregard-ating the informative content of a hypertext it should ing of its informative content which may be by farbe kept into account this fundamental, dynamic be- better than other web objects.havior of hypertext. Instead, most of search engines In a nutshell, visibility is likely to be a synony-tend to simply forget the “hype? part of a hypertext mous of popularity, which is something completely(the links), by simply scoring its textual components, different by quality, and thus its usage to give higheri.e. providing the textual information (henceforth, we score by search engines is a rather poor choice.say “information” in place of the more appropriate,but longer, “measure of information”). 3.1. The hyper information Note that ranking in a different way words ap-pearing in the title”, or between headings5, etc., As said, what is really missing in the evaluationlike all contemporary search engines do, is not in of the score of a web object is its hyper part, that iscontradiction with what said above, since title, head- the dynamic information content which is providedings and so on are not “hype?‘, but they are simply by hyperlinks (henceforth, simply links).attributes of the text. We call this kind of information hyper in- More formally, the textual information of a web formation: this information should be added toobject (url,seq) considers only the textual component the textual information of the web object, giv-seq, not taking into account the hyper information ing its (overall) information in the World Widegiven by url and by the underlying World Wide Web Web. We indicate these three kinds of informa-structure. tion as HYPERINFO, TEXTINFO and INFORMATION, A partial exception is considering the so-called respectively. So, for every web object A wevisibility (cf. for instance [l]) of a web object. have that INFORMATION(A) = TEXTlNFO(A) +Roughly, the visibility of a web object is the number HYPEIUNFO(A) (note that in general these infor-of other web objects with a link to it. Thus, visibility mation functions depend on a specific query, thatis in a sense a measure of the importance of a web is to say they measure the informative content of aobject in the World Wide Web context. Excite6 and web object with respect to a certain query: in theWebCrawIer’ have been the first search engines sequel, we will always consider such query to beto provide higher ranks to web objects with high understood).visibility, now followed by Lycos * and Magellan 9. The presence in a web object of links clearly aug- The problem is that visibility says nothing about ments the informative content with the informationthe informative content of a web object. The mis- contained in the pointed web objects (although weleading assumption is that if a web object has a have to establish to what extent).high visibility, then this is a sign of importance and Recursively, links present in the pointed web ob-consideration, and so de facto its informative con- jects further contribute, and so on. Thus, in principle, the analysis of the informative content of a web ob- ject A should involve all the web objects that are’ http://www.w3.org/pubIWWW/TR/REC-html32.html#title reachable from it via hyperlinks (i.e., “navigating” in’ http://www.w3.org/pubAVWW/TR/FSC-html32.html#headings the World Wide Web).’ http:Nwww.excite.com’ http:Nwww.webcrawler.com This is clearly unfeasible in practice, so, for prac-’ http://www.lycos.com tical reasons, we have to stop the analysis at a certain9 http://www.mckinley.com depth, just like in programs for chess analysis we
1228 M. Marchiori/Computer Networks and ISDN Systems 29 (1997) 1225-1235have to stop considering the moves tree after few that A has almost zero textual information, whilemoves. So, one should fix a certain upper bound for B has an extremely high overall information. Usingthe depth of the evaluation. The definition of depth is the naive approach, A would have more informativecompletely natural: given a web object 0, the (rela- content than B, while it is clear that the user is muchtive) depth of another web object 0’ is the minimum more interested in B than in A.number of links that have to be activated (“clicked”) The problem becomes even more evident whenin order to reach 0’ from 0. So, saying that the we increase the depth of the analysis: consider aevaluation has max depth k means that we consider situation where Bk is at depth k from A, i.e. to goonly the web objects at depth less or equal than k. from A to B one needs k “clicks”, and k is big (e.g. By fixing a certain depth, we thus select a suitable k > 10):finite “local neighborhood” of a web object in theWorld Wide Web. Now we are faced with the prob-lem of establishing the hyper information of a webobject A w.r.t. this neighborhood (the hyper infor- LetA,Bt,... , Bk-I have almost zero infOITXitiVemation with depth k). We denote this information by content, and Bk have very high overall information.with HYPERINFOtkl, and the corresponding overall Then A would have higher overall information thaninformation (with depth k) with INFORMATIONtkl. Bk, which is paradoxical since A has clearly a tooIn most of cases, k will be considered understood, weak relation with Bk to be more useful than Bkand so we will omit the [k] subscript. itself. We assume that INFORMATION, TEXTINFO and The problem essentially is that the textual infor-HYPERINFO are functions from web objects to non- mation pointed by a link cannot be considered asnegative real numbers. Their intuitive meaning is that actual, since it is potential: for the user there is athe more information a web object has, the greater is cost to retain the textual information pointed by athe corresponding number. link (click and...wait). Observe that in order to have a feasible im- The solution to these two factors is: the contri-plementation, we need that all of these functions bution to the hyper information of a web object atare bounded (i.e., there is a number M such that depth k is not simply its textual information, but M 2 INFORMATION(A), for every A). So, we it is its textual information diminished via a fadingassume without loss of generality that TEXTINFO factor depending on its depth, i.e. on “how far” is thehas an upper bound of 1, and that HYPERINFO is information for the user (how many clicks s/he hasbounded. to perform). Our choice about the law regulating this fading3.2. Single links function is that textual information fades exponen- tially w.r.t. the depth, i.e. the contribution to the To start with, consider the simple case where we hyper information of A given by an object B at depthhave only at most one link in every web object. k is Fk . TEXTINFO( B), for a suitable fading factor F In the basic case when the depth is one, the most (0 < F < 1).complicated case is when we have one link from the Thus, in the above example, the hyper informationconsidered web object A to another web object B, i.e. of A is not TEXTINFO(Bi) + . . . + TEXTINFO(Bk), but F . TEXTINFO( Bt ) + F* . TEXTINFO( &) + . . . +El-43 Fk . TEXTINFO(Bk), so that the overall information A naive approach is just to add the textual infor- of A is not necessarily greater than that of Bk.mation of the pointed web object. This idea, more or As an aside, note that when calculating theless, is tantamount to identify a web object with the overall information of a web object A, its tex-web object obtained by replacing the links with the tual information can be nicely seen as a spe-corresponding pointed web objects. cial degenerate case of hyper information, since This approach is attracting, but not correct, since TEXTINFO(A) = F” . TEXTINFO(A) (viz., the ob-it raises a number of problems. For instance, suppose ject is at “zero distance” from itself).
h4. Marchiori/Computer Networks and ISDN Systems 29 (1997) 1225-1235 12293.3. More motivations link (i.e. F.TEXTINFO(B1)+. . .+F.TEXTlNFO(B,)), isn’t feasible since we want the hyper information to There is still one point to clarify. We have intro- be bounded.duced an exponentially fading law, and seen how it This would seem in contradiction with the inter-behaves well with respect to our expectations. But pretation of a link as potential information that weone could ask why the fading function has to be have given earlier: if one has many links that can beexponential, and not of different form, for instance activated, then one has all of their potential informa-polynomial (e.g., multiplying by F/k instead that by tion. However, this paradox is only apparent: the userFk). Let us consider the overall information of a web cannot get all the links at the same time, but has toobject with depth k: if we have sequentially select them. In other words, nondeter- minism has a cost. So, in the best case the user will 9B- select the most informative link, and then the second more informative one, and so on. Suppose for exam-it is completely reasonable to assume that the over- ple that the more informative link is BI, the secondall information of A (with depth k) is given by thetextual content of A plus the faded overall informa- one is B2 and so on (i.e., we have TEXTINFO(B1) > TEXTINFO(B2) 2 . 2 TEXTINFO(B,)). Thus, thetion of B (with depth k - 1). It is also reasonable hyper information is F . TEXTINFO(BI) (the userto assume that this fading can be approximated by selects the best link) plus F2 . TEXTINFO(B2) (themultiplying by an appropriate fading constant F second time, the user selects the second best link)(0 -C F < 1). So, we have the recursive relation and so on, that is to saylNFORMATION,kl (A) = F . lNFORMATIONlk- 11(B) F TEXTlNFO(B,) + . . . + F” . TEXTINFO(B,) Since readily for every object A we have Nicely, evaluating the score this way gives alNFORMATIONLol(A) = TEXTlNFO(A), eliminating bounded function, since for any number of links, therecursion we obtain that if sum cannot be greater than F/(F + 1). Note that we chose the best sequence of selec- tions, since hyper information is the best “potential”then INFORMATION(A) = TEXTlNFO(A) + information, so we have to assume the user does theF . VEXTWO + F . (TEXTINFO(B~) + F . best choices: we cannot use e.g. a random selection(. . .TEXTINFO(Bk)))) = TEXTlNFO(A) + F . of the links, or even other functions like the averageTEXTINFO(BI) + F2 . TEXTINFO(B;?) + . . . + Fk . between the contributions of the each link, since weTEXTlNFO(Bk), which is just the exponential fading cannot impose that every link has to be relevant. Forlaw that we have adopted. instance, if we did so, accessory links with zero score (e.g. think of the “powered with Netscape” ‘O-links)3.4. Multiple links would devalue by far the hyper information even in presence of highly scored links, while those acces- Now we turn to the case where there is more than sory links should simply be ignored (as the aboveone link in the same web object. For simplicity, we method, consistently, does).assume the depth is one. So, suppose one has thefollowing situation 3.5. The general case ,/. The general case (multiple links in the same / ,/ ?q--~ page and arbitrary depth k) is treated accordingly to /‘./ what previously seen. Informally, all the web objects A fBh-II@ at depth less of equal than k are ordered w.r.t. a0 What is the hyper information in this case? The ” http://www.netscape.com/comprodroducts/navigator/version-easiest answer, just sum the contribution of every 3.0/images/netnow3.gif
1230 M. Marchiori/Computer Networks and ISDN Systems 29 (1997) 1225-1235“sequence of selections” such that the corresponding Another important issue is that the same defini-hyper information is the highest one. tion of link in a web object is fairly from trivial. A For example, consider the following situation: link present in a web object 0 is said to be active if the web objects it points to can be accessed by viewing (m-&q) with an HTML” browser (e.g., Netscape 12, Navigator l3 or Microsoft l4 Internet Explorer r5). This means, informally, that once we view 0 with the browser, we can activate the link by clicking over it. The previous definition is rather operational, but it is much more intuitive than awith F = OS,TEXTINFO(B) = 0.4,TEXTINFO(C) formal technical definition which can be given by= 0.3,TEXTINFO(D) = 0.2,TEXTINFO(E) = 0.6. tediously specifying all the possible cases according Then via a “sequence of selections” we can go to the HTML specification I6 (note a problem com-from A to B, to C, to E and then to D, and plicating a formal analysis is that one cannot assumethis is readily the best sequence that maximizes the that seq is composed by legal HTML code, sincehyper information, which is 0.5 . TEXTINFO(B) + browsers are error-tolerating).0.52~TEXTINFO(C)+0.53.TEXTINFO(E)+0.54. Thus, the links mentioned in the paper should beTEXTINFO(D)(=0.3625). only the active ones. Also, there are many different kinds of links, and each of them would require a specific treatment. For4. Refining the model instance, Local links (links pointing to some point in the There is a number of subtleties that for clarity of same web object, using the #-specifier 17) shouldexposition we haven’t considered so far. be readily ignored. The first big problem comes from possible dupli- Frumd8 links should be automatically expanded,cations. For instance, consider the case of backwards i.e. if A has a frame link to B, then this linklinks: suppose that a user has set up two pages A and should be replaced with a proper expansion ofB, with a link in from A to B, and then a “come B inside A (since a frame link is automaticallyback” link in B from B to A: activated, its pointed web object is just part of the original web object, and the user does not see any link at all). Other links that are automatically activated are for instance the image I9 links, i.e. source links This means that we have a kind of recursion, and of image tags, the background 2o links and so on:that the textual information of A is repeatedly added they should be treated analogously to frame links.(although with increasingly higher fading) when cal- However, since usually it is a very hard problemculating the hyper information of A (this because to recover some useful information from imagesfrom A we can navigate to B and then to A andso on...). Another example is given by duplicating ‘I http://www.w3.org/publWWWlMarkUpllinks (e.g. two links in a web object pointing to the ” http://www.netscape.comsame web object). The solution to avoid these kinds I3 http:l/www.netscape.com/comprod/products/navigator/version-of problems, is not to consider all the “sequence of 3.0/index.htmlselections”, but only those without repetitions. This I4 http://www.microsoft.com I5 http://www.microsoft.com/ieis consistent with the intuition of hyper information ” http:llwww.w3.orglpubiWWWMarkup/to measure in a sense how much a web object is far I7 http:llwww.w3.orglpub/WWWlAddressinglrfc1738.txtfrom a page: if one has already reached an object, I8 http://home.netscape.com/assist/net_sites/frames.htmls/he already has got its information, and so it makes I9 http:Nwww.w3.orglpubIWWW/TR/RFE-html32.htmHimgno sense to get it again. 2o http:Nwww.w3.org/pub/WWW/TWREC-html32.htmHmody
M. Marchiori/Computer Networks and ISDN Svstems 29 (1997) 1225-1235 1231 (cf. ), in a practical implementation of the hy- persuasion techniques, “spamdexing” [4,7,8], i.e. the per information all these links can be ignored artificial repetition of relevant keywords. Despite (note this doesn’t mean that the textual informa- these efforts, the situation is getting worst and worst, tion function has to ignore such links, since it since more sophisticated techniques have been devel- can e.g. take into account the name of the pic- oped, which analyze the behavior of search engines tures). This also happens for active links pointing and tune the pages accordingly. This has also led to to images, movies, sounds etc., which can be syn- the situation where search engines maintainers tend tactically identified by their file extensions .gif, to assign penalties to pages that rank “too high”, .jpg, .tiff, .avi, .wav and so on. and at the same time to provide less and less de- tails on their information functions just in order to4.1. Search engines persuasion prevent this kind of tuning, thus in many cases penal- izing pages of high informative value that were not A big problem that search engines have to face designed with “persuasion” intentions. For a com-is the phenomenon of so-called sep (search engines prehensive study of the sep phenomenon, we referpersuasion). Indeed, search engines have become so to .important in the advertisement market that it has The hyper methodology that we have developedbecome essential for companies to have their pages is able to a certain extent to nicely cope with thelisted in top positions of search engines, in order sep problem. Indeed, maintainers can keep details ofto get a significant web-based promotion. Starting their TEXTINFO function hidden, and make publicwith the already pioneering work of Rhodes , the information that they are using a HYPERINFOthis phenomenon is now boosting at such a rate to function.have provoked serious problems to search engines, The only precaution is to distinguish betweenand has revolutioned the web design companies, two fundamental types of links, assigning differentwhich are now specifically asked not only to de- fading factors to each of them. Suppose to have asign good web sites, but also to make them rank web object (url,seq). A link contained in seqis calledhigh in search engines. A vast number of new com- outer if it has not the same domain” of m-1,andpanies was born just to make customer web pages inner in the other case. That is to say, inner links ofas visible as possible. More and more companies, a web objects point to web objects in the same sitelike Exploit 21, Allwilk 22, Northern Webs 23, Ry (its “local world”, so to say), while outer links pointley & Associates 24, etc., explicitly study ways to to web objects of other sites (the “outer world”).rank high a page in search engines. OpenText Now, inner links are dangerous from the sep pointarrived even to sell “preferred listings”, i.e. assuring of view, since they are on the direct control of thea particular entry to stay in the top ten for some site maintainer. For instance, a user that wants totime, a policy that has provoked some controversies artificially increase the hyper information of a web(cf. [91>. object A could set up a very similar web object B This has led to a bad performance degradation (i.e. such that TEXTINFO(A) % TEXTINFO(B)), andof search engines, since an increasingly high num- put a link from A to B: this would increase the scoreber of pages is designed to have an artificially high of A by roughly F . TEXTINFO(A).textual content. The phenomenon is so serious that On the other hand. outer links do not presentsearch engines like InfoSeekz6 and Lycos 27 have this problem since they are out of direct control andintroduced penalties to face the most common of this manipulation (at least in principle, cf. [23). Thus, one should consequently assign a very low” http://www.exploit.com or even null fading factor (Fin) to the inner links, and ” http:Nwww.allwilk.com a reasonably high fading factor (F,,) to the outer23 http://www.digital-cafe.com/~webmaster/nonvebOl .htm links. Indeed, in our practical experimentations weI4 http://www.ryley.comz http:Nwww.opentext.com saw that setting Fi, to zero, i.e. completely omittingx http:Nwww.infoseek.com” http://www.lycos.com lx htlp://www.dns.net/dnsrd/
1232 M. Marchiori/Computer Networks and ISDN Systems 29 (1997) 1225-1235the inner link contributions, gave very good results slightly perturbed (improved?) with some other kind(although the “best” value was according to our of WWW-based information.test was near to 0.1). Setting Fin to zero also gave The search engines for which a post-processorthe advantage of making the implementation of the was developed were: Excite 32, HotBot 33, Lycos 34,hyper information quite faster, since most of the links WebCrawler 35, and OpenText 36. This includes allin web objects are inner. As far as outer links are of the major search engines, but for AltaVista 37 andconcerned, we set F,,, to 0.75. Again, from our tests InfoSeek 38, which unfortunately do not give the userit resulted that similar values did not significantly access to the scores, and thus cannot be remotelyaffect the bounty of the hyper information. post-processed. The implemented model included all We said earlier that the information about the use the accessory refinements seen in Section 4.of the hyper information could be given as white We then conducted some tests in order to see howbox to the external users (while keeping as black the values of Fin and F,,, affected the quality of thebox the details of the TEXTINFO function). This hyper information, The depth and fading parametersway, search engines persuasion has the effect of can be flexibly employed to tune the hyper informa-reshaping the web, by considerably improving its tion. However, it is important for its practical useconnectivity. Indeed, as noticed in [I], at present the that the behaviour of the hyper information is testedinter-connectivity is rather poor, since almost 80% of when the depth and the fading factors are fixed insites contain no outer link (!), and a relatively small advance.number of web sites is carrying most of the load of We randomly selected 25 queries, collected allhypertext navigation. the corresponding rankings from the aforementioned Using hyper information thus forces the sites that search engines, and then run several tests in order towant to rank better to improve their connectivity, maximize the effectiveness of the hyper information.improving the overall web structure. We arrived to select the following global parameters for the hyper information: Fin = 0, F,,, = 0.75, and depth one. Although the best values were slightly5. Testing different (Fi, = 0.1, depth two), the differences were almost insignificant. On the other hand, choosing The hyper information has been implemented as Fi, = 0 and depth one had great advantages in termspost-processor of the main search engines now avail- of execution speed of the hyper information.able, i.e. remotely querying them, extracting the cor- After this setting, we chose other 25 queries, andresponding scores (i.e., their TEXTINFO function), tested again the bounty of the hyper information withand calculating the hyper information and therefore these fixed parameters: the results showed that thesethe overall information. initial settings also gave extremely good results for Indeed, as said in the introduction, one of the most these new set of queries.appealing features of the hyper methodology is that While our personal tests clearly indicated a con-it can be implemented “on top” of existing scoring siderable increasing of precision for the search en-functions. Note that, strictly speaking, scoring func- gines, there was obviously the need to get externaltions employing visibility, like those of Excite29, confirmation of the effectiveness of the approach.WebCrawler 3o and Lycos 3’, are not pure “textual Also, although the tests indicated that slightly per-information”. However, this is of little importance: turbing the fading factors did not substantially affectalthough visibility, as shown, is not a good choice, the hyper information, the bounty of the values of theit provides information which is disjoint to the hy-per information, and so we can view such scoring 32 http://www.excite.comfunctions like providing purely textual information 33 http://www.hotbot.com 34 http://www.lycos.com 35 http://www.webcrawler.com*’ http://www.excite.com 36 http://www.opentext.com3o http://www.webcrawler.com 37 http://www.altavista.com31 http://www.lycos.com 38 http:l/www.infoseek.com
M. Marchiori /Computer Networks and ISDN Systems 29 (I 997) 1225-1235 1233fading factors could somehow have been dependent post-processed data, so that the user couldn’t seeon our way of choosing queries. any difference between the two rankings but for Therefore, we performed a deeper and more chal- their links content.lenging test. We considered a group of thirty persons, l The columns where to view the original searchand asked each of them to arbitrarily select five dif- engine results and the post-processed results wereferent topics to search on in the World Wide Web. randomly chosen. Then each person had to perform the following l The users were provided with the only informa-test: s/he had to search for relevant information on tion that we were performing a test on evaluatingeach chosen topic by inputting suitable queries to several different scoring mechanisms, and thatour post-processor, and give an evaluation mark to the choice of the column where to view eachthe obtained ranking lists. The evaluation consisted ranking was random. This also avoided possiblein an integer ranging from 0 (terrible ranking, of no “chain” effects, where a user, after having givenuse), to 100 (perfect ranking). for a certain number of times higher scores to a The post-processor was implemented as a cgi- fixed column, can be unconsciously led to give itscript: the prompt asks for a query, and returns two bonuses even if it actually doesn’t deserve them.ranking lists relative to that query, one per column in Fig. 1 shows an example snapshot of a session,the same page (frames39 are used). In one column where the selected query is “search engine score”:there is the original top ten of the search engine. In the randomly selected search engine in this case wasthe other column, the top ten of the ranking obtained HotBot @, and its original ranking (filtered) is in theby: (1) taking the first 100 items of the ranking of left column.the search engine, and (2) post-processing them with The final results of the test are pictured in Fig. 2.the hyper information. As it can be seen, the chart clearly shows (especially We made serious efforts in order to provide each in view of the blind evaluation process), that theuser with the most natural conditions in order to eval- hyper information is able to considerably improveuate the rankings. Indeed, to avoid “psychological the evaluation of the informative content.pressure” we provided each user with a password, The precise data regarding the evaluation test areso that each user could remotely perform the test on shown in the next table:hisker favorite terminal, in every time period (thetest process could be frozen via a special button, ter- Excite HotBot Lycos WebCrawler OpenText Averageminating the session, and resumed whenever wanted), Normal 80.1 62.2 59.0 54.2 63.4 63.2with all the time necessary to evaluate a ranking (e.g. Hyper 85.2 77.3 75.4 68.5 77.1 76.7possibly via looking at each link using the browser).Besides the evident beneficial effects on the bounty The next table shows the evaluation increment forof the data, this also had the side-effect that we hadn’t each search engine w.r.t. its hyper version. and theto personally take care of each test, something which corresponding standard deviation:could have been extremely time consuming. Another extremely important point is that the post- Excite HotBot Lycos WebCrawler OpenText Averageprocessor was implemented to make the test com- Evaluation incrementpletely blind. This was achieved in the following way. +5.1 +15.1 +16.4 +14.3 +13.7 +12.9 Each time a query was run, the order in which the Standard deviation 2.2 4.1 3.6 1.6 3.0 2.9five search engines had to be applied was randomlychosen. Then, for each of the five search engines thefollowing process occurred: As it can be seen, the small standard deviations l The data obtained by the search engine were are a further empirical evidence of the superiority “filtered” and presented in a standard “neutral” of the hyper search engines over their non-hyper format. The same format was adopted for the versions.j9 http://home.netscape.com/assist/net_sites/frames.html “I http:Nwww.hotbot.com
234 M. Marchiori/Computer Networks and ISDN Systems 29 (1997) 12251235 Fig. 1. Snapshot from a test session. 0 Lycos WelKrwim OpenTeti Exdte HotBol A=-tP Fig. 2. Evaluation of search engines vs. their hyper versions.
M. Marchiori/Computer NetMorks and ISDN Svstems 29 (1997) 1225-1235 12356. Conclusion M. Marchiori (mailto:firstname.lastname@example.org), Security of World Wide Web search engines, 3rd International Confer- ence on Reliability, Qualio and Safev of SqfnYare-Intensive We have presented a new way to considerably S:wems. Athens, Greece, Chapman & Hall (http:Nwww.thoimprove the quality of search engines evaluations by mson.com:8866/chaphall/default.htmi), I997considering the “hyper information” of web objects. K. Murphy (mailto:email@example.com). CheatersWe have shown that besides its practical success, the never win (http:Nwww.webweek.com/96May20/undercon/cmethod presents several other interesting features: heatershtml). Web Week (http://www.webweek.com), May, 1996 It allows to focus separately on the “textual” and J. Rhodes (mailto:deadJock@deadlock.com). The Art of “hype? components. Business Web Site Promotion (http://deadlock.com/-dead1 It allows to improve current search engines in o&/promote). Deadlock Design (http://deadlock.com/-dead1 a smooth way, since it works “on top” of any ock), 1996. existing scoring function. S. Sclaroff (mailto:firstname.lastname@example.org). World Wide Web image search engines (http://www.cs.bu.edu/techreports/95-0 It can be used for locally improving performances 16.www-image-search-engines.ps.z), NSF Workskop on Vi- of search engines, just using a local post-proces- sual Information Management, Cambridge, MA. 1995 sor on the client side (like done in this paper) D. Sullivan (mailto:email@example.com). The Webmaster’s Analogously, it can be used to improve so-called Guide to Search Engines and Directories (http:Ncalafia.com/ “meta searchers”. like SavySearch “, Meta- webmasters/). Calafia Consulting (http://calafia.com), 1996 G. Venditto (maiJto:firstname.lastname@example.org). Search engine Crawler 42, WebCompass 43, NlightNj4 and so showdown (http:Nwww.intemetworld.com/l996/05/showdo on. These suffer even more from the problem of wnhtml). Internet World (http:Nwww.intemetworid.com). good scoring, since they have to mix data coming Vol. 7(S) (http:Nwww.intemetworld.com/1996/05/toc.html), from different search engines, and the result is in May. 1996. most of cases not very readable. The method can N. Wingfeld (mailto:email@example.com), Engine sells re- sults. draws tire (http:l/www.news.comINews/Item/O,4,163 considerably improve the effectiveness and utility 5.00.html). Clnet Inc. (http://www.news.com). June. 1996. for the user of such meta-engines. Last but not least, it behaves extremely well with Massimo Marchiori received the respect to search engines persuasion attacks. M.S. in Mathematics with Highest Honors. and the Ph.D. in Com- puter Science. both from the Uni- versity of Padua. His research in-References terests include World Wide Web and intranets (information retrieval.[I] T. Bray (mailto:firstname.lastname@example.org), Measuring the search engines, metadata. digital li- Web (http://www5conf.inria.fr/fich_htmupapersiP9/Overvie braries), visualisation, programming w.html). 5th International World Wide Web Conference languages (constraint programming, (http:llwww.w3.orglpub/ConferenceslWWWS/), May, Paris. visual programming. functional pro- 1996 gramming, logic programming), ge- M. Marchiori (mailto:email@example.com), The hyper infor- netic algorithms, and rewriting systems. He has published over mation: theory and practice, Technical Report No. 36. Dept. twenty refereed papers on the above topics in various journals of Pure & Applied Mathematics. University of Padova. 1996 and proceedings of international conferences.‘I http:Nwww.cs.colostate.edu/“dreiling/smartform.htmI4’ http://www.metacrawler.com43 http://arachnid.qdeck.com/qdecMdemosoft/lite/41 http:Nwww.nlightn.com