tr-2003-12.doc Word document.doc

729 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
729
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

tr-2003-12.doc Word document.doc

  1. 1. User Access Pattern Enhanced Small Web Search Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen , Wei-Ying Ma, Hong-Jiang Zhang 2003.3.6 Technical Report MSR-TR-2003-12 Microsoft Research Microsoft Corporation One Microsoft Way Redmond, WA 98052
  2. 2. User Access Pattern Enhanced Small Web Search Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen , Wei-Ying Ma, Hong-Jiang Zhang Microsoft Research Microsoft Corporation One Microsoft Way Redmond, WA, USA 98052 Abstract difference in detail in Section 2. Third, users’ access log Current search engines generally utilize link analysis could be utilized to improve search performance. However, techniques to improve the ranking of returned web-pages. so far few efforts in this direction were made except However, the same techniques applied to a small Web, DirectHit 10. such as a website or an intranet Web, cannot achieve the In this paper, we propose to generate implicit link same level of performance because the link structure is structure based on user access pattern mining from Web different from the global Web. In this paper we proposed a logs. The link analysis algorithms then use the implicit novel method of generating implicit link structure based on links instead of the original explicit links to improve the users’ access patterns, and then apply a modified search performance. The experimental results reveal that PageRank algorithm to produce the ranking of web-pages. generated implicit links contain more recommendation Our experimental results indicate that the proposed method links than explicit links. The search experiments on outperforms keyword-based method by 16%, explicit link- Berkeley website illustrate that our method outperforms the based PageRank by 20% and DirectHit by 14%, keyword-based method by 16%, explicit link based respectively. PageRank by 20% and DirectHit by 14%, respectively. The rest of this paper is organized as follows. In Section 1. INTRODUCTION 2, we present the basic link structure of a small Web. Section 3 analyzes user access pattern in a small Web. Global search engines such as Google 10 and AltaVista 10 Section 4 describes the mining technology to construct are widely used by Web users to find their desired implicit link structure. Section 5 discusses the ranking information directly on the global Web. Nevertheless, to process. Our experimental results are presented in Section get more specific and up-to-date information, users usually 6. In Section 7, we present related work on small Web go directly to websites as the very starting points and search and link analysis. Finally, conclusions and future conduct search in those websites. Intranet search is another works are discussed in Section 8. increasing application area which enables search on the documents of an organization. In both cases, search occurs 2. THE LINK STRUCTURE OF SMALL in a closed sub-space of the Web, which is called small WEBS Web search in this paper. Existing small Web search engines generally utilize the The link analysis algorithms such as PageRank 10 and same search and ranking technologies as those widely used HITS 10 use eigenvector calculations to identify on global search engines. However, the performances of authoritative Web pages based on hyperlink structures. The them are problematic. As reported in Forrester survey 10, intuition is that a page with high in-degree is highly current site-specific search engines fail to deliver all the recommended, and should have a high rank score. relevant content, instead returning too much irrelevant The Global Web A Small Web content to meet the user’s information needs. In the survey, the search facilities of 50 websites were tested, but none of them received a satisfactory result. Furthermore, in the TREC-8 Small Web Task, little benefit is obtained from the use of link-based methods 10, which also demonstrates the failure of exiting search technologies in small Web search. Figure 1. Link structure of the Global Web and a small Web The reason of failure lies in several aspects. First, it is However, there is a basic assumption underlying those generally accepted that ranking by keyword-based link analysis algorithms: the whole Web is a citation graph similarity faces difficulties such as shortness and ambiguity (see the left plot in Figure 1), and each hyperlink represents of queries. Second, link analysis technologies could not be a citation or recommendation relationship. Formally, there directly applied because the link structure of a small Web is is the following recommendation assumption. different from the global Web. We will explain this
  3. 3. Recommendation assumption: a hyperlink in page X For the purpose of mining for implicit recommendation pointed to page Y stands for the recommendation of page Y links among web-pages, we seek to find patterns that reflect by the author of page X. such kind of recommendation relationship. In particular, we Henzinger 10 also points out there is a similar-topic seek to find following two-item sequential pattern. assumption underlying link analysis. In this paper, we Two-item sequential pattern: an ordered pair of web- consider similar-topic assumption to be roughly true pages (A→B) with s%, which denote in a browsing session, because a small Web is often about a single domain. On the if a user visit page A, he intend to visit page B after a few global Web, the recommendation assumption is generally steps with probability s%. correct because hyperlinks encode a considerable amount To explain why two-item sequential patterns are of authors’ judgment. Of course some hyperlinks are implicit links that are required for link analysis, we should created not for the recommendation purpose, but the prove that (1)All implicit recommendation links could be influence of them could be reduced to an ignorable level in generated by two-item sequential pattern mining, and the global Web. (2)All two-item sequential patterns satisfy the However, this assumption is commonly invalid in the recommendation assumption. The first hypothesis is case of a small Web. As depicted in the right part of Figure commonly true because each implicit link will reflect many 1, the majority of hyperlinks in a small Web are more users’ intention from the source page to the destination “regular” than that in the global Web. For example, most page. Thus many users will traverse from the source page links are from a parent node to children nodes, between through different paths but finally arrive at the destination. sibling nodes or from leaves to the root (e.g. “Back to This pattern could be discovered if we allow a large Home”). The reason is primarily because hyperlinks in a interval between page pairs in the mining process. small Web are created by a small number of authors; and However the second hypothesis is not much intuitive. the purpose of the hyperlinks is usually to organize the Many factors will affect the quality of generated links. content into a hierarchical or linear structure. Thus the in- Users’ browsing behaviors are often random and degree measure does not reflect the authority of Web pages, unintentional, which brings many noises into the results. making the existing link analysis algorithms not suitable for Because user access patterns are statistically significant this domain. patterns mined from a large group of users instead of any In a small Web, hyperlinks could be divided into individual, the impact of noise could be lowered down navigational links and recommendation links. The latter is when there is sufficient log data. useful for link analysis. However, filtering out navigational Moreover, because traversal paths of users are links from all hyperlinks is not sufficient because the composed of explicit links of the small Web, user access remaining recommendation links are incomplete and un- patterns will be subject to explicit link structures, which weighted. In other words, there are many implicit might cause the generated access patterns include recommendation links (which is called “implicit links” for unnecessary navigational links. In the following we will short) in a small Web. illustrate that explicit links have small influence on user We propose to use implicit links instead of explicit links access patterns when a small Web’s connectivity is high to construct the graph of small Web. The pages and the and users have no significant path selection bias. implicit links is modeled as a connected weighted directed The left graph of Figure 2 demonstrates a link structure graph G = (V, E), where the set V of vertices consists of the in which there is no explicit links between A and Z. pages in a small Web, and the set E represents the implicit Suppose the user access patterns indicate that users often links among pages, each link being associated by a numeric start from A and traverse through BC or D and arrive at Z at value wi≥0. The graph is completely connected because for last. The two-item sequential patterns are A→B, B→C, each pair of pages there is an implicit link. The subsequent C→Z, A→C, B→Z, A→D, D→Z, and A→Z, in which A→Z link analysis is based on this graph. is the real implicit link while others are in fact navigational links. However, if users have no bias of paths selection, the 3. USER ACCESS PATTERN AND support of A→Z will be two times of other pattern’s IMPLICIT LINK supports. If there are more paths between A and Z, each User access patterns include straightforward statistics and path will possess a tiny support, which could be easily more sophisticated forms such association rules or removed by an appropriate minimum support. Thus the sequential patterns. DirectHit 10 is one attempt to utilize explicit links could be filtered out and only the pattern access patterns in search. It simply takes the hit count of (A→Z) remains. each web-page as its rank score. However, the problem Another example in the right graph of Figure 2 with DirectHit is that it could not reveal the real considers there is an existing search page (or a site map authoritativeness of web-pages, which is supposed to be equivalently). Users usually pose queries on the search weighted sum of authoritativeness of pages which cite page S and got result sets I, J, and K. If users who have them. visited I will also visit J for further information (maybe
  4. 4. other search results are visited between them), there likely have minimum support. After that each implicit links are is an implicit link between I and J. In this case, explicit created for each two-item user access pattern (i→j), with links between I and J (for example, there is a path the weight being support of this pattern. I→X→Y→J) have no influence on access patterns too. Existing algorithms for sequential mining like Furthermore, the existence of search page of site map S AprioriAll 10 do not fit our situation very well. For dramatically increases the connectivity of a web-site. example, when we compute the frequency of the two-item A candidate pattern (D→C), a long path such as (D→A→……→V→C), where the distance between D and B S C is large, will be taken into consideration. That is, a long query path also contributes to the frequency of (D→C). D According to the analysis in Berkhin et al. 10, when the C interval of the page pairs in the path is larger than 6, the I J K topics of those pages may not be relevant any more. Z Moreover, we only need to create the two-item sequential Figure 2. User access patterns patterns. Of course there are cases that noise and explicit links may Thus, we propose the following algorithm. First, a affect the link generation. The experiments show the gliding window is used to create two-item ordered pairs. resulting implicit links generated by our algorithms, as well Second the frequency of each ordered pair is computed, and as the quality of them, i.e. whether they are really all the infrequent pairs are filtered out with a user-specified recommendation links. minimum support to generate the two-item sequential pattern. 4. IMPLICIT In general, when a user wants to find some useful RECOMMENDATIONLINK information on a website, he may follow several pages to reach the destination. According to another analysis in GENERATION Berkhin et al. 10, the average length of a session is more The data source of generating access patterns is Web logs than 3. Based on this observation, we use a gliding window collected from a small Web. We first preprocess the logs to move over the path within a session to generate the and segment them into browsing sessions. The preprocess ordered pairs. Here, the window size represents the procedure consists of three steps: data cleaning, session intervals user click between the source page and the target identification and consecutive repetitions elimination. page. In Table 1, we provide an example transaction set and An entry in a Web log contains the following the corresponding two-item ordered pairs extracted by the information: the timestamp of a traversal from a source gliding window of size equal to 2. page to a target page, the IP address of the originating host, We obtain all possible ordered pairs with frequency the type of access (GET or POST) and other data. Since we from a large amount of users browsing sessions. Infrequent do not care about the images or scripts embedded in the occurrences often represent random behaviors and should web-pages, we simply filter out the access entries for them. be removed. Precisely, the support of an item i, denoted as After cleaning the data, we simply distinguish the user by supp(i), refers to the percentage of the sessions that contain their IP address, i.e. we assume that consecutive accesses the item i. Similarly, the support of a two-item pair (i, j), from the same IP during a certain time interval come from denoted supp(i, j), is defined in a similar way. A two-item the same user. In order to handle the case of multi-users ordered pair is frequent if its support supp(i, j)≥ min-supp, with the same IP, we remove IP addresses whose page hits where min_supp is a user specified number. count exceeds some threshold. Afterward, the consecutive Filtering the infrequent occurrences with a frequency entries are grouped into a browsing session. Different threshold of 3 (minimum support of 0.6), the resulting two- grouping criteria are modeled and compared in 10. In this item sequential patterns are shown in Table 2. paper, we choose the criteria similar to 10, i.e., a new session starts when the duration of the whole group of Table 1. Sample Web transactions and Candidate pattern traversals exceeds a time threshold. Consecutive repetitions # Transaction Candidate Pattern within a session are then eliminated, e.g. compact the T1 {A→B→D→E} A→B, A→D, B→D, B→E, D→E session ((A, A, B, C, C, C) to (A, B, C)). After A→B, A→E, B→E, B→C, E→C, T2 {A→B→E→C→D} preprocessing, we obtain a series of user browsing sessions E→D, C→D S=(s1, s2, …, sm), where si=(ui: pi1, pi2, …, pik), ui is the user T3 {A→B→E→C} A→B, A→E, B→E, B→C, E→C id and pij is the web-pages in the browsing path. B→E, E→B, E→A, B→A, B→C, T4 {B→E→B→A→C} A→C The process of two-item sequential pattern mining is D→A, D→B, A→B, A→E, B→E, outlined below. Given a set of page-visiting transactions T5 {D→A→B→E→C} B→C, E→C where each transaction is a list of web-pages ordered by access time, we find all ordered pairs of web-pages that Table 2 Two-item sequential pattern with the min_supp=3
  5. 5. Ordered pairs Frequency where e is the vector of all 1’s, and ε (0<ε<1) is a A→B 4 parameter. In our experiment, we set ε to 0.15. Instead of A→E 3 computing an eigenvector, the simplest iterative method— B→C 4 Jacobi iteration is used to resolve the equation. B→E 5 The reranking mechanism is based on two types of E→C 3 linear combination: the score based reranking and the order After two-item sequential patterns are generated, they based reranking. The score based reranking uses a linear are used to update the connected directed graph G = (V, E) combination of similarity score and the PageRank value of described in Section . All the weights of edges in E are all web-pages: initialized to be zero. For each two-item sequential pattern (i→j), we add its support supp(i→j) to the weight of the Score(P) = αS + (1- α) R (α∈ [0, 1]) edge (i, j)∈E. where S is the similarity score and R is the PageRank value. The order based reranking is based on the rank orders of 5. IMPROVE SEARCH BY IMPLICIT the web-pages. Here we use a linear combination of pages’ LINKS positions in two lists in which one list is sorted by similarity scores the other list is sorted by PageRank After obtained the implicit link structure, we apply an values. That is, algorithm similar to PageRank 10 on it to rerank the web- pages for a small Web search. Score(P) = αO1 + (1- α)O2 (α∈ [0, 1]) Since the implicit link structure of a small Web is where O1 and O2 are positions of page P in similarity score modeled as a connected directed graph G = (V, E). We list and PageRank list, respectively. construct a matrix to describe the graph. In particular, Our experimental result indicated that the order-based assume the graph contains n pages. The n×n adjacency reranking performs better than the score-based reranking. matrix is denoted by A and the entries A[i, j] is defined to be the weight of edge (i, j)∈E. Each row of the matrix A is 6. EXPERIMENTS normalized to let the sum be 1. In this section, we discuss the experimental data set, our The adjacency matrix is used to compute the rank score evaluation metrics, and the result of our study based on of each page. In an “ideal” form, the rank score xi of page i those metrics. is evaluated by a function on the rank scores of all the pages that point to page i: 6.1. Implicit Link Generation xi = ∑x ( j ,i )∈G j ⋅ A[ j , i ] Since our model targets on enhancing small Web search by analyzing the user click-thru on web-pages, we choose a This recursive definition gives each page a fraction of the website with click-thru logs for evaluation. We took 4- rank of each page pointing to it—inversely weighted by the month server log from the website at strength of the links of that page. The above equation can http://www.cs.berkeley.edu/logs/. Before mining for the be written in the form of matrix as: user access patterns on this log, we preprocessed the data as   stated in Section 4 to filter out noise data and, separate the x = Ax click-thrus into different sessions according to the interval However, in practice, many pages have no in-links (or the of each user’s consecutive access, and remove the weight of them is 0), and the eigenvector of the above consecutive repetitions. After the preprocessing, the logs equation is mostly zero. To get around this problem, we are organized into multiple browsing sessions, each modify this basic model to obtain an “actual model” using containing a sequence of consecutive click-thrus from a random walk. Upon browsing a web-page, with the user. The final data set contains about 300,000 transactions, probability 1-ε, a user randomly chooses one of the links on 170,000 pages and 60,000 users. the current page and jumps to the page it links to; and with In order to build a small Web search engine, we need to the probability ε, the user “reset” by jumping to a web-page obtain the text and link structure of the website. First, we picked uniformly and at random from the collection. download all the 170,000 pages and index them by the Therefore the ranking formula is modified to the following well-known Okapi system. For each page, an HTML parser form: is applied to removing tags and extracting links in pages. xi = ε / n + (1 − ε ) ∑x ( j ,i )∈G j ⋅ A[ j, i ] Finally, we got 216,748 hyperlinks in total. We fixed several parameters for the rest experiments. Or, in the matrix form: i.e. window size as 4, minimum support threshold as 7, using support-weighted adjacent matrix, and order based  ε   reranking is chosen. These parameters are determined x = e + (1 − ε ) Ax n based on an extensive experiment which we will discuss in Section 6.3. We Figure 3. Overlap of implicit links and explicit links
  6. 6. compared our algorithm with several state-of-the-art 3 99 42 0.42 algorithms, i.e. full text search, explicit link-based Average 0.39 PageRank, DirectHit, and Modified HITS algorithm 10. As shown in the upper part of Table 3, about 67% of After two-item sequential pattern mining, we obtain implicit links in average are recommendation links. about 336,812 implicit links. shows the overlap between Another three subsets selected from explicit links is shown the explicit links and the generated implicit links contains in the lower part of Table 3 , where the average 22,122 links. That is, only 11% of the links are overlapped. recommendation link ratio is just about 39%. Some evidences should be given to prove that the Several examples of these implicit links are shown in implicit link satisfy the recommendation assumption. We Table 4. For example, the 4th implicit link “Xuanlong’s develop a prediction model by using implicit links, which is course: CS188” → “Wilensky’s course: CS188” represents similar to 10. This model predicts whether a page will be the same course taught by different instructors. When the visited by a user in the next step. The prediction accuracy user visited the page “Xuanlong’s course: CS188”, page directly reflects the correctness of the page “Wilensky’s course: CS188” could be recommended. Table recommendation. We take 4/5 of the log data as the training 4 also shows that parts of the implicit links are overlapped data to create the implicit link and 1/5 of the log data as with the explicit links, which are created by the author and testing data. The following precision is used as our satisfy the recommendation assumption. evaluation metrics. Table 4. Examples of the implicit link P + Exp. Prediction precision= # Source Page Target Page Explanation Link (P + P − ) + ? where P+ is the number of correct prediction and P- is the Book: Artificial number of wrong prediction. 1 Intelligence: A The book’s slide A book and its Y As stated in Section 4, the implicit links are generated . Modern slides by the restriction of the minimum support. The higher the Approach support of the implicit link is, the higher the probability 2 Jordan’s Andrew Ng's Teacher and Y . Homepage Homepage student that the linked pages are accessed at the same session is. As Large Format shown in Figure 3, the prediction precision monotonously 3 Various pictures landscape Picture N increases as the minimum support increases. It proves that . photographs our implicit link is accurate and reflects user’s behaviors 4 Xuanlong’s Wilensky’s and interests. Same course N . course: CS188 course: CS188 Anthony D. 0.8 5 Brian Harvey’s People in Joseph’s N . HomePage Vision group Precision of prediction Homepage 0.6 Machine 6 Machine AI on the Web learning N 0.4 . learning software 0.2 7 Sequin’s course: Sequin’s course: Course of N . cs284 cs285 same person 0 0 2 4 6 8 10 12 14 16 6.2. Search Result min_supp Figure 3. Precision of page prediction by implicit links Because our system only reranks the results of the full text search engines, the global search precision is not changed. The quality of implicit links is evaluated from human However, the precision of top matches should be improved, perspective. We randomly select three subsets, totally about so we evaluate the top 30 retrieval results. Consider a query 375 implicit links. 7 volunteers are required to evaluate Q given by user, and let R be the set of the relevant pages whether the implicit links are recommendation links. to the query and |R| be the number of pages in the set. Table 3. Recommendation links in implicit and explicit links Assume the system generates a result set, we only take the Subset Implicit link Recomm. link Ratio top 30 from the result set as A and the precision of search is 1 128 87 0.68 defined as: 2 114 82 0.72 | R∩ A| 3 133 84 0.63 Precision= Average 0.67 | A| In order to evaluate our method effectively, not only have Subset Explicit link Recomm. link Ratio we taken the precision as our evaluation methodology, we 1 107 47 0.44 also propose a new evaluation methodology: the degree of 2 84 26 0.31 authority. Given a query, we ask the 7 volunteers to
  7. 7. identify the top 10 authoritative pages from a collection of From Figure 4, we found that the performance of web-pages as M and then let the top 10 of the results explicit link-based PageRank is even worse than that of the generated by search system as N. full text search technique, which demonstrate the unreliability of explicit link structure of this website. |M ∩N| Authority= We also compared the results with DirectHit. DirectHit |M | algorithm ranked the pages by simply counting the visit The precision measures the degree to which the times of the previous users from the query logs. The more algorithm produces an accurate result; while the authority the users access the page, the higher rank it will be. measures the ability of the algorithm to produce the pages Different from DirectHit, our algorithm uses server logs to that are most likely to be visited by the user or the authority analyze the users’ behaviors when accessing a website. measurement is more relevant to user’s satisfactory degree Another problem for DirectHit is that it is easy create a on the performance of small Web search engine. quick positive feedback loop when access to a popular All the implicit links are utilized to improve search resource leads to its higher rank, which in turn leads to an performance as the techniques described in Section 5. even higher number accesses to it. The experiments also Seven volunteers were asked to evaluate both precision and show DirectHit only improves a part of popular queries’ authority of search results for the selected 10 queries precision. So the average precision is not as good as our (which are Jordan, Vision, Artificial Intelligence, Bayesian algorithm. Table 5 shows some examples and their ranks Network, wireless Network, Security, Reinforcement, HCI, calculated by different algorithms. Data Mining, and CS188). For precision, we evaluate it on the top 30 documents; while for authority, we evaluate on Table 5. Results of different ranking methods for query “Vision.” the top 10 results. The final relevance judgment for each URL address #1 #2 #3 #4 document is decided by majority votes. The comparison of UC Berkeley Computer Vision Group 1 41 2 8 5 algorithms is shown in Figure 4. The right-most label David Forsyth's Book: Computer “Avg” stands for the average value for the 10 queries. As 2 94 1 4 Vision can be seen from Figure 4, our algorithm outperforms the David Forsyth's Book: Computer other 4 algorithms. The average improvement of precision 3 9 3 10 Vision(3rd Draft) over the full text is 16%, PageRank 20%, DirectHit 14% A workshop on Vision and Graphics 4 44 20 1 and modified HITS 12%. While the average improvement UC Berkeley Computer Vision Group 5 2 13 7 of authority over the full text is 24%, PageRank 26%, CS 280 Home Page 6 14 10 11 DirectHit 15% and modified HITS 14%. Thomas Leung's Publications 7 55 4 31 Jitendra Malik’s Brief Biography 8 17 7 6 An overview of Grouping and 9 5 21 5 Perceptual Organization David Forsyth's Homepage 10 87 29 35 A paper of Phil 13 1 6 9 Full Text Kim’ ZuWhan resume 18 3 5 2 Explicit link- based PageRank A slide of Landay’s talk about 37 4 33 14 DirectHit Notepals Modified HITS John A. Haddon’s publication 39 23 41 13 Implicit link- based PageRank A slide of Landay’s talk about 41 6 42 18 Search Precision 1 Notepals 0.8 Chris Bregler's Publications 44 27 8 42 Course: Appearance Models for 0.6 51 63 47 3 Computer Graphics and Vision 0.4 Reference of Object Recognition 62 59 9 17 0.2 We also noticed that modified HITS algorithm 0 1 2 3 4 5 6 7 8 9 10 Avg relatively achieves higher performances than the full text Query search, DirectHit and PageRank. In fact, this algorithm is a 1 special case of our proposed algorithm, i.e., when we set Authority 0.8 the minimum support threshold to 0 and window size to 1. 0.6 As we mentioned earlier, when the minimum support 0.4 threshold is 0, lots of noise data will be created; and when 0.2 the window size is 1, we will miss many useful links and 0 also affect the result. 1 2 3 4 5 6 7 8 9 10 Avg Query Table 5 shows the top 10 pages based on a query Figure 4. Precision and authority of the different ranking “Vision.” Here #1, #2, #3 and #4 correspond to implicit methods link based PageRank, explicit link based PageRank,
  8. 8. Modified Hits, and DirectHit, respectively. We also found implicit link PageRank>0.15 Ratio 1.2 that the results from the “implicit link based PageRank” are 1 more authoritative than the results from the modified HITS. 0.8 To give an example, one of the pages which is ranked high 0.6 by explicit link-based PageRank is “ANSI Common Lisp”. 0.4 It contain lots of out-links to other pages and in-links from 0.2 0 other pages, so its PageRank value is quite high. But 2 4 6 8 10 12 min_supp according to our analysis on the logs, this page is rarely visited by the user. The pages similar to that page are not Figure 6. Search precision and implicit link number with really the authorities in content or in users’ view. different min_supp Compared to explicit link-based PageRank, the ranks based We perform an experiment to choose the most suitable on implicit links are more preferable. support threshold for mining user access pattern. We asked The convergence curve of several algorithms is shown seven volunteers to test on 5 queries (which are Machine in Figure 5. It could be found that the difference of Leaning, Web Mining, Graphics, OOP, Database PageRank values between adjacent iterations will drop Concurrency Control) for each support and then compute significantly after 7 iterations and show a strong tendency average precision of the top 30 documents. As shown in Explicit link PageRank Figure 6, the system achieved the best search results when Modified HITS Implicit link PageRank the minimum support is 7. The figure also shows that the 250 system performance will be dramatically dropped while it Gap 200 is less than 4 or higher than 10. From our observations, the 150 reason can be explained from two aspects. First, when the minimum support is too small, user’s random behaviors 100 will be counted and the number of the implicit links will 50 increase (as shown in Figure 6), which will introduce more 0 irrelevant links to affect the ranking results. Second, high 1 2 3 4 5 6 7 8 9 10 11 12 support will result in missing some potentially important Iteration but infrequent, implicit links. Furthermore, as shown in Figure 5. Convergence of different ranking models. Figure 6, by increasing the support value, the number of implicit links decreases, which lead to the decrease of the to zero. This proves the convergence of our algorithm in a practical way. number of pages whose PageRank is larger than 0.15(ε=0.15). In the end, the impact of the PageRank on the search result is very weak. 6.3. Parameter Selection min_supp=7 As we mentioned, we used several parameters in the 0.8 Search Precision experiments. i.e. window size = 4, minimum support threshold = 7, using support-weighted adjacent matrix, and 0.7 using order based reranking. Here we provide the 0.6 experiments that we conducted in order to select or decide those parameters. 0.5 Search Precision window size=4 0.4 0.6 1 2 3 4 5 6 0.55 Window size 0.5 Figure 7. Precision of different window size 0.45 We perform an experiment to test the impact of the 0.4 window size. Figure 7 shows the impact of different 1 3 5 7 9 11 min_supp window size on search precision. The evaluation method is same as above. From Figure 7, we found that the precision increases when the window size changes from 1 to 4, which proves the analysis of Berkhin et al. 10 that a user may click several times to get what she wanted. On the other hand, by analyzing the effect of window size on the number of implicit links, we found that more noisy implicit links will be created if the window size is too large. So if we continue to increase the window size, the performance may
  9. 9. Other Interval=6 Interval=1 4% 4% 14% Interval=5 12% decrease. Furthermore, by looking at their similarity score and PageRank score. It we calculate the interval Interval=4 Interval=2 27% caused that a linear combination of the similarity score can distribution for our 15% not achieve good results. So, in our experiment, we choose mined implicit links. As order based reranking method to reranking the search shown in , about 13.7% Interval=3 24% results from full-text search. of the implicit link is Figure 9. Interval accessed in one step, distribution of implicit links 7. Related Work 26% in two steps, 24% in There are many research and commercial small Web search three steps, and so on with ellipsis representing the engines in the market today. Most of them use the “full text remaining infrequent occurrences. search” technology, which retrieves a large amount of When constructing the adjacent matrix, there are two documents containing the same keywords to the inputted choices to set up the weight of the matrix: the weight with query by the user. For example, AltaVista site search 0-1 (called 0-1 weighted) or the weight with the support of engine 10 and Berkeley’s Cha-Cha search engine 10 both the implicit link (called support-weighted). We ask 7 use the full-text search technology to build their search volunteers to test on 10 queries (which are Machine engine. Leaning, Web Mining, Graphics, OOP, Database Hence, many approaches are proposed to improve the Concurrency Control, Classification, Titanium, Distributed small Web search performance. Within these approaches, Database System, Parallel Algorithm, Mobile) and evaluate one kind of the solutions is to speed-up the user’s search the average precision for the top 30 documents. As can be time for a relevant document. Cha-Cha system 10 provides seen from Figure 8, the support-weighted method achieved a new user interface to organize the search results into better search precision compared to 0-1 weighted method in some categories to reflect the underlying intranet structure average. The improvement may come from the fact that the instead of returning pure ranked list. In Cha-Cha system, an support-weighted method has stronger recommendation “outline” or “table of contents” is created by first recording than the 0/1-weighted. the shortest paths in hyperlinks from root pages to every 0-1-weighted page within the Web intranet. After a user submits a query, support-weighted 1 those shortest paths are dynamically combined to form a hierarchical outline of the context in which the search result Search Precision 0.8 0.6 occur. Relied on providing well-organized search results, 0.4 Cha-Cha system can save users’ search time to some extend, but it failed when the relevance of baseline search 0.2 result is poor. M. Levence et al. 10 also proposed a 0 navigation system, which returned a sequence of linked 1 2 3 4 5 6 7 8 9 10 Avg Query pages, to help the end-user to browse the search results. Figure 8. Precision of different weighting methods This system can provide a good starting point for the user to initiate a navigation session. But the system does not Order based Ra-ranking further use the associated relation of the pages to help user Score based Ra-ranking 1 in navigating or reranking the search results. 0.8 Link analysis technology has been widely used to Search Precision 0.6 analyze the pages’ importance, such as Brin and Page’s PageRank 1010 and Kleinberg’s Hub/Authority 10 by 0.4 analyzing the hyperlinks within a website. PageRank 0.2 becomes Google’s 10 important ranking algorithm to 0 measure the importance of web-pages. The whole linkage 1 2 3 4 5 6 7 8 9 10 Avg Query graph of the Web is used to compute universal query- Figure 9. Precision of different reranking mechanisms independent rank value for each page. The Web can be represented a directed graph G={N, E}, where N stands for We perform an experiment to measure the different the web-pages, and E stands for the links within two pages. reranking mechanisms, i.e. the score based reranking and In order to calculate the importance of web-pages, Google the order based reranking. We also ask 7 volunteers to test also model the users’ browsing model as a random surfing on the same 10 queries as above and calculate the average model, which assumes that a user either follow a link from precision for the top 30 documents. As can be seen from the current page or jump to a random page in the graph and Figure 9, the order based reranking works much better than continue from there. The PageRank of a page is then the score based reranking. To our analysis, few of results computed by weighting each hyperlink proportionally to have much higher similarity score and PageRank score; the quality of the page containing the hyperlink. To furthermore, there is little difference among search results determine the quality of a referring page, they use its
  10. 10. PageRank recursively. This leads to the following initialized as 0. Then, the weight from node i to node j will definition of the PageRank PR(p) of a page p: have one increment if one user travel from page i to page j. This method is litter similar to our proposed solution, but PR( p ) = ε / n + (1 − ε ) × ∑ PR(q) / outdegree(q) ( q , p )∈G this method exists some problems: (1) Since this method do not separate the user logs into sessions based on their tasks, where, ε is a dampening factor, which is usually set it is inevitably that many noise data will be introduced into between 0.1 and 0.2 10; n is the number of nodes in G; and the link matrix. For example, if one user read some news in outdegree(q) is the number of the edges leaving page q, i.e., cnn.com first; after reading the news, he went to hotmail to the number of hyperlinks on page q. The PageRank can be check his email. Then, cnn will be considered has some computed by an iterative algorithm and correspond to the relationships with hotmail. (2) Sometimes, users will primary eigenvector of a matrix derived from adjacency follow different links to reach a same goal. If we simply matrix of the available portion of the Web. consider the adjacent page as related, the underlying Although PageRank is very successful in Web search relationship may not be really discovered. In our solutions, engine, we found that it should follow the recommendation we not only consider the adjacent pages as related, but also assumption. The recommendation assumption is that by consider the pages within a same browsing session. Hence, linking to a target, a page author is recommending it. the latent relationship may be recovered by our process. Accordingly, a page with high in-degree is highly recommended, and should be ranked more highly. A recent study found that this assumption is often true in WWW, 8. CONCLUSION AND FUTURE while not work within a specific website. We noticed that WORK Google’s enterprise search engine provides standard In this paper, we proposed a novel small Web search model keyword analysis and other ranking methods, but its by mining the users’ browsing behaviors in a small Web. PageRank is less valuable in the structured design of First, two-item sequential pattern mining algorithm is business sites or intranets. The failure of applying link applied to find out the frequent user access patterns from analysis to Web TREC datasets 10 also demonstrates that the Web log. Then, the mined access patterns are link analysis does not work for a sub-space of Web. The considered as the implicit links of the web-pages within the main reason is that the links within a website don’t satisfy small Web to replace the original explicit links. with the two assumptions for link analysis, i.e. Furthermore, the weights of implicit links in the adjacent recommendation assumption and topic similarity matrix of website are assigned by the support of mined assumption. access patterns instead of simple 0 or 1. Finally, the Since directly applying link analysis failed in small PageRank algorithm is applied on the adjacent matrix to Web search engine, some systems change their focuses to calculate the rank score of each web-page. And this rank users’ usage information. For example, DirectHit 10 is one score is interpolated with similarity from full-text search to of the representative examples, which harnesses millions of rerank search results. Our experimental results showed that human decisions by millions of daily Internet searchers to our proposed method outperforms the existing solutions. In provide more relevant and better organized search results. other words, the mined access patterns represent the latent DirectHit's site ranking system, which is based on the relationships of the web-pages which provide better concepts of "click popularity" and "stickiness," is currently information in recommending a web-page, which is crucial used by Lycos, Hotbot, MSN, Infospace, About.com and to the success of this kind of link obeys the about 20 other search engines. The assumption underlying recommendation assumptions of the PageRank algorithm. is that most relevant pages of a topic are those most visited Since PageRank is an algorithm which is query ones. Although this method is intuitive, it also has some independent, in the next step we hope to adapt our restrictions. One of the problems is that it need a large algorithm to rank the pages relevant to a given topic. amount of user logs and only works for some popular Another possible direction is to apply our mined queries. Another problem is that it is easy to fall into a implicit links to improve web-pages clustering accuracy. quick positive feedback loop when access to a popular Existing solutions on clustering web-pages are based on the resource leads to its higher rank, which in turn leads to an content and explicit links of web-pages. As we noted even higher number accesses to it. earlier, the explicit links in a specific website is only for There are also some works on utilizing the usage data to content organization, so it is difficult to achieve good help Web link analysis. For example, Miller et al. 10 clustering result by this kind of links. We plan to combine propose a new method to use the usage data to modify the our mined implicit links and the contents of web-pages adjacency matrix in Kleinberg’s HITS algorithm (called together to cluster web-pages effectively. Modified HITS). It replaces the adjacency matrix M with a Furthermore, to improve the effectiveness of website, link matrix M’, which assign the weight between nodes we may discover the gap between the website designer’s (pages) based on users usage data collected from web- expectation and the visitor’s behavior by comparing the server logs. In the first, the weights in link matrix are all importances of web-pages calculated from explicit links
  11. 11. and implicit links, and then suggest a modification to the [14] L.Page, S.Brin, R.Motwani and T. Winograd. The site structure to make it more effective for browsing and PageRank citation ranking: Bringing order to the Web. navigation. Technical report, Stanford University Database Group,1998. 9. ACKNOWLEDGMENTS [15] M. Hearst. Next Generation Web Search: Setting Our This work was performed in Microsoft Research, Asia. Sites. IEEE Data Engineering Bulletin, Special issue Many thanks to Jian-Yun Nie for his very helpful advices. on Next GenerationWeb Search, Luis Gravano(Ed.), Sep 2000. 10.REFERENCES [16] M. R. Henzinger. Link Analysis in Web Information [1] A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Retrieval. To appear in the IEEE Data Engineering Tsaparas. Finding authorities and hubs from link Bulletin, November 2000. structures on the World Wide Web. WWW10, 2001. [17] M.Chen, M.Hearst, J. Hong and J.Lin. Cha-Cha: A [2] A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. System for Organizing Intranet Search Results, In Rajagopalan, R. Stata, A. Tomkins, and J. L. Wiener. Proceedings of the 2nd USITS, Boulder, CO, October Graph structure in the Web. WWW9 / Computer 11-14, 1999. Networks 33(1-6): 309-320 (2000). [18] M.Levene and G.Loizou. Web interaction and the [3] AltaVista: http://www.altavista.com. navigation problem in hypertext. In A. Kent, [4] D. Hawking, E. Voorhees, P. Bailey and N. Craswell. J.Williams, and C. Hall, editors, Encyclopedia of Overview of TREC-8 Web Track. In Proceedings of Microcomputers. Marcel Dekker, New York,NY, TREC-8, pages 131–150, Gaithersburg MD, 2001. November 1999. [19] M.Levene and R.Wheeldon. A Web Site Navigation [5] DirectHit: http://www.directhit.com. Engine, Proceedings 10th International WWW Conference, 2001. [6] Google, http://www.google.com . [20] M.M.Kessler. Bibliographic coupling between [7] H. Kato, T. Nakayama and Y. Yamane. Navigation. scientific papers. American Documentation, 1963. Analysis Tool based on the Correlation between Contents Distribution and Access Patterns, WEBKDD, [21] N. Craswell, D. Hawking, and S. Robertson. Effective 2000. site finding using link anchor information. SIGIR'01, 2001. [8] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. [22] N. Craswell, D. Hawking, and S.Robertson. Effective Data Mining and Knowledge Discovery, 1(3): 259 - Site Finding using Link Anchor Information. 289, November 1997. SIGIR’01, September 2001. [9] H.Small. Co-citation in the scientific literature: A new [23] P. Berkhin, J.D.Becher and D.J.Randall. Interactive measure of the relationship between two documents. Path Analysis of Web Site Traffic. Journal of the American Society for Information [24] P. Hagen, H. Manning and Y. Paul. Must search stink? Science, 1973. The Forrester report, Forrester, June 2000. [10] J. M. Kleinberg, R. Kumar, P. Raghavan, S. [25] P. Raghavan. Social Networks from the Web to the Rajagopalan and A. S. Tomkins. The Web as a graph: Enterprise, IEEE Internet Computing, Jan, 2002. measurements, models, and methods, 1999. [26] Q. Yang, H. Hanning Zhang and I.Tianyi Li. Mining [11] J. Pei, J. Han, B. Mortazavi-Asl, and H. Zhu。 Mining Web Logs for Prediction Models in WWW Caching Access Pattern efficiently from Web logs, Proceedings and Prefetching . In The Seventh ACM SIGKDD 2000 Pacific-Asia Conf. on Knowledge Discovery and International Conference on Knowledge Discovery and Data Mining, April 2000. Data Mining KDD'01, Industry Applications Track, August 26 - 29, 2001. [12] J.C.Miller, G.Rae and F.Schaefer. Modifications of Kleinberg’s HITS Algorithm Using Matrix [27] R. Agrawal and R.Srikant. Mining Sequential Pattern. Exponentiation and Web Log Records, SIGIR’01, Proc. of 14th Int’l Conference on Data Engineering, 2001。 Taiwan, 1995. [13] J.M.Kleinberg. Authoritative sources in hyperlink [28] R. Baeza-Yates and B.Ribeiro-Neto. Modern environment. In Proceedings of the 9th ACM-SIAM Information Retrieval. Addison-Wesley, 1999. symposium on Discrete Algorithm. ACM, 1998. [29] R. Cooley, B. Mobasher and J. Srivastava. Data Preparation for Mining World Wide Web Browsing
  12. 12. Patterns, Knowledge and Information Systems V1(1). [33] S.Brin and L. Page. The Anatomy of a Large-Scale 1999. Hyper textual Web Search Engine, Proceedings 7th [30] R. Cooley, P. Ning Tan and J. Srivastava. Discovery of International WWW Conference, 1998. Interesting Usage Patterns from Web Data, To appear [34] T. Nakayama, H. Kato and Y. Yamane. Discovering in Springer-Verlag LNCS/LNAI series, 2000. the Gap between Web Site Designers’ Expectations [31] R. Srikant and Y.Yang. Mining Web Logs to Improve and Users’ Behavior, WWW9, 2000. Website Organiztion, WWW10, 2001. [35] W. Lin, S. Alvarez and C. Ruiz. Efficient Adaptive- [32] S. Chakkrabarti, B.E.Dom, D.Gibson, J.Kleinberg, Support Association Rule Mining for Recommender R.Kunar, P.Raghavan, S.Rajagopalan, and A.Tomkins. Systems, Data Mining and Knowledge Discovery, 6, Mining the Link Structure of the World Wide Web, 83–105, 2002. IEEE Computer, vol. 32, no. 8, August 1999.

×