UProRevs-User Profile Relevant Results
Upcoming SlideShare
Loading in...5
×
 

UProRevs-User Profile Relevant Results

on

  • 1,812 views

This work describes a new system User Profile Relevant Results - ...

This work describes a new system User Profile Relevant Results -
UProRevs which would filter the results given by a search engine based on the user’s profile.

“UProRevs - User Profile Relevant Results” has been published by the IEEE - Computer Society as the proceedings for the 10th International Conference on Information Technology.

Statistics

Views

Total Views
1,812
Views on SlideShare
1,811
Embed Views
1

Actions

Likes
0
Downloads
8
Comments
0

1 Embed 1

http://facebook.slideshare.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

UProRevs-User Profile Relevant Results UProRevs-User Profile Relevant Results Document Transcript

  • 10th International Conference on Information Technology UProRevs - User Profile Relevant Results Amiya Kumar Tripathy Royston Olivera Don Bosco Institute of Technology Xoriant Solutions Private Limited Mumbai, India Powai, Mumbai, India tripathy.a@gmail.com roystonolivera@gmail.com Abstract Zoology would expect results related to the variety of snake python. When a Google search is made with Internet search engines use Web crawlers to the above query it is noticed that the first 30 - 40 download data from the Web. The crawled data is results displayed by Google are related to the stored on centralized servers, where it is parsed and programming language python which would be great indexed. The importance of a Web page is an for a Computer Professional, but an annoying factor inherently subjective matter, which depends on the for an Environmentalist or Zoologist. Thus we have reader’s interests, knowledge and attitudes. Search come to a stage where we need to personalize the engines use a ranking algorithm to determine the web for efficient information retrieval order in which matching web pages are returned on UProRevs (User Profile Relevant Results) deals with the results page [12]. They build indices mostly this by generating user profiles based on their interest based on keyword occurrence, link popularity and (profession) which they enter while registering. frequency for query negotiation using these indices. Using these connectivity based algorithms, they It is found that a page can be best judged for its measure the quality of each individual page so that quality by human experts who require that users will receive a ranked page list for their queries. information. “UProRevs” incorporates this human These search engines perform search with respect to edge by developing a system that uses relevance the query fired by the user without considering the feedback to update a user’s profile. perspective of the user. Thus giving results based on This paper describes a new system “UProRevs” a generalized perspective. This paper presents a which would filter the results given by a search system architecture that would work as a subordinate engine based on the user's profile to a normal search engine by taking its results, calculating the relevance of these results with respect to the user's perspective (profile) and then displaying 2. Related work the results along with its relevance to the user. 2.1. MetaSearch Engines A meta-search engine is a search engine that 1. Introduction sends user requests to several other search engines The World Wide Web has become a major area and/or databases and returns the results from each of interest and use for different people. The rate of one. They allow users to enter their search criteria growth of the web has been exponential in the past only one time and access several search engines years. Millions of people use the web daily for simultaneously. Since it is hard to catalogue the information retrieval. Many companies around the entire web, the idea is that by searching multiple world have invested in developing tools for better search engines you are able to search more of the information retrieval on the web. Search Engines act web in less time and do it with only one click. The as a primary gateway for information retrieval on the ease of use and high probability of finding the web. desired page(s) make metasearch engines popular with those who are willing to weed through the lists Search engines use certain ranking algorithms of irrelevant ’matches’. Another use is to get at least that determine the order in which the web pages some results when no result had been obtained with matching the query are to be displayed. These traditional search engines [12]. ranking algorithms are based on the frequency of keywords, link popularity and the frequency of query 2.2. Page Rank Algorithm negotiation. But this method works when the query fired by the user has a unique meaning. Page Rank is a patented method to assign a numerical weighting to each element of a It may so happen that a user would type in the query hyperlinked set of documents, such as the World python. A user who is computer professional would Wide Web, with the purpose of ”measuring” its expect results related to the programming language relative importance within the set. The algorithm python, whereas an environmentalist or a student of 0-7695-3068-0/07 $25.00 © 2007 IEEE 271 265 DOI 10.1109/ICIT.2007.39
  • may be applied to any collection of entities with relies on different aspects of the user. On the other reciprocal quotations and references. The numerical hand, in ephemeral preferences, the information used weight that it assigns to any given element E is also to construct each user profile is only gathered during called the Page Rank of E and denoted by PR (E). the current session, and it is immediately exploited Page Rank was developed at Stanford University by for executing some adaptive process aimed at Larry Page (hence the name Page-Rank [Vise and personalizing the current interaction. Malseed, 2005] and Sergey Brin as part of a research project about a new kind of search engine. 2.6. Profile Construction based on Modified Collaborative Filtering: 2.3. Personalised Page Rank Algorithm In this approach, we propose the following two The framework of a traditional personalised methods: (1) user profile construction based on the search engine which consists of two phases: data static number of users in the neighbourhood, and (2) collection phase and data ranking phase. In the first user profile construction based on dynamic number phase we basically capture user interests from his of users in the neighbourhood. browsing behaviour. This is done by extracting the keywords of WebPages visited by the user. 3. System Architecture Due to the large number of keywords per page each keyword is assigned a weight depending on its The UProRevs system architecture is as shown in frequency. Then the keywords are classified into m figure 1. global categories. Then we implement the Personalised Page Rank (PPR) algorithm to assign ranks to the search list. The Personalised Page Rank Algorithm has 4 stages [1]: i. In this stage the webpage’s are clustered into the global categories and larger weights are assigned to those web pages whose category is similar to user interests. ii. In the 2nd stage webpage’s are further ranked based on whether their categories satisfy user interests and the query submitted. iii. In the 3rd stage the algorithm checks for a user’s changing interests. iv. In the final stage, the search results undergo collaborative filtering based on recommendations by other users having similar interests. 2.4. User Profile Construction without User’s Effort This work deals with the construction of user profiles without any effort from the user. The profile generated is used to filter the search results. The profiles are constructed in two ways [2]: 2.5. Profile Construction based on Pure Browsing In this method, it assumes that the preferences of each user consist of the following two aspects: (1) 3.1. Description of System Architecture persistent (or long term) preferences and (2) ephemeral (or short term) preferences. In persistent Core- It acts as an interface between the user, the preferences, the user profile is incrementally search engine and the UProRevs System. It accepts developed over time and it is stored for use in later the query from the user and fires it to the remote sessions. The information exploited for constructing search engine. On receiving the results it interacts the profile usually comes from various sources, so it with the UproRevs system and displays the results to the user along with its relevance. 272 266
  • Profile Generator- The Profile Generator develops 6. The results are displayed along with its relevance. the profile which is basically a large set of keywords 7. User gives feedback for the webpage he visits. and its frequencies which are related to the user’s 8. User profile is updated by the profile updater profile. based on feedback given. Profile Updater- The Profile Updater updates the users profile based on the feedback given by the user. 5. System’s Mechanism Relevance Calculator- The Relevance Calculator The goal of our project is to apply our filter to the calculates the relevance of a particular URL with results given by a normal search engine like Google respect to his/her profile. The output of the and get the same quality results as Google but along Relevance Calculator is the URL along with its with its relevance to the user’s profile. relevance. This output is given to the core which displays the results along with its relevance. The output is also stored in the Relevance Log for future 5.1. Relevance of Search Results reference. The results that are displayed for a particular Search Engine- It is an engine which provides a set query should be relevant to the user’s profile. of results when the query is fired by the Core. The Relevancy is a relative term and hence the user’s results provided are simple and sorted as per the registered information and feedback would decide on ranking algorithm of the search engine. the amount of relevancy incurred by the webpage for a particular query. User Information Database- The User Information Database stores the information that the user has 5.2. Calculation of Relevancy entered during registration. This information forms The user profile will basically be a large set of the base of the user profile i.e. the initial user profile keywords along with a frequency value. These is created using the information in this particular keywords will be sorted in decreasing order and be database. given a rank based on the frequency of the keyword. User Profiles- The User Profiles Database stores the Table 1. profiles of all the users that are generated by the Profile Generator. This database basically consists of Keywords (kui) Frequency(fui) Rank(ri) a set of keywords that are related to the profile of the user along with a rank which specifies the relevance ku0 fu0 r0 of the keyword to the user’s profile. The profile ku1 fu1 r1 updater updates this database whenever the user ku2 fu2 r2 gives feedback about a particular webpage. | | | Feedback Log- The feedback log consists of a log of the URLs that the user has visited and the feedback | | | given by the user. This feedback given by the user is kuN fuN rN used by the Profile Updater to update the user’s profile every time he/she gives feedback about a particular webpage. The User Profile can be represented as in Table- Relevance Log- The Relevance Log stores the URLs 1.When the user enters a query the core fires this and its relevance. The relevance calculator before query to the search engine which returns the results calculating the relevance for a particular URL in the form of a list of URLs that are sorted with consults the Relevance Log to check whether the respect to the Page Rank Algorithm of the remote relevance for this particular URL has been search engine. Keywords are extracted from the calculated. webpage’s specified in the result. A dynamic Webpage Profile is created which contains the keywords on that particular webpage along with its 4. System’s Flow frequency. The keywords in the Webpage Profile are also ranked based on their frequency. The Webpage 1. User registers by giving his personal and Profile can be represented as in Table 2. professional details. 2. A profile is generated by the profile generator Thus the mechanics of the UProRevs system are based on the details given by the user. initialised by developing the User Profile and 3. User enters search query. Webpage Profile. After generating the two profiles 4. Search Engine provides current results. the system compares the two profiles to calculate the 5. Relevance of the results with the user’s profile is Relevance of the page. After calculating the calculated. relevance the results are displayed along with their 273 267
  • relevance to the user’s profile. Wi = 1 for di = 0 = 1/disi for di ≠ 0 Table 2. Where di =| ri − si | Keywords(kwi) Frequency(fwi) Rank(si) Wi = Weight of the keyword with respect to the profile. kw0 fw0 s0 si = Rank of the keyword Ki in the Webpage Profile. kw1 fw1 s1 ri = Rank of the keyword Ki in the User Profile. kw2 fw2 s2 Thus if there exists no difference in the ranks of | | | the keyword in the User Profile and Webpage Profile then it get a weight 1, which is the highest weight for | | | any particular keyword. When there exists a kwn fwn sn difference in the ranks, the weight decreases with the increase in the rank difference and decrease in the frequency of the keyword in the webpage. Once the results are displayed the user will check the relevance and select an appropriate URL to view. Further to calculate the weight of the entire page On viewing and analysing the webpage the user will we take a summation of the weights of each keyword be in a position to say how relevant the webpage is to in the webpage that exists in the User Profile. Thus his/her profile and will accordingly give his / her the weight of the webpage W can be calculated as feedback to the webpage. Once the feedback is given follows: the user’s profile gets updated based on the rating of W= Σni=1 Wi the feedback that has been given by the user. where Wi = Weight of the keyword Ki. 6. Evaluation Metrics n = Number of keywords in the webpage. Now as we know that the maximum weight a 6.1. Relevance Calculation keyword can get is 1, the maximum weight that a As described in the previous section we have two webpage can get will be ‘n’ (number of keywords on profiles which consist of keywords and their the webpage).Thus Relevance ρ of the webpage can corresponding ranks. The relevance in the UProRevs be calculated from the weight of the webpage as system is determined by calculating the weight of a follows, particular webpage with respect to the User Profile. ρ = (W/n)*100 The weight of the webpage is calculated by first where calculating the weight of each keyword in webpage W = Weight of the webpage. that exists in the user profile. n = Number of keywords in the Webpage Profile. The factors to be considered while calculating the Thus Relevance ρ can be defined as the ratio of weight of each keyword are as follows: the weight of the webpage to the maximum weight that the particular page can be assigned. i. Difference in the ranks of the keyword in the User Profile and Webpage Profile - Lesser the 6.2. Profile Updation difference in the ranks of the keyword, more When a user visits a website and gives his relevant is it to the User’s Profile and so higher feedback f specifying that the particular website is weight needs to be given to that particular f% relevant to his profile and query, his/her profile keyword. Greater the difference in the ranks of gets updated as per the value of f given by the user. the keyword, lesser relevant is the keyword to the User’s Profile and so lower weight needs to be The Webpage profile consists of keywords and given to that particular keyword. their respective frequencies. The User Profile is updated by updating the frequency of those keywords ii. Rank of the keyword in the Webpage Profile – in the User Profile that exit in the Webpage Profile. Higher the rank of the keyword in the Webpage The updating of keywords depends on the following Profile, higher weight it gets and vice versa. This factors: factor takes into consideration the instance when the rank differences between any two keywords i. The feedback rating given by the user for the are same but have different frequencies of webpage. existence on the webpage. ii. The frequency of the keyword in the webpage. Considering these two factors the weight of each Based on the following factors the frequency of the keyword Ki can be calculated as follows keyword can be updated using the following formula 274 268
  • nf = cf (1+r/100)n 7.1. Spearman's Rank Correlation where In statistics, Spearman’s rank correlation cf = current frequency of the keyword. coefficient, named after Charles Spearman and often nf = new frequency of the keyword. denoted by the Greek letter ρ (rho), is a non- r = f/100, feedback rating. parametric measure of correlation that is, it assesses n= frequency of the keyword in the webpage. how well an arbitrary monotonic function could describe the relationship between two variables, without making any assumptions about the frequency distribution of the variables. 7. Test Cases In principle, ρ is simply a special case of the The UProRevs concept was implemented under Pearson product-moment coefficient in which the the name of the Personalized Search Engine data are converted to rankings before calculating the BINGbeta. The test cases comprise of testing the coefficient. In practice, however, a simpler procedure system against the query python for the two user is normally used to calculate ρ. The raw scores are profiles of a Programmer and Zoologist. Thus for a converted to ranks, and the differences ‘d’ between Programmer the system should give higher relevance the ranks of each observation on the two variables for URL’s related to the programming language are calculated. ρ is then given by: ’python’. Whereas for a Zoologist the URL’s that are ρ = 1 – [6 Σd i2/n(n2-1)] related to the snake ’python’ must get a higher relevance. di = the deference between each rank of corresponding values of p and z. The result set consists of a set of URL's that we n = the number of pairs of values. get after firing the query 'python' for a programmer Now, the above formula would give correlation ρ and a zoologist. It is to be noted that the URL's between the two sets of relevance. The value of ρ selected to be a part of the result set either relate to would range from -1 to +1. If the value of ρ is closer the programmer or the zoologists. At the same time to -1 or +1 then more closely are the two variables results that had lesser content on their page were also related. If ρ is closer to 0 it means there is no relation discarded. If there exists a clear distinction between between the variables. Thus to prove that the system the relevance for a programmer and a zoologist, it is distinguishing between the two profiles we need to would prove that the system is indeed recognizing a get the values of ρ close to 0. distinction between a Programmer and a Zoologist. Table 3 shows the structure of a result set. To test the UproRevs system we considered 3 result sets. Result Set 1 reflects the state of the The result set consists of 5 columns which system when the two profiles were just created. represent the URL, relevance for the programmer, Result Set 2 reflects the state of the system after a relevance for the zoologist, rank for the relevance of few iterations and Result Set 3 reflects the state of the programmer and rank for the relevance of the the system after further iterations. The ρ values for zoologist. The distinction between the two sets of Result Sets 1, 2 and 3 were calculated as 0.618, 0.296 relevance can be proved by Spearman's Rank and 0.15 respectively. It can be observed that the Correlation. correlation between the two sets of values decreases with the increase in time. Figure 3 gives the plot of Table 3. Time versus the Correlation between the two set of values. URL Progra-- Zoologi- Rank Rank mmer st (Z) (Rp) (Rz) The Figure 3 is the plot of the correlation values (P) at different instances of time. It shows that as the time increases, correlation approaches zero. Thus it url 1 P1 Z1 Rp1 Rz1 becomes very clear from the graph that as the time url 2 P2 Z2 Rp2 Rz2 increases the correlation between the two set of values decreases. In other words the distinction | | | | | between the two set of values increases proving that | | | | | the system is recognising the difference between the profile of a Programmer and a Zoologist. url n Pn Zn Rpn Rzn 275 269
  • [2] Kazunari Sugiyama, Kenji Hatano, Masatoshi Yoshikawa. “Adaptive Web Search Based on User Profile Constructed without Any Effort from Users”, Proceeding of 13th International Conference on world wide web, New York, USA ISBN: 1-58113-844-X Page 675-684, (2004). [3] Ricardo Baeza-Yates, Carlos Hurtado, Marcelo Mendoza and Georges Dupret. “Modelling User Search Behavior”. Proceedings of the Third Latin American Web Congress, (2005). [4] Yuefeng Li and Ning Zhong, “Mining Ontology for Automatically Acquiring Web User Information Needs”, IEEE Transaction on Knowledge and Data Engineering, Volume-14, Issue - 4, ISSN: 1041-4347, (2006). [5] Boris Chidlovskii, Natalie S. Glance and M. Antonietta Grasso. “Collaborative Re-Ranking of Search Results”. Proceeding of AAAI-2000Workshop Figure 3: Graph of Time v/s ρ on AI for Web Search (2000). [6] Sergey Brin and Lawrence Page. “The Anatomy of a Large Scale Hyper textual Web Search Engine”. 8. Conclusion Proceeding of 7th International Conference on World Wide Web, Australia, Elsevier Science Publishers, The UProRevs system provides the user with ISSN:0169-7552, Page 107-117, (1998) relevant search results thus saving the users valuable [7] Ray-I Chang, Jan-Ming Ho. “Active Feedback for time spent otherwise while using a general search Effective Web Search”, Technical Report No. TR-IIS- engine. A few drawbacks, such as dishonest details 05-013, September 2005. provided by the user at the time of his registration, unfair ratings provided by the users may prove to be [8] Hsin-Chang Yang,Chung-Hong Lee. “Automatic critical. However, emphasis must be given that the Metadata Generation for Web Pages Using a Text UproRevs system describes the architecture of a Mining Approach”, Proceedings International simple Personalized Search Engine. Workshop on Challenges in Web Information Retrieval and Integration (WIRE05). April 2005. Another point to be noted is that we have discussed the UProRevs system as a subordinate [9] Sung-Won Jung, Hyuk-Chul Kwon. “A Scalable system to a remote general search engine. This Hybrid Approach for Extracting Head Components from Web Tables”. IEEE transaction on Knowledge subordinate system can be transformed into a stand- and Data Engineering, Volume 18, Issue 2, Page-174- alone search engine in which the relevance would act 187, February 2006. as a parameter to re-rank search results thus redefining web search. [10] Sergey Brin, Rajeev Motwani and T. Winograd. “Page Rank Citation Ranking: Bringing Order to the Web Lawrence Page”. 9. Acknowledgment http://dbpubs.stanford.edu:8090/pub/showDoc.pdf This work has been done in Multimedia [11] Akshay Surve, Manav Shah and Amiya Tripathy. Laboratory of Don Bosco Institute of Technology “Optimizing Web search engine by using user and is fully supported by Don Bosco Institute of feedback”, Proceedings of International Conference Technology, Mumbai, India. Business and Information (BAI2007), July 2007, Tokyo, Japan. Volume 4, ISSN-1729-9322. 10. References [1] Wen-Chih Peng and Yu-Chin Lin. “Ranking Web Search Results from Personalized Perspective”. Proceedings of the IEEE joint Conference on E- Commerce Technology (CEC’06) and Enterprises Computing, E-Commerce and E-Service (EEE’06), San Francisco, California, June 26-29, 2006. 276 270