Determining Relevance Rankings with Search Click Logs

  • 1,788 views
Uploaded on

 

More in: Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,788
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
13
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Thesis Proposal — Determining Relevance Rankings with Search Click Logs Inderjeet Singh Supervisor: Dr. Carson Kai-Sang Leung July 8, 2011 Abstract Search engines track users’ search activities in search click logs. These logs can be mined to get better insights into user behavior and to make user behavior models. These models can be then used in a ranking algorithm to give better, more focused and desirable results to the user. There are two problems with the existing models. First, researchers have not considered trust bias while interpreting click logs. Trust bias or trust factor is the preference the user gives to certain URLs that he/she trusts. For example, users show preference for websites like wikipedia.com, yahoo answers, stackoverflow.com and many others 1
  • 2. because users trust the documents from these URLs. The trusted websites can be different for people in different areas or niches. Thus, trust bias is an important parameter to be considered while designing a user behavior model and using it in a ranking algorithm. Second, researchers have not considered user clicks on other parts of a search page like advertisements while making their models. Interpreting these clicks is important because advertisements are also a part of search results and relevant advertisements help a user fulfill his information needs. I propose to extend the existing research to make a user behavior model from search click logs that will overcome the above two problems and then estimates the relevance of documents.1 IntroductionSearch engines are used to answer ad-hoc or specific queries at various times.Queries can be of two types: navigational and informational. A navigationalquery looks for specific information such as a single website, web page or asingle entity. An informational query looks for information about general orniche topics. Search engines rank search results in decreasing order of relevance to thequery. Search engines assign scores to the documents for the user query usinga ranking function, which is derived automatically from a ranking algorithmusing training data. 2
  • 3. Training data is a collection of query-document pairs. Each query-documentpair in the training data is represented by a set of properties of both queryand document called features. Each query-document pair is labeled accord-ing to its relevance under categories such as perfect, excellent, good, fair orbad. These relevance labels are assigned by humans to each query-documentpair indicating how well the document matches the query. These humanjudgments are known as editorial judgments. A good editorial judgementis important because the quality of a ranking function depends upon thequality of the training data used. Ranking function efficiency depends upon two critical aspects: the con-struction of training data with good relevance labels and the selection offeatures in the feature set. Usually, the user starts examining the result snippets (the combinationof a URL and a small description of the document) for a query from top tobottom. The probability of a user examining a result snippet is called theexamination probability. When examining, the user may find some snippetsuseful. This usefulness is the perceived relevance or attractiveness of thesnippet. Eventually, the user clicks on a snippet and lands on a document.The information fulfillment that the user gets out of that document is theactual relevance of the document. The probability that the user will click on aresult snippet is known as the click probability. The click probability is alwaysless than or equal to the examination probability. The click probability forURLs on a results page decreases from top to bottom. This decrease is known 3
  • 4. as position bias. Search engines maintain logs of every user interaction. The log entrieshave information about the queries, the results displayed for each query, thenumber of results displayed, which results were clicked, user IP address andtimestamps. Generally, these logs are on the order of terabytes in size. Mining click logs gives a better insight into how a user interacts withthe search engine. For example, a click can be interpreted as a vote forthat document for a particular query. So, the information about clicks anduser interaction can be used to make user behavior models that capture userpreferences. These models can be further used to estimate the documentrelevance for better search results as described in Section 2.1. The model inthis context would be a set of equations and rules describing user interactionsor actions on the search page in terms of probability values. Also, till date most of the relevance labels for training data are manuallyassigned by editors (humans) who can be biased at times and may not nec-essarily represent the aggregate behavior of search engine users. Manuallydrafting training data is also a time consuming process. To overcome theseproblems, training data must be automatically labeled; see Section 2.4. I intend to make a user behavior model that is different from existingmodels in considering trust bias and clicks on other parts of a search page.My model will closely capture user preferences and will make realistic andflexible assumptions on user behavior — for example, a user can do any ofthe following: click a single result or multiple results, go to a document and 4
  • 5. never come back to the results, or click advertisements on a search page.This model will then be used to estimate the actual document relevance fora query. The relevance estimates will then be added as a feature in trainingdata to compute a better ranking function. To evaluate my model, I will first compare the relevance estimates of doc-uments from my model with editorial judgments and then with the relevanceestimates of earlier models; see Section 5. I will look for an improved rankingfunction after adding the relevance estimates from my model as a feature inthe training data.2 Related WorkSection 2.1 describes how to make user behavior models by interpreting clicklogs and then use the document relevance estimated from these models asa feature in the training data to get a new ranking function. My researchwill follow this work closely and I will try to improve upon these models.Section 2.2 describes trust bias modeling in online communities. My userbehavior model will consider trust bias to certain URLs while interpretingdocument relevance from click logs. Section 2.3 describes modeling the rel-evance of advertisements for queries using click logs. In my model, the rel-evance of advertisements will be considered in the overall fulfillment of userinformation needs. Section 2.4 describes a method to automatically estimaterelevance labels for query-document pairs in training data from click logs. I 5
  • 6. will use this method to automatically-generate labels for the training datawhich will also include feature from my proposed user behavior model.2.1 Estimating Document Relevance from User Be- havior ModelsDupret and Liao [4] designed a user behavior model that estimates the ac-tual relevance of clicked documents and not the perceived relevance. Theirmodel focuses on the fulfillment that a user gets while browsing and click-ing documents. Their main assumption is that a user stops searching whenhis information need is fulfilled. Dupret and Liao did not limit the numberof clicks (single or multiple) or the number of query reformulations in theirmodel assumptions, which makes their model quite realistic. My model willmatch their solution methodology in considering their assumptions in addi-tion to my own of including trust factor and other parts of the search page.I will consider two more ranking efficiency metrics in my evaluation over andabove what they have used; see Section 5. Craswell et al. [3] designed a model that explained position bias from clicklogs. They assumed that the user examines the result snippets sequentiallyfrom top to bottom and that the user’s search ends as soon as he/she clicksa relevant document for the query. This assumption is known as single-clickassumption. Their work also assumes that the user does not skip a resultsnippet without examining it. 6
  • 7. Chapelle and Zhang [2] developed a model that gives an unbiased esti-mation of the actual relevance of a webpage, i.e., the model removes anyposition bias. Chapelle and Zhang’s work extends the work of Craswell etal. [3] with the assumption that the user will not stop searching until satisfiedwith the information. They overrule Craswell et al.’s single-click assumption,instead, assuming multiple clicks and query reformulations. Their work, how-ever, does not consider anything about the other parts of a search page, likesponsored results and related queries, which I am going to consider in mywork. Dupret and Piwowarski [5] developed a model that differs from the workof Craswell et al. [3] in the sense that the user can skip a document withoutexamining it. Their focus is more on attractiveness and perceived relevanceand they are only modeling single clicks. This model has a lot of assumptions,which makes it limited for estimating actual user behavior. Guo et al. [7] proposed independent and dependent click models for mod-eling multiple clicks on a result page. The independent click model assumesthat the click probability is independent for different positions of results andthe examination probability is unity for every result. This model is onlysuccessful in explaining that the user usually clicks on the first snippet on aresult page. The dependent click model extends the idea of Craswell et al. [3]for multiple clicks. This model describes the interdependence between clicksand examination at different positions. The dependent click model is goodat explaining the clicks on the first and the last snippet on the result page. 7
  • 8. These two models can also be used with click log streams, i.e., a continuousflow of data. They do not, however, consider trust bias and other elementsof a search page, which I will work on.2.2 Trust Bias Modeling in Online CommunitiesFinin et al. [6] modeled trust bias or influence in online social communi-ties. They discussed how a popular blog or website in an online communitycan influence opinions of other blogs. Their model of trust bias in onlinecommunities can be applied to URLs in my user behavior model.2.3 Advertisement Relevance Prediction from Search Click LogsRaghavan and Hillard [8] proposed a model that improves the relevance of ad-vertisements for a query in a search engine. The earlier models that rankedadvertisements for a query depended upon the number of clicks an adver-tisement received, i.e., the advertisement got a better rank based upon thenumber of clicks. Raghavan and Hillard’s model interprets click logs to es-timate the actual relevance of an advertisement to the query and then rankthem. Their model is not based upon the number of clicks. I will use theactual relevance of advertisements from Raghavan and Hillard applied tooverall fulfillment of user information need for a query in my model. 8
  • 9. 2.4 Automatically Estimating Relevance Labels for Train- ing DataAgrawal et al. [1] proposed a method that can be used to automaticallyestimate relevance labels of query-document pairs from click logs. Theytransformed user clicks into weighted, directed graphs and formulated thelabel generation problem as an ordered graph partitioning problem. In fullgenerality, the problem of finding n labels in N P hard. Agrawal et al. showedthat optimal labeling of a query-document pair can be done in linear timeby using only two labels (relevant or non-relevant). They have proposedheuristic solutions to automatically estimate efficient labels from click logs.This automatically labeled training data can save humans from manuallydefining labels for query-document pairs.3 Problem DescriptionPrevious user-behavior models for estimating the relevance of documentsfrom search click logs have not considered the trust bias. Also, while makingmodels, less consideration has been given to assumptions such as clicks onother parts of a search page like advertisements. These assumptions closelyinterpret flexible and realistic user behavior from click logs. 9
  • 10. 4 Solution MethodologyMy proposed model will be an extension of the work of Dupret and Liao [4].In addition to their assumptions described in Section 2.1, I will include myown based on the trust bias and clicks on other parts of the search page,especially advertisements. I intend to make a model that will estimate the actual relevance of doc-uments with respect to specific queries. My model will be a set of equationsand rules describing user interactions or actions on the search page in termsof probability values. A user session is a set of actions that the user per-forms on a search page to satisfy his information needs, like examining aresult snippet, clicking on a search result or advertisement, coming back andagain clicking some more results in decreasing ranking order, reformulatingthe query or abandoning the search. I will model trust bias in the form of probability equations just like anyother user interaction. Trust bias equations would come into play in mymodel after the user has already examined the result snippet and finds thatsnippet attractive. Now, the user clicks on snippet based upon his trust inthat URL. Thus, the click probability on a URL now depends on its exam-ination probability, attractiveness and then the trust bias while in Dupretand Liao’s model, click probability on a URL only depends on its examina-tion probability and then it’s attractiveness. For instance, the examinationprobability of a URL is e, it’s attractiveness probability is a, trust bias is t, 10
  • 11. click probability is c, query is q, document is d, actual document relevanceis r and fulfillment probability is f , the click probability c on a particulard for specific q will now depend upon the joint probability of e, a, and thappening in sequence. Also, the actual document relevance r is dependenton fulfillment probability f from the clicked document d. Modeling clicks on other parts of a search page such as advertisements willbe done in the form of probability equations. The clicks on advertisementsare a form of user interaction on the search page. Advertisements also becomepart of overall fulfillment of user’s information needs. I am still researchingon solution methodologies for modeling these clicks. After making the model with above mentioned improvements, the esti-mated relevance of documents for a query from my model will be combinedwith the existing features of the training data to recompute a new rankingfunction. The ranking obtained by the new function will be measured bydiscounted cumulative gain metric, which is a metric to measure ranking ef-fectiveness. If the metric improvement is significant when compared with theresults obtained by ranking function used in existing search engine for thesame query, then the document relevance from my model can be used as afeature in training data. My challenge will be to get search click logs data for a commercial searchengine. If I am unable to get such logs, I will try getting logs from some metasearch engine like metacrawler, dogpile or excite. If that is not possible, Iwill implement my own meta search engine and then collect logs. If all fails, 11
  • 12. I will be using previously-released logs from a search engine.5 EvaluationI will use the discounted cumulative gain and normalized discounted cumula-tive gain to measure ranking effectiveness. I will also use precision and recallas metrics in my evaluation. A correct result is a result retrieved by a searchengine that is relevant to a query. Precision is the ratio of correct results tothe results retrieved and recall is the ratio of the correct results to relevantresults for the query. A raw search click-log will be pre-processed to remove duplicate queriesand noise. Noise here refers to the following queries: queries for which thenumber of user sessions is less than 10, queries with less than 10 resultsand queries with no clicks on snippets in a user session. Queries with noclicks on snippets are removed because most of these queries are misspelledor ambiguous. Only the first result page will be considered because most ofthe clicks happen here. Position bias will be removed from the dataset usingeditorial judgments for the results of a query. After pre-processing, the dataset will be produced automatically by usingthe methods described in Section 2.4. The dataset will then be split equallyinto training and test datasets. The training dataset will be used to train theranking algorithm to generate a new ranking function, while the test datasetwill be used to measure how effectively the function now ranks the results 12
  • 13. for a query by the user. I will do a comparative analysis of the estimated relevance of results frommy model with the models of Dupret and Liao [4] and Guo et al. [7]. Theanalysis will also compare the estimated document relevance from these threemodels with the editorial judgments for the same query-document pairs. Thecomparison will give an idea of how accurately these models are estimatingrelevance. Results will be compared for both informational and navigationalqueries. If the estimated document relevance from my model matches to a consid-erable extent with editorial judgments, then these relevance estimates will beused as a feature in the training data. After this step, the ranking algorithmwill be trained on the above data to generate a new ranking function whichwill be used to rank the test data. Rankings will also be generated for thesame test data by existing ranking function used by popular search engines.I will thus calculate the discounted cumulative gain, normalized discountedcumulative gain, precision, and recall. If these metrics show considerableimprovement, my model can be considered successful.6 Timeline 13
  • 14. Task Start Date End Date Literature review Sept 2010 Ongoing Designing the model May 2011 Sept 2011 Comparative analysis with existing Oct 2011 Dec 2011 models Inclusion of relevance feature from my first week of Jan last week of Jan model in the training data 2012 2012 Evaluation of the new ranking function Feb 2012 last week of for different metric improvements April 2012 Thesis Write-up May 2012 mid July 20127 SummaryI want to make a model that can estimate a document’s actual relevancefrom click logs, after modeling trust bias and clicks on the other parts of asearch page. This model will follow some of the assumptions and solutionmethodology of Dupret and Liao [4]. If successful, this model can be used asa feature in training data to improve the ranking function of a search engine.References[1] Rakesh Agrawal, Alan Halverson, Krishnaram Kenthapadi, Nina Mishra, and Panayiotis Tsaparas. Generating labels from clicks. In Proceed- 14
  • 15. ings of the Second Web Search and Data Mining (WSDM) Conference, Barcelona, Spain, pages 172–181. ACM, 9–11 February 2009.[2] Olivier Chapelle and Ye Zhang. A dynamic Bayesian network click model for web search and ranking. In Proceedings of the 18th International Conference on World Wide Web (WWW), Madrid, Spain, pages 1–10. ACM, 20–24 April 2009.[3] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An exper- imental comparison of click position-bias models. In Proceedings of First Web Search and Data Mining (WSDM) Conference, Palo Alto, CA, USA, pages 87–94. ACM, 11–12 February 2008.[4] Georges Dupret and Ciya Liao. A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine. In Proceedings of Third Web Search and Data Mining (WSDM) Conference, New York City, NY, USA, pages 181–190. ACM, 4–6 February 2010.[5] Georges Dupret and Benjamin Piwowarski. A user browsing model to predict search engine click data from past observations. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, pages 331–338. ACM, 20–24 July 2008. 15
  • 16. [6] Tim Finin, Anupam Joshi, Pranam Kolari, Akshay Java, Anubhav Kale, and Amit Karandikar. The information ecology of social media and online communities. AI Magazine, 29(3):77–92, 2008.[7] Fan Guo, Chao Liu, and Yi Min Wang. Efficient multiple-click models in web search. In Proceedings of Second Web Search and Data Min- ing (WSDM) Conference, Barcelona, Spain, pages 124–131. ACM, 9–11 February 2009.[8] Hema Raghavan and Dustin Hillard. A relevance model based filter for improving ad quality. In Proceedings of the 32nd International ACM SI- GIR Conference on Research and Development in Information Retrieval, Boston, MA, USA, pages 762–763. ACM, 19-23 July 2009. 16