Internet 信息检索中的数学


Published on


Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Internet 信息检索中的数学

    1. 1. Internet 信息检索中的数学 Zhi-Ming Ma April 24, 2009, 厦门 Email:
    2. 3. How can google make a ranking of 2,040,000 pages in 0.11 seconds?
    3. 4. A main task of Internet (Web) Information Retrieval = Design and Analysis of Search Engine (SE) Algorithm involving plenty of Mathematics
    4. 5. Inter network is a large scale complex random network The Earth is developing an electronic nervous system, a network with diverse nodes and links are
    5. 6. 搜索引擎的流程 Web Links & Anchors Pages Link Map 查询 在线部分 离线部分 Link Analysis 缓存 网页剖析器 倒排表 Page & Site 数据库 网络图 网页爬取器 r 用户界面 缓存页面 索引编辑器 Page Ranks 网络图生成器 Indexing and Ranking
    6. 7. Static Rank ( 静态排序) <ul><li>Importance ranking </li></ul><ul><ul><li>Goal: compute page importance, page authority </li></ul></ul><ul><ul><li>Method: Link analysis </li></ul></ul><ul><ul><ul><li>A method is based on the topology of the graph of whole Web pages. </li></ul></ul></ul><ul><ul><ul><li>Web graph: page  node, hyperlink  edge. </li></ul></ul></ul><ul><ul><li>Algorithms: </li></ul></ul><ul><ul><ul><li>HITS(Kleinberg) [5] </li></ul></ul></ul><ul><ul><ul><li>PageRank(GOOGLE) [6] </li></ul></ul></ul>
    7. 8. Dynamic Rank (动态排序) <ul><li>Relevance ranking (相关性排序) </li></ul><ul><ul><li>Goal: compute the content match relevant score between pages and query. </li></ul></ul><ul><ul><li>Method: Statistic machine learning </li></ul></ul><ul><ul><li>Algorithms: </li></ul></ul><ul><ul><ul><li>Point-wise: BM25[7] </li></ul></ul></ul><ul><ul><ul><li>Pair-wise: RankBoost[3], RankSVM[4], RankNet[1],… </li></ul></ul></ul><ul><ul><ul><li>List-wise: ListNet[10] </li></ul></ul></ul>
    8. 9. Research on Complex Networks and Information Retrieval <ul><li>In recent years we have been involved in the research direction of Random Complex Networks and Information Retrieval. I shall briefly review some of our recent results ( in collaboration with Microsoft Research Asia ) in this direction. </li></ul>
    9. 12. Outlines <ul><li>Markov chain methods in search engines </li></ul><ul><li>Point process describing Browsing behavior </li></ul><ul><li>Two layer statistical learning </li></ul><ul><li>Stochstic complement method in ranking Web sites </li></ul><ul><li>Final remarks </li></ul>
    10. 18. <ul><li>Browse Rank  </li></ul><ul><li>vs. </li></ul><ul><li>Page Rank ? </li></ul>
    11. 19.  HITS  PageRank 1998 Jon Kleinberg Cornell University <ul><li>Sergey Brin and Larry Page </li></ul><ul><li>Stanford University </li></ul>
    12. 20. Nevanlinna Prize ( 2006) Jon Kleinberg <ul><li>One of Kleinberg‘s most important research achievements focuses on the internetwork structure of the World Wide Web. </li></ul><ul><li>Prior to   Kleinberg‘s work, search engines focused only on the content of web pages , not on the link structure. </li></ul><ul><li>Kleinberg introduced the idea of </li></ul><ul><li>“ authorities” and “hubs”: </li></ul><ul><li>An authority is a web page that contains   information on a particular topic, </li></ul><ul><li>and a hub is a page that contains links to   many authorities. Zhuzihu thesis.pdf </li></ul>
    13. 21. Page Rank , the ranking system used by the Google search engine. <ul><li>Query independent </li></ul><ul><li>content independent. </li></ul><ul><li>using only the web graph structure </li></ul>
    14. 23. Markov chain describing surfing behavior
    15. 24. Markov chain describing surfing behavior
    16. 25. <ul><li>Web surfers usually have two basic ways to access web pages: </li></ul><ul><li>with probability α, they visit a web page by clicking a hyperlink. </li></ul><ul><li>2. with probability 1-α, they visit a web page by inputting its URL address. </li></ul>
    17. 26. where
    18. 27. More generally we may consider personalized d .: PageRank is the unique positive eigenvector: By the strong ergodic theorem:
    19. 28. Problem:
    20. 31. PageRank as a Function of the Damping Factor Paolo Boldi Massimo Santini Sebastiano Vigna DSI, Università degli Studi di Milano WWW 2005 paper 3.1 Choosing the damping factor 3 General Behaviour 3.2 Getting close to 1 <ul><li>can we somehow characterise the properties of ? </li></ul><ul><li>what makes different from the other (infinitely </li></ul><ul><li>many, if P is reducible) limit distributions of P ? </li></ul>
    21. 32. is the limit distribution of P when the starting distribution is uniform, that is, Conjecture 1 :
    22. 33. Research results by our group: <ul><li>Limit of PageRank </li></ul><ul><li>Comparison of Different Irreducible Markov Chains </li></ul><ul><li>N-step PageRank </li></ul><ul><li>…… </li></ul>
    23. 34. Weak points of PageRank <ul><li>Using only static web graph structure </li></ul><ul><li>Reflecting only the will of web managers, </li></ul><ul><li>but ignore the will of users e.g. the staying time of users on a web. </li></ul><ul><li>Can not effectively against spam and junk pages. </li></ul>BrowseRankSIGIR.ppt
    24. 36. Letting Web Users Vote for Page Importance <ul><li>When calculating the page importance, </li></ul><ul><ul><li>Use the users’ real browsing behavior </li></ul></ul><ul><ul><ul><li>Make no artificial assumption on the users’ behavior </li></ul></ul></ul><ul><ul><li>Use the users’ complete browsing behavior </li></ul></ul><ul><ul><ul><li>Contain the time information </li></ul></ul></ul>06/09/09 Yuting Liu@SIGIR'08
    25. 38. Browsing Process <ul><li>Markov property </li></ul><ul><li>Time-homogeneity </li></ul>
    26. 42. BrowseRank: User browsing graph 06/09/09 Yuting Liu@SIGIR'08 Vertex: Web page Edge: Transition Edge weight w ij : The number of transitions Staying time T i : The time spend on page i Reset probability : Normalized frequencies as first page of session
    27. 43. Mathematical Deduction Maximum likelihood estimation: of staying time
    28. 44. Mathematical Deduction where Therefore
    29. 45. Mathematical Deduction <ul><li>Additional Noise: </li></ul><ul><li>the speed of the Internet connection, </li></ul><ul><li>the length of the page, </li></ul><ul><li>the layout of the page, </li></ul><ul><li>user does some other things (e.g.,answers a phone call) </li></ul><ul><li>Other factors </li></ul>
    30. 46. Mathematical Deduction Assume Noise: Chi-square distribution with degree k
    31. 47. Mathematical Deduction ideally we would have: However, due to data sparseness, we encounter challenges……
    32. 48. Mathematical Deduction To tackle this challenge, we turn it into optimization problems :
    33. 50. Mathematical Deduction <ul><ul><li>Stationary distribution: </li></ul></ul><ul><ul><li>is the mean of the staying time on page i. </li></ul></ul><ul><ul><li>The more important a page is, the longer staying time on it is. </li></ul></ul><ul><ul><li>is the mean of the first re-visit time at page i. The more important a page is, the smaller the </li></ul></ul><ul><ul><li>re-visit time is, and the larger the visit frequency is. </li></ul></ul>
    34. 51. Mathematical Deduction <ul><li>Properties of Q process: </li></ul><ul><ul><li>Jumping probability is conditionally independent from jumping time: </li></ul></ul><ul><ul><li>Embedded Markov chain: </li></ul></ul><ul><ul><ul><li>is a Markov chain with the transition probability matrix </li></ul></ul></ul>
    35. 52. Mathematical Deduction <ul><ul><li>is the stationary distribution of </li></ul></ul><ul><ul><li>The stationary distribution of discrete model is easy to compute </li></ul></ul><ul><ul><ul><li>Power method for </li></ul></ul></ul><ul><ul><ul><li>Log data for </li></ul></ul></ul>
    36. 54. Experiments <ul><li>Data set: </li></ul><ul><ul><li>5.6 million vertices </li></ul></ul><ul><ul><li>53 million edges </li></ul></ul><ul><li>Baselines: PageRank, TrustRank </li></ul><ul><li>Aim to: </li></ul><ul><ul><li>Find good websites </li></ul></ul><ul><ul><li>Fight spam websites </li></ul></ul>06/09/09 Yuting Liu@SIGIR'08
    37. 55. Website-level: Find good 06/09/09 Yuting Liu@SIGIR'08
    38. 56. Website-level: Fight spam 06/09/09 Yuting Liu@SIGIR'08
    39. 58. BrowseRank: Letting Web Users Vote for Page Importance Yuting Liu , Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, and Hang Li July 23, 2008, Singapore the 31st Annual International ACM SIGIR Conference on Research & Development on Information Retrieval. Best student paper !
    40. 59. BrowseRank: Letting Web Users Vote for Page Importance <ul><li>Google search , </li></ul><ul><li>110,000,000 results for </li></ul><ul><li>Browse Rank </li></ul>
    41. 60. Further Studies <ul><li>Browsing Processes will be a </li></ul><ul><li>Basic Mathematical Tool </li></ul><ul><li>in Internet Information Retrieval </li></ul><ul><li>How about inhomogenous process? </li></ul><ul><li>Marked point process </li></ul><ul><ul><li>Hyperlink is not reliable. </li></ul></ul><ul><ul><li>Users’ real behavior should be considered. </li></ul></ul>
    42. 61. Dynamic Rank (动态排序) <ul><li>Relevance ranking (相关性排序) </li></ul><ul><ul><li>Goal: compute the content match relevant score between pages and query. </li></ul></ul><ul><ul><li>Method: Statistic machine learning </li></ul></ul><ul><ul><li>Algorithms: </li></ul></ul><ul><ul><ul><li>Point-wise: BM25[7] </li></ul></ul></ul><ul><ul><ul><li>Pair-wise: RankBoost[3], RankSVM[4], RankNet[1],… </li></ul></ul></ul><ul><ul><ul><li>List-wise: ListNet[10] </li></ul></ul></ul>
    43. 62. Outlines <ul><li>Markov chain methods in search engines </li></ul><ul><li>Point process describing Browsing behavior </li></ul><ul><li>Two layer statistical learning </li></ul><ul><li>Stochstic complement method in ranking Web sites </li></ul><ul><li>Final remarks </li></ul>
    44. 63. Learning to Rank Model Learning System Ranking System Wei-Ying Ma, Microsoft Research Asia min Loss
    45. 64. learning to rank in IR is a two layer statistical learning <ul><li>Distributions of the relevance judgments of documents may vary from query to query </li></ul><ul><li>The numbers of documents for different queries may differ largely </li></ul><ul><li>Evaluation in IR is usually conducted at </li></ul><ul><li>query level </li></ul>
    46. 65. Document level vs Query level <ul><li>Two queries in total. </li></ul><ul><li>Same errors in terms of pairwise classification. 780/790=98.73% </li></ul><ul><li>Different errors in terms of query-level evaluation. 99% vs. 50%. </li></ul>
    47. 66. <ul><li>Query-Level Stability and Generalization in Learning to Rank, </li></ul><ul><li>to appear in Proceedings of the 25th International Conference on Machine Learning 2008, </li></ul><ul><li>Yanyan Lan, Tie-Yan Liu, Tao Qin, Zhiming Ma, Hang Li </li></ul><ul><li>Two Layer Statistical Learning and Applications in Information Retrieval, </li></ul><ul><li>in preparation , Yanyan Lan, Hang Li, </li></ul><ul><li>Tie-Yan Liu, Zhi-Ming Ma, Tao Qin </li></ul>Microsoft Scholar Fellowship
    48. 67. <ul><li>We propose a new framework of statistical learning model, in which the </li></ul><ul><li>training dada are composed in two layers . </li></ul>the two layer structure of training data is not artificial , but arises from the real world Especially from learning to rank in Information Retrieval
    49. 68. Two-Layer Statistical Learning Framework <ul><li>First layer: objects </li></ul><ul><li>Second layer: associated samples </li></ul>: instances : descriptions of instances Instances are the objectives which we are concern
    50. 69. <ul><li>In learning to rank for IR, an object is a query, </li></ul><ul><li>an instance and corresponding description can be interpreted as </li></ul><ul><li>a single document ( pointwise algorithm ) </li></ul><ul><li>a pair of document ( pairwise algorithm ) </li></ul><ul><li>a set of documents ( listwise algorithm ) </li></ul>a score (or label) of a document an order on a pair of documents a permutation (list) of documents
    51. 70. Training Process i.i.d. For each i, the associated samples , distribution the training data is denoted as
    52. 71. <ul><li>The algorithm of a training process is to learn from the training data a function </li></ul><ul><li>that will be used to predict the features of instances. </li></ul>
    53. 72. empirical object level loss loss function on expected object level loss
    54. 73. <ul><li>empirical risk </li></ul>expected risk
    55. 74. <ul><li>The challenge is that when dealing with two layer training data, most of the existing results of statistical learning can not be directly applied. Thus we have to develop new results or modify the existing results to suit the new model. In this aspect much research should be conducted. </li></ul>
    56. 75. Generalization Analysis based on Stability Theory <ul><li>Devroye, L. and Wagner, T.(1979). Stability </li></ul><ul><li>stability-bounds depend on properties of the algorithm itself </li></ul><ul><li>rather than the property of the function class, </li></ul><ul><li>Bousquet, O. and Elisseeff, A.(2002). . </li></ul><ul><li>uniform leave-one-out stability </li></ul><ul><li>Motivated by the above work, we invent: </li></ul><ul><li>Object –level uniform leave-one-out stability </li></ul><ul><li>In short, Object –level stability </li></ul>
    57. 76. Definition: We say a algorithm possesses: Object –level uniform leave-one-out stability Abbreviated as Object –level stability, if: Function learned from training data Function learned from training data
    58. 77. Generalization based on Object-level Stability Object-level stability The number of training objects With probability at least
    59. 78. Note: if , then the bound makes sense. This condition can be satisfied in many practical cases. As case studies, we investigate Ranking SVM and RankBoost. We show that after introducing query-level normalization to its objective function, Ranking SVM will have query-level stability. For RankBoost , the query-level stability can be achieved if we introduce both query-level normalization and regularization to its objective function . These analyses agree largely with our experiments and the experiments in Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon, 2006 [5] and [11].
    60. 79. <ul><li>IRSVM : modified Ranking SVM in Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon, 2006 </li></ul><ul><li>query-level normalization </li></ul>Query-level Empirical Risk Generalization Bound:
    61. 80. Generalization Bounds Comparison <ul><li>Ranking SVM </li></ul>Generalization Bound: Generalization Bound: Modified RSVM
    62. 81. RankBoost with Query-level Normalization and Regularization <ul><li>introducing query-level normalization to RankBoost does not lead to good performance [11]. </li></ul>query-level normalization cannot make RankBoost have query-level stability. <ul><li>adding both query-level normalization and regularization to the objective function, </li></ul><ul><li>query-level stability can be achieved </li></ul><ul><li>. </li></ul><ul><li>Thus the framework offers us a way of modifying RankBoost for good ranking performance </li></ul>
    63. 82. Experimental Results (I) <ul><li>Query-Level Stability </li></ul><ul><li>1200 queries from a search </li></ul><ul><li>engine’s data repository. </li></ul><ul><li>200 queries for training, 500 </li></ul><ul><li>queries for validation and 500 </li></ul><ul><li>queries for test. </li></ul><ul><li>Five relevance labels and we </li></ul><ul><li>treat the first three label as </li></ul><ul><li>“ relevant” and the other ones </li></ul><ul><li>as “irrelevant” to construct pairs. </li></ul>
    64. 83. <ul><li>Query-level Generalization Bound </li></ul>Experimental Results (II)
    65. 84. Future Problems and Challenges <ul><li>It is worth to see whether new learning to rank algorithms can be derived under the guide of our theoretical studies. </li></ul><ul><li>We have investigated the generalization analysis based on the novel two layer statistical learning. We will continue to conduct other theoretical analysis. </li></ul><ul><li>We have proposed “object-level stability”, we will try to investigate other tools. </li></ul><ul><li>Two layer statistical learning in other fields </li></ul>
    66. 85. Outlines <ul><li>Markov chain methods in search engines </li></ul><ul><li>Point process describing Browsing behavior </li></ul><ul><li>Two layer statistical learning </li></ul><ul><li>Stochstic complement method in ranking Web sites </li></ul><ul><li>Final remarks </li></ul>
    67. 86. Outlines <ul><li>Markov chain methods in search engines </li></ul><ul><li>Point process describing Browsing behavior </li></ul><ul><li>Two layer statistical learning </li></ul><ul><li>Stochstic complement method in ranking Web sites </li></ul><ul><li>Final remarks </li></ul>
    68. 87. <ul><li>We have briefly reviewed part of our recent joint work (in collaboration with Microsoft Research Asia) concerning Internet Information Retrieval. </li></ul><ul><li>Mathematics is becoming more and more important in the area of Internet information retrieval. </li></ul><ul><li>Internet information retrieval has been a rich source providing plenty of interesting and challenging problems in Mathematics. </li></ul>
    69. 88. Thank you !