Mazhiming

Internet 信息检索中的数学 Zhi-Ming Ma April 24, 2009, 厦门 Email: mazm@amt.ac.cn http://www.amt.ac.cn/member/mazhiming/index.html

How can google make a ranking of 2,040,000 pages in 0.11 seconds?

A main task of Internet (Web) Information Retrieval = Design and Analysis of Search Engine (SE) Algorithm involving plenty of Mathematics

Inter network is a large scale complex random network The Earth is developing an electronic nervous system, a network with diverse nodes and links are

搜索引擎的流程 Web Links & Anchors Pages Link Map 查询在线部分离线部分 Link Analysis 缓存网页剖析器倒排表 Page & Site 数据库网络图网页爬取器 r 用户界面缓存页面索引编辑器 Page Ranks 网络图生成器 Indexing and Ranking

Static Rank ( 静态排序） ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Dynamic Rank （动态排序） ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Research on Complex Networks and Information Retrieval ,[object Object]

Outlines ,[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object]

 HITS  PageRank 1998 Jon Kleinberg Cornell University ,[object Object],[object Object]

Nevanlinna Prize （ 2006) Jon Kleinberg ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Page Rank , the ranking system used by the Google search engine. ,[object Object],[object Object],[object Object]

Markov chain describing surfing behavior

More generally we may consider personalized d .: PageRank is the unique positive eigenvector: By the strong ergodic theorem:

PageRank as a Function of the Damping Factor Paolo Boldi Massimo Santini Sebastiano Vigna DSI, Università degli Studi di Milano WWW 2005 paper 3.1 Choosing the damping factor 3 General Behaviour 3.2 Getting close to 1 ,[object Object],[object Object],[object Object]

is the limit distribution of P when the starting distribution is uniform, that is, Conjecture 1 :

Research results by our group: ,[object Object],[object Object],[object Object],[object Object]

Weak points of PageRank ,[object Object],[object Object],[object Object],[object Object],BrowseRankSIGIR.ppt

Letting Web Users Vote for Page Importance ,[object Object],[object Object],[object Object],[object Object],[object Object],09/09/10 Yuting Liu@SIGIR'08

Browsing Process ,[object Object],[object Object]

BrowseRank: User browsing graph 09/09/10 Yuting Liu@SIGIR'08 Vertex: Web page Edge: Transition Edge weight w ij : The number of transitions Staying time T i : The time spend on page i Reset probability : Normalized frequencies as first page of session

Mathematical Deduction Maximum likelihood estimation: of staying time

Mathematical Deduction where Therefore

Mathematical Deduction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Mathematical Deduction Assume Noise: Chi-square distribution with degree k

Mathematical Deduction ideally we would have: However, due to data sparseness, we encounter challenges……

Mathematical Deduction To tackle this challenge, we turn it into optimization problems :

Mathematical Deduction ,[object Object],[object Object],[object Object],[object Object],[object Object]

Mathematical Deduction ,[object Object],[object Object],[object Object],[object Object]

Experiments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],09/09/10 Yuting Liu@SIGIR'08

Website-level: Find good 09/09/10 Yuting Liu@SIGIR'08

Website-level: Fight spam 09/09/10 Yuting Liu@SIGIR'08

BrowseRank: Letting Web Users Vote for Page Importance Yuting Liu , Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, and Hang Li July 23, 2008, Singapore the 31st Annual International ACM SIGIR Conference on Research & Development on Information Retrieval. Best student paper !

BrowseRank: Letting Web Users Vote for Page Importance ,[object Object],[object Object],[object Object]

Further Studies ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Learning to Rank Model Learning System Ranking System Wei-Ying Ma, Microsoft Research Asia min Loss

learning to rank in IR is a two layer statistical learning ,[object Object],[object Object],[object Object],[object Object]

Document level vs Query level ,[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Microsoft Scholar Fellowship

[object Object],[object Object],the two layer structure of training data is not artificial , but arises from the real world Especially from learning to rank in Information Retrieval

Two-Layer Statistical Learning Framework ,[object Object],[object Object],: instances : descriptions of instances Instances are the objectives which we are concern

[object Object],[object Object],[object Object],[object Object],[object Object],a score (or label) of a document an order on a pair of documents a permutation (list) of documents

Training Process i.i.d. For each i, the associated samples , distribution the training data is denoted as

[object Object],[object Object]

empirical object level loss loss function on expected object level loss

Generalization Analysis based on Stability Theory ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Definition: We say a algorithm possesses: Object –level uniform leave-one-out stability Abbreviated as Object –level stability, if: Function learned from training data Function learned from training data

Generalization based on Object-level Stability Object-level stability The number of training objects With probability at least

Note: if , then the bound makes sense. This condition can be satisfied in many practical cases. As case studies, we investigate Ranking SVM and RankBoost. We show that after introducing query-level normalization to its objective function, Ranking SVM will have query-level stability. For RankBoost , the query-level stability can be achieved if we introduce both query-level normalization and regularization to its objective function . These analyses agree largely with our experiments and the experiments in Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon, 2006 [5] and [11].

[object Object],[object Object],Query-level Empirical Risk Generalization Bound:

Generalization Bounds Comparison ,[object Object],Generalization Bound: Generalization Bound: Modified RSVM

RankBoost with Query-level Normalization and Regularization ,[object Object],query-level normalization cannot make RankBoost have query-level stability. ,[object Object],[object Object],[object Object],[object Object]

Experimental Results (I) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],Experimental Results (II)

Future Problems and Challenges ,[object Object],[object Object],[object Object],[object Object]

Mazhiming

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Mazhiming

Similar to Mazhiming (20)

Recently uploaded

Recently uploaded (20)

Mazhiming

Editor's Notes