Language Model Information Retrieval with Document Expansion
A paper by Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiang Zhai Presented By Kumar Ashish INF384H/CS395T: Concepts of Information Retrieval (and Web Search) Fall 2011
Zero Count Problem: Term is a possible word of Information need does not occur in document General Problem of Estimation: Terms occurring once are overestimated even though their occurrence was partly by chance In order to solve above problems, high quality extra data is required to enlarge the sample of document.
This gives the average logarithmicdistance between the probabilities: aword would be observed at random fromunigram query language model andunigram document language model.
C(w, d) is number of times word w occur in document d, and |d|is the length of document.Problems:•Assigns Zero Probability to any word not present in documentcausing problem in scoring a document with KL-Divergence.
Proposes a fixed parameter λ to control interpolation. Probability of word w given by the collection model Θc
It uses document dependent coefficient (parameterized with μ) to control the interpolation.
Uses clustering information to smooth a document. Divides all documents into K clusters. First smoothes cluster model with collection model using Dirichlet Smoothing. Takes smoothed cluster as a new reference model to smooth document using JM Smoothing
ΘLd stand for document d’s cluster model and λ,β are smoothing parameters.
Better than JM or Dirichlet Smoothing: It expands a document with more data from the cluster instead of just using the same collection language model.
Cluster D good for smoothing document a but not good for document d.Ideally each documentshould have its owncluster centeredaround itself.
Expand each document using Probabilistic Neighborhood to estimate a virtual document(d’). Apply any interpolation based method(e.g. JM or Dirchlet) to such a virtual document and treat the word counts given by this virtual document as if they were the original word count.
Can use Cosine rule to determine documents in the neighborhood of Original document. Problems: ◦ In narrow sense would contain only few documents whereas in wide sense the whole collection may included. ◦ Neighbor documents can’t be sampled the same as original document.
Associates a Confidence Value with every document in the collection ◦ This Confidence Value reflects the belief that the document is sampled from the same underlying model as the original one.
Confidence Value(γd) is associated to every document to indicate how strongly it is sampled from d’s document. Confidence Value should follow normal distribution:
Shorter document require more help from its neighbor. Longer documents rely more on itself. In order to take care of this a parameter α is introduced to control this balance.
For Efficiency: Pseudo term count can becalculated only using top M closest Neighbors ( asconfidence value follows decay shape)
For performance comparison: ◦ It uses four TREC data sets AP(Associate Press news 1988-90) LA ( LA times) WSJ(Wall Street Journals 1987- 92) SJMN(San Jose Mercury News 1991) For Testing Algorithm Scale Up ◦ Uses TREC8 For Testing Effect on Short Documents ◦ Uses DOE( Department of Energy)
Comparison of DELM +(Diri/JM) with Diri/JMλ for JM, μ for Dirichet are optimal and the same values of λ orμ are used for DELM without further tuning. M is 100 and α is0.5 for DELM.DELM Outperforms JM and Dirichlet on each Data Sets withimprovement as much as 15% in case of Associated PressNews(AP).
Compared Precision values at different levels of recall for AP data sets. DELM + Dirichet outperforms Dirichlet on every precision point.Precision-Recall Curve on AP Data
Compares Performance Trend with respect to M( top M closest neighbors for each Document) Performance change with respect to MConclusion: Neighborhood information improves retrieval accuracyPerformance becomes insensitive to M when M is sufficiently large
Comparison of DELM+Dirichlet with CBDM DELM + Dirichet outperforms CBDM in MAP values on allfour data sets.
Document in AP88-89 was shrinked to 30% of original in 1st,50% of original in 2nd and 70% of original in 3rd .Results shows that DELM help shorter documents more thanlonger ones (41% on 30%-length corpus to 16% on full length)
Performance change with respect to αOptimal Points Migrate when document length becomesshorter. ( 100% corpus length gets optimal at α = 0.4 but30% corpus has to use α = 0.2)
Combination of DELM with Pseudo Feedback DELM combined with Model-Based Feedback proposed in (Zhaiand Lafferty, 2001a)Experiment Performed by: Retrieving Documents by DELM method Choosing top five document to do model based Feedback Using Expanded query model to retrieve documents againResult: DELM can be combined with pseudo feedback toimprove performance