Upcoming SlideShare
×

# Language Model Information Retrieval with Document Expansion

531 views

Published on

Published in: Education, Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
531
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
12
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Language Model Information Retrieval with Document Expansion

1. 1. A paper by Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiang Zhai Presented By Kumar Ashish INF384H/CS395T: Concepts of Information Retrieval (and Web Search) Fall 2011
2. 2.  Zero Count Problem: Term is a possible word of Information need does not occur in document General Problem of Estimation: Terms occurring once are overestimated even though their occurrence was partly by chance In order to solve above problems, high quality extra data is required to enlarge the sample of document.
3. 3. This gives the average logarithmicdistance between the probabilities: aword would be observed at random fromunigram query language model andunigram document language model.
4. 4. C(w, d) is number of times word w occur in document d, and |d|is the length of document.Problems:•Assigns Zero Probability to any word not present in documentcausing problem in scoring a document with KL-Divergence.
5. 5.  Jelinek-Mercer(JM) Smoothing Dirichlet Smoothing
6. 6.  Proposes a fixed parameter λ to control interpolation. Probability of word w given by the collection model Θc
7. 7.  It uses document dependent coefficient (parameterized with μ) to control the interpolation.
8. 8.  Uses clustering information to smooth a document. Divides all documents into K clusters. First smoothes cluster model with collection model using Dirichlet Smoothing. Takes smoothed cluster as a new reference model to smooth document using JM Smoothing
9. 9. ΘLd stand for document d’s cluster model and λ,β are smoothing parameters.
10. 10.  Better than JM or Dirichlet Smoothing: It expands a document with more data from the cluster instead of just using the same collection language model.
11. 11. Cluster D good for smoothing document a but not good for document d.Ideally each documentshould have its owncluster centeredaround itself.
12. 12.  Expand each document using Probabilistic Neighborhood to estimate a virtual document(d’). Apply any interpolation based method(e.g. JM or Dirchlet) to such a virtual document and treat the word counts given by this virtual document as if they were the original word count.
13. 13.  Can use Cosine rule to determine documents in the neighborhood of Original document. Problems: ◦ In narrow sense would contain only few documents whereas in wide sense the whole collection may included. ◦ Neighbor documents can’t be sampled the same as original document.
14. 14.  Associates a Confidence Value with every document in the collection ◦ This Confidence Value reflects the belief that the document is sampled from the same underlying model as the original one.
15. 15.  Confidence Value(γd) is associated to every document to indicate how strongly it is sampled from d’s document. Confidence Value should follow normal distribution:
16. 16.  Shorter document require more help from its neighbor. Longer documents rely more on itself. In order to take care of this a parameter α is introduced to control this balance.
17. 17. For Efficiency: Pseudo term count can becalculated only using top M closest Neighbors ( asconfidence value follows decay shape)
18. 18.  For performance comparison: ◦ It uses four TREC data sets  AP(Associate Press news 1988-90)  LA ( LA times)  WSJ(Wall Street Journals 1987- 92)  SJMN(San Jose Mercury News 1991) For Testing Algorithm Scale Up ◦ Uses TREC8 For Testing Effect on Short Documents ◦ Uses DOE( Department of Energy)
19. 19. Comparison of DELM +(Diri/JM) with Diri/JMλ for JM, μ for Dirichet are optimal and the same values of λ orμ are used for DELM without further tuning. M is 100 and α is0.5 for DELM.DELM Outperforms JM and Dirichlet on each Data Sets withimprovement as much as 15% in case of Associated PressNews(AP).
20. 20. Compared Precision values at different levels of recall for AP data sets. DELM + Dirichet outperforms Dirichlet on every precision point.Precision-Recall Curve on AP Data
21. 21. Compares Performance Trend with respect to M( top M closest neighbors for each Document) Performance change with respect to MConclusion: Neighborhood information improves retrieval accuracyPerformance becomes insensitive to M when M is sufficiently large
22. 22. Comparison of DELM+Dirichlet with CBDM DELM + Dirichet outperforms CBDM in MAP values on allfour data sets.
23. 23. Document in AP88-89 was shrinked to 30% of original in 1st,50% of original in 2nd and 70% of original in 3rd .Results shows that DELM help shorter documents more thanlonger ones (41% on 30%-length corpus to 16% on full length)
24. 24. Performance change with respect to αOptimal Points Migrate when document length becomesshorter. ( 100% corpus length gets optimal at α = 0.4 but30% corpus has to use α = 0.2)
25. 25. Combination of DELM with Pseudo Feedback DELM combined with Model-Based Feedback proposed in (Zhaiand Lafferty, 2001a)Experiment Performed by:  Retrieving Documents by DELM method Choosing top five document to do model based Feedback Using Expanded query model to retrieve documents againResult: DELM can be combined with pseudo feedback toimprove performance
26. 26.  References: ◦ http://sifaka.cs.uiuc.edu/czhai/pub/hlt06-exp.pdf ◦ http://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf ◦ http://krisztianbalog.com/files/sigir2008-csiro.pdf