Webs2008

486 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
486
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Webs2008

  1. 1. Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models Dario Bonino, Fulvio Corno Dipartimento di Automatica ed Informatica Politecnico di Torino dario.bonino@polito.it http://elite.polito.it
  2. 2. Agenda Introduction Problem statement Self-Similarity based pruning Experimental results Conclusion 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 2
  3. 3. Semantic IR New generation search tools exploiting conceptual information Many techniques Logic and reasoning Annotation Natural Language Processing Latent Semantic Indexing Research still open but some convergences are emerging Several researchers independently chose to work on Conceptual Vector Space Models 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 3
  4. 4. C-VSM vs VSM Differences C-VSM VSM Doc features Doc Features Concepts Words Vector components Vector components Related to the Related to word strength of frequency association to a concept 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 4
  5. 5. Index pruning Commonalities Very similar models and data structures Need of large indexes Reducing the index size (ideally) improves the search efficiency This operation is called Index Pruning Index Pruning can be On-line Applicable in parallel to indexing Works on single documents Off-line During idle time Rebuilds the whole index 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 5
  6. 6. Objectives Long-term goal To analyze storage and pruning techniques for C- VSM indexes Current objective On-line pruning Index pruning based on document-local information Design of a Self-Similarity metric for index pruning Implementation of a simple index pruning algorithm based on the Self Similarity Metric 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 6
  7. 7. Agenda Introduction Problem statement Self-Similarity based pruning Experimental results Conclusion 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 7
  8. 8. C-VSM: a formal definition C-VSM Annotations + C−VSM =C , D , A A⊆ D×C ×ℝ C set of concepts of a Each annotation domain ontology D set of documents a∈ A=d , c , w A set of annotations Associates a document d to a concept c with a w weight w d c 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 8
  9. 9. Documents in C-VSM In C-VSM a document is represented by a vector, whose components are the weights wi of annotations toward domain concepts c3 V d =w 1, w 2, w 3,... , w∣C∣ w3 di Where w i = { w :d , c i , w ∈ A } w1 w2 c2 c1 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 9
  10. 10. Self-similarity metric Defined as the cosine similarity between the original document vector d and its pruned version d' V d ⋅V d '  S V d  ,V d ' =cos V d  ,V d ' = ∣V d ∣∣V d ' ∣ c3 d' d α c2 c1 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 10
  11. 11. Agenda Introduction Problem statement Self-Similarity based pruning Experimental results Conclusion 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 11
  12. 12. Self-similarity pruning General definition Given a document d represented by its vector V(d), find a new representation V(d') such that, |V(d')|<|V(d)| for any query q, the difference |S(V(d),V(q))-S(V(d'),V(q))| is minimal 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 12
  13. 13. Greedy algorithm Self similarity prune (V(d),τ) τ = self-similarity threshold V(d') = V(d) while (S(V(d),V(d')) >= τ) { c3 i: argmin(V(d')i) //find the lowest weight w3 V(d')i=0 //delete annotation d d' } return V(d') w1 w2 c1 c2 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 13
  14. 14. Agenda Introduction Problem statement Self-Similarity based pruning Experimental results Conclusion 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 14
  15. 15. Metrics (1/2) Ranking similarity Measures similarity of search results obtained using The ranking ro deriving from the original index The ranking rp deriving from the pruned index The simplest and more used metric is the Symmetric Difference Score (@ top k results) r o  r p =r o−r p ∪r p −r o  ro  r p R r o , r p =1− 2k R=1 perfect match, R=0 no match 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 15
  16. 16. Metrics (2/2) Compression ratio Measures the amount of pruning achieved by a given compression algorithm ∣ prunedEntries∣ CR= ∣originalEntries∣ 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 16
  17. 17. Experimental setting (1/2) Semantic IR system H-DOSE, http://dose.sourceforge.net Uses a C-VSM Shallow indexing based on a bag of words technique Document test sets Sider Subset of the e-Class ontology on siderurgy (677 concepts) 250 documents gathered from the web and manually classified 12 queries Available on request (mail to dario.bonino@polito.it) 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 17
  18. 18. Experimental setting (2/2) Document test sets (continued...) Passepartout Ontology on disabilities developed in collaboration with the Turin's municipality (181 concepts, 20 different relations) Documents: all news and docs published on the Passepartout web site from 2004 to 2006 (around 2400 pages) 12 queries Available on request (mail to dario.bonino@polito.it), ontology in Italian 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 18
  19. 19. CR vs Self-similarity τ = self-similarity threshold Limited at τ >60% (for lower values R becomes too low) 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 19
  20. 20. Ranking Similarity - Sider Ranking similarity vs Compression Ratio 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 20
  21. 21. Ranking Similarity - Passepartout Ranking similarity vs Compression Ratio 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 21
  22. 22. Query time vs pruning Passepartout Sider 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 22
  23. 23. Discussion (1/2) Sider Quite controlled Small Smoother behavior Quite satisfying performance 80% similarity @ 30% pruning Passepartout Medium-sized Captured “on the wild” Complex behavior Fair performance 65% similarity @ 20% pruning 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 23
  24. 24. Open Issues Test sets Small Relatively custom Few or none standard sets available for Semantic IR system We are working on CNN news + KIM ontology Aquis corpus + Eurovoc Semantic IR system Quite simple indexing technique Sensitive to composition of the bag of words 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 24
  25. 25. Agenda Introduction Problem statement Self-Similarity based pruning Experimental results Conclusion 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 25
  26. 26. Conclusions Index pruning is expected to become a critical issue for Semantic IR systems (as already happens for traditional IR) Self-similarity pruning can be applied on-line reaching relatively good performances On-line pruning does not prevent off-line pruning possibly leading to better results Experimentation on bigger and less controlled datasets is needed (however there's a sensible lack of test data) Porting of traditional algorithms to Semantic IR systems shall be investigated 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 26
  27. 27. Thank you! Questions? Dario Bonino - dario.bonino@polito.it 2008-09-02 Self-Similarity metric for Index Pruning - WebS 2008 @ DEXA 27

×