Your SlideShare is downloading. ×
Maria daniele
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Maria daniele

281
views

Published on

Extending ranking with interword spacing analysis

Extending ranking with interword spacing analysis

Published in: Technology, Spiritual

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
281
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Extending ranking with interword spacing analysisMaria Carmela Daniele, Claudio Carpineto and Andrea Bernardini
  • 2. OverviewI.  Word weighting based on interword spacing: σpII.  Extension of quantistic weight through corpora analysis: σ*III.  σ* application to rankingIV.  ExperimentsV.  Selective application of quantistic and frequentistic metrics based on: a)  Document’s length b)  Query hardness
  • 3. Words  weighting  based  on  spacing  between  term  occurrences: σp•  Research branch evolved in the last decade.•  Follow studies on energy level of statistical system formed by irregular quantum, created by Ortuño et al (2002)•  Keyword extraction based on distances between term’s occurrences in a document, regardless of terms frequency analysis of the document.•  Let’s see in more detail…
  • 4. Reference Scenario  Similar to quantistic system, terms in a document are subject to an attraction/ repulsion phenomena, that is stronger between relevant terms compared to common words.  Reference Document: Charles Darwin’s “The Origin of Species”  In practice:   Relevant words tend to cluster in documents ( ie: “INSTINCT”)   Common words like “THE” are distributed uniformly
  • 5. Definition of σp•  Weighting method definition based on probability distributions of distances•  A more efficient method characterized by Standard Deviation: A great scientist must be a good teacher and a good researcher 1 2 3 4 5 6 7 8 9 10 11 12 •  For term “a” we get: X={1,6,10}, D = {0,5,4,2} (di = xi+1- xi), and: 1 n 2 s= (( ∑ x i +1 −x i − µ n −1 i=0 ) ) •  Normalizing with respect to the mean value: €
  • 6. Extension of quantistic weighting throughcorpora analysis: σ*•  We propose to modify the original metric with a factor σf based on the variance of term frequencies (Salton 1975). The factor σf is analogous to σp and it has a twofold goal: 1.  Penalize rare words, because they can be often seen as ‘noise’ in real collection of documents, while they tend to be overestimated using σp ; 2.  Reward words that make it possible to better discriminate a document from the rest of the collection. This feature is lacking in quantistic weighting n 1 2 with s f (w) = ND i=1 i ( ⋅ ∑ f (w) − µ f ) €
  • 7. Comparison between quantistic andfrequentistic metrics•  Using Tf-Idf (with and without stop words) for the metric on the frequencies•  Using σp e σ* for the quantistic weighting•  Reference Document: “The Bible” of The King James•  To calculate Idf e σf that require the collection, we use WT10g Trec collection Rank Tf-Idf Tf-Idf* σp σ* 1 unto lord jesus jesus 2 shall god christ saul 3 lord absalom paul absalom 4 thou son peter jephthah 5 thy king disciples jubile 6 thee behold faith ascendeteh 7 him man john abimelech 8 god judah david elias 9 his land saul joab 10 hath men gospel haman
  • 8. Application of σ* to ranking (1)•  Using σ* metric it’s possible to rank a collection of documents against a query q•  Based on the complementary features of quantistic and frequentistic weighting metrics, we would like to combine these two metrics.
  • 9. Application of σ* to ranking (2)  •  The combined metric is obtained through:   Linear Combination of Okapi’s BM25 and σ* metrics•  Prerequisite for the linear combination is that the the scores will be in similar range•  Application of normalization of scores through:•  The scores are combined by:
  • 10. Experiments (1) Collection:   Web Track: about 1.690.000 documents   Robust Track: more than 500.000 documents •  Evaluation measure: MAP (mean average precision) •  Lucene with BM25 extension created by Perez-Iglesias
  • 11. Experiments (2)•  The quantistic metric alone does not work well: Collezione Topics BM25 σ* BM25+σ* WT10g 501-550 0.143 0.057 0.153 Robust 301-450,601-700 0.195 0.089 0.203•  Experiments on combined quantistic method enhance in a significant way performance of classical methods of IR•  We let the α parameter vary in the range [0,1]: the two extreme points coincides, respectively, with BM25 and σ∗ techniques.•  Results suggest us that the method is sufficiently robust, because we found a range of values in which the performance of the combined method was good. α 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 MAP .1436 .1469 .1537 .1535 .1501 .1379 .1222 .096 .0819 .0679 .0547 MAP .1954 .2033 .2031 .1983 .1673 .1549 .1428 .1203 .1075 .9674 .0898
  • 12. Query by query analysis   BM25 σ* BM25+σ* 1,0 0,9 0,8 0,7 0,6 0,5 AvP 0,4 0,3 0,2 0,1 0,0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 N° Query
  • 13. Selective application of quantistic andfrequentistic techniques 1.  Relying on predictors of the query difficulty for choosing which metric to use (rationale: the quantistic method should be better on difficult queries) 2.  Relying on document’s length for choosing which metric to use (rationale: the quantistic method should be better for long documents)
  • 14. Query hardness (1)•  We used two well-know query predictor: •  Simplified Clarity Score •  σ1
  • 15. Query hardness (2) Bm25 WT10g SS* • WT10g with σ1 Lineare(Bm25) 0,7 Lineare(SS*) predictor 0,6 0,5 • Robust with SCS 0,4 predictorMAP 0,3 0,2 0,1 0,0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 BM25 sigma SS* BM25 Robust SS*• Predictor obtained 0,9 0,8values on x-axis 0,7 0,6 0,5• MAP value on y- MAP 0,4axes (both BM25 0,3 0,2and σ∗) 0,1 0,0 0 1 2 3 4 5 6 7 8 9 SCS 10 11 12 13 14 15 16 17 18 19
  • 16. Document Lenght (1)•  Why using document length? Because the quantistic method works better with long texts BM25 σ* Relevant Retrieved 1544 3729 Relevant NOT Retrieved 4239 2115
  • 17. Document Lenght (2) • Collection: WT10g • σ* • BM25• X-Axis: document’s length expressedin number of words• Y-Axis: Cumulative percentage ofrelevant documents (retrieved in Blue,not retrieved in Red)
  • 18. Conclusions on using a selective application offrequentistic and quantistic weighting•  Query hardness did not work.•  Using document length was more promising
  • 19. Conclusions and future works•  Definition of an extended quantistic weighting method through corpora analysis.•  Integration of quantistic and frequentistic ranking methods•  A linear combination showed a significant enhance of performance compared to the classical frequentistic method•  Selective application: query hardness not useful, document length useful•  This method could be applied on other Information Retrieval Task, i.e.: •  Document Summarization: for create a short version of a text •  Query Expansion: expand the query phrase (ie : using synonymous) •  Search Result Clustering: group results in clusters
  • 20. Conclusions Thanks for listening! questions?