Raffaele Perego "Efficient Query Suggestions in the Long Tail"
Upcoming SlideShare
Loading in...5
×
 

Raffaele Perego "Efficient Query Suggestions in the Long Tail"

on

  • 2,615 views

 

Statistics

Views

Total Views
2,615
Views on SlideShare
897
Embed Views
1,718

Actions

Likes
0
Downloads
2
Comments
0

11 Embeds 1,718

http://xss.yandex.net 1662
http://events.yandex.ru 25
http://events.lynx.yandex.ru 10
http://tech.yandex.ru 9
https://tech.yandex.ru 4
http://external.events.test.tools.yandex-team.ru 2
https://xss.yandex.net 2
http://events.yandex-team.ru 1
http://xss.yandex 1
http://xss.yand 1
http://web-chib.events.lacerta.yandex-team.ru 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Raffaele Perego "Efficient Query Suggestions in the Long Tail" Raffaele Perego "Efficient Query Suggestions in the Long Tail" Presentation Transcript

    • Efficient Query Suggestions in the Long Tail Joint work: R. Perego, F. Silvestri, H. Vahabi, R. Venturini, HPC Lab, Italy F. Bonchi, Yahoo! Research, SpainWednesday, August 24, 2011
    • Query suggestion practices • Use of the Wisdom of the Crowd mined from Query Logs to recommend related queries that are likely to better specify the information need of the user • shorten length of user sessions • enhance perceived QoEWednesday, August 24, 2011
    • Queries in the HeadWednesday, August 24, 2011
    • Queries in the HeadWednesday, August 24, 2011
    • Queries in the HeadWednesday, August 24, 2011
    • Queries in the Long TailWednesday, August 24, 2011
    • Queries in the Long Tail ?Wednesday, August 24, 2011
    • Queries in the Long Tail ? ?Wednesday, August 24, 2011
    • Queries in the Long Tail ? Rare and never-seen ? queries account for more than 50% of the traffic!Wednesday, August 24, 2011
    • Open issues • Sparsity of models: • query assistance services perform poorly or are not even triggered on long-tail queries • Performance: Popularity • on-line process going in parallel with query answering Queries ordered by popularityWednesday, August 24, 2011
    • SoA: Query Flow Graph • Query-centric approach • Suggest queries by computing Random Walks with Restarts (RWRs) on the query-flow graph (QFG) by starting from the current user queryP. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, S.Vigna: The query-flow graph: model and applications. CIKM 2008: 609-618 P. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, S.Vigna: Query suggestions using query-flow graphs. WSCD, 2009Wednesday, August 24, 2011
    • Query-centric suggestions Computing RWRs on a huge graph, e.g., built from a QL recording 580,797,850 queries (from Y! us): • |V| 28,763,637 • |E| 56,250,874Wednesday, August 24, 2011
    • Query-centric suggestions Computing RWRs on a huge graph, e.g., built from a QL recording 580,797,850 queries (from Y! us): • |V| 28,763,637 • |E| 56,250,874 • |{q: f(q)=1}| 162,221,967 (28%)Wednesday, August 24, 2011
    • Term-centric opportunities But, in the same Y! QL: • queries 580,797,850 • Term occurrences 1,343,988,549Wednesday, August 24, 2011
    • Term-centric opportunities But, in the same Y! QL: • queries 580,797,850 • Term occurrences 1,343,988,549 • |{t: f(t)=1}| 5,099,145 (0.04%)Wednesday, August 24, 2011
    • The TQGraph *#%+,) "#%&(")& /#)( %-!&."# !"## "#%&(")& !"##$ /#)($*#%+,) "#%&(")& *#%+,)$ %-!&."# !"#$%& Term nodes are added to the QFG which have only outgoing links pointing at the query nodes corresponding to the queries in which the terms occur.Wednesday, August 24, 2011
    • TQGraph model Suggestions for an incoming query q composed of terms {t1,...tm} ⊆ T are generated from G by extracting the center-piece subgraph starting from the nodes corresponding to terms t1,...,tm • perform m Random Walks with Restart from each one of the m term nodes corresponding to terms in q • multiply component-wise the resulting m stationary distributionsWednesday, August 24, 2011
    • fro 100 queries on Yahoo! useful somewhat not useful M α = 0.9 48% 11% 41% re TQG effectiveness α = 0.5 α = 0.1 41% 37% 20% 20% 39% 43% Table 2: Effectiveness of TQGraph-based recommen- • User study results comparing TQG and QFGand query dations on the two different set of queries effectiveness for two different testbeds (Y! US and MSN QLs). logs, by varying the restart parameter α. TREC on MSN useful somewhat not useful TQGraph α = 0.9 57% 16% 27% QFG 50% 9% 42% sid is 100 queries on Yahoo! useful somewhat not useful sit TQGraph α = 0.9 48% 11% 41% qu QFG 23% 10% 67% lem in m Table 3: User study results comparing effectiveness th of our method with the baseline for the two different testbeds. quWednesday, August 24, 2011
    • queries on Yahoo! useful somewhat not useful MSN query ing to outperform the effectiveness of QFG. The fact that log. The query is “lower heart rate”. Below we 0.9 48% 11% 41% report the top 5 recommendations. 0.5 we, instead, achieve such a remarkable result is, actually, an 41% 20% 39% additional benefit of TQGraph-based 43% methods. Effectiveness on rare queries 0.1 37% 20% Query: lower heart rate Anecdotal evidence. We next show a few examplesSuggested Query of Score query recommendations. We start from queries that have things to lower heart rate 2.9 e−14e 2: Effectiveness of TQGraph-based recommen- lower heart rate through exercise 2.6 e−14 nsnever been observed in the query log, i.e., the most difficult on the two different set of queries and query accelerated heart rate and pregnant 2.9 e−15 by varying the restart parameter α. is one among the eight cases. The first query that we show • web md 2.0 e−16 on MSN Anecdotal evidence from the TREC testbed that do not appear at all in the problemsECMSN query log. useful query is “lower heart rate”. Below we The somewhat not useful heart 8.0 e−17Graph α = 0.9 top 5 57% report the recommendations. 27% 16% We can observe that all the top-5 suggestions can be con- 50% 9% 42% sidered pertinent to the initial topic. Moreover, even if this Query: lower heart rate is not an objective in this paper, they present some diver- queries on Yahoo! useful somewhat not useful Suggested Query Score the first two are how-to queries, while the last three are sity:Graph α = things to lower heart rate 0.9 48% 11% 41% queries related to finding information w.r.t. possible prob- 2.9 e−14 23% 10% lower heart rate through exercise 67% 2.6 e−14 Query not occurring lems (with one very specific for pregnant women). The most interesting recommendation is probably “web md”, whiche 3: Userweb mdresults comparing effectiveness study −15 in the training log accelerated heart rate and pregnant 2.9 e makes perfect sense4 , and has a large edit distance from 2.0 e−16 original query. ther method with the baseline for the two different heart problems 8.0 e−17The next query we present is a rare (i.e., rarely appearing eds. query): “dog heat”; which appears only twice in the MSN query log.eas, we pair TREC queries up with the model built on can be con- We can observe that all the top-5 suggestions SN query log. In fact,to the initial topic. Moreover, even if this sidered pertinent the period from which TREC Query: dog heat come, is an objectiveperiod in paper, MSN queries some diver- is not close to the in this which they present ubmitted. first two are how-to queries, while the last three are Suggested Query Score sity: the heat cycle dog pads 4.3 e−10 generated the top-5 recommendations for each query queries related to finding information w.r.t. possible what happens when female dog is prob- Query occurring twice ng both the QFG and the TQGraph with different pa- lems (with one very specific for pregnant women). The most in heat & a male dog is around ers setting. Using a web interface each assessor was 4.0 e−10 in the training lognted a random query followed by the list of all the“web md”, boxer dog in heat interesting recommendation is probably dif- sense4 , Recommendations were which makes perfect produced.and has a large edit distancedog in heat symptoms recommendations from 3.99 e−10 3.98 e−10 the original query.nted shuffled, in order for the assessor to not be able to behavior of a male dog which system produced them. We give (i.e., rarely appearing around a female dog in heatguishThe next query we present is a rare assessors 3.95 e−10ossibility to “dog heat”; search engine results for the in the MSN query): observe the which appears only twice al query and the recommended query that was being query log. As in the previous example, the top-5 suggestions areated. The assessor was asked to rate a recommendation qualitatively good and present some diversity. Also, the one of the following scores: useful, somewhat useful, Wednesday, August 24, 2011 TQGraph-based method returns long queries, thus likely to
    • TQG pros • provide query suggestions of quality comparable/better than QFG even for rare and unique queries • several possible optimizations for achievingWednesday, August 24, 2011
    • TQG pros • provide query suggestions of quality comparable/better than QFG even for rare and unique queries • several possible optimizations for achieving an efficient on-line query recommendation serviceWednesday, August 24, 2011
    • Indexing precomputed suggestions !"#$%&( 012"#3"4%014"5%#"6#"7"1389:1%:;%3<"%=>=7% ?:$6@3"4%:1%3<"%!AB#86<C%!<"%D"5-?:1%-7%$84"% @6%:;%3"#$%1:4"7(%6:791E7%8#"%3<"%7389:18#F% 4-73#-G@9:1%28D@"7C% )*%+,-%$.(/01*$023+,-(,4("35$.(6,751(0-(*5( !"#$%&(%1(,2*%0-57(2.(%(898(4$,:(!5$:(;( • recommendations for an incoming query+ are computed -by !"#$%&% &()*% )()+*% ) (),*% ) ().-/&**% processing the posting lists associated with the terms in the query 90*0-(23<=5*1(>35$051(%$5(1,$*57(2.(*50$(?/1@()<,$51(%$5( %&&$,A0:%*57(2.(*5(B$5%*51*(2,3-7C(0@5@(D0(4,$(%EE(0(F(G@(Wednesday, August 24, 2011
    • Indexing precomputed suggestions !"#$%&( 012"#3"4%014"5%#"6#"7"1389:1%:;%3<"%=>=7% ?:$6@3"4%:1%3<"%!AB#86<C%!<"%D"5-?:1%-7%$84"% @6%:;%3"#$%1:4"7(%6:791E7%8#"%3<"%7389:18#F% 4-73#-G@9:1%28D@"7C% )*%+,-%$.(/01*$023+,-(,4("35$.(6,751(0-(*5( !"#$%&(%1(,2*%0-57(2.(%(898(4$,:(!5$:(;( • recommendations for an incoming query+ are computed -by !"#$%&% &()*% )()+*% ) (),*% ) ().-/&**% processing the posting lists associated with the terms in the query 90*0-(23<=5*1(>35$051(%$5(1,$*57(2.(*50$(?/1@()<,$51(%$5( :) O(|T|) posting lists %&&$,A0:%*57(2.(*5(B$5%*51*(2,3-7C(0@5@(D0(4,$(%EE(0(F(G@( :( O(|Q|) length of each posting listWednesday, August 24, 2011
    • Pruning posting lists • sort postings by probability and prune them at a reasonable threshold p, e.g. 20,000me quality of thoseGraph-based recom-lity those produced ar, TQGraph-based very large fraction as we shall presentQGraph can be pre- d list”-based repre-very fast generationPH ved online, a queryciently, possibly inn we introduce someeneration of recom-r, we show that the Wednesday, August 24, 2011
    • Pruning posting lists • sort postings by probability and prune them at a reasonable threshold p, e.g. 20,000me quality of thoseGraph-based recom-lity those produced ar, TQGraph-based very large fraction as we shall presentQGraph can be pre- d list”-based repre-very fast generationPH ved online, a query O(|T|) lists, each of size O(p) and no loss in quality!ciently, possibly inn we introduce someeneration of recom-r, we show that the Wednesday, August 24, 2011
    • !"#$%&( Bucketing probabilities 012"#3"4%014"5%#"6#"7"1389:1%:;%3<"%=>=7% • Most space used for storing probabilities ?:$6@3"4%:1%3<"%!AB#86<C%!<"%D"5-?:1%-7%$84"% @6%:;%3"#$%1:4"7(%6:791E7%8#"%3<"%7389:18#F% 4-73#-G@9:1%28D@"7C% • Given ε < 1, we can arrange postings in buckets implicitly coding the approximate probabilities )*%+,-%$.(/01*$023+,-(,4("35$.(6,751(0-(*5( !"#$%&(%1(,2*%0-57(2.(%(898(4$,:(!5$:(;( !"#$%&% &()*% )()+*% )+(),*% )-().-/&**% 90*0-(23<=5*1(>35$051(%$5(1,$*57(2.(*50$(?/1@()<,$51(%$5( %&&$,A0:%*57(2.(*5(B$5%*51*(2,3-7C(0@5@(D0(4,$(%EE(0(F(G@(Wednesday, August 24, 2011
    • !"#$%&( Bucketing probabilities 012"#3"4%014"5%#"6#"7"1389:1%:;%3<"%=>=7% • Most space used for storing probabilities ?:$6@3"4%:1%3<"%!AB#86<C%!<"%D"5-?:1%-7%$84"% @6%:;%3"#$%1:4"7(%6:791E7%8#"%3<"%7389:18#F% 4-73#-G@9:1%28D@"7C% • Given ε < 1, we can arrange postings in buckets implicitly coding the approximate probabilities )*%+,-%$.(/01*$023+,-(,4("35$.(6,751(0-(*5( !"#$%&(%1(,2*%0-57(2.(%(898(4$,:(!5$:(;( !"#$%&% &()*% )()+*% )+(),*% )-().-/&**% 90*0-(23<=5*1(>35$051(%$5(1,$*57(2.(*50$(?/1@()<,$51(%$5( %&&$,A0:%*57(2.(*5(B$5%*51*(2,3-7C(0@5@(D0(4,$(%EE(0(F(G@( • Each entry coded with a few bits, e.g., 11-19 bits • ~5x reduction! • no loss in quality!Wednesday, August 24, 2011
    • Caching posting listsidedboth • achieving in-memory query suggestion ared us-e lost ever,sults.qual-en inhat aed bymake iffer-which Figure 4: Miss ratio of our cache as a function oft Wednesday, August 24, 2011 our
    • Conclusions • TQG model to overcome limitations of current query recommenders • based on a principled, term-centric approach supporting rare and never-seen queries • deployment with a efficient inverted index resulting in effectiveness comparable/better to SoA approaches • the pruning, bucketing, caching techniques proposed constitute a independent contribution in the area of efficiency in large scale RWR computations • reduction of about 80% in the space occupancy w.r.t. uncompressed data structures • in-memory RWRs on huge graphs with 90+ % hit-ratio cacheWednesday, August 24, 2011
    • Вопросы? Questions?Wednesday, August 24, 2011