From Dango to Japanese Cakes: Query Reformulation Models and Patterns

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    From Dango to Japanese Cakes: Query Reformulation Models and Patterns - Presentation Transcript

    1. Francesco Bonchi, Carlos Castillo Yahoo! Research · Barcelona, Spain From “Dango” to “Japanese Cakes” Query Reformulation Models and Patterns Paolo Boldi, Sebastiano Vigna Università degli Studi · Milan, Italy
    2. (Physical) User session Basic concepts
      • A query log is a collection of user sessions
      • More information:
        • timestamps
        • views
        • clicks
      ebay autotrader brcelona barcelona barcelona f.c. barcelona hotels
    3. (Physical) User session Basic concepts Chain #1 Chain #2
      • For more information on how sessions can be segmented into chains (even intertwined), see:
        • A. Gionis, P. Boldi, F. Bonchi, C. Castillo, D. Donato, S. Vigna: Query-flow graphs (CIKM 2008)
      ebay autotrader brcelona barcelona barcelona f.c. barcelona hotels
    4. User chain Basic concepts brcelona barcelona barcelona f.c. barcelona hotels
    5. Basic concepts
      • Understanding query-reformulation type is the main goal of this paper
      User chain brcelona barcelona barcelona f.c. barcelona hotels correction specialization parallel move
    6. Related works
      • Lau & Horvitz, Patterns of search: analyzing and modeling web query refinement . UM'99
      • Rieh & Xie, Analysis of multiple query reformulations on the web: the interactive information retrieval context. Inf. Process. Manage. 2006
      • Jansen et al., Query modifications patterns during web searching , ITNG'07
    7. 1. Query-reformulation types 2. Building a classifier 3. Patterns 4. An application to query suggestion 5. Conclusions
    8. 1. Query-reformulation types 2. Building a classifier 3. Patterns 4. An application to query suggestion 5. Conclusions
    9. Rieh, S. Y. and Xie, H: “Analysis of multiple query reformulations on the web”. IPM 32 (3) 2006. Query-reformulation types barcelona cheap barcelona hotels barcelona hotels luxury barcelona hotels Specialize Generalize brcelona Correct barcelona f.c. Generalize Parallel move Specialize Specialize
    10. Reformulation types
      • Correction
        • startford cinema -> stratford cinema
      • Generalization (“zoom out”)
        • barcelona hotels -> barcelona
      • Specialization (“zoom in”)
        • barcelona soccer -> barcelona camp nou
      Zoom-in, zoom-out, pan, names comes from Y!SAMA Rieh and Xie: “Analysis of multiple query reformulations”. IPM 2006.
    11. Reformulation types
      • Rephrasing
        • wikipedia english -> english wikipedia
        • robbs celebrities -> robbs celebs
      • Parallel move
        • barcelona -> rome
      Rieh and Xie: “Analysis of multiple query reformulations”. IPM 2006.
    12. G S C P
    13. 1. Query-reformulation types 2. Building a classifier 3. Patterns 4. An application to query suggestion 5. Conclusions
    14. Query-reformulation classifier
    15. Classifier: 92% accuracy
    16. Example 1/4: “cat”
    17. Example 2/4: “peanut”
    18. Example 3/4: “surf board”
    19. Example 4/4: “bruce springsteen”
    20. 1. Query-reformulation types 2. Building a classifier 3. Patterns 4. An application to query suggestion 5. Conclusions
    21. Finding query-reformulation patterns
      • Sessions can be seen as strings of QRTs:
        • XSSX
        • XCGSSX
        • XCSGSGSGPX
      • Patterns can be identified in these strings
      • Data: UK and US query logs, early 2008
    22. Chain length distribution
    23. QRT distribution
    24. Interesting patterns
    25. General observations on QRTs
      • Parallel moves (50%-60%)
        • The most frequent class
      • Specializations (30%-40%)
      • Generalizations (5%-10%)
        • Frequently appear together in alternating order
      • Corrections (5%-10%)
        • More frequent at the beginning or end of a chain
    26. 1. Query-reformulation types 2. Building a classifier 3. Patterns 4. An application to query suggestion 5. Conclusions
    27. Query recommendation setting
      • Dataset: “Spring 2006 Microsoft data”
        • Provided by WSCD'09 workshop organizers; 15M queries from a 1-month period
      • Query-flow weighted graph + Reformulations types
        • (see “ Query-flow graphs ”, CIKM 2008)
    28. Implemented systems
      • Slices
          • Queryflow-{G, S, P, C, SP, GSP, GSPC, ...}
      • Composed graphs
          • Queryflow-{(SS T ), (SG), ...}
          • Weight is weight of heaviest length-2 path
      • Recommendations based on random walks
      • WebGraph (graph) SUX4J (hashing)
    29. Random walks
      • Number of steps: 1, 5, 10 (2, 6, 12)
      • Scoring: PPR ( Absolute ) or PPR/PR ( Relative )
      • Self-transition with prob. 0.9, no random jumps
    30. For comparison: click graphs
      • Weights based on clicks
      • Two weighting schemes:
      N. Craswell and M. Szummer: “Random walks on the click graph” SIGIR 2007. d 1 d 2 d 3 q 1 q 2 d 4 d 5 q 3 q 4
    31. Example: “sudoku”
      • sudoku
      • free sudoku
      • sudoku shack
      • daily sudoku
      • msn sudoku
      • rci
      • sudoku
      • sudoku #5 easy
      • brainy puzzle
      • sudoku game
      • SuDoKu
      • How to play Sudoku
      QueryFlow-S, 10, Abs. QueryDocument-Bwd, 6, Rel.
    32. Example: “myspace layouts”
      • myspace layouts
      • free myspace layouts
      • myspace contact tables
      • myspace codes
      • premade myspace layouts
      • myspace icons
      • myspace layouts
      • myspace codes + hide last login
      • myspace and layouts
      • dissect virtual frogs
      • Myspace Editors Cars
      • MYSPACE GENERATOR
      QueryFlow-S, 10, Abs. QueryDocument-Bwd, 6, Rel.
    33. Evaluation method
      • Sampled 114 queries (freq 700 - 15,000)
      • Run all systems for each query
      • Create a pool: union of top-5 recommendations
        • ~ 6,000 query recommendations in total
      • Evaluate each recommendation in the pool
        • Useful, somewhat useful, not useful
    34. Assessment
      • Example, query “cnn news”:
    35. Distribution of assessments n = 6 093
    36. Inter-assessor agreement n = 560
    37. Inter-assessor agreement n = 560 Good Bad
    38. Precision @ 5
    39. Precision @ 5
    40. Diversity score
      • Take one query
      • Get top-5 recommendations that are good
      • Get top-5 search results for each of them
      • Count how many distinct URLs appear
        • Discard URLs from initial query, this is in [0,25]
    41. Diversity scores
    42. 1. Query-reformulation types 2. Building a classifier 3. Patterns 4. An application to query suggestion 5. Conclusions
    43. Conclusions (1/3)
      • A definition of query reformulation types
        • Refine the categories depending on the application
      • An accurate classifier
      • Interesting typical patterns
    44. Conclusions (2/3)
      • Query reformulation types are useful
        • Fewer slices are better
        • Even more useful than clicks!
        • Few iterations are enough
    45. Conclusions (3/3) (history-based suggestions)
    46. Q&A barcelona cheap barcelona hotels barcelona hotels luxury barcelona hotels Specialize Generalize brcelona Correct barcelona f.c. Generalize Parallel move Specialize Specialize
    47. That's all folks
    48. User session Basic concepts Radlinksi & Joachims: “Query chains: learning to rank from implicit feedback”. KDD 2005. ebay autotrader used fox vw ryanair barcelona places barcelona rent barcelona Chain/Mission #1 Chain/Mission #2
    49. Whence come the chains? ... ebay autotrader used fox vw barcelona places barcelona rent barcelona barcelona soccer barcelona fc soccer
    50. Whence come the chains? Query-flow graph ... ebay autotrader used fox vw barcelona places barcelona rent barcelona barcelona soccer barcelona fc soccer
    51. Query-Flow Graph P. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, S. Vigna: “The Query-Flow Graph”. CIKM 2008. ebay autotrader used fox vw barcelona hotel barcelona rent barcelona barcelona soccer barcelona fc soccer s t ... ...
    52. The query-flow graph
      • Directed graph
      • Built from a query log
      • Nodes are queries
    53. Frequencies of query transitions
    54. Frequencies of query transitions
    55. Transitions are non-symmetrical 93% of the time 93% of the time 7% of the time And when it happens, the frequencies are not correlated barclona barcelona barcelona rent barcelona
    56. The query-flow graph
      • Directed graph
      • Built from sessions/chains
      • Nodes are queries
      • Edges hold information, e.g.:
        • Aggregate features from query transitions
        • Weights derived from those features
    57. Statistics on query transitions
      • Data for (q,q')
        • Frequency
          • Relative to other transitions from q
        • Bags of terms and character n-grams
          • Cosine similarity, Jaccard coefficient, ...
        • Average time
        • Average position in session
        • ...
      Aggregate multiple users R. Jones & K. Klinkner “Beyond session timeout”. CIKM 2008. Radlinksi & Joachims: “Query chains: learning to rank from implicit feedback”. KDD 2005. barcelona places barcelona rent barcelona
    58. Our query-flow graph
      • Directed graph
      • Built from sessions/chains
      • Nodes are queries
      • Edges hold information, e.g.:
        • Aggregate features from query transitions
        • Weights derived from those features
    59. Weighting schemes
      • Interested in weights such as:
        • w(q,q') = freq(q,q') / freq(q)
              • -or-
        • w(q,q') = chain(q,q')
    60. Weighting schemes
      • Interested in weights such as:
        • w(q,q') = freq(q,q') / freq(q)
              • -or-
        • w(q,q') = chain(q,q')
      • freq(q,q') is the frequency of the transition
      • chain(q,q') is the chaining probability
        • Pr(q, q' in same chain | q, q' in same session)
    61. Example with observed transition probabilities
    62. 1. Building a query-flow graph 2. Application to session segmentation 3. Query-reformulation types 4. Application to query recommendation 5. Conclusions
    63. Weighting schemes
      • Interested in weights such as:
        • w(q,q') = freq(q,q') / freq(q)
              • -or-
        • w(q,q') = chain(q,q')
      • freq(q,q') is the frequency of the transition
      • chain(q,q') is the chaining probability
        • Pr(q, q' in same chain | q, q' in same session)
    64. Estimating chain(q,q')
      • Data for (q,q')
        • Frequency
        • N-grams and terms similarity
        • Average delta time
        • Average position in session
        • Clicks between q and q'
        • ...
      R. Jones & K. Klinkner “Beyond session timeout”. CIKM 2008. barcelona places barcelona rent barcelona
    65. Estimating chain(q,q')
      • Data for (q,q')
        • Frequency
        • N-grams and terms similarity
        • Average delta time
        • Average position in session
        • Clicks between q and q'
        • ...
      • Training labels
        • 5,000 consecutive query pairs
        • Manually labelled
        • 2/3 are same-chain
        • +
      barcelona places barcelona rent barcelona
    66. Estimating chain(q,q')
      • Data for (q,q')
        • Frequency
        • N-grams and terms similarity
        • Average delta time
        • Average position in session
        • Clicks between q and q'
        • ...
      freq(q,q') = 1: different-chain if α time - β sim.ngrams.jaccard - γ sim.terms.jaccard + δ > θ same-chain otherwise freq(q,q') > 1: C5.0 rules, 8 rules in total
        • + Training labels
      barcelona places barcelona rent barcelona
    67. There are intertwined chains (photos: commons.wikimedia.org) pointui forum audi ipswich golfers elbow cox ipswich
    68. Efficient session breaking
      • 1. sort the session
      • 2. break into contiguous chains
    69. Sorting: ATSP formulation pointui forum audi ipswich golfers elbow cox ipswich
    70. ATSP formulation pointui forum audi ipswich golfers elbow cox ipswich
    71. ATSP formulation Weights given by chain(q,q') pointui forum audi ipswich golfers elbow cox ipswich
    72. ATSP solution Maximize Π chain(q i ,q i+1 ) s.t. TSP constraints Greedy algorithm with d-steps lookahead pointui forum audi ipswich golfers elbow cox ipswich
    73. ATSP solution pointui forum audi ipswich golfers elbow cox ipswich
    74. Breaking using a threshold pointui forum audi ipswich golfers elbow cox ipswich
    75. Breaking using a threshold Chain #1 Chain #2 pointui forum audi ipswich golfers elbow cox ipswich
    76. Testing set
      • Queries from 1,458 users from UK
        • Collected queries over 3 months
        • Selected users with >1 query
        • Partitioned into chains manually
      • 415 (28%) were single-chain
      • 3.6 chains per user on average
    77. Evaluation wrt ground truth
      • Sorting strategy
        • Rand index of omniscient breaking strategy
      • Breaking strategy
        • Rand index
        • Rand index = pair-wise agreements / ( )
      n 2
    78. Evaluation of reordering strategy
      • Average over all users
    79. Evaluation of reordering + breaking
      • Baseline: 30 min. timeout
    80. Evaluation (cont.) Baseline: 30' timeout Ours Better rand index for the “difficult” chains. Ours and baseline score 0.85 and 0.71 if we exclude the cases where baseline has score of 1
    81. Session segmentation remarks
      • Simple time-outs OK ... but we can do much better
      • Global information in the query-flow graph can help
    82. 1. Building a query-flow graph 2. Application to session segmentation 3. Query-reformulation types 4. Application to query recommendation 5. Conclusions
    83. (Physical) User session Basic concepts ebay autotrader brcelona barcelona barcelona f.c. barcelona hotels

    + Carlos CastilloCarlos Castillo, 2 months ago

    custom

    213 views, 0 favs, 0 embeds more stats

    Paolo Boldi, Francesco Bonchi, Carlos Castillo and more

    More info about this document

    CC Attribution License

    Go to text version

    • Total Views 213
      • 213 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 7
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Tags