0
Identifying Task-based Sessions
in Search Engine Query Logs
Gabriele
Tolomei
Ca’ Foscari University ofVenice
ISTI-CNR, Pis...
Agenda
• Introduction
• Contributions
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February,...
Agenda
• Introduction
• Contributions
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February,...
Problem Statement:TSDP
Task-based Session Discovery Problem:
Discover sets of possibly non contiguous queries
issued by us...
Background
• What is a Web task?
5
Gabriele Tolomei - February, 12 2011
Background
• What is a Web task?
• A “template” for representing any (atomic) activity that can be
achieved by exploiting ...
Background
• What is a Web task?
• A “template” for representing any (atomic) activity that can be
achieved by exploiting ...
Background
• What is a Web task?
• A “template” for representing any (atomic) activity that can be
achieved by exploiting ...
Background
• What is a Web task?
• A “template” for representing any (atomic) activity that can be
achieved by exploiting ...
Background
• What is a Web task?
• A “template” for representing any (atomic) activity that can be
achieved by exploiting ...
Motivation
• “Addiction to Web search”: no matter what your
information need is, ask it to a WSE and it will give you
the ...
Motivation
• “Addiction to Web search”: no matter what your
information need is, ask it to a WSE and it will give you
the ...
Motivation
• “Addiction to Web search”: no matter what your
information need is, ask it to a WSE and it will give you
the ...
Motivation
• “Addiction to Web search”: no matter what your
information need is, ask it to a WSE and it will give you
the ...
The Big Picture
query
hong kong
flights
...
7
Gabriele Tolomei - February, 12 2011
The Big Picture
query
fly to
hong kong
...
7
Gabriele Tolomei - February, 12 2011
The Big Picture
query
nba sport
news
...
7
Gabriele Tolomei - February, 12 2011
The Big Picture
query
pisa to
hong kong
...
7
Gabriele Tolomei - February, 12 2011
The Big Picture
... ......
7
Gabriele Tolomei - February, 12 2011
long-term session
The Big Picture
... ......
7
1 2 n
...
Gabriele Tolomei - February, 12 2011
Δt > tφ long-term session
The Big Picture
7
1 2 ... n
Gabriele Tolomei - February, 12 2011
The Big Picture
7
1 2 ... n
fly to
Hong Kong
nba news shopping in
Hong Kong
Gabriele Tolomei - February, 12 2011
Related Work
• Previous work on session identification
can be classified into:
1. time-based
2. content-based
3. novel heuri...
Related Work: time-based
• 1999: Silverstein et al. [1] firstly defined the
concept of “session”:
• 2 adjacent queries (qi, ...
Related Work: time-based
• 1999: Silverstein et al. [1] firstly defined the
concept of “session”:
• 2 adjacent queries (qi, ...
Related Work: content-based
• Some work exploit lexical content of the queries
for determining a topic shift in the stream...
Related Work: content-based
• Some work exploit lexical content of the queries
for determining a topic shift in the stream...
Related Work: novel
• 2005: Radlinski and Joachims [3] introduced query
chains, i.e., sequence of queries with similar
inf...
Related Work: novel
• 2005: Radlinski and Joachims [3] introduced query
chains, i.e., sequence of queries with similar
inf...
Agenda
• Introduction
• Contributions
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February,...
Outline
• Formalize the Task-based Session Discovery Problem
13
Gabriele Tolomei - February, 12 2011
Outline
• Formalize the Task-based Session Discovery Problem
• Analyze a real long-term WSE log of queries
13
Gabriele Tol...
Outline
• Formalize the Task-based Session Discovery Problem
• Analyze a real long-term WSE log of queries
• Build a groun...
Outline
• Formalize the Task-based Session Discovery Problem
• Analyze a real long-term WSE log of queries
• Build a groun...
Outline
• Formalize the Task-based Session Discovery Problem
• Analyze a real long-term WSE log of queries
• Build a groun...
Agenda
• Introduction
• Contributions
• Query Log Analysis
• Ground-truth Analysis
• Approaching TSDP
• Experiments and Re...
Data Set:AOL Query Log
15
Original Data Set
✓ 3-months collection
✓ ~20M queries
✓ ~657K users
Gabriele Tolomei - February...
Data Set:AOL Query Log
15
Original Data Set
Sample Data Set
✓ 1-week collection
✓ ~100K queries
✓ 1,000 users
✓ removed em...
16
Data Analysis: query time gap
Gabriele Tolomei - February, 12 2011
16
Data Analysis: query time gap
Gabriele Tolomei - February, 12 2011
tφ = 26 min.
84.1% of adjacent query pairs
are issue...
Agenda
• Introduction
• Contributions
• Query Log Analysis
• Ground-truth Analysis
• Approaching TSDP
• Experiments and Re...
• Long-term sessions of sample data set are first split
using the threshold tφ devised before (i.e., 26 minutes)
• obtainin...
• Long-term sessions of sample data set are first split
using the threshold tφ devised before (i.e., 26 minutes)
• obtainin...
• Long-term sessions of sample data set are first split
using the threshold tφ devised before (i.e., 26 minutes)
• obtainin...
• Long-term sessions of sample data set are first split
using the threshold tφ devised before (i.e., 26 minutes)
• obtainin...
19
Ground-truth: statistics
✓ 2,004 queries
✓ 446 time-gap sessions
✓ 1,424 annotated queries
✓ 307 annotated time-gap ses...
20
Ground-truth: statistics
Gabriele Tolomei - February, 12 2011
✓ 4.49 avg. queries per
time-gap session
✓ more than 70% ...
20
Ground-truth: statistics
Gabriele Tolomei - February, 12 2011
✓ 4.49 avg. queries per
time-gap session
✓ more than 70% ...
20
Ground-truth: statistics
Gabriele Tolomei - February, 12 2011
✓ 4.49 avg. queries per
time-gap session
✓ more than 70% ...
21
Ground-truth: statistics
✓ overlapping degree of
multi-tasking sessions
✓ jump occurs whenever
two queries of the same
...
Agenda
• Introduction
• Contributions
• Query Log Analysis
• Ground-truth Analysis
• Approaching TSDP
• Experiments and Re...
23
TSDP: approaches
1) TimeSplitting-t
Description:
The idea is that if two consecutive queries are far
away enough then t...
23
TSDP: approaches
1) TimeSplitting-t
Description:
The idea is that if two consecutive queries are far
away enough then t...
23
TSDP: approaches
1) TimeSplitting-t
Description:
The idea is that if two consecutive queries are far
away enough then t...
23
TSDP: approaches
1) TimeSplitting-t
Description:
The idea is that if two consecutive queries are far
away enough then t...
23
TSDP: approaches
2) QueryClustering-m
Description:
Queries are grouped using clustering algorithms,
which exploit sever...
23
TSDP: approaches
2) QueryClustering-m
Description:
Queries are grouped using clustering algorithms,
which exploit sever...
23
TSDP: approaches
2) QueryClustering-m
Description:
Queries are grouped using clustering algorithms,
which exploit sever...
23
TSDP: approaches
2) QueryClustering-m
Description:
Queries are grouped using clustering algorithms,
which exploit sever...
24
Query Features
Gabriele Tolomei - February, 12 2011
Content-based (µcontent)
✓ two queries (qi, qj) sharing common
term...
24
Query Features
Gabriele Tolomei - February, 12 2011
Semantic-based (µsemantic)
✓ using Wikipedia and Wiktionary for
“ex...
25
Distance Functions: µ1 vs. µ2
Gabriele Tolomei - February, 12 2011
✓ Convex combination µ1
✓ Conditional formula µ2
Ide...
• Models each time-gap session φ as a complete weighted
undirected graph Gφ = (V, E, w)
26
QC-WCC
Gabriele Tolomei - Febru...
• Models each time-gap session φ as a complete weighted
undirected graph Gφ = (V, E, w)
• set of nodesV are the queries in...
• Models each time-gap session φ as a complete weighted
undirected graph Gφ = (V, E, w)
• set of nodesV are the queries in...
• Models each time-gap session φ as a complete weighted
undirected graph Gφ = (V, E, w)
• set of nodesV are the queries in...
• Models each time-gap session φ as a complete weighted
undirected graph Gφ = (V, E, w)
• set of nodesV are the queries in...
• Models each time-gap session φ as a complete weighted
undirected graph Gφ = (V, E, w)
• set of nodesV are the queries in...
27
QC-WCC
Gabriele Tolomei - February, 12 2011
1 8765432φ
27
QC-WCC
Gabriele Tolomei - February, 12 2011
1 2
3
4
56
7
8
Build similarity graph Gφ
27
QC-WCC
Gabriele Tolomei - February, 12 2011
1 2
3
4
56
7
8
Drop “weak edges”
27
QC-WCC
Gabriele Tolomei - February, 12 2011
1 2
3
4
56
7
8
• Variation of QC-WCC based on head-tail components
28
QC-HTC
Gabriele Tolomei - February, 12 2011
• Variation of QC-WCC based on head-tail components
• Does not need to compute the full similarity graph
28
QC-HTC
Gabriel...
• Variation of QC-WCC based on head-tail components
• Does not need to compute the full similarity graph
• Exploits the se...
• Variation of QC-WCC based on head-tail components
• Does not need to compute the full similarity graph
• Exploits the se...
• Partition each time-gap session into sequential
clusters containing only queries issued in a row
29
QC-HTC: sequential c...
• Partition each time-gap session into sequential
clusters containing only queries issued in a row
• Each query in every s...
• Partition each time-gap session into sequential
clusters containing only queries issued in a row
• Each query in every s...
• Merge together related sequential clusters due to
multi-tasking
30
QC-HTC: merging
Gabriele Tolomei - February, 12 2011
• Merge together related sequential clusters due to
multi-tasking
• Hyp: a cluster is represented by its chronologically-
...
• Merge together related sequential clusters due to
multi-tasking
• Hyp: a cluster is represented by its chronologically-
...
• Merge together related sequential clusters due to
multi-tasking
• Hyp: a cluster is represented by its chronologically-
...
• Merge together related sequential clusters due to
multi-tasking
• Hyp: a cluster is represented by its chronologically-
...
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 8765432φ
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2 1) Sequential Clustering
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2
3
1) Sequential Clustering
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2
3
4
1) Sequential Clustering
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2
3
4
56
7
8
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2
3
4
56
7
8
2) Merging
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2
3
4
56
7
8
2) Merging
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2
4
56
7
8
• In the first step the algorithm computes the similarity
only between one query and the next in the original data
• O(n) w...
• In the first step the algorithm computes the similarity
only between one query and the next in the original data
• O(n) w...
Agenda
• Introduction
• Contributions
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February,...
• Run and compare all the proposed
approaches with:
34
Setup
Gabriele Tolomei - February, 12 2011
• Run and compare all the proposed
approaches with:
• TS-26: time-splitting technique (baseline)
34
Setup
Gabriele Tolomei...
• Run and compare all the proposed
approaches with:
• TS-26: time-splitting technique (baseline)
• QFG: session extraction...
• Measure the degree of correspondence between true tasks,
i.e., manually-extracted ground-truth, and predicted tasks,
i.e...
• Measure the degree of correspondence between true tasks,
i.e., manually-extracted ground-truth, and predicted tasks,
i.e...
• Measure the degree of correspondence between true tasks,
i.e., manually-extracted ground-truth, and predicted tasks,
i.e...
• Measure the degree of correspondence between true tasks,
i.e., manually-extracted ground-truth, and predicted tasks,
i.e...
• 3 time thresholds used: 5, 15, and 26 minutes
36
Results:TS-t
Gabriele Tolomei - February, 12 2011
• 3 time thresholds used: 5, 15, and 26 minutes
• Note:TS-26 was used for splitting sample data set
• task-based sessions ...
37
Results: QFG
Gabriele Tolomei - February, 12 2011
✓ trained on a segment of
our sample data set
✓ best results using η ...
38
Results: QC-WCC
Gabriele Tolomei - February, 12 2011
✓ best results using µ2 and
η = 0.3
✓ vs. baseline:
• +20% F-measu...
39
Results: QC-HTC
Gabriele Tolomei - February, 12 2011
✓ best results using µ2 and
η = 0.3
✓ vs. baseline:
• +19% F-measu...
40
Results: best
Gabriele Tolomei - February, 12 2011
• Benefit of using Wikipedia instead of only lexical
content when computing query distance function
41
Results:Wiki impact
...
• Benefit of using Wikipedia instead of only lexical
content when computing query distance function
• Capturing other two q...
• Benefit of using Wikipedia instead of only lexical
content when computing query distance function
• Capturing other two q...
41
Results:Wiki impact
Gabriele Tolomei - February, 12 2011
Agenda
• Introduction
• Contributions
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February,...
• Introduced the Task-based Session Discovery Problem
• from a WSE log of user activities extract several sets of
queries ...
• Introduced the Task-based Session Discovery Problem
• from a WSE log of user activities extract several sets of
queries ...
• Introduced the Task-based Session Discovery Problem
• from a WSE log of user activities extract several sets of
queries ...
• Why should we stop here?
44
Future Work
Gabriele Tolomei - February, 12 2011
• Why should we stop here?
• Once discovered, smaller tasks might be part of
larger and more complex tasks
44
Future Work
...
• Why should we stop here?
• Once discovered, smaller tasks might be part of
larger and more complex tasks
• The task “fly ...
• Make Web Search Engine the “universal driver” for
executing our daily activities on the Web
45
Vision
Gabriele Tolomei -...
• Make Web Search Engine the “universal driver” for
executing our daily activities on the Web
• Once user types in a query...
• Make Web Search Engine the “universal driver” for
executing our daily activities on the Web
• Once user types in a query...
• Make Web Search Engine the “universal driver” for
executing our daily activities on the Web
• Once user types in a query...
46
References
Gabriele Tolomei - February, 12 2011
[1] Silverstein, Marais, Henzinger, and Moricz.“Analysis of a very larg...
ThankYou!
Questions?
Upcoming SlideShare
Loading in...5
×

WSDM 2011

160

Published on

These slides refer to the talk I gave at the ACM International Conference on Web Search and Data Mining (WSDM 2011), where I presented the research work entitled "Identifying Task-based Sessions in Search Engine Query Logs", which was accepted both for oral and poster presentations (acceptance rate = 8.6%), and turned out to be one of the six best paper candidate overall.

The research challenge addressed in this work is to devise effective techniques for identifying task-based sessions, i.e., sets of possibly non contiguous queries issued by the user of a Web Search Engine for carrying out a given task. In order to evaluate and compare different approaches, we built, bymeans of a manual labeling process, a ground-truth where the queries of a given query log have been grouped in tasks.
Our analysis of this ground-truth shows that users tend to
perform more than one task at the same time, since about
75% of the submitted queries involve a multi-tasking activity. We formally define the Task-based Session Discovery Problem (TSDP) as the problem of best approximating the manually annotated tasks, and we propose several variants of well known clustering algorithms, as well as a novel efficient heuristic algorithm, specifically tuned for solving the TSDP. These algorithms also exploit the collaborative knowledge collected by Wiktionary and Wikipedia for detecting query pairs that are not similar from a lexical content point of view, but actually semantically related. The proposed algorithms have been evaluated on the above ground-truth, and are shown to perform better than state-of-the-art approaches, because they effectively take into account the multi-tasking behavior of users.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
160
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "WSDM 2011"

  1. 1. Identifying Task-based Sessions in Search Engine Query Logs Gabriele Tolomei Ca’ Foscari University ofVenice ISTI-CNR, Pisa Italy February, 12 2011 Fabrizio Silvestri Raffaele Perego Claudio Lucchese Salvatore Orlando
  2. 2. Agenda • Introduction • Contributions • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  3. 3. Agenda • Introduction • Contributions • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  4. 4. Problem Statement:TSDP Task-based Session Discovery Problem: Discover sets of possibly non contiguous queries issued by users and collected by Web Search Engine Query Logs whose aim is to carry out specific “tasks” 4 Gabriele Tolomei - February, 12 2011
  5. 5. Background • What is a Web task? 5 Gabriele Tolomei - February, 12 2011
  6. 6. Background • What is a Web task? • A “template” for representing any (atomic) activity that can be achieved by exploiting the information available on the Web, e.g., “find a recipe”,“book a flight”,“read news”, etc. 5 Gabriele Tolomei - February, 12 2011
  7. 7. Background • What is a Web task? • A “template” for representing any (atomic) activity that can be achieved by exploiting the information available on the Web, e.g., “find a recipe”,“book a flight”,“read news”, etc. • Why WSE Query Logs? 5 Gabriele Tolomei - February, 12 2011
  8. 8. Background • What is a Web task? • A “template” for representing any (atomic) activity that can be achieved by exploiting the information available on the Web, e.g., “find a recipe”,“book a flight”,“read news”, etc. • Why WSE Query Logs? • Users rely on WSEs for satisfying their information needs by issuing possibly interleaved stream of related queries 5 Gabriele Tolomei - February, 12 2011
  9. 9. Background • What is a Web task? • A “template” for representing any (atomic) activity that can be achieved by exploiting the information available on the Web, e.g., “find a recipe”,“book a flight”,“read news”, etc. • Why WSE Query Logs? • Users rely on WSEs for satisfying their information needs by issuing possibly interleaved stream of related queries • WSEs collect the search activities, i.e., sessions, of their users by means of issued queries, timestamps, clicked results, etc. 5 Gabriele Tolomei - February, 12 2011
  10. 10. Background • What is a Web task? • A “template” for representing any (atomic) activity that can be achieved by exploiting the information available on the Web, e.g., “find a recipe”,“book a flight”,“read news”, etc. • Why WSE Query Logs? • Users rely on WSEs for satisfying their information needs by issuing possibly interleaved stream of related queries • WSEs collect the search activities, i.e., sessions, of their users by means of issued queries, timestamps, clicked results, etc. • User search sessions (especially long-term ones) might contain interesting patterns that can be mined, e.g., sub-sessions whose queries aim to perform the same Web task 5 Gabriele Tolomei - February, 12 2011
  11. 11. Motivation • “Addiction to Web search”: no matter what your information need is, ask it to a WSE and it will give you the answer, e.g., people querying Google for “google”! 6 Gabriele Tolomei - February, 12 2011
  12. 12. Motivation • “Addiction to Web search”: no matter what your information need is, ask it to a WSE and it will give you the answer, e.g., people querying Google for “google”! • Everyone who is now at WSDM 2011 has dealt with a lot of “stuff” for organizing her/his attendance 6 Gabriele Tolomei - February, 12 2011
  13. 13. Motivation • “Addiction to Web search”: no matter what your information need is, ask it to a WSE and it will give you the answer, e.g., people querying Google for “google”! • Everyone who is now at WSDM 2011 has dealt with a lot of “stuff” for organizing her/his attendance • Conference Web site is full of useful information but still some tasks have to be performed (e.g., book flight, reserve hotel room, rent car, etc.) 6 Gabriele Tolomei - February, 12 2011
  14. 14. Motivation • “Addiction to Web search”: no matter what your information need is, ask it to a WSE and it will give you the answer, e.g., people querying Google for “google”! • Everyone who is now at WSDM 2011 has dealt with a lot of “stuff” for organizing her/his attendance • Conference Web site is full of useful information but still some tasks have to be performed (e.g., book flight, reserve hotel room, rent car, etc.) • Discovering tasks from WSE logs will allow us to better understand user search intents at a “higher level of abstraction”: • from query-by-query to task-by-task Web search 6 Gabriele Tolomei - February, 12 2011
  15. 15. The Big Picture query hong kong flights ... 7 Gabriele Tolomei - February, 12 2011
  16. 16. The Big Picture query fly to hong kong ... 7 Gabriele Tolomei - February, 12 2011
  17. 17. The Big Picture query nba sport news ... 7 Gabriele Tolomei - February, 12 2011
  18. 18. The Big Picture query pisa to hong kong ... 7 Gabriele Tolomei - February, 12 2011
  19. 19. The Big Picture ... ...... 7 Gabriele Tolomei - February, 12 2011 long-term session
  20. 20. The Big Picture ... ...... 7 1 2 n ... Gabriele Tolomei - February, 12 2011 Δt > tφ long-term session
  21. 21. The Big Picture 7 1 2 ... n Gabriele Tolomei - February, 12 2011
  22. 22. The Big Picture 7 1 2 ... n fly to Hong Kong nba news shopping in Hong Kong Gabriele Tolomei - February, 12 2011
  23. 23. Related Work • Previous work on session identification can be classified into: 1. time-based 2. content-based 3. novel heuristics (combining 1. and 2.) 8 Gabriele Tolomei - February, 12 2011
  24. 24. Related Work: time-based • 1999: Silverstein et al. [1] firstly defined the concept of “session”: • 2 adjacent queries (qi, qi+1) are part of the same session if their time submission gap is at most 5 minutes • 2000: He and Göker [2] used different timeouts to split user sessions (from 1 to 50 minutes) • 2006: Jansen and Spink [4] described a session as the time gap between the first and last recorded timestamp on the WSE server 9 Gabriele Tolomei - February, 12 2011
  25. 25. Related Work: time-based • 1999: Silverstein et al. [1] firstly defined the concept of “session”: • 2 adjacent queries (qi, qi+1) are part of the same session if their time submission gap is at most 5 minutes • 2000: He and Göker [2] used different timeouts to split user sessions (from 1 to 50 minutes) • 2006: Jansen and Spink [4] described a session as the time gap between the first and last recorded timestamp on the WSE server PROs ✓ ease of implementation CONs ✓ unable to deal with multi- tasking behaviors 9 Gabriele Tolomei - February, 12 2011
  26. 26. Related Work: content-based • Some work exploit lexical content of the queries for determining a topic shift in the stream, i.e., session boundary [3, 5, 6, 7] • Several string similarity scores have been proposed, e.g., Levenshtein, Jaccard, etc. • 2005: Shen et al. [8] compared “expanded representation” of queries • expansion of a query q is obtained by concatenating titles and Web snippets for the top-50 results provided by a WSE for q 10 Gabriele Tolomei - February, 12 2011
  27. 27. Related Work: content-based • Some work exploit lexical content of the queries for determining a topic shift in the stream, i.e., session boundary [3, 5, 6, 7] • Several string similarity scores have been proposed, e.g., Levenshtein, Jaccard, etc. • 2005: Shen et al. [8] compared “expanded representation” of queries • expansion of a query q is obtained by concatenating titles and Web snippets for the top-50 results provided by a WSE for q PROs ✓ effectiveness improvement CONs ✓ vocabulary-mismatch problem: e.g., (“nba”,“kobe bryant”) 10 Gabriele Tolomei - February, 12 2011
  28. 28. Related Work: novel • 2005: Radlinski and Joachims [3] introduced query chains, i.e., sequence of queries with similar information need • 2008: Boldi et al. [9] introduce the query-flow graph as a model for representing WSE log data • session identification as Traveling Salesman Problem • 2008: Jones and Klinkner [10] address a problem similar to the TSDP • hierarchical search: mission vs. goal • supervised approach: learn a suitable binary classifier to detect whether two queries (qi, qj) belong to the same task or not 11 Gabriele Tolomei - February, 12 2011
  29. 29. Related Work: novel • 2005: Radlinski and Joachims [3] introduced query chains, i.e., sequence of queries with similar information need • 2008: Boldi et al. [9] introduce the query-flow graph as a model for representing WSE log data • session identification as Traveling Salesman Problem • 2008: Jones and Klinkner [10] address a problem similar to the TSDP • hierarchical search: mission vs. goal • supervised approach: learn a suitable binary classifier to detect whether two queries (qi, qj) belong to the same task or not PROs ✓ effectiveness improvement CONs ✓ computational complexity 11 Gabriele Tolomei - February, 12 2011
  30. 30. Agenda • Introduction • Contributions • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  31. 31. Outline • Formalize the Task-based Session Discovery Problem 13 Gabriele Tolomei - February, 12 2011
  32. 32. Outline • Formalize the Task-based Session Discovery Problem • Analyze a real long-term WSE log of queries 13 Gabriele Tolomei - February, 12 2011
  33. 33. Outline • Formalize the Task-based Session Discovery Problem • Analyze a real long-term WSE log of queries • Build a ground-truth of tasks by manually grouping a sample of task-related queries in the given WSE log 13 Gabriele Tolomei - February, 12 2011
  34. 34. Outline • Formalize the Task-based Session Discovery Problem • Analyze a real long-term WSE log of queries • Build a ground-truth of tasks by manually grouping a sample of task-related queries in the given WSE log • Perform some statistics on top of the ground-truth 13 Gabriele Tolomei - February, 12 2011
  35. 35. Outline • Formalize the Task-based Session Discovery Problem • Analyze a real long-term WSE log of queries • Build a ground-truth of tasks by manually grouping a sample of task-related queries in the given WSE log • Perform some statistics on top of the ground-truth • Propose several techniques for addressing the TSDP 13 Gabriele Tolomei - February, 12 2011
  36. 36. Agenda • Introduction • Contributions • Query Log Analysis • Ground-truth Analysis • Approaching TSDP • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  37. 37. Data Set:AOL Query Log 15 Original Data Set ✓ 3-months collection ✓ ~20M queries ✓ ~657K users Gabriele Tolomei - February, 12 2011
  38. 38. Data Set:AOL Query Log 15 Original Data Set Sample Data Set ✓ 1-week collection ✓ ~100K queries ✓ 1,000 users ✓ removed empty queries ✓ removed “non-sense” queries ✓ removed stop-words ✓ applied Porter stemming algorithm ✓ 3-months collection ✓ ~20M queries ✓ ~657K users Gabriele Tolomei - February, 12 2011
  39. 39. 16 Data Analysis: query time gap Gabriele Tolomei - February, 12 2011
  40. 40. 16 Data Analysis: query time gap Gabriele Tolomei - February, 12 2011 tφ = 26 min. 84.1% of adjacent query pairs are issued within 26 minutes
  41. 41. Agenda • Introduction • Contributions • Query Log Analysis • Ground-truth Analysis • Approaching TSDP • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  42. 42. • Long-term sessions of sample data set are first split using the threshold tφ devised before (i.e., 26 minutes) • obtaining several time-gap sessions 18 Ground-truth: construction Gabriele Tolomei - February, 12 2011
  43. 43. • Long-term sessions of sample data set are first split using the threshold tφ devised before (i.e., 26 minutes) • obtaining several time-gap sessions • Human annotators group queries that they claim to be task-related inside each time-gap session 18 Ground-truth: construction Gabriele Tolomei - February, 12 2011
  44. 44. • Long-term sessions of sample data set are first split using the threshold tφ devised before (i.e., 26 minutes) • obtaining several time-gap sessions • Human annotators group queries that they claim to be task-related inside each time-gap session • Represents the true task-based partitioning manually built from actual WSE query log data 18 Ground-truth: construction Gabriele Tolomei - February, 12 2011
  45. 45. • Long-term sessions of sample data set are first split using the threshold tφ devised before (i.e., 26 minutes) • obtaining several time-gap sessions • Human annotators group queries that they claim to be task-related inside each time-gap session • Represents the true task-based partitioning manually built from actual WSE query log data • Useful both for statistical purposes and evaluation of automatic task-based session discovery methods 18 Ground-truth: construction Gabriele Tolomei - February, 12 2011
  46. 46. 19 Ground-truth: statistics ✓ 2,004 queries ✓ 446 time-gap sessions ✓ 1,424 annotated queries ✓ 307 annotated time-gap sessions ✓ 554 detected task-based sessions Gabriele Tolomei - February, 12 2011
  47. 47. 20 Ground-truth: statistics Gabriele Tolomei - February, 12 2011 ✓ 4.49 avg. queries per time-gap session ✓ more than 70% time-gap session contains at most 5 queries
  48. 48. 20 Ground-truth: statistics Gabriele Tolomei - February, 12 2011 ✓ 4.49 avg. queries per time-gap session ✓ more than 70% time-gap session contains at most 5 queries ✓ 2.57 avg. queries per task ✓ ~75% tasks contains at most 3 queries
  49. 49. 20 Ground-truth: statistics Gabriele Tolomei - February, 12 2011 ✓ 4.49 avg. queries per time-gap session ✓ more than 70% time-gap session contains at most 5 queries ✓ 2.57 avg. queries per task ✓ ~75% tasks contains at most 3 queries ✓ 1.80 avg. task per time- gap session ✓ ~47% time-gap session contains more than one task (multi-tasking) ✓ 1,046 over 1,424 queries (i.e., ~74%) included in multi-tasking sessions
  50. 50. 21 Ground-truth: statistics ✓ overlapping degree of multi-tasking sessions ✓ jump occurs whenever two queries of the same task are not originally adjacent ✓ ratio of task in a time-gap session that contains at least one jump Gabriele Tolomei - February, 12 2011
  51. 51. Agenda • Introduction • Contributions • Query Log Analysis • Ground-truth Analysis • Approaching TSDP • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  52. 52. 23 TSDP: approaches 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. Gabriele Tolomei - February, 12 2011
  53. 53. 23 TSDP: approaches 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) Gabriele Tolomei - February, 12 2011
  54. 54. 23 TSDP: approaches 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) CONs: ✓ unable to deal with multi-tasking ✓ unawareness of other discriminating query features (e.g., lexical content) Gabriele Tolomei - February, 12 2011
  55. 55. 23 TSDP: approaches 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) Methods: TS-5,TS-15,TS-26, etc. CONs: ✓ unable to deal with multi-tasking ✓ unawareness of other discriminating query features (e.g., lexical content) Gabriele Tolomei - February, 12 2011
  56. 56. 23 TSDP: approaches 2) QueryClustering-m Description: Queries are grouped using clustering algorithms, which exploit several query features. Clustering algorithms assembly such features using two different distance functions for computing query- pair similarity. Two queries (qi, qj) are in the same task-based session if and only if they are in the same cluster. 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) Methods: TS-5,TS-15,TS-26, etc. CONs: ✓ unable to deal with multi-tasking ✓ unawareness of other discriminating query features (e.g., lexical content) Gabriele Tolomei - February, 12 2011
  57. 57. 23 TSDP: approaches 2) QueryClustering-m Description: Queries are grouped using clustering algorithms, which exploit several query features. Clustering algorithms assembly such features using two different distance functions for computing query- pair similarity. Two queries (qi, qj) are in the same task-based session if and only if they are in the same cluster. PROs: ✓ able to detect multi-tasking sessions ✓ able to deal with “noisy queries” (i.e., outliers) 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) Methods: TS-5,TS-15,TS-26, etc. CONs: ✓ unable to deal with multi-tasking ✓ unawareness of other discriminating query features (e.g., lexical content) Gabriele Tolomei - February, 12 2011
  58. 58. 23 TSDP: approaches 2) QueryClustering-m Description: Queries are grouped using clustering algorithms, which exploit several query features. Clustering algorithms assembly such features using two different distance functions for computing query- pair similarity. Two queries (qi, qj) are in the same task-based session if and only if they are in the same cluster. PROs: ✓ able to detect multi-tasking sessions ✓ able to deal with “noisy queries” (i.e., outliers) CONs: ✓ O(n2) time complexity (i.e. quadratic in the number n of queries due to all-pairs-similarity computational step) 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) Methods: TS-5,TS-15,TS-26, etc. CONs: ✓ unable to deal with multi-tasking ✓ unawareness of other discriminating query features (e.g., lexical content) Gabriele Tolomei - February, 12 2011
  59. 59. 23 TSDP: approaches 2) QueryClustering-m Description: Queries are grouped using clustering algorithms, which exploit several query features. Clustering algorithms assembly such features using two different distance functions for computing query- pair similarity. Two queries (qi, qj) are in the same task-based session if and only if they are in the same cluster. PROs: ✓ able to detect multi-tasking sessions ✓ able to deal with “noisy queries” (i.e., outliers) CONs: ✓ O(n2) time complexity (i.e. quadratic in the number n of queries due to all-pairs-similarity computational step) Methods: QC-MEANS, QC-SCAN, QC-WCC, and QC-HTC 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) Methods: TS-5,TS-15,TS-26, etc. CONs: ✓ unable to deal with multi-tasking ✓ unawareness of other discriminating query features (e.g., lexical content) Gabriele Tolomei - February, 12 2011
  60. 60. 24 Query Features Gabriele Tolomei - February, 12 2011 Content-based (µcontent) ✓ two queries (qi, qj) sharing common terms are likely related ✓ µjaccard: Jaccard index on query character 3-grams ✓ µlevenshtein: normalized Levenshtein distance
  61. 61. 24 Query Features Gabriele Tolomei - February, 12 2011 Semantic-based (µsemantic) ✓ using Wikipedia and Wiktionary for “expanding” a query q ✓ “wikification” of q using vector-space model ✓ relatedness between (qi, qj) computed using cosine-similarity Content-based (µcontent) ✓ two queries (qi, qj) sharing common terms are likely related ✓ µjaccard: Jaccard index on query character 3-grams ✓ µlevenshtein: normalized Levenshtein distance
  62. 62. 25 Distance Functions: µ1 vs. µ2 Gabriele Tolomei - February, 12 2011 ✓ Convex combination µ1 ✓ Conditional formula µ2 Idea: if two queries are close in term of lexical content, the semantic expansion could be unhelpful.Vice-versa, nothing can be said when queries do not share any content feature ✓ Both µ1 and µ2 rely on the estimation of some parameters, i.e., α, t, and b ✓ Use ground-truth for tuning parameters
  63. 63. • Models each time-gap session φ as a complete weighted undirected graph Gφ = (V, E, w) 26 QC-WCC Gabriele Tolomei - February, 12 2011
  64. 64. • Models each time-gap session φ as a complete weighted undirected graph Gφ = (V, E, w) • set of nodesV are the queries in φ 26 QC-WCC Gabriele Tolomei - February, 12 2011
  65. 65. • Models each time-gap session φ as a complete weighted undirected graph Gφ = (V, E, w) • set of nodesV are the queries in φ • set of edges E are weighted by the similarity of the corresponding nodes 26 QC-WCC Gabriele Tolomei - February, 12 2011
  66. 66. • Models each time-gap session φ as a complete weighted undirected graph Gφ = (V, E, w) • set of nodesV are the queries in φ • set of edges E are weighted by the similarity of the corresponding nodes • Drop weak edges, i.e., with low similarity, assuming the corresponding queries are not related and obtaining G’φ 26 QC-WCC Gabriele Tolomei - February, 12 2011
  67. 67. • Models each time-gap session φ as a complete weighted undirected graph Gφ = (V, E, w) • set of nodesV are the queries in φ • set of edges E are weighted by the similarity of the corresponding nodes • Drop weak edges, i.e., with low similarity, assuming the corresponding queries are not related and obtaining G’φ • Clusters are built on the basis of strong edges by finding all the connected components of the pruned graph G’φ 26 QC-WCC Gabriele Tolomei - February, 12 2011
  68. 68. • Models each time-gap session φ as a complete weighted undirected graph Gφ = (V, E, w) • set of nodesV are the queries in φ • set of edges E are weighted by the similarity of the corresponding nodes • Drop weak edges, i.e., with low similarity, assuming the corresponding queries are not related and obtaining G’φ • Clusters are built on the basis of strong edges by finding all the connected components of the pruned graph G’φ • O(n2) time complexity where n = |V| 26 QC-WCC Gabriele Tolomei - February, 12 2011
  69. 69. 27 QC-WCC Gabriele Tolomei - February, 12 2011 1 8765432φ
  70. 70. 27 QC-WCC Gabriele Tolomei - February, 12 2011 1 2 3 4 56 7 8 Build similarity graph Gφ
  71. 71. 27 QC-WCC Gabriele Tolomei - February, 12 2011 1 2 3 4 56 7 8 Drop “weak edges”
  72. 72. 27 QC-WCC Gabriele Tolomei - February, 12 2011 1 2 3 4 56 7 8
  73. 73. • Variation of QC-WCC based on head-tail components 28 QC-HTC Gabriele Tolomei - February, 12 2011
  74. 74. • Variation of QC-WCC based on head-tail components • Does not need to compute the full similarity graph 28 QC-HTC Gabriele Tolomei - February, 12 2011
  75. 75. • Variation of QC-WCC based on head-tail components • Does not need to compute the full similarity graph • Exploits the sequentiality of query submissions to reduce the number of similarity computations 28 QC-HTC Gabriele Tolomei - February, 12 2011
  76. 76. • Variation of QC-WCC based on head-tail components • Does not need to compute the full similarity graph • Exploits the sequentiality of query submissions to reduce the number of similarity computations • Performs 2 steps: 1. sequential clustering 2. merging 28 QC-HTC Gabriele Tolomei - February, 12 2011
  77. 77. • Partition each time-gap session into sequential clusters containing only queries issued in a row 29 QC-HTC: sequential clustering Gabriele Tolomei - February, 12 2011
  78. 78. • Partition each time-gap session into sequential clusters containing only queries issued in a row • Each query in every sequential cluster has to be “similar enough” to the chronologically next one 29 QC-HTC: sequential clustering Gabriele Tolomei - February, 12 2011
  79. 79. • Partition each time-gap session into sequential clusters containing only queries issued in a row • Each query in every sequential cluster has to be “similar enough” to the chronologically next one • Need to compute only the similarity between one query and the next in the original data 29 QC-HTC: sequential clustering Gabriele Tolomei - February, 12 2011
  80. 80. • Merge together related sequential clusters due to multi-tasking 30 QC-HTC: merging Gabriele Tolomei - February, 12 2011
  81. 81. • Merge together related sequential clusters due to multi-tasking • Hyp: a cluster is represented by its chronologically- first and last queries, i.e., head and tail, respectively 30 QC-HTC: merging Gabriele Tolomei - February, 12 2011
  82. 82. • Merge together related sequential clusters due to multi-tasking • Hyp: a cluster is represented by its chronologically- first and last queries, i.e., head and tail, respectively • Given two sequential clusters ci, cj and hi, ti, and hj, tj, their corresponding head and tail queries the similarity s(ci, cj) is computed as follow: 30 QC-HTC: merging Gabriele Tolomei - February, 12 2011
  83. 83. • Merge together related sequential clusters due to multi-tasking • Hyp: a cluster is represented by its chronologically- first and last queries, i.e., head and tail, respectively • Given two sequential clusters ci, cj and hi, ti, and hj, tj, their corresponding head and tail queries the similarity s(ci, cj) is computed as follow: 30 QC-HTC: merging s(ci, cj) = min w(e(qi, qj)) s.t. qi ∈ {hi, ti} and qj ∈ {hj, tj} Gabriele Tolomei - February, 12 2011
  84. 84. • Merge together related sequential clusters due to multi-tasking • Hyp: a cluster is represented by its chronologically- first and last queries, i.e., head and tail, respectively • Given two sequential clusters ci, cj and hi, ti, and hj, tj, their corresponding head and tail queries the similarity s(ci, cj) is computed as follow: 30 QC-HTC: merging s(ci, cj) = min w(e(qi, qj)) s.t. qi ∈ {hi, ti} and qj ∈ {hj, tj} • ci and cj are merged as long as s(ci, cj) > η • hi, ti and hj, tj are updated consequently Gabriele Tolomei - February, 12 2011
  85. 85. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 8765432φ
  86. 86. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 1) Sequential Clustering
  87. 87. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 3 1) Sequential Clustering
  88. 88. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 3 4 1) Sequential Clustering
  89. 89. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 3 4 56 7 8
  90. 90. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 3 4 56 7 8 2) Merging
  91. 91. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 3 4 56 7 8 2) Merging
  92. 92. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 4 56 7 8
  93. 93. • In the first step the algorithm computes the similarity only between one query and the next in the original data • O(n) where n is the size of the time-gap session 32 QC-HTC: time complexity Gabriele Tolomei - February, 12 2011
  94. 94. • In the first step the algorithm computes the similarity only between one query and the next in the original data • O(n) where n is the size of the time-gap session • In the second step the algorithm computes the pairwise similarity between each sequential cluster • O(k2) where k is the number of sequential clusters • if k = β·n with 0<β≤1 then time complexity is O(β2·n2) • e.g. β = 1/2 O(n2/4) 4 times better than QC-WCC 32 QC-HTC: time complexity Gabriele Tolomei - February, 12 2011
  95. 95. Agenda • Introduction • Contributions • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  96. 96. • Run and compare all the proposed approaches with: 34 Setup Gabriele Tolomei - February, 12 2011
  97. 97. • Run and compare all the proposed approaches with: • TS-26: time-splitting technique (baseline) 34 Setup Gabriele Tolomei - February, 12 2011
  98. 98. • Run and compare all the proposed approaches with: • TS-26: time-splitting technique (baseline) • QFG: session extraction method based on the query-flow graph model (state of the art) 34 Setup Gabriele Tolomei - February, 12 2011
  99. 99. • Measure the degree of correspondence between true tasks, i.e., manually-extracted ground-truth, and predicted tasks, i.e., output by algorithms 35 Evaluation Gabriele Tolomei - February, 12 2011
  100. 100. • Measure the degree of correspondence between true tasks, i.e., manually-extracted ground-truth, and predicted tasks, i.e., output by algorithms 35 Evaluation a) F-MEASURE ✓ evaluates the extent to which a predicted task contains only and all the queries of a true task ✓ combines p(i, j) and r(i, j) the precision and recall of task i w.r.t. class j Gabriele Tolomei - February, 12 2011
  101. 101. • Measure the degree of correspondence between true tasks, i.e., manually-extracted ground-truth, and predicted tasks, i.e., output by algorithms 35 Evaluation a) F-MEASURE ✓ evaluates the extent to which a predicted task contains only and all the queries of a true task ✓ combines p(i, j) and r(i, j) the precision and recall of task i w.r.t. class j b) RAND ✓ pairs of queries instead of singleton ✓ f00, f01, f10, f11 Gabriele Tolomei - February, 12 2011
  102. 102. • Measure the degree of correspondence between true tasks, i.e., manually-extracted ground-truth, and predicted tasks, i.e., output by algorithms 35 Evaluation a) F-MEASURE ✓ evaluates the extent to which a predicted task contains only and all the queries of a true task ✓ combines p(i, j) and r(i, j) the precision and recall of task i w.r.t. class j b) RAND ✓ pairs of queries instead of singleton ✓ f00, f01, f10, f11 c) JACCARD ✓ pairs of queries instead of singleton ✓ f01, f10, f11 Gabriele Tolomei - February, 12 2011
  103. 103. • 3 time thresholds used: 5, 15, and 26 minutes 36 Results:TS-t Gabriele Tolomei - February, 12 2011
  104. 104. • 3 time thresholds used: 5, 15, and 26 minutes • Note:TS-26 was used for splitting sample data set • task-based sessions == time-gap sessions 36 Results:TS-t Gabriele Tolomei - February, 12 2011
  105. 105. 37 Results: QFG Gabriele Tolomei - February, 12 2011 ✓ trained on a segment of our sample data set ✓ best results using η = 0.7 ✓ vs. baseline: • +16% F-measure • +52% Rand • +15% Jaccard
  106. 106. 38 Results: QC-WCC Gabriele Tolomei - February, 12 2011 ✓ best results using µ2 and η = 0.3 ✓ vs. baseline: • +20% F-measure • +56% Rand • +23% Jaccard ✓ vs. QFG: • +5% F-measure • +9% Rand • +10% Jaccard
  107. 107. 39 Results: QC-HTC Gabriele Tolomei - February, 12 2011 ✓ best results using µ2 and η = 0.3 ✓ vs. baseline: • +19% F-measure • +56% Rand • +21% Jaccard ✓ vs. QFG: • +4% F-measure • +9% Rand • +8% Jaccard
  108. 108. 40 Results: best Gabriele Tolomei - February, 12 2011
  109. 109. • Benefit of using Wikipedia instead of only lexical content when computing query distance function 41 Results:Wiki impact Gabriele Tolomei - February, 12 2011
  110. 110. • Benefit of using Wikipedia instead of only lexical content when computing query distance function • Capturing other two queries that are lexically different but somehow “semantically” similar 41 Results:Wiki impact Gabriele Tolomei - February, 12 2011
  111. 111. • Benefit of using Wikipedia instead of only lexical content when computing query distance function • Capturing other two queries that are lexically different but somehow “semantically” similar • Try going here: http://en.wikipedia.org/wiki/Cancun 41 Results:Wiki impact Gabriele Tolomei - February, 12 2011
  112. 112. 41 Results:Wiki impact Gabriele Tolomei - February, 12 2011
  113. 113. Agenda • Introduction • Contributions • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  114. 114. • Introduced the Task-based Session Discovery Problem • from a WSE log of user activities extract several sets of queries which are all related to the same task 43 Conclusions Gabriele Tolomei - February, 12 2011
  115. 115. • Introduced the Task-based Session Discovery Problem • from a WSE log of user activities extract several sets of queries which are all related to the same task • Compared clustering solutions exploiting two distance functions based on query content and semantic expansion (i.e.,Wiktionary and Wikipedia) 43 Conclusions Gabriele Tolomei - February, 12 2011
  116. 116. • Introduced the Task-based Session Discovery Problem • from a WSE log of user activities extract several sets of queries which are all related to the same task • Compared clustering solutions exploiting two distance functions based on query content and semantic expansion (i.e.,Wiktionary and Wikipedia) • Proposed novel graph-based heuristic QC-HTC, lighter than QC-WCC, outperforming other methods in terms of F-measure, Rand and Jaccard index 43 Conclusions Gabriele Tolomei - February, 12 2011
  117. 117. • Why should we stop here? 44 Future Work Gabriele Tolomei - February, 12 2011
  118. 118. • Why should we stop here? • Once discovered, smaller tasks might be part of larger and more complex tasks 44 Future Work Gabriele Tolomei - February, 12 2011
  119. 119. • Why should we stop here? • Once discovered, smaller tasks might be part of larger and more complex tasks • The task “fly to Hong Kong” might be a step of a larger task, e.g.,“holidays in Hong Kong”, which in turn could involve several other tasks... 44 Future Work Gabriele Tolomei - February, 12 2011
  120. 120. • Make Web Search Engine the “universal driver” for executing our daily activities on the Web 45 Vision Gabriele Tolomei - February, 12 2011
  121. 121. • Make Web Search Engine the “universal driver” for executing our daily activities on the Web • Once user types in a query,WSE should “infer the tasks” user aims to perform (if any) serendipity! 45 Vision Gabriele Tolomei - February, 12 2011
  122. 122. • Make Web Search Engine the “universal driver” for executing our daily activities on the Web • Once user types in a query,WSE should “infer the tasks” user aims to perform (if any) serendipity! • Results should be no longer only list of plain links but also tasks, either simple and complex 45 Vision Gabriele Tolomei - February, 12 2011
  123. 123. • Make Web Search Engine the “universal driver” for executing our daily activities on the Web • Once user types in a query,WSE should “infer the tasks” user aims to perform (if any) serendipity! • Results should be no longer only list of plain links but also tasks, either simple and complex • Recommendation of queries and/or Web pages both intra- and inter-task 45 Vision Gabriele Tolomei - February, 12 2011 task vs. query recommendation
  124. 124. 46 References Gabriele Tolomei - February, 12 2011 [1] Silverstein, Marais, Henzinger, and Moricz.“Analysis of a very large web search engine query log”. In SIGIR Forum, 1999 [2] He and Göker.“Detecting session boundaries from web user logs”. In BCS-IRSG, 2000 [3] Radlinski and Joachims.“Query chains: Learning to rank from implicit feedback”. In KDD '05 [4] Jansen and Spink.“How are we searching the world wide web?: a comparison of nine search engine transaction logs”. In IPM, 2006 [5] Lau and Horvitz.“Patterns of search:Analyzing and modeling web query refinement”. In UM '99 [6] He and Harper.“Combining evidence for automatic web session identification”. In IPM, 2002 [7] Ozmutlu and Çavdur.“Application of automatic topic identification on excite web search engine data logs”. In IPM, 2005 [8] Shen,Tan, and Zhai.“Implicit user modeling for personalized search”. In CIKM '05 [9] Boldi, Bonchi, Castillo, Donato, Gionis, andVigna.“The query-flow graph: model and applications”. In CIKM '08 [10] Jones and Klinkner.“Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs”. In CIKM '08 [11] MacQueen.“Some methods for classification and analysis of multivariate observations”. In BSMSP, 1967 [12] Ester, Kriegel, Sander, and Xu.“A density-based algorithm for discovering clusters in large spatial databases with noise”. In KDD '96
  125. 125. ThankYou! Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×