Identifying Users’ Topical Tasks in Web Search

Identifying Users’ Topical Tasks
in Web Search
W. Hua, Y. Song, H. Wang, Z. Zhou, WSDM 2013
2013/06/30 SEXI2013読み会
@harapon

この論文の内容
 Search Taskはクエリとそのreformationで成り立つ
 task identificationはサーチエンジンにとって重要
• 価値のある情報を提供する
• ユーザーのサーチの意図を予測する
• ユーザーにクエリをサジェストする
 これまでのアプローチでは
• クエリのtemporal features (時間特徴量)
• lexical features (単語の特徴量)を用いる
 多くのクエリのreformationはtopicalなので，その
reformationは単語レベルでは同じではないかも
（問題1）
2
“flight to LA”に対して“cheap US flight“をサジェスト→やりやすい
“flight to LA”に対して“hotel in LA”をサジェスト→難しい

この論文の内容
 更に同じsearch session内であっても複数のタスク
が挟まれている場合があり，タイムスタンプに
よってクエリを時間順に並べることができない
（問題2）
 このような問題に対して，
• セマンティックレベルで2つのクエリを比較した類似度
をつくることで問題1に対処し，
• 時間的に離れたクエリ間においてSequential Cut and
Merge (SCM)アルゴリズムを提案し，問題2に対処した
3

3. ProBase
http://research.microsoft.com/en-us/projects/probase/
4
 bag-of-words表現と人間の理解の間にはギャップ
 セマンティックネットワーク

3.1 Knowledge Representation
 node
• entity (“Barack Obama”)
• concept (“President of America”)
• attribute (eg, “age”, “color”)
 edge
• isA relationship (“Barack Obama” isA “President of America”)
• isAttributeOf relationship
(“population” isAttributeOf “country”)
 Probaseのedgesは確率的情報の重み付け
• P(instance | concept)
• eg. P(“poodle”| “dogs”) > P(“pugs”|”dogs”)
• P(concept | instance), P(concept |
attribute), P(attribute | concept)も定義可能
 何十億のweb pageからひっぱってきて構築
5

4. Methodology
 task identificationは次の2つの問題に集約される
• 2つのクエリ間の類似性・関連性をどう定量化するか？
• 1つのクエリセッション内で類似クエリを効率的に
どうクラスター化するか？
 taskの定義
6
あるセッションがクエリとして与えられたとき，taskは
: クエリのタイムスタンプ
: クエリ , の類似度
: 類似度の閾値
は連続していないかもしれない

4.1 Similarity Calculation
 クエリ類似度のために4種類の特徴量をつくる
• conceptual features
• lexical features
• template features
• temporal features
 特にconceptual featuresがメイン
• クエリ曖昧性の解消がモチベーション
• “apple”が「リンゴ(果物)」なのか「アップル(企業)」なのか
7

4.1.1 Conceptual Features
 クエリの背後にある概念を特徴量化する
 Step 1: Parsing
• クエリを単語に分割
• Probase内のinstance/attributeに写像される一番長い
単語列を用いる
• 同じ長さならより多くのconceptに繋がっている方
• クエリが”truck driving school pay after training”なら
”truck driving”, “school”, “pay”, “training”がProbaseに表れる
最も長いインスタンス．”driving”はダメ
• クエリが”tiger woods”なら
”tiger”, “woods“ではなく”tiger woods”
• このようにBoW表現よりも解釈しやすいクエリ
8

 Step 2: Conceptualization
• あるクエリをinstances/attributesの集合に写像
• これらのinstances/attributesを表す最も良いconceptを推測
• まず，以下を用いて候補となるconceptを特定する
• ここで，concept vector
• MはProbase内のconceptの総数
• 上位K位のconceptのみ選ぶ
9

 クラスター化
• 次にあるクエリ内の複数のトピックを見つけるために
各クラスターが一つのトピックになるよう，
instance/attribituesをグループ化する
• eg. “alabama home insurance”なら”alabama”(“state”)と”home
insurance”(“insurance” and “benefits”)
• 重み付けグラフをつくる
• 各edgeはnodeのconcept vector間のコサイン類似度
• 閾値より小さいコサイン類似度であればグラフから
edgeを除去し，instance/attributes clusterを表す
サブグラフを作成
• クラスター r：
10
: Tと一致するノード集
合，
: エッジ集合

 クエリ曖昧性問題の解消
• 各クラスターr内のinstance/attributesを
concept vector crでコンセプト化
• Naive Bayes fuctionによってクラスターr内の各
instance/attributesのconcept vectorの共通部分を計算
• ここでinstance/attributesはそれぞれ独立と仮定
• クラスターr内の共通コンセプトをランク付け
11
P(ck|tl
r): instance/attributes tl
rびconcept vector のk番目の値
P(ck): Probaseにおけるconcept ckのpopularity

 クエリ曖昧性解消の例
 クエリ全体のコンセプト化
• 曖昧性のないconcept vectorからコンセプト化
12

 Step 3: Calculating conceptual similarity
• 各クエリqのコンセプト化の結果(concept vector cq)か
ら，クエリ間のコサイン類似度を計算
13
クエリ
qi
単語列
単語列
単語列
instance/attributes集合
T = (t1, t2, …, tL)
t1 → c1
t2 → c2
tL → cL
concept vector
t2
t1 t3
t4
ti-1
ti ti+1
クラスター化
T1 → c1
Tr → cr
各クラスターの
concept vector
クエリqiの
concept vector
cq
一連の流れ

4.1.2 Lexical Features
 クエリ間のBoW類似度を表す2つの方法
• N-word Jaccard
• “the car james bond drive”を2-wordsでやると
[“the car”, “car james”, “james bond”, “bond drive”]
• N-char Jaccard
• 同様に文字単位で定義
14
vi : the N-word set of query qi
vi k : the term-frequency of the kth N-word in set vi
m, n : the size of set vi and vj
ki , kj : the indexes of that N- word in set vi and vj
vi ki
, vijkj
: the term frequencies of that common N-word in set vi and vj

4.1.3 Template Features
 Huang et al.(2009)の方法
 substring/superstring, add/remove
words, stemming, spelling correction, acronym and
abbreviation, etc.
 要はタイプミスや派生語の編集距離
 Levenshtein edit distance
15
ed(qi , qj) : the Levenshtein edit distance between query qi and qj
len(qi): the length of query qi

4.1.4 Temporal Features
 連続するクエリ間のtime interval
 時間的に近ければ近いほど同じタスクである確率
が高い
16
t(qi) : the time query qi is issued
d(qi): the dwelltime of query qi (the sum of dwelltimes of clicks after qi)

4.2 Task Identification
 Sequential Cut and Merge (SCM)
17
挟み込まれた場合に
対応できず
計算時間がかかる
over-merge

4.2 Task Identification
 Sequential Cut and Merge (SCM)
• まず，SCを適用し得られたtaskをsub taskと命名
• sub taskに含まれるクエリのBoWをマージし，新しい
クエリをつくる．これはsub taskを表現
• sub task集合でGCを適用．閾値以下のedgeをカット
• SCMのウリ
• SCの問題点(挟み込み)に対処
• GCに比べ，計算時間が少ない(上位概念でGCしているので)
18

5. Evaluation and Results
 2012年5月のある1日の商用ブラウザから得られた
セッションを抽出
 簡単のため，英語で書かれているUS住人で1つの
セッションに少なくとも10クエリあるセッション
にフィルタリング
 45813セッション得られて，600サンプルを人手
でラベル付け
19

 Effectiveness of Classifiers and Features
20
Error Rate

 Accuracy of Algorithms
 Computational Time
• 速い
21
GCはSCに比べてf measureで12.03%, jaccardで46.21%改善，
SCMはGCに比べて1.49%, jaccardで12.61%改善している

6. Conclusions
 Probaseを使ってconceptual featuresつくった
 これまでの特徴量と合わせて使うと
分類器の精度改善
 task identificationにもこれまでのアルゴリズムを
あわせたようなSCMアルゴリズムを使うと，
計算時間も同定精度も改善する
 今後の課題として
”celtics members”と”kevin garnette”が同じタスクに
されてしまう問題を解消したい
• 前者はNBAのチーム，後者はNBAのプレイヤー
22

Identifying Users’ Topical Tasks in Web Search

Recommended

Recommended

More Related Content

Similar to Identifying Users’ Topical Tasks in Web Search

Similar to Identifying Users’ Topical Tasks in Web Search (16)

Identifying Users’ Topical Tasks in Web Search