Something about search

by yi00da
Something About
Search

搜索探索性分析
搜索评价指标
BM25
Click Model
Learning To Rank
Search Analysis

过滤搜索关键字长度大于40的关键字
表1-1
表1-2
表1-3
由表1-1/1-2可知，三搜索关键字组成，PC段和移动端差不多
平台搜索关键字平均长度
android 7.44
ios 7.62
pc 8.47
平台每个query的不同term数一个term的占比 2个term的占比 3个term的占比
android 4.25 7% 13% 22%
ios 4.07 8% 14% 23%
pc 4.12 11% 12% 22%
平台关键字搜索次数 unique关键词占比
android 60925778 3612131 5.9%
ios 19063979 1499656 7.9%
pc 24689234 2356621 9.5%

表2-1
表2-2
由上表可知，一个用户的session, pc和android在2个小时左右, ios在一个小时左右；
每个session含有大概2-3个query
业界一般使用半个小时作为一次search session,一般用来做
Query transformation/document ranking/user satisfaction prediction
平台 session平均时长(分钟)
android 108
ios 63
pc 129
平台 session的query数mean session的query数mean，限制query数不超过100
android 3.015 3.005
ios 2.334 2.325
pc 3.425 3.362

表3-1
表3-2
由表3-1可知，搜索结果的平均点击位置为7，第一次点击的平均位置为4，
说明搜索的排序仍然有待提升(点击位置越前越好)
由表3-2可知，搜索后有播放的query仅仅占55%左右，
说明搜索结果展示需要改进(未有第三方竞品数据作对比)
平台点击的平均位置每个用户每个sid的不同搜索关键字首次点击平均位置
pc 7.08 4.08
平台搜索后有播放的unique query数总unique query数播放占比有下载的unique query数下载占比
android 1969604 3612131 55% 727141 20%
ios 756288 1499656 50%
pc 1280618 2356621 54% 317625 13%(不准)

0
500000
1000000
1500000
中国新歌声
薛之谦
微微一笑很…
乔任梁
TFBOYS
小苹果
逆流成河
dj
张杰
歌在飞
鹿晗
麻辣变形计
儿童歌曲
周杰伦
旋风少女2
没有你陪伴…
tfboys
小幸运
儿歌
林中鸟
蒙面唱将猜…
庄心妍
郑源
太想爱你
冷漠
凤凰传奇
刘德华
张信哲
大王叫我来…
陈奕迅
android搜索关键字top30
0
50000
100000
150000
200000
薛之谦
中国新歌声
周杰伦
dj
逆流成河
林中鸟
小幸运
乔任梁
歌在飞
演员
微微一笑…
庄心妍
没有你陪…
多幸运
陈奕迅
平凡之路
张学友
悟空
张杰
刘德华
丑八怪
小苹果
tfboys
夜色
张信哲
faded
告白气球
林俊杰
汪峰
凤凰传奇
PC搜索关键字top30
由上图可以看出，移动端和PC端的用户搜索行为有区别

、
图4-1 表4-2
由图4-1可知，用户搜索主要集中在晚上8-9点
由表4-2可知，top100000w的关键字搜索量占到85%，长尾分布严重
平台类型占比
pc top10 3.9%
pc top100 14.0%
pc top1000 36.5%
pc top10000 66.2%
pc top100000 84.2%
android top10 6.1%
android top100 17.7%
ios top10 6.5%
ios top100 18.9%
ios top1000 43.3%
ios top10000 71.8%
ios top100000 87.5%
20点, 8.4%
0.0%
2.0%
4.0%
6.0%
8.0%
10.0%
0点
1点
2点
3点
4点
5点
6点
7点
8点
9点
10点
11点
12点
13点
14点
15点
16点
17点
18点
19点
20点
21点
22点
23点
不同时段的搜索次数占比

Do users scan document from top to bottom?
1. The click-through rate (CTR) of the first document is about 0.45 while the CTR
of the tenth document is well below 0.05
2. The document below a click is viewed roughly 50% of the times

MAP
Mean Average Precision
Example:
假设有两个主题，主题1有4个相关网页，主题2有5个相关网页。某系统对于主题1检索出4个相关网页，其rank分别
为1, 2, 4, 7；对于主题2检索出3个相关网页，其rank分别为1,3,5。对于主题1，平均准确率为(1/1+2/2+3/4+4/7)/4=0.
83。对于主题2，平均准确率为(1/1+2/3+3/5+0+0)/5=0.45。则MAP= (0.83+0.45)/2=0.64

NDCG
Normalize Discounted cumulative gain

BM25
BM25算法，通常用来作搜索相关性平分。一句话概况其主要思想：对Query进行语素解析，生成语素qi；
然后，对于每个搜索结果D，计算每个语素qi与D的相关性得分，最后，将qi相对于D的相关性得分进行加
权求和，从而得到Query与D的相关性得分
一般而言，没有相关信息，即r和R都是0，而在query中，一般不会有某个term出现的次数大于1，qfi=1,
Score的定义如下：
其中参数b的作用是调整文档长度对相关性影响的大小。b越大，文档长度的对相关性得分的影响越大，反
之越小。

BM25 with title
可以看见，BM25对歌曲的Title效果不好
Appendix

Random Click Model (RCM)
Click-through Rate Models (CTR)
Rank-based CTR Model (RCTR)
Document-based CTR Model (DCTR)
User Browsing Model (UBM)
Position-based Model (PBM)
Dependent Click Model (DCM)
Click Chain Model (CCM)
Dynamic Bayesian Network Model (DBN)
Simplified DBN Model (SDBN)
Cascade Model (CM)
Click Model

Baseline model
1.Random Click Model (RCM)
Any document can be click with the same (fixed) probability
2. Click-Through Rate Models (RCTR)
the click probability depends on the rank of the document
3. Document-Based CTR Model (DCTR)
the click-through rates for each query-document pair.
subject to overfitting for the reason that some documents and/or queries were not previously
encountered in our click log

Position-Based Model
position-based model (PBD)
Means that a document is clicked when user
Examine and attractive with it
Examination hypothesis. The probability of a user examining a document depends heavily
on its rank or position. PBM introduces a set of examination parameters Y, one for each rank.
PBM does not depend on the events at previous ranks.

Cascade Model
Cascade model (CM)
Step:
1.Start from the first document
2.Examine documents one by one
3.If click, then stop
4.Otherwise, continue

Cascade model (CM)
In particular:
1.CM does not allow sessions with more than one click
2.CM can not explain non-linear examination patterns

So far,
1.CTR models
+ count clicks (simple and fast)
- do not distinguish examination and attractiveness
2. Position-based model (PBM)  User browsing model
+ examination and attractiveness
- examination of a document at rank r does not depend on
examinations and clicks above r
3. Cascade model (CM)  Dynamic Bayesian network
+ cascade dependency of examination at r on examinations
and clicks above r
- only one click is allowed

User Browsing Model
User Browsing model (UBM)
the examination probability depends not
only on the rank of a document r, but also
on the rank of the previously clicked document r’
r’ is the rank of the previously clicked document or 0 if none of them was clicked
where c0 is set to 1 for convenience

Dynamic Bayesian Model
Dynamic Bayesian model (DBN)
Step:
1.Start from the first document
2.Examine documents one by one
3.If click, read actual document and can be satisfied
4.If satisfied, stop
5.Otherwise,continue with fixed probability

Dynamic Bayesian model (DBN)
In particular:
1.Gamma is the continuation probability for a user that either did not click on a document or click
ed but was not satisfied by it
2.DBN set gamma to 1,is Simplified DBN Model (SDBN) – MLE & good performance
3.SDBN set to 1,then model become Cascade Model (CM)

Random Click Model (RCM)
Click-through Rate Models (CTR)
Rank-based CTR Model (RCTR)
Document-based CTR Model (DCTR)
User Browsing Model (UBM)
Position-based Model (PBM)
Dependent Click Model (DCM)
Click Chain Model (CCM)
Dynamic Bayesian Network Model (DBN)
Simplified DBN Model (SDBN)
Cascade Model (CM)
1. Maximum likelihood estimation (RCM,RCTP,DCTP,DCM,SDBN,CM)
2. Expectation maximization (UBM,PBM,CCM,DBN)
Parameter Estimation

Simplified DBN Model (SDBN) -- MLE
In particular:
1. SDBN assumes that a user examines all documents until the last-clicked one and then aband
ons the search. In this case, both the attractiveness A and satisfaction S of SDBN are observed.
2.吸引度A即是给定query，其ducument的点击次数和展示次数(最后一个点击或之前)之比
3.满意度S即是给定query，在其ducument的点击集合中该ducument最后一次点击的占比

Simplified DBN Model (SDBN) -- MLE

Dynamic Bayesian model (DBN) -- EM
In particular:
1. E-step. Given three parameters,compute the posterior probabilities A,E,S, This involves the
forward-backward algorithm
2. M-step. Given the posterior probabilities, update three parameters

1.The DBN outperform others
2. X-axis = 100 means those urls whose train set >= 100;more session means priors not as important. Cascade & DBN improve.
3. Navigational queries have quality of context bias, and lots of sessions. Position models suffer
Result

Limit:
1.Click model cannot model out of order clicks
2. Completely blind to query reformulations
3. Assumes homogeneous user population
Future research:
1. Why not learning the structure of a click model from data
instead of defining it manually
2. Interactions beyond clicks
Limitations and future research

[1] Anne Schuth, Floor Sietsma, Shimon Whiteson, and Maarten de Rijke. “Optimizing Base Rankers Using Clicks A
Case Study using BM25”
[2] Thorsten Joachims, Laura Granka Bing Pan, Helene Hembrooke,and Geri Gay.” Accuratelyinterpreting clickthro
ugh data as implicit feedback”
[3] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. “An experimental comparison of click position-bi
as models”
[4] Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang, and Christos Faloutsos. ”Click chain
model in web search”
[5] Olivier Chapelle and Ya Zhang.” A dynamic bayesian network click model for web search ranking”
[6] Kevin Patrick Murphy. Machine Learning: “A Probabilistic Perspective.”
[7] Suzan Verberne, Hans van Halteren, Daphne Theijssen,” Learning to Rank QA Data”
[8] Thorsten Joachims,” Optimizing Search Engines using Clickthrough Data”
[9] Daxin Jiang, Jian Pei, Hang Li,” Mining Search and Browse Logs for Web Search: A Survey”
Reference

SDBN compute
只取搜索结果top60条记录
1.定义曝光为最后一次点击之前的结果
2.点击满意定义如备注所示
1.点击定义为播放、添加或下载 3. att_alpha=0.1,att_beta=250,sat_alpha=0.1,sat_beta=100
2.丢弃z序列缺失超过10%的session
3.过滤同一个mid,sid的记录数超过10000的session
4.只留下超过10个session的query
5.用户播放顺序从上往下，抛弃乱序播放的session
Appendix
pc行为流水
爬取搜索接口数据
Click model 相关性score

RCM VS RCTR
RCM RCTR
全局热度和关键字下热度，都会出现position bais，关键字热度要好一点，考虑到不同query的影响
Appendix

SDBN VS CM
对某些关键字来看，SDBN效果要好一些。并未有人工编辑的标签，未做NDCG
Appendix

Something about search

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Something about search

Similar to Something about search (20)

Recently uploaded

Recently uploaded (20)

Something about search

Editor's Notes