Converting browsers into buyers • Improving web site design and usability • Improving customer retention and loyalty • Increasing cross-sell by recommending items related to the ones being considered • Helping visitors to quickly find relevant information on a website • Making results of information retrieval/search more aware of the context and user interests
Knowledge Discovery in databases is to create a suitable target dataset for data mining
Web log may not be completely reliable Caching – files stored at client but not accessed from server Information pass through the POST method will not be available in a server log
Content preprocessing consists of converting the text, image, scripts and other files such as multimedia into forms that are useful for the web Usage mining process 数据清理: (1) 图片、视频等非用户显式请求的记录,即属性 URI 中后缀名为 gif, jpg, jpeg, ico, rm 等的记录。 (2) 网页的格式信息记录,即属性 URI 中后缀名为 css 的记录。 (3) 属性 Status 中代码显示访问错误的记录, 即属性 Status 中代码值小于 200 或大于 299 的日志记录 ( 属性 Status 中代码为 200~299 通常指示成功响应 ) 。 用户会话识别: (1) 如果 IP 地址不同则认为是不同的用户。 (2) 如果 IP 地址相同,但浏览器软件或操作系统不同,则认为是不同的用户。 (3) 如果 IP 地址相同,浏览器软件和操作系统也相同,那么根据引用信息进行进一步判断。 检查记录的属性 ReferURI ,如果 ReferURI 中记录的 URL 没有被访问过,则认为该记录为一个新的用户会话;或如果 ReferURI 为空,且该记录与上一条记录的访问时间间隔大于 10 s ,也认为该记录为一个新的用户会话。 (4) 根据前 3 条规则得到的每个用户会话可能包含了用户在不同时间的多次访问,因此,采用基于页面访问时间的方法进一步进行用户会话识别,得到用户会话集合。 路径补充: 路径补充是通过分析将日志中没有记录的信息补充完整,得到用户实际的浏览路径。本文采用基于引用的分析方法完成路径补充。
Ambiguity : the level at which clicksare analyzed ( URL A, B, or C as basic identifier) is very shallow, almost no meaning – Dynamic URLs: meaningless URLs even more ambiguity – Semantic Web Usage Mining: (Oberleet al., 2003) • Scalability : Massive Web Log data that cannot fit in main memory requires techniques that are scalable (stream data mining) (Nasraouiet al.: WebKDD2003, ICDM 2003) • Handling Evolution : Usage data that changes with time – Mining & Validation in dynamic environments: largely unexplored area…except in: (Mitchell et al. 1994; Widmer, 1996; Maloof& Michalski, 2000) – In the Web usage domain:(Desikan& Srivastava, 2004; Nasraouiet al.: WebKDD2003, ICDM 2003, KDD 2005, Computer Networks 2006, CIKM 2006) • From Clicks to Concepts : few efforts exist based on laborious manual construction of concepts, website ontology or taxonomy – How to do this automatically? (Berendtet al., 2002; Oberleet al., 2003; Dai & Mobasher, 2002; Eirinakiet al., 2003) • Implementing recommender systems can be slow, costly and a bottle neck especially – for researchers who need to perform tests on a variety of websites – For website owners that cannot afford expensive or complicated solutions
1 。时间阈值 Time out 来划分不同的用户会话;改进的会话识别,动态的 Time out 划分;统计学特征、滑动窗口 模式发现: 1 。 统计分析 2 。序列模式 Markov 模型 3 。关联规则 最大前向引用 (Maximal Forward Reference, 简称 MFR) Ap riori 算法 4 。分类和聚类 决策树、分类法、贝叶斯分类法