5. 补充读物-1
The Search: How Google and Its Rivals
Rewrote the Rules of Business and
Transformed Our Culture
by John Battelle
•ISBN-10: 1591840880
•Publisher: Portfolio (September 8, 2005)
作者博客站点
http://battellemedia.com/
6. 补充读物-2
Modern Information Retrieval
by Ricardo Baeza-Yates (Universidad
de Chile, Chile) and Berthier Ribeiro-
Neto (Univ Federal de Minas Gerais,
Brazil)
•ISBN-10: 020139829X
•Publisher: Addison-Wesley 1999
8. 什么是搜索引擎
A search engine is a program designed to help find
information stored on a computer system such as the
World Wide Web, inside a corporate or proprietary
network or a personal computer.
--- Wikipedia
搜索引擎属于跨学科应用,涉及信息检索、数据库、
数据挖掘、计算机系统、多媒体、人工智能、计算机
网络、分布式处理、图书馆学、自然语言处理等多个
领域,是目前互联网上最复杂的基础应用之一
22. 垂直搜索和网页搜索的对比
Vertical Search Web Search
Index Size Smaller and specialized Global and general
Document Type Typically more structured Typically less structured
Relevance Highly customizable Fixed algorithm
Relevance enhanced by Popularity-based
–Constrained context
–Structured data
–Domain Taxonomy
Comprehensiveness Focused/deep crawling Broad/surface crawling
Freshness Customizable schedules Fixed schedule
From seconds to months Days on average
Presentation Structured, Navigational Flat list
–Taxonomy drill-down
–Sorting & grouping
–Clustering & collapsing
27. Web搜索引擎的工作原理
1
址网 A
址网 B
址网 C
…
3
2
址 网 字键关
A A
址网 字键关
B B
址 网 字键关
C C
…
2. 根据关键字 3. 用户按照关键字
1. 采集大量的网页 为网页作索引 搜索网页
Crawler Index Pages Search & Rank
29. 蜘蛛的准则
• A Crawler must show identification
– Yahoo! Slurp, Googlebot, Baidu Spider
• A Crawler must obey the robots exclusion
standard
– http://www.robotstxt.org/wc/norobots.html
• A Crawler must not hog resources
• A Crawler must report errors
44. 搜索引擎面临的挑战
Search within Search
如何减少用户在搜索结果里面再次搜索的成本,真正快而准
Invisible Tabs
减少用户了解各种垂直搜索产品的成本,帮助返回全方位的相
关信息
45. 垂直搜索的价值被限制
• “Invisible Tabs”由Search Engine Watch的资深编辑Danny
Sullivan发明,用来描述搜索引擎可能会怎样来试图提供
更贴近用户本意的搜索结果
“You almost need a search engine
“You almost need a search engine
for all our search enginesquot;
for all our search enginesquot;
Marissa Mayer
Marissa Mayer
VP of Search Products and User
VP of Search Products and User
Experience at Google
Experience at Google
47. 搜索引擎面临的挑战
Search within Search
如何减少用户在搜索结果里面再次搜索的成本,真正快而准
Invisible Tabs
减少用户了解各种垂直搜索产品的成本,帮助返回全方位的相
关信息
Deep Web or Invisible Web
对互联网上各种搜索引擎无法获取信息的处理
48. Deep Web 概况
30万站点,45万数据库,126万接口,在2000~2004年间增加了
数据规模
3~7倍
主题多元化 分布在各种主题内容,不仅仅是电子商务类
数据结构 多数为结构化数据
94%可以在前3层被发现
数据深度
- Deep web并不是完全不能抓取,主流的搜索引擎约覆盖了1/3的
搜索引擎的覆
盖率 数据
- 但是搜索引擎由于其内在的局限性,各家覆盖的数据基本一样
很少,只有0.2%~15.6%
目录站点的覆
盖率
数据来源:”Accessing the Deep Web”, Communications of the ACM, May 2007
49. 搜索引擎对Deep Web的覆盖
Coverage of Search Engines on Deep Web
The Entire Deep Web
Google (32%)
Yahoo (32%)
MSN (11%)
All (37%)
0% 5% 37% 100%
数据来源:”Accessing the Deep Web”, Communications of the ACM, May 2007
58. 从Search到Information Supply
Avail. Info.
Activity
User Profile
Supply
Context
& Context
Information Supply Engine
Matching
information
Feedback Feedback
User Action
Source: Andrei Broder 2006
64. Mobile Search 移动搜索
Desktop Search ≠ Mobile Search
移动搜索需要考虑到手机屏幕的大小,交互模式 (例如 iPhone
移动搜索需要考虑到手机屏幕的大小, 例如
Touch Screen)、手机浏览器、用户位置信息等各种因素
、手机浏览器、
Mobile Web 2009 = Desktop Web 1998
Jakob Nielsen
67. 未来的搜索
Unstructured Structured
Desktop Search Mobile Search
Solo Search Universal Search
Relevance Intelligence
Surface Web Deep Web
Search Recommendation