  • 1. scrapy+sphinx搭建搜索引擎 银平 2010-06-07
  • 2. Outline• Overview• Scrapy – python爬虫框架• Sphinx – C++全文搜索引擎• demo – scrapy + sphinx实现小说搜索引擎
  • 3. Overview - 搜索引擎/爬虫分类• 搜索引擎 o 通用搜索引擎 o 垂直搜索引擎 o 资源型垂直搜索引擎• 爬虫 o 通用爬虫 o 专用爬虫
  • 4. Overview - 搜索引擎 • 分词 • 倒排索引
  • 5. Scrapy – python爬虫框架• Architecture• Built-in middlewares• Extensions• 从网页中提取数据
  • 6. Architecture• Components o Scrapy Engine o Scheduler o Downloader o Spider o Item Pipeline o Middlewares• Event-driven networking: twisted
  • 7. Architecture
  • 8. Built-in middlewares• Downloader middlewares o DefaultHeadersMiddleware o HttpAuthMiddleware o HttpCacheMiddleware o RedirectMiddleware o RetryMiddleware• Spider middlewares o DepthMiddleware o RefererMiddleware• Scheduler middlewares o DuplicatesFilterMiddleware
  • 9. Extensions• 特性 o Scrapy启动时加载的普通class o 监听各种signal (engine_started, item_scraped, item_dropped)• Built-in extensions o CoreStats o WebConsole o …
  • 10. 从网页中提取数据• CrawlSpider: Rule/Matcher/callback• 使用XPath进行提取• Scrapy shell• Parsley: a selector language, superset of XPath and css3 ( 内存泄露)li.main>a/@href
  • 11. Sphinx – C++全文搜索引擎• Sphinx特性• Sphinx组件• 索引• 搜索• SphinxSE: mysql存储引擎
  • 12. Sphinx特性• high indexing speed (upto 10 MB/sec on modern CPUs);• high search speed (avg query is under 0.1 sec on 2-4 GB text collections);• high scalability (upto 100 GB of text, upto 100 M documents on a single CPU);• provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking;• provides distributed searching capabilities;• provides document exceprts generation;• provides searching from within MySQL through pluggable storage engine;• supports boolean, phrase, and word proximity queries;• supports multiple full-text fields per document (upto 32 by default);• supports multiple additional attributes per document (ie. groups, timestamps, etc);• supports stopwords;• supports both single-byte encodings and UTF-8;• supports English stemming, Russian stemming, and Soundex for morphology;• supports MySQL natively (MyISAM and InnoDB tables are both supported);• supports PostgreSQL natively.
  • 13. Sphinx组件• indexer (binary)• searchd (binary)• search (binary)• sphinxapi (api libraries for PHP, Python, Perl, Ruby)• spelldump• indextool
  • 14. 索引• 数据源: 数据库, xml, 等等。 o 表的每一行视为一篇文档, o 可在配置中指定哪些列需要进行索引• 属性:表的某些列可被指定为文档的属性,不被索引,但可 用来做过滤和排序
  • 15. 索引(2)索引配置的片段sql_query = SELECT id, title, content, author_id, forum_id, post_date FROM my_forum_postssql_attr_uint = author_idsql_attr_uint = forum_idsql_attr_timestamp = post_date过滤和排序应用示例// only search posts by author whose ID is 123$cl->SetFilter ( "author_id", array ( 123 ) );// only search posts in sub-forums 1, 3 and 7$cl->SetFilter ( "forum_id", array ( 1,3,7 ) );// sort found posts by posting date in descending order$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );
  • 16. 搜索 – 匹配模式匹配模式 o SPH_MATCH_ALL o SPH_MATCH_ANY o SPH_MATCH_PHRASE o SPH_MATCH_BOOLEAN o SPH_MATCH_EXTENDED2最灵活的SPH_MATCH_EXTENDED2hello | worldhello | -world@name hello @intro world"hello world"aaa << bbb << ccc"hello world foo"~10"the world is a wonderful place"/3"hello world" @title "example program"~5 @body python -(php|perl) @* code
  • 17. 搜索 – 排序模式• SPH_SORT_RELEVANCE• SPH_SORT_EXTENDED@weight DESC, price ASC, @id DESC• SPH_SORT_EXPR$cl->SetSortMode ( SPH_SORT_EXPR, "@weight + ( user_karma + ln(pageviews) )*0.1" );
  • 18. 搜索 – 分布式搜索• 横向划分数据,分别进行索引• 在主searchd上配置分布式索引• 主searchd发送请求到各个从searchd,合并返回的结果,并 最终返回• cluster中的每个searchd都可作为主searchd, 进行负载均衡
  • 19. 搜索 – SphinxQL: 使用sql语法进行搜索• searchd实现了mysql的网络协议• 可将searchd当做mysql服务器使用,通过mysql client连接SELECT *, @weight*10+docboost AS skey FROM example ORDER BY skeSELECT * FROM test1 WHERE MATCH("test doc"/3)SELECT * FROM test WHERE MATCH(@title hello @body world) OPTIONranker=bm25, max_matches=3000
  • 20. SphinxSE: mysql存储引擎特点• 类似InnoDB, MyISAM, 需要编译进mysql• 本身不存储数据,而是与searchd通信来获取数据优点• 任何语言都可使用,而naive api只支持几种语言• 当搜索结果需要在mysql端进一步处理时,效率更高 (JOIN, mysql-like filtering)
  • 21. Sphinx vs. xapianSphinx• searchd提供搜索服务• 不用自己实现indexer,不用写C++代码,仅通过配置就能实 现索引和搜索• 分布式搜索xapian • 类似lucene,api直接访问索引文件进行搜索 • 得自己实现indexer • 可定制性强 (豆瓣从sphinx切到xapian)
  • 22. demo – scrapy + sphinx实现搜索引擎以爬取,索引,搜索起点小说为例,实现一个小说搜索引擎.demo的代码可从github下载:git clone git:// clone git://• 使用scrapy实现爬虫• 使用sphinx实现索引和搜索• 实现搜索前端具体见