scrapy+sphinx搭建搜索引擎   银平 pkufranky@gmail.com        2010-06-07
Outline•   Overview•   Scrapy – python爬虫框架•   Sphinx – C++全文搜索引擎•   demo – scrapy + sphinx实现小说搜索引擎
Overview - 搜索引擎/爬虫分类• 搜索引擎  o 通用搜索引擎  o 垂直搜索引擎  o 资源型垂直搜索引擎• 爬虫  o 通用爬虫  o 专用爬虫
Overview - 搜索引擎 • 分词 • 倒排索引http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.h...
Scrapy – python爬虫框架•   Architecture•   Built-in middlewares•   Extensions•   从网页中提取数据
Architecture• Components  o Scrapy Engine  o Scheduler  o Downloader  o Spider  o Item Pipeline  o Middlewares• Event-driv...
Architecture
Built-in middlewares• Downloader middlewares  o DefaultHeadersMiddleware  o HttpAuthMiddleware  o HttpCacheMiddleware  o R...
Extensions• 特性  o Scrapy启动时加载的普通class  o 监听各种signal (engine_started, item_scraped,    item_dropped)• Built-in extensions  ...
从网页中提取数据•  CrawlSpider: Rule/Matcher/callback•  使用XPath进行提取•  Scrapy shell•  Parsley: a selector language, superset of XPa...
Sphinx – C++全文搜索引擎•   Sphinx特性•   Sphinx组件•   索引•   搜索•   SphinxSE: mysql存储引擎
Sphinx特性• high indexing speed (upto 10 MB/sec on modern CPUs);• high search speed (avg query is under 0.1 sec on 2-4 GB te...
Sphinx组件•   indexer (binary)•   searchd (binary)•   search (binary)•   sphinxapi (api libraries for PHP, Python, Perl, Rub...
索引• 数据源: 数据库, xml, 等等。  o 表的每一行视为一篇文档,  o 可在配置中指定哪些列需要进行索引• 属性:表的某些列可被指定为文档的属性,不被索引,但可  用来做过滤和排序
索引(2)索引配置的片段sql_query = SELECT id, title, content,   author_id, forum_id, post_date FROM my_forum_postssql_attr_uint = aut...
搜索 – 匹配模式匹配模式     o   SPH_MATCH_ALL     o   SPH_MATCH_ANY     o   SPH_MATCH_PHRASE     o   SPH_MATCH_BOOLEAN     o   SPH_M...
搜索 – 排序模式• SPH_SORT_RELEVANCE• SPH_SORT_EXTENDED@weight DESC, price ASC, @id DESC• SPH_SORT_EXPR$cl->SetSortMode ( SPH_SOR...
搜索 – 分布式搜索• 横向划分数据,分别进行索引• 在主searchd上配置分布式索引• 主searchd发送请求到各个从searchd,合并返回的结果,并  最终返回• cluster中的每个searchd都可作为主searchd, 进行负...
搜索 – SphinxQL: 使用sql语法进行搜索• searchd实现了mysql的网络协议• 可将searchd当做mysql服务器使用,通过mysql client连接SELECT *, @weight*10+docboost AS s...
SphinxSE: mysql存储引擎特点• 类似InnoDB, MyISAM, 需要编译进mysql• 本身不存储数据,而是与searchd通信来获取数据优点• 任何语言都可使用,而naive api只支持几种语言• 当搜索结果需要在mysq...
Sphinx vs. xapianSphinx• searchd提供搜索服务• 不用自己实现indexer,不用写C++代码,仅通过配置就能实  现索引和搜索• 分布式搜索xapian • 类似lucene,api直接访问索引文件进行搜索 • ...
demo – scrapy + sphinx实现搜索引擎以爬取,索引,搜索起点小说为例,实现一个小说搜索引擎.demo的代码可从github下载:git clone git://github.com/pkufranky/sedemo-index...
Upcoming SlideShare
Loading in …5
×

scrapy+sphinx搭建搜索引擎

3,523 views

Published on

Published in: Technology
  • Be the first to comment

scrapy+sphinx搭建搜索引擎

  1. 1. scrapy+sphinx搭建搜索引擎 银平 pkufranky@gmail.com 2010-06-07
  2. 2. Outline• Overview• Scrapy – python爬虫框架• Sphinx – C++全文搜索引擎• demo – scrapy + sphinx实现小说搜索引擎
  3. 3. Overview - 搜索引擎/爬虫分类• 搜索引擎 o 通用搜索引擎 o 垂直搜索引擎 o 资源型垂直搜索引擎• 爬虫 o 通用爬虫 o 专用爬虫
  4. 4. Overview - 搜索引擎 • 分词 • 倒排索引http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html
  5. 5. Scrapy – python爬虫框架• Architecture• Built-in middlewares• Extensions• 从网页中提取数据
  6. 6. Architecture• Components o Scrapy Engine o Scheduler o Downloader o Spider o Item Pipeline o Middlewares• Event-driven networking: twisted
  7. 7. Architecture
  8. 8. Built-in middlewares• Downloader middlewares o DefaultHeadersMiddleware o HttpAuthMiddleware o HttpCacheMiddleware o RedirectMiddleware o RetryMiddleware• Spider middlewares o DepthMiddleware o RefererMiddleware• Scheduler middlewares o DuplicatesFilterMiddleware
  9. 9. Extensions• 特性 o Scrapy启动时加载的普通class o 监听各种signal (engine_started, item_scraped, item_dropped)• Built-in extensions o CoreStats o WebConsole o …
  10. 10. 从网页中提取数据• CrawlSpider: Rule/Matcher/callback• 使用XPath进行提取• Scrapy shell• Parsley: a selector language, superset of XPath and css3 ( 内存泄露)li.main>a/@href
  11. 11. Sphinx – C++全文搜索引擎• Sphinx特性• Sphinx组件• 索引• 搜索• SphinxSE: mysql存储引擎
  12. 12. Sphinx特性• high indexing speed (upto 10 MB/sec on modern CPUs);• high search speed (avg query is under 0.1 sec on 2-4 GB text collections);• high scalability (upto 100 GB of text, upto 100 M documents on a single CPU);• provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking;• provides distributed searching capabilities;• provides document exceprts generation;• provides searching from within MySQL through pluggable storage engine;• supports boolean, phrase, and word proximity queries;• supports multiple full-text fields per document (upto 32 by default);• supports multiple additional attributes per document (ie. groups, timestamps, etc);• supports stopwords;• supports both single-byte encodings and UTF-8;• supports English stemming, Russian stemming, and Soundex for morphology;• supports MySQL natively (MyISAM and InnoDB tables are both supported);• supports PostgreSQL natively.
  13. 13. Sphinx组件• indexer (binary)• searchd (binary)• search (binary)• sphinxapi (api libraries for PHP, Python, Perl, Ruby)• spelldump• indextool
  14. 14. 索引• 数据源: 数据库, xml, 等等。 o 表的每一行视为一篇文档, o 可在配置中指定哪些列需要进行索引• 属性:表的某些列可被指定为文档的属性,不被索引,但可 用来做过滤和排序
  15. 15. 索引(2)索引配置的片段sql_query = SELECT id, title, content, author_id, forum_id, post_date FROM my_forum_postssql_attr_uint = author_idsql_attr_uint = forum_idsql_attr_timestamp = post_date过滤和排序应用示例// only search posts by author whose ID is 123$cl->SetFilter ( "author_id", array ( 123 ) );// only search posts in sub-forums 1, 3 and 7$cl->SetFilter ( "forum_id", array ( 1,3,7 ) );// sort found posts by posting date in descending order$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );
  16. 16. 搜索 – 匹配模式匹配模式 o SPH_MATCH_ALL o SPH_MATCH_ANY o SPH_MATCH_PHRASE o SPH_MATCH_BOOLEAN o SPH_MATCH_EXTENDED2最灵活的SPH_MATCH_EXTENDED2hello | worldhello | -world@name hello @intro world"hello world"aaa << bbb << ccc"hello world foo"~10"the world is a wonderful place"/3"hello world" @title "example program"~5 @body python -(php|perl) @* code
  17. 17. 搜索 – 排序模式• SPH_SORT_RELEVANCE• SPH_SORT_EXTENDED@weight DESC, price ASC, @id DESC• SPH_SORT_EXPR$cl->SetSortMode ( SPH_SORT_EXPR, "@weight + ( user_karma + ln(pageviews) )*0.1" );
  18. 18. 搜索 – 分布式搜索• 横向划分数据,分别进行索引• 在主searchd上配置分布式索引• 主searchd发送请求到各个从searchd,合并返回的结果,并 最终返回• cluster中的每个searchd都可作为主searchd, 进行负载均衡
  19. 19. 搜索 – SphinxQL: 使用sql语法进行搜索• searchd实现了mysql的网络协议• 可将searchd当做mysql服务器使用,通过mysql client连接SELECT *, @weight*10+docboost AS skey FROM example ORDER BY skeSELECT * FROM test1 WHERE MATCH("test doc"/3)SELECT * FROM test WHERE MATCH(@title hello @body world) OPTIONranker=bm25, max_matches=3000
  20. 20. SphinxSE: mysql存储引擎特点• 类似InnoDB, MyISAM, 需要编译进mysql• 本身不存储数据,而是与searchd通信来获取数据优点• 任何语言都可使用,而naive api只支持几种语言• 当搜索结果需要在mysql端进一步处理时,效率更高 (JOIN, mysql-like filtering)
  21. 21. Sphinx vs. xapianSphinx• searchd提供搜索服务• 不用自己实现indexer,不用写C++代码,仅通过配置就能实 现索引和搜索• 分布式搜索xapian • 类似lucene,api直接访问索引文件进行搜索 • 得自己实现indexer • 可定制性强 (豆瓣从sphinx切到xapian)
  22. 22. demo – scrapy + sphinx实现搜索引擎以爬取,索引,搜索起点小说为例,实现一个小说搜索引擎.demo的代码可从github下载:git clone git://github.com/pkufranky/sedemo-indexer.gitgit clone git://github.com/pkufranky/sedemo-spider.git• 使用scrapy实现爬虫• 使用sphinx实现索引和搜索• 实现搜索前端具体见 http://pkufranky.heroku.com/2010/06/03/scrapysphinx/

×