scrapy+sphinx搭建搜索引擎
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

scrapy+sphinx搭建搜索引擎

  • 3,317 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,317
On Slideshare
3,317
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
62
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. scrapy+sphinx搭建搜索引擎 银平 pkufranky@gmail.com 2010-06-07
  • 2. Outline• Overview• Scrapy – python爬虫框架• Sphinx – C++全文搜索引擎• demo – scrapy + sphinx实现小说搜索引擎
  • 3. Overview - 搜索引擎/爬虫分类• 搜索引擎 o 通用搜索引擎 o 垂直搜索引擎 o 资源型垂直搜索引擎• 爬虫 o 通用爬虫 o 专用爬虫
  • 4. Overview - 搜索引擎 • 分词 • 倒排索引http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html
  • 5. Scrapy – python爬虫框架• Architecture• Built-in middlewares• Extensions• 从网页中提取数据
  • 6. Architecture• Components o Scrapy Engine o Scheduler o Downloader o Spider o Item Pipeline o Middlewares• Event-driven networking: twisted
  • 7. Architecture
  • 8. Built-in middlewares• Downloader middlewares o DefaultHeadersMiddleware o HttpAuthMiddleware o HttpCacheMiddleware o RedirectMiddleware o RetryMiddleware• Spider middlewares o DepthMiddleware o RefererMiddleware• Scheduler middlewares o DuplicatesFilterMiddleware
  • 9. Extensions• 特性 o Scrapy启动时加载的普通class o 监听各种signal (engine_started, item_scraped, item_dropped)• Built-in extensions o CoreStats o WebConsole o …
  • 10. 从网页中提取数据• CrawlSpider: Rule/Matcher/callback• 使用XPath进行提取• Scrapy shell• Parsley: a selector language, superset of XPath and css3 ( 内存泄露)li.main>a/@href
  • 11. Sphinx – C++全文搜索引擎• Sphinx特性• Sphinx组件• 索引• 搜索• SphinxSE: mysql存储引擎
  • 12. Sphinx特性• high indexing speed (upto 10 MB/sec on modern CPUs);• high search speed (avg query is under 0.1 sec on 2-4 GB text collections);• high scalability (upto 100 GB of text, upto 100 M documents on a single CPU);• provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking;• provides distributed searching capabilities;• provides document exceprts generation;• provides searching from within MySQL through pluggable storage engine;• supports boolean, phrase, and word proximity queries;• supports multiple full-text fields per document (upto 32 by default);• supports multiple additional attributes per document (ie. groups, timestamps, etc);• supports stopwords;• supports both single-byte encodings and UTF-8;• supports English stemming, Russian stemming, and Soundex for morphology;• supports MySQL natively (MyISAM and InnoDB tables are both supported);• supports PostgreSQL natively.
  • 13. Sphinx组件• indexer (binary)• searchd (binary)• search (binary)• sphinxapi (api libraries for PHP, Python, Perl, Ruby)• spelldump• indextool
  • 14. 索引• 数据源: 数据库, xml, 等等。 o 表的每一行视为一篇文档, o 可在配置中指定哪些列需要进行索引• 属性:表的某些列可被指定为文档的属性,不被索引,但可 用来做过滤和排序
  • 15. 索引(2)索引配置的片段sql_query = SELECT id, title, content, author_id, forum_id, post_date FROM my_forum_postssql_attr_uint = author_idsql_attr_uint = forum_idsql_attr_timestamp = post_date过滤和排序应用示例// only search posts by author whose ID is 123$cl->SetFilter ( "author_id", array ( 123 ) );// only search posts in sub-forums 1, 3 and 7$cl->SetFilter ( "forum_id", array ( 1,3,7 ) );// sort found posts by posting date in descending order$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );
  • 16. 搜索 – 匹配模式匹配模式 o SPH_MATCH_ALL o SPH_MATCH_ANY o SPH_MATCH_PHRASE o SPH_MATCH_BOOLEAN o SPH_MATCH_EXTENDED2最灵活的SPH_MATCH_EXTENDED2hello | worldhello | -world@name hello @intro world"hello world"aaa << bbb << ccc"hello world foo"~10"the world is a wonderful place"/3"hello world" @title "example program"~5 @body python -(php|perl) @* code
  • 17. 搜索 – 排序模式• SPH_SORT_RELEVANCE• SPH_SORT_EXTENDED@weight DESC, price ASC, @id DESC• SPH_SORT_EXPR$cl->SetSortMode ( SPH_SORT_EXPR, "@weight + ( user_karma + ln(pageviews) )*0.1" );
  • 18. 搜索 – 分布式搜索• 横向划分数据,分别进行索引• 在主searchd上配置分布式索引• 主searchd发送请求到各个从searchd,合并返回的结果,并 最终返回• cluster中的每个searchd都可作为主searchd, 进行负载均衡
  • 19. 搜索 – SphinxQL: 使用sql语法进行搜索• searchd实现了mysql的网络协议• 可将searchd当做mysql服务器使用,通过mysql client连接SELECT *, @weight*10+docboost AS skey FROM example ORDER BY skeSELECT * FROM test1 WHERE MATCH("test doc"/3)SELECT * FROM test WHERE MATCH(@title hello @body world) OPTIONranker=bm25, max_matches=3000
  • 20. SphinxSE: mysql存储引擎特点• 类似InnoDB, MyISAM, 需要编译进mysql• 本身不存储数据,而是与searchd通信来获取数据优点• 任何语言都可使用,而naive api只支持几种语言• 当搜索结果需要在mysql端进一步处理时,效率更高 (JOIN, mysql-like filtering)
  • 21. Sphinx vs. xapianSphinx• searchd提供搜索服务• 不用自己实现indexer,不用写C++代码,仅通过配置就能实 现索引和搜索• 分布式搜索xapian • 类似lucene,api直接访问索引文件进行搜索 • 得自己实现indexer • 可定制性强 (豆瓣从sphinx切到xapian)
  • 22. demo – scrapy + sphinx实现搜索引擎以爬取,索引,搜索起点小说为例,实现一个小说搜索引擎.demo的代码可从github下载:git clone git://github.com/pkufranky/sedemo-indexer.gitgit clone git://github.com/pkufranky/sedemo-spider.git• 使用scrapy实现爬虫• 使用sphinx实现索引和搜索• 实现搜索前端具体见 http://pkufranky.heroku.com/2010/06/03/scrapysphinx/