Spider实现     Spider   细   节
们                           统        构    过                       识 术语    过         过        实现   “   ”讲   实现        简单   ...
务调   对   “   识   ”战较
vs.                                                               http://www.flickr.com/photos/blueblankut/497571704/sizes...
构处
构归    Dust链          锚
构                  页    识链   务优    级   页       评     评        键   识    错惩罚                   历    鲜    访问
构压    ip压     规则protocol   sitemap   robots.txt链         检测
构处结构     识         语义             结构   识              识     联Link
构词词库   Trie树
构        词        图    Query转换         词        键词 层      访问Cache
杂             ……http://www.flickr.com/photos/regolare/791385521/
说Spider……            构线实时优 级       线实时               处       统Map-                Reduce      资                节 资        ...
Crawler Architecture                                                                      Repository               Downloa...
键   术        储   统        统   调   监    线        驱动链    储结构        务调
键   术    Dust页              Simhash           PageRank评     还       简单   标评词库         词
键       术robust   html      css selector     lxml   tidy认证码识规则url    术      proxy      术   还   UA伪
传1-NoSQL统   储扩   颈   尝试   NOSQL
传-NoSQL    统             结构   NOSQL   实现优 队        Heap    队队       FIFO
传-NoSQL    术选      围   这   选为    HBase    Cassandra
传-NoSQL Cassandra稳       问题 bug    实                               Randompatitioning      对实资     对 Crawler   说    们   Cra...
传-NoSQLCAP对         说应该   HBase -> CP   Cassandra -> AP实际 Cassandra     CCrawler   link
传2-Google                动incremental processing system -  Percolator. a.k.a. Caffeine
传2-Google                           动     BigTable    储 预           务    证 timestamp oracle   lightweightlock           产 ...
传2-Google                   动对Map-Reduce   评     迟      赖       Locality   设计
传2-Google                          动  Trade-off                时    trillion:million        Map-Reduce        迟单Page   RPC...
传2-Google                            动Percolator      传统DBMS           DBMS      查询语  为          scale设计                  ...
Thanks
Upcoming SlideShare
Loading in …5
×

Crawler pieces

1,058 views
1,000 views

Published on

This is a presentation I prepared for Beijing Open Party. It's a summary of what I learned when I was building a crawler system. There must be some mistakes, please don't use/read in seriously purpose.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,058
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
9
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Crawler pieces

    1. 1. Spider实现 Spider 细 节
    2. 2. 们 统 构 过 识 术语 过 过 实现 “ ”讲 实现 简单 问题 论监 馈 键问题 预测 颈读 论 Spike
    3. 3. 务调 对 “ 识 ”战较
    4. 4. vs. http://www.flickr.com/photos/blueblankut/497571704/sizes/z/in/photostream/http://www.flickr.com/photos/coreyburger/2481836757/sizes/z/in/photostream/
    5. 5. 构处
    6. 6. 构归 Dust链 锚
    7. 7. 构 页 识链 务优 级 页 评 评 键 识 错惩罚 历 鲜 访问
    8. 8. 构压 ip压 规则protocol sitemap robots.txt链 检测
    9. 9. 构处结构 识 语义 结构 识 识 联Link
    10. 10. 构词词库 Trie树
    11. 11. 构 词 图 Query转换 词 键词 层 访问Cache
    12. 12. 杂 ……http://www.flickr.com/photos/regolare/791385521/
    13. 13. 说Spider…… 构线实时优 级 线实时 处 统Map- Reduce 资 节 资 馈 长 迟
    14. 14. Crawler Architecture Repository Downloader Download Extractor Worker Worker save page to repository if 302 foundedget a link update link http status put downloaded page to queue links queue pages queue extract links and save main loop will put peek sites links to queue Crawler Linkbase main loopSite will refill itselfwhen its empty TaskLoader Priority Heap Scope.txt Sites Ordered Site and their links
    15. 15. 键 术 储 统 统 调 监 线 驱动链 储结构 务调
    16. 16. 键 术 Dust页 Simhash PageRank评 还 简单 标评词库 词
    17. 17. 键 术robust html css selector lxml tidy认证码识规则url 术 proxy 术 还 UA伪
    18. 18. 传1-NoSQL统 储扩 颈 尝试 NOSQL
    19. 19. 传-NoSQL 统 结构 NOSQL 实现优 队 Heap 队队 FIFO
    20. 20. 传-NoSQL 术选 围 这 选为 HBase Cassandra
    21. 21. 传-NoSQL Cassandra稳 问题 bug 实 Randompatitioning 对实资 对 Crawler 说 们 Crawler 严 赖 锁 务负 闻
    22. 22. 传-NoSQLCAP对 说应该 HBase -> CP Cassandra -> AP实际 Cassandra CCrawler link
    23. 23. 传2-Google 动incremental processing system - Percolator. a.k.a. Caffeine
    24. 24. 传2-Google 动 BigTable 储 预 务 证 timestamp oracle lightweightlock 产 Notification 库 trigger 线 Observer传递Notification 统 费Notification Percolator Worker实现 们 线 务
    25. 25. 传2-Google 动对Map-Reduce 评 迟 赖 Locality 设计
    26. 26. 传2-Google 动 Trade-off 时 trillion:million Map-Reduce 迟单Page RPC MR 过读 队 组 预读 缓 10 MR RPC 资
    27. 27. 传2-Google 动Percolator 传统DBMS DBMS 查询语 为 scale设计 库 为 节 Percolator 节 节 调 迟 调 Percolator 义为shared-nothing parallel databases
    28. 28. Thanks

    ×