Spider  Spider
“   ”Spike
“   ”
vs.                                                               http://www.flickr.com/photos/blueblankut/497571704/sizes...
Dust
ip     protocol   sitemap   robots.txt
Link
Trie
QueryCache
……http://www.flickr.com/photos/regolare/791385521/
Spider……                    Map-           Reduce
Crawler Architecture                                                                      Repository               Downloa...
Dust            Simhash       PageRank
robust   html      css selector   lxml   tidy   url           proxy            UA
1-NoSQL          NOSQL
-NoSQL               NOSQL   HeapFIFO
-NoSQLHBaseCassandra
-NoSQL Cassandra              bug                                       Randompatitioning        Crawler              Craw...
-NoSQLCAP   HBase -> CP    Cassandra -> AP      Cassandra   CCrawler   link
2-Googleincremental processing system -  Percolator. a.k.a. Caffeine
2-Google   BigTable                   timestamp oracle   lightweightlock              Notification                trigger ...
2-GoogleMap-Reduce             Locality
2-GoogleTrade-off                    trillion:million       Map-ReducePage   RPC     MR                    10   MR   RPC
2-GooglePercolator           DBMS       DBMS             scale      Percolator  Percolator           shared-nothing parall...
Thanks
爬虫点滴
爬虫点滴
爬虫点滴
Upcoming SlideShare
Loading in …5
×

爬虫点滴

2,301 views

Published on

总结这“不务正业”的半年。

爬虫是很多搜索引擎的一部分,它的名声并不好。比起搜索引擎的分词技术、索引技术来说它很基础,似乎没有那么多花样,被认为是没啥意思的脏活累活。我在这里就分享一下爬虫这个不起眼的系统里面涉及到方方面面的技术,由于内容比较细碎,话题里面只能蜻蜓点水。

如果用几个关键词形容它:爬虫、架构、分布式系统、NOSQL、实时/离线系统对比、Google Caffeine、Percolator

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,301
On SlideShare
0
From Embeds
0
Number of Embeds
105
Actions
Shares
0
Downloads
80
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • 爬虫点滴

    1. 1. Spider Spider
    2. 2. “ ”Spike
    3. 3. “ ”
    4. 4. vs. http://www.flickr.com/photos/blueblankut/497571704/sizes/z/in/photostream/http://www.flickr.com/photos/coreyburger/2481836757/sizes/z/in/photostream/
    5. 5. Dust
    6. 6. ip protocol sitemap robots.txt
    7. 7. Link
    8. 8. Trie
    9. 9. QueryCache
    10. 10. ……http://www.flickr.com/photos/regolare/791385521/
    11. 11. Spider…… Map- Reduce
    12. 12. Crawler Architecture Repository Downloader Download Extractor Worker Worker save page to repository if 302 foundedget a link update link http status put downloaded page to queue links queue pages queue extract links and save main loop will put peek sites links to queue Crawler Linkbase main loopSite will refill itselfwhen its empty TaskLoader Priority Heap Scope.txt Sites Ordered Site and their links
    13. 13. Dust Simhash PageRank
    14. 14. robust html css selector lxml tidy url proxy UA
    15. 15. 1-NoSQL NOSQL
    16. 16. -NoSQL NOSQL HeapFIFO
    17. 17. -NoSQLHBaseCassandra
    18. 18. -NoSQL Cassandra bug Randompatitioning Crawler Crawler
    19. 19. -NoSQLCAP HBase -> CP Cassandra -> AP Cassandra CCrawler link
    20. 20. 2-Googleincremental processing system - Percolator. a.k.a. Caffeine
    21. 21. 2-Google BigTable timestamp oracle lightweightlock Notification trigger Observer Notification Notification Percolator Worker
    22. 22. 2-GoogleMap-Reduce Locality
    23. 23. 2-GoogleTrade-off trillion:million Map-ReducePage RPC MR 10 MR RPC
    24. 24. 2-GooglePercolator DBMS DBMS scale Percolator Percolator shared-nothing parallel databases
    25. 25. Thanks

    ×