Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Spider  Spider
“   ”Spike
“   ”
vs.                                                               http://www.flickr.com/photos/blueblankut/497571704/sizes...
Dust
ip     protocol   sitemap   robots.txt
Link
Trie
QueryCache
……http://www.flickr.com/photos/regolare/791385521/
Spider……                    Map-           Reduce
Crawler Architecture                                                                      Repository               Downloa...
Dust            Simhash       PageRank
robust   html      css selector   lxml   tidy   url           proxy            UA
1-NoSQL          NOSQL
-NoSQL               NOSQL   HeapFIFO
-NoSQLHBaseCassandra
-NoSQL Cassandra              bug                                       Randompatitioning        Crawler              Craw...
-NoSQLCAP   HBase -> CP    Cassandra -> AP      Cassandra   CCrawler   link
2-Googleincremental processing system -  Percolator. a.k.a. Caffeine
2-Google   BigTable                   timestamp oracle   lightweightlock              Notification                trigger ...
2-GoogleMap-Reduce             Locality
2-GoogleTrade-off                    trillion:million       Map-ReducePage   RPC     MR                    10   MR   RPC
2-GooglePercolator           DBMS       DBMS             scale      Percolator  Percolator           shared-nothing parall...
Thanks
爬虫点滴
爬虫点滴
爬虫点滴
Upcoming SlideShare
Loading in …5
×

爬虫点滴

总结这“不务正业”的半年。

爬虫是很多搜索引擎的一部分,它的名声并不好。比起搜索引擎的分词技术、索引技术来说它很基础,似乎没有那么多花样,被认为是没啥意思的脏活累活。我在这里就分享一下爬虫这个不起眼的系统里面涉及到方方面面的技术,由于内容比较细碎,话题里面只能蜻蜓点水。

如果用几个关键词形容它:爬虫、架构、分布式系统、NOSQL、实时/离线系统对比、Google Caffeine、Percolator

  • Be the first to comment

爬虫点滴

  1. 1. Spider Spider
  2. 2. “ ”Spike
  3. 3. “ ”
  4. 4. vs. http://www.flickr.com/photos/blueblankut/497571704/sizes/z/in/photostream/http://www.flickr.com/photos/coreyburger/2481836757/sizes/z/in/photostream/
  5. 5. Dust
  6. 6. ip protocol sitemap robots.txt
  7. 7. Link
  8. 8. Trie
  9. 9. QueryCache
  10. 10. ……http://www.flickr.com/photos/regolare/791385521/
  11. 11. Spider…… Map- Reduce
  12. 12. Crawler Architecture Repository Downloader Download Extractor Worker Worker save page to repository if 302 foundedget a link update link http status put downloaded page to queue links queue pages queue extract links and save main loop will put peek sites links to queue Crawler Linkbase main loopSite will refill itselfwhen its empty TaskLoader Priority Heap Scope.txt Sites Ordered Site and their links
  13. 13. Dust Simhash PageRank
  14. 14. robust html css selector lxml tidy url proxy UA
  15. 15. 1-NoSQL NOSQL
  16. 16. -NoSQL NOSQL HeapFIFO
  17. 17. -NoSQLHBaseCassandra
  18. 18. -NoSQL Cassandra bug Randompatitioning Crawler Crawler
  19. 19. -NoSQLCAP HBase -> CP Cassandra -> AP Cassandra CCrawler link
  20. 20. 2-Googleincremental processing system - Percolator. a.k.a. Caffeine
  21. 21. 2-Google BigTable timestamp oracle lightweightlock Notification trigger Observer Notification Notification Percolator Worker
  22. 22. 2-GoogleMap-Reduce Locality
  23. 23. 2-GoogleTrade-off trillion:million Map-ReducePage RPC MR 10 MR RPC
  24. 24. 2-GooglePercolator DBMS DBMS scale Percolator Percolator shared-nothing parallel databases
  25. 25. Thanks

×