Published on

MongoDB Beijing Meetup on May 7th.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Crawlware - Seravias Deep Web Crawling System Presented by 邹志乐 敬宓
  2. 2. Agenda● What is Crawlware?● Crawlware Architecture● Job Model● Payload Generation and Scheduling● Rate Control● Auto Deduplication● Crawler testing with Sinatra● Some problems we encountered & TODOs
  3. 3. What is Crawlware?Crawlware is a distributed deep web crawling system, which enablesscalable and friendly crawls of the data that must be retrieved withcomplex queries. ● Distributed: Execute cross multiple machines ● Scalable: Scale up by adding extra machines and bandwidth. ● Efficiency: Efficient use of various system resources ● Extensible: Be extensible for new data formats, protocols, etc ● Freshness: Be able to capture data changes ● Continuous: Continuous crawling without administrators operation. ● Generic: Each crawling worker can crawl any given sites ● Parallelization: Crawl all websites in parellel ● Anti-blocking: Precise rate control
  4. 4. A General Crawler ArchitectureFrom <<Introduction to Information Retrieval>> Robots Doc FPs Templates URLSet DNSWWW Parse Dedup Content URL Filter URL Fetch Seen? Elim URL Frontier
  5. 5. Crawlware ArchitectureHigh Level Working Flows
  6. 6. Crawlware Architecture
  7. 7. Job Model● XML syntax ● HTTP Get, Post, Next Page● An assembly of reusable actions ● Page Extractor, Link Extractor● A context shared at runtime ● File Storage● Job & Actions customized through ● Assignmentproperties ● Code Snippet
  8. 8. Job Model Sample
  9. 9. Payload Generation & Scheduling KeyGeneratorIncrementCrawl job 1 KeyGeneratorDecrementCrawl job 2 KeyGeneratorFileCrawl job 3Crawl job 4 KeyGeneratorDecorator Key Generator KeyGeneratorDate...... KeyGeneratorDateRange Config 1) Push KeyGeneratorCustomRangeCrawl job 5 KeyGeneratorComposite files payloads Job DB 2) Load payloads Read frequency settings Scheduler Redis Queue 3) Push payload blocks Crawl job 1 Crawl job 1 Crawl job 2 Crawl job 3 ...... Crawl job 4 Payload Block 1024 Payloads
  10. 10. Rate Control Scheduler ● Site frequency configuration ●A given sites payloads amount in payload block is determined by the crawling frequency. Redis Queue ● Scheduler controls the crawling Pull payload blocks Pull payload blocks rate of the entire system (N crawler nodes/IPs) Worker Controller Worker Controller ●Worker Controller controls the crawling rate of a single node/IP Pull payloads Pull payloads In-mem Queue In-mem QueueWorker Worker Worker Worker Worker Worker
  11. 11. Auto DeduplicationDedup DB Job DB Get unique links and push them in dedup dbDedup rpc Job DB rpc Push unique links into Job DB Dedup links Crawl brief page Extract linksCrawl Job
  12. 12. Crawler Testing with Sinatra● What is SinatraCrawler Testing● ● Simulate various crawling actions via http, such as Get, Post, NextPage. ● Simulate job profiles
  13. 13. Encountered Problems & TODOs● Changing site load/performance ● Monitoring ● Dynamic rate switch based on time zone● Page Correctness ● Page tracker – continuous errors or identical pages● Data Freshness ● Scheduled updates ● Crawl delta for ID or date range payloads ● Recrawl for keyword payloads● Javascript
  14. 14. Thank YouPlease contact for job opportunities.