Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Frontera распределенный робот для обхода веба в больших объемах / Александр Сибиряков (Scrapinghub)

544 views

Published on

В этом докладе я собираюсь поделиться нашим опытом обхода испанского интернета. Мы поставили перед собой задачу обойти около 600 тысяч веб-сайтов в зоне .es с целью сбора статистики об узлах и их размерах. Я расскажу об архитектуре робота, хранилища, проблемах, с которыми мы столкнулись при обходе, и их решении.

Наше решение доступно в форме open source фреймворка Frontera. Фреймворк позволяет построить распределенного робота для скачивания страниц из Интернета в больших объемах в реальном времени. Также он может быть использован для построения сфокусированных роботов для выкачивания подмножества заранее известных веб-сайтов.

Фреймворк предлагает: настраиваемое хранилище URL документов (RDBMS или Key Value), управление стратегиями обхода, абстракцию транспортного уровня, абстракцию модуля загрузки.

Доклад построен в увлекательной форме: описание проблемы, решение и проблемы, которые возникли в ходе разработки решения.

Published in: Engineering
  • Be the first to comment

Frontera распределенный робот для обхода веба в больших объемах / Александр Сибиряков (Scrapinghub)

  1. 1. Frontera: open source, large scale web crawling framework Alexander Sibiryakov, Scrapinghub Ltd. sibiryakov@scrapinghub.com
  2. 2. Здравствуйте участники!
  3. 3. Здравствуйте участники! • Software Engineer @ Scrapinghub
  4. 4. Здравствуйте участники! • Software Engineer @ Scrapinghub • Born in Yekaterinburg, RU
  5. 5. Здравствуйте участники! • Software Engineer @ Scrapinghub • Born in Yekaterinburg, RU • 5 years at Yandex: 
 social & QA search, snippets.
  6. 6. Здравствуйте участники! • Software Engineer @ Scrapinghub • Born in Yekaterinburg, RU • 5 years at Yandex: 
 social & QA search, snippets. • 2 years at Avast! antivirus: false positives, malicious downloads
  7. 7. We help turn web content into useful data { "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http://www.theguardian.com/society/ 2015/oct/05/world-bank-extreme-poverty-to-fall- below-10-of-world-population-for-first-time" }, "points": "9 points", "time_ago": { "text": "2 hours ago", "href": "https://news.ycombinator.com/item? id=10352189" }, "username": { "text": "hliyan", "href": "https://news.ycombinator.com/user? id=hliyan" } },
  8. 8. We help turn web content into useful data • Over 2 billion requests per month (~800/sec.) { "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http://www.theguardian.com/society/ 2015/oct/05/world-bank-extreme-poverty-to-fall- below-10-of-world-population-for-first-time" }, "points": "9 points", "time_ago": { "text": "2 hours ago", "href": "https://news.ycombinator.com/item? id=10352189" }, "username": { "text": "hliyan", "href": "https://news.ycombinator.com/user? id=hliyan" } },
  9. 9. We help turn web content into useful data • Over 2 billion requests per month (~800/sec.) • Focused crawls & Broad crawls { "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http://www.theguardian.com/society/ 2015/oct/05/world-bank-extreme-poverty-to-fall- below-10-of-world-population-for-first-time" }, "points": "9 points", "time_ago": { "text": "2 hours ago", "href": "https://news.ycombinator.com/item? id=10352189" }, "username": { "text": "hliyan", "href": "https://news.ycombinator.com/user? id=hliyan" } },
  10. 10. Broad crawl usages
  11. 11. Broad crawl usages • News analysis
  12. 12. Broad crawl usages • News analysis • Topical crawling
  13. 13. Broad crawl usages • News analysis • Topical crawling • Plagiarism detection
  14. 14. Broad crawl usages • News analysis • Topical crawling • Plagiarism detection • Sentiment analysis (popularity, likability)
  15. 15. Broad crawl usages • News analysis • Topical crawling • Plagiarism detection • Sentiment analysis (popularity, likability) • Due diligence (profile/business data)
  16. 16. Broad crawl usages • News analysis • Topical crawling • Plagiarism detection • Sentiment analysis (popularity, likability) • Due diligence (profile/business data) • Lead generation (extracting contact information)
  17. 17. Broad crawl usages • News analysis • Topical crawling • Plagiarism detection • Sentiment analysis (popularity, likability) • Due diligence (profile/business data) • Lead generation (extracting contact information) • Track criminal activity & find lost persons (DARPA)
  18. 18. Saatchi Global Gallery Guide
  19. 19. Saatchi Global Gallery Guide • www.globalgalleryguide.com
  20. 20. Saatchi Global Gallery Guide • www.globalgalleryguide.com
  21. 21. Saatchi Global Gallery Guide • www.globalgalleryguide.com • Discover 11K online galleries.
  22. 22. Saatchi Global Gallery Guide • www.globalgalleryguide.com • Discover 11K online galleries. • Extract general information, art samples, descriptions.
  23. 23. Saatchi Global Gallery Guide • www.globalgalleryguide.com • Discover 11K online galleries. • Extract general information, art samples, descriptions. • NLP-based extraction.
  24. 24. Saatchi Global Gallery Guide • www.globalgalleryguide.com • Discover 11K online galleries. • Extract general information, art samples, descriptions. • NLP-based extraction. • Find more galleries on the web.
  25. 25. Task
  26. 26. Task • Spanish web: hosts and their sizes statistics.
  27. 27. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD.
  28. 28. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy:
  29. 29. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ,
  30. 30. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ, • 2,
  31. 31. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ, • 2, • 3,
  32. 32. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ, • 2, • 3, • …
  33. 33. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ, • 2, • 3, • … • Finishing condition: 100 docs from host max., all hosts
  34. 34. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ, • 2, • 3, • … • Finishing condition: 100 docs from host max., all hosts • Low costs.
  35. 35. Spanish, Russian and world Web, 2012 Sources: OECD Communications Outlook 2013, statdom.ru * - current period (October 2015) Domains Web servers Hosts DMOZ* Spanish (.es) 1,5M 280K 4,2M 122K Russian (.ru, .рф, .su) 4,8M 2,6M ? 105K World 233M 62M 890M 1,7
  36. 36. Solution
  37. 37. Solution • Scrapy (based on Twisted) - async network operations.
  38. 38. Solution • Scrapy (based on Twisted) - async network operations. • Apache Kafka - data bus (offsets, partitioning).
  39. 39. Solution • Scrapy (based on Twisted) - async network operations. • Apache Kafka - data bus (offsets, partitioning). • Apache HBase - storage (random access, linear scanning, scalability).
  40. 40. Solution • Scrapy (based on Twisted) - async network operations. • Apache Kafka - data bus (offsets, partitioning). • Apache HBase - storage (random access, linear scanning, scalability). • Snappy - efficient compression algorithm for IO-bounded applications.
  41. 41. Architecture Kafka topic SW Crawling strategy workers Storage workersDB
  42. 42. 1. Big and small hosts problem
  43. 43. 1. Big and small hosts problem • Queue is flooded with URLs from the same host.
  44. 44. 1. Big and small hosts problem • Queue is flooded with URLs from the same host. • → underuse of spider resources.
  45. 45. 1. Big and small hosts problem • Queue is flooded with URLs from the same host. • → underuse of spider resources. • additional per-host (per-IP) queue and metering algorithm.
  46. 46. 1. Big and small hosts problem • Queue is flooded with URLs from the same host. • → underuse of spider resources. • additional per-host (per-IP) queue and metering algorithm. • URLs from big hosts are cached in memory.
  47. 47. 2. DDoS DNS service Amazon AWS
  48. 48. 2. DDoS DNS service Amazon AWS Breadth-first strategy →
  49. 49. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts →
  50. 50. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs.
  51. 51. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs.
  52. 52. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs.
  53. 53. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs. Recursive DNS server
  54. 54. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs. Recursive DNS server • on every spider node,
  55. 55. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs. Recursive DNS server • on every spider node, • upstream to Verizon & OpenDNS.
  56. 56. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs. Recursive DNS server • on every spider node, • upstream to Verizon & OpenDNS. We used dnsmasq.
  57. 57. 3. Tuning Scrapy thread pool for efficient DNS resolution
  58. 58. 3. Tuning Scrapy thread pool for efficient DNS resolution • OS DNS resolver,
  59. 59. 3. Tuning Scrapy thread pool for efficient DNS resolution • OS DNS resolver, • blocking calls,
  60. 60. 3. Tuning Scrapy thread pool for efficient DNS resolution • OS DNS resolver, • blocking calls, • thread pool to resolve DNS name to IP.
  61. 61. 3. Tuning Scrapy thread pool for efficient DNS resolution • OS DNS resolver, • blocking calls, • thread pool to resolve DNS name to IP.
  62. 62. 3. Tuning Scrapy thread pool for efficient DNS resolution • OS DNS resolver, • blocking calls, • thread pool to resolve DNS name to IP. • numerous errors and timeouts 🆘
  63. 63. 3. Tuning Scrapy thread pool for efficient DNS resolution • OS DNS resolver, • blocking calls, • thread pool to resolve DNS name to IP. • numerous errors and timeouts 🆘 • A patch for thread pool size and timeout adjustment.
  64. 64. 4. Overloaded HBase region servers during state check 3Tb of metadata. URLs, timestamps,… 275 b/doc
  65. 65. 4. Overloaded HBase region servers during state check • 10^3 links per doc, 3Tb of metadata. URLs, timestamps,… 275 b/doc
  66. 66. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, 3Tb of metadata. URLs, timestamps,… 275 b/doc
  67. 67. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. 3Tb of metadata. URLs, timestamps,… 275 b/doc
  68. 68. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. • Small volume 🆗 3Tb of metadata. URLs, timestamps,… 275 b/doc
  69. 69. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. • Small volume 🆗 • With ⬆table size, response times ⬆🆘 3Tb of metadata. URLs, timestamps,… 275 b/doc
  70. 70. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. • Small volume 🆗 • With ⬆table size, response times ⬆🆘 • Disk queue ⬆ 3Tb of metadata. URLs, timestamps,… 275 b/doc
  71. 71. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. • Small volume 🆗 • With ⬆table size, response times ⬆🆘 • Disk queue ⬆ • Host-local fingerprint function for keys in HBase. 3Tb of metadata. URLs, timestamps,… 275 b/doc
  72. 72. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. • Small volume 🆗 • With ⬆table size, response times ⬆🆘 • Disk queue ⬆ • Host-local fingerprint function for keys in HBase. • Tuning HBase block cache to fit average host states into one block. 3Tb of metadata. URLs, timestamps,… 275 b/doc
  73. 73. 5. Intensive network traffic from workers to services
  74. 74. 5. Intensive network traffic from workers to services • Throughput between workers and Kafka/HBase 
 ~ 1Gbit/s.
  75. 75. 5. Intensive network traffic from workers to services • Throughput between workers and Kafka/HBase 
 ~ 1Gbit/s. • Thrift compact protocol for HBase
  76. 76. 5. Intensive network traffic from workers to services • Throughput between workers and Kafka/HBase 
 ~ 1Gbit/s. • Thrift compact protocol for HBase • Message compression in Kafka with Snappy
  77. 77. 6. Further query and traffic optimizations to HBase
  78. 78. 6. Further query and traffic optimizations to HBase • State check: lots of reqs and network
  79. 79. 6. Further query and traffic optimizations to HBase • State check: lots of reqs and network • Consistency
  80. 80. 6. Further query and traffic optimizations to HBase • State check: lots of reqs and network • Consistency • Local state cache in strategy worker.
  81. 81. 6. Further query and traffic optimizations to HBase • State check: lots of reqs and network • Consistency • Local state cache in strategy worker. • For consistency, spider log was partitioned by host.
  82. 82. State cache
  83. 83. State cache • All ops are batched:
  84. 84. State cache • All ops are batched: – If no key in cache→ read HBase
  85. 85. State cache • All ops are batched: – If no key in cache→ read HBase – every ~4K docs → flush
  86. 86. State cache • All ops are batched: – If no key in cache→ read HBase – every ~4K docs → flush • Close to 3M (~1Gb) elms → flush & cleanup
  87. 87. State cache • All ops are batched: – If no key in cache→ read HBase – every ~4K docs → flush • Close to 3M (~1Gb) elms → flush & cleanup • Least-Recently-Used (LRU) 👍
  88. 88. Spider priority queue (slot)
  89. 89. Spider priority queue (slot) • Cell:
  90. 90. Spider priority queue (slot) • Cell: Array of:
 - fingerprint, 
 - Crc32(hostname), 
 - URL, 
 - score
  91. 91. Spider priority queue (slot) • Cell: Array of:
 - fingerprint, 
 - Crc32(hostname), 
 - URL, 
 - score • Dequeueing top N.
  92. 92. Spider priority queue (slot) • Cell: Array of:
 - fingerprint, 
 - Crc32(hostname), 
 - URL, 
 - score • Dequeueing top N. • Prone to huge hosts
  93. 93. Spider priority queue (slot) • Cell: Array of:
 - fingerprint, 
 - Crc32(hostname), 
 - URL, 
 - score • Dequeueing top N. • Prone to huge hosts • Scoring model: document count per host.
  94. 94. 7. Problem of big and small hosts (strikes back!)
  95. 95. 7. Problem of big and small hosts (strikes back!) • Discovered few very huge hosts (>20M docs)
  96. 96. 7. Problem of big and small hosts (strikes back!) • Discovered few very huge hosts (>20M docs) • All queue partitions were flooded with huge hosts,
  97. 97. 7. Problem of big and small hosts (strikes back!) • Discovered few very huge hosts (>20M docs) • All queue partitions were flooded with huge hosts, • Two MapReduce jobs:
  98. 98. 7. Problem of big and small hosts (strikes back!) • Discovered few very huge hosts (>20M docs) • All queue partitions were flooded with huge hosts, • Two MapReduce jobs: – queue shuffling,
  99. 99. 7. Problem of big and small hosts (strikes back!) • Discovered few very huge hosts (>20M docs) • All queue partitions were flooded with huge hosts, • Two MapReduce jobs: – queue shuffling, – limit all hosts to 100 docs MAX.
  100. 100. Hardware requirements
  101. 101. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel.
  102. 102. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel. • Spiders to workers ratio is 4:1 (without content)
  103. 103. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable).
  104. 104. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example:
  105. 105. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: – 12 spiders ~ 14.4K pages/min.,
  106. 106. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: – 12 spiders ~ 14.4K pages/min., – 3 SW and 3 DB workers,
  107. 107. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: – 12 spiders ~ 14.4K pages/min., – 3 SW and 3 DB workers, – Total 18 cores.
  108. 108. Software requirements CDH (100% Open source Hadoop package)
  109. 109. Software requirements • Apache HBase, CDH (100% Open source Hadoop package)
  110. 110. Software requirements • Apache HBase, • Apache Kafka, CDH (100% Open source Hadoop package)
  111. 111. Software requirements • Apache HBase, • Apache Kafka, • Python 2.7+, CDH (100% Open source Hadoop package)
  112. 112. Software requirements • Apache HBase, • Apache Kafka, • Python 2.7+, • Scrapy 0.24+, CDH (100% Open source Hadoop package)
  113. 113. Software requirements • Apache HBase, • Apache Kafka, • Python 2.7+, • Scrapy 0.24+, • DNS Service. CDH (100% Open source Hadoop package)
  114. 114. Maintaining Cloudera Hadoop on Amazon EC2
  115. 115. Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager.
  116. 116. Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. • We’ve moved it using symbolic links to separate EBS partition.
  117. 117. Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. • We’ve moved it using symbolic links to separate EBS partition. • EBS should be at least 30Gb, base IOPS should be enough.
  118. 118. Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. • We’ve moved it using symbolic links to separate EBS partition. • EBS should be at least 30Gb, base IOPS should be enough. • Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD).
  119. 119. Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. • We’ve moved it using symbolic links to separate EBS partition. • EBS should be at least 30Gb, base IOPS should be enough. • Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD). • After one week of crawling, we ran out of space, and started to move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).
  120. 120. Spanish (.es) internet crawl results
  121. 121. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites
  122. 122. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found (~600K expected),
  123. 123. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found (~600K expected), • 46.5M crawled pages overall,
  124. 124. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found (~600K expected), • 46.5M crawled pages overall, • 1.5 months,
  125. 125. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found (~600K expected), • 46.5M crawled pages overall, • 1.5 months, • 22 websites with more than 50M pages
  126. 126. where are the rest of web servers?!
  127. 127. Bow-tie model A. Broder et al. / Computer Networks 33 (2000) 309-320
  128. 128. Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005
  129. 129. 12 years dynamics Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014
  130. 130. Main features
  131. 131. Main features • Online operation: scheduling of new batch, updating of DB state.
  132. 132. Main features • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included).
  133. 133. Main features • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use?
  134. 134. Main features • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use? • Scrapy ecosystem: good documentation, big community, ease of customization.
  135. 135. Distributed Frontera features
  136. 136. Distributed Frontera features • Communication layer is Apache Kafka: topic partitioning, offsets mechanism.
  137. 137. Distributed Frontera features • Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module.
  138. 138. Distributed Frontera features • Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. • Polite by design: each website is downloaded by at most one spider.
  139. 139. Distributed Frontera features • Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. • Polite by design: each website is downloaded by at most one spider. • Python: workers, spiders.
  140. 140. References • Frontera. https://github.com/scrapinghub/frontera • Distributed extension. https://github.com/ scrapinghub/distributed-frontera • Documentation: – http://frontera.readthedocs.org/ – http://distributed-frontera.readthedocs.org/ • Google groups: Frontera (https://goo.gl/ak9546)
  141. 141. Future plans
  142. 142. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets.
  143. 143. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box.
  144. 144. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes.
  145. 145. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes. • PageRank or HITS strategy.
  146. 146. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes. • PageRank or HITS strategy. • Own HTML and URL parsers.
  147. 147. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes. • PageRank or HITS strategy. • Own HTML and URL parsers. • Integration into Scrapinghub services.
  148. 148. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes. • PageRank or HITS strategy. • Own HTML and URL parsers. • Integration into Scrapinghub services. • Testing on larger volumes.
  149. 149. Run your business using Frontera
  150. 150. Run your business using Frontera  SCALABLE
  151. 151. Run your business using Frontera  SCALABLE  OPEN
  152. 152. Run your business using Frontera  SCALABLE  OPEN  CUSTOMIZABLE
  153. 153. Run your business using Frontera Made in Scrapinghub (authors of Scrapy)  SCALABLE  OPEN  CUSTOMIZABLE
  154. 154. Здесь может быть ВАШ код!
  155. 155. Здесь может быть ВАШ код! • Web scale crawler,
  156. 156. Здесь может быть ВАШ код! • Web scale crawler, • Historically first attempt in Python,
  157. 157. Здесь может быть ВАШ код! • Web scale crawler, • Historically first attempt in Python, • Truly resource- intensive task: CPU, network, disks.
  158. 158. We’re hiring! http://scrapinghub.com/jobs/
  159. 159. Спасибо! Alexander Sibiryakov, sibiryakov@scrapinghub.com

×