Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

«Scrapy internals» Александр Сибиряков, Scrapinghub

250 views

Published on

Выступление на PYCON RUSSIA 2017

Published in: Internet
  • Be the first to comment

«Scrapy internals» Александр Сибиряков, Scrapinghub

  1. 1. Scrapy internals Alexander Sibiryakov, 16-17 July 2017, PyConRU 2017 sibiryakov@scrapinghub.com made by
  2. 2. Talk scope
  3. 3. Talk scope • Design of complex asynchronous application,
  4. 4. Talk scope • Design of complex asynchronous application, • Flow-control issues,
  5. 5. Talk scope • Design of complex asynchronous application, • Flow-control issues, • open source life.
  6. 6. Scrapy: web scraping
  7. 7. Scrapy: web scraping • extraction of structured data,
  8. 8. Scrapy: web scraping • extraction of structured data, • Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel
  9. 9. Scrapy: web scraping • extraction of structured data, • Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel • Interactive shell,
  10. 10. Scrapy: web scraping • extraction of structured data, • Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel • Interactive shell, • Feed exports in JSON, CSV, XML and storing in FTP, S3, local fs,
  11. 11. Scrapy: web scraping • extraction of structured data, • Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel • Interactive shell, • Feed exports in JSON, CSV, XML and storing in FTP, S3, local fs, • Robust encoding support and auto-detection,
  12. 12. Main features
  13. 13. Main features • Extensible: spider, signals, middlewares, extensions, and pipelines,
  14. 14. Main features • Extensible: spider, signals, middlewares, extensions, and pipelines, Telnet console
  15. 15. Main features • Extensible: spider, signals, middlewares, extensions, and pipelines, Form submission Telnet console
  16. 16. Main features • Extensible: spider, signals, middlewares, extensions, and pipelines, COOKIES Form submission Telnet console
  17. 17. Main features • Extensible: spider, signals, middlewares, extensions, and pipelines, COOKIES Form submission Telnet console Graceful shutdown by signal
  18. 18. Main features • Extensible: spider, signals, middlewares, extensions, and pipelines, COOKIES Robots.txt Form submission Telnet console Graceful shutdown by signal
  19. 19. Scrapy architecture
  20. 20. Twisted
  21. 21. Twisted • Event-driven network programming framework
  22. 22. Twisted • Event-driven network programming framework • Event loop and Deferreds («Обещания»)
  23. 23. Twisted • Event-driven network programming framework • Event loop and Deferreds («Обещания») • Protocols and transport:
  24. 24. Twisted • Event-driven network programming framework • Event loop and Deferreds («Обещания») • Protocols and transport: • TCP, UDP, SSL, UNIX sockets
  25. 25. Twisted • Event-driven network programming framework • Event loop and Deferreds («Обещания») • Protocols and transport: • TCP, UDP, SSL, UNIX sockets • HTTP, DNS, SMTP/IMAP, IRC
  26. 26. Twisted • Event-driven network programming framework • Event loop and Deferreds («Обещания») • Protocols and transport: • TCP, UDP, SSL, UNIX sockets • HTTP, DNS, SMTP/IMAP, IRC • Cross platform
  27. 27. Twisted • Event-driven network programming framework • Event loop and Deferreds («Обещания») • Protocols and transport: • TCP, UDP, SSL, UNIX sockets • HTTP, DNS, SMTP/IMAP, IRC • Cross platform
  28. 28. Creator of Twisted
  29. 29. Glyph Lefkowitz Creator of Twisted
  30. 30. –Twisted source code self._nameResolver = _SimpleResolverComplexifier(resolver)
  31. 31. Twisted event loop https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
  32. 32. Twisted event loop https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
  33. 33. Twisted event loop https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html events: [e1: Event, e2: Event, … eN] Event: func, args, desired_time
  34. 34. Twisted event loop https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html events: [e1: Event, e2: Event, … eN] Event: func, args, desired_time min: O(1)
  35. 35. x86 time sources
  36. 36. x86 time sources • Real Time Clock - absolute time, 1 sec. precision,
  37. 37. x86 time sources • Real Time Clock - absolute time, 1 sec. precision, • 8254 chip previously,
  38. 38. x86 time sources • Real Time Clock - absolute time, 1 sec. precision, • 8254 chip previously, • HPET (High Precision Event Timer), at least 10Mhz
  39. 39. x86 time sources • Real Time Clock - absolute time, 1 sec. precision, • 8254 chip previously, • HPET (High Precision Event Timer), at least 10Mhz • single counter for periodic mode,
  40. 40. x86 time sources • Real Time Clock - absolute time, 1 sec. precision, • 8254 chip previously, • HPET (High Precision Event Timer), at least 10Mhz • single counter for periodic mode, • many for one-shot mode,
  41. 41. x86 time sources • Real Time Clock - absolute time, 1 sec. precision, • 8254 chip previously, • HPET (High Precision Event Timer), at least 10Mhz • single counter for periodic mode, • many for one-shot mode, • compares actual timer value and target
  42. 42. x86 time sources • Real Time Clock - absolute time, 1 sec. precision, • 8254 chip previously, • HPET (High Precision Event Timer), at least 10Mhz • single counter for periodic mode, • many for one-shot mode, • compares actual timer value and target • RDTSC/RDTSCP - CPU clock cycles
  43. 43. x86 time sources • Real Time Clock - absolute time, 1 sec. precision, • 8254 chip previously, • HPET (High Precision Event Timer), at least 10Mhz • single counter for periodic mode, • many for one-shot mode, • compares actual timer value and target • RDTSC/RDTSCP - CPU clock cycles • Proprietary timers
  44. 44. Twisted.Deferred
  45. 45. Twisted.Deferred • callback
  46. 46. Twisted.Deferred • callback • errback
  47. 47. Twisted.Deferred • callback • errback • addCallback, addErrback
  48. 48. Twisted.Deferred • callback • errback • addCallback, addErrback • cancel
  49. 49. Twisted.Deferred • callback • errback • addCallback, addErrback • cancel • addTimeout
  50. 50. Twisted.Deferred • callback • errback • addCallback, addErrback • cancel • addTimeout • pause/unpause
  51. 51. Internal components intercommunication
  52. 52. Web agent pipeline
  53. 53. Downloader Slots:
  54. 54. PROBLEMS
  55. 55. Throttling between internal components
  56. 56. Throttling between internal components • Downloader,
  57. 57. Throttling between internal components • Downloader, • Scraper
  58. 58. Throttling between internal components • Downloader, • Scraper • Item pipelines (cleansing, validating, dups, storing,..)
  59. 59. Throttling between internal components • Downloader, • Scraper • Item pipelines (cleansing, validating, dups, storing,..) • Feed exports (serialization + disk/network IO)
  60. 60. Throttling between internal components • Downloader, • Scraper • Item pipelines (cleansing, validating, dups, storing,..) • Feed exports (serialization + disk/network IO) • ?
  61. 61. Flow control: memory
  62. 62. Flow control: memory
  63. 63. Flow control: memory • Unlimited downloading -> unlimited items growth from cascading feed pages.
  64. 64. Flow control: memory • Unlimited downloading -> unlimited items growth from cascading feed pages. • maintain limit per amount of memory used for Responses in queue (~5Mb)
  65. 65. Flow control: CPU spending more time on than > reactor.callLater( 0.1 , d.errback, _failure) an artificial delay in 100ms Callbacks-> CPU io
  66. 66. Summarizing
  67. 67. Summarizing • concurrent items limits,
  68. 68. Summarizing • concurrent items limits, • memory consumption limits,
  69. 69. Summarizing • concurrent items limits, • memory consumption limits, • scheduling of new calls with delays.
  70. 70. Summarizing • concurrent items limits, • memory consumption limits, • scheduling of new calls with delays. if limit is reached ->
  71. 71. Summarizing • concurrent items limits, • memory consumption limits, • scheduling of new calls with delays. if limit is reached -> don’t pickup new request from scheduler
  72. 72. It just stopped…
  73. 73. It just stopped… • Why?
  74. 74. It just stopped… • Why? • Some Deferred was lost?
  75. 75. It just stopped… • Why? • Some Deferred was lost? • Where?
  76. 76. It just stopped… • Why? • Some Deferred was lost? • Where? • How to debug?
  77. 77. It just stopped… • Why? • Some Deferred was lost? • Where? • How to debug? No silver bullet.
  78. 78. It just stopped… • Why? • Some Deferred was lost? • Where? • How to debug? No silver bullet. > self.heartbeat = task.LoopingCall(nextcall.schedule)
  79. 79. It just stopped… • Why? • Some Deferred was lost? • Where? • How to debug? No silver bullet. > self.heartbeat = task.LoopingCall(nextcall.schedule) + extensive logging
  80. 80. Design your async application well
  81. 81. Design your async application well Iterations
  82. 82. Design your async application well Iterations State diagrams
  83. 83. Вопросы Alexander Sibiryakov, Scrapinghub Ltd., sibiryakov@scrapinghub.com

×