Scrapy internals
Alexander Sibiryakov, 16-17 July 2017, PyConRU 2017
sibiryakov@scrapinghub.com
made by
Talk scope
Talk scope
• Design of complex asynchronous
application,
Talk scope
• Design of complex asynchronous
application,
• Flow-control issues,
Talk scope
• Design of complex asynchronous
application,
• Flow-control issues,
• open source life.
Scrapy: web scraping
Scrapy: web scraping
• extraction of structured data,
Scrapy: web scraping
• extraction of structured data,
• Selecting and extracting data from HTML/XML
(CSS, Xpath, regexps) → Parsel
Scrapy: web scraping
• extraction of structured data,
• Selecting and extracting data from HTML/XML
(CSS, Xpath, regexps) → Parsel
• Interactive shell,
Scrapy: web scraping
• extraction of structured data,
• Selecting and extracting data from HTML/XML
(CSS, Xpath, regexps) → Parsel
• Interactive shell,
• Feed exports in JSON, CSV, XML and storing in
FTP, S3, local fs,
Scrapy: web scraping
• extraction of structured data,
• Selecting and extracting data from HTML/XML
(CSS, Xpath, regexps) → Parsel
• Interactive shell,
• Feed exports in JSON, CSV, XML and storing in
FTP, S3, local fs,
• Robust encoding support and auto-detection,
Main features
Main features
• Extensible: spider, signals, middlewares,
extensions, and pipelines,
Main features
• Extensible: spider, signals, middlewares,
extensions, and pipelines,
Telnet console
Main features
• Extensible: spider, signals, middlewares,
extensions, and pipelines,
Form submission
Telnet console
Main features
• Extensible: spider, signals, middlewares,
extensions, and pipelines,
COOKIES
Form submission
Telnet console
Main features
• Extensible: spider, signals, middlewares,
extensions, and pipelines,
COOKIES
Form submission
Telnet console
Graceful shutdown
by signal
Main features
• Extensible: spider, signals, middlewares,
extensions, and pipelines,
COOKIES
Robots.txt
Form submission
Telnet console
Graceful shutdown
by signal
Scrapy architecture
Twisted
Twisted
• Event-driven network programming
framework
Twisted
• Event-driven network programming
framework
• Event loop and Deferreds («Обещания»)
Twisted
• Event-driven network programming
framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
Twisted
• Event-driven network programming
framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
• TCP, UDP, SSL, UNIX sockets
Twisted
• Event-driven network programming
framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
• TCP, UDP, SSL, UNIX sockets
• HTTP, DNS, SMTP/IMAP, IRC
Twisted
• Event-driven network programming
framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
• TCP, UDP, SSL, UNIX sockets
• HTTP, DNS, SMTP/IMAP, IRC
• Cross platform
Twisted
• Event-driven network programming
framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
• TCP, UDP, SSL, UNIX sockets
• HTTP, DNS, SMTP/IMAP, IRC
• Cross platform
Creator of Twisted
Glyph Lefkowitz
Creator of Twisted
–Twisted source code
self._nameResolver =
_SimpleResolverComplexifier(resolver)
Twisted event loop
https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work
https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
Twisted event loop
https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work
https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
Twisted event loop
https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work
https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
events:
[e1: Event, e2: Event, … eN]
Event:
func, args, desired_time
Twisted event loop
https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work
https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
events:
[e1: Event, e2: Event, … eN]
Event:
func, args, desired_time min: O(1)
x86 time sources
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
• many for one-shot mode,
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
• many for one-shot mode,
• compares actual timer value and target
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
• many for one-shot mode,
• compares actual timer value and target
• RDTSC/RDTSCP - CPU clock cycles
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
• many for one-shot mode,
• compares actual timer value and target
• RDTSC/RDTSCP - CPU clock cycles
• Proprietary timers
Twisted.Deferred
Twisted.Deferred
• callback
Twisted.Deferred
• callback
• errback
Twisted.Deferred
• callback
• errback
• addCallback, addErrback
Twisted.Deferred
• callback
• errback
• addCallback, addErrback
• cancel
Twisted.Deferred
• callback
• errback
• addCallback, addErrback
• cancel
• addTimeout
Twisted.Deferred
• callback
• errback
• addCallback, addErrback
• cancel
• addTimeout
• pause/unpause
Internal components intercommunication
Web agent pipeline
Downloader
Slots:
PROBLEMS
Throttling between internal
components
Throttling between internal
components
• Downloader,
Throttling between internal
components
• Downloader,
• Scraper
Throttling between internal
components
• Downloader,
• Scraper
• Item pipelines (cleansing, validating, dups,
storing,..)
Throttling between internal
components
• Downloader,
• Scraper
• Item pipelines (cleansing, validating, dups,
storing,..)
• Feed exports (serialization + disk/network IO)
Throttling between internal
components
• Downloader,
• Scraper
• Item pipelines (cleansing, validating, dups,
storing,..)
• Feed exports (serialization + disk/network IO)
• ?
Flow control: memory
Flow control: memory
Flow control: memory
• Unlimited downloading -> unlimited items growth
from cascading feed pages.
Flow control: memory
• Unlimited downloading -> unlimited items growth
from cascading feed pages.
• maintain limit per amount of memory used for
Responses in queue (~5Mb)
Flow control: CPU
spending more time on
than
> reactor.callLater( 0.1 , d.errback, _failure)
an artificial delay in 100ms
Callbacks-> CPU
io
Summarizing
Summarizing
• concurrent items limits,
Summarizing
• concurrent items limits,
• memory consumption limits,
Summarizing
• concurrent items limits,
• memory consumption limits,
• scheduling of new calls with delays.
Summarizing
• concurrent items limits,
• memory consumption limits,
• scheduling of new calls with delays.
if limit is reached ->
Summarizing
• concurrent items limits,
• memory consumption limits,
• scheduling of new calls with delays.
if limit is reached ->
don’t pickup new request from scheduler
It just stopped…
It just stopped…
• Why?
It just stopped…
• Why?
• Some Deferred was lost?
It just stopped…
• Why?
• Some Deferred was lost?
• Where?
It just stopped…
• Why?
• Some Deferred was lost?
• Where?
• How to debug?
It just stopped…
• Why?
• Some Deferred was lost?
• Where?
• How to debug?
No silver bullet.
It just stopped…
• Why?
• Some Deferred was lost?
• Where?
• How to debug?
No silver bullet.
> self.heartbeat = task.LoopingCall(nextcall.schedule)
It just stopped…
• Why?
• Some Deferred was lost?
• Where?
• How to debug?
No silver bullet.
> self.heartbeat = task.LoopingCall(nextcall.schedule)
+ extensive logging
Design your async
application well
Design your async
application well
Iterations
Design your async
application well
Iterations
State diagrams
Вопросы
Alexander Sibiryakov, Scrapinghub Ltd.,
sibiryakov@scrapinghub.com

«Scrapy internals» Александр Сибиряков, Scrapinghub