Spider进化论

890 views
727 views

Published on

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
890
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
17
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Spider进化论

  1. 1. The Evolution Theory of Spider 逐浪@淘宝北京研发中心
  2. 2. Topic• Simplest Spider• Framework(Scrapy) – Abstraction – IO Model• Evolution – Architecture – Module• Simplify• Do it
  3. 3. Simplest Spiderimport urllib, lxml, MySQLdburls = [...]for url in urls: html = urllib.urlopen(url).read() item = parse(html) save(item)
  4. 4. Framework(scrapy)---Abstraction• Workflow abstraction – Work abstraction – Flow abstraction• Task abstraction – Request/Response – Task• Platform abstraction – Linux – Windows
  5. 5. Framework(scrapy)---Abstraction• Workflow abstraction – Work abstraction • Schedule • Download • Extract • Pipeline – Immutable and variable • Scrapy perspective • My perspective – What is spider class? • Variable works abstraction
  6. 6. Framework(scrapy)---Abstraction)• Workflow abstraction – Flow abstraction • Apache vs Scrapy – Why control center • Control ability – Error Retry • Extensibility • Module independency
  7. 7. Framework(scrapy)---Abstraction)• Task abstraction – Request/Response – Task• Platform abstraction – Linux – Windows
  8. 8. Framework(scrapy)---IO Model)• Concepts – Synchronous/Asynchronous(IO state consistency) – Block/Nonblock(Process/Thread status)• IO Model – Synchronous Block(urllilb) – Asynchronous Block(spynner, gevent, nginx_lua) – Asynchronous NonBlock(twisted, reactor, proactor) – Synchronous NonBlock(mistery)
  9. 9. Evolution---Architecture• Why – Scrapy • Single Process – Etao Spider v1 Etao Spider v1
  10. 10. Evolution---Architecture• Distributed on Processes• Distributed on Machines• How – Thrift/HSF – Interact • Direction – Dependent • Task queue – Stateless
  11. 11. Evolution---Module• Downloader – Render • Webkit(Javascript) • Webkit(AJAX):click simulation, event notify • Webkit(CSS): css feature – ADSL Proxy • How to get – Why scan by ourselves • How to use – Why nginx
  12. 12. Evolution---Module• Extractor – Wrapper induction • Semi automation – Firefox extensions – How to improve – Templates management • Full automation – Scrapy extract tool • Cascade extraction supported
  13. 13. Evolution---Module• Scheduler – FIFO Queue – Priority Queue • Seed weight • Smallest interval • User Query distribution • User Query importance • Webpage change characteristics
  14. 14. Evolution---Module• Processor – Mysql – Redis – Hadoop
  15. 15. Simplify• IO module – Synchronous block• No Middleware supported• No Item Loader• No Framework• No …
  16. 16. Do it• Time Estimation – Basic 1-2 month – Improve

×