• Like
Spider进化论
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Spider进化论

  • 478 views
Published

 

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
478
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
13
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Evolution Theory of Spider 逐浪@淘宝北京研发中心
  • 2. Topic• Simplest Spider• Framework(Scrapy) – Abstraction – IO Model• Evolution – Architecture – Module• Simplify• Do it
  • 3. Simplest Spiderimport urllib, lxml, MySQLdburls = [...]for url in urls: html = urllib.urlopen(url).read() item = parse(html) save(item)
  • 4. Framework(scrapy)---Abstraction• Workflow abstraction – Work abstraction – Flow abstraction• Task abstraction – Request/Response – Task• Platform abstraction – Linux – Windows
  • 5. Framework(scrapy)---Abstraction• Workflow abstraction – Work abstraction • Schedule • Download • Extract • Pipeline – Immutable and variable • Scrapy perspective • My perspective – What is spider class? • Variable works abstraction
  • 6. Framework(scrapy)---Abstraction)• Workflow abstraction – Flow abstraction • Apache vs Scrapy – Why control center • Control ability – Error Retry • Extensibility • Module independency
  • 7. Framework(scrapy)---Abstraction)• Task abstraction – Request/Response – Task• Platform abstraction – Linux – Windows
  • 8. Framework(scrapy)---IO Model)• Concepts – Synchronous/Asynchronous(IO state consistency) – Block/Nonblock(Process/Thread status)• IO Model – Synchronous Block(urllilb) – Asynchronous Block(spynner, gevent, nginx_lua) – Asynchronous NonBlock(twisted, reactor, proactor) – Synchronous NonBlock(mistery)
  • 9. Evolution---Architecture• Why – Scrapy • Single Process – Etao Spider v1 Etao Spider v1
  • 10. Evolution---Architecture• Distributed on Processes• Distributed on Machines• How – Thrift/HSF – Interact • Direction – Dependent • Task queue – Stateless
  • 11. Evolution---Module• Downloader – Render • Webkit(Javascript) • Webkit(AJAX):click simulation, event notify • Webkit(CSS): css feature – ADSL Proxy • How to get – Why scan by ourselves • How to use – Why nginx
  • 12. Evolution---Module• Extractor – Wrapper induction • Semi automation – Firefox extensions – How to improve – Templates management • Full automation – Scrapy extract tool • Cascade extraction supported
  • 13. Evolution---Module• Scheduler – FIFO Queue – Priority Queue • Seed weight • Smallest interval • User Query distribution • User Query importance • Webpage change characteristics
  • 14. Evolution---Module• Processor – Mysql – Redis – Hadoop
  • 15. Simplify• IO module – Synchronous block• No Middleware supported• No Item Loader• No Framework• No …
  • 16. Do it• Time Estimation – Basic 1-2 month – Improve