The Evolution Theory of Spider

     逐浪@淘宝北京研发中心
Topic
• Simplest Spider
• Framework(Scrapy)
  – Abstraction
  – IO Model
• Evolution
  – Architecture
  – Module
• Simplify
• Do it
Simplest Spider
import urllib, lxml, MySQLdb
urls = [...]
for url in urls:
      html = urllib.urlopen(url).read()
      item = parse(html)
      save(item)
Framework(scrapy)---Abstraction
• Workflow abstraction
  – Work abstraction
  – Flow abstraction
• Task abstraction
  – Request/Response
  – Task
• Platform abstraction
  – Linux
  – Windows
Framework(scrapy)---Abstraction
• Workflow abstraction
  – Work abstraction
     •   Schedule
     •   Download
     •   Extract
     •   Pipeline
  – Immutable and variable
     • Scrapy perspective
     • My perspective
  – What is spider class?
     • Variable works abstraction
Framework(scrapy)---Abstraction)
• Workflow abstraction
  – Flow abstraction
     • Apache vs Scrapy
  – Why control center
     • Control ability
        – Error Retry
     • Extensibility
     • Module independency
Framework(scrapy)---Abstraction)
• Task abstraction
  – Request/Response
  – Task
• Platform abstraction
  – Linux
  – Windows
Framework(scrapy)---IO Model)
• Concepts
  – Synchronous/Asynchronous(IO state consistency)
  – Block/Nonblock(Process/Thread status)
• IO Model
  – Synchronous Block(urllilb)
  – Asynchronous Block(spynner, gevent, nginx_lua)
  – Asynchronous
    NonBlock(twisted, reactor, proactor)
  – Synchronous NonBlock(mistery)
Evolution---Architecture
• Why
  – Scrapy
     • Single Process
  – Etao Spider v1




                        Etao Spider v1
Evolution---Architecture
• Distributed on Processes
• Distributed on Machines
• How
  – Thrift/HSF
  – Interact
     • Direction
        – Dependent
     • Task queue
        – Stateless
Evolution---Module
• Downloader
  – Render
    • Webkit(Javascript)
    • Webkit(AJAX):click simulation, event notify
    • Webkit(CSS): css feature
  – ADSL Proxy
    • How to get
       – Why scan by ourselves
    • How to use
       – Why nginx
Evolution---Module
• Extractor
  – Wrapper induction
     • Semi automation
        – Firefox extensions
        – How to improve
        – Templates management
     • Full automation
  – Scrapy extract tool
     • Cascade extraction supported
Evolution---Module
• Scheduler
  – FIFO Queue
  – Priority Queue
     •   Seed weight
     •   Smallest interval
     •   User Query distribution
     •   User Query importance
     •   Webpage change characteristics
Evolution---Module
• Processor
  – Mysql
  – Redis
  – Hadoop
Simplify
• IO module
    – Synchronous block
•   No Middleware supported
•   No Item Loader
•   No Framework
•   No …
Do it
• Time Estimation
  – Basic 1-2 month
  – Improve

Spider进化论