Kiwipycon2011 async-with-gevent-redis

  • 4,949 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,949
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
42
Comments
0
Likes
10

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Writing a distributed crawler system using gevent and redis Alex Dong @alexdong
  • 2. Roadmap
    • Crawler: the unsung hero
    • 3. Async 101
    • 4. Gevent: the monkey king
    • 5. Redis: data structure server
    • 6. Lessons learned
  • 7. Guess
    • How many links does Google index when they launch in 1998?
    • 8. How many links today?
    • 9. What was the project Google employee #1 working on?
  • 10. Talking about a crawler!
    • Get a url from task queue
    • 11. DNS resolution
    • 12. Request HTTP Header
    • 13. Download full content
    • 14. Store to local file store, database and index
    • 15. Scheduling, throttling, status monitoring, scale up by flicking on more servers.
  • 16. Async 101 – Why bother?
    • What's wrong with multi-thread?
      • GIL
      • 17. Yield on IO/socket, but
      • 18. Computational expensive will block
    • What about multi-process?
      • Memory efficiency
      • 19. Context switch overhead
  • 20. Async 101
    • Controller + worker model: register and callback
    • 21. Cooperative multitasking
    • 22. epollfd = epoll_create ();
    • 23. epoll_ctl ( epollfd , EPOLL_CTL_ADD, listen_sock , &ev)
    • 24. epoll_wait ( epollfd , events , MAX_EVENTS, -1)
  • 25. Gevent – Monkey King and Pool
      from gevent import monkey monkey.patch_all() from gevent.pool import Pool worker_pool = Pool (SIZE) # get domain into payload pool.spawn ( socket.getaddrinfo , payload)
  • 26. Redis – Data Structure Server
    • High performance: 15,000 req/sec
      • Lock free, single process
      • 27. Master/slave ready
    • Data structures
      • FIFO queue: Lists - LPOP, RPUSH
      • 28. Working: Hashtable - HSET, HDEL, HEXISTS
      • 29. One and only one: Sets - SADD, SPOP
  • 30. Redis – Limit parallel requests
      > LPUSH ”trunk. ly ” ” http://trunk.ly/developers /” > SADD ”waiting” ”trunk.ly” > HSET ”processing” ”trunk.ly” 1227266553 > SPOP ”domains” -> trunk.ly > RPOP ”trunk.ly” -> …/developers/ > HDEL ”processing” ”trunk.ly”
  • 31. Lessons learned - Dashboard
    • Turning point: most important code we've written
    • 32. 25% code for status update and monitoring
    • 33. What's causing the piling up?
      • Someone abusing the system?
      • 34. DNS is down?
      • 35. ISP's bandwidth?
      • 36. Large file download?
      • 37. Scheduler re-submit tasks?
  • 38. Lessons learned – Fine balance
      Conflict between frontend and backend. Capacity planning 10 seconds network timeout 1000 parallel threadlets 0.01 second per task 500 reads / 100 writes hitting backend
  • 39. Lessons learned – Use profiler
    • Structure the code to make it possible to run all steps in one non-gevent enabled process
    • 40. Carefully profile to make sure socket.recv becomes the main bottleneck.
    • 41. The get_title crisis
    • 42. Rule of thumb: load average < 1 to saturate 10M bandwidth
  • 43. Twitter: @alexdong trunk.ly/?q= from:alexdong + gevent