Kiwipycon2011 async-with-gevent-redis-110826190218-phpapp01
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Kiwipycon2011 async-with-gevent-redis-110826190218-phpapp01

  • 1,501 views
Uploaded on

Trunk.ly launched as a half baked product when Yahoo announced its plan to "sunset" delicious. Within 2 days, 5000 users sign up with the service. In order to deal with the huge demands, we spun......

Trunk.ly launched as a half baked product when Yahoo announced its plan to "sunset" delicious. Within 2 days, 5000 users sign up with the service. In order to deal with the huge demands, we spun off 10+ EC2 instances to crunch through the links. Alex rewrote the crawler architecture using gevent and redis. Reducing the total number of servers to 2. Saving $3000 per month. This talk is about the framework and architecture used in the rewrite.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,501
On Slideshare
1,501
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Writing a distributed crawler system using gevent and redis Alex Dong @alexdong
  • 2. Roadmap
    • Crawler: the unsung hero
    • 3. Async 101
    • 4. Gevent: the monkey king
    • 5. Redis: data structure server
    • 6. Lessons learned
  • 7. Guess
    • How many links does Google index when they launch in 1998?
    • 8. How many links today?
    • 9. What was the project Google employee #1 working on?
  • 10. Talking about a crawler!
    • Get a url from task queue
    • 11. DNS resolution
    • 12. Request HTTP Header
    • 13. Download full content
    • 14. Store to local file store, database and index
    • 15. Scheduling, throttling, status monitoring, scale up by flicking on more servers.
  • 16. Async 101 – Why bother?
    • What's wrong with multi-thread?
      • GIL
      • 17. Yield on IO/socket, but
      • 18. Computational expensive will block
    • What about multi-process?
      • Memory efficiency
      • 19. Context switch overhead
  • 20. Async 101
    • Controller + worker model: register and callback
    • 21. Cooperative multitasking
    • 22. epollfd = epoll_create ();
    • 23. epoll_ctl ( epollfd , EPOLL_CTL_ADD, listen_sock , &ev)
    • 24. epoll_wait ( epollfd , events , MAX_EVENTS, -1)
  • 25. Gevent – Monkey King and Pool
      from gevent import monkey monkey.patch_all() from gevent.pool import Pool worker_pool = Pool (SIZE) # get domain into payload pool.spawn ( socket.getaddrinfo , payload)
  • 26. Redis – Data Structure Server
    • High performance: 15,000 req/sec
      • Lock free, single process
      • 27. Master/slave ready
    • Data structures
      • FIFO queue: Lists - LPOP, RPUSH
      • 28. Working: Hashtable - HSET, HDEL, HEXISTS
      • 29. One and only one: Sets - SADD, SPOP
  • 30. Redis – Limit parallel requests
      > LPUSH ”trunk. ly ” ” http://trunk.ly/developers /” > SADD ”waiting” ”trunk.ly” > HSET ”processing” ”trunk.ly” 1227266553 > SPOP ”domains” -> trunk.ly > RPOP ”trunk.ly” -> …/developers/ > HDEL ”processing” ”trunk.ly”
  • 31. Lessons learned - Dashboard
    • Turning point: most important code we've written
    • 32. 25% code for status update and monitoring
    • 33. What's causing the piling up?
      • Someone abusing the system?
      • 34. DNS is down?
      • 35. ISP's bandwidth?
      • 36. Large file download?
      • 37. Scheduler re-submit tasks?
  • 38. Lessons learned – Fine balance
      Conflict between frontend and backend. Capacity planning 10 seconds network timeout 1000 parallel threadlets 0.01 second per task 500 reads / 100 writes hitting backend
  • 39. Lessons learned – Use profiler
    • Structure the code to make it possible to run all steps in one non-gevent enabled process
    • 40. Carefully profile to make sure socket.recv becomes the main bottleneck.
    • 41. The get_title crisis
    • 42. Rule of thumb: load average < 1 to saturate 10M bandwidth
  • 43. Twitter: @alexdong trunk.ly/?q= from:alexdong + gevent