Writing a  distributed crawler  system using  gevent  and  redis Alex Dong @alexdong
Roadmap <ul><li>Crawler: the unsung hero
Async 101
Gevent: the monkey king
Redis: data structure server
Lessons learned </li></ul>
Guess <ul><li>How many links does Google index when they launch in 1998?
How many links today?
What was the project Google employee #1 working on?  </li></ul>
Talking about a crawler! <ul><li>Get a url from task queue
DNS resolution
Request HTTP Header
Download full content
Upcoming SlideShare
Loading in...5
×

Kiwipycon2011 async-with-gevent-redis

5,110

Published on

Published in: Technology, Education
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,110
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
44
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Kiwipycon2011 async-with-gevent-redis

  1. 1. Writing a distributed crawler system using gevent and redis Alex Dong @alexdong
  2. 2. Roadmap <ul><li>Crawler: the unsung hero
  3. 3. Async 101
  4. 4. Gevent: the monkey king
  5. 5. Redis: data structure server
  6. 6. Lessons learned </li></ul>
  7. 7. Guess <ul><li>How many links does Google index when they launch in 1998?
  8. 8. How many links today?
  9. 9. What was the project Google employee #1 working on? </li></ul>
  10. 10. Talking about a crawler! <ul><li>Get a url from task queue
  11. 11. DNS resolution
  12. 12. Request HTTP Header
  13. 13. Download full content
  14. 14. Store to local file store, database and index
  15. 15. Scheduling, throttling, status monitoring, scale up by flicking on more servers. </li></ul>
  16. 16. Async 101 – Why bother? <ul><li>What's wrong with multi-thread? </li><ul><li>GIL
  17. 17. Yield on IO/socket, but
  18. 18. Computational expensive will block </li></ul><li>What about multi-process? </li><ul><li>Memory efficiency
  19. 19. Context switch overhead </li></ul></ul>
  20. 20. Async 101 <ul><li>Controller + worker model: register and callback
  21. 21. Cooperative multitasking
  22. 22. epollfd = epoll_create ();
  23. 23. epoll_ctl ( epollfd , EPOLL_CTL_ADD, listen_sock , &ev)
  24. 24. epoll_wait ( epollfd , events , MAX_EVENTS, -1) </li></ul>
  25. 25. Gevent – Monkey King and Pool <ul>from gevent import monkey monkey.patch_all() from gevent.pool import Pool worker_pool = Pool (SIZE) # get domain into payload pool.spawn ( socket.getaddrinfo , payload) </ul>
  26. 26. Redis – Data Structure Server <ul><li>High performance: 15,000 req/sec </li><ul><li>Lock free, single process
  27. 27. Master/slave ready </li></ul><li>Data structures </li><ul><li>FIFO queue: Lists - LPOP, RPUSH
  28. 28. Working: Hashtable - HSET, HDEL, HEXISTS
  29. 29. One and only one: Sets - SADD, SPOP </li></ul></ul>
  30. 30. Redis – Limit parallel requests <ul>> LPUSH ”trunk. ly ” ” http://trunk.ly/developers /” > SADD ”waiting” ”trunk.ly” > HSET ”processing” ”trunk.ly” 1227266553 > SPOP ”domains” -> trunk.ly > RPOP ”trunk.ly” -> …/developers/ > HDEL ”processing” ”trunk.ly” </ul>
  31. 31. Lessons learned - Dashboard <ul><li>Turning point: most important code we've written
  32. 32. 25% code for status update and monitoring
  33. 33. What's causing the piling up? </li><ul><li>Someone abusing the system?
  34. 34. DNS is down?
  35. 35. ISP's bandwidth?
  36. 36. Large file download?
  37. 37. Scheduler re-submit tasks? </li></ul></ul>
  38. 38. Lessons learned – Fine balance <ul>Conflict between frontend and backend. Capacity planning 10 seconds network timeout 1000 parallel threadlets 0.01 second per task 500 reads / 100 writes hitting backend </ul>
  39. 39. Lessons learned – Use profiler <ul><li>Structure the code to make it possible to run all steps in one non-gevent enabled process
  40. 40. Carefully profile to make sure socket.recv becomes the main bottleneck.
  41. 41. The get_title crisis
  42. 42. Rule of thumb: load average < 1 to saturate 10M bandwidth </li></ul>
  43. 43. Twitter: @alexdong trunk.ly/?q= from:alexdong + gevent
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×