Kiwipycon2011 async-with-gevent-redis
Upcoming SlideShare
Loading in...5
×
 

Kiwipycon2011 async-with-gevent-redis

on

  • 5,852 views

 

Statistics

Views

Total Views
5,852
Views on SlideShare
5,446
Embed Views
406

Actions

Likes
10
Downloads
42
Comments
0

14 Embeds 406

http://lanyrd.com 209
http://simple-is-better.com 162
http://trunk.ly 10
http://feed.feedsky.com 7
https://twitter.com 6
http://xnny.net 3
http://twitter.com 2
https://si0.twimg.com 1
http://m552.mail.qq.com 1
http://m105.mail.qq.com 1
http://www.simple-is-better.com 1
http://tweetedtimes.com 1
http://us-w1.rockmelt.com 1
http://pythontip.sinaapp.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Kiwipycon2011 async-with-gevent-redis Kiwipycon2011 async-with-gevent-redis Presentation Transcript

  • Writing a distributed crawler system using gevent and redis Alex Dong @alexdong
  • Roadmap
    • Crawler: the unsung hero
    • Async 101
    • Gevent: the monkey king
    • Redis: data structure server
    • Lessons learned
  • Guess
    • How many links does Google index when they launch in 1998?
    • How many links today?
    • What was the project Google employee #1 working on?
  • Talking about a crawler!
    • Get a url from task queue
    • DNS resolution
    • Request HTTP Header
    • Download full content
    • Store to local file store, database and index
    • Scheduling, throttling, status monitoring, scale up by flicking on more servers.
  • Async 101 – Why bother?
    • What's wrong with multi-thread?
      • GIL
      • Yield on IO/socket, but
      • Computational expensive will block
    • What about multi-process?
      • Memory efficiency
      • Context switch overhead
  • Async 101
    • Controller + worker model: register and callback
    • Cooperative multitasking
    • epollfd = epoll_create ();
    • epoll_ctl ( epollfd , EPOLL_CTL_ADD, listen_sock , &ev)
    • epoll_wait ( epollfd , events , MAX_EVENTS, -1)
  • Gevent – Monkey King and Pool
      from gevent import monkey monkey.patch_all() from gevent.pool import Pool worker_pool = Pool (SIZE) # get domain into payload pool.spawn ( socket.getaddrinfo , payload)
  • Redis – Data Structure Server
    • High performance: 15,000 req/sec
      • Lock free, single process
      • Master/slave ready
    • Data structures
      • FIFO queue: Lists - LPOP, RPUSH
      • Working: Hashtable - HSET, HDEL, HEXISTS
      • One and only one: Sets - SADD, SPOP
  • Redis – Limit parallel requests
      > LPUSH ”trunk. ly ” ” http://trunk.ly/developers /” > SADD ”waiting” ”trunk.ly” > HSET ”processing” ”trunk.ly” 1227266553 > SPOP ”domains” -> trunk.ly > RPOP ”trunk.ly” -> …/developers/ > HDEL ”processing” ”trunk.ly”
  • Lessons learned - Dashboard
    • Turning point: most important code we've written
    • 25% code for status update and monitoring
    • What's causing the piling up?
      • Someone abusing the system?
      • DNS is down?
      • ISP's bandwidth?
      • Large file download?
      • Scheduler re-submit tasks?
  • Lessons learned – Fine balance
      Conflict between frontend and backend. Capacity planning 10 seconds network timeout 1000 parallel threadlets 0.01 second per task 500 reads / 100 writes hitting backend
  • Lessons learned – Use profiler
    • Structure the code to make it possible to run all steps in one non-gevent enabled process
    • Carefully profile to make sure socket.recv becomes the main bottleneck.
    • The get_title crisis
    • Rule of thumb: load average < 1 to saturate 10M bandwidth
  • Twitter: @alexdong trunk.ly/?q= from:alexdong + gevent