• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Kiwipycon2011 async-with-gevent-redis
 

Kiwipycon2011 async-with-gevent-redis

on

  • 5,715 views

 

Statistics

Views

Total Views
5,715
Views on SlideShare
5,315
Embed Views
400

Actions

Likes
10
Downloads
42
Comments
0

14 Embeds 400

http://lanyrd.com 205
http://simple-is-better.com 160
http://trunk.ly 10
http://feed.feedsky.com 7
https://twitter.com 6
http://xnny.net 3
http://twitter.com 2
https://si0.twimg.com 1
http://m552.mail.qq.com 1
http://m105.mail.qq.com 1
http://www.simple-is-better.com 1
http://tweetedtimes.com 1
http://us-w1.rockmelt.com 1
http://pythontip.sinaapp.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Kiwipycon2011 async-with-gevent-redis Kiwipycon2011 async-with-gevent-redis Presentation Transcript

    • Writing a distributed crawler system using gevent and redis Alex Dong @alexdong
    • Roadmap
      • Crawler: the unsung hero
      • Async 101
      • Gevent: the monkey king
      • Redis: data structure server
      • Lessons learned
    • Guess
      • How many links does Google index when they launch in 1998?
      • How many links today?
      • What was the project Google employee #1 working on?
    • Talking about a crawler!
      • Get a url from task queue
      • DNS resolution
      • Request HTTP Header
      • Download full content
      • Store to local file store, database and index
      • Scheduling, throttling, status monitoring, scale up by flicking on more servers.
    • Async 101 – Why bother?
      • What's wrong with multi-thread?
        • GIL
        • Yield on IO/socket, but
        • Computational expensive will block
      • What about multi-process?
        • Memory efficiency
        • Context switch overhead
    • Async 101
      • Controller + worker model: register and callback
      • Cooperative multitasking
      • epollfd = epoll_create ();
      • epoll_ctl ( epollfd , EPOLL_CTL_ADD, listen_sock , &ev)
      • epoll_wait ( epollfd , events , MAX_EVENTS, -1)
    • Gevent – Monkey King and Pool
        from gevent import monkey monkey.patch_all() from gevent.pool import Pool worker_pool = Pool (SIZE) # get domain into payload pool.spawn ( socket.getaddrinfo , payload)
    • Redis – Data Structure Server
      • High performance: 15,000 req/sec
        • Lock free, single process
        • Master/slave ready
      • Data structures
        • FIFO queue: Lists - LPOP, RPUSH
        • Working: Hashtable - HSET, HDEL, HEXISTS
        • One and only one: Sets - SADD, SPOP
    • Redis – Limit parallel requests
        > LPUSH ”trunk. ly ” ” http://trunk.ly/developers /” > SADD ”waiting” ”trunk.ly” > HSET ”processing” ”trunk.ly” 1227266553 > SPOP ”domains” -> trunk.ly > RPOP ”trunk.ly” -> …/developers/ > HDEL ”processing” ”trunk.ly”
    • Lessons learned - Dashboard
      • Turning point: most important code we've written
      • 25% code for status update and monitoring
      • What's causing the piling up?
        • Someone abusing the system?
        • DNS is down?
        • ISP's bandwidth?
        • Large file download?
        • Scheduler re-submit tasks?
    • Lessons learned – Fine balance
        Conflict between frontend and backend. Capacity planning 10 seconds network timeout 1000 parallel threadlets 0.01 second per task 500 reads / 100 writes hitting backend
    • Lessons learned – Use profiler
      • Structure the code to make it possible to run all steps in one non-gevent enabled process
      • Carefully profile to make sure socket.recv becomes the main bottleneck.
      • The get_title crisis
      • Rule of thumb: load average < 1 to saturate 10M bandwidth
    • Twitter: @alexdong trunk.ly/?q= from:alexdong + gevent