SCALING DJANGO FOR X FACTOR
             MALCOLM BOX, DJUGL OCTOBER 2012
WHAT I’M TALKING ABOUT
  Scaling Django to >10K request/s
  Caching, Counting and Cassandra
  Toolbox
ME
 Malcolm Box, CTO & Co-Founder

 @malcolmbox

 malcolm@tellybug.com

 http://tellybug.com
Making TV more
 entertaining


Live interaction

 Highly social

Unique content
WHO ARE YOU?
  Technical?


  Running Django?


  Scale?
THE CHALLENGE
THE CHALLENGE
  Millions of people watch the
  shows we work with
THE CHALLENGE
  Millions of people watch the
  shows we work with

  TV tells them to buzz/clap/
  score....
THE CHALLENGE
  Millions of people watch the
  shows we work with

  TV tells them to buzz/clap/
  score....

  A giant DDOS is launched
  against our servers
HOW BIG?
  Peak loads of 10,000 requests/s
  Read/write mix
    Write-heavy workload - lots of user interactions
HOW BIG?

10K REQUESTS/S IS
 25,920,000,000
REQUESTS/MONTH
The Internet


ARCHITECTURE                                                                       Static assets



                                               HAProxy layer

  Entirely cloud
  based                                         Web layer


                       Chef

  Nodes come and                  Cache


  go - frequently!    Monitor
                                             Cassandra Cluster


  Automatic            Task

  deployment direct
                                                                 RDS MySQL
                      Server



  from Github via               Amazon AWS eu-west-1
                                                                   Logs, backups
                                                                                            Amazon S3

  Chef
CACHING
  Cache as speedup or Cache as mission-critical?
  Use Django cache framework
    Pylibmc - consistent hashing and server death patches
  Problems as you scale up...
CACHE PROBLEMS
  Cache miss behaviour         value = cache.get(key)
                               if value is None:
                                 try:
    Thundering herds are bad       lock = cache.add(lock_key(key))
                                   if lock:
  Key overload                       # Do something expensive
                                     new_value = calculate_new_value()
                                     cache.set(key, new_value)
  Server overload                    return new_value
                                 finally:
  Dualcache - https://             if lock:
                                     cache.delete(lock_key(key)
  gist.github.com/953524
                               return value
COUNTING
  Hard to count a few things very fast
  And have real-time access to the latest result
  Things we tried:
    memcache
    Cassandra counters
  Final solution: Sharded counters
SHARDED COUNTERS
  Implemented in about 350 lines of Python
  To provide two basic operations!
    incr()
    get()
  Uses a combination of two layers of memcache and
  Cassandra to provide real-time, scalable counters
CASSANDRA
  Core piece of our infrastructure
  Highly write-scalable
  Reads scaled from cache
  Using Acunu Cassandra for virtual nodes
  “Fake” Django ORM classes to make it feel more natural
    But no automatic join support
TOOLBOX
  Development
    Django Extensions, Celery, Piston (heavily forked), iPython, pycassa
    Tsung (load testing tool)
  Deployment:
    Fabric, Chef, Boto
  Operations
    Sentry, Gargoyle
THINGS THAT STILL SUCK



                Monitoring
Q&A
AND YES, WE’RE HIRING SO IF YOU’RE INTERESTED IN BUILDING EXTREMELY LARGE
                    DJANGO SITES THEN GET IN TOUCH
                        MALCOLM@TELLYBUG.COM

Scaling Django for X Factor - DJUGL Oct 2012

  • 1.
    SCALING DJANGO FORX FACTOR MALCOLM BOX, DJUGL OCTOBER 2012
  • 2.
    WHAT I’M TALKINGABOUT Scaling Django to >10K request/s Caching, Counting and Cassandra Toolbox
  • 3.
    ME Malcolm Box,CTO & Co-Founder @malcolmbox malcolm@tellybug.com http://tellybug.com
  • 4.
    Making TV more entertaining Live interaction Highly social Unique content
  • 5.
    WHO ARE YOU? Technical? Running Django? Scale?
  • 6.
  • 7.
    THE CHALLENGE Millions of people watch the shows we work with
  • 8.
    THE CHALLENGE Millions of people watch the shows we work with TV tells them to buzz/clap/ score....
  • 9.
    THE CHALLENGE Millions of people watch the shows we work with TV tells them to buzz/clap/ score.... A giant DDOS is launched against our servers
  • 10.
    HOW BIG? Peak loads of 10,000 requests/s Read/write mix Write-heavy workload - lots of user interactions
  • 11.
    HOW BIG? 10K REQUESTS/SIS 25,920,000,000 REQUESTS/MONTH
  • 12.
    The Internet ARCHITECTURE Static assets HAProxy layer Entirely cloud based Web layer Chef Nodes come and Cache go - frequently! Monitor Cassandra Cluster Automatic Task deployment direct RDS MySQL Server from Github via Amazon AWS eu-west-1 Logs, backups Amazon S3 Chef
  • 13.
    CACHING Cacheas speedup or Cache as mission-critical? Use Django cache framework Pylibmc - consistent hashing and server death patches Problems as you scale up...
  • 14.
    CACHE PROBLEMS Cache miss behaviour value = cache.get(key) if value is None: try: Thundering herds are bad lock = cache.add(lock_key(key)) if lock: Key overload # Do something expensive new_value = calculate_new_value() cache.set(key, new_value) Server overload return new_value finally: Dualcache - https:// if lock: cache.delete(lock_key(key) gist.github.com/953524 return value
  • 15.
    COUNTING Hardto count a few things very fast And have real-time access to the latest result Things we tried: memcache Cassandra counters Final solution: Sharded counters
  • 16.
    SHARDED COUNTERS Implemented in about 350 lines of Python To provide two basic operations! incr() get() Uses a combination of two layers of memcache and Cassandra to provide real-time, scalable counters
  • 17.
    CASSANDRA Corepiece of our infrastructure Highly write-scalable Reads scaled from cache Using Acunu Cassandra for virtual nodes “Fake” Django ORM classes to make it feel more natural But no automatic join support
  • 18.
    TOOLBOX Development Django Extensions, Celery, Piston (heavily forked), iPython, pycassa Tsung (load testing tool) Deployment: Fabric, Chef, Boto Operations Sentry, Gargoyle
  • 19.
    THINGS THAT STILLSUCK Monitoring
  • 20.
    Q&A AND YES, WE’REHIRING SO IF YOU’RE INTERESTED IN BUILDING EXTREMELY LARGE DJANGO SITES THEN GET IN TOUCH MALCOLM@TELLYBUG.COM

Editor's Notes

  • #2 \n
  • #3 \n
  • #4 \n
  • #5 XFactor 2012 app. Also Switch, BGT, Arab Voice, Unzipped...\n
  • #6 Questions for audience:\n\n- Technical?\n- Running Django in production\n- Scale - 10 ... 100 .... 1000 .... 10000 .... 100000 req/s\n
  • #7 XFactor - over 1M installs, 260 Million boos/claps\nBGT - 250K simultaneous users\n\n
  • #8 XFactor - over 1M installs, 260 Million boos/claps\nBGT - 250K simultaneous users\n\n
  • #9 XFactor - over 1M installs, 260 Million boos/claps\nBGT - 250K simultaneous users\n\n
  • #10 XFactor - over 1M installs, 260 Million boos/claps\nBGT - 250K simultaneous users\n\n
  • #11 XFactor - over 1M installs, 260 Million boos/claps\nBGT - 250K simultaneous users\n\n
  • #12 XFactor - over 1M installs, 260 Million boos/claps\nBGT - 250K simultaneous users\n\n
  • #13 \n
  • #14 cf Google serving 34K searches/s worldwide\n
  • #15 \n
  • #16 Cache is either a speedup for your site, or it is mission critical. The deciding factor is whether your DB can handle the load if the cache fails.\nAt > 500 req/s, MySQL on AWS can’t keep up - hence cache is critical\n\n
  • #17 Discuss the code:\n- what happens if you return None? How does that affect upstream bits of code?\n- occasional latency problems if the value expires - everything fails for as long as calculate_new_value() takes to return\n\nGhetto locking - if using to protect e.g. DB writes, the key itself can end up as a problem\n\n
  • #18 \n
  • #19 Describe how sharded counters work\n- and the very interesting challenge of debugging!\n
  • #20 Used for write performance rather than data size - still more data in MySQL than Cassandra\n\n
  • #21 \n
  • #22 Mini rant - trouble finding any tool that copes with a highly scalable infrastructure up and down\n\nTried: Zabbix, Nagios, Cloudwatch, New Relic, Sensu, librato ... and probably some others\nNow building our own :(\n
  • #23 \n