DjangoCon 2010 Scaling Disqus

Z
Scaling the World’s Largest Django App

Jason Yan                 David Cramer
@jasonyan                    @zeeg
What is DISQUS?
What is DISQUS?


            dis·cuss • dĭ-skŭs'

We are a comment system with an emphasis on
           connecting communities




              http://disqus.com/about/
What is Scale?

                     Number of Visitors
300M
250M
200M
150M
100M
 50M



Our traffic at a glance
17,000 requests/second peak
450,000 websites
15 million profiles
75 million comments
250 million visitors (August 2010)
Our Challenges


• We can’t predict when things will happen
  • Random celebrity gossip
  • Natural disasters
• Discussions never expire
  • We can’t keep those millions of articles from
    2008 in the cache
  • You don’t know in advance (generally) where the
    traffic will be
  • Especially with dynamic paging, realtime, sorting,
    personal prefs, etc.
Our Challenges (cont’d)


• High availability
  • Not a destination site
  • Difficult to schedule maintenance
Server Architecture
Server Architecture - Load Balancing
• Load Balancing                          • High Availability
  • Software, HAProxy                       • heartbeat
     • High performance, intelligent
       server availability checking
     • Bonus: Nice statistics reporting




                                                     Image Source: http://haproxy.1wt.eu/
Server Architecture



• ~100 Servers
 • 30% Web Servers (Apache + mod_wsgi)
 • 10% Databases (PostgreSQL)
 • 25% Cache Servers (memcached)
 • 20% Load Balancing / High Availability
   (HAProxy + heartbeat)
 • 15% Utility Servers (Python scripts)
Server Architecture - Web Servers


• Apache 2.2
• mod_wsgi
  • Using `maximum-requests` to
    plug memory leaks.

• Performance Monitoring
  • Custom middleware
    (PerformanceLogMiddleware)
  • Ships performance statistics
    (DB queries, external calls,
    template rendering, etc) through
    syslog
  • Collected and graphed through
    Ganglia
Server Architecture - Database




• PostgreSQL
• Slony-I for Replication
  • Trigger-based
  • Read slaves for extra read capacity
  • Failover master database for high
    availability
Server Architecture - Database

• Make sure indexes fit in memory and
  measure I/O
 • High I/O generally means slow queries
   due to missing indexes or indexes not in
   buffer cache
• Log Slow Queries
 • syslog-ng + pgFouine + cron to automate
   slow query logging
Server Architecture - Database



• Use connection pooling
 • Django doesn’t do this for you
 • We use pgbouncer
 • Limits the maximum number of
   connections your database needs to
   handle
 • Save on costly opening and tearing down
   of new database connections
Our Data Model
Partitioning




• Fairly easy to implement, quick wins
• Done at the application level
  • Data is replayed by Slony
• Two methods of data separation
Vertical Partitioning
Vertical partitioning involves creating tables with fewer columns
  and using additional tables to store the remaining columns.



     Forums         Posts             Users         Sentry




          http://en.wikipedia.org/wiki/Partition_(database)
Pythonic Joins


            Allows us to separate datasets

posts = Post.objects.all()[0:25]

# store users in a dictionary based on primary key
users = dict(
    (u.pk, u) for u in 
    User.objects.filter(pk__in=set(p.user_id for p in posts))
)

# map users to their posts
for p in posts:
  p._user_cache = users.get(p.user_id)
Pythonic Joins (cont’d)



• Slower than at database level
    • But not enough that you should care
    • Trading performance for scale
• Allows us to separate data
    • Easy vertical partitioning
• More efficient caching
    • get_many, object-per-row cache
Designating Masters




• Alleviates some of the write load on your
  primary application master
• Masters exist under specific conditions:
  • application use case
  • partitioned data
• Database routers make this (fairly) easy
Routing by Application




class ApplicationRouter(object):
    def db_for_read(self, model, **hints):
        instance = hints.get('instance')
        if not instance:
            return None

        app_label = instance._meta.app_label

        return get_application_alias(app_label)
Horizontal Partitioning
Horizontal partitioning (also known as sharding) involves splitting
               one set of data into different tables.



      Disqus      Your Blog            CNN        Telegraph




           http://en.wikipedia.org/wiki/Partition_(database)
Horizontal Partitions




• Some forums have very large datasets
• Partners need high availability
• Helps scale the write load on the master
• We rely more on vertical partitions
Routing by Partition

class ForumPartitionRouter(object):
    def db_for_read(self, model, **hints):
        instance = hints.get('instance')
        if not instance:
            return None

        forum_id = getattr(instance, 'forum_id', None)
        if not forum_id:
              return None

        return get_forum_alias(forum_id)


# What we used to do
Post.objects.filter(forum=forum)


# Now, making sure hints are available
forum.post_set.all()
Optimizing QuerySets




• We really dislike raw SQL
  • It creates more work when dealing with
    partitions
• Built-in cache allows sub-slicing
  • But isn’t always needed
  • We removed this cache
Removing the Cache


• Django internally caches the results of your QuerySet
  • This adds additional memory overhead

     # 1 query
     qs = Model.objects.all()[0:100]

     # 0 queries (we don’t need this behavior)
     qs = qs[0:10]

     # 1 query
     qs = qs.filter(foo=bar)


• Many times you only need to view a result set once
• So we built SkinnyQuerySet
Removing the Cache (cont’d)

Optimizing memory usage by removing the cache
 class SkinnyQuerySet(QuerySet):
     def __iter__(self):
         if self._result_cache is not None:
             # __len__ must have been run
             return iter(self._result_cache)

        has_run = getattr(self, 'has_run', False)
        if has_run:
            raise QuerySetDoubleIteration("...")
        self.has_run = True
        # We wanted .iterator() as the default
        return self.iterator()



                http://gist.github.com/550438
Atomic Updates




• Keeps your data consistent
• save() isnt thread-safe
  • use update() instead
• Great for things like counters
  • But should be considered for all write
    operations
Atomic Updates (cont’d)


  Thread safety is impossible with .save()
Request 1

post = Post(pk=1)
# a moderator approves
post.approved = True
post.save()

Request 2

post = Post(pk=1)
# the author adjusts their message
post.message = ‘Hello!’
post.save()
Atomic Updates (cont’d)


            So we need atomic updates
Request 1

post = Post(pk=1)
# a moderator approves
Post.objects.filter(pk=post.pk)
            .update(approved=True)

Request 2

post = Post(pk=1)
# the author adjusts their message
Post.objects.filter(pk=post.pk)
            .update(message=‘Hello!’)
Atomic Updates (cont’d)


           A better way to approach updates
def update(obj, using=None, **kwargs):
    """
    Updates specified attributes on the current instance.
    """
    assert obj, "Instance has not yet been created."
    obj.__class__._base_manager.using(using)
                                .filter(pk=obj)
                                .update(**kwargs)
    for k, v in kwargs.iteritems():
        if isinstance(v, ExpressionNode):
            # NotImplemented
            continue
        setattr(obj, k, v)



http://github.com/andymccurdy/django-tips-and-tricks/blob/master/model_update.py
Delayed Signals




• Queueing low priority tasks
 • even if they’re fast
• Asynchronous (Delayed) signals
 • very friendly to the developer
 • ..but not as friendly as real signals
Delayed Signals (cont’d)



  We send a specific serialized version
   of the model for delayed signals

from disqus.common.signals import delayed_save

def my_func(data, sender, created, **kwargs):
    print data[‘id’]

delayed_save.connect(my_func, sender=Post)




 This is all handled through our Queue
Caching




• Memcached
• Use pylibmc (newer libMemcached-based)
 • Ticket #11675 (add pylibmc support)
 • Third party applications:
   • django-newcache, django-pylibmc
Caching (cont’d)



• libMemcached / pylibmc is configurable with
  “behaviors”.
• Memcached “single point of failure”
  • Distributed system, but we must take
    precautions.
  • Connection timeout to memcached can stall
    requests.
    • Use `_auto_eject_hosts` and
      `_retry_timeout` behaviors to prevent
      reconnecting to dead caches.
Caching (cont’d)



   • Default (naive) hashing behavior
     • Modulo hashed cache key cache for index
       to server list.
     • Removal of a server causes majority of
       cache keys to be remapped to new
       servers.

CACHE_SERVERS = [‘10.0.0.1’, ‘10.0.0.2’]
key = ‘my_cache_key’
cache_server = CACHE_SERVERS[hash(key) % len(CACHE_SERVERS)]
Caching (cont’d)

• Better approach: consistent hashing
  • libMemcached (pylibmc) uses libketama
    (http://tinyurl.com/lastfm-libketama)


  • Addition / removal of a cache server
    remaps (K/n) cache keys
    (where K=number of keys and n=number of servers)




                 Image Source: http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction
Caching (cont’d)


• Thundering herd (stampede) problem
  • Invalidating a heavily accessed cache key causes many
    clients to refill cache.
  • But everyone refetching to fill the cache from the data
    store or reprocessing data can cause things to get even
    slower.
  • Most times, it’s ideal to return the previously invalidated
    cache value and let a single client refill the cache.
  • django-newcache or MintCache (http://
    djangosnippets.org/snippets/793/) will do this for you.
  • Prefer filling cache on invalidation instead of deleting
    from cache also helps to prevent the thundering herd
    problem.
Transactions


• TransactionMiddleware got us started, but
  down the road became a burden
• For postgresql_psycopg2, there’s a database
  option, OPTIONS[‘autocommit’]
  • Each query is in its own transaction. This
    means each request won’t start in a
    transaction.
    • But sometimes we want transactions
      (e.g., saving multiple objects and rolling
      back on error)
Transactions (cont’d)


• Tips:
  • Use autocommit for read slave databases.
  • Isolate slow functions (e.g., external calls,
    template rendering) from transactions.
  • Selective autocommit
    • Most read-only views don’t need to be
      in transactions.
    • Start in autocommit and switch to a
      transaction on write.
Scaling the Team




• Small team of engineers
• Monthly users / developers = 40m
• Which means writing tests..
• ..and having a dead simple workflow
Keeping it Simple




• A developer can be up and running in a few
  minutes
 • assuming postgres and other server
   applications are already installed
 • pip, virtualenv
 • settings.py
Setting Up Local




1. createdb -E UTF-8 disqus
2. git clone git://repo
3. mkvirtualenv disqus
4. pip install -U -r requirements.txt
5. ./manage.py syncdb && ./manage.py migrate
Sane Defaults


settings.py
from disqus.conf.settings.default import *

try:
    from local_settings import *
except ImportError:
    import sys, traceback
    sys.stderr.write("Can't find 'localsettings.py’n”)
    sys.stderr.write("nThe exception was:nn")
    traceback.print_exc()



local_settings.py
from disqus.conf.settings.dev import *
Continuous Integration



• Daily deploys with Fabric
  • several times an hour on some days
• Hudson keeps our builds going
  • combined with Selenium
• Post-commit hooks for quick testing
  • like Pyflakes
• Reverting to a previous version is a matter of
  seconds
Continuous Integration (cont’d)

 Hudson makes integration easy
Testing



• It’s not fun breaking things when you’re the new
  guy
• Our testing process is fairly heavy
• 70k (Python) LOC, 73% coverage, 20 min suite
• Custom Test Runner (unittest)
  • We needed XML, Selenium, Query Counts
  • Database proxies (for read-slave testing)
  • Integration with our Queue
Testing (cont’d)


Query Counts
# failures yield a dump of queries
def test_read_slave(self):
    Model.objects.using(‘read_slave’).count()
    self.assertQueryCount(1, ‘read_slave’)


Selenium
def test_button(self):
    self.selenium.click('//a[@class=”dsq-button”]')



Queue Integration
class WorkerTest(DisqusTest):
    workers = [‘fire_signal’]

    def test_delayed_signal(self):
        ...
Bug Tracking



• Switched from Trac to Redmine
  • We wanted Subtasks
• Emailing exceptions is a bad idea
  • Even if its localhost
• Previously using django-db-log to aggregate
  errors to a single point
• We’ve overhauled db log and are releasing
  Sentry
django-sentry

Groups messages intelligently




   http://github.com/dcramer/django-sentry
django-sentry (cont’d)

Similar feel to Django’s debugger




    http://github.com/dcramer/django-sentry
Feature Switches



• We needed a safety in case a feature wasn’t
  performing well at peak
  • it had to respond without delay, globally,
    and without writing to disk
• Allows us to work out of trunk (mostly)
• Easy to release new features to a portion of
  your audience
• Also nice for “Labs” type projects
Feature Switches (cont’d)
Final Thoughts


• The language (usually) isn’t your problem
• We like Django
  • But we maintain local patches
• Some tickets don’t have enough of a following
  • Patches, like #17, completely change
    Django..
  • ..arguably in a good way
• Others don’t have champions
      Ticket #17 describes making the ORM an identify mapper
Housekeeping




       Birds of a Feather
   Want to learn from others about
  performance and scaling problems?
           Or play some StarCraft 2?


          We’re Hiring!

DISQUS is looking for amazing engineers
Questions
References


django-sentry
http://github.com/dcramer/django-sentry

Our Feature Switches
http://cl.ly/2FYt

Andy McCurdy’s update()
http://github.com/andymccurdy/django-tips-and-tricks

Our PyFlakes Fork
http://github.com/dcramer/pyflakes

SkinnyQuerySet
http://gist.github.com/550438

django-newcache
http://github.com/ericflo/django-newcache

attach_foreignkey (Pythonic Joins)
http://gist.github.com/567356
1 of 56

Recommended

MinIO January 2020 Briefing by
MinIO January 2020 BriefingMinIO January 2020 Briefing
MinIO January 2020 BriefingJonathan Symonds
277 views41 slides
Intro to Amazon ECS by
Intro to Amazon ECSIntro to Amazon ECS
Intro to Amazon ECSAmazon Web Services
4.5K views32 slides
AWS Aurora 100% 활용하기 by
AWS Aurora 100% 활용하기AWS Aurora 100% 활용하기
AWS Aurora 100% 활용하기I Goo Lee
1.7K views71 slides
[DEVIEW 2021] 1000만 글로벌 유저를 지탱하는 기술과 사람들 by
[DEVIEW 2021] 1000만 글로벌 유저를 지탱하는 기술과 사람들[DEVIEW 2021] 1000만 글로벌 유저를 지탱하는 기술과 사람들
[DEVIEW 2021] 1000만 글로벌 유저를 지탱하는 기술과 사람들Brian Hong
512 views44 slides
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc... by
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
24.1K views67 slides
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015 by
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015Amazon Web Services Korea
10.3K views40 slides

More Related Content

What's hot

GS Neotek aws_Amazon_CloudFrontDay2018_session3 by
GS Neotek aws_Amazon_CloudFrontDay2018_session3GS Neotek aws_Amazon_CloudFrontDay2018_session3
GS Neotek aws_Amazon_CloudFrontDay2018_session3GS Neotek
702 views147 slides
Amazon EC2 Deep Dive - 이창수 (AWS 솔루션 아키텍트) : 8월 온라인 세미나 by
Amazon EC2 Deep Dive - 이창수 (AWS 솔루션 아키텍트) : 8월 온라인 세미나Amazon EC2 Deep Dive - 이창수 (AWS 솔루션 아키텍트) : 8월 온라인 세미나
Amazon EC2 Deep Dive - 이창수 (AWS 솔루션 아키텍트) : 8월 온라인 세미나Amazon Web Services Korea
5.6K views87 slides
Chickens & Eggs: Managing secrets in AWS with Hashicorp Vault by
Chickens & Eggs: Managing secrets in AWS with Hashicorp VaultChickens & Eggs: Managing secrets in AWS with Hashicorp Vault
Chickens & Eggs: Managing secrets in AWS with Hashicorp VaultJeff Horwitz
6.2K views53 slides
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD) by
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)충섭 김
12K views85 slides
성능 최대화를 위한 CloudFront 설정 Best Practice by
성능 최대화를 위한 CloudFront 설정 Best Practice성능 최대화를 위한 CloudFront 설정 Best Practice
성능 최대화를 위한 CloudFront 설정 Best PracticeGS Neotek
2.9K views35 slides
Best Practices for Backup and Recovery: Windows Workload on AWS by
Best Practices for Backup and Recovery: Windows Workload on AWS Best Practices for Backup and Recovery: Windows Workload on AWS
Best Practices for Backup and Recovery: Windows Workload on AWS Amazon Web Services
6.9K views33 slides

What's hot(20)

GS Neotek aws_Amazon_CloudFrontDay2018_session3 by GS Neotek
GS Neotek aws_Amazon_CloudFrontDay2018_session3GS Neotek aws_Amazon_CloudFrontDay2018_session3
GS Neotek aws_Amazon_CloudFrontDay2018_session3
GS Neotek702 views
Amazon EC2 Deep Dive - 이창수 (AWS 솔루션 아키텍트) : 8월 온라인 세미나 by Amazon Web Services Korea
Amazon EC2 Deep Dive - 이창수 (AWS 솔루션 아키텍트) : 8월 온라인 세미나Amazon EC2 Deep Dive - 이창수 (AWS 솔루션 아키텍트) : 8월 온라인 세미나
Amazon EC2 Deep Dive - 이창수 (AWS 솔루션 아키텍트) : 8월 온라인 세미나
Chickens & Eggs: Managing secrets in AWS with Hashicorp Vault by Jeff Horwitz
Chickens & Eggs: Managing secrets in AWS with Hashicorp VaultChickens & Eggs: Managing secrets in AWS with Hashicorp Vault
Chickens & Eggs: Managing secrets in AWS with Hashicorp Vault
Jeff Horwitz6.2K views
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD) by 충섭 김
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
충섭 김12K views
성능 최대화를 위한 CloudFront 설정 Best Practice by GS Neotek
성능 최대화를 위한 CloudFront 설정 Best Practice성능 최대화를 위한 CloudFront 설정 Best Practice
성능 최대화를 위한 CloudFront 설정 Best Practice
GS Neotek2.9K views
Best Practices for Backup and Recovery: Windows Workload on AWS by Amazon Web Services
Best Practices for Backup and Recovery: Windows Workload on AWS Best Practices for Backup and Recovery: Windows Workload on AWS
Best Practices for Backup and Recovery: Windows Workload on AWS
Amazon Web Services6.9K views
Fargate 를 이용한 ECS with VPC 1부 by Hyun-Mook Choi
Fargate 를 이용한 ECS with VPC 1부Fargate 를 이용한 ECS with VPC 1부
Fargate 를 이용한 ECS with VPC 1부
Hyun-Mook Choi6.6K views
AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying a... by Amazon Web Services
AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying a...AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying a...
AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying a...
Amazon Web Services4.1K views
EKS vs GKE vs AKS - Evaluating Kubernetes in the Cloud by DevOps.com
EKS vs GKE vs AKS - Evaluating Kubernetes in the CloudEKS vs GKE vs AKS - Evaluating Kubernetes in the Cloud
EKS vs GKE vs AKS - Evaluating Kubernetes in the Cloud
DevOps.com727 views
Introduce Google Kubernetes by Yongbok Kim
Introduce Google KubernetesIntroduce Google Kubernetes
Introduce Google Kubernetes
Yongbok Kim6.6K views
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev... by Amazon Web Services Korea
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트) by Amazon Web Services Korea
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)
MSA ( Microservices Architecture ) 발표 자료 다운로드 by Opennaru, inc.
MSA ( Microservices Architecture ) 발표 자료 다운로드MSA ( Microservices Architecture ) 발표 자료 다운로드
MSA ( Microservices Architecture ) 발표 자료 다운로드
Opennaru, inc. 7.3K views
AWS Kubernetes 서비스 자세히 살펴보기 (정영준 & 이창수, AWS 솔루션즈 아키텍트) :: AWS DevDay2018 by Amazon Web Services Korea
AWS Kubernetes 서비스 자세히 살펴보기 (정영준 & 이창수, AWS 솔루션즈 아키텍트) :: AWS DevDay2018AWS Kubernetes 서비스 자세히 살펴보기 (정영준 & 이창수, AWS 솔루션즈 아키텍트) :: AWS DevDay2018
AWS Kubernetes 서비스 자세히 살펴보기 (정영준 & 이창수, AWS 솔루션즈 아키텍트) :: AWS DevDay2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018 by Amazon Web Services
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Amazon Web Services1.2K views
3. 마이크로 서비스 아키텍쳐 by Terry Cho
3. 마이크로 서비스 아키텍쳐3. 마이크로 서비스 아키텍쳐
3. 마이크로 서비스 아키텍쳐
Terry Cho7.3K views
An intro to Kubernetes operators by J On The Beach
An intro to Kubernetes operatorsAn intro to Kubernetes operators
An intro to Kubernetes operators
J On The Beach2.5K views

Viewers also liked

Physical Security Presentation by
Physical Security PresentationPhysical Security Presentation
Physical Security PresentationWajahat Rajab
41.2K views115 slides
Mri brain anatomy Dr Muhammad Bin Zulfiqar by
Mri brain anatomy Dr Muhammad Bin ZulfiqarMri brain anatomy Dr Muhammad Bin Zulfiqar
Mri brain anatomy Dr Muhammad Bin ZulfiqarDr. Muhammad Bin Zulfiqar
53K views81 slides
Thai tech startup ecosystem report 2017 by
Thai tech startup ecosystem report 2017Thai tech startup ecosystem report 2017
Thai tech startup ecosystem report 2017Techsauce Media
67.1K views60 slides
Engineering Geology by
Engineering GeologyEngineering Geology
Engineering GeologyGAURAV. H .TANDON
52.1K views132 slides
Process sequence of weaving by
Process sequence of weavingProcess sequence of weaving
Process sequence of weavingMd. Mazadul Hasan Shishir
42.5K views55 slides
The evolution of mobile phones by
The evolution of mobile phonesThe evolution of mobile phones
The evolution of mobile phonesOlivia2590
142K views36 slides

Viewers also liked(20)

Physical Security Presentation by Wajahat Rajab
Physical Security PresentationPhysical Security Presentation
Physical Security Presentation
Wajahat Rajab41.2K views
Thai tech startup ecosystem report 2017 by Techsauce Media
Thai tech startup ecosystem report 2017Thai tech startup ecosystem report 2017
Thai tech startup ecosystem report 2017
Techsauce Media67.1K views
The evolution of mobile phones by Olivia2590
The evolution of mobile phonesThe evolution of mobile phones
The evolution of mobile phones
Olivia2590142K views
BCG Matrix of Engro foods by Mutahir Bilal
BCG Matrix of Engro foodsBCG Matrix of Engro foods
BCG Matrix of Engro foods
Mutahir Bilal35.8K views
4. heredity and evolution by Abhay Goyal
4. heredity and evolution4. heredity and evolution
4. heredity and evolution
Abhay Goyal69.8K views
10+ Getting to Know You Activities for Teens & Adults by Shelly Sanchez Terrell
10+ Getting to Know You Activities for Teens & Adults10+ Getting to Know You Activities for Teens & Adults
10+ Getting to Know You Activities for Teens & Adults
Shelly Sanchez Terrell416.6K views
Tmj anatomy by Tony Pious
Tmj anatomyTmj anatomy
Tmj anatomy
Tony Pious108.9K views
Basics of c++ Programming Language by Ahmad Idrees
Basics of c++ Programming LanguageBasics of c++ Programming Language
Basics of c++ Programming Language
Ahmad Idrees47.8K views
How Obama Won Using Digital and Social Media by James Burnes
How Obama Won Using Digital and Social MediaHow Obama Won Using Digital and Social Media
How Obama Won Using Digital and Social Media
James Burnes138.6K views
Fmcg training modules-bfg by Romy Cagampan
Fmcg training modules-bfgFmcg training modules-bfg
Fmcg training modules-bfg
Romy Cagampan48K views
New forever clean 9 booklet by Katalin Hidvegi
New forever clean 9 bookletNew forever clean 9 booklet
New forever clean 9 booklet
Katalin Hidvegi211.6K views
Tweak Your Resume by Chiara Ojeda
Tweak Your ResumeTweak Your Resume
Tweak Your Resume
Chiara Ojeda71.4K views
Coca Cola by mixas450
Coca ColaCoca Cola
Coca Cola
mixas45032.2K views
Understanding text-structure-powerpoint by aelowans
Understanding text-structure-powerpointUnderstanding text-structure-powerpoint
Understanding text-structure-powerpoint
aelowans105.6K views

Similar to DjangoCon 2010 Scaling Disqus

Play Framework and Activator by
Play Framework and ActivatorPlay Framework and Activator
Play Framework and ActivatorKevin Webber
4.1K views47 slides
introduction to node.js by
introduction to node.jsintroduction to node.js
introduction to node.jsorkaplan
3.7K views30 slides
Django at Scale by
Django at ScaleDjango at Scale
Django at Scalebretthoerner
8.4K views43 slides
Bye bye $GLOBALS['TYPO3_DB'] by
Bye bye $GLOBALS['TYPO3_DB']Bye bye $GLOBALS['TYPO3_DB']
Bye bye $GLOBALS['TYPO3_DB']Jan Helke
214 views24 slides
Nodejs - Should Ruby Developers Care? by
Nodejs - Should Ruby Developers Care?Nodejs - Should Ruby Developers Care?
Nodejs - Should Ruby Developers Care?Felix Geisendörfer
3.8K views52 slides
Where Django Caching Bust at the Seams by
Where Django Caching Bust at the SeamsWhere Django Caching Bust at the Seams
Where Django Caching Bust at the SeamsConcentric Sky
16.4K views64 slides

Similar to DjangoCon 2010 Scaling Disqus(20)

Play Framework and Activator by Kevin Webber
Play Framework and ActivatorPlay Framework and Activator
Play Framework and Activator
Kevin Webber4.1K views
introduction to node.js by orkaplan
introduction to node.jsintroduction to node.js
introduction to node.js
orkaplan3.7K views
Bye bye $GLOBALS['TYPO3_DB'] by Jan Helke
Bye bye $GLOBALS['TYPO3_DB']Bye bye $GLOBALS['TYPO3_DB']
Bye bye $GLOBALS['TYPO3_DB']
Jan Helke214 views
Where Django Caching Bust at the Seams by Concentric Sky
Where Django Caching Bust at the SeamsWhere Django Caching Bust at the Seams
Where Django Caching Bust at the Seams
Concentric Sky16.4K views
Django Pro ORM by Alex Gaynor
Django Pro ORMDjango Pro ORM
Django Pro ORM
Alex Gaynor3.1K views
Migration to ClickHouse. Practical guide, by Alexander Zaitsev by Altinity Ltd
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd9.4K views
CouchDB for Web Applications - Erlang Factory London 2009 by Jason Davies
CouchDB for Web Applications - Erlang Factory London 2009CouchDB for Web Applications - Erlang Factory London 2009
CouchDB for Web Applications - Erlang Factory London 2009
Jason Davies2.2K views
Rails 3 (beta) Roundup by Wayne Carter
Rails 3 (beta) RoundupRails 3 (beta) Roundup
Rails 3 (beta) Roundup
Wayne Carter1.2K views
Our Puppet Story (GUUG FFG 2015) by DECK36
Our Puppet Story (GUUG FFG 2015)Our Puppet Story (GUUG FFG 2015)
Our Puppet Story (GUUG FFG 2015)
DECK361.2K views
Architecting for Microservices Part 2 by Elana Krasner
Architecting for Microservices Part 2Architecting for Microservices Part 2
Architecting for Microservices Part 2
Elana Krasner777 views
Rails Tips and Best Practices by David Keener
Rails Tips and Best PracticesRails Tips and Best Practices
Rails Tips and Best Practices
David Keener1.3K views
How to Contribute to Apache Usergrid by David M. Johnson
How to Contribute to Apache UsergridHow to Contribute to Apache Usergrid
How to Contribute to Apache Usergrid
David M. Johnson9.8K views
Staying Sane with Drupal NEPHP by Oscar Merida
Staying Sane with Drupal NEPHPStaying Sane with Drupal NEPHP
Staying Sane with Drupal NEPHP
Oscar Merida603 views

More from zeeg

Practicing Continuous Deployment by
Practicing Continuous DeploymentPracticing Continuous Deployment
Practicing Continuous Deploymentzeeg
2.8K views38 slides
Tools for Development and Debugging in Python by
Tools for Development and Debugging in PythonTools for Development and Debugging in Python
Tools for Development and Debugging in Pythonzeeg
2K views32 slides
Pitfalls of Continuous Deployment by
Pitfalls of Continuous DeploymentPitfalls of Continuous Deployment
Pitfalls of Continuous Deploymentzeeg
18.5K views50 slides
Building Scalable Web Apps by
Building Scalable Web AppsBuilding Scalable Web Apps
Building Scalable Web Appszeeg
3.7K views51 slides
Continuous Deployment at Disqus (Pylons Minicon) by
Continuous Deployment at Disqus (Pylons Minicon)Continuous Deployment at Disqus (Pylons Minicon)
Continuous Deployment at Disqus (Pylons Minicon)zeeg
4.2K views28 slides
PyCon 2011 Scaling Disqus by
PyCon 2011 Scaling DisqusPyCon 2011 Scaling Disqus
PyCon 2011 Scaling Disquszeeg
12.2K views41 slides

More from zeeg(8)

Practicing Continuous Deployment by zeeg
Practicing Continuous DeploymentPracticing Continuous Deployment
Practicing Continuous Deployment
zeeg2.8K views
Tools for Development and Debugging in Python by zeeg
Tools for Development and Debugging in PythonTools for Development and Debugging in Python
Tools for Development and Debugging in Python
zeeg2K views
Pitfalls of Continuous Deployment by zeeg
Pitfalls of Continuous DeploymentPitfalls of Continuous Deployment
Pitfalls of Continuous Deployment
zeeg18.5K views
Building Scalable Web Apps by zeeg
Building Scalable Web AppsBuilding Scalable Web Apps
Building Scalable Web Apps
zeeg3.7K views
Continuous Deployment at Disqus (Pylons Minicon) by zeeg
Continuous Deployment at Disqus (Pylons Minicon)Continuous Deployment at Disqus (Pylons Minicon)
Continuous Deployment at Disqus (Pylons Minicon)
zeeg4.2K views
PyCon 2011 Scaling Disqus by zeeg
PyCon 2011 Scaling DisqusPyCon 2011 Scaling Disqus
PyCon 2011 Scaling Disqus
zeeg12.2K views
Sentry (SF Python, Feb) by zeeg
Sentry (SF Python, Feb)Sentry (SF Python, Feb)
Sentry (SF Python, Feb)
zeeg1.7K views
Db tips & tricks django meetup by zeeg
Db tips & tricks django meetupDb tips & tricks django meetup
Db tips & tricks django meetup
zeeg1.9K views

Recently uploaded

Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum... by
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...NUS-ISS
34 views35 slides
Special_edition_innovator_2023.pdf by
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdfWillDavies22
16 views6 slides
The details of description: Techniques, tips, and tangents on alternative tex... by
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...BookNet Canada
121 views24 slides
Understanding GenAI/LLM and What is Google Offering - Felix Goh by
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohNUS-ISS
41 views33 slides
[2023] Putting the R! in R&D.pdf by
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdfEleanor McHugh
38 views127 slides
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica... by
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...NUS-ISS
16 views28 slides

Recently uploaded(20)

Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum... by NUS-ISS
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
NUS-ISS34 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2216 views
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada121 views
Understanding GenAI/LLM and What is Google Offering - Felix Goh by NUS-ISS
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
NUS-ISS41 views
[2023] Putting the R! in R&D.pdf by Eleanor McHugh
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh38 views
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica... by NUS-ISS
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
NUS-ISS16 views
Combining Orchestration and Choreography for a Clean Architecture by ThomasHeinrichs1
Combining Orchestration and Choreography for a Clean ArchitectureCombining Orchestration and Choreography for a Clean Architecture
Combining Orchestration and Choreography for a Clean Architecture
ThomasHeinrichs169 views
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... by Vadym Kazulkin
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
Vadym Kazulkin75 views
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen... by NUS-ISS
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
NUS-ISS28 views
.conf Go 2023 - Data analysis as a routine by Splunk
.conf Go 2023 - Data analysis as a routine.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - Data analysis as a routine
Splunk93 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma17 views
Future of Learning - Khoong Chan Meng by NUS-ISS
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan Meng
NUS-ISS33 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10209 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software225 views
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu... by NUS-ISS
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
NUS-ISS37 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman27 views

DjangoCon 2010 Scaling Disqus

  • 1. Scaling the World’s Largest Django App Jason Yan David Cramer @jasonyan @zeeg
  • 3. What is DISQUS? dis·cuss • dĭ-skŭs' We are a comment system with an emphasis on connecting communities http://disqus.com/about/
  • 4. What is Scale? Number of Visitors 300M 250M 200M 150M 100M 50M Our traffic at a glance 17,000 requests/second peak 450,000 websites 15 million profiles 75 million comments 250 million visitors (August 2010)
  • 5. Our Challenges • We can’t predict when things will happen • Random celebrity gossip • Natural disasters • Discussions never expire • We can’t keep those millions of articles from 2008 in the cache • You don’t know in advance (generally) where the traffic will be • Especially with dynamic paging, realtime, sorting, personal prefs, etc.
  • 6. Our Challenges (cont’d) • High availability • Not a destination site • Difficult to schedule maintenance
  • 8. Server Architecture - Load Balancing • Load Balancing • High Availability • Software, HAProxy • heartbeat • High performance, intelligent server availability checking • Bonus: Nice statistics reporting Image Source: http://haproxy.1wt.eu/
  • 9. Server Architecture • ~100 Servers • 30% Web Servers (Apache + mod_wsgi) • 10% Databases (PostgreSQL) • 25% Cache Servers (memcached) • 20% Load Balancing / High Availability (HAProxy + heartbeat) • 15% Utility Servers (Python scripts)
  • 10. Server Architecture - Web Servers • Apache 2.2 • mod_wsgi • Using `maximum-requests` to plug memory leaks. • Performance Monitoring • Custom middleware (PerformanceLogMiddleware) • Ships performance statistics (DB queries, external calls, template rendering, etc) through syslog • Collected and graphed through Ganglia
  • 11. Server Architecture - Database • PostgreSQL • Slony-I for Replication • Trigger-based • Read slaves for extra read capacity • Failover master database for high availability
  • 12. Server Architecture - Database • Make sure indexes fit in memory and measure I/O • High I/O generally means slow queries due to missing indexes or indexes not in buffer cache • Log Slow Queries • syslog-ng + pgFouine + cron to automate slow query logging
  • 13. Server Architecture - Database • Use connection pooling • Django doesn’t do this for you • We use pgbouncer • Limits the maximum number of connections your database needs to handle • Save on costly opening and tearing down of new database connections
  • 15. Partitioning • Fairly easy to implement, quick wins • Done at the application level • Data is replayed by Slony • Two methods of data separation
  • 16. Vertical Partitioning Vertical partitioning involves creating tables with fewer columns and using additional tables to store the remaining columns. Forums Posts Users Sentry http://en.wikipedia.org/wiki/Partition_(database)
  • 17. Pythonic Joins Allows us to separate datasets posts = Post.objects.all()[0:25] # store users in a dictionary based on primary key users = dict( (u.pk, u) for u in User.objects.filter(pk__in=set(p.user_id for p in posts)) ) # map users to their posts for p in posts: p._user_cache = users.get(p.user_id)
  • 18. Pythonic Joins (cont’d) • Slower than at database level • But not enough that you should care • Trading performance for scale • Allows us to separate data • Easy vertical partitioning • More efficient caching • get_many, object-per-row cache
  • 19. Designating Masters • Alleviates some of the write load on your primary application master • Masters exist under specific conditions: • application use case • partitioned data • Database routers make this (fairly) easy
  • 20. Routing by Application class ApplicationRouter(object): def db_for_read(self, model, **hints): instance = hints.get('instance') if not instance: return None app_label = instance._meta.app_label return get_application_alias(app_label)
  • 21. Horizontal Partitioning Horizontal partitioning (also known as sharding) involves splitting one set of data into different tables. Disqus Your Blog CNN Telegraph http://en.wikipedia.org/wiki/Partition_(database)
  • 22. Horizontal Partitions • Some forums have very large datasets • Partners need high availability • Helps scale the write load on the master • We rely more on vertical partitions
  • 23. Routing by Partition class ForumPartitionRouter(object): def db_for_read(self, model, **hints): instance = hints.get('instance') if not instance: return None forum_id = getattr(instance, 'forum_id', None) if not forum_id: return None return get_forum_alias(forum_id) # What we used to do Post.objects.filter(forum=forum) # Now, making sure hints are available forum.post_set.all()
  • 24. Optimizing QuerySets • We really dislike raw SQL • It creates more work when dealing with partitions • Built-in cache allows sub-slicing • But isn’t always needed • We removed this cache
  • 25. Removing the Cache • Django internally caches the results of your QuerySet • This adds additional memory overhead # 1 query qs = Model.objects.all()[0:100] # 0 queries (we don’t need this behavior) qs = qs[0:10] # 1 query qs = qs.filter(foo=bar) • Many times you only need to view a result set once • So we built SkinnyQuerySet
  • 26. Removing the Cache (cont’d) Optimizing memory usage by removing the cache class SkinnyQuerySet(QuerySet): def __iter__(self): if self._result_cache is not None: # __len__ must have been run return iter(self._result_cache) has_run = getattr(self, 'has_run', False) if has_run: raise QuerySetDoubleIteration("...") self.has_run = True # We wanted .iterator() as the default return self.iterator() http://gist.github.com/550438
  • 27. Atomic Updates • Keeps your data consistent • save() isnt thread-safe • use update() instead • Great for things like counters • But should be considered for all write operations
  • 28. Atomic Updates (cont’d) Thread safety is impossible with .save() Request 1 post = Post(pk=1) # a moderator approves post.approved = True post.save() Request 2 post = Post(pk=1) # the author adjusts their message post.message = ‘Hello!’ post.save()
  • 29. Atomic Updates (cont’d) So we need atomic updates Request 1 post = Post(pk=1) # a moderator approves Post.objects.filter(pk=post.pk) .update(approved=True) Request 2 post = Post(pk=1) # the author adjusts their message Post.objects.filter(pk=post.pk) .update(message=‘Hello!’)
  • 30. Atomic Updates (cont’d) A better way to approach updates def update(obj, using=None, **kwargs): """ Updates specified attributes on the current instance. """ assert obj, "Instance has not yet been created." obj.__class__._base_manager.using(using) .filter(pk=obj) .update(**kwargs) for k, v in kwargs.iteritems(): if isinstance(v, ExpressionNode): # NotImplemented continue setattr(obj, k, v) http://github.com/andymccurdy/django-tips-and-tricks/blob/master/model_update.py
  • 31. Delayed Signals • Queueing low priority tasks • even if they’re fast • Asynchronous (Delayed) signals • very friendly to the developer • ..but not as friendly as real signals
  • 32. Delayed Signals (cont’d) We send a specific serialized version of the model for delayed signals from disqus.common.signals import delayed_save def my_func(data, sender, created, **kwargs): print data[‘id’] delayed_save.connect(my_func, sender=Post) This is all handled through our Queue
  • 33. Caching • Memcached • Use pylibmc (newer libMemcached-based) • Ticket #11675 (add pylibmc support) • Third party applications: • django-newcache, django-pylibmc
  • 34. Caching (cont’d) • libMemcached / pylibmc is configurable with “behaviors”. • Memcached “single point of failure” • Distributed system, but we must take precautions. • Connection timeout to memcached can stall requests. • Use `_auto_eject_hosts` and `_retry_timeout` behaviors to prevent reconnecting to dead caches.
  • 35. Caching (cont’d) • Default (naive) hashing behavior • Modulo hashed cache key cache for index to server list. • Removal of a server causes majority of cache keys to be remapped to new servers. CACHE_SERVERS = [‘10.0.0.1’, ‘10.0.0.2’] key = ‘my_cache_key’ cache_server = CACHE_SERVERS[hash(key) % len(CACHE_SERVERS)]
  • 36. Caching (cont’d) • Better approach: consistent hashing • libMemcached (pylibmc) uses libketama (http://tinyurl.com/lastfm-libketama) • Addition / removal of a cache server remaps (K/n) cache keys (where K=number of keys and n=number of servers) Image Source: http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction
  • 37. Caching (cont’d) • Thundering herd (stampede) problem • Invalidating a heavily accessed cache key causes many clients to refill cache. • But everyone refetching to fill the cache from the data store or reprocessing data can cause things to get even slower. • Most times, it’s ideal to return the previously invalidated cache value and let a single client refill the cache. • django-newcache or MintCache (http:// djangosnippets.org/snippets/793/) will do this for you. • Prefer filling cache on invalidation instead of deleting from cache also helps to prevent the thundering herd problem.
  • 38. Transactions • TransactionMiddleware got us started, but down the road became a burden • For postgresql_psycopg2, there’s a database option, OPTIONS[‘autocommit’] • Each query is in its own transaction. This means each request won’t start in a transaction. • But sometimes we want transactions (e.g., saving multiple objects and rolling back on error)
  • 39. Transactions (cont’d) • Tips: • Use autocommit for read slave databases. • Isolate slow functions (e.g., external calls, template rendering) from transactions. • Selective autocommit • Most read-only views don’t need to be in transactions. • Start in autocommit and switch to a transaction on write.
  • 40. Scaling the Team • Small team of engineers • Monthly users / developers = 40m • Which means writing tests.. • ..and having a dead simple workflow
  • 41. Keeping it Simple • A developer can be up and running in a few minutes • assuming postgres and other server applications are already installed • pip, virtualenv • settings.py
  • 42. Setting Up Local 1. createdb -E UTF-8 disqus 2. git clone git://repo 3. mkvirtualenv disqus 4. pip install -U -r requirements.txt 5. ./manage.py syncdb && ./manage.py migrate
  • 43. Sane Defaults settings.py from disqus.conf.settings.default import * try: from local_settings import * except ImportError: import sys, traceback sys.stderr.write("Can't find 'localsettings.py’n”) sys.stderr.write("nThe exception was:nn") traceback.print_exc() local_settings.py from disqus.conf.settings.dev import *
  • 44. Continuous Integration • Daily deploys with Fabric • several times an hour on some days • Hudson keeps our builds going • combined with Selenium • Post-commit hooks for quick testing • like Pyflakes • Reverting to a previous version is a matter of seconds
  • 45. Continuous Integration (cont’d) Hudson makes integration easy
  • 46. Testing • It’s not fun breaking things when you’re the new guy • Our testing process is fairly heavy • 70k (Python) LOC, 73% coverage, 20 min suite • Custom Test Runner (unittest) • We needed XML, Selenium, Query Counts • Database proxies (for read-slave testing) • Integration with our Queue
  • 47. Testing (cont’d) Query Counts # failures yield a dump of queries def test_read_slave(self): Model.objects.using(‘read_slave’).count() self.assertQueryCount(1, ‘read_slave’) Selenium def test_button(self): self.selenium.click('//a[@class=”dsq-button”]') Queue Integration class WorkerTest(DisqusTest): workers = [‘fire_signal’] def test_delayed_signal(self): ...
  • 48. Bug Tracking • Switched from Trac to Redmine • We wanted Subtasks • Emailing exceptions is a bad idea • Even if its localhost • Previously using django-db-log to aggregate errors to a single point • We’ve overhauled db log and are releasing Sentry
  • 49. django-sentry Groups messages intelligently http://github.com/dcramer/django-sentry
  • 50. django-sentry (cont’d) Similar feel to Django’s debugger http://github.com/dcramer/django-sentry
  • 51. Feature Switches • We needed a safety in case a feature wasn’t performing well at peak • it had to respond without delay, globally, and without writing to disk • Allows us to work out of trunk (mostly) • Easy to release new features to a portion of your audience • Also nice for “Labs” type projects
  • 53. Final Thoughts • The language (usually) isn’t your problem • We like Django • But we maintain local patches • Some tickets don’t have enough of a following • Patches, like #17, completely change Django.. • ..arguably in a good way • Others don’t have champions Ticket #17 describes making the ORM an identify mapper
  • 54. Housekeeping Birds of a Feather Want to learn from others about performance and scaling problems? Or play some StarCraft 2? We’re Hiring! DISQUS is looking for amazing engineers
  • 56. References django-sentry http://github.com/dcramer/django-sentry Our Feature Switches http://cl.ly/2FYt Andy McCurdy’s update() http://github.com/andymccurdy/django-tips-and-tricks Our PyFlakes Fork http://github.com/dcramer/pyflakes SkinnyQuerySet http://gist.github.com/550438 django-newcache http://github.com/ericflo/django-newcache attach_foreignkey (Pythonic Joins) http://gist.github.com/567356

Editor's Notes

  1. Hi. I'm Jason (and I'm David), and we're from Disqus.
  2. Show of hands, How many of you know what DISQUS is?
  3. For those of you who are not familiar with us, DISQUS is a comment system that focuses on connecting communities. We power discussions on such sites as CNN, IGN, and more recently Engadget and TechCrunch. Our company was founded back in 2007 by my co-founder, Daniel Ha, and I back where we started working out of our dorm room. Our decision to use Django came down primarily to our dislike for PHP which we were previously using. Since then, we've grown Disqus to over 250+ million visitors a month.
  4. We've peaked at over 17,000 requests per second, to Django, and we currently power comments on nearly half a million websites which accounts for more than 15 million profiles who have left over 75 million comments.
  5. As you can imagine we have some big challenges when it comes to scaling a large Django application. For one, it’s hard to predict when events happen like last year with Michael Jackson’s death, and more recently, the Gulf Oil Spill. Another challenge we have is the fact that discussions never expire. When you visit that blog post from 2008 we have to be ready to serve those comments immediately. Not only does THAT make caching difficult, but we also have to deal with things such as dynamic paging, realtime commenting, and other personal preferences. This makes it even more important to be able to serve those quickly without relying on the cache.
  6. So we also have some interesting infrastructure problems when it comes to scaling Disqus. We're not a destination website, so if we go down, it affects other sites as well as ours. Because of this, it's difficult for us to schedule maintenance, so we face some interesting scaling and availbility challenges.
  7. As you can see, we have tried to keep the stack pretty thin. This is because, as we've learned, the more services we try to add, the more difficult it is to support. And especially because we have a small team, this becomes difficult to manage. So we use DNS load balancing to spread the requests to multiple HAProxy servers which are our software load balancers. These proxy requests to our backend app servers which run mod_wsgi. We use memcache for caching, and we have a custom wrapper using syslog for our queue. For our data store, we use PostgreSQL, and for replication, we use Slony for failover and read slaves.
  8. As I said, we use HAProxy for HTTP load balancing. It's a high performance software load balancer with intelligent failure detection. It also provides you with nice statistics of your requests. We use heartbeat for high availability and we have it take over the IP address of the down machine.
  9. We have about 100GB of cache. Because of our high availability requirements, 20% are allocated to high availability and load balancing.
  10. Our web servers are pretty standard. We use mod_wsgi mostly because it just works. Performance wise, you're really going to be bottlenecked on the application. The cool thing we do is that we actually hasve a custom middleware that does performance monitoring. What this does is ship data from our application about external calls like database, cache calls, and we collect it and graph it with Ganglia.
  11. The more interesting aspect of our server architecture is how we have our database setup. As I mentioned, we use Postgres as our database. Honestly, we used it because Django recommended it, and my recommendation is that if you’re not already an expert in a database, you're better off going with Postgres. We use slony for replication Slony is trigger-based which means that every write is captured and strored in a log table and those events are replayed to slave databases. This is nice over otehr methods such as log shipping because it allows us to have flexible schemas across read lsaves. For example, some of our read slaves have different indexes. We also use slony for failover for high availbility.
  12. There are a few things we do to keep our database healthy. We keep our indexes in memory, and when we can't, we partition our data. We also have application-specific indexes on our readslaves. Another important thing we've done is measure I/O. Any time we've seen high I/O is usually because we're missing indexes or indexes aren't fitting in memory. Lastly, we monitor slow queries. We send logs to pgfouine via syslog which genererates a nice report showing you which queries are the slowest.
  13. The last thing we've found to be really helpful is switching to database connection pool. Remember, Django doesn't do this for you. We use pgbouncer for this, and there are a few easy wins for using it. One is that it limits the maximum connections to the database so it doesn't have to handle as many concurrent connections. Secpondly, you save the cost of opening and tearing down new connections per request.
  14. Moving on to our application, we’ve found that most of the struggle is with the database layer. We’ve got a pretty standard layout if you’re familiar with forums. Forum has many threads, which has many posts. Posts use an adjacency list model, and also reference Users. With this kind of data model, one of our quickest wins has been the ability to partition data.
  15. It’s almost entirely done at the application level, which makes it fairly easy to implement. The only thing not handled by the app is replication, and Slony does that for us. We handle partitioning in a couple of ways.
  16. The first of which are vertical partitions. This is probably the simplest thing you can implement in your application. Kill off your joins and spread out your applications on multiple databases. Some database engines might make this easier than others, but Slony allows us to easily replicate very specific data.
  17. Using this method you’ll need to handle joins in your Python application. We do this by performing two separate queries and mapping the foreign keys to the parent objects. For us the easiest way has been to throw them into a dictionary, iterate through the other queryset, and set the foreignkey cache’s value to the instance.
  18. A few things to keep in mind when doing pythonic joins. They’re not going to be as fast in the database. You can’t avoid this, but it’s not something you should worry about. With this however, you get plain and simple vertical partitions. You also can cache things a lot easier, and more efficiently fetch them using things like get_many and a singular object cache. Overall your’e trading performance for scale.
  19. Another benefit that comes from vertical partitioning is the ability to designate masters. We do this to alleviate some of the load on our primary application master. So for example, server FOO might be the source for writes on the Users table, while server BAR handles all of our other forum data. Since we’re using Django 1.2 we also get routing for free through the new routers.
  20. Here’s an example of a simple application router. It let’s us specify a read-slave based on our app label. So if its users, we go to FOO, if its forums, we go to BAR. You can handle this logic any way you want, pretty simple and powerful.
  21. While we use vertical partitioning for most cases, eventually you hit an issue where your data just doesn’t scale on a single database. You’re probably familiar with the word sharding, well that’s what we do with our forum data. We’ve set it up so that we can send certain large sites to dedicated machines. This also uses designated masters as we mentioned with the other partitions.
  22. We needed this when write and read load combined became so big that it was just hard to keep up on a single set of machines. It also gives the nice added benefit of high availability in many situations. Mostly though, it all goes back to scaling our master databases.
  23. So again we’re using the router here to handle partitioning of the forums. We can specify that CNN goes to this database alias, which could be any number of machines, and everything else goes to our default cluster. The one caveat we found with this, is sometimes hints aren’t present in the router. I believe within the current version of Django they are only available when using a relational lookup, such as a foreign key. All in all it’s pretty powerful, and you just need to be aware of it while writing your queries.