DISQUS                         Building Scalable Web Apps                                 David Cramer                    ...
Agenda                •        Terminology                •        Common bottlenecks                •        Building a s...
Performance vs. Scalability                         “Performance measures the speed with                         which a s...
Sharding                         “Database sharding is a method of                         horizontally partitioning data ...
Denormalization                         “Denormalization is the process of                         attempting to optimize ...
Common Bottlenecks                •        Database (almost always)                •        Caching, Invalidation         ...
Building TweeterTuesday, June 21, 2011
Getting Started                •        Pick a framework: Django, Flask, Pyramid                •        Package your app;...
Let’s use DjangoTuesday, June 21, 2011
Tuesday, June 21, 2011
Django is..                •        Fast (enough)                •        Loaded with goodies                •        Main...
Packaging MattersTuesday, June 21, 2011
setup.py                     #!/usr/bin/env python                     from setuptools import setup, find_packages        ...
setup.py (cont.)                     $   mkvirtualenv tweeter                     $   git clone git.example.com:tweeter.gi...
setup.py (cont.)                     ## fabfile.py                     def setup():                         run(git clone ...
setup.py (cont.)                     $ fab web setup                     setup   executed   on   web1                     ...
Database(s) FirstTuesday, June 21, 2011
Databases          •     Usually core          •     Common bottleneck          •     Hard to change          •     Tediou...
What a tweet “looks” likeTuesday, June 21, 2011
Modeling the data                     from django.db import models                     class Tweet(models.Model):         ...
Public Timeline                     # public timeline                     SELECT * FROM tweets                       ORDER...
Following Timeline                     # tweets from people you follow                     SELECT t.* FROM tweets AS t    ...
Materializing Views                     PUBLIC_TIMELINE = []                     def on_tweet_creation(tweet):            ...
Introducing Redis                     class PublicTimeline(object):                         def __init__(self):           ...
Cleaning Up                     from datetime import datetime, timedelta                     class PublicTimeline(object):...
Scaling Redis                     from nydus.db import create_cluster                     class PublicTimeline(object):   ...
Nydus                     # create a cluster of Redis connections which                     # partition reads/writes by ke...
Vertical vs. HorizontalTuesday, June 21, 2011
Looking at the Cluster                               sql-1-master             sql-1-slave                                 ...
“Tomorrow’s” Cluster                         sql-1-master         sql-1-users         sql-1-tweets                        ...
Asynchronous TasksTuesday, June 21, 2011
In-Process Limitations                     def on_tweet_creation(tweet):                         # O(1) for public timelin...
In-Process Limitations (cont.)                         # O(n) for users following author                         # 7 MILLI...
Introducing Celery                     #!/usr/bin/env python                     from setuptools import setup, find_packag...
Introducing Celery (cont.)                     @task(exchange=tweet_creation)                     def on_tweet_creation(tw...
Bringing It Together                def home(request):                    "Shows the latest 100 tweets from your follow st...
Build an APITuesday, June 21, 2011
APIs           •     PublicTimeline.list           •     redis.zrange           •     Tweet.objects.all()           •     ...
Refactoring                def home(request):                    "Shows the latest 100 tweets from your follow stream"    ...
Refactoring (cont.)                     from datetime import datetime, timedelta                     class PublicTimeline(...
Optimization in the API                 class PublicTimeline(object):                    def list(self, offset=0, limit=-1...
Optimization in the API (cont.)                 def list(self, offset=0, limit=-1):                        ids = self.conn...
Optimization in the API (cont.)                         if not all(cache.itervalues()):                             # fetc...
Optimization in the API (cont.)                                    Store misses back in the cache                         ...
(In)validate the Cache                 class PublicTimeline(object):                    def add(self, tweet):             ...
(In)validate the Cache                 class PublicTimeline(object):                    def remove(self, tweet):          ...
Wrap UpTuesday, June 21, 2011
Reflection                •        Use a framework!                •        Start simple; grow naturally                •  ...
Reflection (cont.)                •        100 shards > 10; Rebalancing sucks                         •   Use VMs          ...
Food for Thought                •        Normalize object cache keys                •        Application triggers directly...
DISQUS                           Questions?                           psst, we’re hiring                          jobs@dis...
Upcoming SlideShare
Loading in …5
×

Building Scalable Web Apps

27,940 views

Published on

Talk at EuroPython 2011

Published in: Technology
0 Comments
28 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
27,940
On SlideShare
0
From Embeds
0
Number of Embeds
20,398
Actions
Shares
0
Downloads
144
Comments
0
Likes
28
Embeds 0
No embeds

No notes for slide

Building Scalable Web Apps

  1. 1. DISQUS Building Scalable Web Apps David Cramer @zeegTuesday, June 21, 2011
  2. 2. Agenda • Terminology • Common bottlenecks • Building a scalable app • Architecting your database • Utilizing a Queue • The importance of an APITuesday, June 21, 2011
  3. 3. Performance vs. Scalability “Performance measures the speed with which a single request can be executed, while scalability measures the ability of a request to maintain its performance under increasing load.” (but we’re not just going to scale your code)Tuesday, June 21, 2011
  4. 4. Sharding “Database sharding is a method of horizontally partitioning data by common properties”Tuesday, June 21, 2011
  5. 5. Denormalization “Denormalization is the process of attempting to optimize the performance of a database by adding redundant data or by grouping data.”Tuesday, June 21, 2011
  6. 6. Common Bottlenecks • Database (almost always) • Caching, Invalidation • Lack of metrics, lack of testsTuesday, June 21, 2011
  7. 7. Building TweeterTuesday, June 21, 2011
  8. 8. Getting Started • Pick a framework: Django, Flask, Pyramid • Package your app; Repeatability • Solve problems • Invest in architectureTuesday, June 21, 2011
  9. 9. Let’s use DjangoTuesday, June 21, 2011
  10. 10. Tuesday, June 21, 2011
  11. 11. Django is.. • Fast (enough) • Loaded with goodies • Maintained • Tested • UsedTuesday, June 21, 2011
  12. 12. Packaging MattersTuesday, June 21, 2011
  13. 13. setup.py #!/usr/bin/env python from setuptools import setup, find_packages setup( name=tweeter,    version=0.1,    packages=find_packages(),    install_requires=[     Django==1.3,    ],    package_data={ tweeter: [ static/*.*, templates/*.*, ], }, )Tuesday, June 21, 2011
  14. 14. setup.py (cont.) $ mkvirtualenv tweeter $ git clone git.example.com:tweeter.git $ cd tweeter $ python setup.py developTuesday, June 21, 2011
  15. 15. setup.py (cont.) ## fabfile.py def setup(): run(git clone git.example.com:tweeter.git) run(cd tweeter) run(./bootstrap.sh) ## bootstrap.sh #!/usr/bin/env bash virtualenv env env/bin/python setup.py developTuesday, June 21, 2011
  16. 16. setup.py (cont.) $ fab web setup setup executed on web1 setup executed on web2 setup executed on web3 setup executed on web4 setup executed on web5 setup executed on web6 setup executed on web7 setup executed on web8 setup executed on web9 setup executed on web10Tuesday, June 21, 2011
  17. 17. Database(s) FirstTuesday, June 21, 2011
  18. 18. Databases • Usually core • Common bottleneck • Hard to change • Tedious to scale http://www.flickr.com/photos/adesigna/3237575990/Tuesday, June 21, 2011
  19. 19. What a tweet “looks” likeTuesday, June 21, 2011
  20. 20. Modeling the data from django.db import models class Tweet(models.Model): user = models.ForeignKey(User) message = models.CharField(max_length=140) date = models.DateTimeField(auto_now_add=True) parent = models.ForeignKey(self, null=True) class Relationship(models.Model): from_user = models.ForeignKey(User) to_user = models.ForeignKey(User) (Remember, bare bones!)Tuesday, June 21, 2011
  21. 21. Public Timeline # public timeline SELECT * FROM tweets ORDER BY date DESC LIMIT 100; • Scales to the size of one physical machine • Heavy index, long tail • Easy to cache, invalidateTuesday, June 21, 2011
  22. 22. Following Timeline # tweets from people you follow SELECT t.* FROM tweets AS t JOIN relationships AS r ON r.to_user_id = t.user_id WHERE r.from_user_id = 1 ORDER BY t.date DESC LIMIT 100 • No vertical partitions • Heavy index, long tail • “Necessary evil” join • Easy to cache, expensive to invalidateTuesday, June 21, 2011
  23. 23. Materializing Views PUBLIC_TIMELINE = [] def on_tweet_creation(tweet): global PUBLIC_TIME PUBLIC_TIMELINE.insert(0, tweet) def get_latest_tweets(num=100): return PUBLIC_TIMELINE[:num] Disclaimer: don’t try this at homeTuesday, June 21, 2011
  24. 24. Introducing Redis class PublicTimeline(object): def __init__(self): self.conn = Redis() self.key = timeline:public def add(self, tweet): score = float(tweet.date.strftime(%s.%m)) self.conn.zadd(self.key, tweet.id, score) def remove(self, tweet): self.conn.zrem(self.key, tweet.id) def list(self, offset=0, limit=-1): tweet_ids = self.conn.zrevrange(self.key, offset, limit) return tweet_idsTuesday, June 21, 2011
  25. 25. Cleaning Up from datetime import datetime, timedelta class PublicTimeline(object): def truncate(self): # Remove entries older than 30 days d30 = datetime.now() - timedelta(days=30) score = float(d30.strftime(%s.%m)) self.conn.zremrangebyscore(self.key, d30, -1)Tuesday, June 21, 2011
  26. 26. Scaling Redis from nydus.db import create_cluster class PublicTimeline(object): def __init__(self): # create a cluster of 9 dbs self.conn = create_cluster({ engine: nydus.db.backends.redis.Redis, router: nydus.db.routers.redis.PartitionRouter, hosts: dict((n, {db: n}) for n in xrange(64)), })Tuesday, June 21, 2011
  27. 27. Nydus # create a cluster of Redis connections which # partition reads/writes by key (hash(key) % size) from nydus.db import create_cluster redis = create_cluster({ engine: nydus.db.backends.redis.Redis, router: nydus.db...redis.PartitionRouter, hosts: { 0: {db: 0}, } }) # maps to a single node res = conn.incr(foo) assert res == 1 # executes on all nodes conn.flushdb() http://github.com/disqus/nydusTuesday, June 21, 2011
  28. 28. Vertical vs. HorizontalTuesday, June 21, 2011
  29. 29. Looking at the Cluster sql-1-master sql-1-slave redis-1 DB0 DB1 DB2 DB3 DB4 DB5 DB6 DB7 DB8 DB9Tuesday, June 21, 2011
  30. 30. “Tomorrow’s” Cluster sql-1-master sql-1-users sql-1-tweets redis-1 DB0 DB1 DB2 DB3 DB4 redis-2 DB5 DB6 DB7 DB8 DB9Tuesday, June 21, 2011
  31. 31. Asynchronous TasksTuesday, June 21, 2011
  32. 32. In-Process Limitations def on_tweet_creation(tweet): # O(1) for public timeline PublicTimeline.add(tweet) # O(n) for users following author for user_id in tweet.user.followers.all(): FollowingTimeline.add(user_id, tweet) # O(1) for profile timeline (my tweets) ProfileTimeline.add(tweet.user_id, tweet)Tuesday, June 21, 2011
  33. 33. In-Process Limitations (cont.) # O(n) for users following author # 7 MILLION writes for Ashton Kutcher for user_id in tweet.user.followers.all(): FollowingTimeline.add(user_id, tweet)Tuesday, June 21, 2011
  34. 34. Introducing Celery #!/usr/bin/env python from setuptools import setup, find_packages setup(    install_requires=[     Django==1.3,     django-celery==2.2.4,    ], # ... )Tuesday, June 21, 2011
  35. 35. Introducing Celery (cont.) @task(exchange=tweet_creation) def on_tweet_creation(tweet_dict): # HACK: not the best idea tweet = Tweet() tweet.__dict__ = tweet_dict # O(n) for users following author for user_id in tweet.user.followers.all(): FollowingTimeline.add(user_id, tweet) on_tweet_creation.delay(tweet.__dict__)Tuesday, June 21, 2011
  36. 36. Bringing It Together def home(request): "Shows the latest 100 tweets from your follow stream" if random.randint(0, 9) == 0: return render(fail_whale.html) ids = FollowingTimeline.list( user_id=request.user.id, limit=100, ) res = dict((str(t.id), t) for t in Tweet.objects.filter(id__in=ids)) tweets = [] for tweet_id in ids: if tweet_id not in res: continue tweets.append(res[tweet_id]) return render(home.html, {tweets: tweets})Tuesday, June 21, 2011
  37. 37. Build an APITuesday, June 21, 2011
  38. 38. APIs • PublicTimeline.list • redis.zrange • Tweet.objects.all() • example.com/api/tweets/Tuesday, June 21, 2011
  39. 39. Refactoring def home(request): "Shows the latest 100 tweets from your follow stream" tweet_ids = FollowingTimeline.list( user_id=request.user.id, limit=100, ) def home(request): "Shows the latest 100 tweets from your follow stream" tweets = FollowingTimeline.list( user_id=request.user.id, limit=100, )Tuesday, June 21, 2011
  40. 40. Refactoring (cont.) from datetime import datetime, timedelta class PublicTimeline(object): def list(self, offset=0, limit=-1): ids = self.conn.zrevrange(self.key, offset, limit) cache = dict((t.id, t) for t in Tweet.objects.filter(id__in=ids)) return filter(None, (cache.get(i) for i in ids))Tuesday, June 21, 2011
  41. 41. Optimization in the API class PublicTimeline(object): def list(self, offset=0, limit=-1): ids = self.conn.zrevrange(self.list_key, offset, limit) # pull objects from a hash map (cache) in Redis cache = dict((i, self.conn.get(self.hash_key(i))) for i in ids) if not all(cache.itervalues()): # fetch missing from database missing = [i for i, c in cache.iteritems() if not c] m_cache = dict((str(t.id), t) for t in Tweet.objects.filter(id__in=missing)) # push missing back into cache cache.update(m_cache) for i, c in m_cache.iteritems(): self.conn.set(hash_key(i), c) # return only results that still exist return filter(None, (cache.get(i) for i in ids))Tuesday, June 21, 2011
  42. 42. Optimization in the API (cont.) def list(self, offset=0, limit=-1): ids = self.conn.zrevrange(self.list_key, offset, limit) # pull objects from a hash map (cache) in Redis cache = dict((i, self.conn.get(self.hash_key(i))) for i in ids) Store each object in it’s own keyTuesday, June 21, 2011
  43. 43. Optimization in the API (cont.) if not all(cache.itervalues()): # fetch missing from database missing = [i for i, c in cache.iteritems() if not c] m_cache = dict((str(t.id), t) for t in Tweet.objects.filter(id__in=missing)) Hit the database for missesTuesday, June 21, 2011
  44. 44. Optimization in the API (cont.) Store misses back in the cache # push missing back into cache cache.update(m_cache) for i, c in m_cache.iteritems(): self.conn.set(hash_key(i), c) # return only results that still exist return filter(None, (cache.get(i) for i in ids)) Ignore database missesTuesday, June 21, 2011
  45. 45. (In)validate the Cache class PublicTimeline(object): def add(self, tweet): score = float(tweet.date.strftime(%s.%m)) # add the tweet into the object cache self.conn.set(self.make_key(tweet.id), tweet) # add the tweet to the materialized view self.conn.zadd(self.list_key, tweet.id, score)Tuesday, June 21, 2011
  46. 46. (In)validate the Cache class PublicTimeline(object): def remove(self, tweet): # remove the tweet from the materialized view self.conn.zrem(self.key, tweet.id) # we COULD remove the tweet from the object cache self.conn.del(self.make_key(tweet.id))Tuesday, June 21, 2011
  47. 47. Wrap UpTuesday, June 21, 2011
  48. 48. Reflection • Use a framework! • Start simple; grow naturally • Scale can lead to performance • Not the other way around • Consolidate entry pointsTuesday, June 21, 2011
  49. 49. Reflection (cont.) • 100 shards > 10; Rebalancing sucks • Use VMs • Push to caches, don’t pull • “Denormalize” counters, views • Queue everythingTuesday, June 21, 2011
  50. 50. Food for Thought • Normalize object cache keys • Application triggers directly to queue • Rethink pagination • Build with future-sharding in mindTuesday, June 21, 2011
  51. 51. DISQUS Questions? psst, we’re hiring jobs@disqus.comTuesday, June 21, 2011

×