0
Scaling the Guardian


Michael Brunton-Spall (@bruntonspall)

michael.brunton-spall@guardian.co.uk
The Guardian - Some Figures

 ABCe Audited (Dec 2009)
    Unique Users - 36.9m per month, 1.8m per day
    Page Impression...
Initial Architecture
Scaling Problems

 In memory cache is order of magnitude too small at 500Mb
Even Worse!




 Cache is local to appserver
   Adding an App Server makes the problem worse
Our Solution



            Memcached!
    or more accurately, a distributed cache
Our Solution
Phase 1

 Memcache object cache
   Massive reduction in number of DB calls




    No significant drop in DB Load
Phase 2

 Memcached query cache
   Massive reduction in DB Load
Phase 3
Phase 3

 Memcached pages
   More reduction in Appserver load
   Must handle customisation outside of cache
   Memcached f...
Getting a Scaling Solution

  The problem isn't technical
  It's all about the process
  Agile doesn't scale well!
       ...
Scaling small apps rapidly

On Thursday 15th 2010 there was a historic UK event - a
televised national debate.
Poll Charts

Always sounds simple:

"Let people viewing the page vote at anytime whether they like
or dislike what the par...
Our coverage looked like this...
The poll itself
The poll itself

  Python
  Google App Engine
  An inhouse, inplatform cache
The Naive Implementation

class IncrLibDemRequest:
   def get(self):
     Poll.get().libdems += 1


Why?
  Google App Engi...
Our Implementation (Phase 1)

Sharded counters are the way to go
   Follow the article at code.google.com/appengine on
   ...
Our Results and Numbers
Our Results and Numbers
Some interesting notes

 Average of around 100-120 req/s
 Peaked at 400 req/s
 Total of nearly 1,000,000 requests
 Surpris...
Request Duration




 Between 1 sec and 8 seconds!
 Causes
    Thread contention
    Not enough shards
Our Implementation (2)

 Increase shards by factor of 10?
     Completely reduces transaction failures
     Each request s...
The Results?
The Results?
Some notes
  Total of around 2,727,000 requests
  Average of around 454 req/s
  Peaked at 750 req/s
Requests per Second




But...
Request Duration




 Average 1.2s at first
 Live deploy fixed to 300ms
Any Questions?




Michael Brunton-Spall (@bruntonspall)

michael.brunton-spall@guardian.co.uk
Upcoming SlideShare
Loading in...5
×

Scaling the guardian

1,193

Published on

How does the guardian website scale?
With millions of page views per month, we need to think about scaling to an extreme level. But being Agile we did it as we went.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,193
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
28
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Scaling the guardian"

  1. 1. Scaling the Guardian Michael Brunton-Spall (@bruntonspall) michael.brunton-spall@guardian.co.uk
  2. 2. The Guardian - Some Figures ABCe Audited (Dec 2009) Unique Users - 36.9m per month, 1.8m per day Page Impressions - 259m per month, 9.2m per day Log file analysis 37m requests per day, 1.1bn requests per month - not inlcuding images / static files
  3. 3. Initial Architecture
  4. 4. Scaling Problems In memory cache is order of magnitude too small at 500Mb
  5. 5. Even Worse! Cache is local to appserver Adding an App Server makes the problem worse
  6. 6. Our Solution Memcached! or more accurately, a distributed cache
  7. 7. Our Solution
  8. 8. Phase 1 Memcache object cache Massive reduction in number of DB calls No significant drop in DB Load
  9. 9. Phase 2 Memcached query cache Massive reduction in DB Load
  10. 10. Phase 3
  11. 11. Phase 3 Memcached pages More reduction in Appserver load Must handle customisation outside of cache Memcached for pages is filter Page customisation is a higher filter Time based decache only Decache only on direct page edit
  12. 12. Getting a Scaling Solution The problem isn't technical It's all about the process Agile doesn't scale well! Onsite customer doesn't care about scaling Dedicated 10% team to look at "platform" issues Still Agile, Customer is Operations Team & Architects (backend and frontend)
  13. 13. Scaling small apps rapidly On Thursday 15th 2010 there was a historic UK event - a televised national debate.
  14. 14. Poll Charts Always sounds simple: "Let people viewing the page vote at anytime whether they like or dislike what the party leader is saying. Oh, and lets show it with a real time graph" Bad words here anytime real-time graph
  15. 15. Our coverage looked like this...
  16. 16. The poll itself
  17. 17. The poll itself Python Google App Engine An inhouse, inplatform cache
  18. 18. The Naive Implementation class IncrLibDemRequest: def get(self): Poll.get().libdems += 1 Why? Google App Engine has transaction locks, simultaneous threads can't atomically increment a counter (duh) If you wrap in a txn, all threads are serialised. You just turned Googles massively parallel data center into a very expensive file backed db
  19. 19. Our Implementation (Phase 1) Sharded counters are the way to go Follow the article at code.google.com/appengine on sharded counters Gives parallel counters But beware....
  20. 20. Our Results and Numbers
  21. 21. Our Results and Numbers
  22. 22. Some interesting notes Average of around 100-120 req/s Peaked at 400 req/s Total of nearly 1,000,000 requests Surprisingly little cheating Only 2000 requests But...
  23. 23. Request Duration Between 1 sec and 8 seconds! Causes Thread contention Not enough shards
  24. 24. Our Implementation (2) Increase shards by factor of 10? Completely reduces transaction failures Each request still takes 200ms The cost is the datastore write Replace datastore with memcache? Different architecture vote does memcache atomic increment/decrement results get from memcache cronjob 1/min reads from memcache and writes to datastore requests now take 20 ms
  25. 25. The Results?
  26. 26. The Results?
  27. 27. Some notes Total of around 2,727,000 requests Average of around 454 req/s Peaked at 750 req/s
  28. 28. Requests per Second But...
  29. 29. Request Duration Average 1.2s at first Live deploy fixed to 300ms
  30. 30. Any Questions? Michael Brunton-Spall (@bruntonspall) michael.brunton-spall@guardian.co.uk
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×