Scaling the guardian


Published on

How does the guardian website scale?
With millions of page views per month, we need to think about scaling to an extreme level. But being Agile we did it as we went.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scaling the guardian

  1. Scaling the Guardian Michael Brunton-Spall (@bruntonspall)
  2. The Guardian - Some Figures ABCe Audited (Dec 2009) Unique Users - 36.9m per month, 1.8m per day Page Impressions - 259m per month, 9.2m per day Log file analysis 37m requests per day, 1.1bn requests per month - not inlcuding images / static files
  3. Initial Architecture
  4. Scaling Problems In memory cache is order of magnitude too small at 500Mb
  5. Even Worse! Cache is local to appserver Adding an App Server makes the problem worse
  6. Our Solution Memcached! or more accurately, a distributed cache
  7. Our Solution
  8. Phase 1 Memcache object cache Massive reduction in number of DB calls No significant drop in DB Load
  9. Phase 2 Memcached query cache Massive reduction in DB Load
  10. Phase 3
  11. Phase 3 Memcached pages More reduction in Appserver load Must handle customisation outside of cache Memcached for pages is filter Page customisation is a higher filter Time based decache only Decache only on direct page edit
  12. Getting a Scaling Solution The problem isn't technical It's all about the process Agile doesn't scale well! Onsite customer doesn't care about scaling Dedicated 10% team to look at "platform" issues Still Agile, Customer is Operations Team & Architects (backend and frontend)
  13. Scaling small apps rapidly On Thursday 15th 2010 there was a historic UK event - a televised national debate.
  14. Poll Charts Always sounds simple: "Let people viewing the page vote at anytime whether they like or dislike what the party leader is saying. Oh, and lets show it with a real time graph" Bad words here anytime real-time graph
  15. Our coverage looked like this...
  16. The poll itself
  17. The poll itself Python Google App Engine An inhouse, inplatform cache
  18. The Naive Implementation class IncrLibDemRequest: def get(self): Poll.get().libdems += 1 Why? Google App Engine has transaction locks, simultaneous threads can't atomically increment a counter (duh) If you wrap in a txn, all threads are serialised. You just turned Googles massively parallel data center into a very expensive file backed db
  19. Our Implementation (Phase 1) Sharded counters are the way to go Follow the article at on sharded counters Gives parallel counters But beware....
  20. Our Results and Numbers
  21. Our Results and Numbers
  22. Some interesting notes Average of around 100-120 req/s Peaked at 400 req/s Total of nearly 1,000,000 requests Surprisingly little cheating Only 2000 requests But...
  23. Request Duration Between 1 sec and 8 seconds! Causes Thread contention Not enough shards
  24. Our Implementation (2) Increase shards by factor of 10? Completely reduces transaction failures Each request still takes 200ms The cost is the datastore write Replace datastore with memcache? Different architecture vote does memcache atomic increment/decrement results get from memcache cronjob 1/min reads from memcache and writes to datastore requests now take 20 ms
  25. The Results?
  26. The Results?
  27. Some notes Total of around 2,727,000 requests Average of around 454 req/s Peaked at 750 req/s
  28. Requests per Second But...
  29. Request Duration Average 1.2s at first Live deploy fixed to 300ms
  30. Any Questions? Michael Brunton-Spall (@bruntonspall)