Your SlideShare is downloading. ×
Scaling the guardian
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Scaling the guardian


Published on

How does the guardian website scale? …

How does the guardian website scale?
With millions of page views per month, we need to think about scaling to an extreme level. But being Agile we did it as we went.

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Scaling the Guardian Michael Brunton-Spall (@bruntonspall)
  • 2. The Guardian - Some Figures ABCe Audited (Dec 2009) Unique Users - 36.9m per month, 1.8m per day Page Impressions - 259m per month, 9.2m per day Log file analysis 37m requests per day, 1.1bn requests per month - not inlcuding images / static files
  • 3. Initial Architecture
  • 4. Scaling Problems In memory cache is order of magnitude too small at 500Mb
  • 5. Even Worse! Cache is local to appserver Adding an App Server makes the problem worse
  • 6. Our Solution Memcached! or more accurately, a distributed cache
  • 7. Our Solution
  • 8. Phase 1 Memcache object cache Massive reduction in number of DB calls No significant drop in DB Load
  • 9. Phase 2 Memcached query cache Massive reduction in DB Load
  • 10. Phase 3
  • 11. Phase 3 Memcached pages More reduction in Appserver load Must handle customisation outside of cache Memcached for pages is filter Page customisation is a higher filter Time based decache only Decache only on direct page edit
  • 12. Getting a Scaling Solution The problem isn't technical It's all about the process Agile doesn't scale well! Onsite customer doesn't care about scaling Dedicated 10% team to look at "platform" issues Still Agile, Customer is Operations Team & Architects (backend and frontend)
  • 13. Scaling small apps rapidly On Thursday 15th 2010 there was a historic UK event - a televised national debate.
  • 14. Poll Charts Always sounds simple: "Let people viewing the page vote at anytime whether they like or dislike what the party leader is saying. Oh, and lets show it with a real time graph" Bad words here anytime real-time graph
  • 15. Our coverage looked like this...
  • 16. The poll itself
  • 17. The poll itself Python Google App Engine An inhouse, inplatform cache
  • 18. The Naive Implementation class IncrLibDemRequest: def get(self): Poll.get().libdems += 1 Why? Google App Engine has transaction locks, simultaneous threads can't atomically increment a counter (duh) If you wrap in a txn, all threads are serialised. You just turned Googles massively parallel data center into a very expensive file backed db
  • 19. Our Implementation (Phase 1) Sharded counters are the way to go Follow the article at on sharded counters Gives parallel counters But beware....
  • 20. Our Results and Numbers
  • 21. Our Results and Numbers
  • 22. Some interesting notes Average of around 100-120 req/s Peaked at 400 req/s Total of nearly 1,000,000 requests Surprisingly little cheating Only 2000 requests But...
  • 23. Request Duration Between 1 sec and 8 seconds! Causes Thread contention Not enough shards
  • 24. Our Implementation (2) Increase shards by factor of 10? Completely reduces transaction failures Each request still takes 200ms The cost is the datastore write Replace datastore with memcache? Different architecture vote does memcache atomic increment/decrement results get from memcache cronjob 1/min reads from memcache and writes to datastore requests now take 20 ms
  • 25. The Results?
  • 26. The Results?
  • 27. Some notes Total of around 2,727,000 requests Average of around 454 req/s Peaked at 750 req/s
  • 28. Requests per Second But...
  • 29. Request Duration Average 1.2s at first Live deploy fixed to 300ms
  • 30. Any Questions? Michael Brunton-Spall (@bruntonspall)