Scaling the guardian

•

2 likes•1,554 views

How does the guardian website scale? With millions of page views per month, we need to think about scaling to an extreme level. But being Agile we did it as we went.

Scaling the Guardian

Michael Brunton-Spall (@bruntonspall)

michael.brunton-spall@guardian.co.uk

The Guardian - Some Figures

ABCe Audited (Dec 2009)
Unique Users - 36.9m per month, 1.8m per day
Page Impressions - 259m per month, 9.2m per day
Log file analysis
37m requests per day, 1.1bn requests per month - not
inlcuding images / static files

Scaling Problems

In memory cache is order of magnitude too small at 500Mb

Even Worse!

Cache is local to appserver
Adding an App Server makes the problem worse

Our Solution

Memcached!
or more accurately, a distributed cache

Phase 1

Memcache object cache
Massive reduction in number of DB calls

No significant drop in DB Load

Phase 2

Memcached query cache
Massive reduction in DB Load

Phase 3

Memcached pages
More reduction in Appserver load
Must handle customisation outside of cache
Memcached for pages is filter
Page customisation is a higher filter
Time based decache only
Decache only on direct page edit

Getting a Scaling Solution

The problem isn't technical
It's all about the process
Agile doesn't scale well!
Onsite customer doesn't care about scaling
Dedicated 10% team to look at "platform" issues
Still Agile, Customer is Operations Team & Architects
(backend and frontend)

Scaling small apps rapidly

On Thursday 15th 2010 there was a historic UK event - a
televised national debate.

Poll Charts

Always sounds simple:

"Let people viewing the page vote at anytime whether they like
or dislike what the party leader is saying. Oh, and lets show it
with a real time graph"

Bad words here
anytime
real-time graph

The poll itself

Python
Google App Engine
An inhouse, inplatform cache

The Naive Implementation

class IncrLibDemRequest:
def get(self):
Poll.get().libdems += 1

Why?
Google App Engine has transaction locks, simultaneous
threads can't atomically increment a counter (duh)
If you wrap in a txn, all threads are serialised.
You just turned Googles massively parallel data center
into a very expensive file backed db

Our Implementation (Phase 1)

Sharded counters are the way to go
Follow the article at code.google.com/appengine on
sharded counters
Gives parallel counters
But beware....

Some interesting notes

Average of around 100-120 req/s
Peaked at 400 req/s
Total of nearly 1,000,000 requests
Surprisingly little cheating
Only 2000 requests

But...

Request Duration

Between 1 sec and 8 seconds!
Causes
Thread contention
Not enough shards

Our Implementation (2)

Increase shards by factor of 10?
Completely reduces transaction failures
Each request still takes 200ms
The cost is the datastore write
Replace datastore with memcache?
Different architecture
vote does memcache atomic
increment/decrement
results get from memcache
cronjob 1/min reads from memcache and
writes to datastore
requests now take 20 ms

Some notes
Total of around 2,727,000 requests
Average of around 454 req/s
Peaked at 750 req/s

Request Duration

Average 1.2s at first
Live deploy fixed to 300ms

Any Questions?

Michael Brunton-Spall (@bruntonspall)

michael.brunton-spall@guardian.co.uk

What's hot

1Spatial: Cardiff FME World Tour: Getting started with FME1Spatial

Sydney Continuous Delivery Meetup May 2014Andreas Grabner

An Ops Primer to Productionalizing DatameerColin Brown

Top Java Performance Problems and Metrics To Check in Your PipelineAndreas Grabner

Zapier DemystifiedNoel P. Rodriguez

How To Combine Back-End   & Front-End Testing with BlazeMeter & Sauce LabsSauce Labs

Go with the flow!Jaap Brasser

How Percolate uses CFEngine to Manage AWS Stateless InfrastructurePercolate

Performance Testing w/ WebPage Test Private Instance (DrupalCamp Ohio)Bill Condo

Create awesome Azure Functions with PowerShellJaap Brasser

Automate it with Azure FunctionsJaap Brasser

Serving Up Testability with MockServerJames Kirkbride

Docker/DevOps Meetup: Metrics-Driven Continuous Performance and ScalabiltyAndreas Grabner

Building High Performance Web ApplicationsJeff Whelpley

Skype goes agileAlexey Ilyichev

Don't roll your own HTTP serverNordic APIs

What I Learned from Optimizing Workspaces through Many YearsSafe Software

Amazon EKS: the good, the bad, and the uglyCloudOps2005

Automating everything with Microsoft FlowJaap Brasser

180929_NextBuild_From_Java_to_KotlinPaulien van Alst

What's hot (20)

1Spatial: Cardiff FME World Tour: Getting started with FME

Sydney Continuous Delivery Meetup May 2014

An Ops Primer to Productionalizing Datameer

Top Java Performance Problems and Metrics To Check in Your Pipeline

Zapier Demystified

How To Combine Back-End   & Front-End Testing with BlazeMeter & Sauce Labs

Go with the flow!

How Percolate uses CFEngine to Manage AWS Stateless Infrastructure

Performance Testing w/ WebPage Test Private Instance (DrupalCamp Ohio)

Create awesome Azure Functions with PowerShell

Automate it with Azure Functions

Serving Up Testability with MockServer

Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty

Building High Performance Web Applications

Skype goes agile

Don't roll your own HTTP server

What I Learned from Optimizing Workspaces through Many Years

Amazon EKS: the good, the bad, and the ugly

Automating everything with Microsoft Flow

180929_NextBuild_From_Java_to_Kotlin

Similar to Scaling the guardian

Thinking Outside the Cube: How In-Memory Bolsters AnalyticsInside Analysis

Satisfying Business and Engineering Requirements: Client-server JavaScript, S...Jason Strimpel

Making it fast: Zotonic & PerformanceArjan

Anton Lytunenko "Data Lake. Make data pleasant to swim in"Lviv Startup Club

The Yin and Yang of Softwareelliando dias

Os Solomonoscon2007

Super Sizing Youtube with Pythondidip

Magento performancenbsvarien

Natural Laws of Software PerformanceGibraltar Software

The Evolution of a Scrappy Startup to a Successful Web ServicePoornima Vijayashanker

Developing a database server: software engineer's viewLaurynas Biveinis

Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...John McCaffrey

Windy cityrails performance_tuningJohn McCaffrey

Stress Test as a CultureJoão Moura

Facebook, Robert JohnsonFuenteovejuna

DevOps: Find Solutions, Not More DefectsTechWell

WebPerformance: Why and How? – Stefan WintermeyerElixir Club

Advanced web application architecture - TalkMatthias Noback

Performance Oriented DesignRodrigo Campos

Performance OptimizationNeha Thakur

Similar to Scaling the guardian (20)

Thinking Outside the Cube: How In-Memory Bolsters Analytics

Satisfying Business and Engineering Requirements: Client-server JavaScript, S...

Making it fast: Zotonic & Performance

Anton Lytunenko "Data Lake. Make data pleasant to swim in"

The Yin and Yang of Software

Os Solomon

Super Sizing Youtube with Python

Magento performancenbs

Natural Laws of Software Performance

The Evolution of a Scrappy Startup to a Successful Web Service

Developing a database server: software engineer's view

Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...

Windy cityrails performance_tuning

Stress Test as a Culture

Facebook, Robert Johnson

DevOps: Find Solutions, Not More Defects

WebPerformance: Why and How? – Stefan Wintermeyer

Advanced web application architecture - Talk

Performance Oriented Design

Performance Optimization

Scaling the guardian

1. Scaling the Guardian Michael Brunton-Spall (@bruntonspall) michael.brunton-spall@guardian.co.uk

2. The Guardian - Some Figures ABCe Audited (Dec 2009) Unique Users - 36.9m per month, 1.8m per day Page Impressions - 259m per month, 9.2m per day Log file analysis 37m requests per day, 1.1bn requests per month - not inlcuding images / static files

3. Initial Architecture

4. Scaling Problems In memory cache is order of magnitude too small at 500Mb

5. Even Worse! Cache is local to appserver Adding an App Server makes the problem worse

6. Our Solution Memcached! or more accurately, a distributed cache

7. Our Solution

8. Phase 1 Memcache object cache Massive reduction in number of DB calls No significant drop in DB Load

9. Phase 2 Memcached query cache Massive reduction in DB Load

10. Phase 3

11. Phase 3 Memcached pages More reduction in Appserver load Must handle customisation outside of cache Memcached for pages is filter Page customisation is a higher filter Time based decache only Decache only on direct page edit

12. Getting a Scaling Solution The problem isn't technical It's all about the process Agile doesn't scale well! Onsite customer doesn't care about scaling Dedicated 10% team to look at "platform" issues Still Agile, Customer is Operations Team & Architects (backend and frontend)

13. Scaling small apps rapidly On Thursday 15th 2010 there was a historic UK event - a televised national debate.

14. Poll Charts Always sounds simple: "Let people viewing the page vote at anytime whether they like or dislike what the party leader is saying. Oh, and lets show it with a real time graph" Bad words here anytime real-time graph

15. Our coverage looked like this...

16. The poll itself

17. The poll itself Python Google App Engine An inhouse, inplatform cache

18. The Naive Implementation class IncrLibDemRequest: def get(self): Poll.get().libdems += 1 Why? Google App Engine has transaction locks, simultaneous threads can't atomically increment a counter (duh) If you wrap in a txn, all threads are serialised. You just turned Googles massively parallel data center into a very expensive file backed db

19. Our Implementation (Phase 1) Sharded counters are the way to go Follow the article at code.google.com/appengine on sharded counters Gives parallel counters But beware....

20. Our Results and Numbers

21. Our Results and Numbers

22. Some interesting notes Average of around 100-120 req/s Peaked at 400 req/s Total of nearly 1,000,000 requests Surprisingly little cheating Only 2000 requests But...

23. Request Duration Between 1 sec and 8 seconds! Causes Thread contention Not enough shards

24. Our Implementation (2) Increase shards by factor of 10? Completely reduces transaction failures Each request still takes 200ms The cost is the datastore write Replace datastore with memcache? Different architecture vote does memcache atomic increment/decrement results get from memcache cronjob 1/min reads from memcache and writes to datastore requests now take 20 ms

25. The Results?

26. The Results?

27. Some notes Total of around 2,727,000 requests Average of around 454 req/s Peaked at 750 req/s

28. Requests per Second But...

29. Request Duration Average 1.2s at first Live deploy fixed to 300ms

30. Any Questions? Michael Brunton-Spall (@bruntonspall) michael.brunton-spall@guardian.co.uk

Scaling the guardian

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling the guardian

Similar to Scaling the guardian (20)

Scaling the guardian