The BBC prepared extensively for hosting their online coverage of the 2012 London Olympics by conducting load testing, performance testing, and resilience testing on their systems and platforms. Through this testing, they found problems at various architecture layers that had not been discovered during testing on their staging environment. The testing exposed issues like poor application performance, timeouts not being properly configured, and a lack of caching. The extensive testing in the months leading up to the Olympics helped the BBC identify and fix issues before they impacted their online coverage during the live events.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Velocity london 2012 bbc olympics
1. The BBC’s
Experience of
preparing for the
2012 London
Olympics
For Velocity London
2012
Andy “Bob” Brockhurst
Principal Engineer BBC
Platforms/Frameworks
2. Introduction
• The Team
• LAMP (without the M)
• Tomcat Java Service Layer
• Custom apache modules
• Varnish with extensions
• ZendFramework..
– ...customised a.k.a PAL
• Barlesque
3. How the BBC works
• One domain[1]
• Two technology stacks[2]
• Cert’s and SSL
• ProxyPass’
• Apps are a TLD
• 360+ apps
• Everyone shares everything[3]
[1] Okay there are several but they are all really the same one.
[2] Okay, three if you are going to be picky.
[3] Yes really, everything!
4.
5. Network Topology
• Dual DC[1]
• No DC affinity[2]
[1] One more soon(ish)
[2] Well a couple of apps do[3]
[3] We don't talk about them
12. Why?
• Too much change
– Network Architecture
– Server Configurations
– Load balancers
– Peering points
• High Profile
• Gain confidence
13. Gaining Confidence
• Load testing on Stage
– Tests individual applications
– Single endpoints only
– No concurrent load
• Real hardware
• Real data
– As much as possible
• Real Journalism
15. “What the Abdication did for Radio
and the Coronation did for
Television,
London 2012 will do for Online.”
16. Current Volumetrics
• Big numbers for sport
– 9M users/day
– 90M views/day
• Punishing peaks
– Saturday football final scores 4000 pv/s
– ~750k Concurrent users
• Wimbledon
– 1700 pv/s
17. Expected Volumetrics
• Expected peaks
– 1.5M concurrent users
– 60k different sports pages
• 2,500 per minute
– 30% video via iPlayer
18.
19. Timeline
• March 2012 (T minus 5 months)
– Team members assigned
– Resilience testing
– Performance testing
• Testing with External Partner
20. Olympics Run-up
• Jubilee (2nd June)
• Euros (8th June)
• Wimbledon (25th June)
• Formula One
43. One week before…
The opening ceremony…
• 1st successful test on Live
– with no errors at all.
44.
45. Performance Overview
• Did find problems
– Weren’t found on stage
• In all architecture layers
• Components believed to be “fine” were not
• Stage is not suitable for this level of testing
• Proposal for any future “high profile” event
• CDNs didn’t really get tested
46. Resilience Overview
• Teams never tested failure scenarios
• Assumed that services didn’t fail
• Inconsistent use of flagpoles
• Reliance on mod_cache stale-on-error
47. Other problems
• Running a “fake” Olympics
– That is invisible to the public
– Did consider publicising a test
• No A/B (bucket) testing capability
• Some tests affected BAU
• No real test of the HLS HDS streaming
• Platform monitoring cycle
48. Other problems
• RCA complicated by shared platform
• Testing stopped by BAU/TX
• High reliance on key staff
– Some tests suffered
• No CDN testing
– At their request
– Places unfair load on infrastructure
• Unable to simulate network congestion
49.
50. Working with external tester
• Workflow testing differed
– User journeys
– Direct linking to hotspots
• Very responsive to altering tests
• Did add extra complexity
51. Did it work
• YES
– Found and fixed issues
– Before they bit us
– On production
– With little impact on BAU
52. Recommendations
• Increase stage capacity
• Intelligent load balancing
• Test NFRs in Development
• Caching, caching and more caching
• Kill load tests quickly
• Improve internal load testing
• Profile frontends under load
• Better post analysis tools
What the abdication did for Radio,and the Coronation did to Television,the Olympics will do for Online.
Roughly 1/3 traffic from mobile 2/3 desktop/tablet18M users/week on sport (normal)9M Users/day on sport (event)90M Page views/day (1000/sec)30% traffic internationalJuly – August traditionally quiet (no football)New mobile siteExpect to exceed normal peaks
Formula One Monaco May 27th internal video stream testingOlympics Fri 27th July -> Sun 12th August
Stage environment done, not suitably confidentLive considered too different from stageOur internal testing can’t use proxies easilyTarget of x concurrent users No concurrent load
Backends return whatever data they have after a certain time
Speculative requests for content
Would have killed us, had we not taken actionStage frontends *Always* died firstJournalism and frontends under spec’d for this type of test
Root cause analysis
Added to Non-Functional Requirements for BBC Products at Development
Tennis Singles Finals - Serena Williams and Andy Murray golds 820,000 Request Sun 5th Aug
Bradley Wiggins TT peaked at 700 Gbps2.8 Peta Bytes that dayExceeded in 24hrs entire coverage of FIFA World Cup 2010