The BBC’sExperience ofpreparing for the2012 LondonOlympicsFor Velocity London2012            Andy “Bob” Brockhurst        ...
Introduction•   The Team•   LAMP (without the M)•   Tomcat Java Service Layer•   Custom apache modules•   Varnish with ext...
How the BBC works•     One domain[1]•     Two technology stacks[2]•     Cert’s and SSL•     ProxyPass’•     Apps are a TLD...
Network Topology• Dual DC[1]• No DC affinity[2][1] One more soon(ish)[2] Well a couple of apps do[3][3] We dont talk about...
Network Topology
Traffic Routing•   TM -> PAL•   TM -> Varnish -> TM -> PAL•   TM -> Service Layer•   TM -> Varnish -> TM -> Service Layer
Traffic Routing    Requestiplayer |  everything elsesport |      v     TM      .-> TM .--> TM .--> TM     | / | /         ...
Environments•   Integration•   Test•   Staging•   Live•   Journalism
Right let’s do some testing
Why?• Too much change  – Network Architecture  – Server Configurations  – Load balancers  – Peering points• High Profile• ...
Gaining Confidence• Load testing on Stage  – Tests individual applications  – Single endpoints only  – No concurrent load•...
Other objectives• Maintain BAU• Handle failure gracefully• Deliver Expectation
“What the Abdication did for Radioand the Coronation did for Television,London 2012 will do for Online.”
Current Volumetrics• Big numbers for sport  – 9M users/day  – 90M views/day• Punishing peaks  – Saturday football final sc...
Expected Volumetrics• Expected peaks  – 1.5M concurrent users  – 60k different sports pages    • 2,500 per minute  – 30% v...
Timeline• March 2012 (T minus 5 months)  – Team members assigned  – Resilience testing  – Performance testing• Testing wit...
Olympics Run-up•   Jubilee (2nd June)•   Euros (8th June)•   Wimbledon (25th June)•   Formula One
Cloud Testing• International testing• Detailed test results
Cloud Testing•   First performance test breaks live•   Exposed monitoring issues•   Couldn’t internally diagnose•   Lots o...
Early Findings•   Stop tests•   Monitoring•   UK Data centre capacity•   UK Data centre network segments
(Not) Caching kills• Conditional modules• Non-Olympics related modules  – Commenting / Favourites• Lowers cachability• Tes...
What is a failure?•   Error 500?•   Blank pages?•   Stale content?•   Slow pages?•   Burning data centres?
Resilience Testing• Kill backends• Traffic Manager  – Screw with headers  – Screw with status (418 anyone)  – Truncate bod...
Early findings• Failure mode testing  – Everything is a SPOF  – Performance sucks in a failure
Specific findings•   Monitoring Thresholds•   Verbose logging, everywhere•   Timeouts•   No data•   Volumetrics•   Unfair ...
Verbose Logging•   Wrong levels configured•   Diagnostic information•   Expected/Handled errors•   Too much detail•   Hurt...
Not enough logging•   Fatals with no logging•   Unhandled conditions•   Monitoring holes•   Operations staff blind
Platform Configuration• Unfair load-balancing  – Remove older commodity servers• Competitive service applications  – Re-ho...
“Timeouts at lower levels in the architecture MUST be set shorter than the timeouts configured at higher levels of the arc...
Timeouts• Frontend/Backend timeouts  – Frontends with lower timeouts  – Caches never populated• Alter backends to return e...
More timeouts• Unspecified timeouts• Wrongly specified timeout units  – ms/sec
Poor Application Performance• Multiple synchronous content requests• International cachability• Missing negative caching  ...
Testing frequency• Every two weeks• Every week• Every other day
One week before…  The opening ceremony…• 1st successful test on Live  – with no errors at all.
Performance Overview• Did find problems    – Weren’t found on stage•   In all architecture layers•   Components believed t...
Resilience Overview•   Teams never tested failure scenarios•   Assumed that services didn’t fail•   Inconsistent use of fl...
Other problems• Running a “fake” Olympics    – That is invisible to the public    – Did consider publicising a test•   No ...
Other problems• RCA complicated by shared platform• Testing stopped by BAU/TX• High reliance on key staff  – Some tests su...
Working with external tester• Workflow testing differed  – User journeys  – Direct linking to hotspots• Very responsive to...
Did it work• YES – Found and fixed issues – Before they bit us – On production – With little impact on BAU
Recommendations•   Increase stage capacity•   Intelligent load balancing•   Test NFRs in Development•   Caching, caching a...
Some Statistics
Daily Reach (M)
Streaming Views (M) Wed 1st Aug
Unique Browsers (M)
Thanks for listening•   Thanks to flickr users:     –   dgjones           •     Office Dalek, London, 14-10-06           •...
Special Thanksto:      – David Holroyd        • Technical Architect BBC Sport (Olympics)      – Matt Clark        • Senior...
Thanks for listening• This presentation:  – TBC• Me:  – Andy “Bob” Brockhurst  – Twitter: b3cft (and pretty much anywhere ...
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics
Upcoming SlideShare
Loading in...5
×

Velocity london 2012 bbc olympics

3,635

Published on

Talk for Velocity London 2012

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,635
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
55
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • FlagpolesVarnish device detection Varnish geo ip lookupCookie manipulationVariant cachingmod_annotateMicro – MVC helper for Zend/PALSpectrum - templating
  • Okay, bbc.co.uk andbbc.comOkay, Forge, JournalismYup, Service layer on shared physical serversPAL apps installed on all frontends.
  • Physically same TMs and varnishesTraffic routing destinations, header at entry point
  • Journalism run a separate stack with the same environmentsAlso have previewers for editorial previews
  • Increased network capacityHardware replacementsRHEL 5 -> 610Gb NICs
  • Stage load tests done internally
  • What the abdication did for Radio,and the Coronation did to Television,the Olympics will do for Online.
  • Roughly 1/3 traffic from mobile 2/3 desktop/tablet18M users/week on sport (normal)9M Users/day on sport (event)90M Page views/day (1000/sec)30% traffic internationalJuly – August traditionally quiet (no football)New mobile siteExpect to exceed normal peaks
  • Formula One Monaco May 27th internal video stream testingOlympics Fri 27th July -> Sun 12th August
  • Virtualised dynamically provisioned externally hosted testing
  • Stage environment done, not suitably confidentLive considered too different from stageOur internal testing can’t use proxies easilyTarget of x concurrent users No concurrent load
  • Backends return whatever data they have after a certain time
  • Speculative requests for content
  • Would have killed us, had we not taken actionStage frontends *Always* died firstJournalism and frontends under spec’d for this type of test
  • Root cause analysis
  • Added to Non-Functional Requirements for BBC Products at Development
  • Tennis Singles Finals - Serena Williams and Andy Murray golds 820,000 Request Sun 5th Aug
  • Bradley Wiggins TT peaked at 700 Gbps2.8 Peta Bytes that dayExceeded in 24hrs entire coverage of FIFA World Cup 2010
  • Velocity london 2012 bbc olympics

    1. 1. The BBC’sExperience ofpreparing for the2012 LondonOlympicsFor Velocity London2012 Andy “Bob” Brockhurst Principal Engineer BBC Platforms/Frameworks
    2. 2. Introduction• The Team• LAMP (without the M)• Tomcat Java Service Layer• Custom apache modules• Varnish with extensions• ZendFramework.. – ...customised a.k.a PAL• Barlesque
    3. 3. How the BBC works• One domain[1]• Two technology stacks[2]• Cert’s and SSL• ProxyPass’• Apps are a TLD• 360+ apps• Everyone shares everything[3][1] Okay there are several but they are all really the same one.[2] Okay, three if you are going to be picky.[3] Yes really, everything!
    4. 4. Network Topology• Dual DC[1]• No DC affinity[2][1] One more soon(ish)[2] Well a couple of apps do[3][3] We dont talk about them
    5. 5. Network Topology
    6. 6. Traffic Routing• TM -> PAL• TM -> Varnish -> TM -> PAL• TM -> Service Layer• TM -> Varnish -> TM -> Service Layer
    7. 7. Traffic Routing Requestiplayer | everything elsesport | v TM .-> TM .--> TM .--> TM | / | / | / | | / | / | / | v / v / api v / v Varnish PAL Varnish Dynamite
    8. 8. Environments• Integration• Test• Staging• Live• Journalism
    9. 9. Right let’s do some testing
    10. 10. Why?• Too much change – Network Architecture – Server Configurations – Load balancers – Peering points• High Profile• Gain confidence
    11. 11. Gaining Confidence• Load testing on Stage – Tests individual applications – Single endpoints only – No concurrent load• Real hardware• Real data – As much as possible• Real Journalism
    12. 12. Other objectives• Maintain BAU• Handle failure gracefully• Deliver Expectation
    13. 13. “What the Abdication did for Radioand the Coronation did for Television,London 2012 will do for Online.”
    14. 14. Current Volumetrics• Big numbers for sport – 9M users/day – 90M views/day• Punishing peaks – Saturday football final scores 4000 pv/s – ~750k Concurrent users• Wimbledon – 1700 pv/s
    15. 15. Expected Volumetrics• Expected peaks – 1.5M concurrent users – 60k different sports pages • 2,500 per minute – 30% video via iPlayer
    16. 16. Timeline• March 2012 (T minus 5 months) – Team members assigned – Resilience testing – Performance testing• Testing with External Partner
    17. 17. Olympics Run-up• Jubilee (2nd June)• Euros (8th June)• Wimbledon (25th June)• Formula One
    18. 18. Cloud Testing• International testing• Detailed test results
    19. 19. Cloud Testing• First performance test breaks live• Exposed monitoring issues• Couldn’t internally diagnose• Lots of tail, grep, awk, sed.
    20. 20. Early Findings• Stop tests• Monitoring• UK Data centre capacity• UK Data centre network segments
    21. 21. (Not) Caching kills• Conditional modules• Non-Olympics related modules – Commenting / Favourites• Lowers cachability• Testing an immature product• Subsequent testing exposed more
    22. 22. What is a failure?• Error 500?• Blank pages?• Stale content?• Slow pages?• Burning data centres?
    23. 23. Resilience Testing• Kill backends• Traffic Manager – Screw with headers – Screw with status (418 anyone) – Truncate body• Introduce waits• Limit cache sizes• Reduce network bandwidth
    24. 24. Early findings• Failure mode testing – Everything is a SPOF – Performance sucks in a failure
    25. 25. Specific findings• Monitoring Thresholds• Verbose logging, everywhere• Timeouts• No data• Volumetrics• Unfair load balancing
    26. 26. Verbose Logging• Wrong levels configured• Diagnostic information• Expected/Handled errors• Too much detail• Hurts health/forensic reporting
    27. 27. Not enough logging• Fatals with no logging• Unhandled conditions• Monitoring holes• Operations staff blind
    28. 28. Platform Configuration• Unfair load-balancing – Remove older commodity servers• Competitive service applications – Re-home critical applications
    29. 29. “Timeouts at lower levels in the architecture MUST be set shorter than the timeouts configured at higher levels of the architecture.”
    30. 30. Timeouts• Frontend/Backend timeouts – Frontends with lower timeouts – Caches never populated• Alter backends to return early
    31. 31. More timeouts• Unspecified timeouts• Wrongly specified timeout units – ms/sec
    32. 32. Poor Application Performance• Multiple synchronous content requests• International cachability• Missing negative caching – Bypassed shared caches
    33. 33. Testing frequency• Every two weeks• Every week• Every other day
    34. 34. One week before… The opening ceremony…• 1st successful test on Live – with no errors at all.
    35. 35. Performance Overview• Did find problems – Weren’t found on stage• In all architecture layers• Components believed to be “fine” were not• Stage is not suitable for this level of testing• Proposal for any future “high profile” event• CDNs didn’t really get tested
    36. 36. Resilience Overview• Teams never tested failure scenarios• Assumed that services didn’t fail• Inconsistent use of flagpoles• Reliance on mod_cache stale-on-error
    37. 37. Other problems• Running a “fake” Olympics – That is invisible to the public – Did consider publicising a test• No A/B (bucket) testing capability• Some tests affected BAU• No real test of the HLS HDS streaming• Platform monitoring cycle
    38. 38. Other problems• RCA complicated by shared platform• Testing stopped by BAU/TX• High reliance on key staff – Some tests suffered• No CDN testing – At their request – Places unfair load on infrastructure• Unable to simulate network congestion
    39. 39. Working with external tester• Workflow testing differed – User journeys – Direct linking to hotspots• Very responsive to altering tests• Did add extra complexity
    40. 40. Did it work• YES – Found and fixed issues – Before they bit us – On production – With little impact on BAU
    41. 41. Recommendations• Increase stage capacity• Intelligent load balancing• Test NFRs in Development• Caching, caching and more caching• Kill load tests quickly• Improve internal load testing• Profile frontends under load• Better post analysis tools
    42. 42. Some Statistics
    43. 43. Daily Reach (M)
    44. 44. Streaming Views (M) Wed 1st Aug
    45. 45. Unique Browsers (M)
    46. 46. Thanks for listening• Thanks to flickr users: – dgjones • Office Dalek, London, 14-10-06 • http://www.flickr.com/photos/dgjones/284592369 – b3cft • Bombe rebuild detail • http://www.flickr.com/photos/b3cft/3797123899 – Karindalziel • Clouds • http://www.flickr.com/photos/nirak/644336486 – Enjoy Surveillance • What are you looking at? • http://www.flickr.com/photos/enjoy-surveillance/34795807/ – Solo • 45th Annual Watsonville Fly-in and Air Show • http://www.flickr.com/photos/donsolo/4959045491/in/photostream/ – SF Brit • Sunset over Iguazu • http://www.flickr.com/photos/cnbattson/4333692253/• Olympics Photos: www.london2012.com• Other Photos: EpicWin, FailBlog, Haha-Business
    47. 47. Special Thanksto: – David Holroyd • Technical Architect BBC Sport (Olympics) – Matt Clark • Senior Technical Architect BBC Sport
    48. 48. Thanks for listening• This presentation: – TBC• Me: – Andy “Bob” Brockhurst – Twitter: b3cft (and pretty much anywhere online) – www.kingkludge.net
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×