SlideShare a Scribd company logo
1 of 62
The BBC’s
Experience of
preparing for the
2012 London
Olympics




For Velocity London
2012
            Andy “Bob” Brockhurst
           Principal Engineer BBC
            Platforms/Frameworks
Introduction

•   The Team
•   LAMP (without the M)
•   Tomcat Java Service Layer
•   Custom apache modules
•   Varnish with extensions
•   ZendFramework..
    – ...customised a.k.a PAL
• Barlesque
How the BBC works

•     One domain[1]
•     Two technology stacks[2]
•     Cert’s and SSL
•     ProxyPass’
•     Apps are a TLD
•     360+ apps
•     Everyone shares everything[3]
[1] Okay there are several but they are all really the same one.
[2] Okay, three if you are going to be picky.
[3] Yes really, everything!
Network Topology

• Dual DC[1]
• No DC affinity[2]




[1] One more soon(ish)
[2] Well a couple of apps do[3]

[3] We don't talk about them
Network Topology
Traffic Routing

•   TM -> PAL
•   TM -> Varnish -> TM -> PAL
•   TM -> Service Layer
•   TM -> Varnish -> TM -> Service Layer
Traffic Routing

    Request
iplayer |  everything else
sport | 
     v 
    TM      .-> TM .--> TM .--> TM
     | / | /          | /    |
     | /      | /     | /    |
     v /       v / api v /      v
   Varnish        PAL     Varnish Dynamite
Environments

•   Integration
•   Test
•   Staging
•   Live
•   Journalism
Right let’s do some testing
Why?

• Too much change
  – Network Architecture
  – Server Configurations
  – Load balancers
  – Peering points
• High Profile
• Gain confidence
Gaining Confidence

• Load testing on Stage
  – Tests individual applications
  – Single endpoints only
  – No concurrent load
• Real hardware
• Real data
  – As much as possible
• Real Journalism
Other objectives

• Maintain BAU
• Handle failure gracefully
• Deliver Expectation
“What the Abdication did for Radio

and the Coronation did for
 Television,

London 2012 will do for Online.”
Current Volumetrics

• Big numbers for sport
  – 9M users/day
  – 90M views/day
• Punishing peaks
  – Saturday football final scores 4000 pv/s
  – ~750k Concurrent users
• Wimbledon
  – 1700 pv/s
Expected Volumetrics

• Expected peaks
  – 1.5M concurrent users
  – 60k different sports pages
    • 2,500 per minute
  – 30% video via iPlayer
Timeline

• March 2012 (T minus 5 months)
  – Team members assigned
  – Resilience testing
  – Performance testing
• Testing with External Partner
Olympics Run-up

•   Jubilee (2nd June)
•   Euros (8th June)
•   Wimbledon (25th June)
•   Formula One
Cloud Testing

• International testing
• Detailed test results
Cloud Testing

•   First performance test breaks live
•   Exposed monitoring issues
•   Couldn’t internally diagnose
•   Lots of tail, grep, awk, sed.
Early Findings

•   Stop tests
•   Monitoring
•   UK Data centre capacity
•   UK Data centre network segments
(Not) Caching kills

• Conditional modules
• Non-Olympics related modules
  – Commenting / Favourites
• Lowers cachability
• Testing an immature product
• Subsequent testing exposed more
What is a failure?

•   Error 500?
•   Blank pages?
•   Stale content?
•   Slow pages?
•   Burning data centres?
Resilience Testing

• Kill backends
• Traffic Manager
  – Screw with headers
  – Screw with status (418 anyone)
  – Truncate body
• Introduce waits
• Limit cache sizes
• Reduce network bandwidth
Early findings

• Failure mode testing
  – Everything is a SPOF
  – Performance sucks in a failure
Specific findings

•   Monitoring Thresholds
•   Verbose logging, everywhere
•   Timeouts
•   No data
•   Volumetrics
•   Unfair load balancing
Verbose Logging

•   Wrong levels configured
•   Diagnostic information
•   Expected/Handled errors
•   Too much detail
•   Hurts health/forensic reporting
Not enough logging

•   Fatals with no logging
•   Unhandled conditions
•   Monitoring holes
•   Operations staff blind
Platform Configuration

• Unfair load-balancing
  – Remove older commodity servers
• Competitive service applications
  – Re-home critical applications
“Timeouts at lower levels in the architecture
 MUST be set shorter than the timeouts
 configured at higher levels of the
 architecture.”
Timeouts

• Frontend/Backend timeouts
  – Frontends with lower timeouts
  – Caches never populated
• Alter backends to return early
More timeouts

• Unspecified timeouts
• Wrongly specified timeout units
  – ms/sec
Poor Application Performance

• Multiple synchronous content requests
• International cachability
• Missing negative caching
  – Bypassed shared caches
Testing frequency

• Every two weeks
• Every week
• Every other day
One week before…

  The opening ceremony…

• 1st successful test on Live
  – with no errors at all.
Performance Overview

• Did find problems
    – Weren’t found on stage
•   In all architecture layers
•   Components believed to be “fine” were not
•   Stage is not suitable for this level of testing
•   Proposal for any future “high profile” event
•   CDNs didn’t really get tested
Resilience Overview

•   Teams never tested failure scenarios
•   Assumed that services didn’t fail
•   Inconsistent use of flagpoles
•   Reliance on mod_cache stale-on-error
Other problems

• Running a “fake” Olympics
    – That is invisible to the public
    – Did consider publicising a test
•   No A/B (bucket) testing capability
•   Some tests affected BAU
•   No real test of the HLS HDS streaming
•   Platform monitoring cycle
Other problems

• RCA complicated by shared platform
• Testing stopped by BAU/TX
• High reliance on key staff
  – Some tests suffered
• No CDN testing
  – At their request
  – Places unfair load on infrastructure
• Unable to simulate network congestion
Working with external tester

• Workflow testing differed
  – User journeys
  – Direct linking to hotspots
• Very responsive to altering tests
• Did add extra complexity
Did it work

• YES
 – Found and fixed issues
 – Before they bit us
 – On production
 – With little impact on BAU
Recommendations

•   Increase stage capacity
•   Intelligent load balancing
•   Test NFRs in Development
•   Caching, caching and more caching
•   Kill load tests quickly
•   Improve internal load testing
•   Profile frontends under load
•   Better post analysis tools
Some Statistics
Daily Reach (M)
Streaming Views (M) Wed 1st Aug
Unique Browsers (M)
Thanks for listening
•   Thanks to flickr users:
     –   dgjones
           •     Office Dalek, London, 14-10-06
           •     http://www.flickr.com/photos/dgjones/284592369
     –   b3cft
           •     Bombe rebuild detail
           •     http://www.flickr.com/photos/b3cft/3797123899
     –   Karindalziel
           •     Clouds
           •     http://www.flickr.com/photos/nirak/644336486
     –   Enjoy Surveillance
           •     What are you looking at?
           •     http://www.flickr.com/photos/enjoy-surveillance/34795807/
     –   Solo
           •     45th Annual Watsonville Fly-in and Air Show
           •     http://www.flickr.com/photos/donsolo/4959045491/in/photostream/
     –   SF Brit
           •     Sunset over Iguazu
           •     http://www.flickr.com/photos/cnbattson/4333692253/
•   Olympics Photos: www.london2012.com
•   Other Photos: EpicWin, FailBlog, Haha-Business
Special Thanks

to:
      – David Holroyd
        • Technical Architect BBC Sport (Olympics)


      – Matt Clark
        • Senior Technical Architect BBC Sport
Thanks for listening

• This presentation:
  – TBC
• Me:
  – Andy “Bob” Brockhurst
  – Twitter: b3cft (and pretty much anywhere online)
  – www.kingkludge.net
Velocity london 2012 bbc olympics
Velocity london 2012 bbc olympics

More Related Content

Similar to Velocity london 2012 bbc olympics

Hands-on Performance Tuning Lab - Devoxx Poland
Hands-on Performance Tuning Lab - Devoxx PolandHands-on Performance Tuning Lab - Devoxx Poland
Hands-on Performance Tuning Lab - Devoxx PolandC2B2 Consulting
 
Hands on Performance Tuning - Mike Croft
Hands on Performance Tuning - Mike CroftHands on Performance Tuning - Mike Croft
Hands on Performance Tuning - Mike CroftJAXLondon2014
 
Hands-on Performance Workshop - The science of performance
Hands-on Performance Workshop - The science of performanceHands-on Performance Workshop - The science of performance
Hands-on Performance Workshop - The science of performanceC2B2 Consulting
 
Comprehensive Performance Testing: From Early Dev to Live Production
Comprehensive Performance Testing: From Early Dev to Live ProductionComprehensive Performance Testing: From Early Dev to Live Production
Comprehensive Performance Testing: From Early Dev to Live ProductionTechWell
 
Software devops engineer in test (SDET)
Software devops engineer in test (SDET)Software devops engineer in test (SDET)
Software devops engineer in test (SDET)Sriram Angajala
 
Auditing Drupal Sites for Performance, Content and Optimal Configuration - SA...
Auditing Drupal Sites for Performance, Content and Optimal Configuration - SA...Auditing Drupal Sites for Performance, Content and Optimal Configuration - SA...
Auditing Drupal Sites for Performance, Content and Optimal Configuration - SA...Jon Peck
 
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamFrom Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamAndreas Grabner
 
Tools. Techniques. Trouble?
Tools. Techniques. Trouble?Tools. Techniques. Trouble?
Tools. Techniques. Trouble?Testplant
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The CloudAmazon Web Services
 
Rails Performance Tricks and Treats
Rails Performance Tricks and TreatsRails Performance Tricks and Treats
Rails Performance Tricks and TreatsMarshall Yount
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interactionGovind Kanshi
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Govind Kanshi
 
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...Amazon Web Services
 
UWP apps development - Part 2
UWP apps development - Part 2UWP apps development - Part 2
UWP apps development - Part 2Jiri Danihelka
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedTim Callaghan
 
Asynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondStuart (Pid) Williams
 
Node.js Dublin Meetup April 2014
Node.js Dublin Meetup April 2014Node.js Dublin Meetup April 2014
Node.js Dublin Meetup April 2014Damian Beresford
 
Intuit continuous performance testing for code camp temp
Intuit continuous performance testing for code camp tempIntuit continuous performance testing for code camp temp
Intuit continuous performance testing for code camp tempRamakrishna Kollipara
 
Summit 16: Multi-site OPNFV Testing Challenges
Summit 16: Multi-site OPNFV Testing ChallengesSummit 16: Multi-site OPNFV Testing Challenges
Summit 16: Multi-site OPNFV Testing ChallengesOPNFV
 

Similar to Velocity london 2012 bbc olympics (20)

Hands-on Performance Tuning Lab - Devoxx Poland
Hands-on Performance Tuning Lab - Devoxx PolandHands-on Performance Tuning Lab - Devoxx Poland
Hands-on Performance Tuning Lab - Devoxx Poland
 
Hands on Performance Tuning - Mike Croft
Hands on Performance Tuning - Mike CroftHands on Performance Tuning - Mike Croft
Hands on Performance Tuning - Mike Croft
 
Hands-on Performance Workshop - The science of performance
Hands-on Performance Workshop - The science of performanceHands-on Performance Workshop - The science of performance
Hands-on Performance Workshop - The science of performance
 
Comprehensive Performance Testing: From Early Dev to Live Production
Comprehensive Performance Testing: From Early Dev to Live ProductionComprehensive Performance Testing: From Early Dev to Live Production
Comprehensive Performance Testing: From Early Dev to Live Production
 
Software devops engineer in test (SDET)
Software devops engineer in test (SDET)Software devops engineer in test (SDET)
Software devops engineer in test (SDET)
 
Jan Hloušek, Keen Software House
Jan Hloušek, Keen Software HouseJan Hloušek, Keen Software House
Jan Hloušek, Keen Software House
 
Auditing Drupal Sites for Performance, Content and Optimal Configuration - SA...
Auditing Drupal Sites for Performance, Content and Optimal Configuration - SA...Auditing Drupal Sites for Performance, Content and Optimal Configuration - SA...
Auditing Drupal Sites for Performance, Content and Optimal Configuration - SA...
 
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamFrom Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
 
Tools. Techniques. Trouble?
Tools. Techniques. Trouble?Tools. Techniques. Trouble?
Tools. Techniques. Trouble?
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
 
Rails Performance Tricks and Treats
Rails Performance Tricks and TreatsRails Performance Tricks and Treats
Rails Performance Tricks and Treats
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interaction
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)
 
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...
 
UWP apps development - Part 2
UWP apps development - Part 2UWP apps development - Part 2
UWP apps development - Part 2
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
 
Asynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per second
 
Node.js Dublin Meetup April 2014
Node.js Dublin Meetup April 2014Node.js Dublin Meetup April 2014
Node.js Dublin Meetup April 2014
 
Intuit continuous performance testing for code camp temp
Intuit continuous performance testing for code camp tempIntuit continuous performance testing for code camp temp
Intuit continuous performance testing for code camp temp
 
Summit 16: Multi-site OPNFV Testing Challenges
Summit 16: Multi-site OPNFV Testing ChallengesSummit 16: Multi-site OPNFV Testing Challenges
Summit 16: Multi-site OPNFV Testing Challenges
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

Velocity london 2012 bbc olympics

  • 1. The BBC’s Experience of preparing for the 2012 London Olympics For Velocity London 2012 Andy “Bob” Brockhurst Principal Engineer BBC Platforms/Frameworks
  • 2. Introduction • The Team • LAMP (without the M) • Tomcat Java Service Layer • Custom apache modules • Varnish with extensions • ZendFramework.. – ...customised a.k.a PAL • Barlesque
  • 3. How the BBC works • One domain[1] • Two technology stacks[2] • Cert’s and SSL • ProxyPass’ • Apps are a TLD • 360+ apps • Everyone shares everything[3] [1] Okay there are several but they are all really the same one. [2] Okay, three if you are going to be picky. [3] Yes really, everything!
  • 4.
  • 5. Network Topology • Dual DC[1] • No DC affinity[2] [1] One more soon(ish) [2] Well a couple of apps do[3] [3] We don't talk about them
  • 7. Traffic Routing • TM -> PAL • TM -> Varnish -> TM -> PAL • TM -> Service Layer • TM -> Varnish -> TM -> Service Layer
  • 8. Traffic Routing Request iplayer | everything else sport | v TM .-> TM .--> TM .--> TM | / | / | / | | / | / | / | v / v / api v / v Varnish PAL Varnish Dynamite
  • 9. Environments • Integration • Test • Staging • Live • Journalism
  • 10. Right let’s do some testing
  • 11.
  • 12. Why? • Too much change – Network Architecture – Server Configurations – Load balancers – Peering points • High Profile • Gain confidence
  • 13. Gaining Confidence • Load testing on Stage – Tests individual applications – Single endpoints only – No concurrent load • Real hardware • Real data – As much as possible • Real Journalism
  • 14. Other objectives • Maintain BAU • Handle failure gracefully • Deliver Expectation
  • 15. “What the Abdication did for Radio and the Coronation did for Television, London 2012 will do for Online.”
  • 16. Current Volumetrics • Big numbers for sport – 9M users/day – 90M views/day • Punishing peaks – Saturday football final scores 4000 pv/s – ~750k Concurrent users • Wimbledon – 1700 pv/s
  • 17. Expected Volumetrics • Expected peaks – 1.5M concurrent users – 60k different sports pages • 2,500 per minute – 30% video via iPlayer
  • 18.
  • 19. Timeline • March 2012 (T minus 5 months) – Team members assigned – Resilience testing – Performance testing • Testing with External Partner
  • 20. Olympics Run-up • Jubilee (2nd June) • Euros (8th June) • Wimbledon (25th June) • Formula One
  • 21.
  • 22. Cloud Testing • International testing • Detailed test results
  • 23. Cloud Testing • First performance test breaks live • Exposed monitoring issues • Couldn’t internally diagnose • Lots of tail, grep, awk, sed.
  • 24. Early Findings • Stop tests • Monitoring • UK Data centre capacity • UK Data centre network segments
  • 25. (Not) Caching kills • Conditional modules • Non-Olympics related modules – Commenting / Favourites • Lowers cachability • Testing an immature product • Subsequent testing exposed more
  • 26.
  • 27. What is a failure? • Error 500? • Blank pages? • Stale content? • Slow pages? • Burning data centres?
  • 28. Resilience Testing • Kill backends • Traffic Manager – Screw with headers – Screw with status (418 anyone) – Truncate body • Introduce waits • Limit cache sizes • Reduce network bandwidth
  • 29.
  • 30. Early findings • Failure mode testing – Everything is a SPOF – Performance sucks in a failure
  • 31. Specific findings • Monitoring Thresholds • Verbose logging, everywhere • Timeouts • No data • Volumetrics • Unfair load balancing
  • 32. Verbose Logging • Wrong levels configured • Diagnostic information • Expected/Handled errors • Too much detail • Hurts health/forensic reporting
  • 33. Not enough logging • Fatals with no logging • Unhandled conditions • Monitoring holes • Operations staff blind
  • 34. Platform Configuration • Unfair load-balancing – Remove older commodity servers • Competitive service applications – Re-home critical applications
  • 35. “Timeouts at lower levels in the architecture MUST be set shorter than the timeouts configured at higher levels of the architecture.”
  • 36.
  • 37. Timeouts • Frontend/Backend timeouts – Frontends with lower timeouts – Caches never populated • Alter backends to return early
  • 38. More timeouts • Unspecified timeouts • Wrongly specified timeout units – ms/sec
  • 39.
  • 40. Poor Application Performance • Multiple synchronous content requests • International cachability • Missing negative caching – Bypassed shared caches
  • 41. Testing frequency • Every two weeks • Every week • Every other day
  • 42.
  • 43. One week before… The opening ceremony… • 1st successful test on Live – with no errors at all.
  • 44.
  • 45. Performance Overview • Did find problems – Weren’t found on stage • In all architecture layers • Components believed to be “fine” were not • Stage is not suitable for this level of testing • Proposal for any future “high profile” event • CDNs didn’t really get tested
  • 46. Resilience Overview • Teams never tested failure scenarios • Assumed that services didn’t fail • Inconsistent use of flagpoles • Reliance on mod_cache stale-on-error
  • 47. Other problems • Running a “fake” Olympics – That is invisible to the public – Did consider publicising a test • No A/B (bucket) testing capability • Some tests affected BAU • No real test of the HLS HDS streaming • Platform monitoring cycle
  • 48. Other problems • RCA complicated by shared platform • Testing stopped by BAU/TX • High reliance on key staff – Some tests suffered • No CDN testing – At their request – Places unfair load on infrastructure • Unable to simulate network congestion
  • 49.
  • 50. Working with external tester • Workflow testing differed – User journeys – Direct linking to hotspots • Very responsive to altering tests • Did add extra complexity
  • 51. Did it work • YES – Found and fixed issues – Before they bit us – On production – With little impact on BAU
  • 52. Recommendations • Increase stage capacity • Intelligent load balancing • Test NFRs in Development • Caching, caching and more caching • Kill load tests quickly • Improve internal load testing • Profile frontends under load • Better post analysis tools
  • 55. Streaming Views (M) Wed 1st Aug
  • 57.
  • 58. Thanks for listening • Thanks to flickr users: – dgjones • Office Dalek, London, 14-10-06 • http://www.flickr.com/photos/dgjones/284592369 – b3cft • Bombe rebuild detail • http://www.flickr.com/photos/b3cft/3797123899 – Karindalziel • Clouds • http://www.flickr.com/photos/nirak/644336486 – Enjoy Surveillance • What are you looking at? • http://www.flickr.com/photos/enjoy-surveillance/34795807/ – Solo • 45th Annual Watsonville Fly-in and Air Show • http://www.flickr.com/photos/donsolo/4959045491/in/photostream/ – SF Brit • Sunset over Iguazu • http://www.flickr.com/photos/cnbattson/4333692253/ • Olympics Photos: www.london2012.com • Other Photos: EpicWin, FailBlog, Haha-Business
  • 59. Special Thanks to: – David Holroyd • Technical Architect BBC Sport (Olympics) – Matt Clark • Senior Technical Architect BBC Sport
  • 60. Thanks for listening • This presentation: – TBC • Me: – Andy “Bob” Brockhurst – Twitter: b3cft (and pretty much anywhere online) – www.kingkludge.net

Editor's Notes

  1. FlagpolesVarnish device detection Varnish geo ip lookupCookie manipulationVariant cachingmod_annotateMicro – MVC helper for Zend/PALSpectrum - templating
  2. Okay, bbc.co.uk andbbc.comOkay, Forge, JournalismYup, Service layer on shared physical serversPAL apps installed on all frontends.
  3. Physically same TMs and varnishesTraffic routing destinations, header at entry point
  4. Journalism run a separate stack with the same environmentsAlso have previewers for editorial previews
  5. Increased network capacityHardware replacementsRHEL 5 -> 610Gb NICs
  6. Stage load tests done internally
  7. What the abdication did for Radio,and the Coronation did to Television,the Olympics will do for Online.
  8. Roughly 1/3 traffic from mobile 2/3 desktop/tablet18M users/week on sport (normal)9M Users/day on sport (event)90M Page views/day (1000/sec)30% traffic internationalJuly – August traditionally quiet (no football)New mobile siteExpect to exceed normal peaks
  9. Formula One Monaco May 27th internal video stream testingOlympics Fri 27th July -> Sun 12th August
  10. Virtualised dynamically provisioned externally hosted testing
  11. Stage environment done, not suitably confidentLive considered too different from stageOur internal testing can’t use proxies easilyTarget of x concurrent users No concurrent load
  12. Backends return whatever data they have after a certain time
  13. Speculative requests for content
  14. Would have killed us, had we not taken actionStage frontends *Always* died firstJournalism and frontends under spec’d for this type of test
  15. Root cause analysis
  16. Added to Non-Functional Requirements for BBC Products at Development
  17. Tennis Singles Finals - Serena Williams and Andy Murray golds 820,000 Request Sun 5th Aug
  18. Bradley Wiggins TT peaked at 700 Gbps2.8 Peta Bytes that dayExceeded in 24hrs entire coverage of FIFA World Cup 2010