OBJECTIVEPut working software into production as quickly as possible, whilst minimising risk ofload-related problems:• Bad response times• Lack of capacity• Availability too low• Excessive system resource useWithin the context of websites.
TRADITIONAL APPROACHLoad testing through simulationhttp://www.ﬂickr.com/photos/danramarch/4423023837
DECIDE WHATTOTEST•Focus on busiest instant•Model most-hit functionality•Extrapolate to expected load•Look at production trafﬁc•Or attempt educated guess
DECIDE ON SCOPEComponent testChain testFull environment test•Test coverage•Level of certainty•Number of systems•Amount of work
SET UPTEST DATA• Usually starts as a copy from production• Or educated guess what people will enter• Render anonymous• Make tests deterministic• Synchronise between all systemshttp://www.ﬂickr.com/photos/22168167@N00/3889737939/
DECIDE ON STRATEGYOne or more of:•Scalability test•Stress test•Endurance test•Regression test•Resilience testhttp://www.ﬂickr.com/photos/timjoyfamily/5935279962/
DECIDE ONTEST DURATION(which is tricky)http://www.ﬂickr.com/photos/wwarby/3297205226
PROVIDE HARDWAREhttp://www.ﬂickr.com/photos/s_w_ellis/2681151694/Copy of production?Only one copy?Virtualisation?Sharing between teams?
INTEGRATE INTO PIPELINEUnit testFunctionalintegrationtestLoad testVery fast Fast Takes longer
INTEGRATE INTO PIPELINEUnit testFunctionalintegrationtestLoad testVery fast Takes longer
PERMANENT LOADTESTINGDaytime: constant load, teamsinspect impact of changesNighttime: EndurancetestWeekends: refresh test datahttp://www.ﬂickr.com/photos/renaissancechambara/5106171956/
RESPONSETIMEDNS lookup (www.xebia.com)Time to ﬁrst byte + loading HTMLTime to renderTime to document completeBrowser CPU useBandwidth# connections to a singlehosthttp://www.webpagetest.org/result/130522_FG_10SC/1/details/SSL handshakeParse timesBlocking client code
CLEAR REQUIREMENTSResponse timeFail: 10 Now: 3.5 Goal: 1Intention: Users get a response quickly so thatthey are happy and spend more money.Stakeholder: Marketing dept.Scale: 95th percentile of “document complete”response times, in seconds, measured over oneminute.Metric: Page load times as reported by ourRUM tool.Inspired byTom Gilb, Competitive Engineering
WebPageTest: ﬁrst view + repeat view (median of 3)95th percentile response times from access logsADJUST REQUIREMENTS DUETO LACK OFREAL BROWSERS
Playground to test changesNo impact on real usersLess pressureMore workGuesswork and extrapolationCan take a signiﬁcant amount of timeMore hardware
THINGS WILL BREAK...... in spite of your best effortshttp://www.ﬂickr.com/photos/jmarty/1239950166/
SO INSTEAD WE SHOULD FOCUS ONFAST RECOVERYhttp://www.ﬂickr.com/photos/19107136@N02/8386567228/
“MTTR is more important thanMTBF*”John Allspaw* for most types of F
00.51.01.52.099thpercentileresponsetime(s)Test durationMTBF LEADSTO FUD
Time→TTD ﬁnd cause (RCA) write & test ﬁx build deployvalidatecompiledeploy&testMonitoringAlerts•Skills•Organisation•Culture•Maintainability•Simple architecture•Fastworkstations•Goodtooling•Abletoquicklytestlocally•Automation•Fastbuildserver•EfﬁcienttestsMonitoring•Automation•FlexiblearchitectureTTR
MONITORINGTechnical metrics•CPU use•Memory use•TPS•Response times•etcProcess metrics•# bugs•MTTR, MTTD•Time from idea to live on site•etcBusiness metrics•Revenue•# unique visitors•etchttp://www.ﬂickr.com/photos/smieyetracking/5609671098/
MEASURE LATENCYAvg. response times front end vs backendNumber of calls
GO/NO-GO MEETINGS• What are the biggest fears?• How can we measure this?• What can be done if it does happen?
RETROSPECTIVESHow can we prevent a failure fromhappening again?How can we detect it earlier?Was there only one root cause?http://www.ﬂickr.com/photos/katerha/8380451137
INTRODUCE OUTAGESChaos monkeyGame day exerciseshttp://www.ﬂickr.com/photos/frostnova/440551442/
CULTURE• Dev and Ops work together on providing information.• Assumptions are dangerous, try to eliminate as many as possible.• Small changes are easier to ﬁx than large ones.• Deploy during ofﬁce hours so everyone is available in case problems happen.• All information, including business metrics, should be accessible to everyone.
SIMPLE, FLEXIBLE ARCHITECTURE• If the site goes down often, probably its architecture is at fault• Avoid fragile systems• Resilience is key• Scalable (redundancy is not waste)• Rather many small systems than a few large ones• State is a “hot brick”
CHANGES FORTHE BUSINESS• Accept to push smaller changes.• Continuous delivery vs continuousdeployment.• Share data.
CONCLUSIONWork on your ability to respond to failure.Trying to prevent failure can slow you downand make you focus on the wrong things.Keep assumptions clearly separated from facts. Make your decisions based on evidence.Measure everything, including the impact of changes to the business.Look for your compromise, try permanent load testing ﬁrst and learn from that.