Ensuring a website will scale with excellent performance under peak levels of load is no easy task. Any number of problems can occur—from switch hardware failure to third party service outages, to a poor choice of algorithms or memory use in the code. Melissa Chawla describes Shutterfly's three-tiered approach to prevent site outages during peak load. First, check the development team's designs for scalability by holding performance design reviews for each project including identifying throughput requirements for all down-stream resources. Second, automate continuous load testing of individual web services running against a non-production test environment. And third, load test client applications in concert against production systems. Melissa provides examples of defects Shutterfly has identified and prevented with each of these three types of testing, and describes the tool set Shutterfly uses. Join Melissa as she shares some of the challenges they faced, including the test environment to be as performant as production.
1.
T14
Performance
Testing
10/5/17
13:30
A
Three-‐Tier
Load
Testing
Program
Saved
Our
Bacon
Presented
by:
Melissa
Chawla
Guidewire
Brought
to
you
by:
350
Corporate
Way,
Suite
400,
Orange
Park,
FL
32073
888-‐-‐-‐268-‐-‐-‐8770
·∙·∙
904-‐-‐-‐278-‐-‐-‐0524
-‐
info@techwell.com
-‐
http://www.starwest.techwell.com/
2.
Melissa
Chawla
Guidewire
Melissa
Chawla
manages
a
software
development
team
at
Guidewire.
For
the
previous
five
years,
she
managed
the
test
infrastructure
and
performance
engineering
teams
at
Shutterfly.
Prior
to
that,
Melissa
held
software
development
positions
at
startups
in
the
high
performance
computing
storage
and
development
tools
software
industries.
She
worked
in
the
Tcl/Tk
language
development
team
at
Sun
Labs,
contributing
to
several
open
source
projects
around
Tcl.
Her
favorite
sports
are
downhill
skiing
and
Ultimate
Frisbee.
Learn
more
about
Melissa
at
LinkedIn.
3. 1
The
Resistance.
How
a
3
Tiered
Load
Tes.ng
Program
Saves
Shu7erfly’s
Bacon
STARTWEST
Conference
October
5,
2017
By
Melissa
Chawla
2
• Goal
of
performance
tesAng
• ShuDerfly’s
ecosystem
• 3
Tiers
of
performance
tesAng
o Goal,
tesAng
method,
value
added
&
challenges
faced
• The
Journey
o EvoluAon
of
3
perf
tesAng
Aers
o Steps
to
increase
perf
tesAng
return
on
investment
o CreaAng
a
performance
culture
Agenda
4. 2
3
1. Improve
web
page
speed
during
peak
load
2. Avoid
preventable
site
outages
Main
Performance
Goal
for
ShuDerfly
4
• ApplicaAon
SoSware
o Ecommerce
+
photo
personalizaAon/management
o Main
website
+
naAve
iPhone,
iPad
&
android
apps
o Diverse
soSware
plaXorm
due
to
Ame
&
acquisiAons
• SoSware
Layers
o Proxy,
authenAcaAon
&
caching
layers
o Third-‐party
dependencies
• CDN,
Tax
&
payment
services,
beacons,
etc.
• Hardware
PlaXorm
o Datacenter
• Switches,
load
balancers,
400+
servers
o Cloud
(AWS)
• cloud ≠ 𝑠𝑐𝑎𝑙𝑒
ShuDerfly’s
Complex
Layered
Ecosystem
5. 3
5
ShuDerfly’s
SoSware
ApplicaAon
Layers
6
1. Scalability
Design
Review
2. ConAnuous
Unit
Load
TesAng
3. End-‐to-‐end
ProducAon
Load
Test
SoluAon:
3
Tiers
of
Load
TesAng
6. 4
7
Goal:
Prevent
performance
&
scalability
problems
before
code
is
wriDen
Tier
1:
Scalability
Design
Review
8
• Treat
performance
at
scale
as
a
required
feature
• Ask
quesAons
during
design
phase
o Expected
performance
of
the
feature
o Impact
on
dependencies
• Required
(where
applicable)
o Stories
to
supply
performance
monitoring
o Stories
to
automate
measuring
performance
at
scale
Scalability
Design
Review:
Method
7. 5
9
• Throughput
o EsAmated
peak
requests/minute
o Average
number
of
bytes
in
&
outbound
per
request
• Latency
o Response
Times
during
peak
traffic
levels
• median
o Not
average
because
we
don’t
want
outliers
to
skew
results
• 95%
o 95%
of
requests
get
this
response
Ame
or
beDer
o 5%
of
requests
are
slower
• Monitoring
o Require
runAme
visibility
for
throughput,
latency
&
error
rate
o If
you
can’t
measure
performance,
assume
it
is
poor
Clarify
Feature
Performance
&
Monitoring
ExpectaAons
10
• List
all
of
the
feature's
resource
dependencies
o API
calls
to
other
code/3rd
party
vendors
o Databases
• Response
Time
o What
Ameout
is
planned
for
each
call
to
a
dependency?
o How
will
the
feature
behave
when
its
dependency
call
fails
or
Ames
out?
• Throughput
o What
is
the
addiAonal
peak
throughput
(req/min)
your
feature
will
place
on
its
dependencies?
o Can
the
dependent
resources
handle
the
addiAonal
load?
• Database
Scale
o Number
of
concurrent
database
connecAons
o Capacity:
data
growth
rate
over
Ame
• Get
approval
from
dependency
owners
o Hallway
agreements
are
insufficient
Assess
Impact
on
Dependent
Resources
8. 6
11
• Who
needs
to
determine
throughput
&
response
Ame
requirements?
o Ideally
Product
Owner/Product
Manager
owns
this
• Requires
input
from
dev/architecture
• Which
projects
&
teams
need
scalability
review?
o Encountered
pushback
on
addiAonal
design
overhead/meeAngs
• Consider
the
cost
of
delivering
features
that
don’t
scale
• All
dev
teams;
all
dev
projects
and
post-‐release
bug
fixes
Scalability
Design
Review:
Challenges
Faced
12
Meaning
• Benchmark
web
services
and
heavy
traffic
web
pages
individually
Goals
1. Early
idenAficaAon
of
boDlenecks
and
leaks
2. Measure
code
performance
prior
to
first
producAon
release
3. Catch
code
performance
regressions
over
Ame
Tier
2:
ConAnuous
Unit
Load
TesAng
9. 7
13
What
Does
Unit
Load
TesAng
Cover?
14
• Isolated,
Stable
Test
Environment
• Automated
ConAnuous
TesAng
o Develop
a
Gatling
load
test
for
each
web
service/page
o Launch
load
tests
conAnuously
via
Jenkins
• Visual
Test
Results/Performance
Trends
o Collect
test
results
in
Graphite
database
o View
test
trends
in
Jenkins’
Gatling
plugin
Unit
Load
TesAng:
Method
at
ShuDerfly
10. 8
15
o Memory,
file,
database
connecAon
leaks
o Heavy
database
queries
Unit
Load
TesAng:
Find
Leaks
Early
16
Environment,
Environment,
Environment!
o Designing
&
funding
a
test
environment
to
match
producAon
• Hardware
&
data
o Maintaining
stable
non-‐producAon
environment
o Can’t
run
load
test
for
every
code
change
• Isolated
environment
• Tests
run
serially
to
avoid
performance
variability
(shared
resources)
Unit
Load
TesAng:
Challenges
Faced
11. 9
17
Meaning
• Generate
site
load
to
reflect
projected
peak
site
usage
Goals
1. Prove
all
components
can
sustain
peak
load
simultaneously
o Shared
resources
can
handle
mixed
traffic
at
scale
• Databases,
load
balancers,
switches,
etc.
2. Prove
effecAveness
of
failover
&
resiliency
features
at
scale
Tier
3:
End-‐to-‐end
Load
TesAng
18
What
does
End-‐to-‐end
Load
TesAng
Encompass?
12. 10
19
End-‐to-‐end
Load
TesAng
Method
Tooling
o
o .
Environment
o Test
in
producAon
during
low
usage
periods
Data
set
o Generate
test
accounts
populated
with
data
similar
to
real
users
Test
Planning,
Project
Tracking,
Post-‐tesAng
Results
o Google
Sheets
to
calculate
&
track
transacAon
throughput
20
Site
Scale
Tests
• Cover
~15
end-‐user
scenarios
• ~20
load
test
execuAons
in
2016
o 3
for
hardware
failover
o 5
for
full
site
scale
o 12
for
targeted
funcAonality
13. 11
21
• Database
o Heavier
read/write
load
than
expected
o ConnecAon
pool
exhausted
o Failover/failback
configuraAon
problems
• Front-‐end
code
DOS’ing
back-‐end
services
• Client
bypassing
Squid
cache
o Client
calling
web
services
more
frequently
than
expected
• Load
balancer
misconfiguraAon
• Switch/ISP
misconfiguraAon
• Gaps
in
monitoring/alerAng
End-‐to-‐end
Load
TesAng:
Site
Outages
Prevented
22
End-‐to-‐end
Load
TesAng:
Challenges
Faced
• Cost
o SoSware
&
cloud
services
license
o Complex
test
development
work
o 30+
engineers
up
most
of
the
night
during
test
execuAon
o Find
&
fix
boDlenecks
iteraAvely
• Gerng
acAonable
results
from
arAficial
tests
o Can’t
test
every
funcAonal
case
o Fixed
test
data
set
• Hard
to
avoid
hot-‐sporng
• Hard
to
mimic
real
cache
paDerns
• ImpacAng
producAon
data
o Minimize
customer
impact
during
&
aSer
tests
o Don’t
mess
with
real
customer
data
o Ensure
business
metrics
don’t
include
load
test
acAvity
o Avoid
order
fulfillment
o Conserve
producAon
DB
space
while
avoiding
deleAon
14. 12
23
• Started
with
unit
load
tesAng
(2013)
o Started
building
isolated
test
env
in
2012
o ADracted
early
adopters
o Rapidly
provide
benchmarks
for
criAcal
services
• Added
end-‐to-‐end
tesAng
(2015)
o Start
with
proof
of
concept
• simple
tests
to
find
biggest
problems;
show
value
o Expand
coverage:
more
end-‐user
paths
o Introduce
failover
tesAng
under
load
• ImplemenAng
scalability
design
review
(2017)
o Started
with
early
adopters
o SAll
a
work
in
progress
at
ShuDerfly
EvoluAon
of
3
Tiers
Over
5+
Years
24
• Root-‐cause
every
issue,
not
just
the
worst
ones
o Most
outages
have
mulAple
root
causes
contribuAng
o Capture
all
contribuAng
factors,
not
just
easiest
to
blame/fix
o Lack
of
monitoring/alerAng
is
also
a
root
cause!
• For
each
issue
that
crops
up
in
unit
&
end-‐to-‐end
tesAng
o Ask
how
issues
could
have
been
prevented
and/or
caught
sooner
o Fix
classes
of
problems,
not
just
instances
of
problems
• Look
for
“near
misses”,
not
just
full
outages/failures
Steps
to
Increase
Perf
TesAng
Return
on
Investment
15. 13
25
No
longer
acceptable
o Blaming
tools
• Fix,
replace
or
delete
faulty
tooling
o Ignoring/explaining-‐away
slow
performance
• Especially
without
proving
our
hypothesis!
o Should
vs.
does
Required
o Performance
parameters
up
front
• Peak
throughput
• Expected
Latency
(95%
response
Ame)
o Part
of
”definiAon
of
done”
• Load
unit
tests
• Monitoring
&
alerts
CreaAng
a
Culture
of
Performance
at
Scale
26
QuesAons