8. Downtime costs
eBay offline ($90K/h)
22h outage at eBay cost $2M ($90,909/h) (Internetnews, 1999)
Financial company down ($100K/h)
53.2% of finance companies lose over $100,000/hour (nextslm.org)
9. Downtime costs
Amazon offline ($1M/h)
Amazon loses nearly $1M/hour if down (NYT, 2008)
eBay offline ($90K/h)
22h outage at eBay cost $2M ($90,909/h) (Internetnews, 1999)
Financial company down ($100K/h)
53.2% of finance companies lose over $100,000/hour (nextslm.org)
10. Downtime costs
Amazon offline ($1M/h)
Amazon loses nearly $1M/hour if down (NYT, 2008)
Network downtime ($42K/h)
1 hour of network downtime costs $42,000 (Gartner, 2003)
eBay offline ($90K/h)
22h outage at eBay cost $2M ($90,909/h) (Internetnews, 1999)
Financial company down ($100K/h)
53.2% of finance companies lose over $100,000/hour (nextslm.org)
11. Downtime costs
Amazon offline ($1M/h)
Amazon loses nearly $1M/hour if down (NYT, 2008)
Network downtime ($42K/h)
1 hour of network downtime costs $42,000 (Gartner, 2003)
eBay offline ($90K/h)
22h outage at eBay cost $2M ($90,909/h) (Internetnews, 1999)
Financial company down ($100K/h)
53.2% of finance companies lose over $100,000/hour (nextslm.org)
Let’s say $50K/h if you’re serious.
12. Availability Downtime/year Loss @$50K/h
90% % 36.5 days Can$43,800,000
95% 18.25 days Can$21,900,000
98% 7.30 days Can$8,760,000
99% 3.65 days Can$4,380,000
99.5% 1.83 days Can$2,196,000
99.8% 17.52 hours Can$876,000
99.9% 8.76 hours Can$438,000
99.95% 4.38 hours Can$219,000
99.99% 52.6 minutes Can$43,833
99.999% 5.26 minutes Can$4,383
99.9999% 31.5 seconds Can$438
13. Availability Downtime/year Loss @$50K/h
90% % 36.5 days Can$43,800,000
95% 18.25 days Can$21,900,000
98% 7.30 days Can$8,760,000
99% 3.65 days Can$4,380,000
99.5% 1.83 days Can$2,196,000
99.8% 17.52 hours Can$876,000
99.9% 8.76 hours Can$438,000 Less than
99.95% 4.38 hours Can$219,000 an hour a
99.99% 52.6 minutes Can$43,833 year
99.999% 5.26 minutes Can$4,383
99.9999% 31.5 seconds Can$438
14. Availability Downtime/year Loss @$50K/h
90% % 36.5 days Can$43,800,000
95% 18.25 days Can$21,900,000
98% 7.30 days Can$8,760,000
99% 3.65 days Can$4,380,000
99.5% 1.83 days Can$2,196,000
99.8% 17.52 hours Can$876,000
99.9% 8.76 hours Can$438,000 Less than
99.95% 4.38 hours Can$219,000 an hour a
99.99% 52.6 minutes Can$43,833 year
99.999% 5.26 minutes Can$4,383 Less than
99.9999% 31.5 seconds Can$438 a minute a
year
16. You really don’t want web
users to call you.
$15
$12
$9
$6
$3
$0
Web self-service IVR Email Live phone
Cost estimates
BiT Group White Paper: “Web Self-Service Lowers
Call Center Costs and Improves Customer Service” Low Average High
17. You really don’t want web
users to call you.
$15
$12
$9
$6
$3
Can$0.24
$0
Web self-service IVR Email Live phone
Cost estimates
BiT Group White Paper: “Web Self-Service Lowers
Call Center Costs and Improves Customer Service” Low Average High
18. You really don’t want web
users to call you.
$15
$12
$9
$6
$3
Can$0.24 Can$0.45
$0
Web self-service IVR Email Live phone
Cost estimates
BiT Group White Paper: “Web Self-Service Lowers
Call Center Costs and Improves Customer Service” Low Average High
19. You really don’t want web
users to call you.
$15
$12
$9
$6
Can$3.00
$3
Can$0.24 Can$0.45
$0
Web self-service IVR Email Live phone
Cost estimates
BiT Group White Paper: “Web Self-Service Lowers
Call Center Costs and Improves Customer Service” Low Average High
20. You really don’t want web
users to call you.
$15
$12
$9
$6 Can$5.50
Can$3.00
$3
Can$0.24 Can$0.45
$0
Web self-service IVR Email Live phone
Cost estimates
BiT Group White Paper: “Web Self-Service Lowers
Call Center Costs and Improves Customer Service” Low Average High
22. If you don’t know the past
you can’t know the future.
If you don’t know the future,
you can’t budget for it.
Photo by Alan Cleaver from his Flicker Freestock set. Thanks, Alan!
http://www.flickr.com/photos/alancleaver/2638883650/
40. Organic Ad
Campaigns
search network $
1 1 1
Advertiser site
Visitor 2 O er 3 $
8 Upselling 4
Abandonment
Reach
5 Purchase step $
Mailing,
alerts, Purchase step $
9 promotions
$
Conversion $
Disengagement 7
Enrolment 6
Impact on site
$ Positive $ Negative
43. Bad
$
4 content
Social Search
Invitation
network link results
4 Good
content
1 $
1 1
Collaboration site
2
Visitor Content creation Moderation
$
3 Spam & trolls
$
Engagement 5
Viral
6 Social graph
spread
7
Disengagement $
Impact on site
$ Positive $ Negative
45. Enterprise subscriber $
1
End user (employee) $
Refund $
2
Renewal, upsell, SLA
reference SaaS site violation
Performance
Good Bad 3
Helpdesk Support
5 $
Usability escalation costs
7
4
Good Bad
Productivity
Good Bad
6
Churn $
Impact on site
$ Positive $ Negative
47. $
Media site
Enrolment Targeted
2 embedded ad 5
$
6 1
Ad
Visitor
network
4
3 5
Advertiser $
Departure $ site
Impact on site
$ Positive $ Negative
48. Why measure
Tactical, to find and fix
Strategic, to plan/trend
Part two
The elements of web latency
52. Slow sites suck
Lower conversion rates
Less likely to attract a loyal following
Liable for damages
53. Slow sites suck
Lower conversion rates
Less likely to attract a loyal following
Liable for damages
Liable for refunds or service credits
54. Slow sites suck
Lower conversion rates
Less likely to attract a loyal following
Liable for damages
Liable for refunds or service credits
Customers find other channels that cost more
55. Why the web is slow
A crash course in performance & availability.
56. Load Web App
Internet balancer server server DB
Client
www.example.com
57. Your website
Load Web App
Internet balancer server server DB
Client
www.example.com
58. DNS
Load Web App
Internet balancer server server DB
Client
DNS “www.example.com”
59. DNS DNS
lookup
Load Web App
Internet balancer server server DB
Client
DNS “www.example.com”
60. DNS DNS
lookup
Load Web App
Internet balancer server server DB
Client
DNS “www.example.com”
61. IP IP
Load Web App
Internet balancer server server DB
Client
62. IP IP
Load Web App
Internet balancer server server DB
Client
Internet routing
63. IP R IP
R
Load Web App
Internet R balancer server server DB
Client R
R
Internet routing
64. IP R IP
R
Load Web App
Internet R balancer server server DB
Client R
R
TCP session
65. IP R IP
R
Load Web App
Internet R balancer server server DB
Client R
R
TCP session
94. Getting a page by hand
Trying 67.205.65.12...
Connected to bitcurrent.com.
Escape character is '^]'.
95. Getting a page by hand
Trying 67.205.65.12...
Connected to bitcurrent.com.
Escape character is '^]'.
GET /
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/
xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head profile="http://gmpg.org/xfn/11">
<script type="text/javascript" src="http://
www.bitcurrent.com/wp-content/themes/
grid_focus_public/js/perftracker.js"></script>
<script>
</body>
</html>
Connection closed by foreign host.
96. Static
content
HTTP HTTP HTTP
SSL SSL
TCP TCP
IP IP
Load Web App
Internet balancer server server DB
Client
image.gif
GET www.example.com/image.gif
97. Static
content
HTTP HTTP HTTP
SSL SSL
TCP TCP
IP IP
Load Web App
Internet balancer server server DB
Client
image.gif
GET www.example.com/image.gif
98. Static
content Dynamic
content
HTTP HTTP HTTP
SSL SSL
TCP TCP
IP IP
Load Web App
Internet balancer server server DB
Client
dynamic.jsp
GET www.example.com/dynamic.jsp
99. Static
content Dynamic
content
HTTP HTTP HTTP
SSL SSL
TCP TCP
IP IP
Load Web App
Internet balancer server server DB
Client
dynamic.jsp
GET www.example.com/dynamic.jsp
100. Static
content Dynamic Stored
content data
HTTP HTTP HTTP
SSL SSL
TCP TCP
IP IP
Load Web App
Internet balancer server server DB
Client
(Database)
POST www.example.com/data.cgi
101. Static
content Dynamic Stored
content data
HTTP HTTP HTTP
SSL SSL
TCP TCP
IP IP
Load Web App
Internet balancer server server DB
Client
(Database)
POST www.example.com/data.cgi
104. Browser Data center
Server
TCP SYN (“let’s talk”)
TCP SYN ACK (“Agreed: let’s talk”)
TCP ACK (“OK, we’re talking)
105. Browser Data center
Server
TCP SYN (“let’s talk”)
TCP SYN ACK (“Agreed: let’s talk”)
TCP ACK (“OK, we’re talking)
SSL (“Someone might be listening!”)
SSL (“Here’s a decoder ring”)
106. Browser Data center
Server
TCP SYN (“let’s talk”)
TCP SYN ACK (“Agreed: let’s talk”)
TCP ACK (“OK, we’re talking)
SSL (“Someone might be listening!”)
SSL (“Here’s a decoder ring”)
HTTP GET / (“Can I have your home page?”)
HTTP 200 OK (“Sure!”)
(thinks
[index.html] (“Here it is!”) a bit)
(Renders furiously) Bump, bump. [img js css] (“Have this too!”)
107. Browser Data center
Server
TCP SYN (“let’s talk”)
TCP SYN ACK (“Agreed: let’s talk”)
TCP ACK (“OK, we’re talking)
SSL (“Someone might be listening!”)
SSL (“Here’s a decoder ring”)
HTTP GET / (“Can I have your home page?”)
HTTP 200 OK (“Sure!”)
(thinks
[index.html] (“Here it is!”) a bit)
(Renders furiously) Bump, bump. [img js css] (“Have this too!”)
TCP FIN (“Thanks! I’m done now.”)
TCP FIN ACK (“You’re welcome. Have a nice day.”)
109. “Page load time” isn’t simple
Documents versus event models
AJAX
Mobility
CDNs
Third-party content
Embedded objects and plug-ins
114. Part of the problem
You control You’re blamed for
Server latency Page rendering
Network latency for Total network latency
known content and
User environment
network parameters
115. Part of the problem
You control You’re blamed for
Server latency Page rendering
Network latency for Total network latency
known content and
User environment
network parameters
You need
diagnostic metrics
so you can fix it.
116. Part of the problem
You control You’re blamed for
Server latency Page rendering
Network latency for Total network latency
known content and
User environment
network parameters
You need escalation
You need
metrics so you can prove
diagnostic metrics
it and make it someone
so you can fix it.
else’s problem.
117. Why measure
Tactical, to find and fix
Strategic, to plan/trend
What to measure:
How long until a user can use
the app as you intended?
Part three
Where to measure
118. Three tiers of data
WAN accessibility: One test from many locations
Can everybody get here?
App functionality: Several tests of key processes
Is my business model working correctly?
Tiered tests: Frequent metrics of each tier
Is network, service, CPU, data I/O to blame?
119. WAN accessibility
Place A
Task B
Client
Goal C
...
Load Web App
Internet balancer server server DB
Client
131. Landing page:
View one story
Task: Log in
Enter credentials
Verify
Recovery
Task:
Forward a story
Enter recipients
Enter message
Send
132. Landing page:
Task: View one story
Create account
Task: Log in
Pick name
Check if free Enter credentials
Set Password Verify
CAPTCHA Recovery
Send mail
Get confirm
Task:
Forward a story
Enter recipients
Enter message
Send
133. Landing page:
Task: View one story
Create account
Task: Log in
Pick name
Check if free Enter credentials
Set Password Verify
CAPTCHA Recovery
Send mail
Get confirm
Task:
Forward a story
Task: Submit Enter recipients
a new story Enter message
Send
Enter URL
Describe
Deduplicate
Post it
134. Landing page:
Task: View one story
Create account
Task: Log in
Pick name Place: View stories
Check if free Enter credentials
Vote up Next 25
Set Password Verify
Vote down Last 25
CAPTCHA Recovery
Send mail
Get confirm
Task:
Forward a story
Task: Submit Enter recipients
a new story Enter message
Send
Enter URL
Describe
Deduplicate
Post it
135. Landing page:
Task: View one story
Create account
Task: Log in
Pick name Place: View stories
Check if free Enter credentials
Vote up Next 25
Set Password Verify
Vote down Last 25
CAPTCHA Recovery
Send mail
Place: Read
Get confirm
poster comments
Vote up Next 25
Task:
Vote down Last 25
Forward a story
Task: Submit Enter recipients
a new story Enter message
Send
Enter URL
Describe
Deduplicate
Post it
136. Landing page:
Task: View one story
Create account
Task: Log in
Pick name Place: View stories
Check if free Enter credentials
Vote up Next 25
Set Password Verify
Vote down Last 25
CAPTCHA Recovery
Send mail
Place: Read
Get confirm
poster comments
Vote up Next 25
Task:
Vote down Last 25
Forward a story
Task: Submit Enter recipients
a new story Place: My Enter message
Enter URL account Send
Describe Change My
address comments
Deduplicate
Change PW See karma
Post it
137. Landing page:
Create acct. View one story
Task: Log in
Place: View stories
Place: Read
poster comments
Task:
Forward a story
Task: Submit
a new story Place: My
account
138. Landing page:
Create acct.
Create acct. View one story
Form uptime Place: View stories
Task: Log in
# started
Bad form
Place: Read
# CAPTCHA poster comments
Mail uptime Task:
Forward a story
Mail bounced
Task: Submit
a new story Place: My
Confirm & return account
Return 3x
139. Landing page:
Create acct. View one story
Task: Log in
Place: View stories
Place: View stories
Stories/visit
Place: Read
# up/down
poster comments
Time/story Top stories
Task:
Forward a story
Task: Submit Refresh time Views/page
a new story Place: My
account
140. Landing page:
Create acct. View one story
Task: Log in
Place: View stories
Place: Read
poster comments
Task:
Forward a story
Task: Submit
a new story Place: My
account
141. Places
Efficiency matters
How quickly, how many,
productivity
Learning curve OK
Leave when they’re bored
Collect “aha” feedback
A/B test content for
pages/session, exits
142. Tasks
Effectiveness matters
Completion, abandonment
Intuitiveness rules
Leave when they change their
mind or it breaks
Collect “motivation” feedback
A/B test layouts for conversion
143. 2 sides of the same coin
End user
Web analytics
monitoring
What did Could they
visitors do? do it?
145. For media sites
Are ads loading quickly and successfully clicked
through?
Is content loading fast enough for visitors?
146. For collaboration sites
Can visitors contribute (posting content, voting?)
Is bad content being mitigated (trolling, spam)?
147. For SaaS sites
Are your end users productive?
Are they making fewer mistakes?
Is the site working during customers’ business hours?
148. Tiered tests
Place A
Task B
Client
Goal C
Load Web App
Internet balancer server server DB
Client
149. Testing the tiers
Load Web App
Internet balancer server server DB
Client
Request Do some Search a
Request a
uncached heavy dataset for
big object
object computing a string
(Or watch (Or track
CPU) query time)
151. Why measure
Tactical, to find and fix
Strategic, to plan/trend
What to measure:
How long until a user can use
the app as you intended?
Part four Where to measure:
How to measure WAN, from everywhere
Core app functionality
performance data Tiers of components
184. Simultaneous Sequential
5 tests from
5 tests at 15:00
15:00 to 15:05
185. Synthetic pros & cons
Pros Cons
Easy to set up Brittle
Only way to test without Detects macro outages, not
actual visitor traffic user events
Can compare to Good geographic & network
competitors coverage costs money,
Easy baseline establishment generates load
Detects a problem before No measurement of traffic
visitors sees it volume
Consistent data over time Places load on the site
under test
201. Browser Load Web
Network balancer server
tap
User A
202. Browser Load Web
Network balancer server
tap
User A
User B
User C
203. Browser Load Web
Network balancer server
tap
User A
User B
User C
Visit
history
P1
P2
P3
204. Browser Load Web
Network balancer server
tap
User A
User B
User C
Visit Aggregate
history reports
P1
P2
P3
205. Browser Load Web
Network balancer server
tap
User A
User B
User C
Visit Aggregate Alerts
history reports
!
P1
P2
P3
212. TopN, worstN
RUM tools are excellent for more qualitative data
What’s most broken?
What’s biggest?
What’s slowest?
What’s most inconsistent?
213. RUM pros & cons
Pros Cons
Directly correlated with May require physical
clickstream, analytics installation
Watches everything, not just Can be a privacy risk
the things you know about Doesn’t work if there’s no
Can be used to reproduce traffic
problems Need to filter out your own
Measures traffic as well as visits, crawlers, etc.
performance
215. Why measure
Tactical, to find and fix
Strategic, to plan/trend
What to measure:
How long until a user can use
the app as you intended?
Part five Where to measure:
Getting the math right WAN, from everywhere
Core app functionality
Tiers of components
How to measure it:
Synth, to ensure it’s working
RUM, to see where it’s broken
232. “It can scarcely be
denied that the supreme goal of
all theory is to make the irreducible
basic elements as simple and as few
as possible without having to
surrender the adequate
representation of a single
datum of experience.”
http://media.photobucket.com/image/einstein/derekabril/einstein_010.png
233. “As simple as possible,
but no simpler.”
(FYI, this is irony.)
251. )" &!!!"
!"#$%&'()*+(,&-*.*/0".1&
!*2,&'.,%0)3.1&
("
%#!!"
#"
71% correlation
%!!!"
'"
$#!!"
between traffic
&"
$!!!"
%"
$"
!"
#!!"
!"
and latency.
$" %" &" '" #" (" )" *" +" $!"$$"$%"$&"$'"$#"$("$)"$*"$+"%!"%$"%%"%&"%'"%#"%("%)"%*"%+"&!"&$"&%"&&"&'"&#"&("&)"&*"
4#5&
,-./0" 12-34-5.602" 14789:12-34-5.602;"
If you have traffic predictions, and latency
is correlated with performance, you may
be able to estimate performance in the
future from the business plan.*
*It’s seldom this simple.
254. Baselines
Establish an agreed-upon set of metrics, and always
compare to these baselines.
What does “normal” look like?
Weekly variance? Seasonality?
255. Why measure
Tactical, to find and fix
Strategic, to plan/trend
What to measure:
How long until a user can use
the app as you intended?
Part six Where to measure:
Targeting metrics WAN, from everywhere
Core app functionality
to your audience Tiers of components
How to measure it:
Synth, to ensure it’s working
RUM, to see where it’s broken
Get the math right
257. How technical
are they?
Your goal is
to be clearly
understood.
258. How technical
are they?
Your goal is
How will they
to be clearly use it?
understood.
259. How technical
are they?
To fix
Your goal is something
How will they
to be clearly use it?
understood.
260. How technical
are they?
To fix
Your goal is something
How will they To escalate
to be clearly use it? to others
understood.
261. How technical
are they?
To fix
Your goal is something
How will they To escalate
to be clearly use it? to others
understood. To plan the
future
262. How technical
are they?
To fix
Your goal is something
How will they To escalate
to be clearly use it? to others
understood. To plan the
future
Translate to
their jargon
263. How technical
are they?
To fix
Your goal is something
How will they To escalate
to be clearly use it? to others
understood. To plan the
future
What words do Translate to
they use? their jargon
265. By timeframe
Type of
metric Timeframe Delivery Detail
Break/fix
monitoring
266. By timeframe
Type of
metric Timeframe Delivery Detail
Break/fix
monitoring
Daily
reports
267. By timeframe
Type of
metric Timeframe Delivery Detail
Break/fix
monitoring
Daily
reports
Quarterly
planning
268. By timeframe
Type of
metric Timeframe Delivery Detail
Break/fix Push alerts Simple
Urgent
monitoring to PDA messages
Daily
reports
Quarterly
planning
269. By timeframe
Type of
metric Timeframe Delivery Detail
Break/fix Push alerts Simple
Urgent
monitoring to PDA messages
Daily Historical
Automated Mail PDF
reports context
Quarterly
planning
270. By timeframe
Type of
metric Timeframe Delivery Detail
Break/fix Push alerts Simple
Urgent
monitoring to PDA messages
Daily Historical
Automated Mail PDF
reports context
Quarterly Part of big
Prepared Slide deck
planning picture
271. By medium
Where will this wind up?
Dashboard
NOC screen
Log file
Someone’s
spreadsheet
Inbox
http://www.flickr.com/photos/warrenski/4190341621/
272. Why measure
Tactical, to find and fix
Strategic, to plan/trend
What to measure:
How long until a user can use
the app as you intended?
Part seven Where to measure:
Marching orders WAN, from everywhere
Core app functionality
Tiers of components
How to measure it:
Synth, to ensure it’s working
RUM, to see where it’s broken
Get the math right
274. First
Meet your analytics team
Find out
What are the key goals they’re monitoring?
Where are visitors coming from?
What are the most common entrance and exit
pages?
275. Second
Pick the three processes, pages, or functions that
matter most to you
Landing pages, or part of a conversion funnel
276. Third
Set up monitoring of:
Your site from many places (synthetic testing)
Your top 3 core business processes (synthetic or
RUM)
Your important infrastructure tiers (from agents +
synthetic, or RUM)
277. Fourth
Wait a week or two
To establish a baseline
To detect seasonal variance
To show others and get buy-in
278. Fifth
While you’re waiting, understand the elements latency
and how they affect your performance
DNS
SSL
Network latency
Host (server) latency
Client page load time
279. Set a target threshold
Now that you have an idea of what “normal” is, set a
threshold
... but not just any threshold.
280. The login page Function
will have a total latency Metric
of under 4 seconds Target
with a cached browser copy User situation
from any US branch office Testing point
95% of the time Percentile
weekdays, 8AM ET to 6M PST Time window
by synth test at 5m intervals Collection type
282. How Apdex works
Frustrated: over 8 seconds
Tolerating: 2-8 seconds
Satisfied: 0-2 seconds
283. How Apdex works
Frustrated: 5 hits
Total requests: 100
Tolerating: 30 hits
(65) + (30/2)
= 0.80
100
Satisfied: 65 hits
284. Train your audience
Visit key stakeholders and walk them through the
report
Get them used to the information
In the same format
At the same time
From the same place
285. Put monitoring into your
release cycle
Talk to the development team
Adding instrumentation
Identifying new code functions that need testing
Verifying whether optimization worked
289. RUM
Client-side
AJAX (Gomez, Coradiant
TrueSight Edge) Full
Agent-based (Aternity) disclosure: We
both worked at
Inline (sniffer/tap) Coradiant.
Coradiant, Tealeaf, Beatbox(HP),
Atomic Labs, Compuware
Apdex
Server-side (logfile, agent)
302. AJAX
As for your male and female
slaves whom you may have:
you may buy male and female
slaves from among the nations
that are around you.
- Leviticus 25:44