Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/So63P6.
Richard Campbell shares his experiences evolving a web site from ordinary to resilient, the triage process, the quick-and-dirty solutions as well as the work to bring the site to true resiliency. Filmed at qconlondon.com.
Richard Campbell has spent more than 30 years playing around with microcomputers. Along the way he's done virtually every job you can imagine in the industry, from manufacturing to programming to consulting, training and writing. Richard is a Microsoft Regional Director, an ASP.NET MVP and a serial entrepreneur. Richard is a partner in PWOP Productions that creates a number of popular podcasts.
From Instability to Resilience: The Story of a Web Site
1.
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/resiliency-website
http://www.infoq.com/presentati
ons/nasa-big-data
http://www.infoq.com/presentati
ons/nasa-big-data
3. Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. From
Instability
to
Resilience:
The
Story
of
a
Web
Site
Richard
Campbell
@richcampbell
5. Richard
Campbell
• Background
– First
laid
hands
on
a
microcomputer
in
1977,
it’s
been
all
downhill
from
there
– Spent
the
last
fiGeen
years
helping
companies
scale
soGware
on
a
variety
of
plaIorms
• Currently
– Architect
and
Consultant
– Organizer
of
DevIntersecNon
– Rabid
Podcaster
6. Podcasts
For .NET Developers
First published 2002
Two shows a week
955 episodes in the archive
For IT Pros
First published 2007
Once a week
357 episodes in the archive
For Tablet Developers
First Published 2011
Once a week
126 episodes so far
9. The
Web
Site
Story
• VerNcal
market
e-‐commerce
system
• More
than
200,000
discrete
items
• Considered
mature
(version
3!)
• Busier
than
ever,
but
making
less
money
• Increasing
tech
support
calls
and
complaints
11. IniNal
Diagnosis
• CPU
uNlizaNon
high,
not
pinned
• 20-‐30
Requests/sec,
steady
• Request
Queued
jumps
occasionally
• .NET
Heap
Climbs
to
800MB,
then
dumps
– Memory
leak?
• Worker
process
recycling
every
20
minutes
12. Taking
AcNon
• Speed
of
response
vs.
comprehensive
response
• Must
have
measurements
before
and
aGer
• Addressing
a
memory
leak
with
more
memory!
13. First
AcNons
&
Results
• Switch
to
64
Bit
OS
(but
compile
to
32
bit)
• Added
4GB
of
Memory
(for
a
total
of
8GB)
• Memory
leak
conNnued
– Just
took
longer
– Dump
occurred
a
2GB
(maximum
pool
size)
– Worker
process
recycling
every
two
hours
• Complaint
level
drops
dramaNcally
14. Character
of
the
Web
Site
• Accuracy
• Reliability
• Scalability
• Performance
• Where
does
Resilience
fit?
15. Bejer
Living
through
Hardware
• Adding
more
web
servers
– Spreading
failure
around
• Moving
to
the
Cloud
– Paying
for
failure
by
the
hour
16. Increasing
Understanding
• What
is
causing
this
memory
leak?
– Using
.NET
Memory
Profiling
to
analyze
• Most
memory
consumed
by
cache
<foreshadow>
• NoNced
large
blocks
of
staNc
memory
(session
objects)
</foreshadow>
17. When
Cache
Runs
Amok
• Caching
added
to
improve
performance
• Cache
objects
being
created
around
searches
• What
are
the
chances
of
that
cache
object
ever
being
used
again?
• How
do
you
expire
a
cache
object?
18. InstrumenNng
Cache
• Wrapped
cache
objects
in
a
dicNonary
abstracNon
• Kept
counters
for
re-‐use
• Two
hours
of
run
Nme
(one
worker
process
recycle)
results
in
95%
of
cache
objects
never
reused
19. Fixing
Cache
• Don’t
allow
free
form
search
cache
• Don’t
let
the
users
populate
cache
at
all
• Treat
the
real
problem
20. Valuing
Resiliency
• Nothing
like
failure
to
help
determine
value
• Owning
your
ROI
• Spending
CapEx
to
reduce
OpEx
21. Adding
More
Resiliency
• Bend
further
– MulNple
servers
provide
redundancy
• ElasNcity
– Can
we
expand
or
shrink
based
on
need?
• Bejer
feedback
– How
do
we
know
when
we’re
bent,
when
to
expand
and
when
to
shrink?
22. Clusters
of
Joy
• Successful
failover
is
hardware,
soGware
&
configuraNon
together
• If
it
hasn’t
been
tested,
it
doesn’t
work
• RouNne
failover
is
your
friend!
23. The
Plague
of
State
• Geqng
session
out
of
the
web
server
• The
bajle
of
performance
over
scalability
• As
lijle
state
as
possible
(and
no
less)
24. InstrumenNng
ProducNon
• The
only
source
of
truth
• How
can
you
build
soGware
that
can
test
in
producNon
without
impacNng
the
user?
• InstrumentaNon
driving
features
25. Building
Resiliency
• Resiliency
is
a
journey,
not
a
desNnaNon
• Know
the
character
of
your
system
• Work
from
facts,
not
conjecture
to
move
the
system
forward
26.
27. Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/resiliency
-website