Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
issues with the use of canaries in upgrade
1. NICTA Copyright 2012 From imagination to impact
Issues with the
use of Canaries
Len Bass
2. NICTA Copyright 2012 From imagination to impact
Goal of Presentation
• For you to understand there are interesting
problems associated with canaries.
• For me to be sure that I understand how
canaries are used in practice.
• For me to get feedback about appropriate
testing results.
2
3. NICTA Copyright 2012 From imagination to impact
Various Upgrade Strategies
• How many at once?
– One at a time (rolling upgrade)
– Groups at a time (staged upgrade, e.g. canaries)
– All at once (big flip)
• What happens to old versions?
– Replaced en masse
– Maintained for some period for compatibility purposes
• This talk will focus on the canary strategy and
examine some issues associated with canaries
3
4. NICTA Copyright 2012 From imagination to impact
Context
• Deep service
dependency
hierarchy – may be
70 deep
• Upgrading one
service in this
hierarchy
• Need to consider
both service and its
clients
4
Figure from Netflix Tech Blog
5. NICTA Copyright 2012 From imagination to impact
Current state of major internet provider
• Each service has an owner
• Every service instance is instrumented
• When a canary is deployed, service owner
examines monitoring data (next slide) and uses
judgment to decide when to move to production.
• Canary testing is currently based on
functionality. No stress testing of canaries.
• Research question – what scientific criteria can
be used to make judgment of when to go into
production?
5
6. NICTA Copyright 2012 From imagination to impact
Netflix Monitoring Sequence
6
• Client outbound (start/end)
• Network (start/end)
• Service network (inbound start/end)
• Service processing (start/end)
• Service outbound (start/end)
• Network (start/end)
• Client inbound (start/end)
7. NICTA Copyright 2012 From imagination to impact
Common upgrade strategy
• Require all versions to be backward compatible
with previous versions
• Require changes associated with new version to
be software switchable.
• Clients of a service must be version aware in
order to know whether to utilize new
functionality.
• Once all instances have been upgraded to new
versions, send signal to turn on changes both in
the new version and their clients.
• When using canaries only turn on changes for a
subset of services and their clients. 7
8. NICTA Copyright 2012 From imagination to impact
Canary Issues
• Canaries are a form of live testing. Put a new
version into limited production to test its
correctness.
• Issues
– How long are new versions tested to determine
correctness?
• Period based – for some period of time
• Load based – under some utilization assumptions
• Result based – until some criteria is met
– How are clients of new version chosen and how is
this choice enforced?
– How are the canaries deployed?
8
9. NICTA Copyright 2012 From imagination to impact
General Picture
Client
Top Level
load
balancer
Second
level load
balancer
Server for
Version A
Server for
Version A
Server for
Version B
Second
level load
balancer
Server for
Version A
Server for
Version B
9
Client
• Version aware
• Must know about new versions
In order to take advantage of
new functionality
• May be implicitly version aware
based on, e.g. cluster
• Version unaware clients will only use
old functionality and these can be
served by any server since services
are backward compatible.
In addition:
• Load variation may
trigger elasticity rules.
• Deciding whether to
load new version or old
version raises other
issues.
10. NICTA Copyright 2012 From imagination to impact
More Detail on Upgrade Process
• Canaries are deployed and allowed to run for a
period without turning on new features.
• This is to test backward compatibility.
• Once canaries pass this test, then the new
features are turned on.
10
11. NICTA Copyright 2012 From imagination to impact
Question 1 – how are clients messages
routed?
• Three cases:
1. Clients are separated, a priori, into those utilizing
new version and those not.
2. Messages are routed arbitrarily by load balancer and
those that are received by new version of service
cause client to be designated as utilizing new
version.
3. All services are capable of being old version or new
version and choose based on version of message
they receive. (seems contrary to canary strategy)
11
12. NICTA Copyright 2012 From imagination to impact
More Questions
2. After turning on new functionality, how does one decide
that the canaries have been sufficiently functionally
tested for the fixed set of clients
– Are there results from the testing community that
pertain here? I don’t know.
– After answering this question, one can add
additional clients to those being routed to Version B
until the metric available(?) from the testing
community passes some threshold.
3. How can one perform stress testing in a live
environment?
– We are examining a metric called “Performance
Nonscability Likelihood” for its applicability
12
13. NICTA Copyright 2012 From imagination to impact
Summary
• We have identified the problem of determining when canary testing
is adequate as one that could use more rigor.
• Multiple different strategies for connecting new version clients to
new version services
• Outstanding questions are
– How long before all of the benefit of using canaries has been
realized and the new functionality can be turned on?
– How is stress testing performed?
13
14. NICTA Copyright 2012 From imagination to impact
Questions/comments
• Len.bass@nicta.com.au
14