Presented in Pittsburgh, PA for Abstractions.io in August, 2016. "The greater your adoption success, the greater your retirement pain." The number of APIs has exploded and will continue growing as more teams adopt microservices. In my time at Amazon my team of eight owned hundreds of microservices, covering every possible life cycle phase. Two of our services were more than a decade old. These lingered despite several interface iterations launched to cover alternative use-cases, multiple ownership changes, and funded retirement initiatives. The reason was simple: hundreds of internal and external clients had a dependency on these services. A coordinated migration effort was nearly impossible to prioritize across the whole of Amazon and integrated third-party merchant software. To make matters worse, many of our clients were as old as the services themselves. Several were owned by teams that had long since forgotten about them. We were all victims of these successful services. As I took over as manager of the team I knew that failure to retire these services in the next year would be a critical blow to our system scalability in Q4. Failure was not an option. I was hellbent on hitting the power switch. Every service owner will eventually encounter this problem. Few want to talk about it. In this session I'll elaborate on the challenges facing service owners, my approach to solving them, the impact of our success, and the lessons learned. I'll share the highs and lows - smooth migrations, and the pain of scream test victims. It is my hope that an attendee will be able to learn from these experiences and succeed in their own retirement efforts.
Retiring Service Interfaces: A Retrospective on Two 10+ Year Old Services
1. JEFF
NICKOLOFF
All in Geek Consulting
Retiring Service
Interfaces: A
Retrospective on
Two 10+ Year Old
Services
@allingeek
jeff@allingeek.com
2. Who am I?
• Former Amazonian
• Author of Docker in Action
• Independent Software Engineer
• Blogger
• Containerization and AWS consulting
3. Why should I care?
(about service retirement)
• There are a surprising number of ways to change a
service interface
• Most of those changes will require some retirement
campaign
4. Background and
Declarations
Microservices enable implementation iteration in isolation
while hindering interface iteration.
Amazon runs an amazing number of microservices.
My former team of eight owned hundreds of services.
Amazon has amazing tooling for service owners
(but it wont save you).
What follows is an account of a process that surely happens
all the time and everywhere that different people’s code
establish runtime service dependencies.
5. A Long Story Short
• We retired two 10+ year old service interfaces each
having hundreds of unique consumers
• We did so in about 7 months
• We did not go out of business
• Other similar deprecation campaigns had been
attempted within the five years prior
6. What and Why
Two services that predated all of AWS:
• Ol’ McCruftyface suffered from non-sensical UX, an
RPC style interface, mutable entities, overly broad
ID space, “variadic” function definitions, and no
authentication.
• Blobby Cleartext used fake crypto, weak (never
rotated) keys, crappy write-through caching, file re-
streaming, and no authentication.
7. The Plan
• Identify clients that will need to migrate
• Identify active use-cases
• Document migration paths and timeline
• Open communication with clients and present plan
• Follow up regularly and monitor migration efforts
• Increase migration pressure
• Shut it down… eventually
8. Client Discovery
• Amazon has one package manager to rule them all
• Service dependencies are modeled
• Dependent packages have known owners
• Amazon has strong hardware ownership… Inspect
our service logs for IP addresses
• Additional challenges?
Mixed ownership, old software, and infrequent usage
10. Wait…
“Why didn’t you just make all of the client changes yourself?”
Hundreds of clients, unique release cycles, repositories,
permissions issues, and politics.
11. Migration Path and
Assistance
• Analyze existing API and usage patterns from logs
• Document an internal client migration
• Prepare a migration matrix (before and after
mapping for discovered use-cases)
• Organize a migration assistance on-call rotation
• Establish lightweight procedures for assistance
12. Timeline and
Communication
• Make your sales pitch (carrots)
• Provide a complete but concise explanation,
documentation, assistance options, timeline, and
milestones
• Establish clear rules for regular communication
(frequency, medium, heartbeat, etc.)
• Highlight escalation paths and consequences of
compliance failure (sticks)
13. Carrots
• Better availability
• Clearer API UX
• Enhanced features
• Immutability
• Tighter latency guarantees
• Real data protection
15. Sticks and Secret Sticks
• Secretly reducing service redundancy and
throughput capacity (make the services worse)
• Failure to communicate - pageable
• Missed milestones - pageable
You’re supposed to own these clients, so own them.
16. Following Up… with Sticks
• Half your customers will comply quickly <3
• A quarter will talk to you and never act
(missing milestones - page ‘em)
• About a quarter will ignore you
(until later - page ‘em regularly)
• Some small percentage will never respond at all
(wait for the hammer)
17. More Client Discovery…
Sticks
• We identified that some clients were not identifiable
• Fell back to weekly advertisements in org-wide
meetings
• Unscheduled, inconvenient scream tests
Scream tests have to be painful or they’ll be ignored.
18. Prioritizing Empathy
1. I acknowledge that the short term gains of ignoring
service debt are tempting.
2. I also acknowledge that your team’s needs are
important.
3. However I propose that massive risks like this one
are more important, and that our customers share my
opinion.
4. I’m always going to prioritize customer needs over
your comfort.
19. A Hammer
I could always just turn them off, release the
hardware, and delete the service definitions.
“… but you can’t do that.”
“Are you sure?”
20. A Few Anecdotes
• Muddled ownership problems
• Reorg problems
• Unowned clients
• “This person just doesn’t want to work” problems
• “The painful test is painful” problems
21. Swinging the Hammer
It’s 10:32 am PST, we’ve only got a trickle of traffic,
the remaining known clients are unresponsive for
months, we’ve extended the shutdown date a month,
we’re wearing “deal with it” sunglasses inside.
Hit it.
22.
23. Dousing the Final Fires
• Swinging the hammer will light a few fires. They
may take a few days for people to notice.
• Deal with them and resist all urges to turn the
service back on. Salt the earth where it stood.
Say, “This is your life now.”
24. Lessons Learned
… or suspicions confirmed
• Service adoption (really all dependency) is debt
• Don’t transfer ownership of “complete” services
• Services that “just work” suffer the most drastic
knowledge rot
• Greater success will bring greater pain
25. Things I’d do again…
• Structured planning and communication
• Provide strong positive incentives
• Use operational pain as leverage
• Scream tests were very successful
• Swing the hammer
26. Things I’d do Next Time
• Improve communication consistency
• Increase awareness of scream test risk
… but not of individual tests
• Escalate more quickly
27. JEFF
NICKOLOFF
All in Geek Consulting
Questions about
services or
Docker? Come
talk in the hall!
@allingeek
jeff@allingeek.com