You can break your customers almost as easily with a well specified API as with an unspecified one. See examples of how, and what you can do to mitigate the problem.
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Your API spec isn't worth the paper it's written on
1. YOUR API SPEC ISN’T WORTH THE PAPER IT’S WRITTEN ON
@NORDICAPIS 2019 AUSTIN #AUSTINAPISUMMIT
GARETH JONES
PRINCIPAL API ARCHITECT
@GARETHJ_MSFT
This Photo by Unknown Author is licensed under CC BY-SA
2. YOU ARE AN API OWNER.
YOU ARE FEELING SMUG.
Open API 3.0
description
Generated
SDKs
Automated
tests
Pervasive
monitoring
This Photo by Unknown Author is licensed under CC BY-SA-NC
3. YOU DEPLOY A NEW BUILD
This Photo by Unknown Author is licensed under CC BY-NC
This Photo by Unknown Author is licensed under CC BY-SA-NC
8. HYRUM’S LAW
With a sufficient number of users of an API,
it does not matter what you promise in the
contract:
all observable behaviors of your system
will be depended on by somebody.
9. NUMBER OF THINGS
GROWS
• Array size grows from 0 to 1
• Array size grows from 1 to n
• Array size grows from < 1 page
to > 1 page
10. SIZES OF
THINGS GROW
Image sizes get big
Overall packet sizes get too big
Example: major
retailer 10mb packet
Android device can’t
handle it with some
json stacks
Afternoon, I hope you’ve had a great first day?
My name is Gareth Jones, and I’ve been with Microsoft for over twenty years now, working on APIs for about the last six or so.
I spent a couple of years as an architect for the Microsoft Graph, and more recently in our Education team focusing on building a platform on the Graph for app-builders targeting the classroom.
I’d like to take half an hour this afternoon to talk about the limits of where we are with API descriptions when it comes to protecting our API consumers from unexpected change.
So let’s imagine you’ve shipped an API to a set of customers. And things are running well.
You’ve followed best practices, you feel in control.
And then…
…you deploy a new build you’re sending data that matches your spec – all your tests are green.
But suddenly - tickets are flying – customers are on the phone – their apps are broken – your boss is NOT happy.
What went wrong?
Whether your API is public facing or internal, it’s essentially a consumer/producer contract. An API specification has many internal benefits to the producer in terms of engineering quality and predictability.
But, like all contracts, the looser it is, the more room for interpretation there is.
And I’m here to tell you folks, that even the best API descriptions out there today have quite a lot of wiggle room in them.
So spec interpretation happens on both sides of the relationship, but the burden of pain is usually felt by the consumers, cos they don’t know what change to expect or what change they SHOULD have anticipated.
But perhaps more importantly - people are busy and maybe even lazy.
This doesn’t just apply to marmalade cats.
So often consuming code will be written to handle just the data that is returned from an API call to the first test account that gets set up.
We tend to focus on not making “breaking changes” in our APIs for some definition of breaking change and then anything we do outside of that definition, we say is the API consumer’s problem.
But what were we trying to achieve with our API in the first place?
Typically we were trying to enable some kind of business relationship.
So who is the burden on in that relationship to ensure success?
There’s a fundamental tension between optimizing for relationship continuity by not making any changes in an API,
and being flexible and agile to meet the changing needs of a business.
*You* have to design where you should land on that spectrum.
And today’s API definition languages and tools might not go far enough out of the box.
Of course, really, this isn’t a completely winnable game.
Hyrum Wright made this great observation – that fundamentally implementations leak to become implicit interfaces.
So let’s talk about some implementation leaks that most commonly cause problems.
This is perhaps the simplest mistake consumers make is rushing to get an implementation shipped.
A test account always had an empty list of Foos.
The initial data only had one bank account per person. But the API is defined as an array.
These initial manifestations in data translate into assumptions in code again and again and again.
They’re wrong – but they happen all the time.
Sometimes at the parsing layer – sometimes at the application code.
It’s not just arrays - often a paged collection handler ignores the next link and only processes the first page.
Other things that grow are the actual payload itself.
Perhaps it’s the JSON running over some buffer – especially on IOT solutions.
Or perhaps test images were all low-res samples but now in production you are returning high-res PNGs..
Can your stack cope? Here’s a real example from my friend Dave, the CEO of APIMetrics.
A major retailer hit a problem when their stack on an Android app couldn’t deal with a JSON packet greater than 10mb.
They hit that limit and … bang.
Perf’s another frequent problem.
Perhaps it’s obvious that if you slow down your API calls you will have unhappy customers, especially if they happen to have called directly from a mobile app.
Think about your sequencing and flows and be super-sensitive to perf of calls that need to happen as predecessors to other calls.
e.g. identity lookups.
But sad to say, even improving your performance can break your customers, if they had undiscovered race conditions based on your previous typical latency.
Auth is often the hardest thing to get right when onboarding to an API.
And auth perhaps breaks more apps than anything else after they have shipped too.
Changes to token default or mandatory lifetimes can make app flows that previously worked well be unusable.
Apps may have gotten away without implementing OAUTH refresh tokens but now need them.
Apps may have used an embedded browser redirect and now you require a separate tab for OAUTH.
Perhaps *you* didn’t even make this change – perhaps it came from your IDP – be vigilant!
You have a role-based access AuthZ system and you introduce a new role that users need to be added to.
Flooding a consumer with 10x the number of webhooks they were previously handling isn’t likely to go well.
Many webhooks handlers don’t implement decent throttling.
Many webhooks handlers try to process the packet inline which isn’t a good practice.
So also simply making the webhook packet more detailed can degrade them.
Lots of APIs redirect for secondary calls to a subdomain outside the initial subdomain of the API.
For example, redirecting to a CDN for image downloads.
Callers can have unfortunate proxy configurations set up to only route to known domains and changing here can break the redirect.
Don’t assume servers have the same freedom to follow all URLs that browser users have.
Note this one can be mentioned in the OAS document but is rarely acted upon today.
Lots of APIs have rich query parameters for describing paging, filtering, sorting, counting resources etc.
You can describe these in your OAS, but not how they interact.
It’s really easy to make sorting not work with filtering on some collection of resources and break a lot of customers.
Provide a mock endpoint for your API for testing that has a really wide diversity of data delivered.
Don’t live with one fixed set. Mix it up ideally.
Push every limit and have slow calls, fast calls, big packets, small ones etc.
Vary anything that can be varied
and start the variance at different points on each session so callers don’t just repeat the same pattern.
If your consumers can cope with such a mock, they will probably cope with your real life data.
Anything which is optional or a preference can be disobeyed by the server under some circumstances.
There might not be enough data to fill a ten-record page.
So sometimes send back five two record pages instead to make sure the client can handle it.
Especially if you have a pre-production mode.
Unusual calling patterns?
More calls?
Less calls?
More 400s?
More 500s?
Average packet size changes in or out?
Consider extending your breaking changes policy to include some of the types of cases I’ve described.
This isn’t for everyone, but if relationship continuity is your top priority then you might want to set this higher bar.
Then you do whatever you would normally do with a breaking change.
Version the API or format/delay/rollback the change etc.
It has to actually WORK….
Here’s another example from APIMetrics.
Here’s a UK bank’s API for locating ATMs
After a deployment, it was only able to find ATMs in one city in the country.
Perfectly compliant to the spec, but mostly no actual data.
Don’t be afraid to take a heuristic measurement of content across your APIs.
If it changes A LOT – be very suspicious.
So I hope I’ve offered you some food for thought on a wider set of things that can and will break the consumers of your APIs, and just dipped into some strategies for mitigating the problems.
I’d love to chat more about your experiences in this area at the reception this evening.
Thanks very much.