60. Why we found out so late?
● The test passed
● The system is not production - no errors.
● The other developer didn’t notice because we
committed at 18:30
66. ● Could’ve used more explicit tests.
● Should’ve separated between internal and external
data types.
● Would’ve separated between Apps Manager data
type and the App Data data types.
Could’ve, should’ve, would’ve
69. ● Just because all your tests are green doesn’t mean
others will integrate with you successfully.
● Tools and development methods are there to help
you, not replace you.
● Be ready to handle problems, because mess-ups are
a part of life.
Takeaways
71. This is where you are going to present your final words.
This slide is not meant to have a lot of text.Thank You!
Any Questions?
amita@wix.com
Amit Anafy
Editor's Notes
Hi,
I’m Amit and I’m going to talk to you about what I feel is the blind spot of TDD.
The world outside.
But first I have a question,
Do any of you worked with TDD? Please raise your hands
…
If you worked with TDD and still had problems in production please keep your hands up
…
So I guess everthing isn’t perfect.
// In my talk i’m going to demnctract problems I had working with TDD at wix.
//
//Wix is a website publishing platform.
We at Wix use TDD to work with an architecture of microservices that means that we don’t have one big moonlit, we have a lot of independent services. We started by spliting the monolite to10 microservices and now we have over 250 microservices
This distributed style is a possibility for us because TDD. TDD also helps us to work in Continuous Delivery in which we have no version just continence delivery of every little thing we do
// we don’t have a quit period followed by stressful period.
In this talk, I’m going to give you tips on how to avoid integration pitfalls when working with TDD in a microservice environment.
Let’s begin
///TODO: improve the explanation if continuce delivery and microserives
//
///
Those of you how worked with TDD and didn’t experience production problems, probably a
Those of you how worked with TDD probably know it doesn’t ??exume?? you from unexpected production problems?
In this talk, I'll demonstrate how despite of using TDD, a simple code change can cause an inter system failure, and how to avoid it.
Let’s begin
TDD is test driven development the idea is we first write a test see it fails write the minimum to make it work and then refactor
Why do you use TDD?
TDD is a crucial factor in minimizing mistakes in a complex system
We don’t write code for future scenarios, we write for what we need right now, and by that we minimize the mistakes caused by overheads.
TDD helps you know what a system does without needing to read outdated papers. -
It elements the need of papers to show the system.
When you start working in a company many times they sit you infront of outdated paper to understand the system.
But when you finish reading the papers and start working on the system you see it has nothing to do with what you read.
What I believe is the greatest advantage of TDD is the confidence it gives to a developer
When you start work on this system you know that the test you wrote cover what you implented – because all you implement was the minimum in order to make them pass. And you know all the system works because all the other tests passed.
But this confidence may lead to overconfidence in your E2E system
You are sure your system will work but in production it fails.
Lets dive into that – talk about what test we have and where do we fail
So you have your service
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
As expected we at Wix have our unit test they are the majority and they test most of the complex scenarios.
We check the unit - the specific internal class function with a surgical tweezers.
//at wix.
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
We also have our Integration tests the check the connection with other systems.
/// More data about integration
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
We have our E2E the check the entire flow.
The classical test types. – The ones my team use.
///Service with capital S.
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
But there are more tests, like those tests, I guess those are the E2E..
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
No, those
And it goes on and on so my team at wix only really do the first E2E test
//change my system to my services
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
Until we get all those tests.
But as I said we don’t do all the Question marked tests.
There are many reasons why we don’t do them..
///Service with capital S.
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
like not knowing where to stop, or what tests to skip and what test to linger on.
///Service with capital S.
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
Not enough data about the other system to be able to simulate the state you want - or not being able to access it.
E2E test with too many uncountable variables are not reliable.
And cause false negative.
False negative is when the test fail but in reality your system work just fine. It can be exhausting because you will lose trust in your tests, and just run them again and again. And all your confidence will go out of the window.
//write the story
In E2E you want to reach as close as you can to 100% the system in production.
But It is really hard and in some cases impossible.
I wanted to add everything but it not always possible, I’m not able to reach 100%
Rule of tumb just do unit test integration tests and the vanilla E2E test
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
Now lets see a simple integration example
Imagine you need to write an International Beer delivery Service
The IBDS
The IBDS
Imagine you need to write an International Beer delivery system that get your beer in pints from the UK and need to bring to the Israeli brewery in liters.
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
Imagine you need to write an International Beer delivery system that get your beer in pints from the UK and need to bring to the Israeli brewery in liters.
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
Imagine you need to write an International Beer delivery system that get your beer in pints from the UK and need to bring to the Israeli brewery in liters.
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
Imagine you need to write an International Beer delivery system that get your beer in pints from the UK and need to bring to the Israeli brewery in liters.
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
Imagine you need to write an International Beer delivery system that get your beer in pints from the UK and need to bring to the Israeli brewery in liters.
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
Imagine you need to write an International Beer delivery system that get your beer in pints from the UK and need to bring to the Israeli brewery in liters.
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
But then an american brewery starts using your system. And they saw we take pint – and they work with pints
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
But then an american brewery starts using your system. And they saw we take pint – and they work with pints
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
But the system will return the wrong amount, and still your test will pass.
This is a very dangerous false positive that might lead to errors in production.
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
But the system will return the wrong amount, and still your test will pass.
This is a very dangerous false positive that might lead to errors in production.
//
We can simulate outside communication for testing our system end-to-end - but what is your system.
Is it the micro-service you're working on? Whatever is in your repository? where is my system ends and another system start?
The problem with making those bigger test are plenty: like we don’t know how to simulate their code, it might be an out if organization system, not reliable - dependents on staff out of your control - false negative is bad..., and takes a lot of time and a lot of talking.
This is not just in an imaginary situations
Let me tell you about my sad integration story
Six months ago I worked on a project called wix touch
Wix touch is a service which allows you to easily create a mobile app for your wix online store - In just one click.
At the time we worked on a new feature which will allow you to create an app that can be downloaded via app store.
I was working Apps Manager which is responsible to manage the server side of the new app from the moment it was created by the app builder an external micro service
Now that we know the microservices involved, and I was only working on the apps manager
Brace yourself we are stepping into a maze\
//
The Apps manager needs to hold the apple push notification certificate and key in the app Data from the time the app was built.
At their request he should give those fields back to the app builder
Are you confused by the names of apple push notification service certificate key and apple push notification service certificate well, I am.
/////
Apps Builder sends the Apps Manager 2 fields: a certificate and a key.
Apps Manager saves them in App Data.
Upon request, Apps Manager can send the certificate and key from App Data to the App or to the Apps Builder.
In order to be able to send apple push notifications.
The Apps manager needs to hold the apple push notification certificate and key in the app Data from the time the app was built.
At their request he should give those fields back to the app builder
Are you confused by the names of apple push notification service certificate key and apple push notification service certificate well, I am.
/////
Apps Builder sends the Apps Manager 2 fields: a certificate and a key.
Apps Manager saves them in App Data.
Upon request, Apps Manager can send the certificate and key from App Data to the App or to the Apps Builder.
In order to be able to send apple push notifications.
We did an end to end test for our APNS certificate key.
We sent the key as if we were the apps builder, saved it.
And then asked for it back and assert it.
//Add E2E circle
///
We should get the certificate in a base64 format, and the key in a regular base8 format.
Our Apps Manager has a test that creates a new object,
Inserts it to the database,
And then assert if what we get from Apps Builder is the same value.
Remember the two very similar fields?
Well the certificate needed to be in base64 and the key in UTF-8.
But because the names are so similar there was a confusion.
///
We should get the certificate in a base64 format, and the key in a regular base8 format.
Our Apps Manager has a test that creates a new object,
Inserts it to the database,
And then assert if what we get from Apps Builder is the same value.
The APNS certificate key was sent in base64 instead of UTF-8
OK, let’s check our test… it passed. Not a good omen - it was suppose to fail...
Ok, I know it - it is the pint issue, got the wrong input
So, if everything passed how it was discovered? A meticulous developer in apps builder. - not magic
We needed the apps builder to start sending the data in UTF-8 format.
///
But, what actually happened was that Apps Builder sent both fields in base64.
This caused my App test to fail. Data was saved as base64, and sent to the app upon request.
Let’s check our test – is it green? YES!
I got the wrong input, smells like the pint issue….
How it was discovered? A meticulous developer in apps builder.
Add boom after test passsed
The APNS certificate key was sent in base64 instead of UTF-8
OK, let’s check our test… it passed. Not a good omen - it was suppose to fail...
Ok, I know it - it is the pint issue, got the wrong input
So, if everything passed how it was discovered? A meticulous developer in apps builder. - not magic
We needed the apps builder to start sending the data in UTF-8 format.
///
But, what actually happened was that Apps Builder sent both fields in base64.
This caused my App test to fail. Data was saved as base64, and sent to the app upon request.
Let’s check our test – is it green? YES!
I got the wrong input, smells like the pint issue….
How it was discovered? A meticulous developer in apps builder.
Add boom after test passsed
The APNS certificate key was sent in base64 instead of UTF-8
OK, let’s check our test… it passed. Not a good omen - it was suppose to fail...
Ok, I know it - it is the pint issue, got the wrong input
So, if everything passed how it was discovered? A meticulous developer in apps builder. - not magic
We needed the apps builder to start sending the data in UTF-8 format.
///
But, what actually happened was that Apps Builder sent both fields in base64.
This caused my App test to fail. Data was saved as base64, and sent to the app upon request.
Let’s check our test – is it green? YES!
I got the wrong input, smells like the pint issue….
How it was discovered? A meticulous developer in apps builder.
Add boom after test passsed
But there's a problem – the data is in base64 fromate,
we can’t just add utf-8 data
we won’t be able to distinguish between them.
And we can’t just move all data to be UTF-8 because we are not sure what we are going to get the new utf-8 or the old.
we won’t be able to distinguish between them.
And we can’t just move all data to be UTF-8 because we are not sure what we are going to get the new utf-8 or the old.
So we added the new APNS certificate key. To avoid the backward computability problem.
//To not break backward compatibility - We didn’t want the app data to be half base64 and half UTF-8
we won’t be able to distinguish between them.
And we can’t just move all data to be UTF-8 because we are not sure what we are going to get the new utf-8 or the old.
So we added the new APNS certificate key. To avoid the backward computability problem.
//To not break backward compatibility - We didn’t want the app data to be half base64 and half UTF-8
Then we converted the old APNSCK in the App data to the new APNSCK
Ok, backward compatibility check.
Now we can start using the UTF-8 based key and remove stuff we don’t need, like the base64 - Old APNS certificate key..
We don’t have backward probability problem
we won’t be able to distinguish between them.
And we can’t just move all data to be UTF-8 because we are not sure what we are going to get the new utf-8 or the old.
We did that and now – I thought my test implementation of the Apps Builder simulates the apps builder 1:1, It uses to… but I have guess changed it….
I didn’t think I changed the actual and the Excepted.
Let’s check our test – is it green? Yes! Go ahead and commit
6:30 PM we push.
Now I went to sleep wake up the next day and went to a conference.
At 9:30 AM I get a call, What happened last night, nothing works!
And if you could picture the situation my phone battery was draining fast, like every conference.
My laptop charger was sleeping at home, and I was trying to fix the trouble fast so other people can work.
///
Why we didn’t get it right away?
The test didn’t get it because they used the same local type and it worked for it.
the system is not production - no errors.
Didn’t run the relevant http - didn’t think about did so,
Why other that used it didn’t find it - committed at 18:30….
The systems were expecting APNS certificate key field and we gave them the new APNS certificate key.
And as you know if you give the API the wrong input it expludes
The test didn’t get it because they used the same local type as apps manager not as apps builder
So we started fixing, first added the missing field, and then realized it is not there anymore.
So we added the old field and converted back to the old.
Then the apps builder started reading the new field and we were able to remove the old APNSCK without breaking anything.
//
Fix 1 - add the missing field! - no the DB doesn't have it anymore :(
Fix 2 - add field on load - reading the none existing field using the existing field data.
we won’t be able to distinguish between them.
And we can’t just move all data to be UTF-8 because we are not sure what we are going to get the new utf-8 or the old.
The systems were expecting APNS certificate key field and we gave them the new APNS certificate key.
And as you know if you give the API the wrong input it expludes
We found it out so late not only because the tests passed, also because the system was not in production, so other than the other systems dev team no one sent us messages.
And the dev team was on their way home because it’s the end of the day
What I learned from this ordeal
We could’ve used backward compatibility test.
You can use pact for Http tests - Tool to help you with contract testing maybe add a test like this for any open api. (a good lecture about contract testing: https://www.youtube.com/watch?v=-6x6XBDf9sQ)
contract tests - run a separate set of integration contract tests that checks all the calls against your test doubles return the same results as a call to the external service would. As part of your deploy pipeline. Even when it is an external service.
In this tests we check an exact field by name not the inter type - here we check the APNS certificate key not the certificate itself
////It does this by running a separate set of integration contract tests that checks all the calls against your test doubles return the same results as a call to the external service would.
This tests need to be a part of the deploy pipeline.
Even when it is an external service.
// We use the real Apps Builder - Or a special made test kit
We would have used one type for the internal data and another for the communication to separate the internal design from the api.
In our test we would have stayed with the old field APNSCK and in our implementation we would have seen the NewAPNSCK and then the explosion would have happened in the E2E test.
We should’ve separated between the data type saved in the app data to the app data type
You can use pact for Http tests - Tool to help you with contract testing maybe add a test like this for any open api. (a good lecture about contract testing: https://www.youtube.com/watch?v=-6x6XBDf9sQ)
contract tests - run a separate set of integration contract tests that checks all the calls against your test doubles return the same results as a call to the external service would. As part of your deploy pipeline. Even when it is an external service.
We wouldn’t have to change back and forward the data in the app Data because there would be a different type
//At the end thell be three
We wouldn’t have to change back and forward the data in the app Data because there would be a different type
//At the end thell be three
Just because your testes are green doesn’t mean integration will work.
Tools and methods are there to help you, don’t think the could replace you. Be thankful thats why your job won’t disappear in a few years.
And be ready to handle problems because they happen
And most importantly remember
///
On tools and deployment method say “be thankful of that because I like having a job”
TDD is great
But you have to be aware of what you don’t test
///Think about changing it.
Integration is a *#@$, but if you decouple it’s a little bit easier