46. BEHAVIOR TEST IMPLEMENTATION TEST
# Create book
# Edit book
# Get updated description
# Assert correctness
# Create book
# Edit book
# Get updated description
# Assert correctness
def test_editing_description_sets_correct_value():
47. # Create book - API
requests.post(...
# Edit book - API
requests.post(...
# Get updated description - API
new_desc = requests.get(…
# Assert correctness
assert new_desc == ...
# Create book - DB
# Edit book - API
# Get updated description - DB
# Assert correctness
BEHAVIOR TEST IMPLEMENTATION TEST
def test_editing_description_sets_correct_value():
48. # Create book - DB
DbBook.creat_new(...
# Edit book - API
requests.post(...
# Get updated description - DB
new_desc = DbBook.query_one(…
# Assert correctness
assert new_desc == ...
BEHAVIOR TEST IMPLEMENTATION TEST
# Create book - API
requests.post(...
# Edit book - API
requests.post(...
# Get updated description - API
new_desc = requests.get(…
# Assert correctness
assert new_desc == ...
def test_editing_description_sets_correct_value():
49. BEHAVIOR TEST IMPLEMENTATION TEST
# Create book - DB
DbBook.creat_new(...
# Edit book - API
requests.post(...
# Get updated description - DB
new_desc = DbBook.query_one(…
# Assert correctness
assert new_desc == ...
# Create book - API
requests.post(...
# Edit book - API
requests.post(...
# Get updated description - API
new_desc = requests.get(…
# Assert correctness
assert new_desc == ...
def test_editing_description_sets_correct_value():
WHAT
COHESIVE
HOW
INCOHESIVE
126. TOOLS etc.!
>>> pytest-watch
Re-run tests on file changes
>>> pytest-testmon
Only run tests that can be impacted
by the code that changed
>>> unittest subTest Organizing test
output when you have large tests.
For pytest: pytest-subtests
>>> hypothesis 😍
Property-based testing - for getting
strong tests.
>>> coverage.py
Python test-coverage
>>> vcrpy
Record HTTP requests
Hi everyone,
Thank you for coming.
Today I’m going to talk about testing.
Testing is great,
But if we do it wrong, sometimes it’s not so great.
Quick intro:
My name is Shai Geva
I’ve been in the industry for while now, mostly in hands-on engineering, but also in other roles like management and product.
I’m a principal developer at codium.ai.
So, my day job is creating tools that generate tests, which is pretty nice for someone who loves testing.
Lots of cool stuff happening in this field, if you want to talk about it, come catch me after the talk.
My purpose with this talk is to help you get a better ROI on testing work.
I’ll talk about concrete things I’ve seen that can hurt that - either make us spend more time on testing work or make that work less effective.
Naturally - not everything is going to be a good match for every team, and some things are going to need adjustments.
So take the basic ideas, and see what applies to you.
We’ll talk about different practices, different ways that we can work.
These practices will affect us by changing properties of our tests.
The main properties we’ll see:
Strength - how good the tests are at catching bugs.
Maintainability -
How easy it is for us to deal with the tests as things change.
And, dealing with changes is very important.
As developers, dealing with change is one of the main things that we do - changes to requirements, changes to scale, and even things like changes to the team as new developers join.
And Performance -
How long does it take for the tests to run.
This might sound like a lesser consideration because computers are fast, but as we’ll see, it matters.
So, 10 ways to shoot yourself in the foot with tests
Footgun 1: There are no tests
It’s better to have some tests than to not have tests at all.
Even if the tests are not well-written, and even if they seem like a drop in the sea.
They still catch bugs and they are still an improvement.
So if you don’t have tests yet - just start with something small, and slowly keep improving.
Footgun 2: If it doesn’t fail, it doesn’t pass.
Sometimes our tests lie to us.
We have a test that is supposed to protect us from something - but it still happens.
Obviously, this happens because the test didn’t actually check what we thought it does.
Maybe we copy-pasted and forgot to change something.
My suggestion here - when you write a test, always make it fail.
For every assertion, do a tiny change either to the code or the test, and make sure it fails the way you expect.
And only after you saw that it fails - consider it a passing test.
Just like with product code, if we put too many things in the same place we get a mess.
My rule of thumb is to try hard to test a single fact about the behavior of the code.
And it helps if I use these specific words mentally.
SINGLE. FACT. About the BEHAVIOR.
Let’s say we have a book store and we’re testing the edit book functionality.
For example, that’s a single fact about the behavior the code.
user_can_edit_their_own_book.
And, this is not a single fact
test_edit_book
It’s general
How do they compare?
Single fact test: It’s clear what the test checks. It’s clear that it only checks that.
But, with the general test: we’ll need to read and understand all the test code to know.
If the single-fact test fails, it’s clear what functionality stopped working.
And because it’s small, it’ll be easy to debug it.
If the general test fails, anything related to edit book might have failed. We’ll need to dig in. And it does a lot of things, so debugging might be a lot of work.
Footgun 4: unclear language.
The words we use make a big difference in how we think about the tests, and how easy it is to understand them.
The guidelines I use for myself:
First, like we said earlier - prefer to test a single fact about the behavior of code.
This is not the language itself but it really sets the tone.
We want to use decisive language
And we want the language to be specific and explicit.
A few examples of this:
Using the same example:
test_edit_book.
Like we saw, this is hard to understand, so it’s not a very good choice.
Adding things like “works” or “is correct” - most of the time, it’s just bloat.
Doesn’t really help.
test_user_should_be_able_to_edit_their_own_book
That’s better. Much more specific .
The only problem here is this indecisive language.
It’s kind-of confusing, right?
Why “should”?
Are we not sure about this?
Is this ever going to be NOT TRUE?
So also, not optimal.
And, again, this does sound like a fact.
“User can edit their own book”,
Decisive, specific, explicit.
And I recommend to go with language like this.
Footgun 5: The devil’s in the details.
Tests that highlight too much or too little detail are more difficult to maintain.
One problem is what I like to think of as non-locality.
Here we’re testing some parser, and the data is in a file -
so we read the data from the file before we move on with the test.
The problem is that no matter what the test checks,
it’s impossible to know if this is correct without going somewhere else and looking.
Maybe a different file, but even if it’s a constant at the top of this file.
Sometimes we can’t avoid this, but a lot of times we can.
Maybe try something like this.
It’s exactly the same test, but the data is local.
If you can find some data sample that’s small enough to fit locally, the tests become much more maintainable.
The other side of this is too-much-detail.
There’s so much stuff.
The important things
It’s not easy to spot them
And just organizing a little makes a big difference
It’s not a lot of work,
And it makes it much easier to understand what’s going on.
Footgun 6: the tests are not isolated
If your tests are not isolated, it means you sometimes get different behavior if you run only some of them or run them in a different order.
And what I have to say about tests that are not isolated,
Is DON’T.
Just don’t do it.
If you have 30 tests, and test number 24 fails because of some things that happened during test 8 and test 15.
You’re not going to have a good time debugging.
This gets really bad, and I can not stress enough how much I’m against this.
Footgun #7: Improper test scope.
This is the root cause for many testing problems.
My approach here, is that we want to test a cohesive whole.
Some complete, make-sense story.
It’s very close to the notion of “testing implementation instead of behavior”,
Just a phrasing that’s a little more general.
Let’s say our Book Store is a web service, and it uses a DB.
We’ll think about two alternative test suites - “behavior tests” and “implementation tests”
And try to imagine what our life will look like if we would have chosen one test suite or the other.
We’ll look at an almost identical test in both test suites.
The test verifies that if we edit the description of a book, then it has really been updated.
Pretty simple.
Both tests have the same flow -
Create a book
Edit the book
Get the updated description
Make the assertion.
The behavior test does everything through the external http API, IN THE SAME WAY things would be done in the actual system.
The implementation test does some of the things at a lower level:
It creates the book by directly creating a record in the database, and it also checks the updated description through the DB.
So the behavior test only looks at the WHAT -
It looks at things as they appear from outside.
The implementation test also knows about HOW.
It knows how the code will change the DB.
Now, checking the implementation like this will USUALLY be equivalent to the behavior - but not always.
But why does this matter to us?
Let’s look at a possible scenario:
We’ve had this test suite for a while, maybe even years.
We’ve invested a lot in them, and we rely on them.
Now, we’re making a change to optimize the database.
We’re moving the description out of the Book table, and into a separate table.
However, we’re not deleting the old field yet - we’ll do that later after all the data has moved to the new table.
Now, we’re finished with everything else,
and, it’s time to update the edit-book endpoint.
Now, what if we just forgot to update the edit-book endpoint?
Completely forgot.
It now changes the wrong field in the database so behavior-wise, it doesn’t do anything.
If this gets to production, then we created a major bug.
If we chose behavior tests -
The test only uses the external API.
The behavior test…
Does not care about implementation details.
So if the behavior is wrong, the test will fail, just like it should.
The regression bug was prevented.
Everything’s ok.
If we chose the implementation test -
The test looks directly at old description field in the DB.
When we run the test, the old description field will change, just like before, so the test will not fail.
The regression bug was not prevented.
And a major bug made it to production.
It’s not ok.
On the other side of this, what if we made the change correctly?
Edit book now changes the new table instead of the old field.
No bugs, everything’s fine.
If we chose the behavior test -
Everything behaves correctly, so the test will pass.
We don’t need to do anything.
If we chose the implementation test -
The old field is not updated any more, so
even though the code is correct, this test will fail.
The distinction here is that the failure reason is not that the code is not correct.
The test fails because it has become technically invalid.
So, we have extra work - we need to figure out whether the failure is real or technical.
And then we’ll need to update the test.
Also - because we changed the test, we now have less confidence in it. We need to learn to trust it again.
On large code bases, this can become a real pain.
You have to update the tests, even if the code change has no bugs, and sometimes even if the test has nothing to do with the feature you worked on.
You end up wasting hours and you hate the test suite.
Summing up
Cohesive, behavior tests - are closer to reality
They are better at protecting us.
The create less redundant work
And we have higher confidence in them in the long run.
One more thing worth mentioning: we looked at an example of a small, incremental change.
But sometimes, we need to make BIG changes. SCARY changes.
It happens less often but when it happens it’s a big deal.
For example, in a lot of companies, at some point, the DB doesn’t deal with the scale well.
We have stability issues, and we need to make a big change - maybe move the data to a different type of database.
And that’s when tests are MOST important.
And if we went with behavior level tests - everything will be fine.
Those same tests that we’ve been running with for 3 years now - we don’t change them.
When they pass, we’re done.
But if we went with Implementation level tests - they all become technically invalid and they all fail.
We will need to port all of them to use the new database,
but more importantly: because we’re changing them - we’re not going to trust them enough.
This might make the difference between a project that takes a few weeks, and a company-level event that drags out for months while the product has stability issues.
So I cannot recommend enough.
Test behavior.
A cohesive whole.
Footgun 8 - test doubles everywhere
Sometimes, in a test, we switch a part of the system, a dependency, with an alternative implementation.
These are called test doubles. Things like stubs, mocks and fakes.
The main reason we use them is performance - if the real thing is too slow to run a lot of tests, we switch it with a fast test double.
Test doubles can be useful, but…
Test doubles are a re-implementation.
They know the implementation details of the thing they’re replacing.
Different types of test doubles do it differently, but this is what they do.
The main problem this causes is correctness.
The test double might not behave exactly like the real thing, and that makes the tests less accurate, less correct.
And as times goes by,
the real thing might be slowly changed,
but the test double would stay the same, so it would drift further and further from reality.
And, of course, this can hurt your foot.
This is actually a flavor of the implementation vs. behavior problem.
There are some differences, but essentially, it’s the same category of issues - tests that use test doubles are not as good at catching bugs, and sometimes they fail even though the code is correct, causing all that extra work.
So, test doubles - use, with caution.
The question is - how do we avoid the pitfalls?
And I’ll suggest a couple of ideas
First - code design.
So important.
Try to design so you can test a lot of functionality effectively, with fast unit tests, that don’t need test doubles.
Not ALWAYS possible, but a lot of times it is.
Another thing is to choose which test double you’re working with.
And I suggest to mostly use fakes.
A fake behaves like your dependency, but fast.
For example, a fake database table can be an in-memory list of tuples, where each tuple is a row.
In tests - it behaves the same way.
We can make a fake more realiable by writing some tests, not for the code - but for the fake itself.
For example, we can run the same operations against the fake and the real thing and verify we get the same results.
It’ll never be 100% the same - we make tradeoffs in how much we are willing to invest in testing the fake.
Sometimes, a reliable fake already exists.
For example, if you’re using SQLite - Python actually has a built-in, in-memory implementation.
So google it, maybe you’ll get lucky.
An interesting thing you can do with fakes, is to run exactly the same test - once with a fake, and once with the real thing.
For example, maybe we have 10 tests, and that’s too much to run against the real thing.
So we run all 10 with the fake.
And then, we choose the 2 most important ones, and we run them ALSO with the real thing.
And this gives us some real world certainty.
The essence of the idea is to use test doubles, but selectively verify their correctness until we get an acceptable tradeoff.
Another way to “use test doubles and verify” is by caching recordings.
We can record HTTP requests, DB actions, or anything else.
For example, at CodiumAI,
We have an HTTP service that calls another HTTP service - an AI layer that does code analysis and generation.
So, in our tests, we record and save the HTTP interactions between the main server and the AI server.
Locally, we run the tests against the recordings, so they are very fast.
But - we also verify.
In the cloud, we also run the tests against the real thing to make sure they are still valid.
Footgun 9: Slow tests
Yeah, slow tests are not fun.
I’ll talk about two ways in which they are not fun.
The first way is what I like to think of as the bottleneck and the time bomb.
The bottleneck here is where the tests take so long to run, that we have a long queue of tasks waiting to be merged to the main branch.
What kind of numbers are we’re talking about here?
Assume we have, say, 10 work-hours each day.
with tests that take 5 minutes.
So, that’s 12 merges per hour,
Which is 120 merges a day before the tests slow us down.
For most teams, that virtually never happens, so a 5-minute test suite -
not a bottleneck.
On the other extreme, and it usually won’t get to that but just so it’s easy to imagine -
if the test suite takes 2 HOURS,
Then we can only merge 5 tasks to main each day.
Whenever we want to wrap up a bunch of tasks quickly, maybe before a major version, the merge queue length becomes days.
If the tests sometimes fail, then this can happen on any random day.
It just doesn’t work.
The team will probably just stop waiting for the tests to pass before merging, and spend a lot of time with the tests being broken.
Now, we can SURVIVE this way.
But it’s a lot of extra work and it’s really not what we want.
And really, even with less extreme numbers,
From what I see:
With a 30 minutes test suite
The same things happen.
They happen less, but they happen.
And, a few years ago I was actually part of a team where this happened.
When the tests took 20 minutes, I understood it’s a time bomb and eventually things were going to get bad.
But I didn’t have this clear phrasing of exactly how the slowness would be a problem. The bottleneck.
Another problem there was that the tests were also flaky and we always needed to fix them,
so it wasn’t clear to most people that slowness was the more urgent problem.
After a while, we were getting all these problems every few weeks.
Multi-day merge queues, everything was stuck.
Real crisis mode.
It only became ok after we did an expensive project and made the tests run in parallel.
Tests would still break sometimes, but the queue got back to zero fast enough so it was not a crisis.
The question is - what do we do about this?
We don’t want premature optimization, so what we need on day 1 is to make sure that WHEN we want to optimize, it’s not going to be a very expensive project.
And specifically, it should be possible to run the tests in parallel because that’s going to be the go-to solution.
The only thing we need for that, is to remember the footgun about isolated tests.
If the tests don’t affect each other, they can run in parallel.
My advice is to consider this as a must-have.
Another way that slow tests can hurt us is by making our feedback loop longer.
The feedback loop is how fast we learn about bugs and understand what happened.
And I’m talking about any type of bug here - anything from a typo to complex concurrency issues.
The feedback loop is very important,
And anything that makes it shorter is very good.
Even a squiggly red line in the IDE.
I usually aim for a setup where most of the time, I’m working in watch-mode, so the tests re-run every time a file changes, and I run a sub-set of the tests that finishes within 2 or 3 seconds.
Being on the fast side is great.
For example, if a test fails just a few seconds after I wrote the code - I instantly understand what’s going on.
With a 10-minute tests suite in CI - the commit with a failing test contains a lot of code.
Plus my brain will do a context switch and go catch up on slack.
So when I try to understand what’s going on with the failing test - it’s a lot more work.
Now, some tests HAVE to be slow.
But, we can still have pretty fast feedback loop.
What helps me here is that instead of asking
“How long does it take for the tests to run”
I’m asking
“How long does it take to catch a bug”
And I’m visualizing this using the “bug funnel”.
All possible theoretical bugs come in, and some of them get filtered out on every stage.
And the key observation here is that what matters to the feedback loop is that we catch MOST bugs quickly.
We will have a good experience if the feedback loop is USUALLY fast.
Let’s say we start out with a bug funnel that looks like this.
We only have long-running integration tests, and we only run them in CI.
So if we create 10 bugs - we need to wait and debug the CI 10 times.
If we start adding fast unit tests, then pretty quickly the bug funnel will look more like this.
We don’t wait 10 times for the long-running CI anymore. Only for, say, 2 of the bugs.
For the rest of the bugs, so most of the time - we’ll have a much faster feedback loop.
And try to run at least some of the tests in watch-mode!
You will have a 2-second feedback loop, even if it’s not for everything.
And, as we discussed - you can also use test doubles, that’s why they exist.
By the way, local recordings have a very good tradeoff here - you should try it.
So, before the last footgun - I haven’t directly mentioned specific tools because it kind of broke the flow, so here is a bunch of stuff worth exploring.
No need to take a picture - I uploaded the slides, and there’s a link at the end.
Footgun 10: wrong priorities
We saw a bunch of different practices,
and how they will affect us by changing the properties of our tests.
The bug funnel is all about performance.
“Testing implementation instead of behavior” us about maintainability and strength.
But how do we prioritize?
Now, the objective of tests is their strength.
We have tests so that they catch bugs.
The unintuitive thing is that this is not what we should prioritize when we work.
Start with making them maintainable,
Then make sure they are fast enough
And then make them strong
Here’s the thing
Slow tests are weak, or at least they are EVENTUALLY weak.
Let’s say that, as a team, we decided that we are not willing to have tests that run for more than 30 minutes.
If, at some point the tests reach 30 minutes…
It becomes very difficult to add more tests.
So after enough time, there will be a lot of code that’s not tested well.
And the same thing happens with maintenance.
It’s more subtle, but if tests are not maintainable, it costs more to have them, and we end up creating fewer tests.
So again, they will be eventually weak.
And, maintainability issues can also make it difficult to handle performance.
An example we saw is test isolation and parallelization.
In other words,
Maintainability is a necessary condition for performance, and both are necessary conditions for strength.
So make maintainability the priority.
Testing a single fact, code design and all the others.
When you have a choice to make - I suggest to go with the most maintainable option almost always.
Even at the cost of other things.
Because in the long run, that’s how we get tests that let us move fast, and have confidence in our code.