10 ways to shoot yourself in the foot with tests - Shai Geva, PyConUS 2023

•Download as PPTX, PDF•

0 likes•86 views

Shai Geva

Talk given at PyConUS 2023 about common software testing pitfalls and what to do about them.

Software

TESTING
FOOTGUNS;
10 WAYS TO SHOOT YOURSELF IN THE FOOT
WITH TESTS
Shai Geva | Principal dev @CodiumAI | @shai_ge

GENERATING MEANINGFUL
TESTS
FOR BUSY DEVS

STRENGTH
MAINTAINABILITY
Tests
PROPERTIES

PERFORMANCE
STRENGTH
MAINTAINABILITY
Tests
PROPERTIES

test_user_can_edit_their_own_book test_edit_book

UNDERSTAND
SINGLE TEST
test_user_can_edit_their_own_book test_edit_book

UNDERSTAND
SINGLE TEST
DEBUG
test_user_can_edit_their_own_book test_edit_book

4 UNCLEAR LANGUAGE
test_ (): ...
test_ (): ...

GUIDELINES
SINGLE FACT
BEHAVIOR
DECISIVE
LANGUAGE

GUIDELINES
SINGLE FACT
BEHAVIOR
DECISIVE
LANGUAGE
SPECIFIC
EXPLICIT

test_user_should_be_able_to_edit_their_own_book():

def test_my_parser():
data = Path(PATH_TO_DATA_FILE).read_text()
parsed_data = parser_under_test(data)
assert parsed_data.total_books == 3

$def test_my_parser(): data = “”” { < JSON with the data > } “”” parsed_data = parser_under_test(data) assert parsed_data.total_books == 3$

def test_foobar():
setup = some_thing(with_something_else)
more_data = SomeObj.read(a_path)
combined = “,“.join([setup, more_data])
prep_1 = MoreThings.do(combined, 3)
the_actual_action = foobar(prep_1)
sub_res = the_actual_action[3]
thing_to_assert = json.parse(sub_result)[“key“]
assert thing_to_assert == 3

def test_foobar():
prep_1 = setup_prep_1()
the_actual_action = foobar(prep_1)
# Extract the important key
sub_res = the_actual_action[3]
thing_to_assert = json.parse(sub_result)[“key“]
assert thing_to_assert == 3

def test_editing_description_sets_correct_value():
BEHAVIOR TEST IMPLEMENTATION TEST

BEHAVIOR TEST IMPLEMENTATION TEST
# Create book
# Edit book
# Get updated description
# Assert correctness
# Create book
# Edit book
# Get updated description
# Assert correctness
def test_editing_description_sets_correct_value():

# Create book - API
requests.post(...
# Edit book - API
requests.post(...
# Get updated description - API
new_desc = requests.get(…
# Assert correctness
assert new_desc == ...
# Create book - DB
# Edit book - API
# Get updated description - DB
# Assert correctness
BEHAVIOR TEST IMPLEMENTATION TEST
def test_editing_description_sets_correct_value():

# Create book - DB
DbBook.creat_new(...
# Edit book - API
requests.post(...
# Get updated description - DB
new_desc = DbBook.query_one(…
# Assert correctness
assert new_desc == ...
BEHAVIOR TEST IMPLEMENTATION TEST
# Create book - API
requests.post(...
# Edit book - API
requests.post(...
# Get updated description - API
new_desc = requests.get(…
# Assert correctness
assert new_desc == ...
def test_editing_description_sets_correct_value():

BEHAVIOR TEST IMPLEMENTATION TEST
# Create book - DB
DbBook.creat_new(...
# Edit book - API
requests.post(...
# Get updated description - DB
new_desc = DbBook.query_one(…
# Assert correctness
assert new_desc == ...
# Create book - API
requests.post(...
# Edit book - API
requests.post(...
# Get updated description - API
new_desc = requests.get(…
# Assert correctness
assert new_desc == ...
def test_editing_description_sets_correct_value():
WHAT
COHESIVE
HOW
INCOHESIVE

BOOK TABLE
DESCRIPTION DESC_ID ID
DESCRIPTION TABLE
VALUE ID
BOOK TABLE
DESCRIPTION ID …

BOOK TABLE
DESCRIPTION ID …
BOOK TABLE
DESCRIPTION DESC_ID ID
DESCRIPTION TABLE
VALUE ID

BOOK TABLE
DESCRIPTION DESC_ID ID
DESCRIPTION TABLE
VALUE ID
Book Store
/edit-book
/new-book
/get-book

BOOK TABLE
DESCRIPTION DESC_ID ID
DESCRIPTION TABLE
VALUE ID
Book Store
/edit-book
/new-book
/get-book
BEHAVIOR
TEST

BOOK TABLE
DESCRIPTION DESC_ID ID
DESCRIPTION TABLE
VALUE ID
Book Store
/edit-book
/new-book
/get-book
IMPLEMENTATION
TEST

COHESIVE
BEHAVIOR
INCOHESIVE
IMPLEMENTATION

COHESIVE
BEHAVIOR
INCOHESIVE
IMPLEMENTATION
PASSING == WORKS?

COHESIVE
BEHAVIOR
INCOHESIVE
IMPLEMENTATION
PASSING == WORKS?
CATCHES BUGS
AS EXPECTED?

COHESIVE
BEHAVIOR
INCOHESIVE
IMPLEMENTATION
PASSING == WORKS?
CATCHES BUGS
AS EXPECTED?
ONLY NECESSARY WORK?

COHESIVE
BEHAVIOR
INCOHESIVE
IMPLEMENTATION
PASSING == WORKS?
CATCHES BUGS
AS EXPECTED?
ONLY NECESSARY WORK?
HIGH CONFIDENCE?

CodiumAI
Main server
CodiumAI
Main server
REAL
SERVICE

Footgun #: the tests are slow
SLOW
TESTS
9

SLOW TESTS: THE BOTTLENECK AND THE TIME
BOMB

>>> workday_hours = 10
>>> test_run_minutes = 5

>>> workday_hours = 10
>>> test_run_minutes = 5
>>> test_run_per_hour = 60 / 5
12

>>> workday_hours = 10
>>> test_run_minutes = 5
>>> test_run_per_hour = 60 / 5
12
>>> test_run_per_day = 12 * 10
120

>>> workday_hours = 10
>>> test_run_hours = 2

>>> workday_hours = 10
>>> test_run_hours = 2
>>> test_run_per_day = 10 / 2
5

>>> test_run_hours = 0.5
>>> workday_hours = 10

>>> test_run_hours = 0.5
>>> workday_hours = 10
THE SAME, BUT NOT AS COMMON

TESTS ARE
+ TESTS ARE
SLOW
FLAKY
SITUATION EVEN WORSE

DEFUSE THE BOMB
PREPARED TO OPTIMIZE
HOW
TO
B
E

DEFUSE THE BOMB
PREPARED TO OPTIMIZE
CAN RUN IN PARALLEL
HOW
TO
B
E
TEST
S

DEFUSE THE BOMB
PREPARED TO OPTIMIZE
CAN RUN IN PARALLEL
ISOLATED TESTS
HOW
TO
B
E
TEST
S
WRIT
E

SLOW TESTS:
THE FEEDBACK LOOP AND
THE BUG FUNNEL

FAST = 3 seconds,
watch
SLOW = 10 minutes, CI

HOW LONG FOR THE TESTS TO RUN?
HOW LONG TO CATCH A BUG?

ALL BUGS
FASTEST TESTS
EVEN
SLOWER
REALLY SLOW
LESS FAST TESTS

ALL BUGS (10)
2
2
4
UTs FOR THIS
MODULE
CI UTs
CI INTEGRATION
TESTS
ALL LOCAL UTs 2

ALL BUGS (10)
2
2
4
CI UTs
ALL LOCAL UTs 2
2 SECOND WATCH-
MODE
CI INTEGRATION
TESTS

TOOLS etc.!
>>> pytest-watch
Re-run tests on file changes
>>> pytest-testmon
Only run tests that can be impacted
by the code that changed
>>> unittest subTest Organizing test
output when you have large tests.
For pytest: pytest-subtests
>>> hypothesis 😍
Property-based testing - for getting
strong tests.
>>> coverage.py
Python test-coverage
>>> vcrpy
Record HTTP requests

Tests
PROPERTIES
PERFORMANCE
STRENGTH
MAINTAINABILITY

HARD TO
MAINTAIN EXPENSIVE
FEWER
TESTS WEAK

THANK YOU
AND
SAFE
CODING!
Shai Geva
@shai_ge
Slides:
GENERATING
MEANINGFUL TESTS
FOR BUSY DEVS

What's hot

On-boarding with JanusGraph PerformanceChin Huang

[COSCUP 2020] How to use llvm frontend library-libtoolingDouglas Chen

Building a Real-Time Feature Store at iFoodDatabricks

Python Programming with Google Colabvadhaniseetharaman

The disruption called ChatGPT.docxZubair Khan

Airflow at lyft for Airflow summit 2020 conferenceTao Feng

PPt on Chat GPT New users.pptxMohdMansoorAli1

Apache Druid®: A Dance of Distributed ProcessesImply

Copilot to Cover: Why AI can't replace developers with robots, but can make l...Andy Piper

DevOps for DatabricksDatabricks

Pinot: Near Realtime Analytics @ UberXiang Fu

CICD using jenkins and NomadBram Vogelaar

Primeiros passos com a API do Zabbix - 3º Zabbix Meetup do InteriorZabbix BR

Lost with data consistencyMichał Gryglicki

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Let's Build a Chatbot!Christopher Mohritz

Airbyte @ Airflow Summit - The new modern data stackMichel Tricot

Bootstrapping state in Apache FlinkDataWorks Summit

Extending Flink SQL for stream processing use casesFlink Forward

Streaming SQL for Data Engineers: The Next Big Thing?Yaroslav Tkachenko

What's hot (20)

On-boarding with JanusGraph Performance

[COSCUP 2020] How to use llvm frontend library-libtooling

Building a Real-Time Feature Store at iFood

Python Programming with Google Colab

The disruption called ChatGPT.docx

Airflow at lyft for Airflow summit 2020 conference

PPt on Chat GPT New users.pptx

Apache Druid®: A Dance of Distributed Processes

Copilot to Cover: Why AI can't replace developers with robots, but can make l...

DevOps for Databricks

Pinot: Near Realtime Analytics @ Uber

CICD using jenkins and Nomad

Primeiros passos com a API do Zabbix - 3º Zabbix Meetup do Interior

Lost with data consistency

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Let's Build a Chatbot!

Airbyte @ Airflow Summit - The new modern data stack

Bootstrapping state in Apache Flink

Extending Flink SQL for stream processing use cases

Streaming SQL for Data Engineers: The Next Big Thing?

Similar to 10 ways to shoot yourself in the foot with tests - Shai Geva, PyConUS 2023

Testing in DjangoKevin Harvey

Unit Test Your DatabaseDavid Wheeler

The Boy Scout RuleAlistair McKinnell

Unit Testing and Behavior Driven Testing with AngularJS - Jesse Liberty | Fal...FalafelSoftware

Developing a Culture of Quality Code (Midwest PHP 2020)Scott Keck-Warren

TDD & BDDArvind Vyas

Testing C# and ASP.net using RubyBen Hall

Why rubyrstankov

Introduction to testingManel Sellés

How To Test Everythingnoelrap

Django’s nasal passageErik Rose

We Are All Testers Now: The Testing Pyramid and Front-End DevelopmentAll Things Open

Test Coverage in RailsJames Gray

Do You Need That Validation? Let Me Call You Back About ItTobias Pfeiffer

RSpock Testing Framework for RubyBrice Argenson

TDDKerry Buckley

Bdd for-dso-1227123516572504-8Frédéric Delorme

Tdd for BT E2E test communityKerry Buckley

From 0 to 100: How we jump-started our frontend testingHenning Muszynski

Test Design Essentials for Great Test Automation - TitusSauce Labs

Similar to 10 ways to shoot yourself in the foot with tests - Shai Geva, PyConUS 2023 (20)

Testing in Django

Unit Test Your Database

The Boy Scout Rule

Unit Testing and Behavior Driven Testing with AngularJS - Jesse Liberty | Fal...

Developing a Culture of Quality Code (Midwest PHP 2020)

TDD & BDD

Testing C# and ASP.net using Ruby

Why ruby

Introduction to testing

How To Test Everything

Django’s nasal passage

We Are All Testers Now: The Testing Pyramid and Front-End Development

Test Coverage in Rails

Do You Need That Validation? Let Me Call You Back About It

RSpock Testing Framework for Ruby

TDD

Bdd for-dso-1227123516572504-8

Tdd for BT E2E test community

From 0 to 100: How we jump-started our frontend testing

Test Design Essentials for Great Test Automation - Titus

Recently uploaded

%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...masabamasaba

%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Software Quality Assurance Interview QuestionsArshad QA

Right Money Management App For Your Financial GoalsJhone kinadey

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba

%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba

%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba

Direct Style Effect Systems -The Print[A] Example- A Comprehension AidPhilip Schwarz

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10

VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale

Recently uploaded (20)

%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...

%in Harare+277-882-255-28 abortion pills for sale in Harare

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Software Quality Assurance Interview Questions

Right Money Management App For Your Financial Goals

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...

%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg

%in kempton park+277-882-255-28 abortion pills for sale in kempton park

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf

VTU technical seminar 8Th Sem on Scikit-learn

10 ways to shoot yourself in the foot with tests - Shai Geva, PyConUS 2023

1. TESTING FOOTGUNS; 10 WAYS TO SHOOT YOURSELF IN THE FOOT WITH TESTS Shai Geva | Principal dev @CodiumAI | @shai_ge

2. Shai Geva ~20 years making software Love testing Principal dev, codium.ai

3. GENERATING MEANINGFUL TESTS FOR BUSY DEVS

4. Return On Investment

5. Tests PROPERTIES

6. STRENGTH Tests PROPERTIES

7. STRENGTH MAINTAINABILITY Tests PROPERTIES

8. PERFORMANCE STRENGTH MAINTAINABILITY Tests PROPERTIES

10. THERE ARE NO TESTS 1

11. IF IT DOESN’T FAIL IT DOESN’T PASS 2

12. 3 TESTING

13. SINGLE FACT BEHAVIOR

14. BookStore

15. test_user_can_edit_their_own_book

16. test_user_can_edit_their_own_book test_edit_book

17. UNDERSTAND SINGLE TEST test_user_can_edit_their_own_book test_edit_book

18. UNDERSTAND SINGLE TEST test_user_can_edit_their_own_book test_edit_book

19. UNDERSTAND SINGLE TEST DEBUG test_user_can_edit_their_own_book test_edit_book

20. UNDERSTAND SINGLE TEST DEBUG test_user_can_edit_their_own_book test_edit_book

21. 4 UNCLEAR LANGUAGE test_ (): ... test_ (): ...

22. GUIDELINES

23. GUIDELINES SINGLE FACT BEHAVIOR

24. GUIDELINES SINGLE FACT BEHAVIOR DECISIVE LANGUAGE

25. GUIDELINES SINGLE FACT BEHAVIOR DECISIVE LANGUAGE SPECIFIC EXPLICIT

26. test_edit_book():

27. test_edit_book_works_correctly():

28. test_user_should_be_able_to_edit_their_own_book():

29. test_user_should_be_able_to_edit_their_own_book():

30. test_user_can_edit_their_own_book():

31. THE DEVIL IS IN THE 5 DETAIL S

32. def test_my_parser(): data = Path(PATH_TO_DATA_FILE).read_text() parsed_data = parser_under_test(data) assert parsed_data.total_books == 3

33. def test_my_parser(): data = Path(PATH_TO_DATA_FILE).read_text() parsed_data = parser_under_test(data) assert parsed_data.total_books == 3

34. def test_my_parser(): data = Path(PATH_TO_DATA_FILE).read_text() parsed_data = parser_under_test(data) assert parsed_data.total_books == 3 ???

35. def test_my_parser(): data = “”” { < JSON with the data > } “”” parsed_data = parser_under_test(data) assert parsed_data.total_books == 3

36. def test_foobar(): setup = some_thing(with_something_else) more_data = SomeObj.read(a_path) combined = “,“.join([setup, more_data]) prep_1 = MoreThings.do(combined, 3) the_actual_action = foobar(prep_1) sub_res = the_actual_action[3] thing_to_assert = json.parse(sub_result)[“key“] assert thing_to_assert == 3

37. def test_foobar(): setup = some_thing(with_something_else) more_data = SomeObj.read(a_path) combined = “,”.join([setup, more_data]) prep_1 = MoreThings.do(combined, 3) the_actual_action = foobar(prep_1) sub_res = the_actual_action[3] thing_to_assert = json.parse(sub_result)[“key”] assert thing_to_assert == 3

38. def test_foobar(): prep_1 = setup_prep_1() the_actual_action = foobar(prep_1) # Extract the important key sub_res = the_actual_action[3] thing_to_assert = json.parse(sub_result)[“key“] assert thing_to_assert == 3

39. def test_foobar(): prep_1 = setup_prep_1() the_actual_action = foobar(prep_1) # Extract the important key sub_res = the_actual_action[3] thing_to_assert = json.parse(sub_result)[“key“] assert thing_to_assert == 3

40. 6 THE TESTS ARE NOT ISOLATED

41.

42. 7 IMPROPER TEST SCOPE

43. Book Store MySQL

44. BEHAVIOR TEST IMPLEMENTATION TEST

45. def test_editing_description_sets_correct_value(): BEHAVIOR TEST IMPLEMENTATION TEST

46. BEHAVIOR TEST IMPLEMENTATION TEST # Create book # Edit book # Get updated description # Assert correctness # Create book # Edit book # Get updated description # Assert correctness def test_editing_description_sets_correct_value():

47. # Create book - API requests.post(... # Edit book - API requests.post(... # Get updated description - API new_desc = requests.get(… # Assert correctness assert new_desc == ... # Create book - DB # Edit book - API # Get updated description - DB # Assert correctness BEHAVIOR TEST IMPLEMENTATION TEST def test_editing_description_sets_correct_value():

48. # Create book - DB DbBook.creat_new(... # Edit book - API requests.post(... # Get updated description - DB new_desc = DbBook.query_one(… # Assert correctness assert new_desc == ... BEHAVIOR TEST IMPLEMENTATION TEST # Create book - API requests.post(... # Edit book - API requests.post(... # Get updated description - API new_desc = requests.get(… # Assert correctness assert new_desc == ... def test_editing_description_sets_correct_value():

49. BEHAVIOR TEST IMPLEMENTATION TEST # Create book - DB DbBook.creat_new(... # Edit book - API requests.post(... # Get updated description - DB new_desc = DbBook.query_one(… # Assert correctness assert new_desc == ... # Create book - API requests.post(... # Edit book - API requests.post(... # Get updated description - API new_desc = requests.get(… # Assert correctness assert new_desc == ... def test_editing_description_sets_correct_value(): WHAT COHESIVE HOW INCOHESIVE

50. SO?

51. BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID BOOK TABLE DESCRIPTION ID …

52. BOOK TABLE DESCRIPTION ID … BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID

53. BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID Book Store /edit-book /new-book /get-book

54. BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID Book Store /edit-book /new-book /get-book BEHAVIOR TEST

55. BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID Book Store /edit-book /new-book /get-book BEHAVIOR TEST

56. BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID Book Store /edit-book /new-book /get-book BEHAVIOR TEST

57. BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID Book Store /edit-book /new-book /get-book IMPLEMENTATION TEST

58. BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID Book Store /edit-book /new-book /get-book IMPLEMENTATION TEST

59. BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID Book Store /edit-book /new-book /get-book IMPLEMENTATION TEST

60. BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID Book Store /edit-book /new-book /get-book

61. BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID Book Store /edit-book /new-book /get-book BEHAVIOR TEST

62. BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID Book Store /edit-book /new-book /get-book IMPLEMENTATION TEST

63. BOOK TABLE DESCRIPTION DESC_ID ID DESCRIPTION TABLE VALUE ID Book Store /edit-book /new-book /get-book IMPLEMENTATION TEST

64. COHESIVE BEHAVIOR INCOHESIVE IMPLEMENTATION

65. COHESIVE BEHAVIOR INCOHESIVE IMPLEMENTATION PASSING == WORKS?

66. COHESIVE BEHAVIOR INCOHESIVE IMPLEMENTATION PASSING == WORKS? CATCHES BUGS AS EXPECTED?

67. COHESIVE BEHAVIOR INCOHESIVE IMPLEMENTATION PASSING == WORKS? CATCHES BUGS AS EXPECTED? ONLY NECESSARY WORK?

68. COHESIVE BEHAVIOR INCOHESIVE IMPLEMENTATION PASSING == WORKS? CATCHES BUGS AS EXPECTED? ONLY NECESSARY WORK? HIGH CONFIDENCE?

69. SCARY CHANGES DRAMATIC DIFFERENCE

70. BEHAVIOR COHESIVE

71. 8 TEST DOUBLES

72. TEST DOUBLES == IMPLEMENTATION

78. USE WITH CAUTIO N

79. USE WITH BU T CAUTIO N HOW ?

80. DESIGN

81. db_fake = []

82. TEST THE FAKE

83.

84. WHY NOT BOTH?

85. db_fake

86. db_fake

87. db_fake

88. USE, AND VERIFY

89.

90. CodiumAI Main server LLM SERVICE

91. CodiumAI Main server LLM SERVICE

92. CodiumAI Main server

93. CodiumAI Main server CodiumAI Main server REAL SERVICE

94. Footgun #: the tests are slow SLOW TESTS 9

95. SLOW TESTS: THE BOTTLENECK AND THE TIME BOMB

96. SLOW TESTS: THE BOTTLENECK AND THE TIME BOMB

97. >>> workday_hours = 10

98. >>> workday_hours = 10 >>> test_run_minutes = 5

99. >>> workday_hours = 10 >>> test_run_minutes = 5 >>> test_run_per_hour = 60 / 5 12

100. >>> workday_hours = 10 >>> test_run_minutes = 5 >>> test_run_per_hour = 60 / 5 12 >>> test_run_per_day = 12 * 10 120

101. >>> workday_hours = 10 >>> test_run_minutes = 5 >>> test_run_per_hour = 60 / 5 12 >>> test_run_per_day = 12 * 10 120

102. >>> workday_hours = 10

103. >>> workday_hours = 10 >>> test_run_hours = 2

104. >>> workday_hours = 10 >>> test_run_hours = 2 >>> test_run_per_day = 10 / 2 5

105. >>> workday_hours = 10 >>> test_run_hours = 2 >>> test_run_per_day = 10 / 2 5

106. >>> test_run_hours = 0.5 >>> workday_hours = 10

107. >>> test_run_hours = 0.5 >>> workday_hours = 10 THE SAME, BUT NOT AS COMMON

108. THAT TIME WHEN THE BOMB EXPLODED

109. TESTS ARE + TESTS ARE SLOW FLAKY SITUATION EVEN WORSE

110.

111. DEFUSE THE BOMB HOW TO

112. DEFUSE THE BOMB PREPARED TO OPTIMIZE HOW TO B E

113. DEFUSE THE BOMB PREPARED TO OPTIMIZE CAN RUN IN PARALLEL HOW TO B E TEST S

114. DEFUSE THE BOMB PREPARED TO OPTIMIZE CAN RUN IN PARALLEL ISOLATED TESTS HOW TO B E TEST S WRIT E

115. SLOW TESTS: THE FEEDBACK LOOP AND THE BUG FUNNEL

116. FAST = 3 seconds, watch

117. FAST = 3 seconds, watch SLOW = 10 minutes, CI

118. LET’S BE FAST

119. HOW LONG FOR THE TESTS TO RUN? HOW LONG TO CATCH A BUG?

120. ALL BUGS FASTEST TESTS EVEN SLOWER REALLY SLOW LESS FAST TESTS

121. ALL BUGS CI INTEGRATION TESTS

122. ALL BUGS (10) 10 CI INTEGRATION TESTS

123. ALL BUGS (10) 2 2 4 UTs FOR THIS MODULE CI UTs CI INTEGRATION TESTS ALL LOCAL UTs 2

124. ALL BUGS (10) 2 2 4 CI UTs ALL LOCAL UTs 2 2 SECOND WATCH- MODE CI INTEGRATION TESTS

125. FEEDBACK LOOP: TEST DOUBLES

126. TOOLS etc.! >>> pytest-watch Re-run tests on file changes >>> pytest-testmon Only run tests that can be impacted by the code that changed >>> unittest subTest Organizing test output when you have large tests. For pytest: pytest-subtests >>> hypothesis 😍 Property-based testing - for getting strong tests. >>> coverage.py Python test-coverage >>> vcrpy Record HTTP requests

127. 10 WRONG PRIORITIES

128. Tests PROPERTIES PERFORMANCE STRENGTH MAINTAINABILITY

129. STRENGTH

130. PERFORMANCE STRENGTH MAINTAINABILITY

131. WHY?

132. SLOW

133. SLOW TOO LONG

134. SLOW TOO LONG FEWER TESTS

135. SLOW TOO LONG FEWER TESTS WEAK

136. HARD TO MAINTAIN EXPENSIVE FEWER TESTS WEAK

137. HARD TO MAINTAIN SLOW

138. PERFORMANCE STRENGTH MAINTAINABILITY

139. THANK YOU AND SAFE CODING! Shai Geva @shai_ge Slides: GENERATING MEANINGFUL TESTS FOR BUSY DEVS

Editor's Notes

Hi everyone, Thank you for coming. Today I’m going to talk about testing. Testing is great, But if we do it wrong, sometimes it’s not so great.
Quick intro: My name is Shai Geva I’ve been in the industry for while now, mostly in hands-on engineering, but also in other roles like management and product.
I’m a principal developer at codium.ai. So, my day job is creating tools that generate tests, which is pretty nice for someone who loves testing. Lots of cool stuff happening in this field, if you want to talk about it, come catch me after the talk.
My purpose with this talk is to help you get a better ROI on testing work. I’ll talk about concrete things I’ve seen that can hurt that - either make us spend more time on testing work or make that work less effective. Naturally - not everything is going to be a good match for every team, and some things are going to need adjustments. So take the basic ideas, and see what applies to you.
We’ll talk about different practices, different ways that we can work. These practices will affect us by changing properties of our tests. The main properties we’ll see:
Strength - how good the tests are at catching bugs.
Maintainability - How easy it is for us to deal with the tests as things change. And, dealing with changes is very important. As developers, dealing with change is one of the main things that we do - changes to requirements, changes to scale, and even things like changes to the team as new developers join.
And Performance - How long does it take for the tests to run. This might sound like a lesser consideration because computers are fast, but as we’ll see, it matters.
So, 10 ways to shoot yourself in the foot with tests
Footgun 1: There are no tests It’s better to have some tests than to not have tests at all. Even if the tests are not well-written, and even if they seem like a drop in the sea. They still catch bugs and they are still an improvement. So if you don’t have tests yet - just start with something small, and slowly keep improving.
Footgun 2: If it doesn’t fail, it doesn’t pass. Sometimes our tests lie to us. We have a test that is supposed to protect us from something - but it still happens. Obviously, this happens because the test didn’t actually check what we thought it does. Maybe we copy-pasted and forgot to change something. My suggestion here - when you write a test, always make it fail. For every assertion, do a tiny change either to the code or the test, and make sure it fails the way you expect. And only after you saw that it fails - consider it a passing test.
Just like with product code, if we put too many things in the same place we get a mess. My rule of thumb is to try hard to test a single fact about the behavior of the code. And it helps if I use these specific words mentally.
SINGLE. FACT. About the BEHAVIOR.
Let’s say we have a book store and we’re testing the edit book functionality.
For example, that’s a single fact about the behavior the code. user_can_edit_their_own_book.
And, this is not a single fact test_edit_book It’s general How do they compare?
Single fact test: It’s clear what the test checks. It’s clear that it only checks that.
But, with the general test: we’ll need to read and understand all the test code to know.
If the single-fact test fails, it’s clear what functionality stopped working. And because it’s small, it’ll be easy to debug it.
If the general test fails, anything related to edit book might have failed. We’ll need to dig in. And it does a lot of things, so debugging might be a lot of work.
Footgun 4: unclear language. The words we use make a big difference in how we think about the tests, and how easy it is to understand them.
The guidelines I use for myself:
First, like we said earlier - prefer to test a single fact about the behavior of code. This is not the language itself but it really sets the tone.
We want to use decisive language
And we want the language to be specific and explicit. A few examples of this:
Using the same example: test_edit_book. Like we saw, this is hard to understand, so it’s not a very good choice.
Adding things like “works” or “is correct” - most of the time, it’s just bloat. Doesn’t really help.
test_user_should_be_able_to_edit_their_own_book That’s better. Much more specific .
The only problem here is this indecisive language. It’s kind-of confusing, right? Why “should”? Are we not sure about this? Is this ever going to be NOT TRUE? So also, not optimal.
And, again, this does sound like a fact. “User can edit their own book”, Decisive, specific, explicit. And I recommend to go with language like this.
Footgun 5: The devil’s in the details. Tests that highlight too much or too little detail are more difficult to maintain.
One problem is what I like to think of as non-locality. Here we’re testing some parser, and the data is in a file -
so we read the data from the file before we move on with the test.
The problem is that no matter what the test checks, it’s impossible to know if this is correct without going somewhere else and looking. Maybe a different file, but even if it’s a constant at the top of this file. Sometimes we can’t avoid this, but a lot of times we can.
Maybe try something like this. It’s exactly the same test, but the data is local. If you can find some data sample that’s small enough to fit locally, the tests become much more maintainable.
The other side of this is too-much-detail. There’s so much stuff. The important things
It’s not easy to spot them
And just organizing a little makes a big difference
It’s not a lot of work, And it makes it much easier to understand what’s going on.
Footgun 6: the tests are not isolated If your tests are not isolated, it means you sometimes get different behavior if you run only some of them or run them in a different order. And what I have to say about tests that are not isolated,
Is DON’T. Just don’t do it. If you have 30 tests, and test number 24 fails because of some things that happened during test 8 and test 15. You’re not going to have a good time debugging. This gets really bad, and I can not stress enough how much I’m against this.
Footgun #7: Improper test scope. This is the root cause for many testing problems. My approach here, is that we want to test a cohesive whole. Some complete, make-sense story. It’s very close to the notion of “testing implementation instead of behavior”, Just a phrasing that’s a little more general.
Let’s say our Book Store is a web service, and it uses a DB.
We’ll think about two alternative test suites - “behavior tests” and “implementation tests” And try to imagine what our life will look like if we would have chosen one test suite or the other.
We’ll look at an almost identical test in both test suites. The test verifies that if we edit the description of a book, then it has really been updated. Pretty simple.
Both tests have the same flow - Create a book Edit the book Get the updated description Make the assertion.
The behavior test does everything through the external http API, IN THE SAME WAY things would be done in the actual system.
The implementation test does some of the things at a lower level: It creates the book by directly creating a record in the database, and it also checks the updated description through the DB.
So the behavior test only looks at the WHAT - It looks at things as they appear from outside. The implementation test also knows about HOW. It knows how the code will change the DB. Now, checking the implementation like this will USUALLY be equivalent to the behavior - but not always.
But why does this matter to us? Let’s look at a possible scenario: We’ve had this test suite for a while, maybe even years. We’ve invested a lot in them, and we rely on them.
Now, we’re making a change to optimize the database. We’re moving the description out of the Book table, and into a separate table.
However, we’re not deleting the old field yet - we’ll do that later after all the data has moved to the new table. Now, we’re finished with everything else, and, it’s time to update the edit-book endpoint.
Now, what if we just forgot to update the edit-book endpoint? Completely forgot. It now changes the wrong field in the database so behavior-wise, it doesn’t do anything. If this gets to production, then we created a major bug.
If we chose behavior tests -
The test only uses the external API. The behavior test…
Does not care about implementation details. So if the behavior is wrong, the test will fail, just like it should. The regression bug was prevented. Everything’s ok.
If we chose the implementation test -
The test looks directly at old description field in the DB. When we run the test, the old description field will change, just like before, so the test will not fail. The regression bug was not prevented. And a major bug made it to production.
It’s not ok.
On the other side of this, what if we made the change correctly? Edit book now changes the new table instead of the old field. No bugs, everything’s fine.
If we chose the behavior test - Everything behaves correctly, so the test will pass. We don’t need to do anything.
If we chose the implementation test - The old field is not updated any more, so even though the code is correct, this test will fail. The distinction here is that the failure reason is not that the code is not correct. The test fails because it has become technically invalid. So, we have extra work - we need to figure out whether the failure is real or technical. And then we’ll need to update the test. Also - because we changed the test, we now have less confidence in it. We need to learn to trust it again.
On large code bases, this can become a real pain. You have to update the tests, even if the code change has no bugs, and sometimes even if the test has nothing to do with the feature you worked on. You end up wasting hours and you hate the test suite.
Summing up
Cohesive, behavior tests - are closer to reality
They are better at protecting us.
The create less redundant work
And we have higher confidence in them in the long run.
One more thing worth mentioning: we looked at an example of a small, incremental change. But sometimes, we need to make BIG changes. SCARY changes. It happens less often but when it happens it’s a big deal. For example, in a lot of companies, at some point, the DB doesn’t deal with the scale well. We have stability issues, and we need to make a big change - maybe move the data to a different type of database. And that’s when tests are MOST important. And if we went with behavior level tests - everything will be fine. Those same tests that we’ve been running with for 3 years now - we don’t change them. When they pass, we’re done. But if we went with Implementation level tests - they all become technically invalid and they all fail. We will need to port all of them to use the new database, but more importantly: because we’re changing them - we’re not going to trust them enough. This might make the difference between a project that takes a few weeks, and a company-level event that drags out for months while the product has stability issues.
So I cannot recommend enough. Test behavior. A cohesive whole.
Footgun 8 - test doubles everywhere Sometimes, in a test, we switch a part of the system, a dependency, with an alternative implementation. These are called test doubles. Things like stubs, mocks and fakes. The main reason we use them is performance - if the real thing is too slow to run a lot of tests, we switch it with a fast test double. Test doubles can be useful, but…
Test doubles are a re-implementation. They know the implementation details of the thing they’re replacing. Different types of test doubles do it differently, but this is what they do.
The main problem this causes is correctness. The test double might not behave exactly like the real thing, and that makes the tests less accurate, less correct.
And as times goes by,
the real thing might be slowly changed,
but the test double would stay the same, so it would drift further and further from reality.
And, of course, this can hurt your foot. This is actually a flavor of the implementation vs. behavior problem. There are some differences, but essentially, it’s the same category of issues - tests that use test doubles are not as good at catching bugs, and sometimes they fail even though the code is correct, causing all that extra work.
So, test doubles - use, with caution.
The question is - how do we avoid the pitfalls? And I’ll suggest a couple of ideas
First - code design. So important. Try to design so you can test a lot of functionality effectively, with fast unit tests, that don’t need test doubles. Not ALWAYS possible, but a lot of times it is.
Another thing is to choose which test double you’re working with. And I suggest to mostly use fakes. A fake behaves like your dependency, but fast. For example, a fake database table can be an in-memory list of tuples, where each tuple is a row. In tests - it behaves the same way.
We can make a fake more realiable by writing some tests, not for the code - but for the fake itself. For example, we can run the same operations against the fake and the real thing and verify we get the same results. It’ll never be 100% the same - we make tradeoffs in how much we are willing to invest in testing the fake.
Sometimes, a reliable fake already exists. For example, if you’re using SQLite - Python actually has a built-in, in-memory implementation. So google it, maybe you’ll get lucky.
An interesting thing you can do with fakes, is to run exactly the same test - once with a fake, and once with the real thing.
For example, maybe we have 10 tests, and that’s too much to run against the real thing.
So we run all 10 with the fake.
And then, we choose the 2 most important ones, and we run them ALSO with the real thing. And this gives us some real world certainty.
The essence of the idea is to use test doubles, but selectively verify their correctness until we get an acceptable tradeoff.
Another way to “use test doubles and verify” is by caching recordings. We can record HTTP requests, DB actions, or anything else.
For example, at CodiumAI, We have an HTTP service that calls another HTTP service - an AI layer that does code analysis and generation.
So, in our tests, we record and save the HTTP interactions between the main server and the AI server.
Locally, we run the tests against the recordings, so they are very fast.
But - we also verify. In the cloud, we also run the tests against the real thing to make sure they are still valid.
Footgun 9: Slow tests Yeah, slow tests are not fun. I’ll talk about two ways in which they are not fun.
The first way is what I like to think of as the bottleneck and the time bomb.
The bottleneck here is where the tests take so long to run, that we have a long queue of tasks waiting to be merged to the main branch.
What kind of numbers are we’re talking about here? Assume we have, say, 10 work-hours each day.
with tests that take 5 minutes.
So, that’s 12 merges per hour,
Which is 120 merges a day before the tests slow us down. For most teams, that virtually never happens, so a 5-minute test suite -
not a bottleneck.
On the other extreme, and it usually won’t get to that but just so it’s easy to imagine -
if the test suite takes 2 HOURS,
Then we can only merge 5 tasks to main each day.
Whenever we want to wrap up a bunch of tasks quickly, maybe before a major version, the merge queue length becomes days. If the tests sometimes fail, then this can happen on any random day. It just doesn’t work. The team will probably just stop waiting for the tests to pass before merging, and spend a lot of time with the tests being broken. Now, we can SURVIVE this way. But it’s a lot of extra work and it’s really not what we want.
And really, even with less extreme numbers, From what I see: With a 30 minutes test suite
The same things happen. They happen less, but they happen.
And, a few years ago I was actually part of a team where this happened. When the tests took 20 minutes, I understood it’s a time bomb and eventually things were going to get bad. But I didn’t have this clear phrasing of exactly how the slowness would be a problem. The bottleneck.
Another problem there was that the tests were also flaky and we always needed to fix them, so it wasn’t clear to most people that slowness was the more urgent problem.
After a while, we were getting all these problems every few weeks. Multi-day merge queues, everything was stuck. Real crisis mode. It only became ok after we did an expensive project and made the tests run in parallel. Tests would still break sometimes, but the queue got back to zero fast enough so it was not a crisis.
The question is - what do we do about this?
We don’t want premature optimization, so what we need on day 1 is to make sure that WHEN we want to optimize, it’s not going to be a very expensive project.
And specifically, it should be possible to run the tests in parallel because that’s going to be the go-to solution.
The only thing we need for that, is to remember the footgun about isolated tests. If the tests don’t affect each other, they can run in parallel. My advice is to consider this as a must-have.
Another way that slow tests can hurt us is by making our feedback loop longer. The feedback loop is how fast we learn about bugs and understand what happened. And I’m talking about any type of bug here - anything from a typo to complex concurrency issues. The feedback loop is very important, And anything that makes it shorter is very good. Even a squiggly red line in the IDE.
I usually aim for a setup where most of the time, I’m working in watch-mode, so the tests re-run every time a file changes, and I run a sub-set of the tests that finishes within 2 or 3 seconds. Being on the fast side is great. For example, if a test fails just a few seconds after I wrote the code - I instantly understand what’s going on.
With a 10-minute tests suite in CI - the commit with a failing test contains a lot of code. Plus my brain will do a context switch and go catch up on slack. So when I try to understand what’s going on with the failing test - it’s a lot more work.
Now, some tests HAVE to be slow. But, we can still have pretty fast feedback loop.
What helps me here is that instead of asking “How long does it take for the tests to run” I’m asking “How long does it take to catch a bug”
And I’m visualizing this using the “bug funnel”. All possible theoretical bugs come in, and some of them get filtered out on every stage. And the key observation here is that what matters to the feedback loop is that we catch MOST bugs quickly. We will have a good experience if the feedback loop is USUALLY fast.
Let’s say we start out with a bug funnel that looks like this. We only have long-running integration tests, and we only run them in CI.
So if we create 10 bugs - we need to wait and debug the CI 10 times.
If we start adding fast unit tests, then pretty quickly the bug funnel will look more like this. We don’t wait 10 times for the long-running CI anymore. Only for, say, 2 of the bugs. For the rest of the bugs, so most of the time - we’ll have a much faster feedback loop.
And try to run at least some of the tests in watch-mode! You will have a 2-second feedback loop, even if it’s not for everything.
And, as we discussed - you can also use test doubles, that’s why they exist. By the way, local recordings have a very good tradeoff here - you should try it.
So, before the last footgun - I haven’t directly mentioned specific tools because it kind of broke the flow, so here is a bunch of stuff worth exploring. No need to take a picture - I uploaded the slides, and there’s a link at the end.
Footgun 10: wrong priorities
We saw a bunch of different practices, and how they will affect us by changing the properties of our tests. The bug funnel is all about performance. “Testing implementation instead of behavior” us about maintainability and strength. But how do we prioritize?
Now, the objective of tests is their strength. We have tests so that they catch bugs.
The unintuitive thing is that this is not what we should prioritize when we work. Start with making them maintainable, Then make sure they are fast enough And then make them strong
Here’s the thing
Slow tests are weak, or at least they are EVENTUALLY weak. Let’s say that, as a team, we decided that we are not willing to have tests that run for more than 30 minutes.
If, at some point the tests reach 30 minutes…
It becomes very difficult to add more tests.
So after enough time, there will be a lot of code that’s not tested well.
And the same thing happens with maintenance. It’s more subtle, but if tests are not maintainable, it costs more to have them, and we end up creating fewer tests. So again, they will be eventually weak.
And, maintainability issues can also make it difficult to handle performance. An example we saw is test isolation and parallelization.
In other words, Maintainability is a necessary condition for performance, and both are necessary conditions for strength. So make maintainability the priority. Testing a single fact, code design and all the others. When you have a choice to make - I suggest to go with the most maintainable option almost always. Even at the cost of other things. Because in the long run, that’s how we get tests that let us move fast, and have confidence in our code.
Thank you!

10 ways to shoot yourself in the foot with tests - Shai Geva, PyConUS 2023

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 10 ways to shoot yourself in the foot with tests - Shai Geva, PyConUS 2023

Similar to 10 ways to shoot yourself in the foot with tests - Shai Geva, PyConUS 2023 (20)

Recently uploaded

Recently uploaded (20)

10 ways to shoot yourself in the foot with tests - Shai Geva, PyConUS 2023

Editor's Notes