My talk at Swiss Testing Day 2019: Use BDD, Cucumber and IBM Watson Assistant to build, specify and test conversational applications (like Chatbots). Case study with the UBS Innovation Lab, IBM and FaceMe on Virtual Avatars.
4. 4
CONVERSATIONAL APPLICATIONS
• Natural language conversations between human
and machine
• Virtual Assistants, Customer Care, Information
Systems, First Level support ...
• Self-improvement through ongoing (re)-training and
feedback loops.
16. 22
MANUAL TESTING
When building a conversational application, typical first steps
include repeated, manual testing of conversations as they are
modelled.
Problems
Manual conversation testing lacks
• Repeatability
• Consistency
• Automation
• Speed
• internals are not easily exposed for
inspection (e.g. confidence levels, context
variables...)
17. 23
UNIT TESTS
Teams have implemented Unit tests against API calls
comparing actual vs. expected responses
Problems
• written and maintained by technical team
• technical tests are source code, e.g. not suitable for
communication with business stakeholders and
domain experts
18. 24
BEHAVIOR-DRIVEN DEVELOPMENT (BDD)
“ Behavior-driven development combines the general techniques
and principles of TDD with ideas from domain-driven design and
object- oriented analysis and design to provide software
development and management teams with shared tools and a
shared process to collaborate on software development.”
https://en.wikipedia.org/wiki/Behavior-driven_development
19. 25
UBIQUITOUS LANGUAGE
“A ubiquitous language is a (semi-)formal language
that is shared by all members of a software
development team — both software developers
and non-technical personnel. The language in
question is both used and developed by all team
members as a common means of discussing the
domain of the software in question.”
https://en.wikipedia.org/wiki/Behavior-driven_development
20. 26
CUCUMBER AND GHERKIN
• Cucumber: Test Automation Framework.
We use the JavaScript / Node implementation
https://github.com/cucumber/cucumber-js
• Gherkin: Domain-specific Language (DSL) to
describe feature specifications using a semi-
structured, more "natural language" for non-
technical stakeholders
21. 28
OUR APPROACH
Describe conversation scenarios using an (almost)
natural language
• "Happy path" conversation regression testing
• Corner cases / Digression / ”Navigation”
• Assert minimum #intent confidence
• Assert on context variables (white box)
• Goal:
create, discuss and test conversation specifications
collaboratively with technical and non-technical domain
experts.
22. 29
SCENARIO SPECIFICATION
• Feature: Name of feature under test
• Background: setup stage before each scenario/test run
• Scenario: A single "test case"
• Steps: Indicate a sequence of executable steps in the
test case.
• The understood phrases are domain specific to our tool
(see next slides).
Feature: Customer Service – Sample
Background:
Given the conversation workspace is ”Customer Service Assistant"
And I start a new conversation
#Strict text matching
Scenario: Get directions to store
When I ask "give me directions"
Then Watson will respond "We're located by Union Square on the
corner of 13th and Broadway"
23. 30
TEXT MATCHING
• Assert "strict" response text:
• Watson will respond "<STRING>"
• Match response text against a regular
expression:
• Watson will say something like
"<REG_EXP>"
• partial matches, ignores case
• Multiple output texts get concatenated
• Note: "I ask ..." - steps will trigger an API
request to the Watson Assistant service
# Regexp text matching
Scenario: Get directions to store from Landmark
When I ask "How do I find you coming from Times Square?"
Then Watson will respond something like
"from …* take the …* We're located by …"
24. 31
INTENT DETECTION
• Assert that a specific #intent is
detected:
• Watson will detect that my intent is
"<intent name>"
• Assert a minimum confidence score:
• [and] have a confidence of at least
<percentage>
# Recognize intent
Scenario: Get connected to an human agent
When I ask "agent, please"
Then Watson will detect that my intent is "Connect_to_Agent"
And have a confidence of at least 95%
25. 32
TEST FOR CONTEXT VARIABLES
• Assert $context_variable:
• "context_variable" will have a value of
"<expected_value>"
• "context_variable" will be "<expected_value>"
• "context_variable" will contain
"<expected_substring>"
• "context_variable" will match
"<expected_pattern>"
• Supports simple data types (Numbers, Strings)
and JSON
• {...}: JSON objects: deep equal test or partial
matching (JSON.stringify)
# Check context variables
Scenario: Make an appointment
When I ask "make an appointment"
Then Watson will respond like "What day ?"
And when I say ”next monday"
Then "date" will have a value of "2019-03-25"
26. 33
SCENARIO OUTLINES
• Test scenarios using <placeholders>
• Examples: table containing the placeholder
values
• One row = one test case
28. 35
“GUARD RAILS”
• Detect deterioration and regressions, e.g.
caused by continuous (re-) training, workspace
migrations etc..
• Assert that "happy path" conversation flows
consistently work as expected
• Test correct dialog flow, e.g. digressions and
drill down/out conversation paths
• Test corner cases and fuzzy / problematic
intents
29. 36
WHITE-BOX TEST
• Test "internals”
• Assert value of context variables after some
specific dialog steps
• Useful for checking correct filling of slots
• Confidence baselines for a set of related
inputs / intent examples
• integration with other systems
• Not really "BDD” - test on implementation
details
30. 37
WEB APPLICATION: DEPLOYMENT & SETUP
• Docker image
• Manifests for IBM Cloud Kubernetes
service
• Persistence volume storage for
configuration and feature
specification (= test cases) data
31. 38
RUN TEST
• Editor with syntax highlighting
• Run test
• Save & save as...
• Stderr/Stdout streamed to client
using Server-sent Events (SSE)
34. 41
Summary
... and Lessons Learned
• Test-first: Describe “happy path“
• Test-later: Automate test cases to avoid regressions
• Developer „white box“ tests: confidence levels and “intents“ behind utterances
• You will need domain experts, not just developers + requirements / chatlogs
• Test cases will evolve and need some “relaxing“
• Separate test vs. deployment instances, e.g. using API-call tagging
• Digressions, jumps and navigation (drill down / out) tend to be brittle
• Iterativly learning machines: automated testing and analytics is a must