How to train a transactional chatbot using reinforcement learning.pdf

1/21
How to train a transactional chatbot using reinforcement
learning?
leewayhertz.com/train-transactional-chatbot-using-reinforcement-learning
In an age where artificial intelligence is reshaping our world, chatbots have emerged as a
valuable tool for businesses. With a staggering 80% of businesses projected to integrate
chatbots in their operations by 2024, the focus is now shifting towards transactional chatbots,
also known as Goal-oriented (GO) chatbots. Unlike typical chatbots, transactional chatbots
are laser-focused on solving specific user problems. Need to book a ticket? There is a
chatbot for that. Looking to make a reservation? A transactional chatbot is on it. These
transactional chatbots are not just sophisticated, they are becoming smarter and more
efficient by the day.
But how are these transactional chatbots trained to be so proficient? The answer lies in two
major learning techniques: supervised learning and reinforcement learning. Supervised
learning uses an encoder-decoder approach to map user dialogue to responses directly. In
contrast, reinforcement learning takes a more hands-on approach, training chatbots using
trial-and-error conversations with rule-based user simulator or real users.
Among these, transactional chatbots using reinforcement learning have recently surfaced as
an exciting field teeming with potential applications. One stellar example of this rapidly
growing field is the TC-Bot developed by MiuLab. The TC-Bot showcases how a user can be

2/21
simulated using basic rules, significantly expediting the training process compared to using
real people.
With more advanced chatbot training methods being developed, it’s safe to say we are on
the cusp of a new era where transactional chatbots will become ubiquitous, changing the
way we interact with technology. In this article, we will dive deep into the world of
transactional chatbots, explore the process of their training, their use cases and other vital
aspects.
What is transactional chatbot?
Transactional chatbots vs. Traditional chatbots
Key components of transactional chatbot
Benefits of transactional chatbots
How does a transactional chatbot operate?
Understanding the dialogue system
The role of the user simulator and error model controller
An overview of Deep-Q-Network
Training a transactional chatbot using Deep-Q-Network
The scenario
Prerequisites
Understanding the data (movie tickets) for the chatbot
Understanding the anatomy of an action
Preparing the state
Dialogue configuration for the agent
Building neural network model
Implementing policy
Training an agent
Use cases of transactional chatbots
What is transactional chatbot?
A transactional chatbot, also known as a task-oriented or goal-oriented chatbot, is a
specialized form of artificial intelligence software designed with a clear purpose – to help
users achieve a specific goal or complete a specific task. This could range from booking a
flight, scheduling a doctor’s appointment, or placing an order for a pizza.
Unlike their counterparts (general conversation or social chatbots), which focus on simulating
human-like interaction and carrying out broad, non-specific conversations, transactional
chatbots have a clear focus. Their role is not to engage in small talk or provide entertainment
but to aid users in accomplishing a particular task as quickly and efficiently as possible.

3/21
Transactional chatbots operate by recognizing and understanding the user’s intent and then
taking appropriate actions to fulfill the user’s request. To do this, they employ sophisticated
Natural Language Understanding (NLU) capabilities and machine learning algorithms to
interpret the user’s inputs, map them to the correct action, and generate a suitable response.
The importance of these goal-oriented chatbots in today’s digital ecosystem cannot be
understated. In a world that is increasingly driven by speed, efficiency, and convenience,
transactional chatbots serve as a pivotal touchpoint between businesses and customers.
They provide instant, 24/7 support, helping to improve customer service and engagement,
streamline business processes, and reduce operational costs. Moreover, they provide a
personalized user experience, understand and remember customer preferences, and deliver
tailor-made solutions, enhancing customer satisfaction and loyalty.
Furthermore, in times of social distancing and remote operations, transactional chatbots
have become invaluable tools for businesses to maintain constant, uninterrupted customer
support. By handling routine tasks and queries, they allow human staff to focus on more
complex and critical issues, thus enhancing the overall efficiency of the business.
In sum, transactional chatbots are more than just fancy technology; they are powerful tools
that are reshaping the way businesses operate and interact with their customers, making
them indispensable in the modern digital landscape.
Transactional chatbots vs. traditional chatbots
Comparison
Criteria
Transactional Chatbot Traditional Chatbot
Purpose Primarily designed to handle transactions
and support complex tasks. They can
assist in making reservations, completing
purchases, and providing personalized
recommendations.
Typically designed for
simple tasks such as
answering basic FAQs or
guiding users to the
appropriate resources.
Complexity of
interaction
Capable of understanding and
responding to more complex customer
queries. These chatbots can process
multiple layers of communication and
follow the flow of conversation.
Generally capable of
managing simple, linear
conversations and might
struggle with complex
interactions.
Use of AI Uses advanced AI and machine learning
to provide personalized responses,
understand user intent, and remember
previous interactions.
Primarily uses rule-based
responses and may or may
not leverage AI. Its
capabilities are often limited
to predefined responses.

4/21
Data analysis Continually learns from user interactions,
enabling it to make more accurate
predictions and provide personalized
services.
Data analysis is typically
minimal or non-existent, with
less emphasis on learning
from user interactions.
User
experience
Enhances user experience by offering
personalized responses and handling
complex requests.
Provides a satisfactory user
experience for
straightforward inquiries but
may not handle complex
requests as effectively.
Integration with
other systems
Often integrated with other systems
(CRM, ERP) to access customer data,
process transactions, etc.
Usually standalone, with
minimal integration with
other systems.
Cost and
implementation
time
Might require a higher initial investment
and longer implementation time due to
their complex nature.
Generally cheaper and
quicker to implement as
they’re less complex.
Scalability High scalability due to its ability to learn
and adapt from interactions. Can handle
an increasing number of complex queries
effectively.
Limited scalability. As
queries become more
complex, these chatbots
might struggle to maintain
efficiency.
Key components of transactional chatbots
Goal-oriented chatbots or transactional chatbots, also known as task-oriented chatbots, have
several key components that enable them to interact with users effectively and accomplish
specific tasks. Here are some of the main elements:
Natural Language Understanding (NLU) unit: This is the component of the chatbot
that interprets and understands the user’s input. It transforms human language into a
machine-readable format. NLU employs tokenization, stemming, part-of-speech
tagging, and entity extraction to understand the user’s message’s context, intent, and
entities.
Dialogue Manager (DM): The DM is the central control unit of the chatbot. It maintains
the context and state of the conversation, decides the next action based on the current
state and user’s input, and generates the appropriate system response.
State Tracker (ST): Sometimes considered a part of the Dialogue Manager, the state
tracker keeps track of the current state of the conversation, including the user’s goals,
requests, and the information that the chatbot has provided.
Policy learner: This component uses reinforcement learning algorithms to determine
the best responses based on the state of the conversation. It “learns” from its past
actions and their outcomes to optimize the chatbot’s responses.

5/21
Natural Language Generator (NLG) unit: The NLG takes the system response
generated by the dialogue manager and translates it into natural, human-like language.
This can either be a simple template-based system or a more complex machine
learning model.
User simulator: In training a transactional chatbot, a user simulator is used. It’s a model
that generates simulated user behavior, which can be used for training the chatbot in a
controlled environment.
Database (DB): Chatbots that provide information or perform transactions often need
to interact with a database. This could be checking ticket availability, booking
appointments, providing product details, etc. The DB is an integral part of these chatbot
systems.
Error model controller: This component is often used during training to add some
noise to the user simulator’s responses, making the training environment more similar
to real-world conditions where user inputs can be unpredictable and varied.
These components work together in a cycle to enable transactional chatbots to handle
complex, multi-turn dialogues, manage user goals, and offer an engaging, human-like
conversation experience.
Benefits of transactional chatbots
Transactional chatbots, a form of virtual assistant, are seeing increased adoption across
various industries, all thanks to the multitude of benefits they bring to the table. Here are
some benefits of using them:
Enhanced efficiency: Transactional chatbots are designed for multitasking, handling
several customer interactions simultaneously without any hitches. They provide round-
the-clock service, responding to customer queries in real time, regardless of
geographical boundaries or time differences. Automated responses also guarantee
accuracy, improving the overall efficiency of your team and services.
Budget-friendly solution: Incorporating transactional chatbots into your customer
service protocol allows you to minimize the need for human intervention, leading to
considerable cost savings. With their capacity to operate 24/7, chatbots also contribute
to improved cost-effectiveness. By optimizing operations and reducing personnel
expenses, chatbots offer substantial cost advantages.
Tailored interactions: Chatbots can comprehend each customer’s preferences,
paving the way for more personalized interactions and tailored recommendations.
Customers are more likely to interact with businesses offering a personal touch,
enhancing their overall experience.

6/21
Augmented sales: Transactional chatbots can significantly boost sales by providing
personalized suggestions based on customer preferences and buying history. They
also contribute to lead generation by simultaneously managing multiple queries,
potentially enhancing your business’s revenue and sales figures.
Superior customer experience: With their round-the-clock service and efficient
customer management, transactional chatbots significantly improve the customer
experience. By offering seamless service without human involvement, these chatbots
can contribute to the growth and reputation of your organization.
How does a transactional chatbot operate?
Here is the sequence of steps that describe how a transactional chatbot works.
User initiation: The process begins when a user sends a message or a request to the
chatbot. This could be a query, a request for information, or an action such as booking
a ticket or making a reservation.
Input interpretation: The chatbot uses its Natural Language Understanding (NLU) unit
to interpret the user’s message. It converts the natural language input into a machine-
readable format. The NLU unit employs tokenization, stemming, part-of-speech
tagging, and entity extraction to understand the context, intent, and entities in the
user’s message.
Dialogue management: The Dialogue Manager (DM) processes this interpreted input.
It uses the state tracker to keep track of the conversation’s context, including the user’s
goals, requests, and the information the chatbot has provided.
Policy learning: Based on the current state of the conversation, the policy learner
uses reinforcement learning algorithms to decide on the best possible action or
response.
System response generation: Once the action is determined, the system generates
an appropriate response. This could involve querying a database for required
information, initiating a transaction, or formulating a reply to the user’s query.
Response delivery: The generated system response is then translated into natural,
human-like language using the Natural Language Generator (NLG) unit. This response
is then delivered to the user.
User feedback and learning: The chatbot observes and learns from user feedback.
For instance, if a user corrects information or rephrases a request, the chatbot uses
this feedback to update its understanding and improve future responses.
Conversation continuation or termination: Depending on the user’s response or the
chatbot’s settings, the conversation may continue with further exchanges or be
concluded if the chatbot has successfully addressed the user’s request.

7/21
This is a generalized flow of how a transactional chatbot operates. Please note that the exact
workings can vary based on the chatbot’s specific design, functionalities, and the complexity
of tasks it is programmed to perform.
Understanding the dialogue system
A transactional chatbot employs a dialogue system designed to facilitate meaningful,
purpose-driven conversations with users. This system revolves around three key
components: the Dialogue Manager (DM), the Natural Language Understanding (NLU) unit,
and the Natural Language Generator (NLG) unit, each playing a unique role in the
conversational process.
The NLU unit acts as the ears of the chatbot, listening to and interpreting user inputs. When
a user utters something, it is the job of the NLU to translate this into a semantic frame. This
frame is a structured representation of the user’s utterance, stripped of natural language
complexities and brought down to a format the chatbot can understand and process.
Now enter the DM, the chatbot’s brain. Composed of a Dialogue State Tracker (DST) and a
policy, often represented by a neural network, the DM controls the flow of the conversation.
The DST takes the semantic frame from the NLU, combines it with the history of the
conversation, and creates a state representation. This state is the distilled essence of the
dialogue so far, allowing the bot to maintain the context and continuity of the conversation.

8/21
Next, the state representation is ingested by the policy component of the DM, determining
the chatbot’s next action. Here, reinforcement learning can play a vital role, enabling the
chatbot to learn the best responses over time from repeated interactions.
In some cases, an external database can be consulted to supplement the chatbot’s
responses with useful information, like specifics about a restaurant reservation or movie
ticket availability.
Once the chatbot’s response is decided, it is still in a semantic frame, which isn’t user-
friendly. Here is where the NLG unit, the chatbot’s mouth, steps in. The NLG takes this
semantic frame and transforms it back into natural, human-like language. This allows the
chatbot to deliver responses that are easily understandable by the user.
The user’s goal, be it making a reservation, booking a ticket, or gathering information, forms
the driving force behind this dialogue loop. Through iterative cycles of understanding,
managing dialogue, and generating natural language, the transactional chatbot works
towards achieving this user goal, creating a dynamic, interactive, and purposeful
conversational experience.
The role of the user simulator and error model controller
In transactional chatbots, two significant components contribute to refining the model’s
training and performance: the user simulator and the Error Model Controller (EMC). Both are
crucial in enabling the chatbot to handle more realistic, diverse, and error-prone
conversations.
User simulator
The user simulator is akin to a virtual training partner for the chatbot. It emulates the
behavior of a real user, offering a more efficient way to train the bot compared to hours of
user interactions. This simulator operates based on an agenda, meaning it has a predefined
goal for each interaction episode, and its actions align with this goal. The internal state of the
simulator allows it to follow the dialogue progression and take informed actions accordingly.
Responses to agent actions are crafted using a combination of deterministic rules with a
touch of stochastic rules to introduce variety.
User goals are essential elements for the simulator, representing what the user wants to
achieve from a conversation. These goals can be sourced from actual dialogue corpus or be
manually created, comprising ‘inform slots’ and ‘request slots.’ The inform slots represent
constraints the user has in mind, while request slots simulate the user’s quest for specific
information. However, unlike real users who may change their minds during a conversation,
the simulator’s goals remain static throughout an episode. A “default slot” is added to every
goal’s request slots, and the agent must provide a value for this slot for successful goal
fulfillment.

9/21
The user simulator’s internal state records the goal slots and the conversation’s history. It
aids in formulating user actions at each step, containing dictionaries of slots and an intent:
rest slots, history slots, request slots, inform slots, and the intent of the current action.
The actions that a user simulator can perform are varied and can sometimes be complex,
incorporating multiple requests or inform slots. These actions can even contain a mix of both
types of slots.
Error Model Controller (EMC)
The Error Model Controller (EMC) comes into play once a user action is received from the
simulator. It is responsible for introducing errors into these actions, mimicking the
imperfections of real-world interactions and helping the bot cope with potential
misunderstandings or mistakes in user responses. The EMC can add errors to the user
action’s inform slots and intent, training the bot to handle unexpected scenarios better and
ensuring it’s equipped to deal with more realistic, less-than-perfect human interactions.
An overview of Deep-Q-Network
Deep Q-Network (DQN) is a reinforcement learning technique that combines Q-Learning with
deep neural networks. DQN was proposed by researchers at Google DeepMind and and it
had a significant impact on the field of reinforcement learning, particularly in environments
where input data has high-dimensional raw spaces, such as video games.
In traditional Q-Learning, a table called the Q-table stores the value of every possible state-
action pair. However, this approach doesn’t scale well to problems with large state spaces or
problems where states are not easily expressible in table form, such as image inputs.
DQN addresses these challenges using a deep neural network to approximate the Q-
function, which maps state-action pairs to expected future rewards. This way, a neural
network can be trained to predict the Q-values for a given state instead of maintaining a table
for each possible state-action pair.
A key innovation in DQN is using experience replay and target networks to stabilize training.
Experience replay stores past experiences in a replay buffer and samples mini-batches from
this buffer to train the network, which breaks the correlation between sequential experiences.
The target network is a separate network used to compute the target Q-values during
learning, which is periodically updated from the main network. This helps to avoid harmful
feedback loops during learning.
Since the inception of DQN, many extensions have been proposed to improve its
performance and stability, such as Double DQN, Dueling DQN, and Prioritized Experience
Replay.

10/21
Training a transactional chatbot using Deep-Q-Network
Building a transactional chatbot using reinforcement learning involves several steps that
should be executed sequentially. Here’s the sequence:
1. Preparing the state: The initial step in developing a chatbot is preparing the state,
which represents the current situation that the chatbot is in. This typically involves
processing the raw input data (like text conversation history) into a format the model
can understand. The state also includes the chatbot’s internal information about the
conversation, like the identified intents or entities in the user’s utterances.
2. Dialogue configuration for the agent: The next step is to set up the dialogue
configuration for the agent. This includes defining the possible actions that the agent
can take (like answering a question, asking for more information, or ending the
conversation) and defining the reward structure that the agent will use to learn. This
configuration guides the agent about the context of the conversation, its possible
actions, and their consequences.
3. Neural network model: Once the state and dialogue configuration have been set up,
the next step is to build the neural network model that will be used to learn the dialogue
policy. This model takes the current state as input and outputs the Q-values for each
possible action. The Q-values represent the expected future reward for taking each
action, which is used to decide the best action to take. This model could be a Deep Q-
Network (DQN) or other types of network, depending on the complexity of the task and
the available data.
4. Policy: With the neural network model in place, a policy that dictates how the agent
chooses its actions can be defined. A common policy is an epsilon-greedy policy,
where the agent mostly chooses the action with the highest Q-value (as predicted by
the model) but occasionally chooses a random action to explore the environment.
5. Agent training: Finally, with the state, dialogue configuration, neural network model,
and policy setup, the agent can be trained. During training, the agent interacts with the
environment (in this case, the chatbot conversing with users or a user simulator), takes
actions according to its policy, observes the results, and receives rewards. The agent
then uses these experiences to update its neural network model, intending to maximize
its total reward over time. The agent continually goes through this interaction and
learning process until it reaches a satisfactory performance level.
The scenario
The main objective of our transactional chatbot is to engage in proficient interactions with
real users, successfully accomplishing specific tasks such as locating suitable reservations
or movie tickets within the users’ specified constraints. The chatbot, referred to as the agent,
has a crucial role in processing an ongoing conversation’s state and generating an

11/21
appropriate, near-optimal response. In essence, the agent takes a snapshot of the current
dialogue history from the Dialogue State Tracker (ST) and uses it to decide on the most
fitting dialogue response to offer the next.
The supporting code for our system draws inspiration from a dialogue system developed by
MiuLab, known as TC-Bot. The notable achievement of their research is the demonstration
of a user simulation with fundamental rules. This approach enables the swift training of the
chatbot agent via reinforcement learning, which is considerably faster than when training with
real people. While other studies have attempted similar methods, the unique aspect of this
research lies in its effective training model, which is successful and accompanied by
accessible and comprehensive code.
The complete code is available here – https://github.com/maxbrenner-ai/GO-Bot-DRL
Prerequisites
To fully comprehend the code, there are a few prerequisites that won’t be explicitly covered
but are vital for a comprehensive understanding. Here they are:
Proficiency in Python programming – A solid grasp of Python programming language is
a must.
Mastery of Python dictionaries – We will extensively utilize dictionaries in Python, so
understanding their operation is crucial.
Understanding of the DQN (Deep Q-Network) – Familiarity with developing a simple
DQN is necessary.
Experience with Keras for building neural networks – You should know how to construct
a straightforward neural network model using Keras.
Please ensure you are familiar with these areas before proceeding.
You need to have the following dependencies ready before executing the code:
Python >= 3.5
Keras >= 2.24 (Earlier versions probably work)
numpy
Understanding the data (movie tickets) for the chatbot

12/21
Data sources: Our dataset comprises movie tickets with varied attributes or slots. It is
structured as a dictionary where the keys are the unique identifiers of the tickets
(represented as long integers) and the values are sub-dictionaries encapsulating the
detailed information that each ticket holds. It’s important to note that not every ticket will
have the same attributes and certainly not the same values! Data source –
https://gist.github.com/maxbrenner-ai/f665bb570e1ac55568001c7991faebcd#file-
movie_dict-txt
Database index: There is another file that houses a dictionary. The keys in this
dictionary represent different slots that a ticket might hold, while the values are lists of
potential values that each slot can take. Data dictionary link –
https://gist.github.com/maxbrenner-ai/f665bb570e1ac55568001c7991faebcd#file-
movie_dict-txt
User goal collection: Lastly, we have a list that stores user goals. Each goal is
represented as a dictionary comprising request and inform slots. We will delve deeper
into what these slots signify later on. User goal list –
https://gist.github.com/maxbrenner-ai/79c1ace99eafcc376f37090c7e5287aa#file-
movie_user_goals-txt
The core objective here is to enable the chatbot agent to locate a ticket that aligns with the
user’s specific requirements, which are defined by the goal for each episode. This is quite a
challenging task considering each ticket’s uniqueness and variance in slots!
Understanding the anatomy of an action
Understanding the structure of an action is crucial in this dialogue system. Ignoring the
natural language aspect for a moment, we can see that both the user simulator and the
agent work with actions represented as semantic frames. An action consists of an intent,
inform slots, and request slots. Here, a ‘slot’ signifies a key-value pair, typically referring to a
singular inform or request. For instance, in the dictionary {‘starttime’: ’tonight’, ‘theater’: ’regal
16’}, both ‘starttime: tonight’ and ‘theater: regal 16’ are considered slots. Here you will get
more example actions: https://gist.github.com/maxbrenner-
ai/dcf1185a0f2dffc9f88b4054b908cf13#file-action_examples-txt
The intent indicates the kind of action it is. The remainder of the action is divided into inform
slots, which contain constraints, and request slots, which carry information that needs
completion. The potential keys are specified in the dialogue_config.py, and their values are
provided in the aforementioned database dictionary.
An inform slot shares information that the sender wants the receiver to acknowledge. It
comprises a key from the list of keys and a value from that key’s associated list of values.
Conversely, a request slot contains a key for which the sender wishes to retrieve a value
from the receiver. In essence, it is a key from the list of keys and ‘UNK’ (indicating
“unknown”) as the value, as the sender doesn’t yet know the appropriate value for this slot.

13/21
The intents Include:
Inform: Provides constraints in the form of inform slots.
Request: Asks for the completion of request slots with values.
Thanks: Used exclusively by the user, it signals to the agent that it has done
something satisfactory, or that the user is prepared to conclude the conversation.
Match found: Used solely by the agent, it informs the user that a match fulfilling the
user’s goal has been identified.
Reject: Utilized only by the user in response to the agent’s ‘match found’ intent,
indicating that the suggested match doesn’t fit their constraints.
Done: The agent uses this to wrap up the conversation and verify if the current goal
has been accomplished. The user action automatically adopts this intent if the
conversation drags on too long.
Preparing the state
The Dialogue State Tracker (ST) is essential in a transactional chatbot. Its primary function is
to create a ‘state’ for the chatbot to work from. A ‘state’ is like a snapshot of the current
situation in the chat, which the chatbot uses to decide its next action.
To do this, the ST maintains a record of the dialogue, capturing both the user’s and chatbot’s
actions as they happen. It also keeps track of any information (known as ‘inform slots’)
shared in the chat. For instance, if the user mentions they prefer Italian food, this information
is saved in an ‘inform slot.’
The state prepared by the ST is essentially an array of data representing current dialogue
history and all the information slots mentioned so far. It’s like a conversation summary to
date, which helps the chatbot make informed decisions.
Also, whenever the chatbot needs to provide information to the user, the ST can fetch this
from a database using the data in the current information. For example, if the user asks for
Italian restaurants, the ST can pull a list from the database matching this criterion.
One crucial aspect of the ST’s job is to compile a useful state that gives the chatbot an
accurate view of the ongoing conversation. This state includes recent actions from both the
user and the chatbot, letting the chatbot know where the dialogue is at. It also includes a
count of the number of rounds or interactions that have occurred. This helps the chatbot
gauge how much time it has left, especially in scenarios where the chat has a maximum
number of rounds allowed.
Lastly, the state also includes details about the current inform slots and how many database
entries match this information. This helps the chatbot know how much information it has to
work with and how relevant it is to the user’s requirements.

14/21
The Dialogue State Tracker is like the chatbot’s memory and awareness, helping it
understand the current conversation and make the best possible response.
Dialogue configuration for the agent
Dialogue configuration for the agent is a critical step in building a transactional chatbot. This
process involves defining how the chatbot will interact with users, specifying the flow of
conversation, and the range of responses it can deliver. Essentially, it is setting up the rules
of engagement for the chatbot, ensuring that it can understand user inputs and provide
relevant and meaningful responses. This configuration becomes the foundation upon which
further layers of learning and adaptation are built, making it a vital part of any successful
chatbot development.
Here are the dialogue config constants used by the agent:
# Possible inform and request slots for the agent
agent_inform_slots = ['moviename', 'theater', 'starttime', 'date', 'genre', 'state', 'city', 'zip',
'critic_rating',
'mpaa_rating', 'distanceconstraints', 'video_format', 'theater_chain', 'price', 'actor',
'description', 'other', 'numberofkids']
agent_request_slots = ['moviename', 'theater', 'starttime', 'date', 'numberofpeople', 'genre',
'state', 'city', 'zip',
'critic_rating', 'mpaa_rating', 'distanceconstraints', 'video_format', 'theater_chain', 'price',
'actor', 'description', 'other', 'numberofkids']
# Possible actions for agent
agent_actions = [
{'intent': 'done', 'inform_slots': {}, 'request_slots': {}}, # Triggers closing of conversation
{'intent': 'match_found', 'inform_slots': {}, 'request_slots': {}}
]
for slot in agent_inform_slots:
agent_actions.append({'intent': 'inform', 'inform_slots': {slot: 'PLACEHOLDER'},
'request_slots': {}})
for slot in agent_request_slots:

15/21
agent_actions.append({'intent': 'request', 'inform_slots': {}, 'request_slots': {slot: 'UNK'}})
# Rule-based policy request list
rule_requests = ['moviename', 'starttime', 'city', 'date', 'theater', 'numberofpeople']
# These are possible inform slot keys that cannot be used to query
no_query_keys = ['numberofpeople', usersim_default_key]
Building a neural network model
In the development of a transactional chatbot, constructing the neural network model is a
pivotal step. Leveraging Keras, a popular deep learning framework, a model for the chatbot
agent is designed. This model comprises a single hidden layer neural network, which,
despite its simplicity, proves to be highly effective for the task at hand. The design of this
model plays a crucial role in enabling the chatbot to comprehend and respond appropriately
to the user’s input. Here is the code snippet:
def _build_model(self):
model = Sequential()
model.add(Dense(self.hidden_size, input_dim=self.state_size, activation='relu'))
model.add(Dense(self.num_actions, activation='linear'))
model.compile(loss='mse', optimizer=Adam(lr=self.lr))
return model
The instance variables are assigned in constants.json file located here –
https://github.com/maxbrenner-ai/GO-Bot-DRL/blob/master/constants.json
Implementing policy
The implementation of the policy in a transactional chatbot serves as a guide for the agent to
select a suitable action based on the current state. This varies according to whether the
dialogue is in the warm-up or training stage. The warm-up stage, which precedes the
training, is designed to fill the agent’s memory using generally a random policy. For our GO
chatbot, however, a basic rule-based policy is used during the warm-up phase.
def get_action(self, state, use_rule=False):
# self.eps is initialized to the starting epsilon and does NOT get annealed
if self.eps > random.random():

16/21
index = random.randint(0, self.num_actions - 1)
# self._map_index_to_action(index) takes an index and maps the action from all possible
agent actions
action = self._map_index_to_action(index)
return index, action
else:
if use_rule:
return self._rule_action()
else:
return self._dqn_action(state)
Upon transitioning into the training stage, the behavior model comes into play for action
selection. Here, the term ‘use rule’ signifies the warm-up stage. This policy determination
method provides both the index of the action and the action itself.
The rule-based policy employed during the warm-up stage is a straightforward one. A
noteworthy component of this rule-based policy is the reset method of the agent. This
primarily serves to reset a couple of variables associated with the rule-based policy. Although
simple, this policy is crucial for initiating the agent’s activity in a somewhat meaningful way,
thus improving results over taking random actions.
Training an agent

17/21
In a transactional chatbot, the agent’s role is much like a skilled conversation partner, adept
at helping users achieve a specific target, such as booking a reservation or buying a movie
ticket, while considering the user’s specific needs and limitations. This agent’s primary task is
navigating through a conversation and making the best possible decision at each step.
The agent relies on a Dialogue State Tracker (ST) to do this. This tracker is like the memory
of the conversation, keeping track of the discussion’s history. Using this information, the
agent selects an appropriate response that moves the conversation forward, aiming to fulfill
the user’s goal.
The agent chooses a course of action based on a specific state. During the warm-up phase,
this policy could be as simple as a list of requests. However, during training, the policy
becomes more complex, transforming into a single-layer behavior model.
The training method is pretty straightforward, with only a few variations from other methods
that use Deep Q-Network (DQN) training. It is always beneficial to experiment with the
model’s structure, incorporate prioritized experience replay (a technique that selectively
replays more important experiences), and develop a more sophisticated rule-based policy.
This continual tweaking and enhancement can make the agent even more efficient and
effective at accomplishing its goals.
Here’s a simpler explanation of the flow of an agent’s action in a transactional chatbot, as
shown in the above diagram:
A single round or loop in training involves four main components:

18/21
The agent (dqn_agent)
The dialogue state tracker (state_tracker)
The user (or user simulator)
The Error Model Controller (EMC)
The following steps outline the sequence of events:
1. The round begins by acquiring the current state, either an initial state for the start of the
conversation (episode) or equivalent to the previous round. This state is then fed into
the agent’s action determination method.
2. The agent decides on an action based on the current state and passes it to the state
tracker. The state tracker updates its record of the conversation and enriches the
agent’s action with additional information retrieved from a database.
3. The enriched agent’s action is then given to the user simulator. Here, the user simulator
generates a rule-based response and also provides details about the reward and
success rate (though these aren’t shown in the diagram).
4. The user’s response then goes through the error model controller, which introduces
potential errors mimicking real-world scenarios.
5. The possibly erroneous user response is then fed into the state tracker, which updates
its conversation record. However, unlike before, it doesn’t add any substantial updates
to the user response.
6. Lastly, the state tracker produces the next stage of the conversation, completing the
current experience tuple (state, action, reward, next state). This tuple is then added to
the agent’s memory, and the cycle continues with the next round.
Before the actual learning and decision-making begin for a Deep Q-Network (DQN) agent,
like our chatbot, it undergoes a ‘warm-up’ phase. This phase is necessary to fill the agent’s
memory buffer with initial information. But, unlike DQN applications in games where the
agent may perform random actions, our chatbot uses a basic rule-based algorithm during
this warm-up stage. The specifics of this algorithm will be covered in detail in part II of the
series.
It’s also important to note that we are not using any Natural Language (NL) components in
this training process. This means that all the actions of the chatbot will be in the form of
‘semantic frames’ – structured data representing meanings. The focus here is on training the
Dialogue Manager (DM), which doesn’t require Natural Language Understanding (NLU) or
Natural Language Generation (NLG). These NL components are usually pre-trained
separately from the agent and are not crucial to understand the reinforcement learning
process.
Here is the code snippet to train the agent:
print('Training Started...')

19/21
episode = 0
period_reward_total = 0
period_success_total = 0
success_rate_best = 0.0
while episode < NUM_EP_TRAIN: episode_reset() episode += 1 done = False state =
state_tracker.get_state() while not done: next_state, reward, done, success =
run_round(state) period_reward_total += reward state = next_state period_success_total +=
success # Train if episode % TRAIN_FREQ == 0: # Check success rate success_rate =
period_success_total / TRAIN_FREQ avg_reward = period_reward_total / TRAIN_FREQ #
Flush if success_rate >= success_rate_best and success_rate >=
SUCCESS_RATE_THRESHOLD:
dqn_agent.empty_memory()
# Update current best success rate
if success_rate > success_rate_best:
print('Episode: {} NEW BEST SUCCESS RATE: {} Avg Reward: {}' .format(episode,
success_rate, avg_reward))
success_rate_best = success_rate
dqn_agent.save_weights()
period_success_total = 0
period_reward_total = 0
# Copy
dqn_agent.copy()
# Train
dqn_agent.train()
print('...Training Ended')
The complete code is available here – https://github.com/maxbrenner-ai/GO-Bot-
DRL/blob/master/train.py
Use cases of transactional chatbots

20/21
Transactional chatbots hold great potential across a multitude of sectors, including but not
limited to banking, insurance, e-commerce, healthcare, and hospitality. Here is how they can
be leveraged in various contexts:
Banking: Transactional chatbots can enhance banking services by automating tasks
traditionally handled by bank operators. For instance, they can authenticate user
identities, block stolen credit cards, provide operational hours of nearby branches, or
confirm outgoing transfers. Moreover, they can offer immediate assistance in case of
account queries, balance checks, or recent transactions, providing users with real-time
convenience.
Insurance: In the insurance sector, these chatbots can offer quotes to potential
customers or distribute insurance certificates to existing ones. More advanced bots can
even streamline the conversion process, allowing prospects to sign up if the quote
matches their budget and needs directly. The bot gathers necessary details and
forwards the contract and supporting documents, reducing manual intervention and
accelerating policy issuance.
E-commerce: For e-commerce platforms, transactional chatbots can assist users in
product discovery based on their preferences. Additionally, they can facilitate the
buying process and handle requests for order modifications or cancellations. These
bots can also provide real-time order tracking, enhancing the shopping experience.
Healthcare: Transactional chatbots in the healthcare industry can help patients book
appointments, send reminders for medication, or guide common health issues. They
can also gather patient data for health records, making the patient intake process more
efficient.
Hospitality: In the hospitality sector, these bots can automate room bookings, provide
information about facilities, offer personalized recommendations, and address common
queries about the stay, check-in/check-out process, etc.
Energy companies or mobile service providers: Similar to insurance, these
businesses can use transactional chatbots to provide quotes, facilitate service sign-
ups, offer upgrades, or handle cancellation requests.
These few instances illustrate the versatility and utility of transactional chatbots. However,
their use is not confined to these areas, and they can be tailored to address the unique
needs of various other industries.
Endnote
Transactional chatbots have indeed ushered in a new era of interaction between businesses
and their customers. It has become vital for companies to incorporate this transformative
technology into their communication strategies, ensuring they remain adaptable and
responsive to the shifting needs of their clientele. The promise that transactional chatbots
hold for the future is substantial, and with careful planning and tactical execution, they can

21/21
contribute to substantial growth for any business. Therefore, if a company wishes to stay
competitive and not fall behind in the rapidly advancing digital world, integrating a
transactional chatbot into its strategic planning becomes an astute decision.
As consumer expectations continue to evolve, the prospects for transactional chatbots are
looking brighter than ever. Future developments may involve more advanced levels of
personalization, with chatbots becoming increasingly intelligent. This would offer a more
enriched user experience, potentially featuring responses or suggestions specifically tailored
to an individual user’s preferences or past interactions.
Security is another area poised for significant improvement, particularly given the sensitive
transactional information these chatbots handle. Expect to see advancements in encryption,
fraud detection, and even biometric authentication as a means to protect and secure user
data.
Another promising direction for chatbots is their increasing integration with other
sophisticated technologies. Currently, chatbots are deployed across a wide array of business
sectors. Still, in the future, we could see them amalgamating with other cutting-edge
technologies, such as voice assistants or augmented reality, to offer even more engaging
customer experiences.
In sum, transactional chatbots are fast becoming necessary for businesses wishing to thrive
and grow in the digital age. Their potential future developments point to a world of more
personalized, secure, and immersive customer experiences.
Looking to boost your business operations with AI-driven transactional chatbots? Achieve
this with LeewayHertz’s AI chatbot development expertise!

How to train a transactional chatbot using reinforcement learning.pdf

Recommended

Recommended

More Related Content

Similar to How to train a transactional chatbot using reinforcement learning.pdf

Similar to How to train a transactional chatbot using reinforcement learning.pdf (20)

More from StephenAmell4

More from StephenAmell4 (20)

Recently uploaded

Recently uploaded (20)

How to train a transactional chatbot using reinforcement learning.pdf