2. AI-driven UI: conversational interfaces and more
Mai 22nd 2018
by Dmitriy Semashkov & Eirik Stavelin, Making Waves
How to build a chatbot for the
Norwegian market
http://www.meshnorway.com/events/ai-driven-ui-conversational-interfaces-and-more
3. HOW TO?
There are plenty of
guides for creating
conversational UIs,
personas and fulfilment
engines out there.
You can figure it out!
5
Step 0
Do .say(“Hello”)
8. 1 1
Speech-to-text &&
text-to-speech
TLDR; The densest part of the black box,
for us it’s just in/out from an API anyway.
• Computers deal in code
• Developers in text
• Users in voice
• A mic is opened for the users,
developer receives a text transcript
• Developer returns a text string, TTS
makes sound
9. 1 2
Intentions & goals
A classification problem
(TLDR; they did it for you, just bring your own data)
• Goals - things users want done
• Intents things users want to do to
fulfil goals
11. 1 4
Entities
Named or unnamed - the bits of text that
distinguishes one order from the next
(I want extra pepperoni on mine)
• Identify “things” in the world
• NLP
• (N)ER
• regex
• Word lists
• magic
• …
15. 1
8
DIALOGFLOW
console.dialogflow.com
Fire pils og en pizza Ei flaske vin I ny
og ne' Lite biff og dyr champagne
Men ka gjør no' det?
the_entity
[{“product”:”pils”,”amount":"4"},{"prod
uct":"pizza"},
{"product":"biff"}]
17. 2 0
Contexts
Provides a short term memory
to the agent. Ensure that
prereqs are done, or keep track
of what was been talked about
already.
18. 2 1
Back-end aka fulfilment
TLDR: you can use what-ever you want. Node.js lib
works ok.
• A simple webhook
• Computations are done back-end
• Log-in (account linking)
• Con: separates sentences the bot
says from DF into back-end system
25. 2 8
Sales?
Support?
Self-service?
Any dictation/stenographer situation?
Knowledge (e.g. turist info, wiki lookups, etc)
Entertainment (the guide though history at Folkemuseet?)
User guides (how do I assemble the Peter Opsvik chair “Tripp Trapp”)?
For what tasks are
conversational UIs the perfect
match?
26. 2 9
Alarm! I’m stuck.
Language barriers
Blind / visual impairment
Vehicles (car / boat /bike / etc)
My hands are occupied / I have tools in my hands
I’m actively moving around/under/over/through my {work}space
In what situations are a
computer/tablet a hindrance or
distraction?
27. just like in any other language (NO lang support out-of-box in DF)
the critical part (STT & TTS, intent-classification & entity-extraction) are
better now
the art of conversations is still hard
the tech is here - where to apply it for max effect is our problem to figure
out.
How to build a chatbot for the
Norwegian market
Editor's Notes
Hello and good evening.
My name is Eirik Stavelin and I’m here with my colleague Dmitriy Semashkov. We work as data scientist as Making Waves and are here to share some experiences in creating information systems with conversational UI with technology from google.
I was given this title, how to build a chatbot fot the Norwegian market. I’ll get back to tach at the end.
Our title is “How to build conversational interfaces for the Norwegian market”.
There are plenty of guides out there on how to build all the parts these systems consists of. You typically require fewer of them as features are being consolidated into bot-frameworks. As this still is somewhat new and changing technologies, documentation is quickly outdated and lacking, but this gets better as stable versions is rolled out.
We roughly follow the design guidelines composed by google.
So this is not what we are going to talk about…
…what we are going to talk about is our experiences as developing and designing data scientist in creating conversational UI with the google technologies we just had presented (hopefully).
We could do a data science perspective on this, or a design perspective or a programming perspective. Or a commercial one. There are many, and time is short so what we’ll do is to take our normal tech-y data-science view and zoom out a little, try to talk about the pieces with a certain distance and quickly get to the point where we hope you guys have a rough idea, and can rather ask and each other about the best ways.
Googles’ “conversational UI” is a black box, not all details are public - but the birds-eye-view of these systems is known. They are also more or less the same as with other such systems:
text to speech & speech to text
Intents
Entities
Short term memory
Back end processing & longer term memory
About 15 years ago I sat and read to my computer. In broken English. Alone under the stairs. The new version of MS speech recognition was out, and from now on I’d never have to write a nother English paper again. Ever. Id just dictate the content, and the machine would deliver perfectly correct text. I’d ace my English grades all the way from that point.
That didn’t work. Many of you probably also testet this tech in the late 90s and early 2000s. It dit not work. It did not transcribe well and the speech synthesis was awful. Every time between then and now when a new speech system was out, I’d ignore it as fast as I could. That stuff does not work.
But now though, it sort of does work. Siri kind-of works. Alexa kind-of works. And the google assistant kind-of works. Perhaps this time around, voice as UI finally works well enough to actually be useful.
say -v nora "hei mesh, kan dere høre entusiasmen i stemmen min?”
say -v Alex "I'm sorry Dave, I'm afraid I can't do that” || Jeg beklager Dave, jeg er redd jeg ikke kan gjøre det.
https://deepmind.com/blog/wavenet-generative-model-raw-audio/ (hvordan siste bølge med talesyntese blir laget hos google)
Goals: what the user actually needs and wants to accomplish, should be navigable through one or more intents.
To identify what intent a user input has is a classification problem with a threshold and a default fallback if no good candidate is found. What is new-ish for me here is that what ever algorithm google uses here, they’ve gotten really good at this problem even with very few training examples.
Intents are much like simple functions in programming, they can be with or without parameters and trigger some action. In dialogflow, intents that require parameters will prompt the user with follow-questions in order to secure all needed inputs.
A simple A&Q that just returns the text of a FAQ need no back-end, it can just return the answer-text as voice output. More complicated stuff needs a fulfilment engine through a webhook. This lets you connect you existing systems into the voice UI.
To map from input natural language text to intent and parameters is the NLU-part.
Here we have some training data, annotated with entities. I’ve made an ice cream ordering system, where the intent is to order_ice_cream, and we have a few entities. The flavour of the ice cream (in yellow) the container (in red) and the topping (in pink). We ca only presume this is also used in to the classification of the intent, but most visibly here as training examples of what and where entities are used in sentences assumed to be used in order too fulfil the intent.
If several intents needs to be fill filed in order to trigger a further down one, context can be set and make it possible to trigger new intents. These has an “expired date”, a number of back-n-forth between man and machine. Lets say you need to both have an item to purchase and an address to deliver to, in order to place an order. The intents getAddress and findProduct needs to be fulfilled and contexts set before the placeOrder intent can be triggered and the end goal of playing with that nice thing can be real.
We can also set variables that can carry through the conversation, in order to track how many times we misunderstand each other, let permissions given be remembered, etc.
This is one way this generation of conversational interfaces attempts to upkeep the illusion of smartness as a conversation partner.
Both the beginning and the end of a conversational interface voice, but both ends are done through text. What words should the assistant use? What level of speed, details, accuracy etc should I use?
This was a good exercise for us, as it creates a space where PR-people, content-people, tech-people, admin-people etc all had to come together and create a persona. This also makes a good opportunity to dust off those core values that were composed back in the day. These core values and world-view function as a yard-stick for evaluating the quality of speech. Our persona is a grown woman with lots of experience and a bias towards healthy food. With that we can qualitatively but easily evaluate if wordings and tone-of-voice feels right. We found good value in using this persona as a common creation that includes different people, from Making Waves and our clients.
(Our design Heidi Lisle leads this work.)
For most of us conversational or auditory interfaces are new or uncomfortable, as we remember how clumpsy and painful the journey has been, or just never bothered to go that route. We talk about it as new as we are entering a new generation of techniques in the TTS and STT areas. But the blind and visually impaired has endured the previous generations of this tech. I believe we can learn a lot useful from this group in how we design ours systems, and where to apply this kind of technologies.
The Norwegian association of the blind /blindeforbundet.
I have also slowly learned that people how can see and can read, prefer not to, and will gladly tak a 50/50 guess on a two-button confirm/deny dialogbox in their computer or mobile. Even if the result is that all their contacts in the address book is deleted if that choose wrong. Info in sound might remedy some of these situations.
And then there are those who cannot read the language or have experience with computers. It should be possible for an computer illiterate Chinese person to purchase train tickets from the teicket-kiost at Oslo S without any clicking except the pin from the payment. There are probably a myriad of such problems where the machine creates friction, where voice can let users with other things to worry about than pushing buttons, interact with machines hands free.
Some of these problems are obvious and have financial incentives pushing them to be solved fast. Others might be of social, cultural or humanitarian character and take longer to find and fix. What ever those problems will turn out to be - to build a conversational user interface in Norwegian is no longer the hard part.