Hi everyone, I’m Martin from Netcetera and I will be talking about interesting projects we have at Netcetera with conversational interfaces and machine learning.
First, we will see what are conversational interfaces and why they are so popular today. Then I will show you few Siri use cases that we have developed. Next, we will explore api.ai and how we are using it for natural language understanding. At the end, I will show you another app that recognises characters from an image.
People and computers speak different languages – the former are using words and sentences, while the latter are more into ones and zeros. As we know, this gap in the communication is filled with a mediator, which knows how to translate all the information flowing between the two parts. These mediators are called graphical user interfaces (GUIs). The first graphical was the desktop with the mouse cursor. Then there is the multi touch technology on the smartphones. These are great and still present today. However, both require some learning curve and can be challenging for the users. Voice - Simple and natural way to give commands No learning curve for users Same on all platforms and devices All big tech companies have products in this area (Siri, OK Google, Cortana, Alexa, etc.) CI are important part of Netcetera’s innovation platform
Transport is big part of Netcetera’s area of expertise. Now I will show you a video, of our first proof of concept app with Siri, which is about booking a ride.
So how these things work? Well SiriKit is a fairly new framework which enables access to Siri from third party apps. Here the heavy lifting is done by Apple, they have their own implementation of natural language understanding algorithms. If there’s a matching of the spoken phrase to the app, the developers just receive in a callback the needed values which are important. However, this also comes with some restrictions. You can only get certain information from Siri, to limited pre-defined domains. For example, as we saw, booking a ride is one of them, as well as paying with Siri. Also, the UI is not that customizable.
To overcome the limitations with Siri, we have also tried a different approach. First, I’m going to show you another video. This is a Grocery List application, where you can add items via voice interface and later share it with other people. After you are done shopping you can pay the list with Masterpass.
Here’s an overview of the steps needed to present the extracted entities to the users. First, the phone is in a recording state and the user gives some command to the phone, like “I need milk and bananas”. Next, we use native speech recogniser from the iOS SDK to get a transcribed version of the user’s spoken phrase. We then send this string to api.ai, which is natural language understanding platform that we will discuss later. Api.ai gives us the user’s intent and some parameters, which we can save somewhere, either on a backend or locally in the app and then we can present them to the user.
Api.ai is Google’s NLU platform. The cool thing about it is that the developers can train the model with their test data sentences and automatically apply those changes to all users. The platform is accessible for developers as a REST service. It supports creating of agents in different languages, intents, entities, web hooks and contexts.
These terms that we have mentioned are really important to the NLP engineers. Intent is, as the name implies, what the user wants to do. Entities are specific values or parameters for that action, which can be location, date, product type etc. Session represents the whole conversation, from start to end. Context is a knowledge of the previous state of the conversation.
Our next showcase is an app that recognises characters from an image. This is handy for our development process at Netcetera. Usually we have meetings where we discuss about architecture and write things on the board. Later, when we want to have this data in a digital format, we have to re-enter this again. With this app, we can just take a picture of the board and it will extract the needed text.
So how did we do this? As you may have assumed, this required both object detection and then recognising what this object is. To do this, we are using two new iOS Frameworks, the first one being Vision and the second one Core ML. Vision does the real time object detection, while Core ML enables easy integration of already trained machine learning models. We have used convolutional neural networks, which had 86-90% accuracy. Since the training is done with separate characters.
Conversational Interfaces and Machine Learning
Conversational Interfaces and
Martin Mitrevski, ICT Innovations, Skopje, 2017
• Conversational Interfaces
• Siri use cases
• Natural language understanding with api.ai
• Text recognizer
• Apple’s framework for accessing Siri from third
• Apple does the natural language understanding
• Restricted to pre-defined domains
• Not that customisable
Paying Grocery List
Intents and entities
Persist extracted data
• Google’s NLU platform
• Accessible as a REST Service (JSON)
• Training is done on the web app by the
• Supporting agents in different languages, intents,
entities, web hooks, contexts
• Intent - mapping a phrase to a specific action
• Entities - parameters of the action (e.g. location,
date, product, etc.)
• Session - the whole conversation
• Context - intermediate states and parameters
from previous expressions
• Vision - iOS framework for real time object detection
• Core ML - iOS framework for integrating trained
machine learning models
• Convolutional Neural Networks (86-90% accuracy)
• Cropping the image into separate parts for each letter
• Padding insertion, resizing, grayscale conversion