NLU / Intent Detection Benchmark by Intento, August 2017

NLU / Intent Detection
Benchmark
by Intento
August 2017

About
• At Intento, we want to make Machine Intelligence
services easy to discover, choose and use.
• So far, the evaluation is the most problematic part: to
compare providers, one need to sign a lot of contracts
and integrate a lot of APIs.
• We deliver that for FREE on public datasets. To
evaluate on you own dataset, contact our sales.
• Also, check out our Machine Translation Benchmark.
Machine Translation is an easy way to build a multi-
lingual bot.

Overview
• Natural Language Understanding Services with Public
APIs*
• IBM Watson Conversation
• API.ai API
• Microsoft LUIS
• Amazon Lex
• Recast.ai API
• wit.ai API
• SNIPS API
• Benchmark Dimensions
• Intent Prediction Performance
• False Positives
• Learning Speed (performance on small datasets)
• Language Coverage, Price, Response time
August 2017© Intento, Inc.
* as of today, some of them don’t have a Public API and we’ve got an access for the purpose of this benchmark

NLU Engines Compared

API.ai / Google
• Website: https://api.ai
• Launched: 2010 (acquired by Google in 2016)
• Pricing model: FREE
• Interface: HTTP REST API
“Build delightful and natural
conversational experiences”

wit.ai / Facebook
• Website: https://wit.ai
• Launched: 2013 (acquired by Facebook in 2015)
• Pricing model: FREE
“Natural Language for Developers”

IBM Watson Conversation
• Website: https://www.ibm.com/watson/services/
conversation/
• Launched: 2016
• Pricing model: Pay As You Go
“Quickly build and deploy chatbots and
virtual agents across a variety of
channels, including mobile devices,
messaging platforms, and even robots.”

Microsoft LUIS
• Website: https://www.luis.ai/
• Launched: 2015
“Language Understanding
Intelligent Service. Add
conversational intelligence to your
apps.”

Amazon Lex
• Website: https://aws.amazon.com/lex/
• Launched: 2016
• Important Restrictions apply (see Dataset slide)
“Conversational interfaces for your
applications. Powered by the same deep
learning technologies as Alexa”

Recast.ai
• Website: https://recast.ai
• Launched: 2016
• Pricing mode: Free tier + Contact sales
“The collaborative platform to
build, train, deploy and monitor
intelligent bots for developers”

SNIPS
• Website: https://snips.ai
• Launched: 2017
• Pricing model: Free to test + Pay per Device
• Interface: On Device*
“Snips is an AI-powered voice
assistant you can add to your
products. It runs on-device and is
Private by Design”
* we’ve tested a hosted version with a private API provided by SNIPS

Historical
timeline
2011 2015 20162012 2013 2014 20172010
Conversation
Facebook
Google
LUIS

The Approach

Dataset
• The original dataset is the SNIPS.ai 2017 NLU Benchmark
• English language only
• We have removed duplicates that differ by number of
whitespaces, quotes, lettercase etc
• Resulting dataset parameters*: 7 intents (next slide),
15.6K samples (~2K per intent), 340K symbols
• We also used ~450 samples from SNIPS.ai 2016 NLU
Benchmark to test for False Positives
* For Amazon Lex, the training set is capped by 200K symbols per API limitation; also symbol limitations
apply, resulting in 4500 utterances (~640 per intent).

Intents [*]
• SearchCreativeWork (e.g. Find me the I, Robot television show)
• GetWeather (e.g. Is it windy in Boston, MA right now?)
• BookRestaurant (e.g. I want to book a highly rated restaurant for
me and my boyfriend tomorrow night)
• PlayMusic (e.g. Play the last track from Beyoncé off Spotify)
• AddToPlaylist (e.g. Add Diamonds to my roadtrip playlist)
• RateBook (e.g. Give 6 stars to Of Mice and Men)
• SearchScreeningEvent (e.g. Check the showtimes for Wonder
Woman in Paris)
[*] quoted from https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines

I. Prediction Performance

Experimental setting
• Inspired by the SNIPS Benchmarks (2016, 2017)
• We benchmark intent detection only, no parameter
extraction yet
• For each provider, one model is trained to detect all
intents
• We run all models with the default conﬁdence score
thresholds.
• 3-fold 80/20* Stratiﬁed Monte Carlo cross-validation
* 47/20 for Amazon Lex

Scores Normalization (I)
Intents: the good, the bad and the ugly
(to compare providers we need to remove intent-related bias)
F1 Scores* Confusion matrix
* F1 Score is a weighted average of the precision and recall; reaches its best value at 1 and worst score at 0.

Scores Normalization (II)
• Standardizing F1,
P and R scores
for each intent
(SD-
normalization)
• Then adjusting
scales by
multiplying on the
global std and
adding the global
mean (next slide)

Detection Performance*
• Black bars
indicate
conﬁdence
intervals
• Amazon Lex
is trained on a
smaller
dataset due
to its API
limits
* Mean standardized F1 scores, adjusted to the initial scale using global mean and std

Average Precision*
• Black bars
indicate
conﬁdence
intervals
• Amazon Lex
is trained on a
smaller
dataset due
to its API
limits
* Mean standardized Precision, adjusted to the initial scale using global mean and std

Average Recall*
* Mean standardized Recall, adjusted to the initial scale using global mean and std
• Black bars
indicate
conﬁdence
intervals
• is Amazon Lex
trained on a
smaller dataset
due to its API
limits

Precision vs. Recall

Discussion
• Training models is cumbersome for most of the services:
• manual work (adding models in the web interface) and
• solving issues with the tech support.
• Based on the conﬁdence intervals, all providers fall into several
groups:
• Top-runners: IBM Watson, API.ai, Microsoft LUIS
• Amazon Lex: API limits don’t allow for training multi-intent
model on enough samples
• Choosing the provider:
• For “good” intents (like GetWeather), all providers are good
enough. For “bad” intents, having a “good” providers is crucial
• Within each tier, the leader depends on the intent data

II. False Positives

Approach
• How does NLU behaves when a user express an
intent it wasn’t trained for?
• Expected behavior: produce the Fallback Intent
(no trained intents pass the detection threshold)
• 1411 utterances from the domains other than
Music, Movie, Weather, RestaurantReservation,
Entertainment and TV.
• Compare % of (false) positive detections for each
NLU provider

Out-of-the-Domain Samples

Discussion
• Only Snips.ai and API.ai are somewhat good at
detecting that users asks for something the agent
is not trained for
• IBM Watson and Microsoft LUIS are trying to map
any user request to one of the intents from the
training set
Perhaps the Fallback intent should be manually
added and trained on junk utterances?

III. Learning Curve

Experimental Setting
• Similar to the Prediction Performance (slide 18)
• 20% of the dataset reserved for testing (stratiﬁed)
• From the remaining 80%, for intent we’ve randomly
built a set of training sets of the following cardinality:
10, 25, 50, 100, 200, 500, 1000*
• No cross-validation
• Analyzed F1 Scores, normalised as described on
slides 19-20.
* for Amazon Lex, 10, 25, 50, 100, 200

Leaning curve by provider
Vertical bars
denote
conﬁdence
intervals

Discussion
• On <100 samples IBM Watson is superior*. API.ai
catches up at >100 samples.
• Detecting the user’s intent is crucial for the
subsequent slot extraction and response generation.
• The learning curve is quite steep, good performance
requires hundreds of utterances to train on.
• Most of the pre-built intents (Microsoft LUIS, API.ai,
etc) are built on 10-50 utterances.
* SNIPS advertises a special enterprise feature to generate additional samples for smaller datasets, but it is not
available by default

IV. Language coverage

Supported languages
• Merged all
dialects (e.g. en-
uk and en-us).
• Note we’ve tested
the performance
only for English

Language popularity
Numberofsupportingproviders
0
2
4
5
7
English
Korean
German
Spanish
French
Italian
Portuguese
Japanese
Chinese
Dutch
Arabic
Russian
Norwegian,
Polish, Hindi,
Finnish, Danish,
Czech, Swedish,
Catalan,
Ukranian +29

Discussion
Potentially, Machine Translation may be used to
increase the language coverage and/or performance:
• either by translating both the training and testing
utterances to English
• or by translating only testing utterances and
using the English training model
That’s something to check
in future benchmarks

V. Other observations

Average response time
*
* Snips assumes on-device deployment, we put here a response time for a hosted test bench August 2017© Intento, Inc.

Price per 1K requests*
* prediction requests; SNIPS has per device pricing and is not shown on this chart
CONTACTSALES
Free Free

Conclusions

Performance (F1) vs. Price*
Performance
Affordability = 1/Price
wit.ai
amazon.lex
microsoft.luis
ibm.watson api.ai
* Recast and SNIPS are not shown as they don't provide a public pricing
FREE

Performance (F1) vs. Latency*
Performance
Speed = 1/Latency
snips
recast
wit.ai
amazon.lex
microsoft.luis
ibm.watson api.ai
* Snips assumes on-device deployment, we put here a response time for a hosted test bench August 2017© Intento, Inc.

Conclusions
1. API.ai, Microsoft LUIS and IBM Watson have overall
best intent detection performance, speed and language
coverage.
• Within this group, API.ai is superior at price (free), Microsoft LUIS at speed
(almost 50% faster response), IBM Watson at performance (esp. at smaller
datasets). Here, only API.ai detects (~40% of) out-of-domain requests and produce
Fallback intent.
2. For extreme language coverage, go with wit.ai.
• It’s interesting if Machine Translation can be applied either on training or on
testing stage to increase language coverage for the top-3 providers.
3. The performance varies a lot for different intents and
dataset sizes. We recommend to evaluate several
providers on your data before making a choice.

Discover the best service providers
for your AI task
Evaluate performance on your own
data
Access any provider with no effort
using to our Single API
Intento Service Platform

Intento
https://inten.to
Dmitry Labazkin,
Grigory Sapunov,
Konstantin Savenkov
Intento, Inc.

<hello@inten.to>

NLU / Intent Detection Benchmark by Intento, August 2017

More Related Content

What's hot

Viewers also liked

Similar to NLU / Intent Detection Benchmark by Intento, August 2017

More from Konstantin Savenkov

Recently uploaded

NLU / Intent Detection Benchmark by Intento, August 2017