Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NLU / Intent Detection Benchmark by Intento, August 2017

We have evaluated intent prediction performance, false positives, learning rate, language coverage, response time and pricing for 7 NLU providers: Amazon Lex, Facebook’s, IBM Watson Conversation, Google’s, Microsoft LUIS,,

  • Be the first to comment

NLU / Intent Detection Benchmark by Intento, August 2017

  1. 1. NLU / Intent Detection Benchmark by Intento August 2017
  2. 2. About • At Intento, we want to make Machine Intelligence services easy to discover, choose and use. • So far, the evaluation is the most problematic part: to compare providers, one need to sign a lot of contracts and integrate a lot of APIs. • We deliver that for FREE on public datasets. To evaluate on you own dataset, contact our sales. • Also, check out our Machine Translation Benchmark. Machine Translation is an easy way to build a multi- lingual bot.
  3. 3. Overview • Natural Language Understanding Services with Public APIs* • IBM Watson Conversation • API • Microsoft LUIS • Amazon Lex • API • API • SNIPS API • Benchmark Dimensions • Intent Prediction Performance • False Positives • Learning Speed (performance on small datasets) • Language Coverage, Price, Response time August 2017© Intento, Inc. * as of today, some of them don’t have a Public API and we’ve got an access for the purpose of this benchmark
  4. 4. NLU Engines Compared August 2017© Intento, Inc.
  5. 5. / Google • Website: • Launched: 2010 (acquired by Google in 2016) • Pricing model: FREE • Interface: HTTP REST API “Build delightful and natural conversational experiences” August 2017© Intento, Inc.
  6. 6. / Facebook • Website: • Launched: 2013 (acquired by Facebook in 2015) • Pricing model: FREE • Interface: HTTP REST API “Natural Language for Developers” August 2017© Intento, Inc.
  7. 7. IBM Watson Conversation • Website: conversation/ • Launched: 2016 • Pricing model: Pay As You Go • Interface: HTTP REST API “Quickly build and deploy chatbots and virtual agents across a variety of channels, including mobile devices, messaging platforms, and even robots.” August 2017© Intento, Inc.
  8. 8. Microsoft LUIS • Website: • Launched: 2015 • Pricing model: Pay As You Go • Interface: HTTP REST API “Language Understanding Intelligent Service. Add conversational intelligence to your apps.” August 2017© Intento, Inc.
  9. 9. Amazon Lex • Website: • Launched: 2016 • Pricing model: Pay As You Go • Interface: HTTP REST API • Important Restrictions apply (see Dataset slide) “Conversational interfaces for your applications. Powered by the same deep learning technologies as Alexa” August 2017© Intento, Inc.
  10. 10. • Website: • Launched: 2016 • Pricing mode: Free tier + Contact sales • Interface: HTTP REST API “The collaborative platform to build, train, deploy and monitor intelligent bots for developers” August 2017© Intento, Inc.
  11. 11. SNIPS • Website: • Launched: 2017 • Pricing model: Free to test + Pay per Device • Interface: On Device* “Snips is an AI-powered voice assistant you can add to your products. It runs on-device and is Private by Design” * we’ve tested a hosted version with a private API provided by SNIPS August 2017© Intento, Inc.
  12. 12. Historical timeline 2011 2015 20162012 2013 2014 20172010 Conversation Facebook Google LUIS August 2017© Intento, Inc.
  13. 13. The Approach August 2017© Intento, Inc.
  14. 14. Dataset • The original dataset is the 2017 NLU Benchmark • English language only • We have removed duplicates that differ by number of whitespaces, quotes, lettercase etc • Resulting dataset parameters*: 7 intents (next slide), 15.6K samples (~2K per intent), 340K symbols • We also used ~450 samples from 2016 NLU Benchmark to test for False Positives * For Amazon Lex, the training set is capped by 200K symbols per API limitation; also symbol limitations apply, resulting in 4500 utterances (~640 per intent). August 2017© Intento, Inc.
  15. 15. Intents [*] • SearchCreativeWork (e.g. Find me the I, Robot television show) • GetWeather (e.g. Is it windy in Boston, MA right now?) • BookRestaurant (e.g. I want to book a highly rated restaurant for me and my boyfriend tomorrow night) • PlayMusic (e.g. Play the last track from Beyoncé off Spotify) • AddToPlaylist (e.g. Add Diamonds to my roadtrip playlist) • RateBook (e.g. Give 6 stars to Of Mice and Men) • SearchScreeningEvent (e.g. Check the showtimes for Wonder Woman in Paris) [*] quoted from August 2017© Intento, Inc.
  16. 16. I. Prediction Performance August 2017© Intento, Inc.
  17. 17. Experimental setting • Inspired by the SNIPS Benchmarks (2016, 2017) • We benchmark intent detection only, no parameter extraction yet • For each provider, one model is trained to detect all intents • We run all models with the default confidence score thresholds. • 3-fold 80/20* Stratified Monte Carlo cross-validation * 47/20 for Amazon Lex August 2017© Intento, Inc.
  18. 18. Scores Normalization (I) Intents: the good, the bad and the ugly (to compare providers we need to remove intent-related bias) F1 Scores* Confusion matrix * F1 Score is a weighted average of the precision and recall; reaches its best value at 1 and worst score at 0. August 2017© Intento, Inc.
  19. 19. Scores Normalization (II) • Standardizing F1, P and R scores for each intent (SD- normalization) • Then adjusting scales by multiplying on the global std and adding the global mean (next slide) August 2017© Intento, Inc.
  20. 20. Detection Performance* • Black bars indicate confidence intervals • Amazon Lex is trained on a smaller dataset due to its API limits * Mean standardized F1 scores, adjusted to the initial scale using global mean and std August 2017© Intento, Inc.
  21. 21. Average Precision* • Black bars indicate confidence intervals • Amazon Lex is trained on a smaller dataset due to its API limits August 2017© Intento, Inc. * Mean standardized Precision, adjusted to the initial scale using global mean and std
  22. 22. Average Recall* * Mean standardized Recall, adjusted to the initial scale using global mean and std • Black bars indicate confidence intervals • is Amazon Lex trained on a smaller dataset due to its API limits August 2017© Intento, Inc.
  23. 23. Precision vs. Recall August 2017© Intento, Inc.
  24. 24. Discussion • Training models is cumbersome for most of the services: • manual work (adding models in the web interface) and • solving issues with the tech support. • Based on the confidence intervals, all providers fall into several groups: • Top-runners: IBM Watson,, Microsoft LUIS • Amazon Lex: API limits don’t allow for training multi-intent model on enough samples • Choosing the provider: • For “good” intents (like GetWeather), all providers are good enough. For “bad” intents, having a “good” providers is crucial • Within each tier, the leader depends on the intent data August 2017© Intento, Inc.
  25. 25. II. False Positives August 2017© Intento, Inc.
  26. 26. Approach • How does NLU behaves when a user express an intent it wasn’t trained for? • Expected behavior: produce the Fallback Intent (no trained intents pass the detection threshold) • 1411 utterances from the domains other than Music, Movie, Weather, RestaurantReservation, Entertainment and TV. • Compare % of (false) positive detections for each NLU provider August 2017© Intento, Inc.
  27. 27. Out-of-the-Domain Samples August 2017© Intento, Inc.
  28. 28. Discussion • Only and are somewhat good at detecting that users asks for something the agent is not trained for • IBM Watson and Microsoft LUIS are trying to map any user request to one of the intents from the training set Perhaps the Fallback intent should be manually added and trained on junk utterances? August 2017© Intento, Inc.
  29. 29. III. Learning Curve August 2017© Intento, Inc.
  30. 30. Experimental Setting • Similar to the Prediction Performance (slide 18) • 20% of the dataset reserved for testing (stratified) • From the remaining 80%, for intent we’ve randomly built a set of training sets of the following cardinality: 10, 25, 50, 100, 200, 500, 1000* • No cross-validation • Analyzed F1 Scores, normalised as described on slides 19-20. * for Amazon Lex, 10, 25, 50, 100, 200 August 2017© Intento, Inc.
  31. 31. Leaning curve by provider Vertical bars denote confidence intervals August 2017© Intento, Inc.
  32. 32. Discussion • On <100 samples IBM Watson is superior*. catches up at >100 samples. • Detecting the user’s intent is crucial for the subsequent slot extraction and response generation. • The learning curve is quite steep, good performance requires hundreds of utterances to train on. • Most of the pre-built intents (Microsoft LUIS,, etc) are built on 10-50 utterances. * SNIPS advertises a special enterprise feature to generate additional samples for smaller datasets, but it is not available by default August 2017© Intento, Inc.
  33. 33. IV. Language coverage August 2017© Intento, Inc.
  34. 34. Supported languages • Merged all dialects (e.g. en- uk and en-us). • Note we’ve tested the performance only for English August 2017© Intento, Inc.
  35. 35. Language popularity Numberofsupportingproviders 0 2 4 5 7 English Korean German Spanish French Italian Portuguese Japanese Chinese Dutch Arabic Russian Norwegian, Polish, Hindi, Finnish, Danish, Czech, Swedish, Catalan, Ukranian +29 August 2017© Intento, Inc.
  36. 36. Discussion Potentially, Machine Translation may be used to increase the language coverage and/or performance: • either by translating both the training and testing utterances to English • or by translating only testing utterances and using the English training model That’s something to check in future benchmarks August 2017© Intento, Inc.
  37. 37. V. Other observations August 2017© Intento, Inc.
  38. 38. Average response time * * Snips assumes on-device deployment, we put here a response time for a hosted test bench August 2017© Intento, Inc.
  39. 39. Price per 1K requests* * prediction requests; SNIPS has per device pricing and is not shown on this chart CONTACTSALES Free Free August 2017© Intento, Inc.
  40. 40. Conclusions August 2017© Intento, Inc.
  41. 41. Performance (F1) vs. Price* Performance Affordability = 1/Price amazon.lex microsoft.luis ibm.watson * Recast and SNIPS are not shown as they don't provide a public pricing FREE August 2017© Intento, Inc.
  42. 42. Performance (F1) vs. Latency* Performance Speed = 1/Latency snips recast amazon.lex microsoft.luis ibm.watson * Snips assumes on-device deployment, we put here a response time for a hosted test bench August 2017© Intento, Inc.
  43. 43. Conclusions 1., Microsoft LUIS and IBM Watson have overall best intent detection performance, speed and language coverage. • Within this group, is superior at price (free), Microsoft LUIS at speed (almost 50% faster response), IBM Watson at performance (esp. at smaller datasets). Here, only detects (~40% of) out-of-domain requests and produce Fallback intent. 2. For extreme language coverage, go with • It’s interesting if Machine Translation can be applied either on training or on testing stage to increase language coverage for the top-3 providers. 3. The performance varies a lot for different intents and dataset sizes. We recommend to evaluate several providers on your data before making a choice. August 2017© Intento, Inc.
  44. 44. Discover the best service providers for your AI task Evaluate performance on your own data Access any provider with no effort using to our Single API Intento Service Platform August 2017© Intento, Inc.
  45. 45. Intento Dmitry Labazkin, Grigory Sapunov, Konstantin Savenkov Intento, Inc. <>