Shop By Voice Product Overview

A NewVoice in Customer Experience

Introduction
Shop ByVoice is a product developed by Firebird Summit, Inc. for purposes of applying voice-driven
customer interface to a standard shopping experience.
The demonstration details included here are for the purposes of showcasing the technical capabilities
of the system, including how it can be compared to standard, commercially available voice solutions.
Custom integration with a commerce platform and adaptation to retailer-specific shopping lexicon is
done during implementation.
For a live demonstration or additional questions, please contact the Firebird Summit team at
info@firebirdsummit.com

Limitations of commercially available voice solutions
Say the phrase “I want Illy Espresso…”
Alexa hears:
Google hears:
Watson hears:
Microsoft hears:
Siri hears:
Standard commercial Voice
Technology is only the start of an
actual voice-based solution for
business-specific needs

Voice engines for
product search
The strength of SBV technology is in
enhancing the accuracy of underlying
technical capabilities of commercial
engines.
Even when a specific product is not found
by a standard voice solution, SBV’s
custom engine and algorithms can still
produce accurate results.
The addition of background noise can make a significant difference in the quality of results from various engines.
Metrics included herein all demonstrate these differences by showing test results with and without added noise.

Shop ByVoice for a
specific product: Illy
Espresso
Using our custom engine and full-context
search, SBV is able to return specific
results (with uncommon proper nouns)
with 95% accuracy.
Say the phrase “I want Illy Espresso…”

Contextual
relevance
Context matters.
Each of the engines includes natively-
imbedded assumptions about how to
interpret different words.
Meat = Meet – unless there is enough
context to change the interpretation.
But these are generic to average uses,
and are not directly customizable to an
individual business context.
No
Context
Minimal
Context Specific
Context

Solving the Problem
Wreck a nice beach?
Reckon eyes peach?
Recognize speech?
Acoustic Model
Basic Accoustic Model (Commercial Voice Service) +
Ecommerce Domain Acoustic Model +
Dynamic Custom Acoustic Model =>
Dynamic Custom Domain Acoustic Model
Language Model
Basic Language Model (Commercial Voice Service) +
Customer Domain Analysis +
Ecommerce Domain Language Model +
Dynamic Custom Language Model =>
Dynamic Custom Domain Language Model
User
Speech
System
Text
Commercial voice
technologies are the
beginning of a solution
– not the whole solution!

Voice Engine
Word Error Rate
(WER)
Not all voice engines are created equally
to work in all circumstances.
Word error rate* is the calculation of an
engine's native capability to understand
contextually relevant, business-specific
language.
*WER technical definition and calculations described in detail in Appendix

Shop by voice for a generic product: Meat
Select product to add
to cart by number
Returns popular
results next
Browse through more
options
Shop By Voice uses a
customer’s order history and
different types of preferences
to intelligently return search
results for products with large
numbers of possibilities.
Just ask and Shop By Voice will
read the titles of returned
results aloud.
Returns previously
purchased product
first
Returns prioritized
‘favorites’ second
Returns regular
‘favorites’ next

Voice engine
recall rate
Voice engine’s native ability to accurately
recognize spoken language.

Shop byVoice Reporting Dashboard
SBV Administration Dashboard
shows voice-driven data about
users, products, orders and devices.
Native responsive design works on computer, tablet and mobile screen sizes.

Thank you for your interest!
Contact info@firebirdsummit.com to arrange
a live demo or ask questions.

Appendix
Engine comparison metrics and testing methodology

Test description and assumptions
We ran 1000 * 7 * 4 = 28000 tests for determining the
current level of SBV accuracy. Our testing set contained
common search queries for grocery stores. Most phrases
are single words or two words queries. Most of the tested
phrases are common words that are used also in articles,
human speech, web.
Example of tested phrases: aloe, cranberries, pork, stain
remover, vegetarian, fish.
We calculated the widely known metrics for speech
recognition engines:
• Word error rate
• Word accuracy (not included here)
• Recall rate
We also tested algorithms inside our system. Metrics that
show the quality of our system we measure as
percentage of Found Products depending on Automatic
Speech Recognition (ASR) results.
We used IBM voice synthesis to emulate speech. We used
male and female voices, with and without noise, with
American and British English accents.
Usage of robotic voice here is a system-naive approach
which is far from representative in real world scenarios.
This naive assumption significantly improves ASR results
and metric values can not be used as real world
indicators, but this approach gives opportunity to rate
ASR engines and to compare them relatively.
Our comparison model included Shop ByVoice, and six
commercially available engines:
1. Google
2. Microsoft
3. IBMWatson
4. Open Source
5. Apple’s Siri
6. AmazonAlexa

Word Error Rate
Word error rate is a common metric of the performance
of a speech recognition or machine translation system.
This performance calculation is computed by comparing
a reference transcription with the transcription output by
the speech recognizer. In simple words,WER shows the
number of transformations needed to be applied to ASR
hypothesis to receive a reference. It is the most accurate
metric to compare ASRs. From this comparison it is
possible to compute the number of errors, which typically
belong to 3 categories:
1. Insertions I (when in the output of the ASR it is
present a word not present in the reference)
2. Deletions D (a word is missed in the ASR output)
3. Substitutions S (a word is confused with another
one)
Word error rate can then be computed as:
where
• S is the number of substitutions,
• D is the number of deletions,
• I is the number of insertions,
• C is the number of the corrects,
• N is the number of words in the reference transcription.
The main issue in computing this score is the required
alignment between the 2 word sequences. This can be
obtained through dynamic programming, using the so-
called Levenstein distance.

Recall and Found Product Metrics
Recall (information retrieval)
This is a metric that represents a ratio
of correctly recognized words (H) to
the total number of words in reference
(N).This metric is used to measure ASR
performance. However this metric
does not count the amount of noise
that can be generated by an ASR and
can invert or undermine the phrase
context.
Found Product Rate
Found products is calculated as ratio of
correctly found* products to all
products (all products here equal the
number of tests).The product is
considered correctly found if the
product returned by the search engine
based on the reference phrase which is
equal to the product that is returned by
the search engine based on the ASR
hypothesis phrase.
*Search engine used is a native ecommerce, on-site engine across all ASRs to establish consistent results. Typical on-site ecommerce search
engines include enhanced results management for common needs, such as misspellings, related words, suggested alternatives, etc.

Shop By Voice Product Overview

Recommended

Recommended

More Related Content

Similar to Shop By Voice Product Overview

Similar to Shop By Voice Product Overview (20)

More from Alora Chistiakoff

More from Alora Chistiakoff (7)

Recently uploaded

Recently uploaded (20)

Shop By Voice Product Overview