It’s not breaking news that voice search is the emerging technology of greatest interest, but what hasn’t been demystified is how it works. This session will uncover how the algorithm functions at a structural level by dissecting Google’s Automatic Speech Recognition, Google's Quality Metrics for voice search, and deciphering the nuances of the spoken word as they apply to semantic search.
2. #SMX #22A @UpasnaGautam
▪ Name: Upasna Gautam
▪ Nickname: Pas
▪ Job: SEO Manager at Ziff Davis for
PC Magazine & Mashable
▪ Past Life Job: Scientist/Lab Rat
▪ Other Job: Fitness & Dance Instructor
▪ Hobbies: The Office & hiking
▪ Location: Austin but Michigan is home
About Me
3. #SMX #22A @UpasnaGautam
Anthony Verre
Veteran SEO & SMX Speaker
Former Boss at Rockfish
Former/Current Mentor
Everyone Tweet at @TonyVerre
and tell him we miss him here!
Shout Out!
5. #SMX #22A @UpasnaGautam
▪ The form of a structure is
correlated to the purpose/function
of that structure
▪ When we understand FORM, we
can better understand FUNCTION
Form Follows Function:
Why Is This So Important?
Why don’t you explain this
to me like I’m 5?
6. #SMX #22A @UpasnaGautam
▪ Before we strategize and implement, we should understand
HOW the voice search system works.
▪ Automatic Speech Recognition (ASR), fueled by deep learning
neural networking, is the system that powers applications like
speech transcription and voice search.
ASR is the FORM behind the voice search FUNCTION
Form Follows Function:
How Does This Apply To Voice Search?
7. #SMX #22A @UpasnaGautam
Automatic Speech Recognition:
How Do Humans Do It?
Human articulation produces sound waves which
the ear conveys to the brain for processing.
New phone who dis?
8. #SMX #22A @UpasnaGautam
Automatic Speech Recognition:
How Do Machines Do It?
Part 1: Fourier Transform (Sound Signal Processing)
• Turning sound into math functions that are digested into data
• Extract the most significant coefficients
Part 2: Hidden Markov Model (Speech Modeling)
• Take the newly created sound /math functions and build sequence
of states
• In this model, the states are the letters of the message and the
sequence of events are the sound signal
Part 3:Viterbi Algorithm
• Obtain the sequence of states of maximum likelihood.This “states
of maximum likelihood” are what we get served in the Google
SERPs after submitting a voice search query.
9. #SMX #22A @UpasnaGautam
Automatic Speech Recognition:
Sound Signal Processing + Speech Modeling
• Convert speech signal into a
sequence of vectors
• Vectors are measured
throughout the duration of the
speech signal
• Using a syntactic decoder, a
valid sequence of
representations is generated
10. #SMX #22A @UpasnaGautam
Google’s Voice Search Quality Metrics
Google has defined and uses a set of metrics to track
the quality of its voice search system.
They use these metrics to drive their research directions
as well as provide insight and guidance for solving
specific problems and tuning system performance.
“We strive to find metrics that illuminate the end-user experience, to make sure that we
optimize the most important aspects and make effective tradeoffs.We also design
metrics which can bring to light specific issues with the underlying technology.” -GOOG
12. #SMX #22A @UpasnaGautam
Google’s Voice Search Quality Metrics:
Word Error Rate (WER)
• Measures misrecognitions at the word level
• Compares the words outputted by the recognizer to those the user really spoke
• Every error (substitution, insertion or deletion) is counted against the recognizer
WER = Number of Substitution + Insertions + Deletions
Total Number ofWords
13. #SMX #22A @UpasnaGautam
Google’s Voice Search Quality Metrics:
Semantic Quality (Webscore)
• Individual word errors do not necessarily effect the final search results shown (deleting
function words like ”in” or ”of,” or minor misspellings, like forgetting an “s” to pluralize)
• The semantic quality of the recognizer (Webscore) is tracked by measuring how many
times the search result as queried by the recognition hypothesis varies from the search
result as queried by a human transcription
• A better recognizer has a higher Webscore
• PageRank + Degree + Betweenness + Closeness
• TheWebscore gives us a much clearer picture of what the user experiences when they
search by voice. Google focuses on optimizing this metric, rather than the more
traditional WER metric defined in the previous slide.
Webscore = Number of Correct Search Results + Deletions
Total Number of Spoken Queries
14. #SMX #22A @UpasnaGautam
Google’s Voice Search Quality Metrics:
Perplexity (PPL)
• Measure of the size of the set of words that can be recognized next, given the previously
recognized words in the query
• Provides a rough measure of the quality of the language model
• The lower the perplexity, the better the model is at predicting the next word
15. #SMX #22A @UpasnaGautam
Google’s Voice Search Quality Metrics:
Out-of-Vocabulary Rate (OOV)
• Tracks the percentage of words spoken by the user that
are not modeled by the language model
• It is important to keep this number as low as possible
• Any word spoken by our users that is not in our
vocabulary will ultimately result in a recognition error
• Recognition errors may also cause errors in surrounding
words due to the subsequent poor predictions of the
language model and acoustic misalignments
16. #SMX #22A @UpasnaGautam
Google’s Voice Search Quality Metrics:
Latency
Contributing Factors
• Time it takes the system to detect end-of-speech
• Total time to recognize the spoken query
• Time to perform the web query
• Time to return the web search results back to the
client over the network
• Time it takes to render the search results in the
browser of the users phone.
The total time (in seconds) it takes to complete a search by voice. More specifically, the
time from when the user finishes speaking until the search results appear on screen
Each of these factors are studied and optimized to provide a streamlined user experience.
17. #SMX #22A @UpasnaGautam
Share these #SMXInsights on your social channels!
#SMXInsights
▪ To Understand Automatic Speech
Recognition is to Understand Voice Search
▪ ASR is the form behind the voice
search function
▪ Sound processing and speech modeling
power voice search results
18. #SMX #22A @UpasnaGautam
Share these #SMXInsights on your social channels!
#SMXInsights
▪ Semantic Quality is EVERYTHING.
▪ Keyword research must include user behavior
research & consumer journey analyses to
uncover natural language patterns
▪ Google’s recognition hypotheses and human
transcription are in sync – we need to serve the
resources (aka content) that facilitate that sync
▪ Long-Tail is Life
19. #SMX #22A @UpasnaGautam
Share these #SMXInsights on your social channels!
#SMXInsights
▪ A High-Quality UX is a Fast UX
▪ From the time it takes to detect end-of-speech,
to the time it takes to render the search results,
time is of the essence during speech processing.
▪ “It is generally desirable to reduce any user noticeable latency, and
in certain circumstances may be desirable to reduce latency even if
improved speed comes at the cost of reduced quality ASR results.”
-GOOG