Sound, Search, and Semantics: How Form Follows Function

#SMX #22A @UpasnaGautam
Optimizing for Voice Search & Virtual Assistants
Sound, Search,
and Semantics:
How Form Follows Function

▪ Name: Upasna Gautam
▪ Nickname: Pas
▪ Job: SEO Manager at Ziff Davis for
PC Magazine & Mashable
▪ Past Life Job: Scientist/Lab Rat
▪ Other Job: Fitness & Dance Instructor
▪ Hobbies: The Office & hiking
▪ Location: Austin but Michigan is home
About Me

Anthony Verre
Veteran SEO & SMX Speaker
Former Boss at Rockfish
Former/Current Mentor
Everyone Tweet at @TonyVerre
and tell him we miss him here!
Shout Out!

▪ Automatic Speech Recognition
▪ Sound Processing
▪ Speech Modeling
▪ Quality Metrics
▪ Word Error Rate (WER)
▪ Semantic Quality (Webscore)
▪ Perplexity (PPL)
▪ Out-of-Vocabulary Rate (OOV)
▪ Latency
Agenda

▪ The form of a structure is
correlated to the purpose/function
of that structure
▪ When we understand FORM, we
can better understand FUNCTION
Form Follows Function:
Why Is This So Important?
Why don’t you explain this
to me like I’m 5?

▪ Before we strategize and implement, we should understand
HOW the voice search system works.
▪ Automatic Speech Recognition (ASR), fueled by deep learning
neural networking, is the system that powers applications like
speech transcription and voice search.
ASR is the FORM behind the voice search FUNCTION
Form Follows Function:
How Does This Apply To Voice Search?

Automatic Speech Recognition:
How Do Humans Do It?
Human articulation produces sound waves which
the ear conveys to the brain for processing.
New phone who dis?

How Do Machines Do It?
Part 1: Fourier Transform (Sound Signal Processing)
• Turning sound into math functions that are digested into data
• Extract the most significant coefficients
Part 2: Hidden Markov Model (Speech Modeling)
• Take the newly created sound /math functions and build sequence
of states
• In this model, the states are the letters of the message and the
sequence of events are the sound signal
Part 3:Viterbi Algorithm
• Obtain the sequence of states of maximum likelihood.This “states
of maximum likelihood” are what we get served in the Google
SERPs after submitting a voice search query.

Sound Signal Processing + Speech Modeling
• Convert speech signal into a
sequence of vectors
• Vectors are measured
throughout the duration of the
speech signal
• Using a syntactic decoder, a
valid sequence of
representations is generated

Google’s Voice Search Quality Metrics
Google has defined and uses a set of metrics to track
the quality of its voice search system.
They use these metrics to drive their research directions
as well as provide insight and guidance for solving
specific problems and tuning system performance.
“We strive to find metrics that illuminate the end-user experience, to make sure that we
optimize the most important aspects and make effective tradeoffs.We also design
metrics which can bring to light specific issues with the underlying technology.” -GOOG

Google’s Voice Search Quality Metrics
• Word Error Rate (WER)
• SemanticQuality (Webscore)
• Perplexity (PPL)
• Out-of-Vocabulary Rate (OOV)
• Latency

Google’s Voice Search Quality Metrics:
Word Error Rate (WER)
• Measures misrecognitions at the word level
• Compares the words outputted by the recognizer to those the user really spoke
• Every error (substitution, insertion or deletion) is counted against the recognizer
WER = Number of Substitution + Insertions + Deletions
Total Number ofWords

Semantic Quality (Webscore)
• Individual word errors do not necessarily effect the final search results shown (deleting
function words like ”in” or ”of,” or minor misspellings, like forgetting an “s” to pluralize)
• The semantic quality of the recognizer (Webscore) is tracked by measuring how many
times the search result as queried by the recognition hypothesis varies from the search
result as queried by a human transcription
• A better recognizer has a higher Webscore
• PageRank + Degree + Betweenness + Closeness
• TheWebscore gives us a much clearer picture of what the user experiences when they
search by voice. Google focuses on optimizing this metric, rather than the more
traditional WER metric defined in the previous slide.
Webscore = Number of Correct Search Results + Deletions
Total Number of Spoken Queries

Perplexity (PPL)
• Measure of the size of the set of words that can be recognized next, given the previously
recognized words in the query
• Provides a rough measure of the quality of the language model
• The lower the perplexity, the better the model is at predicting the next word

Out-of-Vocabulary Rate (OOV)
• Tracks the percentage of words spoken by the user that
are not modeled by the language model
• It is important to keep this number as low as possible
• Any word spoken by our users that is not in our
vocabulary will ultimately result in a recognition error
• Recognition errors may also cause errors in surrounding
words due to the subsequent poor predictions of the
language model and acoustic misalignments

Latency
Contributing Factors
• Time it takes the system to detect end-of-speech
• Total time to recognize the spoken query
• Time to perform the web query
• Time to return the web search results back to the
client over the network
• Time it takes to render the search results in the
browser of the users phone.
The total time (in seconds) it takes to complete a search by voice. More specifically, the
time from when the user finishes speaking until the search results appear on screen
Each of these factors are studied and optimized to provide a streamlined user experience.

Share these #SMXInsights on your social channels!
#SMXInsights
▪ To Understand Automatic Speech
Recognition is to Understand Voice Search
▪ ASR is the form behind the voice
search function
▪ Sound processing and speech modeling
power voice search results

#SMXInsights
▪ Semantic Quality is EVERYTHING.
▪ Keyword research must include user behavior
research & consumer journey analyses to
uncover natural language patterns
▪ Google’s recognition hypotheses and human
transcription are in sync – we need to serve the
resources (aka content) that facilitate that sync
▪ Long-Tail is Life

#SMXInsights
▪ A High-Quality UX is a Fast UX
▪ From the time it takes to detect end-of-speech,
to the time it takes to render the search results,
time is of the essence during speech processing.
▪ “It is generally desirable to reduce any user noticeable latency, and
in certain circumstances may be desirable to reduce latency even if
improved speed comes at the cost of reduced quality ASR results.”
-GOOG

Sound, Search, and Semantics: How Form Follows Function

Recommended

Recommended

More Related Content

Similar to Sound, Search, and Semantics: How Form Follows Function

Similar to Sound, Search, and Semantics: How Form Follows Function (20)

Recently uploaded

Recently uploaded (20)

Sound, Search, and Semantics: How Form Follows Function