1. Speech recognition - how does it work?
The first device for speech recognition arrived in 1952, and it could understand the numbers
spoken by a person. 40 years later, the first commercial programs that recognize human
speech were introduced. They were intended for people who, due to physiological
characteristics, could not type manually. Now the speech recognition function is available in
almost any smartphone; it allows us to interact with voice applications, making our lives
easier and more relaxed. How speech recognition works-this is in today's issue.
The applications most certainly associated with the term "voice search" are based on the use
of speech recognition systems and frequent speech synthesis to return search results
automatically. Voice search is conducted in the following ways:
perform a search for companies by name or category;
perform a search for a person by list;
search for information such as finances, weather, news, congestion, traffic, or information
about movie theaters (this is frequently used to manage multi-level voice menus);
How voice recognition is used in real life
If you say a voice request, for example, the address of the destination, the smartphone will
not hear the street and the house number, but a sound signal in which the sounds smoothly
flow into each other, without clear boundaries. It is worth noting that the same phrase, uttered
by different people in different situations, gives completely different signals to each other.
After receiving a voice request, that is recorded by the smartphone and sent to the servers.
The level of interference is determined, and the noise is cleared, and the useful signal is
separated. Then the record is divided into small fragments (frames), for example, 25
milliseconds in length with a step of 10 milliseconds, that is, overlap. Thus, one second of
speech produces a hundred frames.
Machine Learning processing
First, each frame is transmitted through the acoustic model. Machine learning algorithm
defines spoken word variants and context. The correctness of the results straight depends on
the completeness of the phonetic alphabet of the system. For each sound, a complex
statistical model is initially constructed that describes the utterance of this sound in speech.
The recognition system matches the incoming speech signal with phonemes, and then
collects words from them. Each frame is mapped not to a single phoneme, but to several that
match with varying degrees of probability. Besides, the system takes into account the
probability of transitions, that is, determines which frames can follow a particular phoneme.
For this purpose, data on pronunciation, morphology, and semantics used. Thus, the system
selects variants of words, which are then analyzed for forms, parts of speech, and possible
statistical relationships between them.
Next, a language model enters the process, with which the system determines the probable
2. word order and, if necessary, restores unrecognized words in meaning based on the context.
As a result, the received information is sent to the central unit of the recognition system - the
decoder. This software component combines data from acoustic and language models and,
based on their combination, produces the final result in the form of the most likely sequence
of words.
Integrating speech recognition and voice commands into a website
If you want to integrate speech recognition to your website, you can check for some tutorials
on the internet, which uses the browser Speech Recognition API. Or even easier is to install
the speech recognition tool for a website Voxpow we found.
It is the first online tool for adding voice commands to a website and controlling everything
from a single point. It is a tool that allows you to use voice power quite easily and for free.
Big players in Speech Recognition world
Google
The well-known IT Corporation offers to test its Google Cloud Platform product online.
Anyone can try out the service for free. The product itself is convenient and clear to use.
Pluses:
support for more than 80 languages;
fast processing of names entities;
high-quality recognition in conditions of poor communication and in the presence of
extraneous sounds.
Minuses:
there are difficulties in recognizing messages with accents and poor pronunciation, which
makes the system difficult to use by anyone other than native speakers;
lack of clear technical support for the service.
Yandex
Speech recognition from Yandex is available in several ways:
via cloud service;
library for access from mobile applications;
JavaScript API
Pluses:
easy to use and configure;
good recognition of the text in Russian language;
the system gives out several variants of answers and through neural networks tries to find
the most similar to the truth option.
Minuses:
3. some words may not be defined correctly during streaming.
Azure
The Azure system was developed by Microsoft. Against the background of analogues, it
stands out strongly due to the price. But, be prepared to face some difficulties.
Pluses:
relative to other services, Azure processes messages very quickly in real time.
Minuses:
the system is very sensitive to accent, hardly recognizes speech from non-native speakers;
the system works only in English.
Overview
Thanks to machine learning, systems are resistant to noise and can recognize speech with
an accent. The accuracy of modern speech recognition systems exceeds 90 percent. We are
very close to the times that speech recognition technologies will be used in every aspect of
our lives.