ELSA's Speech Recognition Overview

SPEECH RECOGNITION
FOR LANGUAGE LEARNING

In the age of globalization, English fluency has become the pivotal skill for any professional or
student to develop,with special focus on the ability to confidently communicate.In recent years,
language education has constantly been in the spotlight for development.Novel methodologies
are extensively researched and applied in order to deliver the best learning experience, with
advancement in automation navigating the wave. In the current context of COVID-19 disabling
traditional schooling in many ways, technology-aided language education is growing at an
unprecedented speed.
One of the most popular technologies applied in English teaching is Automatic Speech
Recognition (ASR). ASR converts the audio speech input into written text, based on big
data-searching language corpora and finding matching patterns (Carrier, 2017). This
technology will support the development of listening and speaking – the two foundational pillars
of English communication. The application of ASR in language pedagogy offers all the benefits
of interactive learning in a regular classroom setup, as well as several prevailing characteristics
making it the ideal solution for the 4.0 digitalization of education.
The ELSA Speak App is equipped with our own spoken language
assessment technology, which will be discussed in detail in this
paper along with the benefits of its application in English learning
through real success stories with our partners.

The ELSA ASR technology is developed entirely in-house - we have a dedicated research
department responsible for creating and evolving all of our speech assessment algorithms. Our
technology inherits a lot from statistical signal processing techniques traditionally used in
speech recognition. In this respect, we use deep neural networks to learn what the correct
pronunciation is,trained from thousands of hours of native speech in American English as well as
other machine learning algorithms to classify intonation, rhythm, etc. Our pipeline also includes
novel signal processing to extract information from the speech signal that is then converted to
features to be classified into correct, almost correct, and incorrect classes.
The main factor differentiating our technology from speech recognition research (and especially
from APIs from major cloud computing services) is what we target to identify. While speech
recognition research aims at identifying what words were spoken regardless of the speaker’s
accent, we focus on distinguishing between correct and incorrect pronunciations and
benchmark the speech against actual native speaker standards. Traditionally, as speech
recognition systems get more training data from non-native speakers,they also lose the ability to
tell a user that his accent is not correct, and are even more unlikely to be able to identify why. In
our case, as we gather more data from users and evolve the system, we will improve our
pronunciation detection algorithms and the quality of our feedback
1. TECHNICAL OVERVIEW:

To perform effective communication training, in our
analysis of the student’s speaking proficiency, we
independently analyze the following dimensions:
Pronunciation: This skill evaluates the correct generation of American
English sounds, identifying when the user speaks with a non-native accent
and offering concrete actionable feedback to reduce it.
Word stress: This skill measures how well the student stresses the right
syllable in multi-syllable words so that the word is correctly understood by the
listener, namely in the case of homonyms.
Fluency: This skill measures the ability of the student to speak with the right
rhythm and apply pauses where necessary.This is a suprasegmental skill that
is fundamental to bringing students to a good conversational level with English
natives.
Sentence intonation: This skill evaluates 2 different aspects of intonation.
On one hand,it checks whether there is adequate pitch variation to make the
speech pleasant and not monotonous. On the other hand, it inspects
whether the student is adding enough emphasis to the prominent words in
the sentence, conveying the expected message to the listener.
Listening: This skill refers to the ability of a student to differentiate among
the different sounds in English, which many times convey different
meanings. This is currently done via the use of minimal pairs (i.e. words
whose only difference is the tested phoneme).

In the ELSA app, we combine our proprietary speech analysis
technology with a unique scoring system and gamified learning
experience to evaluate the user’s input and give accurate and
actionable feedback, simultaneously designing engaging and
interactive lessons.
When the student starts using the application they are invited to take a placement test that
evaluates their overall English proficiency. The assessment test highlights to the student what
English mistakes they are making and creates a learning path to improve proficiency. The ELSA
proficiency score shown to the user is a combination of the scores obtained in each of the skills
outlined above. The ELSA interface is also designed to support the learning journey. On the
home screen are planets created as part of the core user experience. These planets represent
the phonemes of English that will help build their pronunciation at the atomic level.
For example,the assessment test evaluates users’ pronunciation of all American English sounds
and gives a score for each planet at the end. Once the result is generated, the planets will be
reordered on the home screen with the highest score at the top and the lowest at the bottom.
This experience allows users to easily access the sounds that they want to practice first. Finally,
ELSA offers multiple types of lessons in the form of games that train the different skills above.
There are games that focus on a single skill (e.g. listening and word stress games) and others
that train multiple skills, like the conversation games, which evaluate pronunciation, intonation,
and fluency. This gamified approach can boost learners’ performance by up to 90%, according
to a study by ScienceDirect in 2020.

2.1 Interactive learning
One-on-one interaction is known to be the best way for language learners to improve their
conversational skills (Strik, Neri and Cucchiarini, 2011). This practice gives them instant
corrective feedback, not only keeping the knowledge fresh but also developing a natural flow
and vocabulary suitable for real-life interaction.
However,organizing one-on-one tutoring with trained instructors is complicated for students as
well as institutions. The cost and logistics will be substantial over time, not to mention the
demand to customize the curriculum to each student’s need and may even compromise the
teaching quality.
The ELSA Speak App may offer a new approach to personal tutoring. Our ASR will ensure
immediate feedback to speech input. The built-in library of 120+ topics of daily conversation
helps students develop vocab naturally. All learning curricula is tailored for individual needs by
the ELSA A.I coach.
2.2 Pronunciation improvement
Recent studies have shown that, although ignored in the past, pronunciation plays an important
role in effective communication and can improve the cognitive level of learners (Strik,2009).Our
pronunciation scoring is measured against the standardized IELTS speaking band and the app
produces a native-accent version of the sample sentence. The student thus can learn with
quality similar to a teacher’s lesson, or better, to a conversation with the natives. Pronunciation
does not simply mean one can correctly produce each phoneme, it also covers listening
comprehension. Many students report a common challenge being interpretation of foreign
accents (Fathi Sidig Sidgi and Jelani Shaari, 2017). As illustrated in section 1, ELSA technology
can detect, identify accents, and design training respectively.
An independent study conducted at the University of Yogyakarta (Indonesia) among students of
the English Department saw a 30% increase in average pronunciation scoring from before and
after 3 rounds of practicing with the ELSA Speak App. The research also reflected positive
feedbacks from users on the learning experience (Kholis, 2021)
2. THE BENEFITS OF ASR

2.3 Confidence boost & Experience enhancement
No longer bound to the classroom setting and schedule, the student can now learn anytime,
anywhere at their own pace, with continual access to other additive learning materials, such as
visualizations and recordings.According to Golonka et al.(2012),learners often report favorable
experiences with A.I-assisted language training and more motivation to use English in real life.
Bodnar et al. (2017) even demonstrated that receiving corrections from an ASR-based system
had no negative impact on the learners' enjoyment,willingness to practice,or self-efficacy.All in
all, the technology offers a private, stress-free learning environment that ultimately improves
learner’s confidence and overall learning experience.
The study mentioned above by Kholis and a similar research conducted by Darsih and partners
(2021) both reported over 85% happy feedback on the ELSA app.
A 2-month pilot program with our EdTech partner saw
that attendance rate increased 4x, total engagement is
8x compared to current stats. The organization saved a
total of 359 hours of teachers’ time.
Assuming a $20-40 per hour salary, ELSA is projected
to help save $400,000 - $800,000 per year per 1,000
students.
Improved confidence Rating maintained on
the App stores.

REQUEST A DEMO:
https://elsaspeak.com/en/english-for-companies/demo
A 2021 report from J’son & Partners Consulting estimates that language learning constitutes ¼
of the EdTech market, and English takes up 80% of this ratio. According to data from the
mentioned report, the global EdTech market size is 268 billion USD in 2021, implying 44 billion
USD for the English segment. On the supply side, nearly 20 billion USD of private funds was
poured into learning technology companies during the first half of 2021, according to Metaari’s
Chief Researcher. This corresponds well to a reported 4.1 billion invested artificial intelligence
development for the sector per the A.I Index Report 2021 by Stanford University.
That is to say, EdTech for English has been and will continue to be the target of revolution and
investment. This paper has discussed how automated spoken language technology will
certainly be the key in the development of communicative language training. Educational
institutions and EdTech companies alike are all looking for ways of integrating spoken language
technology into their existing program to enhance training effectiveness and learner’s
experience.
CONCLUSION
Proudly equipped with our AI-powered language assessment
technology, ELSA is the next step to take your solution to the next
level. Book a consultation today to learn how we can help you.

ELSA's Speech Recognition Overview

Recommended

Recommended

More Related Content

Similar to ELSA's Speech Recognition Overview

Similar to ELSA's Speech Recognition Overview (20)

Recently uploaded

Recently uploaded (20)

ELSA's Speech Recognition Overview