SlideShare a Scribd company logo
1 of 6
Download to read offline
A system to generate speech to text in real-time
Kavyashree Shankar
Illinois Institute of Technology
Chicago, US
kshankar1@hawk.iit.edu
Saptarshi Chatterjee
Illinois Institute of Technology
Chicago, US
schatterjee@hawk.iit.edu
Abstract​— A real-time speech to text conversion system
converts the spoken words into text . Speech-to-Text
technology enables us to convert audio to text by applying
powerful neural network models. It has a number of
applications for users with and without disabilities.
Speech-to-text has been used for voice search, help writers
boost their productivity, and to provide alternate access to a
computer for individuals with physical impairments. Other
applications include speech recognition for foreign language
learning, voice-activated products for the blind and many
familiar mainstream technologies. It is a driving force behind
the success of new age voice-controlled speakers like Amazon
Echo and Google Home.
While cloud solutions like Google Cloud Speech-to-Text,
Amazon Transcribe, IBM watson Speech to Text etc effectively
convert audio and voice into written text, but there are some
scalability challenges in effectively using them. In this project,
we tried to address some of those challenges and also tested
what are the configurations and parameters yield best results
for such a system.
Keywords — ​Speech-to-Text​, ​voice search​, ​Amazon Echo,
Google Home
1. INTRODUCTION.
In this article, we are proposing a scalable system
architecture to generate text from speech input in real time.
We also are publishing our experimental results and
performance metrics for different technologies used as the
building blocks of this architecture.
Here we are not trying to build a new Speech to Text
generation application from scratch , but our objective is to
propose an architecture of such a system and how it
performs under various configurations.
2. RELATED WORK, CHALLENGES AND HOW WE
ADDRESSED THE CHALLENGES.
A lot of work is already been done in this field. Cloud
services became fairly good at converting audio and voice
into written text using Machine Learning Algorithm like
Neural Network, but still, such a system has scalability
challenges in effectively using them. Following are few
challenges that drives most of the architecture decisions
while designing such a system -
A. Selection of scalable system design patterns.
If we push streaming audio directly to web server, then such
a system is bound to choke under heavy load, as CPU will
be busy processing previous inputs and new inputs will be
waiting to be consumed and eventually system will start
throwing 5XX errors . So we need to provisionally store the
audio file , before an available process can pick it up and
start processing .
We researched several Scalable System Design Patterns​[2]
and found out " event-based architecture" model is most
suitable to address this problem .Event-based architecture
supports several communication styles:
● Publish-subscribe
● Broadcast
● Point-to-point
Publish-subscribe communication style decouples sender &
receiver and facilitates asynchronous communication.
Event-driven architecture (EDA) promotes the production,
detection, consumption of, and reaction to events[3].
The main advantage of this architecture is that they are
loosely coupled.
So for this application, we are pushing the audio files to a
distributed queue (Kafka ) and a multithreaded Spring Boot
Application consumes those audio files from the queue.
B. Selection of storage platform for provisional storage
of audio file.
We need to store the audio file in a DataBase / Storage
system before a consumer picks that up and starts processing
. Audio files need to be stored as BLOB data . We looked at
several DataBases and found NoSql database like MongoDB
performs better at storing BLOB data compared to
traditional RDBMS systems​[4]​
.
However, we found that storing the Audio files in
MongoDB will not be a good choice , because -
a. Maximum document size supported is 16MB and
large sound files can easily exceed that range .
b. If we store the audio in MongoDB , then consumer
application need to constantly poll the DB to check
if there is a new file. If we increase the polling
frequency then this new event detection is
instantaneous, but it increases the system workload.
So there is a tradeoff between workload and
response time.
So , we ​decided to use Kafka as provisional storage
platform​ for Audio files. Kafka performs drastically faster
than other point-to-point messaging platform like Activemq
or Rabbitmq​[5].
C. Storing Large files in Kafka
Kafka was initially designed as Distributed Messaging
System for Log Processing. So default max message size ~
1MB . A > 10 Sec 1 channel .wav file easily exceeds 1MB
in size. So we need to override default configs in order to
support large messages. Following properties had to be
changed to support this. These values are in bytes , hence
out system could easily handle ~40MB message.
D. Multithreaded kafka consumer
Multiple producers push to kafka in a multi-partition topic,
it’s obvious that we also need a multi-threaded consumer to
keep the whole thing robust and balanced . Previous work
has been done in this topic​[6]
to implement a multithreaded
Kafka consumer using Kafka-Python .But kafka-python is
not thread safe and does not support multi-threading.
To overcome this challenge we decided to use ​Spring-kafka
along without multithreaded Spring Boot application, which
is thread safe and graciously supports multithreading.
3. TECHNOLOGIES USED IN SYSTEM
IMPLEMENTATION
A. Apache Kafka - Apache Kafka is a distributed
streaming platform. A streaming platform has three
key capabilities: 1) Publish and subscribe to
streams of records, similar to a message queue or
enterprise messaging system. 2) Store streams of
records in a fault-tolerant durable way. 3) Process
streams of records as they occur.
We are using kafka as our transient storage of
Audio files. Related work​[6] ​
in this area shows with
default message size configuration overrides , how
it can effectively be used as a storage of BLOB
files. And how it’s faster compared to other
messaging platforms[5].
B. MongoDB - MongoDB is an open-source,
document database designed for ease of
development and scaling.
We are using MongoDB to store the transcripts for
future reference. Related work in this area affirms
MongoDB performs better when compared to
RDBMS for unstructured / semistructured data[4] .
Ease of use was one of key reason of our choice as
well.
C. Google cloud Speech to Text - Speech recognition
cloud service that enables developers to convert
audio to text by applying powerful neural network
models in an easy-to-use API.
For this experiment we tried other services like
Amazon Transcribe, IBM watson Speech to Text
and found Google Speech to Text gives more
accurate results with lower median latency
compared to other two platforms.
However, Amazon Transcribe supports .mp3 file
format and it allows the user to customize
vocabulary which is not supported in Google
Speech-to-text service.
D. Spring Boot - Spring Boot is a Spring framework
module which provides RAD (Rapid Application
Development) feature to the Spring framework.
We used Spring along with Tomcat as our
application server to process client requests. We are
also using Spring-Kafka to that Spring components
can work as kafka Producer/Consumer.
E. Sockets - a JavaScript library for real time web
applications. It enables real time, bi-directional
communication between web clients and servers.
F. HTML5 Web Audio API - The Web Audio API
provides a powerful and versatile system for
controlling audio on the Web, allowing developers
to choose audio sources, add effects to audio, create
audio visualizations, apply spatial effects (such as
panning) and much more.
4. PROPOSED SYSTEM ARCHITECTURE AND DESIGN
A. We are using HTML5 MediaStream Recording API
to record the audio inside a browser environment
B. Posting the Audio as BLOB data using Ajax to a
Spring Boot application running on Tomcat Server
C. A Spring controller receives the Audio data and
pushes it to Kafka topic recieved_sound
D. A kafka Consumer receives an event that new audio
is available and it consumes the audio and stream that audio
to Google Speech-To-Text api. And when the transcript is
available pushes it to topic ‘transcriptToClient’.
E. A service method calls to google Speech-To-Text api
to get the transcript from google.
F. When the transcript is available in
‘transcriptToClient’ topic a consumer picks it up and ​stores
it in MongoDB ​for future reference. it also ​pushes the
transcript to a open socket connection initiated by the
client.
G. Client who is listening on an open socket receives the
transcript and displays it in the browser
5. RESULTS AND THEIR ANALYSIS.
A.Comparison of cloud services to generate text from speech
We tried out Google Cloud Speech-to-Text, Amazon
Transcribe, IBM watson Speech to Text for our experiment .
We found Google Cloud Speech-to-Text performs the best
compared to other two . Error rate and latency time is both
significantly lower for Google Speech-to-Text. ​Avg.
Latency time is 6sec for < 1MB​ files.
B. System load testing
We tested our system with multiple parallel client requests
and a dual partition kafka topic with 2 brokers . And
received following metrics in a macOs Mojave (MacBook
Pro 2017 , 2.3 GHz Intel Core i5 , 8 GB 2133 MHz
LPDDR3 )
PRODUCER
Consumer
.
C. EFFECTS OF AUDIO ENCODINGS FORMATS ON
GENERATED TRANSCRIPTS.
We recorded mono sound (1 channel ) and noticed
recording 2 channels double the file size. We used FLAC
encoding , LINEAR16 and MULAW encoding separately
with .wav file format .
FLAC and LINEAR16 performs better ​with almost
error free output when used for Text generation .
FLAC and LINEAR16 are lossless encoding compared to
MULAW . MULAW is an 8 bit PCM encoding, where the
sample's amplitude is modulated logarithmically rather
than linearly. As a result, MULAW reduces the effective
dynamic range of the audio thus compressed. So a higher
error rate was expected for MULAW.
6. DEMO LINK.
HTTPS://WWW.YOUTUBE.COM/WATCH?V=4FZ2EOAZ8P8&T=6S
7. SOURCE CODE -
HTTPS://GITHUB.COM/SAP9433/SPEECHTOTEXT
REFERENCES
[1] Yadav, Devender. "Save Large Files in MongoDB
Using Kundera - DZone Database." Dzone.com. July 24,
2016. Accessed December 01,
2018.https://dzone.com/articles/save-large-files-in-mongo
db-using-kundera.
[2] Ho, Ricky. "Scalable System Design Patterns - DZone
Database." Dzone.com. October 18, 2010. Accessed
December 01, 2018.
https://dzone.com/articles/scalable-system-design
[3] Distributed Systems Architecture . Accessed
November 14, 2018.
http://cse.csusb.edu/tongyu/courses/cs660/notes/distarch.p
hp
[4] Kamil Kolonko, “Performance comparison of the most
popular relational & non-relational database management
systems” IEEE
http://www.diva-portal.org/smash/get/diva2:1199667/FU
LLTEXT02
[5] Jay Kreps,Neha Narkhede, Jun Rao, ds“Kafka: a
Distributed Messaging System for Log Processing”.
Accessed November 14, 2018. ACM
978-1-4503-0652-2/11/06
http://notes.stephenholiday.com/Kafka.pdf
[6] Guy Shilo, “Moving binary data with Kafka – a more
realistic scenario”. Accessed November 14, 2018.
http://www.idata.co.il/2017
/12/moving-binary-data-with-kafka-a-more-realistic-scena
rio/
[7] Nodar MOMTSELIDZE, “Apache Kafka - Real-time
Data Processing”. Accessed November 14, 2018.
https://jtst.ibsu.edu.ge/jms/index.php/jtst/article/view/80
[8] Misal, Disha. "Google Speech Vs Amazon Transcribe:
The War Of Speech Technology." Analytics India
Magazine. October 22, 2018. Accessed December 05,
2018.
https://www.analyticsindiamag.com/google-speech-vs-am
azon-transcribe-the-war-of-speech-technology/.

More Related Content

What's hot

30分でわかる広告エンジンの作り方
30分でわかる広告エンジンの作り方30分でわかる広告エンジンの作り方
30分でわかる広告エンジンの作り方
Daisuke Yamazaki
 
[cb22] ブロックチェーンにC&Cサーバー情報を隠ぺいした攻撃者との直接対峙により得られたもの by 谷口 剛
[cb22]  ブロックチェーンにC&Cサーバー情報を隠ぺいした攻撃者との直接対峙により得られたもの by 谷口 剛[cb22]  ブロックチェーンにC&Cサーバー情報を隠ぺいした攻撃者との直接対峙により得られたもの by 谷口 剛
[cb22] ブロックチェーンにC&Cサーバー情報を隠ぺいした攻撃者との直接対峙により得られたもの by 谷口 剛
CODE BLUE
 

What's hot (20)

30分でわかる広告エンジンの作り方
30分でわかる広告エンジンの作り方30分でわかる広告エンジンの作り方
30分でわかる広告エンジンの作り方
 
Blockchain and BPM - Reflections on Four Years of Research and Applications
Blockchain and BPM - Reflections on Four Years of Research and ApplicationsBlockchain and BPM - Reflections on Four Years of Research and Applications
Blockchain and BPM - Reflections on Four Years of Research and Applications
 
Building a NFT Marketplace DApp
Building a NFT Marketplace DAppBuilding a NFT Marketplace DApp
Building a NFT Marketplace DApp
 
[cb22] ブロックチェーンにC&Cサーバー情報を隠ぺいした攻撃者との直接対峙により得られたもの by 谷口 剛
[cb22]  ブロックチェーンにC&Cサーバー情報を隠ぺいした攻撃者との直接対峙により得られたもの by 谷口 剛[cb22]  ブロックチェーンにC&Cサーバー情報を隠ぺいした攻撃者との直接対峙により得られたもの by 谷口 剛
[cb22] ブロックチェーンにC&Cサーバー情報を隠ぺいした攻撃者との直接対峙により得られたもの by 谷口 剛
 
Stack pivot
Stack pivotStack pivot
Stack pivot
 
セキュリティ未経験だったけど入社1年目から Bug Bounty Program 運営に参加してみた
セキュリティ未経験だったけど入社1年目から Bug Bounty Program 運営に参加してみたセキュリティ未経験だったけど入社1年目から Bug Bounty Program 運営に参加してみた
セキュリティ未経験だったけど入社1年目から Bug Bounty Program 運営に参加してみた
 
Zoom上にタイマーを表示させる
Zoom上にタイマーを表示させるZoom上にタイマーを表示させる
Zoom上にタイマーを表示させる
 
Bitcoin
BitcoinBitcoin
Bitcoin
 
HTML5と WebSocket / WebRTC / Web Audio API / WebGL 技術解説
HTML5と WebSocket / WebRTC / Web Audio API / WebGL 技術解説HTML5と WebSocket / WebRTC / Web Audio API / WebGL 技術解説
HTML5と WebSocket / WebRTC / Web Audio API / WebGL 技術解説
 
Kotlin/Native 「使ってみた」の一歩先へ
Kotlin/Native 「使ってみた」の一歩先へKotlin/Native 「使ってみた」の一歩先へ
Kotlin/Native 「使ってみた」の一歩先へ
 
GitHubにバグ報告して賞金$500を頂いた話
GitHubにバグ報告して賞金$500を頂いた話GitHubにバグ報告して賞金$500を頂いた話
GitHubにバグ報告して賞金$500を頂いた話
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
 
iOS/macOSとAndroid/Linuxのサンドボックス機構について調べた
iOS/macOSとAndroid/Linuxのサンドボックス機構について調べたiOS/macOSとAndroid/Linuxのサンドボックス機構について調べた
iOS/macOSとAndroid/Linuxのサンドボックス機構について調べた
 
Scalaエンジニアのためのモナド入門
Scalaエンジニアのためのモナド入門Scalaエンジニアのためのモナド入門
Scalaエンジニアのためのモナド入門
 
IIJmio meeting 10 端末の動作確認(後編)
IIJmio meeting 10 端末の動作確認(後編)IIJmio meeting 10 端末の動作確認(後編)
IIJmio meeting 10 端末の動作確認(後編)
 
ruby-ffiについてざっくり解説
ruby-ffiについてざっくり解説ruby-ffiについてざっくり解説
ruby-ffiについてざっくり解説
 
PWNの超入門 大和セキュリティ神戸 2018-03-25
PWNの超入門 大和セキュリティ神戸 2018-03-25PWNの超入門 大和セキュリティ神戸 2018-03-25
PWNの超入門 大和セキュリティ神戸 2018-03-25
 
2011年度 新3年生向け
2011年度 新3年生向け2011年度 新3年生向け
2011年度 新3年生向け
 
Blockchain+IOT
Blockchain+IOTBlockchain+IOT
Blockchain+IOT
 
Global Future of Blockchain
Global Future of Blockchain Global Future of Blockchain
Global Future of Blockchain
 

Similar to System to generate speech to text in real time

EQR Reporting: Rails + Amazon EC2
EQR Reporting:  Rails + Amazon EC2EQR Reporting:  Rails + Amazon EC2
EQR Reporting: Rails + Amazon EC2
jeperkins4
 
AWS Customer Presentation - Melodeo
AWS Customer Presentation - MelodeoAWS Customer Presentation - Melodeo
AWS Customer Presentation - Melodeo
Amazon Web Services
 
Cloud Computing Workshop
Cloud Computing WorkshopCloud Computing Workshop
Cloud Computing Workshop
Charlie Moad
 

Similar to System to generate speech to text in real time (20)

Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
EQR Reporting: Rails + Amazon EC2
EQR Reporting:  Rails + Amazon EC2EQR Reporting:  Rails + Amazon EC2
EQR Reporting: Rails + Amazon EC2
 
Kafka for Scale
Kafka for ScaleKafka for Scale
Kafka for Scale
 
Scaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, GoalsScaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, Goals
 
IBM Bootcamp - Text to Speech API Lab
IBM Bootcamp - Text to Speech API LabIBM Bootcamp - Text to Speech API Lab
IBM Bootcamp - Text to Speech API Lab
 
Cloud Computing Primer: Using cloud computing tools in your museum
Cloud Computing Primer: Using cloud computing tools in your museumCloud Computing Primer: Using cloud computing tools in your museum
Cloud Computing Primer: Using cloud computing tools in your museum
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
AWS Customer Presentation - Melodeo
AWS Customer Presentation - MelodeoAWS Customer Presentation - Melodeo
AWS Customer Presentation - Melodeo
 
Melodeo Nutsie is powered by AWS
Melodeo Nutsie is powered by AWSMelodeo Nutsie is powered by AWS
Melodeo Nutsie is powered by AWS
 
Cloud Computing Workshop
Cloud Computing WorkshopCloud Computing Workshop
Cloud Computing Workshop
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Analysis
AnalysisAnalysis
Analysis
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache Beam
 
What We Learned From Building a Modern Messaging and Streaming System for Cloud
What We Learned From Building a Modern Messaging and Streaming System for CloudWhat We Learned From Building a Modern Messaging and Streaming System for Cloud
What We Learned From Building a Modern Messaging and Streaming System for Cloud
 
Understanding Apache Kafka P99 Latency at Scale
Understanding Apache Kafka P99 Latency at ScaleUnderstanding Apache Kafka P99 Latency at Scale
Understanding Apache Kafka P99 Latency at Scale
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 

More from Saptarshi Chatterjee

More from Saptarshi Chatterjee (18)

Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
 
ReactJs Optimizations , Making server side react faster
ReactJs Optimizations , Making server side react faster ReactJs Optimizations , Making server side react faster
ReactJs Optimizations , Making server side react faster
 
Code splitting with server side react
Code splitting with server side reactCode splitting with server side react
Code splitting with server side react
 
Auto generate customized test suit for your AngularJs
Auto generate customized test suit for your AngularJsAuto generate customized test suit for your AngularJs
Auto generate customized test suit for your AngularJs
 
Doc003
Doc003Doc003
Doc003
 
Input02
Input02Input02
Input02
 
Puja 2013
Puja 2013Puja 2013
Puja 2013
 
title
titletitle
title
 
Saptarshi chatterjee-resume
Saptarshi chatterjee-resumeSaptarshi chatterjee-resume
Saptarshi chatterjee-resume
 
SlideShare team outing
SlideShare team outingSlideShare team outing
SlideShare team outing
 
Slide share potluck
Slide share potluckSlide share potluck
Slide share potluck
 
Placement report-2012
Placement report-2012Placement report-2012
Placement report-2012
 
Getting started
Getting startedGetting started
Getting started
 
Trends in online communities
Trends in online communitiesTrends in online communities
Trends in online communities
 
5 x5
5 x55 x5
5 x5
 
Master
MasterMaster
Master
 
Java Teacher in delhi
Java Teacher in delhiJava Teacher in delhi
Java Teacher in delhi
 
Slideshare is now linkedin
Slideshare is now linkedinSlideshare is now linkedin
Slideshare is now linkedin
 

Recently uploaded

在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Monica Sydney
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 

Recently uploaded (20)

在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 

System to generate speech to text in real time

  • 1. A system to generate speech to text in real-time Kavyashree Shankar Illinois Institute of Technology Chicago, US kshankar1@hawk.iit.edu Saptarshi Chatterjee Illinois Institute of Technology Chicago, US schatterjee@hawk.iit.edu Abstract​— A real-time speech to text conversion system converts the spoken words into text . Speech-to-Text technology enables us to convert audio to text by applying powerful neural network models. It has a number of applications for users with and without disabilities. Speech-to-text has been used for voice search, help writers boost their productivity, and to provide alternate access to a computer for individuals with physical impairments. Other applications include speech recognition for foreign language learning, voice-activated products for the blind and many familiar mainstream technologies. It is a driving force behind the success of new age voice-controlled speakers like Amazon Echo and Google Home. While cloud solutions like Google Cloud Speech-to-Text, Amazon Transcribe, IBM watson Speech to Text etc effectively convert audio and voice into written text, but there are some scalability challenges in effectively using them. In this project, we tried to address some of those challenges and also tested what are the configurations and parameters yield best results for such a system. Keywords — ​Speech-to-Text​, ​voice search​, ​Amazon Echo, Google Home 1. INTRODUCTION. In this article, we are proposing a scalable system architecture to generate text from speech input in real time. We also are publishing our experimental results and performance metrics for different technologies used as the building blocks of this architecture. Here we are not trying to build a new Speech to Text generation application from scratch , but our objective is to propose an architecture of such a system and how it performs under various configurations. 2. RELATED WORK, CHALLENGES AND HOW WE ADDRESSED THE CHALLENGES. A lot of work is already been done in this field. Cloud services became fairly good at converting audio and voice into written text using Machine Learning Algorithm like Neural Network, but still, such a system has scalability challenges in effectively using them. Following are few challenges that drives most of the architecture decisions while designing such a system - A. Selection of scalable system design patterns. If we push streaming audio directly to web server, then such a system is bound to choke under heavy load, as CPU will be busy processing previous inputs and new inputs will be waiting to be consumed and eventually system will start throwing 5XX errors . So we need to provisionally store the audio file , before an available process can pick it up and start processing . We researched several Scalable System Design Patterns​[2] and found out " event-based architecture" model is most suitable to address this problem .Event-based architecture supports several communication styles: ● Publish-subscribe ● Broadcast ● Point-to-point Publish-subscribe communication style decouples sender & receiver and facilitates asynchronous communication. Event-driven architecture (EDA) promotes the production, detection, consumption of, and reaction to events[3]. The main advantage of this architecture is that they are loosely coupled. So for this application, we are pushing the audio files to a distributed queue (Kafka ) and a multithreaded Spring Boot Application consumes those audio files from the queue.
  • 2. B. Selection of storage platform for provisional storage of audio file. We need to store the audio file in a DataBase / Storage system before a consumer picks that up and starts processing . Audio files need to be stored as BLOB data . We looked at several DataBases and found NoSql database like MongoDB performs better at storing BLOB data compared to traditional RDBMS systems​[4]​ . However, we found that storing the Audio files in MongoDB will not be a good choice , because - a. Maximum document size supported is 16MB and large sound files can easily exceed that range . b. If we store the audio in MongoDB , then consumer application need to constantly poll the DB to check if there is a new file. If we increase the polling frequency then this new event detection is instantaneous, but it increases the system workload. So there is a tradeoff between workload and response time. So , we ​decided to use Kafka as provisional storage platform​ for Audio files. Kafka performs drastically faster than other point-to-point messaging platform like Activemq or Rabbitmq​[5]. C. Storing Large files in Kafka Kafka was initially designed as Distributed Messaging System for Log Processing. So default max message size ~ 1MB . A > 10 Sec 1 channel .wav file easily exceeds 1MB in size. So we need to override default configs in order to support large messages. Following properties had to be changed to support this. These values are in bytes , hence out system could easily handle ~40MB message. D. Multithreaded kafka consumer
  • 3. Multiple producers push to kafka in a multi-partition topic, it’s obvious that we also need a multi-threaded consumer to keep the whole thing robust and balanced . Previous work has been done in this topic​[6] to implement a multithreaded Kafka consumer using Kafka-Python .But kafka-python is not thread safe and does not support multi-threading. To overcome this challenge we decided to use ​Spring-kafka along without multithreaded Spring Boot application, which is thread safe and graciously supports multithreading. 3. TECHNOLOGIES USED IN SYSTEM IMPLEMENTATION A. Apache Kafka - Apache Kafka is a distributed streaming platform. A streaming platform has three key capabilities: 1) Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system. 2) Store streams of records in a fault-tolerant durable way. 3) Process streams of records as they occur. We are using kafka as our transient storage of Audio files. Related work​[6] ​ in this area shows with default message size configuration overrides , how it can effectively be used as a storage of BLOB files. And how it’s faster compared to other messaging platforms[5]. B. MongoDB - MongoDB is an open-source, document database designed for ease of development and scaling. We are using MongoDB to store the transcripts for future reference. Related work in this area affirms MongoDB performs better when compared to RDBMS for unstructured / semistructured data[4] . Ease of use was one of key reason of our choice as well. C. Google cloud Speech to Text - Speech recognition cloud service that enables developers to convert audio to text by applying powerful neural network models in an easy-to-use API. For this experiment we tried other services like Amazon Transcribe, IBM watson Speech to Text and found Google Speech to Text gives more accurate results with lower median latency compared to other two platforms. However, Amazon Transcribe supports .mp3 file format and it allows the user to customize vocabulary which is not supported in Google Speech-to-text service. D. Spring Boot - Spring Boot is a Spring framework module which provides RAD (Rapid Application Development) feature to the Spring framework. We used Spring along with Tomcat as our application server to process client requests. We are also using Spring-Kafka to that Spring components can work as kafka Producer/Consumer. E. Sockets - a JavaScript library for real time web applications. It enables real time, bi-directional communication between web clients and servers. F. HTML5 Web Audio API - The Web Audio API provides a powerful and versatile system for controlling audio on the Web, allowing developers to choose audio sources, add effects to audio, create audio visualizations, apply spatial effects (such as panning) and much more. 4. PROPOSED SYSTEM ARCHITECTURE AND DESIGN A. We are using HTML5 MediaStream Recording API to record the audio inside a browser environment
  • 4. B. Posting the Audio as BLOB data using Ajax to a Spring Boot application running on Tomcat Server C. A Spring controller receives the Audio data and pushes it to Kafka topic recieved_sound D. A kafka Consumer receives an event that new audio is available and it consumes the audio and stream that audio to Google Speech-To-Text api. And when the transcript is available pushes it to topic ‘transcriptToClient’. E. A service method calls to google Speech-To-Text api to get the transcript from google. F. When the transcript is available in ‘transcriptToClient’ topic a consumer picks it up and ​stores it in MongoDB ​for future reference. it also ​pushes the transcript to a open socket connection initiated by the client. G. Client who is listening on an open socket receives the transcript and displays it in the browser 5. RESULTS AND THEIR ANALYSIS. A.Comparison of cloud services to generate text from speech We tried out Google Cloud Speech-to-Text, Amazon Transcribe, IBM watson Speech to Text for our experiment . We found Google Cloud Speech-to-Text performs the best compared to other two . Error rate and latency time is both significantly lower for Google Speech-to-Text. ​Avg. Latency time is 6sec for < 1MB​ files. B. System load testing We tested our system with multiple parallel client requests and a dual partition kafka topic with 2 brokers . And received following metrics in a macOs Mojave (MacBook Pro 2017 , 2.3 GHz Intel Core i5 , 8 GB 2133 MHz LPDDR3 ) PRODUCER
  • 5. Consumer . C. EFFECTS OF AUDIO ENCODINGS FORMATS ON GENERATED TRANSCRIPTS. We recorded mono sound (1 channel ) and noticed recording 2 channels double the file size. We used FLAC encoding , LINEAR16 and MULAW encoding separately with .wav file format . FLAC and LINEAR16 performs better ​with almost error free output when used for Text generation . FLAC and LINEAR16 are lossless encoding compared to MULAW . MULAW is an 8 bit PCM encoding, where the sample's amplitude is modulated logarithmically rather than linearly. As a result, MULAW reduces the effective dynamic range of the audio thus compressed. So a higher error rate was expected for MULAW. 6. DEMO LINK. HTTPS://WWW.YOUTUBE.COM/WATCH?V=4FZ2EOAZ8P8&T=6S 7. SOURCE CODE - HTTPS://GITHUB.COM/SAP9433/SPEECHTOTEXT REFERENCES [1] Yadav, Devender. "Save Large Files in MongoDB Using Kundera - DZone Database." Dzone.com. July 24, 2016. Accessed December 01, 2018.https://dzone.com/articles/save-large-files-in-mongo db-using-kundera. [2] Ho, Ricky. "Scalable System Design Patterns - DZone Database." Dzone.com. October 18, 2010. Accessed December 01, 2018. https://dzone.com/articles/scalable-system-design [3] Distributed Systems Architecture . Accessed November 14, 2018. http://cse.csusb.edu/tongyu/courses/cs660/notes/distarch.p hp [4] Kamil Kolonko, “Performance comparison of the most popular relational & non-relational database management systems” IEEE http://www.diva-portal.org/smash/get/diva2:1199667/FU LLTEXT02 [5] Jay Kreps,Neha Narkhede, Jun Rao, ds“Kafka: a Distributed Messaging System for Log Processing”. Accessed November 14, 2018. ACM 978-1-4503-0652-2/11/06 http://notes.stephenholiday.com/Kafka.pdf [6] Guy Shilo, “Moving binary data with Kafka – a more realistic scenario”. Accessed November 14, 2018. http://www.idata.co.il/2017
  • 6. /12/moving-binary-data-with-kafka-a-more-realistic-scena rio/ [7] Nodar MOMTSELIDZE, “Apache Kafka - Real-time Data Processing”. Accessed November 14, 2018. https://jtst.ibsu.edu.ge/jms/index.php/jtst/article/view/80 [8] Misal, Disha. "Google Speech Vs Amazon Transcribe: The War Of Speech Technology." Analytics India Magazine. October 22, 2018. Accessed December 05, 2018. https://www.analyticsindiamag.com/google-speech-vs-am azon-transcribe-the-war-of-speech-technology/.