SlideShare a Scribd company logo
1 of 3
Download to read offline
A Distributed System for Recognizing Home Automation
Commands and Distress Calls in the Italian Language
Emanuele Principi1
, Stefano Squartini1
, Francesco Piazza1
, Danilo Fuselli2
, and Maurizio
Bonifazi2
1A3Lab, Department of Information Engineering, Universit`a Politecnica delle Marche, Ancona, Italy
2FBT Elettronica Spa, Via Paolo Soprani 1, Recanati (MC), Italy
e.principi@univpm.it
1. Introduction
Ageing of population in the most developed countries is posing a great challenge in
social healthcare systems [1].
Population Ageing
Ambient Assisted Living: use of technologies for supporting and assisting people in
their own homes.
Solution: Ambient Assisted Living
• Current technologies employ application dependent sensors, such as vital sensors,
cameras or microphones.
• Recent studies demonstrated that people prefer non-invasive sensors, as micro-
phones [2, 3].
• Several works appeared in the literature recognize emergencies based on audio
signals only [2, 4, 5, 6].
Different Approaches to the Problem
A new system that integrates technologies for monitoring the acoustic environ-
ment, hands-free communication and home automation services.
Idea
2. The proposed system
• The system is composed by two functional units:
– Local Multimedia Control Unit (LMCU): it recognizes distress calls and home
automation commands, and manages the hands-free communication.
– Central Management and Processing Unit (CMPU): it integrates home and
content delivery automation services, and it manages emergency states.
• Private Branch eXchange (PBX): it interfaces the system with the Public
Switched Telephone Network (PSTN) and analogue phone devices.
Internet
Tuesday
9
SMS
73
-+
x
cingular
9:41
AM
iPod
Web
Mail
Phone
Settings
Notes
Calculato
r
Clock
YouTube
Stocks
Maps
Weather
Camera
Photos
Calenda
r
Text
Tuesday
9
SMS
73
-+
x
cingular
9:41
AM
iPod
Web
Mail
Phone
Settings
Notes
Calculato
r
Clock
YouTube
Stocks
Maps
Weather
Camera
Photos
Calenda
r
Text
CMPU
PBX
LCMU
LCMU
Home
Network
PSTN
Overview
VoIP
Stack
VAD
Interference
Cancellation
Speech
Recognition
"Monitoring"Estate
"Calling"Estate
Television,ERadio,Eetc.
AEC
• One microphone acquires the input speech signal.
• A Voice Activity Detector selects the audio segments that contain the voice signal.
• An interference cancellation module removes the sounds coming from known audio
sources (e.g., from a television or a radio).
• Two operating states:
– monitoring: the speech recognition engine is active and detects home automa-
tion commands and distress calls.
– calling: it occurs when a phone call is ongoing, the speech recognition engine is
disabled and the acoustic echo cancellation and the VoIP stack are active.
• Hands-free communication is guaranteed by VoIP a stack and an AEC.
• Hardware platform: based on ARM processors and embedded Linux.
The Local Multimedia Control Unit
• x(n): the far-end signal.
• s(n): the near-end speech of a local speaker.
• w(n): the room impulse response.
• d(n) = s(n) + w(n) ∗ x(n): the signal acquired by the microphone.
• e(n): the residual echo.
• Acoustic Echo Cancellation algorithms remove the echo produced by the acous-
tic coupling between the loudspeaker and the microphone.
• The system presented in this paper, the algorithm implemented in the Speex codec
has been used [7].
• Interference cancellation: the same algorithm is employed to reduce the degrad-
ing effects of known sources of interference (e.g., a television).
Acoustic Echo & Interference Cancellation
Interspeech, Lyon, France, 2013 25-29 August 2013
3. Commands and Distress calls Recognition
• Commands and distress calls recognition is performed by means of the Pocket-
Sphinx [8] speech recognition engine.
• The support for semi-continuous Hidden Markov Models and for fixed point arith-
metic makes the engine suitable for embedded devices.
The engine: PocketSphinx
• Mel-Frequency Cepstral Coefficients (MFCC) are currently employed in the ma-
jority of speech recognition systems.
• Power Normalized Cepstral Coefficients (PNCC)a
have been recently proposed
to improve the robustness against distortions (e.g., noise and reverberation) [9].
• The final feature vector is composed by 13 mean normalized static coefficients
and their first and second derivatives.
• Number of arithmetic operations: PNCC 17516, MFCC 13010.
Medium-Time
Processing
Time-Frequency
Normalization
Pre-Emphasis
STFTw&
Magnitude
Squared
Gammatone
Frequency
Integration
MeanwPower
Normalization
PowerwFunction
Non-Linearity
DCTw&w
Mean
Normalization
Speech PNCCFirstw&wSecond
Derivatives
a
A free library for PNCC extraction is available at http://www.a3lab.dii.univpm.it/projects/libpncc
Feature Extraction: Power Normalized Cepstral Coefficients
PNCClExtraction Training
Utterances:l1310
Minutes:l105
Speakers:l60
Female/male:l50h
APASCIlTraininglSet
Stateslperlphone:l3
Skip:lno
Gaussianslperlstate:l4
Senones:l200
AcousticlModel
• Training has been performed using the APASCI corpus [10].
– It is composed of Italian speech utterances recorded in a quiet room with a
vocabulary of about 3000 words.
– Training has been performed on the training set part of the speaker indepen-
dent subset
– Acoustic model parameters have been chosen on the test set part of the
speaker independent subset (860 utterances spoken by 20 males and 20 fe-
males, for a total duration of about 69 minutes).
• These values have been selected using the APASCI speaker independent test
set as development set and a world loop grammar as language model. The
obtained word recognition accuracy is 64.5%.
Acoustic Model
• The current version of the PocketSphinx (version 0.8) is not able to perform re-
jection using generic word models [11] or with garbage models as in keyword
spotting [12].
• The “confidence score” associated to each result is reliable only for vocabular-
ies of more than 100 words, a number exceeding the vocabulary size of the
application scenario taken into consideration.
• The approach followed to reject of out of grammar words is inspired by the tech-
nique of Vertanen [13].
– This consists in training a special “garbage phone” that represents a generic
word phone replacing the actual phonetic transcription of a percentage of the
vocabulary words with a sequence of garbage phones.
– Then, in the recognition dictionary a garbage word is introduced whose pho-
netic transcription is the single garbage phone.
– Here, training of the garbage phone has been performed substituting 10% of
the phonetic transcriptions of the words present in the training vocabulary.
– Words have been chosen so that each phone of the Italian language is equally
represented, so that they are all trained with the same amount of data.
Out of Grammar Rejection
• Robustness to mismatches between training and testing conditions can be im-
proved by adapting the acoustic model.
• Before using the system for the first time, the user is asked to speak a set of
phonetically rich sentences so that all phones of the Italian language are trained
using the same amount of data.
• Adaptation is performed by means of Maximum Likelihood Linear Regression
(MLLR).
Speaker adaptation
• A simple finite state grammar where the word GARBAGE is mapped to the GP
phone in the dictionary.
• Sequences of one or more garbage words are automatically ignored.
• Stop-words, such as articles and prepositions, are also ignored by the system.
public <SENTENCE> = <COMMAND> | <DISTRESS_CALL> | GARBAGE;
<COMMAND> = ((accendi|spegni) la luce) |
((alza|abbassa) la temperatura);
<DISTRESS_CALL> = aiuto | ambulanza;
Language Model
4. The ITAAL speech corpus
• ITAAL is a new speech corpus of home automation commands and distress
calls in Italian a
.
• People read three groups of sentences in Italian: home automation commands
(e.g., “close the door”), distress calls (e.g., “help!”, “ambulance!”) and phonet-
ically rich sentences. These have been extracted from the “speaker indepen-
dent” set of the APASCI corpus and they cover all the phones of the Italian
language.
• Every sentence was spoken both in normal and shouted conditions, with the
home automation commands and the phonetically rich sentences pronounced
without emotional inflection. In contrast, people were asked to speak distress
calls as they were frightened.
a
Free samples are available at http://www.a3lab.dii.univpm.it/projects/itaal
Description
Commands Distress calls APASCI
Number of sentences 15 5 6
Total words 45 8 52
Unique Words 18 6 51
Duration (minutes) 39 13 17
Repetitions 3 3 1
Number of speakers 20
Female/male ratio 50%
Average age 41.70 ± 11.17
SNR (headset/distant) 51.46 dB/34.08 dB
Details
Recording equipment
Room setup
0 100 200 300 400
0
0.5
1
·10−2
F0 (Hz)
normal
shouted
−60 −50 −40 −30 −20 −10 0
0
2
4
6
8
·10−2
Active speech level (dB)
Pitch and active speech level
Interspeech, Lyon, France, 2013 25-29 August 2013
5. Experiments
• Signals: headset and array central microphone of ITAAL corpus.
• Four recognition tasks:
1. recognition of home automation commands uttered in normal speaking style
2. recognition of distress calls uttered in normal speaking style
3. recognition of home automation commands uttered in shouted speaking style
4. recognition of distress calls uttered in shouted speaking style
• Each task has been evaluated with and without adapting the acoustic model.
• Adaptation strategy: acoustic model adaptation has been performed separately
for each speaker, for each talking style (normal, shouted) and for each micro-
phone (headset, distant) on the phonetically rich sentences of ITAAL.
• Evaluation metric: Sentence Recognition Accuracy, defined as
Number of correctly recognized sentences
Total number of sentences
Description
Commands Distress calls Average
Headset Normal Shouted Normal Shouted
Baseline 94.22 87.89 99.67 93.00 93.70
MLLR 95.77 96.11 100.00 95.33 96.80
Distant
Baseline 19.13 10.67 70.67 45.33 36.45
MLLR 37.15 37.67 85.00 72.67 58.12
Sentence recognition results
• The accuracy on the headset microphone signals exceeds 90% in each task,
with the only exception being shouted commands recognition.
• In shouted speaking style, both headset and microphone signals results exhibit
a similar performance decrease.
• The performance difference between normal and shouted conditions is evident
also on the distant microphone signals, and this is consistent with the results in
[14].
• The high mismatch between training and testing conditions, mainly due to re-
verberation (the T60 of the room is 0.72 s) is the main cause of the performance
reduction.
• Consider that the acoustic model parameters have been chosen using the
APASCI test set, thus on noise and reverberation free signals.
Comments on “Baseline” results
• On the headset microphone, the introduction of speaker adaptation produces a
significant performance improvement over non-adapted results.
– Other than for the acoustic conditions, the improvement can be attributed also
to the different accent between the APASCI corpus speakers (northern Italy)
and the ITAAL ones (central Italy).
– Shouted speaking style improvements are more significant.
• On the distant talking microphone, the improvement is more evident and it
reaches an average of 21.67%.
• Distress calls recognition can be considered satisfactory for being employed in
a real scenario.
Comments on “MLLR” results
6. Conclusions and Future Developments
• A system for the emergency detection and the remote assistance in a smart-
home has been presented.
• It is composed of multiple local multimedia control units that actively monitor the
acoustic environment, and one central management and processing unit that
coordinates the system.
• The acoustic environment is monitored by means of a voice activity detector, an
interference removal module, and a speech recognizer based on PocketSphinx.
• The detection of a distress call automatically triggers a phone call with a contact
of the address book.
• Hands-free communication is possible by means of a VoIP stack based on the
SIP protocol, and the Speex acoustic echo canceller.
• The system has been evaluated on ITAAL, a new speech corpus of home au-
tomation commands and distress calls in Italian.
• Recognition results showed that with MLLR speaker adaptation the system is
achieves a distress call recognition accuracy of 85.00% in normal speaking
styles utterances, and 72.67% in shouted speaking style ones.
Conclusions
• Algorithms will be integrated to improve the recognition accuracy on shouted
speech, for example detecting the speaking style as in [15, 16].
• Multiple microphone channels will be employed to increase the performance
using distant talking conditions.
• A subjective evaluation will be carried out to establish the effectiveness of the
overall system.
Future works
References
[1] K. Giannakouris, “Aging characterizes the demographic perspectives of the European societies,” 2008, Eurostat 72.
[2] S. Goetze, J. Schroder, and S. Gerlach, “Acoustic monitoring and localization for social care,” Journal of Computing Science and Engineering, vol. 6, no. 1, pp. 40–50, 2012.
[3] M. Ziefle, C. Rocker, and A. Holzinger, “Medical technology in smart homes: Exploring the user’s perspective on privacy, intimacy and trust,” in Proc. of COMPSACW, Jul. 18-22
2011, pp. 410–415.
[4] M. Vacher, A. Fleury, J.-F. Serignat, N. Noury, and H. Glasson, “Preliminary evaluation of speech/sound recognition for telemedice application in a real environment,” in Proc. of
Interspeech, Brisbane, Australia, Sep. 2008, pp. 496–499.
[5] M. Vacher, B. Lecouteux, and F. Portet, “Recognition of voice commands by multisource ASR and noise cancellation in a smart home environment,” in Proc. of Eusipco, Bucharest,
Romania, Aug. 27-31 2012, pp. 1663–1667.
[6] K. Drossos, A. Floros, K. Agavanakis, N.-A. Tatlas, and N.-G. Kanellopoulos, “Emergency voice/stress-level combined recognition for intelligent house applications,” in Proc. of the
132nd AES Convention, Budapest, Hungary, Apr. 26-29 2012, pp. 1–11.
[7] J.-M. Valin, “On Adjusting the Learning Rate in Frequency Domain Echo Cancellation With Double-Talk,” IEEE Trans. on Audio, Speech, and Lang. Process., vol. 15, no. 3, pp.
1030–1034, 2007.
[8] D. Huggins-Daines, M. Kumar, A. Chan, A. Black, M. Ravishankar, and A. Rudnicky, “PocketSphinx: A free, real-time continuous speech recognition system for hand-held devices,”
in Proc. of ICASSP, vol. 1, Toulouse, France, May 15-19 2006, pp. 185–188.
[9] C. Kim and R. M. Stern, “Power-normalized coefficients (PNCC) for robust speech recognition,” in Proc. of ICASSP, Kyoto, Japan, Mar. 2012, pp. 4101–4104.
[10] B. Angelini, F. Brugnara, D. Falavigna, D. Giuliani, R. Gretter, and M. Omologo, “Automatic segmentation and labeling of english and italian speech databases,” in Proc. of
Eurospeech, Berlin, Germany, Sep. 22-25 1993, pp. 653–656.
[11] I. Bazzi and J. Glass, “Modeling out-of-vocabulary words for robust speech recognition,” in Proc. of ICSLP, vol. 1, Beijing, China, 2000, pp. 401–404.
[12] E. Principi, S. Cifani, C. Rocchi, S. Squartini, and F. Piazza, “Keyword spotting based system for conversation fostering in tabletop scenarios: Preliminary evaluation,” in Proc. of
HSI, Catania, Italy, May 21-23 2009, pp. 216–219.
[13] K. Vertanen, “Baseline WSJ acoustic models for HTK and Sphinx: Training recipes and recognition experiments,” Cavendish Laboratory, University of Cambridge, Tech. Rep.,
2006.
[14] P. Zelinka, M. Sigmund, and J. Schimmel, “Impact of vocal effort variability on automatic speech recognition,” Speech Communication, no. 54, pp. 732–742, 2012.
[15] J. Pohjalainen, P. Alku, and T. Kinnunen, “Shout detection in noise,” in Proc. of ICASSP, Prague, Czech Republic, May 22-27 2011, pp. 4968–4971.
[16] N. Obin, “Cries and whispers-classification of vocal effort in expressive speech,” in Proc. of Interspeech, Portland, OR, USA, Sep. 9-13 2012, pp. 2234–2237.
Interspeech, Lyon, France, 2013 25-29 August 2013

More Related Content

What's hot

Linear predictive coding documentation
Linear predictive coding  documentationLinear predictive coding  documentation
Linear predictive coding documentation
chakravarthy Gopi
 

What's hot (20)

Fun with MATLAB
Fun with MATLABFun with MATLAB
Fun with MATLAB
 
Speech Signal Analysis
Speech Signal AnalysisSpeech Signal Analysis
Speech Signal Analysis
 
Speech coding std
Speech coding stdSpeech coding std
Speech coding std
 
Mini Project- Audio Enhancement
Mini Project- Audio EnhancementMini Project- Audio Enhancement
Mini Project- Audio Enhancement
 
Linear Predictive Coding
Linear Predictive CodingLinear Predictive Coding
Linear Predictive Coding
 
Speaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajanSpeaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajan
 
Linear predictive coding documentation
Linear predictive coding  documentationLinear predictive coding  documentation
Linear predictive coding documentation
 
Mini Project- Audio Enhancement
Mini Project-  Audio EnhancementMini Project-  Audio Enhancement
Mini Project- Audio Enhancement
 
Introductory Lecture to Audio Signal Processing
Introductory Lecture to Audio Signal ProcessingIntroductory Lecture to Audio Signal Processing
Introductory Lecture to Audio Signal Processing
 
CS 230 poster
CS 230 posterCS 230 poster
CS 230 poster
 
Digital speech within 125 hz bandwidth (DS-125)
Digital speech within 125 hz bandwidth (DS-125)Digital speech within 125 hz bandwidth (DS-125)
Digital speech within 125 hz bandwidth (DS-125)
 
Voice Identification And Recognition System, Matlab
Voice Identification And Recognition System, MatlabVoice Identification And Recognition System, Matlab
Voice Identification And Recognition System, Matlab
 
Automatic Speech Recognion
Automatic Speech RecognionAutomatic Speech Recognion
Automatic Speech Recognion
 
Speaker recognition on matlab
Speaker recognition on matlabSpeaker recognition on matlab
Speaker recognition on matlab
 
Speech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using Vocoder
 
Voice recognition system
Voice recognition systemVoice recognition system
Voice recognition system
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
multirate signal processing for speech
multirate signal processing for speechmultirate signal processing for speech
multirate signal processing for speech
 
SPEECH CODING
SPEECH CODINGSPEECH CODING
SPEECH CODING
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition system
 

Viewers also liked

A NOVEL APPROACH TO CHANNEL DECORRELATION FOR STEREO ACOUSTIC ECHO CANCELLATI...
A NOVEL APPROACH TO CHANNEL DECORRELATION FOR STEREO ACOUSTIC ECHO CANCELLATI...A NOVEL APPROACH TO CHANNEL DECORRELATION FOR STEREO ACOUSTIC ECHO CANCELLATI...
A NOVEL APPROACH TO CHANNEL DECORRELATION FOR STEREO ACOUSTIC ECHO CANCELLATI...
a3labdsp
 

Viewers also liked (13)

Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
 
Optimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizerOptimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizer
 
An Advanced Implementation of a Digital Artificial Reverberator
An Advanced Implementation of a Digital Artificial ReverberatorAn Advanced Implementation of a Digital Artificial Reverberator
An Advanced Implementation of a Digital Artificial Reverberator
 
Low Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard PlatformLow Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard Platform
 
A NOVEL APPROACH TO CHANNEL DECORRELATION FOR STEREO ACOUSTIC ECHO CANCELLATI...
A NOVEL APPROACH TO CHANNEL DECORRELATION FOR STEREO ACOUSTIC ECHO CANCELLATI...A NOVEL APPROACH TO CHANNEL DECORRELATION FOR STEREO ACOUSTIC ECHO CANCELLATI...
A NOVEL APPROACH TO CHANNEL DECORRELATION FOR STEREO ACOUSTIC ECHO CANCELLATI...
 
Hybrid Reverberator Using Multiple Impulse Responses for Audio Rendering Impr...
Hybrid Reverberator Using Multiple Impulse Responses for Audio Rendering Impr...Hybrid Reverberator Using Multiple Impulse Responses for Audio Rendering Impr...
Hybrid Reverberator Using Multiple Impulse Responses for Audio Rendering Impr...
 
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015
 
System Identification Based on Hammerstein Models Using Cubic Splines
System Identification Based on Hammerstein Models Using Cubic SplinesSystem Identification Based on Hammerstein Models Using Cubic Splines
System Identification Based on Hammerstein Models Using Cubic Splines
 
Approximation of Real Impulse Response Using IIR Structures
Approximation of Real Impulse Response Using IIR Structures Approximation of Real Impulse Response Using IIR Structures
Approximation of Real Impulse Response Using IIR Structures
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
HDFS與MapReduce架構研討
HDFS與MapReduce架構研討HDFS與MapReduce架構研討
HDFS與MapReduce架構研討
 
Running Word2Vec with Chinese Wikipedia dump
Running Word2Vec with Chinese Wikipedia dumpRunning Word2Vec with Chinese Wikipedia dump
Running Word2Vec with Chinese Wikipedia dump
 

Similar to A Distributed System for Recognizing Home Automation Commands and Distress Calls in the Italian Language

Silent sound interface
Silent sound interfaceSilent sound interface
Silent sound interface
Jeevitha Reddy
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognition
phyuhsan
 
Automatic subtitle generation
Automatic subtitle generationAutomatic subtitle generation
Automatic subtitle generation
tanyasaxena1611
 

Similar to A Distributed System for Recognizing Home Automation Commands and Distress Calls in the Italian Language (20)

Silent sound interface
Silent sound interfaceSilent sound interface
Silent sound interface
 
Audio/Speech Signal Analysis for Depression
Audio/Speech Signal Analysis for DepressionAudio/Speech Signal Analysis for Depression
Audio/Speech Signal Analysis for Depression
 
GENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
GENDER RECOGNITION SYSTEM USING SPEECH SIGNALGENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
GENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
 
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
 
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
 
Application of Digital Signal Processing In Echo Cancellation: A Survey
Application of Digital Signal Processing In Echo Cancellation: A SurveyApplication of Digital Signal Processing In Echo Cancellation: A Survey
Application of Digital Signal Processing In Echo Cancellation: A Survey
 
Speaker Identification & Verification Using MFCC & SVM
Speaker Identification & Verification Using MFCC & SVMSpeaker Identification & Verification Using MFCC & SVM
Speaker Identification & Verification Using MFCC & SVM
 
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
 
Course report-islam-taharimul (1)
Course report-islam-taharimul (1)Course report-islam-taharimul (1)
Course report-islam-taharimul (1)
 
Assign
AssignAssign
Assign
 
General Kalman Filter & Speech Enhancement for Speaker Identification
General Kalman Filter & Speech Enhancement for Speaker IdentificationGeneral Kalman Filter & Speech Enhancement for Speaker Identification
General Kalman Filter & Speech Enhancement for Speaker Identification
 
A study of EMG based Speech Recognition
A study of EMG  based Speech Recognition A study of EMG  based Speech Recognition
A study of EMG based Speech Recognition
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognition
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Analysis of Suitable Extraction Methods and Classifiers For Speaker Identific...
Analysis of Suitable Extraction Methods and Classifiers For Speaker Identific...Analysis of Suitable Extraction Methods and Classifiers For Speaker Identific...
Analysis of Suitable Extraction Methods and Classifiers For Speaker Identific...
 
Automatic subtitle generation
Automatic subtitle generationAutomatic subtitle generation
Automatic subtitle generation
 
G010424248
G010424248G010424248
G010424248
 
Speech Recognition System
Speech Recognition SystemSpeech Recognition System
Speech Recognition System
 
Research on VoIP Acoustic Echo Cancelation Algorithm Based on Speex
Research on VoIP Acoustic Echo Cancelation Algorithm Based on SpeexResearch on VoIP Acoustic Echo Cancelation Algorithm Based on Speex
Research on VoIP Acoustic Echo Cancelation Algorithm Based on Speex
 
A Dual-Microphone Speech Enhancement Algorithm Based on the Coherence Function
A Dual-Microphone Speech Enhancement Algorithm Based on the Coherence FunctionA Dual-Microphone Speech Enhancement Algorithm Based on the Coherence Function
A Dual-Microphone Speech Enhancement Algorithm Based on the Coherence Function
 

More from a3labdsp

More from a3labdsp (8)

Evaluation of a Multipoint Equalization System based on Impulse Responses Pro...
Evaluation of a Multipoint Equalization System based on Impulse Responses Pro...Evaluation of a Multipoint Equalization System based on Impulse Responses Pro...
Evaluation of a Multipoint Equalization System based on Impulse Responses Pro...
 
Hybrid Reverberation Algorithm: a Practical Approach
Hybrid Reverberation Algorithm: a Practical ApproachHybrid Reverberation Algorithm: a Practical Approach
Hybrid Reverberation Algorithm: a Practical Approach
 
Mixed Time Frequency Approach for Multipoint Room Response Equalization
Mixed Time Frequency Approach for Multipoint Room Response EqualizationMixed Time Frequency Approach for Multipoint Room Response Equalization
Mixed Time Frequency Approach for Multipoint Room Response Equalization
 
Audio Morphing for Percussive Sound Generation
Audio Morphing for Percussive Sound GenerationAudio Morphing for Percussive Sound Generation
Audio Morphing for Percussive Sound Generation
 
An Efficient DSP Implementation of a Dynamic Convolution Using Principal Comp...
An Efficient DSP Implementation of a Dynamic Convolution Using Principal Comp...An Efficient DSP Implementation of a Dynamic Convolution Using Principal Comp...
An Efficient DSP Implementation of a Dynamic Convolution Using Principal Comp...
 
Approximation of Dynamic Convolution Exploiting Principal Component Analysis:...
Approximation of Dynamic Convolution Exploiting Principal Component Analysis:...Approximation of Dynamic Convolution Exploiting Principal Component Analysis:...
Approximation of Dynamic Convolution Exploiting Principal Component Analysis:...
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
 
A Hybrid Approach for Real-time Room Acoustic Response Simulation
A Hybrid Approach for Real-time Room Acoustic Response SimulationA Hybrid Approach for Real-time Room Acoustic Response Simulation
A Hybrid Approach for Real-time Room Acoustic Response Simulation
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

A Distributed System for Recognizing Home Automation Commands and Distress Calls in the Italian Language

  • 1. A Distributed System for Recognizing Home Automation Commands and Distress Calls in the Italian Language Emanuele Principi1 , Stefano Squartini1 , Francesco Piazza1 , Danilo Fuselli2 , and Maurizio Bonifazi2 1A3Lab, Department of Information Engineering, Universit`a Politecnica delle Marche, Ancona, Italy 2FBT Elettronica Spa, Via Paolo Soprani 1, Recanati (MC), Italy e.principi@univpm.it 1. Introduction Ageing of population in the most developed countries is posing a great challenge in social healthcare systems [1]. Population Ageing Ambient Assisted Living: use of technologies for supporting and assisting people in their own homes. Solution: Ambient Assisted Living • Current technologies employ application dependent sensors, such as vital sensors, cameras or microphones. • Recent studies demonstrated that people prefer non-invasive sensors, as micro- phones [2, 3]. • Several works appeared in the literature recognize emergencies based on audio signals only [2, 4, 5, 6]. Different Approaches to the Problem A new system that integrates technologies for monitoring the acoustic environ- ment, hands-free communication and home automation services. Idea 2. The proposed system • The system is composed by two functional units: – Local Multimedia Control Unit (LMCU): it recognizes distress calls and home automation commands, and manages the hands-free communication. – Central Management and Processing Unit (CMPU): it integrates home and content delivery automation services, and it manages emergency states. • Private Branch eXchange (PBX): it interfaces the system with the Public Switched Telephone Network (PSTN) and analogue phone devices. Internet Tuesday 9 SMS 73 -+ x cingular 9:41 AM iPod Web Mail Phone Settings Notes Calculato r Clock YouTube Stocks Maps Weather Camera Photos Calenda r Text Tuesday 9 SMS 73 -+ x cingular 9:41 AM iPod Web Mail Phone Settings Notes Calculato r Clock YouTube Stocks Maps Weather Camera Photos Calenda r Text CMPU PBX LCMU LCMU Home Network PSTN Overview VoIP Stack VAD Interference Cancellation Speech Recognition "Monitoring"Estate "Calling"Estate Television,ERadio,Eetc. AEC • One microphone acquires the input speech signal. • A Voice Activity Detector selects the audio segments that contain the voice signal. • An interference cancellation module removes the sounds coming from known audio sources (e.g., from a television or a radio). • Two operating states: – monitoring: the speech recognition engine is active and detects home automa- tion commands and distress calls. – calling: it occurs when a phone call is ongoing, the speech recognition engine is disabled and the acoustic echo cancellation and the VoIP stack are active. • Hands-free communication is guaranteed by VoIP a stack and an AEC. • Hardware platform: based on ARM processors and embedded Linux. The Local Multimedia Control Unit • x(n): the far-end signal. • s(n): the near-end speech of a local speaker. • w(n): the room impulse response. • d(n) = s(n) + w(n) ∗ x(n): the signal acquired by the microphone. • e(n): the residual echo. • Acoustic Echo Cancellation algorithms remove the echo produced by the acous- tic coupling between the loudspeaker and the microphone. • The system presented in this paper, the algorithm implemented in the Speex codec has been used [7]. • Interference cancellation: the same algorithm is employed to reduce the degrad- ing effects of known sources of interference (e.g., a television). Acoustic Echo & Interference Cancellation Interspeech, Lyon, France, 2013 25-29 August 2013
  • 2. 3. Commands and Distress calls Recognition • Commands and distress calls recognition is performed by means of the Pocket- Sphinx [8] speech recognition engine. • The support for semi-continuous Hidden Markov Models and for fixed point arith- metic makes the engine suitable for embedded devices. The engine: PocketSphinx • Mel-Frequency Cepstral Coefficients (MFCC) are currently employed in the ma- jority of speech recognition systems. • Power Normalized Cepstral Coefficients (PNCC)a have been recently proposed to improve the robustness against distortions (e.g., noise and reverberation) [9]. • The final feature vector is composed by 13 mean normalized static coefficients and their first and second derivatives. • Number of arithmetic operations: PNCC 17516, MFCC 13010. Medium-Time Processing Time-Frequency Normalization Pre-Emphasis STFTw& Magnitude Squared Gammatone Frequency Integration MeanwPower Normalization PowerwFunction Non-Linearity DCTw&w Mean Normalization Speech PNCCFirstw&wSecond Derivatives a A free library for PNCC extraction is available at http://www.a3lab.dii.univpm.it/projects/libpncc Feature Extraction: Power Normalized Cepstral Coefficients PNCClExtraction Training Utterances:l1310 Minutes:l105 Speakers:l60 Female/male:l50h APASCIlTraininglSet Stateslperlphone:l3 Skip:lno Gaussianslperlstate:l4 Senones:l200 AcousticlModel • Training has been performed using the APASCI corpus [10]. – It is composed of Italian speech utterances recorded in a quiet room with a vocabulary of about 3000 words. – Training has been performed on the training set part of the speaker indepen- dent subset – Acoustic model parameters have been chosen on the test set part of the speaker independent subset (860 utterances spoken by 20 males and 20 fe- males, for a total duration of about 69 minutes). • These values have been selected using the APASCI speaker independent test set as development set and a world loop grammar as language model. The obtained word recognition accuracy is 64.5%. Acoustic Model • The current version of the PocketSphinx (version 0.8) is not able to perform re- jection using generic word models [11] or with garbage models as in keyword spotting [12]. • The “confidence score” associated to each result is reliable only for vocabular- ies of more than 100 words, a number exceeding the vocabulary size of the application scenario taken into consideration. • The approach followed to reject of out of grammar words is inspired by the tech- nique of Vertanen [13]. – This consists in training a special “garbage phone” that represents a generic word phone replacing the actual phonetic transcription of a percentage of the vocabulary words with a sequence of garbage phones. – Then, in the recognition dictionary a garbage word is introduced whose pho- netic transcription is the single garbage phone. – Here, training of the garbage phone has been performed substituting 10% of the phonetic transcriptions of the words present in the training vocabulary. – Words have been chosen so that each phone of the Italian language is equally represented, so that they are all trained with the same amount of data. Out of Grammar Rejection • Robustness to mismatches between training and testing conditions can be im- proved by adapting the acoustic model. • Before using the system for the first time, the user is asked to speak a set of phonetically rich sentences so that all phones of the Italian language are trained using the same amount of data. • Adaptation is performed by means of Maximum Likelihood Linear Regression (MLLR). Speaker adaptation • A simple finite state grammar where the word GARBAGE is mapped to the GP phone in the dictionary. • Sequences of one or more garbage words are automatically ignored. • Stop-words, such as articles and prepositions, are also ignored by the system. public <SENTENCE> = <COMMAND> | <DISTRESS_CALL> | GARBAGE; <COMMAND> = ((accendi|spegni) la luce) | ((alza|abbassa) la temperatura); <DISTRESS_CALL> = aiuto | ambulanza; Language Model 4. The ITAAL speech corpus • ITAAL is a new speech corpus of home automation commands and distress calls in Italian a . • People read three groups of sentences in Italian: home automation commands (e.g., “close the door”), distress calls (e.g., “help!”, “ambulance!”) and phonet- ically rich sentences. These have been extracted from the “speaker indepen- dent” set of the APASCI corpus and they cover all the phones of the Italian language. • Every sentence was spoken both in normal and shouted conditions, with the home automation commands and the phonetically rich sentences pronounced without emotional inflection. In contrast, people were asked to speak distress calls as they were frightened. a Free samples are available at http://www.a3lab.dii.univpm.it/projects/itaal Description Commands Distress calls APASCI Number of sentences 15 5 6 Total words 45 8 52 Unique Words 18 6 51 Duration (minutes) 39 13 17 Repetitions 3 3 1 Number of speakers 20 Female/male ratio 50% Average age 41.70 ± 11.17 SNR (headset/distant) 51.46 dB/34.08 dB Details Recording equipment Room setup 0 100 200 300 400 0 0.5 1 ·10−2 F0 (Hz) normal shouted −60 −50 −40 −30 −20 −10 0 0 2 4 6 8 ·10−2 Active speech level (dB) Pitch and active speech level Interspeech, Lyon, France, 2013 25-29 August 2013
  • 3. 5. Experiments • Signals: headset and array central microphone of ITAAL corpus. • Four recognition tasks: 1. recognition of home automation commands uttered in normal speaking style 2. recognition of distress calls uttered in normal speaking style 3. recognition of home automation commands uttered in shouted speaking style 4. recognition of distress calls uttered in shouted speaking style • Each task has been evaluated with and without adapting the acoustic model. • Adaptation strategy: acoustic model adaptation has been performed separately for each speaker, for each talking style (normal, shouted) and for each micro- phone (headset, distant) on the phonetically rich sentences of ITAAL. • Evaluation metric: Sentence Recognition Accuracy, defined as Number of correctly recognized sentences Total number of sentences Description Commands Distress calls Average Headset Normal Shouted Normal Shouted Baseline 94.22 87.89 99.67 93.00 93.70 MLLR 95.77 96.11 100.00 95.33 96.80 Distant Baseline 19.13 10.67 70.67 45.33 36.45 MLLR 37.15 37.67 85.00 72.67 58.12 Sentence recognition results • The accuracy on the headset microphone signals exceeds 90% in each task, with the only exception being shouted commands recognition. • In shouted speaking style, both headset and microphone signals results exhibit a similar performance decrease. • The performance difference between normal and shouted conditions is evident also on the distant microphone signals, and this is consistent with the results in [14]. • The high mismatch between training and testing conditions, mainly due to re- verberation (the T60 of the room is 0.72 s) is the main cause of the performance reduction. • Consider that the acoustic model parameters have been chosen using the APASCI test set, thus on noise and reverberation free signals. Comments on “Baseline” results • On the headset microphone, the introduction of speaker adaptation produces a significant performance improvement over non-adapted results. – Other than for the acoustic conditions, the improvement can be attributed also to the different accent between the APASCI corpus speakers (northern Italy) and the ITAAL ones (central Italy). – Shouted speaking style improvements are more significant. • On the distant talking microphone, the improvement is more evident and it reaches an average of 21.67%. • Distress calls recognition can be considered satisfactory for being employed in a real scenario. Comments on “MLLR” results 6. Conclusions and Future Developments • A system for the emergency detection and the remote assistance in a smart- home has been presented. • It is composed of multiple local multimedia control units that actively monitor the acoustic environment, and one central management and processing unit that coordinates the system. • The acoustic environment is monitored by means of a voice activity detector, an interference removal module, and a speech recognizer based on PocketSphinx. • The detection of a distress call automatically triggers a phone call with a contact of the address book. • Hands-free communication is possible by means of a VoIP stack based on the SIP protocol, and the Speex acoustic echo canceller. • The system has been evaluated on ITAAL, a new speech corpus of home au- tomation commands and distress calls in Italian. • Recognition results showed that with MLLR speaker adaptation the system is achieves a distress call recognition accuracy of 85.00% in normal speaking styles utterances, and 72.67% in shouted speaking style ones. Conclusions • Algorithms will be integrated to improve the recognition accuracy on shouted speech, for example detecting the speaking style as in [15, 16]. • Multiple microphone channels will be employed to increase the performance using distant talking conditions. • A subjective evaluation will be carried out to establish the effectiveness of the overall system. Future works References [1] K. Giannakouris, “Aging characterizes the demographic perspectives of the European societies,” 2008, Eurostat 72. [2] S. Goetze, J. Schroder, and S. Gerlach, “Acoustic monitoring and localization for social care,” Journal of Computing Science and Engineering, vol. 6, no. 1, pp. 40–50, 2012. [3] M. Ziefle, C. Rocker, and A. Holzinger, “Medical technology in smart homes: Exploring the user’s perspective on privacy, intimacy and trust,” in Proc. of COMPSACW, Jul. 18-22 2011, pp. 410–415. [4] M. Vacher, A. Fleury, J.-F. Serignat, N. Noury, and H. Glasson, “Preliminary evaluation of speech/sound recognition for telemedice application in a real environment,” in Proc. of Interspeech, Brisbane, Australia, Sep. 2008, pp. 496–499. [5] M. Vacher, B. Lecouteux, and F. Portet, “Recognition of voice commands by multisource ASR and noise cancellation in a smart home environment,” in Proc. of Eusipco, Bucharest, Romania, Aug. 27-31 2012, pp. 1663–1667. [6] K. Drossos, A. Floros, K. Agavanakis, N.-A. Tatlas, and N.-G. Kanellopoulos, “Emergency voice/stress-level combined recognition for intelligent house applications,” in Proc. of the 132nd AES Convention, Budapest, Hungary, Apr. 26-29 2012, pp. 1–11. [7] J.-M. Valin, “On Adjusting the Learning Rate in Frequency Domain Echo Cancellation With Double-Talk,” IEEE Trans. on Audio, Speech, and Lang. Process., vol. 15, no. 3, pp. 1030–1034, 2007. [8] D. Huggins-Daines, M. Kumar, A. Chan, A. Black, M. Ravishankar, and A. Rudnicky, “PocketSphinx: A free, real-time continuous speech recognition system for hand-held devices,” in Proc. of ICASSP, vol. 1, Toulouse, France, May 15-19 2006, pp. 185–188. [9] C. Kim and R. M. Stern, “Power-normalized coefficients (PNCC) for robust speech recognition,” in Proc. of ICASSP, Kyoto, Japan, Mar. 2012, pp. 4101–4104. [10] B. Angelini, F. Brugnara, D. Falavigna, D. Giuliani, R. Gretter, and M. Omologo, “Automatic segmentation and labeling of english and italian speech databases,” in Proc. of Eurospeech, Berlin, Germany, Sep. 22-25 1993, pp. 653–656. [11] I. Bazzi and J. Glass, “Modeling out-of-vocabulary words for robust speech recognition,” in Proc. of ICSLP, vol. 1, Beijing, China, 2000, pp. 401–404. [12] E. Principi, S. Cifani, C. Rocchi, S. Squartini, and F. Piazza, “Keyword spotting based system for conversation fostering in tabletop scenarios: Preliminary evaluation,” in Proc. of HSI, Catania, Italy, May 21-23 2009, pp. 216–219. [13] K. Vertanen, “Baseline WSJ acoustic models for HTK and Sphinx: Training recipes and recognition experiments,” Cavendish Laboratory, University of Cambridge, Tech. Rep., 2006. [14] P. Zelinka, M. Sigmund, and J. Schimmel, “Impact of vocal effort variability on automatic speech recognition,” Speech Communication, no. 54, pp. 732–742, 2012. [15] J. Pohjalainen, P. Alku, and T. Kinnunen, “Shout detection in noise,” in Proc. of ICASSP, Prague, Czech Republic, May 22-27 2011, pp. 4968–4971. [16] N. Obin, “Cries and whispers-classification of vocal effort in expressive speech,” in Proc. of Interspeech, Portland, OR, USA, Sep. 9-13 2012, pp. 2234–2237. Interspeech, Lyon, France, 2013 25-29 August 2013