606                                                                                  IEEE Transactions on Consumer Electro...
I. I. Papp et al.: Hands-free Voice Communication with TV                                                                 ...
608                                                                            IEEE Transactions on Consumer Electronics, ...
I. I. Papp et al.: Hands-free Voice Communication with TV                                                                 ...
610                                                                                      IEEE Transactions on Consumer Ele...
I. I. Papp et al.: Hands-free Voice Communication with TV                                                                 ...
612                                                                                                                IEEE Tr...
I. I. Papp et al.: Hands-free Voice Communication with TV                                                                 ...
614                                                                                       IEEE Transactions on Consumer El...
Upcoming SlideShare
Loading in …5

Hands free voice communication with tv


Published on

If ur intrested in these project please feel free to contact us@09640648777,Mallikarjun.V

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hands free voice communication with tv

  1. 1. 606 IEEE Transactions on Consumer Electronics, Vol. 57, No. 2, May 2011 Hands-free Voice Communication with TV Istvan I. Papp, Member, IEEE, Zoran M. Šarić, Nikola Dj. Teslić, Member, IEEE Abstract — This paper presents a system for full-duplex platform. The developed system makes possible placing andhands free voice communication integrated with TV technology. accepting calls from remote parties. It also provides contactsThe system provides comfortable conversation by utilization of management through the intuitive graphical interface renderedmicrophone array and advanced voice processing algorithms, on the TV screen. The system connects to a gateway viaeven with simultaneous TV usage. Signal processing includes Bluetooth. As gateway, either GSM phone or PC can be used.superdirective beamformer steered by direction-finding module, In latter case, on the PC a VoIP application shall be active.postprocessing module, acoustic echo canceller, stationary The voice of the remote speaker is played back on the TVnoise reduction module and automatic gain control. All loudspeakers, mixed with the TV broadcast sound. On theprocessing is realized in real-time on DSP based platform. As other hand, remote speaker receives speech enhanced by thecommunication channel GSM or VoIP can be used1. complex speech processing algorithms. The communication is Index Terms — Microphone array, beamforming, speech full-duplex. The system provides quality communication evenenhancement, noise reduction, acoustic echo cancellation, direction if the TV broadcast sound is on.of arrival, automatic gain control, hands-free communication. In the rest of the paper we described the components of the system in details. In the section II, we briefly described I. INTRODUCTION communication and processing components of the system. In Speech is a most natural communication means between the section III we described components of the system inpeople. Hence, lot of researchers paid their attention to details. Acoustic echo canceller (AEC) and superdirectivedevelop a technical means which will allow distant persons to beamformer are described in subsections III.A and III.B. Incommunicate in such a way as they are near each other. The subsection III.C, we described our solution of the soundfirst step toward this aim is development of the hands-free source localization algorithm optimized for real-timecommunication system. Hands free systems can work as near application. Postfilter and automatic gain control (AGC)or long distance speakerphone connected to the telephone line. blocks are described in subsections III.D and III.E.In this paper we propose such a system to be integrated into Experimental results obtained by using real-time DSPTV to provide comfortable communication in room platform are presented in section IV.environment. Hands-free communication involves a number of technicalproblems such are room reverberation, acoustic echo, andvarious ambient interferences. Each of these problems hasreasonable solutions, but there is a still problem to integratethem into a unique system working under real time condition.The solution [1] resolves the problem of speech enhancementin car environment using microphone array and adaptivebeamformer controlled by VAD detector. A hands-freerecording unit [2], uses adaptive microphone array, echocanceller and sound source localization module. It is aimed fornear distances. There are a lot of microphone arrays in the market that Fig. 1. Hands-free voice communication platform integrated with TV.works in stand alone mode. This paper describes a system thatprovides full-duplex hands-free communication in such a II. PLATFORM OVERVIEWcomplex acoustic environment. The typical use case scenario In order to efficiently suppress acoustic disturbances, theis depicted in Fig. 1 where the TV is used as a communication system uses a microphone array of 5 elements. Microphone signals are processed by a DSP in real time. The developed 1 This research was supported by Grant TR13011 from the Ministry of algorithms suppress both echo and ambient noise, leavingScience and Technological Development of the Republic of Serbia only the desired speech. The improved voice is transmitted to Nikola Dj. Teslic is with Faculty of Engineering, The University of NoviSad, Novi Sad, Serbia (nikola.teslic@rt-rk.com). the gateway (GSM phone or PC) via Bluetooth, and then to Zoran M. Saric is with RT-SP d.o.o., Novi Sad, Serbia (Zoran.Saric@rt- the remote party. The DSP and the connectivity module aresp.com). located on an add-on card, which can be easily connected to Istvan I. Papp is with Faculty of Engineering, The University of Novi Sad,Novi Sad, Serbia (istvan.papp@rt-rk.com). the host. The tasks of the host are to route audio channelsContributed PaperManuscript received 11/29/10Current version published 06/27/11Electronic version published 06/27/11. 0098 3063/11/$20.00 © 2011 IEEE
  2. 2. I. I. Papp et al.: Hands-free Voice Communication with TV 607appropriately, to control the connectivity module and to suppress both far end talker and TV audio. TV audio may beprovide an interface to the user. mono or stereo. Suppression of the echo coming from the The usage of TV as a hands free phone communication stereo audio is about twice as computationally demandingplatform induces numerous problems such as weak speech then for mono audio. According to the available computationsignal coming from the near end talker being at a distance of up resources the system may switch stereo TV audio to monoto five meters from the TV set, as well as a number of during conversation to decrease computation demand.disturbances such are strong echo coming from the TV According to ITU-T G.167, [4], the cumulative attenuationloudspeakers, room reverberation, and various stationary and between receive point RCV and send point SND (Fig. 2), alsonon-stationary noise sources. Both far end signal and stereo TV referred as Terminal Coupling Loss (TCL) has to be overaudio cross talk are cancelled by the multichannel acoustic echo 45dB. AEC module can provide up to 35dB echo attenuation.canceller (AEC) which has long tail, short adaptation time and The additional attenuation is provided by superdirectiveshort latency provided by the adaptive algorithm known as beamformer, a postfilter and AGC modules so as to be inpartition block frequency domain adaptive filter (PBFDAF) [3]. compliance with G.167 recommendations.The reference signals from left and right loudspeakers are Combined echo cancellation and beamforming can berecorded by separate channels of the analog/digital converter to realized in one of two different ways: ‘AEC first’ ormake the system robust against the packet loss when VoIP ‘beamforming first’ [5]. In the structure ‘beamforming first’,protocol is used. The multichannel signal from the microphone the beamforming is essentially independent from AEC wherearray is processed by superdirective beamformer steered by acoustic echo is perceived as another source of interference.sound source localization (SSL) module capable to estimate AEC is realized by a single adaptive filter which is attractiveposition of the speaker even in reverberant conditions. The with regard to computational complexity.unsuppressed stationary and non-stationary diffuse noise is ‘AEC first’ structure uses one adaptive filter per eachadditionally suppressed by a postfilter (PF) based on the spatial microphone. With perfect echo cancellation, beamforming isdiversity of the active speaker and noise sources. Automatic undisturbed by acoustical echoes and uses all its degrees ofgain control (AGC) module at the end of the processing chain freedom to suppress spatially distributed interferences andprovides equal level for weak and strong speech signal. AGC is room reverberation. ‘AEC first’ is appropriate if the soundcontrolled by a double talk detector and spatial information from source localization module (SSL) and postfilter (PF) have tothe microphone array. This information is used to prevent be used. The shortcoming of the structure ‘AEC first’ isamplification of the residual noise in pause of speech. The computational complexity because the array with Malgorithms are implemented on DSP2 and optimized for real- microphones requires approximately M-fold computation costtime performance at a sample rate of 8 kHz. compared to the single-channel AEC. In other to provide good performance of SSL and postfilter III. OVERVIEW OF THE SYSTEM modules, we use ‘AEC first’ structure displayed in fig. 2. Microphone signals M1 to M5, are digitized simultaneously A. AEC module with loudspeaker signals SP-L and SP-R by multi-channel AD One of the main features of our system, denoted as Phone converter (module 103, fig. 2). Digitized microphone signalsTV, is to allow the user of the system to continuously watch d1 to d5 and reference signals xL, xR are provided to AECTV program during phone conversation without the need to module 104. The outputs of the AEC module are signals e1 todecrease TV volume. In this case, AEC module has to e5 with suppressed echo. AEC module includes a quality 101 103 104 105 107 M1 1 d1 108 e1 SD-BF sBF PF SPF t s AGC ADC Superdirective Post e5 Beamformer Filter SND M5 output 5 d5 AEC θa 7 channel Sp-L 6 xL 106 receive SSL from far end Sp-R sdtd Azimuth of RCV 7 xR DTD the speaker xFE + TV audio out 102 Fig. 2 Voice processing algorithm. 2 Texas Instruments C6727
  3. 3. 608 IEEE Transactions on Consumer Electronics, Vol. 57, No. 2, May 2011double talk detector (DTD) to prevent adaptation during where E l (n)  El (n,1),..., El (n, L) T is L-column complexdouble talk. Double talk detection signal sdtd is auxiliary vector of DFT bins of the lth microphone channel of the nthoutput of the AEC module used in SSL and AGC modules. processing block, e l (n)  [el (nL / 2),..., el (nL / 2  L  1)]T is There are various adaptive algorithms that can be used inAEC modules. Most frequently used time domain normalized corresponding time domain data vector of the lth microphoneleast means squares algorithm (NLMS) has a complexity that is channel, F denotes L-by-L discrete Fourier transformationlinear in the filter length. For moderate room reverberation the matrix, and w = diag{w} is diagonal matrix of window vectorfilter length has to be a few thousands points for typical w. Window is applied two times. Firstly, it is applied beforesampling frequency fs=8000Hz. The computation demand of FFT (1) to improve frequency resolution. Secondly, it issuch adaptive filter is too high to be used in real time applied after processing in DFT domain, just after inverse FFTapplication. In addition, it suffers from a rather slow to prevent clicks at the boundaries of the processing blocks.convergence for signals with a colored spectrum such as speech. The overall effect is the same as if the microphone signalsMore efficient implementations of the NLMS algorithm rely on were windowed by a square of w. Window function w, used infrequency–domain techniques [6]. Frequency domain LMS the our solution is a square root of Hanning window that(FLMS) adaptive filter implements the block LMS algorithm provides perfect reconstruction of the processed signal forefficiently by using the fast Fourier transform (FFT). While 50% overlapping of the output buffers.doing so, a significant reduction in computational load for the Five microphone signals arranged in the vectorsame adaptation performance is achieved. E(n,i)=[RE1(n,i),…, E5(n,i)] are processed by The main problem in real-time implementation of theFLMS algorithm is a long processing delay. For example if S BF (n, i)  H sdb ( n , i) E (n, i ) , H (2)the block size, e.g. filter length, is N=1024 the delay is 128ms where H sdb ( n , i ) is column weight vector which coefficientsfor 8kHz sampling rate. To overcome this problem, so–calledpartitioned block frequency–domain adaptive filter (PBFDAF) depend of the steering angle  n and superdirective designwas developed [3], [7]. The basic idea is to divide a long method. DFT coefficients of enhanced near end talker areimpulse response of the filter into shorter partitions each of transformed into time domain byL=N/P samples in length, where P is number of partitions. Asthe result the delay is reduced to L samples. To increase BF  ~ ( k )  w F 1 [ S (n,1),..., S (n, L)]T , s BF BF  (3)convergence rate, multiple iterations have to be applied overthe same data block [3]. where ~ BF (n)  [~BF (n,1),..., ~BF (n, L)]T is the output buffer s s s The algorithm used in Phone TV system is iterated ˆ of signal samples. Output stream s ( k ) is obtained byPBFDAF briefly described in [3]. This algorithm is modified overlapping current and previous output buffers byfor stereo reference signal. We used seven blocks eachcontaining 256 points of the adaptive which provides tail of s (nL / 2  t )  ~BF ( n  1, t  L / 2)  ~BF ( n, t ), t  0,..., L / 2  1 . ˆ s s224ms. Linear echo reduction is over 35dB in single talk and (4)over 30dB in double talk. Weight vector H sdb ( n , i ) is estimated using coherence B. Superdirective beamformer matrix Γ(i ) of ideally diffuse noise field [13] by In real reverberant room, acoustic noise field is almost diffuse, H sdb ( , i )  Γ(i)  I 1 d (i) . (5)with strong reflections highly correlated with direct wave of the dH (i )Γ(i )  I  d (i ) 1near end talker. In this case, noise reduction techniques based onadaptive beamforming provide poor results [8], [9], [10]. Hence, For the ideally diffuse noise field elements of the coherencea lot of authors propose the use of (non-adaptive) super directive matrix arebeamformer [11], [12], [13]. Its directivity pattern is optimizedfor the ideally diffuse noise filed according to some optimization * E ( X m (i ) X n (i )) sin (ki d m,n ) (6) XmXn (i )  criteria [13]. The main lobe of the superdirective beamformer is k i d m ,n E  X m (i)  E  X n (i )  2 2directing towards near end talker, whose position is obtained by        sound source localization (SSL) module (304). Inputs of the super directive beamformer (SDBF) are signals where dm,n is distance between the microphones m and n, ki ,e1 to e5. These signals are processed in DFT domain. Signals are k i  2 f / c is wavenumber for the central frequency fi of thepartitioned into blocks each L=512 samples wide. Block ith DFT bin, and c is speed of sound. Small positive scalar overlapping is 50%. Each data block e l (n) , l=1,…,5, is in (5) is used to stabilize matrix inversion when (i ) 1 iswindowed by an appropriate window function w and close to singularity. d(i) is steering vector that compensatestransformed into frequency domain by fast Fourier transform(FFT) the delay between microphones   E l ( n )  F w e l ( n) , (1) d  (i )  [1 e  jki d 0 sin( ) ... e  j ( M 1) ki d 0 sin( ) ]T , (7)
  4. 4. I. I. Papp et al.: Hands-free Voice Communication with TV 609where d 0 is the distance between adjacent microphones. where Ri , j ( ) is a generalized cross-correlation calculated byCoefficients of the superdirective beaformer was optimized at inverse discrete Fourier transform of the normalized cross-the design time and stored in the flash memory for the finite spectrum Gi , j (n) byset of arriving angles ,  {0 , 7 , 14 ,  21.3 , 29.0 , 0 0 0 0 040.0 , 53.1 , 73.7 }. For any particular estimated arriving 0 0 0 Ri , j ( )  n 0 Gi , j ( n)e j 2n / N , N 1 (9)angle θ, the nearest set of coefficients is used. i , j (n) , Gi , j (n)  C. Speaker localization  2 E X i ( n) E X j ( n ) 2 The problem of the speaker localization is usually addressedas sound source localization (SSL) problem. The task of SSLmodule is to determine position of the active speaker to be  where i , j ( n)  E X i ( n) X * ( n) j  is cross-spectrum on nthenhanced by microphone array. In addition, determined DFT bin, N is number of points for DFT, X i (n) and X j (n) ,speaker’s position can be also used for rotating the camera in n=0,…,N-1, are DFT transforms of signals on i and ja videoconferencing system. There are lots of interferencesthat disturb localization of the speaker such as: (a) microphones respectively. E . is expectation operator, (.)*reverberation of the room, (b) ambient noise, (c) residual denotes complex conjugate operator. The range of potentialacoustic echo. These interferences make sound source TDOA values is restricted to a finite interval D, which islocalization an uneasy task. determined by the physical separation between the To locate sound source we need at least two microphones. microphones. In the practical implementation of the algorithm,Cross correlation method is usually utilized to estimate time the following is included:difference of arrival. To improve reliability of the sourcelocation, most of the algorithms exploit the redundant (a) Cross-spectrum i , j (k , n) of the microphone i and j isinformation provided by multiple microphones [14], [15]. averaged in time bySeveral methods based on time difference estimation (TDE)fuse the estimated cost functions from multiple sensor pairs i , j (k , n)  i , j (k  1, n)  (1   ) X i (k , n) X * (k , n) , j (10)before searching the time delay [15], [16], [17]. Anothermethod denoted as steered response power (SRP) with phase where k is index of data block,  is smoothing factor.transform (SRP-PHAT), combines steered response power Typical value for  is 0.5.with phase transform [17], [18]. (b) DFT bins that are not corrupted by the ambient noise are In our solution we used generalized cross correlation with selected and used for calculation of normalized cross-phase transform (GCC-PHAT) method adapted for the spectrum bymultiple-microphone case by fusing generalized crosscorrelation (GCC) functions of the different microphone pairs  i , j ( k , n)  if i , j ( k , n)   N ( k , n ) (11)before estimation of the time difference. The proposed Gi , j (k , n)   i , j (k , n)algorithm exploits geometrical properties of the uniform linear  0 otherwise microphone array to improve reliability of the sound sourcelocalization and to optimize computation cost for real time where  N ( k , n) is estimate of the ambient noise.  N ( k , n)applications. Algorithm also includes bin selection to increase is estimated by minima controlled recursive averagingrobustness of the algorithm against ambient noise and residual algorithm [20].echo. The performance in high reverberant environment isalso improved using time varying cost function which exploits (c) The precedence effect is including using weight functionprecedence effect in sound source location [19]. Precedence we(k,n) [19], [21] byeffect can be explained by the fact that direct path sound w(k , n) , i , j (k , n)  i , j (k  1, n)arrives before any correlated reflections because of that the  ,  initial onsets tend to be less corrupted by reverberation than we (k , n)    i , j ( k , f )    ,  (k , n)   (k  1, n)subsequent sounds.  w(k , n)  i, j i, j    i , j (k  1, f )  For a given two microphone signals xi(t) and xj(t) 3 and theirgeneralized cross-correlation Ri , j ( ) , time difference of (12)arrival (TDOA) is estimated by [15] where w(k , n) is limited and normalized time derivation of the signal power  i , j  arg max Ri , j ( ) , (8)  D  i , j (k , n)   i , j ( k  1, n)  . w(k , n)  max 0.1,  (13)  i , j ( k , n )  3 According to the notation in Fig. 4, signals x1,…,x5 are actually signals  e1,…,e5 from output of AEC module.
  5. 5. 610 IEEE Transactions on Consumer Electronics, Vol. 57, No. 2, May 2011 The scalar constants  and  are empirically set to 0.4 and filter, which acts as post-filter. MVDR beamformer suppresses 0.3 respectively. Using weight function we(k,n), the coherent noise sources, while the post-filter further reduces normalized cross-spectrum is modified by residual diffuse noise. ~ Post-filter was examined by Zelinski [22], who used auto and Gi , j ( k , n)  we ( k , n)Gi , j ( k , n) (14) cross-spectral densities of the input to estimate the time varying parameters of the filter. The use of such post-filter with the sub-(d) Multi-microphone extension is applied by fusing GCC array beamforming was thoroughly investigated by Marro et al. functions of the different microphone pairs before [12]. I.A. McCowan et al. [23] developed an algorithm based on estimation of the time difference. Normalized cross- a model of ideally diffuse noise field for which the complex correlation between microphones 1 and 5 can be modeled coherence function is known. Another postfilter solution, by adaptive post-filter for arbitrary beamforming (APAB) ~  ~1  G1,5 (k , n)  we (k , n) e  jnA 4 d0   N ,5 (k , n) , (15) algorithm [11] estimates post-filter parameters as ratio of output and input power of the beamformer. 2f s A sin( ), n  0,..., N  1 The post-filter used in this paper is based on solution Nc proposed in [24], [25], [26]. Post-filter input signals, e1,…, e5, where fs is sampling frequency, N is number of DFT points, c and sBF are coming from AEC module and SDB beamformer is speed of sound, d0 is distance between adjacent respectively. Signals are processed in DFT domain. Each ith microphones,  is arriving angle, e.g. azimuth, and DFT bin is processing independently. Input signal vector ~ X (k , i) ,  N ,5 ( k , n) 1 is noise caused by the error in phase of the normalized cross-spectrum. In similar manner we can model X (k , i )  [ E1 (k , i ),..., E5 (k , i )]T , (19) cross-correlations between microphone pairs (1, 3) and (1, 2) by consists of signals E1 ( k , i ),..., E5 ( k , i ) that are DFT~  ~1G1,3 (k , n)  we (k , n) e  jnA 2 d0   N ,3 (k , n)  (16) coefficients of the microphone signals processed in AEC module. Index i denotes DFT bin, whereas k denotes data~  ~1, 2G1, 2 (k , n)  we (k , n) e  jnAd0   N (k , n)  (17) block. Signal vector X (k , i) , is modeled by Taking into account that the only exponential terms in (15), X (k , i)  V (k , i)  N (k , i) , (20) (16), and (17) contain information about time difference of arrival, we can join this information by where N(k,i) is diffuse ambient noise, V ( k , i ) is response to ~ ~ ~ the excitation of desired speaker s(k,i),G (k , n)  G1,5 (k , n)  G1,3 (k ,2n)  G1, 2 (k ,4n) (18) V (k , i )  d (i ) s (k , i ) . (21) Finally, the proposed SSL algorithm for five microphonearray can be described by the following steps: Column vector d (i ) is direct path transfer function (7). From1. For kth data block, apply DFT on each of the microphone Wiener filtering theory, the optimal post-filter is signals. Estimate short term cross-spectrums i , j (k , n) of ss (i ) out H post (i)  , (22) the microphone pairs (1,2), (1,3), and (1,5) by (10).  (i )  uu (i ) out ss out2. Estimate ambient noise  N (k , n) by [20] and calculate where ss (i ) and uu (i ) are signal and noise powers on out out normalized cross-correlations Gi , j ( k , n) by (11). output of the beamformer. ss (i ) and uu (i ) are unknown out out3. Apply Precedence effect by (12), (13) and (14) to calculate and have to be estimated from available signal measurements. ~ Gi , j ( k , n) for pairs (1,2), (1,3), and (1,5). Assuming the desired signal and noise are uncorrelated, and taking into account (20) and (21), the averaged power of the4. Make the fusion of the normalized cross-spectrums of the microphone signals on the data block k defined as different microphone pairs’ by (18).  (i )  E X (k , i ) H X (k , i ), is 15. Apply inverse DFT on G (k , n) and calculate  by (8). m 5  c  m (i )   ss (i )  uu (i ) , (23)  4d  .6. Calculate azimuth by   arcsin   D. Post-filter  0  where  ss (i )  E s ( k , i ) 2  and  uu   (i )  E N (k , i ) H N (k , i ) / 5 are signal and interference powers, respectively. The power of The optimal signal estimation on basis of multi-sensor the SDB output, BF (i)  E S BF (i)S BF (i)* , issignals consists of two processing steps [11]. The first one isMinimum Variance Distortionless Response (MVDR) BF (i )  ss (i )  A (i )uu (i ) , (24)beamformer and the second one is a single-channel Wiener
  6. 6. I. I. Papp et al.: Hands-free Voice Communication with TV 611where A (i )  H sdb ( , i ) Γ NN (i ) H sdb ( , i ) H is noise power The algorithm estimates post-filter parameters after everyattenuation factor (NPAF) [11]. Γ NN (i ) is noise coherence processing block to track slow changes in coherence function of the room. This post-processing algorithmmatrix defined by (2.1.7) for ideally diffuse noise field. We successfully suppresses both stationary and non-stationaryassume that Γ NN (i ) is time invariant when there is no interferences.movement in the room. Equations (23) and (24) form a linearsystem of equations with solution E. Automatic gain control (AGC)  BF (i )  A (i )m (i ) , m (i )   BF (i ) . (25) The main task of the AGC module is speech loudness ss (i )  uu (i )  1  A (i ) 1  A (i ) adjustment. Weak speech signals have to be amplified, while loud signals have to be attenuated. The main features of ourNoise power attenuation factor A(i) is always less then one AGC module are:because the beamformer reduces ambient noise. Powers m (i )and  BF (i ) are recursively estimated by - dynamic range compression, - protection against false gain increase during background ˆ ˆm (k , i)  m (k  1, i)  (1   ) X (k , i) H (t ) X (k , i) / 5 (26) noise, 2 - protection against false gain increase caused by residualˆ ˆBF (k , i)  BF (k  1, i)  (1   ) S BF (k , i) , (27) echo, - clipping compression,where positive scalar , 0<<1 controls exponential - internal voice activity detection based on multi-microphoneaveraging. We estimate unknown value of A(i) using feature extraction. ~auxiliary variable A (k , i) , The solution of the AGC module is inspired by [27] which ˆ calculates the gain according to the spatial information of the ~  (k , i ) . (28) sound sources. Block diagram of the AGC module is depicted A (k , i )  BF ˆ  (k , i) m in Fig. 3. It consists of the level estimator, VAD detector and ~ gain processor which generate gain Gn. When there is noVariable A (k , i) has two extreme values. During a speech ~ double talk and the level of the noise is low, gain processorinterval, ss (k , i )  uu (k , i ) , then A (k , i) tends to its upper calculates gain Gn according to the input signal level bylimit one. On the contrary, during a pause of speech diagram depict in Fig. 4 (bold line). If the level of the signal ss ( k , i )   uu ( k , i ) , then it tends to its lower limit SPF is below Lnom_in, signal is just amplified by Gnom (no ~ dynamic compression). If the signal level is between Lnom_inA(i). Using this property of A (k , i) , value of A(i) can be ~ and Lsat_in the compression is ½. If the input signal is aboverecursively estimated by tracking minimum of A (k , i) . For Lsat_in, the output level is limited to the Lsat_out (except brieflythis purpose we applied first order IIR filter with different after a sudden increase in input loudness, known as anforgetting factors  and  : "attack"). ˆ ~ ~ ˆ  A (k  1, i )  (1    ) A (k , i ), for A (k , i )  A (k  1, i ) SPFˆ SAGC =Gn SPFA (k , i )     ~ ~ x ˆ (k  1, i )  (1   ) A (k , i ), for A (k , i )  A (k  1, i ) ˆ   A     0      1. (29) Gn Level GainThe typical values for  and  are 0.95 and 0.999, Estimator Processorrespectively. By substituting estimates of  m ( k , i ) ,  BF (k , i ) , SBF VADand A ( k , i ) into (25) and taking into account detector Dtdˆuu ( k )  A ( k , i )ˆuu ( k , i ) and ss (k , i )  ss (k , i) , the out ˆ ˆout ˆ 5estimate of the post-filter by (22) is SAEC1-SAEC5 Fig. 3. Block diagram of the AGC module. ˆ ˆ ˆ  (k , i )  A (k , i ) m (k , i ) (30) ˆ H post (k , i)  BF   ˆ (k , i)  (k , i) 1  A ˆ BF If there is some residual echo or some background noise, internal VAD detector will detect it. According to the softTo reduce the effects of the estimation errors, an additional decision generated in internal VAD, the gain processor willconstraint has to be applied increase the slope of the part below Lnoise_tres (doted line). As the result, weak residual echo and ambient noise will be 0   post(k,i)  1. H (31) additionally attenuated.
  7. 7. 612 IEEE Transactions on Consumer Electronics, Vol. 57, No. 2, May 2011 saturation L sat_out dynamic compression output level [dB] Gnom Lnoise_tres L nom_in L sat_in input level [dB] Fig. 4. Compression transfer function. IV. EXPERIMENTAL RESULTS Proposed audio processing algorithm was tested using real- Fig. 6. Integration of audio processing into TV.time DSP implementation. The performance of the algorithmwas tested in real conditions using some objective and The audio processing board was integrated into a TV devicesubjective quality measures: (Fig. 6.). The TV platform was the host, while the audio a) Echo Return Loss Enhancement - ERLE, [28], processing board acted as an add-on module. The microphone b) Signal to Noise Ratio Enhancement - SNRE, [11], array was built into the frame of the TV (Fig. 7.). The TV c) Perceptual Evaluation of Speech Quality - PESQ [29], software was extended to support the communication features d) Speech intelligibility as an ultimate measure of quality and control of the DSP board. This way the TV became a using nonsense syllable recognition test. communication device suitable for interactive hands-free The quality assessment was performed using a real-time communication.implementation of the algorithms on a DSP board depicted in The testing was performed in environment shown in Fig. 8Fig. 5. The audio processing board was designed using (Room 1). Dimensions of the room were 7m x 5m x 2.6m,floating point DSP running at 300 MHz. The platform whereas its volume was 90m3. The room was a typical officecontains a Bluetooth connectivity module, as well as a serial with flat walls. The reverberation of the room was 400 ms.command interface for interfacing with the host (TV device). Across different test cases the following was varied:The developed board is interfaced to the 5 microphones usingA/D converters and microphone amplifiers. The processed - single or double talk,output signal is passed to the VoIP or GSM gateway via - loudness of the near-end talker,Bluetooth connection. - position of the near-end talker (0 and 30 degrees), - type of the far-end signal – echo (speech, different types of music – pop, rock, ambient), Microphone - type of the noise signal (speech, cocktail-party, music, interface non-stationary noise), Bluetoothconnectivity AD module convertors 5 element microphone External array Control memory interface Floating point DSP Fig. 5. Audio processing board OSD To achieve a real-time performance, the Matlab code wasconverted into floating point C code, which is then integratedinto DSP framework and compiled for the target platform.During development standard DSP development tool was Loudspeakersused4. No manual optimizations were applied. During Fig. 7. Test device used for real-time testing.operation the measured load of the DSP was 80% includingsignal processing and load coming from DSP framework This resulted in 22 different test cases covering expected(signal data transfer and I/O handling). The memory real-life use cases. To ensure reproducibility, the signal levelsconsumption was 512kB. were adjusted using sound level meter. The signals corresponding to test cases were recorded. The objective 4 Texas Instruments Code Composer Studio
  8. 8. I. I. Papp et al.: Hands-free Voice Communication with TV 613 disturbances – echo and noise. The test subjects were in acoustically isolated room (Room 2). The processed voice was transferred to Room 2 using VoIP and played back on d ar bo 1,8m te loudspeakers. The results of the nonsense syllables recognition hi W Microphones are shown in Fig. 12. In addition, the system, which provides TV with 75% recognition of the nonsense syllables, provides 100% word recognition of the normal speech [30]. 2m Board Based on the test cases the following results are achieved. The average PESQ value is around 2.6 for both single and double talk conditions. The average echo suppression (ERLE) across the set of test vectors is 27 dB. The average noise suppression (SNRE) including nonstationary noise, is 28 dB. rs V. CONCLUSIONS oo D Fig. 8. Test environment The developed system provides high quality full-duplex voice communication in hands-free mode. The targeted use case is the PESQ Results environment like living room or office. The developed audio 2.8 subsystem is integrated with TV. This led to a hands-free 2.7 communication terminal with advanced features like 2.6 2.5 simultaneous TV usage, full-duplex operation and high voice PESQ MOS 2.4 quality. The developed system makes possible comfort 2.3 2.2 conversation using any of the communication technologies 2.1 (GSM or VoIP). The developed technology can be used in 2 1.9 systems like car hands-free kit, teleconferencing system and 1 3 5 7 9 11 13 15 17 19 21 voice based human-machine interfaces. Test case Fig. 9. PESQ results Nonsense syllables recognition rate [%] ERLE Results 69 7 37.5 6 35 50 5 32.5 30 4 ERLE [dB] 27.5 3 21 25 22.5 2 20 1 17.5 15 0 1 3 5 7 9 11 13 15 17 19 21 Wit Wit Ideal case ‐  disturbances, no disturbances, with disturbances, no Test case processing processing processin Fig. 10. ERLE results SNRE Results Fig. 12. Nonsense syllables recognition results 31 30 REFERENCES 29 [1] J.-S. Hu, C.-C. Cheng, W.-H. Liu, and C.-H. Yang, “A Robust Adaptive SNRE[dB] 28 27 Speech Enhancement System for Vehicular Applications,” IEEE 26 Transactions on Consumer Electronics, vol. 52, no. 3, pp. 1069-1077, 25 Aug. 2006. 24 [2] Kazunori Kobayashi, Yoichi Haneda, Ken’ichi Furuya, and Akitoshi 23 1 3 5 7 9 11 13 15 17 19 21 Kataoka, “A Hands-Free Unit with Noise Reduction by Using Adaptive Beamformer”, IEEE Transactions on Consumer Electronics, vol. 54, no. Test case 1, pp. 116-122, Feb. 2008. Fig. 11. SNRE results [3] K. Eneman and M. Moonen, “Iterated Partitioned Block Frequency- Domain Adaptive Filtering for Acoustic Echo Cancellation,” IEEEmeasures ERLE, SNRE, and PESQ were calculated off-line Transactions on Speech and Audio Processing, vol. 11, no.2, pp. 143-using recorded signals. The results are depicted in Fig. 9-11. 158, Mar. 2003. [4] ITU-T recommendation G.167, Acoustic echo controllers, Mar. 1993. The same room (Room 1) was used as near-end environment [5] W. L. Kellermann, “Acoustic Echo Cancellation for Beamformingfor intelligibility measurement. The nonsense syllables were Microphone Arrays,” in M. Brandstein and D. Ward (eds), Microphoneplayed back in Room 1 as near-end signal, together with all Arrays, New York: Springer, 2001, pp. 281–306,
  9. 9. 614 IEEE Transactions on Consumer Electronics, Vol. 57, No. 2, May 2011[6] Earl R. Ferrara, “Fast Implementation of LMS Adaptive Filters,” IEEE [25] Z. M. Saric, S. T. Jovicic, M. Janev, I. I. Papp, and Z. S. Marceta, Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP- “Microphone Array Post-filter Based on Noise Power Attenuation Factor 28, no. 4, pp. 474-475, Aug. 1980, and a priori Knowledge of the Noise Field Coherence,” Proc. of[7] J.-S. Soo and K. K. Pang, “Multidelay block frequency domain adaptive International Conference SPECOM 2007, Moscow, Russia, 2007, pp. filter,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 252-258. vol. 38, no. 2, pp. 373-376, Feb. 1990. [26] Z. M. Saric, D. P. Simić, S. T. Jovicic, “A new post-processing[8] Z. M. Saric, and S.T. Jovicic, “Adaptive microphone array based on algorithm combined with two-step adaptive beamformer,” submitted for pause detection”, Acoustics Research Letters Online (ARLO) 5(2), pp. publication in Circuits, Systems and Signal Processing, 2010. 68-74, Apr. 2004. [27] K. Kobayashi, Y. Haneda, K. Furuya, and A. Kataoka, “A hands-free[9] S.T. Jovicic, Z. M. Saric, S. R. Turajlic, “Application of the maximum unit with adaptive microphone array for directional AGC,” in Acoustic signal to interference ratio criterion to the adaptive microphone array,” echo and noise control : ninth international workshop, IWAENC 2005, Acoustics Research Letters Online (ARLO) 6(4), pp. 232-237, Oct. 2005. September 12-15, 2005, Eindhoven, The Netherlands,[10] I. I. Papp, Z. M. Saric, S. T. Jovicic, N. Dj. Teslic, “Adaptive [28] ITU-T recommendation G.168, Digital network echo cancellers, June microphone array for unknown desired speaker’s transfer function”, 2002. Journal of Acoustic Society of America, Express Letters, pp. 44-49, July [29] ITU-T, “Perceptual evaluation of speech quality (PESQ): An objective 2007. method for end-to-end speech quality assessment of narrow-band[11] K. U. Simmer, J. Bitzer, and C. Marro, “Post-filtering techniques” in M. telephone networks and speech codecs”, International Brandstein and D. Ward (eds), Microphone Arrays, New York: Springer, Telecommunications Union, 2001. 2001, pp. 36–60. [30] H. Levitt, J.C. Webster, "Effects of Noise and Reverberation on[12] C. Marro, Y. Mahieux, and K. U. Simmer, “Analysis of noise reduction Speech”, In C.M. Harris (eds), Handbook of Acoustical Measurements and dereverberation techniques based on microphone arrays with and Noise Control, Chapter 16, McGraw-Hill, 1991. postfiltering,” IEEE Trans. Speech Audio Process., vol. 6, no. 3, pp. 240–259, May 1998.[13] J. Bitzer, and K. U. Simmer, “Superdirective microphone arrays,” in M. Brandstein and D. Ward (eds), Microphone Arrays, New York: Springer, BIOGRAPHIES 2001, pp. 19-38.[14] K. Kwak, and S. Kim, “Sound source localization with the aid of Istvan I. Papp received the B.S., M.S. and Ph.D. excitation source information in home robot environments,” IEEE Trans. degrees in electrical engineering on the Faculty of Consumer Electronics, vol. 54, no. 2, pp. 852-856, May 2008. Technical Sciences, University of Novi Sad, in 1998,[15] T. Nishiura, T. Yamada, S. Nakamura, and K. Shikano, “Localization of 2001 and 2009 respectively. He is an assistant professor multiple sound sources based on a CSP analysis with a microphone at Computing and Control Department, University of array,” in Proc. of IEEE ICASSP’00, vol. 2, pp. 1053–1055, Istanbul, Novi Sad. He is also employed in RT-RK Company, and Turkey, June 2000 engaged in delivering real-time embedded solutions. His[16] S. M. Griebel and M. S. Brandstein, “Microphone array source research interest is related to digital signal processing localization using realizable delay vectors,” in Proceedings of IEEE and toolchain development s for DSPs. IEEE Member since 2008. Workshop on the Applications of Signal Processing to Audio and Acoustics (WASPAA ’01), pp. 71–74, New Platz, NY, USA.[17] J. Chen, J. Benesty, and Y. A. Huang, “Time delay estimation in room acoustic environments: an overwiev,” EURASIP Journal on applied Zoran M. Šarić was born in 1956 in Vis, Croatia. He signal processing, vol. 2006, pp. 1–19, 2006. received B.S. and Ph.D. degrees in School of electrical[18] J. H. DiBiase, H. F. Silverman, and M. S. Branstein, “Robust engineering, University of Belgrade in 1987 and 1993. localization in reverberant rooms,” in M. Brandstein and D. Ward (eds), His current research interests include sound source Microphone Arrays, New York: Springer, 2001, pp. 157–180. localization, microphone arrays, signal processing,[19] K. W. Wilson, and T. Darrell, “Learning a precedence effect-like adaptive signal processing, speech signal processing, weighting function for the generalized cross-correlation framework, and noise cancellation. He was leader and also “IEEE Transactions on audio, speech, and language processing, vol. 14, participated in a number of projects supported by no. 6, pp. 2156-2164, Nov. 2006. Ministry of Science and Technological Development of the Republic of[20] I. Cohen, B. Berdugo, “Noise Estimation by Minima Controlled Serbia. He is currently employed in RT-SP d.o.o. signal processing, Novi Sad, Recursive Averaging for Robust Speech Enhancement,” IEEE signal Serbia. processing letters, vol. 9, no. 1, pp. 12-15, Jan 2002,[21] J.-M. Valin, F. Michaud, J. Rouat, D. L´etourneau, “Robust Sound Source Localization Using a Microphone Array on a Mobile Robot,” in IEEE Proceedings International Conference on Intelligent Robots and Nikola Dj. Teslić received the B.S., M.S. and Ph.D. Systems (2003), vol. 2, pp. 1228-1233, Dec. 2003. degrees in electrical engineering from the Faculty of[22] R. Zelinski, “A microphone array with adaptive post-filtering for noise Technical Sciences, University of Novi Sad, in 1995, reduction in reverberant rooms,” Proc. of IEEE ICASSP’88, pp. 2578– 1997 and 1999 respectively. He is professor at 2581, 1988. Computing and Control Department, University of Novi[23] I. McCowan and H. Bourlard, “Microphone Array Post-filter based on Sad. He is CTO of RT-RK Company, delivering Noise Field Coherence,” IEEE Transactions on Speech and Audio development services and products in consumer area. Processing, vol.11, no.6, pp. 709-716, Nov. 2003. His research interest covers design of real time systems,[24] S. Jovicic and Z. Saric, Adaptive microphone array free of the desired audio and video processing and design of testing systems. IEEE Member since speaker cancellation combined with post filter, Proc. of Acoustics 08, 2008. Paris, pp. 5143-5147, 2008.