Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Reshmi S, Sreenesh Shasidharan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue6, November- December 2012, pp.1271-1275 Porting and Optimization of ITU-T G.729.1 codec on SC3850 DSP core Reshmi S1, Sreenesh Shasidharan 2 1 (M.Tech Student, Department of Electronics & Communication, Federal Institute of Science & Technology, Ankamaly) 2 (Assistant Professor, Department of Electronics & Communication, Federal Institute of Science & Technology, Ankamaly)Abstract—This paper presents a methodology quality by exploiting the full range of humantowards the details of optimization for the real- speech.G.729.1 codec- an 8-32 Kbit/s scalabletime implementation of ITU’s G.729.1 codec on wideband speech and audio codec was standardizedSC3850 DSP core running at 1 GHZ clock speed. by ITU-T in May 2006 as a part of standardizationThe main aim is the optimized fixed point activity held in Jan 2004. This is the former speechimplementation of G.729.1 speech coding codec with an embedded scalable structure and isalgorithm on SC3850 DSP core. Optimizations backward compatible with existing ITU G.729 speechwere done in C language to reduce the execution coding standard. As bit rate increases quality ofspeed of the codec by exploiting the architectural speech will increase proportionally due to the scalablefeatures of the target device. First the reference structure of the codec. Scalable coding technologycode provided by ITU-T’s published documents used by the G.729.1 codec allows the proper selectionwas modified to meet the requirements in C of bit-rate during transmission by any communicationlanguage. Next by exploiting the features of component. This is done by simply truncating theSC3850core, the critical parts of our code that bitstream. Coder operates in many modes and high-consumes more cycles were modified in order to quality speech communication is obtained with thereduce their execution time. In each stage, the low-delay mode. The main applications of this codeccorrectness of the implementation was verified by are: VoIP (IP telephony) including IP phones, VoIPtesting our codes against testing vectors provided handsets, voice recording equipments audio/videoby ITU-T. More than 90 percent improvement conferencing, media servers/gateways, call centerover the baseline codec was achieved by the equipment voice messaging servers. G.729 Annex Jsystematic implementation of these optimization and G.729EV is another name for G.729.1 where EVtechniques for G.729.1 codec on SC3850 core. denotes Embedded Variable (bit rate). Keywords:bitrate,G.729.1codec,MCPS,optimizati StarCores variable length execution set (VLES)on,SC3850 architecture processors are largely used in codecs due to lower power consumption, efficient compilability, I. INTRODUCTION and very compact code density. Merging of VLES and Nowadays, the speech communication DSP technologies on to a single core/system hasservices demanded by the current tech-savvy globe are mitigated the ever increasing requirements of signalever-increasing. The numbers of requests for services processing applications on several platforms. Here weaugment with number of users. Hence the most are going to discuss the implementation of ITU’sstraight forward way for the designers of G.729.1 fixed point standard on SC3850 digital signalcommunication systems to fulfill this increasing need providing transmission canals with lower bandwidth II. OVERVIEW OF G.729.1 CODECto each user. Compression is the most efficient G.729.1 codec is an 8-32 kbit/s scalabletechnique for that. The objective of speech wideband speech and audio coding algorithmcompression is to represent a digitized speech signal interoperable with G.729, G.729A and G.729B[2].using reduced bit-rate as possible so that the The input which is an analog audio signal is sampledreconstructed speech signal maintains an optimum at 8 KHz or 16 KHz(default)sampling frequency. Thelevel of perceptual quality. resulting signal is then quantized and converted to VoIP is nowadays facing new challenges in terms of 16-bit linear PCM which serves as the input to thequality of service and efficiency in networks. Wide- encoder. Decoder output is also 16-bit linear audio codecs are extensively used to meet these The output bitstream of encoder is scalable and ischallenges by allowing a flat transition of voice structured in 12 embedded layers. This layersquality from narrowband (300-3400 Hz) to wideband corresponds to 12 available bit rates from 8 to 32(50-7000 Hz). They are capable of providing most kbit/s with core layer interoperable with G.729. Thelifelike conversations with higher level of fidelity and output bandwidth of G.729.1 is 50-4000 Hz at 8 and 12 kbit/s and 50-7000 Hz from 14 to 32 kbit/s (per 2 kbit/s steps). The G.729.1 coder operates on 1271 | P a g e
  2. 2. Reshmi S, Sreenesh Shasidharan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue6, November- December 2012, pp.1271-127520 milliseconds frames of input speech, corresponding Time-domain aliasing cancellation (TDAC) encoderto 320 samples at a sampling rate of 16000 samples which is a transform-based coder jointly encodes theper second. In-order to consistent with G.729 using 10 lower-band CELP difference signal and the higher-ms frame and 5 ms subframes [3], two 10 ms CELP band signal SHB(n).In addition, parameter-levelframes are processed per 20 ms frame. Hence 20ms redundancy in the bitstream is introduced by the frameframes are referred to as superframes. erasure concealment (FEC) encoder by transmitting the signal classification information, energyA. Encoding priciple information, and phase information [2]. Encoding is done in such a way that the inputspeech samples are analyzed in order to extract the B. Decoding principleparameters of the speech synthesis model. G.729.1 decoder is illustrated in Figure 2.AtTransmitting these parameters as opposed to the receiver, the transmitted parameters are decoded.compressed speech samples saves the bandwidth and The decoder operation is based on the received bit-ratereduces the noise. At the receiving side parameters are or actual number of received layers. If the received bitused to synthesize and reconstruct speech. rates are 8 or 12 kbit/s the CELP decoder done the job The G.729.1 encoder is illustrated in Figure 1. The of decoding by reconstructing a lower-band signalencoder operates at the maximal bit-rate of 32 kbit/s. (50-4000 Hz).This is then post-filtered and pre-The coder is organized in three-stages: embedded processed by a high-pass filter to increase theCode-Excited Linear-Prediction (CELP) coding, subjective quality and reduces the coding error. TheTime-Domain Bandwidth Extension (TDBWE) and resulting signal is upsampled to 16 kHz using thepredictive transform coding [2]. Layers 1 and 2 is QMF synthesis filterbank. If the received bit rate is 14generated by the embedded CELP stage which gives a kbit/s, the TDBWE decoder reconstructs a higher-narrowband output(50-4000 Hz) at 8 and 12 kbit/s. band signal SBWE(n) which is upon merging withLayer 3 is generated by the TDBWE stage producing narrowband enhancement layer (12 kbit/s) produces aa wideband output (50-7000 Hz) at 14 kbit/s. Layers 4 wide-band output of 50-7000 Hz. If the received bitto 12 are generated by the TDAC stage operates in rate is between 16 to 32 kbit/s, the TDAC decoderthe Modified Discrete Cosine Transform (MDCT) reconstructs both a lower-band difference signal and adomain at 14 to 32 kbit/s. higher-band signal. Inorder to mitigate the pre/postecho artefacts due to transform coding the resultant signal is shaped in time domain. CELP output is combined with modified TDAC lower-band signal. The quality of the speech output can be improved by using modified TDAC higher-band signal instead of TDBWE output for the whole frequency range.The signals are then upsampled to 16 KHz and combined in QMF filterbank. Adaptive 2 LP CELP post-proc Encoder TDAC Pre/post echo Encoder Reduction Figure 1. Block diagram of encoder Pre/post echo reduction SBWE(n) A 64-coefficient quadrature mirror filterbank 2 HPF(QMF) divides the input signal SWB(n’) into a higher TDBWEband and lower band. The input signal at lower-band Encoder (-1) nand higher-band is decimated by a factor of 2. In orderto remove the frequency components below 50Hz thelower-band decimated signal is passed through an Figure 2. Block digram of decoderelliptical high-pass filter (HPF) with cut-off frequency0.05 KHz .The resulting signal is encoded by a III. FIXED POINT IMPLEMENTATIONnarrowband embedded CELP encoder [3]. The input A. Arcitecturesignal at higher-band is spectrally folded to get more The StarCore SC3850 DSP core used in mainstreamexact representation and is passed through an elliptic DSP applications, such as wireless and wirelinelow-pass filter (LPF) with 3 kHz cutoff frequency to communications, on both the infrastructure and theremove the frequency components below 3 KHz. The subscriber sides. The SC3850 core is having aresulting signal is then encoded by parametric variable-length execution set (VLES) executiontime-domain bandwidth extension (TDBWE) encoder. 1272 | P a g e
  3. 3. Reshmi S, Sreenesh Shasidharan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue6, November- December 2012, pp.1271-1275model, which utilizes maximum parallelism by number of processor cycles by analyzing the profilerallowing multiple address generation and data information.arithmetic logic units to execute multiple instructions 3. Apply generic optimization techniques to improvein a single clock cycle[4]. Target markets for the the execution speed in C.SC3850 architecture includes wireless and wireline 4. Analyze the disassembly the code (assemblybase stations, wireless and wireline infrastructure, generated by the complier) to find the areas wherebroadband wireless etc. Key features of the SC3850 better assembly can be written.DSP core include the following [4]: 5. After each optimization MCPS is calculated and Main core resources note down the value. 4 data ALU execution units 6. If not satisfied with the performance of codes, go 2 integer and address generation units back to step 2. Sixteen 40-bit data register B. Profiling Sixteen 32-bit address registers Profile information of the reference code gives an Instruction set precise estimate about the relative contribution of 16-bit instruction set, expandable to 32- and different modules to the total computational 48-bit instructionsSixteen 40-bit data register complexity. Profiling information helped to identify Sixteen 32-bit address registers critical portions of the code, which were optimized to Very high execution parallelism improve the performance. Codewarrior’s simulator Up to six instructions executed in a single configuration is used for profiling. clock cycle, statically scheduled Up to 4 data ALU instructions and 2 memory C. MCPS Calculation access/integer instructions per cycle MCPS (million cycles per second) is a Data type support performance metric. It is the total no. of cycles taken Byte (8-bit), word (16-bit), and long (32-bit) to process all the frames It is obtained by multiplying data widths, supported by instructions and the no. of cycles required to execute one frame by no. memory moves. of frames per second as in (1).B. Software 𝐶 ×𝑓𝑠C language was used for high level implementation. MCPS= (1) 𝑆×𝑁The encoder and decoder parameters were dealtefficiently with the help of structures. SC3850 where, C is total no. of cycles for executing oneassembly language was used for low level frame, fs is the sampling rate, S is the total no. ofmodifications. Firstly the encoder and decoder tester samples / frame and N is the total no. of frames .applications were developed using Visual Studio 2008 D. Optimization Techniquesexpress edition by modification of reference code. We followed both target dependent and targetLater the code is loaded to Code Warrior for SC3850 independent optimization techniques in this project.specific implementation. Target independent techniques are generalized techniques applied at high level optimization. Target IV. OPTIMIZATION The methodology for reforming a software dependent optimization involves techniques utilizing architectural features of SC3850. The compilersystem in all aspects to make it work more efficiently specific optimization level was set at –o3[5] .utilizing fewer resources is known as optimization Some of the techniques followed are explained asOptimization can be done for speed and memory. follows[5]Here we concentrated on speed optimization forsuperior performance.The optimization tool used in this project was Code 1) Making use of intrinsicsWarrior 10.2.9. It is latest version from the An intrinsic function is a way to instruct the compilerCodeWarrior family provides a graphical user to use a specific assembly language instructioninterface for managing software development projects. [5].Many functions in the reference code wereCodeWarrior 10.2.9 can be used to develop C, C++, replaced by intrinsic after studying its exactand assembly language code targeted at many functionality. The replacement was done with proper care to maintain bit-exactness of the original ITU-Tprocessors. implementation.. The StarCore C/C++ compilerA. Optimization strategy substitutes the intrinsic function call with a set ofThe strategy we followed in the optimization is as designated assembly language instructions. Intrinsicsfollows improves performance of generated code and exploits1. Initial MCPS of the code is calculated by running the architectural features of SC3850 that can’t be donethe project with worst-case and all the optimization by normal C language.parameters enabled.2. Profile all the functions in the implementation.Identify the critical functions that takes the more 1273 | P a g e
  4. 4. Reshmi S, Sreenesh Shasidharan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue6, November- December 2012, pp.1271-1275 2) Function inlining reduced memory moves with high degree ofFunction inlining is the process of substituting the parallelism.called function code in the place of functional call 7) Making use of overflow flagthereby eliminating functional call overhead .Inlining SC3850 Exception and mode register (EMR) has ancan be done by instructing the StarCore compiler overflow flag called DOVF which is set if saturationusing special #pragma directives. This technique results due to some of the arithmetic instructions.improves execution time at the expense of larger code Overflow check can be done efficiently by monitoringsize. Small functions that are frequently-called are the this flag and this can be directly accessed by makingbest candidates for inlining in the C code. The criteria use of assembly routines.for the selecting the best candidate depends upon thenumber and type of parameters the function passed, V. VERIFICATION AND TESTINGthe data type of the returned value and register At each stage of optimization we verified theallocation in the resulting assembly code. correctness of our implementation using the test vectors included as a part of G.729.1 3) Fast 32-bit DPF format operations recommendation. Each of these test vectors includesThe ITU-T G.729EV reference code was originally an input and output file. For each input file, the outputdesigned for 16-bit processors which do not support file of an implementation should match the prescribed32-bit operations. Thus the 32-bit operations were output for the specified input in a bit exact manner. Bydone using a 32-bit double precision format in which doing this testing phase we became assured that ourdata is represented with a precision of 1 in 2^31.Some implementation was fully compliant with pseudofunctions which originally took the 32-bit parameters codes provided by two higher and lower 16-bit half words andperform 16-bit operations on them were replaced with VI. RESULTS AND CONCLUSIONsingle 32-bit DPF parameter. Instead of using two In this paper we presented fixed-pointseparate DALU registers for two 16-bit half words optimized implementation of G.729.1 codec (annex J)now only a single DALU register for a 32-bit word is on SC3850 DSP core. First the details ofused. Thus, functions that originally received two implementation of this algorithm in C language werepointers to two 16-bit arrays could now operate with discussed. Next using our C written codes,pointer to a single 32-bit array. Modifications are optimization was done for real-time implementation.implemented in DPF format to maintain bit-exactness We suggested a strategy for optimization processof the original ITU-T implementation. This which can be used in other similar researches. Usingoptimization makes use of the processors 32-bit the proposed strategy, a set of different techniques incapabilities and got 50% reduction in MCPS and order to reduce consuming cycles of G.729.1memory moves since the parameters passed to algorithm were introduced. These techniques includedfunction reduced by half. some methods based on SC3850 architecture and 4) Loop unrolling some methods specifically for optimization of G.729.1 Looping overhead in software which increases the algorithm. . The codec can be easily used for real-timecycle count can be reduced by loop unrolling. applications.The speed of operation of codec isUnrolling the loop N times can be done by improved by 90% and it’s a great success.duplicates the loop body N times and decreasing theloop counter by a factor of N.Since the architecture of Table.1 shows the MCPS consumption of the codecSC3850 contains 4 ALU’s unrolling the bigger loops before and after optimization for a sampling rate of 16(higher loop count) by a factor of N=4 gave better KHz .performance. This significantly reduces the MCPS,since it maximizes the efficient usage of multiple After %ALU execution units simultaneously and multiple Initial Final G.729.1 adding improvregister moves. Unrolling the inner loops reduces the MCPS MCPS Codec Intrinsic ementfunction looping overhead significantly, as loop (MCPS)counter needs to be updated less often and fewerbranches are executed. Encoder 400 245 28 93 5) Loop mergingCombining two or more loops of same loop count intoa single loop loads the ALU more efficiently. Loop decoder 240 215 22 90fusion is a process of combining multiple loops, whichreuse the same data, into one loop. The advantage ofthis is that data can be reused, reducing the memoryaccess, and loop overhead is reduced. 6) Use of SIMD2 instructionsThis optimization will make efficient use of MACregisters and enables packet data movement results in 1274 | P a g e
  5. 5. Reshmi S, Sreenesh Shasidharan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue6, November- December 2012, pp.1271-1275 ACKNOWLEDGMENTI gratefully acknowledge the contributions of FederalInstitute of science and technology, ankamaly for thesupport and assistance for the successful completionof this project . REFERENCES[1] Wai c Chu Speech Coding Algorithms Foundation and Evolution of Standardized Coders- Mobile Media Laboratory DoCoMo USA Labs San Jose, California.[2] ITU-T Rec. G.729.1, “G.729 based Embedded Variable bit-rate coder - An 8-32kbits/s scalable wideband coder bitstream interoperable with G.729” May 2006.[3] ITU-T Rec. G.729, “Coding of Speech at 8 kbit/s using Conjugate Structure Algebraic Code Excited Linear Prediction (CSACELP),”March 1996[4] MSC8156 Reference Manual Six Core Digital Signal Processor MSC8156RM Rev 2, June 2011[5] C Code Optimization Examples for the StarCore SC3850 Core, Rev 0, April 2010 1275 | P a g e