A typical speech recognition system is push-to-talk operated that requires activation. However for those who use hands-busy applications, movement may by restricted or impossible. One alternative is to use Speech-Only Interface. The proposed method that is called Wake-Up-Word Speech Recognition (WUW-SR) that utilizes speech only interface. A WUW-SR system would allow the user to activate systems (Cell phone, Computer, etc.) with only speech commands instead of manual activation. The trend in WUW-SR hardware design is towards implementing a complete system on a single chip intended for various applications. This paper presents an experimental FPGA design and implementation of a novel architecture of a real time feature extraction processor that includes: Voice Activity Detector (VAD), and features extraction, MFCC, LPC, and ENH_MFCC. In the WUW-SR system, the recognizer front-end with VAD is located at the terminal which is typically connected over a data network(e.g., server)for remote back-end recognition. VAD is responsible for segmenting the signal into speech-like and non-speech-like segments. For any given frame VAD reports one of two possible states: VAD_ON or VAD_OFF. The back-end is then responsible to score the features that are being segmented during VAD_ON stage. The most important characteristic of the presented design is that it should guarantee virtually 100% correct rejection for non-WUW (out of vocabulary words - OOV) while maintaining correct acceptance rate of 99.9% or higher (in vocabulary words - INV). This requirement sets apart WUW-SR from other speech recognition tasks because no existing system can guarantee 100% reliability by any measure.
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGIJCI JOURNAL
A floating-point unit (FPU) is a math coprocessor, a part of a computer system specially designed to carry
out operations on floating point numbers. The term floating point refers to the fact that the radix point can
"float"; that is, it can placed anywhere with respect to the significant digits of the number. Double
precision floating point, also known as double, is a commonly used format on PCs due to its wider range
over single precision in spite of its performance and bandwidth cost. This paper aims at developing the
verilog version of the double precision floating point core designed to meet the IEEE 754 standard .This
standard defines a double as sign bit, exponent and mantissa. The aim is to build an efficient FPU that
performs basic functions with reduced complexity of the logic used and also reduces the memory
requirement as far as possible.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGIJCI JOURNAL
A floating-point unit (FPU) is a math coprocessor, a part of a computer system specially designed to carry
out operations on floating point numbers. The term floating point refers to the fact that the radix point can
"float"; that is, it can placed anywhere with respect to the significant digits of the number. Double
precision floating point, also known as double, is a commonly used format on PCs due to its wider range
over single precision in spite of its performance and bandwidth cost. This paper aims at developing the
verilog version of the double precision floating point core designed to meet the IEEE 754 standard .This
standard defines a double as sign bit, exponent and mantissa. The aim is to build an efficient FPU that
performs basic functions with reduced complexity of the logic used and also reduces the memory
requirement as far as possible.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
International Journal of Engineering Inventions (IJEI) provides a multidisciplinary passage for researchers, managers, professionals, practitioners and students around the globe to publish high quality, peer-reviewed articles on all theoretical and empirical aspects of Engineering and Science.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
FPGA Implementation of LDPC Encoder for Terrestrial TelevisionAI Publications
The increasing data rates in digital television networks increase the demands on data capacity of the current transmission channels. Through new standards, the capacity of existing channels is increased with new methods of error correction coding and modulation. In this work, Low Density Parity Check (LDPC) codes are implemented for their error correcting capability. LDPC is a linear error correcting code. These linear error correcting codes are used for transmitting a message over a noisy transmission channel. LDPC codes are finding increasing use in applications requiring reliable and highly efficient information transfer over noisy channels. These codes are capable of performing near to Shannon limit performance and have low decoding complexity. LDPC uses parity check matrix for its encoding and decoding purpose. The main advantage of the parity check matrix is that it helps in detecting and correcting errors which is a very important advantage against noisy channels. This work presents the design and implementation of a LDPC encoder for transmission of digital terrestrial television according to the Chinese DTMB standard. The system is written in Verilog and is implemented on FPGA. The whole work is then verified with the help of Matlab modelling.
Controller Area Network is an ideal serial bus design suitable for modern embedded system based networks. It finds its use in most of critical applications, where error detection and subsequent treatment on error is a critical issue. CRC (Cyclic Redundancy Check) block was developed on FPGA in order to meet the needs for simple, low power and low cost wireless communication. This paper gives a short overview of CRC block in the Digital transmitter based on the CAN 2.0 protocols. CRC is the most preferred method of encoding because it provides very efficient protection against commonly occurring burst errors, and is easily implemented. This technique is also sometimes applied to data storage devices, such as a disk drive. In this paper a technique to model the error detection circuitry of CAN 2.0 protocols on reconfigurable platform have been discussed? The software simulation results are presented in the form of timing diagram.FPGA implementation results shows that the circuitry requires very small amount of digital hardware. The Purpose of the research is to diversify the design methods by using VHDL code entry through Modelsim 5.5e simulator and Xilinx ISE8.3i.The VHDL code is used to characterize the CRC block behavior which is then simulated, synthesized and successfully implemented on Sparten3 FPGA .Here, Simulation and Synthesized results are also presented to verify the functionality of the CRC -16 Block. The data rate of CRC block is 250 kbps .Estimated power consumption and maximum operating frequency of the circuitry is also provided.
This is an overview of the kinds of Bit Error Rates (BER) you can expect when using particular error correction techniques over a nominal satellite communications link. A more extensive paper on the subject is available upon request.
Performance Improvement of IEEE 802.22 WRAN Physical LayerIOSR Journals
The spectrum available for the wireless services is limited, the increased demand of wireless
application has put a lot of limitations on the utilization of available radio spectrum. For the efficient spectrum
utilization for wireless application IEEE 802.22 standard i.e. WRAN (Wireless Regional Area Network) is
developed which is based on cognitive radio technique that senses the free available spectrum. It allows sharing
of geographically unused channels allocated to the TV Broadcast Service, without interference.
In this paper we are evaluating the performance of WRAN over physical layer with QPSK, 16-QAM
and 64-QAM modulation with Convolution coding with code rate of 1/2, 2/3, 3/4, 5/6 and obtaining the BER
curves for rician channel. Simulation is performed in MATLAB
Implementation of Viterbi Decoder on FPGA to Improve Designijsrd.com
In the data transmissions over wireless channels are affect by attenuation, distortion, interference and noise, which affects the receiver's ability to receive correct information. Convolution coding with Viterbi decoding is a FEC technique that is particularly suited to a channel in which transmitted signal is corrupted mainly by additive white Gaussian noise (AWGN).Convolutional codes are used for error correction. They have rather good correcting capability and perform well even on very bad channels with error probabilities. Viterbi decoding is the best technique for decoding the Convolutional codes but it is limited to smaller constraint lengths. Viterbi algorithm is a well-known maximum-likelihood algorithm for decoding of convolutional codes.
International Journal of Engineering Inventions (IJEI) provides a multidisciplinary passage for researchers, managers, professionals, practitioners and students around the globe to publish high quality, peer-reviewed articles on all theoretical and empirical aspects of Engineering and Science.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
FPGA Implementation of LDPC Encoder for Terrestrial TelevisionAI Publications
The increasing data rates in digital television networks increase the demands on data capacity of the current transmission channels. Through new standards, the capacity of existing channels is increased with new methods of error correction coding and modulation. In this work, Low Density Parity Check (LDPC) codes are implemented for their error correcting capability. LDPC is a linear error correcting code. These linear error correcting codes are used for transmitting a message over a noisy transmission channel. LDPC codes are finding increasing use in applications requiring reliable and highly efficient information transfer over noisy channels. These codes are capable of performing near to Shannon limit performance and have low decoding complexity. LDPC uses parity check matrix for its encoding and decoding purpose. The main advantage of the parity check matrix is that it helps in detecting and correcting errors which is a very important advantage against noisy channels. This work presents the design and implementation of a LDPC encoder for transmission of digital terrestrial television according to the Chinese DTMB standard. The system is written in Verilog and is implemented on FPGA. The whole work is then verified with the help of Matlab modelling.
Controller Area Network is an ideal serial bus design suitable for modern embedded system based networks. It finds its use in most of critical applications, where error detection and subsequent treatment on error is a critical issue. CRC (Cyclic Redundancy Check) block was developed on FPGA in order to meet the needs for simple, low power and low cost wireless communication. This paper gives a short overview of CRC block in the Digital transmitter based on the CAN 2.0 protocols. CRC is the most preferred method of encoding because it provides very efficient protection against commonly occurring burst errors, and is easily implemented. This technique is also sometimes applied to data storage devices, such as a disk drive. In this paper a technique to model the error detection circuitry of CAN 2.0 protocols on reconfigurable platform have been discussed? The software simulation results are presented in the form of timing diagram.FPGA implementation results shows that the circuitry requires very small amount of digital hardware. The Purpose of the research is to diversify the design methods by using VHDL code entry through Modelsim 5.5e simulator and Xilinx ISE8.3i.The VHDL code is used to characterize the CRC block behavior which is then simulated, synthesized and successfully implemented on Sparten3 FPGA .Here, Simulation and Synthesized results are also presented to verify the functionality of the CRC -16 Block. The data rate of CRC block is 250 kbps .Estimated power consumption and maximum operating frequency of the circuitry is also provided.
This is an overview of the kinds of Bit Error Rates (BER) you can expect when using particular error correction techniques over a nominal satellite communications link. A more extensive paper on the subject is available upon request.
Performance Improvement of IEEE 802.22 WRAN Physical LayerIOSR Journals
The spectrum available for the wireless services is limited, the increased demand of wireless
application has put a lot of limitations on the utilization of available radio spectrum. For the efficient spectrum
utilization for wireless application IEEE 802.22 standard i.e. WRAN (Wireless Regional Area Network) is
developed which is based on cognitive radio technique that senses the free available spectrum. It allows sharing
of geographically unused channels allocated to the TV Broadcast Service, without interference.
In this paper we are evaluating the performance of WRAN over physical layer with QPSK, 16-QAM
and 64-QAM modulation with Convolution coding with code rate of 1/2, 2/3, 3/4, 5/6 and obtaining the BER
curves for rician channel. Simulation is performed in MATLAB
Implementation of Viterbi Decoder on FPGA to Improve Designijsrd.com
In the data transmissions over wireless channels are affect by attenuation, distortion, interference and noise, which affects the receiver's ability to receive correct information. Convolution coding with Viterbi decoding is a FEC technique that is particularly suited to a channel in which transmitted signal is corrupted mainly by additive white Gaussian noise (AWGN).Convolutional codes are used for error correction. They have rather good correcting capability and perform well even on very bad channels with error probabilities. Viterbi decoding is the best technique for decoding the Convolutional codes but it is limited to smaller constraint lengths. Viterbi algorithm is a well-known maximum-likelihood algorithm for decoding of convolutional codes.
Capacitor Placement Using Bat Algorithm for Maximum Annual Savings in Radial ...IJERA Editor
This paper presents a two stage approach that determines the optimal location and size of capacitors on radial distribution systems to improve voltage profile and to reduce the active power loss. In first stage, the capacitor locations can be found by using loss sensitivity method. Bat algorithm is used for finding the optimal capacitor sizes in radial distribution systems. The sizes of the capacitors corresponding to maximum annual savings are determined by considering the cost of the capacitors. The proposed method is tested on 15-bus, 33 bus, 34-bus, 69-bus and 85-bus test systems and the results are presented.
Automatic Notch Indication and Identification of Compartment in EmergencyIJERA Editor
The main objective of our proposed system is to safe guard people’s life, government property and saves time of the passengers. This will focus on the system that will detect the notch that is employed, which helps to run the locomotive ,on the LCD screen and also detects the chain that is pulled in the compartment automatically on the LCD screen with the help of RF module. Position of notch that is displayed when the notch is applied. In general notch is applied to enhance or decrease the speed of the locomotive under the supervision of loco pilot. The increasing and decreasing of notch position should be done gradually in order to increase or decrease the speed. So, by displaying the present position of notch on the LCD screen become ease to identify the notch position. In addition to it, when passengers feel emergency in order to stop the train, they have the right to pull the chain. Once the chain is pulled in any compartment the locomotive tends to travel approximately one kilometer in response to the pulled chain. After once the train is stopped loco pilot has the responsibility to resolve the emergency facing by the passengers, finding where the chain is pulled by checking all the compartments sequentially. In order to save time by checking all the compartments and traveling one kilometer huge distance in emergency driver identifies the compartment where the chain is pulled with the help of RF module placed in the compartments and the same will be displayed on the LCD screen.
Prospect of Coal Based IGCC to Meet the Clean Energy ChallengeIJERA Editor
The development of a country is nearly proportional to the average per person energy consumption rate, which is very low in our country. However, the rate of average energy consumption is increasing day by day throughout the world. With increasing the production of energy, the problem of environment pollution from the power generation sources and energy efficiency becomes more imperative. Coal is the major source of primary energy of the world, however, the energy efficiency of coal based power plant is low, and also it significantly polluted the environment. Therefore, to improve the energy efficiency and reduce the pollution from coal based power plant is an important issue to discuss. In this paper, the primary reserves of energy throughout the world are discussed. Integrated gasification combined cycle (IGCC) is a latest technology used to improve the performance of coal based power plant. The process of IGCC and the present condition of IGCC throughout the world is discussed. Finally the advantages of IGCC and necessity of moving towards IGCC from convention coal based power plant is discussed in terms of cost, efficiency and environmental issues.
Evaluating pollution potential of leachate from landfill site, from the Tangi...IJERA Editor
Leachate from municipalities’ landfills represents a potential health risk to ecosystems in generally and human populations in particularly. This study which was taken during year from 2010 to 2011was focused to study the physicochemical evaluation of the leachate from the landfill of the Tangier city (north of Morocco). The analyses of the sampled leachate revealed strong content of biodegradable organic matter (BOD =166.78 mg/l, COD=2397.25 mg/l and BOD/COD=0.069) and of SM (SM = 577.97 mg/l). Contents in nitrate (NO3=199.77 mg/l) were also revealed. The discharge of the Tangier city is characterized by an old leachate. The long-term monitoring of the evaluation of physicochemical parameters in polluted leachate, on how environmental conditions change over time, could then lead to models useful in the prediction of natural attenuation in aquifers. Therefore, an adaptable and efficient treatment process must be used to eliminate the wide range of pollutants present in leachate.
An advisory firm delivering services to the investors may help you in this sector. They use to provide such professionals who give such tips and hints which benefits the traders and help them to achieve the desired success.
Implementation of Algorithms For Multi-Channel Digital Monitoring ReceiverIOSR Journals
Abstract: Monitoring Receivers form an important constituent of the Electronic support. In Monitoring
Receiver we can monitor, demodulate or scan the multiple channels.
In this project, the Implementation of algorithm for multi channel digital monitoring receiver. The
implementation will carry out the channelization by the way of Digital down Converters (DDCs) and Digital
Base band Demodulation. The Intermediate Frequency (IF) at 10.7 MHz will be digitalized using Analog to
Digital Converter (ADC) with sampling frequency 52.5 MHz and further converted to Base band using DDCs.
Virtually all the digital receivers perform channel access using a DDC. The Base band data will be streamed to
the appropriate demodulators. Matlab Simulink will be used to simulate the logic modules before the
implementation. This system will be prototyped on an FPGA based COTS (Commercial-off-the-shelf)
development board. Xilinx System Generator will be used for the implementation of the algorithms.
Keywords: DDC, ADC, Digital Base band demodulation, IF, Monitoring Receiver.
Speech Recognition Systems(SRS) have been implemented by various processors including the digital signal processors(DSPs) and field programmable gate arrays(FPGAs) and their performance has been reported in literature. The fundamental purpose of speech is communication, i.e., the transmission of messages.In the case of speech, the fundamental analog form of the message is an acoustic waveform, which we call the speech signal. Speech signals can be converted to an electrical waveform by a microphone, further manipulated by both analog and digital signal processing, and then converted back to acoustic form by a loudspeaker, a telephone handset or headphone, as desired.The recognition of speech requires feature extraction and classification. The systems that use speech as input require a microcontroller to carry out the desired actions. In this paper, Cypress Programmable System on Chip (PSoC) has been studied and used for implementation of SRS. From all the available PSoCs, PSoC5 containing ARM Cortex-M3 as its CPU is used. The noise signals are firstly nullified from the speech signals using LogMMSE filtering. These signals are then sent to the PSoC5 wherein the speech is recognized and desired actions are performed.
Speech Recognition Systems(SRS) have been implemented by various processors including the digital signal processors(DSPs) and field programmable gate arrays(FPGAs) and their performance has been reported in literature. The fundamental purpose of speech is communication, i.e., the transmission of messages.In the case of speech, the fundamental analog form of the message is an acoustic waveform, which we call the speech signal. Speech signals can be converted to an electrical waveform by a microphone, further manipulated by both analog and digital signal processing, and then converted back to acoustic form by a loudspeaker, a telephone handset or headphone, as desired.The recognition of speech requires feature extraction and classification. The systems that use speech as input require a microcontroller to carry out the desired actions. In this paper, Cypress Programmable System on Chip (PSoC) has been studied and used for implementation of SRS. From all the available PSoCs, PSoC5 containing ARM Cortex-M3 as its CPU is used. The noise signals are firstly nullified from the speech signals using LogMMSE filtering. These signals are then sent to the PSoC5 wherein the speech is recognized and desired actions are performed.
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...IJERA Editor
Speech feature extraction and likelihood evaluation are considered the main issues in speech recognition system.
Although both techniques were developed and improved, but they remain the most active area of research. This
paper investigates the performance of conventional and hybrid speech feature extraction algorithm of Mel
Frequency Cepstrum Coefficient (MFCC), Linear Prediction Cepstrum Coefficient (LPCC), perceptual linear
production (PLP) and RASTA-PLP through using multivariate Hidden Markov Model (HMM) classifier. The
performance of the speech recognition system is evaluated based on word error rate (WER), which is given for
different data set of human voice using isolated speech TIDIGIT corpora sampled by 8 Khz. This data includes
the pronunciation of eleven words (zero to nine plus oh) are recorded from 208 different adult speakers (men &
women) each person uttered each word 2 times.
Implementation of Robotic System Using Speech Recognition Technique based on ...IJAEMSJORNAL
Recently, voice becomes one of the methods commonly used to control the electronic appliances, because of easily being reproduced by human compared to other efforts needed to operate to control some other appliances. There are many places which are hard or dangerous to approach by human and there are many people with disabilities do not have the dexterity necessary to control a keypad or a joystick on electrical devices. The aim of this study is to build a mobile robotic system, which can be controlled by analyzing the human voice commands. The robot will identify the voice commands and take action based on received signal. In general, the robotic system consists of the voice recognition module (AD-VR3) which serves as the ear that will listen and interpret the voice command, while the Arduino serve as the brain of the system that will process and coordinate the correct output of the input command to control the robot motors to perform the action.
Intelligent Arabic letters speech recognition system based on mel frequency c...IJECEIAES
Speech recognition is one of the important applications of artificial intelligence (AI). Speech recognition aims to recognize spoken words regardless of who is speaking to them. The process of voice recognition involves extracting meaningful features from spoken words and then classifying these features into their classes. This paper presents a neural network classification system for Arabic letters. The paper will study the effect of changing the multi-layer perceptron (MLP) artificial neural network (ANN) properties to obtain an optimized performance. The proposed system consists of two main stages; first, the recorded spoken letters are transformed from the time domain into the frequency domain using fast Fourier transform (FFT), and features are extracted using mel frequency cepstral coefficients (MFCC). Second, the extracted features are then classified using the MLP ANN with back-propagation (BP) learning algorithm. The obtained results show that the proposed system along with the extracted features can classify Arabic spoken letters using two neural network hidden layers with an accuracy of around 86%.
One of the common and easier techniques of feature extraction is Mel Frequency Cestrum Coefficient (MFCC) which allows the signals to extract the feature vector. It is used by Dynamic Feature Extraction and provide high performance rate when compared to previous technique like LPC. But one of the major drawbacks in this technique is robustness. Another feature extraction technique is Relative Spectral (RASTA). In effect the RASTA filter band passes each feature coefficient and in both the log spectral and the Spectral domains appear linear channel distortions as an additive constant. The high-pass portions of the equivalent band pass filter effect the convolution noise introduced in the channel. The low-pass filtering helps in smoothing frame to frame spectral changes. Compared to MFCC feature extraction technique, RASTA filtering reduces the impact of the noise in signals and provides high robustness
A comparison of different support vector machine kernels for artificial speec...TELKOMNIKA JOURNAL
As the emergence of the voice biometric provides enhanced security and convenience, voice biometric-based applications such as speaker verification were gradually replacing the authentication techniques that were less secure. However, the automatic speaker verification (ASV) systems were exposed to spoofing attacks, especially artificial speech attacks that can be generated with a large amount in a short period of time using state-of-the-art speech synthesis and voice conversion algorithms. Despite the extensively used support vector machine (SVM) in recent works, there were none of the studies shown to investigate the performance of different SVM settings against artificial speech detection. In this paper, the performance of different SVM settings in artificial speech detection will be investigated. The objective is to identify the appropriate SVM kernels for artificial speech detection. An experiment was conducted to find the appropriate combination of the proposed features and SVM kernels. Experimental results showed that the polynomial kernel was able to detect artificial speech effectively, with an equal error rate (EER) of 1.42% when applied to the presented handcrafted features.
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...IJCSEA Journal
Speech is the most natural way of information exchange. It provides an efficient means of means of manmachine communication using speech interfacing. Speech interfacing involves speech synthesis and speech recognition. Speech recognition allows a computer to identify the words that a person speaks to a microphone or telephone. The two main components, normally used in speech recognition, are signal processing component at front-end and pattern matching component at back-end. In this paper, a setup that uses Mel frequency cepstral coefficients at front-end and artificial neural networks at back-end has been developed to perform the experiments for analyzing the speech recognition performance. Various experiments have been performed by varying the number of layers and type of network transfer function, which helps in deciding the network architecture to be used for acoustic modelling at back end.
Similar to Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA (20)
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Epistemic Interaction - tuning interfaces to provide information for AI support
Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA
1. Veton Z. Këpuska et al. Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 12( Part 3), December 2014, pp.160-168
www.ijera.com 1 | P a g e
Voice Activity Detector of Wake-Up-Word Speech Recognition
System Design on FPGA
Veton Z. Këpuska, Mohamed M. Eljhani, Brian H. Hight
Electrical &Computer Engineering Department
Florida Institute of Technology, Melbourne, FL 32901, USA
Abstract
A typical speech recognition system is push-to-talk operated that requires activation. However for those who use
hands-busy applications, movement may by restricted or impossible. One alternative is to use Speech-Only
Interface. The proposed method that is called Wake-Up-Word Speech Recognition (WUW-SR) that utilizes
speech only interface. A WUW-SR system would allow the user to activate systems (Cell phone, Computer,
etc.) with only speech commands instead of manual activation. The trend in WUW-SR hardware design is
towards implementing a complete system on a single chip intended for various applications. This paper presents
an experimental FPGA design and implementation of a novel architecture of a real time feature extraction
processor that includes: Voice Activity Detector (VAD), and features extraction, MFCC, LPC, and
ENH_MFCC. In the WUW-SR system, the recognizer front-end with VAD is located at the terminal which is
typically connected over a data network(e.g., server)for remote back-end recognition. VAD is responsible for
segmenting the signal into speech-like and non-speech-like segments. For any given frame VAD reports one of
two possible states: VAD_ON or VAD_OFF. The back-end is then responsible to score the features that are
being segmented during VAD_ON stage. The most important characteristic of the presented design is that it
should guarantee virtually 100% correct rejection for non-WUW (out of vocabulary words - OOV) while
maintaining correct acceptance rate of 99.9% or higher (in vocabulary words - INV). This requirement sets apart
WUW-SR from other speech recognition tasks because no existing system can guarantee 100% reliability by
any measure.
Keywords:Speech Recognition System (SR); Wake-Up-Word (WUW) Speech Recognition; Front-end (FE);
Voice Activity Detector (VAD); Feature Extraction; Mel-frequency Cepstral Coefficients (MFCC); Linear
Predictive Coding (LPC); Enhanced Mel-frequency CepstralCoefficients(ENH_MFCC); Field Programmable
Gate Arrays (FPGA).
I. Introduction
The Voice Activity Detector is responsible for
segmenting the signal into speech and non-speech
segments. For any given frame, VAD reports one of
two possible states: VAD_ON or VAD_OFF. Word
recognition in the Back-end stage begins when the
VAD is in VAD_ON state, and ends when the VAD
switches to VAD_OFF state. VAD works in two
phases: in the first phase, a classifier decides whether
a single input frame is speech-like or non-speech-
like; in the second phase, the number of speech-
frames and non-speech-frames over a period of time
is analyzed and certain rules are applied to report the
final decision (e.g., VAD_ON or VAD_OFF).
The VAD is responsible for finding sections of
speechby segmenting them from the rest of the audio
stream. The back-endthen will identify whether or
not the segmented utterance is a WUW. In the
“Front-end of Wake-Up-Word Speech Recognition
System Design on FPGA” [1], we showed the
generation of three sets of spectrograms. In the
“Wake-Up-Word Feature Extraction on FPGA” [2],
we presented an efficient hardware architecture and
implementation of Front-end of WUW-SR on FPGA.
This Front-end is responsible for generating three sets
of features:
1. Mel-frequency Cepstral Coefficients (MFCC),
2. Linear Predictive Coding (LPC), and
3. Enhanced Mel-frequency Cepstral Coefficients
(ENH-MFCC).
A great deal of work has been done to address
the problem of recognizing speech-like segments by
designing an efficient hardware front-end with built-
in VAD in FPGA.The board that has been usedis
based on Altera DSP system, acting as a processor
that is responsible for extracting three different sets
of features from the input audio signal.
The feature extraction of speech is an important
issue in the Front-end. There are two types of
acoustic measurements of the speech signal. One is
the parametric modeling approach, which is
developed to match closely the resonant structure of
the human vocal tract that produces the
corresponding speech sound. It is mainly derived
from Linear Predictive analysis, such as LPC-based
Cepstrum (LPC). The other approach, MFCCs, is the
RESEARCH ARTICLE OPEN ACCESS
2. Veton Z. Këpuska et al. Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 12( Part 3), December 2014, pp.160-168
www.ijera.com 2 | P a g e
nonparametric modeling method that originates from
the human auditory perception system. The MFCCs
are derived in this form [3].
In recent studies of speech recognition system, the
MFCC parameters perform better than others in the
recognition accuracy [4, 5]. This presented paper
shows VAD based feature extraction solution. This
solutionis optimized for implementation in FPGA
structures. The system is designed to be implemented
as a System-on- Programmable-Chip (SOPC). This
design not only has a relatively low resource usage,
but also maintains a reasonably high level of
performance (last long if plugged-in buttery usage).
The remainder of this paper is organized as
follows: Section II, describes the Wake-Up-Word
speech recognition and its architecture. Section III,
describes the embedded front-end of WUW-SR
design procedure and architecture. Section IV,
describes the Mel-frequency Cepstral Coefficient
Feature design. Section V,describes the Linear
Predictive Coding Feature design.Section VI,
describes the Enhanced Mel-frequency Cepstral
Coefficient Feature design. Section VII, describes the
Voice Activity Detector Design and Implementation.
Section VIII, describes results and comparisons of
three spectrograms and features from FPGA
hardware implementation and C++ front-end
algorithm. These are followed by conclusions in
Section IX.
Figure 1: Overall WUW-SR architecture
II. WUW-SR Overall System Architecture
As shown in Fig. 1, the WUW-SR can be broken
down into three components as explained in [6]. The
front-end system process takes an input waveform
(“Operator” audio signal) and outputs a sequence of
parameters; that is MFCCs, LPCs, and ENH-MFCCs
features as described in [2]. The Voice Activity
Detector system is responsible for segmenting the
signal into speech and non-speech segments. For any
given frame, VAD reports one of two possible states:
VAD_ON or VAD_OFF.
Word recognition in the Back-end stage begins when
the VAD enters VAD_ON state, and ends when the
VAD switches to VAD_OFF state. Whereas the back-
end process takes this sequence, e.g., VAD_ON state,
and outputs the recognized command. From the figure
above, the signal processing module accepts raw
audio samples and produces spectral representations
of short time signals. The feature-extraction module
generates features from this spectral representation,
which are decoded with the corresponding hidden
Markov’s models (HMMs). The individual feature
scores are classified using support vector machines
(SVMs) into INV, or OOV: in-, out-of-vocabulary
speech [6].
III. Embedded Front-end with Built-in VAD
of WUW-SR Architecture
As shown in Fig. 2, the new Front-end design is
divided into twenty-eight modules (five-stages). The
first seven pink-colored modules represents the pre-
processing stage and used as basic modules to provide
windowed speech signal for the other stages. The six
yellow-colored modules represent the LPC stage and
used to generate 13-LPC features. The five brown-
colored modules represents the MFCC stage and used
to generate 13 MFCCs features. The five green-
colored modules represents the ENH-MFCC stage
and used to generate 13 ENH-MFCC features. The
five red-colored modules represents the VAD stage
that is responsible for finding segments of spoken
speech.
3. Veton Z. Këpuska et al. Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 12( Part 3), December 2014, pp.160-168
www.ijera.com 3 | P a g e
Figure 2: Front-end with Built-in VAD of WUW-SR Block Diagram
Stage A: Pre-Processing
1. Analog to Digital Converter ADC.
2. DC Filtering.
3. Serial to 32-bit parallel converter.
4. Integer to floating-point converter.
5. Pre-emphasis filtering.
6. Window advance buffering.
7. Hamming window.
Stage B: Linear Predictive Coding Coefficients
1. Autocorrelation Linear Predictive Coding.
2. Fast Fourier Transform (FFT).
3. LPC Spectrogram.
4. Mel-scale Filtering.
5. Discrete Cosine Transform (DCT).
Stage C: Mel-Frequency Cepstral Coefficients
1. Fast Fourier Transform (FFT).
2. MFCC Spectrogram.
3. Mel-scale Filtering.
4. Discrete Cosine Transform (DCT).
Stage D: Enhanced Mel-Frequency Cepstral
Coefficients
1. Enhanced Spectrum.
2. Enhanced MFCC Spectrogram.
3. Mel-scale Filtering.
4. Discrete Cosine Transform (DCT).
Stage E: Voice Activity Detector
The five blue-colored modules represents the
Voice Activity Detector stage. The VAD is
responsible for finding utterances spoken in the
correct context and segmenting them from the rest of
the audio stream, then the system will identify
whether or not the segmented utterance is a WUW.
1. LPC Spectrogram Features.
2. Log Energy Features.
3. MFCC Feature.
4. Voice Activity Detection Logic.
IV. Mel-scale Frequency Cepstral Coefficients
(MFCC) Design and Implementation
The feature extraction involves identifying the
formants in the speech, which represents the
frequency locations of energy concentrations in the
speaker’s vocal tract. There are many different
approaches used: Mel-scale Frequency Cepstral
Coefficients (MFCC), Linear Predictive Coding
(LPC), Linear Prediction Cepstral Coefficients
(LPCC), Reflection Coefficients (RCs). Among these,
4. Veton Z. Këpuska et al. Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 12( Part 3), December 2014, pp.160-168
www.ijera.com 4 | P a g e
MFCC has been found to be more robust in the
presence of background noise compared to other
algorithms [7]. Also, it offers the best trade-offs
between performance and size (memory)
requirements. The primary reason for effectiveness of
MFCC is that, it models the non-linear auditory
response of the human ear which resolves frequencies
on a log scale [8].
Intensive efforts have been carried out to achieve
a high performance front-end. Converting a speech
waveform into a form suitable for processing by the
decoder requires several stages as shown in Fig. 2:
1. Filtration: The waveform is sent through a low
pass filter, typically 4 kHz to 8 kHz. As is
evidenced by the bandwidth of the telephone
system being around 4 kHz; this is sufficient for
comprehension and used a minimum bandwidth
required for telephony transmittal.
2. Analog-to-Digital Conversion: The process of
digitizing and quantizing an analog speech
waveform begin with this stage. Recall that the
first step in processing speech is to convert the
analog representations (first air pressure, and then
analog electric signals from a microphone), into a
digital signal.
3. Sampling rate: The resulting waveform is
sampled. Sampling rate theory requires a
sampling (Nyquist) rate of double the maximum
frequency (so 8 to 16 kHz as appropriate). The
sampling rate of 8 kHz was used in our front-end.
(We used CODEC Chip to perform first, second,
and third stages).
4. Serial to Parallel Converter: This model gets
serial digital signal from CODEC and converts it
to 32-bit.
5. Integer to floating-point converter: This module
converts 32-bit, signed integer data to single-
precision (32-bit) floating-point values. The input
data is routed through the int_2_float
Megafunction core named ALTFP_CONVERT.
6. Pre-emphasis: the digitalized speech signal s(n)
is put through a low-order (LPF) to spectrally
flatten the signal and to make it less susceptible to
finite precision effects later in the signal
processing. The filter is represented by:
y[n] = x[n] – αx [n-1]
Output = Input – (PRE_EMPH_FACTOR *
Previous_input).
Where we have chosen the value of
PRE_EMPH_FACTOR(α) as 0.975.
7. Window Buffering: A 32-bit, 256 deep dual-port
RAM (DPRAM) stores 200 input samples. A state
machine handles moving audio data into the
RAM, and pulling data out of the RAM (40
samples) to be multiplied by the Hamming
coefficients, which are stored in a ROM memory.
8. Windowing: The Hamming window function
smooth the input audio data with a Hamming
curve prior to the FFT function. This stage slices
the input signal into discrete time segments. This
is done by using window N milliseconds, typically
25 ms wide (200 samples). A Hamming window
size of 25ms which consists of 200 samples at 8
KHz sampling frequency and 5 ms frame shift (40
samples) is picked for our front-end windowing.
9. Fast Fourier Transform: In order to map the
sound data from the time domain to the frequency
domain, the Altera IP Megafunction FFT module
is used. The module is configured so as to produce
a 200-point FFT. This function is capable of
taking a streaming data input in natural order, and
it can also output the transformed data in natural
order, with maximum latency of 200 clock cycles
once all the data (200 data samples) has been
received.
10. Spectrogram: This module takes the complex
data generated by the FFT and performs the
function:
20 * log10 (fft_real^2 + fft_imag^2).
We designed spectrogram to show how the
spectral density of a signal varies with time. We
used spectrogram module to identify phonetic
sounds. Digitally sampled data, in the time
domain, are broken up into chunks, which usually
overlap, and Fourier transformed to calculate the
magnitude of the frequency spectrum for each
chunk. Each chunk then corresponds to a vertical
line in the image; a measurement of magnitude
versus frequency for a specific moment in time.
The spectrums or time plots are then "laid side by
side" to form the image surface.
11. Mel-scale filtering: While the resulting spectrum
of the FFT contains information in each frequency
in linear scale, human hearing is less sensitive at
frequencies above 1000 Hz. This concept also has
a direct effect on performance of automatic speech
recognition systems; therefore, the spectrum is
warped using a logarithmic Mel scale. In order to
create this effect on the FFT spectrum, a bank of
filters is constructed with filters distributed
equally below 1000 Hz and spaced
logarithmically above 1000 Hz.
12. Discrete Cosine Transform:DCT is a Fourier-
related transform similar to the discrete Fourier
5. Veton Z. Këpuska et al. Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 12( Part 3), December 2014, pp.160-168
www.ijera.com 5 | P a g e
transform (DFT), but using only real numbers.
DCTs are equivalent to DFTs of roughly twice the
length, operating on real data with even symmetry
(since the Fourier transform of a real and even
function is real and even). A DCT computes a
sequence of data points in terms of summation of
cosine functions oscillating at various frequencies.
The idea of performing DCT on Mel Scale is
motivated by extraction of the speech frequency
domain characteristics. DCT module reduces the
speech signal’s redundant information, and
reaches the aim of regulating the speech signal
into feature coefficients with minimal dimensions.
V. Autocorrelation Linear Predictive Coding
(LPC) Design and Implementation
As shown in Fig. 2, an additional module named
Autocorrelation Linear Productive Coding (LPC)
used to extract the speech as LPC features. The basic
idea of LPC is to approximate the current speech
sample as a linear combination of past samples as
shown in the following equation:
𝑥 𝑛 = 𝑎 𝑘 𝑥 𝑛 − 𝑘 + 𝑒 𝑛
𝑝
𝑘=1
𝑥 𝑛 − 𝑘 : Previous speech samples
𝑝: Order of the model
𝑎 𝑘 : Prediction coefficient
𝑒 𝑛 : Prediction error
This module gets windowed data from the
window module for representing the spectral
envelope of a digital signal of speech in compressed
form, using the information of a linear predictive
model. We used this method to encode good quality
speech and provide an estimate of speech parameters.
The goal of this method is to calculate prediction
coefficients 𝑎 𝑘 for each frame. The order of LPC,
which is the number of coefficients 𝑝, determines
how closely the prediction coefficients can
approximate the original spectrum. As the order
increases, the accuracy of LPC also increases. This
means the distortion will decrease. The main
advantage of LPC is usually attributed to the all-pole
characteristics of vowel spectra. Also, the ear is also
more sensitive to spectral poles than zeros [9]. In
comparison to non-parametric spectral modeling
techniques such as filter banks, LPC is more powerful
in compressing the spectral information into few filter
coefficients [10].
VI. Enhanced Mel-scale Frequency Cepstral
Coefficients (ENH-MFCC) Design and
Implementation
The spectrum enhancement module is used to
generate ENH-MFCC set of features. We have
implemented this module as shown in the Fig. 2, to
perform an enhancement algorithm on the LPC
spectrum signal. The ENH-MFCC features have a
higher dynamic range than regular MFCC features, so
these new features will help the back-end in
improving the recognition quality and accuracy.
The algorithm uses only the single-sided
spectrum, so the state machine starts the calculations
when 128 data points have been written into the input
RAM.
VII. Voice Activity Detector (VAD) Design and
Implementation
There are various kinds of signals that may
contain actual speech. In our Front-end system it is
imperative to monitor and detect any speech signal.
The algorithm is designed to most efficiently
delineate between speech vs non-speech. It will
consider speech segments that trigger VAD_ON for
more than 15 frames otherwise it will not trigger.
To achieve speech / non-speech detection, as
shown in fig. 3, we designed and implemented VAD
models. The VAD continuously monitors the system
and calculates for every frame if the signal is a speech
signal or a background noise. The VAD uses three
inputs: Log Energy, LPC Features, and MFCC
Features; to decide whether VAD is ON or VAD is
OFF.
Figure 3: Voice Activity Detector Architecture
Those features are finely tuned to detect any
voice in the speech frame. Each feature has an
independent detection method. Depending on the
calculations done by the features, the overall VAD
6. Veton Z. Këpuska et al. Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 12( Part 3), December 2014, pp.160-168
www.ijera.com 6 | P a g e
logic reacts by turning the VAD output ON/OFF
(Speech or Noise Background). Each feature
calculates the presence of speech by calculating
Variance and Mean deviation. Each feature has its
own flag to be set. Each feature is finely tuned to
meet certain criteria before its feature flag is set.
The final decision is based on all these
flags/states; if at least two out of the three are true, the
overall logic of VAD triggers and the VAD turns ON.
Fig. 4 shows “Onward” waveform as input audio data
superimposed with its VAD segmentation, its MFCC
spectrogram, LPC spectrogram, and ENH-MFCC.The
VAD, as is implemented, uses three features to
provide backend with the final decision if the frame
contains speech. Those three features are: Energy,
MFCC and LPC [6].
The VAD logic depends on at least of two out of
all three features to trigger in order for VAD to be
ON. Otherwise VAD is not triggered (OFF).
In the following sections each individual feature
is described:
Figure 4: Speech signal “Onward” with VAD segmentation, MFCC, LPC, and ENH-MFCC
spectrograms
1. Log Energy Feature
The Energy feature is computed by squaring the
samples of the frames. It is used to track the change in
mean frame energy. In addition it is used to detect if
the frame contains speech signal or not. If there is a
drastic change in frame energy the Energy flag
(log2_frame_energy) is set. This feature function
receives the frame energy parameter, e.g.,
log2_frame_energy, from the Hamming Window
module. It uses this parameter to calculate the overall
frame energy mean and compare it to the present
frame energy. The mean frame energy,
mean_log2_fm_en, is dependent on a variable
lambda, lambda_LTE, which is calculated differently
in the first few frames.
The mean is ignored during the VAD ON stage:
mean_log2_fm_en_VAD_ON. If the difference
between the mean frame energy and the current frame
energy is greater than certain
threshold, SNR_THRESHOLD_VAD, Energy flag is
set.
2. Mel-frequency Cepstral Coefficient Feature
The Mel-frequency Cepstral Coefficients are
calculated for each frameand they are included into
the feature vector comprising MFCC component.
When the mean value (mean_vad_mfcc_feature) of
all the frames features is over the current frame’s
MFCC feature value (local_aver_mfcc_fea_val), the
flag is set (VAD_MFCC_State_Flag). The mean is
calculated differently depending on
VAD_MFCC_State_Flagand depending on the frame
counter. It is dynamically adapteddepending onactual
values so it can detect the signal accurately. The mean
is lowered if the current frame feature value is too
low compared to the mean calculated up to that point.
The mean is ignored during the stage where VAD is
turnedON and is stored into a different variable
(mean_vad_mfcc_feature_VAD_ON).
3. Linear Predictive Coding Feature
Linear Predictive Coding is computed from
Spectrum of the signal. This function uses a different
from of feature computation. It uses the variance
measurement of the signal to detect if a speech is
present. When the current variance (var_spec) is
much larger than the average variance
(aver_var_spec) computed from the previous frames,
it indicates that there is speech. The variance flag
(VAD_SPEC_State_Flag) is set when there is a
sudden change in variance when compared to
previous frames and is turned off if the change is
minimal.
VIII. Results and Comparisons
After observing the VAD systemon several test
utterances we concluded that its hardware
7. Veton Z. Këpuska et al. Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 12( Part 3), December 2014, pp.160-168
www.ijera.com 7 | P a g e
implementation performs very well as compared to
C++ implementation. In addition, the analysis
included testing of the performance of the hardware
by comparing its three sets of spectrograms and
features (MFCC, LPC, and ENH-MFCC) with the
C++ software implementation of the WUW’s front-
end.
The VAD described in this paper has been
modeled and integrated with the Front-End as a
processor in Verilog HDL and implemented in low
cost, high speed, and power efficient (Cyclone III
EP3C120F780C7) FPGA on DSP development kit.
The embedded front-end with built-in VAD used
C++ implementation as a baseline. In addition the
floating-point MATLAB implementation of the front-
end of WUW was done. Each module was tested to
ensure correct operation.
The sample utterance containing the WUW
“Operator” was sampled with 8 KHz sampling rate.
This utterances was included in the experiments
presented in this paper as input audio data for testing
the WUW’s embedded Front-end. In addition, three
sets of spectrograms and features produced by the
front-end were compared with the baseline C++
implementation. The results depict:
As shown in Fig. 5, and 6, below, the VAD,
MFCC, LPC, and ENH-MFCC spectrograms
generated from the hardware front-end, and C++
baseline front-end are practically identical.
In Fig. 7, and 8, depict the generated VAD state,
and MFCC, LPC, and ENH-MFCC features from
hardware front-end, and baseline C++ are practically
identical.
Figure 5: Hardware Front-end with VAD, MFCC, LPC, and Enhanced MFCC Spectrograms for
“Operator”. The VAD has three states indicating INITIAL; zero - valued, VAD_OFF; one-valued, and
VAD_ON state; two – valued curve above.
Figure 6: C++ Front-end with VAD, MFCC, LPC, and Enhanced MFCC Spectrograms for
“Operator” Audio Data
8. Veton Z. Këpuska et al. Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 12( Part 3), December 2014, pp.160-168
www.ijera.com 8 | P a g e
Figure 7: Hardware Front-end with VAD, MFCC, LPC, and Enhanced MFCC Histograms for
“Operator” Audio Data. The VAD has three states indicating INITIAL; zero - valued, VAD_OFF; one-
valued, and VAD_ON state; two – valued curve above.
Figure 8: C++ Front-end with VAD, MFCC, LPC, and Enhanced MFCC Histograms for “Operator”
Audio Data
IX. Conclusions and Applications
In this study, an efficient embedded front-endof
WUW-SRhardware implementation with built-in
VAD in FPGA is described. This front-end is
responsible for generating three sets of features:
MFCC, LPC, and ENH-MFCC. These features are
needed to be decoded with corresponding HMMs in
the back-end stage of the WUW Speech Recognizer.
The computational complexity and memory
requirement of three features and VAD algorithms is
analyzed in detail. The proposed front-end is the first
embedded hardware system specifically designed for
WUW-SR feature extraction based on three different
sets of features. To demonstrate its effectiveness, the
proposed design has been implemented in cyclone III
FPGA hardware [11]. The custom DSP board
developed is a power efficient and flexible in design.
References
[1] Këpuska VZ, Eljhani MM, Hight BH (2013)
Front-end of Wake-Up-Word Speech
Recognition System Design on FPGA. J
TelecommunSyst Manage 2:
108. doi:10.4172/2167-0919.1000108.
[2] Këpuska, V.Z., Eljhani, M.M. and Hight, B.H.
(2014) Wake-Up-Word Feature Extraction on
FPGA. World Journal of Engineering and
Technology, 2, 1-12.
http://dx.doi.org/10.4236/wjet.2014.21001.
[3] 0. Burak, Tuzun, M. Demirekler and K. Bora,
”Comparison of Parametric and Non-Parametric
Representations of Speech for Recognition,”
Proceedings of MELECON. Mediterranean
Electrotechnical Conference, 1994, pp. 65-68.
[4] J.P. Openshaw, Z.P. Sun, J.S. Mason, A
comparison of composite features under
degraded speech in speaker recognition,
Proceedings of the International Conference on
Acoustics, Speech, and Signal Processing, Vol.
2, Minneapolis, USA, 1993, pp. 371–374.
[5] R. Vergin, D. O’Shaughnessy, V. Gupta,
Compensated mel frequency cepstrum
coefficients, Proceedings of the International
Conference on Acoustics, Speech, and Signal
Processing, Minneapolis, USA, 1996, pp. 323–
326.
[6] V. Z. Këpuska and T. B. Klein, A novel Wake-
Up-Word speech recognition system, Wake-Up-
Word recognition task, technology and
evaluation, Nonlin. Anal.: Theory Meth. Appl.
9. Veton Z. Këpuska et al. Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 12( Part 3), December 2014, pp.160-168
www.ijera.com 9 | P a g e
71, pp. e2772–e2789, 2009. doi:10.1016/j.
na.2009.06.089.
[7] S. Davis and P. Mermelstein, “Comparison of
parametric representations for monosyllabic
word recognition in continuously spoken
sentences,” IEEE Trans. Acoust., Speech,
Signal Processing, vol. 28, no. 4, pp. 357–366,
1980.
[8] H. Combrinck and E. Botha, “On the mel-scaled
cepstrum,” department of Electrical and
Electronic Engineering, University of Pretoria.
[9] M. R. Schroeder, “Linear prediction, extremely
entropy and prior Information in speech signal
analysis and synthesis,” Speech
Communication, vol. 1, no. 1, pp. 9–20, May
1982.
[10] K. K. Paliwal and W. B. Kleijn, Speech
Synthesis and Coding, chapter Quantization of
LPC parameters, pp. 433–466, Elsevier Science
Publ., Amsterdam, the Netherlands, 1995.
[11] http://my.fit.edu/~vkepuska/web1/