Speaker recognition to recognize multiethnic speakers is an interesting research topic. Various studies involving many ethnicities require the right approach to achieve optimal model performance. The deep learning approach has been used in speaker recognition research involving many classes to achieve high accuracy results with promising results. However, multi-class and imbalanced datasets are still obstacles encountered in various studies using the deep learning method which cause overfitting and decreased accuracy. Data augmentation is an approach model used in overcoming the problem of small amounts of data and multiclass problems. This approach can improve the quality of research data according to the method applied. This study proposes a data augmentation method using pitch shifting with a deep neural network called pitch shifting data augmentation deep neural network (PSDA-DNN) to identify multiethnic Indonesian speakers. The results of the research that has been done prove that the PSDA-DNN approach is the best method in multi-ethnic speaker recognition where the accuracy reaches 99.27% and the precision, recall, F1 score is 97.60%.
Enhancing speaker verification accuracy with deep ensemble learning and inclu...IJECEIAES
Effective speaker identification is essential for achieving robust speaker recognition in real-world applications such as mobile devices, security, and entertainment while ensuring high accuracy. However, deep learning models trained on large datasets with diverse demographic and environmental factors may lead to increased misclassification and longer processing times. This study proposes incorporating ethnicity and gender information as critical parameters in a deep learning model to enhance accuracy. Two convolutional neural network (CNN) models classify gender and ethnicity, followed by a Siamese deep learning model trained with critical parameters and additional features for speaker verification. The proposed model was tested using the VoxCeleb 2 database, which includes over one million utterances from 6,112 celebrities. In an evaluation after 500 epochs, equal error rate (EER) and minimum decision cost function (minDCF) showed notable results, scoring 1.68 and 0.10, respectively. The proposed model outperforms existing deep learning models, demonstrating improved performance in terms of reduced misclassification errors and faster processing times.
Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...journalBEEI
This document discusses spoken language identification using i-vectors and x-vectors for feature extraction, and PLDA and logistic regression for classification. It examines extracting features from Javanese, Sundanese, and Minangkabau languages, then classifying the languages using various parameters. The study finds that x-vector outperforms i-vector when using PLDA classification, except when using logistic regression, where i-vector performs better. It tunes parameters for i-vector UBM size, i-vector dimension, x-vector max frame size, and num repeats, reporting equal error rates to evaluate performance on test segments of 3, 10 and 30 seconds.
Deep convolutional neural networks-based features for Indonesian large vocabu...IAESIJAI
This document describes a study that used convolutional neural networks (CNNs) to extract features for Indonesian large vocabulary speech recognition. The CNN model was trained discriminatively on speech data that had undergone speed perturbation, unlike typical deep learning models that are trained generatively. Evaluations showed the proposed CNN-DNN method achieved a 7.26% error reduction over DBN-DNN using MFCC features and a 9.01% error reduction over DBN-DNN using filterbank features on an Indonesian speech dataset. An additional 6.13% error reduction was achieved compared to a generatively trained CNN-DNN model. The study aims to address the challenge of limited data for non-mainstream languages like Indones
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
This document presents a method for generating suggestions for specific erroneous parts of sentences in Indian languages like Malayalam using deep learning. The method uses recurrent neural networks with long short-term memory layers to train a model on input-output examples of sentences and their corrections. The model takes in preprocessed sentence data and generates a set of possible corrections for erroneous parts through multiple network layers. An analysis of the model shows that it can accurately generate suggestions for word length of three, but requires more data and study to handle the complex morphology and symbols of Malayalam. The performance of the method is limited by the hardware used and it could be improved with a more powerful system and additional training data.
Speech Recognition Application for the Speech Impaired using the Android-base...TELKOMNIKA JOURNAL
Those who are speech impaired (tunawicara in the Indonesian language) suffer from
abnormalities in their delivery (articulation) of the language as well their voice in normal speech, resulting
in difficulty in communicating verbally within their environment. Therefore, an application is required that
can help and facilitate conversations for communication. In this research, the authors have developed a
speech recognition application that can recognise speech of the speech impaired, and can translate into
text form with input in the form of sound detected on a smartphone. By using the Google Cloud Speech
Application Programming Interface (API), this allows converting audio to text, and it is also user-friendly to
use such APIs. The Google Cloud Speech API integrates with Google Cloud Storage for data storage.
Although research into speech recognition to text has been widely practiced, this research try to develop
speech recognition, specially for speech impaired's speech, as well as perform a likelihood calculation to
see the factor of tone, pronunciation, and speech speed in speech recognition. The test was conducted by
mentioning the digits 1 through 10. The experimental results showed that the recognition rate for the
speech impaired is about 80%, while the recognition rate for normal speech is 100%.
Performance estimation based recurrent-convolutional encoder decoder for spee...karthik annam
This document discusses a proposed Recurrent-Convolutional Encoder-Decoder (R-CED) network for speech enhancement. The R-CED network aims to overcome challenges with existing methods by estimating the a priori and posteriori signal-to-noise ratios to separate noise from speech. The R-CED consists of convolutional layers with increasing and decreasing numbers of filters to encode and decode features. Performance will be evaluated using metrics like PESQ, STOI, CER, MSE, SNR, and SDR. The proposed method aims to improve speech enhancement accuracy and recover enhanced speech quality compared to other techniques.
Constructed model for micro-content recognition in lip reading based deep lea...journalBEEI
The document describes a proposed model for micro-content recognition in lip reading using deep learning. The model takes micro-contents (the English alphabet) as input from video and recognizes them using a convolutional neural network (CNN). The CNN performs feature extraction and recognition. The model was tested on a dataset containing videos of 11 people pronouncing letters and achieved a high recognition rate of 98%.
A computationally efficient learning model to classify audio signal attributesIJECEIAES
The era of machine learning has opened up groundbreaking realities and opportunities in the field of medical diagnosis. However, it is also observed that faster and proper diagnosis of any diseases/medical conditions require proper analysis and classification of digital signal data. It indicates the proper identification of tumors in the brain. Brain magnetic resonance imaging (MRI) data has to be appropriately classified, and similarly, pulse signal analysis is required to evaluate the human heart operating condition. Several studies have used machine learning (ML) modeling to classify speech signals, but very few studies have explored the classification of audio signal attributes in the context of intelligent healthcare monitoring. The study thereby aims to introduce novel mathematical modeling to analyze and classify synthetic pulse audio signal attributes with cost-effective computation. The numerical modeling is composed of several functional blocks where deep neural network-based learning (DNNL) plays a crucial role during the training phase, and also it is further combined with a recurrent structure of long-short term memory (R-LSTM) feedback connections (FCs). The design approaches further experiment in a numerical computing environment in terms of accuracy and computational aspects. The classification outcome of the proposed approach shows that it attains approximately 85% accuracy, which is comparable to the baseline approaches and execution time.
Enhancing speaker verification accuracy with deep ensemble learning and inclu...IJECEIAES
Effective speaker identification is essential for achieving robust speaker recognition in real-world applications such as mobile devices, security, and entertainment while ensuring high accuracy. However, deep learning models trained on large datasets with diverse demographic and environmental factors may lead to increased misclassification and longer processing times. This study proposes incorporating ethnicity and gender information as critical parameters in a deep learning model to enhance accuracy. Two convolutional neural network (CNN) models classify gender and ethnicity, followed by a Siamese deep learning model trained with critical parameters and additional features for speaker verification. The proposed model was tested using the VoxCeleb 2 database, which includes over one million utterances from 6,112 celebrities. In an evaluation after 500 epochs, equal error rate (EER) and minimum decision cost function (minDCF) showed notable results, scoring 1.68 and 0.10, respectively. The proposed model outperforms existing deep learning models, demonstrating improved performance in terms of reduced misclassification errors and faster processing times.
Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...journalBEEI
This document discusses spoken language identification using i-vectors and x-vectors for feature extraction, and PLDA and logistic regression for classification. It examines extracting features from Javanese, Sundanese, and Minangkabau languages, then classifying the languages using various parameters. The study finds that x-vector outperforms i-vector when using PLDA classification, except when using logistic regression, where i-vector performs better. It tunes parameters for i-vector UBM size, i-vector dimension, x-vector max frame size, and num repeats, reporting equal error rates to evaluate performance on test segments of 3, 10 and 30 seconds.
Deep convolutional neural networks-based features for Indonesian large vocabu...IAESIJAI
This document describes a study that used convolutional neural networks (CNNs) to extract features for Indonesian large vocabulary speech recognition. The CNN model was trained discriminatively on speech data that had undergone speed perturbation, unlike typical deep learning models that are trained generatively. Evaluations showed the proposed CNN-DNN method achieved a 7.26% error reduction over DBN-DNN using MFCC features and a 9.01% error reduction over DBN-DNN using filterbank features on an Indonesian speech dataset. An additional 6.13% error reduction was achieved compared to a generatively trained CNN-DNN model. The study aims to address the challenge of limited data for non-mainstream languages like Indones
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
This document presents a method for generating suggestions for specific erroneous parts of sentences in Indian languages like Malayalam using deep learning. The method uses recurrent neural networks with long short-term memory layers to train a model on input-output examples of sentences and their corrections. The model takes in preprocessed sentence data and generates a set of possible corrections for erroneous parts through multiple network layers. An analysis of the model shows that it can accurately generate suggestions for word length of three, but requires more data and study to handle the complex morphology and symbols of Malayalam. The performance of the method is limited by the hardware used and it could be improved with a more powerful system and additional training data.
Speech Recognition Application for the Speech Impaired using the Android-base...TELKOMNIKA JOURNAL
Those who are speech impaired (tunawicara in the Indonesian language) suffer from
abnormalities in their delivery (articulation) of the language as well their voice in normal speech, resulting
in difficulty in communicating verbally within their environment. Therefore, an application is required that
can help and facilitate conversations for communication. In this research, the authors have developed a
speech recognition application that can recognise speech of the speech impaired, and can translate into
text form with input in the form of sound detected on a smartphone. By using the Google Cloud Speech
Application Programming Interface (API), this allows converting audio to text, and it is also user-friendly to
use such APIs. The Google Cloud Speech API integrates with Google Cloud Storage for data storage.
Although research into speech recognition to text has been widely practiced, this research try to develop
speech recognition, specially for speech impaired's speech, as well as perform a likelihood calculation to
see the factor of tone, pronunciation, and speech speed in speech recognition. The test was conducted by
mentioning the digits 1 through 10. The experimental results showed that the recognition rate for the
speech impaired is about 80%, while the recognition rate for normal speech is 100%.
Performance estimation based recurrent-convolutional encoder decoder for spee...karthik annam
This document discusses a proposed Recurrent-Convolutional Encoder-Decoder (R-CED) network for speech enhancement. The R-CED network aims to overcome challenges with existing methods by estimating the a priori and posteriori signal-to-noise ratios to separate noise from speech. The R-CED consists of convolutional layers with increasing and decreasing numbers of filters to encode and decode features. Performance will be evaluated using metrics like PESQ, STOI, CER, MSE, SNR, and SDR. The proposed method aims to improve speech enhancement accuracy and recover enhanced speech quality compared to other techniques.
Constructed model for micro-content recognition in lip reading based deep lea...journalBEEI
The document describes a proposed model for micro-content recognition in lip reading using deep learning. The model takes micro-contents (the English alphabet) as input from video and recognizes them using a convolutional neural network (CNN). The CNN performs feature extraction and recognition. The model was tested on a dataset containing videos of 11 people pronouncing letters and achieved a high recognition rate of 98%.
A computationally efficient learning model to classify audio signal attributesIJECEIAES
The era of machine learning has opened up groundbreaking realities and opportunities in the field of medical diagnosis. However, it is also observed that faster and proper diagnosis of any diseases/medical conditions require proper analysis and classification of digital signal data. It indicates the proper identification of tumors in the brain. Brain magnetic resonance imaging (MRI) data has to be appropriately classified, and similarly, pulse signal analysis is required to evaluate the human heart operating condition. Several studies have used machine learning (ML) modeling to classify speech signals, but very few studies have explored the classification of audio signal attributes in the context of intelligent healthcare monitoring. The study thereby aims to introduce novel mathematical modeling to analyze and classify synthetic pulse audio signal attributes with cost-effective computation. The numerical modeling is composed of several functional blocks where deep neural network-based learning (DNNL) plays a crucial role during the training phase, and also it is further combined with a recurrent structure of long-short term memory (R-LSTM) feedback connections (FCs). The design approaches further experiment in a numerical computing environment in terms of accuracy and computational aspects. The classification outcome of the proposed approach shows that it attains approximately 85% accuracy, which is comparable to the baseline approaches and execution time.
Speech recognition techniques are one of the most important modern technologies. Many different systems have been developed in terms of methods used in the extraction of features and methods of classification. Voice recognition includes two areas: speech recognition and speaker recognition, where the research is confined to the field of speech recognition. The research presents a proposal to improve the performance of single word recognition systems by an algorithm that combines more than one of the techniques used in character extraction and modulation of the neural network to study the effects of recognition science and study the effect of noise on the proposed system. In this research four systems of speech recognition were studied, the first system adopted the MFCC algorithm to extract the features. The second system adopted the PLP algorithm, while the third system was based on combining the two previous algorithms in addition to the zero-passing rate. In the fourth system, the neural network used in the differentiation process was modified and the error ratio was determined. The impact of noise on these previous systems. The outcomes were looked at regarding the rate of recognizable proof and the season of preparing the neural network for every system independently, to get a rate of distinguishing proof and quiet up to 98% utilizing the proposed framework.
SPEECH RECOGNITION BY IMPROVING THE PERFORMANCE OF ALGORITHMS USED IN DISCRIM...ijcsit
This document discusses improving speech recognition performance through algorithms used for feature extraction and classification. It examines 4 systems: 1) MFCC extraction, 2) PLP extraction, 3) combining MFCC, PLP and zero-crossing rate, 4) modifying the neural network. System 3 achieved the highest recognition rate of 98% even with noise, outperforming the individual algorithms. Increasing the training samples to 500 further improved recognition ratio.
A prior case study of natural language processing on different domain IJECEIAES
This document summarizes a prior case study on natural language processing across different domains. It begins with an introduction to natural language processing, describing how it is a branch of artificial intelligence that allows computers to understand human language. It then reviews several existing studies that applied natural language processing techniques such as named entity recognition and text mining to tasks like identifying technical knowledge in resumes, enhancing reading skills for deaf students, and predicting student performance. The document concludes by highlighting some of the challenges in developing new natural language processing models.
A Survey on Speech Recognition with Language Specificationijtsrd
As a cross disciplinary, speech recognition is entirely based on the speech as the survey object. Speech recognition allows the machine to convert the speech signal into text or commands via the process of identification and understanding. Speech recognition involves in various fields of physiology, psychology, linguistics, computer science and signal processing, and is even related to the person’s body language, and its goal is to achieve natural language communication between man and machine. The speech recognition technology is gradually becoming the key technology of the IT man machine interface. This paper describes the development of speech recognition technology and its basic principles, methods, reviewed the classification of speech recognition systems, speech recognition approaches and voice recognition technology, analyzed the problems faced by the speech recognition. Dr. Preeti Savant | Lakshmi Sandhya H "A Survey on Speech Recognition with Language Specification" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-6 | Issue-3 , April 2022, URL: https://www.ijtsrd.com/papers/ijtsrd49370.pdf Paper URL: https://www.ijtsrd.com/computer-science/speech-recognition/49370/a-survey-on-speech-recognition-with-language-specification/dr-preeti-savant
Deep convolutional neural network for hand sign language recognition using mo...journalBEEI
An image processing system that based computer vision has received many attentions from science and technology expert. Research on image processing is needed in the development of human-computer interactions such as hand recognition or gesture recognition for people with hearing impairments and deaf people. In this research we try to collect the hand gesture data and used a simple deep neural network architecture that we called model E to recognize the actual hand gestured. The dataset that we used is collected from kaggle.com and in the form of ASL (American Sign Language) datasets. We doing accuracy comparison with another existing model such as AlexNet to see how robust our model. We find that by adjusting kernel size and number of epoch for each model also give a different result. After comparing with AlexNet model we find that our model E is perform better with 96.82% accuracy.
GLOVE BASED GESTURE RECOGNITION USING IR SENSORIRJET Journal
This document summarizes research on a glove-based gesture recognition system using IR sensors. The system aims to help those who are deaf and mute communicate through hand gestures. An IR sensor and LED placed on a glove detect hand gestures based on the amount of light received by the sensor. The Arduino microcontroller recognizes the gestures and displays the meaning on an LCD screen while playing an audio message. The researchers claim this method is more accurate and has a lower error rate than conventional image processing approaches. It is intended to help address both safety and communication issues faced by those who are deaf or speech-impaired. Experimental results showed the system successfully recognized gestures and could help reduce the gap between those who are normal and speech-impaired.
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...csandit
This document describes a study that developed a speech recognition system for recognizing spoken Malayalam digits. It used two wavelet-based feature extraction techniques - Discrete Wavelet Transforms (DWT) and Wavelet Packet Decomposition (WPD) - and evaluated their performance using a Naive Bayes classifier. DWT achieved 83.5% accuracy and WPD achieved 80.7% accuracy. To improve recognition accuracy, the study introduced a new technique called Discrete Wavelet Packet Decomposition (DWPD) that utilizes features from both DWT and WPD. DWPD achieved the highest accuracy of 86.2% along with the Naive Bayes classifier.
Combined feature extraction techniques and naive bayes classifier for speech ...csandit
Speech processing and consequent recognition are important areas of Digital Signal Processing
since speech allows people to communicate more natu-rally and efficiently. In this work, a
speech recognition system is developed for re-cognizing digits in Malayalam. For recognizing
speech, features are to be ex-tracted from speech and hence feature extraction method plays an
important role in speech recognition. Here, front end processing for extracting the features is
per-formed using two wavelet based methods namely Discrete Wavelet Transforms (DWT) and
Wavelet Packet Decomposition (WPD). Naive Bayes classifier is used for classification purpose.
After classification using Naive Bayes classifier, DWT produced a recognition accuracy of
83.5% and WPD produced an accuracy of 80.7%. This paper is intended to devise a new
feature extraction method which produces improvements in the recognition accuracy. So, a new
method called Dis-crete Wavelet Packet Decomposition (DWPD) is introduced which utilizes
the hy-brid features of both DWT and WPD. The performance of this new approach is evaluated
and it produced an improved recognition accuracy of 86.2% along with Naive Bayes classifier.
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...cscpconf
Speech processing and consequent recognition are important areas of Digital Signal Processing since speech allows people to communicate more natu-rally and efficiently. In this work, a
speech recognition system is developed for re-cognizing digits in Malayalam. For recognizing speech, features are to be ex-tracted from speech and hence feature extraction method plays animportant role in speech recognition. Here, front end processing for extracting the features is per-formed using two wavelet based methods namely Discrete Wavelet Transforms (DWT) and Wavelet Packet Decomposition (WPD). Naive Bayes classifier is used for classification purpose.After classification using Naive Bayes classifier, DWT produced a recognition accuracy of83.5% and WPD produced an accuracy of 80.7%. This paper is intended to devise a new feature extraction method which produces improvements in the recognition accuracy. So, a new method called Dis-crete Wavelet Packet Decomposition (DWPD) is introduced which utilizes
the hy-brid features of both DWT and WPD. The performance of this new approach is evaluated and it produced an improved recognition accuracy of 86.2% along with Naive Bayes classifier.
Modeling Text Independent Speaker Identification with Vector QuantizationTELKOMNIKA JOURNAL
Speaker identification is one of the most important technologies nowadays. Many fields such as
bioinformatics and security are using speaker identification. Also, almost all electronic devices are using
this technology too. Based on number of text, speaker identification divided into text dependent and text
independent. On many fields, text independent is mostly used because number of text is unlimited. So, text
independent is generally more challenging than text dependent. In this research, speaker identification text
independent with Indonesian speaker data was modelled with Vector Quantization (VQ). In this research
VQ with K-Means initialization was used. K-Means clustering also was used to initialize mean and
Hierarchical Agglomerative Clustering was used to identify K value for VQ. The best VQ accuracy was
59.67% when k was 5. According to the result, Indonesian language could be modelled by VQ. This
research can be developed using optimization method for VQ parameters such as Genetic Algorithm or
Particle Swarm Optimization.
This paper proposes a unified learning framework to jointly address audio-visual speech recognition and manipulation tasks using cross-modal mutual learning. It aims to disentangle representative features from audio and visual input data using advanced learning strategies. A linguistic module is used to extract knowledge across modalities through cross-modal learning. The goal is to recognize speech with the aid of visual information like lip movements, while preserving identity information for data recovery and synthesis tasks.
Sentiment analysis on Bangla conversation using machine learning approachIJECEIAES
Nowadays, online communication is more convenient and popular than faceto-face conversation. Therefore, people prefer online communication over face-to-face meetings. Enormous people use online chatting systems to speak with their loved ones at any given time throughout the world. People create massive quantities of conversation every second because of their online engagement. People's feelings during the conversation period can be gleaned as useful information from these conversations. Text analysis and conclusion of any material as summarization can be done using sentiment analysis by natural language processing. The use of communication for customer service portals in various e-commerce platforms and crime investigations based on digital evidence is increasing the need for sentiment analysis of a conversation. Other languages, such as English, have welldeveloped libraries and resources for natural language processing, yet there are few studies conducted on Bangla. It is more challenging to extract sentiments from Bangla conversational data due to the language's grammatical complexity. As a result, it opens vast study opportunities. So, support vector machine, multinomial naïve Bayes, k-nearest neighbors, logistic regression, decision tree, and random forest was used. From the dataset, extracted information was labeled as positive and negative.
Speech emotion recognition with light gradient boosting decision trees machineIJECEIAES
Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.
This document provides an agenda for research on knowledge discovery from web search. It begins with an introduction on knowledge discovery and how search engines can help extract information. It then outlines the goals and objectives, provides a literature review on related work, and discusses some common limitations observed, such as models achieving low accuracy and WSD approaches not being efficient enough. The document serves to provide background and planning for a research study on improving knowledge discovery through web search.
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This document summarizes a study that used a back-propagation neural network to estimate students' word recognition abilities based on their performance on a vocabulary test. The study collected test results from 83 elementary school students and used their scores on different word frequency groups as input for the neural network model. The model was trained and tested, showing high correlation between estimated and actual vocabulary volumes. The results demonstrated that a back-propagation neural network can accurately estimate word recognition and could be an effective alternative to traditional statistical methods.
Word embedding for detecting cyberbullying based on recurrent neural networksIAESIJAI
The phenomenon of cyberbullying has spread and has become one of the biggest problems facing users of social media sites and generated significant adverse effects on society and the victim in particular. Finding appropriate solutions to detect and reduce cyberbullying has become necessary to mitigate its negative impacts on society and the victim. Twitter comments on two datasets are used to detect cyberbullying, the first dataset was the Arabic cyberbullying dataset, and the second was the English cyberbullying dataset. Three different pre-trained global vectors (GloVe) corpora with different dimensions were used on the original and preprocessed datasets to represent the words. Recurrent neural networks (RNN), long short-term memory (LSTM), Bidirectional LSTM (BiLSTM), gated recurrent unit (GRU), and Bidirectional GRU (BiGRU) classifiers utilized, evaluated and compared. The GRU outperform other classifiers on both datasets; its accuracy on the Arabic cyberbullying dataset using the Arabic GloVe corpus of dimension equal to 256D is 87.83%, while the accuracy on the English datasets using 100 D pre-trained GloVe corpus is 93.38%.
IRJET - A Robust Sign Language and Hand Gesture Recognition System using Conv...IRJET Journal
This document presents a robust sign language and hand gesture recognition system using convolutional neural networks. The system captures image frames and processes them through various neural network layers to classify hand gestures into letters, numbers, or other symbols. It segments the hand from images using color thresholds and edge detection. The images then undergo preprocessing like resizing before being classified by the CNN. The CNN is trained on a large dataset to accurately recognize gestures in sign language and provide text output to bridge communication between deaf and non-signing individuals. The system achieved good results classifying several alphabet letters but could be expanded to recognize word combinations.
Sensing complicated meanings from unstructured data: a novel hybrid approachIJECEIAES
The majority of data on computers nowadays is in the form of unstructured data and unstructured text. The inherent ambiguity of natural language makes it incredibly difficult but also highly profitable to find hidden information or comprehend complex semantics in unstructured text. In this paper, we present the combination of natural language processing (NLP) and convolution neural network (CNN) hybrid architecture called automated analysis of unstructured text using machine learning (AAUT-ML) for the detection of complex semantics from unstructured data that enables different users to make understand formal semantic knowledge to be extracted from an unstructured text corpus. The AAUT-ML has been evaluated using three datasets data mining (DM), operating system (OS), and data base (DB), and compared with the existing models, i.e., YAKE, term frequency-inverse document frequency (TF-IDF) and text-R. The results show better outcomes in terms of precision, recall, and macro-averaged F1-score. This work presents a novel method for identifying complex semantics using unstructured data.
This document describes a system to help deaf and mute people communicate through sign language and voice recognition. The system uses algorithms like support vector machines and hidden Markov models to recognize hand gestures and speech. It can translate sign language into text and voice into sign language representations. The system aims to reduce communication barriers for deaf/mute communities by converting between sign language, text, and voice. It outlines the implementation process which includes steps like skin color detection, hand location detection, finger region detection, and pattern matching to recognize gestures from video input.
Convolutional neural network with binary moth flame optimization for emotion ...IAESIJAI
Electroencephalograph (EEG) signals have the ability of real-time reflecting brain activities. Utilizing the EEG signal for analyzing human emotional states is a common study. The EEG signals of the emotions aren’t distinctive and it is different from one person to another as every one of them has different emotional responses to same stimuli. Which is why, the signals of the EEG are subject dependent and proven to be effective for the subject dependent detection of the Emotions. For the purpose of achieving enhanced accuracy and high true positive rate, the suggested system proposed a binary moth flame optimization (BMFO) algorithm for the process of feature selection and convolutional neural networks (CNNs) for classifications. In this proposal, optimum features are chosen with the use of accuracy as objective function. Ultimately, optimally chosen features are classified after that with the use of a CNN for the purpose of discriminating different emotion states.
A novel ensemble model for detecting fake newsIAESIJAI
Due the growing proliferation of fake news over the past couple of years, our objective in this paper is to propose an ensemble model for the automatic classification of article news as being either real or fake. For this purpose, we opt for a blending technique that combines three models, namely bidirectional long short-term memory (Bi-LSTM), stochastic gradient descent classifier and ridge classifier. The implementation of the proposed model (i.e. BI-LSR) on real world datasets, has shown outstanding results. In fact, it achieved an accuracy score of 99.16%. Accordingly, this ensemble learning has proven to do perform better than individual conventional machine learning and deep learning models as well as many ensemble learning approaches cited in the literature.
More Related Content
Similar to Improving Indonesian multietnics speaker recognition using pitch shifting data augmentation
Speech recognition techniques are one of the most important modern technologies. Many different systems have been developed in terms of methods used in the extraction of features and methods of classification. Voice recognition includes two areas: speech recognition and speaker recognition, where the research is confined to the field of speech recognition. The research presents a proposal to improve the performance of single word recognition systems by an algorithm that combines more than one of the techniques used in character extraction and modulation of the neural network to study the effects of recognition science and study the effect of noise on the proposed system. In this research four systems of speech recognition were studied, the first system adopted the MFCC algorithm to extract the features. The second system adopted the PLP algorithm, while the third system was based on combining the two previous algorithms in addition to the zero-passing rate. In the fourth system, the neural network used in the differentiation process was modified and the error ratio was determined. The impact of noise on these previous systems. The outcomes were looked at regarding the rate of recognizable proof and the season of preparing the neural network for every system independently, to get a rate of distinguishing proof and quiet up to 98% utilizing the proposed framework.
SPEECH RECOGNITION BY IMPROVING THE PERFORMANCE OF ALGORITHMS USED IN DISCRIM...ijcsit
This document discusses improving speech recognition performance through algorithms used for feature extraction and classification. It examines 4 systems: 1) MFCC extraction, 2) PLP extraction, 3) combining MFCC, PLP and zero-crossing rate, 4) modifying the neural network. System 3 achieved the highest recognition rate of 98% even with noise, outperforming the individual algorithms. Increasing the training samples to 500 further improved recognition ratio.
A prior case study of natural language processing on different domain IJECEIAES
This document summarizes a prior case study on natural language processing across different domains. It begins with an introduction to natural language processing, describing how it is a branch of artificial intelligence that allows computers to understand human language. It then reviews several existing studies that applied natural language processing techniques such as named entity recognition and text mining to tasks like identifying technical knowledge in resumes, enhancing reading skills for deaf students, and predicting student performance. The document concludes by highlighting some of the challenges in developing new natural language processing models.
A Survey on Speech Recognition with Language Specificationijtsrd
As a cross disciplinary, speech recognition is entirely based on the speech as the survey object. Speech recognition allows the machine to convert the speech signal into text or commands via the process of identification and understanding. Speech recognition involves in various fields of physiology, psychology, linguistics, computer science and signal processing, and is even related to the person’s body language, and its goal is to achieve natural language communication between man and machine. The speech recognition technology is gradually becoming the key technology of the IT man machine interface. This paper describes the development of speech recognition technology and its basic principles, methods, reviewed the classification of speech recognition systems, speech recognition approaches and voice recognition technology, analyzed the problems faced by the speech recognition. Dr. Preeti Savant | Lakshmi Sandhya H "A Survey on Speech Recognition with Language Specification" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-6 | Issue-3 , April 2022, URL: https://www.ijtsrd.com/papers/ijtsrd49370.pdf Paper URL: https://www.ijtsrd.com/computer-science/speech-recognition/49370/a-survey-on-speech-recognition-with-language-specification/dr-preeti-savant
Deep convolutional neural network for hand sign language recognition using mo...journalBEEI
An image processing system that based computer vision has received many attentions from science and technology expert. Research on image processing is needed in the development of human-computer interactions such as hand recognition or gesture recognition for people with hearing impairments and deaf people. In this research we try to collect the hand gesture data and used a simple deep neural network architecture that we called model E to recognize the actual hand gestured. The dataset that we used is collected from kaggle.com and in the form of ASL (American Sign Language) datasets. We doing accuracy comparison with another existing model such as AlexNet to see how robust our model. We find that by adjusting kernel size and number of epoch for each model also give a different result. After comparing with AlexNet model we find that our model E is perform better with 96.82% accuracy.
GLOVE BASED GESTURE RECOGNITION USING IR SENSORIRJET Journal
This document summarizes research on a glove-based gesture recognition system using IR sensors. The system aims to help those who are deaf and mute communicate through hand gestures. An IR sensor and LED placed on a glove detect hand gestures based on the amount of light received by the sensor. The Arduino microcontroller recognizes the gestures and displays the meaning on an LCD screen while playing an audio message. The researchers claim this method is more accurate and has a lower error rate than conventional image processing approaches. It is intended to help address both safety and communication issues faced by those who are deaf or speech-impaired. Experimental results showed the system successfully recognized gestures and could help reduce the gap between those who are normal and speech-impaired.
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...csandit
This document describes a study that developed a speech recognition system for recognizing spoken Malayalam digits. It used two wavelet-based feature extraction techniques - Discrete Wavelet Transforms (DWT) and Wavelet Packet Decomposition (WPD) - and evaluated their performance using a Naive Bayes classifier. DWT achieved 83.5% accuracy and WPD achieved 80.7% accuracy. To improve recognition accuracy, the study introduced a new technique called Discrete Wavelet Packet Decomposition (DWPD) that utilizes features from both DWT and WPD. DWPD achieved the highest accuracy of 86.2% along with the Naive Bayes classifier.
Combined feature extraction techniques and naive bayes classifier for speech ...csandit
Speech processing and consequent recognition are important areas of Digital Signal Processing
since speech allows people to communicate more natu-rally and efficiently. In this work, a
speech recognition system is developed for re-cognizing digits in Malayalam. For recognizing
speech, features are to be ex-tracted from speech and hence feature extraction method plays an
important role in speech recognition. Here, front end processing for extracting the features is
per-formed using two wavelet based methods namely Discrete Wavelet Transforms (DWT) and
Wavelet Packet Decomposition (WPD). Naive Bayes classifier is used for classification purpose.
After classification using Naive Bayes classifier, DWT produced a recognition accuracy of
83.5% and WPD produced an accuracy of 80.7%. This paper is intended to devise a new
feature extraction method which produces improvements in the recognition accuracy. So, a new
method called Dis-crete Wavelet Packet Decomposition (DWPD) is introduced which utilizes
the hy-brid features of both DWT and WPD. The performance of this new approach is evaluated
and it produced an improved recognition accuracy of 86.2% along with Naive Bayes classifier.
COMBINED FEATURE EXTRACTION TECHNIQUES AND NAIVE BAYES CLASSIFIER FOR SPEECH ...cscpconf
Speech processing and consequent recognition are important areas of Digital Signal Processing since speech allows people to communicate more natu-rally and efficiently. In this work, a
speech recognition system is developed for re-cognizing digits in Malayalam. For recognizing speech, features are to be ex-tracted from speech and hence feature extraction method plays animportant role in speech recognition. Here, front end processing for extracting the features is per-formed using two wavelet based methods namely Discrete Wavelet Transforms (DWT) and Wavelet Packet Decomposition (WPD). Naive Bayes classifier is used for classification purpose.After classification using Naive Bayes classifier, DWT produced a recognition accuracy of83.5% and WPD produced an accuracy of 80.7%. This paper is intended to devise a new feature extraction method which produces improvements in the recognition accuracy. So, a new method called Dis-crete Wavelet Packet Decomposition (DWPD) is introduced which utilizes
the hy-brid features of both DWT and WPD. The performance of this new approach is evaluated and it produced an improved recognition accuracy of 86.2% along with Naive Bayes classifier.
Modeling Text Independent Speaker Identification with Vector QuantizationTELKOMNIKA JOURNAL
Speaker identification is one of the most important technologies nowadays. Many fields such as
bioinformatics and security are using speaker identification. Also, almost all electronic devices are using
this technology too. Based on number of text, speaker identification divided into text dependent and text
independent. On many fields, text independent is mostly used because number of text is unlimited. So, text
independent is generally more challenging than text dependent. In this research, speaker identification text
independent with Indonesian speaker data was modelled with Vector Quantization (VQ). In this research
VQ with K-Means initialization was used. K-Means clustering also was used to initialize mean and
Hierarchical Agglomerative Clustering was used to identify K value for VQ. The best VQ accuracy was
59.67% when k was 5. According to the result, Indonesian language could be modelled by VQ. This
research can be developed using optimization method for VQ parameters such as Genetic Algorithm or
Particle Swarm Optimization.
This paper proposes a unified learning framework to jointly address audio-visual speech recognition and manipulation tasks using cross-modal mutual learning. It aims to disentangle representative features from audio and visual input data using advanced learning strategies. A linguistic module is used to extract knowledge across modalities through cross-modal learning. The goal is to recognize speech with the aid of visual information like lip movements, while preserving identity information for data recovery and synthesis tasks.
Sentiment analysis on Bangla conversation using machine learning approachIJECEIAES
Nowadays, online communication is more convenient and popular than faceto-face conversation. Therefore, people prefer online communication over face-to-face meetings. Enormous people use online chatting systems to speak with their loved ones at any given time throughout the world. People create massive quantities of conversation every second because of their online engagement. People's feelings during the conversation period can be gleaned as useful information from these conversations. Text analysis and conclusion of any material as summarization can be done using sentiment analysis by natural language processing. The use of communication for customer service portals in various e-commerce platforms and crime investigations based on digital evidence is increasing the need for sentiment analysis of a conversation. Other languages, such as English, have welldeveloped libraries and resources for natural language processing, yet there are few studies conducted on Bangla. It is more challenging to extract sentiments from Bangla conversational data due to the language's grammatical complexity. As a result, it opens vast study opportunities. So, support vector machine, multinomial naïve Bayes, k-nearest neighbors, logistic regression, decision tree, and random forest was used. From the dataset, extracted information was labeled as positive and negative.
Speech emotion recognition with light gradient boosting decision trees machineIJECEIAES
Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.
This document provides an agenda for research on knowledge discovery from web search. It begins with an introduction on knowledge discovery and how search engines can help extract information. It then outlines the goals and objectives, provides a literature review on related work, and discusses some common limitations observed, such as models achieving low accuracy and WSD approaches not being efficient enough. The document serves to provide background and planning for a research study on improving knowledge discovery through web search.
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This document summarizes a study that used a back-propagation neural network to estimate students' word recognition abilities based on their performance on a vocabulary test. The study collected test results from 83 elementary school students and used their scores on different word frequency groups as input for the neural network model. The model was trained and tested, showing high correlation between estimated and actual vocabulary volumes. The results demonstrated that a back-propagation neural network can accurately estimate word recognition and could be an effective alternative to traditional statistical methods.
Word embedding for detecting cyberbullying based on recurrent neural networksIAESIJAI
The phenomenon of cyberbullying has spread and has become one of the biggest problems facing users of social media sites and generated significant adverse effects on society and the victim in particular. Finding appropriate solutions to detect and reduce cyberbullying has become necessary to mitigate its negative impacts on society and the victim. Twitter comments on two datasets are used to detect cyberbullying, the first dataset was the Arabic cyberbullying dataset, and the second was the English cyberbullying dataset. Three different pre-trained global vectors (GloVe) corpora with different dimensions were used on the original and preprocessed datasets to represent the words. Recurrent neural networks (RNN), long short-term memory (LSTM), Bidirectional LSTM (BiLSTM), gated recurrent unit (GRU), and Bidirectional GRU (BiGRU) classifiers utilized, evaluated and compared. The GRU outperform other classifiers on both datasets; its accuracy on the Arabic cyberbullying dataset using the Arabic GloVe corpus of dimension equal to 256D is 87.83%, while the accuracy on the English datasets using 100 D pre-trained GloVe corpus is 93.38%.
IRJET - A Robust Sign Language and Hand Gesture Recognition System using Conv...IRJET Journal
This document presents a robust sign language and hand gesture recognition system using convolutional neural networks. The system captures image frames and processes them through various neural network layers to classify hand gestures into letters, numbers, or other symbols. It segments the hand from images using color thresholds and edge detection. The images then undergo preprocessing like resizing before being classified by the CNN. The CNN is trained on a large dataset to accurately recognize gestures in sign language and provide text output to bridge communication between deaf and non-signing individuals. The system achieved good results classifying several alphabet letters but could be expanded to recognize word combinations.
Sensing complicated meanings from unstructured data: a novel hybrid approachIJECEIAES
The majority of data on computers nowadays is in the form of unstructured data and unstructured text. The inherent ambiguity of natural language makes it incredibly difficult but also highly profitable to find hidden information or comprehend complex semantics in unstructured text. In this paper, we present the combination of natural language processing (NLP) and convolution neural network (CNN) hybrid architecture called automated analysis of unstructured text using machine learning (AAUT-ML) for the detection of complex semantics from unstructured data that enables different users to make understand formal semantic knowledge to be extracted from an unstructured text corpus. The AAUT-ML has been evaluated using three datasets data mining (DM), operating system (OS), and data base (DB), and compared with the existing models, i.e., YAKE, term frequency-inverse document frequency (TF-IDF) and text-R. The results show better outcomes in terms of precision, recall, and macro-averaged F1-score. This work presents a novel method for identifying complex semantics using unstructured data.
This document describes a system to help deaf and mute people communicate through sign language and voice recognition. The system uses algorithms like support vector machines and hidden Markov models to recognize hand gestures and speech. It can translate sign language into text and voice into sign language representations. The system aims to reduce communication barriers for deaf/mute communities by converting between sign language, text, and voice. It outlines the implementation process which includes steps like skin color detection, hand location detection, finger region detection, and pattern matching to recognize gestures from video input.
Similar to Improving Indonesian multietnics speaker recognition using pitch shifting data augmentation (20)
Convolutional neural network with binary moth flame optimization for emotion ...IAESIJAI
Electroencephalograph (EEG) signals have the ability of real-time reflecting brain activities. Utilizing the EEG signal for analyzing human emotional states is a common study. The EEG signals of the emotions aren’t distinctive and it is different from one person to another as every one of them has different emotional responses to same stimuli. Which is why, the signals of the EEG are subject dependent and proven to be effective for the subject dependent detection of the Emotions. For the purpose of achieving enhanced accuracy and high true positive rate, the suggested system proposed a binary moth flame optimization (BMFO) algorithm for the process of feature selection and convolutional neural networks (CNNs) for classifications. In this proposal, optimum features are chosen with the use of accuracy as objective function. Ultimately, optimally chosen features are classified after that with the use of a CNN for the purpose of discriminating different emotion states.
A novel ensemble model for detecting fake newsIAESIJAI
Due the growing proliferation of fake news over the past couple of years, our objective in this paper is to propose an ensemble model for the automatic classification of article news as being either real or fake. For this purpose, we opt for a blending technique that combines three models, namely bidirectional long short-term memory (Bi-LSTM), stochastic gradient descent classifier and ridge classifier. The implementation of the proposed model (i.e. BI-LSR) on real world datasets, has shown outstanding results. In fact, it achieved an accuracy score of 99.16%. Accordingly, this ensemble learning has proven to do perform better than individual conventional machine learning and deep learning models as well as many ensemble learning approaches cited in the literature.
K-centroid convergence clustering identification in one-label per type for di...IAESIJAI
Disease prediction is a high demand field which requires significant support from machine learning (ML) to enhance the result efficiency. The research works on application of K-means clustering supervised classification in disease prediction where each class only has one labeled data. The K-centroid convergence clustering identification (KC3 I) system is based on semi-K-means clustering but only requires single labeled data per class for the training process with the training dataset to update the centroid. The KC3 I model also includes a dictionary box to index all the input centroids before and after the updating process. Each centroid matches with a corresponding label inside this box. After the training process, each time the input features arrive, the trained centroid will put them to its cluster depending on the Euclidean distance, then convert them into the specific class name, which is coherent to that centroid index. Two validation stages were carried out and accomplished the expectation in terms of precision, recall, F1-score, and absolute accuracy. The last part demonstrates the possibility of feature reduction by selecting the most crucial feature with the extra tree classifier method. Total data are fed into the KC3 I system with the most important features and remain the same accuracy.
Plant leaf detection through machine learning based image classification appr...IAESIJAI
Since maize is a staple diet for people, especially vegetarians and vegans, maize leaf disease has a significant influence here on the food industry including maize crop productivity. Therefore, it should be understood that maize quality must be optimal; yet, to do so, maize must be safeguarded from several illnesses. As a result, there is a great demand for such an automated system that can identify the condition early on and take the appropriate action. Early disease identification is crucial, but it also poses a major obstacle. As a result, in this research project, we adopt the fundamental k-nearest neighbor (KNN) model and concentrate on building and developing the enhanced k-nearest neighbor (EKNN) model. EKNN aids in identifying several classes of disease. To gather discriminative, boundary, pattern, and structurally linked information, additional high-quality fine and coarse features are generated. This information is then used in the classification process. The classification algorithm offers high-quality gradient-based features. Additionally, the proposed model is assessed using the Plant-Village dataset, and a comparison with many standard classification models using various metrics is also done.
Backbone search for object detection for applications in intrusion warning sy...IAESIJAI
In this work, we propose a novel backbone search method for object detection for applications in intrusion warning systems. The goal is to find a compact model for use in embedded thermal imaging cameras widely used in intrusion warning systems. The proposed method is based on faster region-based convolutional neural network (Faster R-CNN) because it can detect small objects. Inspired by EfficientNet, the sought-after backbone architecture is obtained by finding the most suitable width scale for the base backbone (ResNet50). The evaluation metrics are mean average precision (mAP), number of parameters, and number of multiply–accumulate operations (MACs). The experimental results showed that the proposed method is effective in building a lightweight neural network for the task of object detection. The obtained model can keep the predefined mAP while minimizing the number of parameters and computational resources. All experiments are executed elaborately on the person detection in intrusion warning systems (PDIWS) dataset.
Deep learning method for lung cancer identification and classificationIAESIJAI
Lung cancer (LC) is calming many lives and is becoming a serious cause of concern. The detection of LC at an early stage assists the chances of recovery. Accuracy of detection of LC at an early stage can be improved with the help of a convolutional neural network (CNN) based deep learning approach. In this paper, we present two methodologies for Lung cancer detection (LCD) applied on Lung image database consortium (LIDC) and image database resource initiative (IDRI) data sets. Classification of these LC images is carried out using support vector machine (SVM), and deep CNN. The CNN is trained with i) multiple batches and ii) single batch for LC image classification as non cancer and cancer image. All these methods are being implemented in MATLAB. The accuracy of classification obtained by SVM is 65%, whereas deep CNN produced detection accuracy of 80% and 100% respectively for multiple and single batch training. The novelty of our experimentation is near 100% classification accuracy obtained by our deep CNN model when tested on 25 Lung computed tomography (CT) test images each of size 512×512 pixels in less than 20 iterations as compared to the research work carried out by other researchers using cropped LC nodule images.
Optically processed Kannada script realization with Siamese neural network modelIAESIJAI
Optical character recognition (OCR) is a technology that allows computers to recognize and extract text from images or scanned documents. It is commonly used to convert printed or handwritten text into machine-readable format. This Study presents an OCR system on Kannada Characters based on siamese neural network (SNN). Here the SNN, a Deep neural network which comprises of two identical convolutional neural network (CNN) compare the script and ranks based on the dissimilarity. When lesser dissimilarity score is identified, prediction is done as character match. In this work the authors use 5 classes of Kannada characters which were initially preprocessed using grey scaling and convert it to pgm format. This is directly input into the Deep convolutional network which is learnt from matching and non-matching image between the CNN with contrastive loss function in Siamese architecture. The Proposed OCR system uses very less time and gives more accurate results as compared to the regular CNN. The model can become a powerful tool for identification, particularly in situations where there is a high degree of variation in writing styles or limited training data is available.
Embedded artificial intelligence system using deep learning and raspberrypi f...IAESIJAI
Melanoma is a kind of skin cancer that originates in melanocytes responsible for producing melanin, it can be a severe and potentially deadly form of cancer because it can metastasize to other regions of the body if not detected and treated early. To facilitate this process, Recently, various computer-assisted low-cost, reliable, and accurate diagnostic systems have been proposed based on artificial intelligence (AI) algorithms, particularly deep learning techniques. This work proposed an innovative and intelligent system that combines the internet of things (IoT) with a Raspberry Pi connected to a camera and a deep learning model based on the deep convolutional neural network (CNN) algorithm for real-time detection and classification of melanoma cancer lesions. The key stages of our model before serializing to the Raspberry Pi: Firstly, the preprocessing part contains data cleaning, data transformation (normalization), and data augmentation to reduce overfitting when training. Then, the deep CNN algorithm is used to extract the features part. Finally, the classification part with applied Sigmoid Activation Function. The experimental results indicate the efficiency of our proposed classification system as we achieved an accuracy rate of 92%, a precision of 91%, a sensitivity of 91%, and an area under the curve- receiver operating characteristics (AUC-ROC) of 0.9133.
Deep learning based biometric authentication using electrocardiogram and irisIAESIJAI
Authentication systems play an important role in wide range of applications. The traditional token certificate and password-based authentication systems are now replaced by biometric authentication systems. Generally, these authentication systems are based on the data obtained from face, iris, electrocardiogram (ECG), fingerprint and palm print. But these types of models are unimodal authentication, which suffer from accuracy and reliability issues. In this regard, multimodal biometric authentication systems have gained huge attention to develop the robust authentication systems. Moreover, the current development in deep learning schemes have proliferated to develop more robust architecture to overcome the issues of tradition machine learning based authentication systems. In this work, we have adopted ECG and iris data and trained the obtained features with the help of hybrid convolutional neural network- long short-term memory (CNN-LSTM) model. In ECG, R peak detection is considered as an important aspect for feature extraction and morphological features are extracted. Similarly, gabor-wavelet, gray level co-occurrence matrix (GLCM), gray level difference matrix (GLDM) and principal component analysis (PCA) based feature extraction methods are applied on iris data. The final feature vector is obtained from MIT-BIH and IIT Delhi Iris dataset which is trained and tested by using CNN-LSTM. The experimental analysis shows that the proposed approach achieves average accuracy, precision, and F1-core as 0.985, 0.962 and 0.975, respectively.
Hybrid channel and spatial attention-UNet for skin lesion segmentationIAESIJAI
Melanoma is a type of skin cancer which has affected many lives globally. The American Cancer Society research has suggested that it a serious type of skin cancer and lead to mortality but it is almost 100% curable if it is detected and treated in its early stages. Currently automated computer vision-based schemes are widely adopted but these systems suffer from poor segmentation accuracy. To overcome these issue, deep learning (DL) has become the promising solution which performs extensive training for pattern learning and provide better classification accuracy. However, skin lesion segmentation is affected due to skin hair, unclear boundaries, pigmentation, and mole. To overcome this issue, we adopt UNet based deep learning scheme and incorporated attention mechanism which considers low level statistics and high-level statistics combined with feedback and skip connection module. This helps to obtain the robust features without neglecting the channel information. Further, we use channel attention, spatial attention modulation to achieve the final segmentation. The proposed DL based scheme is instigated on publically available dataset and experimental investigation shows that the proposed Hybrid Attention UNet approach achieves average performance as 0.9715, 0.9962, 0.9710.
Photoplethysmogram signal reconstruction through integrated compression sensi...IAESIJAI
The transmission of photoplethysmogram (PPG) signals in real-time is extremely challenging and facilitates the use of an internet of things (IoT) environment for healthcare- monitoring. This paper proposes an approach for PPG signal reconstruction through integrated compression sensing and basis function aware shallow learning (CSBSL). Integrated-CSBSL approach for combined compression of PPG signals via multiple channels thereby improving the reconstruction accuracy for the PPG signals essential in healthcare monitoring. An optimal basis function aware shallow learning procedure is employed on PPG signals with prior initialization; this is further fine-tuned by utilizing the knowledge of various other channels, which exploit the further sparsity of the PPG signals. The proposed method for learning combined with PPG signals retains the knowledge of spatial and temporal correlation. The proposed Integrated-CSBSL approach consists of two steps, in the first step the shallow learning based on basis function is carried out through training the PPG signals. The proposed method is evaluated using multichannel PPG signal reconstruction, which potentially benefits clinical applications through PPG monitoring and diagnosis.
Speaker identification under noisy conditions using hybrid convolutional neur...IAESIJAI
Speaker identification is biometrics that classifies or identifies a person from other speakers based on speech characteristics. Recently, deep learning models outperformed conventional machine learning models in speaker identification. Spectrograms of the speech have been used as input in deep learning-based speaker identification using clean speech. However, the performance of speaker identification systems gets degraded under noisy conditions. Cochleograms have shown better results than spectrograms in deep learning-based speaker recognition under noisy and mismatched conditions. Moreover, hybrid convolutional neural network (CNN) and recurrent neural network (RNN) variants have shown better performance than CNN or RNN variants in recent studies. However, there is no attempt conducted to use a hybrid CNN and enhanced RNN variants in speaker identification using cochleogram input to enhance the performance under noisy and mismatched conditions. In this study, a speaker identification using hybrid CNN and the gated recurrent unit (GRU) is proposed for noisy conditions using cochleogram input. VoxCeleb1 audio dataset with real-world noises, white Gaussian noises (WGN) and without additive noises were employed for experiments. The experiment results and the comparison with existing works show that the proposed model performs better than other models in this study and existing works.
Multi-channel microseismic signals classification with convolutional neural n...IAESIJAI
Identifying and classifying microseismic signals is essential to warn of mines’ dangers. Deep learning has replaced traditional methods, but labor-intensive manual identification and varying deep learning outcomes pose challenges. This paper proposes a transfer learning-based convolutional neural network (CNN) method called microseismic signals-convolutional neural network (MS-CNN) to automatically recognize and classify microseismic events and blasts. The model was instructed on a limited sample of data to obtain an optimal weight model for microseismic waveform recognition and classification. A comparative analysis was performed with an existing CNN model and classical image classification models such as AlexNet, GoogLeNet, and ResNet50. The outcomes demonstrate that the MS-CNN model achieved the best recognition and classification effect (99.6% accuracy) in the shortest time (0.31 s to identify 277 images in the test set). Thus, the MS-CNN model can efficiently recognize and classify microseismic events and blasts in practical engineering applications, improving the recognition timeliness of microseismic signals and further enhancing the accuracy of event classification.
Sophisticated face mask dataset: a novel dataset for effective coronavirus di...IAESIJAI
Efficient and accurate coronavirus disease (COVID-19) surveillance necessitates robust identification of individuals wearing face masks. This research introduces the sophisticated face mask dataset (SFMD), a comprehensive compilation of high-quality face mask images enriched with detailed annotations on mask types, fits, and usage patterns. Leveraging cutting-edge deep learning models—EfficientNet-B2, ResNet50, and MobileNet-V2—, we compare SFMD against two established benchmarks: the real-world masked face dataset (RMFD) and the masked face recognition dataset (MFRD). Across all models, SFMD consistently outperforms RMFD and MFRD in key metrics, including accuracy, precision, recall, and F1 score. Additionally, our study demonstrates the dataset's capability to cultivate robust models resilient to intricate scenarios like low-light conditions and facial occlusions due to accessories or facial hair.
Transfer learning for epilepsy detection using spectrogram imagesIAESIJAI
Epilepsy stands out as one of the common neurological diseases. The neural activity of the brain is observed using electroencephalography (EEG). Manual inspection of EEG brain signals is a slow and arduous process, which puts heavy load on neurologists and affects their performance. The aim of this study is to find the best result of classification using the transfer learning model that automatically identify the epileptic and the normal activity, to classify EEG signals by using images of spectrogram which represents the percentage of energy for each coefficient of the continuous wavelet. Dataset includes the EEG signals recorded at monitoring unit of epilepsy used in this study to presents an application of transfer learning by comparing three models Alexnet, visual geometry group (VGG19) and residual neural network ResNet using different combinations with seven different classifiers. This study tested the models and reached a different value of accuracy and other metrics used to judge their performances, and as a result the best combination has been achieved with ResNet combined with support vector machine (SVM) classifier that classified EEG signals with a high success rate using multiple performance metrics such as 97.22% accuracy and 2.78% the value of the error rate.
Deep neural network for lateral control of self-driving cars in urban environ...IAESIJAI
The exponential growth of the automotive industry clearly indicates that self-driving cars are the future of transportation. However, their biggest challenge lies in lateral control, particularly in urban bottlenecking environments, where disturbances and obstacles are abundant. In these situations, the ego vehicle has to follow its own trajectory while rapidly correcting deviation errors without colliding with other nearby vehicles. Various research efforts have focused on developing lateral control approaches, but these methods remain limited in terms of response speed and control accuracy. This paper presents a control strategy using a deep neural network (DNN) controller to effectively keep the car on the centerline of its trajectory and adapt to disturbances arising from deviations or trajectory curvature. The controller focuses on minimizing deviation errors. The Matlab/Simulink software is used for designing and training the DNN. Finally, simulation results confirm that the suggested controller has several advantages in terms of precision, with lateral deviation remaining below 0.65 meters, and rapidity, with a response time of 0.7 seconds, compared to traditional controllers in solving lateral control.
Attention mechanism-based model for cardiomegaly recognition in chest X-Ray i...IAESIJAI
Recently, cardiovascular diseases (CVDs) have become a rapidly growing problem in the world, especially in developing countries. The latter are facing a lifestyle change that introduces new risk factors for heart disease, that requires a particular and urgent interest. Besides, cardiomegaly is a sign of cardiovascular diseases that refers to various conditions; it is associated with the heart enlargement that can be either transient or permanent depending on certain conditions. Furthermore, cardiomegaly is visible on any imaging test including Chest X-Radiation (X-Ray) images; which are one of the most common tools used by Cardiologists to detect and diagnose many diseases. In this paper, we propose an innovative deep learning (DL) model based on an attention module and MobileNet architecture to recognize Cardiomegaly patients using the popular Chest X-Ray8 dataset. Actually, the attention module captures the spatial relationship between the relevant regions in Chest X-Ray images. The experimental results show that the proposed model achieved interesting results with an accuracy rate of 81% which makes it suitable for detecting cardiomegaly disease.
Efficient commodity price forecasting using long short-term memory modelIAESIJAI
Predicting commodity prices, particularly food prices, is a significant concern for various stakeholders, especially in regions that are highly sensitive to commodity price volatility. Historically, many machine learning models like autoregressive integrated moving average (ARIMA) and support vector machine (SVM) have been suggested to overcome the forecasting task. These models struggle to capture the multifaceted and dynamic factors influencing these prices. Recently, deep learning approaches have demonstrated considerable promise in handling complex forecasting tasks. This paper presents a novel long short-term memory (LSTM) network-based model for commodity price forecasting. The model uses five essential commodities namely bread, meat, milk, oil, and petrol. The proposed model focuses on advanced feature engineering which involves moving averages, price volatility, and past prices. The results reveal that our model outperforms traditional methods as it achieves 0.14, 3.04%, and 98.2% for root mean square error (RMSE), mean absolute percentage error (MAPE), and R-squared (R2 ), respectively. In addition to the simplicity of the model, which consists of an LSTM single-cell architecture that reduced the training time to a few minutes instead of hours. This paper contributes to the economic literature on price prediction using advanced deep learning techniques as well as provides practical implications for managing commodity price instability globally.
1-dimensional convolutional neural networks for predicting sudden cardiacIAESIJAI
Sudden cardiac arrest (SCA) is a serious heart problem that occurs without symptoms or warning. SCA causes high mortality. Therefore, it is important to estimate the incidence of SCA. Current methods for predicting ventricular fibrillation (VF) episodes require monitoring patients over time, resulting in no complications. New technologies, especially machine learning, are gaining popularity due to the benefits they provide. However, most existing systems rely on manual processes, which can lead to inefficiencies in disseminating patient information. On the other hand, existing deep learning methods rely on large data sets that are not publicly available. In this study, we propose a deep learning method based on one-dimensional convolutional neural networks to learn to use discrete fourier transform (DFT) features in raw electrocardiogram (ECG) signals. The results showed that our method was able to accurately predict the onset of SCA with an accuracy of 96% approximately 90 minutes before it occurred. Predictions can save many lives. That is, optimized deep learning models can outperform manual models in analyzing long-term signals.
A deep learning-based approach for early detection of disease in sugarcane pl...IAESIJAI
In many regions of the nation, agriculture serves as the primary industry. The farming environment now faces a number of challenges to farmers. One of the major concerns, and the focus of this research, is disease prediction. A methodology is suggested to automate a process for identifying disease in plant growth and warning farmers in advance so they can take appropriate action. Disease in crop plants has an impact on agricultural production. In this work, a novel DenseNet-support vector machine: explainable artificial intelligence (DNet-SVM: XAI) interpretation that combines a DenseNet with support vector machine (SVM) and local interpretable model-agnostic explanation (LIME) interpretation has been proposed. DNet-SVM: XAI was created by a series of modifications to DenseNet201, including the addition of a support vector machine (SVM) classifier. Prior to using SVM to identify if an image is healthy or un-healthy, images are first feature extracted using a convolution network called DenseNet. In addition to offering a likely explanation for the prediction, the reasoning is carried out utilizing the visual cue produced by the LIME. In light of this, the proposed approach, when paired with its determined interpretability and precision, may successfully assist farmers in the detection of infected plants and recommendation of pesticide for the identified disease.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Best 20 SEO Techniques To Improve Website Visibility In SERP
Improving Indonesian multietnics speaker recognition using pitch shifting data augmentation
1. IAES International Journal of Artificial Intelligence (IJ-AI)
Vol. 12, No. 4, December 2023, pp. 1901~1908
ISSN: 2252-8938, DOI: 10.11591/ijai.v12.i4.pp1901-1908 1901
Journal homepage: http://ijai.iaescore.com
Improving Indonesian multietnics speaker recognition using
pitch shifting data augmentation
Kristiawan Nugroho1
, Isworo Nugroho1
, De Rosal Ignatius Moses Setiadi2
, Omar Farooq3
1
Department of Information Technology and Industry, Universitas Stikubank, Semarang, Indonesia
2
Department of Computer Science, Universitas Dian Nuswantoro, Semarang, Indonesia
3
Department of Electronics Engineering, Z. H. College of Engg and Technology, A. M. U, Aligarh, India
Article Info ABSTRACT
Article history:
Received Jan 25, 2023
Revised Mar 15, 2023
Accepted Mar 27, 2023
Speaker recognition to recognize multiethnic speakers is an interesting
research topic. Various studies involving many ethnicities require the right
approach to achieve optimal model performance. The deep learning approach
has been used in speaker recognition research involving many classes to
achieve high accuracy results with promising results. However, multi-class
and imbalanced datasets are still obstacles encountered in various studies
using the deep learning method which cause overfitting and decreased
accuracy. Data augmentation is an approach model used in overcoming the
problem of small amounts of data and multiclass problems. This approach can
improve the quality of research data according to the method applied. This
study proposes a data augmentation method using pitch shifting with a deep
neural network called pitch shifting data augmentation deep neural network
(PSDA-DNN) to identify multiethnic Indonesian speakers. The results of the
research that has been done prove that the PSDA-DNN approach is the best
method in multi-ethnic speaker recognition where the accuracy reaches
99.27% and the precision, recall, F1 score is 97.60%.
Keywords:
Data augmentation
Deep learning
Pitch shifting
Speaker recognition
This is an open access article under the CC BY-SA license.
Corresponding Author:
Kristiawan Nugroho
Department of Information Technology and Industry, Universitas Stikubank Semarang
Jl. Tri Lomba Juang, Semarang, Indonesia
Email: kristiawan@edu.unisbank.ac.id
1. INTRODUCTION
Speaker recognition is one of the challenging research fields. Various kinds of problems need to be
solved to produce the discovery of new theories that can make a positive contribution to human life. Research
results in the field of speaker recognition have proven to have led to various forms of new technology applied
to voice authentication, surveillance speaker recognition, forensic speaker recognition, security, and multi
speaker tracking. Tech giant companies such as Google, Microsoft, and Amazon have also taken advantage of
this technology, including Google Voice technology, Apple Siri, and Amazon's Alexa.
The application of this technology is proven to be able to help human work, such as in the field of
security and authentication in smart homes to help people with disabilities to recognize sound patterns and
images. Initially, research in the field of speaker recognition used classical methods in machine learning,
gaussian mixture model (GMM), such as studies carried out by Motlicek et al. [1] and Veena and Mathew [2].
In speaker recognition, the hidden markov model (HMM) strategy is also utilized, as demonstrated by
Maghsoodi et al. [3], Hussein, et al. [4] and the support vector machine (SVM) approach on research that has
been done by Chaunan et al. [5].
However, along with the growth of data that is getting bigger and the complexity of the problems
faced today, researchers are starting to use the deep learning method in speaker recognition research. Deep
2. ISSN: 2252-8938
Int J Artif Intell, Vol. 12, No. 4, December 2023: 1901-1908
1902
learning (DL) is an approach developed from the neural network algorithm. This method continues to be
developed by researchers to solve various problems in machine learning. DL has advantages in terms of the
ability to handle computational processes that involve very large data. In addition, DL is also able to process
data representations of various forms such as text, images, and sound, which makes it a good ability to process
information in the form of multi modals so that this method outperforms the previous machine learning method,
especially used in the field of computer vision.
Various DL methods are used in voice signal processing research using the deep neural networks
(DNN) approach as on research that has been done by Guo et al. [6] combined with I-Vector in research on
short speech. This study improves the equal error rate (EER) to 26.47%. In other research Mohan and Patil [7]
using self organizing map (SOM) and latent dirichlet allocation (LDA) succeeded in increasing crop prediction
accuracy by 7-23%. DNN is also used by Saleem and Khattak [8] in research on speech separation, which
produces promising performance. However, behind the advantages of DL, this method also has several
weaknesses, including higher accuracy performance requiring large data sets, overfitting, and computational
process efforts that require large resources. In addition to the need for large data volumes, research in the field
of modern classification also faces the problem of multiple classes when it comes to processing an explosion
of features and limited data.
In several studies regarding speaker recognition multi ethnic, the problem of limited data is an initial
challenge in conducting research such as the study conducted by Hanifa et al. [9], which only had 62 recorded
data of Malaysian speakers, and the study of Cole [10] to identify speakers in South East England with 227
speakers. To solve the problem of limited data, various approaches are used, among others, by using ata
augmentation (DA). DA is a method of expanding the quantity of data and has proven to be effective in
conducting training on neural network. In speech signal processing, DA is proven to increase accuracy in audio
classification using DL [11]. In the field of speech signal processing, several DA methods that are often used
are adding white noise (AWN), time stretching (TS), pitch shifting (PS), mixupand speech augment. Some of
these DA approaches are proven to be able to increase the quantity of research data so that research using DL
which requires relatively large amounts of data can be done well so as to achieve a high level of accuracy.
Currently, in machine learning, the problems faced by researchers in an effort to increase the accuracy
of speaker recognition include the number of classes that must be classified in the imbalance dataset. One of
the problems encountered in machine learning classification is unbalanced multiclass which will cause model
inaccuracies in predicting data. This problem can cause prediction errors in machine learning algorithms, so it
needs to be resolved immediately. In research conducted by Khan et al. [12] and Mi et al. [13] the DA approach
used the generative adversarial network (GAN) method to solve multi-class problems which achieved an
accuracy increase of 6.4%. In several studies related to the implementation of DA in speech signal processing,
various methods used include AWN, such on research that has been done by Morales et al. [14], which seeks
to increase speaker recognition accuracy by adding noise effects.
Jacques and Roebel [15] also used the approach to adding noise in the research conducted. TS is also
used by several researchers in speech signal processing research, TS functions to change the speed of audio
signal duration as on research that has been done by Sasaki et al. [16] and Aguiar et al. [17] to classify music
genres with convolutional neural network (CNN). However, the AWN and TS methods used still cannot
produce high accuracy for speech recognition because they only achieve an accuracy level of around 70% to
80%. Another approach used in voice-based augmentation data is PS. PS method is also used in several studies
conducted by Morbale and Navale [18] in processing audio files, Rai and Barkana [19] in processing musical
instruments and Ye et al. [20] with CNN using UME and TIMIT datasets. In research conducted by Ye the
Pitch Shifting method has achieved an accuracy rate above 90% with the highest accuracy of 98.72%.
This paper aims to improve the performance of the multi-ethnic speaker recognition model with a
pitch shifting method based on deep neural networks and consists of several parts, namely the Introduction
section in Chapter 1, which describes the research problems and other research that has been carried out and
the relationship with several other studies. Chapter 2 is a literature review that contains is a section that contains
the study of related topics. In Chapter 3, proposed methodology contains the proposed model in solving the
problem. Chapter 4 is a results and discussion that contains experiment results from the research that has been
done. The final part is Chapter 5, namely conclusion, which contains several conclusions and solutions for the
research carried out.
2. METHOD
This study proposed a pitch shifting deep neural network (PSDA-DNN) which has the best
performance when implemented in voice signal processing supported by MFCC as a feature extraction method
and a DNN that processes multiethnic speaker recognition classification. This research begins with processing
the dataset of multiethnic speakers followed by the preprocessing process. Data augmentation is a step that
3. Int J Artif Intell ISSN: 2252-8938
Improving Indonesian multietnics speaker recognition using pitch shifting data… (Kristiawan Nugroho)
1903
provides a solution to the limited dataset of multiethnic speakers. The dataset is then extracted using the mel
frequency cepstral coefficient (MFCC) approach and then the result of feature extraction is included in the 7
Layer DNN architecture. As the last step, measuring the performance results of the proposed framework, among
others, uses measures of accuracy, precision, recall, and F1. The proposed method can be seen in Figure 1.
Figure. 1 Proposed method
2.1. Datasets and preprocessing
Research on multi-ethnic speaker recognition uses the sample for recognizing ethnic speakers dataset
from Indonesia taken from Youtube 301 language in Indonesia [21]. The dataset was compiled using 70 ethnic
male speakers from hundreds of ethnic groups in Indonesia. Sound processing using Adobe Audition CS6, the
voice of tribal speakers is carried out by a sampling process by taking 10 tribal speakers each with a duration
of 1 second. The sampling process uses a standard sample rate (SR) of 44.100 Hz with 32-bit bit depth, mono
32-bit Floating Point. The next stage is the reprocessing process, which is an important process in data mining
processing. Preprocessing is a step that needs to be done in data mining to get good data quality to reduce
processing time and get the desired results. This study uses the Adobe Audition CS6 application in
preprocessing data with noise reduction facilities to remove noise in the speaker's speech data.
2.2. Data augmentation
DA is one method that is often used in increasing the quantity of dataset needed in research. DA is an
approach that aims to increase the size of the data quantity and is a powerful technique used in the field of data
mining and data processing for regression and classification purposes. PS is a DA method in sound signal
processing by raising or lowering the original voice pitch in audio without affecting the long duration of the
recorded sound. PS is used in this study because it has the advantage that the overall spectral envelope does
not change so that it can achieve high-quality output The process in the PS approach can be seen in Figure 2.
In this study, PS results will be compared with 2 other DA methods, namely AWN and TS which are widely
implemented in various research fields such as sound recording, music production, music learning, and foreign
languages. The original speaker's voice signal is processed using the Pitch Shifting approach. The results of
processing the voice signal before and after using the PS method can be seen in Figure 2.
2.3. Feature extraction and deep neural network seven layer
2.3.1. Mel-frequency cepstral coefficient
MFCC is a feature extraction method used in this study. MFCC is a robust approach used in speech
recognition. The speech signal of the ethnic speaker consisting of 700 wav voices from 70 Ethnic was extracted
using the python application with MFCC settings of frame 900 lengths, 25 frame shifts 10, window type
hamming, preemphasis coefficient 0.97, number of cepstral coefficient 13 and number of lifters 22. In the
MFCC approach, the voice signal will be processed through the following steps:
i) Preemphasis
Is a process that will be carried out after the sound sampling process, this process serves to reduce noise at
the sound source. The preemphasis process in the time domain can be formulated as:
𝑦(𝑛) = 𝑥 (𝑛) − 𝑎𝑥 (𝑛 − 1) (1)
A indicates the constant which is at 0.9 < a < 1.0
4. ISSN: 2252-8938
Int J Artif Intell, Vol. 12, No. 4, December 2023: 1901-1908
1904
Figure 2. Original and pitch shifting speech signal
ii) Frame blocking
The speech signal will then enter the Frame Blocking process which functions to divide the sound into
several parts of the frame.
iii) Windowing
Windowing is a step to analyze a long signal by taking the right part to be processed in the next stage. If
the window is defined as w(n), 0 < n < N – 1 where N is the number of samples in each frame, then the
windowing signal can be formulated as:
𝑦1(𝑛) = 𝑥1(𝑛) 𝑤 (𝑛) , 0 ≤ 𝑛 ≤ 𝑁 − 1 (2)
iv) Fast fourier transform (FFT)
Fourier transform is used to convert the time series of signals in the form of a limited time domain into a
frequency spectrum, while FFT is part of a fast Discrete Fourier Transform (DFT) algorithm that converts
frames into N samples starting from the time domain to the frequency domain. The result of this processing
is referred as Cepstrum which is formulated as:
𝑥(𝑛) = ∑ 𝑥𝑘 𝑒−2𝜋𝑗𝑘𝑛/𝑁
𝑁−1
𝑘=0 (3)
5. Int J Artif Intell ISSN: 2252-8938
Improving Indonesian multietnics speaker recognition using pitch shifting data… (Kristiawan Nugroho)
1905
Where n = 0.1,2,..,N-1 and j = sqrt-1 while X[n] is an n-frecuency form resulting from the Fourier
Transform mechanism.
v) Mel-frequency wrapping
In this step the existing FFT signals are grouped in a triangular filter which aims to multiply the FFT value
with the appropriate filter gain which then the results will be summed. The wrapping process into a signal
in the frequency domain can be formulated as:
𝑥𝑖 = log10(∑ |𝑥(𝑘) 𝐻𝑖(𝑘)
𝑁−1
𝑘=0 ) (4)
Where i = 1, 2, 3, .., M, M is the sum of the triangle filters and Hi(k) is the value for the triangular i-filter
for the acoustic frequency k.
vi) Cepstrum
In order for the signal to be heard by humans, it is necessary to convert the signal into a time domain using
a discrete cosine transform (DCT). The final result of this process is referred to as Mel Frequency Cepstral
Coefficients which is formulated as:
𝑐𝑗 = ∑ 𝑥𝑗 cos ( 𝑗(𝑖−1)/2
𝜋
𝐾
)
𝐾
𝑗=1 (5)
Cj shows the MFCC coefficient. Xj is the strength of the Mel Frequency spectrum, j = 1, 2, 3,.., K is the
expected coefficient and M indicates the number of filters.
2.3.2. Deep neural network seven layer
One of the finest DL techniques is the DNN. DNN has the advantage of building more accurate
models. In this work used DNN 7 layers with architectural network as in Table 1. The DNN architecture used
in this study consists of seven layers consisting of dense or also called fully connected layers, indicating that
the layer in which there are neurons connected to neurons in the previous layer. Layer 1 consists of 193 nodes
which is an input layer that shows 193 features generated from the extracted features. Between layers is given
a dropout function which is a technique used to solve overfitting problems and prediction problems in large
neural networks. Layers two to seven use half of the number of nodes in the previous layer in order to reduce
the complexity of calculations on each layer.
Table 1. DNN7L architecture
Layer (type) Output shape Layer (type) Output shape
Dense 1 (dense) (None,193) Dense 5 (dense) (None,50)
Dense 2 (dense) (None,400) Dropout_4 (Dropout) (None,50)
Dropout_1 (Dropout) (None,400) Dense 6 (dense) (None,25)
Dense 3 (dense) (None,200) Dropout_5 (Dropout) (None,25)
Dropout_2 (Dropout) (None,200) Dense 7 (dense) (None,15)
Dense 4 (dense) (None,100) Dropout_6 (Dropout) (None,15)
Dropout_3 (Dropout) (None,100) Dense 8 (dense) (None,8)
2.4. Evaluation
The end result of the process of Indonesian ethnic speakers recognition is an evaluation process by
measuring the performance of the proposed model by evaluating the level of accuracy, precision, recall, and
F1 measure. Accuracy is the ratio of the number of cases that were predicted with correct answers compared
to the total number of cases, while recall is the ratio of the number of positive cases that were predicted correctly
to the number of positive cases that were predicted. Accuracy, precision, recall and F1-score measure can be
calculated using the (6) to (9), respectively.
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
𝑥 100% (6)
𝑇𝑃
(𝑇𝑃+𝐹𝑃)
𝑥 100% (7)
𝑇𝑃
(𝑇𝑃+𝐹𝑁)
𝑥 100% (8)
2 𝑋 (𝑅𝑒𝑐𝑎𝑙𝑙 𝑋 𝑃𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛)
(𝑅𝑒𝑐𝑎𝑙𝑙+ 𝑃𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛)
(9)
6. ISSN: 2252-8938
Int J Artif Intell, Vol. 12, No. 4, December 2023: 1901-1908
1906
Where TP, TN, FP, and FN stand for true positive, true negative, false positive, and false negative, respectively.
Performance measurement in this study also uses recall which is the ratio of true positive cases that are
predicted to be positive.
3. RESULTS AND DISCUSSION
Dataset on multi-ethnic speakers in Indonesia were tested using DA approach, namely AWN, PS
and TS then trained using DNN using a split ratio of 70:30, 80:20 and 90:10. The test results show that the
PSDA-DNN has better performance than adding white noise data augmentation deep neural network
(AWNDA- DNN), and time stretching data augmentation deep neural network (TSDA-DNN) methods. The
following are the results of the PSDA-DNN model performance testing process using a 70:30 split ratio as
depicted in Figure 3.
Testing on the 70:30 split ratio resulted in an accuracy level of 98.55%, precision, recall and F1
measure each of 94.37%. This performance result shows that PSDA-DNN will be more robust when used on
larger data. Classification using various machine learning methods using many classes is not easy and cause
various problems in learning [22]. An appropriate approach is needed in managing the dataset. The following
is a comparison of the number of classes processed between the various types of research shown in Table 2.
Figure 3. PSDA-DNN performance
Table 2. Comparison with another datasets
Methods Datasets ∑Class Acc (%)
SVM [9] Ethnicity of Malaysian dataset 4 57.7
CNN [23] Urdu speakers 4 87.5
Deep Belief Network (DBN) [24] Accented spoken English corpus 6 90.2
DNN [25] TITML-IDN, OpenSLR 4 98.9
PSDA-DNN Indonesian multietnics speakers 42 99.2
Based on the comparison results presented in Table 2, even though it has more classes, the proposed
model, namely PSDA-DNN, produces better model performance than other machine learning methods with an
accuracy of up to 99.2%. The achievement of a high level of performance at PSDA-DNN was due to a good
preprocessing process on the speech signal dataset which was then carried out by data augmentation process
with PS and proper classification using the DNN 7-layer approach. The PSDA-DNN method also produces the
most effective performance in comparison to various other techniques using the Indonesian Multiethnics
Speakers dataset. The results of the comparison of these methods can be seen in Table 3.
Table 3. Comparison with another methods
Methods Accuracy
K-Nearest Neighbor 92%
Random-Forest 81%
DNN 98.4%
PSDA-DNN (Ours) 99.2%
7. Int J Artif Intell ISSN: 2252-8938
Improving Indonesian multietnics speaker recognition using pitch shifting data… (Kristiawan Nugroho)
1907
The results of this study will be compared with several methods with classical machine learning and
deep learning approaches. Table 3 shows that the PSDA-DNN method has better performance compared to
other methods. In several comparisons of the performance of the methods as previously presented, it can be
concluded that the proposed method has the best performance.
4. CONCLUSION
Study in the area of speaker identification is an interesting topic and challenges researchers around
the world to work hard to make new scientific contributions, including research on Indonesian multiethnic
speaker recognition. The DL method is an approach that has been chosen, especially for processing large
amounts of data, including sound signal processing. However, the problem of multiple classes and data
imbalanced causes low accuracy because the model performance is not optimal. The PS Approach in Data
Augmentation is a solution in increasing the quantity of data and as a solution for multiple class classification
problems. Obtaining a high model accuracy performance of 99.27% through the proposed model, namely
PS-DNN is one of the solutions to overcome the problem of multiple classes and imbalanced datasets in
machine learning. This study proposes the PSDA-DNN approach which is a multi-ethnic speaker recognition
method that uses the PSDA technique which is supported by the MFCC and DNN methods in processing speech
signals. The research results show that PSDA-DNN has better performance compared to other approaches such
as AWN and TS which are also DNN-based in processing speech signals. The PSDA-DNN approach produces
an average level of accuracy of 99.27%, precision, recall, and F1 measure of 97.60%.
REFERENCES
[1] P. Motlicek, S. Dey, S. Madikeri, and L. Burget, “Employment of subspace gaussian mixture models in speaker recognition,” in
2015 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, Apr. 2015, pp. 4445–4449,
doi: 10.1109/icassp.2015.7178811.
[2] K. V Veena and D. Mathew, “Speaker identification and verification of noisy speech using multitaper MFCC and gaussian mixture
models,” Dec. 2015, doi: 10.1109/picc.2015.7455806.
[3] N. Maghsoodi, H. Sameti, H. Zeinali, and T. Stafylakis, “Speaker recognition with random digit strings using uncertainty normalized
HMM-based i-vectors,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 11, pp. 1815–1825,
Nov. 2019, doi: 10.1109/taslp.2019.2928143.
[4] J. S. Hussein, A. A. Salman, and T. R. Saeed, “Arabic speaker recognition using HMM,” Indonesian Journal of Electrical
Engineering and Computer Science, vol. 23, no. 2, pp. 1212–1218, Aug. 2021, doi: 10.11591/ijeecs.v23.i2.pp1212-1218.
[5] N. Chauhan, T. Isshiki, and D. Li, “Speaker recognition using LPC, MFCC, ZCR features with ANN and SVM classifier for large
input database,” in 2019 IEEE 4th International Conference on Computer and Communication Systems ICCCS, Feb. 2019,
pp. 130–133, doi: 10.1109/ccoms.2019.8821751.
[6] J. Guo et al., “Deep neural network based i-vector mapping for speaker verification using short utterances,” Speech Communication,
vol. 105, pp. 92–102, Dec. 2018, doi: 10.1016/j.specom.2018.10.004.
[7] P. Mohan and Kiran Patil, “Deep learning based weighted SOM to forecast weather and crop prediction for agriculture application,”
International Journal of Intelligent Engineering and Systems, vol. 11, no. 4, pp. 167–176, Aug. 2018,
doi: 10.22266/ijies2018.0831.17.
[8] N. Saleem and M. I. Khattak, “Deep neural networks based binary classification for single channel speaker independent multi-talker
speech separation,” Applied Acoustics, vol. 167, p. 107385, Oct. 2020, doi: 10.1016/j.apacoust.2020.107385.
[9] R. M. Hanifa, K. Isa, and S. Mohamad, “Speaker ethnic identification for continuous speech in Malay language using pitch and
MFCC,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 19, no. 1, pp. 207–214, Jul. 2020,
doi: 10.11591/ijeecs.v19.i1.pp207-214.
[10] A. Cole, “Identifications of speaker ethnicity in South-East England: multicultural London English as a divisible perceptual variety,”
in Proceedings of the LREC 2020 Workshop on Citizen Linguistics in Language Resource Development, 2020, pp. 49–57.
[11] L. Nanni, G. Maguolo, and M. Paci, “Data augmentation approaches for improving animal audio classification,” Ecological
Informatics, vol. 57, p. 101084, May 2020, doi: 10.1016/j.ecoinf.2020.101084.
[12] M. H.-M. Khan et al., “Multi-class skin problem classification using deep generative adversarial network (DGAN),” Computational
Intelligence and Neuroscience, vol. 2022, pp. 1–13, Mar. 2022, doi: 10.1155/2022/1797471.
[13] Q. Mi, Y. Hao, M. Wu, and L. Ou, “An enhanced data augmentation approach to support multi-class code readability classification,”
in Proceedings of the Inernational Conference on Software Engineering and Knowledge Engineering SEKE, Jul. 2022, pp. 48–53,
doi: 10.18293/SEKE2022-130.
[14] N. Morales, L. Gu, and Y. Gao, “Adding noise to improve noise robustness in speech recognition,” Jan. 2007,
doi: 10.21437/interspeech.2007-335.
[15] C. Jacques and A. Roebel, “Data augmentation for drum transcription with convolutional neural networks,” Sep. 2019,
doi: 10.23919/eusipco.2019.8902980.
[16] T. Sasaki et al., “Time stretching: illusory lengthening of filled auditory durations,” Attention, Perception, Psychophys, vol. 72,
no. 5, pp. 1404–1421, Jul. 2010, doi: 10.3758/app.72.5.1404.
[17] R. L. Aguiar, Y. M. G. Costa, and C. N. Silla, “Exploring data augmentation to improve music genre classification with convNets,”
in 2018 International Joint Conference on Neural Networks IJCNN, Jul. 2018, pp. 1–8, doi: 10.1109/ijcnn.2018.8489166.
[18] P. R. Morbale and M. Navale, “Design and implementation of real time audio pitch shifting on FPGA,” International Journal of
Innovative Trends in Engineering, vol. 4, no. 2, pp. 81–88, 2015.
[19] A. Rai and B. D. Barkana, “Analysis of three pitch-shifting algorithms for different musical instruments,” in 2019 IEEE Long Island
Systems, Applications and Technology Conference LISAT, May 2019, pp. 1–6, doi: 10.1109/lisat.2019.8817334.
[20] Y. Ye, L. Lao, D. Yan, and R. Wang, “Identification of weakly pitch-shifted voice based on convolutional neural network,”
International Journal of Digital Multimedia Broadcasting, vol. 2020, pp. 1–10, Jan. 2020, doi: 10.1155/2020/8927031.
8. ISSN: 2252-8938
Int J Artif Intell, Vol. 12, No. 4, December 2023: 1901-1908
1908
[21] “301 languages in Indonesia-regional dialects #2 (In Indonesia: 301 languages in Indonesia-bahasa Logat Dialek Daerah #2).”
Indonesia Ideas, 2018, [Online]. Available: https://www.youtube.com/watch?v=FkwXbCY1rWg.
[22] Y. Xue and M. Hauskrecht, “Active learning of multi-class classification models from ordered class sets,” Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 33, no. 01, pp. 5589–5596, Jul. 2019, doi: 10.1609/aaai.v33i01.33015589.
[23] A. Ashar, M. S. Bhatti, and U. Mushtaq, “Speaker identification using a hybrid CNN-MFCC approach,” Mar. 2020,
doi: 10.1109/icetst49965.2020.9080730.
[24] R. Upadhyay and S. Lui, “Foreign English accent classification using deep belief networks,” in 2018 IEEE 12th International
Conference on Semantic Computing ICSC, Jan. 2018, pp. 290–293, doi: 10.1109/icsc.2018.00053.
[25] K. Azizah, M. Adriani, and W. Jatmiko, “Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-
based TTS on low-resource languages,” IEEE Access, vol. 8, pp. 179798–179812, 2020, doi: 10.1109/access.2020.3027619.
ACKNOWLEDGEMENTS
We would like thank to the Universitas Stikubank for granted funding through scientific publication
incentives.
BIOGRAPHIES OF AUTHORS
Kristiawan Nugroho works as a lecturer at the faculty of information
technology and industry, Universitas Stikubank. He obtained a bachelor's degree in 2001 in
the information systems department, Faculty of Computer Science, Universitas Dian
Nuswantoro Semarang, then in 2007 He graduated from Universitas Dian Nuswantoro with
a master's degree in informatics engineering. He also obtained Doctoral degree in computer
science with a concentration in Machine Learning and Artificial Intelligence in 2022 at Dian
Nuswantoro University Semarang. He has conducted various researches in machine learning,
speech recognition and sentiment analysis. He can be contacted via email
kristiawan@edu.unisbank.ac.id.
Isworo Nugroho works as a lecturer at the faculty of information technology
and industry, Universitas Stikubank. He obtained a bachelor's degree in 2001 in the
management department, Faculty of Economis, Stikubank University, then in 2003 he
obtained a Master's degree in computer science, Gajah Mada University. He has conducted
various researches in text processing, data mining and statistical science. He can be contacted
via email isworo@edu.unisbank.ac.id.
De Rosal Ignatius Moses Setiadi a Bachelor of Science in Informatics
Engineering from Universitas Soegijaprana, Indonesia, and a Master of Science in
Informatics Engineering from Universitas Dian Nuswantoro Semarang, both in 2012. He is
presently a lecturer and researcher at the Faculty of Computer Science at Universitas Dian
Nuswantoro in Semarang, Indonesia. He has written more than 138 peer-reviewed journal
and conference publications that Scopus has indexed. His areas of interest in study include
machine learning, cryptography, image steganography, and watermarking. His email address
is moses@dsn.dinus.ac.id.
Prof. Omar Farooq Omar Farooq joined the Department of Electronics
Engineering, AMU Aligarh as Lecturer in 1992 and is currently working as a professor. He
was awarded Commonwealth Scholar from 1999-2002 towards PhD at Loughborough
University, UK, and a one-year postdoctoral fellowship with the UKIERI in 2007-2008. With
a focus on speech recognition, signal processing is his broad area of study interest. He has
approximately 250 publications published by him or with him in reputable academic journals
and conference proceedings, and he has helped 9 scholars complete their PhDs. He is a Senior
Member, Institute of Electrical and Electronics Engineers, (IEEE, USA). He can be contacted
via email omar.farooq@amu.ac.in.