Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an
audio or video recording that contains unknown amounts of speech from unknown speakers and unknown
number of speakers. Diarization has numerous applications in speech recognition, speaker identification,
and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization
problems, but providing exhaustive labeling for the training dataset can become costly in supervised
learning, while accuracy can be compromised when using unsupervised approaches. This paper presents a
novel approach to speaker diarization, which defines loosely labeled data and employs x-vector embedding
and a formalized approach for threshold searching with a given abstract similarity metric to cluster
temporal segments into unique user segments. The proposed algorithm uses concepts of graph theory,
matrix algebra, and genetic algorithm to formulate and solve the optimization problem. Additionally, the
algorithm is applied to English, Spanish, and Chinese audios, and the performance is evaluated using wellknown similarity metrics. The results demonstrate that the robustness of the proposed approach. The
findings of this research have significant implications for speech processing, speaker identification
including those with tonal differences. The proposed method offers a practical and efficient solution for
speaker diarization in real-world scenarios where there are labeling time and cost constraints
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
The document describes a system for multilingual speech-to-text conversion using Hugging Face that aims to assist deaf individuals. The system uses a Transformer-based encoder-decoder model trained on audio datasets. Feature extraction and tokenization are performed on the audio inputs. The model is fine-tuned using Hugging Face and evaluated based on word error rate. A web application allows users to record speech via microphone and see the transcription output. The implemented model achieved a 32.42% word error rate on a Hindi language dataset. The goal is to enable seamless communication for those with hearing impairments across multiple languages.
Text independent speaker identification system using average pitch and forman...ijitjournal
The aim of this paper is to design a closed-set text-independent Speaker Identification system using average
pitch and speech features from formant analysis. The speech features represented by the speech signal are
potentially characterized by formant analysis (Power Spectral Density). In this paper we have designed two
methods: one for average pitch estimation based on Autocorrelation and other for formant analysis. The
average pitches of speech signals are calculated and employed with formant analysis. From the performance
comparison of the proposed method with some of the existing methods, it is evident that the designed
speaker identification system with the proposed method is superior to others.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Isolated word recognition using lpc & vector quantizationeSAT Journals
Abstract Speech recognition is always looked upon as a fascinating field in human computer interaction. It is one of the fundamental steps towards understanding human recognition and their behavior. This paper explicates the theory and implementation of Speech recognition. This is a speaker-dependent real time isolated word recognizer. The major logic used was to first obtain the feature vectors using LPC which was followed by vector quantization. The quantized vectors were then recognized by measuring the Minimum average distortion. All Speech Recognition systems contain Two Main Phases, namely Training Phase and Testing Phase. In the Training Phase, the Features of the words are extracted and during the recognition phase feature matching Takes place. The feature or the template thus extracted is stored in the data base, during the recognition phase the extracted features are compared with the template in the database. The features of the words are extracted by using LPC analysis. Vector Quantization is used for generating the code books. Finally the recognition decision is made based on the matching score. MATLAB will be used to implement this concept to achieve further understanding. Index Terms: Speech Recognition, LPC, Vector Quantization, and Code Book.
Bayesian distance metric learning and its application in automatic speaker re...IJECEIAES
This document proposes a state-of-the-art automatic speaker recognition system based on Bayesian distance metric learning as a feature extractor. It explores constraints on the distance between modified and simplified i-vector pairs from the same speaker and different speakers. An approximation of the distance metric is used as a weighted covariance matrix from the higher eigenvectors of the covariance matrix, which is used to estimate the posterior distribution of the metric distance. This Bayesian distance learning approach achieves better performance than advanced methods and is insensitive to normalization compared to cosine scores. It is also effective with limited training data.
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONijaia
In natural language processing, attention mechanism in neural networks are widely utilized. In this paper, the research team explore a new mechanism of extending output attention in recurrent neural networks for dialog systems. The new attention method was compared with the current method in generating dialog sentence using a real dataset. Our architecture exhibits several attractive properties such as better handle long sequences and, it could generate more reasonable replies in many cases.
Modeling Text Independent Speaker Identification with Vector QuantizationTELKOMNIKA JOURNAL
Speaker identification is one of the most important technologies nowadays. Many fields such as
bioinformatics and security are using speaker identification. Also, almost all electronic devices are using
this technology too. Based on number of text, speaker identification divided into text dependent and text
independent. On many fields, text independent is mostly used because number of text is unlimited. So, text
independent is generally more challenging than text dependent. In this research, speaker identification text
independent with Indonesian speaker data was modelled with Vector Quantization (VQ). In this research
VQ with K-Means initialization was used. K-Means clustering also was used to initialize mean and
Hierarchical Agglomerative Clustering was used to identify K value for VQ. The best VQ accuracy was
59.67% when k was 5. According to the result, Indonesian language could be modelled by VQ. This
research can be developed using optimization method for VQ parameters such as Genetic Algorithm or
Particle Swarm Optimization.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
The document describes a system for multilingual speech-to-text conversion using Hugging Face that aims to assist deaf individuals. The system uses a Transformer-based encoder-decoder model trained on audio datasets. Feature extraction and tokenization are performed on the audio inputs. The model is fine-tuned using Hugging Face and evaluated based on word error rate. A web application allows users to record speech via microphone and see the transcription output. The implemented model achieved a 32.42% word error rate on a Hindi language dataset. The goal is to enable seamless communication for those with hearing impairments across multiple languages.
Text independent speaker identification system using average pitch and forman...ijitjournal
The aim of this paper is to design a closed-set text-independent Speaker Identification system using average
pitch and speech features from formant analysis. The speech features represented by the speech signal are
potentially characterized by formant analysis (Power Spectral Density). In this paper we have designed two
methods: one for average pitch estimation based on Autocorrelation and other for formant analysis. The
average pitches of speech signals are calculated and employed with formant analysis. From the performance
comparison of the proposed method with some of the existing methods, it is evident that the designed
speaker identification system with the proposed method is superior to others.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Isolated word recognition using lpc & vector quantizationeSAT Journals
Abstract Speech recognition is always looked upon as a fascinating field in human computer interaction. It is one of the fundamental steps towards understanding human recognition and their behavior. This paper explicates the theory and implementation of Speech recognition. This is a speaker-dependent real time isolated word recognizer. The major logic used was to first obtain the feature vectors using LPC which was followed by vector quantization. The quantized vectors were then recognized by measuring the Minimum average distortion. All Speech Recognition systems contain Two Main Phases, namely Training Phase and Testing Phase. In the Training Phase, the Features of the words are extracted and during the recognition phase feature matching Takes place. The feature or the template thus extracted is stored in the data base, during the recognition phase the extracted features are compared with the template in the database. The features of the words are extracted by using LPC analysis. Vector Quantization is used for generating the code books. Finally the recognition decision is made based on the matching score. MATLAB will be used to implement this concept to achieve further understanding. Index Terms: Speech Recognition, LPC, Vector Quantization, and Code Book.
Bayesian distance metric learning and its application in automatic speaker re...IJECEIAES
This document proposes a state-of-the-art automatic speaker recognition system based on Bayesian distance metric learning as a feature extractor. It explores constraints on the distance between modified and simplified i-vector pairs from the same speaker and different speakers. An approximation of the distance metric is used as a weighted covariance matrix from the higher eigenvectors of the covariance matrix, which is used to estimate the posterior distribution of the metric distance. This Bayesian distance learning approach achieves better performance than advanced methods and is insensitive to normalization compared to cosine scores. It is also effective with limited training data.
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONijaia
In natural language processing, attention mechanism in neural networks are widely utilized. In this paper, the research team explore a new mechanism of extending output attention in recurrent neural networks for dialog systems. The new attention method was compared with the current method in generating dialog sentence using a real dataset. Our architecture exhibits several attractive properties such as better handle long sequences and, it could generate more reasonable replies in many cases.
Modeling Text Independent Speaker Identification with Vector QuantizationTELKOMNIKA JOURNAL
Speaker identification is one of the most important technologies nowadays. Many fields such as
bioinformatics and security are using speaker identification. Also, almost all electronic devices are using
this technology too. Based on number of text, speaker identification divided into text dependent and text
independent. On many fields, text independent is mostly used because number of text is unlimited. So, text
independent is generally more challenging than text dependent. In this research, speaker identification text
independent with Indonesian speaker data was modelled with Vector Quantization (VQ). In this research
VQ with K-Means initialization was used. K-Means clustering also was used to initialize mean and
Hierarchical Agglomerative Clustering was used to identify K value for VQ. The best VQ accuracy was
59.67% when k was 5. According to the result, Indonesian language could be modelled by VQ. This
research can be developed using optimization method for VQ parameters such as Genetic Algorithm or
Particle Swarm Optimization.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
This document summarizes a study that compares different acoustic feature extraction methods (LPC, MFCC, PLP) for a Bangla speech recognition system using LSTM neural networks. It finds that PLP outperforms MFCC and LPC based on statistical distance measurements of phoneme coefficients. PLP shows better distinction between phonemes compared to MFCC and LPC. While RNN/LSTM are inherently slow, combining PLP with faster networks like Transformers may improve performance for large datasets.
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...IRJET Journal
This document reviews progress in acoustic modeling for statistical parametric speech synthesis, from early hidden Markov models to recent neural network approaches. It discusses how hidden Markov models were previously dominant but artificial neural networks are now replacing them due to improvements in naturalness. The document also examines developing accurate audio classifiers using machine learning techniques on public datasets and improving classification accuracy beyond current levels of 50-79% by employing strategies like cross-validation.
Centralized Class Specific Dictionary Learning for wearable sensors based phy...Sherin Mathews
With recent progress in pervasive healthcare,
physical activity recognition with wearable body sensors has
become an important and challenging area in both research and
industrial communities. Here, we address a novel technique for
a sensor platform that performs physical activity recognition by
leveraging a class specific regularizer term into the dictionary
pair learning objective function. The proposed algorithm jointly
learns a synthesis dictionary and an analysis dictionary in
order to simultaneously perform signal representation and
classification once the time-domain features have been extracted.
Specifically, the class specific regularizer term ensures that the
sparse codes belonging to the same class will be concentrated
thereby proving beneficial for the classification stage. In order
to develop a more practical approach, we employ a combination
of an alternating direction method of multipliers and a l1 − ls
minimization method to approximately minimize the objective
function. We validate the effectiveness of our proposed model
by employing it on two activity recognition problem and an
intensity estimation problem, both of which include a large
number of physical activities. Experimental results demonstrate
that classifiers built in this dictionary learning based framework
outperforms state of art algorithms by using simple features,
thereby achieving competitive results when compared with
classical systems built upon features with prior knowledge
Centralized Class Specific Dictionary Learning for Wearable Sensors based phy...sherinmm
This document proposes a centralized class specific dictionary learning framework for physical activity recognition using wearable sensors. It incorporates a class specific regularizer term into the dictionary pair learning objective function to make sparse codes belonging to the same class more concentrated. This improves classification performance. The framework involves learning a synthesis dictionary and analysis dictionary jointly. It extracts time-domain features from acceleration and heart rate sensor data. Experimental results on activity recognition datasets demonstrate the framework outperforms state-of-the-art methods using only simple features, achieving competitive results to approaches using more complex, domain-specific features.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...IJERA Editor
This paper presents compressive sensing technique used for speech reconstruction using linear predictive coding because the
speech is more sparse in LPC. DCT of a speech is taken and the DCT points of sparse speech are thrown away arbitrarily.
This is achieved by making some point in DCT domain to be zero by multiplying with mask functions. From the incomplete
points in DCT domain, the original speech is reconstructed using compressive sensing and the tool used is Gradient
Projection for Sparse Reconstruction. The performance of the result is compared with direct IDCT subjectively. The
experiment is done and it is observed that the performance is better for compressive sensing than the DCT.
A computationally efficient learning model to classify audio signal attributesIJECEIAES
The era of machine learning has opened up groundbreaking realities and opportunities in the field of medical diagnosis. However, it is also observed that faster and proper diagnosis of any diseases/medical conditions require proper analysis and classification of digital signal data. It indicates the proper identification of tumors in the brain. Brain magnetic resonance imaging (MRI) data has to be appropriately classified, and similarly, pulse signal analysis is required to evaluate the human heart operating condition. Several studies have used machine learning (ML) modeling to classify speech signals, but very few studies have explored the classification of audio signal attributes in the context of intelligent healthcare monitoring. The study thereby aims to introduce novel mathematical modeling to analyze and classify synthetic pulse audio signal attributes with cost-effective computation. The numerical modeling is composed of several functional blocks where deep neural network-based learning (DNNL) plays a crucial role during the training phase, and also it is further combined with a recurrent structure of long-short term memory (R-LSTM) feedback connections (FCs). The design approaches further experiment in a numerical computing environment in terms of accuracy and computational aspects. The classification outcome of the proposed approach shows that it attains approximately 85% accuracy, which is comparable to the baseline approaches and execution time.
A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES
Automatic voice recognition system aims to limit fraudulent access to sensitive areas as labs. Our primary objective of this paper is to increasethe accuracy of the voice recognition in noisy environment of the Microsoft Research (MSR) identity toolbox. The proposed system enabled the user tospeak into the microphone then it will match unknown voice with other human voices existing in the database using a statistical model, in order togrant or deny access to the system. The voice recognition was done in twosteps: training and testing. During the training a Universal BackgroundModel as well as a Gaussian Mixtures Model: GMM-UBM models arecalculated based on different sentences pronounced by the human voice (s) used to record the training data. Then the testing of voice signal in noisyenvironment calculated the Log-Likelihood Ratio of the GMM-UBM models in order to classify user's voice. However, before testing noise and de-noisemethods were applied, we investigated different MFCC features of the voiceto determine the best feature possible as well as noise filter algorithmthat subsequently improved the performance of the automatic voicerecognition system.
Developing an automatic parts-of-speech (POS) tagging for any new language is considered a necessary
step for further computational linguistics methodology beyond tagging, like chunking and parsing, to be
fully applied to the language. Many POS disambiguation technologies have been developed for this type of
research and there are factors that influence the choice of choosing one. This could be either corpus-based
or non-corpus-based. In this paper, we present a review of POS tagging technologies.
The document discusses various techniques for audibly representing mathematical formulae for visually impaired users. It analyzes methods such as using lexical cues, dynamic sonic trajectories, earcons, spearcons, auditory icons, and hybrid sounds. Research experiments found that spearcons were as easy to learn as speech and allowed two streams of data, making them more efficient than speech alone. The document also discusses using prosody and digital representation standards to improve the audibility of mathematical formulae.
The document compares two supervised machine learning algorithms, Naive Bayes and decision trees, for the task of word sense disambiguation (WSD) using an empirical approach. It describes implementing both algorithms on a dataset of 15 English words annotated with senses from WordNet. Naive Bayes achieved an accuracy of 62.86% on the Senseval-3 test, while decision trees achieved 45.14% accuracy. The document analyzes and compares the performance of the two approaches to determine which is most successful for WSD and how their combination could potentially improve accuracy.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
This document summarizes an algorithm for weakly supervised image and video grouping. The algorithm uses efficient data mining tools like min-Hash and APriori, originally developed for text analysis, to iteratively cluster media into similar groups based on a small number of seed examples provided by the user. It introduces the concept of an image signature as a frequency histogram that can be dynamically expanded to identify common features between examples of the same class. The algorithm pulls positive classes together through an iterative process of identifying common elements between seed examples and accentuating those elements to increase similarity. It is tested on image and video datasets, requiring only 10% of labelled ground truth data compared to other approaches.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...mathsjournal
Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an
audio or video recording that contains unknown amounts of speech from unknown speakers and unknown
number of speakers. Diarization has numerous applications in speech recognition, speaker identification,
and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization
problems, but providing exhaustive labeling for the training dataset can become costly in supervised
learning, while accuracy can be compromised when using unsupervised approaches. This paper presents a
novel approach to speaker diarization, which defines loosely labeled data and employs x-vector embedding
and a formalized approach for threshold searching with a given abstract similarity metric to cluster
temporal segments into unique user segments. The proposed algorithm uses concepts of graph theory,
matrix algebra, and genetic algorithm to formulate and solve the optimization problem. Additionally, the
algorithm is applied to English, Spanish, and Chinese audios, and the performance is evaluated using wellknown similarity metrics. The results demonstrate that the robustness of the proposed approach. The
findings of this research have significant implications for speech processing, speaker identification
including those with tonal differences. The proposed method offers a practical and efficient solution for
speaker diarization in real-world scenarios where there are labeling time and cost constraints.
A POSSIBLE RESOLUTION TO HILBERT’S FIRST PROBLEM BY APPLYING CANTOR’S DIAGONA...mathsjournal
We present herein a new approach to the Continuum hypothesis CH. We will employ a string conditioning,
a technique that limits the range of a string over some of its sub-domains for forming subsets K of R. We
will prove that these are well defined and in fact proper subsets of R by making use of Cantor’s Diagonal
argument in its original form to establish the cardinality of K between that of (N,R) respectively
More Related Content
Similar to OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
This document summarizes a study that compares different acoustic feature extraction methods (LPC, MFCC, PLP) for a Bangla speech recognition system using LSTM neural networks. It finds that PLP outperforms MFCC and LPC based on statistical distance measurements of phoneme coefficients. PLP shows better distinction between phonemes compared to MFCC and LPC. While RNN/LSTM are inherently slow, combining PLP with faster networks like Transformers may improve performance for large datasets.
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...IRJET Journal
This document reviews progress in acoustic modeling for statistical parametric speech synthesis, from early hidden Markov models to recent neural network approaches. It discusses how hidden Markov models were previously dominant but artificial neural networks are now replacing them due to improvements in naturalness. The document also examines developing accurate audio classifiers using machine learning techniques on public datasets and improving classification accuracy beyond current levels of 50-79% by employing strategies like cross-validation.
Centralized Class Specific Dictionary Learning for wearable sensors based phy...Sherin Mathews
With recent progress in pervasive healthcare,
physical activity recognition with wearable body sensors has
become an important and challenging area in both research and
industrial communities. Here, we address a novel technique for
a sensor platform that performs physical activity recognition by
leveraging a class specific regularizer term into the dictionary
pair learning objective function. The proposed algorithm jointly
learns a synthesis dictionary and an analysis dictionary in
order to simultaneously perform signal representation and
classification once the time-domain features have been extracted.
Specifically, the class specific regularizer term ensures that the
sparse codes belonging to the same class will be concentrated
thereby proving beneficial for the classification stage. In order
to develop a more practical approach, we employ a combination
of an alternating direction method of multipliers and a l1 − ls
minimization method to approximately minimize the objective
function. We validate the effectiveness of our proposed model
by employing it on two activity recognition problem and an
intensity estimation problem, both of which include a large
number of physical activities. Experimental results demonstrate
that classifiers built in this dictionary learning based framework
outperforms state of art algorithms by using simple features,
thereby achieving competitive results when compared with
classical systems built upon features with prior knowledge
Centralized Class Specific Dictionary Learning for Wearable Sensors based phy...sherinmm
This document proposes a centralized class specific dictionary learning framework for physical activity recognition using wearable sensors. It incorporates a class specific regularizer term into the dictionary pair learning objective function to make sparse codes belonging to the same class more concentrated. This improves classification performance. The framework involves learning a synthesis dictionary and analysis dictionary jointly. It extracts time-domain features from acceleration and heart rate sensor data. Experimental results on activity recognition datasets demonstrate the framework outperforms state-of-the-art methods using only simple features, achieving competitive results to approaches using more complex, domain-specific features.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...IJERA Editor
This paper presents compressive sensing technique used for speech reconstruction using linear predictive coding because the
speech is more sparse in LPC. DCT of a speech is taken and the DCT points of sparse speech are thrown away arbitrarily.
This is achieved by making some point in DCT domain to be zero by multiplying with mask functions. From the incomplete
points in DCT domain, the original speech is reconstructed using compressive sensing and the tool used is Gradient
Projection for Sparse Reconstruction. The performance of the result is compared with direct IDCT subjectively. The
experiment is done and it is observed that the performance is better for compressive sensing than the DCT.
A computationally efficient learning model to classify audio signal attributesIJECEIAES
The era of machine learning has opened up groundbreaking realities and opportunities in the field of medical diagnosis. However, it is also observed that faster and proper diagnosis of any diseases/medical conditions require proper analysis and classification of digital signal data. It indicates the proper identification of tumors in the brain. Brain magnetic resonance imaging (MRI) data has to be appropriately classified, and similarly, pulse signal analysis is required to evaluate the human heart operating condition. Several studies have used machine learning (ML) modeling to classify speech signals, but very few studies have explored the classification of audio signal attributes in the context of intelligent healthcare monitoring. The study thereby aims to introduce novel mathematical modeling to analyze and classify synthetic pulse audio signal attributes with cost-effective computation. The numerical modeling is composed of several functional blocks where deep neural network-based learning (DNNL) plays a crucial role during the training phase, and also it is further combined with a recurrent structure of long-short term memory (R-LSTM) feedback connections (FCs). The design approaches further experiment in a numerical computing environment in terms of accuracy and computational aspects. The classification outcome of the proposed approach shows that it attains approximately 85% accuracy, which is comparable to the baseline approaches and execution time.
A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES
Automatic voice recognition system aims to limit fraudulent access to sensitive areas as labs. Our primary objective of this paper is to increasethe accuracy of the voice recognition in noisy environment of the Microsoft Research (MSR) identity toolbox. The proposed system enabled the user tospeak into the microphone then it will match unknown voice with other human voices existing in the database using a statistical model, in order togrant or deny access to the system. The voice recognition was done in twosteps: training and testing. During the training a Universal BackgroundModel as well as a Gaussian Mixtures Model: GMM-UBM models arecalculated based on different sentences pronounced by the human voice (s) used to record the training data. Then the testing of voice signal in noisyenvironment calculated the Log-Likelihood Ratio of the GMM-UBM models in order to classify user's voice. However, before testing noise and de-noisemethods were applied, we investigated different MFCC features of the voiceto determine the best feature possible as well as noise filter algorithmthat subsequently improved the performance of the automatic voicerecognition system.
Developing an automatic parts-of-speech (POS) tagging for any new language is considered a necessary
step for further computational linguistics methodology beyond tagging, like chunking and parsing, to be
fully applied to the language. Many POS disambiguation technologies have been developed for this type of
research and there are factors that influence the choice of choosing one. This could be either corpus-based
or non-corpus-based. In this paper, we present a review of POS tagging technologies.
The document discusses various techniques for audibly representing mathematical formulae for visually impaired users. It analyzes methods such as using lexical cues, dynamic sonic trajectories, earcons, spearcons, auditory icons, and hybrid sounds. Research experiments found that spearcons were as easy to learn as speech and allowed two streams of data, making them more efficient than speech alone. The document also discusses using prosody and digital representation standards to improve the audibility of mathematical formulae.
The document compares two supervised machine learning algorithms, Naive Bayes and decision trees, for the task of word sense disambiguation (WSD) using an empirical approach. It describes implementing both algorithms on a dataset of 15 English words annotated with senses from WordNet. Naive Bayes achieved an accuracy of 62.86% on the Senseval-3 test, while decision trees achieved 45.14% accuracy. The document analyzes and compares the performance of the two approaches to determine which is most successful for WSD and how their combination could potentially improve accuracy.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
This document summarizes an algorithm for weakly supervised image and video grouping. The algorithm uses efficient data mining tools like min-Hash and APriori, originally developed for text analysis, to iteratively cluster media into similar groups based on a small number of seed examples provided by the user. It introduces the concept of an image signature as a frequency histogram that can be dynamically expanded to identify common features between examples of the same class. The algorithm pulls positive classes together through an iterative process of identifying common elements between seed examples and accentuating those elements to increase similarity. It is tested on image and video datasets, requiring only 10% of labelled ground truth data compared to other approaches.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Similar to OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION (20)
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...mathsjournal
Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an
audio or video recording that contains unknown amounts of speech from unknown speakers and unknown
number of speakers. Diarization has numerous applications in speech recognition, speaker identification,
and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization
problems, but providing exhaustive labeling for the training dataset can become costly in supervised
learning, while accuracy can be compromised when using unsupervised approaches. This paper presents a
novel approach to speaker diarization, which defines loosely labeled data and employs x-vector embedding
and a formalized approach for threshold searching with a given abstract similarity metric to cluster
temporal segments into unique user segments. The proposed algorithm uses concepts of graph theory,
matrix algebra, and genetic algorithm to formulate and solve the optimization problem. Additionally, the
algorithm is applied to English, Spanish, and Chinese audios, and the performance is evaluated using wellknown similarity metrics. The results demonstrate that the robustness of the proposed approach. The
findings of this research have significant implications for speech processing, speaker identification
including those with tonal differences. The proposed method offers a practical and efficient solution for
speaker diarization in real-world scenarios where there are labeling time and cost constraints.
A POSSIBLE RESOLUTION TO HILBERT’S FIRST PROBLEM BY APPLYING CANTOR’S DIAGONA...mathsjournal
We present herein a new approach to the Continuum hypothesis CH. We will employ a string conditioning,
a technique that limits the range of a string over some of its sub-domains for forming subsets K of R. We
will prove that these are well defined and in fact proper subsets of R by making use of Cantor’s Diagonal
argument in its original form to establish the cardinality of K between that of (N,R) respectively
A Positive Integer 𝑵 Such That 𝒑𝒏 + 𝒑𝒏+𝟑 ~ 𝒑𝒏+𝟏 + 𝒑𝒏+𝟐 For All 𝒏 ≥ 𝑵mathsjournal
According to Bertrand's postulate, we have 𝑝𝑛 + 𝑝𝑛 ≥ 𝑝𝑛+1. Is it true that for all 𝑛 > 1 then 𝑝𝑛−1 + 𝑝𝑛 ≥𝑝𝑛+1? Then 𝑝𝑛 + 𝑝𝑛+3 > 𝑝𝑛+1 + 𝑝𝑛+2where 𝑛 ≥ 𝑁, 𝑁 is a large enough value?
A POSSIBLE RESOLUTION TO HILBERT’S FIRST PROBLEM BY APPLYING CANTOR’S DIAGONA...mathsjournal
We present herein a new approach to the Continuum hypothesis CH. We will employ a string conditioning,
a technique that limits the range of a string over some of its sub-domains for forming subsets K of R. We
will prove that these are well defined and in fact proper subsets of R by making use of Cantor’s Diagonal
argument in its original form to establish the cardinality of K between that of (N,R) respectively.
Moving Target Detection Using CA, SO and GO-CFAR detectors in Nonhomogeneous ...mathsjournal
systems in complex situations. A fundamental problem in radar systems is to automatically detect targets while maintaining a
desired constant false alarm probability. This work studies two detection approaches, the first with a fixed threshold and the
other with an adaptive one. In the latter, we have learned the three types of detectors CA, SO, and GO-CFAR. This research
aims to apply intelligent techniques to improve detection performance in a nonhomogeneous environment using standard
CFAR detectors. The objective is to maintain the false alarm probability and enhance target detection by combining
intelligent techniques. With these objectives in mind, implementing standard CFAR detectors is applied to nonhomogeneous
environment data. The primary focus is understanding the reason for the false detection when applying standard CFAR
detectors in a nonhomogeneous environment and how to avoid it using intelligent approaches.
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIAR...mathsjournal
Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an
audio or video recording that contains unknown amounts of speech from unknown speakers and unknown
number of speakers. Diarization has numerous applications in speech recognition, speaker identification,
and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization
problems, but providing exhaustive labeling for the training dataset can become costly in supervised
learning, while accuracy can be compromised when using unsupervised approaches. This paper presents a
novel approach to speaker diarization, which defines loosely labeled data and employs x-vector embedding
and a formalized approach for threshold searching with a given abstract similarity metric to cluster
temporal segments into unique user segments. The proposed algorithm uses concepts of graph theory,
matrix algebra, and genetic algorithm to formulate and solve the optimization problem. Additionally, the
algorithm is applied to English, Spanish, and Chinese audios, and the performance is evaluated using wellknown similarity metrics. The results demonstrate that the robustness of the proposed approach. The
findings of this research have significant implications for speech processing, speaker identification
including those with tonal differences. The proposed method offers a practical and efficient solution for
speaker diarization in real-world scenarios where there are labeling time and cost constraints.
The Impact of Allee Effect on a Predator-Prey Model with Holling Type II Func...mathsjournal
There is currently much interest in predator–prey models across a variety of bioscientific disciplines. The focus is on quantifying predator–prey interactions, and this quantification is being formulated especially as regards climate change. In this article, a stability analysis is used to analyse the behaviour of a general two-species model with respect to the Allee effect (on the growth rate and nutrient limitation level of the prey population). We present a description of the local and non-local interaction stability of the model and detail the types of bifurcation which arise, proving that there is a Hopf bifurcation in the Allee effect module. A stable periodic oscillation was encountered which was due to the Allee effect on the
prey species. As a result of this, the positive equilibrium of the model could change from stable to unstable and then back to stable, as the strength of the Allee effect (or the ‘handling’ time taken by predators when predating) increased continuously from zero. Hopf bifurcation has arose yield some complex patterns that have not been observed previously in predator-prey models, and these, at the same time, reflect long term behaviours. These findings have significant implications for ecological studies, not least with respect to examining the mobility of the two species involved in the non-local domain using Turing instability. A spiral generated by local interaction (reflecting the instability that forms even when an infinitely large
carrying capacity is assumed) is used in the model.
A POSSIBLE RESOLUTION TO HILBERT’S FIRST PROBLEM BY APPLYING CANTOR’S DIAGONA...mathsjournal
We present herein a new approach to the Continuum hypothesis CH. We will employ a string conditioning,a technique that limits the range of a string over some of its sub-domains for forming subsets K of R. We will prove that these are well defined and in fact proper subsets of R by making use of Cantor’s Diagonal argument in its original form to establish the cardinality of K between that of (N,R) respectively.
Moving Target Detection Using CA, SO and GO-CFAR detectors in Nonhomogeneous ...mathsjournal
Modernization of radar technology and improved signal processing techniques are necessary to improve detection systems in complex situations. A fundamental problem in radar systems is to automatically detect targets while maintaining a
desired constant false alarm probability. This work studies two detection approaches, the first with a fixed threshold and the
other with an adaptive one. In the latter, we have learned the three types of detectors CA, SO, and GO-CFAR. This research
aims to apply intelligent techniques to improve detection performance in a nonhomogeneous environment using standard
CFAR detectors. The objective is to maintain the false alarm probability and enhance target detection by combining
intelligent techniques. With these objectives in mind, implementing standard CFAR detectors is applied to nonhomogeneous
environment data. The primary focus is understanding the reason for the false detection when applying standard CFAR
detectors in a nonhomogeneous environment and how to avoid it using intelligent approaches
Modified Alpha-Rooting Color Image Enhancement Method on the Two Side 2-D Qua...mathsjournal
Color in an image is resolved to 3 or 4 color components and 2-Dimages of these components are stored in separate channels. Most of the color image enhancement algorithms are applied channel-by-channel on each image. But such a system of color image processing is not processing the original color. When a color image is represented as a quaternion image, processing is done in original colors. This paper proposes an implementation of the quaternion approach of enhancement algorithm for enhancing color images and is referred as the modified alpha-rooting by the two-dimensional quaternion discrete Fourier transform (2-D QDFT). Enhancement results of this proposed method are compared with the channel-by-channel image enhancement by the 2-D DFT. Enhancements in color images are quantitatively measured by the color enhancement measure estimation (CEME), which allows for selecting optimum parameters for processing by thegenetic algorithm. Enhancement of color images by the quaternion based method allows for obtaining images which are closer to the genuine representation of the real original color.
An Application of Assignment Problem in Laptop Selection Problem Using MATLABmathsjournal
The assignment – selection problem used to find one-to- one match of given “Users” to “Laptops”, the main objective is to minimize the cost as per user requirement. This paper presents satisfactory solution for real assignment – Laptop selection problem using MATLAB coding.
The aim of this paper is to study the class of β-normal spaces. The relationships among s-normal spaces, pnormal spaces and β-normal spaces are investigated. Moreover, we study the forms of generalized β-closed
functions. We obtain characterizations of β-normal spaces, properties of the forms of generalized β-closed
functions and preservation theorems.
Cubic Response Surface Designs Using Bibd in Four Dimensionsmathsjournal
Response Surface Methodology (RSM) has applications in Chemical, Physical, Meteorological, Industrial and Biological fields. The estimation of slope response surface occurs frequently in practical situations for the experimenter. The rates of change of the response surface, like rates of change in the yield of crop to various fertilizers, to estimate the rates of change in chemical experiments etc. are of
interest. If the fit of second order response is inadequate for the design points, we continue the
experiment so as to fit a third order response surface. Higher order response surface designs are sometimes needed in Industrial and Meteorological applications. Gardiner et al (1959) introduced third order rotatable designs for exploring response surface. Anjaneyulu et al (1994-1995) constructed third order slope rotatable designs using doubly balanced incomplete block designs. Anjaneyulu et al (2001)
introduced third order slope rotatable designs using central composite type design points. Seshu babu et al (2011) studied modified construction of third order slope rotatable designs using central composite
designs. Seshu babu et al (2014) constructed TOSRD using BIBD. In view of wide applicability of third
order models in RSM and importance of slope rotatability, we introduce A Cubic Slope Rotatable Designs Using BIBD in four dimensions.
The caustic that occur in geodesics in space-times which are solutions to the gravitational field equations with the energy-momentum tensor satisfying the dominant energy condition can be circumvented if quantum variations are allowed. An action is developed such that the variation yields the field equations
and the geodesic condition, and its quantization provides a method for determining the extent of the wave packet around the classical path.
Approximate Analytical Solution of Non-Linear Boussinesq Equation for the Uns...mathsjournal
For one dimensional homogeneous, isotropic aquifer, without accretion the governing Boussinesq equation under Dupuit assumptions is a nonlinear partial differential equation. In the present paper approximate analytical solution of nonlinear Boussinesq equation is obtained using Homotopy perturbation transform method(HPTM). The solution is compared with the exact solution. The comparison shows that the HPTM is efficient, accurate and reliable. The analysis of two important aquifer
parameters namely viz. specific yield and hydraulic conductivity is studied to see the effects on the height of water table. The results resemble well with the physical phenomena.
Common Fixed Point Theorems in Compatible Mappings of Type (P*) of Generalize...mathsjournal
In this paper, we give some new definition of Compatible mappings of type (P), type (P-1) and type (P-2) in intuitionistic generalized fuzzy metric spaces and prove Common fixed point theorems for six mappings
under the conditions of compatible mappings of type (P-1) and type (P-2) in complete intuitionistic fuzzy
metric spaces. Our results intuitionistically fuzzify the result of Muthuraj and Pandiselvi [15]
Mathematics subject classifications: 45H10, 54H25
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...mathsjournal
- The document presents a probabilistic algorithm for computing the polynomial greatest common divisor (PGCD) with smaller factors.
- It summarizes previous work on the subresultant algorithm for computing PGCD and discusses its limitations, such as not always correctly determining the variant τ.
- The new algorithm aims to determine τ correctly in most cases when given two polynomials f(x) and g(x). It does so by adding a few steps instead of directly computing the polynomial t(x) in the relation s(x)f(x) + t(x)g(x) = r(x).
Table of Contents - September 2022, Volume 9, Number 2/3mathsjournal
Applied Mathematics and Sciences: An International Journal (MathSJ ) aims to publish original research papers and survey articles on all areas of pure mathematics, theoretical applied mathematics, mathematical physics, theoretical mechanics, probability and mathematical statistics, and theoretical biology. All articles are fully refereed and are judged by their contribution to advancing the state of the science of mathematics.
Code of the multidimensional fractional pseudo-Newton method using recursive ...mathsjournal
The following paper presents one way to define and classify the fractional pseudo-Newton method through a group of fractional matrix operators, as well as a code written in recursive programming to implement this
method, which through minor modifications, can be implemented in any fractional fixed-point method that allows
solving nonlinear algebraic equation systems.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION
1. Applied Mathematics and Sciences: An International Journal (MathSJ) Vol.10, No.1/2, June 2023
DOI : 10.5121/mathsj.2023.10201 1
OPTIMIZING SIMILARITY THRESHOLD FOR
ABSTRACT SIMILARITY METRIC IN SPEECH
DIARIZATION SYSTEMS: A MATHEMATICAL
FORMULATION
Jagat Chaitanya Prabhala, Dr. Venkatnareshbabu K and Dr. Ragoju Ravi
Department of Applied Sciences National Institute of Technology Goa, India
ABSTRACT
Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an
audio or video recording that contains unknown amounts of speech from unknown speakers and unknown
number of speakers. Diarization has numerous applications in speech recognition, speaker identification,
and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization
problems, but providing exhaustive labeling for the training dataset can become costly in supervised
learning, while accuracy can be compromised when using unsupervised approaches. This paper presents a
novel approach to speaker diarization, which defines loosely labeled data and employs x-vector embedding
and a formalized approach for threshold searching with a given abstract similarity metric to cluster
temporal segments into unique user segments. The proposed algorithm uses concepts of graph theory,
matrix algebra, and genetic algorithm to formulate and solve the optimization problem. Additionally, the
algorithm is applied to English, Spanish, and Chinese audios, and the performance is evaluated using well-
known similarity metrics. The results demonstrate that the robustness of the proposed approach. The
findings of this research have significant implications for speech processing, speaker identification
including those with tonal differences. The proposed method offers a practical and efficient solution for
speaker diarization in real-world scenarios where there are labeling time and cost constraints.
KEYWORDS
Speaker diarization, speech processing, signal processing, x-vector, graph theory, similarity matrix,
optimization
1. INTRODUCTION
Speech processing is a fundamental area of research in which the goal is to analyze and
understand spoken language. It can be divided into two primary categories: speech recognition
and speaker recognition. While speech recognition is concerned with the content of the speech
signal, speaker recognition aims to identify the speaker(s) within a conversation. Speaker
diarization is an important step in multi-speaker speech processing applications that falls under
the latter category.
2. Applied Mathematics and Sciences: An International Journal (MathSJ) Vol.10, No.1/2, June 2023
2
In speaker diarization, the speech audio is segmented into temporal segments that contain the
voice of a single speaker. These segments are further clustered into homogeneous groups to
identify each speaker and the boundaries/frames of their speech. This process is becoming
increasingly important as an automatic pre-processing step for building speech processing
systems, with direct applications in speaker indexing, retrieval, speech recognition with speaker
identification, and diarizing meetings and lectures.
This research paper proposes a novel approach to diarization that can handle scenarios where the
number of speakers is unknown and their identities are not known beforehand. Specifically, we
explore the use of loosely labeled speech data and formalize a mathematical model to search for
the appropriate similarity threshold to diarize the speakers. Our approach utilizes pre-trained x-
vector based voice verification embeddings, along with some simple concepts of graph theory for
clustering and a genetic algorithm to identify the optimal similarity threshold for speaker
diarization that minimizes the error rate. We evaluate the effectiveness of our approach on speech
data in various languages, using different frame lengths and similarity metrics. The results
demonstrate that our method can achieve high accuracy in speaker diarization even with
unknown speakers and unknown number of speakers, which has significant implications for
speech processing applications in a variety of fields.
This paper will begin by reviewing the existing literature on x-vectors and brief approach for
diarization for labeled and unlabeled data followed by description of the proposed methodology
in which we introduce some definitions and mathematical formulation. After describing the
methodology we also illustrate the approach with a dummy example followed by results on real
audio data and hence conclude the findings and applications in the closing section
2. RELATED WORK
Extensive research has been conducted on speaker diarization and clustering of speech segments
using i-vector similarity or x-vector embeddings, along with various similarity metrics, such as
cosine similarity [1]. Bayesian Information Criterion (BIC) is a widely used method for speaker
diarization model selection [2][3], which can also estimate the number of speakers in a recording.
Other Bayesian methods have also been proposed for this purpose [4][5]. Hierarchical
Agglomerative Clustering (HAC) is a commonly employed unsupervised method for speaker
cluster learning; however, the techniques for determining the optimal number of clusters are not
sophisticated due to the problem being posed as an unsupervised learning problem.
Supervised learning-based clustering using Probabilistic Linear Discriminant Analysis (PLDA) of
x-vectors and Bayesian Hidden Markov Model (BHMM)-based re-segmentation are other
approaches that have been explored in this field [6]. However, these methods may not be suitable
for loosely labeled data.
Our proposed methodology is better suited for training models with loosely labeled data. The
existing methods are not optimal for our data as it is neither fully labeled nor completely
unlabeled. Therefore, we propose a new approach that can effectively train models with loosely
labeled data for speaker diarization and clustering of speech segments.
3. METHODOLOGY
Def: Training audio data is called ‘loosely labeled’ if each audio file has information about the
number of speakers in the audio but do not have detailed annotation of who spoke when.
3. Applied Mathematics and Sciences: An International Journal (MathSJ) Vol.10, No.1/2, June 2023
3
Given set of multi-speaker audio files D = {(Ak,tk) | k = 1,2…n} where Akis kth
multi-speaker
audio and tkis number of speakers in Ak. We define audio data where only the number of speakers
is known but no information about temporal segmentation of speaker is given as loose labels.
Below [ref: Image 1] is process diagram for training and prediction
3.1. Detailed Methodology
3.1.1. Data
As a first step we used single speaker audios from multiple languages i.e SLR38-Chinese,
SLR61-Spanish, SLR45-English &SLR83-English datasets provided in openslr data [7] as input
to generate synthetic multi-speaker audio generation i.e., each audio has more than one speaker.
The table below has more details about the synthesized dataset
Language Total # of files Average duration Average # of speakers
English 6844 103s 3.2
Spanish 2106 53s 2.7
Chinese 5823 154s 4.5
3.1.2. Pre-Processing & Feature Extraction
Split the above generated audio files into small temporal segments with length of 15ms and use
pretrained model in the kaldi project [8] to extract x-vector embeddings of the temporal
segments. The block diagram [ref: Image2] of x-vector training architecture is as follows
Image2
Image 1
4. Applied Mathematics and Sciences: An International Journal (MathSJ) Vol.10, No.1/2, June 2023
4
By end of the data processing step kth
multi-speaker audio file Ak results in a stream of x-vectors
xk1,xk2, ...,xkn where xki represents the x-vector of the ith
temporal segment
3.1.3. Problem Formulation
Further we compute similarity matrix
Sk = S(Ak) = [
𝑠11 𝑠12 ⋯ 𝑠1𝑛
𝑠21 𝑠22 ⋯ 𝑠2𝑛
⋮ ⋮ ⋱ ⋮
𝑠𝑛1 𝑠𝑛2 … 𝑠𝑛𝑛
] where Sij = s(xki,xkj) for an abstract similarity function ‘s’
Define a function Gp: {Sk} {Adjk} from set of similarity matrix {Sk} to set of adjacency matrix
{Adjk}as matrix Gp (Sk) = 1 if Sij>=p and 0 otherwise. Clearly one can see that Gp(Sk) is
adjacency matrix with temporal segments as nodes
Define a function H:{Adjk} ℤ+
as H(Adjk) = # of connected components [12]
The objective is to 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝑝
∑ (𝐻𝑜𝐺𝑝(𝑆𝑘) − (𝑡𝑘 + 1))
2
𝑙
𝑘=1
represented as 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝑝
Q(p).
We use tk+1 to account for silence segments
For the optimal ‘p’, Gp(Sk) is an adjacency matrix corresponding to kth
audio file Ak and each
connected component of this adjacency matrix is a unique user resulting in diarization of the
multi-speaker audio
Given 𝐻𝑜𝐺𝑝(𝑆𝑘) and 𝑡𝑘 are positive integers Q(p) is a discrete quadratic function and at least one
global optimal is guaranteed and due to property of quadratic curve Q(p) doesn’t have any local
optimal. As Q(p) is neither continuous not differentiable we cannot use gradient descent or sub-
gradient descent algorithms to find the optimal ‘p’. We have used genetic algorithm [11] to
search the optimal p.
3.2. Illustration of Proposed method
Let {(A1,2),(A2,3), (A3, 2)} be the set of similarity matrix created from audio files and
corresponding number of speakers with arbitrary similarity matrix. For ease of illustration assume
the similarity metric is bounded between [-1,1]. Let matrices A1, A2, A3 be as represented in the
following tables [Image 3]
Let’s evaluate number of connected components and Q(p) for each threshold ‘p’ where p =
0.9,0.7,0.3for all the matrices in [ref: Image 3]
For p = 0.9, connected components of A1, A2, A3are 8, 9, 7 respectively. The similar colored
edges [ref: Image 4] for each matrix is a connected component
Image3
5. Applied Mathematics and Sciences: An International Journal (MathSJ) Vol.10, No.1/2, June 2023
5
Q(0.9) = (8 − 3)2
+ (9 − 4)2
+ (7 − 3)2
Q(0.9) = 66
For p = 0.7, connected components of A1, A2, A3 are 4, 8, 5 respectively. The similar colored
edges [ref: Image 5] for each matrix is a connected component
Q(0.7) = (4 − 3)2
+ (8 − 4)2
+ (5 − 3)2
Q(0.7) = 21
For p = 0.3, connected components of A1, A2, A3are 2, 3, 2 respectively. The similar colored
edges ref: [Image 6] for each matrix is a connected component
Q(0.3) = (2 − 3)2
+ (3 − 4)2
+ (2 − 3)2
Q(0.3) = 3
Clearly for threshold p = 0.2 connected components are 1,1,1 respectively for A1, A2, A3 and
hence
Q(0.2) = (1 − 3)2
+ (1 − 4)2
+ (1 − 3)2
Q(0.2) = 17
For the illustrated example in [3.2] above, p = 0.3 is the best threshold among the threshold
enumerated as Q(0.3)minimum.
In real life use case when data is large, we use genetic algorithm [11] to find the optimal value of
‘p’ instead of enumeration and exhaustive search
Image4
Image4
Image5
Image6
6. Applied Mathematics and Sciences: An International Journal (MathSJ) Vol.10, No.1/2, June 2023
6
4. RESULTS & DISCUSSION
4.1. Evaluation Metric
As a standard we use mean DER(diarization error rate) as evaluation metric
DER(A) = Espkr + Emiss + Efa +Eovl
where,
Espkr is percentage of scored times the speaker ID is assigned to wrong speaker
Emiss is percentage of scored times the non-speech segment is categorized as speaker segment
Efa is percentage of scored times a speaker segment is categorized as non-speech segment
Eovl is percentage of times there is overlap of speakers in the segment
Mean DER =
1
𝑛
∑ 𝐷𝐸𝑅(𝐴𝑘)
𝑛
𝑘=1
4.2. Similarity Metrics
The following similarity metrics are considered in this paper to test the proposed approach
4.2.1. Cosine Similarity
Cosine between two vectors X and Y is defined as 𝐶𝑜𝑠(𝑋, 𝑌) = 𝑋. 𝑌/|𝑋||𝑌|
4.2.2. Euclidian distance (L2 norm)
The L2 between two vectors X and Y is defined as ||X,Y||L2 = √(𝑋 − 𝑌)2
4.2.3. Manhattan distance (L1 norm)
The L1between two vectors X and Y is defined as ||X,Y||L1 = 𝑎𝑏𝑠(𝑋 − 𝑌)
4.2.4. Kendall's tau similarity
The kendall’s tau between two vectors X and Y is defined as𝜏(𝑋, 𝑌) =
𝑛𝐶−𝑛𝑑
𝑛(𝑛−1)∕2
where nc is # of
concordant pairs and nd is # of discordant pairs and n is length of vector
Note that Cosine and kendall’s tau similarity are bounded metrics and we can interpret segments
are more similar when the value is higher, L1& L2 distance metrics are not bounded above and
segments are closer if the value is closer to 0
4.3.Experimental Results
Test results when training & test is 90% and 10% respectively [ref: Image 7]
7. Applied Mathematics and Sciences: An International Journal (MathSJ) Vol.10, No.1/2, June 2023
7
Test results when training & test is 80% and 20% respectively [ref: Image 8]
Test results when training & test is 70% and 30% respectively [ref: Image 9]
Test results when training & test is 60% and 40% respectively [ref: Image 10]
Image7
Image8
Image9
Image10
8. Applied Mathematics and Sciences: An International Journal (MathSJ) Vol.10, No.1/2, June 2023
8
Test results when training & test is 50% and 50% respectively [ref: Image 11]
4.4. Code & Results with pretrained model [9] without threshold optimization
We have also used same test data and ran diarization algorithm [ref: code snippet 1] which uses
pyannote library [9]. Please find the results below
Image11
9. Applied Mathematics and Sciences: An International Journal (MathSJ) Vol.10, No.1/2, June 2023
9
Sample split English DER Spanish DER Chinese DER
90%-10% 8.4% 9.6% 9.3%
80%-20% 7.8% 7.2% 6.4%
70%-30% 9.6% 10.4% 11.2%
60%-40% 10.2% 11.9% 9.1%
50%-50% 14.8% 12.7% 13.8%
5. CONCLUSION
In conclusion, this paper has presented a generic approach for identifying speakers and
segmenting speech into homogenous speaker segments using loosely labeled data. We have
found that similarity metrics such as cosine and Kendall's tau perform better than distance metrics
such as L1 and L2 norms, with the cosine similarity metric performing the best across languages
with an 80-20 split. Our threshold optimized model has reduced the error by at least 2 percentage
points over the existing state-of-the-art approach [4.4]. This methodology is a generic
formulation that can be used with custom metrics or combined with kernel techniques such as
Gaussian or radial basis kernels.
Although supervised techniques give the least DER, they are costly in terms of money and time
for detailed annotation. Our approach is best suited when the aim is to be better than
unsupervised approaches and minimize the cost of learning. The current approach took an
average of 3 hours of training time for each language and metric combination, with Kendall's tau
taking the longest time of 5-6 hours. As next steps, we plan to explore other optimization
techniques to further reduce training time
ACKNOWLEDGEMENTS
I am grateful to my mother Subbalakshmi Devulapally for inspiring me to do research. Also, a special
thanks to my wife Bhavani Paruchuri for continually supporting me and standing by me through this
journey.
REFERENCES
[1] M. Senoussaoui, P. Kenny, T. Stafylakis and P. Dumouchel, (2014), "A Study of the Cosine
Distance-Based Mean Shift for Telephone Speech Diarization," in IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 217-227
[2] G. Schwarz, (1978), “Estimating the dimension of a model,” Ann. Statist. 6, 461-464
[3] S. S. Chen and P. Gopalakrishnan, (1998), “Clustering via the bayesian information criterion with
applications in speech recognition,” in ICASSP’98, vol. 2, Seattle, USA, pp. 645–648
[4] F. Valente,(2005), “VariationalBayesianmethods foraudio indexing,” Ph.D. dissertation, Eurecom
[5] P. Kenny, D. Reynolds and F. Castaldo,(2010), “Diarization of Telephone Conversations using Factor
Analysis,” Selected Topics in Signal Processing, IEEE Journal of, vol.4, no.6, pp.1059-1070
[6] F. Landini et al.,( 2020), "But System for the Second Dihard Speech Diarization Challenge," ICASSP
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6529-
6533
[7] https://www.openslr.org/resources.php
[8] http://kaldi-asr.org/models/m7
[9] Khoma, Volodymyr, et al., (2023), "Development of Supervised Speaker Diarization System Based
on the PyAnnote Audio ProcessingLibrary
10. Applied Mathematics and Sciences: An International Journal (MathSJ) Vol.10, No.1/2, June 2023
10
[10] Canovas, O., & Garcia, F. J., (2023). Analysis of Classroom Interaction Using Speaker Diarization
and Discourse Features from Audio Recordings. In Learning in the Age of Digital and Green
Transition: Proceedings of the 25th International Conference on Interactive Collaborative Learning
(ICL2022), Volume 2 (pp. 67-74). Cham: Springer International Publishing
[11] Li, T., Shao, G., Zuo, W., & Huang, S. (2017). Genetic algorithm for building optimization: State-of-
the-art survey. In Proceedings of the 9th international conference on machine learning and computing
(pp. 205-210)
[12] Hubert, L. J. (1974). Some applications of graph theory to clustering. Psychometrika, 39(3), 283-309.
AUTHORS
Mr. Jagat Chaitanya Prabhala, is doing Ph.D from applied sciences department of
National Institute Of Technology, Goa, India. He also has 15 years industry
experience applying data science and AI for real business usecases. Currently he is
working as data science leader at Sixt Research & Development Pvt Ltd.
Dr. Venkatnareshbabu K is working as Assistant Professor at computer science
department of National Institute Of Technology, Goa, India.He has published multiple
papers in very reputed journals in the area of computer vision and AI. He is a co-guide
of Mr. Jagat Chaitanya
Dr. Ragoju Ravi is working as Associate Professor at applied sciences department of
National Institute Of Technology, Goa, India. He has published multiple papers in
reputed journals in thearea of applied mathematics including flued mechanics, heat
and mass transfer etc.
Dr. Ravi is guide of Mr. Jagat Chaitanya